C++ Templates for HLS

Introduction

HLS is a powerful tool for building complex hardware designs using traditional software tools, allowing for control over hardware-specific optimizations. Three commonly applied optimizations are loop unrolling, pipelining, and array partitioning. However, in order for these optimizations to be applied to loops, arrays, and kernels, these components must be statically defined in the HLS language being used. For example, when unrolling a loop, the loop's bounds must be fixed and known ahead of time for the HLS compiler to perform the optimizations. Similarly, in order to partition an array, the array size must be fixed and statically defined, which also means that any function (or kernel) calling the array must know the fixed array size when defining the function's arguments. Writing HLS code or choosing a coding style that accommodates this constraint of requiring static definitions can be challenging and messy.

This article will focus on using the templates feature of the C++ language to overcome these challenges and produce clean and concise code while still taking advantage of optimizations while still producing clean and concise code.

Example: ReLU Activation Layer

Below is a simple example of a C++ implementation of an activation layer being applied to a vector.

typedef ap_fixed<32, 16> F_TYPE;
typedef ap_fixed<32, 16> W_TYPE;

// Define ReLU operation
F_TYPE relu(F_TYPE in){
	if(in > 0){
		return in;
	}
	else{
		return 0;
	}
}


void apply_relu(F_TYPE *out, int length){
#pragma HLS INLINE off
	for(int out_idx = 0; out_idx < length; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

#define DATA_VECTOR_1_SZIE = 32
#define DATA_VECTOR_2_SZIE = 64

void top_example(F_TYPE data_vector_1[DATA_VECTOR_1_SZIE],
                 F_TYPE data_vector_2[DATA_VECTOR_2_SZIE]){
    apply_relu(data_vector_1, DATA_VECTOR_1_SZIE);
    apply_relu(data_vector_2, DATA_VECTOR_2_SZIE);
}

This implementation allows you to define the apply_relu() function once and use to apply the ReLU function to data arrays of different sizes.

However, a couple of flaws prevent HLS optimizations from being applied to the apply_relu() function. In apply_relu(), the array length of your data vector is passed in as part of the function call and becomes the loop bound. Since this variable is unknown at compile time and can be any value during runtime, the loop bound is not fixed. Since the bound is not fixed or known at compile-time, you cannot apply pipeline and loop unrolling optimizations to this function. Additionally, since the apply_relu() function references the input array as a pointer, the array size within the function is unknown, and array partition optimizations cannot be applied.

Let's make some simple changes to make the input array size and loop bound fixed.

typedef ap_fixed<32, 16> F_TYPE;
typedef ap_fixed<32, 16> W_TYPE;

// Define ReLU operation
F_TYPE relu(F_TYPE in){
	if(in > 0){
		return in;
	}
	else{
		return 0;
	}
}

#define DATA_VECTOR_1_SZIE = 32
#define DATA_VECTOR_2_SZIE = 64

void apply_relu_1(F_TYPE out[DATA_VECTOR_1_SZIE]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < DATA_VECTOR_1_SZIE; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

void apply_relu_2(F_TYPE out[DATA_VECTOR_2_SZIE]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < DATA_VECTOR_2_SZIE; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}


void top_example(F_TYPE data_vector_1[DATA_VECTOR_1_SZIE],
                 F_TYPE data_vector_2[DATA_VECTOR_2_SZIE]){
    apply_relu_1(data_vector_1);
    apply_relu_2(data_vector_2);
}

In this example, the issues of dynamic loop bounds and unknown array sizes per function call for apply_relu() have been solved. However, in order to do this, we, as the programmer, needed to redefine apply_relu() twice, as apply_relu_1() and apply_relu_2() in order to define the loop bounds and input array sizes statically for each of the data_vectors, data_vector_1 and data_vector_2, I want to apply this function to. Wring the C++ code this way allows me to apply valid pipelining+loop unrolling and array partition optimizations that the HLS compiler will implement when compiling the final hardware design.

The major downside is that the coding style of this approach is not scaling to a larger design. For example, if I had a design that had to implement the apply_relu function 20 times on 20 different sized data vectors, I as a programmer would need to redefine the apply_relu function 20 different times, each with different values for the size of the array argument and loop bound. This is a practical issue in larger or more complex design like a neural network accelerator where many of the same operations (linear, relu, conv2d) will be called in different places but with different fixed parameters for loop bounds, input sizes, and output sizes and fixed domain-specific parameters like side size and dilation size that make this approach messy and unproductive. There are some hacky ways to write slightly better code, like using large switch statements within a single function definition, but they avoid this coding style issue altogether rather than solve it.

Let now introduce C++ templates to solve this issue.

typedef ap_fixed<32, 16> F_TYPE;
typedef ap_fixed<32, 16> W_TYPE;

// Define ReLU operation
F_TYPE relu(F_TYPE in){
	if(in > 0){
		return in;
	}
	else{
		return 0;
	}
}

#define DATA_VECTOR_1_SZIE = 32
#define DATA_VECTOR_2_SZIE = 64

template<int SIZE>
void apply_relu(F_TYPE out[SIZE]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < SIZE; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

void top_example(F_TYPE data_vector_1[DATA_VECTOR_1_SZIE],
                 F_TYPE data_vector_2[DATA_VECTOR_2_SZIE]){
    apply_relu<DATA_VECTOR_1_SZIE>(data_vector_1);
	apply_relu<DATA_VECTOR_2_SZIE>(data_vector_2);
}

By using templates, we can define the apply_relu() template function once and use it to apply the ReLU function to data arrays of different sizes. The template parameter SIZE is used to define the size of the input array and the loop bound. This allows us to define the loop bound and input array size statically for each of the data_vectors, data_vector_1, and data_vector_2 that I want to apply this function to. Only const or compile-time values can be used as template parameters since template functions are instantiated at compile time.

A big caveat of using templates for HLS code is that template function calls with different parameters will be compiled into different functions, which the HLS compiler will treat as different components. This will prevent hardware component reuse with the naive use of templates. This can be worked around by restructuring the code and understanding this limitation.