Tiling-based Convolution for HLS

Tiling-based Convolution

Article by Akshay Kamath. Last updated on Mar 4, 2022.

Motivation - ResNet-50 for HD Inputs

ResNet-50 is a popular convolutional neural network (CNN) that is 50 layers deep. At its core, ResNet-50 is a feature extraction network. Based on how it is used in a deep learning model, ResNet-50 can act as an image classifier or as a feature extractor for object detection/tracking.

Typical implementations of ResNet-50 use the ImageNet dataset inputs for training and inference. This dataset has almost a million images, but each image is quite small - just a 224 x 224 RGB image. This can easily fit on a low-end FPGA and achieve real-time inference latency.

However, we are interested in scaling this model for HD inputs, say, 720 x 1280 RGB images. Due to the sheer size, it is not possible to fit the entire feature map on FPGA (limited by on-board memory!). In such a scenario, we need to "tile" the inputs for processing on-board.

Let's understand this with an intermediate 3x3 convolution layer of ResNet-50 (Layer 2.1.2) when the model is fed with an HD input image.

Layer Paramaters	Tensor Dimensions
Input Feature Map (ID, IH, IW)	(64, 184, 320)
Stride (S)	1
Padding (P)	1
Filter Size (OD)	64
Weight Kernel Map (OD, ID, KH, KW)	(64, 64, 3, 3)
Output Feature Map (OD, OH, OW)	(64, 184, 320)

Convolution Equation

Each feature $(f, i, j)$ of output feature map (OD, OH, OW) is given by -

$Y [f] [i] [j] = R e L U (B [f] + \sum_{c = 0}^{I D - 1} \sum_{k h = 0}^{K H - 1} \sum_{k w = 0}^{K W - 1} X [c] [i + k h - 1] [j + k w - 1] * W [f] [c] [k h] [k w])$

where X is the input feature map, W is the weight kernel map, and B is the bias map.

📌 Each output channel is a function of all input channels and all channels of corresponding weight kernel.

Convolution Pseudocode (Unit Stride & Padding)

for(f = 0; f < OD; f++)              // Filter Size (Output Depth)
    for(i = 0; i < OH; i++)             // Output Height
        for(j = 0; j < OW; j++)           // Output Width
            for(c = 0; c < ID; c++)        // Input Depth
                for(kh = 0; kh < KH; kh++)   // Kernel Height
                    for(kw = 0; kw < KW; kw++) // Kernel Width		
                    {
                        if(c == 0 && kh == 0 && kw == 0)  // Initialize output feature
                            Y[f][i][j]  = X[c][i+kh-1][j+kw-1] * W[f][c][kh][kw] + B[f];
                        else                              // Multiple and Accumulate (MAC)
                            Y[f][i][j] += X[c][i+kh-1][j+kw-1] * W[f][c][kh][kw];
                    }

⚠️ The equation and the pseudocode above do not describe how border features are handled.

Estimating Trivial Convolution Computation Latency

If 1 MAC operation takes 1 clock cycle, then the number of clock cycles required to compute the convolution of X(ID, IH, IW) with weight kernel W(OD, ID, KH, KW) to generate the output Y(OD, OH, OW) is given by:

$N o . o f c l o c k c y c l e s = O D * O H * O W * I D * K H * K W$

→ For the above specified layer, the trivial latency is then

= 64 * 184 * 320 * 64 * 3 * 3

~= 2.1 billion cycles

At 100MHz (T = 10ns), the trivial latency is 21 seconds!

Tiling-Based Convolution

Feature maps are partitioned into tile cubes, each of volume (TD, TH, TW), where TD is the same as input feature map depth (ID), TH is a divisor of IH and TW is a divisor of IW . For initial implementation, it is a good idea to start with smaller dimensions. Sizes can be increased as feasible in further optimizations.

$N o . o f T i l e s = \frac{I H * I W}{T H * T W}$

As the input feature map depth (ID) and the filter size (OD) can be large, these too are split into smaller blocks for on-chip processing. We call these tile slices. Essentially, the granularity of on-chip buffer volumes are decided based on the dimensions of the tile slices.

❗ Note the difference between a “tile cube” and a “tile slice”. A tile cube comprises of several tile slices of some fixed-depth. The depth of the tile cube, on the other hand, is simply the depth of the input feature map.

Sample Design

Consider a tile slice of dimensions (8, 23, 20) for the above mentioned ResNet-50 layer.

The input feature map then comprises of (184*320)/(23*20) = 128 tiles.
The depth of each input tile cube is 64. This is split into slices of 8 each having a depth of 8. In other words, each tile cube (64, 23, 20) is made up of 8 tile slices, each of dimensions (8, 23, 20).
The filter size here is 64. That is, 64 weight kernels. Suppose we process groups of 16 kernels at a time. Then we have 4 kernel groups.
The computation then involves generating partial output feature maps for each kernel group, iteratively for all groups.
The input tile cube must be slightly taller and wider than the output tile cube to allow correct computation. This is left as an exercise to the reader.
The volumes of on-chips buffers are -
- Input Tile Slice: (8, 23 + ??, 20 + ??)
- Weight Kernel Slice: (16, 8, 3, 3)
- Output Tile Slice: (16, 23, 20)

Trivial Tile-Convolution Computation Latency is then given by,

$K e r n e l G r o u p s * K e r n e l s p e r G r o u p * C h a n n e l G r o u p s * C h a n n e l s p e r G r o u p * K H * K W * T H * T W$

For the chosen tile size, the trivial latency is (4*16) * (8*8) * 3 * 3 * 23 * 20 ~= 17M cycles
This is same as the previously computed latency divided by the number of tiles.

📌 The buffer volumes also determine the number of cycles required for loading and storing data from and to off-chip memory.

Computation Flow On-Chip

Each feature $(f, i, j)$ of output tile (64, 23, 20) is given by -

$Y [f] [i] [j] = \sum_{c = 0}^{63} \sum_{k h = 0}^{3} \sum_{k w = 0}^{3} X [c] [i + k h - 1] [j + k w - 1] * W [f] [c] [k h] [k w]$

Let the inner summations be represented as follows:

$Z [f] [c] [i] [j] = \sum_{k h = 0}^{3} \sum_{k w = 0}^{3} X [c] [i + k h - 1] [j + k w - 1] * W [f] [c] [k h] [k w]$

$T h e n, Y [f] [i] [j] = \sum_{c = 0}^{63} Z [f] [c] [i] [j]$

The overall computation can be represented as follows where each summation indicates on-chip processing (load-store) granularity:

$Y [0 : 15] [i] [j] = \sum_{c = 0}^{7} Z [0 : 15] [c] [i] [j] + \sum_{c = 8}^{15} Z [0 : 15] [c] [i] [j] + . . . + \sum_{c = 56}^{63} Z [0 : 15] [c] [i] [j]$

$. . .$

$Y [48 : 63] [i] [j] = \sum_{c = 0}^{7} Z [48 : 63] [c] [i] [j] + . . . + \sum_{c = 56}^{63} Z [48 : 63] [c] [i] [j]$

Basic Optimization

The tile buffers can be partitioned suitably to exploit parallelism. This may be considered as intra-tile parallelization.

📌 If intra-tile computation latency = L, then the overall layer latency = L * No. of tiles

A way to partition input buffer is along the width dimension TW. When implemented correctly in HLS, this should result in a direct speedup of 20x.

💡 Doubling the tile width will also double the speedup to 40x. However, this will lead to increased LUT utilization. This limits how wide the tile can be to exploit parallelism!

Computation Flow

For illustration, consider the computation with the first weight kernel using partial output storage. Each of the below statement executes in a single clock cyle.

$Y_{p} [0] [i] [0 : 19] = Z [f] [0] [i] [0 : 19]$

$Y_{p} [0] [i] [0 : 19] + = Z [f] [1] [i] [0 : 19]$

$. . .$

$Y_{p} [0] [i] [0 : 19] + = Z [f] [63] [i] [0 : 19]$

$Y [0] [i] [0 : 19] = Y_{p} [0] [i] [0 : 19]$

Estimating Computation Latency

Computation of each $Z$ for all tile rows $i \in [0, 22]$ requires TH * KH * KW cycles = 23 * 3 * 3 = 207 cycles.

Each $Z$ is computed sequentially. So, the computation of final outputs for each kernel takes 207 * ID = 13,248 cycles.

As there are a total of 64 kernels in the filter, the above computation has to be repeated 64 times.

Thus, the computation latency of entire tile = 13,248 * OD = 847,872 cycles.

The same can be obtained by substituting the values in the equation below.

$\frac{K e r n e l G r o u p s * K e r n e l s p e r G r o u p * C h a n n e l G r o u p s * C h a n n e l s p e r G r o u p * K H * K W * T H * T W}{T W}$

❗ This latency estimate does not account for data movement overhead. However, this overhead can be absorbed into computation latency by exploiting the wider memory bus width for load-store and/or using ping-pong buffers for computation.

Summary

With 20x speedup, the intra-tile latency reduces from ~17M cycles to ~848K cycles and the overall layer latency reduces from ~2.1B cycles to ~108.5M cycles. At 100MHz (T = 10ns), the computation latency is thus reduced from ~21 seconds to ~1085 ms!

Higher speedups are possible with advanced optimizations targeted for specific boards. The upper bound of speedup is typically given by the number of DSP units in the FPGA board. E.g. For Ultra96v2 with 360 DSP units, a 360x speedup is achievable that reduces the overall latency to less than 65ms!