When we write code intended for HLS implementations, we tend to implement repetitive algorithms that process blocks of data — for example, signal or image processing.
As such, our HLS source code in either C or C++ tends to include several loops or nested loops. When it comes to optimizing, performance loops are one on the places we can start exploring optimization.
By default, HLS loops are kept rolled. This means that each iteration of the loop uses the same hardware. Of course, this provides the worst performance as each iteration is sequential.
Let’s take a simple accumulator, as shown below.
With no optimization, we will notice that in the analysis view the implementation is sequential. This can be confirmed by looking at the trip count, which reports the number of iterations for the loop and the initiation interval. The initiation interval is the number of clocks between then the module can accept a new input values.
To get optimal throughput, we want to ensure initiation interval is a short as possible.
To increase performance, we can unroll the loop and implement parallel implementations provided the loop bounds are static.
When this is synthesized, we will see a significant reduction in the Initiation Interval.
There is no trip count defined as the loop is fully unrolled.
When we unroll loops, we are gaining a higher performance for increased logic utilization.
Of course, there may be times when we are unable to work accept the performance of the rolled loop but unable to accommodate a fully unrolled loops logic demands.
In this situation, we can partially unroll loops, by defining a unroll factor. For an example, let’s change the size of the accumulator.
By setting the unroll factor we can control the amount of parallelization of the loop.
The analysis report will show the factored unrolled performance. In this case, tt takes half the latency of the rolled example.
You will also see this in the schedule viewer. Under the schedule viewer, you will see implementations of the algorithm.
When we partially unroll the code, the HLS tool will implement an exit check in the loop in case the unroll factor does not result in a integer. However, if it is an integer, we can skip the exit check.
But how do we handle nested loops for example if we are doing matrix operations or image processing?
In this case, we have several options, although it depends on how we implement our loops. To get the best performance (lowest latency) when working with nested loops, we want to create a perfect nested loops.
In a perfect nested loop, again the loop bounds are constant and only the inner most loop contains any functionality.
In the case of a perfect loop, we can flatten the loop merging the outer and inner loops — enabling better performance when synthesized. A flattened loop will have a rolled trip count of n * m where n and m are the number of iterations of each loop.
If we have several loops implemented sequentially, we can also merge them.
Merging loops differs from flattening loop. In flattening loop, we are merging nested loops; when merging, we are merging several independent loops.
When loops are merged, the maximum bound constraint is used.
If we want to merge the loops, there are a few constraints we need to consider first. We cannot merge loops that have FIFO accesses, variable bounds or execution side effects.
Let’s take a look at a very simple example of a loop merge, the code for which can be seen below.
When this code is synthesized with no optimization, the latency and trip counts are:
However, if we take the opportunity to merge the three loops:
The results when all three loops are merged can be seen below a much lower overall latency.
It is not just loop unrolling that we want to explore when optimizing our design for HLS. We also want to look at pipe-lining and memory partitioning, among others. I will do similar in depth deep dives soon!