MicroZed Chronicles: Tips and Tricks When Working with HLS — Part One
High Level Synthesis really is a game changer when we are developing algorithms for our programmable logic devices.
Not only does it allow us to work at a higher level of abstraction, it also cuts down the verification time as we can perform the initial test benching using un-timed C test benches. We can even use the same C test bench for Co-Simulation with HDL at later stages in development.
Several times through out this series we have looked at using HLS for image processing. In this blog, we are are going to look at how we can use HLS to create IP blocks for signal processing, and along the way I am going to point out a number of useful tips and tricks. These include:
Defining the correct approach to implement the desired algorithm
Defining the correct interface type for your application
Optimizing throughput and pipelining
Understanding different block RAM structures
Working with fixed point arbitrary precision number systems
The first thing we need to do is define the problem we wish to solve. In this example, we will address an issue faced by many industrial and control applications.
It is common to monitor temperatures in industrial applications using either a thermistor or a platinum resistance thermometer (PRT). Both are non-linear and require a conversion from the measured resistance to the actual temperature to be of further use.
Of the two, thermistor and PRT, the PRT uses a more complicated conversion the Callendar-Van Dusen equation. This equation is named after the M S Callendar-Van Dusen who first published the equation in 1925.
The short form of the equation can be seen below, which provides the resistance for a specific temperature.
R=R_0 ×( 1+a×t+b×t²)
Naturally, for our applications we want to go in the opposite direction from a resistance to a temperature. As such we can re arrange the equation as below.
While we could implement the above equation, it would be complex no matter how we implement it HLS or HDL.
Another approach to implementing a Callendar-Van Dusen equation is to implement a polynomial equation, which fits the line with sufficient accuracy — this is the approach we will take for this example.
Having decided on how we will implement the equation, the next step is to open Vivado HLS ( I am using 2018.2) and create a new project.
Within the project, we are going to create three files — two source files and one header file.
C Source file implementing the algorithm
C Source file as the test bench
C Header file defining the function(s) for acceleration
As I stated above, the equation we are implementing is:
y = 2E-09×4–4E-07×3 + 0.0011×2 + 2.403x — 251.26
Just like we would for any C application, we can write this in a single line of code, and yes, we can even use floating point representation (we are going to look much more at this in part two).
I have written the code in such a way that I can select to process either a single resistance value or several. For this initial application, I will process a batch of 10 results at a time.
When we run this through C Synthesis we find the X and Y interfaces are implemented as block memories. This is to be expected for arrays when they are synthesized into hardware.
Of course, as we are just passing data though the function, a FIFO interface would be better and simpler. As such we can use the pragma HLS interface ap_fifo to ensure when synthesized FIFOs are used in place of RAMs.
If we so desire, we can also use AXI interfaces (m_axi, axis or s_axilite) if we wanted to interface our HLS core with the PS of a Zynq /Zynq MPSoC or IP that uses AXI.
The next thing we need to look at is the performance. The initial synthesis implementation does not pipeline or unroll loops. As a result, in order to implement the 10 calculations, it takes 551 clock cycles. We can make this function perform much better.
By using the pragma HLS pipeline option, we can improve the functions throughput. HLS pipeline can be applied to both loops and functions, pipelining and will automatically unroll loops.
Pipelining reduces the initiation interval (II) of the function — the II describes when the function can begin to process new data. As such, to get high performance and throughput, we want the II to be as low as possible.
It stands to reason that pipelining the function will also use more resources and in this case as we are using floating point significantly more. We can also fine tune the pipeline option with the rewind option which reduces delays between loop iterations.
Although not the case in this example if we have several HLS functions which work together, we can also use the HLS data flow pragma to enable function level pipelining and further optimize the throughput.
If we use memory structures (C arrays) internal to our function and not just on the IO. These will be implemented using block RAM and due to the limited read and write interfaces they can become performance bottlenecks.
To address this we can partition how these block RAMs are organised to store data more effectively for acceleration. Using the HLS pragma array partition allows us to control how we split up block memories and store data. With this pragma, we can use one of three approaches.
If necessary, we can also use the HLS array reshape partition to combine memory elements.
By this point we have implemented a simple, commonly used industrial equation with HLS, which we can use if we want deploy within our FPGA design. This provides a reasonable throughput.
However, we are still working with floating point numbers at this point, which require significant resources and impacts throughput. A much more efficient method is to implement a solution that uses fixed point numbers.
Next week, we will look at how we can work convert this function to work with the fixed point arbitrary precision libraries, which enables us to achieve even better results.
You can find the files associated with this project here: