Many of the FPGAs and SoCs I design for clients are used in high reliability / mission critical systems, across a range of applications including automotive, defense, aerospace and space.
This means I want to ensure any errors which might occur in the system can be either tolerated or avoided.
Of course, creating a mission critical system is a holistic task that must consider the system architecture, hardware design and programmable logic design. Before we even start thinking about the task which is verification, analysis and certification.
However, as logic designers, we should be familiar with capabilities provided internally within programmable logic that help us avoid and tolerate errors.
One such capability is the support for Error Correcting Codes (ECC) on Block RAMs enabling a single bit error to be corrected and a double bit error detected.
In our designs, we use Block RAMs for many different types of storage. The data may be stored for only a short period of time, e.g. in a FIFO. In other applications, we may be storing data in Block RAM for longer periods — for example, configuration settings in a processing pipeline.
When we store data in Block RAM structures, there is the potential for it to be corrupted. This corruption could be caused by several factors, such as Soft Errors from atmospheric neutron radiation — typically this is not an significant issue at ground level, being experienced more so by aerospace and high-altitude applications, but in some applications we have to consider it.
To help us tolerate these issues, the Block RAMs provided in Seven Series, UltraScale and UltraScale+ devices contain Error Detection and Correction (EDAC) capabilities.
This comes from the ability of the RAMB36E1 / RAM36BE2 to implement 512 by 72-bit memory structures when configured as simple dual port RAMs. This means we can store 64 bits of data with 8 bits of parity, these eight bits of parity enable single error correction and double error detection of data within BRAMs and FIFOs using hamming code.
When we configure BRAMs in this way, several optional inputs and outputs are now present on the BRAM block. These include:
injectsbiterr — used for testing to inject a single bit error
injectdbitter — used for testing to inject a double bit error
sbiterr — indicates a single bit error has occurred
dbiterr — indicates a dual bit error has occurred
rdaddrecc — the address of the current data output
In normal operation, we can use the sbiterr and dbiterr outputs to determine the correctness of the output data. If no error is present in the BRAM address for that memory location, then neither sbiterr or dbiterr will be asserted. If a single bit error has been detected and corrected, then the sbiterr signal will be asserted. In this case, the output word can be safely used as the error has been corrected.
If multiple bits are in error, the dbiterr signal will be asserted and the output word cannot be safely used. Knowing this allows us at the system level to take the appropriate action.
Of course, when a single bit error is detected and corrected only the output word is corrected. The contents of the BRAM itself which have suffered the corruption are not updated. To aid in the correction of the corrupted BRAM contents, the block provides an output signal rdaddrecc.
This signal indicates the address of the current output. This can be used in conjunction with a simple state machine to write back the corrected value if a single bit error is reported. An alternate approach would be to cycle around all of the memory locations periodically refreshing and correcting any single bit errors (a technique called scrubbing).
We will want to test our systems handling of the single- and double-bit errors in simulation. This is where the injectsbiterr and injectdbiterr come in to play. These inputs let us inject either a single- or double-bit error as we write the data into the BRAM.
To inject either a single bit or double bit error into the BRAM when we write the data into the BRAM, we set injectsbiterr or injectdbiterr high as below.
When these addresses are the read out from the BRAM, you will see the error flag indicated.
We can then check our design takes the appropriate action depending upon if the error is a single- or double-bit error.
Now we understand a little more about how we can use the ECC capabilities provided by the Block RAM in our design, to create designs which can be more tolerant of Block RAM corruption.