top of page
Writer's pictureAngelo Jacobo

UberDDR3 Feature Update: Error Correction (Part 2) - Post #6

In Part 1, we explored how you can now add error protection to your designs by just enabling the ECC_ENABLE parameter on UberDDR3. We covered burst-granular (ECC_ENABLE = 1) and word-granular (ECC_ENABLE = 2) sideband ECC. Now, we will discuss the third and final ECC option for UberDDR3, inspired by LPDDR4: In-line ECC.

Table of Contents:


I. What is In-line ECC?

The third and final option for ECC is In-line ECC, inspired by LPDDR4. But before we dive into it, let’s clarify what “In-ECC” is and how it differs from “Sideband ECC.”

  • In Side-band ECC, the ECC code is sent as side-band data along with the actual data to memory. For example, an x64 DRAM will require 8 additional bits for ECC storage, making the total device width 72 bits.

  • In In-line ECC, the ECC code is not sent with the actual data. Instead, it generates separate overhead write and read commands for ECC codes. Thus, an x64 DRAM remains x64, but there is a penalty in throughput due to the overhead write and read commands.


For a detailed and excellent deep dive into Sideband ECC, In-line ECC, and other types of ECC, check out this article from Synopsys. For a simpler introduction to In-line ECC, I recommend this short video from Cadence.


Burst-granular (ECC_ENABLE = 1) and Word-granular (ECC_ENABLE = 2) Sideband ECC are simple to implement, as we just extend the bus width to store the ECC codes, as explained in Part 1 of this series. However, with ECC_ENABLE = 3, we need to address two things:

  • Map the memory space to store the ECC codes.

  • Implement overhead write and read access for the ECC codes.


II. Mapping Memory Space for ECC Parity Bits

II.I Dilemma: How to allot Memory Space for ECC?

In-line ECC stores the ECC parity bits in separate memory space from the actual data. The simplest solution we can think of is to allot the 8th bank of the DDR3 DRAM for ECC parity bits and the remaining 7 banks for data. Something like this:

BUT, we have to take care of penalty in throughput due to address mapping.


Let me explain first on address mapping. The memory address mapping for a DRAM is best set to be bank interleaving: {row, bank, column}. This means once we reached the last column then we will go to the next bank, and once we reached the last bank we will go to next row. This makes it possible to avoid the timing penalty of changing rows on the same bank.


The diagram below illustrates a sequential access from address 0 to address 65:

To move from address 3 to address 4, there’s a transition across banks—from bank 0 to bank 1. Similarly, to go from address 63 to address 64, it’s necessary to switch rows—from row 1 to row 2, which spans across all the way from bank 7.


Now, why am I trying to explain this bank-interleaving memory address mapping? It's because if we go back to the previous idea of allotting bank 7 for ECC codes, using a bank-interleaving mapping will mean that bank 7 will obviously be overwritten with data! Checking on the illustration above, write request to address 28 to 31 will overwrite the ECC code in that address space.


We have to find a way such that sequential access will not hit the memory space for ECC. Now, considering this statement, why not just put the ECC codes on the last rows of each bank?


To put this into perspective:


Now, this is actually a good idea. Basically, the ECC codes will be stored in last set of addresses in {row, bank, column} mapping which are the last few rows of each bank. From the user perspective, the addresses will stop at address 191 but internally there are addresses 192 to 223 where the ECC codes are stored.


HOWEVER, there is still a caveat here. Considering that fact that the ECC code is spread out in all banks, a potential issue arises during random access: there’s a significant chance that we may attempt to access a memory bank while its last row, designated for ECC storage, is open. This situation incurs a timing penalty, requiring a precharge and activation cycle for that bank.


For example, we might access bank 7 which has its last row opened due to ECC write access done by bank 0 (assuming bank 0 writes ECC code to bank 7). Since the wrong row is already opened, we need to close this using the precharge command and then open the right row we want to access using the activate command. That's a waste of time!


Hmmmm, let us think of another way. Maybe there is a memory mapping we can do where the ECC codes are on the last set of addresses AND at the same time minimize the number of banks that need to store ECC?


II.II Answer to Dilemma: Customized Memory Mapping Strategy

As mentioned, memory mapping is optimally set to bank interleaving in the order of {row, bank, column} to avoid the timing penalty associated with switching rows within the same bank. However, there might be a strategy that retains the benefits of bank interleaving while ensuring that the last set of addresses, designated for storing ECC, are not spread out in all banks as in {row, bank, column}?


My proposed solution is to use the mapping: {bank[2:1], row, bank[0],col}

This is hard to explain in words, so to illustrate:


Bank interleaving occurs between every pair of banks (bank 0/1, bank 2/3, bank 4/5, bank 6/7). When the last column of bank 0 is reached, the mapping switches to bank 1. Subsequently, upon reaching the last column of bank 1, it switches back to bank 0, but in the next row. Unlike the typical {row, bank, column} mapping, the last set of addresses are not the last row of every bank but are now exclusively on the last rows of banks 6 and 7 only!


The beauty of this approach is that if we allocate the last set of addresses (the last few rows of banks 6 and 7) as the memory space for ECC codes, we can ensure that those addresses are not spread out but are just contained within banks 6 and 7. Banks 0 to 5 do not have to store ECC codes, thereby reducing the likelihood of accessing a bank that has an already opened row due to ECC code accesses.


t’s important to note that there are other methods for memory mapping the space for ECC code, but this strategy is what I consider to be balanced in terms of pros and cons.


II.III Mapping Data Address to ECC Address

Now that we have decided to use the memory mapping {bank[2:1], row, bank[0],column} wherein the memory space for the ECC codes are on last few rows of bank 6 and bank 7, we now have to decide specifically to which addresses in the ECC memory space will each data will be stored to.


Quoting this text form the Synopsys article about inline ECC:

... Every WR and RD command for the actual data is accompanied with an overhead WR and RD command respectively for the ECC data. High-performance controllers reduce the penalty of such overhead ECC commands by packing the ECC data of several consecutive addresses in one overhead ECC WR command....
Hence, the more sequential the traffic pattern is, the latency penalty is less due to such ECC overhead commands. ....

The main idea here is that we can pack the ECC codes of multiple data before doing a single overhead write command for ECC.


For example, an x64 DDR3 has total word width of 512 bits (8 bursts of x64). Each x64 data needs 8 parity bits and since there are eight bursts (8 bursts of 8 parity bits), there will be a total of 64 ECC parity bits. Since a single write request to DDR3 always has a total word width of 512 bits, we can wait until we do 8 sets of sequential writes so we can pack 8 sets of 64 parity bits for a total of 512 bits!


This is shown below:

As noted here, the ECC codes for "Data 0" to "Data 7" are packed together as a single write request for "Write 8".


Now, the question is, to which ECC address will “Write 8” be written? We know that the ECC memory space is somewhere in the last column of banks 6 and 7, but to which specific address will it be?

After my analysis, the best way to map the data memory address to the ECC address is using this golden formula 😎:

  • ECC_BANK = {11, !bank[0]}

  • ECC_ROW = {1, row >> 1}

  • ECC_COL = {row[0], bank[2:1], column >> 3}


If this confuses anyone, apologies, as I do not know how best to explain this other than going straight to the point. As before, the best way to explain this is to illustrate:


First, the ECC_BANK = {11, !bank[0]} ensures that ECC codes for data on odd banks (bank 1/3/5/7) are stored on bank 6, while data on even banks (bank 0/2/4/6) are stored on bank 7. Why do we need to do this

  • This ensures that writing data to bank 6 will not also store the ECC code on that same bank but on bank 7. The same applies to bank 7, where its ECC codes will be stored on bank 6 and not on that same bank. This prevents switching rows on the same bank just to store the ECC codes.


Second, the ECC_ROW = {1, row >> 1} ensures that ECC codes will be stored in the last few rows of the corresponding bank specified by ECC_BANK = {11, !bank[0]}. Looking at the illustration above, all ECC mapped addresses are in the last middle section of banks 6 and 7.


Lastly, the ECC_COL = {row[0], bank[2:1], column>>3} is quite complicated to explain but basically, this ensures that the addresses of all banks will be mapped cleanly and there will be no duplicated mapped ECC address.


III. Implementation of Overhead Write and Read Requests

III.I Hardware Architecture

Now that we understand how to map data addresses to their corresponding ECC addresses, let’s implement the overhead read/write commands on UberDDR3.


To handle these ECC requests, we need to modify the pipeline design. As mentioned on previous blog post "UberDDR3: An Opensource DDR3 Controller", UberDDR3 has two pipeline stages:

  • Stage 1: Data from a Wishbone write request is stored here.

  • Stage 2: Data is then piped from Stage 1 to the DDR3 PHY.


For in-line ECC, we introduce a new Stage 0. Here’s a simple illustration of the data flow:


Consider an x64 DRAM with a total word width of 512 bits (8 bursts of x64). For a write request:

  • The user sends 512-bit Wishbone write data to i_wb_data. If stage 1 and 2 are busy, the write data is stored on stage 0 (stage0_data). Otherwise, it goes to Stage 1 (stage1_data).

  • In Stage 1, eight ECC encoders generate 8 parity bits for each 64-bit burst, totaling 64 parity bits for the 8 bursts.

  • If current transaction is not an overhead write ECC request, the 512-bit write data from stage 1 is piped to stage 2 (stage2_data_unaligned), and the 64 parity bits are stored in stage2_encoded_parity. The unaligned stage 2 data is then piped to a write data alignment block and is then sent to the DDR3 PHY.

  • The 64-bit stage2_encoded_parity is stored in one of the eight slots of stage2_ecc_write_data_q. As data is written sequentially, these slots fill up until they form a 512-bit word.

  • Finally, if the current transaction is an overhead write request, the 512-bits stage2_ecc_write_data_q (concatenation of ECC parity bits from multiple words) is piped to stage2_data_unaligned instead of the stage 1 data.


For read requests, the data flow is as follows:

  • Data comes from the DRAM, goes to the PHY, and then to the read data alignment of the DDR3 controller.

  • If the current transaction is an overhead read request (reading ECC parity bits stored in DRAM), the 512-bit read data (o_wb_data_q_current) is stored in stage2_ecc_read_data_q

  • If the current transaction is a normal read request, the 512-bit read data from o_wb_data_q_current is received by the ECC decoders along with 64-bit decode_parity from one of the eight slots in stage2_ecc_read_data_q

  • The ECC decoders decode each 64-bit burst of the 512-bit data and output the corrected data to o_wb_data.


While there are more details involved in the DDR3 controller, this covers the basic data flow.


III.II When to Perform Overhead Write/Read Requests?

The final point to explain is the condition for performing overhead write and read accesses. The mechanics are simple: if the mapped ECC address of the current request differs from the previous mapped ECC address, an overhead write or read access will be generated in stage 2 of the pipeline.



For example, ECC parity bits for write requests to addresses 6 and 7 are mapped to ECC address {bank = 7, row = 1000, col = 500}. However, ECC parity bits for write requests to addresses 8 to 10 are mapped to the next ECC column {bank = 7, row = 1000, col = 501}. Before switching to this new ECC address, the ECC parity bits for the previously mapped ECC address will be stored in DRAM via an overhead write.


This ensures that data stored in stage2_ecc_write_data_q is saved to its corresponding ECC address before being overwritten by new parity bits for a new ECC address.


IV. Simulation Testbench

In our previous blog post, Getting Started with UberDDR3 (Part 1), we discussed the self-checking testbench included with UberDDR3, which uses the Micron DDR3 model file. To run the simulation for the ECC test, set the localparam for ECC_ENABLE to 3 inside the testbench, then follow the instructions in the blog.


As shown below, incoming Wishbone requests sequentially write from address 0. UberDDR3 receives these requests and pipes them from stage 0 to stage 2 column, bank, and row. Since Wishbone is word-addressable but mapped to burst-addressable, the Wishbone address is multiplied by 8


Stage 2 column addresses 0 to 56 are sequentially written to DRAM, followed by an overhead write to {column = 0, bank = 7, row = h8000}, containing the parity bits for the previous 8 sequential writes. Then this process is repeated for column address 64 to 120 where the parity bits for these 8 sequential words are mapped to {column = 8, bank = 7, row = h8000}.


This approach minimizes the impact on throughput by performing ECC overhead operations only every 8 words.


The read requests follow a similar pattern to the sequential writes, with an overhead read occurring after every 8 requests.


V. Hardware Test

Testing this on hardware is straightforward. We can use the demo projects available in the UberDDR3 repo. In my case, I used the Arty-S7 project. By setting the ECC_ENABLE parameter to 3, we enable in-line ECC on UberDDR3.


As mentioned on the blog "Getting Started with UberDDR3 (Part 1)", there is an internal test sequence done after the initial calibration. This acts like a built-in self-test which does write and read request of various sequences to all addresses of the DDR3. Only once this test sequence ends with no error will the calibration officially ends.


As mentioned in the blog “Getting Started with UberDDR3 (Part 1)”, there is an internal test sequence that runs after the initial calibration. This acts as a built-in self-test, performing write and read requests of various sequences to all DDR3 addresses. The calibration officially ends only if this test sequence completes without errors.


This means we can simply run UberDDR3 with ECC on hardware. Once it finishes calibration (indicated by LEDs lighting up), we can confirm that UberDDR3 with ECC has been properly tested on hardware!


For demonstration purposes, I attached an ILA to the demo project. As shown below, it reaches the DONE_CALIBRATE state successfully (final_calibration_done is high), indicating that the calibration and test sequence completed without issues:

VI. Conclusion

In conclusion, this two-part blog series has demonstrated how to equip your designs with robust error protection using UberDDR3's ECC capabilities.  Part 1 introduced the Sideband ECC, exploring both burst-granular and word-granular implementations, while here in Part 2, we explored the more advanced in-line ECC, inspired by LPDDR4. Through careful memory mapping and optimized hardware pipelines, we’ve addressed the challenges of implementing In-line ECC while minimizing throughput penalties.


With both simulation and hardware tests validating its effectiveness, UberDDR3 with ECC offers a flexible and reliable solution for enhancing data integrity in your FPGA designs.


That wraps up this post. Catch you in the next blog post!

127 views

Comments


Computer Processor

Subscribe to Our Newsletter

Thanks for submitting!

bottom of page