Is it true that for asynchronous clock domain crossing, there is always a small chance that data will be lost or corrupted?

Question

There are several techniques that can be used to transfer data between two asynchronous clock domains. For a few bits, and depending on direction of data between the two clock domains, one could use register chain for slow -> fast domain and use pulse stretching for fast -> slow. It is also possible to use handshake mechanisms I believe. Also, for large number of bits, it will have to be use of clock crossing FIFO. I am not aware of any other techniques actually.

Is it true that while we can move data into the other clock domain but there will always be a small chance that the data will not be transferred accurately? This means that there will always be a very small chance that the receiver end of data will actually get corrupted data. It is possible to take steps to reduce the likelyhood of such an event but it can never be prevented completely. Is that true?

First you need to define what is the success criteria for your safety critical system? Does it need to be 99.999%? 99.9999%? Then you need to know what the probability of corrupt data getting getting passed through and acted on. — SteveSh, Jun 20 '22 at 01:16
If after doing all that the answer is "we still have a problem" then some form of Error Detection and Correction (EDC) or FEC (Forward Error Correction) can be used. — SteveSh, Jun 20 '22 at 01:18
Your question is not meaningful because even with a wire there is a small chance of the signal being lost or corrupted. — DKNguyen, Jun 20 '22 at 01:19
Look first thing I want to ensure is that, there is always a chance of data being corrupted during clock domain crossing. It is just the way it is. It will still happen if the voltage, temperature, process conditions are ideal and there are no cosmic rays e.t.c Is this correct? And if so, how do we know what the failure rate will be? The part about the safety critical systems comes after that. — quantum231, Jun 20 '22 at 01:23
I have removed the last line in the question that can cause confusion to the reader. — quantum231, Jun 20 '22 at 01:24
I had assumed that with multi-bit values like 16 or 32 bit transfers the compunded failure rate would increase since just one bit has to have wrong value for the entire value to become corrupted. Am I wrong in my assumption? — quantum231, Jun 20 '22 at 11:03
Yes. Mathematically, there is always a tiny probability of failure as defined by MTBF of the CDC structure, which is usually designed in such a manner that it fails atmost once in a million years or so. — Mitu Raj, Jun 26 '22 at 06:33

score 3 · Answer 1 · answered Jun 20 '22 at 04:09

3

It seems you've heard of metastability calculations. If so, you've probably noticed the part about a mean-time-between-failure dependence on four parameters (metastability resolution time, metastability time window, and clock and async data edge frequencies). You design your CDC synchronizers accommodating MTBF to be sufficiently great, say, much greater than a lifetime of the product you develop. Considering MTBF's exponential dependence on delay times, you see that the critical parameters of your design, as latency or FIFO sizes, do not noticeably deteriorate, even if you would require MTBF to become as great as the Universe lifetime (believed to be 13.8 billion yrs). Would it satisfy your expectations of a negligible chance, indiscernible from a complete elimination, of data corruption (that can be attributed purely to CDC, as you put it in your comment)?

answered Jun 20 '22 at 04:09

V.V.T

3,521
7
10

When there are multiple signals going from one clock domain to another and not just a single but, and all signals represent a single data value, corruption in any one of them is going to cause the entire data word to be corrupted. Therefore, the calculations that are used for a single register chain does not hold true for multi bit transfers. For multi bit transfers, the probability of failure is greatly increased. I have so far not found any resource that deals with the subject of CDC in its entire length and breadth thus my confusion. – quantum231 Jun 20 '22 at 10:58
Also, I was expecting something like 20%-30% of transfers getting corrupted if they are multi bit values like 16, 32 or even 64 bits. – quantum231 Jun 20 '22 at 10:59
The metastability phenomenon indication is the only direct conceptual answer to your title question. Also, it seems you are well aware of some techniques used to implement multi-bit CDC designs. Did you use an adequate tooling in your project? If a FPGA manufacturer provides a multi-bit capable synchronizer cell with the Verilog model and the SDF tools capable to help you control violations of setup hold times, the multi-bit CDC design becomes feasible, however tedious, piece of work for FPGA solution engineers. – V.V.T Jun 21 '22 at 06:20
For example, in a **Figure Multi-Bit CDC Decision Tree** (https://docs.xilinx.com/r/2021.2-English/ug949-vivado-design-methodology/Multi-Bit-CDC) XILINX offers a graph that helps you start a work. Later in the project you generate a CDC report which helps you re-evaluate your design decisions, if need be. Also, you may benefit from a wider reading about multibit CDC implementation strategies (used in ASIC designs also). – V.V.T Jun 21 '22 at 06:21
@quantum231 - How did you arrive at the 20%-30% value? That seems extremely high and unlikely to me, given the low probability of failure of a single bit if proper CDC techniques are used. – SteveSh Jun 26 '22 at 14:11
I am not sure what the error rate could be for multibit signals. (Please note that 231 in radix 10 is 0xe7) – gyuunyuu Jun 27 '22 at 13:37

score 3 · Accepted Answer · answered Jun 20 '22 at 12:22

No.

If you move one bit into the other domain, and that bit is sufficiently slow (compared to both clock rates) then there is a chance that its value will be "incorrect" or unresolved for ONE clock cycle - e.g. due to the bit change hitting the tiny metastability window,

However even in that case, the NEXT clock will be a whole clock period away from the metastability window and thus guaranteed.

So you design the CDC with this limitation in mind. If the above bit is your "data available" handshaking signal, you can guarantee that by the second clock, the actual data has been stable for plenty of time.

Now if your data rate is comparable with the slower of the clockrates it gets more complex, and FIFOs are your friend. (which just means, Peter Alfke did the difficult bit for you in the 1990s!) But there is no reason you have to live with uncertainty in the data path.

Seriously unappreciated answer here. – James Matta Aug 08 '22 at 21:11 — James Matta, Aug 08 '22 at 21:11

score 2 · Answer 3 · answered Jun 20 '22 at 04:24

2

The solution is to make sure the data (presumably a multi-bit value) is stable and unchanging when the value is accessed from the opposite domain. This normally requires a "full handshake" logic signal such as "the data is ready" (from domain A to B) and a separate signal "I have received the data (from domain B to A). These must be handled as fully asynchronous in the receiving domain (with proper clock-sync sampling), and as such it generally takes 2 clocks from the receiving domain, in each direction.

The confounding principal, though, is that the normal tool suites are very poor at handling async logic, inputs and transitions, so validation cannot automatically check for set-up and hold timings of the async R-S latch that is required in the middle of all this.

answered Jun 20 '22 at 04:24

antiquus

41
3

R-S latch, does this come from the cell library for ASICs that will contain some primitive for CDC stuff? I do not see how there can be an RS latch in the FPGA. – quantum231 Jun 20 '22 at 10:59
RS latch in FPGA? – gyuunyuu Jun 27 '22 at 13:38
My experience is in standard cells in ASICs. In any case, an rs latch is constructed from two NOR gates (active high inputs) or two NAND gates (active low inputs). The output of each one goes as an input to the other, with the other input(s) being the R and S signals. The key here is you only have to fiddle with one signal (pair) and not the entire data word, to guarantee that it can be read atomically and safely. Edit to say that this often breaks things like boundary scan logic and clock timing, so it is not very often employed. – antiquus Jun 29 '22 at 03:27

Is it true that for asynchronous clock domain crossing, there is always a small chance that data will be lost or corrupted?

3 Answers3