Why can't an Asynchronous CPU use a simple "completion bit" to signal completion?

Question

I've been reading about how asynchronous CPUs work, but they always seem to involve some complicated way of communication. Wikipedia talks about a two-way and four-way handshake. I found a pdf from iosrjournals.org that talks about communicating Requests, Acks, and Data either in 2-phase bundles or 4-phase bundles.

I don't understand why any of this is necessary. Why don't you just use a single bit on a dedicated line that is 0 until the operation is complete? Then it becomes a 1. Then the accumulator at the end of the ALU will know the "final answer" is finished and it can switch the buffer from read to write, outputting it on some other bus.

So of course it begs the question as to how the ALU will know when its done. I'm not exactly sure how to do that, but the simplest way to do it would be to just set it to 1 immediately and then make sure the propagation delay is long enough such that the worst case operands finish in time. But this is a unique trace for each operand the ALU is capable of. It's not like a clock signal that has to be long enough for the worse case of any ALU operation.

Is there something fundamentally wrong with this idea? And if so, isn't there some easier way for the logic to figure out when it's done and then signal completion using a single bit, zero for not complete, 1 for complete?

Note, everything I've read so far seems to indicate 2 or 4 lines for dedicated communication in asynchronous things. So I don't think that a single "completion bit" on a dedicated line would be too expensive or you can't find room for it.

If the signal line goes high after an operation completes, what happens on the next operation? It either has to go low again before the next operation (in which case, it needs to know when the other circuit has detected it high), or the logic needs to be reversed for the next operation. If we reverse the logic for the next operation, how do we signal the case where we want to do the operation again with the same input signal values? Whatever solution you find for these questions becomes a handshaking scheme of some kind. — The Photon, Jul 04 '21 at 22:00
_"Why don't you just use a single bit on a dedicated line that is 0 **until the operation is complete**? Then it becomes a 1"_ - how does the bit know when to change? — Bruce Abbott, Jul 04 '21 at 22:01
"I'm not exactly sure how to do that," is precisely the problem! If you *become* sure how, and prove your approach in practice, you'll get at least some good papers out of it. But fair warning : it's one of those areas where people have been stuck since at least the early 1990s... — , Jul 04 '21 at 22:56
@ThePhoton Good point, but 2 simple solutions. It can just be a "ping" that goes high then low again really fast, before the first signal has even propagated to the end. The accumulator with the answer can do it's thing on the rising edge of that 1 signal as if its a clock but with only one pulse at asynch times. Alternatively, just use 2 lines and alternate which one to listen to for the 1. I know it sounds wasteful but if the normal way of doing asynch is with 2-phase or 4-phase, then it already has 2 or 4 dedicated lines for some kinda communication. — DrZ214, Jul 04 '21 at 22:59
@BruceAbbott It changes as soon as the ALU receives operands and forwards them down the right circuitry. The point is the 1 bit needs a long propagation time, either a longer path or maybe some kinda capacitor delaying it. The point is the propagation delay is supposed to be slightly longer than the "logic delay" of however much time the circuit needs to computer an operation, such as add. — DrZ214, Jul 04 '21 at 23:01
@user_1818839 I proposed one way in this OP and asked why it can't work? To me it seems very simple and straightforward way of doing asynch stuff. — DrZ214, Jul 04 '21 at 23:02
How do you know how long your "ping" has to be for the other circuit to detect it? If you rely on the other circuit to tell you it detected it, you're handshaking again. If you assume some specific time (10 ns, say) is sufficient, you're now effectively using a clock, and no longer doing asynchronous logic. — The Photon, Jul 04 '21 at 23:36
Anyway you still need a way for the other circuit to tell you when it's presenting new data to you (in case the new data is identical to the old data)...so when it does that it's basically handshaking your "completed" signal one way or the other. — The Photon, Jul 04 '21 at 23:38
@DrZ214 Perhaps if you were to show us a case where there was a self-reset to "0", followed by a self-rise to "1" after a longest-path combinatorial is completed, it might help useful answers. You've got a concept. Show us how this works. How does this signal become "0" and how does this signal become "1"? Pick as simple a case as possible to make the point. How does the self-clocking work? (Yes, people have worked this out. But I want to see how you visualize this.) — jonk, Jul 05 '21 at 05:51
Imagine one entity is sending operands continuously and faster and the ALU reads them slower. How will you manage this without handshaking? ie., ALU needs some way to say "Oye not now, I am busy". — Mitu Raj, Jul 05 '21 at 06:30
@MituRaj Well if you have a completion bit 0 or 1, you can send it to both ends of the ALU. So that would serve as the ready signal for the previous stage to put new work on the ALU. — DrZ214, Jul 06 '21 at 21:28
That's called handshaking. 'previous stage' will say when operands are valid, ALU will say when it's ready. — Mitu Raj, Jul 06 '21 at 21:31
@MituRaj I disagree. There is no 2-way communication so it cannot be called a handshake. It's just the ALU communicating to both ends. And both ends are purely **listening** for a 1. No need to exchange data. So the ALU is talking, never listening. And the previous stage is listening, never talking (unless you mean sending the operands is talking but by that point it already knows to send its operands). Same thing with the forward end, the accumulator. It is only listening for when to receive valid answer, never talking to the ALU. — DrZ214, Jul 07 '21 at 22:44
Read your own comment.You said you send something to both ends of ALU and one of them serves as 'ready' signal. Perhaps you should but a block diagram of both entities with signals to illustrate your point as your words are NOT clear to me. And prove how it will work without any handshaking. — Mitu Raj, Jul 08 '21 at 05:03

score 1 · Answer 1 · answered Jul 07 '21 at 10:37

1

Let me try to summarize what you propose, because I'm not completely sure I follow your description.

You have an ALU (for example) that can perform different computations. Alongside the ALU you have a block of logic that generates a "Done" signal that will go high when the computation is finished. The "Done" logic is designed so that the delay to generate the "Done" signal is equal to the worst-case delay through the ALU for the particular operation being performed rather than the worst-case delay for all possible operations.

What are the problems with this approach? First, consider the complexity of the "Done" logic. How exactly do you intend to generate relatively long delays without using very long strings of logic gates? Remember that the delays in the "Done" logic will vary with temperature and supply voltage, and must track the same variations in the actual ALU. You also need to generate different delays for the different operations. Designing the "Done" logic to very closely track the ALU delays is likely to require a significant amount of logic, which means more power wasted to make "Done", more silicon used, and more expensive chips.

Second, every time you say you will make some delay equal to the "worst-case delay" of some complex function you lose some of the potential speed benefit of asynchronous logic. To really get the most improvement in speed you need to assert "Done" as soon as possible.

When you work through these trade-offs of speed, power consumption, design effort, and chip cost you start to see why asynchronous processors have not become dominant over synchronous processors.

answered Jul 07 '21 at 10:37

Elliot Alderson

31,192
5
29
67

`the delay to generate the "Done" signal is equal to the worst-case delay through the ALU for the particular operation being performed rather than the worst-case delay for all possible operations.` Yes exactly right. However, i was imagining a unique done path for every operation, rather than sharing a single done path for the whole ALU that somehow adjusts its propagation delay for a given operation. Obviously these unique paths would be MUX / DEMUX by the CU Instruction Decoder. So how to make long propagation delay? I was thinking either a long path or capacitor to delay the high voltage. – DrZ214 Jul 07 '21 at 22:39
`Second, every time you say you will make some delay equal to the "worst-case delay" of some complex function you lose some of the potential speed benefit of asynchronous logic. To really get the most improvement in speed you need to assert "Done" as soon as possible.` Yes i completely agree, however the point is to make asynch simplified at the expense of fully optimal speeds. I am pretty sure the worst case of a given op, rather than the entire ALU critical path that needs to be clock-sized appropriately, will still be much much faster than traditional clocked designs. – DrZ214 Jul 07 '21 at 22:41
"A unique done path for every operation" sacrifices computation speed because the actual calculation time is also a function of the operands. For arithmetic operations your circuit will be much slower than necessary. How to make the long delay? That is also a significant problem, because your delay circuit must react to changes in voltage and temperature just like your ALU. – Elliot Alderson Jul 07 '21 at 22:42
Yes, you may be able to achieve faster average computation, but at what cost in silicon area and power? What happens when you are forced to communicate with synchronous IO devices? – Elliot Alderson Jul 07 '21 at 22:43
I understand some of the problems you are bringing up, but the cost in silicon area and power? I see nothing in power consumption besides one extra trace and the contacts for a MUX/DEMUX connected to it. Remember that eliminating a clock signal will save significant power. Silicon area, i cannot imagine this would take up much area at all. It is just one extra trace per operation, and the combinational logic of that op will be far far more area than that trace. Remember the MUX/DEMUX are already there, we just need one extra trace to integrate with it. – DrZ214 Jul 07 '21 at 22:56
Communication with synchronous IO devices, that is not done with the ALU thankfully. Special regs can use buffers with its own clock to control exactly when they are forwarded onto a bus. Alternatively, you can use an asynch bus and the IO device itself has special regs that only listens at the right clock moment. Either way seems relatively simple to me. – DrZ214 Jul 07 '21 at 22:58

Why can't an Asynchronous CPU use a simple "completion bit" to signal completion?

1 Answers1