How to pipeline an algorithm that not only has latency but also relies on feedback of the previous run?

Question

Trapped in this problem for several days, I feel I can't think of a proper solution on myself. The problem is as below.

Say, a source-input is streaming a sequence of A, denoted by A0, A1, A2, ..., Ai, ..., An.

Now I'd like to do some action on A, denoted by B_{i+1} = act_on(Ai, Op(Bi) ), e.g.

B2 = act_on(A1, Op(B1) )
B3 = act_on(A2, Op(B2) )
B4 = act_on(A3, Op(B3) )
....

But neither act_on() nor Op() can be done in a single clock of A.

I've tried to pipeline A and B, but it failed. Take an example as explanation: Suppose A0_at_stage_3 and Op(B0)_at_stage_2 are pipelined for act_on() at Clock T, and thus B1 can be obtained. It is fine. But unluckily, at Clock T+1, we cannot use Op(B1)_at_stage_2 as what we just do at Clock T, because B1 is not ready yet when calculating Op(B1) or Op(B1)_at_stage_1 (which should be done at Clock T-2 and T-1).

This problem really confused me much. What is the correct way to solve it? Thank you very much.

Edit

For my particular case, B actually stands for three lookup tables and an index, say list_1, list_2, list_3, and idx, which are all possibly expected to be dynamically updated by act_on(). That means, e.g. act_on() may update list_1[2], list_2[4], list_3[6], none of them or all of them, depending on its result.

The Op(B) is cross-looking-up operations between these tables, such as idx2 = list_1[idx], idx3 = list_2[idx3], idx4 = list_3[idx3].

The difficulty I met with is that after act_on() updating these tables, the pipelined tables would immediately become out-of-date. But if using a chain looking-up like list_3[ list_2[ list_1[ 6 ] ] ] in a single clock, the time performance could be very bad (I didn't try but just to imagine). Is it possible to have a workaround? Thank you.

I can't say it applies directly, but have you read Ellis' *Bulldog: A Compiler for VLIW Architectures*? — periblepsis, Jun 17 '23 at 13:53
@periblepsis Not yet, what's it about? Later on I will try searching for it. — xc wang, Jun 17 '23 at 14:04
You only reminded me about a compiler that used info from prior runs to optimize. For example, optimizing access to dram to cope with banks, back when that mattered. Mostly just curious. — periblepsis, Jun 17 '23 at 14:10
I think a diagram or code of your *particular case* would be very helpful. — greybeard, Jun 18 '23 at 05:56
do you need a guarranteed result every cycle, or are occasional failed cycles permitted? — Neil_UK, Jun 19 '23 at 08:26

Neil_UK · Accepted Answer · 2023-06-18T05:37:42.313

There is no solution where you are looking for it at the moment. If you want feedback from one cycle ago, you need to compute all dependent values in one clock cycle.

You need either to use faster hardware, so that act_on(op()) can be completed in a single cycle. Perhaps using a faster clock, so a logical cycle is several physical cycles.

Or you need to modify your algorithm, so that you don't need feedback from one cycle ago. Whether this is possible depends on the details of what you are trying to achieve.

For instance, in an IIR (feedback) filter, although its ideal sequence of taps would be one per cycle, it's often possible to omit the first delayed tap (basically set it to zero) and alter the others with a minimal impact on the frequency response.

Sometimes, it's possible to recast an IIR operation as FIR (no feedback), which can then be pipelined to your heart's content.

Could a prediction be made of the output of act_on(op()) to sufficient precision in one cycle to be useful? Perhaps with correction of the errors in a later stage?

So, it all depends on the detail. If you can share what you are trying to achieve, you might get some more creative answers.

Thank you for your answer, very helpful. I've also updated the question. Btw, if there exist ssome workaround for my case but I failed to find it, can HLS tools be useful in magically turning C code to a high-time-performant HDL code? — xc wang, Jun 17 '23 at 13:38
HLS nearly never writes highly time (or area) performant HDL, C is a horrible place to start for a software to HDL transformer, because C makes assumptions about things being sequential. I seldom expect HLS to be better then what I can write in HDL. — Dan Mills, Jun 17 '23 at 13:48

score 6 · Answer 2 · answered Jun 17 '23 at 14:10

You cannot really do what you are trying to do, feedback systems inherently pipeline poorly.

Your system strikes me as one that might benefit from designing in some speculative execution, is the case where act_on does NOT update any of the lists the common case by any chance?

If so you can have a path that assumes the lists are not changing and can run with the list lookup fully pipelined (6 clocks three lists in fully pipelined block ram, more or less and depends on the chip), then only if act_on updates one of the tables do you have to invalidate the result and wait for the new data to propagate thru the lookups (And you only have to invalidate the data that was in queue after the table that got updated.

It makes time to get a result somewhat non deterministic in that in the pathological case you get a multi cycle per iteration delay, but depending on the nature of the workload the average case might be much better.

The other thing you can sometimes do is play with the data structure, particularly if one of two of those tables are small, you could compute every possible value for a given pair of tables, it will eat a lot of area and ram, but if say T1 is 16 entries and T2 is 256 entries then you can collapse that to one table lookup of 4096 entries, at the cost of the update having to write multiple entries (Which might be a pain if you needed to do it often), but the lookup will be screaming fast, it is the update that will be slow.

You could even combine the two approaches, have a big single table that you can use when it is up to date, but accept that it will take a while to update because every address corresponding to a given table entry will need to be written, AND a slower set of three table lookups for use when act_on has invalidated the big table and you are waiting for the table update logic to finish with the dual port ram.

What works depends strongly on the actual workload here, and in particular on how often act_on triggers a table update.

You will likely find passing a 'data valid' flag along with the pipeline will make ensuring the correct data is marked invalid, that stuff can get annoyingly confusing otherwise.

Thank you for these instructive case guidelines, which inspired me much to make reasonable assumptions or data structures. I currently think by intuition that there may exist a space to optimise the way to update lists so that its old content keeps consensus. I will think it over and try, thanks very much. — xc wang, Jun 17 '23 at 14:37

How to pipeline an algorithm that not only has latency but also relies on feedback of the previous run?

Edit

2 Answers2