FPGA firmware design: How big is too big?

Question

I have a particularly large signal processing transform that needs to be ported from matlab to VHDL. It definitely requires some kind of resource sharing. A bit of calculation gave me the following:

512 ffts of 64-points
41210 multiply-add operations

Considering the largest Virtex 6 FPGA has ~2000 DSP48E blocks, I know that I can resource-share in order to re-use the resources multiple times. Execution time isn't really an issue, the processing time can take relatively long in FPGA terms.

Looking at resource usage, using radix-2 lite architecture gets me 4dsp blocks/FFT operation = 2048 DSP blocks, a total of ~43k. biggest Virtex FPGA has 2k blocks, or 20 operations/mux.

Obviously including such large muxes into the fabric is also going to take up slices. Where do I find the upper end of this limit? I can't infinitely share the FPGA resources. Is 41210 multipliers too big? How do I calculate what is too big?

I've also looked at other resources (Slices, Brams, etc). Radix-2 Lite also gives 4 x 18k brams/fft = 2048 brams biggest Xilinx FPGA contains 2128 Brams. very borderline. I'm concerned that my design is just too big.

UPDATE:

Some more info on the design itself. I can't go into detail, but here is what I can give:

Initial conditions -> 512 ffts -> 40k multipliers ---------|----> output data to host 

                 ^------re-calculate initial conditions----|

output datarate spec: "faster than the matlab simulation"

calculations wise, this is where I am:

FFT stage: easy. I can implement 1/2/4/8 FFTs, store the results in SDRAM and access later. Relatively small, even if it takes long, it's ok. using radix-2 lite I can get 2 DSP48Es and 2 18k BRAMS/FFT. streaming gives 6 DSP48Es 0BRAMS/FFT. in either case, 64 point FFT is small in FPGA resource terms.

Multipliers: this is my problem. The multiplication inputs are taken from either lookup tables or FFT data. It really is just a whole bunch of multiply-adds. There isn't much to optimise. Not a filter, but has characteristics similar to a filter.

Considering resource sharing on the FPGA, maths works out as follows: One LUT-6 can be used as a 4-way mux. The formula for an N-way, M bit mux is as follows:

N*M/3 = number of luts, or N*M/12 = slices (4 LUTS/slice).

crunching the numbers for my implementation does not give good results. 90% of the virtix-6 family doesn't have enough slices to resource-share their DSPs in order to perform 40k operations.

The most efficient forms of resource sharing are partial serialization where you can access data by addressing memory. Of course, at an extreme of this you are back to a conventional stored-program processor - the lack of hard performance requirements starts to point back towards the flexibility of a software implementation perhaps running in a compute cloud. — Chris Stratton, Dec 31 '12 at 15:45
This isn't part of your question, but in your resource calculation you didn't state what size operand. 512 FFTs x 64 points x how many bits? In an FPGA the operand size is totally up to you, so you have to consider it when working out the size of your problem. — The Photon, Dec 31 '12 at 15:54
I don't know if you realized, but those big FPGAs are quite expensive. Some can be above $5k. Perhaps you should consider that as well, unless cost isn't an issue. — Gustavo Litovsky, Dec 31 '12 at 17:54
@ThePhoton: I used the core generator to estimate the FFT resource usage, each FFT is 64 points, fixed point and 32 bits of data in, 24 of phase data. — , Jan 01 '13 at 01:48
Unfortunately beyond the sort-of alternate solution suggestions you got in the answers so far, I doubt if we can do to much more for you. I mean, you could make just one FFT core and run your 512 inputs through it one after the other, and obviously that would fit in even a fairly small FPGA. Somewhere between that and doing everything in parallel is the right balance of speed vs resources for your application...but it's hard for anyone but you to say where that balance should be. — The Photon, Jan 01 '13 at 02:02
This might be the kind of idea to throw around in chat rather than here in the Q&A area...but I don't think there's any serious digital gurus who are regulars in chat. — The Photon, Jan 01 '13 at 02:05
I'm not too concerned about the FFT, honestly. with a bit of SDRAM I can store the FFT operations performed sequentially and recall the data later. I'm more concerned about the ~40k multiply operations. at 18 bits each, and assuming 2k multipliers, I'd have to mux each multiplier 20 ways. a 20 way, 18 bix mux needs 7 LUT-6/bit, or 40 slices. 40 x 2k = 80k. Just small enough to fit into the two biggest Virtex 6 FPGAs :/ — , Jan 01 '13 at 02:37
I guess I didn't understand what you meant about the multipliers. I thought the multiplies were the ones needed to do the FFTs. — The Photon, Jan 01 '13 at 05:46
yeah, looking at it again - it is a bit ambiguous on my part. The FFTs are in addition to the multiply operations. — , Jan 01 '13 at 05:49
@ThePhoton I've added more detail about the design to the question. — , Jan 01 '13 at 16:28
Do you have a budget number for this? Like Gustavo pointed out, high-end FPGAs are expensive, as is developing a PCB to sit them on. Whereas just doubling (or quadrupling or ...) the amount of compute hardware and continuing to use the existing, proven(?), Matlab code could probably meet the speed spec as given. — The Photon, Jan 01 '13 at 17:07
"I originally assumed that the professor who wrote the paper on the algorithm would have optimised it as much as it could be, ..." Could you provide a link to the paper in question? — Dave Tweed, Jan 01 '13 at 18:49
If you have a new question please post it as a new question, give the required details for the specific question you want and ask another. Users will upvote that also and it will help keep each question focused on a single technical issue. — Kortuk, Jan 02 '13 at 15:00
If you are always arranging multiplier inputs in the same sequence you can get away with shifters, rather than wide input muxes. I.e. implementing `mult_in <= mux[n]` where `n` is incremented by one each cycle can be massively minimised. — shuckc, Jun 30 '15 at 11:19

score 9 · Accepted Answer · 2013-01-02T14:10:50.383

I wonder if there is another way of looking at the problem?

Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?

Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.

Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...

Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!

Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.

Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...

(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)

Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).

Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.

On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...

An FFT implemented with a single hardware butterfly is going to require NlogN clock cycles to complete; for 512 points, that would be 256*8 butterflies, or 2048 clocks. That means that the 41210 (or 32768?) MACs would only require 8-10 hardware multipliers to get done in the same amount of time. — Dave Tweed, Jan 01 '13 at 16:04
Sorry, I just realized I got that backwards. The indiivdual FFTs are 64 points, so the single-butterfly implementation will require 32*5 = 160 clocks. The MACs can then be done with 200-250 hardware multipliers. — Dave Tweed, Jan 01 '13 at 16:23
this is what stumps me. How can xilinx design a core capable of doing 16k/32k ffts that require 400k multiply-add operations (NlogN) and yet I'm struggling with my 41k? there must be a way! — , Jan 01 '13 at 16:31
@Dave : I believe you mean 160 multiplications, not 160 cycles, surely? There is nothing quite so inherently serialised in an FFT... — , Jan 01 '13 at 19:55
A radix-2 FFT butterfly requires 4 multiplications and two additions. I'm assuming that the "radix-2 lite architecture" mentioned by the OP can produce a pair of results every clock cycle, based on the resources used. Sequencing the input/intermediate data and coefficients through the hardware is left as an exercise for the reader, but it's quite amenable to pipelining. — Dave Tweed, Jan 01 '13 at 20:04

score 3 · Answer 2 · answered Dec 31 '12 at 22:19

3

If this problem doesn't have hard realtime constraints, and it sounds like it doesn't — you just want it to run "faster", then it seems like it might be quite amenable to acceleration on one or more GPUs. There are several software libraries that make this a relatively straightforward proposition, and this would be about an order of magnitude easier than going straight to custom FPGA hardware.

Just Google for "GPU-enabled library" or "GPU-accelerated library" to get started.

answered Dec 31 '12 at 22:19

Dave Tweed

168,369
17
228
393

Interestingly enough, i mentioned GPUs to the client when i heard about this project, and he wasn't interested. – Jan 01 '13 at 01:43
Did he say why? – Dave Tweed Jan 01 '13 at 02:10
He didn't really say why, just that he had looked into it before an using an FPGA seemed like less work, apparently. I'm going to have to bring it up again. – Jan 01 '13 at 05:51
@stanri: Even if you do ultimately end up in an FPGA implementation, it seems to me that GPU might be a good way to "breadboard" the overall system architecture. Do you have (and could you share?) some sort of high-level dataflow graph for the algorithm, and can you give us an idea of the amount of data involved? Without answers to questions like these, it's going to be really hard to give you anything other than very generic advice. – Dave Tweed Jan 01 '13 at 12:28
It's actually a very very simple algorithm, it's just the scale that makes it so complicated. Basically as follows: initial conditions -> 512 ffts in parallel -> 32768 multiply operations on FFT output -> adjust initial conditions -> rinse and repeat – Jan 01 '13 at 15:48
The fact that the same FFT -> multiplier loop is occurring repeatedly makes this particular implementation particularly suited to an FPGA, however that's only if the design will fit. – Jan 01 '13 at 15:49
That means for each iteration of the algorithm, you need to perform 512*32*5 = 81920 butterflies. If you put together an FPGA with, say, two hardware butterflies and one MAC, you could complete one iteration is something like 40000 clock cycles, or 4 ms at a conservative clock of 100 MHz. 32 butterflies (one iteration of one complete row of an FFT) and 16 MACs would get it done in 2560 clock cycles. How wide are the data words? Whether or not you need to buffer to external memory will determine how far you can practically push the performance. – Dave Tweed Jan 01 '13 at 16:36
Again, this sounds like a *really* good fit for GPU acceleration. – Dave Tweed Jan 01 '13 at 16:38
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/6926/discussion-between-stacey-anne-rieck-and-dave-tweed) – Jan 01 '13 at 16:40

Connor Wolf · Answer 3 · 2012-12-31T21:24:51.470

1

How little of an issue us execution time?

This really seems like a situation where you should really implement a soft-MCU, a FPGA with an integrated hard-MCU, or even a separate MCU device, and serialize all your operations.

Assuming you have the execution time, doing your FFTs in software will be both far easier to debug, and probably far simpler to design too.

edited Dec 31 '12 at 21:24

answered Dec 31 '12 at 21:14

Connor Wolf

31,938
6
77
137

1

Doing heavy computation in a soft core CPU on an FPGA is silly; if you are going to do the computation in a stored program architecture (something that should be considered), due it on high performance/dollar hard cpu(s) where you don't pay the speed penalty of flexible logic over comparable-fab-generation hard logic. – Chris Stratton Dec 31 '12 at 21:21
@ChrisStratton - Good point. Added an additional note to that effect. – Connor Wolf Dec 31 '12 at 21:25
1

Even the built in hard-CPUs are not going to hold a candle to commodity conventional processors/GPUs for software-based tasks, and will cost drastically more. – Chris Stratton Dec 31 '12 at 21:26
@ChrisStratton - I thought the most common integrated hard-CPU architectures was either ARM or POWER? In that case, it basically *is* a commodity CPU. – Connor Wolf Dec 31 '12 at 21:30
Not of the sort someone would use for a compute-driven task. I don't know if Amazon publishes details of the hardware they are buying for their compute cloud, but such an installation (shared or private) will be a careful consideration of the intersection of performance gains in the latest offerings, acquisition cost, and power cost of operation. A CPU stuck into a corner of an FPGA die isn't going to come anywhere close on cost or performance to what you can do by cherry picking what the drastically larger general purpose computing market has to offer on any given day. – Chris Stratton Dec 31 '12 at 21:34
@ChrisStratton - I think we're thinking radically different compute scales. If you want a **lot** of crunch (such as a modern GPU or X86 CPU), you are correct that you will need a separate processor. However, designing a system that uses such processors will be extremely hard going on impossible. If you need that much crunch, you're better off just buying a desktop PC, and making your FPGA device a PCIe card. – Connor Wolf Dec 31 '12 at 21:40
I say that execution time isn't an issue because I'm not streaming data through the FPGA, I'm generating it on the FPGA. Currently the PC implementation takes hours. Unfortunately, GPU or CPU isn't an option here, client insists on FPGA. – Jan 01 '13 at 01:57
I don't know what to tell you, then. It sounds like your project is ideally suited for a GPGPU implementation. Is your client totally intractable? – Connor Wolf Jan 01 '13 at 04:11
1

Given your other FPGA question, building the FPGA board is likely to be a learning experience that will cost quite a bit more than estimated. I think the thing to do at this point would be to give the client some hard price/performance numbers from trial compute cloud runs (which could eventually become purchased hardware), vs some idea of the higher price and much higher risk of the FPGA effort. – Chris Stratton Jan 02 '13 at 15:14

score 1 · Answer 4 · answered Jan 01 '13 at 05:30

It's possible to use an specialized hardware or an FPGA (or even a CPLD) to greatly accelerate certain kinds of math operations. The key thing to bear in mind when trying to design hardware (circuitry or FPGA logic) to accelerate math operations is to figure out what order data will need to go into and out of your device. A device with an efficient I/O layout may offer much better performance than one with an inefficient layout, even if the latter device requires much more circuitry.

I haven't tried working out a hardware-assist design for an FFT, but one I have looked at is hardware assistance for large multiply operations (as might be used for RSA encryption). Many microcontrollers, even those with special fast-multiplication hardware, are not terribly efficient at such operations because they require a lot of register shuffling. Hardware which was designed to minimize register swapping could achieve much better performance with multi-precision multiplication operations, even if the hardware itself wasn't as sophisticated. For example, hardware which can perform a pipelined 16xN multiplication two bits at a time (shifting in two lower bits of multiplcand, and shifting out two upper bits of result) may achieve better performance than hardware which can perform an 8x8 multiply in one cycle, even though the former may take less circuitry (and, by virtue of pipelining, have a shorter critical data path). The key is to figure out what the "inner loop" of the necessary code will look like, and figure out if there are any inefficiencies that can easily be eliminated.

What kinds of operations are particularly suited to this form of optimisation? I've edited the question above to detail a bit more about the nature of the multiply operation. Hardware-assist design sounds really interesting! — , Jan 01 '13 at 16:35

FPGA firmware design: How big is too big?

4 Answers4