How can an FPGA outperform a CPU?

Question

I hear of people using FPGAs to improve performance of systems that do things like bit-coin mining, electronic trading, and protein folding.

How can an FPGA compete with a CPU on performance when the CPU is typically running at least an order of magnitude faster (in terms of clock speed)?

The FPGA does everything at once. – Ignacio Vazquez-Abrams Mar 02 '14 at 02:50 — Ignacio Vazquez-Abrams, Mar 02 '14 at 02:50

score 55 · Answer 1 · answered Mar 02 '14 at 02:55

55

CPU's are sequential processing devices. They break an algorithm up into a sequence of operations and execute them one at a time.

FPGA's are (or, can be configured as) parallel processing devices. An entire algorithm might be executed in a single tick of the clock, or, worst case, far fewer clock ticks than it takes a sequential processor. One of the costs to the increased logic complexity is typically a lower limit at which the device can be clocked.

Bearing the above in mind, FPGA's can outperform CPU's doing certain tasks because they can do the same task in less clock ticks, albeit at a lower overall clock rate. The gains that can be achieved are highly dependent on the algorithm, but at least an order of magnitude is not atypical for something like an FFT.

Further, because you can build multiple parallel execution units into an FPGA, if you have a large volume of data that you want to pass through the same algorithm, you can distribute the data across the parallel execution units and obtain further orders of magnitude higher throughput than can be achieved with even a multi-core CPU.

The price you pay for the advantages is power consumption and $$$'s.

answered Mar 02 '14 at 02:55

markt

4,936
2
15
13

2

+1; FPGAs however are not as dynamic as CPUs, which is why CPUs are usually better suited for PCs – Nick Williams Mar 02 '14 at 03:24
21

"The price you pay for the advantages is power consumption and $$$'s." -- This is often true, but you can squarely beat a high-end multi-$1000 Intel Xeon machine with a low-end $50 Xilinx Spartan-6 for many algorithms. But that typically takes a lot of engineering time and you may end up with a very custom design that only works for one application and is hard to change. So the tradeoff is not just power and money, but algorithm development time, reusability and flexibility. (Although you can argue time == money.) – wjl Mar 02 '14 at 13:05
markt, about your last sentence, aren't FPGAs much lower power than CPUs? There are a broad range of devices for both CPUs and FPGAs, but if we look at the ones that are used for things like bit-coin mining, aren't the CPUs used for those tasks much more power hungry than the FPGAs that would be used? – David Gardner Mar 02 '14 at 15:13
4

@David: When talking about Bitcoin mining, the relevant metric is number of hashes per watt. Markt is talking about overall power consumption. That is, a given FPGA may consume 3x the power of a typical CPU, but be much more than 3x faster at Bitcoin mining; so for Bitcoin that's a win. – Billy ONeal Mar 02 '14 at 18:05
2

@Billy: the number of hashes per watt · second, not per watt. – Paŭlo Ebermann Mar 03 '14 at 09:28
@PaŭloEbermann: True. – Billy ONeal Mar 03 '14 at 22:10
Rather, number of hashes per Joule, which is how electricity is billed... – bobflux Dec 02 '17 at 00:45
Yes if your bitcoin mining rig comes with a geothermal power station and is cooled by an arctic climate. Otherwise we tend to get billed in fiat currency. – Sentinel Mar 19 '18 at 21:30

Kit Scuzz · Answer 2 · 2014-03-02T07:27:20.033

Markt has this mostly right, but I'm going to throw in my 2 cents here:

Imagine that I told you that I wanted to write a program which reversed the order of bits inside of a 32-bit integer. Something like this:

int reverseBits(int input) {
    output = 0;
    for(int i = 0;i < 32;i++) {
        // Check if the lowest bit is set
        if(input & 1 != 0) {
            output = output | 1; // set the lowest bit to match in the output!
        }

        input = input >> 1;
        output = output << 1;
    }
    return output;
}

Now my implementation is not elegant, but I'm sure you agree that there would be some number of operations involved in doing this, and probably some sort of loop. This means that in the CPU, you have spent many more than 1 cycle to implement this operation.

In an FPGA, you can simply wire this up as a pair of latches. You get your data into some register, then you wire it into the different register in reverse bit order. This means that the operation will complete in a single clock cycle in the FPGA. Thus, in a single cycle, the FPGS has completed an operation that took your general purpose CPU many thousands of cycles to complete! In addition, you can wire up probably a few hundred of these registers in parallel. So if you can move in a few hundred numbers onto the FPGA, in a single cycle it will finish those thousands of operations hundreds of times over, all in 1 FPGA clock cycle.

There are many things which a general purpose CPU can do, but as a limitation, we set up generalized and simple instructions which necessarily have to expand into lists of simple instructions to complete some tasks. So I could make the general purpose CPU have an instruction like "reverse bit order for 32 bit register" and give the CPU the same capability as the FPGA we just built, but there are an infinite number of such possible useful instructions, and so we only put in the ones which warrant the cost in the popular CPUs.

FPGAs, CPLDs, and ASICs all give you access to the raw hardware, which allows you to define crazy operations like "decrypt AES256 encrypted bytes with key" or "decode frame of h.264 video". These have latencies of more than one clock cycle in an FPGA, but they can be implemented in much more efficient manners than writing out the operation in millions of lines of general purpose assembly code. This also has the benefit of making the fixed-purpose FPGA/ASIC for many of these operations more power-efficient because they don't have to do as much extraneous work!

Parallelism is the other part which markt pointed out, and while that is important as well, the main thing is when an FPGA parallelizes something which was already expensive in the CPU in terms of cycles needed to perform the operation. Once you start saying "I can perform in 10 FPGA cycles a task which takes my CPU 100,000 cycles, and I can do this task in parallel 4 items at a time," you can easily see why an FPGA could be a heck of a lot faster than a CPU!

So why don't we use FPGAs, CPLDs, and ASICs for everything? Because in general it is a whole chip which does nothing but one operation. This means that although you can get a process to run many orders of magnitude faster in your FPGA/ASIC, you can't change it later when that operation is no longer useful. The reason you can't (generally) change an FPGA once it's in a circuit is that the wiring for the interface is fixed, and normally the circuit doesn't include components which would allow you to repgrogram the FPGA into a more useful configuration. There are some researchers trying to build hybrid FPGA-CPU modules, where there is a section of the CPU which is capable of being rewired/reprogrammed like an FPGA, allowing you to "load" an effective section of the CPU, but none of these have ever made it to market (as far as I'm aware).

For the example of reversing bits (and all other bit swap/selection tasks) it doesn't really take 1 clock cycle, it takes 0. In your example, it takes 1 clock cycle to *store data in a latch*, which is not the same operation. It takes 1 clock cycle whether you reverse the bits or not. The operation of reversing the bits is 0 clock cycles; no overhead, just different routing. The difference is not just semantics, especially when you starting adding things up. For example, how long does it take to shift a 32-bit word down 3 bits, then swap every other nibble, then reverse it? — wjl, Mar 02 '14 at 13:16
"hybrid FPGA-CPU module" -- these have been on the market for a long time (see http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/index.htm for a modern successful one), but even without special support, combining software & HDL is commonly done by implementing a soft CPU inside the FPGA on the fabric. — wjl, Mar 02 '14 at 13:18
@wjl You're right that it technically takes no cycles to perform the operation itself. I would argue that your example is only semantically different though, mostly because doing those three operations logically translates into a fixed bit pattern (i.e. I start with b1b2b3b4 and I end with b3b1b4b2). This was kind of my point in the whole answer. I was trying to point out that describing an operation as a series of steps is frequently only necessary when you have a fixed instruction set/gate arrangement. — Kit Scuzz, Mar 02 '14 at 23:42
@wjl: The way david-gardner asked the question, he seems to be saying "CPU" is equivalent to an Intel or AMD x86/x86_64 highly clocked, pipelined, and optimized CPU. There are many soft "CPUs" but I none of the ones designed to sit in an FPGA can be clocked like an i7, nor are they nearly as optimized or capable. As for hybrids, I more meant something like this: http://newsroom.intel.com/docs/DOC-1512 which apparently does exist — Kit Scuzz, Mar 02 '14 at 23:52
the Zynq really isn't too bad of a processor (ARM Cortex-A9 -- the same thing that runs tablet computers, etc), but I agree it would be way more awesome to have an integrated FPGA with a high speed x86_64. =) — wjl, Mar 03 '14 at 01:16
@wjl - I think calling it ***"0 cycles"*** is not correct in the real world. Repeating the operation of reversing the bits of many 1000's of 32-bit integers (or even a small fraction of that) would clearly not be doable in ***"0 cycles"***. — Kevin Fegan, Aug 17 '14 at 03:29
@KevinFegan -- Cycles is just a quantized unit of time. So, yes, "0 cycles" is exactly the correct thing to call it in the real world. Again, in an FPGA or ASIC, reversing 1000's of 32-bit integers would take 0 clock cycles if the FPGA had enough space to do it all in parallel. What takes time is *storage*, not rerouting bits, which just takes space. — wjl, Aug 17 '14 at 21:31

David Gardner · Accepted Answer · 2016-09-12T21:14:04.340

All of the other popular answers presented here talk about literal differences between FPGAs and CPUs. They point out the parallel nature of the FPGA vs the sequential nature of a CPU, or give examples of why certain algorithms might work well on an FPGA. All of those are good and true, but I would suggest however that there is a more fundamental difference between CPUs and FPGAs.

What’s the common denominator between an FPGA and a CPU? It is that they are both built on top of silicon. And in some cases literally the same silicon processes.

The fundamental difference is the abstractions that we pile on top of that silicon. It’s not possible for one human to understand the full detail of a single modern CPU design from silicon to packaged IC. So as part of the engineering process we divide that complex problem into smaller manageable problems that humans can wrap their heads around.

Consider what it takes to turn that silicon into a functioning CPU. Here’s a somewhat simplified view of the layers of abstraction necessary for that goal:

First we have engineers who know how to create transistors from silicon. They know how to design tiny transistors that sip power and switch at the rate of 10’s or even 100’s of gigahertz, and they know how to design beefy transistors that can drive signals with enough power to send them out of an IC package and across a PCB to another chip.
Then we have digital logic designers who know how to put those transistors together to libraries with hundreds of different logic cells. Logic gates, flip flops, muxes, and adders, to name a few. All in a variety of configurations.
Next we have various groups of engineers who know how to put those digital (and sometimes analog) blocks together to form higher level functional blocks as as high speed transceivers, memory controllers, branch predictors, ALUs, etc.
Then we have CPU designers to architect high end CPU designs by pulling together those functional units into a complete system.

And it doesn’t stop there. At this point we have a working CPU which runs assembly code but that’s not a language most programmers write to these days.

We might have a C compiler to that compiles to assembly code (probably through some intermediate representation)
We could add another abstraction on top of C to get an object oriented language
We might even write a Virtual machine on top of C or C++ so that we can interpret things like Java byte code

And the abstraction layers can go on from there. The important point here is that those abstraction layers combine to yield a CPU based system that scales massively and costs a tiny fraction of a custom silicon design.

HOWEVER, the important point to be made here is that each abstraction also carries a cost itself. The transistor designer doesn’t build the perfect transistor for every use case. He builds a reasonable library, and so sometimes a transistor is used that consumes a little more power or a little more silicon than is really needed for the job at hand. And similarly the logic designers don’t build every possible logic cell. They might build a 4 input NAND gate and a 8 input NAND gate but what happens when another engineer needs a 6 input NAND? He uses an 8 input NAND gate and ties off 2 unused inputs which results in lost silicon resources and waisted power. And so it goes up the chain of abstractions. Each layer giving us a way to handle the complexity, but at the same time charging us an additional incremental cost in terms of silicon and power.

Now compare those abstractions to what is needed for an FPGA. Essentially, the FPGA abstractions stop at #2 in the list above. The FPGA allows developers to work at the digital logic layer. It’s somewhat more sophisticated than that because CPUs are ‘hard coded’ at this layer and FPGAs must be configured at run time (which, BTW, is why CPUs typically run a much higher frequencies), but the essential important truth is that that are far few abstractions for FPGAs than for CPUs.

So, Why can an FPGA be faster than an CPU? In essence it’s because the FPGA uses far fewer abstractions than a CPU, which means the designer works closer to the silicon. He doesn’t pay the costs of all the many abstraction layers which are required for CPUs. He codes at a lower level and has to work harder to achieve a given bit of functionality but the reward he gets higher performance.

But of course there is a down side for fewer abstractions as well. All those CPU abstractions are there for good reason. They give us a much simpler coding paradigm which means more people can easily develop for them. That in turn means that there are many more CPU designs in existence and thus we have massive price/scale/time-to-market benefits from CPUs.

So there you have it. FPGAs have fewer abstractions and so they can be faster and more power efficient but difficult to program for. CPUs have many abstractions design to make them easy to develop for, scalable, and cheap. But they give up speed and power in trade for those benefits.

Also, FPGA's are designed using simple repetitive blocks that are to carry out simple logical tasks. They are tailor made for certain types of tasks. CPU's, OTOH, have many complex functional parts all doing different things. One could consider that a CPU is a group of many different FPGA like devices(after all, it's all just silicon, electronics, and math). So it's just not about abstractions, it's about complexity. CPU's are complex devices made up of many different types of electrical devices while an FPGA is made up of a few. A CPU is a shotgun while an FPGA is a rifle. — AbstractDissonance, Jul 18 '16 at 00:26

score 22 · Answer 4 · edited Apr 13 '17 at 12:47

Whilst the other answers are all correct, none of them yet addresses the bitcoin mining example from your question, which is indeed a decent example. Bitcoin mining involves repeatedly calculating a cryptographic hash function, SHA-256 of the result of another SHA-256 calculation, of data where only a single 32-bit integer changes, until the resulting hash has certain properties. Each SHA-256 consists of 64 repetitions of the same algorithm involving 32-bit additions, bitshifts, and some more bit-mangling operations.

If you program this loop on a 32-bit (or more) CPU, you will find its instruction set very well suited for the task---SHA-256 was designed to run efficiently on CPUs. Still you will only be using maybe 2% of a modern CPU's silicon area, with area-intensive functionality like caching, multiplication, division, floating point operation, branching and brach prediction, etc., either not used at all or unable to provide significant performance boost for this particular task.

In configurable hardware like a FPGA, you simply only implement those 2%, and optimize further by forgetting all about code execution, rather designing gates to directly compute each one of those often repeated subfunctions. Pipelined such that each of them passes a result into the next every clock cylce, and repeated 128-times (and with some special additional logic where each SHA-256 begins and ends), you end up getting a result every clock cycle (for maybe 100 million hashes per second on a FPGA advertised to support 300 MHz on simpler logic than this) whilst on a modern CPU, you could expect one result every few thousand clock cycles per core, say 10 million hashes per second on a multi-core multi-GHz CPU.

If this particular example is of interest to you, you may want to have a look at my related answer about the internals of ASIC miners on bitcoin.stackexchange, since many FPGA miners work in the same way using configurable rather than custom-made hardware. Just for completeness' sake: There are other possibilities, like limiting or avoiding the pipelining I described in favor of a more trivial parallelization by using multiple independent SHA-256 hashers. Depening on the constraints given by your FPGA's internals and its total size, that can even give better performance although it would be less efficient in terms of gate count and routing overhead if you had perfect freedom in designing the entire chip, not just a FPGA's configuration.

But maybe (unintentionally!) misleading, considering that a FPGA consists of somewhat complex cells with many physical gates, of which a typical application again only uses a fraction, allowing their manufacturers to advertise equivalent gate counts in an attempt to tell you how much all of that might be worth in a "typical" application... — , Mar 02 '14 at 21:13

0xc000005 · Answer 5 · 2015-04-04T11:18:46.990

3

The answers above, while correct, miss the point about why FPGAs (and custom ASICs) are especially good for bitcoin calculations.

The real advantage is that a large proportion of the SHA-256 calculations are logical operations (for example, bit shifts) which can be done in wiring. When done this way, they require 0 clock cycles.

Another important advantage is that FPGAs are much more power-efficient (i.e. MIPS per Watt) than CPUs, so the amount of energy required for the calculations is much less. This is important because the cost of mining a bitcoin depends on how much electricity you use to make it.

ASIC chips are more energy efficient than FPGAs, so they can execute the same code much more cheaply. You can also cram more execution units on board to make them faster. The disadvantage is that the cost of making a custom ASIC is very high so you would need to sell quite a few chips to cover the manufacturing cost.

GPUs,are also used for making bitcoins, but since they are much less energy-efficient they have been losing ground to FPGAs and custom ASICs.

edited Apr 04 '15 at 11:18

answered Apr 04 '15 at 11:07

0xc000005

31
2

1

If you look at Monero hashing algorithm aka cryptonight, you will see that a FPGA implementation is near to impossible due to the high amount of memory needed to be accessed randomly (2MB). A CPU has the advantage in this case. – lucas92 Dec 01 '17 at 22:05
@lucas92 cant you integrate RAM into FPGA to accommodate amount of memory needed? – GENIVI-LEARNER Feb 13 '20 at 13:40
You probably won't have enough logic elements in the FPGA for it. – lucas92 Apr 23 '20 at 20:49
@GENIVI-LEARNER but then you can only have one or two mining threads per FPGA because on-chip memory isn't cheap, nor is off-chip memory bandwidth – user253751 Apr 05 '22 at 16:49
@user253751 ok makes sense. And when you say on-chip memory isnt cheap, its due to manufacturing constraints? – GENIVI-LEARNER Apr 05 '22 at 20:11
@GENIVI-LEARNER I would assume so. Have a look at some FPGAs and compare the price and the amount of on-chip memory. You can use external memory chips (almost like computer RAM sticks) but then you have to get the data to and from the memory chips and that has a limited bandwidth. – user253751 Apr 06 '22 at 08:33
2

@GENIVI-LEARNER Monero now (I think it was different in 2017) uses an algorithm called RandomX, where the mining algorithm **is a CPU** and so the best way to mine is to make a CPU! They've specifically designed it so that there's no better way to mine. – user253751 Apr 06 '22 at 08:35

How can an FPGA outperform a CPU?

5 Answers5