Markt has this mostly right, but I'm going to throw in my 2 cents here:
Imagine that I told you that I wanted to write a program which reversed the order of bits inside of a 32-bit integer. Something like this:
int reverseBits(int input) {
output = 0;
for(int i = 0;i < 32;i++) {
// Check if the lowest bit is set
if(input & 1 != 0) {
output = output | 1; // set the lowest bit to match in the output!
}
input = input >> 1;
output = output << 1;
}
return output;
}
Now my implementation is not elegant, but I'm sure you agree that there would be some number of operations involved in doing this, and probably some sort of loop. This means that in the CPU, you have spent many more than 1 cycle to implement this operation.
In an FPGA, you can simply wire this up as a pair of latches. You get your data into some register, then you wire it into the different register in reverse bit order. This means that the operation will complete in a single clock cycle in the FPGA. Thus, in a single cycle, the FPGS has completed an operation that took your general purpose CPU many thousands of cycles to complete! In addition, you can wire up probably a few hundred of these registers in parallel. So if you can move in a few hundred numbers onto the FPGA, in a single cycle it will finish those thousands of operations hundreds of times over, all in 1 FPGA clock cycle.
There are many things which a general purpose CPU can do, but as a limitation, we set up generalized and simple instructions which necessarily have to expand into lists of simple instructions to complete some tasks. So I could make the general purpose CPU have an instruction like "reverse bit order for 32 bit register" and give the CPU the same capability as the FPGA we just built, but there are an infinite number of such possible useful instructions, and so we only put in the ones which warrant the cost in the popular CPUs.
FPGAs, CPLDs, and ASICs all give you access to the raw hardware, which allows you to define crazy operations like "decrypt AES256 encrypted bytes with key" or "decode frame of h.264 video". These have latencies of more than one clock cycle in an FPGA, but they can be implemented in much more efficient manners than writing out the operation in millions of lines of general purpose assembly code. This also has the benefit of making the fixed-purpose FPGA/ASIC for many of these operations more power-efficient because they don't have to do as much extraneous work!
Parallelism is the other part which markt pointed out, and while that is important as well, the main thing is when an FPGA parallelizes something which was already expensive in the CPU in terms of cycles needed to perform the operation. Once you start saying "I can perform in 10 FPGA cycles a task which takes my CPU 100,000 cycles, and I can do this task in parallel 4 items at a time," you can easily see why an FPGA could be a heck of a lot faster than a CPU!
So why don't we use FPGAs, CPLDs, and ASICs for everything? Because in general it is a whole chip which does nothing but one operation. This means that although you can get a process to run many orders of magnitude faster in your FPGA/ASIC, you can't change it later when that operation is no longer useful. The reason you can't (generally) change an FPGA once it's in a circuit is that the wiring for the interface is fixed, and normally the circuit doesn't include components which would allow you to repgrogram the FPGA into a more useful configuration. There are some researchers trying to build hybrid FPGA-CPU modules, where there is a section of the CPU which is capable of being rewired/reprogrammed like an FPGA, allowing you to "load" an effective section of the CPU, but none of these have ever made it to market (as far as I'm aware).