I'm no expert on the topic. I'm a hobbyist, at most, today. I watch the news, buy and test out development boards from time to time, and broadly speaking enjoy the subject area of DSP processing. I've also developed professional products on both the TI C30 and C40 lines (years ago) and the Analog Devices' ADSP-21xx integer DSP lines. (I prefer the ADSP-21xx.) I still do play with DSP algorithms, from time to time. But the last time I did anything like this, professionally, was in 2013. Given the pace of things, this means I don't know very much.
It's also hard to know what you are really asking about. But if you feel there may be a bright line answer, then I think you may be out of luck. The algorithms you are using have everything to do with what hardware features might boost performance.
DSPs traditionally have focused on a few core concepts. These include:
- multiply-accumulate instructions
- parallel read from two memory systems, simultaneously
- combined micro-instructions into a single cycle
For example, the ADSP-21xx processor can read from two different memory systems (data memory and reading from instruction memory as if it were data), perform an ALU operation, and write to memory -- all of these things take place within in a single cycle. This kind of performance allows the reading of data, associated constants (or other data), while also performing an ALU operation and writing out a prior ALU operation result in one clock cycle.
The ADSP-21xx was relatively low-power and didn't support floating point. Instead, it focused on cheap, low-power applications and aided floating point with a fully combinatorial barrel shifter which was able to normalize and de-normalize in a single ALU operation.
All competitive products will have to balance dozens if not hundreds of competing issues. Power consumption, manufacturing and calibration cost, size, weight, legal risks, response to environmental variations, variations between users, availability, tool complexity and cost, and outlook into the future for all of the above and more.
There's no single answer to an application area.
More modern processors, like the Micro Magic RISC-V, announced in October, delivers 11k CoreMarks at \$200\:\text{mW}\$, which is damn good. It's not a DSP. But it is fast and provides that speed at low power.
But if I had to hang my hat on a single thing that, from experience, makes a DSP nice to have and other processors, despite their overall performance, not so nice ... it is on the idea of rigorous processing from input sample to output sample.
In all the work I've done on signal processing, the one thing that has caused my products to excel when competing (and much more highly funded projects failed), is my focus upon making the time between sampling input and driving output fixed and short.
I shoot for zero-cycle variance. With some DSPs (the Analog Devices ADSP-21xx family, for example), I can achieve this. With others (the TI C30, for example), I cannot achieve this under any circumstances. So even within DSP devices, some are better than others in this particular area.
So I will look for a system where I sample the ADC at an absolute fixed rate. Many of these cases, the DSP or MCU must toggle pins and otherwise operate the external ADC (not uncommon) in a manual way. Doing this on a common MCU with zero-cycle variance is very difficult. Doing this on a DSP (at least a good one) is not difficult. And with the ADSP-21xx, I've been able to operate very fast ADCs with zero-cycle sampling variance. Very few MCUs can be expected to achieve that. Also, given the rigorousness with which the instructions execute, I can ensure that there is also zero-cycle variance in the delivery of output changes to the DAC. (There will always be some sub-cycle analog variance beyond my control, though.) And I can ensure that the work in between is performed quite quickly, given the dual-memory-read and memory-write, plus ALU OP, that the ADSP-21xx allows me.
Using most MCUs (CISC or RISC) usually means I don't have as tight of control. They may be fast (which is good), but if there is input sampling variation and output drive variation, then that processing speed doesn't help me. FFT processing will immediately (because of its assumptions of regular sample-spacing) smear the results if I cannot deliver consistent in/out sampling.
So, perhaps, if I had to pick one thing out of many that has helped me deliver solutions to customers when they were failing miserably using the products of large companies (Omega, etc), it would be the fact that I can very strictly control the data flow in a DSP where I will otherwise find (in an MCU) that I lose some of that control because of how their instructions have variable timing or their interrupt system isn't predictable in its response. (And other details, as well.)
A lot of people focus on just borrowing algorithms from others, stuffing them in place, and they don't care about the timing or what is going on under the hood. Not all compilers will translate those routines in the same way and there is a great deal of variation that arises from the use of "library code."
But I write every line of code, myself, and I test and validate each and every routine, start to finish. When I specify that the processing takes 1783 cycles, then that is exactly what it takes. Not 1782 cycles or 1784 cycles. It will take exactly 1783 cycles every single time. No exceptions. So you will know the delay, down to a fraction of a cycle.
Please note, though, that not all DSPs can provide rigorous timing. The TI C30 and C40 lines couldn't come close. In one case I worked on, together with TI experts in an application where timing was vital, I found that the documentation said the routine should take 7 cycles and the actual experience was 11 cycles. The engineers at TI never were able to resolve the difference between their documentation and the actual, measured behavior. And I could not again trust their devices for my application. We spent months working through this, with the final result being that we had no ideas and it was a mystery to the field engineers from TI, the internal design staff at TI, etc. So we closed the door on their DSP.
So it's only some DSP. Not all. And I may expect that some RISC processors might be very competitive, today.
Perhaps your experiences are due to this effect. I can't tell, though.