Why are relatively simpler devices such as microcontrollers so much slower than CPUs?

Question

Given the same number of pipeline stages and the same manufacturing node (say, 65 nm) and the same voltage, simple devices should run faster than more complicated ones. Also, merging multiple pipeline stages into one should not slow down by a factor grater than the number of stages.

Now take a five-year-old CPU, running 14 pipeline stages at 2.8 GHz. Suppose one merges the stages; that would slow down to below 200 MHz. Now increase voltage and reduce number of bits per word; that would actually speed things up.

That's why I don't understand why many currently manufactured microcontrollers, such as AVL, run at abysmal speed (such as 20 MHz at 5 V), even though far more complicated CPUs manufactured years ago were capable of running 150x faster, or 10x faster if you roll all pipeline stages into one, at 1.2 V-ish. According to the most coarse back-of-the-envelope calculations, microcontrollers—even if manufactured using borderline obsolete technology—should run at least 10x faster at one quarter of the voltage they are supplied with.

Thus the question: What are the reasons for slow microcontroller clock rates?

A good chunk of microcontrollers are manufactured with bordline obsolete technology because the fab is paid for. — Matt Young, Apr 11 '16 at 18:31
Power. Factor in the power consumption of both CPUs and they'll be quite close to the same performance/watt, or the micro will win. — , Apr 11 '16 at 18:51
The idea that simpler == faster is simply wrong. A lot of the complexity of a modern cisc CPU goes into features to make it faster, like multi level caches, pipelines and branch prediction — PlasmaHH, Apr 11 '16 at 19:13
that old cpu doesnt run from a small battery for months/years. used cutting edge (read: expensive) technology for its day. didnt have to wait on slow/cheap flash for every instruction. there is rarely a need for an mcu to run fast, they can take some new verilog for the sake of the developers and implement it on whatever foundry. I like the bicycle vs formula 1 car comment the best, I think that sums it up. — old_timer, Apr 11 '16 at 19:56
one way intel is getting better mips/watt performance is by simply running an old design much slower. — old_timer, Apr 11 '16 at 19:59
"reduce number of bits per word" makes surprisingly little difference outside of a multiplier. Which is an optional feature on smaller microcontrollers. — pjc50, Apr 11 '16 at 21:06
.. and you can get a 200MHz microcontroller if you want: http://www.marketwired.com/press-release/nxp-ships-worlds-fastest-arm-cortex-m4-and-cortex-m3-microcontrollers-nasdaq-nxpi-1594382.htm - helpfully they tell you it's made on 90nm. — pjc50, Apr 11 '16 at 21:10
20 MHz is not slow at all. We are just pampered by GHz speeds for PCs, where most of the resources are used for rendering fancy graphics. You can fly to the Moon with a Kilohertz processor... — vsz, Apr 12 '16 at 04:27
The whole point of a pipeline is that you perform instructions as fast as they go through. When Canada puts a barrel of oil into a pipeline, they don't need to wait until a Texas refinery pulls it out to put any more in. — Nick T, Apr 12 '16 at 21:24
"microcontrollers so much slower than CPUs?" - A microcontroller *is* a CPU :) - just saying — Reversed Engineer, Apr 14 '16 at 09:51
What's surprising is that using larger node with more area is still cheaper than a smaller node with less area. — FourierFlux, Jan 02 '21 at 03:54
@FourierFlux I assume it has to do with the sunk costs of the facilities and yield. — DKNguyen, Jan 02 '21 at 06:32
Yes but at some point newer nodes should be cheaper not more expensive. It's like buying a VHS player today. — FourierFlux, Jan 02 '21 at 08:58

Adam Haun · Accepted Answer · 2016-04-13T16:30:35.223

There are other factors that contribute to the speed.

Memory: Actual performance is often limited by memory latency. Intel CPUs have large caches to make up for this. Microcontrollers usually don't. Flash memory is much slower than DRAM.
Power consumption: This is often a big deal in embedded applications. Actual 200 MHz Intel CPUs consumed more than 10 watts (often much more), and needed a big heat-sink and a fan. That takes space and money, and it's not even counting the external logic and memory that went with it. A 20 MHz AVR takes about 0.2 watts, which includes everything you need. This is also related to the process -- faster transistors tend to be leakier.
Operating conditions: As Dmitry points out in the comments, many microcontrollers can operate over a wide voltage and temperature range. That ATMega I mentioned above works from -40C to 85C, and can be stored at anything from -65C to 150C. (Other MCUs work up to 125C or even 155C.) The VCC voltage can be anything from 2.7V to 5.5V (5V +/- 10% for peak performance). This Core i7 datasheet is hard to read since they trim the allowed VCC during manufacturing, but the voltage and temperature tolerances are certainly narrower -- ~3% voltage tolerance and 105C max junction temperature. (5C minimum, but when you're pulling >100 amps, minimum temperatures aren't really a problem.)
Gate count: Simpler isn't always faster. If it were, Intel wouldn't need any CPU architects! It's not just pipelining; you also need things like a high-performance FPU. That jacks up the price. A lot of low-end MCUs have integer-only CPUs for that reason.
Die area budget: Microcontrollers have to fit a lot of functionality into one die, which often includes all of the memory used for the application. (SRAM and reliable NOR flash are quite large.) PC CPUs talk to off-chip memory and peripherals.
Process: Those 5V AVRs are made on an ancient low-cost process. Remember, they were designed from the ground up to be cheap. Intel sells consumer products at high margins using the best technology money can buy. Intel's also selling pure CMOS. MCU processes need to produce on-chip flash memory, which is more difficult.

Many of the above factors are related.

You can buy 200 MHz microcontrollers today (here's an example). Of course, they cost ten times as much as those 20 MHz ATMegas...

The short version is that speed is more complicated than simplicity, and cheap products are optimized for cheapness, not speed.

Don't forget the robustness: a typical CPU will fail if the supply voltage changes by more than 5% or so, while an ATMega runs from anything in 1.8-5.5V range at 4MHz. — Dmitry Grigoryev, Apr 12 '16 at 08:44

score 26 · Answer 2 · answered Apr 11 '16 at 19:06

26

A major underlying technical reason for the slow speeds is that cheap/small MCUs only use on-chip flash memory for program storage (i.e. they don't execute from RAM).

Small MCUs generally don't cache program memory, so they always need to read an instruction from flash before they execute it, every cycle. This gives deterministic performance and #cycles/operation, is just cheaper/simpler, and avoids PC-like issues where code and data are mixed creating a new set of threats from buffer overflows, etc.

The latency of reading from flash memory (on the order of 50-100ns) is much slower than reading from SRAM or DRAM (on the order of 10ns or below), and that latency must be incurred every cycle, limiting the clock speed of the part.

answered Apr 11 '16 at 19:06

compumike

3,624
23
20

4

Also power (and therefore heat) increase more than linearly with frequency. – The Unknown Dev Apr 11 '16 at 20:34
1

I don't think reading from flash is anywhere _near_ 100 ns, is it? IIRC it's two orders of magnitude bigger. However, _if_ your flash controller contains a small DRAM cache, and the code is not too branchy, the cache hit rate can be very high (90%+) so your average latency can be a lot lower. – MSalters Apr 12 '16 at 13:12
3

This AT91SAM7S datasheet I have open says for its internal flash "Fast access time, 30 MHz single-cycle access in Worst Case conditions" for its internal flash. That's 33ns. And it has one dword of prefetch buffer. Off-die Flash may indeed have higher latency. – pjc50 Apr 13 '16 at 14:44
1

@Jamil I don't remember the the exact formula, but I believe it was square of frequency. – jaskij Apr 13 '16 at 15:13

score 24 · Answer 3 · answered Apr 11 '16 at 18:28

24

Why do people ride a bicycle or a small motorbike, when you have a Formula 1 car? Surely it must be better to drive say 300 km/h and get everywhere instantly?

To put it simply, there's no need to be faster than they are. I mean, sure there is a bit and faster microcontrollers do enable some things, but what are you going to do in say a vending machine that is in continuous use for maybe 1 hour a day? What are you going to do in a say remote controller for a TV?

On the other hand, they have other important capabilities, like low power consumption, being MUCH simpler to program and so on. Basically, they're not processors and do different things.

answered Apr 11 '16 at 18:28

AndrejaKo

23,261
25
110
186

That's not exactly my question. I understand why one may want a $1 slow device rather than $100 fast one when the slow would suffice. What I don't understand is why the $1 device would ever run **as** slow as it does, given that simplicity usually implies speed. – Michael Apr 11 '16 at 18:35
12

@Michael Where do you get the idea simple = fast? – Matt Young Apr 11 '16 at 18:38
3

@Michael A bicycle is much simpler than a car, but it's still slower. In any case, Matt is right. Something simple is not automatically fast. That is to say, something fast is going to be complicated, just due to considerations needed for higher frequencies. – AndrejaKo Apr 11 '16 at 18:41
@MattYoung: that's the whole idea behind RISC architectures: simple tasks can be handled faster than complicated ones. And, in the absence of multi-stage pipeline, controller becomes much simpler. – Michael Apr 11 '16 at 18:56
2

High performance CISC processors tend to issue way more instructions that simple embedded processors. They are doing a lot more work in parallel, so they are both more complex and faster. – The Unknown Dev Apr 11 '16 at 20:33
1

@Michael: One way to achieve speed is by keeping the instruction-set simple so it can be pipelined. **That doesn't mean that all simple hardware could work at high clocks, though!** How fast you can clock something depends on how many gate-delays there are in the longest pipeline stage, or something like that. In a "simple" design, there are probably some pretty slow stages that limit the max clock because it's *not* heavily pipelined. – Peter Cordes Apr 12 '16 at 06:21
2

@Michael $1 could be luxuriously expensive for some applications, I've read that the micro-controllers in micro SD cards cost around 19 cents – Xen2050 Apr 12 '16 at 17:40
3

@Michael "that's the whole idea behind RISC architectures: simple tasks can be handled faster than complicated ones" No! Modern RISC architectures are extremely complex because they have to introduce more instructions (like SIMD) and support more features like superscalar, hyperthreading, out-of-order execution... Their complexity may easily exceed CISC architectures. MIPS nowadays have hundreds or thousands of instructions. ["CISC v RISC is largely a historical debate"](https://www.quora.com/What-are-CISC-and-RISC-architecture-How-do-they-differ-from-each-other) – phuclv Apr 13 '16 at 07:25
1

excellent opening analogy, accurate AND lolworthy, would read again. – underscore_d Apr 13 '16 at 21:08

score 13 · Answer 4 · edited Apr 11 '16 at 21:43

13

There are plenty of ARM controllers that run at hundreds of MHz or more. Who needs a 500 MHz PIC and is willing to pay enough per part to justify million dollar masks for a close to state-of-the-art process?

The popular ATmega328 is reportedly made with 350 nm technology, which is quite a bit behind the latest production Intel CPUs (14 nm for Skylake).

Even the cheapie 8-bit controllers have slowly been edging up in speed, and you can get 32 and 64 MHz PIC controllers (for example, PIC18F14K22) that still operate at 5 V (the latter is a consideration in total system cost).

One consideration is that these controllers have an architecture that is optimized for small memory spaces and slow clock speeds. Once you start getting into high clock speeds you have to rejig things with prescalers, etc.

There was an attempt made way back (late 1990s) to produce very fast PIC-like controllers, with the idea that firmware could substitute for peripherals if the microcontroller was fast enough. For example, you could bit-bang a UART. I don't think they were all that commercially successful- Scenix->Ubicom->Qualcomm (game over).

edited Apr 11 '16 at 21:43

Peter Mortensen

1,676
3
17
23

answered Apr 11 '16 at 18:45

Spehro Pefhany

376,485
21
320
842

350 nm? That would explain it. Didn't know that anybody would manufacture anything using 20 year old technology. – Michael Apr 11 '16 at 18:57
5

Some of us are still designing in (not just using) 4000 series CMOS which is something like 3000nm. – Spehro Pefhany Apr 11 '16 at 19:02
7

Older processes are also potentially useful for folks dealing with radiation environments, or high-reliability systems that demand traceability. – Krunal Desai Apr 11 '16 at 19:30
6

Game not over -- the Parallax Propeller is a continuation of that concept. – Dave Tweed Apr 11 '16 at 19:31
@DaveTweed Or, for a more commercially successful example, take a look at the Cypress PSoC line. – Apr 11 '16 at 20:20
@duskwuff: I'm not following you. AIUI, PSoC is not about replacing hardware with firmware, but rather having configurable hardware to replace fixed-function blocks. – Dave Tweed Apr 11 '16 at 20:22
@DaveTweed It's a weird architecture, somewhere in the middle between hardware and software. The configurable components of the PSoC can be modified at runtime, and contain both PLD elements and "datapath" elements which are essentially very small, special-purpose microcontrollers. – Apr 11 '16 at 20:35
By my understanding, the Scenix parts managed a software cause/effect relationship that was much faster than any other controller I've heard of (testing one port pin and conditionally setting another in 20ns total). I've long wondered why Microchip never bothered to produce anything comparable. – supercat Apr 11 '16 at 22:01
1

@SpheroPefhany -- CD4k CMOS is going to be with us for the foreseeable future -- nothing else can serve its role in mixed linear/logic systems (where the logic is called upon to run at the 12/15V linear supply voltage). (Either that, or a more modern HVCMOS process such as ADI's iCMOS will be used to make a replacement.) – ThreePhaseEel Apr 11 '16 at 23:36
4

@Michael: It's not just the age of the technology. The size matters as well. Larger process size have lower defect rates which means lower rejects and thus higher yield - that leads to lower cost per chip. If you're willing to pay $100 for a CPU (like desktops) then the higher cost due to lower yield is justified. If you're only willing to pay 50 cents then it's not justified. – slebetman Apr 12 '16 at 05:45
To add to the idea of creating firmware defined peripherals... Some of the Motorolla/Freescale/Nxp HCS12 processors (like MC9S12XEP100 and others) have an XGATE co-processor to do software peripherals. Some of the Power PC parts (like MPC5674 and others) have ETPU coprocessors to implement software peripherals. – user4574 Jan 02 '21 at 04:06
@user4574 if memory serves, the concept originated with Motorola and their processors aimed at the massive automotive ECU market. They called it a TPU. Modern processors such as the ARM used in the Beaglebone have multiple 33-bit PRUs. – Spehro Pefhany Jan 02 '21 at 04:12

score 3 · Answer 5 · answered Apr 12 '16 at 14:39

3

Imagine one wants to produce automobiles. One approach would be to use a bunch of pieces of equipment in the factory sequentially, building one car at a time. This approach can be done with a modest amount of moderately-complicated equipment, such many pieces of equipment may be used to perform more than one step. On the other hand, much of the equipment in the factory would still be sitting idle much of the time.

Another approach is to set up an assembly line, so that as soon as the equipment that handled the first step of production has finished that operation on the first car, it can then proceed to start the corresponding operation on the next car. Trying to reuse one piece of equipment at multiple stages in the manufacturing process would be complicated, so in most cases it would be better to use more pieces of equipment that are each optimized to perform one very specific task (e.g. if it's necessary to drill 50 holes of 10 different sizes, then a minimal-equipment setup would include one drill with 10 bits and a quick-change mechanism, but an assembly line could have 50 drills each with one permanently-installed bit and no need for a quick-change).

For things like DSPs or GPUs, it's possible to achieve very high speeds relatively cheaply because the nature of work to be performed is very consistent. Unfortunately, many CPUs need to be able to handle arbitrary mistures of instructions of differing complexity. Doing that efficiently is possible, but it requires very complex scheduling logic. In many modern CPUs, the logic necessary to "do work" isn't overly complicated or expensive, but the logic necessary to coordinate everything else, is.

answered Apr 12 '16 at 14:39

supercat

45,939
2
84
143

2

Sorry if I missed it, but what relevance does this have to CPUs vs 'slower' microcontrollers? It only seems to focus on CPUs vs (typically even faster) specialised processors. – underscore_d Apr 13 '16 at 21:10
1

@underscore_d: The first paragraph covers the simpler microcontrollers--they're like the small shop that builds one car at a time. The second paragraph notes that there are some cheap controllers that can perform lots of operations very quickly, but are limited in the kinds of operations they can do. What's hard is being able to perform an arbitrary mix of operations while overlapping them to a significant (but highly variable) degree. If one has a subsystem which on every cycle can accept two numbers and will output the product of two numbers that were submitted four cycles ago, and... – supercat Apr 13 '16 at 21:41
1

...another that will on each cycle accept two numbers and output the sum of the that were submitted two cycles ago, trying to figure out when values need to be submitted, when results will be available, when things should be loaded from and saved to registers, etc. can get very complicated, especially if one wants to avoid padding out all the pipelines to match the longest one. – supercat Apr 13 '16 at 21:44
Thanks; that clears it up. Yeah, it makes sense that fast general-purpose CPUs incur most of their costs, both financial and energy, on 'scaffolding' - pipelining, cache, scheduling, RAM control, etc. Things that are not only prohibitively costly but also often not required for micros. Equally, it never ceases to amaze me what can be done with a relatively tiny clock frequency in a processor specifically tailored for one application. Fascinating stuff on both sides! – underscore_d Apr 13 '16 at 21:58
@underscore_d: The MIPS architecture was designed on the premise that compilers would be responsible for some of the scheduling issues, thus allowing hardware to be simplified. The concept never really caught on, I think, because newer processors often require more pipeline stages than older ones, but code written for a processor with shorter pipelines won't work on a processor with longer ones in the absence of hardware interlocks. – supercat Apr 13 '16 at 22:04
As a user of Ubicom-neé-Scenix’s SX48, at 100MHz no less, the silicon was great but the tools were crap. To program these chips effectively you had to interleave the code for multiple peripherals - the core was fully deterministic timing-wise. I made it achieve some nifty things but had no time to work on tool development so I just did the minimum needed to get the product done and moved on. Such architectures require programming in a fusion of C and a logic description language to describe timing constraints. That’s possible but the market couldn’t support the development costs obviously. – Kuba hasn't forgotten Monica Jan 02 '21 at 03:49
I did data acquisition on it and FIR-filtered data from a 120ks/s 16-bit SPI ADC while communicating with a host over 1MBit/s async serial connection. Slow processing was done using a VM that ran a simple “scripting” byte code. Once the acquisition was going, there were few clock cycles left, barely, but it worked and was rock solid. The code was threaded by a custom tool I wrote that took the assembly sub-programs that had to run in parallel, together with timing constraints on certain I/O transitions, and produced the interleaved cycle-accurate final program that couldn’t be debugged. – Kuba hasn't forgotten Monica Jan 02 '21 at 03:57
Debugging was impossible because I had no flash left for the debug “blob” the tool had to upload to the chip to run the debug function. The code was fully unrolled, and I barely made it fit - there were maybe two dozen basic blocks (straight line code) in the whole thing. So while the silicon could certainly do it and with a better tool I could have probably factored out some duplicate code to save flash, there was the specter of running out of cycles. And each time I made one of the threads “fancier”, I had to update tool with more understanding of what the assembly meant. – Kuba hasn't forgotten Monica Jan 02 '21 at 04:02
So what was supposed to have been a rather uncontroversial microcontroller project turned into a code generation research project and that made me use that part in two products and dump it with prejudice as soon as I reimplemented the firmware for another micro. This thing had banked data space *and* banked code space, and not really enough of either. So a “simple” PIC concept doesn’t scale to faster clocks unless you have expensive and complex development tools and can convince customers that you’ll support them. So OP should be careful why they wish for: been there and it was not great. – Kuba hasn't forgotten Monica Jan 02 '21 at 04:06
@Kubahasn'tforgottenMonica: I find it curious that the line of 100MHz PIC-style chips wasn't extended to use a wider code space to eliminate the need for code and data banking, since from what I understand the Scenix parts could poll an I/O pin and switch an output based upon its value in under 30ns, which is something that even faster ARM chips can't do. BTW, on one project I did with the PIC 16C505, I used a form of dual threading that actually benefited from the banked data space, since I had two threads that used the same code but ran in different data banks. Still, if one were to... – supercat Jan 02 '21 at 04:34
...simply expand the code with of the Scenix part to 14 bits, and make it so that when not using the FSR, the two bits used to select RAM bank or GOTO target bank would simply be loaded from those extra two bits of the instruction, that should have made many things a lot more convenient without increasing the length of the logic pipelines. – supercat Jan 02 '21 at 04:35
Once you’d expand the SX core to do all that, you couldn’t run it so fast anymore on the old process. And also: 20ns pin toggle time is useless in practice, because when you do that you can do nothing else, and I dare say there are 0 applications that make it worthwhile to operate oddball silicon in tight loops that just flip pins. You could do it cheaper with an external GAL or even jellybean logic. Even if SX had no weird paging, without tooling the speed was unusable. Parallax was a user and they figured quickly enough that you needed multiple fast cores to make it usable. And thus Prop 1:) – Kuba hasn't forgotten Monica Jan 03 '21 at 22:33
And even on Prop 1 if you wanted a cog to do multiple simple things in parallel, you were left hand-threading assembly code together, so in practice most people ran just one “peripheral” per cog, and development of those virtual peripherals was a pain. On Prop 2, cogs are faster so interrupts can be used to switch threads, or you can stream multiple threads from hub memory and interleaved them in an executor VM, still running about as fast per thread as Prop 1’s cogs ran, all while coding “normally”, with no manual interleaving. Prop 2 could still use better tooling, but time will cure it. – Kuba hasn't forgotten Monica Jan 03 '21 at 22:37
@Kubahasn'tforgottenMonica: Is there some reason I'm missing that adding two more bits of code space fetched in parallel with the rest and using those instead of banking bits would increase the critical path time? I suppose if memory is partially decoded using banking bits, and if the data fetched from code space arrives late enough that the remainder of the decode is on the critical path, using banking bits instead of flash bits could be a tiny bit faster, but I would not expect those to be the critical path. The main situation where the I/O turnaround speed is useful... – supercat Jan 04 '21 at 03:15
...is when communicating with synchronous interfaces that don't quite fit the way normal SPI peripherals work (e.g. because three wires (clock, tx, and rx) are used to handle handshaking and framing without requiring the use of additional pins for those purposes). Though the other reason I find it interesting that the I/O turnaround time on such a chip is faster than on higher-end processors. – supercat Jan 04 '21 at 03:19

Kuba hasn't forgotten Monica · Answer 6 · 2021-01-02T04:09:14.350

1

Your wishes came true :) Parallax Propeller 2 runs at 300MHz, has 8 cores, 512kB of shared RAM, 2kB local core RAM + another 2kB shared by pairs of cores, most instructions run in 2 clock cycles so 150MIPS per core or 1.2GIPS peak per chip, 64 very flexible smart pins – you can have 64 1MSPS 15-bit ADCs if you can use them, built using a modern TSMC process (35nm but don’t quote me on that – I couldn’t find the forum posts where that was discussed).

So, as for the question of “why don’t they do it”: but they do! And it’s an amazing part, if I may say so.

edited Jan 02 '21 at 04:09

answered Jan 02 '21 at 03:45

Kuba hasn't forgotten Monica

32,734
1
38
103

That thing is stupidly overpriced. You're better off buying a pi. – FourierFlux Jan 02 '21 at 04:11
@FourierFlux $10 in quantity to get 64 channels of ADC/DAC and a very fast microcontroller is overpriced? Not if you actually need its features. And it is hard real-time, so if you need that - forget about pi unless you want to learn to code in assembly for the VideoCore. And also pi has no general purpose ADC and DAC - only audio I/O. Oh, and complete documentation for the pi is only available under a non-disclosure agreement, and runs thousands of pages. In that case, pi is only cheaper if your time is free. Everything there’s to know about P2 fits in under 100 pages (dense ones, true). – Kuba hasn't forgotten Monica Jan 03 '21 at 22:19
The eval board is 150+, I couldn't find a cheaper way to get it on a premade board. – FourierFlux Jan 03 '21 at 22:41

Why are relatively simpler devices such as microcontrollers so much slower than CPUs?

6 Answers6

Linked