Can faster processors/clocks execute more code?

Question

I am writing a program to run on an ATmega 328 which runs at 16Mhz (its an Arduino Duemilanove if you know them, it's an AVR chip).

I have an interrupt process running every 100 microseconds. It's impossible, I would say, to work out how much "code" you can execute in one loop of 100 microseconds (I'm writing in C which presumably is converted to assembly then into a binary image?).

Also this would depend on the complexity of the code (a giant one liner might run slower than several short lines for example).

Is my understanding correct, in that my processor with a clock rate or 16Mhz performs 16 millions cycles per second (this means 16 cycles per microsecond 16,000,000/1,000/1,000); And so, if I want to do more in my 100 microsecond loop, buying a faster model like a 72Mhz version would give me 72 cycles per microsecond (72,000,000/1,000/1,000) ?

Currently it runs just a bit too slow, i.e. its taking a little longer than 100 microseconds to do the loop (how long exactly is too hard to say, but it gradual falls behind) and I would like it to do a little more, is this a sane approach getting a faster chip or have I gone mad?

.... An ATmega328 is NOT an ARM chip. It's an AVR. – vicatcu Oct 28 '11 at 20:19 — vicatcu, Oct 28 '11 at 20:19

score 9 · Accepted Answer · answered Oct 28 '11 at 20:29

9

In general the number of assembly instructions the device can execute per second will depend on the instruction mix and how many cycles each instruction type takes (CPI) to execute. You could in theory cycle count your code by looking at the disassembled asm file and looking the function you are concerned about, counting up all the different types of instructions in it, and looking up the cycle counts from the datasheet for your target processor.

The problem of determining the effective number of instructions per second is exacerbated in more complex processors by the fact that they are pipelined and and have caches and what not. This is not the case for a simple device like an ATMega328 which is a single instruction in flight processor.

As for practical matters, for a simple device like an AVR, my answer would be more or less "yes". Doubling your clock speed should half the execution time of any given function. For an AVR, however, they won't run faster than 20MHz, so you could only "overclock" your Arduino by another 4MHz.

This advice does not generalize to a processor which has more advanced features. Doubling the clock speed on your Intel processor will not in practice double the number of instructions it executes per second (because of branch mis-predictions, cache misses, and so forth).

answered Oct 28 '11 at 20:29

vicatcu

22,499
13
79
155

Hi, thanks for your informative answer! I've seen one of these ( http://www.coolcomponents.co.uk/catalog/product_info.php?products_id=808), you said an AVR can't go faster than 20Mhz, why is that? The chip on the above board ( http://uk.farnell.com/stmicroelectronics/stm32f103rbt6/32bit-mcu-cortex-m3-128k-flash/dp/1447640) is a 72Mhz ARM, could i expect a reasonable performance increase from this in the manner I have describbed above? – jwbensley Oct 28 '11 at 20:44
2

Doubling the processing speed *may not* increase your instruction throughput as you may start to exceed the speed at which instructions can be fetched from the flash. At this point you start hitting "Flash wait states" where the CPU pauses while it waits for the instruction to arrive from the flash. Some microcontrollers get round this by allowing you to execute code from RAM which is much faster than FLASH. – Majenko Oct 28 '11 at 20:48
@Majenko: funny, we both made the same point at the same time. – Jason S Oct 28 '11 at 20:54
It happens... yours is better than mine :) – Majenko Oct 28 '11 at 20:58
I would venture to say if you ran your code on a 72MHz ARM you would at least see a speedup, though maybe not a 3-4x speedup. Also the ARM instruction set is different than the AVR instruction set, so the compiled code difficult to compare - you'd have to measure it empirically as in @Jason's answer to know the practical speedup. Your code high level language code would also likely need to be somewhat rewritten for the ARM processor / compiler. Also to get the *best* performance you'd hand tune the interrupt routine in assembler, taking advantage of register allocation and the like. – vicatcu Oct 28 '11 at 21:06
@javano there are processors out there that can do 3 ghz that doesn't mean there is any avr designed to go that fast. – Kellenjb Oct 30 '11 at 14:20
1

OK, I have marked Vicatcu's answer as "the answer". I feel it was the most appropriate with regards to my original question of speed relating to performance although all the answers are great and I am really chuffed with everyone's answers. They have shown me that it is a wider subject than I first realised, and so, they are all teaching me lots and giving me lots to research, so thanks to everyone :D – jwbensley Oct 31 '11 at 18:02
@vicatcu, I'm going to take your great idea about opening the compiled code in an assembler and looking at how many instructions my main loop turns into. I'm a bit of a novice programmer, but one thing I noticed in the Arduino IDE is that when compiling it states the size, by code was coming in at around 6k bytes. I changed a two giant "one-liners" down into two lots of 5 lines, the code was at 5.5k bytes after. Even though there is now more C code than before, I made it less complicated, this has had an effect on the compiled output. Although the size is less, that doesn't mean it's quicker! – jwbensley Oct 31 '11 at 18:20

Jason S · Answer 2 · 2011-10-28T21:30:13.270

@vicatcu's answer is pretty comprehensive. One additional thing to note is that the CPU may run into wait states (stalled CPU cycles) when accessing I/O, including program and data memory.

For example, we're using a TI F28335 DSP; some areas of RAM are 0-wait state for program and data memory, so when you execute code in RAM, it runs at 1 cycle per instruction (except for those instructions that take more than 1 cycle). When you execute code from FLASH memory (built-in EEPROM, more or less), however, it can't run at the full 150MHz and it is several times slower.

With respect to high-speed interrupt code, you must learn a number of things.

First, become very familiar with your compiler. If the compiler does a good job, it shouldn't be that much slower than hand-coded assembly for most things. (where "that much slower": a factor of 2 would be OK by me; a factor of 10 would be unacceptable) You need to learn how (and when) to use compiler optimization flags, and every once in a while you should look at the compiler's output to see how it does.

Some other things that you can have the compiler do to speedup code:

use inline functions (can't remember if C supports this or if it's only a C++-ism), both for small functions and for functions that are going to be executed only once or twice. The downside is inline functions are hard to debug, especially if compiler optimization is turned on. But they save you unnecessary call/return sequences, especially if the "function" abstraction is for conceptual design purposes rather than code implementation.
Look at your compiler's manual to see if it has intrinsic functions -- these are compiler-dependent builtin functions that map directly to the processor's assembly instructions; some processors have assembly instructions that do useful things like min / max / bit reverse and you can save time doing so.
If you're doing numerical computation, make sure you're not calling math-library functions unnecessarily. We had one case where the code was something like y = (y+1) % 4 for a counter that had a period of 4, expecting the compiler to implement the modulo 4 as a bitwise-AND. Instead it called the math library. So we replaced with y = (y+1) & 3 to do what we wanted.
Get familiar with the bit-twiddling hacks page. I guarantee you'll use at least one of these often.

You also should be using your CPU's timer peripheral(s) to measure code execution time -- most of them have a timer/counter that can be set to run at the CPU clock frequency. Capture a copy of the counter at the beginning and end of your critical code, and you can see how long it takes. If you can't do that, another alternative is to lower an output pin at the beginning of your code, and raise it at the end, and look at this output on an oscilloscope to time the execution. There are tradeoffs to each approach: the internal timer/counter is more flexible (you can time several things) but harder to get the information out, whereas setting/clearing an output pin is immediately visible on a scope and you can capture statistics, but it's hard to distinguish multiple events.

Finally, there is a very important skill that comes with experience -- both general and with specific processor/compiler combinations: knowing when and when not to optimize. In general the answer is don't optimize. The Donald Knuth quote gets posted frequently on StackOverflow (usually just the last part):

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil

But you're in a situation where you know you have to do some kind of optimization, so it's time to bite the bullet and optimize (or get a faster processor, or both). Do NOT write your whole ISR in assembly. That is almost a guaranteed disaster -- if you do it, within months or even weeks you'll forget parts of what you did and why, and the code is likely to be very brittle and difficult to change. There are likely to be portions of your code, however, that are good candidates for assembly.

Signs that parts of your code are well-suited for assembly-coding:

functions that are well-contained, well-defined small routines unlikely to change
functions that can utilize specific assembly instructions (min/max/right shift/etc)
functions that get called many times (gets you a multiplier: if you save 0.5usec on each call, and it gets called 10 times, that saves you 5 usec which is significant in your case)

Learn your compiler's function calling conventions (e.g. where it puts the arguments in registers, and which registers it saves/restores) so that you can write C-callable assembly routines.

In my current project, we have a pretty large codebase with critical code that has to run in a 10kHz interrupt (100usec -- sound familiar?) and there aren't that many functions that are written in assembly. The ones that are, are things like CRC calculation, software queues, ADC gain/offset compensation.

Good luck!

good advice on empirical execution time measurement techniques — vicatcu, Oct 28 '11 at 21:03
Another great answer for my question, thanks so much Jason S for this awesome chunk of knowledge! Two things apparent after reading this; Firstly, I can raise the interrupt from every 100uS to 500uS to give the code more time to execute, I realise now this isn't really benefiting me being that fast. Secondly I think my code maybe too inefficient, with the longer interrupt time and better code it might all be fine. Stackoverflow is a better place to post the code, so I will post it there and put a link to it here, if anyone wants to have a look and make any recommendations please do :D — jwbensley, Oct 31 '11 at 18:09

score 5 · Answer 3 · answered Oct 28 '11 at 21:02

5

Another thing to note - there are probably some optimizations you can perform to make your code more efficient.

For instance - I have a routine which runs from within a timer interrupt. The routine has to complete within 52µS, and has to step through a large quantity of memory while it's doing it.

I managed a large speed increase by locking the main counter variable to a register with (on my µC & compiler - different for yours):

register unsigned int pointer asm("W9");

I don't know the format for your compiler - RTFM, but there will be something you can do to make your routine faster without having to switch to assembly.

Having said that, you can probably do a much better job at optimizing your routine than the compiler, so switching to assembly may well give you some massive speed increases.

answered Oct 28 '11 at 21:02

Majenko

55,955
9
105
187

lol I "simultaneously" commented on my own answer about assembler tuning and register allocation :) – vicatcu Oct 28 '11 at 21:09
If it's taking 100us on a 16 MHz processor - it's obviously pretty huge, so that's _alot_ of code to optimise. I've heard that todays compilers produce about 1.1 times the code than hand-optimsed assembly. Totally not worth it for such a huge routine. For shaving 20% off a 6 line function, perhaps... – DefenestrationDay Oct 29 '11 at 09:05
1

Not necessarily... It could be just 5 lines of code in a loop. And it's not about code *size* but about code *efficiency*. You may be able to write the code differently making it run faster. I know for my interrupt routine I did. For instance, sacrificing size for speed. By running the same code 10 times in sequence you save the time of having code to do the loop - and associated counter variables. Yes, the code is 10 times longer, but it runs faster. – Majenko Oct 29 '11 at 09:07
Hi Majenko, I don't know assembly but I had been thinking about learning it, and was thinking that the Arduino is going to be less complicated than my desktop computer so this could be a good time to learn, especially as I want to know more about what's going on and a lower level. As others have said, I wouldn't re-write the whole thing just certain parts. My understanding is that I can drop in and out of ASM within C, is this correct, is this how one might achieve this mix of C and ASM? I will post on stackoverflow for the specifics, just after a general idea. – jwbensley Oct 31 '11 at 18:13
@javano: Yes. You can drop in and out of ASM within C. Many embedded systems were written like that -- in [a mixture of C and assembly](http://en.wikibooks.org/wiki/Embedded_Systems/Mixed_C_and_Assembly_Programming) -- mainly because there were a few things that simply could not be done in the primitive C compilers available at the time. However, modern C compilers such as gcc (which is the compiler used by Arduino) now handle most and in many cases all the things that used to require assembly language. – davidcary Jun 22 '13 at 16:24

Can faster processors/clocks execute more code?

3 Answers3