What's the point of DMA in embedded CPU's?

Question

I was doing a project recently with the mbed (LPC1768), using the DAC to output various waves. I read parts of the datasheet, and it talked about how it had DMA for a lot of the peripherals. This seemed like it would be useful, but on further reading, I found that the DMA used the same data bus as the cpu (which I guess is normal). Does this mean that the CPU cant interact with any of the memories while the DAC is getting data? Also, since the DAC didn't have a buffer (as far as i could tell) and therefore has to DMA very often, whats the point of DMA? If the CPU can't do memory transactions, can it do anything?

I suggest you look at the features of your CPU and if it can do anything else than access memory. I have heard of some CPUs that can do things like decisions or calculations, not sure if this is common at all.. — PlasmaHH, Jun 14 '16 at 15:25
Should the CPU spend it's time transferring data to an I/O port or delegate the task to a dedicated device? — StainlessSteelRat, Jun 14 '16 at 15:36
Yes the CPU can do other stuff, but in an embedded system it's probably spending a lot of its time interfacing with peripherals, especially I/O ports. Wouldn't it make more sense to have an extra data bus just for DMA? Or is that usually not necessary? The situation where you would want an extra bus is when you're trying to push the limits of your hardware, which I assume is when you would want to usa DMA anyway? — BeB00, Jun 14 '16 at 15:44
Simple example, say you want to do lots of printing of information to a serial port. You can either sit and wait for each byte to be sent (slow), copy it to a buffer and then use interrupts on the CPU to send each byte when the port is ready (lots of context switching = slow), or copy it to a buffer and let the DMA controller time the data out while the CPU is busy doing other things (can be faster). — Tom Carpenter, Jun 14 '16 at 16:01
Saw a cover of EDN once that showed a drawing of a man wearing an enormous, three foot long shoe, and a headline, "If it's a shoe, wear it." The point was, If a part does ten things you don't need, and one thing you do need, and the price, footprint, and power budget all fit, then you should just use it, and not waste your time looking for something with fewer features. — Solomon Slow, Jun 15 '16 at 14:00
for example if the CPU wants to read data from HDD, instead of wasting time for reading byte-by-byte, the CPU can just leave the work for DMA and do more useful things while waiting for data to be copied to RAM — phuclv, Jun 16 '16 at 09:06

pgvoorhees · Answer 1 · 2017-09-06T11:52:03.620

The long and short is that DMA allows the CPU to effectively behave at its native speed, while the peripherals can effectively behave at their native speed. Most of the numbers in the example are made up.

Let's compare two options to periodically collect data from an ADC:

You can set the ADC as part of an interrupt (periodic or otherwise)
You can create a buffer, and tell the DMA to transfer ADC readings to the buffer.

Let's transfer 1000 samples from the ADC to RAM.

Using option 1: For every sample there is

12 cycles are spent entering interrupt
read adc(s)
store in ram
12 cycles are spent exiting interrupt

Let's pretend this interrupt function is 76 instructions, the whole routine is 100 instructions long, assuming single cycle execution (best case). That means option 1 will spend 100,000 cycles of CPU time executing.

Option 2: DMA is configured to collect 1000 samples of ADC. Let's assume the ADC has a hardware trigger from a timer counter.

ADC and DMA transfer 1000 samples data in to ram
DMA interrupts your CPU after 1000 samples
12 cycles are spent entering interrupt
Code happens (let's say it tells the DMA to overwrite the RAM)
12 cycles are spent exiting interrupt

Pretending the whole interrupt (with entry and exit overhead) is 100 single-cycle instructions. Using DMA, you only spend 100 cycles to save the same 1000 samples.

Now, each time the DMA accesses the bus, yes, there might be a dispute between CPU and DMA. The CPU may even be forced to wait for the DMA to finish up. But waiting for the DMA to finish is much much shorter than locking the CPU in to servicing the ADC. If the CPU core clock is 2x Bus clock, then the CPU might waste a few core cycles waiting for the DMA to finish. This means that your effective execution time of the transfer is between 1000 (assuming CPU never waits)and 9000 cycles. Still WAY better than the 100,000 cycles.

Important to note that RAM is not the only place that the CPU can store data. In general the CPU loads the data from RAM into registers before working on it. — Aron, Jun 15 '16 at 01:16
Yah, absolutely correct. My example is purely a rough sketch. — pgvoorhees, Jun 15 '16 at 11:15
Many microcontrollers also have a multilayer bus so concurrent operations are possible. Eg: adc->ram and flash->register at the same time. Also, many instructions are longer than 1 clock, so there is plenty of time for the DMA. — Jeroen3, Sep 06 '17 at 12:05

score 17 · Accepted Answer · answered Jun 14 '16 at 15:54

The LPC1768 datasheet I found has the following quotes (emphasis mine):

Eight channel General Purpose DMA controller (GPDMA) on the AHB multilayer matrix that can be used with SSP, I2S-bus, UART, Analog-to-Digital and Digital-to-Analog converter peripherals, timer match signals, and for memory-to-memory transfers.

Split APB bus allows high throughput with few stalls between the CPU and DMA

The block diagram on page 6 shows SRAM with multiple channels between the AHB matrix and the following quote backs this up:

The LPC17xx contain a total of 64 kB on-chip static RAM memory. This includes the main 32 kB SRAM, accessible by the CPU and DMA controller on a higher-speed bus, and two additional 16 kB each SRAM blocks situated on a separate slave port on the AHB multilayer matrix. This architecture allows CPU and DMA accesses to be spread over three separate RAMs that can be accessed simultaneously

And this is reinforced by the following quote:

The GPDMA enables peripheral-to-memory, memory-to-peripheral, peripheral-to-peripheral, and memory-to-memory transactions.

Therefore you could stream data to your DAC from one of the separate SRAM blocks or from a different peripheral, whilst using the main SRAM for other functions.

This kind of peripheral-peripheral DMA is common in smaller parts where the memory interface is quite simple (compared to say a modern Intel processor).

Ahh, thanks, I didn't realise that was possible, I'm kindof new to DMA. Does that imply that the cpu can access peripherals while the DAC is accessing the separate SRAM? — BeB00, Jun 14 '16 at 16:02
Yes - this is exactly what the AHB matrix is for. It allows different controllers (CPU, DMA, certain peripherals like ethernet and USB) to access different things at the same time. This is why there are multiple 'ports' to the SRAM. — David, Jun 14 '16 at 17:06
Yeah, the AHB in these cheap little critters delivers insane memory bandwidths due to the parallel memory banks: you can have ethernet, USB2 and everything run at max throughput and the cpu doesn't even notice... — bobflux, Sep 06 '17 at 12:27
Also thumb code can put 2 instructions in one 32-bit word, so the cpu may not need to access the bus that often when doing math or operations that mostly involve registers... On the other side, I think the M3 and M4 can do several memory accesses per clock (instruction and data) due to having several busses. — bobflux, Sep 06 '17 at 12:29

score 10 · Answer 3 · answered Jun 14 '16 at 18:21

If on a given cycle the processor and a DMA controller would need to access the same bus, one or the other would have to wait. Many systems, however, contain multiple areas of memory with separate buses along with a bus "bridge" that will allow the CPU to access one memory while the DMA controller accesses another.

Further, many CPUs may not need to access a memory device on every cycle. If a CPU would normally only need to access memory on two out of three cycles, a low-priority DMA device may be able to exploit cycles when the memory bus would otherwise be idle.

Even in cases where every DMA cycle would cause the CPU to be stalled for a cycle, however, DMA may still be very helpful if data arrives at a rate which is slow enough that the CPU should be able to do other things between incoming data items, but fast enough that the per-item overhead needs to be minimized. If an SPI port was feeding data to a device at a rate of one byte every 16 CPU cycles, for example, interrupting the CPU for each transfer would likely cause it to spend almost all its time entering and returning from the interrupt service routine and none doing any actual work. Using DMA, however, the overhead could be reduced to 13% even if each DMA transfer caused the CPU to stall for two cycles.

Finally, some CPUs allow DMA to be performed while the CPU is asleep. Using an interrupt-based transfer would require that the system wake up completely for each unit of data transferred. Using DMA, however, it may be possible for the sleep controller to feed the memory controller a couple of clocks every time a byte comes in but let everything else stay asleep, thus reducing power consumption.

The Cortex-M parts like the LPC1768 have distinct memory path from flash to the instruction decoder, so in fact register-to-register operations may mean the CPU can execute multiple instructions between times when it needs access to data memory. — Chris Stratton, Jun 15 '16 at 04:53

score 6 · Answer 4 · answered Jun 14 '16 at 23:18

As a programmer, DMA is an option for transferring data to and from the peripherals that support it. For the classic example of shifting a large buffer through a serial peripheral like SPI or UART, or collecting a number of samples from an ADC, you have three methods of moving that data:

Polling method. This is where you wait on register flags to allow you to shift in/out the next byte. The problem is that you are holding up all execution of the CPU while waiting for this. Or, if you have to share CPU time in an operating system, then your transfer will be drastically slowed down.
Interrupt method. This is where you write an interrupt service routine (ISR) that executes with every byte transfer and you write the code in the ISR that manages the transfer. This is more CPU efficient because the CPU will service your ISR only when needed. It is free for use at all other times except in the ISR. ISR is also one of the faster options for making the transfer in terms of speed of transfer.
DMA. You configure the DMA with source/destination pointers, number of transfers and off it goes. It will steal bus cycles and CPU time to accomplish the transfer, and the CPU is free to do other things in the mean time. You can configure a flag or interrupt to indicate when the transfer is done. It is usually a touch faster than ISR and is usually your fastest transfer option.

As a programmer, I prefer DMA because it is the easiest to code and is essentially the fastest technique to make the transfer. Typically, you just need to configure a couple registers for the source/destination pointers and number of transfers to make and off it goes. I spend far more hours working in ISR code than I do in DMA-accelerated code because ISR code requires critical design skills and has to be coded, tested, verified, etc. The DMA code is far smaller and the code I have to write myself is relatively trivial, and I'm getting maximum transfer speed in the bargain.

In my experience, lately with Atmel SAM3/4 processors, DMA runs a touch faster than an efficient ISR of my own crafting. I had an application that would read in a pile of bytes from SPI every 5 msec. A lot of floating point math was occurring in background tasks so I wanted the CPU to be as free as possible for those tasks. The initial implementation was ISR, and I then moved to DMA to compare and try to buy a little more CPU time between samples. The transfer speed gain was slightly improved, but only by a little. It was barely measurable on the o-scope.

That is because on recent microprocessors that I've seen, ISR and DMA are operating in almost the same fashion - they take CPU cycles as required and the DMA is doing essentially the same operations with the CPU as I would have coded in an efficient ISR.

In rare cases, I've seen peripherals that have their own RAM area that was ONLY accessible by DMA. This was on Ethernet MACs or USBs.

score 3 · Answer 5 · answered Jun 14 '16 at 19:02

DMA is most likely used here so that the DAC can have some regular timing, generate a waveform by changing the analog output at some known interval.

Yes if it is a shared bus then...you have to share.

The cpu is not always using the bus, thus it is sometimes a good idea to share with a dma engine. And of course that means priorities get involved, sometimes it is just who got there first (for example have a command fifo in front of the resource, and fifo up requests, in the order they arrive, yes that would be not-necessarily-deterministic). In a case like this you may want the dma to have priority over the cpu so that time sensitive things like DACs or ADCs have deterministic timing. Depends on how they chose to implement it.

Folks sometimes have this often incorrect assumption that dma is free. It isnt it still consumes bus time, if shared with the cpu (which it eventually is as it talks to a resource the cpu can talk to) then the cpu and/or the dma is held off, so the cpu still has to wait some time, in some implementations (likely not your microcontroller) the cpu is completely held off until the dma completes, cpu is stopped for the duration. Just depends on the implementation. The free part of it is that the cpu doesnt have to be constantly being interrupted or polling or holding its breath for some event to feed data. It can take its time to create the next buffer to dma over. It does have to watch for the dma transfer to complete and deal with that but instead of say every byte it is now multiple bytes, some block of data.

There is no one universal answer. "It depends"...on the specific design of the specific thing you are using. Even within one chip/board/system design there may be multiple dma engines and there is no reason to assume they all work the same way. For every instance you have to figure it out, and unfortunately, they often dont document it or document it well enough. So you may have to create some experiments if it is a concern.

note embedded has nothing to do with it. the point of dma is to gain performance by possibly doing work for the cpu so it doesnt have to have code, and to take advantage of normally unused bus cycles and do work there. Also for things as in your question of feeding data at the right time ideally without cpu overhead. these advantages are useful embedded or not. — old_timer, Jun 15 '16 at 11:10

score 1 · Answer 6 · answered Jun 15 '16 at 10:42

The answers so far talk about the “speed” the CPU can do work and how DMA benefits that. However there is another consideration, power.

If the CPU wished to send out a packet of data on a slow link, it would need to be awake for most of the time if using polling or interrupts, however the main CPU maybe able to be in a sleep state while DMA is being done.

score 1 · Answer 7 · edited Dec 19 '18 at 11:02

Some processors like the STM32H7 series have a lot of RAM options and heaps of close coupled RAM. Having separate RAM banks allows DMA to hammer one lot of RAM whilst the processor is processing data in the close-coupled ram that doesn't require caching and doesn't get hammered by DMA. To move data around you can use MDMA. I built an FMCW radar set using one of these. The ADC's get IQ data from two inputs into one SRAM. I then scale the data and perform the floating point 256 bin complex fft in dtcm ram. Then FIFO the result into a 2d array in AXI ram using MDMA.

I the take a second fft 64 bin across the fifo for the velocity vector. I then do the magnitude of the complex data and send the resulting data 128 & 64 floating point values out to another H7 using SPI at 12.5 MHz for the detection. I do all of this in 4 ms.

The sampling rate is of the ADC's is 84 kHz and using oversampling I'm getting about 18 bits resolution.

Not bad for a general purposes processor only running in the MHz range and with no external RAM.

Also the large caches this device has improved performance for calcs outside of the dtcm helps as well.

What's the point of DMA in embedded CPU's?

7 Answers7