Why accessing the SRAM is slower with AVR XMega load instructions?

Question

The AVR XMega has 16 GPIO registers in I/O space in addition to its internal SRAM. Interestingly the timing of LD instructions differ to these locations: Accessing the (rather large) I/O space is faster than accessing the internal SRAM.

This can be observed more or less in the instruction set documentation: http://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf (from Page 22) (Rather "less" as it is quite ambiguous).

Not being sure on how the timing works I made tests with actual hardware, measuring the timings of various instructions.

Accessing the SRAM:

                    Mega   XMega
LD reg, ptreg       2cy    2cy
LD reg, ptreg+      2cy    2cy
LD reg, -ptreg      2cy    3cy
LDD reg, ptreg+imm  2cy    3cy
LDS reg, imm16      2cy    3cy
ST ptreg, reg       2cy    1cy
ST ptreg+, reg      2cy    1cy
ST -ptreg, reg      2cy    2cy
STD ptreg+imm, reg  2cy    2cy
STS imm16, reg      2cy    2cy

Accessing the I/O area (tested with GPIO and UART registers), the loads are one cycle faster:

                    Mega   XMega
LD reg, ptreg       2cy    1cy
LD reg, ptreg+      2cy    1cy
LD reg, -ptreg      2cy    2cy
LDD reg, ptreg+imm  2cy    2cy
LDS reg, imm16      2cy    2cy
ST ptreg, reg       2cy    1cy
ST ptreg+, reg      2cy    1cy
ST -ptreg, reg      2cy    2cy
STD ptreg+imm, reg  2cy    2cy
STS imm16, reg      2cy    2cy

Anyone could provide some insight on why it could have been desinged in this manner? The performance difference is rather significant for memory intensive tasks, and the 16 GPIO registers are not much if someone needed some faster RAM (while the I/O space is huge in comparison, apparently wholly capable to operate by the faster timing, which is a design choice I can't really understand if this faster access required more costly hardware).

A particularly bad case is the 3 cycle displacement loads which are very frequent, and even AVR-GCC seems to lean towards using those rather than trying to walk through memory areas with post-increments when optimizing.

"Registers faster than memory" is pretty normal. That's why you could add "register" to a variable declaration if you knew it was going to be heavily used in time critical code. — , Jan 05 '18 at 14:02
@BrianDrummond The "registers" I refer to are not CPU registers (which C compilers refer to with that keyword), it is just terminology. There is a 4Kbyte address space where the loads are faster, on a typical XMega covering possibly around 300 or more individual implemented locations belonging to various peripherals. Among these there are 16 locations labeled "GPIO registers" which are also accessible by bit instructions. — Jubatian, Jan 05 '18 at 15:53
No, but their hardware implementation will be much more like a CPU register (with added logic) than a dense SRAM array. — , Jan 05 '18 at 15:55
@BrianDrummond Still it baffles me why for example an UART configuration register needed a cycle faster access. About 95% of the peripheral registers are possibly such which you would never ever need to access a little bit faster while one could benefit a lot even from just 256 bytes of fast RAM (think about putting the stack on it, you can do that with the 16 GPIO regs, but that's just too small to be useful). For me it rather feels like there is some peripheral bus there which is somehow designed to be faster. — Jubatian, Jan 05 '18 at 18:38

Why accessing the SRAM is slower with AVR XMega load instructions?

0 Answers0