The AVR XMega has 16 GPIO registers in I/O space in addition to its internal SRAM. Interestingly the timing of LD instructions differ to these locations: Accessing the (rather large) I/O space is faster than accessing the internal SRAM.
This can be observed more or less in the instruction set documentation: http://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf (from Page 22) (Rather "less" as it is quite ambiguous).
Not being sure on how the timing works I made tests with actual hardware, measuring the timings of various instructions.
Accessing the SRAM:
Mega XMega
LD reg, ptreg 2cy 2cy
LD reg, ptreg+ 2cy 2cy
LD reg, -ptreg 2cy 3cy
LDD reg, ptreg+imm 2cy 3cy
LDS reg, imm16 2cy 3cy
ST ptreg, reg 2cy 1cy
ST ptreg+, reg 2cy 1cy
ST -ptreg, reg 2cy 2cy
STD ptreg+imm, reg 2cy 2cy
STS imm16, reg 2cy 2cy
Accessing the I/O area (tested with GPIO and UART registers), the loads are one cycle faster:
Mega XMega
LD reg, ptreg 2cy 1cy
LD reg, ptreg+ 2cy 1cy
LD reg, -ptreg 2cy 2cy
LDD reg, ptreg+imm 2cy 2cy
LDS reg, imm16 2cy 2cy
ST ptreg, reg 2cy 1cy
ST ptreg+, reg 2cy 1cy
ST -ptreg, reg 2cy 2cy
STD ptreg+imm, reg 2cy 2cy
STS imm16, reg 2cy 2cy
Anyone could provide some insight on why it could have been desinged in this manner? The performance difference is rather significant for memory intensive tasks, and the 16 GPIO registers are not much if someone needed some faster RAM (while the I/O space is huge in comparison, apparently wholly capable to operate by the faster timing, which is a design choice I can't really understand if this faster access required more costly hardware).
A particularly bad case is the 3 cycle displacement loads which are very frequent, and even AVR-GCC seems to lean towards using those rather than trying to walk through memory areas with post-increments when optimizing.