12-1-2025

64 bit ARM architecture -- What happened to the LDM instruction?

If you spent any time working with 32 bit ARM (aarch32) you probably ran into the LDM and STM instruction. These could load or store any or all registers in a single instruction, governed by a mask.

They were used all the time to enter and leave subroutines, as well as to save and restore registers during exception handling. Typical calls looked like this:

stmia   r1!, {r3, r4, r5, r6, r7, r8, r9, sl}
ldm r7!, {r0, r1, r2, r3}
This instruction is gone in aarch64. What happened?

Well the ARM designers came to their senses, that is what happened. The real question is how a CISC like instruction such as ldm/stm ever found it way into the ARM architecture in the first place.

What we have now is ldp/stp which loads (or stores) a pair of registers. This may not seem so handy, but it plays much better with RISC chip design, pipelines, and all of that. Consider that many ARM chips these days have multiple execution units and can perform 3 or 4 operations in parallel. It is quite possible that this instruction can run in one cycle storing (or loading) both registers at the same time. There is often a wide path to memory (and remember that these are 64 bit registers), so we would need to have a 128 bit wide data path to memory -- and this is quite likely to be available with modern cache designs.

Here is an example of these instructions being used in an exception handler. Registers x0 to x18 and x30 get saved. It is not necessary to save x19 to x29. I hear the screams already. You need to save ALL of the registers! Well you do, but it is not the responsibility of this code to do so. The ARM compiler ABI says that x19 to x29 are "callee saved" registers. So, it is the responsibility of the subroutine being called to save any of those registers it intends to use. Usually this will be none at all and nothing needs to be done -- which is certainly an optimization over blindly saving them all.

ffff0a80:   a9bf07e0    stp x0, x1, [sp, #-16]!
ffff0a84:   a9bf0fe2    stp x2, x3, [sp, #-16]!
ffff0a88:   a9bf17e4    stp x4, x5, [sp, #-16]!
ffff0a8c:   a9bf1fe6    stp x6, x7, [sp, #-16]!
ffff0a90:   a9bf27e8    stp x8, x9, [sp, #-16]!
ffff0a94:   a9bf2fea    stp x10, x11, [sp, #-16]!
ffff0a98:   a9bf37ec    stp x12, x13, [sp, #-16]!
ffff0a9c:   a9bf3fee    stp x14, x15, [sp, #-16]!
ffff0aa0:   a9bf47f0    stp x16, x17, [sp, #-16]!
ffff0aa4:   a9bf7bf2    stp x18, x30, [sp, #-16]!
ffff0aa8:   94000a63    bl  0xffff3434
ffff0aac:   a8c17bf2    ldp x18, x30, [sp], #16
ffff0ab0:   a8c147f0    ldp x16, x17, [sp], #16
ffff0ab4:   a8c13fee    ldp x14, x15, [sp], #16
ffff0ab8:   a8c137ec    ldp x12, x13, [sp], #16
ffff0abc:   a8c12fea    ldp x10, x11, [sp], #16
ffff0ac0:   a8c127e8    ldp x8, x9, [sp], #16
ffff0ac4:   a8c11fe6    ldp x6, x7, [sp], #16
ffff0ac8:   a8c117e4    ldp x4, x5, [sp], #16
ffff0acc:   a8c10fe2    ldp x2, x3, [sp], #16
ffff0ad0:   a8c107e0    ldp x0, x1, [sp], #16
ffff0ad4:   d69f03e0    eret
I told you aarch64 is quite different from aarch32, and this is a great example.
Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org