May 12, 2017 (notes on AArch64 added 12-31-2021)

ARM assembly language

There are a myriad of ARM assembly language guides or tutorials online. This is my own condensed "executive guide" to ARM assembly language. It is by no means appropriate for the first time assembly language programmer. I skim over details that are familiar to me and focus on those things that are unique or important for the ARM. Although I do write a little ARM assembly from time to time, my main interest is in understanding disassembled code that I am reverse engineering.

We can all be glad that the current popularity of ARM has saved us from the wretched and vile mess that x86 assembly is. It is a tribute to the work of clever engineers at Intel that they are able to coax the amazing performance they do out of the x86 while maintaining binary compatibility with such a miserable architecture. Enough said on that topic. ARM by contrast is quite civilized.

I will say little if anything about machine language. If you are writing an assembler or disassembler, you will need to care about how instructions are encoded, but I find little need to be concerned about it. Sometimes it forces itself on us though. It is worth knowing though that ARM instructions have a constant 32 bit size. (But see my notes on "thumb encoding" below).

ARM 64

This essay is entirely about the original 32 bit ARM. Arm64 is quite different and will require its own summary. The obvious and expected thing is that registers are 64 bit (but you can get at them in a 32 bit fashion as well). You also get more of them (32 rather than 16). And the stack pointer is now register 31 all of them time (rather than just being r13 by convention). On top of that, register 31 serves as a zero register for most instructions. In ARM 64 (properly called AArch64) the PC is its own special thing and not one of the registers. In fact, the PC is not accessible in a general way in AArch64. There is more of course, but this is not the place for it. I simply include this discussion to point out how different it is.

Registers

You get 16 registers (r0 through r15). These are all 32 bit registers. The first 13 of these (r0 to r12) are entirely general purpose. The last three (and maybe five if you use gcc) are dedicated to special functions.
R11 - fp (frame pointer if you use gcc)
R13 - sp (the stack pointer, entirely by convention)
R14 - lr (the link register, holds the return address for subroutine calls)
R15 - pc (the program counter)
Athough r13 is the stack pointer only by convention, you would be making severe trouble for yourself if you do otherwise. If you are using gcc, the R11 register is used for the frame pointer.

Why do we skip r12 you should be asking. I have found no good reason. It was probably specified for some special purpose in some committee designed standard, but then was never actually put to use. Who knows!

The Gnu tools use obsolete aliases for the R10 and R12 registers. These may have had particular definitions in some deprecated ABI, but I have never found them to be anything but general registers and it is unfortunate and somewhat confusing that the Gnu tools retain the old names.

R10 - sl (the stack limit)
R12 - ip (intra procedure scratch register)

Subroutine calls

Unlike most processors, the ARM does not get the stack involved when it makes subroutine calls. The usual way to call a subroutine is:
bl	mysub
The call places the return address in the "lr" register which can simply be copied into "pc" to return. However the usual way to return is as follows.
bx	lr
Since to return you just copy lr to pc, you can achieve the proper effect in any number of ways. You are probably saying, "why not just return by using "mov pc,lr" and indeed you could. The "bx" instruction does something extra to allow switching to and from thumb mode (see below). If you are not using thumb mode (and I have yet to use it), this does not matter, but doing a return using "bx lr" is as good as anything and a fine convention to follow. So if you were wondering why we use this special "bx" instruction to do something that doesn't seem all that special, this is why and now you know.

This business of not using the stack makes for tidy and efficient "leaf" routines, but if your subroutine intends to call other subroutines, you will need to shove lr onto the stack otherwise it will get overwritten by the next call. This leads to the following idiom for coding a subroutine. Let's suppose we also want to save the r4 register

push    {r4, lr}
...
bl      next_sub
...
pop     {r4, pc}
This can be expanded to save and restore any number of registers.

As near as I can tell, you never need to save r0, r1, r2, or r3 to play nice with gcc. Also r0 serves to return function values. Subroutine arguments simply get passed in r0, r1, ... ad nauseum. This is all about C compiler conventions, so if you are writing pure assembly language you can do anything you can keep straight with yourself. Good luck, and "Vaya con Dios".

Thumb mode

I intend to ignore this as much as possible in this write-up, but if you want to know a little about what we are ignoring, you can read this section.

The ARM has an alternate encoding where each instruction occupies 16 bits that is known as "thumb mode". This allows compact code and possibly more efficient code if the memory bus is a design bottleneck. Thumb mode has a fair number of instruction differences and even instructions that have no direct counterpart in regular ARM mode.

You enter thumb mode using the "bx" instruction with the target address in some register. If the address being branched to is odd (the low bit is set) execution switches to thumb mode. Another way is to fiddle wit the "t" bit in the SPSR in a specific way. You can't just fiddle it directly, but you must let some other instruction restore the CPSR.

To exit thumb mode, execute a "bx" instruction with an even target address.

The stack

The stack grows to smaller addresses. The stack pointer points to the item currently on top of the stack. This means that when you initialize the stack pointer, you set it to point to the address after the block of memory you have allocated for the stack.

The instruction PUSH {r3} writes the contents of r3 to the address sp-4, then subtracts 4 from the value in sp.

The instruction POP {r3} reads the contents of the address pointed to by sp into r3, then adds 4 to the value in sp.

push and pop are really just shorthand for more general and powerful multiple register instructions (STMDB, LDM, and/or LDMIA). Ignoring that though, they nicely handle any number of registers in a single instruction.

And no, you cannot just write "push r3", the assembler demands the curly braces.

Getting constant values into registers

The name of the game is the "mov" instruction, so you do things like this:
mov	r0,#0
mov	r3,#0x40
This is all well and good, but things get more complicated with larger values. Also there is no clever trick to clear a register, loading a zero immediate works as well as anything. This is a RISC processor after all and that instruction will run in a single clock like any other.

The story with larger values is simply that there is only so much room in a 32 bit instruction set aside for the immediate value, namely 12 bits. But it isn't even that simple - the 12 bits is divided into a 4 bit "rotation" and an 8 bit value. So values from 0-255 are no big deal. Beyond that, the assembler does the dirty work, so you can write things like this:

mov	r1, #0x00ab0000
The assembler stores the 8 bit value 0xab along with an rotation value, so it looks like magic stuffing 32 bit constants into 12 bits. There is even more to this, which you can read about elsewhere: If you need to load some general 32 bit constant that is not accomodated by this compression scheme, you have two choices. You can use two instructions and load it one half at a time, or you can stick it someplace in memory and then fetch it. The first scheme works like this:
movw    sp, #:lower16:my_stack
movt    sp, #:upper16:my_stack
Note that you must do the "movw" first, as it sign extends into the upper half of the word. The second scheme uses a nice assembler construct like this:
ldr     r2, =0x01F00220
Just for the record, this typically compiles into something like the following. The assembler finds some spot (typically after a subroutine routine) nearby to dump the constant, then loads it via a PC relative address, like this:
ldr     r2, [pc, #140]
There is a fair chance this will be in the cache and the instruction will nicely execute in one cycle.

The MVN instruction

After reading the above, you may be amazed to find that the assembler will happily swallow this statement:
mov     r2, #0xFFFFFFFF
32 bits and all ones, how does this fit into an 8 bit immediate field? The answer is that the assembler is being clever and mapping this to the "mvn" instruction. The assembler will do many things of this sort, so one approach to big immediate values is to just write naive code and wait for the assembler to blow the whistle, then convert as needed to one of the other forms mentioned above.

The "mvn" instruction flips all the bits of the operand, then loads it into a register.

Load and Store

You probably heard somewhere that RISC means that all access to memory is via load and store instructions. That is exactly right and is precisely the name of the game with the ARM. You get your address into one register and then fetch or store like this:
ldr     r0, [r4]
str     r1, [r0]
But you can add offsets, and do other things like store bytes and halfwords
ldrh    r4, [r0, #136]
ldrb    r0, [r5, #24]
strb    r1, [r0, #24]
strh    r0, [r1, #130]
And you can do this. No telling what this does! As near as I can tell this gets a shift operator involved, but who knows. When I get smart enough to figure this out, I will fill you in.
strb    r4, [r3], #1

Right to left or left to right

If you have dealt with a variety of assembly languages, you know that you should take nothing for granted. In general on the ARM, the first register mentioned is the destination or target that receives the result. The notable exception (and it makes sense) is the "str" instruction in all of its different flavors. So the flow in each assembler statement is in general from right to left. So the following statement computes r5+r6 and places the result in r4.
add     r4, r5, r6

And, Or, and Bic

eor     r0, r0, r8
orr     r5, r5, #1
and     r6, r0, r1
bic     r6, r0, r1
The first three instructions in the above should be obvious enough, and note that we slipped in EOR (the exclusive or) just to see if you were paying attention. However, you should be asking, "What the heck is "bic" and what is it doing with these familiar logical operations?"

BIC stands for "bit clear" and it is simply an and with the complement of the second operand (which clears all bits in the mask). Note that the second operand is a "mask" not a bit number and multiple bits can be cleared, as in the following instruction:

bic     r0, r0, #0x1f
Handy enough, but you may be saying, "why not just use "and" and let the compiler invert the mask?". Well, you can code that way if you want, but sometimes it is handy to have a single mask in a register and just use different instructions to set or clear a bit. And just in case you are asking where the "BIS" instruction is, there ain't one, you just use "ORR" to set bits.

Arithmetic

Adding values is easy enough, you do things like this:
add     r6, r6, r0
add     r5, r5, #1
add     r0, r0, r5, lsl #4
add     r1, r1, r0, asr #7
The first two of these are clear enough, but what about the last two? The ARM has a place in each instruction to provide a shift specification. You can also do:
mul     r0, r5, r0
sub     r0, r1, #1
subs    r9, r9, #1
subeq   r0, r0, r2, lsr #1
subscc  r4, r4, lr, lsr #4

ARM division

You may be (as many are) surprised to learn that the ARM may or may not provide an integer divide instruction. You should expect not to have it. The Cortex-A7 has one, but the Cortex-A8 does not. You can find out by looking at bits 27:24 of the "Instruction Set Attribute Register".
If they are 0000 you loose.
If they are 0001, you get SDIV and UDIV in the thumb instructions
If they are 0010, you get SDIV and UDIV in both thumb and ARM instructions.

There are all kinds of ways to accomplish integer division without a divide instruction. Being lazy, my approach is to write code in C and let the compiler sweat it out. Many ARM processors these days have floating point hardware, and you can use floating point instructions to do your division for you.

Floating point

This deserves a section of its own:
Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org