June 2, 2021

An short guide to Xtensa assembly language

The ESP8266 has an Xtensa lx106 processor at its core. This is a 32 bit RISC processor with 16 registers.

This little guide is my "cheat sheet" to the Xtensa architecture. There is a 662 page PDF document "Xtensa Instruction Set Architecture reference manual" that this is derived from.

Important things to know

This is a load/store machine with either 16 or 24 bit instructions. This leads to higher code density than with constant 32 bit encoding. Some instructions have optional "short" 16 bit encodings indicated by appending ".n" to the mnemonic. In general, you should simply ignore the ".n" at the end of instructions.

The Xtensa implements SPARC like register windows on subroutine calls, but I have never seen this feature used in either the bootrom or code generated by gcc, so this can be ignored.

There are 16 tegisters named a0 through a15.

The processor can be either big or little endian. The bootrom and toolchain use the processor in little endian mode.

The PC is an independent register. There is also a dedicated 6 bit "SAR" (shift amount register).


Think of the load instructions as "flowing" from right to left. In other words, the first register is the destination, the remaining registers and/or stuff are operands.

Store instructions are just the opposite, the first register is the source and will be stored at some address generated from the operands.

Instructions like "add", "and", "xor" flow like the load. The first register is the destination and the operands are on the right.


As mentioned, the a1 register is used as the stack pointer. As near as I can tell, this is simply compiler convention and any other register could have been used. There are no push or pop instructions and subroutine calls do not use the stack register.

Since subroutine calls use the a0 register, it is necessary to save it to the stack explicitly before another subroutine is called. Notice also that once a0 has been saved to the stack, it is perfectly OK to use it as a temporary register or whatever.

The l32r instruction

There is no way to cram a 32 bit address or constant into a 24 bit instruction, so 32 bit constants get dumped into memory (often grouped by the compiler in a block prior to the routine that uses them). They get loaded into a register by the "l32r" instruction, typically with a negative PC relative address, like this:

l32r    a2, 400011a0    ; ( 3fffa000 )
In this form (what I used to use in my disassembly) the value 400011a0 is the memory location referenced by the l32r instruction. The comment conveniently shows the value fetched from that address and placed into a2.

Note that there is no "s32r" instruction.

I later changed how I display things in my disassembly. In the following I show the value 0x3fffcba0 that is fetched from someplace and loaded into the a10 register. It is in fact stored at 0x40003364, but that location is really of no particular interest.

l32r    a10, [0x3fffcba0]       ; 0x40003364

Loads and stores

Loads and stores are done using a base register and an 8 bit unsigned offset. The following code increments the value stored at 0x3fffa000.

l32r    a2, 400011a0    ; ( 3fffa000 )
l32i.n  a3, a2, 0
addi.n  a3, a3, 1
s32i.n  a3, a2, 0
There are also 8 and 16 bit loads and stores. The stores are simple enough The 8 bit load is "l8ui" and loads an 8 bit unsigned value. The 16 bit load comes in two flavors, "l16ui" and "l16si", which load unsigned (zero fill) or signed (sign extended). Note that the offset is shifted by 1 or 2 for 16 and 32 bit values. In general this is transparent and handled by the compiler.

The "mov" instruction can work register to register or to move a small immediate value into a register:

mov.n   a7, a2		; a7 <-- a2
movi.n  a3, 0

There are conditional flavors of the "mov" instruction

moveqz  a7, a8, a9	; a7 = a8 if a9 == 0
movnez  a4, a3, a11	; a4 = a3 if a11 != 0
movgez  a2, a3, a4	; a2 = a3 if a4 >= 0
movltz  a6, a10, a11	; a6 = a10 if a11 < 0

Jumps and calls

The "j" instruction is a PC relative unconditional jump.
The "jx" instruction jumps to an address held in a register.

The "call0" instruction does a PC relative subroutine call.
The "callx0" instruction does a call to an address held in a register.
The "ret" (usually "ret.n") does a return from subroutine, and is equivalent to "jx a0

Gcc uses registers a2 to a7 to hold subroutine arguments. If there are more than 6 arguments, the extras go on the stack.


Branches that compare against zero take only a register as an argument (and allow for a bigger PC relative offset, but in general you don't care and the compiler takes care of that.
beqz    a3, 40003518
bnez    a1, 400033da
bgez    a4, 40005ced
bltz    a4, 4000647d

As appropriate, the above can have an unsigned "u" flavor. Other branches compare against an immediate value.

beqi    a2, 2, 40006a08
bnei    a2, 6, 400073b9
bgei    a4, 1, 400087dc
blti    a0, 1, 400094aa
bgeui   a4, 8, 40009944
bltui   a13, 6, 4000a18d

Others test if a bit is set or clear via an immediate value. The immediate value is the bit number (0-31). 0 is the lsb.

bbci    a3, 1, 4000bedc
bbsi    a2, 1, 4000df2c
Also a bit can be tested using a bit number held in a second register.
bbc     a14, a12, 40004e0c
bbs     a14, a12, 40005d0c
Two registers can be compared. Here (for example) "GE" is true if the first register is GE the second.
beq     a3, a6, 400000e0
bne     a3, a4, 40000e74
bge     a3, a6, 400000e0
blt     a3, a4, 40000e74
bgeu    a3, a6, 400000e0
bltu    a3, a4, 40000e74
Tests can be performed on the first register using a bit mask held in the second, testing if bits are set.
bany     a12, a14, 40005fe2
bnone    a12, a14, 40005fe2
ball     a12, a14, 40005fe2
bnall    a12, a14, 40005fe2

Odds and Ends

Bit field extraction -
extui   a0, a1, 16, 3		; a0 = (a1>>16) & mask
Where the mask holds the number of bits specified, i.e in this case the mask is 0x7.

There are also some special case adds:

12d1ff          addmi   a1, a1, 0xffffff00
3195f6          l32r    a3, 3fffdaac    ; ( 3fffc000 )
3032a0          addx4   a3, a2, a3      ; add to base
The addmi instruction takes an 8 bit immediate value and shifts it left 8 bits and does sign extension. The disassembler displays the effective result. It yields a signed value in the range -32768 to 32512 at multiples of 256.

The addx4 instruction is perfect for handling lookup tables with 32 bit values. It multiplies the middle register by 4 (by shifting left 2) and adds this to the value in the last register. The result as always going to the first. There are also x2 and x8 variants of the same sort.

Adds and such

These are what would typically be called "ALU" operations on some processors. The general case involves 3 registers:
add	a1, a2, a3
addi	a1, a2, N
The "add" adds a2 and a3, placing the result in a1.
The "addi" adds a2 and N (the 8 bit immediate) placing the result in a1. The value of N is in the range [-127,128].

Subtract, in the following calculates a1 = a2 - a3

sub	a1, a2, a3

Special registers

There are a raft of "special registers" that are read and written using the rsr and wsr instructions.
wsr.intenable   a5
rsr.ccount      a2
wsr.intclear    a2
xsr.ps		a2
The "rsync" instruction waits for all prior wsr instructions to finish. The "xsr" instruction does a swap with a register.

Note that the ccount register increments on every processor cycle. There is a "ccompare0" register that can be used in conjunction with it to generate interrupts.


There is a special register "PS" that holds, along with other things, an interrupt level in the range 0-15. The RSIL instruction (read and set interrupt level), copies the value of PS to a register, then sets the interrupt level in PS to a value 0-15.
rsil    a7, 2
This is often used in an idiom to block interrupts over a section of code as follows:
rsil    a7, 2
wsr.ps	a2



memw is "memory wait". It is basically a pipeline sync. It waits until all loads and stores finish.

The "excw" waits until all prior instructions are either exception free or any exceptions have been taken.

Unusual instructions

There is an s32ri that pairs with the l32ai for multiprocessor synchronization.