August 6, 2023

RP2040 - now blink the LED using assembly language

Hopefully you at least peeked at the code that does this using the C language.

Here is the code for the whole project on Github:

And here is the heart of the code. I am not showing all of it here (see Github for that). A detailed discussion follows.

An important note first. This file is blink.S. The capital "S" extension is very important. It triggers the C preprocessor to run, allowing me to use #define macros as part of the assembly code, which I consider a life saver. See the makefile, as I also use gcc rather than "as" to assemble the code. Yes that probably sounds odd, but it works and allows the use of macros.

/* This is just one of 32 bits */
#define R_IO_BANK0      0x20

#define GPIO_25      0x02000000

/* An odd thing.  The disassmbly (dump) shows "subs", but we must
 * write it here as "sub" or we get errors.  Some gnu as quirk.
 *
 * This fits in only 76 bytes as compared to 124 for the C version
 */

.cpu cortex-m0
.thumb
.text

@ execution starts here.
@ note that we don't need a stack
@ in fact we don't need SRAM at all.
start:

	@ reset IO Bank 0
	ldr	r1,=RESET_BASE_CLR
	mov	r0,#R_IO_BANK0
	str	r0, [r1]

	@ loop/poll until done
	ldr	r1,=RESET_BASE_RW
wait_loop:
	ldr	r2, [r1,#8]
	tst	r0, r2
	beq	wait_loop

	@ Set function select to software IO
	@ for GPIO 25
	ldr	r1,=IO_BANK0_BASE_RW
	mov	r2, #0xcc
	mov	r0,#5
	str	r0,  [r1,r2]

	@ Enable output for GPIO 25
	ldr	r1,=SIO_BASE
	ldr	r0,=GPIO_25
	str	r0, [r1,#SIO_OE_SET]

blink_loop:
	str	r0, [r1,#SIO_OUT_SET]
	bl	blink_delay
	str	r0, [r1,#SIO_OUT_CLR]
	bl	blink_delay
	b	blink_loop

@ ==================================
@ This blinks at about 1 Hz
#define DELAY_COUNT	0x80000

blink_delay:
	ldr	r3,=DELAY_COUNT
delay_loop:
	sub	r3, r3, #1
	cmp	r3, #0
	bne	delay_loop
	bx	lr
The heart of the code is the "blink_loop". Everything else is just setup.

Let's talk about registers. I use r3 in the delay function and it is important not to forget this and also use it in the mainline code. Other than that, I get everything done using only the 3 registers r0, r1, and r2. I use r2 to hold one offset. I use r1 to hold a base address and r0 to hold data.

Notice how I use the "=" trick to make the assembler find a place to stick data. The way this works is that the assembler puts the value somewhere for you and inserts the address in the instruction to go fetch it.

PC relative addressing and thumb mode

The RP2040 is an "M" series ARM device, i.e. a microcontroller. Regular ARM devices use 32 bit opcodes. M series devices benefit from using "thumb" instructions, which are 16 bit opcodes. The benefit is that the code size is cut in half. The downside is that there is little if any room in instructions to hold addresses and constants.

Consider addresses. We need 32 bit addresses. What the assembler does for us is to generate PC relative addresses. The instruction holds an offset from the current PC, which is small enough to fit somewhere in the 16 bit thumb opcode. It does this for constants and for branch targets (such as "wait_loop"). Take note of two things. One is that this is all transparent. I just code up a label and a branch and the assembler does the dirty work. The other is a happy side effect. The code is position independent. With no hard addresses in the code, it can be relocated anywhere.

You might be asking just where this code does run. This is specified in the linker script (see blink.lds). Linker scripts are one of the things embedded programmers have to deal with that regular programmers never care or know about. The one for this project is very simple, as follows:

SECTIONS
{
    . = 0x10000000;
    .text   : { *(.text*)   }
}

This gathers all of the text sections together and locates them at 0x10000000. The careful reader will notice that this is the address of the XIP (execute in place) area in the RP2040 chip. However, don't be fooled by this.

Given that all the code is position independent with PC relative addressing used every place where a definite address might be used, the code could be loaded anywhere and would run! The value in the linker script can be anything at all. In fact I changed it to 0x90000000 and everthing worked just fine.

Finding out just what really goes on will require some study of the bootrom code. Whether or not XIP is being used (and I doubt seriously that it is in this case) is entirely unknown and probably academic.

Here is what we know so far. The bootrom pulls 256 bytes (or a bit less), loads it someplace into SRAM and starts it running. This is intended to be a second stage boot loader. In our case here, it is all it is, but that works fine for this tiny demo. For something bigger we will have to learn more.

Both the bootrom details and XIP are a topic for another writeup when I learn more about them.

ASM versus C

This assembly code fits in 76 bytes and does exactly the same job as the C code that fits into 124 bytes. How can this be? There are two main things. One is that the C compiler did a speed optimization and put two copies of the delay loop inline. Of course this is pointless for a delay loop, but the compiler had no way of knowing that. It also retained the actual delay function itself, so we end up with 3 copies. The other thing is that the C compiler generates code that uses the stack. I have always been surprised (and disappointed) that the C compiler doesn't just optimize something like the count variable in the delay function and keep it in a register. But it doesn't and it stores and fetches to it. Honestly this is due to my declaring it volatile. If I didn't, it would optimize it away entirely, but with the volatile declaration it ties its hands entirely and puts it on the stack.

So there is a place for assembly coding after all. This code avoids use of the stack entirely, and in fact does not use SRAM at all. But as far as the end result (blinking the LED, both get the job done).

A negative aspect of the assembly code is several hard coded constants (like 0xcc for the offset to get to the proper function select register). The C code was able to let the compiler figure this out by using a well designed "struct". The code has several other constants and honestly they all should be set up using informative macros (such as the value "5" to select software IO). This doesn't matter terrible much for this little demo project, but for a bigger project it would lead to code that would be hard to understand and maintain.


Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org