December 22, 2016

Register access shootout

The question here is what is the best way to write C code to access device registers on the ARM.

This question (or my view on it) may be surprising to some, so I should explain it. I use a coding style where I define a structure to represent the register layout for a device and then set a structure pointer to the hardware base address of the actual hardware. A lot of other code (in particular linux kernel code) uses a bunch of macro definitions that are wrappers on offsets along with hardware access macros. My concern was that my approach might be generating inefficient code and I wanted to investigate the details. This should become clearer as you follow along if this is not clear already.

I will also venture to say that the conclusions apply to hardware other than the ARM.

Before diving into the shootout (which looks at details of code generated by the compiler) it is worth discussing other metrics and issues besides generating fast code.

Readable Code

This can be a matter of taste or what a person is used to. My preference is to avoid introducing macros and thus producing a new C dialect. So I would rather see a write to a control register written as:
    base->ctrl = 0x22;
In lieu of:
    writel(0x22,STATUS);
Also consider that the form I like allows things like this to be written:
    base->ctrl |= 0x20;

Portability to hardware with IO instructions

This was a strong argument against what I like to do back in the early days of x86 processors. A header file full of offset values could be made to work both with IO mapped instructions (like inb/outb) as well as memory mapped addresses. This is not an issue with the ARM and IO instructions now seem to be largely out of favor in the x86 world, but the style and prejudice towards coding device drivers with a bunch of offsets seems to persist.

On with the shootout

A template for a subset of the registers in a 16550 compatible UART can be written like so:
#define SUNXI_UART0_BASE        0x01C28000

struct h3_uart {
        volatile unsigned int data;     /* 00 */
        volatile unsigned int ier;      /* 04 */
        volatile unsigned int iir;      /* 08 */
        volatile unsigned int lcr;      /* 0c */
        int _pad;
        volatile unsigned int lsr;      /* 14 */
};

#define UART_BASE       ((struct h3_uart *) SUNXI_UART0_BASE)
#define Uart            ((struct h3_uart *) SUNXI_UART0_BASE)
#define TX_READY        0x40
Some people will argue that the compiler may introduce odd padding and is otherwise free to cause mischief with how a structure is laid out, but in practice I have never seen this to be an issue.

Alternately, this gang of definitions can be used:

#define UART0_RBR (SUNXI_UART0_BASE + 0x0)    /* receive buffer register */
#define UART0_IER (SUNXI_UART0_BASE + 0x4)    /* interrupt enable reigster */
#define UART0_IIR (SUNXI_UART0_BASE + 0x8)    /* interrupt identity register */
#define UART0_LCR (SUNXI_UART0_BASE + 0xc)    /* line control register */
#define UART0_LSR (SUNXI_UART0_BASE + 0x14)   /* line status register */

#define readl(addr)             (*((volatile unsigned long  *)(addr)))
#define writel(v, addr)         (*((volatile unsigned long  *)(addr)) = (unsigned long)(v))
Given these definitions, we want to see what sort of code the compiler will generate from the following three alternatives. The first is the way I would prefer to write code. The second is how many in the linux community like to do things. The third is an alternative I might consider if it generates better code that (1) and more like (2). (This turns out to be moot).
void
uart0_putc1 ( char c )
{
        struct h3_uart *up = UART_BASE;

        while ( !(up->lsr & TX_READY) )
            ;

        up->data = c;
}

void
uart0_putc2 ( char c )
{
        while (!(readl(UART0_LSR) & (1 << 6)))
            ;

        writel(c, UART0_THR);
}

void
uart0_putc3 ( char c )
{
        while ( !(Uart->lsr & TX_READY) )
            ;

        Uart->data = c;
}
For the record, the compiler I am using is "arm-linux-gnu-gcc (GCC) 6.1.1", a cross compiler running on an x86_64 system under Fedora 24 linux. Also to get the disassembled code, I use arm-linux-gnu-objdump -d uart.elf >uart.dump

First we run the compiler without optimization. I will say this up front. Without optimization the code is so bad that you should not worry or care which coding method you use. So aim for clarity and readability (as you should in any event).

0000029c :
 29c:   e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
 2a0:   e28db000        add     fp, sp, #0
 2a4:   e24dd014        sub     sp, sp, #20
 2a8:   e1a03000        mov     r3, r0
 2ac:   e54b300d        strb    r3, [fp, #-13]
 2b0:   e3a03902        mov     r3, #32768      ; 0x8000
 2b4:   e34031c2        movt    r3, #450        ; 0x1c2
 2b8:   e50b3008        str     r3, [fp, #-8]
 2bc:   e320f000        nop     {0}
 2c0:   e51b3008        ldr     r3, [fp, #-8]
 2c4:   e5933014        ldr     r3, [r3, #20]
 2c8:   e2033040        and     r3, r3, #64     ; 0x40
 2cc:   e3530000        cmp     r3, #0
 2d0:   0afffffa        beq     2c0 
 2d4:   e55b200d        ldrb    r2, [fp, #-13]
 2d8:   e51b3008        ldr     r3, [fp, #-8]
 2dc:   e5832000        str     r2, [r3]
 2e0:   e320f000        nop     {0}
 2e4:   e24bd000        sub     sp, fp, #0
 2e8:   e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
 2ec:   e12fff1e        bx      lr

0000029c :
 29c:   e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
 2a0:   e28db000        add     fp, sp, #0
 2a4:   e24dd00c        sub     sp, sp, #12
 2a8:   e1a03000        mov     r3, r0
 2ac:   e54b3005        strb    r3, [fp, #-5]
 2b0:   e320f000        nop     {0}
 2b4:   e3083014        movw    r3, #32788      ; 0x8014
 2b8:   e34031c2        movt    r3, #450        ; 0x1c2
 2bc:   e5933000        ldr     r3, [r3]
 2c0:   e2033040        and     r3, r3, #64     ; 0x40
 2c4:   e3530000        cmp     r3, #0
 2c8:   0afffff9        beq     2b4 
 2cc:   e3a03902        mov     r3, #32768      ; 0x8000
 2d0:   e34031c2        movt    r3, #450        ; 0x1c2
 2d4:   e55b2005        ldrb    r2, [fp, #-5]
 2d8:   e5832000        str     r2, [r3]
 2dc:   e320f000        nop     {0}
 2e0:   e24bd000        sub     sp, fp, #0
 2e4:   e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
 2e8:   e12fff1e        bx      lr

0000029c :
 29c:   e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
 2a0:   e28db000        add     fp, sp, #0
 2a4:   e24dd00c        sub     sp, sp, #12
 2a8:   e1a03000        mov     r3, r0
 2ac:   e54b3005        strb    r3, [fp, #-5]
 2b0:   e320f000        nop     {0}
 2b4:   e3a03902        mov     r3, #32768      ; 0x8000
 2b8:   e34031c2        movt    r3, #450        ; 0x1c2
 2bc:   e5933014        ldr     r3, [r3, #20]
 2c0:   e2033040        and     r3, r3, #64     ; 0x40
 2c4:   e3530000        cmp     r3, #0
 2c8:   0afffff9        beq     2b4 
 2cc:   e3a03902        mov     r3, #32768      ; 0x8000
 2d0:   e34031c2        movt    r3, #450        ; 0x1c2
 2d4:   e55b2005        ldrb    r2, [fp, #-5]
 2d8:   e5832000        str     r2, [r3]
 2dc:   e320f000        nop     {0}
 2e0:   e24bd000        sub     sp, fp, #0
 2e4:   e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
 2e8:   e12fff1e        bx      lr
Conclusions? The code from each of these is very similar. What in the world are the "nop" instructions all about?? The compiler is generating a 32 bit pointer to the device register from two 16 bit pieces in each case (and even repeating the process in the loop rather than holding the constructed item in a register). The only real difference is that case (1) gets penalized because the compiler feels obligated to save what looks like a local variable on the stack.

So let us repeat this with "gcc -O" and see what we get.

00000114 :
 114:   e3a02902        mov     r2, #32768      ; 0x8000
 118:   e34021c2        movt    r2, #450        ; 0x1c2
 11c:   e5923014        ldr     r3, [r2, #20]
 120:   e3130040        tst     r3, #64 ; 0x40
 124:   0afffffc        beq     11c 
 128:   e3a03902        mov     r3, #32768      ; 0x8000
 12c:   e34031c2        movt    r3, #450        ; 0x1c2
 130:   e5830000        str     r0, [r3]
 134:   e12fff1e        bx      lr

00000114 :
 114:   e3a02902        mov     r2, #32768      ; 0x8000
 118:   e34021c2        movt    r2, #450        ; 0x1c2
 11c:   e5923014        ldr     r3, [r2, #20]
 120:   e3130040        tst     r3, #64 ; 0x40
 124:   0afffffc        beq     11c 
 128:   e3a03902        mov     r3, #32768      ; 0x8000
 12c:   e34031c2        movt    r3, #450        ; 0x1c2
 130:   e5830000        str     r0, [r3]
 134:   e12fff1e        bx      lr

00000114 :
 114:   e3a02902        mov     r2, #32768      ; 0x8000
 118:   e34021c2        movt    r2, #450        ; 0x1c2
 11c:   e5923014        ldr     r3, [r2, #20]
 120:   e3130040        tst     r3, #64 ; 0x40
 124:   0afffffc        beq     11c 
 128:   e3a03902        mov     r3, #32768      ; 0x8000
 12c:   e34031c2        movt    r3, #450        ; 0x1c2
 130:   e5830000        str     r0, [r3]
 134:   e12fff1e        bx      lr
Conclusions? This is much, much better. It is also hilarious, because with optimization the code is identical in all three cases. What is surprising is that the compiler could do better. It already has the pointer value it needs in r2 and does not need to construct it again as it does.

The compiler can do better, and it does if we run it as gcc -O2 as follows:

0000010c :
 10c:   e3a02902        mov     r2, #32768      ; 0x8000
 110:   e34021c2        movt    r2, #450        ; 0x1c2
 114:   e5923014        ldr     r3, [r2, #20]
 118:   e3130040        tst     r3, #64 ; 0x40
 11c:   0afffffc        beq     114 
 120:   e5820000        str     r0, [r2]
 124:   e12fff1e        bx      lr

0000010c :
 10c:   e3a02902        mov     r2, #32768      ; 0x8000
 110:   e34021c2        movt    r2, #450        ; 0x1c2
 114:   e5923014        ldr     r3, [r2, #20]
 118:   e3130040        tst     r3, #64 ; 0x40
 11c:   0afffffc        beq     114 
 120:   e5820000        str     r0, [r2]
 124:   e12fff1e        bx      lr

0000010c :
 10c:   e3a02902        mov     r2, #32768      ; 0x8000
 110:   e34021c2        movt    r2, #450        ; 0x1c2
 114:   e5923014        ldr     r3, [r2, #20]
 118:   e3130040        tst     r3, #64 ; 0x40
 11c:   0afffffc        beq     114 
 120:   e5820000        str     r0, [r2]
 124:   e12fff1e        bx      lr
Again, all three routines generate identical code. I have no complaints about this code. I trust that it is faster to generate the 32 bit constant from two 16 bit pieces held in immediate values than to fetch it from memory in a single instruction.

Conclusion

Two things stand out:
  1. It is very much worth while to run the compiler with the -O switch and even better to run it with -O2.
  2. I no longer feel any guilt about writing structures for device register templates as I prefer to do.

I will note that my motivation for using a base pointer and a template is an expectation that the compiler will load that pointer into a register and then repeatedly use it with offsets, as indeed the optimized code does.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org