This question (or my view on it) may be surprising to some, so I should explain it. I use a coding style where I define a structure to represent the register layout for a device and then set a structure pointer to the hardware base address of the actual hardware. A lot of other code (in particular linux kernel code) uses a bunch of macro definitions that are wrappers on offsets along with hardware access macros. My concern was that my approach might be generating inefficient code and I wanted to investigate the details. This should become clearer as you follow along if this is not clear already.
I will also venture to say that the conclusions apply to hardware other than the ARM.
Before diving into the shootout (which looks at details of code generated by the compiler) it is worth discussing other metrics and issues besides generating fast code.
base->ctrl = 0x22;In lieu of:
writel(0x22,STATUS);Also consider that the form I like allows things like this to be written:
base->ctrl |= 0x20;
#define SUNXI_UART0_BASE 0x01C28000 struct h3_uart { volatile unsigned int data; /* 00 */ volatile unsigned int ier; /* 04 */ volatile unsigned int iir; /* 08 */ volatile unsigned int lcr; /* 0c */ int _pad; volatile unsigned int lsr; /* 14 */ }; #define UART_BASE ((struct h3_uart *) SUNXI_UART0_BASE) #define Uart ((struct h3_uart *) SUNXI_UART0_BASE) #define TX_READY 0x40Some people will argue that the compiler may introduce odd padding and is otherwise free to cause mischief with how a structure is laid out, but in practice I have never seen this to be an issue.
Alternately, this gang of definitions can be used:
#define UART0_RBR (SUNXI_UART0_BASE + 0x0) /* receive buffer register */ #define UART0_IER (SUNXI_UART0_BASE + 0x4) /* interrupt enable reigster */ #define UART0_IIR (SUNXI_UART0_BASE + 0x8) /* interrupt identity register */ #define UART0_LCR (SUNXI_UART0_BASE + 0xc) /* line control register */ #define UART0_LSR (SUNXI_UART0_BASE + 0x14) /* line status register */ #define readl(addr) (*((volatile unsigned long *)(addr))) #define writel(v, addr) (*((volatile unsigned long *)(addr)) = (unsigned long)(v))Given these definitions, we want to see what sort of code the compiler will generate from the following three alternatives. The first is the way I would prefer to write code. The second is how many in the linux community like to do things. The third is an alternative I might consider if it generates better code that (1) and more like (2). (This turns out to be moot).
void uart0_putc1 ( char c ) { struct h3_uart *up = UART_BASE; while ( !(up->lsr & TX_READY) ) ; up->data = c; } void uart0_putc2 ( char c ) { while (!(readl(UART0_LSR) & (1 << 6))) ; writel(c, UART0_THR); } void uart0_putc3 ( char c ) { while ( !(Uart->lsr & TX_READY) ) ; Uart->data = c; }For the record, the compiler I am using is "arm-linux-gnu-gcc (GCC) 6.1.1", a cross compiler running on an x86_64 system under Fedora 24 linux. Also to get the disassembled code, I use arm-linux-gnu-objdump -d uart.elf >uart.dump
First we run the compiler without optimization. I will say this up front. Without optimization the code is so bad that you should not worry or care which coding method you use. So aim for clarity and readability (as you should in any event).
0000029cConclusions? The code from each of these is very similar. What in the world are the "nop" instructions all about?? The compiler is generating a 32 bit pointer to the device register from two 16 bit pieces in each case (and even repeating the process in the loop rather than holding the constructed item in a register). The only real difference is that case (1) gets penalized because the compiler feels obligated to save what looks like a local variable on the stack.: 29c: e52db004 push {fp} ; (str fp, [sp, #-4]!) 2a0: e28db000 add fp, sp, #0 2a4: e24dd014 sub sp, sp, #20 2a8: e1a03000 mov r3, r0 2ac: e54b300d strb r3, [fp, #-13] 2b0: e3a03902 mov r3, #32768 ; 0x8000 2b4: e34031c2 movt r3, #450 ; 0x1c2 2b8: e50b3008 str r3, [fp, #-8] 2bc: e320f000 nop {0} 2c0: e51b3008 ldr r3, [fp, #-8] 2c4: e5933014 ldr r3, [r3, #20] 2c8: e2033040 and r3, r3, #64 ; 0x40 2cc: e3530000 cmp r3, #0 2d0: 0afffffa beq 2c0 2d4: e55b200d ldrb r2, [fp, #-13] 2d8: e51b3008 ldr r3, [fp, #-8] 2dc: e5832000 str r2, [r3] 2e0: e320f000 nop {0} 2e4: e24bd000 sub sp, fp, #0 2e8: e49db004 pop {fp} ; (ldr fp, [sp], #4) 2ec: e12fff1e bx lr 0000029c : 29c: e52db004 push {fp} ; (str fp, [sp, #-4]!) 2a0: e28db000 add fp, sp, #0 2a4: e24dd00c sub sp, sp, #12 2a8: e1a03000 mov r3, r0 2ac: e54b3005 strb r3, [fp, #-5] 2b0: e320f000 nop {0} 2b4: e3083014 movw r3, #32788 ; 0x8014 2b8: e34031c2 movt r3, #450 ; 0x1c2 2bc: e5933000 ldr r3, [r3] 2c0: e2033040 and r3, r3, #64 ; 0x40 2c4: e3530000 cmp r3, #0 2c8: 0afffff9 beq 2b4 2cc: e3a03902 mov r3, #32768 ; 0x8000 2d0: e34031c2 movt r3, #450 ; 0x1c2 2d4: e55b2005 ldrb r2, [fp, #-5] 2d8: e5832000 str r2, [r3] 2dc: e320f000 nop {0} 2e0: e24bd000 sub sp, fp, #0 2e4: e49db004 pop {fp} ; (ldr fp, [sp], #4) 2e8: e12fff1e bx lr 0000029c : 29c: e52db004 push {fp} ; (str fp, [sp, #-4]!) 2a0: e28db000 add fp, sp, #0 2a4: e24dd00c sub sp, sp, #12 2a8: e1a03000 mov r3, r0 2ac: e54b3005 strb r3, [fp, #-5] 2b0: e320f000 nop {0} 2b4: e3a03902 mov r3, #32768 ; 0x8000 2b8: e34031c2 movt r3, #450 ; 0x1c2 2bc: e5933014 ldr r3, [r3, #20] 2c0: e2033040 and r3, r3, #64 ; 0x40 2c4: e3530000 cmp r3, #0 2c8: 0afffff9 beq 2b4 2cc: e3a03902 mov r3, #32768 ; 0x8000 2d0: e34031c2 movt r3, #450 ; 0x1c2 2d4: e55b2005 ldrb r2, [fp, #-5] 2d8: e5832000 str r2, [r3] 2dc: e320f000 nop {0} 2e0: e24bd000 sub sp, fp, #0 2e4: e49db004 pop {fp} ; (ldr fp, [sp], #4) 2e8: e12fff1e bx lr
So let us repeat this with "gcc -O" and see what we get.
00000114Conclusions? This is much, much better. It is also hilarious, because with optimization the code is identical in all three cases. What is surprising is that the compiler could do better. It already has the pointer value it needs in r2 and does not need to construct it again as it does.: 114: e3a02902 mov r2, #32768 ; 0x8000 118: e34021c2 movt r2, #450 ; 0x1c2 11c: e5923014 ldr r3, [r2, #20] 120: e3130040 tst r3, #64 ; 0x40 124: 0afffffc beq 11c 128: e3a03902 mov r3, #32768 ; 0x8000 12c: e34031c2 movt r3, #450 ; 0x1c2 130: e5830000 str r0, [r3] 134: e12fff1e bx lr 00000114 : 114: e3a02902 mov r2, #32768 ; 0x8000 118: e34021c2 movt r2, #450 ; 0x1c2 11c: e5923014 ldr r3, [r2, #20] 120: e3130040 tst r3, #64 ; 0x40 124: 0afffffc beq 11c 128: e3a03902 mov r3, #32768 ; 0x8000 12c: e34031c2 movt r3, #450 ; 0x1c2 130: e5830000 str r0, [r3] 134: e12fff1e bx lr 00000114 : 114: e3a02902 mov r2, #32768 ; 0x8000 118: e34021c2 movt r2, #450 ; 0x1c2 11c: e5923014 ldr r3, [r2, #20] 120: e3130040 tst r3, #64 ; 0x40 124: 0afffffc beq 11c 128: e3a03902 mov r3, #32768 ; 0x8000 12c: e34031c2 movt r3, #450 ; 0x1c2 130: e5830000 str r0, [r3] 134: e12fff1e bx lr
The compiler can do better, and it does if we run it as gcc -O2 as follows:
0000010cAgain, all three routines generate identical code. I have no complaints about this code. I trust that it is faster to generate the 32 bit constant from two 16 bit pieces held in immediate values than to fetch it from memory in a single instruction.: 10c: e3a02902 mov r2, #32768 ; 0x8000 110: e34021c2 movt r2, #450 ; 0x1c2 114: e5923014 ldr r3, [r2, #20] 118: e3130040 tst r3, #64 ; 0x40 11c: 0afffffc beq 114 120: e5820000 str r0, [r2] 124: e12fff1e bx lr 0000010c : 10c: e3a02902 mov r2, #32768 ; 0x8000 110: e34021c2 movt r2, #450 ; 0x1c2 114: e5923014 ldr r3, [r2, #20] 118: e3130040 tst r3, #64 ; 0x40 11c: 0afffffc beq 114 120: e5820000 str r0, [r2] 124: e12fff1e bx lr 0000010c : 10c: e3a02902 mov r2, #32768 ; 0x8000 110: e34021c2 movt r2, #450 ; 0x1c2 114: e5923014 ldr r3, [r2, #20] 118: e3130040 tst r3, #64 ; 0x40 11c: 0afffffc beq 114 120: e5820000 str r0, [r2] 124: e12fff1e bx lr
I will note that my motivation for using a base pointer and a template is an expectation that the compiler will load that pointer into a register and then repeatedly use it with offsets, as indeed the optimized code does.
Tom's electronics pages / tom@mmto.org