May 12, 2017

ARM floating point architecture

Many ARM processors these days include one or more floating point units. The Orange Pi I am working with has four ARM Cortex-A7 cores. The popular Raspberry Pi also uses Cortex-A7 cores. The Cortex-A7 definitely has floating point hardware of the following sort:

The BBB (beaglebone black) has a single ARM Cortex-A8 core, which has:

I am going to ignore NEON in this write-up. It is a SIMD floating point device that targets multi-media applications. It supports only single precision floating point, and has no divide.

The VFD is an IEEE-754 compatible (with a few caveats) floating point unit. The "V" for vector was an early feature that was quickly dropped, and now the letter sticks around just to foster confusion.

Note that Cortex-A8 is not necessarily better than Cortex-A7. Both of these implement the ARMv7 instruction-set architecture. Note that the Cortex-A7 has an integer divide instruction and the Cortex-A8 does not.

This is as good a time as any to dicuss the muddle of terminology that ARM has produced. A good way to try to keep your mind straight is to realize that on one hand we have the "architecture", i.e ARMv7, while on the other hand we have "marketing names" like Cortex-A8 with absolutely no obvious relationship between the labels used in the two worlds.

This is as good a place as any to discuss the issue of documentation. There are two printed books entitled the "ARM Architectural Reference Manual". These are fossils, though still of use with some care. The original book, edited by Dave Jaggar (1996) covers through ARMv4 (and is over 20 years old). The second edition, edited by David Seal (2000) covers through ARMv5, and is also plenty old. After these two books, they gave up trying to produce printed manuals, ARM variants proliferated, and you absolutely must find the manual for your ARM variant online and study it for the last word on any specific details. The ARMv7-A reference manual I currently use is 2734 pages.

That being said, the second edition book has a nice section on VFPv1 that provides a good starting point.

Useful references

The Wikipedia article has nice overviews of the ARM floating point facilities.

Getting the compiler in the mood for ARM floating point

I have been using gcc for years to compile C code for the ARM, but have never done a thing with floating point. I currently use the 6.1.1 version of the compiler (not that it particularly matters). I use the compiler from the Fedora repositories, not that that should matter either. I use it with a long list of options, including:
-marm -march=armv7-a -msoft-float
Naturally the "soft-float" option prevents any hardware floating point code from being generated. To find out what target dependent options are available, you can reference the link above, or do something like this:
arm-linux-gnu-gcc --target-help
This will yield several screens of options. What did the trick for me was to change my options to:
-marm -march=armv7-a -mfpu=vfpv4
This generates floating point instructions, but running it yields an undefined instruction exception when it encounters the first floating point instruction, namely this:
vldr    s13, [r3]
So, the floating point unit itself needs to be enabled before you can use it.

Getting the processor in the mood for ARM floating point

ARM floating point is handled by a pair of coprocessors, specifically numbers 10 and 11. Usually coprocessor 10 handles single precision and coprocessor 11 handles double precision.

The processor comes up with the floating point coprocessors disabled. My old friend David Welch suggests the following code:

mrc p15, 0, r0, c1, c0, 2
orr r0, r0, #0x300000 @ single precision
orr r0, r0, #0xC00000 @ double precision
mcr p15, 0, r0, c1, c0, 2

mov r0, #0x40000000
fmxr fpexc,r0
The manuals recommend an IMB (instruction memory barrier) after the "mcr" instruction.

The "mrc" and "mcr" instructions are accessing the "Coprocessor access control register" and setting the bits to enable both single and double precision floating point. Supposedly bad things happen if you enable one and not the other.

The ARM online documentation is set up in such a way that it is impossible to get URL's for pages within it. So search for the "C1" register list and then look at the details for the Coprocessor Access control register

Manuals and documentation

Note that there are entire manuals (albeit very short) dedicated to the floating point units, so if you are going to work with the Cortex-A7, you will want to track those down. In fact you probably want to track down all of the following, and perhaps others I am not yet aware of.
ARM Architecture Reference Manual (ARMv7-A and ARMv7-R edition)   (2734 pages)
Cortex-A7 MPCore Technical Reference Manual                       (268 pages)
Cortex-A7 Floating-Point Unit Technical Reference Manual          (25 pages)
Cortex-A7 NEON Media Processing Engine Technical Reference Manual (26 pages)
Cortex-A Series (Version: 2.0) Programmer’s Guide                 (455 pages)

A bare metal floating point example

Since ARM provides a hardware square root, it seemed like a fine idea to exercise it. I added the following C routines to some bare metal code I have available. Since I don't have %f available in my printf function, I convert the result to a scaled integer for display. Also note that the "w" letter is the trick to indicate one of the single precision registers (s0 to s31) in the vector floating point unit.

It turns out to indicate one of the double precision registers, the "w" constraint letter also works, which is nice. Apparently the compiler is clever enough to get a clue from the type of the variable being translated. If it is a float, it maps to an "s" register, and if it is a double, it maps to a "d" register.

More about these sorts of things in these excellent guides:

Note that the second link digs a bunch of these inline tricks out of the "constraints.md" file, which is in the gcc source tree somewhere. And the comment is made that a lot of this is being kept intentionally secret because the gcc maintainers do not consider it a public interface and may change it at any time.
static int
sqrt_i ( int arg )
{
        float farg = arg;
        float root;

        asm volatile ("vsqrt.f32 %0, %1" : "=w" (root) : "w" (farg) );

        return 10000 * root;
}

void
arm_float ( void )
{
        int val;
        int num = 2;

        val = sqrt_i ( num );
        printf ( "Square root of %d is %d\n", num, val );
}

void
my_main ( void )
{
	fp_enable ();
        arm_float ();
}
When this runs, I get:
Square root of 2 is 14142
Notice the call in the above to fp_enable(). This is a bit of assembly language in my start.S file that looks like this (and which you should recognize from above):
        .globl fp_enable
fp_enable:
        mrc     p15, 0, r0, c1, c0, 2
        orr     r0, r0, #0x300000 @ single precision
        orr     r0, r0, #0xC00000 @ double precision
        mcr     p15, 0, r0, c1, c0, 2
        isb
        mov     r0, #0x40000000
        fmxr    fpexc,r0
        mov     pc, lr

ARM VFP architecture

Now that you have seen it in action, let's talk about how it like we should have up front.

There are the usual single and double precision entities, but also a 16 bit "half precision". The "vector" in VFP is a bit of a misnomer as the vector operations are now deprecated. I guess they expect you to use the NEON unit if that is your game.

The VFP unit provides the usual floating math operations, and includes square root in hardware.

Different implementations of the ARM vfp can have different number of registers. The usual situation seems to be that you get 32 single precision registers or 16 double precision registers. Each double precision register sits on top of two single precision registers. The results of setting a double precision register, then accessing one of the underlying single precision registers is undefined - so there is nothing clever of that sort going on. To do a single or double precision add where s1 = s2 + s3 (or d1 = d2 + d3) you do either:

vadd.f32 s1, s2, s3
vadd.f64 d1, d2, d3
loads and stores from and to memory look like:
vldr s1, [r3]
vldr s2, [r3, #4]
vstr s10, [r4]
The "vmov" instruction can move between regular ARM registers and floating point registers.
Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org