September 12, 2024

NEON

I have ignored this up until now. NEON is a SIMD vector unit that is part of many ARM processors. It is part of the Cortex-A processors that I often work with. I have ignored it, along with normal floating point, but recently got curious.

Note that in a multicore chip with 64 bit ARM (ARMv8), each ARM core will have its own NEON unit. This is even true of a "little" core in a big/little multicore system.

NEON is a standard part of aarch64 (64 bit ARM), so if you have ARMv8 (64 bit arm) you have NEON whether you like it or not.

In addition, gcc will generate code that makes good use of the NEON, apparently without any special effort on your part -- but you may want to read all the fine print. It should "just work".

ARM registers

In a 64 bit ARM, you have 32 integer registers. You have x0 through x31 as 64 bit registers. You can access the very same registers as 32 bit registers via the names w0 through w31. (Note that on a 32 bit ARM chip, you have 16 32 bit interger registers r0 through r15.) In truth you only have 31 integer registers. Register 31 is special, acting as either the stack pointer or as an "always zero" register, according to the instruction.

Scalar floating point has 32 registers, v0 to v31, each 128 bits wide.

I wrote some simple floating point code in C and disassembled it, finding the following for "float" manipulations. Note that it uses "sN" registers. Also note that it uses the ordinary "str" and "ldr" instructions to move these to and from memory.

fmov    s31, #2.000000000000000000e+00
str s31, [sp, #28]
ldr s30, [sp, #28]
fmov    s31, #1.000000000000000000e+00
fadd    s31, s30, s31
I changed the "float" to "double" and now I see this:
fmov    d31, #2.000000000000000000e+00
str d31, [sp, #24]
ldr d30, [sp, #24]
fmov    d31, #1.000000000000000000e+00
fadd    d31, d30, d31
str d31, [sp, #24]
So, we use s13 for 32 bit single precision registers and d13 for a 64 bit double precision register. Remember that s13 and d13 are the same register, we are just using it in different ways.

What happened to the "v13" register? We have a series of letters to use for register names depending on how we want to access the floating point registers:

Bn - for byte access (8 bits) -- don't ask me what use this has.
Hn - for 16 bit access -- there is a 16 bit half precision
Sn - for 32 bit access -- float
Dn - for 64 bit access -- double
Qn - for 128 bit access -- "quad" word, but no 128 bit floating point ops
Why did we ever even mention Vn? Those names are only used for SIMD operations One way this works is that we have 4 "lanes", each 32 bits wide and we operate on 4 values at once in a 128 bit Vn register. Here are two examples:
	fadd    v0.2d, v5.2d, v6.2d
In this example we are adding two 64 bit doubles in parallel.
	fmul    v1.4s, v5.4s, v3.s[2]
Here v5 has 4 single (32 bit) elements that we are each multiplying by a selected element from v3, putting the 4 results into v1.

Once again, just to be clear, v13, d13, and s13 are all the same register, just being used in different ways.

So we can have 2 lanes (double), 4 lanes (float) or 8 lanes (half-float). You don't believe me that there is half float?


Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org