September 12, 2024
Note that in a multicore chip with 64 bit ARM (ARMv8), each ARM core will have its own NEON unit. This is even true of a "little" core in a big/little multicore system.
NEON is a standard part of aarch64 (64 bit ARM), so if you have ARMv8 (64 bit arm) you have NEON whether you like it or not.
In addition, gcc will generate code that makes good use of the NEON, apparently without any special effort on your part -- but you may want to read all the fine print. It should "just work".
Scalar floating point has 32 registers, v0 to v31, each 128 bits wide.
I wrote some simple floating point code in C and disassembled it, finding the following for "float" manipulations. Note that it uses "sN" registers. Also note that it uses the ordinary "str" and "ldr" instructions to move these to and from memory.
fmov s31, #2.000000000000000000e+00 str s31, [sp, #28] ldr s30, [sp, #28] fmov s31, #1.000000000000000000e+00 fadd s31, s30, s31I changed the "float" to "double" and now I see this:
fmov d31, #2.000000000000000000e+00 str d31, [sp, #24] ldr d30, [sp, #24] fmov d31, #1.000000000000000000e+00 fadd d31, d30, d31 str d31, [sp, #24]So, we use s13 for 32 bit single precision registers and d13 for a 64 bit double precision register. Remember that s13 and d13 are the same register, we are just using it in different ways.
What happened to the "v13" register? We have a series of letters to use for register names depending on how we want to access the floating point registers:
Bn - for byte access (8 bits) -- don't ask me what use this has. Hn - for 16 bit access -- there is a 16 bit half precision Sn - for 32 bit access -- float Dn - for 64 bit access -- double Qn - for 128 bit access -- "quad" word, but no 128 bit floating point opsWhy did we ever even mention Vn? Those names are only used for SIMD operations One way this works is that we have 4 "lanes", each 32 bits wide and we operate on 4 values at once in a 128 bit Vn register. Here are two examples:
fadd v0.2d, v5.2d, v6.2dIn this example we are adding two 64 bit doubles in parallel.
fmul v1.4s, v5.4s, v3.s[2]Here v5 has 4 single (32 bit) elements that we are each multiplying by a selected element from v3, putting the 4 results into v1.
Once again, just to be clear, v13, d13, and s13 are all the same register, just being used in different ways.
So we can have 2 lanes (double), 4 lanes (float) or 8 lanes (half-float). You don't believe me that there is half float?
Tom's Computer Info / tom@mmto.org