January 18, 2023

ARM Processor 101 -- Cortex cores

I may have mentioned already that this "Cortex" business is ARM marketing lingo. It needs to be translated into useful information. In particular, is it a 32 bit or 64 bit device, and exactly what ARM architecture is inside. I never have seen anything ancient enough to be ARMv6 or before. My parts are all v7 or v8, but each of those has variants.

Once you know the ISA (instruction set architecture) you are most of the way there, but each chip maker often gets to make choices about how big the L2 cache is, whether or not the chip has floating point (and what kind or kinds), and what mode the chip starts up in.

32 bit cores

"Cortex A7" -- My Orange Pi board with the Allwinner H3 chip has a "Cortex-A7" processor. The innards are an ARMv7-A core (4 of them). Each core has 32K of I cache, and 32K of D cache (these are L1). There is also 512K of L2 cache that all the cores share. It has an 8 stage dual issue pipeline, 28nm process.

"Cortex A9" -- My Xilinx Zynq chip has a "Cortex-A9" processor (two of them). They run at 667 Mhz. This is also an ARMv7-A architecture. Each core has 32K of I cache, and 32K of D cache (these are L1). There is at least 128K of L2 cache shared by the cores. The actual amount may vary and is supposed to be 512K in the Zynq. It has a "dual issue partly out of order 8 stage pipeline" and typically gets 50 percent better performance than the A8. They claim 2.5 DMIPS per Mhz. Zynq uses 20nm process.

"Cortex A8" -- The BBB (beaglebone black) has a "Cortex-A8" (single core) inside the AM3358 chip. It has 32k/32k of I/D L1 cache along with 256K of L2 cache. It is also an ARMv7-A architecture chip. I am reading that the Cortex-A8 came out in 2005. Golly, that makes it nearly 20 years old. That is quite a long time in the world of computer technology.

It has an 13 stage dual issue pipeline, TI uses 45nm process.

A quick note about 512K of L2 cache. Currently the executable image for my Kyu operating system is 240K. This will easily fit in its entirety into 512K of L2 cache and even just barely fits into the 256K of cache on the BBB. Of course some of that cache gets used for data, but this puts the cache size in a certain perspective.

The fact that all of these are ARMv7-A architecture means that what I learn about the MMU and cache will apply in the same way to each of them. It may be important to verify cache line sizes, and this may be information that can be interrogated within the chip itself. A sensible person might ask what the difference is between the A7, A8, and A9 if they are all ARMv7-A devices. The difference may be in number of transistors, number of execution units, pipelines and such, but the instruction set presented by each is the same and though one may run faster or cooler or take up less area in silicon, they all act the same from a programmers point of view.

64 bit cores

These are all ARMv8.2-A (except for the A53 and A72 which are ARMv8-A).
Note that ARM announced ARMv9 in 2021 which is v8 with security an vector processing extensions.

"Cortex A53" - I have Orange Pi boards with the Allwinner H5 chip that have a quad core A53. This is ARM64 (ARMv8-A) with 32/32K of I/D L1 cache and a 512K unified L2 cache. I also have a NanoPi Fire3 board with the Samsung s5p6818 chip. This has 8 cores (in 2 groups of 4), in essence is has two of the 4 core groups that the H5 has. Each 4 core group has a 512K L2 cache (so there is a grand total of 1M of L2, but I don't know what goes on as far as synchronizing the two). I also have the Rockchip RK3328, which has four A53 cores just like the Allwinner H5.

"Cortex A72" - I have a Rockchip RK3399 based board. The chip contains two A72 cores and four A53 cores. The 4 core A53 unit is just like the four core unit discussed above. The A72 cores are in a two core unit and are bigger and faster. They have 48K of I cache and 32K of D cache in L1 for each core, as well as 1024K of L2 cache for the pair.

"Cortex A76" - The Rockchip RK3588 is an 8 core chip with 4 A76 cores and 4 A55 cores. The A76 cores have a 64/64K I/D cache along with 2M of L2 cache (512K for each processor). There is also 3M of L3 cache shared by all 8 processors. Can run at up to 2.4 Ghz.

"Cortex A55" - This is the 4 core "cluster" of small cores in the RK3588 mentioned just above. These have 32/32K of I/D L1 cache along with 128K of L2 cache for each processor.

Getting distracted by the RK3588 for the moment (I don't yet have one). They also throw in 3 Cortex M0 cores "just for fun", so you are really getting 11 ARM cores with one of these. These M0 cores are probably active when the chip is in low power modes. The RK3588 is available on the Orange Pi 5 with 4, 8 or 16G of ram -- you need an M.2 SSD to make good use of it. So I would pay $100 along with $25 or so for an M.2 SSD.

Google Pixel 6 "tensor"

Getting even further distracted, I began wondering just what ARM cores are in my Pixel 6 phone. The interesting thing is the Cortex-X1 core (and two of them). It is 5 way superscalar and uses something called a 3K "macro-OP (MOP)" cache. It can fetch 5 instructions and 8 MOPs per cycle. It has a 64/64K I/D L1 cache and up to 1M per core of L2 cache.

The X1 is an improved design based on the Cortex-A78 with the design optimized entirely for performance.

Big - Little

The observant reader will have noticed the mix of core types on the Rockchip units. Let's consider what is different between an A76 and an A55 core.

First of all, they say that the A55 is a more power efficient replacement for the A53.

The A75 has a 3 way superscalar design (the A73 was 2 way). The A75 has 7 execution units as well as two load/store units. The A75 fetches 4 instructions per cycle and has an exclusive L2 cache.

The A76 has a 4 way superscalar design (so it can execute 4 instructions in parallel).

Both the A53 and A55 have a "short" 8 stage pipeline.
The A76 has a 13 stage pipeline.


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org