January 2, 2026

Kyu - Aarch64 -- cache related system registers

One of the nicest things about Aarch64 as compared to Aarch32 is the use of system registers, rather than the miserable and cryptic coprocessor 0 scheme used for such things on aarch32 systems.

MRS

This is the "move to register from system register" instruction.
It reads from a system register, like so:
asm volatile("mrs %0, CurrentEL" : "=r" (val) : : "cc");
Note that the "cc" clobber here is probably bogus, but harmless.

MSR

This is the "move to system register from register" instruction.
It writes to a system register, like so:
asm volatile ("msr PMCCFILTR_EL0, %0" : : "r" (val) );

ARM inline assembly with GCC

Here are notes on this: In a nutshell, each statement has 4 parts, separated by colons:
asm ( "A" : O : I : C );
Here A is the assembly code. O are output registers, I are input registers, C is the clobber list.

Note that "=r" will be used for things in the "O" section, while "r" is used for things in the "I" section.
The "=" is called a "constraint" and indicates that the operand is write only, which is why it is used for things in the O section (it is going into a register to be "consumed", so it ain't getting written).

Two examples

From Kyu asm64/cache.c
 asm volatile ("msr csselr_el1, %0" : : "r" (val) );
 asm volatile ("mrs %0, ccsidr_el1" : "=r" (val) );
Here we write to a select register, then read from another register.

Info from the Allwinner H5

--------   cache CCSIDR for L1 D = 700fe01a
NO Write through
Write back
Read allocate
Write allocate
ARM L1 D cache line size: 64
 associativity: 4
 sets: 128
--------   cache CCSIDR for L1 I = 201fe00a
NO Write through
NO Write back
Read allocate
NO Write allocate
ARM L1 I cache line size: 64
 associativity: 2
 sets: 256
--------   cache CCSIDR for L2 = 703fe07a
NO Write through
Write back
Read allocate
Write allocate
ARM L2 cache line size: 64
 associativity: 16
 sets: 512
From here, we move from practice back to theory.

Write-back and all that

Write-through is simple. Whenever a write to the cache occurs, it is also pushed through to main memory, most likely at the full penalty of a write to main memory. What is the advantage you say then? The advantage comes with the speed of subsequent reads which hit the cache.
Our ARM cache never implements write through.

Write-back is all about procrastination. The cache write takes place at full speed and the cache remains "dirty" (i.e. out of sync with main memory) for some indeterminate time, perhaps even forever. Forever is unlikely and could only happen if the cache gets invalidated. The write-back happens when the line gets evicted by an allocate (see below).

Read allocate says that when a cache miss occurs on a read, the entire line gets fetched to the cache, then the read takes place from the cache. Note that our ARM caches above all implement read allocate

Write allocate says that when a cache miss occurs on a write, the entire line gets read from main memory, then gets modified by the write. Without write allocate, the write just goes to main memory, bypassing the cache. Note that only the ARM I-cache does not implement write-allocate. This is reasonable given that writes to the I cache seldom and perhaps never happen.

Note that whenever a line gets read to the cache, it is possible and perhaps likely that a cache line needs to get evicted. A clean line can just be discarded, but a dirty line will need to get written back.

More lingo: VIPT and PIPT

The I cache in ARM designs is VIPT, while the D cache is PIPT.

VIPT is "vitual indexed, physical tagged".
PIPT is "physical indexed, physical tagged".

Having the I cache vitual indexed allows both the cache and the TLB (in the MMU) to be accessed in parallel for speed. The VA selects the set of cache lines, and then the PA has been obtained in parallel from the TLB to select the line.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org