Tom's Electronics Pages

January 2, 2026

Kyu - ARMv8 (aarch64) cache operations

Handling the caches in a mutlicore ARM system is surprisingly complex.

Considering just a single core, we have 3 caches to consider. We have the L1 I cache, the L1 D cache, and the L2 "unified" cache. Of course we also have main memory.

When we add additional cores, they each have a private L1 I and D cache, but share the L2 "unified" cache with other cores.

Some systems (such as the 8 core Fire3 I have) have "clusters", typically of 4 cores, and each cluster has its own L2 unified cache.

Coherency

Once we have more than one copy of anything, the issue of stale or inconsistent data arises. There are a number of ways this can arise, the following is an incomplete list:

Self modifying code. Here the write will go through the D cache, but some effort will be required to ensure the change is reflected in the I cache.
Device DMA. On reads main memory gets modified, but the D cache is not informed. On writes, the D cache must be forced to main memory before the DMA write occurs.
Each core will evolve its own D cache contents, which may or may not be consistent with other cores or with main memory.

The ARM hardware takes care of D cache coherence between cores in a cluster, but things are more complex with more than one cluster.

Note that there is no "automatic" mechanism to ensure coherence between main memory and a D cache. This is expected in the case of device DMA. What may be surprising is that if a core has its D cache disabled and writes to main memory, the D cache of another core (with the D cache enabled) will not be formed. This sort of thing is exactly the same as DMA writing to memory as far as the cache system is concerned.

A variety of issues arise with self modifying code. I avoid this in my own programming, but debugging tools may modify code "on the fly" and require careful attention. However, relocating code will need attention because writes will have gone through the D cache and the I cache may become incoherent.

Cache flush

The concept here is that the contents of the D cache is forced to be written back to memory, ensuring coherence between the D cache in question and main memory.

ARM documentation avoids the term "flush" (for whatever reason) and defines the following operations:

Clean - write back any dirty data, making cached copies clean
Invalidate - delete data (clean or dirty) from the cache
Clean and invalidate - a clean is followed by an invalidate

What I do when I am writing a packet via a network interface is to write my data to the buffer, then perform a "clean" as per the above, so that the buffer in main memory is ready for DMA.

In the other direction, when a packet has been received and deposited in main memory by DMA, I do an invalidate. It is possible to specify an address range for each action, so the entire cache does not need to get involved.

I don't believe I have ever had use for "clean and invalidate".

Point of Coherency and Point of Unification

These terms (abbreviated as PoC and PoU) are used by the ARM documentation.

PoC is typically main memory (unless external caches exist, which is not the case with any systems I work with). It is defined as the point at which all accesses see the same copy.

PoU can (and usually is) more local. It is the point at which the I and D caches of a certain single core see the same copy. This will typically be the L2 cache in systems which have one (all my systems have L2 cache).

"inner" vs "outer"

ARM calls these "shareability domains", and this language seems to only show up in aarch32 (ARMv7) documents. It was always confusing, so perhaps the new PoC and PoU language is less so. Exactly what this means is specific to the cache design for each given system.

This typically distinguishes between L1 (inner) and L2 along with L3 (when it exists) as outer. However there is more to it. We actually have 4 classifications:

inner shareable
outer shareable
inner cacheable
outer cacheeable

The trick is defining which entities are in each domain. This becomes useful, important, and interesting in a system with two CPU clusters (like a big.LITTLE system). Things beside CPU can get involved (like a GPU).

Fortunately my multi-cluster systems are all aarch64 based and I won't have to deal with this muddled terminology to deal with them. I can just view inner as L1 and outer as L2. Where I have run into this is when setting up MMU entries and I am being allowed to designate pages as being in one of these 4 classes.

This explanation is admittedly incomplete, as is my understanding at this point.

Set/Way

A set-way cache is one of several cache designs. (The others are direct mapped and fully associative.) It is also often referred to as a set-associative cache. The cache is divided into sets, each set has multiple "ways" (lines). You will see the phrase "4-way set-associative" cache.

Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org