January 10, 2026

Kyu - ARM - Aarch64 cache - code to flush / invalidate

Here is the code for arm 64 -- from U-boot arch/arm/cpu/armv8/cache.S
.pushsection .text.__asm_flush_dcache_range, "ax"
ENTRY(__asm_flush_dcache_range)
    mrs x3, ctr_el0
    ubfx    x3, x3, #16, #4
    mov x2, #4
    lsl x2, x2, x3      /* cache line size */

    /* x2 <- minimal cache line size in cache system */
    sub x3, x2, #1
    bic x0, x0, x3
1:  dc  civac, x0   /* clean & invalidate data or unified cache */
    add x0, x0, x2
    cmp x0, x1
    b.lo    1b
    dsb sy
    ret
ENDPROC(__asm_flush_dcache_range)
Replace "civac" with "ivac" to get the invalidate routine.

ENTRY and ENDPROC are defined in include/linux/linkage.h -- they are trivial and of no particular interest here.

Get the cache line length

First of all, this routine is called with two arguments.
The first is in x0 and is the starting address.
The second is in x1 and is the ending address.

The "mrs" instruction reads the "ctr_el0" register (cache type register). The ubfx (unsigned bitfield extract) instruction pulls a 4 bit field from offset 16 from the 32 bit value this returns (namely bits 19:16). This gives the log2 of the size in words of the smallest cache line in the L1 and L2 caches controlled by this core.

The value "4" is loaded into x2 to effectively multiply the result to be obtained by 4, converting the word count to a byte count. The shift then inverts the log2 to give us the actual line size in bytes (in x2).

The "bic" instruction is "bit clear". It does NOT just clear one bit. It clears all the bits given (in this case) by the mask in x3. This mask is (x2-1), i.e. the line size minus 1. Clearing these low bits in x0 (start address argument) gives a nice start to the loop that follows.

The loop should be clear. We call the "dc civac" (or "dc ivac") instruction for each cache line that starts within the range ending by the comparison with the x1 value.

Finally, a "dsb sy" barrier ensures that the function does not return until all the writes have been performed.

Compare this to code for aarch32

This code is mostly in arch/arm/cpu/armv7/cache_v7.c. Here the code is written in C and uses inline assembly as needed. The code is also more convoluted -- the following is my own condensation of several routines.
Note that aarch32 uses mrc and mcr in lieu of mrs and msr that aarch64 uses.
#define CCSIDR_LINE_SIZE_OFFSET     0
#define CCSIDR_LINE_SIZE_MASK       0x7

    u32 line_len, ccsidr;
    u32 mva;

    /* Read current CP15 Cache Size ID Register */
    asm volatile ("mrc p15, 1, %0, c0, c0, 0" : "=r" (ccsidr));

    line_len = ((ccsidr & CCSIDR_LINE_SIZE_MASK) >>
        CCSIDR_LINE_SIZE_OFFSET) + 2;

    /* Converting from words to bytes */
    line_len += 2;
    /* converting from log2(linelen) to linelen */
    line_len = 1 << line_len;

#if FLUSH
    /* Align start to cache line boundary */
    start &= ~(line_len - 1);

    for (mva = start; mva < stop; mva = mva + line_len) {
        /* DCCIMVAC - Clean & Invalidate data cache by MVA to PoC */
        asm volatile ("mcr p15, 0, %0, c7, c14, 1" : : "r" (mva));
    }
#endif
#if INVALIDATE
    for (mva = start; mva < stop; mva = mva + line_len) {
        /* DCIMVAC - Invalidate data cache by MVA to PoC */
        asm volatile ("mcr p15, 0, %0, c7, c6, 1" : : "r" (mva));
    }
#endif

    // dsb();
    asm volatile ("dsb sy" : : : "memory")
Take note in the above, that the C code must do a shift and mask to extract a field from the register, while with aarch64 we have the ubfx instruction to do just this sort of thing for us.

Here we have the infernal and cursed syntax for the aarch32 system registers. Let me go on a rant here. Why, o why? Why did they not make the assembler take care of this and allow us instead to write code somewhat like:

	mrc r0, CCSIDR
	mcr DCCIMAVAC, r1
They certainly could have -- and should have! I have written post processors to produce this syntax for code I have disassembled, and it has been a huge benefit. This is what assemblers are for and what they should do!!

The U-boot code kindly provides comments to tell us what is intended.

Why does the flush get the start aligned to a line boundary, while the invalidate does not? They both should (as they both do in the aarch64 code above). The only result of not doing this is that we might needlessly invalidate a line that we shouldn't at the end of the range. Excess invalidating never causes errors, but it has a slight performance penalty.

CTR_EL0 versus CCSIDR

Aarch64 has both registers, and either can be used to obtain the cache line size. Aarch32 almost certainly has a CTR system register. If so, it can be accessed as:
    asm volatile ("mrc p15, 0, %0, c0, c0, 1" : "=r" (ctr));
So the code can be written to use either. The field in the CTR does say that it returns the minimum line size of ALL the caches. Exactly what goes on with the CCSIDR depends on the specific core (A7 or A53 or ...).


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org