January 27, 2023

Kyu networking -- Allwinner H3 - I cache experiments

So, what happens when we switch off the I cache?

I do the following:

    invalidate_icache_all();
    icache_disable();
This boils down to:
#define CP15ISB asm volatile ("mcr     p15, 0, %0, c7, c5, 4" : : "r" (0))
#define CP15DSB asm volatile ("mcr     p15, 0, %0, c7, c10, 4" : : "r" (0))

    /* Invalidate all instruction caches
     * Also flushes branch target cache.
     */
    asm volatile ("mcr p15, 0, %0, c7, c5, 0" : : "r" (0));

    /* Invalidate entire branch predictor array */
    asm volatile ("mcr p15, 0, %0, c7, c5, 6" : : "r" (0));

    /* Full system DSB - make sure that the invalidation is complete */
    CP15DSB;

    /* ISB - make sure the instruction stream sees it */
    CP15ISB;

    get_SCTLR ( sctlr );
    sctlr &= ~SCTLR_I_CACHE;
    set_SCTLR ( sctlr );

Can we detect a change after this?

I run my memcpy() benchmark and the timings are really no different:
core 0 SCTLR = 00c5087d
 100K:  77271  77212  77234  77150 per 1K =   772
  20K:  15422  15436  15454  15459 per 1K =   771
   1K:    775    774    767    773 per 1K =   775
Note the SCTLR register no longer has bit 0x1000 set, so the I cache is indeed disabled.

I am now omitting the 500K timing because it doesn't add new information and is annoyingly slow. What I do find though is that my LED blink delay test now runs extremely slowly. The heart of this test is a call to delay_ms(), which looks like this:

void
delay_us ( int delay )
{
        volatile unsigned int count;

        count = delay * us_delay_count;
        while ( count -- )
            ;
}

// 1003 gives 1.000 ms
void
delay_ms ( int delay )
{
        unsigned int n;

        for ( n=delay; n; n-- )
            delay_us ( 1003 );
}
The heart of the delay_us() loop looks like this:
40002dac:       e51b3008        ldr     r3, [fp, #-8]
40002db0:       e2432001        sub     r2, r3, #1
40002db4:       e50b2008        str     r2, [fp, #-8]
40002db8:       e3530000        cmp     r3, #0
40002dbc:       1afffffa        bne     40002dac
The read and write from [fp,-8] is a reference to memory on the stack. This surprises me somewhat. It could be that the volatile forces this, or it is simply because we are not giving a -O switch to the compiler.

The question though is why this slows down so much, but memcpy seems to run in the same amount of time. Whatever the case, we have certainly confirmed that we are able to switch off the I cache.

L2 control on the Allwinner H3

I had a hunch and spent some time digging throug the data sheet. On page 148-149 there is a section CPUCFG. It has two registers, one with a bit to enable the clock to the L2 and another to reset (or not) the L2. Both look like the L2 should be active.

Interestingly there is a 64 bit counter here (fed by the 24M clock).

More I cache related timings

I wrote some code to check the timing for a 10 ms delay as determined by the above delay_ms() function.
Delay for 10 ms =   8019 (with I cache enabled)
Delay for 10 ms = 144717 (with I cache disabled).
Delay for 10 ms = 144715 (with I cache disabled, optimized).

Delay for 10 ms = 9991	  (BBB with I cache enabled (just right!)
Delay for 10 ms = 9991	  (BBB with I cache enabled, optimized (just right!)
Delay for 10 ms = 104896  (BBB with I cache disabled)
Comparing to the BBB timings is interesting, but not the main focus right now. It is interesting that the BBB timings are almost exactly 10 ms.

I removed the volatile, and there was no change in the timing. I tried adding "register", but that changed nothing either. Apparently (as advertised), register is no more than a hint and is in general just ignored these days.

I was very much surprised that the "optimized" version (with no memory references in the loop) ran just as fast as the unoptimized. Details follow as to what this "optimized" version is all about.

As for the "optimized" timing, I discovered I could optimize just one function in a file (using gcc) with the following:

__attribute__ ((optimize(1)))
static void
e_delay_us ( int delay )
{
        // volatile unsigned int count;
        register unsigned int count;

        count = delay * us_delay_count;
        while ( count -- ) {
            asm volatile ( "nop" );
            asm volatile ( "nop" );
            asm volatile ( "nop" );
        }
}
I added the 3 "nop" instructions so there would be 5 instructions in the loop, as in the non-optimized case above. The idea is to eliminate memory references in the loop, while keeping everything else constant. This being a RISC machine, it should execute every instruction in a single clock if there are no conflicts.
4001dd90 :
4001dd90:       e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
4001dd94:       e28db000        add     fp, sp, #0
4001dd98:       e30b3590        movw    r3, #46480      ; 0xb590
4001dd9c:       e3443003        movt    r3, #16387      ; 0x4003
4001dda0:       e5933000        ldr     r3, [r3]
4001dda4:       e0000093        mul     r0, r3, r0
4001dda8:       e3500000        cmp     r0, #0
4001ddac:       0a000004        beq     4001ddc4 
4001ddb0:       e320f000        nop     {0}
4001ddb4:       e320f000        nop     {0}
4001ddb8:       e320f000        nop     {0}
4001ddbc:       e2500001        subs    r0, r0, #1
4001ddc0:       1afffffa        bne     4001ddb0 
4001ddc4:       e28bd000        add     sp, fp, #0
4001ddc8:       e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
4001ddcc:       e12fff1e        bx      lr


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org