January 25, 2023

Kyu networking -- Study theNetBSD startup

Just to remind myself, this is all about making 100 Mbit network transfers run as fast as they should on my Allwinner H3 boards. I discovered that there was a big delay in (of all things) memcpy -- and I believe (but have not yet proven) that this must be related to the caches not being properly enabled.

I am now studying the NetBSD starup code to see how the initialize the H3 ARM processor. I have timed a 1 megabyte transfer (using netcat) with NetBSD on my Orange Pi PC board and it runs at 0.17 secons just like it should (with Kyu the time is now 5 seconds). This proves that the H3 is capable of doing much better, and also that my specific board does not have broken hardware.

I built NetBSD for the H3 just so I would have a record of the files actually used in the build. This is all but impossible to work out just looking at the sources themselves. Unfortunately, NetBSD does not leave the object files next to the sources in the nice way that U-boot does (so they can serve as breadcrumbs). I captured the build log and extracted a list of filenames from it, which is proving to be invaluable.

Two files hold the main execution thread for startup:

arch/arm/arm/armv6_start.S
arch/arm/arm32/locore.S
I have no idea why there is arm and arm32. The 64 bit ARM stuff is all in arch/aarch64. Whatever the case, there is an interesting comment at the start of armv6_start.S --
 * At this point, this code has been loaded into SDRAM and the MMU should be off
 * with data caches disabled.
 * linux image type should be used in uboot images to ensure this is the case.
This is certainly not the case for Kyu since uboot simply loads a binary file into memory and jumps into it. In fact when I test the control registers I see that the cache enable bits are on, and the MMU is enabled. One might think I could just take the U-boot initialization and run with it, but that doesn't work.

EFI booting

NetBSD has uboot do something called EFI booting. This adds another layer of complication to what I have to understand. Someday.

Sheer luck looking at the U-boot sources

I trip over this routine in arch/arm/cpu/armv7/cpu.c
int cleanup_before_linux_select(int flags)
{
        /*
         * this function is called just before we call linux
         * it prepares the processor for linux
         *
         * we turn off caches etc ...
         */

        disable_interrupts();

        if (flags & CBL_DISABLE_CACHES) {
                /*
                * turn off D-cache
                * dcache_disable() in turn flushes the d-cache and disables MMU
                */
                dcache_disable();
                v7_outer_cache_disable();

                /*
                * After D-cache is flushed and before it is disabled there may
                * be some new valid entries brought into the cache. We are
                * sure that these lines are not dirty and will not affect our
                * execution. (because unwinding the call-stack and setting a
                * bit in CP15 SCTRL is all we did during this. We have not
                * pushed anything on to the stack. Neither have we affected
                * any static data) So just invalidate the entire d-cache again
                * to avoid coherency problems for kernel
                */
                invalidate_dcache_all();

                icache_disable();
                invalidate_icache_all();
        } else {
                /*
                 * Turn off I-cache and invalidate it
                 */
                icache_disable();
                invalidate_icache_all();

                flush_dcache_all();
                invalidate_icache_all();
                icache_enable();
        }

        /*
         * Some CPU need more cache attention before starting the kernel.
         */
        cpu_cache_initialization();

        return 0;
}

int cleanup_before_linux(void)
{
        return cleanup_before_linux_select(CBL_ALL);
}
Worth reading might be: doc/README.arm-caches

So where is the above called?

./lib/efi_loader/efi_boottime.c:	cleanup_before_linux();
./arch/arm/lib/spl.c:	cleanup_before_linux();
./arch/arm/lib/bootm.c:	cleanup_before_linux();
No matter what the comment in NetBSD says, I see no code in U-boot that disables the MMU.

Use Kyu to look at the ARM setup U-Boot hands to us

It is worth noting that U-boot is a moving target. I have done a lot of my Orange Pi work with a 2016 vintage U-boot. The tests described below were done with 2022 vintage U-boot:
U-Boot SPL 2022.10-dirty (Jan 16 2023 - 21:31:52 -0700)

I add code to Kyu to record the value of the MMU registers the very instant we gain control in locore.S. I see:

orig SCTLR = 00c5187d
orig TTBR0 = 7fff4000
orig TTBR1 = 40040059
orig TTBCR = 80000f00
orig DACR  = 55555555
All this is interesting. The DACR gives permissions for the 16 domains (2 bits per domain). The value 5 (0101) is setting 01 which is to check access using the information in the tables. I have always used f (1111), which says to just skip checking access altogether.

The SCTLR tells us that both I and D caches are enabled and the MMU is enabled as well.

The low 3 bits of TTBCR are zero and that means that only TTBR0 is being used. The big surprise is bit 31 being set (see below)

The value in TTBR1 is not used and is a value I set, which has persisted through a reset and never was changed by U-Boot. I set TTBR1 equal to TTBR0 (why not), even though it should never be used.

The value of TTBR0 shows us the U-Boot stuck the MMU table way out near the very end of the 1G of ram (0x1000_0000 is 256M of ram).

We can use Kyu to look at the table:

 dl 0x7fff4000 32
7fff4000  7fff0003 00000000 7fff1003 00000000
7fff4010  7fff2003 00000000 7fff3003 00000000
7fff4020  00000000 00000000 00000000 00000000
Kyu, ready> dl 7fff0000 8
7fff0000  00000441 00400000 00200441 00400000
7fff0010  00400441 00400000 00600441 00400000
7fff0020  00800441 00400000 00a00441 00400000
7fff0030  00c00441 00400000 00e00441 00400000
7fff0040  01000441 00400000 01200441 00400000
7fff0050  01400441 00400000 01600441 00400000
7fff0060  01800441 00400000 01a00441 00400000
7fff0070  01c00441 00400000 01e00441 00400000
Kyu, ready> dl 7fff1000 8
7fff1000  40000449 00000000 40200449 00000000
7fff1010  40400449 00000000 40600449 00000000
7fff1020  40800449 00000000 40a00449 00000000
7fff1030  40c00449 00000000 40e00449 00000000
7fff1040  41000449 00000000 41200449 00000000
7fff1050  41400449 00000000 41600449 00000000
7fff1060  41800449 00000000 41a00449 00000000
7fff1070  41c00449 00000000 41e00449 00000000
Holy smokes. This makes no sense at all.
Well it does. Apparently the Cortex-A7 can work in LPAE mode, which uses 64 bit MMU table entries. And U-boot is using it in that mode. (I wonder if NetBSD will do so also) Bit 31 in the TTBCR enables this (and you can see that bit set in the dump above). The TTBCR register bit assignments change radically when this bit is set. The mysterious "f" in the lower 16 bits of that register set inner/outer cacheability for the mmu tables, making them "write back, no write allocate".

What about LPAE and NetBSD?

Looking at sources, I see the compiler switch -DARM_HAS_LPAE. The only place I see this referenced is in arch/arm/arm32/arm32_kvminit.c I see no evidence that my build included LPAE.

How does LPAE work -- how do we interpret the tables set up by U-boot?

LPAE allows a 3 level setup, and the first level has four 1G descriptors. So the virtual address space is limited to 4G (physical space can be much bigger). The second level table has 512 entries, each for 2M of virtual memory. Each second level table entry can point to a level 3 table, also with 512 entries, each describing a 4K page.

A side note. Cortex-A8 and A9 do not provide LPAE, only A7. Also LPAE is the default with arm64.

The details are there (as they should be) in the ARMv7 A/R manual. Search for "long-descriptor translation table format desriptors" ---

The first 2 levels look like this: If Bit 1 is clear (block) , the entry includes attributes for the block And more attributes in the upper word: So, in the above U-boot is setting the XN bit for what wecan see of the first 1G (which is not ram). The low bits are 0x0440, so we have bit 6 set (for AP) and bit 10 set (the access flag)

They say that "to be consistent with the short format" the bit AP[0] is not defined.

Setting AP[2] makes the memory "read only" (otherwise it is R/W) - We see it set 0.
Setting AP[1] allows access at any level, otherwise it is privileged. - We see it set 1.

The access flag bit will yield a fault if it is zero and the entry is read into the TLB. Software is expected to do something, then set the flag to one. Since I don't want this, setting this to 1 initially sounds like just the right thing.

A quick peek at the BBB

orig SCTLR = 00c5187f
orig ACTLR = 00000042
orig    SP = 9ef40818
orig TTBR0 = 9fff0000
orig TTBR1 = 00000000
orig TTBCR = 00000000
orig DACR  = fffffffd

Kyu (bbb), ready> di 9fff0000 8
9fff0000  00000c12 00100c12 00200c12 00300c12
9fff0010  00400c12 00500c12 00600c12 00700c12
9fff0020  00800c12 00900c12 00a00c12 00b00c12
9fff0030  00c00c12 00d00c12 00e00c12 00f00c12
9fff0040  01000c12 01100c12 01200c12 01300c12
9fff0050  01400c12 01500c12 01600c12 01700c12
9fff0060  01800c12 01900c12 01a00c12 01b00c12
9fff0070  01c00c12 01d00c12 01e00c12 01f00c12
So, I cache enabled, D cache also, and the MMU.
the BBB is a Cortex-A8 which (nicely) has a bit in the ACTLR dedicated to enabling the L2 cache. This is 0x2 and we see that is also enabled.
Bit 6 (0x40) is "IBE" invalidate BTB enable (whatever that is).


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org