October 18, 2023

Orange Pi H3 -- the ARM MMU

The Allwinner H3 chip on the Orange Pi PC I am working with contains a 4 core cpu cluster. These use ARM Cortex-A7 MPCore processors, each with 32K of D cache and 32K of I cache. The cluster also has 512K of L2 cache.

It is often important to get exactly the ARM manual for the specific processor, and there is a specific manual for the Cortex-A7 MPcore. However for details of the MMU, you go to the more general "ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition". Look for the section labeled VMSA (virtual memory system architecture) and find the subsection that describes "translation tables". This is section B3 (page 1307) and B3.3 (page 1318).

Why do we care about this at all if we are doing bare metal programming and want a static transparent address map? We must set up the translation tables if we want to enable the caches. The reason why is that the addresses that reference hardware registers must not be cached. So we set up a linear (transparent) mapping for the entire 4G address space, but only enable caching for the 1G sub-space that could contain ram.

The first complication that arises when reading the manul is that it discusses various "extensions" that may or may not be present on your specific processor. These are:

Which of these are present on the Allwinner H3? As near as I can tell these are all available to a chip designer, but may be included or omitted as they see fit. How do we know which we have and which we can ignore? As near as I can tell you can read the "Processor Feature Registers" to find out. I do so, and here is what I see:
PFR0: 00001131
PFR1: 00011011
These are described in section B4.1.93 and B4.1.94 of the manual.
Each 4 bit group (i.e. hex digit) indicates the state of some feature.

For PFR0 we have (right to left): ARM, Thumb2, Jazelle, ThumbEE

For PFR1, right to left: STD model, Security extensions, NO M profile, virtualization extensions, generic timer

So we have answered two of our questions, but still don't know about LPAE and Multiprocessing extensions. There are 4 "memory model feature registers", but they don't discuss LPAE.

Searching the Cortex A7 MPcore manual, I find the statement: "The Cortex-A7 MPCore processor supports the Virtualization Extensions (VE) and the Large Physical Address Extension (LPAE)". I am just going to ignore the multiprocessing extensions.

Translation tables

At this point I am interested in examining how U-Boot has set up the translation tables. This should start with looking at the TTBCR (translation table base control register).
TTBCR: 80000F00
The high bit is set, indicating that EAE (40 bit) addresses will be used. This means the TTBR0 is a 64 bit register!! This is something I have been unaware of up to now. There are special instructions that get or set such a thing from a pair of 32 bit arm registers:
/* Set 64-bit TTBR0 */
asm volatile ( "mcrr p15, 0, %0, %1, c2" : : "r"(low_32), "r"(0) : "memory");
/* Get 64 bit TTBR0 */
asm volatile ( "mrrc p15, 0, %0, %1, c2" : "=r" (low_32), "=r" (hi_32) )
I read this out and see:
TTBR0: 00000000 7FFF4000
Note that if I just use mcr/mrc to access this register, I mess with the low 32 bits, which is fine as long as the upper 32 have been previously set to 0.

There is also another register (TTBR1), but it is never initialized or used. It reads as different random garbage after each power cycle of the board.

So, our first level page tables are at 0x7fff_4000 and they look like this:

TT 0: 7FFF4000 - 00000000 7FFF0003
TT 1: 7FFF4008 - 00000000 7FFF1003
TT 2: 7FFF4010 - 00000000 7FFF2003
TT 3: 7FFF4018 - 00000000 7FFF3003
Each of these 64 bit entries describes a 1G section of the address space. I was confused by these initially, expecting to find cacheability bits here, and the entries are all the same, except for addresses, but these point to second level page tables with 1M entries (1024 of them for each 1G). They look like this:
TT 0: 7FFF0000 - 00400000 00000441
TT 1: 7FFF0008 - 00400000 00200441
TT 2: 7FFF0010 - 00400000 00400441
TT 3: 7FFF0018 - 00400000 00600441

TT 0: 7FFF1000 - 00000000 40000449
TT 1: 7FFF1008 - 00000000 40200449
TT 2: 7FFF1010 - 00000000 40400449
TT 3: 7FFF1018 - 00000000 40600449

TT 0: 7FFF2000 - 00400000 80000441
TT 1: 7FFF2008 - 00400000 80200441
TT 2: 7FFF2010 - 00400000 80400441
TT 3: 7FFF2018 - 00400000 80600441

TT 0: 7FFF3000 - 00400000 C0000441
TT 1: 7FFF3008 - 00400000 C0200441
TT 2: 7FFF3010 - 00400000 C0400441
TT 3: 7FFF3018 - 00400000 C0600441
Here I have just dumped the first 4 of the 1024 entries in each second level table.
Notice the difference for the second table where we have "9" instead of "1" (and also "0" in lieu of "4" in the second word.
The second table describes addresses 0x4000_0000 to 0x7fff_ffff, which is ram.

Analysis

After some searching in my U-Boot sources, I located the code that sets up the mmu. It is in U-boot/OrangePi/orangepi/arch/arm/lib/cache-cp15.c in the routine mmu_setup(). A reasonable name for such a routine. Note the "cp15", which refers to the infernal "Coprocessor 15" that handles all the processor control registers on a 32 bit ARM system. Note if you are studying this code that CONFIG_ARMV7_LPAE is defined and we have 64 bit page descriptors.

We have gotten ahead of ourselves, but have now presented the whole picture. Let's go back and look at value in the registers and tables in detail.

First the "CR" (control register). We see: TTBCR: 80000F00. B4.1.153 in the manual describes this. We have already noticed that bit 31 (seen here as "8" enables extended addressing using 40 bits and 64 bit table entries. Why do this if we only have a 32 bit (4G) address space? It is because it allows large pages (1M) and segments (1G) and thus a relatively compact page table.

The 4 bits set to "F" are OOII where OO selects "outer" cacheability and II selects "inner" cacheability. I have run around in circles trying to find out what inner and outer are talking about. This is clearly a flaw in the documentation. What I take it to be is that "inner" refers to the L1 cache and "outer" refers to the L2 cache. The setting with the two bits set to one (0x3) is "Outer Write-Back no Write-Allocate Cacheable". This gets modified later on a page by page basis by the table entries.

The low 3 bits of the TTBCR determine how many bits in TTBR0 are used for the base address. I always just ignore this aspect of things and slap the actual address into this register, which works fine as long as all the bits beyond those used as the base are zero. There are no fancy bits in TTBR0, so it just holds the base address of the translation table:

TTBR0: 00000000 7FFF4000
This takes us to the level 1 table, which has 4 entries like so:
TT 0: 7FFF4000 - 00000000 7FFF0003
TT 1: 7FFF4008 - 00000000 7FFF1003
TT 2: 7FFF4010 - 00000000 7FFF2003
TT 3: 7FFF4018 - 00000000 7FFF3003
The low bit is a "valid" bit -- if it is 0, any access to that 1G block of addresses is invalid and causes a fault. The next bit (bit 1) is set to 1, and that indicates that this simply points to a level 2 table.

The level 2 tables (there are 4 of them) looks like so:

TT 0: 7FFF0000 - 00400000 00000441
TT 1: 7FFF0008 - 00400000 00200441
TT 2: 7FFF0010 - 00400000 00400441
TT 3: 7FFF0018 - 00400000 00600441

TT 0: 7FFF1000 - 00000000 40000449
TT 1: 7FFF1008 - 00000000 40200449
TT 2: 7FFF1010 - 00000000 40400449
TT 3: 7FFF1018 - 00000000 40600449

TT 0: 7FFF2000 - 00400000 80000441
TT 1: 7FFF2008 - 00400000 80200441
TT 2: 7FFF2010 - 00400000 80400441
TT 3: 7FFF2018 - 00400000 80600441

TT 0: 7FFF3000 - 00400000 C0000441
TT 1: 7FFF3008 - 00400000 C0200441
TT 2: 7FFF3010 - 00400000 C0400441
TT 3: 7FFF3018 - 00400000 C0600441
The low 2 bits again indicate the type of entry. Here we see "01". In this format, bits 2-11 and 52-63 give attributes.

Bit 53 (which is "4" for everything but RAM) is "PXN", which is privileged execute never, which says do not execute here at PL1. So an instruction fetch from these regions would cause a fault

Now, what about bits 11-2. We see 11-4 are always "44". The interesting thing is bit 3 which gets set to 1 for RAM. This must indicate that this region is cacheable.

At this point I am getting lazy. That bit set to 1 in bit 3 indicates that ram should be cached, and selects a caching flavor. The ARM documentation is all but inscrutable about the translation table entries. Looking at U-Boot would be the better bet with the help of ctags (naturally) and checking which CONFIG options are actually active for the Orange Pi.

But for my purposes, just mimicing that bit without fully understanding it will do just fine. Do I really care about the NXP bit on other regions? It might catch some really crazy code runaways, but I doubt that it really matters.

So for me 0x441 for non-RAM and 0x449 for RAM.

What I need to know all this for is when I kick off another core. The smart thing to do would be to just tell it to use the very same translation table as core 0. The other smart thing would be to ensure that I don't stomp on the table set up by U-Boot. And that might be a little trickier if I work on a different H3 board (like the NanoPi) with less ram. The "right thing" to do would be to have core 0 read the TTBR0 and post the value someplace to be used when I set up core 1.

Another very important discovery is that TTBR0 is a 64 bit register. I have been getting away with treating it like a 32 bit register for core 0, relying on the upper 32 bits being set to zero by U-boot. But when I start core 1, I must be sure to initialize the entire 64 bits.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org