January 12, 2025

Driving an LED panel - We need our ARM to run faster

This page is by no means specific to running a HUB75 panel. This is entirely about clocks on the Zynq.

A spoiler up front. This is mostly a detective story driven by erroneous information that indicated that the CPU was running at 10 Mhz. This was bogus and wrong, but did lead to several important discoveries.

Banging out GPIO bits to send data to the HUB75 panel did expose the issue. Even though I have two ARM cores in the Zynq, and they both can run at 666 Mhz, I discovered that the one I am using is running only at 10 Mhz. Why? And how can I fix this.

Zynq clock diagram

The above diagram from the TRM should explain everything. You also need to figure out where the registers are that control the various blocks in the diagram. For that, refer to section 25 of the Zynq TRM (technical reference manual) along with B.28 in the Appendix (page 1570) which lists and documents all the registers in the "slcr" (system level control register) section.

What do we know and how do we know it?

The CPU is running at 10 Mhz. We know this by reading out the CCNT (cycle count register) once per second and finding that the count increases by 10,000,000 each second. I do this using the Kyu "i 8" command, and here are the results:
CCNT for 1 sec: 9993902
CCNT for 1 sec: 10000467
CCNT for 1 sec: 10000496
CCNT for 1 sec: 10000485
CCNT for 1 sec: 10000483

We get a nice 50 Mhz Fabric Clock 1 in the FPGA. We have routed this to an external pin and measured it with a scope. We have also fiddled with the FPGA clock control registers in the slcr and been able to set the Fabric clock rate as we please, from 25 to 250 Mhz. The Fabric clock is derived from the IO clock, which is running at 1000 Mhz.

We see a multiplier of 30 set for the IO clock. We also have a 33.3 Mhz crystal on our board. Multiplying 33.3 by 30 gives us 1000 Mhz, which confirms that the PS_CLK in the clock diagram above is indeed running at 33.3 Mhz.

The 3 PLL registers

These contain a multiplier and several other bits, including a bypass bit.
PLL -- arm: 0x00028008
PLL -- ddr: 0x00020008
PLL -- io : 0x0001E008
I see that bit 3 is set (008) in all three. I was confused for some time thinking this was the bypass bit. It is not -- that is bit 4

Consider the arm PLL register. The multiplier is 0x28 = 40. With a 33.3 Mhz crystal this would give 1332 Mhz. Then if we divide this by 2, we get 666 Mhz, which is exactly what we want.

What does the timer tell us?

We have a timer giving us 1000 Hz interrupts for Kyu. As far as I remember, I used trial and error to set the values without having any real idea of what clock was feeding it.

The documentation could be better. As near as I can tell the timer (triple timer" is not in the IOP collection. Rather it is special and the Pclk it gets is the cpu_1x signal, which ought to be 111 Mhz in a properly configured system.

The timer has a prescaler of 16 and a preload value of 6666. If we multiply 6666 by 16 we get 106656 which is sort of 111,000. This sort of suggests that the clock feeding the timer is the 111 Mhz cpu_1x signal, and that that signal is indeed running at 111 Mhz.

The timer documents call this "pclk". Note that there is a 4 bit prescaler inside the timer. (The one we note above is set to 16). It is not one of the 6 bit programmable dividers shown in the clock diagram above.

Section 8.5 of the TRM talks about the "triple timer". It can select one of three sources for its clock. Pclk is one, and external clock from MIO is another, and a clock from the PL (fpga) is another. I see the Pclk versus Extclk selection in the Timer registers, but not the PL clock selection.

But even more important is that this is not the 3 way selection shown for IOP devices in the clock diagram above.

Experiments with the ARM PLL register

I then changed the divider from 40 to 20. This had no apparent effect. The processor did not get upset and I still measured it as running at 10 Mhz.

It is as though the processor has some secret way of getting the 10 Mhz clock it is running on. It seems independent of what I am doing to the ARM PLL. Even more interesting, the timer also seems undisturbed. I ask the timer for a 20 second delay (using Kyu command "i 8" and I get a 20 second delay. I would have expected the ARM PLL output to be halved and the timer to be running at half speed (and thus get a 40 second delay when I asked for 20). See below for more on this!

Some ideas at this point

First of all, my fooling with the ARM PLL registers affected neither the CPU or the Timer. It is as though these changes did nothing at all. Maybe that is possible. There may be rules for monkeying with these settings that I don't know about. Even if my changes don't affect the CPU for some reason, I would expect them to affect the timer.

On page 1578, the TRM says that the PLL must first be bypassed and then put into reset mode before changing the divisor. The reset is the low bit of the same register .... Aha!

Second is that the 10 Mhz CPU clock does not correspond to anything I am finding in reading about clocks. Could it be some kind of boot setting that needs to be bypassed somewhere else entirely.

What if the PLL are indeed bypassed

Our crystal is 33.3 Mhz and our FPGA fabric clock does seem to run at the proper 50 Mhz (derived from the IO PLL). If the ARM PLL was bypassed, this 33.3 Mhz clock would go to the divider, which would divide by 2 and give 16.66 Mhz into the clock ratio generator. Now, if this 16.66 replaced the expected 666, the cpu_1x would then be 2.775 Mhz -- I don't see any way to make sense of this and the proper behavior of the timer we see.

More experiments with the ARM PLL register

Now I know that the PLL must be both bypassed and held in reset. I skip the bypass, but do the reset and get results. I change the divider from 40 to 20 and now my supposed 20 second test (governed by the timer) takes 40 seconds. This tells me that cpu_1x is running at half the prior rate. But interestingly, the CCNT values now read for 2 seconds (measured by stopwatch) are identical to what the 1 second values were. This suggests that the CPU clock itself (supposedly cpu_6x) also was cut in half! Well, that is surprising but suggests that the CPU is somehow linked to the ARM PLL. I start to wonder about my 10 Mhz measurement.

What about that CCNT thing?

It is all but impossible to figure out what ARM documents describe all this. I have learned that processor details can change between specific processors. I worked out my CCNT code using the Cortex-A8 on the BBB. The code then seemed to work OK with the Orange Pi (Cortex A7). The Zynq has a Cortex A9, so maybe we have something new going on.

I will note that the CCNT has a divide by 64 feature that can be enabled. Suppose it was accidentally disabled (due to some effectively undocumented change in the Cortex A9). That would mean that my processor is really running at 640 Mhz (which is mighty close to the expected 666). This seems awfully suspicious. Maybe I can figure out an independent way to check the clock speed. Something like cranking out pulses on some MIO pin on the Zynq. 666/64 = 10.41 Mhz. Very suspicious.

Note that my 1000 Hz timer was set up expecting a 100 Mhz pClk, but I now know it is a 111 Mhz signal. I could recalibrate my timer knowing this and might learn some new things. But the first thing to do is set up the experiment with MIO pulses. Some decent CCNT performance monitor documentation for the Cortex A9 would certainly be nice.

The next day

I do two things. I adjust the preload for my 1000 Hz timer based on what I now know is the 111 Mhz clock that feeds it. I also scale up my CCNT values by 64 based on my theory that the divide by 64 feature of the CCNT performance monitoring system is getting enabled somehow in this Cortex A9 system. This gives me:
Kyu (zynq), ready> i 8
Kyu (zynq), ready> Collecting data for 8 seconds
CCNT for 1/10 sec: 665453568
CCNT for 1/10 sec: 666046464
CCNT for 1/10 sec: 666047296
CCNT for 1/10 sec: 666046720
CCNT for 1/10 sec: 666046976
CCNT for 1/10 sec: 666046976
CCNT for 1/10 sec: 666046592
CCNT for 1/10 sec: 666047104
Looks like 666 Mhz CPU clock
So I am calling this case closed. The big surprise was the divide by 64 in the CCNT system. But I learned a lot about the Zynq clocks and also made the useful discovery that the timer is getting a 111 Mhz clock, which allowed me to make adjustments and improve accuracy.

A last (?) experiment

I wrote some code to generate a square wave on the LAT pin. Here is the assembly as generated by the C code show:
20004b94:   e51b3008    ldr r3, [fp, #-8]
20004b98:   e51b2014    ldr r2, [fp, #-20]  @ 0xffffffec
20004b9c:   e5832010    str r2, [r3, #16]
20004ba0:   e51b3008    ldr r3, [fp, #-8]
20004ba4:   e51b2018    ldr r2, [fp, #-24]  @ 0xffffffe8
20004ba8:   e5832010    str r2, [r3, #16]
20004bac:   eafffff8    b   20004b94 

    for ( ;; ) {
        gp->output2_low = m_on;
        gp->output2_low = m_off;
    }
I measure 1.54 Mhz, with a 330 ns high time and a 320 ns low time. This boils down to 7 instructions in the loop, and the loop taking 650 ns, so we get 92.9 ns per instruction.

This is pretty close to the 100 ns per instruction that a 10 Mhz CPU clock would give us. This is both surprising and disappointing.

My take on this is that the 10 Mhz rate is just a coincidence (in that it nearly matches the erroneous 10 Mhz that we sorted out above). My guess is that I need to investigate whether caches are enabled, and related issues involving the caches. Rather than invest time in that now, I am going to transition my work to using the FPGA for better performance sending data to my HUB75 panel.


Have any comments? Questions? Drop me a line!

Tom's Electronics pages / tom@mmto.org