Tom's computer pages

November 10, 2023

Let's learn USB! -- Delays

There is absolutely nothing here that is specific to USB.

This topic did arise from my investigation of SOF interrupts, which you can read about in the next page.

I have a routine "delay_ms ( count )" that you can call to delay for a certain number of milliseconds. It is implemented like so:

static void
delay_one_ms ( void )
{
        volatile int count = 7273;

        while ( count-- )
            ;
}

Actually I lied. I call this N times to get a N-millisecond delay, but you get the idea. I typically "tune" it using a stopwatch and trial and error and arrive at a number that gives close to a 1 ms delay. My investigation of USB frames revealed that this yields a delay of about 0.913 milliseconds. So I could grab my calculator and do this:

N = 7273/0.913 = 7966

Using that value would no doubt be an improvement, but I am curious if we can work on this from the other direction and maybe learn something via the process.

My make command always (or often) yields a file xyz.dump for project "xyz". In this case indeed, I have papoon.dump which contains disassembled code for the entire binary that I flash into the chip.

08000158 :
 8000158:       b168            cbz     r0, 8000176 
 800015a:       1e41            subs    r1, r0, #1
 800015c:       f641 4069       movw    r0, #7273       @ 0x1c69
 8000160:       b082            sub     sp, #8
 8000162:       9001            str     r0, [sp, #4]
 8000164:       9b01            ldr     r3, [sp, #4]
 8000166:       1e5a            subs    r2, r3, #1
 8000168:       9201            str     r2, [sp, #4]
 800016a:       2b00            cmp     r3, #0
 800016c:       d1fa            bne.n   8000164 
 800016e:       3901            subs    r1, #1
 8000170:       d2f7            bcs.n   8000162 
 8000172:       b002            add     sp, #8
 8000174:       4770            bx      lr
 8000176:       4770            bx      lr

I search through papoon.dump and find the above. The optimizer has forced the "delay_one_ms" function inline and discarded the name. Note the useless second "bx lr" instruction that has resulted. We even see our magic value "7273". We have an outer loop (that doesn't particularly interest us right now) and an inner loop. The inner loop is:

 8000164:       9b01            ldr     r3, [sp, #4]
 8000166:       1e5a            subs    r2, r3, #1
 8000168:       9201            str     r2, [sp, #4]
 800016a:       2b00            cmp     r3, #0
 800016c:       d1fa            bne.n   8000164

So, we have 5 ARM instructions. Also the processor is running at 72 Mhz. If this was truly a RISC processor (and it is!) each instruction would take a single clock cycle. But other issues may get involved. The branch instruction could flush the pipeline, and I know we have specified 2 wait states for accesses to flash memory.

Now consider our 72 Mhz processor clock. In 1 ms we will have 72,000 cycle. We also run the above code 7966 times per millisecond. So calculate 72,000/7966 = 9.038. Let's call that 9 processor cycles to run the above 5 instructions. I can believe that, but am unsure how to partition those 9 cycles among the 5 instructions. Certainly each instruction uses at least 1 cycle. That leaves 4 left over. A true ARM guru would know the answer. Here is a wild guess -- the two instructions that access the stack use 3 cycles each and all the other instructions use 1 cycle. That would total to 9. But I lose interest at this point, although several interesting questions do arise.

If our 9 cycles per loop iteration is correct and our goal is to burn up 72,000 cycles to delay a millisecond, we can calculate the number of loop iterations simply enough. N = 72,000/9 = 8000. Let's put that value into our delay function and move on.

When I do this, I get a delay of 1.005 milliseconds. This is good enough for me and certainly better than 0.913. Why aren't we spot on? There are two possibilies.

One is that our 72 Mhz clock is derived from an 8 Mhz crystal of unknown precision. The other is that there is some overhead with setting up the loop, calling the delay_ms function, and in the case of multiple millisecond delays, the loop wrapped around the "delay_one_ms(): function. This leads to the following idea. Why not have a single loop and calculate N*8000 as the delay amount, i.e. the following.

void
delay_ms ( int ms )
{
        volatile int count = ms * 8000;

        while ( count-- )
            ;
}

I do this and still measure a 1.005 millisecond delay. I'll blame the crystal. If I was endlessly curious and had time to burn, I might try running this on different boards, but I'm not that curious.

Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org