The Kyu project

January 17, 2023

Kyu networking -- make the Orange Pi run at 100 Mbit - review

I have been working on this for over a week with many detours and it seems like a good time to review and plan what to do next.

I have what looks like proper PHY initialization and autonegotiate to 100 Mbit
I have started playing with NetBSD on the Orange Pi PC plus
I have built U-boot from mainline 2022.10 source
But transfers are still slower than they should be (5 seconds for 1M)

The problem came to light when I booted up Kyu on a BBB and saw a 1M transfer run at 0.17 seconds! This is what I am looking for on the Orange Pi.

My current (new) thought is that although it now takes 5 full seconds to transfer a 1M file on the Orange Pi, I really may have the network in 100 Mbit mode. This is what the lights on my network switch say after all. I tried 3 experiments timing the 1M transfer

Inherit 10 Mbit setup from 2016 U-boot - 5.27 seconds
Inherit 100 Mbit setup from 2022 U-boot - 5.04 seconds
Fresh 100 Mbit initialization from Kyu - 5.03 seconds

The business of "inheriting" the U-boot setup is done by commenting out that call to phy_init() in Kyu (as well as the call to emac_reset()) and just using the hardware in the state U-Boot has left it.

The new idea is that I have my PHY and emac initialization correct and something else is producing the slow transfer rates. 5 seconds is even long for 10 mbit. I would expect 10 mbit on the BBB to be maybe 1.7 seconds (i.e. 10 times longer than the 100 mbit time). 5 seconds is out of the question regardless.

It seems unlikely to be the TCP code itself, since this is exactly the same on both the BBB and the Orange Pi. I don't know where this big delay is coming from, but it is unique to the Orange Pi.

Wireshark shows that the transfer takes place 500 bytes at a time. With 1M byte, there are 2000 of these "chunks", each consisting of a SYN packet with a 500 byte payload and an ACK packet. 5/2000 = 2.5 milliseconds per "chunk". So we are looking to find out where just over 2 milliseconds is being wasted. Wireshark timings may be helpful.

Now consider the BBB where the 1M transfer happens in 0.17 seconds. Take 0.17/2000 = .085 milliseconds per "chunk".

And consider time on the wire for a 500 byte (2000 bit) message. Clearly 2 milliseconds is crazy.

.05 milliseconds at 10 Mbit
.005 milliseconds at 100 Mbit

I wrote some code to use the CCNT hardware counter to timestamp different events as the TCP transfer is taking place.

     stcp     send       tx       rx     rtcp
        0      880       56     3368       39
     1979      880        0        0        0
      167      880        8        0        0
      167      929       55       70      186
     5131      880        0        0        0
      167      932        8       95        0
      168      880        8        0        0
      167      895       55      105      179
     5124      880        0        0        0
      167      940       15       87        0

The times shown are in Mhz ticks since the previous "event". So the "send" column shows the time from the "stcp" event (which is TCP putting the packet on the output queue) to the actual "send" event in the network driver that hands the packet to the hardware. The "tx" event is the interrupt that says that the packet has been sent and hardware is ready for another.

It is alarming that the "send" column shows a 0.88 millisecond delay after TCP decided to send a packet. This is systematic and significant.
And here are the results from the BBB --

     stcp     send       tx       rx     rtcp
        0 -1470440        0        0        0
        8       12        0        0        0
        7       12        0        0        0
        7       12       71        0        0
       34       11        0        0        0
        7       12        0        0        0
        7       12        0        0        0
        7       12       71      538       11
       38       11        0        0        0
        8       12        0        0        0

The differences are quite striking. My attention is particularly captured by the "send" column in the Orange Pi data, but the occasional 5 and 2 millisecond delays in the "stcp" column for the Orange Pi are certainly of concern. The events are:

stcp - TCP code sends a packet
send - network driver sends the packet
tx   - network driver gets interrupt after sending
rx   - network driver gets receive interrupt
rtcp - TCP code gets packet to process

It is highly suggestive that the counts in the "send" column are so close to 1 ms, which is the system timer rate. There is no reason why packets should "stall" in the output queue for 1 clock tick, but they seem to. And they do not seem to on the BBB. Consider that if each "chunk" suffers a 1 ms delay of this sort and there are 2000 "chunks" in a 1M transfer, this delay totals to 2 seconds.

So who are the players who deal with the network output queue? The tcp-input thread running at priority 24, or the user thread at priority 31 making a call to tcp_output (not as likely) is putting the packet on the queue. The net-out thread running at priority 21 should be removing the packet and calling emac_send(). (That is it's sole purpose for existing, so it runs at the high priority of 21). It is hard to see why a 1 second delay should be going on here (and/or why it is not on the BBB).

Here is the thread mix on the BBB (the Orange Pi is exactly the same):

  Thread:       name (  &tp   )    state     sem       pc       sp     pri
  Thread:      shell (400a7a20)   READY I          40004620 40568000   11
  Thread:     net-in (400a7650)     SEM I  net-inq 40011170 40570000   20
  Thread:    net-out (400a755c)     SEM J net-outq 40010cd0 40572000   21
  Thread:  net-timer (400a7468)  REPEAT I          40011170 40574000   22
  Thread:  tcp-input (400a7838)     SEM I  tcp-inq 4001116c 4056c000   24
* Thread:  tcp-timer (400a7744)   READY I          40011170 4056e000   25
  Thread: wangdoodle (400a7280)     SEM J   socket 40010cd0 40578000   31
  Thread:       idle (400a792c)   READY C          400111c4 4056a000 1234

I should note here that I am giving thought to getting rid of the net-out thread. The device driver itself could pull packets from the output queue directly and that could be handled in the Tx interrupt routine. This is pretty much what goes on in NetBSD. We certainly need an output queue, but perhaps not a task dedicated to emptying it. Nonetheless, I want to leave things as they are and find out what is different between the BBB and the OrangePi with the current setup.

Divide and conquer

A new day has begun (1-18-2023). The idea is to focus on the 0.9 millisecond delay between TCP sending a packet and the network driver getting that packet and handing it to the hardware. I am going to add two more timing points, "A" and "B" to my display. I will position A right after inserting the packet into the queue and just before signaling that a new thing is in the queue. I will position B right after being alerted that a new packet is available. I am also going to add the display of the thread name sending the packet (is it tcp-input or tcp-timer?). I will also tidy up the output a bit.

Here is the result. I am surprised. I expected the delay to appear in the B column and to be something to do with one task signaling another and some other task hogging the processor. What we see though is the 0.888 millisecond delay now fairly evenly split between the "A" and the "send" column.

     stcp        A        B     send       tx       rx     rtcp
        0      413        9      459       56     3369       39
     1979      420        9      451       -       -       -
      167      421        8      451        8       -       -
      167      496        8      451       56       67      215
     5148      413        9      459       -       -       -
      167      479        9      459        8       81       -
      168      413        8      458        8       -       -
      167      460        9      451       56       81      178
     5139      413        9      451       -       -       -
      167      476        9      459       15       84       -

I'll note that I did do an experiment yesterday to simulate the business of one task signaling another with a semaphore and got exactly the same kind of delays of around 8 or 9 Mhz units. This result confirms that experiment testing the right thing.

We could start spinning theories, or we could start moving the A and B sample points in the code to pin down where the delay is taking place. What I do next is in net_main.c -- I put "A" just before locking interrupts and then adding the packet to the queue, I put "B" where A used to be (after the packet has been added to the queue, interrupts have been unlocked, and just before firing the signal with the semaphore. Now we get this:

     thread     stcp        A        B     send       tx       rx     rtcp
   wang_thr        -      419        1      459       56     3376       39
   wang_thr     1980      413        1      467        -        -        -
   wang_thr      168      411        1      467        8        -        -
   wang_thr      167      496        1      467       55       37      180
   wang_thr     5124      419        1      459        -        -        -
   wang_thr      168      522        1      459        8       45        -
   wang_thr      168      412        1      459       15        -        -
   wang_thr      168      507        1      467       56       26      179
   wang_thr     5131      412        1      467        -        -        -
   wang_thr      167      513        1      459        8       54        -

Here I have prettied up the output a bit and added a "thread" column. This shows us which thread is actually calling tcp_ouput to send the packet, and it turns out to be the user thread calling "tcp_send(). To my surprise (again) the delay is not between points A and B where we lock interrupts and put the packet on the queue, it is some time before.

Now I move A and B in ip_arp_send() and rerun the above.

     thread     stcp        A        B     send       tx       rx     rtcp
   wang_thr        -      404        8      461       56     3369       46
   wang_thr     1980      403        8      468        -        -        -
   wang_thr      168      404        8      468        8        -        -
   wang_thr      167      497        8      460       56       34      178
   wang_thr     5125      412        8      461        -        -        -
   wang_thr      168      477        8      461        8       82        -
   wang_thr      167      403        7      469       15        -        -
   wang_thr      168      495        7      469       56       29      179
   wang_thr     5131      412        8      461        -        -        -
   wang_thr      168      490        8      461        8       67        -

We get the above, showing that the delay is prior to calling ip_arp_send(). Now we position A and B in net_ip.c putting A at the very start of ip_send() and B right before the call to ip_arp_send(). We get this:

     thread     stcp        A        B     send       tx       rx     rtcp
   wang_thr        -      390       14      476       55     3369       39
   wang_thr     1979      390       14      476        -        -        -
   wang_thr      167      398       14      468        9        -        -
   wang_thr      167      440       14      468       55       78      179
   wang_thr     5138      390       14      477        -        -        -
   wang_thr      167      481       14      477        8       56        -
   wang_thr      168      390       14      476        8        -        -
   wang_thr      168      431       14      476       56       79      179
   wang_thr     5124      397       14      469        -        -        -
   wang_thr      168      464       14      469        8       80        -

Now the delay is moving back into TCP code. Now we are in the routine ip_output() where the "stcp" sample point is.

     thread     stcp        A        B     send       tx       rx     rtcp
   wang_thr        -        5      385      490       55     3370       39
   wang_thr     1980        6      414      483        -        -        -
   wang_thr      167        6      392      483        8        -        -
   wang_thr      167        5      453      490       55       52      208
   wang_thr     5147        6      385      491        -        -        -
   wang_thr      167        6      466      483        8       72        -
   wang_thr      168        5      392      483        8        -        -
   wang_thr      168        6      438      483       56       75      186
   wang_thr     5130        6      384      491        -        -        -
   wang_thr      167        5      447      490        8       85        -

Now we have located the delay between points A and B (within ip_output()). I move A down and B up in the routine so they now surround a loop that copies mbuf data to a Kyu netbuf. I get this, which shows that the delay is indeed in that loop:

     thread     stcp        A        B     send       tx       rx     rtcp
   wang_thr        -        6      379      496       55     3369       39
   wang_thr     1980        6      385      489        -        -        -
   wang_thr      167        6      386      489        8        -        -
   wang_thr      167        5      454      488       55       52      186
   wang_thr     5132        6      378      496        -        -        -
   wang_thr      167        6      479      497        8       46        -
   wang_thr      168        6      378      496        8        -        -
   wang_thr      168        6      451      488       56       54      179
   wang_thr     5137        6      378      489        -        -        -
   wang_thr      167        6      460      497       16       64        -

The next experiment gets tricky since I don't know for sure how many mbuf the loop will process, and having multiple calls to A and B will lead to some ambiguity -- but I just go ahead and position A and B around what it really the only interesting call in that loop, which is memcpy().

     thread     stcp        A        B     send       tx       rx     rtcp
   wang_thr        -        2      338      498       55     3369       39
   wang_thr     1980        2      346      490        -        -        -
   wang_thr      167        2      345      490        8        -        -
   wang_thr      167        2      433      490       55       32      179
   wang_thr     5130        3      337      498        -        -        -
   wang_thr      167        3      468      497        8       16        -
   wang_thr      168        3      345      489        8        -        -
   wang_thr      167        3      423      489       56       42      178
   wang_thr     5137        3      337      497        -        -        -
   wang_thr      168        3      462      498        8       22        -

Well -- who would have thought? How can memcpy be our culprit?

We will take that story to a whole new page.

Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org