November 25, 2016

Intel Galileo - Gen 2 - Processor performance

The Intel Galileo contains a 400 Mhz x86 processor. This seems underwhelming compared to a BBB (or Raspberry Pi) with a 1 Ghz ARM. Some misinformed people have claimed that the x86 will outperform expectations based on its clock rate by virtue of being a CISC rather than a RISC processor. This is flat wrong on several counts as I discuss below.

Nothing beats making actual measurements. So I dusted off the source code for the venerable "Dhrystone" benchmark and ran it on 3 processors that I had handy. My Intel i7 based desktop, a Beaglebone Black (BBB) with a 1 Ghz ARM, and the Galileo with a 400 Mhz x86.

3.5 Ghz i7 -- 13,676,000 Dhrystones /sec
1.0 Ghz ARM  1,092,900 Dhrystones /sec
400 Mhz x86  282,326  Dhrystones /sec 
And there you have it. The 1000 Mhz ARM outperforms the 400 Mhz x86 by a factor of 3. A bit more so than its clock rate alone might predict. Right off the top, I can give several reasons for prefering the BBB over the Galileo since they are now available at roughly the same price:
1) A processor that runs about 3 times faster.
2) Onboard eMMC (2G or 4G)
3) Better documentation.
4) Much nicer and more sophisticaled IO (look at the GPIO architecture, not to mention the PRU).
5) A smaller board
Why would anyone choose a Galileo? Probably because they are infatuated with the Arduino development model. Indeed, compared to an Arduino with an 8 bit AVR processor, the Galileo is a massive leap in performance. Compared to a 1Ghz ARM, it is every bit as underwhelming and mediocre as you might expect.

Wrong thinking - an example

The following is extracted from this article, where the fellow effectively gets almost everything wrong. He isn't the only person saying such misleading things, but makes a nice example
Most development boards similar to the Galileo (if not all) are driven by ARM-based processors onboard. That makes perfect sense, and indeed ARM chips dominate when it comes to mobile and embedded devices. They're cheaper, more power-efficient and have less consumption.

However, the Quark is certainly faster and more powerful and fundamentally different, since it's derived from Intel's x86 processors, usually seen in desktop computers - running Windows or OS X. Moreover, the Quark chip (being x86 architecture) is a CISC (Complex Instruction Set Computing) processor which is a more powerful and complex set.

On the contrary, ARM processors are RISC (Reduced Instruction Set Computing) which is a smaller, simpler instruction set architecture. The difference being that something which takes an ARM processor a few cycles to complete, might take an x86 just one.

CISC versus RISC

CISC has everything to do with the architecture presented to the outside world. CISC processors have "cool features" such as a multitude of complex addressing modes intended to make programming them easy. RISC processors on the other hand are lean and mean and rely on sophisiticated compilers. In general a RISC processor will complete one instruction every clock (pipelines allow this). Any processor with superscalar features (multiple execution units) can complete more than one instruction per clock. Superscalar processors can be RISC or CISC, but are most often RISC since the only CISC architecture yet alive is Intel x86. The "complex" in CISC has nothing to do with the internals of the processor and everything to do with the instructions offered to the compiler (or programmer). The DEC VAX is (was) the flagship CISC machine, along with the Motorola m68k, and the original 8086 series, carried on through the 80386 and subsequent Intel processors. Why hasn't Intel abandoned CISC? It all has to do with Microsoft. To understand the issues, ask why Windows doesn't run on non-Intel architectures like linux does.

Some additional thoughts

All of this began when I began to get annoyed by claims that the 400 Mhz x86 in the Galileo was somehow "magic" and would outperform ARM processors running at higher clock rates. These claims often mentioned the CISC versus RISC aspect of the two processors and implied that a CISC processor was somehow superior. I was strongly skeptical (I smelled a rat).

So the thing to do was to run some experiments. I dug up the code for the venerable "Dhrystone" benchmark. It has been widely criticized and discussed, but it is a lump of C code that can be compiled and run to evaluate processors (and compilers, in truth the processor/compiler combination, with due consideration to compiler options and optimization, etc. etc.). So I compiled and ran it on 3 machines. My linux desktop (an x86 (Intel i7) processor running at 3.5 Ghz). My BBB (an ARM v7 running at 1.0 Ghz) and the Galileo (an x86 running at 400 Mhz). My desktop is a multi core SMP machine, but that doesn't matter, the benchmark just runs on one core. The results were as follows, times were determined by processor clocks but verified by a stopwatch.

3.5 Ghz i7 -- 13,676,000 Dhrystones /sec
1.0 Ghz ARM  1,092,900 Dhrystones /sec
400 Mhz x86  282,326  Dhrystones /sec 
To my surprise, the ARM outperforms the x86 even more than its clock rate might indicate. If we scale the Galileo clock rate up by 282,326 * 1000 / 400 = 705,815 Dhrystones/sec.

I used the compiler with no special options whatsoever in each case.

I had expected the x86 to produce more exciting results based on superscalar features of the Intel architecture. The x86 in some incarnations has a CISC "face" it presents to the outside world, for binary compatibility with something, presumably ancient Microsoft software. But under that skin there is a more risc like core with many more registers than externally visible and multiple execution units. Perhaps the x86 core in the Quark X1000 on the Galileo has done away with much of that to bring cost (and power consumption down). Who knows.

Note however if we scale up the Quark 282,326 * 35/4 = 2,470,352 Dhrystones/sec. This is far short of the 13,676,000 Dhrystones achieved by the i7, so my bet is that the i7 does indeed have the superscalar features I have read about and the Quark does not.

A side note on the Galileo. I put a linux image on an SD card pulled from the Intel site labelled "IoT DevKit image" and was pleased to find that it had vim, gcc, and sshd (and many other things), making the Galileo a true linux machine -- it even did a proper dhcp without any special effort, making my job running this benchmark very easy. This looks like "real linux" as compared to the pared down "embedded linux" shipped in onboard flash. My notes are here:

Two final comments. One is that linux includes a well named (and as you will see, misleading and useless "benchmark" called "bogomips"). This can be viewed on any system via "cat /proc/cpuinfo". The values on the 3 systems in question are:
i7 = 6988 bogomips
bbb = 297 bogomips
galileo 798 bogomips
Notice that the ARM (bbb) is significantly underrepresented by this measure (which as I remember simply increments some scalar variable to a target value and measures the time required). It is unclear just what "bogomips" measures, perhaps memory bandwidth rather than processing speed. I am not particularly interested in digging into this in any detail. The name itself is a warning, and these results just underscore the need to give little credibility to "bogomips" values, as many have already pointed out.

This is simply another reason for me to not get excited by the Galileo. Even at the reduced price of $45 (which puts it on a par with the BBB), the BBB is a far better choice.


Feedback? Questions? Drop me a line!

Tom's Computer Info / tom@mmto.org