The Big Benchmarking Roundup

Getting started with machine learning and edge computing

Over the last six months I’ve been looking at machine learning on the edge, publishing a series of articles trying to answer some of the questions that people have been asking about inferencing on embedded hardware.

But, after a half year of posts, talks, and videos, it’s all bit of a sprawling mess and the overall picture is of what’s really happening is rather confusing.

So here’s a great big benchmarking roundup!

Inferencing time in milli-seconds for the for MobileNet v1 SSD 0.75 depth model (left hand bars) and the MobileNet v2 SSD model (right hand bars), trained using the Common Objects in Context (COCO) dataset with an input size of 300×300. Stand alone platforms are shown in green, while the (single) bars for the Xnor AI2GO platform are timings for their proprietary binary weight model and are shown in blue. All other measurements using accelerator hardware attached to the Raspberry Pi 3, Model B+, are in yellow, while measurements on the Raspberry Pi 4, Model B, in red.

While some people have dismissed the idea of benchmarks for inferencing as irrelevant because “…it’s training times that matter,” that doesn’t really seem justified. While if you take an academic approach to machine learning you often will train thousands of different models to find one that is ‘paper worthy’ but this does not seem to be how things work out in the world.

Instead for embedded systems training is a sunk cost with the final model being used thousands, perhaps even millions, of times depending on how many systems make use of it. Those models will also tend to hang around, potentially for decades if you’re talking about hardware that’s going into factories, homes, or public spaces. So in the long term it’s how fast those models run on the embedded hardware that’s important, not how long they took to train.

Discussion of the methodology behind the benchmarks can be found in the original post in the series, while the latest results can be found below, and are also discussed in both the first and the final post in the series.

Final benchmarking results in milli-seconds for MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model, both trained using the Common Objects in Context (COCO) dataset with an input size of 300×300, alongside the Xnor AI2GO platform and their proprietary binary weight model.

While inferencing speed is probably our most important measure, these are devices intended to do machine learning at at the edge. That means we also need to pay attention to environmental factors.

Designing a smart object isn’t just about the software you put on it, you also have to pay attention to other factors, and here we’re especially concerned with heating and cooling, and the power envelope. Because it might be necessary to trade off inferencing speed against these other factors when designing for the Internet of Things.

Idle and peak current consumption for our benchmarked platforms before and during extended testing. All measurements for USB connected accelerated platforms were done using a Raspberry Pi 3, Model B+.

Discussion of environmental factors like power consumption and heating and cooling can mostly be found in the original post in the series. Although some discussion is scattered through the followup posts where warranted.

Getting Started with Google’s Edge TPU

The Coral Edge TPU-based hardware was found to be ‘best in class’ according to our benchmark results. With the addition of the USB 3 to the Raspberry Pi 4, Model B, the Coral USB Accelerator is the fastest accelerator platform that is currently available.

Getting Started with the Intel’s Movidius

Getting Started with the Intel Neural Compute Stick 2 and the Raspberry Pi

While first to market Intel’s Movidius-based hardware may now be showing its age. While most of the boards, cards, sticks, and other widgets you see advertising themselves as machine learning accelerators are actually based around Movidius hardware our benchmarks show poor performance when compared with Google’s newer Edge TPU hardware.

Getting Started with NVIDIA’s GPUs

Getting Started with the NVIDIA Jetson Nano Developer Kit

A major take away from our benchmarking on the NVIDIA Jetson Nano dev kit is that you want things to run quickly you need to optimise your TensorFlow model using NVIDIA’s own TensorRT framework.

TensorFlow (dark blue) compared to TensorFlow with TensorRT optimisation (light blue) for MobileNet SSD V1 with 0.75 depth (left) and MobileNet SSD V2 (right) on the NVIDIA Jetson Nano

We also seen that while NVIDIA’s GPU-based hardware is more flexible, that extra capability comes with a speed penalty. NVIDIA’s board is built around their existing GPU technology, while Google’s Edge TPU hardware is aimed directly at running smaller quantised models. We’re seeing that the Edge TPU hardware is faster because running smaller models at the edge is what its designed to do.

Benchmarking Machine Learning

Our original benchmarks were run before the arrival of the Raspberry Pi 4, Model B. However, our main results were only reinforced by the arrival of the newer hardware, and they are really starting make me wonder whether we’ve gone ahead and started optimising in hardware just a little too soon. The fast inferencing times we see with AI2GO framework, and the Edge TPU, both of which make use of quantisation suggest that we may need to explore software strategies before continuing to optimise our hardware any further.

Machine Learning on the Raspberry Pi 4

Perhaps the biggest takeaway for those wishing to use the new Raspberry Pi 4 for inferencing is the performance gains seen with the Coral USB Accelerator. The addition of USB 3 to the Raspberry Pi 4 means we see an approximate ×3 increase in inferencing speed over our original results using the Raspberry Pi 3 and USB 2.

Benchmarking results in milli-seconds for the Coral USB Accelerator using the MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model, both trained using the Common Objects in Context (COCO) dataset for the Raspberry Pi 3, Model B+ (left), and the Raspberry Pi 4, Model B over USB 3.0 (middle) and USB 2 (right).

The somewhat surprising result of slower inferencing for the Raspberry Pi 4 and USB 2 is mostly likely due to the architectural changes made to the new Raspberry Pi.

Benchmarking results in milli-seconds for MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model, both models trained using the Common Objects in Context (COCO) dataset with an input size of 300×300, for the new Raspberry Pi 4, Model B, running Tensor Flow (blue) and TensorFlow Lite (green).

But it’s not until we look at TensorFlow Lite on the Raspberry Pi 4 that we see the real surprise. Here we see between a ×3 and ×4 increase in inferencing speed between our original TensorFlow benchmark, and the new results using TensorFlow Lite.

Processor temperature in °C against time in seconds during extended testing.

Due to this necessity to actively cool the Raspberry Pi during testing I’d recommend that if you intended to use the new board for inferencing for extended periods, you should add at least a passive heatsink. Although to ensure that you avoid the possibility of CPU throttling entirely it’s likely that a small fan might be a good idea.


The Benchmarking Code

If you’re interested in reproducing these results, or just want to get a much better understanding of my methodology, I’ve made all the resources you’ll need to run and duplicate the benchmark results available for download.


The Great Big Roundup

The Raspberry Pi 4 is probably the cheapest, most affordable, most accessible way to get started with embedded machine learning right now. Use it on its own with TensorFlow Lite for competitive performance, or with the Coral USB Accelerator from Google for ‘best in class’ performance.

The Big Benchmarking Roundup was originally published in Hackster Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Original article: The Big Benchmarking Roundup
Author: Alasdair Allan