Good inferencing chips can move data very quickly.
The number of new inferencing chip companies announced this past year is enough to make your head spin. With so many chips and no lack of any quality benchmarks, the industry often forgets one extremely critical piece: the memory subsystem. The truth is, you can’t have a good inference chip unless you have a good memory subsystem. Thus, if an inferencing chip company is only talking about TOPS and having very little discussion around SRAM, DRAM and the memory subsystem in general, they probably don’t have a very good solution.
It’s All About Data Throughput
Good inferencing chips are architected so that they can move data through them very quickly, which means they have to process that data very fast, and move it in and out of memory very quickly. If you look at models using ResNet-50 and YOLOv3, you will see a striking difference not only in their computational side, but also in how they each use memory.
For each image using ResNet-50, it takes 2 billion multiply accumulates (MACs), but for YOLOv3 it takes over 200 billion MACs. That is a hundred times increase. Part of this is due to the fact that there are more weights for YOLOv3 (62 million weights versus approximately 23 million for ResNet-50.) However, the biggest difference is with the image size in the typical benchmark. ResNet-50 uses 224×224 which is the size no one actually uses and YOLOv3 uses 2 megapixels. Thus, the computational load is much greater on YOLOv3.
Using the example above, you can see that we have two different workloads and one takes 100 times more. The obvious question is: does this mean YOLOv3 runs 100 times slower? The only way you can answer that is by looking at the memory subsystem because that is going to tell you the actual throughput on any given chip.
The Memory Subsystem
With inferencing chips, we are not just creating a chip, we are creating a system. The chip has the engine, which are the MACs, but without the right fuel delivery system (memory and interconnect) the engine will stall.
If you look at what happens in an inferencing chip, data comes in and you then have to provide new images at a certain rate, such as 30 frames per second. Images come into the chip, and what comes out is some sort of result. Image size varies but most applications need to process megapixel images to get sufficient accuracy.
What happens inside the chip is that you are processing the images using a neural network model so there is code, weights and there are intermediate activations at the end of each layer. All of this needs to be stored somewhere and read and written into the computational units of the inference chip.
The applications for inference are numerous, with edge applications such as autonomous driving representing one of the largest opportunities. In the future, every automobile will have multiple inference engines so that pedestrians, buses and cars can be detected and avoided in real time. This makes processing large image sizes such as in YOLOv3 very important. This is common sense when you think about looking at an image with your own eye. If someone shows you a small image, you are going to miss the details and misinterpret the image. In the case of self-driving cars and surveillance cameras, the small details are absolutely critical.
Autonomous driving represents one of the largest opportunities for AI accelerators at the edge (Image: Flex Logix)
The difference between the edge and the cloud is that in the edge you need to send instantaneous responses versus in the cloud where you typically have lots of data and have time to process it. If you are in a car, for example, you need to know where people are so you have a chance to avoid them. However, this is different in the data center where there are applications such as photo tagging that can run large batch sizes at night time. This does not work in edge applications where everything has to be fast and with short latency, which also means batch size = 1.
So in essence, we are rearchitecting chips to deliver results in a short time (low latency) so the right response can be taken in time. We need to process data right away and get results back right away, which means memory is an absolutely critical part of this.
If you look at ResNet-50, there are many chips where the performance at batch size = 10 or 100 is very high, but that falls off as you go to batch size = 1. In some chips, this drop off is as much as 75 percent. That means that whatever utilization they are getting from the MACs with a high batch rate, they are getting a quarter of that utilization at batch = 1. Thus, if batch = 1, which is critical at the edge, some chips have less than 25 percent utilization of the available cycles in their compute MACs.
A Shift in Architectures
In the old days, memory architectures for processors, which still do the bulk of inferencing in data centers today, had DRAM and multiple levels of caches that all fed into a processor. Memory was a single unified memory that was centralized. With inference chips, memory tends to be distributed. One way to process data faster is to split the MACs up into chunks and distribute those chunks with localized SRAM. This is an approach used by companies such as Flex Logix and Intel, and it represents an approach that will be dominant in the future. The reason is because having memory close to the MACs results in shorter access times and having the MACs distributed results in more parallelism.
The difference in memory subsystem design for general purpose processors and inference accelerators. Inference accelerators tend to have distributed memory in the form of localized SRAM (Image: Flex Logix)
Another key requirement in edge applications is meeting cost and power budgets. Unlike chips used for training that take up a whole wafer, chips used for applications such as cars and surveillance cameras have an associated dollar budget and power limit. Typically, the amount of SRAM that is available is not sufficient to store all the weights, code and intermediate activations on the chip without breaking these budgets. These chips are processing vast amounts of data and they are processing it regularly, and the bulk of applications at the edge are always on. Since all chips emit heat, the amount of processing they do will correlate with increasing amounts of heat. The architectures that get more throughput out of the same amount of silicon and power will be the winners because they can deliver more results for less power and less cost.
Optimizing for Power and Cost
There are shortcuts that companies can take that tradeoff precision of accuracy of detecting objects. However, this is not the way customers want to go. Customers want to run a model and get high accuracy around object detection and recognition, but within certain power limits. The key to doing this is really in the memory subsystem.
If you look at ResNet-50 or YOLOv3, you need to store the weights. The weights in YOLOv3 are about 23 megabytes, while the weights in ResNet-50 are around 62 megabytes. Just storing these weights on chip will make a chip close to 100 square millimeters, which is not economical for most applications. That means mass memory needs to be off chip and that means DRAM.
As a side note, we are frequently asked whether the type of DRAM matters. The answer is that it matters a lot. High-bandwidth memory (HBM) is extremely expensive which does not work for cost-constrained edge applications. LPDDR4 is a better memory to use because it comes in a wide bus configuration that provides more bandwidth out of a single DRAM. DRAM is also very heat sensitive which can be an issue in automobiles and surveillance cameras operating outside. Thus, for both the cost and heat issues, it is best to minimize your use of DRAM.
How to Design the Best Inferencing Chip
The best inferencing chips will be the ones architected by designers who thought about what type of processing customers will be doing, what their loads and applications will be, and where they will be used. At the end of the day, customers want the highest throughput – which means they need high MAC utilization. The way to get high MAC utilization is to have high bandwidth feeding into the MACs, but you want to do this with the least SRAM and the least DRAM.
Chip designers need to model the kinds of applications they expect customers to run and look closely at the weights, code size and the activations. There are modeling tools available today that allow chip designers to vary the number of MACs, SRAM and DRAM, which enable the designer to make a series of tradeoffs for determining how to deliver the cheapest chip and the highest throughput.
There are also many things designers can do to organize their MACs to get them to run at higher frequencies. For example, MACs can be optimized for 8-bit multiply and accumulate, which runs faster than a 16-bit multiply and accumulate. The only tradeoff in this case would be a little bit less accuracy, but if offers much more throughput per dollar and per watt.
So how do ResNet-50 and YOLOv3 differ on memory usage? While weights are 2X different, the biggest difference is in their activations. Every layer of ResNet-50 generates activations and the maximum activation size for ResNet-50 is 1 megabyte, with some of the layers even smaller than that. With YOLOv3, the maximum activation of the largest layer is 64 megabytes so this 64 megabytes has to be stored so it’s ready for the next layer. When you are looking at the on-chip or DRAM capacity requirements, the activations in the case of YOLOv3 actually drive more storage requirement than the weights, which is very different from ResNet-50. In fact, the trick customers need to be wary about is that some companies design chips so they can store the ResNet-50 weights on board knowing that the activations are small and this can make their performance “seem” very good. However, in real life applications, that chip’s performance will drop off drastically.
Tradeoff between SRAM and DRAM
DRAM chips cost money and what costs even more money is the connections to the DRAM chip. Companies tend to focus on die size, but the chip packaging size is a big determination of cost and can sometimes be more expensive than the die. Every time a DRAM is added, it adds at least 100 balls. Some chips today have 8 DRAMs connected to them so that is pushing them into 1,000 ball packages which are extremely expensive. While companies realize they can’t fit all the SRAM on board, they also realize they can’t solve their cost equation by having too much DRAM. What they really need is as little DRAM as possible with as little SRAM as possible. To do this, chip designers need to study the activations. If you look at the number of activations that are 64MB, there is only one. Most activations are smaller in size, so if you put 8 megabytes of SRAM on chip, most of the intermediate activations will be stored on the chip and you only need to use DRAM to handle the biggest activations.
This is the sweet spot for inferencing chips and what chip designers should be striving for in their designs. And, if you are customer, you need to start asking questions about your chip’s memory subsystem because that is a huge determining factor in how it will perform in real-life applications.
Geoff Tate is CEO of Flex Logix.
>> This article was originally published on our sister site, EE Times.
The post Inference chip performance builds on optimized memory subsystem design appeared first on Embedded.com.