Comparing dense compute platforms for AI
Jun 15, 2017
Jun 15, 2017
In the world of artificial intelligence, there has been a lot of talk about performance and capabilities of hardware platforms. It is true that today’s computing power is what allowed the AI revolution to (re)happen and this is a combination of 1) increased data set sizes, and 2) high-density compute. In this blog, I’d like to focus on the compute side and provide a framework for the comparison of different high-density computing devices.
Numerous efforts have been started trying to solve this problem by ‘building a better mousetrap’ than a CPU or GPU. My own startup, Nervana (acquired by Intel in 8/2016) is an example. While there are certainly ways to better arrange transistors on a silicon die to have a performance and power advantage for this application, there are some fundamental issues that must be addressed with any architecture. A problem today is that there are many performance numbers being tossed about that may not have much correlation to real AI performance. Raw TeraFLOPs/s or TeraOPs/s have been used to compare various platforms, and below we’ll delve into some reasons why this metric is not sufficient to assess performance on neural network training.
You may have heard of the “Von Neumann” architecture and how it’s dead. Simply stated, the Von Neumann architecture is one where data lives in a memory connected to an arithmetic device (ALU) via some narrow data pipe. This has several key issues. When data is moved back and forth from the memory to the arithmetic device, energy is used and latency is incurred. In addition, the memory pipe might become a bottleneck if the arithmetic device can consume the data faster than it can be supplied by the memory. The new thinking is, if we can bring the memory closer to the arithmetic device, we burn less energy and mitigate bottlenecks. The problem with this in building a real silicon device is that memory grouped together will generally be denser and lower power than memory interspersed with digital logic. This is true for on-die SRAM but is even starker when we consider standard external memory technologies like DDR4, HBM2, or HMC that achieve very high density and power efficiency. The parameter sizes of today’s neural networks are generally too large to fit into on-die memory resources, so we are stuck with a data pipe between an off-die memory and arithmetic device. On-die memory can be used to mitigate the memory bandwidth problem, but deciding what stays on-die vs off-die requires careful management to achieve high performance.
Utilization in this context is the percentage of the raw compute capabilities of the chip that can be effectively used for a real workload. Deep learning and neural networks use a relatively small number of computational primitives, and only a few of those occupy much of the compute time. Matrix multiplication (MM) and transposes are fundamental operations. MM is composed of Multiply Accumulate (MAC) operations. OPs/s numbers are derived by how many MACs can be done per second (each multiply and accumulate are considered 1 operation, so a MAC is actually 2 OPs). So, we can define utilization as
Now, if the MAC capabilities of a design are ‘starved’ by the memory bandwidth, our design will never get high utilization. All of the OPs/s in the world will not make the design work faster since the memory bandwidth has become the bottleneck. We call this being memory bound. The memory subsystem has the job of keeping all of the compute busy on the chip. This can be done by being clever about how memory is managed between external memory and on chip memory. Caches are an example of this.
As might be obvious, the more compute a chip has, the more memory bandwidth is required to keep the MAC units busy. So, additional circuitry like buffers, transpose logic, nonlinearity (ReLU) logic must be employed to accomplish this. These come at a cost of die area and power. These factors must be carefully balanced to make a device that devotes enough power and area to keeping the MACs busy and utilizing the memory bandwidth optimally. Simply throwing more and more OPs/s at the problem won’t help much in the real world if these other operations are not considered.
One of the main knobs we have to make better use of memory bandwidth, utilization, and power is to go to lower bit precisions for each MAC. It is out of the scope of this blog to describe exactly the challenges and solutions with lower precision, but it is an area of active research. In addition, we can exploit sparsity and employ techniques like pruning to achieve more apparent computation on devices.
There is a desire for simple metrics to compare AI workloads on various platforms. CPUs used to use clock rate as a basis for comparison, but better benchmarks eventually obviated that need. Similarly, in the dense compute space we see the use of TeraFLOPs/s or TeraOPs/s commonly. Instead, we need a metric that linearizes the relative training performance of hardware platforms. If device A has twice the metric rating as device B, it would imply that device A is double the performance on training most neural networks for instance.
To this end, I’d like to propose the following metric: Computational Capacity (CC). The 3 factors that are involved are bit width of numeric representation, memory bandwidth, and OPs/s
Let b=# bits of representation, m=memory bandwidth in GigaBits/s, and o=TeraOPs/s
We use the square of the number of bits of representation as a simple proxy of the relative area of the multipliers to implement that precision. This implies that 16 bit multipliers are approximately 4 times larger circuits than 8 bit multipliers which is close to reality.
As with any comparison metric, the CC metric will also be just an approximation of performance and will not capture the nuances of different architectures. An obvious issue is that chip-to-chip I/O is not considered at all and this might be a further refinement to the metric later on (indeed, feel free to reach out on twitter or email us with any suggestions). Power and area devoted to interconnect can be highly advantageous if performance scales across multiple chips. The goal of this blog is really to get the community thinking more deeply about what it takes to achieve high performance on AI workloads.
Recently, we announced a new architecture built from the ground up for neural networks, known as the Intel Nervana™ Neural Network Processor (NNP). The goal of this new architecture is to provide the needed flexibility to support all deep learning primitives while making core hardware components as efficient as possible. We designed the NNP to…
As our Intel CEO Brian Krzanich discussed earlier today at Wall Street Journal’s D.Live event, Intel will soon be shipping the world’s first family of processors designed from the ground up for artificial intelligence (AI): the Intel® Nervana™ Neural Network Processor family (formerly known as “Lake Crest”). This family of processors is over 3 years…
Nervana is currently developing the Nervana Engine, an application specific integrated circuit (ASIC) that is custom-designed and optimized for deep learning. Training a deep neural network involves many compute-intensive operations, including matrix multiplication of tensors and convolution. Graphics processing units (GPUs) are more well-suited to these operations than CPUs since GPUs were originally designed for video…
Keep tabs on all the latest news with our monthly newsletter.