Intel® Nervana™ Neural Network Processor: Architecture Update

Dec 06, 2017

Author Bio Image

Carey Kloss

Vice President of Hardware, Artificial Intelligence Products Group

Recently, we announced a new architecture built from the ground up for neural networks, known as the Intel® Nervana™ Neural Network Processor (NNP).  The goal of this new architecture is to provide the needed flexibility to support all deep learning primitives while making core hardware components as efficient as possible. We designed the NNP to free us from the limitations imposed by existing hardware, which was not explicitly designed for AI. Today, we are excited to disclose more details on the architecture and on some of the new features specific to increasing neural network training and inference performance. In this blog, I will talk about the attributes that make the NNP so innovative and why these features are important tools for neural network designers.

To solve a large data problem with a neural network, it is necessary for the designer to iterate quickly on different possible neural networks using large data sets.  In other words, it’s all about being able to train a given neural network against more data in a given amount of time.  There are multiple important factors to achieve this: 1) maximizing compute utilization,  2) easily scaling to more compute nodes, 3) doing so with as little power as possible.  The NNP architecture provides novel solutions to these problems and will give neural network designers powerful tools for solving larger and more difficult problems.

Spatial architecture

The NNP gives software the flexibility to directly manage data locality, both within the processing elements and in the high bandwidth memory (HBM) itself. Tensors can be split across HBM modules in order to ensure that in-memory data is always closest to the relevant compute elements. This minimizes data movement across the die, saving power and reducing on-die congestion. Similarly, software can determine which blocks of data are stored long-term inside the processing elements, saving more power by reducing data movement to and from external memory.

HBM is at the forefront of memory technology, supporting up to 1TB/s of bandwidth between the compute elements and the large (16-32GB) external memory. But even with this large amount of memory bandwidth, deep learning workloads can easily become memory-limited. Until new memory technologies become available, it is important for deep learning compute architectures to use creative strategies that minimize data movement and maximize data re-use in order to leverage all of their computational resources. The NNP employs a number of these creative strategies, all under the control of software. The local memory of each processing element is large (>2MB each, with more than 30MB of local memory per chip). This larger on-die memory size reduces the number of times data needs to be read from memory, and enables local transforms that don’t affect the HBM subsystem. After data is loaded into the local memory of each processing element, it can then be moved to other processing elements without re-visiting the HBM, leaving more HBM bandwidth available for pre-fetching the tensor for the next operation. Tensors can even be sent off-die to neighboring chips directly from processing element to processing element, again without requiring a second trip in and out of the HBM subsystem. Even simple aspects of the architecture like free (zero-cycle) transpose are targeted at reducing the overall memory bandwidth.

Flexpoint numerics optimized for neural networks

We designed Flexpoint, the core numerical technology powering the NNP, in order to achieve results similar to FP32 while only using 16 bits of storage space. As opposed to FP16, we use all 16 bits for the mantissa, passing the exponent in the instruction. This new numeric format effectively doubles the memory bandwidth available on a system compared to FP32, and utilizes 16 bit integer multiply-accumulate logic which is more power efficient than even FP16.

Flexpoint is modular. While our first generation NNP focuses on 16b multipliers with a 5b exponent, future silicon will enable even smaller bit widths in order to save even more power.  

New forms of parallelism

The NNP includes high speed serdes which enable more than a terabit-per-second of bidirectional off-chip bandwidth. Similar to our memory subsystem, this bandwidth is fully software-controlled. QOS can be maintained on each individual link using software-configurable, adjustable-bandwidth virtual channels and multiple priorities within each channel. Data can be moved between chips either between their HBM memories or directly from the processing elements. The high bandwidth enables model parallelism (a set of chips will combine together and act as if they are a single compute element), in addition to data parallelism (where a job is split up along input data boundaries). The ability to move data directly from local to remote processing elements ensures that HBM reads can be reused as many times as possible, maximizing data re-use in memory-bound applications.

By designing the Intel® Nervana™ Neural Network Processor to maximize compute utilization and easily scale to more compute nodes while using as little power as possible, we have created a novel form of hardware that is optimized for deep learning.

Author Bio Image

Carey Kloss

Vice President of Hardware, Artificial Intelligence Products Group

Related Blog Posts

Intel® Movidius™ Neural Compute Stick: One Year On

In the summer of 2017, I was involved in the type of project that very few get to work on during their careers: the launch of a new category of devices. While new product launches happen all the time, it’s rare to witness—let alone help launch—an entirely new type of product. The device in question…

Read more

#Technology

Warner Bros. Taps Intel AI to Connect Content and Audience

Consider the following problem. You’re looking for new approaches to marketing a beloved 90s sitcom, which has already been enormously successful in syndication and home video sales. The show produced hundreds of half-hour episodes—more than 80 hours of video altogether. How do you determine which scenes will best resonate with your target audiences? How could…

Read more

#Technology

Data Analytics and AI are Just What the Doctor Ordered

The healthcare industry is ripe for adoption of multiple aspects of artificial intelligence (AI).  In a segment with an abundance of use cases to inform AI solutions, it’s easy to see how the healthcare industry can benefit from the insights provided by AI. And the stakes are high when patient outcomes can be impacted by…

Read more

#Solutions #Technology