Lowering Numerical Precision to Increase Deep Learning Performance

Deep learning training and inference are poised to be computational heavyweights of the coming decades. For example, training an image classifier can require 1018 single-precision operations[1]. This demand has made the acceleration of deep learning computations an important area of research for both Intel and the artificial intelligence community at large.

An approach we are particularly excited about is the use of lower-precision mathematical operations for deep learning inference and training. In a new white paper from Intel, we review recent research on lower-precision deep learning, look at how Intel is facilitating low-precision deep learning on Intel® Xeon® Scalable processors, and preview upcoming work to further accelerate these operations for current and future microarchitectures.

The Benefits of Lower-precision Operations

Today, most commercial deep learning applications use 32 bits of floating point precision in their training and inference workloads. However, many studies[2] have demonstrated that both training and inference can be performed with lower numerical precision with little to no loss in accuracy of outcomes.

Lower-precision operations have two main benefits. First, many deep learning operations are memory-bandwidth bound. In these cases, reduced precision may permit better cache usage and the reduction of memory bottlenecks. This allows data to be moved more quickly and compute resources to be maximized. Second, lower-precision multipliers require less silicon area and power. This can enable the hardware to execute a greater number of operations per second, further accelerating workloads. Due to these benefits, the use of lower-precision operations is expected to become standard practice in the near future, particularly for convolutional neural networks.

Lower-Precision Operations on the Intel® Xeon® Scalable Platform

Our white paper details how the Intel Xeon Scalable platform’s 512-bit wide Fused Multiply Add (FMA) core instructions, part of the  Intel® Advance Vector Extension 512 (Intel® AVX-512) instruction set, accelerate deep learning by enabling lower-precision multiplies with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32 bits requires three instructions and requires one of the 8-bit vectors to be in unsigned int8 (u8format, the other in signed int8 (s8format with the accumulation in signed int32 (s32) format. This allows for 4x more input at the cost of 3x more instructions, or 33.33% more compute with 1/4 the memory requirement. Additionally, the reduced memory and higher frequency for lower-precision operations may further speed execution. Please see Figure 1 for details.

Figure 1: The Intel® Xeon® Scalable processor enables 8-bit multiplies with 32-bit accumulates with 3 instructions: VPMADDUBSW u8×s8→s16 multiples, VPMADDWD broadcast1 s16→s32, and VPADDD s32→s32 adds the result to accumulator. This allows for 4x more input over fp32 at the cost of 3x more instructions or 33.33% more compute and 1/4 the memory requirement. The reduced memory and higher frequency available with lower precision makes it even faster. Image credit to Israel Hirsh.

Future Enhancements for Low-Precision Operations

A new set of Intel AVX-512 instructions called AVX512_VNNI (Vector Neural Network Instruction) will further increase deep learning performance in future microarchitectures. AVX512_VNNI includes an FMA instruction for 8-bit multiplies with 32-bit accumulates u8×s8→s32 as shown in Figure 2 and an FMA instruction for 16-bit multiplies with 32-bit accumulates s16×s16→s32 as shown in Figure 3. The theoretical peak compute gains are 4x int8 OPS and 2x int16 OPS over fp32 OPS, respectively. Practically, the gains may be lower due to memory bandwidth bottlenecks. Compiler support for these AVX512_VNNI instructions is currently underway.

Figure 2: AVX512_VNNI enables 8-bit multiplies with 32-bit accumulates with 1 instruction. The VPMADDUBSW, VPMADDWD, VPADDD instructions in Figure 1 are fused VPDPBUSD instruction u8×s8→s32. This allows for 4x more inputs over fp32 and (theoretical peak) 4x more compute with 1/4 the memory requirements. Image credit to Israel Hirsh.

Figure 3: The AVX512_VNNI VPDPWSSD instruction s16×s16→s32 enables 16-bit multiplies with 32-bit accumulates. This allows for 2x more inputs over fp32 and (theoretical peak) 2x more compute with 1/2 the memory requirements. Image credit to Israel Hirsh.

Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and Lower-Precision Primitives

The Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) contains popular deep learning functions or primitives used across various models as well as the functions necessary to manipulate the layout of tensors or high-dimensional arrays. To better support lower-precision primitives, new functions were added to Intel MKL-DNN for inference workloads with 8 bits of precision in convolutional, ReLU, fused convolutional plus ReLU, and pooling layers. Functions for recurrent neural networks (RNNs), other fused operations, and Winograd convolutions with 8 bits for inference as well as adding support for 16-bit functions for training are designated as future work. Please consult our white paper for instructions for quantizing model weights and activations and descriptions of the lower numerical functions available in the Intel MKL-DNN.

Framework Support for Lower-Precision Operations

Intel has enabled 8-bit inference in Intel® Distribution of Caffe*. Intel’s DL Inference Engine, Apache* MXNet*, and TensorFlow* 8-bit inference optimizations are expected to be available in Q2 2018. All these 8-bit optimizations are currently limited to CNN models. RNN models, and other frameworks will follow later in 2018. Our white paper provides detailed reports on framework support for lower-precision operations as well as explanations of the modifications necessary to enable lower-precision computation in other deep learning frameworks.

More on Lower-Precision Operations

Intel is enabling excellent deep learning performance on the versatile, well-known, standards-based Intel® architecture already relied upon for many other popular workloads. Low-precision operations present an exciting opportunity to accelerate deep learning workloads. We look forward to continuing our work to enable better and more widespread support for low-precision operations.

For more information on enabling low-precision operations on Intel Xeon Scalable processors, please consult our white paper “Lower Numerical Precision Deep Learning Inference and Training.” Additionally, please stay tuned to AI.intel.com for more on Intel’s work to accelerate deep learning on Intel architecture.


Notices and Disclaimers:

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”.  Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.

© 2018 Intel Corporation. Intel, the Intel logo, Xeon, Xeon logos, and neon, are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

[1] https://itpeernetwork.intel.com/science-engineering-supercomputers-ai/

[2] E.g., Vanhoucke, et al. (2011); Hwang, et al. (2014); Courbariaux, et al. (2015); Koster, et al.(2017); Kim and Smaragdis (2016)