Deep Learning is Coming of Age

In the early days of artificial intelligence (AI), Hans Moravec asserted what became known as Moravec’s paradox: “it is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.”

This assertion held for a quarter of century but is now unraveling primarily due to the ascent of deep learning (DL). With the rapid advance of deep learning, it is essential to understand how this technology will mature by the end of this decade and which forces are shaping the leadership hardware and software in the 2019/2020 timeframe.

In the early 2000s, the availability of massively parallel compute capabilities and huge amounts of data were joined by breakthroughs in machine learning techniques. The beginning of a revolution in the field of computing was marked by leaps such as those achieved by Geoffrey Hinton’s lab at the University of Toronto (2006). This included training of individual neural layers and paving the way to larger networks with more layers, essentially making the models “deeper”. This speed up resulted in effective and viable compute models for solving real-world problems. Deep learning was born as an updated take on the previous ‘connectionism’ approach, and has boosted the reach and results of machine learning in a profound way ever since.

Scientists started throwing larger datasets at the newly defined deep neural networks (DNNs), and were soon solving tasks such as image and speech recognition with near-human accuracy. Traditional automakers and technology companies engaged in a race to bring the promise of autonomous driving closer every day. After a long winter, AI blossomed and transformed the face of tech giants such as Google, Microsoft, Amazon, Apple, and Facebook, while at the same time a multitude of deep learning startups received hundreds of millions in venture capital.

The transformative impact of AI and deep learning extends well beyond the high-tech industry. AI was in its ‘childhood’ during the years 2014-15, and fields like medicine, health care, agriculture, oil and gas, finance, banking, transportation, and even urban planning, started experimenting with the application of this technology. By 2020, deep learning will have reached a fundamentally different stage of maturity and efficiency orientation, and will be ‘coming of age.’ Deployment and adoption will cross the line from experimentation into core implementation and will permeate most fields of research and industries.

This rapid and fundamental transformation across industries and fields is happening not as a result of a one-dimensional breakthrough event, but rather through the collection of advances on a number of different fronts. We expand on these below.

Increased Topology Variety and Dataset Size and Complexity

In 2012, Google’s neural networks taught themselves to recognize cats and humans in YouTube* videos with 70% accuracy, a significant improvement over machine learning methods at the time. Just six years later, this technology has evolved to support real-life applications such as the identification of potentially malignant cells in 3D medical imaging — computations involving many more dimensions than earlier. As datasets have become more complex over the years, deep learning topologies have evolved.

The initial progress with neural nets has been illustrated by the ImageNet* project, an annual competition in visual recognition, with different topologies being developed to address the challenge: AlexNet* (Alex Krizhevsky, 2012), GoogLeNet*/Inception (2104), and Residual Neural Network (ResNet*) among others. The solutions improved at a very fast pace and have resulted in major changes to the structure of the DL topologies – e.g., growing from 8 layers to 152 layers in just 4 years.

AI Shifts Towards Significantly More Inference Cycles

Advancements in speed and accuracy have made deep learning both viable and cost-effective, pushing deep learning from its exploration phase into broad deployment. As deep learning has been adopted more broadly, there has been a clear shift in the ratio between cycles of training (producing models) and inference (applying models) from 1:1 in early days of DL, to potentially well over 1:5 by 2020. Additionally, real-world applications have increasingly stricter latency, demanding lightning-fast DL inference. This brings inference acceleration to the spotlight and core of many hardware and software solutions and promises a more rapid expansion of inference infrastructure.

Deep Learning Frameworks and the Democratization of Data Science

The early AI techniques have traditionally involved deep expertise in the programming of data parallel-machines and specialized statistical libraries in frameworks such as R*, MATLAB*, Python*, Java*, Scala*, or C++*. The rapid maturation of deep learning into broad deployment has been supported by the development of frameworks that abstract away lower level dependencies and facilitate deep learning implementation at scale: from Theano* in 2007, to Caffe* in 2013, and Torch* in 2014. However, the major shift to deep learning at scale happened starting circa 2016 with deep learning frameworks providing a new level of abstraction, tools, and hardware-independence, spurring a tremendous increase in the number of data scientists and AI practitioners developing models in the new deep learning frameworks.
TensorFlow* was released by Google in 2017. Its flexible architecture allows for the easy deployment of computation across a variety of platforms, and from desktops, to clusters of servers, to mobile and edge devices. TensorFlow quickly became the most widely used deep learning framework. Other frameworks were introduced during the 2016 to 2018 period including Caffe2*, PyTorch*, MXNet*, CNTK*, and PaddlePaddle*.
The release of multiple frameworks which significantly facilitate the implementation of deep learning in different environments has greatly contributed to its democratization, making it much more accessible and further strengthening the dissemination and wide adoption of DL-based applications.

Hardware Architectures for Heterogeneous Deployments

The adoption of AI throughout its childhood phase was strongly supported by GPUs. Yet, GPUs were not originally architected for neural network deployment. Due to the availability of GPUs, DL algorithms were adapted to use them for acceleration, gaining a significant boost in performance and bringing the training of several real-world problems to a feasible and viable range for the first time.

CPUs, libraries, and other software capabilities were initially not optimized for DL, but have gone through significant optimization during 2016-2018. Through the addition of dedicated libraries such as Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), CPU systems have demonstrated enormous performance compared to levels observed in early 2016 (i.e. Intel® Xeon® processor E5 v3 (codenamed Haswell) with BVLC-Caffe*) – up to 277x for inference (for GoogleNetv1* using Intel® Optimization for Caffe*) and 241x for training (for AlexNet using Intel Optimization for Caffe)1, respectively. Many topologies specifically benefit from processing on CPUs like Intel® Xeon® Scalable processors because of the memory requirements or hybrid workloads involved. Additionally, since Intel Xeon Scalable processors are relied upon for so many other enterprise workloads, leveraging them for AI comes at minimal extra cost.
For customers whose DL demands grow more intensive and sustained, a new class of accelerators is emerging, with very high concurrency of large numbers of compute elements (spatial architectures), fast data access, high-speed memory close to the compute, high-speed interconnect, and multi-node scaled solutions. Intel is developing a purpose-built accelerator, called the Intel® Nervana™ Neural Network Processor (Intel® Nervana™ NNP), which delivers on all of these fronts. Similarly, Intel® Movidius™ Myriad™ X VPU and other Movidius VPUs offer capabilities for media processing and DL inference for low-power edge devices.

Increased Breadth of Deployment

The years 2014 and 2015 have witnessed a great amount of experimentation for AI within commercial deployments. Today, most companies are still at stages of exploration and starting to understand the benefits that AI can bring to their individual industries. By 2020, AI will move in the technology adoption cycle from “early adopters” to “early majority,” with deployments moving from experimental into the main lines of business. We will also observe a more accelerated adoption within academic/scientific and government environments. Such broad deployment requires solutions that cover a wide range of settings – from sub-watt end devices, to megawatt racks, and all that is between. Rolling out deployment in line of business implies shifting the criteria towards scale, flexibility, power, and cost efficiency.

Increased Accuracy in Performance Assessments

Earlier hardware performance metrics such as TOPs (peak theoretical Tera Operations per second) are not a representative measure of performance for AI. TOPs measure the raw compute potential of a platform, which rarely happens in AI applications. It does not capture meaningful utilization constraints from real DL workloads such as data residency and reuse, data flow and interconnect, and workloads that combine compute types (e.g., attention in Neural Machine Translation, or Reinforcement Learning). More relevant benchmarks are starting to emerge, in which actual workloads are being exercised. In those benchmarks, accuracy is generally maintained while the measured metrics include effective TOPs (or throughput), power efficiency (inferences per second per Watt), latency, and, Total Cost of Ownership (TCO). To address the various limits of machines, benchmarks will tend to cover image recognition, Speech or Neural Machine Translation (NMT), and elements of recommendation systems.

What it Takes to Win at Scale

From the most-advanced drone to the largest data center, the upcoming larger and broader deployment of AI will demand a comprehensive portfolio of products and platforms addressing different requirements and constraints (such as at the edge vs. in the data center). Today, inference primarily runs on CPU, and a typical deployment will use a mix of neural networks and other types of compute. The trade-offs between performance, power efficiency, and latency vary significantly across compute environments, and highly efficient solutions demand completely integrated systems with optimized interactions between CPU, accelerator (when needed), memory, storage, and fabric. The AI space is becoming increasingly complex, and a one-size-fits-all solution cannot optimally address the unique constraints of each environment across the spectrum of AI compute.

The key elements of a leadership solution include a proliferated highly effective CPU as a flexible and powerful platform foundation, combined with a portfolio of specialized competitive acceleration for specific rigorous workloads from the end-point to the data center, a tightly integrated system to provide the best overall real-world solution, and a strong software stack to support all the popular deep learning frameworks. These are the considerations we have in mind as we build our product roadmap to best support AI on Intel® architecture.

Notices and Disclaimers:
Performance results are based on testing as of 06/15/2015 (v3 baseline), 05/29/2018 (241x) & 6/07/2018(277x) and may not reflect all publically available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Configurations for Inference throughput:
Tested by Intel as of 6/7/2018:Platform :2 socket Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz / 28 cores HT ON , Turbo ON Total Memory 376.28GB (12slots / 32 GB / 2666 MHz),4 instances of the framework, CentOS Linux-7.3.1611-Core , SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework caffe version: a3d5b022fe026e9092fc7abc7654b1162ab9940d Topology:GoogleNet v1 BIOS:SE5C620.86B.00.01.0004.071220170215 MKLDNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396 NoDataLayer. Measured: 1449 imgs/sec vs Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 64GB DDR4-2133 ECC RAM. BIOS: SE5C610.86B.01.01.0024.021320181901, CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64, SSD sdb INTEL SSDSC2BW24 SSD 223.6GB. Framework BVLC-Caffe: https://github.com/BVLC/caffe, Inference & Training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594
Configuration for training throughput:
Tested by Intel as of 05/29/2018 Platform :2 socket Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz / 28 cores HT ON , Turbo ON Total Memory 376.28GB (12slots / 32 GB / 2666 MHz),4 instances of the framework, CentOS Linux-7.3.1611-Core , SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework caffe version: a3d5b022fe026e9092fc7abc765b1162ab9940d Topology:alexnet BIOS:SE5C620.86B.00.01.0004.071220170215 MKLDNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396 NoDataLayer. Measured: 1257 imgs/sec vs Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 64GB DDR4-2133 ECC RAM. BIOS: SE5C610.86B.01.01.0024.021320181901, CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64, SSD sdb INTEL SSDSC2BW24 SSD 223.6GB. Framework BVLC-Caffe: https://github.com/BVLC/caffe, Inference & Training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § For more information go to www.intel.com/benchmarks.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Intel, the Intel logo, Xeon, Nervana, Movidius, and Myriad are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation