Ushering in the convergence of AI and HPC: What will it take?

Jul 24, 2017

We have entered the era of AI. The abundance of data and algorithmic innovations powered through deep learning have led to a sharp increase in the need for compute to train increasingly deep networks. However, commonly used deep neural networks are not easy to scale for a fixed size problem onto a large system with thousands of nodes. This level of deep learning requires high compute power that scales to the size of the high performance computing infrastructures to transform industries such as  healthcare, retail and finance with the power of AI. Adding to the dilemma, unlike a traditional HPC programmer who is well-versed in low-level APIs for parallel and distributed programming, such as OpenMP or MPI, a typical data scientist who trains deep neural networks on a supercomputer is likely only familiar with high-level scripting-language based frameworks like Caffe* or TensorFlow*.

Intel is conducting research to help bridge the gap and bring about the much needed HPC-AI convergence. Working in collaboration with researchers at the National Energy Research Scientific Computing Center (NERSC), Stanford University, and the University of Montreal, we have achieved a scaling breakthrough for deep learning training. We have scaled to over 9,000 Intel® Xeon Phi™ processor based nodes on the Cori supercomputer, while staying under the accuracy and small batch-size constraints of today’s popular stochastic gradient descent variants method using a hybrid parameter update scheme. We will share this work at the upcoming Supercomputing Conference in Denver, November 12 – 17, 2017.

Additionally, in collaboration with Dr. Jorge Nocedal at Northwestern University, we have developed novel insights into why large batch methods don’t scale and how to address fundamental roadblocks that limit parallelism and the ability to bring deep learning to HPC scale. This work was published at the International Conference on Learning Representations (ICLR) in April this year.

An equally important challenge in the convergence of HPC and AI is the gap between programming models. HPC programmers are “parallel programming ninjas”, but deep learning and machine learning is mostly programmed using MATLAB-like frameworks. We must address the challenge of delivering scalable, HPC-like performance for AI applications without the need to train data scientists in low-level parallel programming.

We have addressed this problem at the levels of library, language, and runtime. More specifically, we have achieved significant[1] (more than 10x) performance improvement by enabling MPI libraries to become efficiently callable from Apache Spark*, which is described in our Very Large Data Bases Conference (VLDB) paper presented earlier this year. Additionally, in collaboration with Julia Computing and MIT, we have managed to significantly speed up Julia programs both at the node-level and on clusters. Underneath the source code, ParallelAccelerator and the High Performance Analytics Toolkit (HPAT) turn programs written in productivity languages (such as Julia* and Python*) into highly performant code. These have been released as open-souce projects, which will help enable research in academia and the industry to push advanced runtime capabilities even further.

The emerging AI community on HPC infrastructure is critical to achieving the vision of AI: machines that don’t just crunch numbers, but help us make better and more informed complex decisions. This type of decision making requires a plethora of data and deep learning algorithms that scale to size of the high performance computing infrastructures of today to train these increasingly deep networks. Delivering this need for high compute power is what it will take to expand the reach of compute to improve medical treatment, deliver autonomous vehicles, help make important business decisions in real-time and more.