Heterogenous Computing: AI Hardware Designed for Specific Tasks

Artificial Intelligence (AI) is quickly advancing in its ability to process and automate repetitive tasks through training and inference. To truly unleash the power of machine learning (ML) and deep learning (DL), and free it from processing delays, various types of AI hardware along with a robust software stack are needed to bring it closer to the end user’s devices and IoT systems. Data centers will continue to handle the bulk of general purpose processing in addition to new AI applications, enabling enterprises and cloud service providers to ensure their infrastructure is flexible enough to deliver a range of services.

There are many different use cases that require different processing needs for applications leveraging ML or DL. Part of the practitioner’s approach is to identify the right hardware and software platform for each unique job.  Let’s take a look at some of the hardware options available for AI processing needs that are mentioned in the recent paper “AI Requires Many Approaches” by Linley Gwennap, Principal Analyst, The Linley Group:

CPUs power PCs, servers and supercomputers, and run a wide variety of software programs. Data center operators can use these flexible processors to run ML and DL workloads that rely on sophisticated mathematical and statistical computations. And their superior memory capacity allows CPUs to tackle DL for large unstructured data sets and dense imaging deep learning applications.

Intel® Xeon® Scalable processors were enhanced for high-performance AI algorithms—alongside the data center and cloud applications they already run. Intel powers the majority of data center servers running AI workloads today. From classical methods, like support vector machines and random forests, to deep learning, analytics algorithms, and AI techniques like regression learning, Intel Xeon Scalable processors have the compute and memory to handle them all.

GPUs are very efficient at processing graphics and rendering images used in gaming, and it was later discovered that they can also be programmed for highly parallel matrix mathematics used in deep neural networks (DNNs). GPUs work well in batched image inferencing, but due to memory constraints and programmability, purpose-built accelerators offer an alternative for deep learning workloads that require lower latency, flexible form-factors or lower/customizable power.

Purpose-Built Accelerators are newly developing custom ASICs that are optimized for deep learning acceleration and provide a bridge to AI. Development of accelerators is still in its infancy but purpose-built accelerators are expected to provide high performance with power optimization. Most flavors of accelerators will have common features like focusing on smaller integer values, a systolic MAC array, and special hardware to compute common DNN functions.

The upcoming Intel® Nervana™ Neural Network Processor is an architecture built to accelerate deep learning. It optimizes memory and interconnects to provide more computation capability and better model scalability than existing data center architectures. Built to scale, it has a roadmap designed to unleash higher levels of training performance as newer algorithms become available.

Field Programmable Gate Arrays (FPGAs) are programmable chips that can implement new, custom architectures for AI acceleration. Unlike a custom ASIC or GPU, their purpose and power can be easily adapted again and again for any number of workloads and a wide range of structured and unstructured data types. This is due to their flexible fabric that enables the direct connection of various inputs, such as cameras, without needing an intermediary.

Intel® FPGAs make real-time inference possible by offering an architecture that is built with high-bandwidth memory and multi-core parallelism that translates to low-latency processing.  By being completely customizable, FPGA accelerators can be tuned for exact precision, performance, power, and accuracy. Additionally, various levels of numeric precision can be applied across floating point or integer values, including FP32, FP16, INT8, and INT4.  When new precisions are needed, the Intel FPGA can be re-programmed for added flexibility.

Computer Vision powers a variety of applications, from autonomous vehicles to factory automation, and search engine image categorization using sophisticated AI algorithms and hardware that are fed data from cameras and sensors. Vision processing needs are often closely related to on-device compute because the processors must be portable and powerful while being power efficient.

Intel® Movidius™ vision processing units (VPUs) push the boundaries of what’s possible in AI at the edge with extreme low-power inferencing right on the device. They also run vision algorithms including depth mapping, feature extraction, and visual odometry. Consumer applications of computer vision often complement existing product categories, making them more intuitive to use, more capable, and in many cases, providing an ambient computing experience rather than direct interaction. This can be seen in experiences including digital payments, and smart appliances for the home.

When taking into consideration the different types of AI solutions that are being developed and the unique needs and requirements for each solution, it’s clear that there is no single best processor or hardware accelerator for AI. The type of workload, algorithm, constraints, and parameters of the task will determine the most suitable approach.

For a deeper dive into this subject, see the paper “AI Requires Many Approaches” by Linley Gwennap, Principal Analyst, The Linley Group.