The Nervana Cloud provides unprecedented performance, ease of use, and the ability to apply deep learning to a large range of machine learning problems. With modern networks taking days, weeks or even months to train, performance is one of our fundamental goals. GPUs allow us to greatly improve performance by parallelizing convolution and matrix multiply operations over thousands of CUDA cores. Nervana tops the benchmarks for deep learning performance on single GPUs. In order to provide the best possible performance, we have extended this parallelization across multiple GPUs within a physical machine. Since the GPUs are linked together with high speed PCIe buses and direct peer-to-peer memory copies are possible through the driver, we have added a multi-GPU backend implemented with pycuda to our cloud version of neon.

Methods of Distributing Models

There are two commonly used ways to distribute the training of a neural network among processors with independent memory: data parallel and model parallel[1]. In the data parallel method, a minibatch is distributed amongst the nodes such that each node handles the forward and backward propagation for a fragment of the inputs. In the model parallel method, parameters for each layer are distributed amongst the nodes, and each node maintains and updates a fragment of the parameters. In addition to providing a speedup from parallel execution, these methods are useful when the network or desired minibatch size is too large to fit in a single node’s memory.

For our multi-GPU backend, we began with support for data parallel training, since it is easily applied to new types of layers and models. Data parallelism allows us to parallelize many types of layers over multiple GPUs with low overhead. Since each GPU is training with a different subset of the inputs, it is necessary to synchronize parameters between GPUs after each minibatch. This synchronization step is the primary overhead, and has influenced our implementation. Due to the nature of the networks we are training, it is generally not beneficial to parallelize fully connected layers in this manner because of the large number of parameters present in these layers and relatively low computational workload. Our implementation switches from data parallel to a single GPU (not distributed) when fully connected layers are encountered. This results in our other primary overhead, which is gathering the activations from the data parallel convolutional layers to a single GPU.

Multi-GPU Implementation and Basic Operations

An early goal for the multi-GPU backend was to be lightweight in a way that wouldn’t require lots of maintenance work every time a new type of layer or model is added to neon. To accomplish this, the multi-GPU backend is designed to wrap the functionality of our NervanaGPU backend. Most of the work handled by the multi-GPU backend is in fragmenting, gathering, replicating, and reducing tensors between GPUs. Additionally, the implementation is built to infer when these operations need to occur so that the user requires no knowledge of the backend.

As mentioned above, there are four fundamental operations that are handled by the multi-GPU backend: fragment, gather, replicate, and reduce. In the multi-GPU backend we have several tensor types which closely relate to these operations. A singlenode tensor is one which exists only on the primary GPU and contains the full tensor’s data. A replica tensor is one where the entire tensor exists in replicated form on every GPU. The fragment tensor is a one which has been split between multiple GPUs such that each GPU contains an equal, but different portion of the data.basics of multi gpu implementation

# Backend will create a singlenode tensor by default
A = be.empty((1000, 128)) # this is equivalent to...
A = be.empty((1000, 128), parallel=False, distributed=False)
# Assign some data to the tensor
A[:] = be.rand() 

fundamental operations multi gpu backend

# Create new tensor and replicate data to all GPUs
B = be.empty(A.shape, parallel=True, distributed=False)
B[:] = A


multi gpu tensors

# Create new tensor and fragment data between GPUs
C = be.empty(A.shape, parallel=True, distributed=True)
C[:] = A

The examples above illustrate how the backend tensor creation function empty can be used to create different types of multi-GPU tensors with the parallel and distributed arguments. The parallel argument tells the backend that (a copy or fragments of) this tensor should be present on all GPUs, so that computations involving the tensor can be done in parallel. The distributed argument tells the backend that the tensor data should be divided among all of the GPUs, so that workload involving this tensor can be distributed. Together these arguments form our three tensor categories. Note that these arguments will not have any effect when running on single node backends such as NervanaGPU and NervanaCPU.

We must be able to convert between these three types of tensors. Operations to do so are implemented by the multi-GPU backend. The fragment operation converts from a singlenode tensor to a fragment tensor by dividing the data and using peer-to-peer memcpy to distribute a fragment to each GPU. In practice this happens when inputs are split among GPUs for the data parallel method, but a fragment operation also occurs behind the scenes in the example above when A is assigned to the fragment tensor C. Gather is the opposite, and copies the fragment from each GPU into a singlenode tensor containing all of the data on the primary GPU. This is necessary when activations from a parallel convolution layer need to be fed into a non-parallel fully-connected layer. Replicate is used to copy the full tensor to every GPU and is mostly used during initialization of parameters. In the example above, a replicate operation happens during the assignment of A to B. The reduce operation is used to synchronize a replica tensor which has diverged. This happens after back propagation when each node has updated weights based on different training data. Reduce performs a bandwidth efficient accumulation of the replicas. In the diagram below, several of these operations are used in the training of a simple convnet.

training a simple convnet

The above diagram shows how a convolutional neural network can be parallelized over multiple GPUs using our data parallel implementation. The speedup comes from parallelizing the workload from convolutional layers and other layers supported by the multi-GPU backend. The primary overhead is in the inter-GPU transfers, shown in the diagram with colored arrows. When using neon through the Nervana cloud on multiple GPUs, this parallelization will all be done transparently with no modifications to the network or script.

Handling Compounded Operations in the OpTree

Aside from these basic operations, the multi-GPU backend must be able to infer when they are necessary. This need arises with OpTrees or compound backend operations with inputs of more than one type of tensor. In these situations, the implementation can infer which conversions must be executed based on the output tensor’s type. The output tensor is assumed to be of the correct type. Based on this assumption, the implementation fragments, gathers, or reduces the input tensors as necessary. This is illustrated in the following code.

# A, B, and C come from the previous code example
B[:] = B * (be.sum(C, axis=1))
A[:] = B * 3

The first expression above outputs to a replica tensor, B, and takes as input a fragment tensor, C, and B. In this case, reducing C along axis 1 results in a vector, which can be added to each replica of B. However this results in a replica tensor which is not consistent between GPUs (referred to as divergent in our implementation), since the result of be.sum(C, axis=1) will be different on each GPU. To maintain the tensor as a replica, the multi-GPU backend will automatically perform a reduce operation on this result so that all replicas of B will be the same. The second expression above writes to a singlenode tensor, A, and takes the replica, B, as an input. In this case, the backend knows that B is consistent between GPUs and will only perform the operation on GPU0.


Execution times for fprop, bprop, and update on the VGG E* network are given below for a configuration of 1, 2, 4, and 8 GPUs. It is worth mentioning that the speedup attained by using multiple GPUs is highly dependent on the network being used. Networks with smaller layers (with less FLOPs required) will not parallelize as well, because the driver and communication overheads will starve the GPUs for work. Additionally, networks that are mostly composed of fully connected layers will not parallelize well on this implementation, since these layers are executed on a single GPU. Nervana is currently building a novel distributed, deep learning processor that will address these and other limitations of GPUs (which are after all Graphics Processing Units) for deep learning applications.

graphics processing units for deep learning

Times given for training on 512 images. This equates to 8 minibatches of 64 images each.

*The VGG network in the linked script has a pool window size of 3, whereas the network used for testing has a window size of 2

Next Steps

In addition to the above synchronous data parallel approach, we have also implemented data parallel approaches that can extend across multiple nodes including parameter servers. These can scale better for fully connected and small convolutional networks while also supporting recurrent neural network and LSTM based models. Future blog posts will describe those approaches.

To get started with the Nervana Cloud or request more information, please contact us at


[1] Alex Krizhevsky (2014). One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997.

Related Blog Posts

Introducing the aeon dataloader and other enhancements in Nervana Cloud 1.5.0

Nervana Cloud 1.5.0 contains enormous under-the-hood changes and improvements.  We’ve revamped and updated a lot of the core underlying code, separated the various application components into their own microservices, re-written our job launcher, added support for a new container orchestration service, squashed more than 75 bugs, and greatly expanded our testing coverage. The biggest changes…

Read more

#Intel DL Cloud & Systems

Security at Nervana Part 2: Securing Data

In our previous Security post, we discussed the Root of Trust, and how it is used to create a secure, trusted environment in which to execute deep learning applications. In this post, we explore the challenges involved in securing data, and how we can build on the aforementioned hardened software environment to meet those challenges.…

Read more

#Intel DL Cloud & Systems

Simplified ncloud syntax and other improvements to Nervana Cloud

Nervana Cloud is a full-stack hosted platform for deep learning that enables businesses to develop and deploy high-accuracy deep learning solutions at a fraction of the cost of building their own infrastructure and data science teams. We recently updated Nervana Cloud’s ncloud command-line interface (CLI) syntax to support subcommands and shortcuts for improved usability and…

Read more

#Intel DL Cloud & Systems