Intern Spotlight: Implementing Language Models
Nov 04, 2015
Nov 04, 2015
During my internship at Nervana Systems, I got to implement a few language models using Recurrent Neural Networks (RNN’s) and achieved a significant speedup in training image captioning models. RNN’s are good at learning relationships over sequences of data. So for example, a RNN could be fed characters of Shakespearean text, learn an internal representation of what Shakespearean text looks like, and then sample its predictions to generate some new Shakespeare. A vanilla RNN inputs a sequence of data, computes some hidden state over time, and then outputs a sequence.
(Christopher Olah, http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
A variant of the RNN, the long short term memory network (LSTM), helps in resolving certain issues the RNN has including learning long term dependencies. Using a LSTM in Nervana’s deep learning library, neon™, I was able to produce some coherent Shakespeare-like text:
BAPTISTA: Where is the matter? what says my brother?
BENVOLIO: I will not be a soldier.
DON ADRIANO DE ARMADO: I will not be a consent of my soul.
AEMELIA: I have seen thee as the sea of the state of the state
LSTM’s have also achieved state of the art at other NLP tasks like speech recognition, translation, and question answering (See Andrej Karpathy’s excellent RNN blog post).
A downside of using LSTM’s is that training can take a long time especially if the number of time steps used is large. One of my projects involved implementing an image captioning model in neon™ and optimizing training time. The image caption model from Google (paper) takes the output features of the second to last layer of a Convolutional Neural Network (CNN) and then feeds those to a LSTM to generate a sentence. This model treats the image features as the first word of a sentence and then recursively predicts the next word until a stop token is predicted.
(Vinyals et al., 2015)
One way to speed up training is to process a batch of training examples at the same time. This lets you perform large matrix multiplications which are optimized on a GPU. To deal with variable length sentences, we can pad shorter sentences up to the maximum sentence length and then apply a bit mask to ignore the padded parts. Using the Nvidia Visual Profiler, we can also look for periods where the GPU is not being fully utilized (indicated by gaps between bars below).
Data loading can be performed asynchronously with computation. Moving data from the CPU to the GPU is generally very expensive and should be minimized or done in parallel so that we can reduce the gaps and time where the GPU isn’t being used for computation. Additionally, we can look at which kernel calls are being used the most and check if we can save computation by storing temporary values. Certain parts of the LSTM computation can be sped up by compounding matrix multiplications which can reduce the number of kernel calls made to the GPU and reduce latency. We also want to choose the correct sizes for partitioning the data to allow for optimal memory access. Finally, at an even lower level, the various kernels themselves can be optimized for 100% utilization. With these tricks, training an image captioning modeling on the flickr8k dataset (approx. 8000 images and 40,000 sentences) takes around 20s for one iteration through the entire dataset on a Maxwell GPU. Existing code which only ran on a CPU (NeuralTalk) takes about 1.1hr for one iteration. This means roughly a 200 x speedup can be achieved on a GPU. The model can achieve reasonable performance in about 15 iterations so the difference in total training time is 5 minutes versus 16.5hr. Lower training time makes tweaking your models much easier because you can run multiple models with different hyperparameters simultaneously and then choose which model achieved the greatest performance.
Interning at Nervana was an amazing experience. I was able to work on a few big projects in deep learning including GoogLeNet, image captioning, and helping implement a fast LSTM. I was given complete responsibility in how I wanted to complete these projects and was also able to shape how large parts of the neon™ library worked to make it more modular and extensible. The people at Nervana Systems are extremely smart and collaborative. I could ask for help from my mentor on a variety of problems and he would immediately know a solution for things like how to optimize code for the GPU or how to design simple network layers. In addition, I could ask questions and gain insight from the various experts there in fields ranging from GPU programming to distributed computing.
The workplace and living in San Diego were also quite fun. San Diego is famous for its craft beer and on weekends we would visit the local breweries. Most of us were into the outdoors and the interns would organize regular outings for surfing, rock climbing, tennis, and mountain biking. Co-workers would hold events like picnics at the park, house warming parties, and beer mile races. The office had an endless supply of snacks, its own squat rack, a 2 wheel self balancing scooter, and two cute dogs that co-workers brought in. Overall, I had an awesome experience while being able to build scalable deep learning models.
[Figure 1. http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
[Figure 2. http://arxiv.org/pdf/1411.4555v2.pdf]
The idea of converting natural language processing (NLP) into a problem of vector space mathematics using deep learning models has been around since 2013. A word vector, from word2vec , uses a string of numbers to represent a word’s meaning as it relates to other words, or its context, through training. From a word vector,…
By: Anthony Ndirango and Tyler Lee Speech is an intrinsically temporal signal. The information-bearing elements present in speech evolve over a multitude of timescales. The fine changes in air pressure at rates of hundreds to thousands of hertz convey information about the speakers, their location, and help us separate them from a noisy world. Slower changes in…
Introduction In the last few years plenty of deep neural net (DNN) models have been made available for a variety of applications such as classification, image recognition and speech translation. Typically, each of these models are designed for a very specific purpose, but can be extended to novel use cases. For example, one can train…
Keep tabs on all the latest news with our monthly newsletter.