Publications

  • Matthew Sotoudeh
  • Sara S. Baghsorkhi

DeepThin: A Self-Compressing Library for Deep Neural Networks

As the industry deploys increasingly large and complex neural networks to mobile devices, more pressure is put on the memory and compute resources of those devices. Deep compression, or compression of deep neural network weight matrices, is a technique to stretch resources for such scenarios. Existing compression methods cannot effectively compress models smaller than 1-2%…

View Publication

A Progressive Batching L-BFGS Method for Machine Learning (PBQN)

The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise…

View Publication

Abstractions for Containerized Machine Learning Workloads in the Cloud

Many institutions rely on Machine Learning (ML) to meet their goals. ML workloads are computationally intensive and as a result there is an investment in accelerators such as ASICs, FPGAs and GPUs to improve their performance. At the same time, these institutions are increasingly adopting cloud infrastructure with containers gaining traction relative to virtual machines...

View Publication

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems -- the models (often deep networks or wide networks or…

View Publication

WRPN: Wide Reduced-Precision Networks

For computer vision applications, prior works have shown the efficacy of reducing numeric precision of model parameters (network weights) in deep neural networks. Activation maps, however, occupy a large memory footprint during both the training and inference step when using mini-batches of inputs. One way to reduce this large memory footprint is to reduce the…

View Publication

Mixed Precision Training of Convolutional Neural Networks Using Integer Operations

The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for…

View Publication