In recent years, batch-normalization has been used commonly in deep networks and has enabled faster training and higher performance in a wide variety of applications. However, the reasons for these benefits have not been well understood and several shortcomings have hindered the use of batch-normalization for certain tasks.
In a paper jointly authored by myself and Elad Hoffer, Itay Golan and Daniel Soudry from The Technion at the Israel Institute of Technology, we offer a novel view of normalization methods and weight-decay as tools to decouple weights’ norm from the underlying optimized objective. Additionally, we improve the use of weight-normalization and show the connection between practices such as normalization, weight decay, and learning-rate adjustments. Finally, we suggest alternatives to the widely-used L2 batch-normalization and show that using normalization in and L∞ spaces can substantially improve numerical stability in low-precision implementations while also providing computational and memory-use benefits. Together, these findings have many implications for increasing training performance while maintaining high accuracy, especially for lower-precision workloads. We have been invited to present this research as a Spotlight Paper and Poster Session at the 2018 Conference on Neural Information Processing Systems (NeurIPS).
Batch-normalization, despite its merits, suffers from several issues, as pointed out by previous work [NEW]. These issues are not yet solved in current normalization methods.
Numerical precision. Though interest in low-precision training continues to increase , current normalization methods are notably not suited for low-precision due to their reliance on L2 normalization, which involves several operations requiring high precision. Using norm spaces other than L2 can alleviate these problems, as we demonstrate in the paper.
Computational costs. The computational overhead of batch-normalization is significant. Previous analysis has found batch-normalization to constitute up to 24% of computational time [BE1] [HL2] [BR3] needed for an entire model. Further, it can require as much as twice the memory of a non-batch-normalization network during the training phase. Methods like weight-normalization have smaller computational costs, but can result in lower accuracy when used with large-scale tasks.
Interplay with other regularization mechanisms. Other regularization mechanisms are typically used in conjunction with batch-normalization. Though earlier studies have shown that explicit regularization, such as weight decay, can improve generalization performance, it is not clear how weight decay interacts with batch-normalization, or if weight decay is even really necessary, as batch-normalization already constrains the output norms.
Task-specific limitations. A key assumption in batch-normalization is independence between samples appearing in each batch. While this assumption seems to hold for most convolutional networks used to classify images in conventional datasets, it falls short in domains with strong correlations between samples, such as time-series prediction, reinforcement learning, and generative modeling. For example, weight-normalization and layer-normalization were devised to address the finding that batch-normalization required modification for use with recurrent networks.
Our paper makes the following contributions:
We look forward to discussing these findings with our peers and colleagues at the 2018 Conference on Neural Information Processing Systems. In a subsequent work, we extend this work by suggesting an even more numerically stable batch normalization, called range batch-norm, where only the largest and smallest input values need to be calculated. This makes the batch norm calculation very tolerant to low precision hardware since accuracy is not degraded by max() and min() operations.
For more on this research, please review our paper, “Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks,” look for us at the 2018 NeurIPS conference, and stay tuned to https://ai.intel.com and @IntelAIDev on Twitter.
The artificial intelligence (AI) revolution is having a profound impact on countless technologies. From automatic identification of our family and…
Quantized neural networks (QNNs) are regularly used to improve network efficiency in deep learning. Though there has been much research…
Now that the excitement of the 32nd annual Conference on Neural Information Processing Systems (NeurIPS) is upon us, it’s a…
Though they are effective at a variety of computer vision tasks, deep neural networks (DNNs) have been shown to be…