Improving the Efficiency and Accuracy of Normalization Schemes in Deep Networks

MaryT_Intel · ‎12-20-2019

In recent years, batch-normalization has been used commonly in deep networks and has enabled faster training and higher performance in a wide variety of applications. However, the reasons for these benefits have not been well understood and several shortcomings have hindered the use of batch-normalization for certain tasks.

In a paper jointly authored by myself and Elad Hoffer, Itay Golan and Daniel Soudry from The Technion at the Israel Institute of Technology, we offer a novel view of normalization methods and weight-decay as tools to decouple weights’ norm from the underlying optimized objective. Additionally, we improve the use of weight-normalization and show the connection between practices such as normalization, weight decay, and learning-rate adjustments. Finally, we suggest alternatives to the widely-used L² batch-normalization and show that using normalization in and L^∞ spaces can substantially improve numerical stability in low-precision implementations while also providing computational and memory-use benefits. Together, these findings have many implications for increasing training performance while maintaining high accuracy, especially for lower-precision workloads. We have been invited to present this research as a Spotlight Paper and Poster Session at the 2018 Conference on Neural Information Processing Systems (NeurIPS).

Challenges with Current Normalization Methods

Batch-normalization, despite its merits, suffers from several issues, as pointed out by previous work [8][3][NEW]. These issues are not yet solved in current normalization methods.

Numerical precision. Though interest in low-precision training continues to increase[14] [15], current normalization methods are notably not suited for low-precision due to their reliance on L²normalization, which involves several operations requiring high precision. Using norm spaces other than L² can alleviate these problems, as we demonstrate in the paper.

Computational costs. The computational overhead of batch-normalization is significant. Previous analysis has found batch-normalization to constitute up to 24% of computational time [BE1] [HL2] [BR3] needed for an entire model[11]. Further, it can require as much as twice the memory of a non-batch-normalization network during the training phase[12]. Methods like weight-normalization have smaller computational costs, but can result in lower accuracy when used with large-scale tasks[13].

Interplay with other regularization mechanisms. Other regularization mechanisms are typically used in conjunction with batch-normalization. Though earlier studies[6] have shown that explicit regularization, such as weight decay, can improve generalization performance, it is not clear how weight decay interacts with batch-normalization, or if weight decay is even really necessary, as batch-normalization already constrains the output norms[7].

Task-specific limitations. A key assumption in batch-normalization is independence between samples appearing in each batch. While this assumption seems to hold for most convolutional networks used to classify images in conventional datasets, it falls short in domains with strong correlations between samples, such as time-series prediction, reinforcement learning, and generative modeling. For example, weight-normalization[8] and layer-normalization[9] were devised to address the finding[10] that batch-normalization required modification for use with recurrent networks.

Improving Batch-Normalization

Our paper makes the following contributions:

We show that we can replace the standard L² batch-normalization with L¹ and L^∞variations without reduced accuracy in CIFAR* or ImageNet. This improves the suitability of batch-normalization for hardware implementations of low-precision neural networks.
We suggest that it is redundant to use weight decay before normalization. We demonstrate that the effect of weight decay on the learning dynamics can be mimicked by adjusting the learning rate or normalization method.
We show that by bounding the norm in the weight-normalization scheme, we can significantly improve its performance in convolutional neural networks, such as ImageNet*, and in long short-term memory networks (LSTMs), such as WMT14 de-en*. This method can alleviate several of batch-normalization’s task-specific limitations while also reducing compute and memory costs.

Advancing AI on Intel® Architecture

We look forward to discussing these findings with our peers and colleagues at the 2018 Conference on Neural Information Processing Systems. In a subsequent work, we extend this work by suggesting an even more numerically stable batch normalization, called range batch-norm, where only the largest and smallest input values need to be calculated. This makes the batch norm calculation very tolerant to low precision hardware since accuracy is not degraded by max() and min() operations.
For more on this research, please review our paper, “Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks,” look for us at the 2018 NeurIPS conference, and stay tuned to https://ai.intel.com and @IntelAIDev on Twitter.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

[1] Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456, 2015.

[2] Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[3] Ioffe, Sergey. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems, pp. 1942–1950, 2017.

[4] Ulyanov, Dmitry, Vedaldi, Andrea, and Lempitsky, Victor S. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016. URL http://arxiv.org/abs/1607.08022.

[5] For example:

Salimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.
Ioffe, Sergey. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems, pp. 1942–1950, 2017.
[NEW] Arpit, Devansh, Zhou, Yingbo, Kota, Bhargava, and Govindaraju, Venu. Normalization Propagation: A parametric technique for removing internal covariate shift in deep networks. In International Conference on Machine Learning, pp. 1168–1176, 2016.

[6] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, and Vinyals, Oriol. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016

[7] Huang, Lei, Liu, Xianglong, Lang, Bo, and Li, Bo. Projection based weight normalization for deep neural networks. arXiv preprint arXiv:1710.02338, 2017.

[8] Salimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

[9] Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[10] Cooijmans, Tim, Ballas, Nicolas, Laurent, César, Gülçehre, Çağlar, and Courville, Aaron. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.

[11] Gitman, Igor and Ginsburg, Boris. Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. CoRR, abs/1709.08145, 2017. URL http://arxiv.org/abs/1709.08145.

[12] Rota Bulò, Samuel, Porzi, Lorenzo, and Kontschieder, Peter. In-place activated batchnorm for memory-optimized training of dnns. arXiv preprint arXiv:1712.02616, 2017.

[13] Gitman, Igor and Ginsburg, Boris. Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. CoRR, abs/1709.08145, 2017. URL http://arxiv.org/abs/1709.08145.

[14] Hubara, Itay, Courbariaux, Matthieu, Soudry, Daniel, ElYaniv, Ran, and Bengio, Yoshua. Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115, 2016.

[15] Venkatesh, Ganesh, Nurvitadhi, Eriko, and Marr, Debbie. Accelerating deep convolutional networks using low-precision and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 2861–2865. IEEE, 2017.