Stochastic Gradient Descent and its variants, referred here collectively as SGD, have been the de facto methods in training neural networks. These methods aim to minimize a network-specific loss function *F(x)* whose lower values correspond to better-trained versions of the neural network in question. To find a minimal point* x**, SGD relies solely on knowing the gradient (approximately) at the neural network weights x* _{k}* at iteration

My collaborators Raghu Bollapragada, Jorge Nocedal, Hao-Jun Michael Shi from Northwestern University, and Dheevatsa Mudigere from Intel Labs in Bangalore, India have recently adapted the higher-order L-BFGS optimization method to a stochastics setting suitable for machine learning. We derived various convergence proofs of our algorithm, and its implementation exhibits encouraging experimental results. In this blog post, I will informally introduce the work, and the full paper can be found here.

In optimization theory and practice, higher-order methods often outperform first-order methods significantly. Textbooks such as [1] routinely present examples where a higher-order algorithm requires two orders of magnitude fewer iterations to converge compared to a first-order analogue. Provided below are high-level descriptions of them. A first-order method tries to find a minimal point of *F(x)* moving *x * along the negative direction of the gradient: *–▽F(x)* as shown on the top part of the figure. That is, it uses information on how *F**(x)* changes. A higher-order method tries to find a zero of *F(x)* ’s gradient using information on how*▽F(x)* changes. Newton’s Method is the best-known higher-order method where *x* is moved along –*(▽ ^{2}F(x))^{-1}▽F(x) *which involves applying the inverse of the Hessian matrix to the gradient of

**BFGS is a well-known quasi-Newton method (see [1] for example) that operates as follows. At each iteration K with x_{k} being the current iterate, it constructs the “secant” matrix H_{k} as a very small change (rank-2 update in linear algebra terminology) to H_{k-1} while satisfying the secant condition. If the very first H matrix, H_{0}, is simple (e.g. a diagonal matrix), then none of the H_{k} needs to be known explicitly because application of H_{k} to any vector can be computed recursively as long as all the “curvature pairs” (S_{l}, Y_{l}) where **

The *limited memory* version L-BFGS [2] relaxes the memory requirement by using only a fixed number (10 usually suffice) of the most recent curvature pairs. That it works even better in practice than its well-established convergence theory makes L-BFGS one of the favorite methods for standard (non-machine learning-specific) optimization problems.

In training neural networks, however, deterministic approaches become infeasible. Typically, the function *F(x)* to be optimiz

ed is given as a sum with the number of summands equal to the size of the training set in question. A stochastic approach is taken where at each iteration, one only deals with a sample of these summands whose (stochastic) gradient we compute. The following notation is common:

The set *S * is typically a stochastic batch. A high-level adaptation of the previously outlined L-BFGS algorithm is stated in the box below. To customize L-BFGS for machine learning, we adapt the following three components: (1) in the sampling of *S _{k}*, (2) in determining the step size α

The underlying principle is to obtain a stochastic gradient such that not only the expected direction is applied to the exact gradient, but also that *Ρ _{k}* makes a small angle with the latter in high probability. We developed a simple estimate relating the variance of this angle to the batch size |S

Long-established analysis for deterministic BFGS and L-BFGS suggests initializing *α _{k}*

Note that the curvature pair *(S _{l}, Y_{l}) * in the deterministic case involves the difference between two gradients: In Step 3, we need

Our paper establishes several convergence results such as linear convergence when the loss function is strongly convex, and general (slower) convergence without convexity assumptions. We apply our algorithm on a logistic regression problem on eight datasets and on several neural networks of convolutional and residual types on the MNIST and CIFAR-10 datasets. We present just a snapshot of these results here and invite our readers to read through our paper on the archive site.

For the logistic regression problem, we compared with the standard SGD approach and also the variance reduction method of SVRG. The two figures below show the test accuracy versus number of (full) gradient evaluations, also called epochs. This shows that a higher-order method is highly competitive with other training methods, a concept not commonly shared as exhibited in Figure 3.1 of [3].

The figures below show that our progressive batching L-BFGS, while promising, requires more work on general neural networks. Compared with SGD and Adam, L-BFGS exhibits reasonable progress in test loss reduction until it overfits. We are currently working on several ideas to mitigate overfitting and a fully parallel implementation as a progressive increasing batch size is most amenable to massive parallelism.

References

[1] Nocedal, J. and Wright, S. *Numerical Optimization.* Springer New York, 2nd edition, 1999.

[2] Liu, D. C. and Nocedal, J. On the limited memory BFGS method for large scale optimization. *Mathematical Programming, *45(1-3): 503-528, 1989.

[3] Bottou, L. Curtis, F. and Nocedal, J. Optimization methods for large-scale machine learning. *SIAM Review*, 60-2: 223-311, 2018.