ASHIVNI SHEKHAWATResearch scientist at Lyft Inc.

Most practitioners of numerical optimization (or minimization) know of the perils of narrow curved valleys in the cost function (see the Rosenbrock function or one of the many other pathological cost functions). While such behavior might be considered to be “pathological” in functions of a few variables, this behavior is the norm for ML where one routinely optimizes functions of millions of variables. As a result, several methods have been developed to overcome these issues.

Suppose we have a scalar function f(w), where w is a vector. The standard first order method for minimizing this function is called gradient descent, and it predates deep learning by a long shot (even Newton knew of this method, and in fact, Newton’s method is an improvement over gradient descent method, and is widely used for function minimization when w is low dimensional). In the gradient descent method, one simply does the following iteration till convergence

where dwt is a shorthand for df/dw evaluated at wt. Note that defining vt in the above equations does not serve any purpose as of now, but we still do so because it is useful for comparison with other methods described below. The parameter alpha is called the learning rate. While gradient descent is guaranteed to converge to a local minimum for sufficiently small alpha, the issue is that convergence can be slow and the iteration can “bounce around” in narrow canyons of the objective function. The methods below have been developed to deal with these convergence issues.

### Why the Name — Gradient Descent?

Why is the above method called gradient descent? It is quite simple. The term dw is the gradient, and the parameters w are changed in the direction of the negative gradient, thus making a figurative descent along the gradient direction.

### Momentum

As one can see, as opposed to using the gradient calculated at the most recent parameter estimate, we instead use an exponential average of the gradient over the last few steps (of the order of 1/(1-beta) steps). 0.9 is a typically used value of beta. The reason this helps is that if there are some directions along which the direction of the derivative dw changes rapidly (i.e. as the estimate bounces from one side of the canyon to the other), then the averaging process should roughly nullify these rapid changes and leave the derivative along the canyon intact.

### Nesterov Momentum

The Nesterov momentum differs only slightly from the usual momentum method. The update equations for Nesterov momentum are the same as the usual momentum, except for the fact that the gradient dw is calculated at wt - alpha * vt (almost).

### RMS Prop

The basic issue that momentum and Nesterov method (or for that matter all other methods here) try to address is that the different components of the gradient vector have vastly different magnitudes. Momentum tries to address this issues by averaging (or as some folks prefer, by integrating), so that components that oscillate rapidly get averaged to zero, while the persistent components stay. Another approach is to effectively have a different learning rate in different directions, thus taking smaller steps in directions of large gradient, while taking larger steps in directions of small gradient. The RMS prop method does just this, but in a more principled manner. The equations for the RMS prop method are

In the above equations, the square, the square-root, and the division are all element-wise, and epsilon is a small number (~10^8) for numerical stability. As one can imagine, the components of S are larger for directions that have large gradients, and serve as an adaptive learning rate for each direction.

In practice, the ADAM method works fairly well for most problems. The typical value of betav is 0.9 while that of betas is 0.999.