Director of Machine Learning and Data Science at Shipt.
Most practitioners of numerical optimization (or minimization) know of the perils of narrow curved valleys in the cost function (see the Rosenbrock function or one of the many other pathological cost functions). While such behavior might be considered to be “pathological” in functions of a few variables, this behavior is the norm for ML where one routinely optimizes functions of millions of variables. As a result, several methods have been developed to overcome these issues.
Suppose we have a scalar function
w is a vector. The standard first order method for minimizing this function is called
gradient descent, and it predates deep learning by a long shot (even Newton knew of this method, and in fact, Newton’s method is an improvement over gradient descent method, and is widely used for function minimization when
w is low dimensional). In the gradient descent method, one simply does the following iteration till convergence
dwt is a shorthand for
df/dw evaluated at
wt. Note that defining
vt in the above equations does not serve any purpose as of now, but we still do so because it is useful for comparison with other methods described below. The parameter
alpha is called the learning rate. While gradient descent is guaranteed to converge to a local minimum for sufficiently small
alpha, the issue is that convergence can be slow and the iteration can “bounce around” in narrow canyons of the objective function. The methods below have been developed to deal with these convergence issues.
Why the Name — Gradient Descent?
Why is the above method called
gradient descent? It is quite simple. The term
dw is the gradient, and the parameters
w are changed in the direction of the negative gradient, thus making a figurative descent along the gradient direction.
The so-called momentum method is a minor adjustment on the gradient
As one can see, as opposed to using the gradient calculated at the most recent parameter estimate, we instead use an exponential average of the gradient over the last few steps (of the order of
1/(1-beta) steps). 0.9 is a typically used value of
beta. The reason this helps is that if there are some directions along which the direction of the derivative
dw changes rapidly (i.e. as the estimate bounces from one side of the canyon to the other), then the averaging process should roughly nullify these rapid changes and leave the derivative along the canyon intact.
The Nesterov momentum differs only slightly from the usual momentum method. The update equations for Nesterov momentum are the same as the usual momentum, except for the fact that the gradient
dw is calculated at
wt - alpha * vt (almost).
The basic issue that momentum and Nesterov method (or for that matter all other methods here) try to address is that the different components of the gradient vector have vastly different magnitudes. Momentum tries to address this issues by averaging (or as some folks prefer, by
integrating), so that components that oscillate rapidly get averaged to zero, while the persistent components stay. Another approach is to effectively have a different learning rate in different directions, thus taking smaller steps in directions of large gradient, while taking larger steps in directions of small gradient. The RMS prop method does just this, but in a more principled manner. The equations for the RMS prop method are
In the above equations, the square, the square-root, and the division are all element-wise, and
epsilon is a small number (~10^8) for numerical stability. As one can imagine, the components of
S are larger for directions that have large gradients, and serve as an adaptive learning rate for each direction.
The ADAM method attempts to get the best of both worlds by combining the momentum and RMS prop methods above. The update equations are
In practice, the ADAM method works fairly well for most problems. The typical value of
betav is 0.9 while that of
betas is 0.999.