COMP5329 Deep Learning Week 3 – Optimization for Training Deep Models 1. Optimizer
1.1. Momentum. Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction γ of the update vector of the past time step to the current update vector:
(1) vt = γvt−1 + η∇θJ(θ) (2) θ = θ − vt,
where the momentum term is usually set to 0.9 or a similar value.
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
1.2. Nesterov. We know that we will use our momentum term γvt−1 to move the parameters θ. Computing θ − γvt−1 thus gives us an approximation of the next position of the parameters. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters θ but w.r.t. the approximate future position of our parameters:
(3) vt = γvt−1 + η∇θJ(θ − γvt−1) (4) θ = θ − vt,
Momentum first computes the current gradient (J(θ)), and then takes a big jump in the direction of the updated accumulated gradient (θ = θ − vt). Nesterov accelerated gradient (NAG) first makes a big jump in the direction of the previous accumulated gradient (θ−γvt−1), measures the gradient (J(θ − γvt−1)) and then makes a correction (θ = θ − vt) which results in the complete NAG update.
1.3. Adagrad. Previously, we performed an update for all parameters θ at once as every param- eter θi used the same learning rate η. Adagrad uses a different learning rate for every parameter θi at every time step t. For brevity, we use gt to denote the gradient at time step t. gt,i is then the partial derivative of the objective function w.r.t. to the parameter θi at time step t:
(5) gt,i = ∇J(θt,i).
Adagrad sets learning rate η at each time step t for every parameter θi based on the past
gradients that have been computed for θi
(6) θt+1,i = θt,i − Gt,ii + εgt,i,
where Gt ∈ Rd×d is a diagonal matrix. Each diagonal element in Gt is the sum of the squares of the gradients w.r.t. θi up to time step t. ε is a smoothing term that avoids division by zero (e.g., 10−8).
As Gt contains the sum of the squares of the past gradients w.r.t. to all parameters θ along its diagonal, we can now vectorize our implementation by performing a matrix-vector product ⊙ between Gt and gt:
(7) θt+1=θt−√ η ⊙gt, Gt + ε
One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that. Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added
term is positive, the accumulated sum keeps growing during training. This in turn causes the
η
1
learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.
1.4. Adadelta. Adadelta is an extension of Adagrad that seeks to reduce its aggressive, mono- tonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size ω.
Instead of inefficiently storing ω previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average E[g2]t at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient:
(8) E[g2]t = γE[g2]t−1 + (1 − γ)gt2,
where γ can be set as a similar value as the momentum term, around 0.9.
Based on Eq. (7), we simply replace the diagonal matrix Gt with the decaying average over past squared gradients E[g2]t, and get
η
(9) θt+1 = θt − E[g2]t + εgt.
Note that the denominator is just the root mean squared (RMS) of the gradient, we can rewrite it as
(10) θt+1 = θt − η gt. RMS[g]t
Note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, we define another exponentially decaying average, this time not of squared gradients but of squared parameter updates:
(11) E[∆θ2]t = γE[∆θ2]t−1 + (1 − γ)θt2,
where
(12) ∆θt = − η gt. RMS[g]t
The root mean squared error of parameter updates is thus:
(13) RMS[∆θ]t = E[∆θ2]t + ε.
Since RMS[∆θ]t is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate η in the previous update rule with RMS[∆θ]t−1 finally yields the Adadelta update rule:
(14) ∆θt = −RMS[∆θ]t−1 gt. RMS[g]t
(15) θt+1 = θt + ∆θt.
With Adadelta, we do not even need to set a default learning rate, as it has been eliminated
from the update rule.
1.5. RMSprop. RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:
(16) E[g2]t = γE[g2]t−1 + (1 − γ)gt2, η
(17) θt+1 = θt − E[g2]t + ε gt. 2
1.6. Adam. Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt ike Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum. We compute the decaying averages of past and past squared gradients mt and vt respectively as follows:
(18) mt = β1mt−1 + (1 − β1)gt (19) vt = β1vt−1 + (1 − β1)gt2
mt and vt are are estimates of the first moment (the mean) and the second moment (the un- centered variance) of the gradients respectively, hence the name of the method. As mt and vt are initilizaed as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small. They counteract these biases by computing bias-corrected first and second moment estimates:
(20) mˆt= mt
1 − β 1t
(21) vˆt= vt
1 − β 2t
They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule:
(22) θt+1 = θt − √ η mˆ t. vˆ t + ε
3