CS计算机代考程序代写 deep learning algorithm COMP5329 Deep Learning Week 3 – Optimization for Training Deep Models

COMP5329 Deep Learning Week 3 – Optimization for Training Deep Models

1. Optimizer

1.1. Momentum. Momentum is a method that helps accelerate SGD in the relevant direction
and dampens oscillations. It does this by adding a fraction γ of the update vector of the past
time step to the current update vector:

(1) vt = γvt−1 + η∇θJ(θ)

(2) θ = θ − vt,
where the momentum term is usually set to 0.9 or a similar value.
The momentum term increases for dimensions whose gradients point in the same directions and
reduces updates for dimensions whose gradients change directions. As a result, we gain faster
convergence and reduced oscillation.

1.2. Nesterov. We know that we will use our momentum term γvt−1 to move the parameters θ.
Computing θ−γvt−1 thus gives us an approximation of the next position of the parameters. We
can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters
θ but w.r.t. the approximate future position of our parameters:

(3) vt = γvt−1 + η∇θJ(θ − γvt−1)

(4) θ = θ − vt,
Momentum first computes the current gradient (J(θ)), and then takes a big jump in the direction
of the updated accumulated gradient (θ = θ − vt). Nesterov accelerated gradient (NAG) first
makes a big jump in the direction of the previous accumulated gradient (θ−γvt−1), measures the
gradient (J(θ − γvt−1)) and then makes a correction (θ = θ − vt) which results in the complete
NAG update.

1.3. Adagrad. Previously, we performed an update for all parameters θ at once as every param-
eter θi used the same learning rate η. Adagrad uses a different learning rate for every parameter
θi at every time step t. For brevity, we use gt to denote the gradient at time step t. gt,i is then
the partial derivative of the objective function w.r.t. to the parameter θi at time step t:

(5) gt,i = ∇J(θt,i).
Adagrad sets learning rate η at each time step t for every parameter θi based on the past
gradients that have been computed for θi

(6) θt+1,i = θt,i −
η√

Gt,ii + �
gt,i,

where Gt ∈ Rd×d is a diagonal matrix. Each diagonal element in Gt is the sum of the squares
of the gradients w.r.t. θi up to time step t. � is a smoothing term that avoids division by zero
(e.g., 10−8).
As Gt contains the sum of the squares of the past gradients w.r.t. to all parameters θ along its
diagonal, we can now vectorize our implementation by performing a matrix-vector product �
between Gt and gt:

(7) θt+1 = θt −
η


Gt + �

� gt,

One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning
rate. Most implementations use a default value of 0.01 and leave it at that. Adagrad’s main
weakness is its accumulation of the squared gradients in the denominator: Since every added
term is positive, the accumulated sum keeps growing during training. This in turn causes the

1

learning rate to shrink and eventually become infinitesimally small, at which point the algorithm
is no longer able to acquire additional knowledge.

1.4. Adadelta. Adadelta is an extension of Adagrad that seeks to reduce its aggressive, mono-
tonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta
restricts the window of accumulated past gradients to some fixed size ω.
Instead of inefficiently storing ω previous squared gradients, the sum of gradients is recursively
defined as a decaying average of all past squared gradients. The running average E[g2]t at time
step t then depends (as a fraction γ similarly to the Momentum term) only on the previous
average and the current gradient:

(8) E[g2]t = γE[g
2]t−1 + (1− γ)g2t ,

where γ can be set as a similar value as the momentum term, around 0.9.
Based on Eq. (7), we simply replace the diagonal matrix Gt with the decaying average over past
squared gradients E[g2]t, and get

(9) θt+1 = θt −
η√

E[g2]t + �
gt.

Note that the denominator is just the root mean squared (RMS) of the gradient, we can rewrite
it as

(10) θt+1 = θt −
η

RMS[g]t
gt.

Note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match,
i.e. the update should have the same hypothetical units as the parameter. To realize this, we
define another exponentially decaying average, this time not of squared gradients but of squared
parameter updates:

(11) E[∆θ2]t = γE[∆θ
2]t−1 + (1− γ)θ2t ,

where

(12) ∆θt = −
η

RMS[g]t
gt.

The root mean squared error of parameter updates is thus:

(13) RMS[∆θ]t =

E[∆θ2]t + �.

Since RMS[∆θ]t is unknown, we approximate it with the RMS of parameter updates until the
previous time step. Replacing the learning rate η in the previous update rule with RMS[∆θ]t−1
finally yields the Adadelta update rule:

(14) ∆θt = −
RMS[∆θ]t−1
RMS[g]t

gt.

(15) θt+1 = θt + ∆θt.

With Adadelta, we do not even need to set a default learning rate, as it has been eliminated
from the update rule.

1.5. RMSprop. RMSprop and Adadelta have both been developed independently around the
same time stemming from the need to resolve Adagrad’s radically diminishing learning rates.
RMSprop in fact is identical to the first update vector of Adadelta that we derived above:

(16) E[g2]t = γE[g
2]t−1 + (1− γ)g2t ,

(17) θt+1 = θt −
η√

E[g2]t + �
gt.

2

1.6. Adam. Adaptive Moment Estimation (Adam) is another method that computes adaptive
learning rates for each parameter. In addition to storing an exponentially decaying average of
past squared gradients vt ike Adadelta and RMSprop, Adam also keeps an exponentially decaying
average of past gradients mt, similar to momentum. We compute the decaying averages of past
and past squared gradients mt and vt respectively as follows:

(18) mt = β1mt−1 + (1− β1)gt

(19) vt = β1vt−1 + (1− β1)g2t
mt and vt are are estimates of the first moment (the mean) and the second moment (the un-
centered variance) of the gradients respectively, hence the name of the method. As mt and vt
are initilizaed as vectors of 0’s, the authors of Adam observe that they are biased towards zero,
especially during the initial time steps, and especially when the decay rates are small. They
counteract these biases by computing bias-corrected first and second moment estimates:

(20) m̂t =
mt

1− βt1

(21) v̂t =
vt

1− βt2
They then use these to update the parameters just as we have seen in Adadelta and RMSprop,
which yields the Adam update rule:

(22) θt+1 = θt −
η


v̂t + �

m̂t.

3

1. Optimizer
1.1. Momentum
1.2. Nesterov
1.3. Adagrad
1.4. Adadelta
1.5. RMSprop
1.6. Adam