The University of Sydney Page 1
Optimization for
Training Deep Models
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
A common training process for neural networks
1. Initialize the parameters
2. Choose an optimization algorithm
3. Repeat these steps:
1.Forward propagate an input
2.Compute the cost function
3.Compute the gradients of the cost with respect to
parameters using backpropagation
4.Update each parameter using the gradients, according
to the optimization algorithm
The University of Sydney Page 3
Outline
1. Gradient Descent Optimization
2. Initialization
• Gradient descent
• Momentum
• Nesterov
• Adagrad
• Adadelta
• Rmsprop
• Adam
• Adamax
The University of Sydney Page 4
where is the gradient of objective function w.r.t. the
parameters. is the learning rate.
l is the objective function, are the parameters, is
the training dataset. We want to optimize by:
Gradient descent
The University of Sydney Page 5
Gradient descent
l Batch gradient descent:
Compute the gradient for the entire training dataset.
l Stochastic gradient descent:
Compute the gradient for each training example.
l Mini-batch gradient descent:
Compute the gradient for every mini-batch training examples.
The University of Sydney Page 6
Some Challenges For Gradient descent
l Choosing a proper learning rate can be difficult.
Ø A learning rate that is too small leads to painfully slow
convergence.
Ø A learning rate that is too large can hinder convergence and
cause the loss function to fluctuate around the minimum or
even to diverge.
http://srdas.github.io/DLBook/GradientDescentTechniques.html
The University of Sydney Page 7
Some Challenges For Gradient descent
l The same learning rate applies to all parameter updates.
Ø If our data is sparse and our features have very different
frequencies, we might not want to update all of them to
the same extent, but perform a larger update for rarely
occurring features.
The University of Sydney Page 8
Some Challenges For Gradient descent
l Easily get trapped in numerous saddle points.
Ø Saddle points are points where one dimension slopes up
and another slopes down. These saddle points are usually
surrounded by a plateau of the same error, which makes it
notoriously hard for SGD to escape, as the gradient is
close to zero in all dimensions.
The University of Sydney Page 9
Momentum
Without Momentum With Momentum
Momentum tries to accelerate SGD in the relevant direction and
dampens oscillations.
SGD has trouble navigating ravines, i.e. areas where the surface
curves much more steeply in one dimension than in another,
which are common around local optima.
https://www.willamette.edu/~gorr/classes/cs449/momrate.html
The University of Sydney Page 10
Momentum
The momentum term increases for dimensions whose gradients
point in the same directions and reduces updates for dimensions
whose gradients change directions.
Momentum term ! is usually set to 0.9 or a similar value.
As a result, we gain faster convergence and reduced oscillation.
The University of Sydney Page 11
blue vectors = standard momentum
brown vector = jump, red vector = correction,
green vector = accumulated gradient ,
https://isaacchanghau.github.io/2017/05/2
9/parameter-update-methods/
Nesterov accelerated gradient (NAG)
l The standard momentum method:
Ø First computes the gradient at the current location
Ø Then takes a big jump in the direction of the updated
accumulated gradient.
l Nesterov accelerated gradient (often works better).
ØFirst make a big jump in the direction of the previous
accumulated gradient.
ØThen measure the gradient where you end up and make a
correction.
The University of Sydney Page 12
Nesterov accelerated gradient (NAG)
First make a big jump
Then measure the gradient
where you end up
Finally accumulate the
gradient and make a
correction
First make a big jump in the direction of the previous accumulated
gradient.
Then measure the gradient where you end up.
Finally accumulate the gradient and make a correction.
blue vectors = standard momentum
brown vector = jump, red vector = correction,
green vector = accumulated gradient ,
This update prevents us
from going too fast.
The University of Sydney Page 13
Adaptive Learning Rate Methods
The University of Sydney Page 14
Motivation
l Till now we assign the same learning rate to all features.
l If the feature varies in importance and frequency, why is this a
good idea?
l It’s probably not!
The University of Sydney Page 15
Motivation
Nice (all features are equally important)
The University of Sydney Page 16
Motivation
Harder
The University of Sydney Page 17
l Adagrad
Adapt the learning rate to the parameters based on
the past gradients that have been computed for “!.
l Original gradient descent:
Perform an update for all parameters using the same learning rate.
l Adagrad:
The same for all parameters
G! ∈ $”×” is a diagonal matrix where each diagonal element is the sum of the
squares of the gradients w.r.t. %$ up to time step t.
!! = [!!,#, !!,$, ⋯ , !!,% , ⋯ , !!,&]
The University of Sydney Page 18
Adagrad
l Adapt the learning rate to the parameters.
Ø Perform larger updates for infrequent updated parameters,
and smaller updates for frequent updated parameters.
l Well-suited for dealing with sparse data.
“&'(,! = “&,! −
%
&&,!! + (
)&,!
&&,!! =*
!*(
&
)&,!+)&,! = +,, “& !
where
The University of Sydney Page 19
Adagrad
l The learning rate eventually becomes infinitesimally small.
Ø &&,!! = ∑!*(& )&,!+ increases monotonically,
–
.!,##’/
will
eventually be infinitesimally small.
l Need to manually select a global learning rate .
“&'(,! = “&,! −
%
&&,!! + (
)&,!
&&,!! =*
!*(
&
)&,!+)&,! = +,, “& !
where
The University of Sydney Page 20
Adadelta
l In Adagrad, the denominator accumulates the squared gradients
from each iteration.
Ø Accumulate over window.
If we set window size to 2, we can
get that
“&'(,! = “&,! −
%
&&,!! + (
)&,!
&&,!! =*
!*(
&
)&,!+)&,! = +,, “& !
where
The University of Sydney Page 21
Adadelta – Accumulate over window
l Storing fixed previous squared gradients cannot accumulate to
infinity and instead becomes a local estimate using recent
gradients.
l Adadelta implements it as an exponentially decaying average of
all past squared gradients.
‘() * ! = + *$ ! + –
The University of Sydney Page 22
Adadelta
l Need for a manually selected global learning rate.
Ø Correct units with Hessian approximation.
“&'(,! = “&,! −
%
&&,!! + (
)&,!
&&,!! =*
!*(
&
)&,!+)&,! = +,, “& !
where
The University of Sydney Page 23
Adadelta – Correct units with Hessian approximation
l The units in SGD and Momentum relate to the gradient, not the
parameter.
l Second order methods such as Newton’s method do have the
correct units.
* Assume cost function 0 is unitless.
The University of Sydney Page 24
Adadelta – Correct units with Hessian approximation
We assume the curvature is locally smooth and approximate by
computing the exponentially decaying RMS of .
.1( ∝ 11+,
1″+
∝ ∆”1,
1″
∝ ∆”)
.’#* = ∆!
The University of Sydney Page 25
RMSprop
l RMSprop and Adadelta have both been developed
independently around the same time.
l RMSprop in fact is identical to the first update vector of
Adadelta.
Hinton suggests ! to
be set to 0.9, while a
good default value for
the learning rate % is
0.001.
The University of Sydney Page 26
Adam
Combine the advantages of Adgrad and RMSprop.
l Adgrad: adaptive learning rate for different parameters.
l RMSprop: exponentially decaying average.
Ø Store an exponentially decaying average of past squared
gradients .
Keep an exponentially decaying average of past gradients .
The University of Sydney Page 27
Adam – bias-correct
Since and are initialized as vectors of 0’s, they are biased
towards zero, especially during the initial time steps or when the
decay rates are small (i.e. close to 1).
The University of Sydney Page 28
Visualization
Contours of a loss face
l Both run towards the negative gradient direction
l Momentum and NAG run faster in the relevant direction
http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
The University of Sydney Page 29
Visualization
Behaviour around a saddle point
Saddle point: a point where one dimension has a positive slope,
while the other dimensions has negative slopes.
l It’s easier for Rmsprop, Adadelta, and Adagrad to escape the
saddle point.
http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
The University of Sydney Page 30
Methods Adaptive
lnearning rate
Consistent
units
Easily hand
Saddle points
SGD No No No
Momentum No No No
Nesterov No No No
Adagrad Yes No Yes
Adadelta Yes Yes Yes
Rmsprop Yes No Yes
Adam Yes No Yes
1. Gradient Descent Optimization
The University of Sydney Page 31
A common training process for neural networks
1. Initialize the parameters
2. Choose an optimization algorithm
3. Repeat these steps:
1.Forward propagate an input
2.Compute the cost function
3.Compute the gradients of the cost with respect to
parameters using backpropagation
4.Update each parameter using the gradients, according
to the optimization algorithm
The University of Sydney Page 32
Outline
1. Gradient Descent Optimization
2. Initialization
• Gradient descent
• Momentum
• Nesterov
• Adagrad
• Adadelta
• Rmsprop
• Adam
• Adamax
The initialization step can be critical to the model’s ultimate
performance, and it requires the right method.
The University of Sydney Page 33
Initialization
l All zero initialization
l Small random numbers
l Calibrating the variances
The University of Sydney Page 34
Every neuron computes the same output.
They will also compute the same gradients.
They will undergo the exact same parameter updates.
There will be no difference between all the neurons.
Initialization
All zero initialization:
The University of Sydney Page 35
Initialize weights with values too small or too large
Despite breaking the symmetry, initializing the weights with
values (i) too small or (ii) too large leads respectively to (i)
slow learning or (ii) divergence.
34 = 5[3]5[31(]5[31+]…5 55 +5[(]7
Case 1: A too-large initialization leads to exploding gradients
Case 2: A too-small initialization leads to vanishing gradients
Assuming all the activation functions are linear (identity
function), we have
The University of Sydney Page 36
https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
The University of Sydney Page 37
https://pytorch.org/docs/stable/nn.init.html?highlight=kaiming_uniform#torch
.nn.init.kaiming_uniform_
0(− () ,
(
))
The University of Sydney Page 38
How to find appropriate initialization values
1.The mean of the activations should be zero.
2.The variance of the activations should stay the same across every layer.
Under these two assumptions, the backpropagated gradient signal
should not be multiplied by values too small or too large in any layer. It
should travel to the input layer without exploding or vanishing.
The University of Sydney Page 39
Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every
layer?
Assume the activation function is tanh. The forward propagation is:
8[6] = 5[6]9[61(] + :[6]
9[6] = tanh(8[6])
How we should initialize our weights such that A9B 9 61( = A9B(9[6])?
Early on in the training, we are in the linear regime of tanh. Values are
small enough and thus tanh 8 6 ≈ 8 6 , meaning that
A9 B 9 6 = A9B(8[6])
The University of Sydney Page 40
Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every
layer?
8[6] = 5[6]9[61(] + :[6]
9[6] = tanh(8[6])
A9 B 9 6 = A9B 8 6
87
[6] = *
8*(
9[%&’]
D78
[6]98
[61(]
A9 B 97
6 = A9B 87
6
= A9B *
8*(
9 %&’
D78
6 98
61(
= *
8*(
9 %&’
A9B(D78
6 98
61( )
1. Weights are independent and identically distributed
2. Inputs are independent and identically distributed
3. Weights and inputs are mutually independent
The University of Sydney Page 41
Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every
layer?
87
[6] = *
8*(
9[%&’]
D78
[6]98
[61(]
A9 B 97
6 = A9B 87
6
= A9B *
8*(
9 %&’
D78
6 98
61(
= *
8*(
9 %&’
A9B(D78
6 98
61( )
*+,(./)
= 2 . !*+,(/) + *+, . 2 / ! + *+,(.)*+,(/)
*+, 4″#
$ +#
$%&
= 2 4″#
$ ! *+, +#
$%& + *+, 4″#
$ 2 +#
$%& !
+ *+, 4″#
$ *+,(+#
$%& )
Weights are initialized with zero mean,
and inputs are normalized.
45 6 55
6 = 7[6’#]456 859
6 456(59
6’# )
The University of Sydney Page 42
Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every
layer?
45 6 55
6 = 7[6’#]456 859
6 456(59
6’# )
456 9[6] =
1
7[6’#]
This expression holds for every layer of our network:
45 6 5[:] = 7 :’# 456 9 : 456 5 :’#
= 7 :’# 456 9 : 7 :’$ 456 9 :’# 456 5 :’$
= ⋯ = ;
6;#
:
7 6’# 456 9 6 456(<)
The University of Sydney Page 43
Initialization (Xavier) -- [Xavier Glorot, Yoshua.Bengio 2010]
45 6 5[:] = ;
6;#
:
7 6'# 456 9 6 456(<)
E 61( A9B 5 6
< 1 → Vanishing Signal
= 1 → 45 6 5[:] = 45 6 <
> 1 → Exploding Signal
456 9[6] = 17[6’#]
https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_
The University of Sydney Page 44
Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
456 9[6] = 17[6’#]
We worked on activations computed during the forward propagation,
and get
The same result can be derived for the backpropagated gradients:
456 9[6] =
1
7[6]
Xavier initialization would either initialize the weights as
@(0, 27 6’# + 7[6])
0(− (
) !”# <)[!]
, (
) !"# <)[!]
)
xavier_normal
xavier_uniform
The University of Sydney Page 45
Initialization(He) -- [Kaiming He, Xiangyu Zhang, et al. 2015]
If we use ReLU as the activation function, n is the number of
inputs, then we have:
ReLU
:;< = = ? =% − ? = %
The University of Sydney Page 46
Initialization(He) -- [Kaiming He, Xiangyu Zhang, et al. 2015]
If we use ReLU as the activation function, n is the number of
inputs, then for layer we have:
If has zero mean, and has a symmetric distribution around zero:
So we have:
ReLU
@(0, 27)
0(− () ,
(
))
kaiming_normal
kaiming_uniform
:;< = = ? =% − ? = %
The University of Sydney Page 47
Thank you!