CS计算机代考程序代写 algorithm Optimization for Training Deep Models

Optimization for Training Deep Models
Dr Chang Xu
School of Computer Science
The University of Sydney Page 1

A common training process for neural networks
1. Initialize the parameters
2. Choose an optimization algorithm
3. Repeat these steps:
1.Forward propagate an input
2.Compute the cost function
3.Compute the gradients of the cost with respect to
parameters using backpropagation
4.Update each parameter using the gradients, according
to the optimization algorithm
The University of Sydney Page 2

Outline
1. Gradient Descent Optimization
• Gradient descent • Momentum
• Nesterov
2. Initialization
• Adagrad • Adadelta • Rmsprop
• Adam
• Adamax
The University of Sydney
Page 3

Gradient descent
l is the objective function, are the parameters, is the training dataset. We want to optimize by:
where
parameters. is the learning rate.
is the gradient of objective function w.r.t. the
The University of Sydney
Page 4

Gradient descent
l Batch gradient descent:
Compute the gradient for the entire training dataset.
l Stochastic gradient descent:
Compute the gradient for each training example.
l Mini-batch gradient descent:
Compute the gradient for every mini-batch training examples.
The University of Sydney Page 5

Some Challenges For Gradient descent
l Choosing a proper learning rate can be difficult.
Ø A learning rate that is too small leads to painfully slow
convergence.
Ø A learning rate that is too large can hinder convergence and
cause the loss function to fluctuate around the minimum or even to diverge.
http://srdas.github.io/DLBook/GradientDescentTechniques.html
The University of Sydney Page 6

Some Challenges For Gradient descent
l The same learning rate applies to all parameter updates.
Ø If our data is sparse and our features have very different frequencies, we might not want to update all of them to
the same extent, but perform a larger update for rarely occurring features.
The University of Sydney
Page 7

Some Challenges For Gradient descent
l Easily get trapped in numerous saddle points.
Ø Saddle points are points where one dimension slopes up and another slopes down. These saddle points are usually
surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
The University of Sydney
Page 8

Momentum
SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima.
Without Momentum https://www.willamette.edu/~gorr/classes/cs449/momrate.html
Momentum tries to accelerate SGD in the relevant direction and
dampens oscillations.
The University of Sydney Page 9
With Momentum

Momentum
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.
Momentum term ! is usually set to 0.9 or a similar value.
As a result, we gain faster convergence and reduced oscillation.
The University of Sydney Page 10

Nesterov accelerated gradient (NAG)
lThe standard momentum method:
Ø First computes the gradient at the current location
Ø Then takes a big jump in the direction of the updated
accumulated gradient.
lNesterov accelerated gradient (often works better).
ØFirst make a big jump in the direction of the previous accumulated gradient.
ØThen measure the gradient where you end up and make a correction.
The University of Sydney
Page 11
https://isaacchanghau.github.io/2017/05/2 9/parameter-update-methods/
blue vectors = standard momentum
brown vector = jump, red vector = correction, green vector = accumulated gradient ,

Nesterov accelerated gradient (NAG)
First make a big jump in the direction of the previous accumulated gradient.
Then measure the gradient where you end up.
Finally accumulate the gradient and make a correction.
Finally accumulate the gradient and make a correction
This update prevents us from going too fast.
blue vectors = standard momentum
Then measure the gradient where you end up
brown vector = jump, red vector = correction,
The University of Sydney green vector = accumulated gradient , Page 12
First make a big jump

The University of Sydney
Page 13
Adaptive Learning Rate Methods

Motivation
l Till now we assign the same learning rate to all features.
l If the feature varies in importance and frequency, why is this a
good idea?
l It’s probably not!
The University of Sydney Page 14

Motivation
The University of Sydney
Page 15
Nice (all features are equally important)

Motivation
The University of Sydney
Page 16
Harder

l Adagrad
l Original gradient descent:
!! =[!!,#,!!,$,⋯,!!,%,⋯,!!,&]
The same for all parameters
Perform an update for all parameters using the same learning rate. l Adagrad: Adapt the learning rate to the parameters based on
the past gradients that have been computed for “!.
G! ∈ $”×” is a diagonal matrix where each diagonal element is the sum of the squares of the gradients w.r.t. %$ up to time step t.
The University of Sydney Page 17

Adagrad
“&'(,! = “&,! −
% )&,! &&,!! + (
where
)&,! = +,, “&
&
& =*)+
!
&,!!
&,!
l Adapt the learning rate to the parameters.
Ø Perform larger updates for infrequent updated parameters,
and smaller updates for frequent updated parameters.
l Well-suited for dealing with sparse data.
The University of Sydney Page 18
!*(

Adagrad
where
“&'(,! = “&,! − )&,! = +,, “&
% )&,! &&,!! + (
&
& =*)+
!
&,!!
&,!
l The learning rate eventually becomes infinitesimally small. Ø & = ∑& )+ increases monotonically, – will
&,!! !*( &,! .!,##’/
eventually be infinitesimally small.
l Need to manually select a global learning rate . The University of Sydney
Page 19
!*(

Adadelta
l In Adagrad, the denominator accumulates the squared gradients
from each iteration.
%
&&,!! + ( )&,!
“&'(,! = “&,! − )&,! = +,, “&
Ø Accumulate over window.
where
&
& =*)+
!
&,!!
&,!
If we set window size to 2, we can get that
!*(
The University of Sydney
Page 20

Adadelta – Accumulate over window
l Storing fixed previous squared gradients cannot accumulate to infinity and instead becomes a local estimate using recent gradients.
l Adadelta implements it as an exponentially decaying average of all past squared gradients.
‘()*!= +*$!+-
The University of Sydney
Page 21

Adadelta
l Need for a manually selected global learning rate.
where
“&'(,! = “&,! − )&,! = +,, “&
% )&,! &&,!! + (
&
& =*)+ &,!! &,!
!*(
Ø Correct units with Hessian approximation.
The University of Sydney
Page 22
!

Adadelta – Correct units with Hessian approximation
l The units in SGD and Momentum relate to the gradient, not the parameter.
l Second order methods such as Newton’s method do have the correct units.
TheUniversityofSydney *Assumecostfunction0isunitless. Page23

Adadelta – Correct units with Hessian approximation
.’#* = ∆!
We assume the curvature is locally smooth and approximate computing the exponentially decaying RMS of .
by
.1(∝ 1 ∝∆”∝∆” 1+, 1, )
1″+ 1″
The University of Sydney
Page 24

RMSprop
l RMSprop and Adadelta have both been developed independently around the same time.
l RMSprop in fact is identical to the first update vector of
Adadelta.
Hinton suggests ! to be set to 0.9, while a good default value for the learning rate % is 0.001.
The University of Sydney
Page 25

Adam
Combine the advantages of Adgrad and RMSprop.
l Adgrad: adaptive learning rate for different parameters.
l RMSprop: exponentially decaying average.
Ø Store an exponentially decaying average of past squared
gradients .
Keep an exponentially decaying average of past gradients . The University of Sydney Page 26

Adam – bias-correct
Since and are initialized as vectors of 0’s, they are biased towards zero, especially during the initial time steps or when the decay rates are small (i.e. close to 1).
The University of Sydney Page 27

Visualization
l Both run towards the negative gradient direction
l Momentum and NAG run faster in the relevant direction
Contours of a loss face http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
The University of Sydney
Page 28

Visualization
Saddle point: a point where one dimension has a positive slope, while the other dimensions has negative slopes.
l It’s easier for Rmsprop, Adadelta, and Adagrad to escape the
saddle point.
Behaviour around a saddle point
The University of Sydney http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html Page 29

1. Gradient Descent Optimization
Methods
Adaptive lnearning rate
Consistent units
Easily hand Saddle points
SGD
No
No
No
Momentum
No
No
No
Nesterov
No
No
No
Adagrad
Yes
No
Yes
Adadelta
Yes
Yes
Yes
Rmsprop
Yes
No
Yes
Adam
Yes
No
Yes
The University of Sydney Page 30

A common training process for neural networks
1. Initialize the parameters
2. Choose an optimization algorithm
3. Repeat these steps:
1.Forward propagate an input
2.Compute the cost function
3.Compute the gradients of the cost with respect to
parameters using backpropagation
4.Update each parameter using the gradients, according
to the optimization algorithm
The University of Sydney Page 31

Outline
1. Gradient Descent Optimization
• Gradient descent • Momentum
• Nesterov
2. Initialization
• Adagrad • Adadelta • Rmsprop
• Adam
• Adamax
The initialization step can be critical to the model’s ultimate performance, and it requires the right method.
The University of Sydney Page 32

Initialization
l All zero initialization
l Small random numbers
l Calibrating the variances
The University of Sydney Page 33

Initialization
All zero initialization:
Every neuron computes the same output.
They will also compute the same gradients. They will undergo the exact same parameter updates. There will be no difference between all the neurons.
The University of Sydney
Page 34

Initialize weights with values too small or too large
Despite breaking the symmetry, initializing the weights with values (i) too small or (ii) too large leads respectively to (i) slow learning or (ii) divergence.
Assuming all the activation functions are linear (identity function), we have
43 = 5[3]5[31(]5[31+] … 5 5 5 + 5[(]7
Case 1: A too-large initialization leads to exploding gradients
Case 2: A too-small initialization leads to vanishing gradients
The University of Sydney Page 35

https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
The University of Sydney Page 36

0 ( − )( , )( )
https://pytorch.org/docs/stable/nn.init.html?highlight=kaiming_uniform#torch
.nn.init.kaiming_uniform_
The University of Sydney Page 37

How to find appropriate initialization values
1.The mean of the activations should be zero.
2.The variance of the activations should stay the same across every layer.
Under these two assumptions, the backpropagated gradient signal should not be multiplied by values too small or too large in any layer. It should travel to the input layer without exploding or vanishing.
The University of Sydney Page 38

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every layer?
Assume the activation function is tanh. The forward propagation is: 8[6] = 5[6]9[61(] + :[6]
9[6] = tanh(8[6])
How we should initialize our weights such that A9B 9 61( = A9B(9[6])?
Early on in the training, we are in the linear regime of tanh. Values are
small enough and thus tanh 8 6 A9 B 9 6
≈ 8 6 , meaning that = A9B(8[6])
The University of Sydney
Page 39

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every layer?
8[6] = 5[6]9[61(] + :[6]
9[%&’]
8[6] = * D[6]9[61(] 7 788
9[6] = tanh(8[6]) 8*(
A9B 976 =A9B 876
A9B96 =A9B86 9%&’
The University of Sydney
=A9B *D6961( 78 8
8*( 9 %&’
= * A9B(D 6 9 61( ) 78 8
8*(
1. Weights are independent and identically distributed 2. Inputs are independent and identically distributed 3. Weights and inputs are mutually independent Page 40

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every layer?
9[%&’]
8[6] = * D[6]9[61(] 7 788
8*(
A9B 976 =A9B 876
%&’
=A9B *D6961(
9 %&’
= * A9B(D 6 9 61( ) 78 8
*+, 4$+$%& “# #
9
8*( +*+, 4$ *+,(+$%&)
=24$ !*+,+$%& +*+,4$ 2+$%& ! 788 “## “##
8*(
The University of Sydney
456 56 5
Weights are initialized with zero mean, and inputs are normalized.
=7[6’#]456 86 456(56’# ) 59 9Page 41
“# #
*+,(./)
= 2 . !*+,(/) + *+, . 2 / ! + *+,(.)*+,(/)

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
How Xavier Initialization keeps the variance the same across every layer?
456 56 =7[6’#]456 86 456(56’# ) 5 599
456 9[6] = 1 7[6’#]
This expression holds for every layer of our network:
456 5[:] =7:’# 456 9: 456 5:’#
=7:’# 456 9: 7:’$ 456 9:’# 456 5:’$ :
=⋯= ;76’#45696 456(<) The University of Sydney 6;# Page 42 Initialization (Xavier) -- [Xavier Glorot, Yoshua.Bengio 2010] : 4565[:] = ;76'#45696 456(<) 6;# E61(A9B56 < 1 → Vanishing Signal =1 → 4565[:] =456< > 1 → Exploding Signal 456 9[6] = 1
https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_
The University of Sydney
Page 43
7[6’#]

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]
We worked on activations computed during the forward propagation, and get
456 9[6] = 1 7[6’#]
The same result can be derived for the backpropagated gradients:
456 9[6] = 1 7[6]
Xavier initialization would either initialize the weights as
xavier_normal
@(0, 2 ) 7 6’# + 7[6]
xavier_uniform 0(− (
The University of Sydney ) !”# <)[!] , ( ) ) !"# <)[!] Page 44 Initialization(He) -- [Kaiming He, Xiangyu Zhang, et al. 2015] If we use ReLU as the activation function, n is the number of inputs, then we have: ReLU The University of Sydney Page 45 :;< = = ? =% − ? = % Initialization(He) -- [Kaiming He, Xiangyu Zhang, et al. 2015] If we use ReLU as the activation function, n is the number of inputs, then for layer we have: ReLU If has zero mean, and has a symmetric distribution around zero: So we have: kaiming_normal @(0,72) kaiming_uniform 0(− )( , )() The University of Sydney Page 46 :;< = = ? =% − ? = % The University of Sydney Page 47 Thank you!