CS计算机代考程序代写 algorithm The University of Sydney Page 1

The University of Sydney Page 1

Optimization for
Training Deep Models

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

A common training process for neural networks

1. Initialize the parameters

2. Choose an optimization algorithm

3. Repeat these steps:
1.Forward propagate an input
2.Compute the cost function
3.Compute the gradients of the cost with respect to

parameters using backpropagation
4.Update each parameter using the gradients, according

to the optimization algorithm

The University of Sydney Page 3

Outline

1. Gradient Descent Optimization

2. Initialization

• Gradient descent
• Momentum
• Nesterov

• Adagrad
• Adadelta
• Rmsprop

• Adam
• Adamax

The University of Sydney Page 4

where is the gradient of objective function w.r.t. the
parameters. is the learning rate.

l is the objective function, are the parameters, is
the training dataset. We want to optimize by:

Gradient descent

The University of Sydney Page 5

Gradient descent
l Batch gradient descent:

Compute the gradient for the entire training dataset.

l Stochastic gradient descent:
Compute the gradient for each training example.

l Mini-batch gradient descent:
Compute the gradient for every mini-batch training examples.

The University of Sydney Page 6

Some Challenges For Gradient descent
l Choosing a proper learning rate can be difficult.

Ø A learning rate that is too small leads to painfully slow
convergence.

Ø A learning rate that is too large can hinder convergence and
cause the loss function to fluctuate around the minimum or
even to diverge.

http://srdas.github.io/DLBook/GradientDescentTechniques.html

The University of Sydney Page 7

Some Challenges For Gradient descent
l The same learning rate applies to all parameter updates.

Ø If our data is sparse and our features have very different
frequencies, we might not want to update all of them to
the same extent, but perform a larger update for rarely
occurring features.

The University of Sydney Page 8

Some Challenges For Gradient descent
l Easily get trapped in numerous saddle points.

Ø Saddle points are points where one dimension slopes up
and another slopes down. These saddle points are usually
surrounded by a plateau of the same error, which makes it
notoriously hard for SGD to escape, as the gradient is
close to zero in all dimensions.

The University of Sydney Page 9

Momentum

Without Momentum With Momentum

Momentum tries to accelerate SGD in the relevant direction and
dampens oscillations.

SGD has trouble navigating ravines, i.e. areas where the surface
curves much more steeply in one dimension than in another,
which are common around local optima.

https://www.willamette.edu/~gorr/classes/cs449/momrate.html

The University of Sydney Page 10

Momentum

The momentum term increases for dimensions whose gradients
point in the same directions and reduces updates for dimensions
whose gradients change directions.

Momentum term ! is usually set to 0.9 or a similar value.

As a result, we gain faster convergence and reduced oscillation.

The University of Sydney Page 11

blue vectors = standard momentum
brown vector = jump, red vector = correction,
green vector = accumulated gradient ,

https://isaacchanghau.github.io/2017/05/2
9/parameter-update-methods/

Nesterov accelerated gradient (NAG)

l The standard momentum method:
Ø First computes the gradient at the current location
Ø Then takes a big jump in the direction of the updated

accumulated gradient.
l Nesterov accelerated gradient (often works better).

ØFirst make a big jump in the direction of the previous
accumulated gradient.

ØThen measure the gradient where you end up and make a
correction.

The University of Sydney Page 12

Nesterov accelerated gradient (NAG)

First make a big jump

Then measure the gradient
where you end up

Finally accumulate the
gradient and make a
correction

First make a big jump in the direction of the previous accumulated
gradient.
Then measure the gradient where you end up.
Finally accumulate the gradient and make a correction.

blue vectors = standard momentum
brown vector = jump, red vector = correction,
green vector = accumulated gradient ,

This update prevents us
from going too fast.

The University of Sydney Page 13

Adaptive Learning Rate Methods

The University of Sydney Page 14

Motivation

l Till now we assign the same learning rate to all features.

l If the feature varies in importance and frequency, why is this a
good idea?

l It’s probably not!

The University of Sydney Page 15

Motivation

Nice (all features are equally important)

The University of Sydney Page 16

Motivation

Harder

The University of Sydney Page 17

l Adagrad

Adapt the learning rate to the parameters based on
the past gradients that have been computed for “!.

l Original gradient descent:

Perform an update for all parameters using the same learning rate.

l Adagrad:

The same for all parameters

G! ∈ $”×” is a diagonal matrix where each diagonal element is the sum of the
squares of the gradients w.r.t. %$ up to time step t.

!! = [!!,#, !!,$, ⋯ , !!,% , ⋯ , !!,&]

The University of Sydney Page 18

Adagrad

l Adapt the learning rate to the parameters.
Ø Perform larger updates for infrequent updated parameters,

and smaller updates for frequent updated parameters.
l Well-suited for dealing with sparse data.

“&'(,! = “&,! −
%

&&,!! + (
)&,!

&&,!! =*
!*(

&
)&,!+)&,! = +,, “& !

where

The University of Sydney Page 19

Adagrad

l The learning rate eventually becomes infinitesimally small.

Ø &&,!! = ∑!*(& )&,!+ increases monotonically,

.!,##’/
will

eventually be infinitesimally small.
l Need to manually select a global learning rate .

“&'(,! = “&,! −
%

&&,!! + (
)&,!

&&,!! =*
!*(

&
)&,!+)&,! = +,, “& !

where

The University of Sydney Page 20

Adadelta
l In Adagrad, the denominator accumulates the squared gradients

from each iteration.

Ø Accumulate over window.

If we set window size to 2, we can
get that

“&'(,! = “&,! −
%

&&,!! + (
)&,!

&&,!! =*
!*(

&
)&,!+)&,! = +,, “& !

where

The University of Sydney Page 21

Adadelta – Accumulate over window

l Storing fixed previous squared gradients cannot accumulate to
infinity and instead becomes a local estimate using recent
gradients.

l Adadelta implements it as an exponentially decaying average of
all past squared gradients.

‘() * ! = + *$ ! + –

The University of Sydney Page 22

Adadelta

l Need for a manually selected global learning rate.

Ø Correct units with Hessian approximation.

“&'(,! = “&,! −
%

&&,!! + (
)&,!

&&,!! =*
!*(

&
)&,!+)&,! = +,, “& !

where

The University of Sydney Page 23

Adadelta – Correct units with Hessian approximation

l The units in SGD and Momentum relate to the gradient, not the
parameter.

l Second order methods such as Newton’s method do have the
correct units.

* Assume cost function 0 is unitless.

The University of Sydney Page 24

Adadelta – Correct units with Hessian approximation

We assume the curvature is locally smooth and approximate by
computing the exponentially decaying RMS of .

.1( ∝ 11+,
1″+

∝ ∆”1,
1″

∝ ∆”)

.’#* = ∆!

The University of Sydney Page 25

RMSprop

l RMSprop and Adadelta have both been developed

independently around the same time.

l RMSprop in fact is identical to the first update vector of

Adadelta.
Hinton suggests ! to
be set to 0.9, while a
good default value for
the learning rate % is
0.001.

The University of Sydney Page 26

Adam

Combine the advantages of Adgrad and RMSprop.
l Adgrad: adaptive learning rate for different parameters.

l RMSprop: exponentially decaying average.
Ø Store an exponentially decaying average of past squared

gradients .

Keep an exponentially decaying average of past gradients .

The University of Sydney Page 27

Adam – bias-correct

Since and are initialized as vectors of 0’s, they are biased
towards zero, especially during the initial time steps or when the
decay rates are small (i.e. close to 1).

The University of Sydney Page 28

Visualization

Contours of a loss face

l Both run towards the negative gradient direction

l Momentum and NAG run faster in the relevant direction

http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

The University of Sydney Page 29

Visualization

Behaviour around a saddle point

Saddle point: a point where one dimension has a positive slope,
while the other dimensions has negative slopes.
l It’s easier for Rmsprop, Adadelta, and Adagrad to escape the

saddle point.

http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

The University of Sydney Page 30

Methods Adaptive
lnearning rate

Consistent
units

Easily hand
Saddle points

SGD No No No

Momentum No No No

Nesterov No No No

Adagrad Yes No Yes

Adadelta Yes Yes Yes

Rmsprop Yes No Yes

Adam Yes No Yes

1. Gradient Descent Optimization

The University of Sydney Page 31

A common training process for neural networks

1. Initialize the parameters

2. Choose an optimization algorithm

3. Repeat these steps:
1.Forward propagate an input
2.Compute the cost function
3.Compute the gradients of the cost with respect to

parameters using backpropagation
4.Update each parameter using the gradients, according

to the optimization algorithm

The University of Sydney Page 32

Outline

1. Gradient Descent Optimization

2. Initialization

• Gradient descent
• Momentum
• Nesterov

• Adagrad
• Adadelta
• Rmsprop

• Adam
• Adamax

The initialization step can be critical to the model’s ultimate
performance, and it requires the right method.

The University of Sydney Page 33

Initialization

l All zero initialization

l Small random numbers

l Calibrating the variances

The University of Sydney Page 34

Every neuron computes the same output.

They will also compute the same gradients.

They will undergo the exact same parameter updates.

There will be no difference between all the neurons.

Initialization

All zero initialization:

The University of Sydney Page 35

Initialize weights with values too small or too large

Despite breaking the symmetry, initializing the weights with
values (i) too small or (ii) too large leads respectively to (i)
slow learning or (ii) divergence.

34 = 5[3]5[31(]5[31+]…5 55 +5[(]7

Case 1: A too-large initialization leads to exploding gradients

Case 2: A too-small initialization leads to vanishing gradients

Assuming all the activation functions are linear (identity
function), we have

The University of Sydney Page 36

https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear

The University of Sydney Page 37

https://pytorch.org/docs/stable/nn.init.html?highlight=kaiming_uniform#torch
.nn.init.kaiming_uniform_

0(− () ,
(
))

The University of Sydney Page 38

How to find appropriate initialization values

1.The mean of the activations should be zero.
2.The variance of the activations should stay the same across every layer.

Under these two assumptions, the backpropagated gradient signal
should not be multiplied by values too small or too large in any layer. It
should travel to the input layer without exploding or vanishing.

The University of Sydney Page 39

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]

How Xavier Initialization keeps the variance the same across every
layer?

Assume the activation function is tanh. The forward propagation is:

8[6] = 5[6]9[61(] + :[6]

9[6] = tanh(8[6])
How we should initialize our weights such that A9B 9 61( = A9B(9[6])?

Early on in the training, we are in the linear regime of tanh. Values are
small enough and thus tanh 8 6 ≈ 8 6 , meaning that

A9 B 9 6 = A9B(8[6])

The University of Sydney Page 40

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]

How Xavier Initialization keeps the variance the same across every
layer?

8[6] = 5[6]9[61(] + :[6]

9[6] = tanh(8[6])

A9 B 9 6 = A9B 8 6

87
[6] = *

8*(

9[%&’]

D78
[6]98

[61(]

A9 B 97
6 = A9B 87

6

= A9B *
8*(

9 %&’

D78
6 98

61(

= *
8*(

9 %&’

A9B(D78
6 98

61( )

1. Weights are independent and identically distributed
2. Inputs are independent and identically distributed
3. Weights and inputs are mutually independent

The University of Sydney Page 41

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]

How Xavier Initialization keeps the variance the same across every
layer?

87
[6] = *

8*(

9[%&’]

D78
[6]98

[61(]

A9 B 97
6 = A9B 87

6

= A9B *
8*(

9 %&’

D78
6 98

61(

= *
8*(

9 %&’

A9B(D78
6 98

61( )

*+,(./)
= 2 . !*+,(/) + *+, . 2 / ! + *+,(.)*+,(/)

*+, 4″#
$ +#

$%&

= 2 4″#
$ ! *+, +#

$%& + *+, 4″#
$ 2 +#

$%& !

+ *+, 4″#
$ *+,(+#

$%& )

Weights are initialized with zero mean,
and inputs are normalized.

45 6 55
6 = 7[6’#]456 859

6 456(59
6’# )

The University of Sydney Page 42

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]

How Xavier Initialization keeps the variance the same across every
layer?

45 6 55
6 = 7[6’#]456 859

6 456(59
6’# )

456 9[6] =
1

7[6’#]
This expression holds for every layer of our network:

45 6 5[:] = 7 :’# 456 9 : 456 5 :’#
= 7 :’# 456 9 : 7 :’$ 456 9 :’# 456 5 :’$

= ⋯ = ;
6;#

:
7 6’# 456 9 6 456(<) The University of Sydney Page 43 Initialization (Xavier) -- [Xavier Glorot, Yoshua.Bengio 2010] 45 6 5[:] = ; 6;# : 7 6'# 456 9 6 456(<) E 61( A9B 5 6 < 1 → Vanishing Signal = 1 → 45 6 5[:] = 45 6 < > 1 → Exploding Signal

456 9[6] = 17[6’#]
https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_

The University of Sydney Page 44

Initialization (Xavier) — [Xavier Glorot, Yoshua.Bengio 2010]

456 9[6] = 17[6’#]

We worked on activations computed during the forward propagation,
and get

The same result can be derived for the backpropagated gradients:

456 9[6] =
1
7[6]

Xavier initialization would either initialize the weights as

@(0, 27 6’# + 7[6])

0(− (
) !”# <)[!] , ( ) !"# <)[!] ) xavier_normal xavier_uniform The University of Sydney Page 45 Initialization(He) -- [Kaiming He, Xiangyu Zhang, et al. 2015] If we use ReLU as the activation function, n is the number of inputs, then we have: ReLU :;< = = ? =% − ? = % The University of Sydney Page 46 Initialization(He) -- [Kaiming He, Xiangyu Zhang, et al. 2015] If we use ReLU as the activation function, n is the number of inputs, then for layer we have: If has zero mean, and has a symmetric distribution around zero: So we have: ReLU @(0, 27) 0(− () , ( )) kaiming_normal kaiming_uniform :;< = = ? =% − ? = % The University of Sydney Page 47 Thank you!