CS计算机代考程序代写 chain deep learning Keras algorithm Deep Learning – COSC2779 – Neural Network Optimization

Deep Learning – COSC2779 – Neural Network Optimization

Deep Learning – COSC2779
Neural Network Optimization

Dr. Ruwan Tennakoon

August 2, 2021

Reference: Chapter 7,8: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 1 / 56

Outline

Part 1: Optimization Techniques

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Part 2: Regularization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 2 / 56

Machine Learning

The Task can be expressed an unknown
target function:

y = f (x)

ML finds a Hypothesis (model), h (·), from
hypothesis space H, which approximates
the unknown target function.

ŷ = h∗ (x) ≈ f (x)

The Experience is typically a data set, D,
of values

D =
{(

x(i), f
(

x(i)
))}N

i=1

∗Assume supervised learning for now

The Performance is typically numerical
measure that determines how well the
hypothesis matches the experience.

Last week we discussed about a representation of Hypothesis (model), h (·).

h (x) = h(3)
(
h(2)

(
h(1) (x)

))
This week: How can we find the optimal hypothesis h∗ (x)?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 3 / 56

Objectives for this Lecture

Explore techniques that can be used to find the optimal hypothesis in a
NN.
Understand the optimization techniques so that we can come up with the
“best approach for a problem” in a way that is better than random or
exhaustive search of the applicable techniques in the deep learning
toolbox.
Start understanding how to do evidence based debugging when the model
is not learning.
Applicable to all NN types, not just feed-forward.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 4 / 56

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 5 / 56

Loss Function
;

Data
D

Model
h (w)

w

L (w)

We would like to finding the parameters, w, of a neural network that reduce a cost function
L (w).

w∗ = argmin
w

L (w)

The cost function usually consists of two parts. The loss and regularization terms.

L (w) = E(x,y)∼pdataL (y , h (x; w))︸ ︷︷ ︸
Risk

+λ R (w)︸ ︷︷ ︸
Regularization

Here pdata is the data-generating distribution. L() is some function that quantify the
deviation from expected result.

This is an optimization problem. But, in ML, We do not know pdata, Therefore we cannot
minimize risk.

∗Out main attention in these slides is given to supervised learning.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 6 / 56

Loss Function

We would like to finding the parameters, w, of a neural network that reduce a cost function
L (w).

w∗ = argmin
w

L (w)

The cost function usually consists of two parts. The loss and regularization terms.

L (w) = E(x,y)∼pdataL (y , h (x; w))︸ ︷︷ ︸
Risk

+λ R (w)︸ ︷︷ ︸
Regularization

Here pdata is the data-generating distribution. L() is some function that quantify the
deviation from expected result.

This is an optimization problem. But, in ML, We do not know pdata, Therefore we cannot
minimize risk.

∗Out main attention in these slides is given to supervised learning.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 6 / 56

Loss Function

In ML, We do not know pdata. Therefore we minimize the empirical risk. Minimize the
expected loss on the training set.

L (w) = E(x,y)∼p̂dataL (y , h (x; w))︸ ︷︷ ︸
Empirical Risk

+λR (w)

Here p̂data is the training data distribution.

E(x,y)∼p̂dataL (y , h (x; w)) =
1
N

N∑
i=1

L
(
y (i), h

(
x(i); w

))
The training process based on minimizing this average training error is known as empirical
risk minimization.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 7 / 56

Loss Function

The derivative of the loss function is zero
(∂L(w)

∂w = 0) at any critical point in the loss
function.

A local minimum is a point
where L (w) is lower than at all
neighboring points.
A point that obtains the absolute
lowest value of L (w) is a global
minimum.

If the loss is convex we have only one
minimum – The global minimum.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 8 / 56

Finding the Minimum

Random search – Bad idea.

Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)

Gradient descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56

Finding the Minimum

Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.

Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)

Gradient descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56

Finding the Minimum

Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)

Gradient descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56

Gradient Decent

Note that w = [w0,w1, · · · ,wm]

Algorithm 1: Basic Gradient Decent
Result: Final Weights w
Initialize weights randomly ∼ N

(
0, σ2

)
;

while not converged do
Compute Gradients, ∇wL (w) ;
Update weights, w← w− α ∇wL (w) ;

end
α is the learning rate.

How can we efficiently apply gradient decent to Neural networks?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 10 / 56

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 11 / 56

Computing Gradients in Neural Networks

x1

x2

x3

x4

x5

w (1)1,1

The gradient, ∇wL (w), is simply the vector of
partial derivatives in each dimension.

How can we calculate ∂
∂w (1)1,1
L (w)?

∇wL (w) =


∂L (w)
∂w(1)1,1

,
∂L (w)
∂w(1)1,2

, · · ·


Numerical derivatives (finite difference
approximation):

∂w (1)1,1
L (w) = lim

δ→0

L
(

w (1)1,1 + δ
)
− L

(
w (1)1,1

)
δ

∗Other elements of w will remain constant.

How about ∂
∂w (1)1,2
L (w)?

This looks expensive, if we have millions of
weights.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 12 / 56

Computing Gradients in Neural Networks

x1

x2

x3

x4

x5

w (1)1,1

The gradient, ∇wL (w), is simply the vector of
partial derivatives in each dimension.

How can we calculate ∂
∂w (1)1,1
L (w)?

Numerical derivatives (finite difference
approximation):

∂w (1)1,1
L (w) = lim

δ→0

L
(

w (1)1,1 + δ
)
− L

(
w (1)1,1

)
δ

∗Other elements of w will remain constant.

How about ∂
∂w (1)1,2
L (w)?

This looks expensive, if we have millions of
weights.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 12 / 56

Back-Propagation

Back-Propagation is a
computationally efficient method to
calculate derivatives.

Repeated application of the chain-rule.

Lets have a look at the intuition
behind back-prop using a simple
neural network.

A good explanation of back-prop is in
Calculus on Computational Graphs:
Backpropagation

x h1 h2 ŷ L(w)w (1) w (2) w (3)

∂L (w)
∂w (3)

=
∂L (w)
∂ŷ

×
∂ŷ
∂w (3)

∂L (w)
∂w (2)

=
∂L (w)
∂ŷ︸ ︷︷ ︸×

∂ŷ
∂h2
×

∂h2
∂w (2)

∂L (w)
∂w (1)

=
∂L (w)
∂ŷ

×
∂ŷ
∂h2︸ ︷︷ ︸×

∂h2
∂h1
×

∂h1
∂w (1)

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 13 / 56

http://colah.github.io/posts/2015-08-Backprop/
http://colah.github.io/posts/2015-08-Backprop/

Back-Propagation

Back-Propagation is a
computationally efficient method to
calculate derivatives.

Repeated application of the chain-rule.

Lets have a look at the intuition
behind back-prop using a simple
neural network.

A good explanation of back-prop is in
Calculus on Computational Graphs:
Backpropagation

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 14 / 56

http://colah.github.io/posts/2015-08-Backprop/
http://colah.github.io/posts/2015-08-Backprop/

Implementation – TensorFlow

In TensorFlow 2.0, the
GradientTape looks after the
gradient calculations. We
only need to do the forward
pass.

There is also a much simpler
API in TensorFlow that does
all of this under-the-hood.
More on this in the lab.

import tensorflow as tf

w = tf.Variable([tf.random.normal()])
lr = 0.001

while True: # loop forever
with tf.GradientTape() as g:

loss = compute_loss(w) #forward function defined outside
gradient = g.gradient(loss, w)

w = w – lr*gradient
# need to check convergence and break loop

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 15 / 56

Revision Questions

1 Why cannot we directly minimize the risk in ML?
2 What is the difference between the notations ∇wL (w) and ∂∂wL (w)?
3 Will gradient decent find the weights that minimize the cost function in

deep feed-forward NN?
4 When calculating ∂

∂w (1)1,1
L (w), what valaues will you use for other weights

(e.g. w (1)1,2 )?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 16 / 56

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 17 / 56

Approximate Derivatives

Calculating the gradients of the empirical risk.

∇wL (w) = ∇wE(x,y)∼p̂dataL (y , h (x; w)) =
1
N

N∑
i=1
∇wL

(
y (i), h

(
x(i); w

))
Computing this expectation exactly is very expensive because it requires
evaluating the model on every example in the entire dataset.

In practice, we can compute these expectations by randomly sampling a small
number of examples from the dataset, then taking the average over only those
examples.

∇wE(x,y)∼p̂dataL (y , h (x; w)) ≈
1

Nb

Nb∑
i=1
∇wL

(
y (i), h

(
x(i); w

))

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 18 / 56

Approximate Derivatives

Why does approximate derivatives work (intuition):
Standard error of the mean estimated with n samples is σ/


n. Reduction

in standard error with the increase in n diminishes with n.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 19 / 56

Approximate Derivatives

Why does approximate derivatives work (intuition):
Standard error of the mean estimated with n samples is σ/


n. Reduction

in standard error with the increase in n diminishes with n.
Redundancy in the training set. Some data-points will result in the same
values.
We just need an approximate direction to step. Will work as long as the
gradients are not totally random.

Gradient decent methods:
Deterministic: Uses all the training data when calculating gradient for
each step.
Stochastic (mini-batch): Use randomly sampled small number of
examples from training data when calculating gradient for each step

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 20 / 56

Determining Batch-Size

Larger batches provide a more accurate estimate of the gradient, but with
less than linear returns
Multi-core architectures are usually underutilized by extremely small
batches.
Amount of memory scales with the batch size.
When using GPUs, it is common for power of 2 batch sizes to offer better
run time.
Small batches can offer a regularizing effect.

The general inefficiency of batch training for gradient descent learning

We also wish for two subsequent gradient estimates to be independent from
each other, so two subsequent mini-batches of examples should also be
independent from each other.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 21 / 56

http://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf

Stochastic (Mini-Batch) Gradient Descent

Note that w = [w0,w1, · · · ,wm]

Algorithm 2: Stochastic Gradient Decent
Input: Training data D, Learning rate schedule
Result: Final Weights w
Initialize weights randomly;
t ← 1 ;
while not converged do

Sample an iid batch from training data, Db ⊂ D ;
Compute Gradients using Db, ∇wL (w) ;
Update weights, w← w− αt ∇wL (w) ;
t ← t + 1;
Check for convergence ;

end

αt is the learning rate at iteration t.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 22 / 56

Implementation – TensorFlow

An Epoch is one pass
through all the training data.

The data should be shuffled
after each epoch to make
sure the gradients are
independent. This should be
handled in the data
generator.

More in labs.

import tensorflow as tf

model = MNISTModel() # defined outside
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
train_loss = tf.keras.metrics.Mean(name=’train_loss’)

EPOCHS = 5
for epoch in range(EPOCHS):

# iterate over all batches in the dataset
for image_batch, label_batch in mnist_train:

with tf.GradientTape() as tape:
predictions = model(image_batch)
loss = loss_object(label_batch, predictions)

gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train_loss(loss)

print(‘Epoch ‘, epoch, ‘ Train loss: ‘, train_loss.result())
train_loss.reset_states()

Do we need anything else or, are we done?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 23 / 56

Aside: Learning Curves

Learning curves are a
powerful tool to understand
the behaviour of NN
training.

More on plotting learning
curve in labs.

Image: Standford cs231n

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 24 / 56

Revision Questions

1 What are the advantages of using a small batch size?
2 Why do we need to shuffle data between epoch?
3 Why is there a negative sign in the weight update of the SGD?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 25 / 56

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 26 / 56

Local Minima

With non-convex functions, such as
neural nets, it is possible to have
many local minima.

Local minima can be problematic if they have high
cost in comparison to the global minimum.

“For large neural networks, most local minima have a low cost
value, and that it is not important to find a true global
minimum rather than to find weights with low cost.”

To rule out local minima: plot the norm of the
gradient over time.

If the norm of the gradient does not shrink, the
problem is not any kind of critical point.
Positively establishing that local minima are the
problem can be very difficult in DNN.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 27 / 56

Saddle Points

A saddle point has a local minimum along one cross-section of the cost function and a local maximum along another cross-section.

In higher-dimensional spaces, local minima are rare, and saddle points are
more common.
For first-order optimization, the gradient can often become very small
near a saddle point/Flat Regions.
Second-order methods (Newton’s method) can terminate at saddle points.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 28 / 56

Flat Regions, Cliffs and Exploding Gradients

Image: Goodfellow 2016

Neural networks with many layers often have extremely steep regions
resembling cliffs.
On the face of an extremely steep cliff structure, the gradient update step
can move the parameters extremely far.
Serious consequences can be avoided using the gradient clipping. This
prevents any gradient to have norm greater than a threshold.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 29 / 56

Long-Term Dependencies

Vanishing and Exploding gradient problem.

Arises when
The computational graph becomes extremely deep.
Repeatedly applying the same operation at each time step of a long
temporal sequence (in RNN discussed later)

Vanishing gradients make it difficult to know which direction the parameters
should move to improve the cost function, while exploding gradients can make
learning unstable.

How can we solve these problems?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 30 / 56

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 31 / 56

Stochastic Gradient Descent – SGD

Image: Standford cs231n

We already discussed SGD.
The main parameter is learning rate. This can be a fixed
value (α) or can vary in time. A simple linear learning
rate schedule is below:

αt =
(

1−
t
τ

)
α0 +

t
τ
ατ

α0 initial learning rate, ατ final learning rate, τ number of iterations.

The learning rate may be chosen by trial and error. Best
to choose it by monitoring learning curves.

If it is too large, the learning curve will show violent
oscillations,and often increasing significantly.
If the learning rate is too low, learning proceeds
slowly.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 32 / 56

Momentum

Normal gradient decent update:

w[t] ← w[t−1] − αt ∇wL (w)

Pathological curvature: regions of cost function which are not scaled
properly.

The iterates either jump between valleys, or approach the optimum in small
steps.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 33 / 56

Momentum

Momentum give gradient descent a short-term memory.

z[t] ← βz[t−1] − (1− β)∇wL (w)

w[t] ← w[t−1] + αz[t]

Hyper-parameter β controls how quickly the contributions of previous gradients exponentially decay. A
typical value is β = 0.9.

Why Momentum Really Works
On the importance of initialization and momentum in deep learning.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 34 / 56

https://distill.pub/2017/momentum/
https://www.cs.toronto.edu/~fritz/absps/momentum.pdf

Nesterov Momentum

Variants: Nesterov momentum. The gradient is evaluated after the current velocity is applied.

z[t] ← βz[t−1] − (1− β)∇wL
(
w + βz[t−1]

)
w[t] ← w[t−1] + αtz[t]

Hyper-parameter β controls how quickly the contributions of previous gradients exponentially decay.

Why Momentum Really Works
On the importance of initialization and momentum in deep learning.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 35 / 56

https://distill.pub/2017/momentum/
https://www.cs.toronto.edu/~fritz/absps/momentum.pdf

Revision Questions

How can the problem of saddle points be minimized first-order methods?
What happens if β = 0 in momentum gradient descent.
Why is it a good idea to reduce the learning rate with iterations in SGD?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 36 / 56

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 37 / 56

Adaptive Learning Rate

Image: Standford cs231n

Learning rate is a key parameter in gradient descent.

If the learning rate is too small, updates are small
and optimization is slow.
If the learning rate is too large, updates will be large
and the optimization is likely to diverge.

Start with a large learning rate and decay it during
training.

Finding the “best decay schedule” is not trivial.
Adaptive learning-rate algorithms such as Adam and
RMSprop can adjust the learning rate during the
optimization process.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 38 / 56

RMSprop

Normal gradient decent update: w[t] ← w[t−1] − αt ∇wL (w)

Can have oscillation: Slow in one direction and speed up in other. RMSprop
use the root mean square of the gradient in each direction to scale the update.

r [t] = ρr [t−1] + (1− ρ) (∇wL (w))2

w[t] ← w[t−1] − α
∇wL (w)√

r [t] + �
� is a small value to prevent division by zero (� = 10−8). r [t] is a vector. Division and square applied element-wise

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 39 / 56

Adam

Combines RMSprop and momentum. ADAM: A Method For Stochastic Optimization

r [t] = ρr [t−1] + (1− ρ) (∇wL (w))2 . RMSprop

z[t] ← βz[t−1] + (1− β)∇wL (w) .Momentum

r [t] ← r [t]/(1− ρ) ; z[t] ← z[t]/(1− β) . Bias correction

w[t] ← w[t−1] − α
z[t]√

r [t] + �
� is a small value to prevent division by zero (� = 10−8). s[t] is a vector. Division and square applied element-wise

Works with wide range of problems. Hyper parameters [β, ρ] are named
[β1, β2] in tensorflow and usually are set to β1 = 0.9, β2 = 0.999.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 40 / 56

https://arxiv.org/pdf/1412.6980.pdf

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 41 / 56

Choosing the Right Optimization Algorithm
Unfortunately, there is currently no consensus on which algorithm to use.

SGD

Gradient descent can use parallelization efficiently.
SGD usually converges faster than GD on large datasets.
Of the optimizers profiled here, SGD uses the least memory
for given batch size.

Momentum
Momentum usually speeds up the learning.
Momentum uses more memory for a given batch size than
stochastic gradient descent but less than RMSprop and Adam.

RMSprop
RMSprop’s adaptive learning rate usually prevents the learning
rate decay from diminishing too slowly or too fast.
RMSprop maintains per-parameter learning rates.

Adam
The hyperparameters of Adam are usually set to predefined values.
Adam performs a form of learning rate annealing with adaptive step-sizes.
Adam is often the default optimizer in machine learning.

If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate
methods.
An overview of gradient descent optimization algorithms

Beyond Gradient Descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 42 / 56

https://ruder.io/optimizing-gradient-descent/
https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 43 / 56

Parameter Initialization

Deep networks has many parameters:

h (x) = h(3)
(
h(2)

(
h(1)

(
x; W(1)

)
; W(2)

)
; W(3)

)
Gradient descent learning is iterative and thus
require the user to specify some initial point.

w(t+1) ← w(t) − αt ∇wL (w)

Starting place of gradient descent can be
critical to the model’s ultimate performance.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 44 / 56

Parameter Initialization

The initial point can determine whether the algorithm converges:
Some initial points can be unstable and the algorithm may fails altogether.
Initial point can determine how quickly learning converges.
local minimums can have wildly varying generalization error, and the initial point
can affect which minimum is selected.

The most important consideration when initializing the weights is that the initial
parameters need to “break symmetry” between different units.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 45 / 56

Initializing: Break Symmetry

Initial parameters need to “break symmetry”
between different units.

If two hidden units with the same activation function
are connected to the same inputs, then these units
must have different initial parameters.
If they have the same initial parameters, then a
learning algorithm will constantly update both of
these units in the same way.

This motivates random initialization of the parameters.

x1

x2
y1

h(1) (·)

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 46 / 56

Random Initialization

x1

x2

w (1) w (2) w (L)

Can initialize all the weights in the model to values

drawn randomly from a Gaussian or uniform distribution.

The scale of the initial distribution, have a large effect on
the outcome of the optimization procedure.

Too small: vanishing gradients, learning too slow.
Too large: exploding gradients, unstable.

Assuming all activation’s are
linear.

ŷ = w (L)w (L−1) · · ·w (2)w (1)x

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 47 / 56

Initialization: Now to select scale

x1

x2

h(1) h(2) h(L)

Rule of thumb to prevent exploding/vanishing gradients:
The mean of the activations should be zero.

E
(
h(l−1)

)
= E

(
h(l)
)

= 0

The variance of the activations should stay the same across every layer.

Var
(
h(l−1)

)
= Var

(
h(l)
)

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 48 / 56

Xavier initialization

All the weights of layer ll are picked randomly from a normal distribution with
mean µ = 0 and variance σ2 = 1n[l−1] where n

[l−1] is the number of neuron in
layer l − 1.

W(l) ∼ N
(

0,
1

n[l−1]

)
Biases are initialized with zeros.

b(l) = 0

The mathematics of why this works is explained in Deeplearning.ai notes
A varient of Xavier, that works well with “ReLU” is provided in He
normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 49 / 56

https://www.deeplearning.ai/ai-notes/initialization/
https://arxiv.org/pdf/1502.01852.pdf
https://arxiv.org/pdf/1502.01852.pdf

Outline

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 50 / 56

Input Normalization

Generally we know that normalizing the input features help to speedup learning.

µj =
1
N

i
x (i)j ; σ

2
j =

1
N

i

(
x (i)j − µj

)2

x̃ (i)j =
x (i)j − µj

σj

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 51 / 56

Batch Normalization

How about hidden layer activations?

x1

x2

h(1) h(2) h(L)

Can we normalize input to each layer?

z(i) = W(l)x(i) + b(l) .Affine transform

h(i) = g
(

z(i)
)

.Activation

Typically we normalize z(i).
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 52 / 56

https://arxiv.org/abs/1502.03167

Batch Normalization

µj =
1
m

i
z(i)J ; σ

2
j =

1
m

i

(
z(i)j − µj

)2

ẑ(i)j =
z(i)j − µj√
σ2j + �

.mean zero, variance 1

z̃(i)j = γẑ
(i)
j + β . γ, β learnable parameters

Batch Normalization allows us to normalize the feature values across a
mini-batch.

The mean and variance is computed over the mini-batch.

The bias in affine transform is not useful when using batch-norm:
z(i) = W(l)x(i)+ b(l)

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 53 / 56

Batch Normalization: Inference

At test time you do not have a batch to compute mean and variance.

Keep a weighted average over mini-batches while training.

µ̂j = ρ µ̂j + (1− ρ)µj
σ̂2j = ρ σ̂

2
j + (1− ρ)σ

2
j

Apply the moving averages while testing.

ẑ(i)j =
z(i)j − µ̂j√
σ̂2j + �

z̃(i)j = γẑ
(i)
j + β

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 54 / 56

Revision Questions

1 What happens if the weights in a neural network are all initialized to 0?
2 Why should we not have a bias in neural network layers when batch-norm

is used?
3 Should batch-norm be used before or after the activation?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 55 / 56

Summary

1 Neural networks are usually trained using iterative, gradient based
methods.

2 There are many gradient based optimization techniques. Methods with
adaptive learning rate seems to be robust to many problems.

3 Initialization and batch normalization help speed up learning.

Lab: Continue to learn how NN can be implemented in TensorFlow.

Next week:
1 Convolutions

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 56 / 56

Loss Function & Empirical Risk Minimization
Back-Propagation
Stochastic Mini-batch Gradient Descent
Challenges in Neural Network Optimization
Basic Algorithms
Advanced Algorithms: Adaptive Learning Rate
Choosing the Right Optimization Algorithm
Parameter Initialization Strategies
Batch Normalization