Deep Learning – COSC2779 – Neural Network Optimization
Deep Learning – COSC2779
Neural Network Optimization
Dr. Ruwan Tennakoon
August 2, 2021
Reference: Chapter 7,8: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 1 / 56
Outline
Part 1: Optimization Techniques
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Part 2: Regularization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 2 / 56
Machine Learning
The Task can be expressed an unknown
target function:
y = f (x)
ML finds a Hypothesis (model), h (·), from
hypothesis space H, which approximates
the unknown target function.
ŷ = h∗ (x) ≈ f (x)
The Experience is typically a data set, D,
of values
D =
{(
x(i), f
(
x(i)
))}N
i=1
∗Assume supervised learning for now
The Performance is typically numerical
measure that determines how well the
hypothesis matches the experience.
Last week we discussed about a representation of Hypothesis (model), h (·).
h (x) = h(3)
(
h(2)
(
h(1) (x)
))
This week: How can we find the optimal hypothesis h∗ (x)?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 3 / 56
Objectives for this Lecture
Explore techniques that can be used to find the optimal hypothesis in a
NN.
Understand the optimization techniques so that we can come up with the
“best approach for a problem” in a way that is better than random or
exhaustive search of the applicable techniques in the deep learning
toolbox.
Start understanding how to do evidence based debugging when the model
is not learning.
Applicable to all NN types, not just feed-forward.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 4 / 56
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 5 / 56
Loss Function
;
Data
D
Model
h (w)
w
L (w)
We would like to finding the parameters, w, of a neural network that reduce a cost function
L (w).
w∗ = argmin
w
L (w)
The cost function usually consists of two parts. The loss and regularization terms.
L (w) = E(x,y)∼pdataL (y , h (x; w))︸ ︷︷ ︸
Risk
+λ R (w)︸ ︷︷ ︸
Regularization
Here pdata is the data-generating distribution. L() is some function that quantify the
deviation from expected result.
This is an optimization problem. But, in ML, We do not know pdata, Therefore we cannot
minimize risk.
∗Out main attention in these slides is given to supervised learning.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 6 / 56
Loss Function
We would like to finding the parameters, w, of a neural network that reduce a cost function
L (w).
w∗ = argmin
w
L (w)
The cost function usually consists of two parts. The loss and regularization terms.
L (w) = E(x,y)∼pdataL (y , h (x; w))︸ ︷︷ ︸
Risk
+λ R (w)︸ ︷︷ ︸
Regularization
Here pdata is the data-generating distribution. L() is some function that quantify the
deviation from expected result.
This is an optimization problem. But, in ML, We do not know pdata, Therefore we cannot
minimize risk.
∗Out main attention in these slides is given to supervised learning.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 6 / 56
Loss Function
In ML, We do not know pdata. Therefore we minimize the empirical risk. Minimize the
expected loss on the training set.
L (w) = E(x,y)∼p̂dataL (y , h (x; w))︸ ︷︷ ︸
Empirical Risk
+λR (w)
Here p̂data is the training data distribution.
E(x,y)∼p̂dataL (y , h (x; w)) =
1
N
N∑
i=1
L
(
y (i), h
(
x(i); w
))
The training process based on minimizing this average training error is known as empirical
risk minimization.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 7 / 56
Loss Function
The derivative of the loss function is zero
(∂L(w)
∂w = 0) at any critical point in the loss
function.
A local minimum is a point
where L (w) is lower than at all
neighboring points.
A point that obtains the absolute
lowest value of L (w) is a global
minimum.
If the loss is convex we have only one
minimum – The global minimum.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 8 / 56
Finding the Minimum
Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)
Gradient descent
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56
Finding the Minimum
Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)
Gradient descent
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56
Finding the Minimum
Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)
Gradient descent
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56
Gradient Decent
Note that w = [w0,w1, · · · ,wm]
Algorithm 1: Basic Gradient Decent
Result: Final Weights w
Initialize weights randomly ∼ N
(
0, σ2
)
;
while not converged do
Compute Gradients, ∇wL (w) ;
Update weights, w← w− α ∇wL (w) ;
end
α is the learning rate.
How can we efficiently apply gradient decent to Neural networks?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 10 / 56
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 11 / 56
Computing Gradients in Neural Networks
x1
x2
x3
x4
x5
ŷ
w (1)1,1
The gradient, ∇wL (w), is simply the vector of
partial derivatives in each dimension.
How can we calculate ∂
∂w (1)1,1
L (w)?
∇wL (w) =
∂L (w)
∂w(1)1,1
,
∂L (w)
∂w(1)1,2
, · · ·
Numerical derivatives (finite difference
approximation):
∂
∂w (1)1,1
L (w) = lim
δ→0
L
(
w (1)1,1 + δ
)
− L
(
w (1)1,1
)
δ
∗Other elements of w will remain constant.
How about ∂
∂w (1)1,2
L (w)?
This looks expensive, if we have millions of
weights.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 12 / 56
Computing Gradients in Neural Networks
x1
x2
x3
x4
x5
ŷ
w (1)1,1
The gradient, ∇wL (w), is simply the vector of
partial derivatives in each dimension.
How can we calculate ∂
∂w (1)1,1
L (w)?
Numerical derivatives (finite difference
approximation):
∂
∂w (1)1,1
L (w) = lim
δ→0
L
(
w (1)1,1 + δ
)
− L
(
w (1)1,1
)
δ
∗Other elements of w will remain constant.
How about ∂
∂w (1)1,2
L (w)?
This looks expensive, if we have millions of
weights.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 12 / 56
Back-Propagation
Back-Propagation is a
computationally efficient method to
calculate derivatives.
Repeated application of the chain-rule.
Lets have a look at the intuition
behind back-prop using a simple
neural network.
A good explanation of back-prop is in
Calculus on Computational Graphs:
Backpropagation
x h1 h2 ŷ L(w)w (1) w (2) w (3)
∂L (w)
∂w (3)
=
∂L (w)
∂ŷ
×
∂ŷ
∂w (3)
∂L (w)
∂w (2)
=
∂L (w)
∂ŷ︸ ︷︷ ︸×
∂ŷ
∂h2
×
∂h2
∂w (2)
∂L (w)
∂w (1)
=
∂L (w)
∂ŷ
×
∂ŷ
∂h2︸ ︷︷ ︸×
∂h2
∂h1
×
∂h1
∂w (1)
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 13 / 56
http://colah.github.io/posts/2015-08-Backprop/
http://colah.github.io/posts/2015-08-Backprop/
Back-Propagation
Back-Propagation is a
computationally efficient method to
calculate derivatives.
Repeated application of the chain-rule.
Lets have a look at the intuition
behind back-prop using a simple
neural network.
A good explanation of back-prop is in
Calculus on Computational Graphs:
Backpropagation
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 14 / 56
http://colah.github.io/posts/2015-08-Backprop/
http://colah.github.io/posts/2015-08-Backprop/
Implementation – TensorFlow
In TensorFlow 2.0, the
GradientTape looks after the
gradient calculations. We
only need to do the forward
pass.
There is also a much simpler
API in TensorFlow that does
all of this under-the-hood.
More on this in the lab.
import tensorflow as tf
w = tf.Variable([tf.random.normal()])
lr = 0.001
while True: # loop forever
with tf.GradientTape() as g:
loss = compute_loss(w) #forward function defined outside
gradient = g.gradient(loss, w)
w = w – lr*gradient
# need to check convergence and break loop
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 15 / 56
Revision Questions
1 Why cannot we directly minimize the risk in ML?
2 What is the difference between the notations ∇wL (w) and ∂∂wL (w)?
3 Will gradient decent find the weights that minimize the cost function in
deep feed-forward NN?
4 When calculating ∂
∂w (1)1,1
L (w), what valaues will you use for other weights
(e.g. w (1)1,2 )?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 16 / 56
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 17 / 56
Approximate Derivatives
Calculating the gradients of the empirical risk.
∇wL (w) = ∇wE(x,y)∼p̂dataL (y , h (x; w)) =
1
N
N∑
i=1
∇wL
(
y (i), h
(
x(i); w
))
Computing this expectation exactly is very expensive because it requires
evaluating the model on every example in the entire dataset.
In practice, we can compute these expectations by randomly sampling a small
number of examples from the dataset, then taking the average over only those
examples.
∇wE(x,y)∼p̂dataL (y , h (x; w)) ≈
1
Nb
Nb∑
i=1
∇wL
(
y (i), h
(
x(i); w
))
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 18 / 56
Approximate Derivatives
Why does approximate derivatives work (intuition):
Standard error of the mean estimated with n samples is σ/
√
n. Reduction
in standard error with the increase in n diminishes with n.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 19 / 56
Approximate Derivatives
Why does approximate derivatives work (intuition):
Standard error of the mean estimated with n samples is σ/
√
n. Reduction
in standard error with the increase in n diminishes with n.
Redundancy in the training set. Some data-points will result in the same
values.
We just need an approximate direction to step. Will work as long as the
gradients are not totally random.
Gradient decent methods:
Deterministic: Uses all the training data when calculating gradient for
each step.
Stochastic (mini-batch): Use randomly sampled small number of
examples from training data when calculating gradient for each step
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 20 / 56
Determining Batch-Size
Larger batches provide a more accurate estimate of the gradient, but with
less than linear returns
Multi-core architectures are usually underutilized by extremely small
batches.
Amount of memory scales with the batch size.
When using GPUs, it is common for power of 2 batch sizes to offer better
run time.
Small batches can offer a regularizing effect.
The general inefficiency of batch training for gradient descent learning
We also wish for two subsequent gradient estimates to be independent from
each other, so two subsequent mini-batches of examples should also be
independent from each other.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 21 / 56
http://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf
Stochastic (Mini-Batch) Gradient Descent
Note that w = [w0,w1, · · · ,wm]
Algorithm 2: Stochastic Gradient Decent
Input: Training data D, Learning rate schedule
Result: Final Weights w
Initialize weights randomly;
t ← 1 ;
while not converged do
Sample an iid batch from training data, Db ⊂ D ;
Compute Gradients using Db, ∇wL (w) ;
Update weights, w← w− αt ∇wL (w) ;
t ← t + 1;
Check for convergence ;
end
αt is the learning rate at iteration t.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 22 / 56
Implementation – TensorFlow
An Epoch is one pass
through all the training data.
The data should be shuffled
after each epoch to make
sure the gradients are
independent. This should be
handled in the data
generator.
More in labs.
import tensorflow as tf
model = MNISTModel() # defined outside
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
train_loss = tf.keras.metrics.Mean(name=’train_loss’)
EPOCHS = 5
for epoch in range(EPOCHS):
# iterate over all batches in the dataset
for image_batch, label_batch in mnist_train:
with tf.GradientTape() as tape:
predictions = model(image_batch)
loss = loss_object(label_batch, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
print(‘Epoch ‘, epoch, ‘ Train loss: ‘, train_loss.result())
train_loss.reset_states()
Do we need anything else or, are we done?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 23 / 56
Aside: Learning Curves
Learning curves are a
powerful tool to understand
the behaviour of NN
training.
More on plotting learning
curve in labs.
Image: Standford cs231n
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 24 / 56
Revision Questions
1 What are the advantages of using a small batch size?
2 Why do we need to shuffle data between epoch?
3 Why is there a negative sign in the weight update of the SGD?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 25 / 56
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 26 / 56
Local Minima
With non-convex functions, such as
neural nets, it is possible to have
many local minima.
Local minima can be problematic if they have high
cost in comparison to the global minimum.
“For large neural networks, most local minima have a low cost
value, and that it is not important to find a true global
minimum rather than to find weights with low cost.”
To rule out local minima: plot the norm of the
gradient over time.
If the norm of the gradient does not shrink, the
problem is not any kind of critical point.
Positively establishing that local minima are the
problem can be very difficult in DNN.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 27 / 56
Saddle Points
A saddle point has a local minimum along one cross-section of the cost function and a local maximum along another cross-section.
In higher-dimensional spaces, local minima are rare, and saddle points are
more common.
For first-order optimization, the gradient can often become very small
near a saddle point/Flat Regions.
Second-order methods (Newton’s method) can terminate at saddle points.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 28 / 56
Flat Regions, Cliffs and Exploding Gradients
Image: Goodfellow 2016
Neural networks with many layers often have extremely steep regions
resembling cliffs.
On the face of an extremely steep cliff structure, the gradient update step
can move the parameters extremely far.
Serious consequences can be avoided using the gradient clipping. This
prevents any gradient to have norm greater than a threshold.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 29 / 56
Long-Term Dependencies
Vanishing and Exploding gradient problem.
Arises when
The computational graph becomes extremely deep.
Repeatedly applying the same operation at each time step of a long
temporal sequence (in RNN discussed later)
Vanishing gradients make it difficult to know which direction the parameters
should move to improve the cost function, while exploding gradients can make
learning unstable.
How can we solve these problems?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 30 / 56
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 31 / 56
Stochastic Gradient Descent – SGD
Image: Standford cs231n
We already discussed SGD.
The main parameter is learning rate. This can be a fixed
value (α) or can vary in time. A simple linear learning
rate schedule is below:
αt =
(
1−
t
τ
)
α0 +
t
τ
ατ
α0 initial learning rate, ατ final learning rate, τ number of iterations.
The learning rate may be chosen by trial and error. Best
to choose it by monitoring learning curves.
If it is too large, the learning curve will show violent
oscillations,and often increasing significantly.
If the learning rate is too low, learning proceeds
slowly.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 32 / 56
Momentum
Normal gradient decent update:
w[t] ← w[t−1] − αt ∇wL (w)
Pathological curvature: regions of cost function which are not scaled
properly.
The iterates either jump between valleys, or approach the optimum in small
steps.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 33 / 56
Momentum
Momentum give gradient descent a short-term memory.
z[t] ← βz[t−1] − (1− β)∇wL (w)
w[t] ← w[t−1] + αz[t]
Hyper-parameter β controls how quickly the contributions of previous gradients exponentially decay. A
typical value is β = 0.9.
Why Momentum Really Works
On the importance of initialization and momentum in deep learning.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 34 / 56
https://distill.pub/2017/momentum/
https://www.cs.toronto.edu/~fritz/absps/momentum.pdf
Nesterov Momentum
Variants: Nesterov momentum. The gradient is evaluated after the current velocity is applied.
z[t] ← βz[t−1] − (1− β)∇wL
(
w + βz[t−1]
)
w[t] ← w[t−1] + αtz[t]
Hyper-parameter β controls how quickly the contributions of previous gradients exponentially decay.
Why Momentum Really Works
On the importance of initialization and momentum in deep learning.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 35 / 56
https://distill.pub/2017/momentum/
https://www.cs.toronto.edu/~fritz/absps/momentum.pdf
Revision Questions
How can the problem of saddle points be minimized first-order methods?
What happens if β = 0 in momentum gradient descent.
Why is it a good idea to reduce the learning rate with iterations in SGD?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 36 / 56
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 37 / 56
Adaptive Learning Rate
Image: Standford cs231n
Learning rate is a key parameter in gradient descent.
If the learning rate is too small, updates are small
and optimization is slow.
If the learning rate is too large, updates will be large
and the optimization is likely to diverge.
Start with a large learning rate and decay it during
training.
Finding the “best decay schedule” is not trivial.
Adaptive learning-rate algorithms such as Adam and
RMSprop can adjust the learning rate during the
optimization process.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 38 / 56
RMSprop
Normal gradient decent update: w[t] ← w[t−1] − αt ∇wL (w)
Can have oscillation: Slow in one direction and speed up in other. RMSprop
use the root mean square of the gradient in each direction to scale the update.
r [t] = ρr [t−1] + (1− ρ) (∇wL (w))2
w[t] ← w[t−1] − α
∇wL (w)√
r [t] + �
� is a small value to prevent division by zero (� = 10−8). r [t] is a vector. Division and square applied element-wise
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 39 / 56
Adam
Combines RMSprop and momentum. ADAM: A Method For Stochastic Optimization
r [t] = ρr [t−1] + (1− ρ) (∇wL (w))2 . RMSprop
z[t] ← βz[t−1] + (1− β)∇wL (w) .Momentum
r [t] ← r [t]/(1− ρ) ; z[t] ← z[t]/(1− β) . Bias correction
w[t] ← w[t−1] − α
z[t]√
r [t] + �
� is a small value to prevent division by zero (� = 10−8). s[t] is a vector. Division and square applied element-wise
Works with wide range of problems. Hyper parameters [β, ρ] are named
[β1, β2] in tensorflow and usually are set to β1 = 0.9, β2 = 0.999.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 40 / 56
https://arxiv.org/pdf/1412.6980.pdf
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 41 / 56
Choosing the Right Optimization Algorithm
Unfortunately, there is currently no consensus on which algorithm to use.
SGD
Gradient descent can use parallelization efficiently.
SGD usually converges faster than GD on large datasets.
Of the optimizers profiled here, SGD uses the least memory
for given batch size.
Momentum
Momentum usually speeds up the learning.
Momentum uses more memory for a given batch size than
stochastic gradient descent but less than RMSprop and Adam.
RMSprop
RMSprop’s adaptive learning rate usually prevents the learning
rate decay from diminishing too slowly or too fast.
RMSprop maintains per-parameter learning rates.
Adam
The hyperparameters of Adam are usually set to predefined values.
Adam performs a form of learning rate annealing with adaptive step-sizes.
Adam is often the default optimizer in machine learning.
If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate
methods.
An overview of gradient descent optimization algorithms
Beyond Gradient Descent
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 42 / 56
https://ruder.io/optimizing-gradient-descent/
https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 43 / 56
Parameter Initialization
Deep networks has many parameters:
h (x) = h(3)
(
h(2)
(
h(1)
(
x; W(1)
)
; W(2)
)
; W(3)
)
Gradient descent learning is iterative and thus
require the user to specify some initial point.
w(t+1) ← w(t) − αt ∇wL (w)
Starting place of gradient descent can be
critical to the model’s ultimate performance.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 44 / 56
Parameter Initialization
The initial point can determine whether the algorithm converges:
Some initial points can be unstable and the algorithm may fails altogether.
Initial point can determine how quickly learning converges.
local minimums can have wildly varying generalization error, and the initial point
can affect which minimum is selected.
The most important consideration when initializing the weights is that the initial
parameters need to “break symmetry” between different units.
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 45 / 56
Initializing: Break Symmetry
Initial parameters need to “break symmetry”
between different units.
If two hidden units with the same activation function
are connected to the same inputs, then these units
must have different initial parameters.
If they have the same initial parameters, then a
learning algorithm will constantly update both of
these units in the same way.
This motivates random initialization of the parameters.
x1
x2
y1
h(1) (·)
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 46 / 56
Random Initialization
x1
x2
ŷ
w (1) w (2) w (L)
Can initialize all the weights in the model to values
drawn randomly from a Gaussian or uniform distribution.
The scale of the initial distribution, have a large effect on
the outcome of the optimization procedure.
Too small: vanishing gradients, learning too slow.
Too large: exploding gradients, unstable.
Assuming all activation’s are
linear.
ŷ = w (L)w (L−1) · · ·w (2)w (1)x
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 47 / 56
Initialization: Now to select scale
x1
x2
ŷ
h(1) h(2) h(L)
Rule of thumb to prevent exploding/vanishing gradients:
The mean of the activations should be zero.
E
(
h(l−1)
)
= E
(
h(l)
)
= 0
The variance of the activations should stay the same across every layer.
Var
(
h(l−1)
)
= Var
(
h(l)
)
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 48 / 56
Xavier initialization
All the weights of layer ll are picked randomly from a normal distribution with
mean µ = 0 and variance σ2 = 1n[l−1] where n
[l−1] is the number of neuron in
layer l − 1.
W(l) ∼ N
(
0,
1
n[l−1]
)
Biases are initialized with zeros.
b(l) = 0
The mathematics of why this works is explained in Deeplearning.ai notes
A varient of Xavier, that works well with “ReLU” is provided in He
normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 49 / 56
https://www.deeplearning.ai/ai-notes/initialization/
https://arxiv.org/pdf/1502.01852.pdf
https://arxiv.org/pdf/1502.01852.pdf
Outline
1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 50 / 56
Input Normalization
Generally we know that normalizing the input features help to speedup learning.
µj =
1
N
∑
i
x (i)j ; σ
2
j =
1
N
∑
i
(
x (i)j − µj
)2
x̃ (i)j =
x (i)j − µj
σj
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 51 / 56
Batch Normalization
How about hidden layer activations?
x1
x2
ŷ
h(1) h(2) h(L)
Can we normalize input to each layer?
z(i) = W(l)x(i) + b(l) .Affine transform
h(i) = g
(
z(i)
)
.Activation
Typically we normalize z(i).
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 52 / 56
https://arxiv.org/abs/1502.03167
Batch Normalization
µj =
1
m
∑
i
z(i)J ; σ
2
j =
1
m
∑
i
(
z(i)j − µj
)2
ẑ(i)j =
z(i)j − µj√
σ2j + �
.mean zero, variance 1
z̃(i)j = γẑ
(i)
j + β . γ, β learnable parameters
Batch Normalization allows us to normalize the feature values across a
mini-batch.
The mean and variance is computed over the mini-batch.
The bias in affine transform is not useful when using batch-norm:
z(i) = W(l)x(i)+ b(l)
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 53 / 56
Batch Normalization: Inference
At test time you do not have a batch to compute mean and variance.
Keep a weighted average over mini-batches while training.
µ̂j = ρ µ̂j + (1− ρ)µj
σ̂2j = ρ σ̂
2
j + (1− ρ)σ
2
j
Apply the moving averages while testing.
ẑ(i)j =
z(i)j − µ̂j√
σ̂2j + �
z̃(i)j = γẑ
(i)
j + β
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 54 / 56
Revision Questions
1 What happens if the weights in a neural network are all initialized to 0?
2 Why should we not have a bias in neural network layers when batch-norm
is used?
3 Should batch-norm be used before or after the activation?
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 55 / 56
Summary
1 Neural networks are usually trained using iterative, gradient based
methods.
2 There are many gradient based optimization techniques. Methods with
adaptive learning rate seems to be robust to many problems.
3 Initialization and batch normalization help speed up learning.
Lab: Continue to learn how NN can be implemented in TensorFlow.
Next week:
1 Convolutions
Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 56 / 56
Loss Function & Empirical Risk Minimization
Back-Propagation
Stochastic Mini-batch Gradient Descent
Challenges in Neural Network Optimization
Basic Algorithms
Advanced Algorithms: Adaptive Learning Rate
Choosing the Right Optimization Algorithm
Parameter Initialization Strategies
Batch Normalization