CS计算机代考程序代写 chain deep learning Keras algorithm Deep Learning - COSC2779 - Neural Network Optimization

Deep Learning – COSC2779 – Neural Network Optimization

Deep Learning – COSC2779
Neural Network Optimization

Dr. Ruwan Tennakoon

August 2, 2021

Reference: Chapter 7,8: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 1 / 56

Outline

Part 1: Optimization Techniques

1 Loss Function & Empirical Risk Minimization
2 Back-Propagation
3 Stochastic Mini-batch Gradient Descent
4 Challenges in Neural Network Optimization
5 Basic Algorithms
6 Advanced Algorithms: Adaptive Learning Rate
7 Choosing the Right Optimization Algorithm
8 Parameter Initialization Strategies
9 Batch Normalization

Part 2: Regularization

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 2 / 56

Machine Learning

The Task can be expressed an unknown
target function:

y = f (x)

ML finds a Hypothesis (model), h (·), from
hypothesis space H, which approximates
the unknown target function.

ŷ = h∗ (x) ≈ f (x)

The Experience is typically a data set, D,
of values

D =
{(

x(i), f
(

x(i)
))}N

i=1

∗Assume supervised learning for now

The Performance is typically numerical
measure that determines how well the
hypothesis matches the experience.

Last week we discussed about a representation of Hypothesis (model), h (·).

h (x) = h(3)
(
h(2)

(
h(1) (x)

))
This week: How can we find the optimal hypothesis h∗ (x)?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 3 / 56

Objectives for this Lecture

Explore techniques that can be used to find the optimal hypothesis in a
NN.
Understand the optimization techniques so that we can come up with the
“best approach for a problem” in a way that is better than random or
exhaustive search of the applicable techniques in the deep learning
toolbox.
Start understanding how to do evidence based debugging when the model
is not learning.
Applicable to all NN types, not just feed-forward.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 4 / 56

Outline

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 5 / 56

Loss Function
;

Data
D

Model
h (w)

L (w)

We would like to finding the parameters, w, of a neural network that reduce a cost function
L (w).

w∗ = argmin
w

L (w)

The cost function usually consists of two parts. The loss and regularization terms.

L (w) = E(x,y)∼pdataL (y , h (x; w))︸︷︷︸
Risk

+λ R (w)︸︷︷︸
Regularization

Here pdata is the data-generating distribution. L() is some function that quantify the
deviation from expected result.

This is an optimization problem. But, in ML, We do not know pdata, Therefore we cannot
minimize risk.

∗Out main attention in these slides is given to supervised learning.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 6 / 56

Loss Function

We would like to finding the parameters, w, of a neural network that reduce a cost function
L (w).

w∗ = argmin
w

L (w)

The cost function usually consists of two parts. The loss and regularization terms.

L (w) = E(x,y)∼pdataL (y , h (x; w))︸︷︷︸
Risk

+λ R (w)︸︷︷︸
Regularization

Here pdata is the data-generating distribution. L() is some function that quantify the
deviation from expected result.

This is an optimization problem. But, in ML, We do not know pdata, Therefore we cannot
minimize risk.

∗Out main attention in these slides is given to supervised learning.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 6 / 56

Loss Function

In ML, We do not know pdata. Therefore we minimize the empirical risk. Minimize the
expected loss on the training set.

L (w) = E(x,y)∼p̂dataL (y , h (x; w))︸︷︷︸
Empirical Risk

+λR (w)

Here p̂data is the training data distribution.

E(x,y)∼p̂dataL (y , h (x; w)) =
1
N

N∑
i=1

L
(
y (i), h

(
x(i); w

))
The training process based on minimizing this average training error is known as empirical
risk minimization.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 7 / 56

Loss Function

The derivative of the loss function is zero
(∂L(w)

∂w = 0) at any critical point in the loss
function.

A local minimum is a point
where L (w) is lower than at all
neighboring points.
A point that obtains the absolute
lowest value of L (w) is a global
minimum.

If the loss is convex we have only one
minimum – The global minimum.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 8 / 56

Finding the Minimum

Random search – Bad idea.

Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)

Gradient descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56

Finding the Minimum

Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.

Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)

Gradient descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56

Finding the Minimum

Random search – Bad idea.
Solve ∇wL (w) = 0 and find a
closed form solution. Not
applicable for many cases.
Guided Search – Apply iterative
method – Gradient decent:
w[t] = w[t−1] − α ∇wL (w)

Gradient descent

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 9 / 56

Gradient Decent

Note that w = [w0,w1, · · · ,wm]

Algorithm 1: Basic Gradient Decent
Result: Final Weights w
Initialize weights randomly ∼ N

(
0, σ2

)
;

while not converged do
Compute Gradients, ∇wL (w) ;
Update weights, w← w− α ∇wL (w) ;

end
α is the learning rate.

How can we efficiently apply gradient decent to Neural networks?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 10 / 56

Outline

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 11 / 56

Computing Gradients in Neural Networks

ŷ

w (1)1,1

The gradient, ∇wL (w), is simply the vector of
partial derivatives in each dimension.

How can we calculate ∂
∂w (1)1,1
L (w)?

∇wL (w) =


∂L (w)
∂w(1)1,1

,
∂L (w)
∂w(1)1,2

, · · ·




Numerical derivatives (finite difference
approximation):

∂

∂w (1)1,1
L (w) = lim

δ→0

L
(

w (1)1,1 + δ
)
− L

(
w (1)1,1

)
δ

∗Other elements of w will remain constant.

How about ∂
∂w (1)1,2
L (w)?

This looks expensive, if we have millions of
weights.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 12 / 56

Computing Gradients in Neural Networks

ŷ

w (1)1,1

The gradient, ∇wL (w), is simply the vector of
partial derivatives in each dimension.

How can we calculate ∂
∂w (1)1,1
L (w)?

Numerical derivatives (finite difference
approximation):

∂

∂w (1)1,1
L (w) = lim

δ→0

L
(

w (1)1,1 + δ
)
− L

(
w (1)1,1

)
δ

∗Other elements of w will remain constant.

How about ∂
∂w (1)1,2
L (w)?

This looks expensive, if we have millions of
weights.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 12 / 56

Back-Propagation

Back-Propagation is a
computationally efficient method to
calculate derivatives.

Repeated application of the chain-rule.

Lets have a look at the intuition
behind back-prop using a simple
neural network.

A good explanation of back-prop is in
Calculus on Computational Graphs:
Backpropagation

x h1 h2 ŷ L(w)w (1) w (2) w (3)

∂L (w)
∂w (3)

=
∂L (w)
∂ŷ

×
∂ŷ
∂w (3)

∂L (w)
∂w (2)

=
∂L (w)
∂ŷ︸︷︷︸×

∂ŷ
∂h2
×

∂h2
∂w (2)

∂L (w)
∂w (1)

=
∂L (w)
∂ŷ

×
∂ŷ
∂h2︸︷︷︸×

∂h2
∂h1
×

∂h1
∂w (1)

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 13 / 56

http://colah.github.io/posts/2015-08-Backprop/
http://colah.github.io/posts/2015-08-Backprop/

Back-Propagation

Back-Propagation is a
computationally efficient method to
calculate derivatives.

Repeated application of the chain-rule.

Lets have a look at the intuition
behind back-prop using a simple
neural network.

A good explanation of back-prop is in
Calculus on Computational Graphs:
Backpropagation

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 14 / 56

http://colah.github.io/posts/2015-08-Backprop/
http://colah.github.io/posts/2015-08-Backprop/

Implementation – TensorFlow

In TensorFlow 2.0, the
GradientTape looks after the
gradient calculations. We
only need to do the forward
pass.

There is also a much simpler
API in TensorFlow that does
all of this under-the-hood.
More on this in the lab.

import tensorflow as tf

w = tf.Variable([tf.random.normal()])
lr = 0.001

while True: # loop forever
with tf.GradientTape() as g:

loss = compute_loss(w) #forward function defined outside
gradient = g.gradient(loss, w)

w = w – lr*gradient
# need to check convergence and break loop

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 15 / 56

Revision Questions

1 Why cannot we directly minimize the risk in ML?
2 What is the difference between the notations ∇wL (w) and ∂∂wL (w)?
3 Will gradient decent find the weights that minimize the cost function in

deep feed-forward NN?
4 When calculating ∂

∂w (1)1,1
L (w), what valaues will you use for other weights

(e.g. w (1)1,2 )?

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 16 / 56

Outline

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 17 / 56

Approximate Derivatives

Calculating the gradients of the empirical risk.

∇wL (w) = ∇wE(x,y)∼p̂dataL (y , h (x; w)) =
1
N

N∑
i=1
∇wL

(
y (i), h

(
x(i); w

))
Computing this expectation exactly is very expensive because it requires
evaluating the model on every example in the entire dataset.

In practice, we can compute these expectations by randomly sampling a small
number of examples from the dataset, then taking the average over only those
examples.

∇wE(x,y)∼p̂dataL (y , h (x; w)) ≈
1

Nb∑
i=1
∇wL

(
y (i), h

(
x(i); w

))

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 18 / 56

Approximate Derivatives

Why does approximate derivatives work (intuition):
Standard error of the mean estimated with n samples is σ/

√
n. Reduction

in standard error with the increase in n diminishes with n.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 19 / 56

Approximate Derivatives

Why does approximate derivatives work (intuition):
Standard error of the mean estimated with n samples is σ/

√
n. Reduction

in standard error with the increase in n diminishes with n.
Redundancy in the training set. Some data-points will result in the same
values.
We just need an approximate direction to step. Will work as long as the
gradients are not totally random.

Gradient decent methods:
Deterministic: Uses all the training data when calculating gradient for
each step.
Stochastic (mini-batch): Use randomly sampled small number of
examples from training data when calculating gradient for each step

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 20 / 56

Determining Batch-Size

Larger batches provide a more accurate estimate of the gradient, but with
less than linear returns
Multi-core architectures are usually underutilized by extremely small
batches.
Amount of memory scales with the batch size.
When using GPUs, it is common for power of 2 batch sizes to offer better
run time.
Small batches can offer a regularizing effect.

The general inefficiency of batch training for gradient descent learning

We also wish for two subsequent gradient estimates to be independent from
each other, so two subsequent mini-batches of examples should also be
independent from each other.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 21 / 56

http://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf

Stochastic (Mini-Batch) Gradient Descent

Note that w = [w0,w1, · · · ,wm]

Algorithm 2: Stochastic Gradient Decent
Input: Training data D, Learning rate schedule
Result: Final Weights w
Initialize weights randomly;
t ← 1 ;
while not converged do

Sample an iid batch from training data, Db ⊂ D ;
Compute Gradients using Db, ∇wL (w) ;
Update weights, w← w− αt ∇wL (w) ;
t ← t + 1;
Check for convergence ;

end

αt is the learning rate at iteration t.

Lecture 3 (Part 1) Deep Learning – COSC2779 August 2, 2021 22 / 56

Implementation – TensorFlow

An Epoch is one pass
through all the training data.

The data should be shuffled
after each epoch to make
sure the gradients are
independent. This should be
handled in the data
generator.

Related Posts