Deep Learning – COSC2779 – Deep Feed Forward Networks
Deep Learning – COSC2779
Deep Feed Forward Networks
Dr. Ruwan Tennakoon
July 26, 2021
Reference: Chapter 6: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 1 / 46
Outline
Part 1: Deep Feed Forward Networks
1 Perceptron
2 Maximum Likelihood Estimation
3 Feed Forward Neural Networks
4 Hidden Units
5 Loss function & output units
6 Universal Approximation Properties and Depth
Part 2: Deep Learning Software & Hardware
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 2 / 46
Machine Learning
The Task can be expressed an unknown
target function:
y = f (x)
ML finds a Hypothesis (model), h (·), from
hypothesis space H, which approximates
the unknown target function.
ŷ = h∗ (x) ≈ f (x)
The Experience is typically a data set, D,
of values
D =
{(
x(i), f
(
x(i)
))}N
i=1
∗Assume supervised learning for now
The Performance is typically numerical
measure that determines how well the
hypothesis matches the experience.
How is the Hypothesis (model), h (·) represented?
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 3 / 46
Example: Linear Regression
House
Price (y)Distance from city (x1)
Floor Area (x2)
f (x)
House
Price (y)Distance from city (x1)
Floor Area (x2)
h (x)
Hypothesis (Model):
ŷ (i) = h
(
x(i)
)
= w0 + w1x
(i)
1 + w2x
(i)
2
Hypothesis space: H ∈
All possible combinations of (w0,w1,w2)
What are the other methods we can
use to represent h (x)?
Tree (regression/classification)
Rules
Neural networks
· · ·
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 4 / 46
Objectives for this Lecture
Explore the elements used in representing the hypothesis space of a
feed-forward neural network.
Understand the applicable techniques so that we can identify the “best
hypothesis space for a problem” in a way that is better than random
search of all the possible combinations (not feasible).
Gain the ability to justify your model to others.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 5 / 46
Outline
1 Perceptron
2 Maximum Likelihood Estimation
3 Feed Forward Neural Networks
4 Hidden Units
5 Loss function & output units
6 Universal Approximation Properties and Depth
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 6 / 46
Perceptron
Neural Networks are inspired by the structure of
the human brain. The basic building block of
the brain is a neuron.
A Neuron is formed of:
A series of incoming synapses
An activation cell
A single outgoing synapse that connects to
other Neurons.
A Neuron is modelled as a Perceptron
(Rosenblatt 1962)
Image: https://appliedgo.net/perceptron/
Reading: The nature of code – chapter
10
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 7 / 46
https://appliedgo.net/perceptron/
https://natureofcode.com/book/chapter-10-neural-networks/
https://natureofcode.com/book/chapter-10-neural-networks/
Perceptron
Neural Networks are inspired by the structure
of the human brain. The basic building block of
the brain is a neuron.
A Neuron is formed of:
A series of incoming synapses
An activation cell
A single outgoing synapse that connects to
other Neurons.
A Neuron is modelled as a Perceptron
(Rosenblatt 1962)
1
x1
x2
∑
σ ŷ
b
w1
w 2
ŷ = σ
(
w>x + b
)
σ
(
w>x + b
)
=
1
1 + exp (−w>x + b)
With Sigmoid activation the basic perceptron
is similar to logistics regression.
How to find the weights w, b?
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 8 / 46
Perceptron
x1
x2
0 1 2 3 4
0
1
2
3
4
ŷ = σ (w1x1 + w2x2 + b)
w1x1 + w2x2 + b = 0
1
x1
x2
∑
σ ŷ
b
w1
w 2
ŷ = σ
(
w>x + b
)
σ
(
w>x + b
)
=
1
1 + exp (−w>x + b)
With Sigmoid activation the basic perceptron
is similar to logistics regression.
How to find the weights w, b?
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 8 / 46
Outline
1 Perceptron
2 Maximum Likelihood Estimation
3 Feed Forward Neural Networks
4 Hidden Units
5 Loss function & output units
6 Universal Approximation Properties and Depth
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 9 / 46
Aside: Maximum Likelihood Estimation
D =
{
x(1), · · · , x(n)
}
: Set of data drawn independently
from the unknown data-generating distribution pdata.
pmodel (x; θ): Family of distributions parameterize by θ.
We want to find θ that best matches with the
observations (data D). Or we want to find θ that
maximize p (θ | D).
p(θ | D) =
p(D | θ)p(θ)
p(D)
Posterior
Likelihood Prior
Maximum Likelihood Estimation (MLE) is a
method to fit a distribution to data.
θ̂ = argmax
θ
p (D | θ)
For independent data:
θ̂ = argmax
θ
N∏
i=1
pmodel
(
x(i); θ
)
The logarithm of the likelihood does not
change its argmax but makes the math
convenient.
θ̂ = argmax
θ
N∑
i=1
log pmodel
(
x(i); θ
)
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 10 / 46
Aside: Maximum Likelihood Estimation
height
pmodel
θch = [µ = 0.3m, σ = 0.1m] θs = [1.0m, 0.15m] θc = [1.7m, 0.2m]
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 10 / 46
Aside: Maximum Likelihood Estimation
height
pmodel
θch = [0.3m, 0.1m] θs = [1.0m, 0.15m] θc = [1.7m, 0.2m]
pmodel (x; θch)⇒≈ 0× 0× 0× 0
pmodel (x; θs)⇒≈ 0.02× 0.3× 0.15× 0.01:
pmodel (x; θc )⇒≈ 0× 0× 0.001× 0.01:
Example only values are not accurate.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 10 / 46
Aside: Maximum Likelihood Estimation
D =
{
x(1), · · · , x(n)
}
: Set of data drawn independently
from the unknown data-generating distribution pdata.
pmodel (x; θ): Family of distributions parameterize by θ.
We want to find θ that best matches with the
observations (data D). Or we want to find θ that
maximize p (θ | D).
p(θ | D) =
p(D | θ)p(θ)
p(D)
Posterior
Likelihood Prior
In supervised learning we have a conditional
model.
θ̂ = argmax
θ
N∑
i=1
log pmodel
(
x(i); θ
)
Conditional Log-Likelihood:
θ̂ = argmax
θ
N∑
i=1
log p
(
y (i) | x(i); θ
)
Take few minutes and go through the following videos to get a good understanding of MLE:
StatQuest: Maximum Likelihood, clearly explained
StatQuest: Maximum Likelihood For the Normal Distribution, step-by-step!
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 11 / 46
Maximum Likelihood Solution
For the Sigmoid model we can write (y is Bernoulli RV):
p
(
y (i) | x(i); w
)
=
σ
(
w>x + b
)
y (i) = 1
1− σ
(
w>x + b
)
y (i) = 0
p
(
y (i) | x(i); w
)
=
(
σ
(
w>x + b
))y (i) (
1− σ
(
w>x + b
))(1−y (i))
Assume you have a biased coin with p (O = h) = 0.7
Then p (O = t) = 1− p (O = h) = 0.3
Assume you observed a sequence h, h, t, h, t, h. What is the likelihood that happening:
Given coin tosses are independent: 0.7× 0.7× 0.3× 0.7× 0.3× 0.7
The log likelihood:
log p
(
y (i) | x(i); w
)
= y (i) log
(
σ
(
w>x + b
))
+
(
1− y (i)
)
log
(
1− σ
(
w>x + b
))
log p (Y | X; w) =
N∑
i=1
(
y (i) log
(
ŷ (i)
)
+
(
1− y (i)
)
log
(
1− ŷ (i)
))
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 12 / 46
Maximum Likelihood Solution
For the Sigmoid model we can write (y is Bernoulli RV):
p
(
y (i) | x(i); w
)
=
σ
(
w>x + b
)
y (i) = 1
1− σ
(
w>x + b
)
y (i) = 0
p
(
y (i) | x(i); w
)
=
(
σ
(
w>x + b
))y (i) (
1− σ
(
w>x + b
))(1−y (i))
The log likelihood:
log p
(
y (i) | x(i); w
)
= y (i) log
(
σ
(
w>x + b
))
+
(
1− y (i)
)
log
(
1− σ
(
w>x + b
))
log p (Y | X; w) =
N∑
i=1
(
y (i) log
(
ŷ (i)
)
+
(
1− y (i)
)
log
(
1− ŷ (i)
))
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 12 / 46
Finding Maximum
Loss (cost) function:
L (w) = − log p (Y | X; w) = −
N∑
i=1
(
y (i) log
(
ŷ (i)
)
+
(
1− y (i)
)
log
(
1− ŷ (i)
))
w∗ = argminwL (w)
No closed form solution for the Maximum Likelihood for this model.
However the error is convex.
Gradient Descent/ascent.
Gradient Update:
w (t+1) = w (t) − αt
∂
∂w
L (w)
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 13 / 46
Reversion Questions
1 What is the purpose of the non-linear function in Perceptron?
2 Can the Perceptron be used for regression? How?
3 Is Sigmoid-Perceptron appropriate to classify the data shown below:
4 Calculate the partial derivatives for sigmoid Perceptron ∂
∂wL (w).
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 14 / 46
Outline
1 Perceptron
2 Maximum Likelihood Estimation
3 Feed Forward Neural Networks
4 Hidden Units
5 Loss function & output units
6 Universal Approximation Properties and Depth
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 15 / 46
Increasing Model Capacity
Given data with attributes x = [x1, x2],
how can we increase the capacity of a
linear regression model?
y = b +
d∑
j=1
wjxj
Do polynomial transformation
(non-linear) on x: x→ φ (x).
Fit a linear model on φ (x).
Are there better choices for φ (·)?
How to choose φ (·)?
Use generic transformations: Radial
Basis Functions (e.g. As used in
SVM).
Hand crafted (engineered): SIFT,
HoG features in computer vision.
Learn from data: Combines good
points of first two approaches. φ (·)
can be highly generic and the
engineering effort can go into
architecture.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 16 / 46
Increasing Model Capacity
Given data with attributes x = [x1, x2],
how can we increase the capacity of a
linear regression model?
y = b +
d∑
j=1
wjxj
Do polynomial transformation
(non-linear) on x: x→ φ (x).
Fit a linear model on φ (x).
Are there better choices for φ (·)?
How to choose φ (·)?
Use generic transformations: Radial
Basis Functions (e.g. As used in
SVM).
Hand crafted (engineered): SIFT,
HoG features in computer vision.
Learn from data: Combines good
points of first two approaches. φ (·)
can be highly generic and the
engineering effort can go into
architecture.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 16 / 46
Increasing Model Capacity
Given data with attributes x = [x1, x2],
how can we increase the capacity of a
linear regression model?
y = b +
d∑
j=1
wjxj
Do polynomial transformation
(non-linear) on x: x→ φ (x).
Fit a linear model on φ (x).
Are there better choices for φ (·)?
How to choose φ (·)?
Use generic transformations: Radial
Basis Functions (e.g. As used in
SVM).
Hand crafted (engineered): SIFT,
HoG features in computer vision.
Learn from data: Combines good
points of first two approaches. φ (·)
can be highly generic and the
engineering effort can go into
architecture.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 16 / 46
Feed Forward Neural Networks
Learned transformation:
φ (x) := h(i)
(
x; w(i)
)
The model compose of many such
transformations organized in a
sequence (hierarchical).
h (x) = h(3)
(
h(2)
(
h(1) (x)
))
Information flow in function
evaluation begins at input, flows
through intermediate computations,
to produce the output (y).
x1
x2
x3
x4
x5
y
h(1) (·) h(2) (·)
Note: All neurons have biases. For convenience they are not represented in the
diagram.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 17 / 46
Feed Forward Neural Networks
h (x) = h(3)
(
h(2)
(
h(1) (x)
))
No feedback connections (Until we
get to Recurrent Networks!)
Function composition can be
described by a directed acyclic graph.
Gives up convexity.
x1
x2
x3
x4
x5
y
h(1) (·) h(2) (·)
Note: All neurons have biases. For convenience they are not represented in the
diagram.
It is important for h(i) (·) to have some non linearity. Stacking linear layers will still be
linear.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 18 / 46
XOR Example
Image: Deep learning, Goodfellow.
x1
x2
y1
h(1) (·)
Model:
y =
(
w (2)
)>
max
{
0,
(
w (1)
)>
x
}
Weights:
w (1) =
[
0 1 1
−1 1 1
]
w (2) =
01
−2
Assume the weights are given.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 19 / 46
XOR Example
Image: Deep learning, Goodfellow.
Model: y =
(
w (2)
)>
max
{
0,
(
w (1)
)>
x
}
Weights:
w (1) =
[
0 1 1
−1 1 1
]
w (2) =
01
−2
x1
x2
y1
h(1) (·)
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 20 / 46
Increase Model Capacity
Image: Goodfellow, 2016.
Simpler functions are more likely to generalize, but a sufficiently complex
hypothesis is needed to achieve low training error.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 21 / 46
Multi-class Classification and Regression
Multi-Class Classification: Add
output unit equal to the number of
classes.
Regression: Output units with linear
activation.
x1
x2
x3
x4
x5
y1
y2
y3
h(1) (·) h(2) (·)
Note: All neurons have biases. For convenience they are not represented in the
diagram.
Does the cross entropy loss function work?
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 22 / 46
Outline
1 Perceptron
2 Maximum Likelihood Estimation
3 Feed Forward Neural Networks
4 Hidden Units
5 Loss function & output units
6 Universal Approximation Properties and Depth
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 23 / 46
Hidden Units
The Hidden layers of a Feed forward NN consists of an affine transformation
and a activation:
z(i) = W(l)x(i) + b(l) .Affine transform
h(i) = g
(
z(i)
)
.Activation
The activation is applied (usually) element-wise.
What functions can be used for activation?
Design of Hidden units is an active area of research.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 24 / 46
Sigmoid Function
g
(
z(i)
)
=
1
1 + exp
(
−z(i)
)
Squashing type non-linearity: pushes outputs to
range [0, 1].
Saturate across most of their domain, strongly
sensitive only when z is closer to zero.
Saturation makes gradient based learning difficult.
tanh function is similar to sigmoid but pushes
outputs to range [−1, 1].
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 25 / 46
Rectified Linear Unit
Image: Deep learning, Goodfellow.
g
(
z(i)
)
= max
(
0, z(i)
)
Gives large and consistent gradients (does not
saturate) when active.
Efficient to optimize, converges much faster than
sigmoid.
Not everywhere differentiable: In practice not a
problem. Return one sided derivatives at z = 0.
Stochastic Gradient based optimization is subject to
numerical error.
Units when inactive will never update.
Good Practice: Initialize all elements of b to a small
positive value, such as 0.1.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 26 / 46
Generalized Rectified Linear Units
Image: Xu, B.“Empirical evaluation of rectified activations in convolutional network”.
g
(
z (i)
)
= max
(
0, z (i)
)
+ ai min
(
0, z (i)
)
Get a non-zero slope when z (i) < 0.
Leaky ReLU (Maas et al., 2013): Fix ai to a small value (e.g 0.001)
Randomized ReLU (Xu et al., 2015): Sample ai from a fixed range during training, fix
during testing.
Parametric ReLU (He et al., 2015): Learn ai
Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 27 / 46
Exponential Linear Units (ELUs)
Image: G. Zhang, “Effectiveness of Scaled Exponentially-Regularized Linear Units (SERLUs)”.
g(z (i)) =
{
z (i) if z (i) > 0
α
(
exp
(
z (i)
)
− 1
)
if z (i) ≤ 0
Get a non-zero slope when z (i) < 0. Calculating exponent expensive. Paper: Fast and Accurate Deep Network Learning by Exponential Linear Units Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 28 / 46 https://arxiv.org/pdf/1511.07289.pdf MaxOut Units g ( z(i) ) j = max k∈G(j) z(i)k Fundamentally different to other units we discussed so far - Not elementwise. Generalize rectified linear units further. maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one of these groups. A maxout unit can learn a piece-wise linear, convex function with up to k pieces. With large enough k, a maxout unit can learn to approximate any convex function with arbitrary fidelity (e.g. ReLU). Goodfellow I, et. al. “Maxout networks”. In International conference on machine learning 2013. Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 29 / 46 https://arxiv.org/pdf/1302.4389.pdf Revision Questions 1 Why don’t we just use identity transform as a activation unit? 2 What is an issue with sigmoid activation? 3 What is an issue with Relu activation? Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 30 / 46 Outline 1 Perceptron 2 Maximum Likelihood Estimation 3 Feed Forward Neural Networks 4 Hidden Units 5 Loss function & output units 6 Universal Approximation Properties and Depth Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 31 / 46 Loss functions Similar to the Perceptron define pmodel and use the principle of maximum likelihood. L (w) = − 1 N N∑ i=0 log p ( y (i) | x(i); w ) If pmodel = N (y ; h (x; w) , I) then: L (w) = − 1 N N∑ i=0 ‖y − h (x; w)‖2 Choice of output units is very important for choice of cost function Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 32 / 46 Binary Classification Task: Predict a binary variable y ∈ {0, 1} Use a sigmoid unit. If the output of the penultimate layer is g. ŷ (i) = p ( y (i) = 1 | x (i) ) = σ ( W>g(i) + b
)
y is a Bernoulli random variable:
p
(
y (i) | x(i); w
)
=
(
ŷ (i)
)y (i) (
1− ŷ (i)
)(1−y (i))
L (w) = −
1
N
N∑
i=1
(
y (i) log
(
ŷ (i)
)
+
(
1− y (i)
)
log
(
1− ŷ (i)
))
∗This is not the only option.
Saturation thus occurs only when the
model already has the right answer.
Other loss functions, such as mean
squared error, can saturate anytime
σ(z) saturates.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 33 / 46
Other Loss Functions
Loss Usage Comments
Hinge-Loss
max
[
1− h
(
x (i); w
)
y (i)
]p SVM When used for Standard SVM, the lossfunction denotes the size of the marginbetween linear separator and its closest
points in either class.
Log-Loss
log
(
1 + e−h(x
(i);w)y (i)
) logistic
regrssion
One of the most popular loss functions
in Machine Learning, since its outputs
are well-calibrated probabilities.
Exponential-Loss
e−h(x
(i);w)y (i) Ada-Boost
This function is very aggressive. The
loss of a misprediction increases
exponentially with the value
Zero-One-Loss
δ
(
Sign
(
h
(
x (i); w
))
6= y (i)
) ActualClassification
loss
Non-continuous and thus impractical
to optimize.
The Target is represented as: y ∈ {−1,+1}
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 34 / 46
Multi-Class Classification
Task: Predict a categorical variable y ∈ {0, 1}c
Use linear layer followed by SoftMax (assume the output of the final
linear layer is z):
ŷ (i)j = SoftMax (z)j =
exp (zj )∑
j exp (zj )
SoftMax across all units will add to one. As y is multinomial
random variable, we can write:
p
(
y (i) | x(i); w
)
=
c∏
j=1
(
SoftMax
(
z(i)
)
j
)y (i)j
L (w) = −
N∑
i=1
c∑
j=1
y (i)j log
(
ŷ (i)j
)
example:
ŷ (i) =
[
0.1 0.2 0.6 0.1
]
y (i) =
[
0.0 0.0 1.0 0.0
]
L (w)(i) =
0× 2.3 + 0× 1.6 + 1× 0.5 + 0× 2.3
Maximizing log-likelihood encourages z (i)j to be pushed up, while softmax encouraging all the other z to be
pushed down (competition).
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 35 / 46
Multi-Class Classification
Affine Activation
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 35 / 46
Regression
Task: Predict a real valued variable y ∈ R
Use linear activation: pmodel = N (y ; h (x; w) , I) then:
N (y ; h (x; w) , I) =
1
2πσ2
exp
{
−
∥∥y (i) − h (x ; w)∥∥2
2σ2
}
L (w) = −
1
N
N∑
i=0
‖y − h (x; w)‖2
Because linear units do not saturate, they pose little difficulty for gradient-based optimization
algorithms and may be used with a wide variety of optimization algorithms.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 36 / 46
Regression
Task: Predict a real valued variable y ∈ R
Use linear activation: pmodel = N (y ; h (x; w) , I) then:
L (w) = −
1
N
N∑
i=0
‖y − h (x; w)‖2
Because linear units do not saturate, they pose little difficulty for
gradient-based optimization algorithms and may be used with a wide variety of
optimization algorithms.
Lecture 2 (Part 1) Deep Learning – COSC2779 July 26, 2021 36 / 46
Other Loss Functions
Loss Comments
Squared Loss
(h(xi )− yi )2
ADVANTAGE: Differentiable everywhere
DISADVANTAGE: Somewhat sensitive to outliers/noise
Also known as Ordinary Least Squares (OLS)
Absolute Loss
|h(xi )− yi |
ADVANTAGE: Less sensitive to noise
DISADVANTAGE: Not differentiable at 0
Huber Loss
1
2 (h(xi )− yi )
2 if |h(xi )− yi |< δ, otherwise δ(|h(xi )− yi |− δ2 ) ADVANTAGE: ”Best of Both Worlds” of Squared and Absolute Loss Once-differentiable Takes on behavior of Squared-Loss when loss is small, and Absolute Loss when loss is large. Log-Cosh Loss log(cosh(h(xi )− yi )) ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 37 / 46 Revision Questions 1 what happens if sigmoid output units (one for each class) are used when the task is multi-class classification? 2 Can you use softmax activation for binary classification? Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 38 / 46 Outline 1 Perceptron 2 Maximum Likelihood Estimation 3 Feed Forward Neural Networks 4 Hidden Units 5 Loss function & output units 6 Universal Approximation Properties and Depth Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 39 / 46 Model Architecture The word architecture refers to the overall structure of the network: how many units it should have and how these units should be connected to each other. Commonly neural networks are organized into groups of units called layers. Most neural network architectures arrange these layers in a chain structure. h (x; W) = h(3) ( h(2) ( h(1) ( x; W(1) ) ; W(2) ) ; W(3) ) In these chain-based architectures, the main architectural considerations are choosing the depth of the network and the width of each layer. Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 40 / 46 Universal Approximation Theorem Universal Approximation Theorem (Horniket al., 1989; Cybenko, 1989): “A feed-forward network with a linear output layer and at least one hidden layer with any activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non zero amount of error, provided that the network is given enough hidden units.” In simple terms — you can always come up with a neural network that will approximate any complex relation between input and output. Given that it has at least one hidden layer with non-linear activation and “enough” neurons. Seems like this is a silver bullet. Are we done? Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 41 / 46 Universal Approximation Theorem Universal Approximation Theorem say that a large MLP will be able to represent any complex function (with some assumption). It does not guarantee that we would be able to learn this function. learning can fail: Optimization procedure may not find appropriate weights (e.g. may find local minimum). Might choose wrong weights due to over-fitting. H of MLP “Smooth” functions Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 42 / 46 Network Depth The one hidden layer NN in theory might have infeasibly large number of neurons and may fail to learn and generalize correctly. In practice, using deeper models can reduce the number of neurons and can reduce the amount of generalization error. Intuition: When we choose a specific machine learning algorithm, we are implicitly stating some set of prior beliefs we have about what kind of function the algorithm should learn. Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 43 / 46 Network Depth Network depth Image: Deep learning, Goodfellow. Empirically, greater depth does seem to result in better generalization. Need many further practical tricks. e.g. residual blocks, convolutions, etc. We will cover these later. Is this because high depth results in high number of parameters? Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 44 / 46 Network Depth Image: Deep learning, Goodfellow. Deeper models perform better not merely because the model is larger. Above results from Goodfellow et al. shows that increasing the number of parameters in layers of convolutional networks with out increasing their depth is not nearly as effective at increasing test set performance. Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 45 / 46 https://arxiv.org/pdf/1312.6082.pdf Summary 1 Feed-forward Neural networks based on preceptrons are a very general (flexible) type of function approximater. 2 Many ways to customize the models. 3 Nice theoretical results that shows that neural networks models can be used to model “any” complex function. Lab: We will see how NN can be implemented in TensorFlow. Next week: 1 How to optimize feed forward NN. 2 Regularization. Lecture 2 (Part 1) Deep Learning - COSC2779 July 26, 2021 46 / 46 Perceptron Maximum Likelihood Estimation Feed Forward Neural Networks Hidden Units Loss function & output units Universal Approximation Properties and Depth