Introduction to
Statistical Machine Learning (ISML):
Deep Learning
Dong Gong
Part of the slides are from Fei-Fei Li et al. and Francois Fleuret.
10/2021
Linear Classifier for Image Classification
2
Image Classification
Dataset: CIFAR10 [Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.]
3
Image Classification
4
Image Classification
● Image classification with linear classifier
5
Image Classification
● Image classification with linear classifier
6
Image Classification
● An image example with 4 pixels and 3 classes.
7
Image Classification
● Interpreting a linear classifier.
8
Image Classification
● Hard cases for a linear classifier.
9
From Linear Classifiers to (Non-linear) Neural Networks
10
Neural Networks
● Starting from the original linear classifier
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202124
Neural networks: the original linear classifier
(Before) Linear score function:
11
Neural Networks
● 2 layers
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202125
(Before) Linear score function:
(Now) 2-layer Neural Network
Neural networks: 2 layers
(In practice we will usually add a learnable bias at each layer as well)
12
Neural Networks
● 2 layers
● Also called as fully connected network
● Fully connected layer
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202126
(Before) Linear score function:
(Now) 2-layer Neural Network
Neural networks: also called fully connected network
(In practice we will usually add a learnable bias at each layer as well)
“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
13
Neural Networks
● 2 layers
● Also called as fully connected network
● Fully connected layer
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202128
(Before) Linear score function:
(Now) 2-layer Neural Network
Neural networks: hierarchical computation
x hW1 sW2
3072 100 10
14
Neural Networks
● 3 layers
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202127
Neural networks: 3 layers
(Before) Linear score function:
(Now) 2-layer Neural Network
or 3-layer Neural Network
(In practice we will usually add a learnable bias at each layer as well)
15
Neural Networks
● Activation function
● The function max(0, z) is called the activation function.
● What if without the activation function?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 2021
The function is called the activation function.
Q: What if we try to build a neural network without one?
30
(Before) Linear score function:
(Now) 2-layer Neural Network
Neural networks: why is max operator important?
16
Neural Networks
● Activation function
● The function max(0, z) is called the activation function.
● What if without the activation function?
○ The model will be linear.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 2021
The function is called the activation function.
Q: What if we try to build a neural network without one?
30
(Before) Linear score function:
(Now) 2-layer Neural Network
Neural networks: why is max operator important?
17
Neural Networks
● Activation functions
○ Non-linear functions
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202133
Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
Activation functions ReLU is a good default choice for most problems
18
Neural Networks
● Architectures (for MLP)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202134
“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”
“3-layer Neural Net”, or
“2-hidden-layer Neural Net”
Neural networks: Architectures
19
Neural Networks
● Architectures (for CNNs)
Convolutional Neural Networks(CNNs)
20
From Neural Networks to “Deep Learning”
21
From Neural Networks to “Deep Learning”
Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 –
Frank Rosenblatt, ~1957: Perceptron
The Mark I Perceptron machine was the first
implementation of the perceptron algorithm.
The machine was connected to a camera that used
20×20 cadmium sulfide photocells to produce a 400-pixel
image.
recognized
letters of the alphabet
update rule:
A bit of history…
This image by Rocky Acosta is licensed under CC-BY 3.0
45 22
From Neural Networks to “Deep Learning”
23
From Neural Networks to “Deep Learning”
24
From Neural Networks to “Deep Learning”
25
From Neural Networks to “Deep Learning”
26
From Neural Networks to “Deep Learning”
Transformer
27
From Neural Networks to “Deep Learning”
● DL is everywhere
28
From Neural Networks to “Deep Learning”
● DL is everywhere
29
From Neural Networks to “Deep Learning”
● DL is everywhere
30
From Neural Networks to “Deep Learning”
● DL is everywhere
31
Convolutional Neural Network (CNN), from MLP to CNN
32
Convolutional Neural Networks (CNNs)
An overview of a CNN
33
Convolutional Neural Networks (CNNs)
● Recap: fully connected (FC) layer
Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 –
3072
1
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
10 x 3072
weights
activationinput
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
1
10
67
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202134
“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”
“3-layer Neural Net”, or
“2-hidden-layer Neural Net”
Neural networks: Architectures
FC layer
34
Convolution Layer (2D)
35
Convolution Layer
36
Convolution Layer
37
Convolution Layer
38
Convolution Layer
Noticed the difference on size?
Why?
39
Convolution Layer
40
41
42
A closer look at spatial dimensions:
43
A closer look at spatial dimensions:
44
A closer look at spatial dimensions:
45
A closer look at spatial dimensions:
46
A closer look at spatial dimensions:
47
A closer look at spatial dimensions:
48
A closer look at spatial dimensions:
49
A closer look at spatial dimensions:
50
A closer look at spatial dimensions:
51
A closer look at spatial dimensions:
52
A closer look at spatial dimensions:
53
54
55
N will not be 7 after padding.
56
Other padding operations: replication padding, reflection padding …
57
58
59
Pooling layer
60
Max Pooling
Other pooling operations: average pooling …
61
Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 – 119
Pooling layer: summary
Let’s assume input is W1 x H1 x C
Conv layer needs 2 hyperparameters:
– The spatial extent F
– The stride S
This will produce an output of W2 x H2 x C where:
– W2 = (W1 – F )/S + 1
– H2 = (H1 – F)/S + 1
Number of parameters: 0
62
Fully Connected Layer (FC layer)
• Contains neurons that connect to the entire input
volume, as in ordinary Neural Networks
63
Fully Connected Layer
64
Summary of CNNs
● ConvNets stack CONV,POOL,FC layers
● Trend towards smaller filters and deeper architectures
● Trend towards getting rid of POOL/FC layers (just CONV)
● Historically architectures looked like [(CONV-RELU)*N-POOL?]*M-(FC-
RELU)*K,SOFTMAX, where N is usually up to ~5, M is large, 0 <= K <= 2.
○ but recent advances such as ResNet/GoogLeNet have challenged this paradigm
65
Learning, Optimization, and Backpropagation
66
Loss Function
● Recap
Prediction:
Loss function:
Measuring
difference between
prediction and
ground truth.
Loss to optimize:
67
Loss Function
68
Optimization with Gradient Descent
● Finding the best parameter
● Minimizing the loss
● Analytic closed-form solution is hard to derive or calculate
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 20218
Finding the best W: Optimize with Gradient Descent
Landscape image is CC0 1.0 public domain
Walking man image is CC0 1.0 public domain Updating parameters
69
Stochastic Gradient Descent (SGD)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021
Stochastic Gradient Descent (SGD)
10
Full sum expensive
when N is large!
Approximate sum
using a minibatch of
examples
32 / 64 / 128 common
70
Problem: How to compute gradients?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021
If we can compute then we can learn W1 and W2
52
Problem: How to compute gradients?
Nonlinear score function
SVM Loss on predictions
Regularization
Total loss: data loss + regularization
71
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202153
(Bad) Idea: Derive on paper
Problem: What if we want to
change loss? E.g. use softmax
instead of SVM? Need to
re-derive from scratch =(
Problem: Very tedious: Lots of
matrix calculus, need lots of paper
Problem: Not feasible for very
complex models!
72
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202156
Really complex neural
networks!!
Figure reproduced with permission from a Twitter post by Andrej Karpathy.
input image
loss
73
Solution: Backpropagation (BP)
We want to train an MLP by minimizing a loss over the training set
L(w , b) =
X
n
l(f (xn;w , b), yn).
To use gradient descent, we need the expression of the gradient of the
per-sample loss
ln = l(f (xn;w , b), yn)
with respect to the parameters, e.g.
@ln
@w
(l)
i,j
and
@ln
@b
(l)
i
.
François Fleuret Deep learning / 3.6. Back-propagation 1 / 11
74
Backpropagation (BP)
For clarity, we consider a single training sample x , and introduce s(1), . . . , s(L)
as the summations before activation functions.
x(0) = x
w (1),b(1)
�����! s(1) ��! x(1)
w (2),b(2)
�����! s(2) ��! . . .
w (L),b(L)
�����! s(L) ��! x(L) = f (x ;w , b).
Formally we set x(0) = x ,
8l = 1, . . . , L,
(
s(l) = w (l)x(l�1) + b(l)
x(l) = �
�
s(l)
�
,
and we set the output of the network as f (x ;w , b) = x(L).
This is the forward pass.
François Fleuret Deep learning / 3.6. Back-propagation 2 / 11
Loss l()
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202134
“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”
“3-layer Neural Net”, or
“2-hidden-layer Neural Net”
Neural networks: Architectures
FC layer
75
Backpropagation (BP)
x(l�1)
w (l), b(l)
�����! s(l) ����! x(l)
::::::::::
Since s
(l)
i influences l only through x
(l)
i with
x
(l)
i = �(s
(l)
i ),
we have
@l
@s
(l)
i
=
@l
@x
(l)
i
@x
(l)
i
@s
(l)
i
=
@l
@x
(l)
i
�0
⇣
s
(l)
i
⌘
,
And since x
(l�1)
j influences l only through the s
(l)
i with
s
(l)
i =
X
j
w
(l)
i,j x
(l�1)
j + b
(l)
i ,
we have
@l
@x
(l�1)
j
=
X
i
@l
@s
(l)
i
@s
(l)
i
@x
(l�1)
j
=
X
i
@l
@s
(l)
i
w
(l)
i,j .
François Fleuret Deep learning / 3.6. Back-propagation 4 / 11
76
Backpropagation (BP)
x(l�1)
w (l), b(l)
�����! s(l) ����! x(l)
Since s
(l)
i influences l only through x
(l)
i with
x
(l)
i = �(s
(l)
i ),
we have
@l
@s
(l)
i
=
@l
@x
(l)
i
@x
(l)
i
@s
(l)
i
=
@l
@x
(l)
i
�0
⇣
s
(l)
i
⌘
,
And since x
(l�1)
j influences l only through the s
(l)
i with
s
(l)
i =
X
j
w
(l)
i,j x
(l�1)
j + b
(l)
i ,
we have
@l
@x
(l�1)
j
=
X
i
@l
@s
(l)
i
@s
(l)
i
@x
(l�1)
j
=
X
i
@l
@s
(l)
i
w
(l)
i,j .
François Fleuret Deep learning / 3.6. Back-propagation 4 / 11
77
Backpropagation (BP)
x(l�1)
w (l), b(l)
�����! s(l) ����! x(l)
Since w
(l)
i,j and b
(l)
i influences l only through s
(l)
i with
s
(l)
i =
X
j
w
(l)
i,j x
(l�1)
j + b
(l)
i ,
we have
@l
@w
(l)
i,j
=
@l
@s
(l)
i
@s
(l)
i
@w
(l)
i,j
=
@l
@s
(l)
i
x
(l�1)
j ,
@l
@b
(l)
i
=
@l
@s
(l)
i
.
François Fleuret Deep learning / 3.6. Back-propagation 5 / 11
78
Backpropagation (BP)
To summarize: we can compute
@l
@x
(L)
i
from the definition of l, and recursively
propagate backward the derivatives of the loss w.r.t the activations with
@l
@s
(l)
i
=
@l
@x
(l)
i
�0
⇣
s
(l)
i
⌘
and
@l
@x
(l�1)
j
=
X
i
@l
@s
(l)
i
w
(l)
i,j .
And then compute the derivatives w.r.t the parameters with
@l
@w
(l)
i,j
=
@l
@s
(l)
i
x
(l�1)
j ,
and
@l
@b
(l)
i
=
@l
@s
(l)
i
.
This is the backward pass.
François Fleuret Deep learning / 3.6. Back-propagation 6 / 11
For further backward
Gradient on parameters
79
Backpropagation (BP)
● Computational graph
x(l�1) ⇥
w (l)
+
b(l)
s(l) � x(l)
h
@l
@x(l)
ih
@l
@s(l)
i
�
�0
·>⇥
h
@l
@x(l�1)
i
h
@l
@b(l)
ihh
@l
@w (l)
ii
⇥ ·>
François Fleuret Deep learning / 3.6. Back-propagation 8 / 11
https://fleuret.org/dlc/materials/dlc-slides-3-6-backprop.pdf 80
Deep Learning Frameworks/Packages
81
The End
82