CS计算机代考程序代写 deep learning algorithm Introduction to

Introduction to
Statistical Machine Learning (ISML):

Deep Learning

Dong Gong

Part of the slides are from Fei-Fei Li et al. and Francois Fleuret.
10/2021

Linear Classifier for Image Classification

Image Classification

Dataset: CIFAR10 [Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.]

Image Classification

● Image classification with linear classifier

Image Classification

● Image classification with linear classifier

Image Classification

● An image example with 4 pixels and 3 classes.

Image Classification

● Interpreting a linear classifier.

Image Classification

● Hard cases for a linear classifier.

From Linear Classifiers to (Non-linear) Neural Networks

Neural Networks

● Starting from the original linear classifier

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202124

Neural networks: the original linear classifier

(Before) Linear score function:

Neural Networks

● 2 layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202125

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: 2 layers

(In practice we will usually add a learnable bias at each layer as well)
12

Neural Networks

● 2 layers
● Also called as fully connected network
● Fully connected layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202126

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: also called fully connected network

(In practice we will usually add a learnable bias at each layer as well)

“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)

Neural Networks

● 2 layers
● Also called as fully connected network
● Fully connected layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202128

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: hierarchical computation

x hW1 sW2

3072 100 10

Neural Networks

● 3 layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202127

Neural networks: 3 layers

(Before) Linear score function:

(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Neural Networks

● Activation function
● The function max(0, z) is called the activation function.

● What if without the activation function?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 2021

The function is called the activation function.
Q: What if we try to build a neural network without one?

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: why is max operator important?

Neural Networks

● Activation function
● The function max(0, z) is called the activation function.

● What if without the activation function?
○ The model will be linear.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 2021

The function is called the activation function.
Q: What if we try to build a neural network without one?

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: why is max operator important?

Neural Networks

● Activation functions
○ Non-linear functions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202133

Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Activation functions ReLU is a good default choice for most problems

Neural Networks

● Architectures (for MLP)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202134

“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

“3-layer Neural Net”, or
“2-hidden-layer Neural Net”

Neural networks: Architectures

Neural Networks

● Architectures (for CNNs)
Convolutional Neural Networks(CNNs)

From Neural Networks to “Deep Learning”

Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 –

Frank Rosenblatt, ~1957: Perceptron

The Mark I Perceptron machine was the first
implementation of the perceptron algorithm.

The machine was connected to a camera that used
20×20 cadmium sulfide photocells to produce a 400-pixel
image.

recognized
letters of the alphabet

update rule:

A bit of history…

This image by Rocky Acosta is licensed under CC-BY 3.0

45 22

From Neural Networks to “Deep Learning”

Transformer

From Neural Networks to “Deep Learning”

● DL is everywhere

From Neural Networks to “Deep Learning”

● DL is everywhere

From Neural Networks to “Deep Learning”

● DL is everywhere

From Neural Networks to “Deep Learning”

● DL is everywhere

Convolutional Neural Network (CNN), from MLP to CNN

Convolutional Neural Networks (CNNs)

An overview of a CNN

Convolutional Neural Networks (CNNs)

● Recap: fully connected (FC) layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 –

3072
1

Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1

10 x 3072
weights

activationinput

1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

1
10

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202134

“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

“3-layer Neural Net”, or
“2-hidden-layer Neural Net”

Neural networks: Architectures

FC layer

Convolution Layer (2D)

Convolution Layer

Noticed the difference on size?
Why?

Convolution Layer

A closer look at spatial dimensions:

N will not be 7 after padding.

Other padding operations: replication padding, reflection padding …
57

Pooling layer

Max Pooling

Other pooling operations: average pooling …
61

Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 – 119

Pooling layer: summary

Let’s assume input is W1 x H1 x C
Conv layer needs 2 hyperparameters:
– The spatial extent F
– The stride S

This will produce an output of W2 x H2 x C where:
– W2 = (W1 – F )/S + 1
– H2 = (H1 – F)/S + 1

Number of parameters: 0
62

Fully Connected Layer (FC layer)

• Contains neurons that connect to the entire input
volume, as in ordinary Neural Networks

Fully Connected Layer

Summary of CNNs

● ConvNets stack CONV,POOL,FC layers
● Trend towards smaller filters and deeper architectures
● Trend towards getting rid of POOL/FC layers (just CONV)
● Historically architectures looked like [(CONV-RELU)*N-POOL?]*M-(FC-

RELU)*K,SOFTMAX, where N is usually up to ~5, M is large, 0 <= K <= 2. ○ but recent advances such as ResNet/GoogLeNet have challenged this paradigm 65 Learning, Optimization, and Backpropagation 66 Loss Function ● Recap Prediction: Loss function: Measuring difference between prediction and ground truth. Loss to optimize: 67 Loss Function 68 Optimization with Gradient Descent ● Finding the best parameter ● Minimizing the loss ● Analytic closed-form solution is hard to derive or calculate Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 20218 Finding the best W: Optimize with Gradient Descent Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Updating parameters 69 Stochastic Gradient Descent (SGD) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021 Stochastic Gradient Descent (SGD) 10 Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common 70 Problem: How to compute gradients? Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021 If we can compute then we can learn W1 and W2 52 Problem: How to compute gradients? Nonlinear score function SVM Loss on predictions Regularization Total loss: data loss + regularization 71 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202153 (Bad) Idea: Derive on paper Problem: What if we want to change loss? E.g. use softmax instead of SVM? Need to re-derive from scratch =( Problem: Very tedious: Lots of matrix calculus, need lots of paper Problem: Not feasible for very complex models! 72 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202156 Really complex neural networks!! Figure reproduced with permission from a Twitter post by Andrej Karpathy. input image loss 73 Solution: Backpropagation (BP) We want to train an MLP by minimizing a loss over the training set L(w , b) = X n l(f (xn;w , b), yn). To use gradient descent, we need the expression of the gradient of the per-sample loss ln = l(f (xn;w , b), yn) with respect to the parameters, e.g. @ln @w (l) i,j and @ln @b (l) i . François Fleuret Deep learning / 3.6. Back-propagation 1 / 11 74 Backpropagation (BP) For clarity, we consider a single training sample x , and introduce s(1), . . . , s(L) as the summations before activation functions. x(0) = x w (1),b(1) ��! s(1) ��! x(1) w (2),b(2) ��! s(2) ��! . . . w (L),b(L) ��! s(L) ��! x(L) = f (x ;w , b). Formally we set x(0) = x , 8l = 1, . . . , L, ( s(l) = w (l)x(l�1) + b(l) x(l) = � � s(l) � , and we set the output of the network as f (x ;w , b) = x(L). This is the forward pass. François Fleuret Deep learning / 3.6. Back-propagation 2 / 11 Loss l() Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202134 “Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net” Neural networks: Architectures FC layer 75 Backpropagation (BP) x(l�1) w (l), b(l) ��! s(l) ��! x(l) :::::::::: Since s (l) i influences l only through x (l) i with x (l) i = �(s (l) i ), we have @l @s (l) i = @l @x (l) i @x (l) i @s (l) i = @l @x (l) i �0 ⇣ s (l) i ⌘ , And since x (l�1) j influences l only through the s (l) i with s (l) i = X j w (l) i,j x (l�1) j + b (l) i , we have @l @x (l�1) j = X i @l @s (l) i @s (l) i @x (l�1) j = X i @l @s (l) i w (l) i,j . François Fleuret Deep learning / 3.6. Back-propagation 4 / 11 76 Backpropagation (BP) x(l�1) w (l), b(l) ��! s(l) ��! x(l) Since s (l) i influences l only through x (l) i with x (l) i = �(s (l) i ), we have @l @s (l) i = @l @x (l) i @x (l) i @s (l) i = @l @x (l) i �0 ⇣ s (l) i ⌘ , And since x (l�1) j influences l only through the s (l) i with s (l) i = X j w (l) i,j x (l�1) j + b (l) i , we have @l @x (l�1) j = X i @l @s (l) i @s (l) i @x (l�1) j = X i @l @s (l) i w (l) i,j . François Fleuret Deep learning / 3.6. Back-propagation 4 / 11 77 Backpropagation (BP) x(l�1) w (l), b(l) ��! s(l) ��! x(l) Since w (l) i,j and b (l) i influences l only through s (l) i with s (l) i = X j w (l) i,j x (l�1) j + b (l) i , we have @l @w (l) i,j = @l @s (l) i @s (l) i @w (l) i,j = @l @s (l) i x (l�1) j , @l @b (l) i = @l @s (l) i . François Fleuret Deep learning / 3.6. Back-propagation 5 / 11 78 Backpropagation (BP) To summarize: we can compute @l @x (L) i from the definition of l, and recursively propagate backward the derivatives of the loss w.r.t the activations with @l @s (l) i = @l @x (l) i �0 ⇣ s (l) i ⌘ and @l @x (l�1) j = X i @l @s (l) i w (l) i,j . And then compute the derivatives w.r.t the parameters with @l @w (l) i,j = @l @s (l) i x (l�1) j , and @l @b (l) i = @l @s (l) i . This is the backward pass. François Fleuret Deep learning / 3.6. Back-propagation 6 / 11 For further backward Gradient on parameters 79 Backpropagation (BP) ● Computational graph x(l�1) ⇥ w (l) + b(l) s(l) � x(l) h @l @x(l) ih @l @s(l) i � �0 ·>⇥
h

@l
@x(l�1)

h
@l

@b(l)

ihh
@l

@w (l)

⇥ ·>

François Fleuret Deep learning / 3.6. Back-propagation 8 / 11

https://fleuret.org/dlc/materials/dlc-slides-3-6-backprop.pdf 80

Deep Learning Frameworks/Packages

The End

Related Posts