CS计算机代考程序代写 deep learning algorithm Introduction to

Introduction to
Statistical Machine Learning (ISML):

Deep Learning

Dong Gong

Part of the slides are from Fei-Fei Li et al. and Francois Fleuret.
10/2021

Linear Classifier for Image Classification

2

Image Classification

Dataset: CIFAR10 [Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.]

3

Image Classification

4

Image Classification

● Image classification with linear classifier

5

Image Classification

● Image classification with linear classifier

6

Image Classification

● An image example with 4 pixels and 3 classes.

7

Image Classification

● Interpreting a linear classifier.

8

Image Classification

● Hard cases for a linear classifier.

9

From Linear Classifiers to (Non-linear) Neural Networks

10

Neural Networks

● Starting from the original linear classifier

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202124

Neural networks: the original linear classifier

(Before) Linear score function:

11

Neural Networks

● 2 layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202125

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: 2 layers

(In practice we will usually add a learnable bias at each layer as well)
12

Neural Networks

● 2 layers
● Also called as fully connected network
● Fully connected layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202126

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: also called fully connected network

(In practice we will usually add a learnable bias at each layer as well)

“Neural Network” is a very broad term; these are more accurately called
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)

13

Neural Networks

● 2 layers
● Also called as fully connected network
● Fully connected layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202128

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: hierarchical computation

x hW1 sW2

3072 100 10

14

Neural Networks

● 3 layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202127

Neural networks: 3 layers

(Before) Linear score function:

(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

15

Neural Networks

● Activation function
● The function max(0, z) is called the activation function.

● What if without the activation function?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 2021

The function is called the activation function.
Q: What if we try to build a neural network without one?

30

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: why is max operator important?

16

Neural Networks

● Activation function
● The function max(0, z) is called the activation function.

● What if without the activation function?
○ The model will be linear.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 2021

The function is called the activation function.
Q: What if we try to build a neural network without one?

30

(Before) Linear score function:

(Now) 2-layer Neural Network

Neural networks: why is max operator important?

17

Neural Networks

● Activation functions
○ Non-linear functions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202133

Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Activation functions ReLU is a good default choice for most problems

18

Neural Networks

● Architectures (for MLP)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202134

“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

“3-layer Neural Net”, or
“2-hidden-layer Neural Net”

Neural networks: Architectures

19

Neural Networks

● Architectures (for CNNs)
Convolutional Neural Networks(CNNs)

20

From Neural Networks to “Deep Learning”

21

From Neural Networks to “Deep Learning”

Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 –

Frank Rosenblatt, ~1957: Perceptron

The Mark I Perceptron machine was the first
implementation of the perceptron algorithm.

The machine was connected to a camera that used
20×20 cadmium sulfide photocells to produce a 400-pixel
image.

recognized
letters of the alphabet

update rule:

A bit of history…

This image by Rocky Acosta is licensed under CC-BY 3.0

45 22

From Neural Networks to “Deep Learning”

23

From Neural Networks to “Deep Learning”

24

From Neural Networks to “Deep Learning”

25

From Neural Networks to “Deep Learning”

26

From Neural Networks to “Deep Learning”

Transformer

27

From Neural Networks to “Deep Learning”

● DL is everywhere

28

From Neural Networks to “Deep Learning”

● DL is everywhere

29

From Neural Networks to “Deep Learning”

● DL is everywhere

30

From Neural Networks to “Deep Learning”

● DL is everywhere

31

Convolutional Neural Network (CNN), from MLP to CNN

32

Convolutional Neural Networks (CNNs)

An overview of a CNN

33

Convolutional Neural Networks (CNNs)

● Recap: fully connected (FC) layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 –

3072
1

Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1

10 x 3072
weights

activationinput

1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

1
10

67

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 – April 08, 202134

“Fully-connected” layers
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

“3-layer Neural Net”, or
“2-hidden-layer Neural Net”

Neural networks: Architectures

FC layer

34

Convolution Layer (2D)

35

Convolution Layer

36

Convolution Layer

37

Convolution Layer

38

Convolution Layer

Noticed the difference on size?
Why?

39

Convolution Layer

40

41

42

A closer look at spatial dimensions:

43

A closer look at spatial dimensions:

44

A closer look at spatial dimensions:

45

A closer look at spatial dimensions:

46

A closer look at spatial dimensions:

47

A closer look at spatial dimensions:

48

A closer look at spatial dimensions:

49

A closer look at spatial dimensions:

50

A closer look at spatial dimensions:

51

A closer look at spatial dimensions:

52

A closer look at spatial dimensions:

53

54

55

N will not be 7 after padding.

56

Other padding operations: replication padding, reflection padding …
57

58

59

Pooling layer

60

Max Pooling

Other pooling operations: average pooling …
61

Fei-Fei Li, Ranjay Krishna, Danfei Xu April 13, 2021Lecture 5 – 119

Pooling layer: summary

Let’s assume input is W1 x H1 x C
Conv layer needs 2 hyperparameters:
– The spatial extent F
– The stride S

This will produce an output of W2 x H2 x C where:
– W2 = (W1 – F )/S + 1
– H2 = (H1 – F)/S + 1

Number of parameters: 0
62

Fully Connected Layer (FC layer)

• Contains neurons that connect to the entire input
volume, as in ordinary Neural Networks

63

Fully Connected Layer

64

Summary of CNNs

● ConvNets stack CONV,POOL,FC layers
● Trend towards smaller filters and deeper architectures
● Trend towards getting rid of POOL/FC layers (just CONV)
● Historically architectures looked like [(CONV-RELU)*N-POOL?]*M-(FC-

RELU)*K,SOFTMAX, where N is usually up to ~5, M is large, 0 <= K <= 2. ○ but recent advances such as ResNet/GoogLeNet have challenged this paradigm 65 Learning, Optimization, and Backpropagation 66 Loss Function ● Recap Prediction: Loss function: Measuring difference between prediction and ground truth. Loss to optimize: 67 Loss Function 68 Optimization with Gradient Descent ● Finding the best parameter ● Minimizing the loss ● Analytic closed-form solution is hard to derive or calculate Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 20218 Finding the best W: Optimize with Gradient Descent Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Updating parameters 69 Stochastic Gradient Descent (SGD) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021 Stochastic Gradient Descent (SGD) 10 Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common 70 Problem: How to compute gradients? Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 2021 If we can compute then we can learn W1 and W2 52 Problem: How to compute gradients? Nonlinear score function SVM Loss on predictions Regularization Total loss: data loss + regularization 71 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202153 (Bad) Idea: Derive on paper Problem: What if we want to change loss? E.g. use softmax instead of SVM? Need to re-derive from scratch =( Problem: Very tedious: Lots of matrix calculus, need lots of paper Problem: Not feasible for very complex models! 72 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202156 Really complex neural networks!! Figure reproduced with permission from a Twitter post by Andrej Karpathy. input image loss 73 Solution: Backpropagation (BP) We want to train an MLP by minimizing a loss over the training set L(w , b) = X n l(f (xn;w , b), yn). To use gradient descent, we need the expression of the gradient of the per-sample loss ln = l(f (xn;w , b), yn) with respect to the parameters, e.g. @ln @w (l) i,j and @ln @b (l) i . François Fleuret Deep learning / 3.6. Back-propagation 1 / 11 74 Backpropagation (BP) For clarity, we consider a single training sample x , and introduce s(1), . . . , s(L) as the summations before activation functions. x(0) = x w (1),b(1) �����! s(1) ��! x(1) w (2),b(2) �����! s(2) ��! . . . w (L),b(L) �����! s(L) ��! x(L) = f (x ;w , b). Formally we set x(0) = x , 8l = 1, . . . , L, ( s(l) = w (l)x(l�1) + b(l) x(l) = � � s(l) � , and we set the output of the network as f (x ;w , b) = x(L). This is the forward pass. François Fleuret Deep learning / 3.6. Back-propagation 2 / 11 Loss l() Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - April 08, 202134 “Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net” Neural networks: Architectures FC layer 75 Backpropagation (BP) x(l�1) w (l), b(l) �����! s(l) ����! x(l) :::::::::: Since s (l) i influences l only through x (l) i with x (l) i = �(s (l) i ), we have @l @s (l) i = @l @x (l) i @x (l) i @s (l) i = @l @x (l) i �0 ⇣ s (l) i ⌘ , And since x (l�1) j influences l only through the s (l) i with s (l) i = X j w (l) i,j x (l�1) j + b (l) i , we have @l @x (l�1) j = X i @l @s (l) i @s (l) i @x (l�1) j = X i @l @s (l) i w (l) i,j . François Fleuret Deep learning / 3.6. Back-propagation 4 / 11 76 Backpropagation (BP) x(l�1) w (l), b(l) �����! s(l) ����! x(l) Since s (l) i influences l only through x (l) i with x (l) i = �(s (l) i ), we have @l @s (l) i = @l @x (l) i @x (l) i @s (l) i = @l @x (l) i �0 ⇣ s (l) i ⌘ , And since x (l�1) j influences l only through the s (l) i with s (l) i = X j w (l) i,j x (l�1) j + b (l) i , we have @l @x (l�1) j = X i @l @s (l) i @s (l) i @x (l�1) j = X i @l @s (l) i w (l) i,j . François Fleuret Deep learning / 3.6. Back-propagation 4 / 11 77 Backpropagation (BP) x(l�1) w (l), b(l) �����! s(l) ����! x(l) Since w (l) i,j and b (l) i influences l only through s (l) i with s (l) i = X j w (l) i,j x (l�1) j + b (l) i , we have @l @w (l) i,j = @l @s (l) i @s (l) i @w (l) i,j = @l @s (l) i x (l�1) j , @l @b (l) i = @l @s (l) i . François Fleuret Deep learning / 3.6. Back-propagation 5 / 11 78 Backpropagation (BP) To summarize: we can compute @l @x (L) i from the definition of l, and recursively propagate backward the derivatives of the loss w.r.t the activations with @l @s (l) i = @l @x (l) i �0 ⇣ s (l) i ⌘ and @l @x (l�1) j = X i @l @s (l) i w (l) i,j . And then compute the derivatives w.r.t the parameters with @l @w (l) i,j = @l @s (l) i x (l�1) j , and @l @b (l) i = @l @s (l) i . This is the backward pass. François Fleuret Deep learning / 3.6. Back-propagation 6 / 11 For further backward Gradient on parameters 79 Backpropagation (BP) ● Computational graph x(l�1) ⇥ w (l) + b(l) s(l) � x(l) h @l @x(l) ih @l @s(l) i � �0 ·>⇥
h

@l
@x(l�1)

i

h
@l

@b(l)

ihh
@l

@w (l)

ii

⇥ ·>

François Fleuret Deep learning / 3.6. Back-propagation 8 / 11

https://fleuret.org/dlc/materials/dlc-slides-3-6-backprop.pdf 80

Deep Learning Frameworks/Packages

81

The End

82