PowerPoint 프레젠테이션
Changjae Oh
Copyright By PowCoder代写 加微信 powcoder
Computer Vision
– Introduction to Deep Learning –
Semester 1, 22/23
• Machine learning basics++
• Introduction to deep learning
• Linear classifier
What is Machine Learning?
• Learning = Looking for a Function
http://www.slideshare.net/ckmarkohchang
Machine Learning: Basic Concept
• Prediction task
̶ Regression: returns a specific value
̶ Classification: returns a class label
Regression example
What is location when x=10?
Classification example
For a new image, what class does it belong to?
Machine Learning: Basic Concept
• Training data
𝕏 = {𝐱1 = 2.0 , 𝐱2 = 4.0 , 𝐱3 = 6.0 , 𝐱4 = 8.0 }
𝕐 = {𝑦1 = 3.0, 𝑦2 = 4.0, 𝑦3 = 5.0, 𝑦4 = 6.0}
Training the model with data
– Finding optimal parameters
– Starting at a random value, increasing
accuracy to compute optimal parameters
– Minimize errors for new samples (test set)
– Generalization
refers to high performance for test sets
Optimal parameter w=0.5 b=2.0
Machine Learning: Basic Concept
• Multi-dimensional feature space
̶ d-dimensional data: x=(x1,x2, … ,xd)
• Linear classifier for d-dimensional data
̶ 1-D linear classifier # of variables = 𝑑 + 1
• Widely used in machine learning
̶ 2-D linear classifier # of variables =
(𝑑+1)(𝑑+2)
Note) d=784 in MNIST data
2 + 𝑤𝑑+1𝑥1𝑥2 +⋯+𝑤𝑑 𝑑+1
𝑥𝑑−1𝑥𝑑 + 𝑤𝑑 𝑑+1
𝑥1⋯+𝑤𝑑 𝑑+1
Machine Learning: Basic Concept
• Feature space transformation
̶ Map a linearly non-separable feature space into separable space
Original feature
Transformed feature space,
which is linearly separable
Toy example
Machine Learning: Basic Concept
• Representation learning
̶ Aims to find good feature space automatically
̶ Deep learning finds a hierarchical feature space by using neural networks with
multiple hidden layers.
• The first hidden layer has low-level features (edge, corner points, etc.), and the right-hand side
features advanced features (face, wheel, etc.)
Data for Machine Learning
• The quality of the training data
̶ To increase estimation accuracy, diverse and enough data should be collected for a
given application.
̶ Ex) After learning from a database with a frontal face only, the recognition accuracy
of side face will be degraded.
• MNIST database
̶ Handwritten numeric database
̶ Training data: 60,000
̶ Test data: 10,000
Data for Machine Learning
• Database size vs. training accuracy
̶ Ex) MNIST: 28*28 binary image
→ The total number of possible samples is 2784 , but MNIST has 60,000 training images.
Data for Machine Learning
• How does a small database achieve high performance?
̶ In a feature space, the actual data is generated in a very small subspace
̶ Manifold assumption
is unlikely to happen.
Smooth change according to certain rules like
Training Model: under-fitting vs. over-fitting
• Under-fitting
̶ Model capacity is too small to fit the data accordingly.
̶ Model with higher order can be used.
1st order 2nd order 3rd order 4th order 12th order
Training Model: under-fitting vs. over-fitting
• Over-fitting
̶ 12th order polynomial model approximates perfectly for the training set.
̶ But if you anticipate “new” data, there’s a big problem.
̶ Since the model capacity is large, the training process also accepts data noise.
̶ The model with the appropriate capacity should be selected.
Actual data
Predicted data
Training Model: under-fitting vs. over-fitting
• 1st and 2nd order model show poor performance for both the training and the test set.
• 12th order model shows high performance in training set, but low performance in test
set. → low generalization ability
• 3rd and 4th order model are lower than the 12th order model for the training set, but the
test set has high performance → higher generalization capability
1st order 2nd order 3rd order 4th order 12th order
Spectrum of supervision
Supervised
Semi-Supervised
Unsupervised
Reinforcement
Computer vision
Spectrum of supervision
• Supervised learning
̶ Both the feature vector 𝕏 and the output 𝕐 are given.
̶ Regression and classification problem
• Unsupervised learning
̶ The feature vector 𝕏 is given, but the output 𝕐 is not given.
̶ Ex) Clustering, density estimation, feature space conversion
Spectrum of supervision
• Reinforcement learning
̶ The output is given, but it is different from supervised learning.
• Once the game is over, you get a point (credit).
If you win, get 1, and -1 otherwise.
• The credit should be distributed to each sample of the game.
• Semi-supervised Learning
̶ Some of data have both 𝕏 and 𝕐, but others have only 𝕏.
̶ It is becoming important, since it is easy to collect 𝕏, but 𝕐 requires manual tasks.
Introduction to
Deep Learning
• Deep learning
̶ is a branch of machine learning based on a set of algorithms that attempt to model hi
gh-level abstractions in data by using multiple processing layers, with complex structu
res or otherwise, composed of multiple non-linear transformations.
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Game playing
• .. and many, many more
Why deep learning?
• Hand-crafted features vs. Learned features
• Large datasets
• GPU hardware advances + Price decreases
• Improved techniques (algorithm)
What is Deep Learning?
• Stacked Functions Learned by Machine
̶ End-to-end training: what each function should do is learned automatically
̶ Deep learning usually refers to neural network based model
Spherical”
What is Deep Learning?
• Stacked Functions Learned by Machine
̶ Representation Learning: learning features/representations
̶ Deep Learning: learning (multi-level) features and an output
Machine Learning vs. Deep Learning
• Deep vs Shallow: Image Recognition
̶ Shallow model using machine learning
http://www.slideshare.net/ckmarkohchang
Machine Learning vs. Deep Learning
• Deep vs Shallow: Image Recognition
̶ Deep model using deep learning
http://www.slideshare.net/ckmarkohchang
Machine Learning vs. Deep Learning
• Machine Learning vs. Deep Learning
Machine Learning = Feature descriptor + Classifier
Feature: hand-crafted domain-specific knowledge
Describing your data with features that computer
can understand
Ex) SIFT, Bag-of-Words (BoW), Histogram of Oriented Gradient
Optimizing the classifier weights on features
Ex) Nearest Neighbor (NN), Support Vector Machine
(SVM), Random Forest (RF)
Machine Learning vs. Deep Learning
• Machine Learning vs. Deep Learning
Deep Learning = Feature descriptor + Classifier
Feature: Representation learned by machine
Automatically learned internal knowledge
Optimizing the classifier weights on
Neural network based model
A series of linear classifiers and non-linear activations + Loss function
Deep Learning
• A single neuron
𝑧 = 𝒘T𝒙 + 𝑏 =
Deep Learning
• A single layer with multiple neurons
T 𝒙 + 𝑏𝑀 =
𝑤𝑀𝑘𝑥𝑘 + 𝑏𝑀
1 + 𝑒−𝑧1𝑧1 = 𝒘1
T𝒙 + 𝑏1 =
𝑤1𝑘𝑥𝑘 + 𝑏1
Deep Learning
• Deep Neural Network
̶ Cascading the neurons to form a neural network
Each layer consists of the linear classifier
and activation function
Linear classifier
classifiers
Neural Network
Parametric Approach
• (Review) Unit 3 ML basics and classification
Array of 32x32x3 numbers
(3072 numbers total)
f(x,W) = Wx + b
parameters
(or weights)
10 numbers giving
class scores
10×1 10×3072
Parametric Approach
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
• (Review) Unit 3 recognition
What do we need now?
• Functions to measuring the error between the output of a classifier and
the given target value.
̶ Let’s talk about designing error (a.k.a. loss) functions!
Changjae Oh
Computer Vision
– Loss functions –
Semester 1, 22/23
Loss function
• Loss function
̶ quantifies our unhappiness with the scores across the training data.
• Type of loss function
̶ Hinge loss
̶ Cross-entropy loss
̶ Log likelihood loss
̶ Regression loss
Loss Function: Hinge Loss
• Binary hinge loss (=binary SVM loss)
• Hinge loss (=multiclass SVM loss)
̶ 𝐶: The number of class (> 2)
𝐿𝑖 = max 0, 1 − 𝑦𝑖 ∙ 𝑠
𝑠 = 𝒘T𝒙𝒊 + 𝑏
𝑦𝑖 = ±1 for positive/negative samples
max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1
𝒔 = 𝐖𝒙𝑖 + 𝒃
𝒙𝑖: input data (e.g. image)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Given a dataset of examples
{(𝒙𝑖 , 𝑦𝑖)}𝑖=1
𝑦𝑖: class label (integer)
Loss over the dataset is a
sum of loss over examples:
𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖
Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Multiclass SVM loss (=hinge loss)
0 if 𝑠𝑦𝑖 ≥ 𝑠𝑗 + 1
𝑠𝑗 − 𝑠𝑦𝑖 + 1 otherwise
where score vector 𝒔 = 𝑓 𝒙𝑖 ,𝐖
max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1
Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Multiclass SVM loss (=hinge loss)
max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1
= max(0, 5.1 – 3.2 + 1) + max(0, -1.7 – 3.2 + 1)
= max(0, 2.9) + max(0, -3.9)
= 2.9 + 0 = 2.9
= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0 = 0
= max(0, 2.2 – (-3.1) + 1) + max(0, 2.5 – (-3.1) + 1)
= max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
Loss over full dataset is average
L = (2.9 + 0 + 12.9)/3 = 5.27
Loss Function: Log Likelihood Loss
• Log likelihood loss
𝐿𝑖 = −log 𝑝𝑗 where 𝑗 satisfies 𝑧𝑖𝑗 = 1
Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝐶 = 10.
𝐿𝑖 = −log 0.7
𝒛𝒊: class label for 𝑖
(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)
probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
Loss Function: Cross-entropy Loss
• Cross-entropy loss
𝑧𝑖𝑗log 𝑝𝑗 + (1 − 𝑧𝑖𝑗)log(1 − 𝑝𝑗)
Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝐶 = 10.
𝐿𝑖 = −log 1 − 0.1 − log0.7 − log(1 − 0.2)
𝒛𝒊: class label for 𝑖
(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)
probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
Softmax Activation Function
• Softmax activation function
scores = unnormalized log probabilities of the classes.
Probability can be computed using scores as below.
Probability of class label being 𝑘 for an image 𝒙𝑖
𝑃 𝑌 = 𝑘 𝑋 = 𝒙𝑖 = 𝑝𝑘 =
Softmax activation
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
𝒔 = 𝐖𝒙𝑖 + 𝒃
Softmax + Log Likelihood Loss
Softmax + Log likelihood loss:
is often called ‘softmax classifier’
Softmax + Cross-entropy Loss
L_i = -log(0.13)-log(1-0.87)-log(1-0.0)
Loss Function: Regression Loss
• Regression loss
̶ Using L1 or L2 norms
̶ Widely used in pixel-level prediction (e.g. image denoising)
𝐿𝑖 = 𝒚𝑖 − 𝒔𝑖 = 0 − 0.1 + 1 − 0.7 + |0 − 0.2|
𝐿𝑖 = |𝒚𝑖 − 𝒔𝑖|
𝐿𝑖 = (𝒚𝑖 − 𝒔𝑖)
Regularization
max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1 𝒔 = 𝐖𝒙 + 𝒃
Suppose that we found a W such
that L = 0. Is this W unique?
No! 2W is also has L = 0!
With W twice as large:
= max(0, 2.6 – 9.8 + 1) + max(0, 4.0 – 9.8 + 1)
= max(0, -6.2) + max(0, -4.8)
= 0 + 0 = 0
= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0 = 0
Regularization
𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 + 𝜆𝑅(𝐖)
Data loss: Model predictions
should match training data
Regularization: Model should be “simple”
to avoid overfitting, so it works on test data
Minimizing data loss Minimizing data + regularization loss
Regularization
• L2 regularization: 𝑅 𝐖 = σ𝑘,𝑙𝑊𝑘,𝑙
• L1 regularization: 𝑅 𝐖 = σ𝑘,𝑙 |𝑊𝑘,𝑙|
• Elastic net (L1 + L2): 𝑅 𝐖 = σ𝑘,𝑙 𝛽𝑊𝑘,𝑙
2 + |𝑊𝑘,𝑙|
• Max norm regularization: 𝒘𝑗
𝑇 < 𝑐 for all 𝑗
• Dropout (will see later)
• Batch normalization, stochastic depth (will see later)
𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 + 𝜆𝑅(𝐖)
𝜆: regularization strength
(hyperparameter)
Optimization: Gradient Descent
• Gradient Descent
̶ The simplest approach to minimizing a loss function
̶ 𝛼: step size (a.k.a. learning rate)
𝐖T+1 = 𝐖T − 𝛼
Optimization: Gradient Descent
Optimization: Stochastic Gradient Descent (SGD)
𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 + 𝜆𝑅(𝐖)
𝜕𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖
Full sum is too expensive when N is large!
Instead, approximating sum using a
minibatch of 32 / 64 / 128/ 256 examples
Changjae Oh
Computer Vision
- Backpropagation-
Semester 1, 2021
Backpropagation
• A widely used algorithm for training feedforward neural networks.
• A way of computing gradients of expressions through recursive applicati
on of chain rule.
̶ Backpropagation computes the gradient of the loss function with respect to the weig
hts of the network (model) for a single input–output example.
Derivative
• Optimization using derivative
̶ 1st order derivative
̶ 𝑓′(𝑥): The slope of the function, indicating the direction in which the value increases
→ The minima of the objective function may exist in the direction of −𝑓′(𝑥).
→Gradient descent algorithm: −𝑓′(𝑥)
𝐖T+1 = 𝐖T − 𝛼
Derivative
• Partial derivative
̶ Derivatives of functions with multiple variables
̶ Gradient: the vector of the partial derivative 𝛻𝑓,
Derivative
• Jacobian matrix
̶ 1st order partial derivative matrix for 𝐟: ℝ𝑑 ↦ ℝ𝑚
• Hessian matrix
̶ 2nd order partial derivative matrix
Derivative
• Chain rule
𝑓 𝑥 = 𝑔(ℎ 𝑥 )
𝑓 𝑥 = 𝑔(ℎ 𝑖(𝑥) )
Why are we talking about derivatives?
• Gradient Descent
̶ The simplest approach to minimizing a loss function
̶ 𝛼: step size (a.k.a. learning rate)
𝐖T+1 = 𝐖T − 𝛼
Derivative
• Example) Applying chain rule to single-layer perceptron
̶ Example of composite function
̶ Back-propagation: use the chain rule to compute
𝐖𝒙 softmax
𝑛 × 1 𝑛 × 1
𝑛 × 1Ground truth
likelihood
Analytic Gradient: Linear Equation
𝑤11 𝑤12 ⋯ 𝑤1𝑑
𝑤21 𝑤22 ⋯ 𝑤2𝑑
𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑
𝒔 = 𝐖𝒙 + 𝒃
= [𝒙 𝟎 𝟎 ⋯𝟎] ∈ ℜ𝑑×𝑛 𝜕𝒔
= [𝟎 𝟎 𝒙 ⋯𝟎] ∈ ℜ𝑑×𝑛
jth column
= 𝐈 ∈ ℜ𝑛×𝑛
Analytic Gradient: Linear Equation
= 𝒘1 𝒘2 ⋯𝒘𝑛 = 𝐖
𝑤11 𝑤12 ⋯ 𝑤1𝑑
𝑤21 𝑤22 ⋯ 𝑤2𝑑
𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑
𝒔 = 𝐖𝒙 + 𝒃
Analytic Gradient: Sigmoid Function
• Sigmoid function
(1 + 𝑒−𝑥)2
1 + 𝑒−𝑥 − 1
= (1 − 𝜎 𝑥 )𝜎 𝑥
For a scalar 𝑥
Similarly, for a vector 𝒔 ∈ ℜ𝑛×1
= 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠𝑗))𝜎(𝑠𝑗) =
(1 − 𝜎(𝑠1))𝜎(𝑠1) ⋯ 0
0 ⋯ (1 − 𝜎(𝑠𝑛))𝜎(𝑠𝑛)
for 𝑗 = 1,… , 𝑛
Analytic Gradient: Softmax Activation Function
• Softmax function
• 1st order derivative of softmax function
score function
probability
𝑑𝑖𝑎𝑔 𝑒𝒔 ∙ σ 𝑒𝑠𝑗 − 𝑒𝒔 𝑒𝒔 T
in vector form𝒑 =
𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0
0 ⋯ 𝑒𝑠𝑛σ𝑒𝑠𝑗
𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛
𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛
Analytic Gradient: Softmax Activation Function
𝑑𝑖𝑎𝑔 𝑒𝒔 ∙ σ 𝑒𝑠𝑗 − 𝑒𝒔 𝑒𝒔 T
𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0
0 ⋯ 𝑒𝑠𝑛σ𝑒𝑠𝑗
𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛
𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛
𝑒𝑠𝑎(σ𝑒𝑠𝑗 − 𝑒𝑠𝑎)
= 𝑝𝑎(1 − 𝑝𝑎)
= −𝑝𝑎𝑝𝑏 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏)
0 otherwise
Analytic Gradient: Hinge Loss
• 1st order derivative of binary hinge loss
For simplicity of notation, the index
of training image i is omitted here
𝐿 = max 0, 1 − 𝑦 ∙ 𝑠
−𝑦 if 1 − 𝑦 ∙ 𝑠 > 0
0 otherwise
1 − 𝑦 ∙ 𝑠 if 1 − 𝑦 ∙ 𝑠 > 0
0 otherwise
𝑦 = ±1 for positive/negative samples
𝑠 = 𝒘T𝒙 + 𝑏
Analytic Gradient: Hinge Loss
• 1st order derivative of hinge loss
For simplicity of notation, the index
of training image i is omitted here
𝒔 = 𝐖𝒙 + 𝒃
𝑦: class label (integer, 1 ≤ 𝑦 ≤ 𝑛)
max 0, 𝑠𝑗 − 𝑠𝑦 + 1
= 1 𝑠𝑗 − 𝑠𝑦 + 1 > 0
1 𝑖𝑓 𝐹 𝑖𝑠 𝑡𝑟𝑢𝑒
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑠𝑗 − 𝑠𝑦 + 1 > 0 for 𝑗 = 𝑦
Analytic Gradient: Log Likelihood Loss
For simplicity of notation, the index
of training image i is omitted here
Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2
𝐿 = −log 𝑝𝑦 where 𝑦 satisfies 𝑧𝑦 = 1
𝑦: class label for 𝑖𝑡ℎ image (1 ≤ 𝑦 ≤ 𝑛)
𝒛: class probability for 𝑖𝑡ℎ image
𝒛 = (𝑧1 𝑧2… 𝑧𝑛)
𝑇, 𝑧𝑦 = 1 and 𝑧𝑘≠𝑦 = 0
probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)
Analytic Gradient: Cross-entropy Loss
𝑧𝑗log 𝑝𝑗 + (1 − 𝑧𝑗)log(1 − 𝑝𝑗)
1/(1 − 𝑝1)
1/(1 − 𝑝𝑛)
Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2
1/(1 − 0.1)
1/(1 − 0.2)
𝑦: class label for 𝑖𝑡ℎ image (1 ≤ 𝑦 ≤ 𝑛)
𝒛: class probability for 𝑖𝑡ℎ image
𝒛 = (𝑧1 𝑧2… 𝑧𝑛)
𝑇, 𝑧𝑦 = 1 and 𝑧𝑘≠𝑦 = 0
probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)
For simplicity of notation, the index
of training image i is omitted here
Analytic Gradient: Regression Loss
• Regression loss
𝐿 = (𝒚 − 𝒔)2
= −2(𝒚 − 𝒔) 𝒔 =
For simplicity of notation, the index
of training image i is omitted here
Why are we talking about derivatives?
• Gradient Descent
̶ The simplest approach to minimizing a loss function
̶ 𝛼: step size (a.k.a. learning rate)
𝐖T+1 = 𝐖T − 𝛼
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com