EBU7240 Computer Vision
Introduction to Deep Learning
Semester 1, 2021
Changjae Oh
Copyright By PowCoder代写 加微信 powcoder
• Machine learning basics++
• Introduction to deep learning
• Linear classifier
What is Machine Learning? • Learning = Looking for a Function
http://www.slideshare.net/ckmarkohchang
Machine Learning: Basic Concept • Prediction task
Regression: returns a specific value Classification: returns a class label
Regression example
What is location when x=10?
Classification example
For a new image, what class does it belong to?
Machine Learning: Basic Concept
• Training data
𝕏={𝐱1= 2.0,𝐱2= 4.0,𝐱3= 6.0,𝐱4= 8.0}
𝕐={𝑦 =3.0,𝑦 =4.0,𝑦 =5.0,𝑦 =6.0} 1234
Training the model with data
– Finding optimal parameters
– Starting at a random value, increasing
accuracy to compute optimal parameters
Optimal parameter w=0.5 b=2.0
– Minimize errors for new samples (test set) – Generalization
refers to high performance for test sets
Machine Learning: Basic Concept
• Multi-dimensional feature space
̶ d-dimensional data: x=(x1,x2, … ,xd)T Note) d=784 in MNIST data
• Linear classifier for d-dimensional data
1-D linear classifier
Widely used in machine learning
# of variables = 𝑑 + 1
2-D linear classifier
# of variables = (𝑑+1)(𝑑+2) 2
𝑦 = 𝑤 𝑥2 + 𝑤 𝑥2 + ⋯ + 𝑤 𝑥2 + 𝑤
11 22 𝑑𝑑 𝑑+112 𝑑𝑑+1 𝑑−1𝑑 𝑑𝑑+1+11 𝑑𝑑+1+𝑑𝑑
𝑥 𝑥 + ⋯ + 𝑤 𝑥 𝑥 + 𝑤 222
Machine Learning: Basic Concept • Feature space transformation
̶ Map a linearly non-separable feature space into separable space Toy example
Original feature Transformed feature space, space which is linearly separable
Machine Learning: Basic Concept • Representation learning
Aims to find good feature space automatically
Deep learning finds a hierarchical feature space by using neural networks with multiple hidden layers.
The first hidden layer has low-level features (edge, corner points, etc.), and the right-hand side features advanced features (face, wheel, etc.)
Data for Machine Learning • The quality of the training data
̶ To increase estimation accuracy, diverse and enough data should be collected for a given application.
̶ Ex) After learning from a database with a frontal face only, the recognition accuracy of side face will be degraded.
• MNIST database
Handwritten numeric database Training data: 60,000
Test data: 10,000
Data for Machine Learning • Database size vs. training accuracy
̶ Ex) MNIST: 28*28 binary image
→The total number of possible samples is 2784, but MNIST has 60,000 training images.
Data for Machine Learning
• How does a small database achieve high performance?
̶ In a feature space, the actual data is generated in a very small subspace is unlikely to happen.
̶ Manifold assumption
Smooth change according to certain rules like
Training Model: under
• Under-fitting
̶ Model capacity is too small to fit the data accordingly. ̶ Model with higher order can be used.
fitting vs. over
1st order 2nd order 3rd order 4th order
12th order
Training Model: under • Over-fitting
fitting vs. over
12th order polynomial model approximates perfectly for the training set.
But if you anticipate “new” data, there’s a big problem.
Since the model capacity is large, the training process also accepts data noise. The model with the appropriate capacity should be selected.
Predicted data Actual data
Training Model: under
• 1st and 2nd order model show poor performance for both the training and the test set.
• 12th order model shows high performance in training set, but low performance in test
set.→low generalization ability
• 3rd and 4th order model are lower than the 12th order model for the training set, but the
test set has high performance→higher generalization capability
1st order 2nd order 3rd order 4th order 12th order
fitting vs. over
Spectrum of supervision
Computer vision
Supervised Semi-Supervised Unsupervised Reinforcement learning learning learning learning
Spectrum of supervision
• Supervised learning
̶ Both the feature vector 𝕏 and the output 𝕐 are given. ̶ Regression and classification problem
• Unsupervised learning
̶ The feature vector 𝕏 is given, but the output 𝕐 is not given.
̶ Ex) Clustering, density estimation, feature space conversion
Spectrum of supervision • Reinforcement learning
The output is given, but it is different from supervised learning.
Once the game is over, you get a point (credit).
If you win, get 1, and -1 otherwise.
The credit should be distributed to each sample of the game.
• Semi-supervised Learning
̶ Some of data have both 𝕏 and 𝕐, but others have only 𝕏.
̶ It is becoming important, since it is easy to collect 𝕏, but 𝕐 requires manual tasks.
Introduction to Deep Learning
• Deep learning
̶ is a branch of machine learning based on a set of algorithms that attempt to model hi gh-level abstractions in data by using multiple processing layers, with complex structu res or otherwise, composed of multiple non-linear transformations.
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more
Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Game playing
• .. and many, many more
Why deep learning?
• Hand-crafted features vs. Learned features
• Large datasets
• GPU hardware advances + Price decreases
• Improved techniques (algorithm)
What is Deep Learning?
• Stacked Functions Learned by Machine
End-to-end training: what each function should do is learned automatically Deep learning usually refers to neural network based model
“Fruit, Red, Spherical”
What is Deep Learning?
• Stacked Functions Learned by Machine
̶ Representation Learning: learning features/representations ̶ Deep Learning: learning (multi-level) features and an output
Machine Learning vs. Deep Learning
• Deep vs Shallow: Image Recognition ̶ Shallow model using machine learning
http://www.slideshare.net/ckmarkohchang
Machine Learning vs. Deep Learning
• Deep vs Shallow: Image Recognition ̶ Deep model using deep learning
http://www.slideshare.net/ckmarkohchang
Machine Learning vs. Deep Learning • Machine Learning vs. Deep Learning
Machine Learning = Feature descriptor + Classifier
Feature: hand-crafted domain-specific knowledge
Describing your data with features that computer
can understand
Ex) SIFT, Bag-of-Words (BoW), Histogram of Oriented Gradient (HOG)
Optimizing the classifier weights on features Ex) Nearest Neighbor (NN), Support Vector Machine (SVM), Random Forest (RF)
Machine Learning vs. Deep Learning • Machine Learning vs. Deep Learning
Deep Learning = Feature descriptor + Classifier
Feature: Representation learned by machine Automatically learned internal knowledge
Neural network based model
Optimizing the classifier weights on features
A series of linear classifiers and non-linear activations + Loss function
Deep Learning • A single neuron
𝑧 = 𝒘T𝒙 + 𝑏 = 𝑤𝑘𝑥𝑘 + 𝑏 𝑘=1
Deep Learning
• A single layer with multiple neurons
𝑧 =𝒘T𝒙+𝑏 =𝑤 𝑥 +𝑏 𝑦1=1+𝑒−𝑧 1111𝑘𝑘1 1
𝑧 =𝒘T𝒙+𝑏 =𝑤 𝑥 +𝑏 𝑀𝑀𝑀 𝑀𝑘𝑘𝑀
𝑦𝑀= 1 1+𝑒−𝑧𝑛
Deep Learning
• Deep Neural Network
̶ Cascading the neurons to form a neural network
Each layer consists of the linear classifier and activation function
Linear classifier
Neural Network
Linear classifiers
Parametric Approach
• (Review) Unit 3 ML basics and classification
Array of 32x32x3 numbers (3072 numbers total)
f(x,W) = Wx + b
10×1 10×3072 f(x,W)
10 numbers giving class scores
parameters (or weights)
Parametric Approach
• (Review) Unit 3 recognition
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
What do we need now?
• Functions to measuring the error between the output of a classifier and the given target value.
̶ Let’s talk about designing error (a.k.a. loss) functions!
EBU7240 Computer Vision
Changjae Oh
Loss functions
Semester 1, 2021
Loss function
• Loss function
̶ quantifies our unhappiness with the scores across the training data.
• Type of loss function
Hinge loss Cross-entropy loss Log likelihood loss Regression loss
Loss Function: Hinge Loss • Binary hinge loss (=binary SVM loss)
𝐿𝑖 = max 0, 1 − 𝑦𝑖 ∙ 𝑠
𝑠 = 𝒘T𝒙𝒊 + 𝑏
𝑦𝑖 = ±1 for positive/negative samples
• Hinge loss (=multiclass SVM loss) ̶ 𝐶: The number of class (> 2)
𝐿= max0,𝑠−𝑠 +1 𝑖 𝑗𝑦𝑖
𝒔 = 𝐖𝒙𝑖 + 𝒃
𝒙𝑖: input data (e.g. image)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Given a dataset of examples
{(𝒙𝑖 , 𝑦𝑖 )}𝑁
𝑖=1 𝑦𝑖: class label (integer)
Loss over the dataset is a
sum of loss over examples:
𝐿=𝑁𝐿𝑖 𝑓𝒙𝑖,𝐖,𝑦𝑖
Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Multiclass SVM loss (=hinge loss)
𝑠 −𝑠 +1 𝑗 𝑦𝑖
if𝑠 ≥𝑠+1 𝑦𝑖 𝑗
=max0,𝑠−𝑠 +1 𝑗 𝑦𝑖
where score vector 𝒔 = 𝑓 𝒙𝑖, 𝐖
Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Multiclass SVM loss (=hinge loss)
𝐿=max0,𝑠−𝑠 +1 𝑖 𝑗𝑦𝑖
L = (2.9 + 0 + 12.9)/3 = 5.27
= max(0, 5.1 – 3.2 + 1) + max(0, -1.7 – 3.2 + 1) = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0 = 2.9
= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1) = max(0, -2.6) + max(0, -1.9)
= max(0, 2.2 – (-3.1) + 1) + max(0, 2.5 – (-3.1) + 1) = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
Loss over full dataset is average
Loss Function: Log Likelihood Loss
• Log likelihood loss
𝐿𝑖 = −log 𝑝𝑗 where 𝑗 satisfies 𝑧𝑖𝑗 = 1
𝑝2 probability for 𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑 = 1.)
Suppose 𝑖𝑡h image belongs to class 2 and 𝐶 = 10.
𝒛𝒊: class label for 𝑖𝑡h image
(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
0 𝐿𝑖 = −log 0.7
Loss Function: Cross • Cross-entropy loss
𝐿𝑖 =− 𝑧𝑖𝑗log𝑝𝑗 +(1−𝑧𝑖𝑗)log(1−𝑝𝑗) 𝑗=1
Suppose 𝑖𝑡h image belongs to class 2 and 𝐶 = 10.
entropy Loss
𝑝2 probability for 𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑 = 1.)
𝒛𝒊: class label for 𝑖𝑡h image
(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
0 𝐿𝑖 =−log 1−0.1 −log0.7−log(1−0.2)
• Softmax activation function
Activation Function
scores = unnormalized log probabilities of the classes. Probability can be computed using scores as below.
Probability of class label being 𝑘 for an image 𝒙𝑖
=𝑝𝑘 = Softmaxactivation function
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
σ𝐶 𝑒𝑠𝑗 𝑗=1
𝒔 = 𝐖𝒙𝑖 + 𝒃
+ Log Likelihood Loss
Softmax + Log likelihood loss:
is often called ‘softmax classifier’
𝑒𝑠𝑦𝑖 𝐶 𝑒𝑠𝑘
𝐿𝑖 =−log σ𝐶 𝑒𝑠𝑗 − log(1−σ𝐶 𝑒𝑠𝑗)
𝑗=1 𝑘=1,𝑘≠𝑦𝑖 𝑗=1
L_i = -log(0.13)-log(1-0.87)-log(1-0.0) = 0.89*2
entropy Loss
Loss Function: Regression Loss • Regression loss
Using L1 or L2 norms
Widely used in pixel-level prediction (e.g. image denoising)
𝐿𝑖 =|𝒚𝑖 −𝒔𝑖| 𝐿𝑖 =(𝒚𝑖 −𝒔𝑖)2
0 𝐿𝑖 = 𝒚𝑖 −𝒔𝑖 = 0−0.1 + 1−0.7 +|0−0.2|
Regularization
𝐿=max0,𝑠−𝑠 +1 𝑖 𝑗𝑦𝑖
Suppose that we found a W such that L = 0. Is this W unique?
No! 2W is also has L = 0!
= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1) = max(0, -2.6) + max(0, -1.9)
With W twice as large:
= max(0, 2.6 – 9.8 + 1) + max(0, 4.0 – 9.8 + 1) = max(0, -6.2) + max(0, -4.8)
Regularization
𝐿=𝑁𝐿𝑖𝑓𝒙𝑖,𝐖,𝑦𝑖 +𝜆𝑅(𝐖)
Data loss: Model predictions Regularization: Model should be “simple” should match training data to avoid overfitting, so it works on test data
Minimizing data loss
Minimizing data + regularization loss
Regularization
𝐿=𝑁𝐿𝑖𝑓𝒙𝑖,𝐖,𝑦𝑖 +𝜆𝑅(𝐖)
𝜆: regularization strength (hyperparameter)
• L2 regularization: 𝑅 𝐖 = σ 𝑊2 𝑘,𝑙 𝑘,𝑙
• L1 regularization: 𝑅 𝐖 = σ |𝑊 𝑘,𝑙 𝑘,𝑙
• Elastic net (L1 + L2): 𝑅 𝐖 = σ 𝛽𝑊2 𝑘,𝑙 𝑘,𝑙
+ |𝑊 | 𝑘,𝑙
• Max norm regularization: 𝒘𝑇 < 𝑐 for all 𝑗 𝑗
• Dropout (will see later)
• Batch normalization, stochastic depth (will see later)
Optimization: Gradient Descent • Gradient Descent
̶ The simplest approach to minimizing a loss function 𝐖T+1=𝐖T−𝛼 𝜕𝐿
̶ 𝛼: step size (a.k.a. learning rate)
Optimization: Gradient Descent
Optimization: Stochastic Gradient Descent (SGD)
𝐿=𝑁𝐿𝑖𝑓𝒙𝑖,𝐖,𝑦𝑖 +𝜆𝑅(𝐖)
Full sum is too expensive when N is large! Instead, approximating sum using a
minibatch of 32 / 64 / 128/ 256 examples is common
𝜕𝐿 1 𝑁 𝜕𝐿𝑖 𝑓 𝒙𝑖,𝐖 ,𝑦𝑖
𝜕𝑅(𝐖) +𝜆 𝜕𝐖
𝜕𝐖=𝑁 𝜕𝐖 𝑖=1
EBU7240 Computer Vision
Backpropagation
Semester 1, 2021
Changjae Oh
Backpropagation
• A widely used algorithm for training feedforward neural networks.
• A way of computing gradients of expressions through recursive applicati on of chain rule.
̶ Backpropagation computes the gradient of the loss function with respect to the weig hts of the network (model) for a single input–output example.
Derivative
• Optimization using derivative
1st order derivative
𝑓′(𝑥): The slope of the function, indicating the direction in which the value increases →The minima of the objective function may exist in the direction of −𝑓′(𝑥). →Gradient descent algorithm: −𝑓′(𝑥)
𝐖T+1=𝐖T−𝛼 𝜕𝐿 𝜕𝐖T
Derivative
• Partial derivative
̶ Derivatives of functions with multiple variables ̶ Gradient: the vector of the partial derivative
T 𝜕𝐱 𝜕𝑥1 𝜕𝑥2
Ex) 𝛻𝑓,𝜕𝑓,
Derivative
• Jacobian matrix
̶ 1st order partial derivative matrix for 𝐟: R𝑑 ↦ R𝑚 Ex)
• Hessian matrix
̶ 2nd order partial derivative matrix
Derivative • Chain rule
𝑓𝑥 =𝑔(h𝑥) 𝑓𝑥 =𝑔(h𝑖(𝑥))
Why are we talking about derivatives? • Gradient Descent
̶ The simplest approach to minimizing a loss function 𝐖T+1=𝐖T−𝛼 𝜕𝐿
̶ 𝛼: step size (a.k.a. learning rate)
Derivative
Example) Applying chain rule to single-layer perceptron
̶ Example of composite function
̶ Back-propagation: use the chain rule to compute 𝜕𝐿 and 𝜕𝐿
Ground truth 𝒛 𝒔𝒑
𝑑×1 𝑛×1 𝑛×1
Log likelihood
𝜕𝐿𝜕𝒑𝜕𝐿 𝜕𝐿𝜕𝒔𝜕𝐿𝜕𝐿𝜕𝐿𝜕𝐿 𝜕𝐿
𝜕𝒔 = 𝜕𝒔 𝜕𝒑
𝜕𝒘 = 𝜕𝒘 𝜕𝒔 𝑗𝑗
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝜕𝐿 𝜕𝒃 𝜕𝒃 𝜕𝒔 𝜕𝒔
𝜕𝐖= 𝜕𝒘 𝜕𝒘 ⋯ 𝜕𝒘 12𝑛
Analytic Gradient: Linear Equation
𝒔 = 𝐖𝒙 + 𝒃
𝜕𝑠1 =𝒙 𝜕𝒘1
𝜕𝑠2 =𝟎 𝜕𝒘1
𝜕𝑠𝑛 =𝟎 𝜕𝒘1
111𝑠1 𝒘𝑇𝑤𝑤⋯𝑤𝑥1
T 111121𝑑𝑥 𝑠2=𝒘2𝒙+𝑏2 𝒔= 𝑠2 𝒘𝑇 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝒙= 2
⋮𝐖=2=⋮⋮⋮ ⋮𝑠⋮ 𝑥𝑑𝑏
𝑠 =𝒘T𝒙+𝑏 𝑛 𝑛𝑛𝑛
𝑛𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑
𝜕𝒔 =[𝒙𝟎𝟎⋯𝟎]∈R𝑑×𝑛 𝜕𝒘1
𝜕𝒔= ⋮ ⋱ ⋮ =𝐈∈R𝑛×𝑛
jth column
= [𝟎 𝟎 𝒙 ⋯ 𝟎] ∈ R𝑑×𝑛 𝑗
Analytic Gradient: Linear Equation
𝒔 = 𝐖𝒙 + 𝒃
𝜕𝑠1 = 𝒘1 𝜕𝒙
111𝑠1 𝒘𝑇𝑤𝑤⋯𝑤𝑥1
⋮𝑠⋮ 𝑥𝑑𝑏 𝑛𝑛
T 111121𝑑𝑥
𝑠2=𝒘2𝒙+𝑏2 𝒔= 𝑠2 𝒘𝑇 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝒙= 2
𝑠 =𝒘T𝒙+𝑏 𝑛 𝑛𝑛𝑛
𝜕𝑠2=𝒘2 𝜕𝒔=𝒘𝒘⋯𝒘 =𝐖T∈R𝑑×𝑛
𝜕𝑠𝑛 = 𝒘𝑛 𝜕𝒙
𝒃= 𝑏2 ⋮𝐖=2=⋮⋮⋮
𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑
Analytic Gradient: Sigmoid Function • Sigmoid function
For a scalar 𝑥
𝜎 𝑥 =1+𝑒−𝑥 → 𝜕𝑥 =(1+𝑒−𝑥)2 = 1+𝑒−𝑥 1+𝑒−𝑥 =(1−𝜎 𝑥 )𝜎 𝑥
1 𝜕𝜎(𝑥) 𝑒−𝑥 1+𝑒−𝑥 −1 1 Similarly, for a vector 𝒔 ∈ R𝑛×1
𝒑=𝜎𝒔 = 1 1+𝑒
=𝑑𝑖𝑎𝑔(1−𝜎(𝑠))𝜎(𝑠) = 𝑗 𝑗
for 𝑗 = 1, ... , 𝑛
(1−𝜎(𝑠1))𝜎(𝑠1) ⋯ 0 ⋮ ⋱ ⋮
0 ⋯ (1−𝜎(𝑠𝑛))𝜎(𝑠𝑛)
Analytic Gradient: • Softmax function
𝑝𝑘 = σ𝑛 𝑒𝑠𝑗 𝑗=1
probability
Activation Function
𝒑 = σ𝑛 𝑒𝑠𝑗 𝑗=1
in vector form
score function
• 1st order derivative of softmax function
𝜕𝒑 𝑑𝑖𝑎𝑔 𝑒𝒔 ∙σ𝑒𝑠𝑗 −𝑒𝒔 𝑒𝒔 T 1 𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0 𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛
𝜕𝒔= σ𝑒𝑠𝑗2 =σ𝑒𝑠𝑗2 ⋮⋱⋮−⋮⋱⋮
0 ⋯ 𝑒𝑠𝑛σ𝑒𝑠𝑗 𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛
Analytic Gradient:
𝜕𝒑 𝑑𝑖𝑎𝑔 𝑒𝒔 ∙σ𝑒𝑠𝑗 −𝑒𝒔 𝑒𝒔 T 1 𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0 𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛 𝐃=𝜕𝒔= σ𝑒𝑠𝑗2 =σ𝑒𝑠𝑗2 ⋮⋱⋮−⋮⋱⋮
Activation Function
For 𝑎 = 𝑏 𝑒𝑠𝑎(σ𝑒𝑠𝑗 −𝑒𝑠𝑎)
For 𝑎 ≠ 𝑏
𝑒𝑠𝑛σ𝑒𝑠𝑗 𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛
𝑠 =−𝑝𝑎𝑝𝑏 σ𝑒𝑗2
𝐷 =𝑝 (𝛿 −𝑝 ) 𝑎𝑏 𝑎𝑎𝑏 𝑏
0 otherwise
Analytic Gradient: Hinge Loss • 1st order derivative of binary hinge loss
𝐿 = max 0, 1 − 𝑦 ∙ 𝑠
=ቊ1−𝑦∙𝑠 if1−𝑦∙𝑠>0 0 otherwise
𝜕𝐿=ቊ−𝑦 if1−𝑦∙𝑠>0 𝜕𝑠 0 otherwise
For simplicity of notation, the index of training image i is omitted here
𝑠 = 𝒘T𝒙 + 𝑏
𝑦 = ±1 for positive/negative samples
Analytic Gradient: Hinge Loss • 1st order derivative of hinge loss
For simplicity of notation, the index of training image i is omitted here
𝜕𝐿 = 1 𝑠 − 𝑠 + 1 > 0
𝒔 = 𝐖𝒙 + 𝒃
𝐿= max0,𝑠−𝑠+1
𝑗=1,𝑗≠𝑦 1
𝒘𝑇 𝒔= 𝑠2 𝐖=2⋮
𝑦: class label (integer, 1 ≤ 𝑦 ≤ 𝑛) 𝒘𝑇 𝑛 𝑛
=− 1𝑠−𝑠+1>0 𝑗 𝑦
for 𝑗 ≠ 𝑦
1𝐹 =ቊ1 𝑖𝑓𝐹𝑖𝑠𝑡𝑟𝑢𝑒 0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
Analytic Gradient: Log Likelihood Loss
For simplicity of notation, the index of training image i is omitted here
𝐿 = −log 𝑝𝑦 where 𝑦 satisfies 𝑧𝑦 = 1
probability for 𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑 = 1.)
𝑦: class label for 𝑖𝑡h image (1 ≤ 𝑦 ≤ 𝑛)
𝒛: class probability for 𝑖𝑡h image
𝒛 = (𝑧1 𝑧2 … 𝑧𝑛)𝑇, 𝑧𝑦 = 1 and 𝑧𝑘≠𝑦 = 0
Suppose 𝑖𝑡h image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2
1 0.7 𝜕𝐿 1/0.7
𝒛= 0 𝒑= 0 𝜕𝒑=− 0
⋮⋮⋮ 0 0.2 0
Analytic Gradient: Cross
For simplicity of notation, the index of training image i is omitted here
entropy Loss
𝐿=− 𝑧log𝑝 +(1−𝑧)log(1−𝑝 )
probabilityfor𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑
1/(1−𝑝 ) 1
𝑦: class label for 𝑖𝑡h image (1 ≤ 𝑦 ≤ 𝑛) 𝒛: class probability for 𝑖𝑡h image
1/𝑝 𝜕𝒑 = − 𝑦
𝒛 = (𝑧 𝑧 … 𝑧 )𝑇, 𝑧 = 1 and 𝑧 = 0 ⋮ 12𝑛𝑦𝑘≠𝑦
Suppose 𝑖𝑡h image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2
1/(1 − 0.1)
1 0.7 𝜕𝐿 1/0.7 𝒛=0 𝒑=0 𝜕𝒑=−0
0 0.2 1/(1 − 0.2)
Analytic Gradient: Regression Loss • Regression loss
For simplicity of notation, the index of training image i is omitted here
𝒚= 𝑦2 𝒔= 𝑠2 ⋮⋮
𝐿 = (𝒚 − 𝒔)2
𝜕𝐿 = −2(𝒚−𝒔) 𝜕𝒔
Why are we talking about derivatives? • Gradient Descent
̶ The simplest approach to minimizing a loss function 𝐖T+1=𝐖T−𝛼 𝜕𝐿
̶ 𝛼: step size (a.k.a. learning rate)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com