CS代写 EBU7240 Computer Vision

EBU7240 Computer Vision
Introduction to Deep Learning
Semester 1, 2021
Changjae Oh

• Machine learning basics++
• Introduction to deep learning
• Linear classifier

What is Machine Learning? • Learning = Looking for a Function
http://www.slideshare.net/ckmarkohchang

Machine Learning: Basic Concept • Prediction task
Regression: returns a specific value Classification: returns a class label
Regression example
What is location when x=10?
Classification example
For a new image, what class does it belong to?

Machine Learning: Basic Concept
• Training data
𝕏={𝐱1= 2.0,𝐱2= 4.0,𝐱3= 6.0,𝐱4= 8.0}
𝕐={𝑦 =3.0,𝑦 =4.0,𝑦 =5.0,𝑦 =6.0} 1234
Training the model with data
– Finding optimal parameters
– Starting at a random value, increasing
accuracy to compute optimal parameters
Optimal parameter w=0.5 b=2.0
– Minimize errors for new samples (test set) – Generalization
refers to high performance for test sets

Machine Learning: Basic Concept
• Multi-dimensional feature space
̶ d-dimensional data: x=(x1,x2, … ,xd)T Note) d=784 in MNIST data
• Linear classifier for d-dimensional data
1-D linear classifier
Widely used in machine learning
# of variables = 𝑑 + 1
2-D linear classifier
# of variables = (𝑑+1)(𝑑+2) 2
𝑦 = 𝑤 𝑥2 + 𝑤 𝑥2 + ⋯ + 𝑤 𝑥2 + 𝑤
11 22 𝑑𝑑 𝑑+112 𝑑𝑑+1 𝑑−1𝑑 𝑑𝑑+1+11 𝑑𝑑+1+𝑑𝑑
𝑥 𝑥 + ⋯ + 𝑤 𝑥 𝑥 + 𝑤 222

Machine Learning: Basic Concept • Feature space transformation
̶ Map a linearly non-separable feature space into separable space Toy example
Original feature Transformed feature space, space which is linearly separable

Machine Learning: Basic Concept • Representation learning
Aims to find good feature space automatically
Deep learning finds a hierarchical feature space by using neural networks with multiple hidden layers.
The first hidden layer has low-level features (edge, corner points, etc.), and the right-hand side features advanced features (face, wheel, etc.)

Data for Machine Learning • The quality of the training data
̶ To increase estimation accuracy, diverse and enough data should be collected for a given application.
̶ Ex) After learning from a database with a frontal face only, the recognition accuracy of side face will be degraded.
• MNIST database
Handwritten numeric database Training data: 60,000
Test data: 10,000

Data for Machine Learning • Database size vs. training accuracy
̶ Ex) MNIST: 28*28 binary image
→The total number of possible samples is 2784, but MNIST has 60,000 training images.

Data for Machine Learning
• How does a small database achieve high performance?
̶ In a feature space, the actual data is generated in a very small subspace is unlikely to happen.
̶ Manifold assumption
Smooth change according to certain rules like

Training Model: under
• Under-fitting
̶ Model capacity is too small to fit the data accordingly. ̶ Model with higher order can be used.
fitting vs. over
1st order 2nd order 3rd order 4th order
12th order

Training Model: under • Over-fitting
fitting vs. over
12th order polynomial model approximates perfectly for the training set.
But if you anticipate “new” data, there’s a big problem.
Since the model capacity is large, the training process also accepts data noise. The model with the appropriate capacity should be selected.
Predicted data Actual data

Training Model: under
• 1st and 2nd order model show poor performance for both the training and the test set.
• 12th order model shows high performance in training set, but low performance in test
set.→low generalization ability
• 3rd and 4th order model are lower than the 12th order model for the training set, but the
test set has high performance→higher generalization capability
1st order 2nd order 3rd order 4th order 12th order
fitting vs. over

Spectrum of supervision
Computer vision
Supervised Semi-Supervised Unsupervised Reinforcement learning learning learning learning

Spectrum of supervision
• Supervised learning
̶ Both the feature vector 𝕏 and the output 𝕐 are given. ̶ Regression and classification problem
• Unsupervised learning
̶ The feature vector 𝕏 is given, but the output 𝕐 is not given.
̶ Ex) Clustering, density estimation, feature space conversion

Spectrum of supervision • Reinforcement learning
The output is given, but it is different from supervised learning.
Once the game is over, you get a point (credit).
If you win, get 1, and -1 otherwise.
The credit should be distributed to each sample of the game.
• Semi-supervised Learning
̶ Some of data have both 𝕏 and 𝕐, but others have only 𝕏.
̶ It is becoming important, since it is easy to collect 𝕏, but 𝕐 requires manual tasks.

Introduction to Deep Learning

• Deep learning
̶ is a branch of machine learning based on a set of algorithms that attempt to model hi gh-level abstractions in data by using multiple processing layers, with complex structu res or otherwise, composed of multiple non-linear transformations.

Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Speech synthesis
• Game playing
• .. and many, many more

Deep learning success
• Image classification
• Machine translation
• Speech recognition
• Game playing
• .. and many, many more

Why deep learning?
• Hand-crafted features vs. Learned features

• Large datasets
• GPU hardware advances + Price decreases
• Improved techniques (algorithm)

What is Deep Learning?
• Stacked Functions Learned by Machine
End-to-end training: what each function should do is learned automatically Deep learning usually refers to neural network based model
“Fruit, Red, Spherical”

What is Deep Learning?
• Stacked Functions Learned by Machine
̶ Representation Learning: learning features/representations ̶ Deep Learning: learning (multi-level) features and an output

Machine Learning vs. Deep Learning
• Deep vs Shallow: Image Recognition ̶ Shallow model using machine learning
http://www.slideshare.net/ckmarkohchang

Machine Learning vs. Deep Learning
• Deep vs Shallow: Image Recognition ̶ Deep model using deep learning
http://www.slideshare.net/ckmarkohchang

Machine Learning vs. Deep Learning • Machine Learning vs. Deep Learning
Machine Learning = Feature descriptor + Classifier
Feature: hand-crafted domain-specific knowledge
Describing your data with features that computer
can understand
Ex) SIFT, Bag-of-Words (BoW), Histogram of Oriented Gradient (HOG)
Optimizing the classifier weights on features Ex) Nearest Neighbor (NN), Support Vector Machine (SVM), Random Forest (RF)

Machine Learning vs. Deep Learning • Machine Learning vs. Deep Learning
Deep Learning = Feature descriptor + Classifier
Feature: Representation learned by machine Automatically learned internal knowledge
Neural network based model
Optimizing the classifier weights on features
A series of linear classifiers and non-linear activations + Loss function

Deep Learning • A single neuron
𝑧 = 𝒘T𝒙 + 𝑏 = ෍ 𝑤𝑘𝑥𝑘 + 𝑏 𝑘=1

Deep Learning
• A single layer with multiple neurons
𝑧 =𝒘T𝒙+𝑏 =෍𝑤 𝑥 +𝑏 𝑦1=1+𝑒−𝑧 1111𝑘𝑘1 1
𝑧 =𝒘T𝒙+𝑏 =෍𝑤 𝑥 +𝑏 𝑀𝑀𝑀 𝑀𝑘𝑘𝑀
𝑦𝑀= 1 1+𝑒−𝑧𝑛

Deep Learning
• Deep Neural Network
̶ Cascading the neurons to form a neural network
Each layer consists of the linear classifier and activation function

Linear classifier
Neural Network
Linear classifiers

Parametric Approach
• (Review) Unit 3 ML basics and classification
Array of 32x32x3 numbers (3072 numbers total)
f(x,W) = Wx + b
10×1 10×3072 f(x,W)
10 numbers giving class scores
parameters (or weights)

Parametric Approach
• (Review) Unit 3 recognition
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

What do we need now?
• Functions to measuring the error between the output of a classifier and the given target value.
̶ Let’s talk about designing error (a.k.a. loss) functions!

EBU7240 Computer Vision
Changjae Oh
Loss functions
Semester 1, 2021

Loss function
• Loss function
̶ quantifies our unhappiness with the scores across the training data.
• Type of loss function
Hinge loss Cross-entropy loss Log likelihood loss Regression loss

Loss Function: Hinge Loss • Binary hinge loss (=binary SVM loss)
𝐿𝑖 = max 0, 1 − 𝑦𝑖 ∙ 𝑠
𝑠 = 𝒘T𝒙𝒊 + 𝑏
𝑦𝑖 = ±1 for positive/negative samples
• Hinge loss (=multiclass SVM loss) ̶ 𝐶: The number of class (> 2)
𝐿= ෍ max0,𝑠−𝑠 +1 𝑖 𝑗𝑦𝑖
𝒔 = 𝐖𝒙𝑖 + 𝒃
𝒙𝑖: input data (e.g. image)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)

Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Given a dataset of examples
{(𝒙𝑖 , 𝑦𝑖 )}𝑁
𝑖=1 𝑦𝑖: class label (integer)
Loss over the dataset is a
sum of loss over examples:
𝐿=𝑁෍𝐿𝑖 𝑓𝒙𝑖,𝐖,𝑦𝑖

Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Multiclass SVM loss (=hinge loss)
𝑠 −𝑠 +1 𝑗 𝑦𝑖
if𝑠 ≥𝑠+1 𝑦𝑖 𝑗
=෍max0,𝑠−𝑠 +1 𝑗 𝑦𝑖
where score vector 𝒔 = 𝑓 𝒙𝑖, 𝐖

Loss Function: Hinge Loss
Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are
Multiclass SVM loss (=hinge loss)
𝐿=෍max0,𝑠−𝑠 +1 𝑖 𝑗𝑦𝑖
L = (2.9 + 0 + 12.9)/3 = 5.27
= max(0, 5.1 – 3.2 + 1) + max(0, -1.7 – 3.2 + 1) = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0 = 2.9
= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1) = max(0, -2.6) + max(0, -1.9)
= max(0, 2.2 – (-3.1) + 1) + max(0, 2.5 – (-3.1) + 1) = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
Loss over full dataset is average

Loss Function: Log Likelihood Loss
• Log likelihood loss
𝐿𝑖 = −log 𝑝𝑗 where 𝑗 satisfies 𝑧𝑖𝑗 = 1
𝑝2 probability for 𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑 = 1.)
Suppose 𝑖𝑡h image belongs to class 2 and 𝐶 = 10.
𝒛𝒊: class label for 𝑖𝑡h image
(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
0 𝐿𝑖 = −log 0.7

Loss Function: Cross • Cross-entropy loss
𝐿𝑖 =−෍ 𝑧𝑖𝑗log𝑝𝑗 +(1−𝑧𝑖𝑗)log(1−𝑝𝑗) 𝑗=1
Suppose 𝑖𝑡h image belongs to class 2 and 𝐶 = 10.
entropy Loss
𝑝2 probability for 𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑 = 1.)
𝒛𝒊: class label for 𝑖𝑡h image
(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
0 𝐿𝑖 =−log 1−0.1 −log0.7−log(1−0.2)

• Softmax activation function
Activation Function
scores = unnormalized log probabilities of the classes. Probability can be computed using scores as below.
Probability of class label being 𝑘 for an image 𝒙𝑖
=𝑝𝑘 = Softmaxactivation function
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)
σ𝐶 𝑒𝑠𝑗 𝑗=1
𝒔 = 𝐖𝒙𝑖 + 𝒃

+ Log Likelihood Loss
Softmax + Log likelihood loss:
is often called ‘softmax classifier’

𝑒𝑠𝑦𝑖 𝐶 𝑒𝑠𝑘
𝐿𝑖 =−log σ𝐶 𝑒𝑠𝑗 − ෍ log(1−σ𝐶 𝑒𝑠𝑗)
𝑗=1 𝑘=1,𝑘≠𝑦𝑖 𝑗=1
L_i = -log(0.13)-log(1-0.87)-log(1-0.0) = 0.89*2
entropy Loss

Loss Function: Regression Loss • Regression loss
Using L1 or L2 norms
Widely used in pixel-level prediction (e.g. image denoising)
𝐿𝑖 =|𝒚𝑖 −𝒔𝑖| 𝐿𝑖 =(𝒚𝑖 −𝒔𝑖)2
0 𝐿𝑖 = 𝒚𝑖 −𝒔𝑖 = 0−0.1 + 1−0.7 +|0−0.2|

Regularization
𝐿=෍max0,𝑠−𝑠 +1 𝑖 𝑗𝑦𝑖
Suppose that we found a W such that L = 0. Is this W unique?
No! 2W is also has L = 0!
= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1) = max(0, -2.6) + max(0, -1.9)
With W twice as large:
= max(0, 2.6 – 9.8 + 1) + max(0, 4.0 – 9.8 + 1) = max(0, -6.2) + max(0, -4.8)

Regularization
𝐿=𝑁෍𝐿𝑖𝑓𝒙𝑖,𝐖,𝑦𝑖 +𝜆𝑅(𝐖)
Data loss: Model predictions Regularization: Model should be “simple” should match training data to avoid overfitting, so it works on test data
Minimizing data loss
Minimizing data + regularization loss

Regularization
𝐿=𝑁෍𝐿𝑖𝑓𝒙𝑖,𝐖,𝑦𝑖 +𝜆𝑅(𝐖)
𝜆: regularization strength (hyperparameter)
• L2 regularization: 𝑅 𝐖 = σ 𝑊2 𝑘,𝑙 𝑘,𝑙
• L1 regularization: 𝑅 𝐖 = σ |𝑊 𝑘,𝑙 𝑘,𝑙
• Elastic net (L1 + L2): 𝑅 𝐖 = σ 𝛽𝑊2 𝑘,𝑙 𝑘,𝑙
+ |𝑊 | 𝑘,𝑙
• Max norm regularization: 𝒘𝑇 < 𝑐 for all 𝑗 𝑗 • Dropout (will see later) • Batch normalization, stochastic depth (will see later) Optimization: Gradient Descent • Gradient Descent ̶ The simplest approach to minimizing a loss function 𝐖T+1=𝐖T−𝛼 𝜕𝐿 ̶ 𝛼: step size (a.k.a. learning rate) Optimization: Gradient Descent Optimization: Stochastic Gradient Descent (SGD) 𝐿=𝑁෍𝐿𝑖𝑓𝒙𝑖,𝐖,𝑦𝑖 +𝜆𝑅(𝐖) Full sum is too expensive when N is large! Instead, approximating sum using a minibatch of 32 / 64 / 128/ 256 examples is common 𝜕𝐿 1 𝑁 𝜕𝐿𝑖 𝑓 𝒙𝑖,𝐖 ,𝑦𝑖 𝜕𝑅(𝐖) +𝜆 𝜕𝐖 𝜕𝐖=𝑁෍ 𝜕𝐖 𝑖=1 EBU7240 Computer Vision Backpropagation Semester 1, 2021 Changjae Oh Backpropagation • A widely used algorithm for training feedforward neural networks. • A way of computing gradients of expressions through recursive applicati on of chain rule. ̶ Backpropagation computes the gradient of the loss function with respect to the weig hts of the network (model) for a single input–output example. Derivative • Optimization using derivative 1st order derivative 𝑓′(𝑥): The slope of the function, indicating the direction in which the value increases →The minima of the objective function may exist in the direction of −𝑓′(𝑥). →Gradient descent algorithm: −𝑓′(𝑥) 𝐖T+1=𝐖T−𝛼 𝜕𝐿 𝜕𝐖T Derivative • Partial derivative ̶ Derivatives of functions with multiple variables ̶ Gradient: the vector of the partial derivative T 𝜕𝐱 𝜕𝑥1 𝜕𝑥2 Ex) 𝛻𝑓,𝜕𝑓, Derivative • Jacobian matrix ̶ 1st order partial derivative matrix for 𝐟: R𝑑 ↦ R𝑚 Ex) • Hessian matrix ̶ 2nd order partial derivative matrix Derivative • Chain rule 𝑓𝑥 =𝑔(h𝑥) 𝑓𝑥 =𝑔(h𝑖(𝑥)) Why are we talking about derivatives? • Gradient Descent ̶ The simplest approach to minimizing a loss function 𝐖T+1=𝐖T−𝛼 𝜕𝐿 ̶ 𝛼: step size (a.k.a. learning rate) Derivative Example) Applying chain rule to single-layer perceptron ̶ Example of composite function ̶ Back-propagation: use the chain rule to compute 𝜕𝐿 and 𝜕𝐿 Ground truth 𝒛 𝒔𝒑 𝑑×1 𝑛×1 𝑛×1 Log likelihood 𝜕𝐿𝜕𝒑𝜕𝐿 𝜕𝐿𝜕𝒔𝜕𝐿𝜕𝐿𝜕𝐿𝜕𝐿 𝜕𝐿 𝜕𝒔 = 𝜕𝒔 𝜕𝒑 𝜕𝒘 = 𝜕𝒘 𝜕𝒔 𝑗𝑗 𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝜕𝐿 𝜕𝒃 𝜕𝒃 𝜕𝒔 𝜕𝒔 𝜕𝐖= 𝜕𝒘 𝜕𝒘 ⋯ 𝜕𝒘 12𝑛 Analytic Gradient: Linear Equation 𝒔 = 𝐖𝒙 + 𝒃 𝜕𝑠1 =𝒙 𝜕𝒘1 𝜕𝑠2 =𝟎 𝜕𝒘1 𝜕𝑠𝑛 =𝟎 𝜕𝒘1 111𝑠1 𝒘𝑇𝑤𝑤⋯𝑤𝑥1 T 111121𝑑𝑥 𝑠2=𝒘2𝒙+𝑏2 𝒔= 𝑠2 𝒘𝑇 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝒙= 2 ⋮𝐖=2=⋮⋮⋮ ⋮𝑠⋮ 𝑥𝑑𝑏 𝑠 =𝒘T𝒙+𝑏 𝑛 𝑛𝑛𝑛 𝑛𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝜕𝒔 =[𝒙𝟎𝟎⋯𝟎]∈R𝑑×𝑛 𝜕𝒘1 𝜕𝒔= ⋮ ⋱ ⋮ =𝐈∈R𝑛×𝑛 jth column = [𝟎 𝟎 𝒙 ⋯ 𝟎] ∈ R𝑑×𝑛 𝑗 Analytic Gradient: Linear Equation 𝒔 = 𝐖𝒙 + 𝒃 𝜕𝑠1 = 𝒘1 𝜕𝒙 111𝑠1 𝒘𝑇𝑤𝑤⋯𝑤𝑥1 ⋮𝑠⋮ 𝑥𝑑𝑏 𝑛𝑛 T 111121𝑑𝑥 𝑠2=𝒘2𝒙+𝑏2 𝒔= 𝑠2 𝒘𝑇 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝒙= 2 𝑠 =𝒘T𝒙+𝑏 𝑛 𝑛𝑛𝑛 𝜕𝑠2=𝒘2 𝜕𝒔=𝒘𝒘⋯𝒘 =𝐖T∈R𝑑×𝑛 𝜕𝑠𝑛 = 𝒘𝑛 𝜕𝒙 𝒃= 𝑏2 ⋮𝐖=2=⋮⋮⋮ 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 Analytic Gradient: Sigmoid Function • Sigmoid function For a scalar 𝑥 𝜎 𝑥 =1+𝑒−𝑥 → 𝜕𝑥 =(1+𝑒−𝑥)2 = 1+𝑒−𝑥 1+𝑒−𝑥 =(1−𝜎 𝑥 )𝜎 𝑥 1 𝜕𝜎(𝑥) 𝑒−𝑥 1+𝑒−𝑥 −1 1 Similarly, for a vector 𝒔 ∈ R𝑛×1 𝒑=𝜎𝒔 = 1 1+𝑒 =𝑑𝑖𝑎𝑔(1−𝜎(𝑠))𝜎(𝑠) = 𝑗 𝑗 for 𝑗 = 1, ... , 𝑛 (1−𝜎(𝑠1))𝜎(𝑠1) ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ (1−𝜎(𝑠𝑛))𝜎(𝑠𝑛) Analytic Gradient: • Softmax function 𝑝𝑘 = σ𝑛 𝑒𝑠𝑗 𝑗=1 probability Activation Function 𝒑 = σ𝑛 𝑒𝑠𝑗 𝑗=1 in vector form score function • 1st order derivative of softmax function 𝜕𝒑 𝑑𝑖𝑎𝑔 𝑒𝒔 ∙σ𝑒𝑠𝑗 −𝑒𝒔 𝑒𝒔 T 1 𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0 𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛 𝜕𝒔= σ𝑒𝑠𝑗2 =σ𝑒𝑠𝑗2 ⋮⋱⋮−⋮⋱⋮ 0 ⋯ 𝑒𝑠𝑛σ𝑒𝑠𝑗 𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛 Analytic Gradient: 𝜕𝒑 𝑑𝑖𝑎𝑔 𝑒𝒔 ∙σ𝑒𝑠𝑗 −𝑒𝒔 𝑒𝒔 T 1 𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0 𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛 𝐃=𝜕𝒔= σ𝑒𝑠𝑗2 =σ𝑒𝑠𝑗2 ⋮⋱⋮−⋮⋱⋮ Activation Function For 𝑎 = 𝑏 𝑒𝑠𝑎(σ𝑒𝑠𝑗 −𝑒𝑠𝑎) For 𝑎 ≠ 𝑏 𝑒𝑠𝑛σ𝑒𝑠𝑗 𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛 𝑠 =−𝑝𝑎𝑝𝑏 σ𝑒𝑗2 𝐷 =𝑝 (𝛿 −𝑝 ) 𝑎𝑏 𝑎𝑎𝑏 𝑏 0 otherwise Analytic Gradient: Hinge Loss • 1st order derivative of binary hinge loss 𝐿 = max 0, 1 − 𝑦 ∙ 𝑠 =ቊ1−𝑦∙𝑠 if1−𝑦∙𝑠>0 0 otherwise
𝜕𝐿=ቊ−𝑦 if1−𝑦∙𝑠>0 𝜕𝑠 0 otherwise
For simplicity of notation, the index of training image i is omitted here
𝑠 = 𝒘T𝒙 + 𝑏
𝑦 = ±1 for positive/negative samples

Analytic Gradient: Hinge Loss • 1st order derivative of hinge loss
For simplicity of notation, the index of training image i is omitted here
𝜕𝐿 = 1 𝑠 − 𝑠 + 1 > 0
𝒔 = 𝐖𝒙 + 𝒃
𝐿= ෍ max0,𝑠−𝑠+1
𝑗=1,𝑗≠𝑦 1
𝒘𝑇 𝒔= 𝑠2 𝐖=2⋮
𝑦: class label (integer, 1 ≤ 𝑦 ≤ 𝑛) 𝒘𝑇 𝑛 𝑛
=− ෍ 1𝑠−𝑠+1>0 𝑗 𝑦
for 𝑗 ≠ 𝑦
1𝐹 =ቊ1 𝑖𝑓𝐹𝑖𝑠𝑡𝑟𝑢𝑒 0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒

Analytic Gradient: Log Likelihood Loss
For simplicity of notation, the index of training image i is omitted here
𝐿 = −log 𝑝𝑦 where 𝑦 satisfies 𝑧𝑦 = 1
probability for 𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑 = 1.)
𝑦: class label for 𝑖𝑡h image (1 ≤ 𝑦 ≤ 𝑛)
𝒛: class probability for 𝑖𝑡h image
𝒛 = (𝑧1 𝑧2 … 𝑧𝑛)𝑇, 𝑧𝑦 = 1 and 𝑧𝑘≠𝑦 = 0
Suppose 𝑖𝑡h image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2
1 0.7 𝜕𝐿 1/0.7
𝒛= 0 𝒑= 0 𝜕𝒑=− 0
⋮⋮⋮ 0 0.2 0

Analytic Gradient: Cross
For simplicity of notation, the index of training image i is omitted here
entropy Loss
𝐿=−෍ 𝑧log𝑝 +(1−𝑧)log(1−𝑝 )
probabilityfor𝑖𝑡h image
⋮ (It is assumed to be normalized, i.e. 𝒑
1/(1−𝑝 ) 1
𝑦: class label for 𝑖𝑡h image (1 ≤ 𝑦 ≤ 𝑛) 𝒛: class probability for 𝑖𝑡h image
1/𝑝 𝜕𝒑 = − 𝑦
𝒛 = (𝑧 𝑧 … 𝑧 )𝑇, 𝑧 = 1 and 𝑧 = 0 ⋮ 12𝑛𝑦𝑘≠𝑦
Suppose 𝑖𝑡h image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2
1/(1 − 0.1)
1 0.7 𝜕𝐿 1/0.7 𝒛=0 𝒑=0 𝜕𝒑=−0
0 0.2 1/(1 − 0.2)

Analytic Gradient: Regression Loss • Regression loss
For simplicity of notation, the index of training image i is omitted here
𝒚= 𝑦2 𝒔= 𝑠2 ⋮⋮
𝐿 = (𝒚 − 𝒔)2
𝜕𝐿 = −2(𝒚−𝒔) 𝜕𝒔

Why are we talking about derivatives? • Gradient Descent
̶ The simplest approach to minimizing a loss function 𝐖T+1=𝐖T−𝛼 𝜕𝐿
̶ 𝛼: step size (a.k.a. learning rate)

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts