CS代写 PowerPoint 프레젠테이션

PowerPoint 프레젠테이션

Changjae Oh

Computer Vision
– Introduction to Deep Learning –

Semester 1, 22/23

• Machine learning basics++

• Introduction to deep learning

• Linear classifier

What is Machine Learning?

• Learning = Looking for a Function

http://www.slideshare.net/ckmarkohchang

Machine Learning: Basic Concept

• Prediction task

̶ Regression: returns a specific value

̶ Classification: returns a class label

Regression example
What is location when x=10?

Classification example
For a new image, what class does it belong to?

Machine Learning: Basic Concept

• Training data

𝕏 = {𝐱1 = 2.0 , 𝐱2 = 4.0 , 𝐱3 = 6.0 , 𝐱4 = 8.0 }
𝕐 = {𝑦1 = 3.0, 𝑦2 = 4.0, 𝑦3 = 5.0, 𝑦4 = 6.0}

Training the model with data
– Finding optimal parameters
– Starting at a random value, increasing

accuracy to compute optimal parameters

– Minimize errors for new samples (test set)
– Generalization

refers to high performance for test sets

Optimal parameter w=0.5 b=2.0

Machine Learning: Basic Concept

• Multi-dimensional feature space

̶ d-dimensional data: x=(x1,x2, … ,xd)

• Linear classifier for d-dimensional data

̶ 1-D linear classifier # of variables = 𝑑 + 1
• Widely used in machine learning

̶ 2-D linear classifier # of variables =
(𝑑+1)(𝑑+2)

Note) d=784 in MNIST data

2 + 𝑤𝑑+1𝑥1𝑥2 +⋯+𝑤𝑑 𝑑+1

𝑥𝑑−1𝑥𝑑 + 𝑤𝑑 𝑑+1

𝑥1⋯+𝑤𝑑 𝑑+1

Machine Learning: Basic Concept

• Feature space transformation

̶ Map a linearly non-separable feature space into separable space

Original feature

Transformed feature space,
which is linearly separable

Toy example

Machine Learning: Basic Concept

• Representation learning

̶ Aims to find good feature space automatically

̶ Deep learning finds a hierarchical feature space by using neural networks with
multiple hidden layers.
• The first hidden layer has low-level features (edge, corner points, etc.), and the right-hand side

features advanced features (face, wheel, etc.)

Data for Machine Learning

• The quality of the training data

̶ To increase estimation accuracy, diverse and enough data should be collected for a
given application.

̶ Ex) After learning from a database with a frontal face only, the recognition accuracy
of side face will be degraded.

• MNIST database

̶ Handwritten numeric database

̶ Training data: 60,000

̶ Test data: 10,000

Data for Machine Learning

• Database size vs. training accuracy

̶ Ex) MNIST: 28*28 binary image
→ The total number of possible samples is 2784 , but MNIST has 60,000 training images.

Data for Machine Learning

• How does a small database achieve high performance?

̶ In a feature space, the actual data is generated in a very small subspace

̶ Manifold assumption

is unlikely to happen.

Smooth change according to certain rules like

Training Model: under-fitting vs. over-fitting

• Under-fitting

̶ Model capacity is too small to fit the data accordingly.

̶ Model with higher order can be used.

1st order 2nd order 3rd order 4th order 12th order

Training Model: under-fitting vs. over-fitting

• Over-fitting

̶ 12th order polynomial model approximates perfectly for the training set.

̶ But if you anticipate “new” data, there’s a big problem.

̶ Since the model capacity is large, the training process also accepts data noise.

̶ The model with the appropriate capacity should be selected.

Actual data

Predicted data

Training Model: under-fitting vs. over-fitting

• 1st and 2nd order model show poor performance for both the training and the test set.

• 12th order model shows high performance in training set, but low performance in test
set. → low generalization ability

• 3rd and 4th order model are lower than the 12th order model for the training set, but the
test set has high performance → higher generalization capability

1st order 2nd order 3rd order 4th order 12th order

Spectrum of supervision

Supervised

Semi-Supervised

Unsupervised

Reinforcement

Computer vision

Spectrum of supervision

• Supervised learning

̶ Both the feature vector 𝕏 and the output 𝕐 are given.

̶ Regression and classification problem

• Unsupervised learning

̶ The feature vector 𝕏 is given, but the output 𝕐 is not given.

̶ Ex) Clustering, density estimation, feature space conversion

Spectrum of supervision

• Reinforcement learning

̶ The output is given, but it is different from supervised learning.

• Once the game is over, you get a point (credit).

If you win, get 1, and -1 otherwise.

• The credit should be distributed to each sample of the game.

• Semi-supervised Learning

̶ Some of data have both 𝕏 and 𝕐, but others have only 𝕏.

̶ It is becoming important, since it is easy to collect 𝕏, but 𝕐 requires manual tasks.

Introduction to
Deep Learning

• Deep learning

̶ is a branch of machine learning based on a set of algorithms that attempt to model hi
gh-level abstractions in data by using multiple processing layers, with complex structu
res or otherwise, composed of multiple non-linear transformations.

Deep learning success

• Image classification

• Machine translation

• Speech recognition

• Speech synthesis

• Game playing

• .. and many, many more

Deep learning success

• Image classification

• Machine translation

• Speech recognition

• Speech synthesis

• Game playing

• .. and many, many more

Deep learning success

• Image classification

• Machine translation

• Speech recognition

• Speech synthesis

• Game playing

• .. and many, many more

Deep learning success

• Image classification

• Machine translation

• Speech recognition

• Game playing

• .. and many, many more

Why deep learning?

• Hand-crafted features vs. Learned features

• Large datasets

• GPU hardware advances + Price decreases

• Improved techniques (algorithm)

What is Deep Learning?

• Stacked Functions Learned by Machine

̶ End-to-end training: what each function should do is learned automatically

̶ Deep learning usually refers to neural network based model

Spherical”

What is Deep Learning?

• Stacked Functions Learned by Machine

̶ Representation Learning: learning features/representations

̶ Deep Learning: learning (multi-level) features and an output

Machine Learning vs. Deep Learning

• Deep vs Shallow: Image Recognition

̶ Shallow model using machine learning

http://www.slideshare.net/ckmarkohchang

Machine Learning vs. Deep Learning

• Deep vs Shallow: Image Recognition

̶ Deep model using deep learning

http://www.slideshare.net/ckmarkohchang

Machine Learning vs. Deep Learning

• Machine Learning vs. Deep Learning

Machine Learning = Feature descriptor + Classifier

Feature: hand-crafted domain-specific knowledge
Describing your data with features that computer
can understand
Ex) SIFT, Bag-of-Words (BoW), Histogram of Oriented Gradient

Optimizing the classifier weights on features
Ex) Nearest Neighbor (NN), Support Vector Machine
(SVM), Random Forest (RF)

Machine Learning vs. Deep Learning

• Machine Learning vs. Deep Learning

Deep Learning = Feature descriptor + Classifier

Feature: Representation learned by machine
Automatically learned internal knowledge

Optimizing the classifier weights on

Neural network based model
A series of linear classifiers and non-linear activations + Loss function

Deep Learning

• A single neuron

𝑧 = 𝒘T𝒙 + 𝑏 = ෍

Deep Learning

• A single layer with multiple neurons

T 𝒙 + 𝑏𝑀 = ෍

𝑤𝑀𝑘𝑥𝑘 + 𝑏𝑀

1 + 𝑒−𝑧1𝑧1 = 𝒘1
T𝒙 + 𝑏1 = ෍

𝑤1𝑘𝑥𝑘 + 𝑏1

Deep Learning

• Deep Neural Network

̶ Cascading the neurons to form a neural network

Each layer consists of the linear classifier
and activation function

Linear classifier

classifiers

Neural Network

Parametric Approach

• (Review) Unit 3 ML basics and classification

Array of 32x32x3 numbers
(3072 numbers total)

f(x,W) = Wx + b

parameters
(or weights)

10 numbers giving
class scores

10×1 10×3072

Parametric Approach

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

• (Review) Unit 3 recognition

What do we need now?

• Functions to measuring the error between the output of a classifier and
the given target value.

̶ Let’s talk about designing error (a.k.a. loss) functions!

Changjae Oh

Computer Vision
– Loss functions –

Semester 1, 22/23

Loss function

• Loss function

̶ quantifies our unhappiness with the scores across the training data.

• Type of loss function

̶ Hinge loss

̶ Cross-entropy loss

̶ Log likelihood loss

̶ Regression loss

Loss Function: Hinge Loss

• Binary hinge loss (=binary SVM loss)

• Hinge loss (=multiclass SVM loss)

̶ 𝐶: The number of class (> 2)

𝐿𝑖 = max 0, 1 − 𝑦𝑖 ∙ 𝑠
𝑠 = 𝒘T𝒙𝒊 + 𝑏

𝑦𝑖 = ±1 for positive/negative samples

max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1

𝒔 = 𝐖𝒙𝑖 + 𝒃

𝒙𝑖: input data (e.g. image)
𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)

Loss Function: Hinge Loss

Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are

Given a dataset of examples

{(𝒙𝑖 , 𝑦𝑖)}𝑖=1

𝑦𝑖: class label (integer)

Loss over the dataset is a
sum of loss over examples:

𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖

Loss Function: Hinge Loss

Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are

Multiclass SVM loss (=hinge loss)

0 if 𝑠𝑦𝑖 ≥ 𝑠𝑗 + 1

𝑠𝑗 − 𝑠𝑦𝑖 + 1 otherwise

where score vector 𝒔 = 𝑓 𝒙𝑖 ,𝐖

max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1

Loss Function: Hinge Loss

Suppose: 3 training examples, 3 classes.
With some W the scores 𝑓(𝒙,𝐖) = 𝐖𝒙 + 𝒃 are

Multiclass SVM loss (=hinge loss)

max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1

= max(0, 5.1 – 3.2 + 1) + max(0, -1.7 – 3.2 + 1)
= max(0, 2.9) + max(0, -3.9)
= 2.9 + 0 = 2.9

= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0 = 0

= max(0, 2.2 – (-3.1) + 1) + max(0, 2.5 – (-3.1) + 1)
= max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6

Loss over full dataset is average

L = (2.9 + 0 + 12.9)/3 = 5.27

Loss Function: Log Likelihood Loss

• Log likelihood loss

𝐿𝑖 = −log 𝑝𝑗 where 𝑗 satisfies 𝑧𝑖𝑗 = 1

Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝐶 = 10.

𝐿𝑖 = −log 0.7

𝒛𝒊: class label for 𝑖

(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)

probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)

𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)

Loss Function: Cross-entropy Loss

• Cross-entropy loss

𝑧𝑖𝑗log 𝑝𝑗 + (1 − 𝑧𝑖𝑗)log(1 − 𝑝𝑗)

Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝐶 = 10.

𝐿𝑖 = −log 1 − 0.1 − log0.7 − log(1 − 0.2)

𝒛𝒊: class label for 𝑖

(𝐶 × 1 vector, 𝑧𝑖𝑗 = 1 when 𝑗 = 𝑦𝑖 and 0 otherwise)

probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)

𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)

Softmax Activation Function

• Softmax activation function
scores = unnormalized log probabilities of the classes.

Probability can be computed using scores as below.

Probability of class label being 𝑘 for an image 𝒙𝑖

𝑃 𝑌 = 𝑘 𝑋 = 𝒙𝑖 = 𝑝𝑘 =

Softmax activation

𝑦𝑖: class label (integer, 1 ≤ 𝑦𝑖 ≤ 𝐶)

𝒔 = 𝐖𝒙𝑖 + 𝒃

Softmax + Log Likelihood Loss

Softmax + Log likelihood loss:
is often called ‘softmax classifier’

Softmax + Cross-entropy Loss

L_i = -log(0.13)-log(1-0.87)-log(1-0.0)

Loss Function: Regression Loss

• Regression loss

̶ Using L1 or L2 norms

̶ Widely used in pixel-level prediction (e.g. image denoising)

𝐿𝑖 = 𝒚𝑖 − 𝒔𝑖 = 0 − 0.1 + 1 − 0.7 + |0 − 0.2|

𝐿𝑖 = |𝒚𝑖 − 𝒔𝑖|

𝐿𝑖 = (𝒚𝑖 − 𝒔𝑖)

Regularization

max 0, 𝑠𝑗 − 𝑠𝑦𝑖 + 1 𝒔 = 𝐖𝒙 + 𝒃
Suppose that we found a W such
that L = 0. Is this W unique?
No! 2W is also has L = 0!

With W twice as large:
= max(0, 2.6 – 9.8 + 1) + max(0, 4.0 – 9.8 + 1)
= max(0, -6.2) + max(0, -4.8)
= 0 + 0 = 0

= max(0, 1.3 – 4.9 + 1) + max(0, 2.0 – 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0 = 0

Regularization

𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 + 𝜆𝑅(𝐖)

Data loss: Model predictions
should match training data

Regularization: Model should be “simple”
to avoid overfitting, so it works on test data

Minimizing data loss Minimizing data + regularization loss

Regularization

• L2 regularization: 𝑅 𝐖 = σ𝑘,𝑙𝑊𝑘,𝑙

• L1 regularization: 𝑅 𝐖 = σ𝑘,𝑙 |𝑊𝑘,𝑙|

• Elastic net (L1 + L2): 𝑅 𝐖 = σ𝑘,𝑙 𝛽𝑊𝑘,𝑙
2 + |𝑊𝑘,𝑙|

• Max norm regularization: 𝒘𝑗
𝑇 < 𝑐 for all 𝑗 • Dropout (will see later) • Batch normalization, stochastic depth (will see later) 𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 + 𝜆𝑅(𝐖) 𝜆: regularization strength (hyperparameter) Optimization: Gradient Descent • Gradient Descent ̶ The simplest approach to minimizing a loss function ̶ 𝛼: step size (a.k.a. learning rate) 𝐖T+1 = 𝐖T − 𝛼 Optimization: Gradient Descent Optimization: Stochastic Gradient Descent (SGD) 𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 + 𝜆𝑅(𝐖) 𝜕𝐿𝑖 𝑓 𝒙𝑖 ,𝐖 , 𝑦𝑖 Full sum is too expensive when N is large! Instead, approximating sum using a minibatch of 32 / 64 / 128/ 256 examples Changjae Oh Computer Vision - Backpropagation- Semester 1, 2021 Backpropagation • A widely used algorithm for training feedforward neural networks. • A way of computing gradients of expressions through recursive applicati on of chain rule. ̶ Backpropagation computes the gradient of the loss function with respect to the weig hts of the network (model) for a single input–output example. Derivative • Optimization using derivative ̶ 1st order derivative ̶ 𝑓′(𝑥): The slope of the function, indicating the direction in which the value increases → The minima of the objective function may exist in the direction of −𝑓′(𝑥). →Gradient descent algorithm: −𝑓′(𝑥) 𝐖T+1 = 𝐖T − 𝛼 Derivative • Partial derivative ̶ Derivatives of functions with multiple variables ̶ Gradient: the vector of the partial derivative 𝛻𝑓, Derivative • Jacobian matrix ̶ 1st order partial derivative matrix for 𝐟: ℝ𝑑 ↦ ℝ𝑚 • Hessian matrix ̶ 2nd order partial derivative matrix Derivative • Chain rule 𝑓 𝑥 = 𝑔(ℎ 𝑥 ) 𝑓 𝑥 = 𝑔(ℎ 𝑖(𝑥) ) Why are we talking about derivatives? • Gradient Descent ̶ The simplest approach to minimizing a loss function ̶ 𝛼: step size (a.k.a. learning rate) 𝐖T+1 = 𝐖T − 𝛼 Derivative • Example) Applying chain rule to single-layer perceptron ̶ Example of composite function ̶ Back-propagation: use the chain rule to compute 𝐖𝒙 softmax 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth likelihood Analytic Gradient: Linear Equation 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝒔 = 𝐖𝒙 + 𝒃 = [𝒙 𝟎 𝟎 ⋯𝟎] ∈ ℜ𝑑×𝑛 𝜕𝒔 = [𝟎 𝟎 𝒙 ⋯𝟎] ∈ ℜ𝑑×𝑛 jth column = 𝐈 ∈ ℜ𝑛×𝑛 Analytic Gradient: Linear Equation = 𝒘1 𝒘2 ⋯𝒘𝑛 = 𝐖 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝒔 = 𝐖𝒙 + 𝒃 Analytic Gradient: Sigmoid Function • Sigmoid function (1 + 𝑒−𝑥)2 1 + 𝑒−𝑥 − 1 = (1 − 𝜎 𝑥 )𝜎 𝑥 For a scalar 𝑥 Similarly, for a vector 𝒔 ∈ ℜ𝑛×1 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠𝑗))𝜎(𝑠𝑗) = (1 − 𝜎(𝑠1))𝜎(𝑠1) ⋯ 0 0 ⋯ (1 − 𝜎(𝑠𝑛))𝜎(𝑠𝑛) for 𝑗 = 1,… , 𝑛 Analytic Gradient: Softmax Activation Function • Softmax function • 1st order derivative of softmax function score function probability 𝑑𝑖𝑎𝑔 𝑒𝒔 ∙ σ 𝑒𝑠𝑗 − 𝑒𝒔 𝑒𝒔 T in vector form𝒑 = 𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0 0 ⋯ 𝑒𝑠𝑛σ𝑒𝑠𝑗 𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛 𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛 Analytic Gradient: Softmax Activation Function 𝑑𝑖𝑎𝑔 𝑒𝒔 ∙ σ 𝑒𝑠𝑗 − 𝑒𝒔 𝑒𝒔 T 𝑒𝑠1σ𝑒𝑠𝑗 ⋯ 0 0 ⋯ 𝑒𝑠𝑛σ𝑒𝑠𝑗 𝑒𝑠1𝑒𝑠1 ⋯ 𝑒𝑠1𝑒𝑠𝑛 𝑒𝑠𝑛𝑒𝑠1 ⋯ 𝑒𝑠𝑛𝑒𝑠𝑛 𝑒𝑠𝑎(σ𝑒𝑠𝑗 − 𝑒𝑠𝑎) = 𝑝𝑎(1 − 𝑝𝑎) = −𝑝𝑎𝑝𝑏 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 0 otherwise Analytic Gradient: Hinge Loss • 1st order derivative of binary hinge loss For simplicity of notation, the index of training image i is omitted here 𝐿 = max 0, 1 − 𝑦 ∙ 𝑠 −𝑦 if 1 − 𝑦 ∙ 𝑠 > 0
0 otherwise

1 − 𝑦 ∙ 𝑠 if 1 − 𝑦 ∙ 𝑠 > 0

0 otherwise

𝑦 = ±1 for positive/negative samples

𝑠 = 𝒘T𝒙 + 𝑏

Analytic Gradient: Hinge Loss

• 1st order derivative of hinge loss

For simplicity of notation, the index
of training image i is omitted here

𝒔 = 𝐖𝒙 + 𝒃

𝑦: class label (integer, 1 ≤ 𝑦 ≤ 𝑛)

max 0, 𝑠𝑗 − 𝑠𝑦 + 1

= 1 𝑠𝑗 − 𝑠𝑦 + 1 > 0

1 𝑖𝑓 𝐹 𝑖𝑠 𝑡𝑟𝑢𝑒
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

1 𝑠𝑗 − 𝑠𝑦 + 1 > 0 for 𝑗 = 𝑦

Analytic Gradient: Log Likelihood Loss
For simplicity of notation, the index
of training image i is omitted here

Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2

𝐿 = −log 𝑝𝑦 where 𝑦 satisfies 𝑧𝑦 = 1

𝑦: class label for 𝑖𝑡ℎ image (1 ≤ 𝑦 ≤ 𝑛)

𝒛: class probability for 𝑖𝑡ℎ image
𝒛 = (𝑧1 𝑧2… 𝑧𝑛)

𝑇, 𝑧𝑦 = 1 and 𝑧𝑘≠𝑦 = 0

probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)

Analytic Gradient: Cross-entropy Loss

𝑧𝑗log 𝑝𝑗 + (1 − 𝑧𝑗)log(1 − 𝑝𝑗)

1/(1 − 𝑝1)

1/(1 − 𝑝𝑛)

Suppose 𝑖𝑡ℎ image belongs to class 2 and 𝑛 = 10. → 𝑦 = 2

1/(1 − 0.1)

1/(1 − 0.2)

𝑦: class label for 𝑖𝑡ℎ image (1 ≤ 𝑦 ≤ 𝑛)

𝒛: class probability for 𝑖𝑡ℎ image
𝒛 = (𝑧1 𝑧2… 𝑧𝑛)

𝑇, 𝑧𝑦 = 1 and 𝑧𝑘≠𝑦 = 0

probability for 𝑖𝑡ℎ image
(It is assumed to be normalized, i.e. 𝒑 = 1.)

For simplicity of notation, the index
of training image i is omitted here

Analytic Gradient: Regression Loss

• Regression loss

𝐿 = (𝒚 − 𝒔)2

= −2(𝒚 − 𝒔) 𝒔 =

For simplicity of notation, the index
of training image i is omitted here

Why are we talking about derivatives?

• Gradient Descent

̶ The simplest approach to minimizing a loss function

̶ 𝛼: step size (a.k.a. learning rate)

𝐖T+1 = 𝐖T − 𝛼

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts