Review
CS542 Machine Learning
Support Vector Machines
CS542 Machine Learning
slides based on lecture by R. Urtasun
http://www.cs.toronto.edu/~urtasun/courses/CSC2515/CSC2515_Winter15.html
Max Margin Classifier
“Expand” the decision boundary to include a margin (until we hit first point on either side)
Use margin of 1
Inputs in the margins are of unknown class
Linear SVM Formulation
This is the primal formulation, can be optimized via gradient descent, EM, etc.
Apply Lagrange multipliers: formulate equivalent problem
Lagrange Multipliers
Convert the primal constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms:
For data {(xi,yi)} use the following penalty term:
Note, we are now minimizing with respect to w and b, and maximizing with respect to a (additional parameters)
Lagrange Multipliers
Convert the primal constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms:
For data {(xi,yi)} use the following penalty term:
Rewrite the minimization problem:
Where {αi} are the
Lagrange multipliers
Solution to Linear SVM
Swap the ‘max’ and ‘min’:
First minimize J() w.r.t. {w,b} for any fixed setting of the Lagrange multipliers:
Then substitute back into J() and simplify to get final optimization:
Dual Problem
subject to
only dot products of data points needed
Prediction on Test Example
Now we have the solution for the weights and bias
Dual vs Primal SVM
d
s.t
Kernel SVM
Non-linear decision boundaries
• Note that both the learning objective and the decision function depend only on dot products between patterns
• How to form non-linear decision boundaries in input space?
• Basic idea:
1. Map data into feature space
1. Replace dot products between inputs with feature points
1. Find linear decision boundary in feature space • Problem: what is a good feature function φ(x)?
Kernel Trick
• Kernel trick: dot-products in feature space can be computed as a kernel function
• Idea: work directly on x, avoid having to compute φ(x) • Example:
Input transformation
Mapping to a feature space can produce problems:
• High computational burden due to high dimensionality • Many more parameters
SVM solves these two issues simultaneously
• Kernel trick produces efficient classification
• Dual formulation only assigns parameters to samples, not features
Kernels and Similarity
x2
Given , compute new feature depending on proximity to training points
Example: Gaussian kernel
x1
similarity is high
Predict label “1” when
similarity is low
Example:
SVM with Kernels
Given choose
Given example :
Kernels
Examples of kernels (kernels measure similarity):
1. Polynomial 2. Gaussian 3. Sigmoid
Each kernel computation corresponds to dot product calculation for particular mapping φ(x): implicitly maps to high-dimensional space
Why is this useful?
1. Rewrite training examples using more complex features
2. Dataset not linearly separable in original space may be linearly separable in higher dimensional space
Classification with non-linear SVMs
Non-linear SVM using kernel function K():
Maximize LK w.r.t. {α}, under constraints α≥0
Unlike linear SVM, cannot express w as linear combination of support vectors – now must retain the support vectors to classify new examples
Final decision function:
Unsupervised Learning II
Principal Component Analysis
50
How to choose lower-dim subspace?
51
Minimize “error”
52
Choose subspace with minimal “information loss”
𝑢𝑢(1)
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Choose subspace with minimal “information loss”
𝑢𝑢(1) ∈ 𝑅𝑅3
𝑢𝑢(2) ∈ 𝑅𝑅3
𝑢𝑢(1)
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Reduce from n-dimension to K- dimension: Find K vectors
𝑢𝑢(1), 𝑢𝑢(2), … , 𝑢𝑢(𝐾𝐾) onto which to project the data so as to minimize the projection error.
Principal Components Analysis
𝑢𝑢(1)
(1) (𝐾𝐾)
U = [𝑢𝑢 … 𝑢𝑢 ], where K ≪ n
𝑥𝑥
Find orthonormal basis vectors
𝑧𝑧 = 𝑈𝑈𝑇𝑇𝑥𝑥, 𝑧𝑧𝑘𝑘 = 𝑢𝑢(𝑘𝑘)𝑇𝑇𝑥𝑥
Principal Components Analysis
𝑥𝑥� 𝑢𝑢(1) 𝑥𝑥
(1) (𝐾𝐾)
U=[𝑢𝑢 … 𝑢𝑢 ],whereK≪n
Find orthonormal basis vectors
𝑧𝑧 = 𝑈𝑈𝑇𝑇𝑥𝑥, 𝑧𝑧𝑘𝑘 = 𝑢𝑢(𝑘𝑘)𝑇𝑇𝑥𝑥 Reconstructed data point
𝐾𝐾
𝑥𝑥� = � 𝑧𝑧 𝑘𝑘 𝑢𝑢 ( 𝑘𝑘 )
𝑘𝑘=1
1
𝑚𝑚
𝐽𝐽 = 𝑚𝑚 �𝑖𝑖 = 1 𝑥𝑥 � 𝑖𝑖 − 𝑥𝑥 𝑖𝑖
Cost function: reconstruction error
Want:
min 𝐽𝐽 𝑈𝑈
2
PCA Solution
• The solution turns out to be the first K eigenvectors of the data covariance matrix (see Bishop 12.1 for details)
• Closed-form, use Singular Value Decomposition (SVD) on covariance matrix
• OtherPCAformulations
– can derive via maximizing variance of projected data
– probabilistic formulation of PCA possible, or the similar factor analysis, see Bishop 8.1.4
57
PCA Algorithm
Normalize features (ensure every feature has zero mean) and optionally scale feature
Compute “covariance matrix” Σ: Sigma =
Compute its “eigenvectors”:
[U,S,V] = svd(Sigma);
Keep first K eigenvectors and project to get new features z
Ureduce = U(:,1:K);
z = Ureduce’*x;
PCA Algorithm
Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Replace each with .
If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable range of values.
Andrew Ng
PCA is not linear regression
• There is no “output” in PCA, all dimensions are equal Regression 𝑥𝑥1 → 𝑥𝑥2 Regression 𝑥𝑥2 → 𝑥𝑥1 PCA
Choosing (number of principal components)
Average squared projection error: Total variation in the data:
Typically, choose to be smallest value so that (1%)
“99% of variance is retained”
Andrew Ng
Choosing (number of principal components)
[U,S,V] = svd(Sigma)
Pick smallest value of for which (99% of variance retained)
Andrew Ng
Good use of PCA
– Compression
– Reduce memory/disk needed to store data – Speeduplearningalgorithm
– Visualization
Andrew Ng
Maximum Likelihood for Linear Regression
So far, we have treated outputs as noiseless
• Defined cost function as “distance to true output”
• An alternate view:
– data (x,y) are generated by unknown process
– however, we only observe a noisy version – how can we model this uncertainty?
• Alternative cost function?
How to model uncertainty in data?
h𝑥𝑥=𝜃𝜃𝑥𝑥 𝜃𝜃
Hypothesis:
500
400 𝑇𝑇 300
𝜃𝜃: parameters
0
𝐷𝐷= 𝑥𝑥(𝑖𝑖),𝑦𝑦(𝑖𝑖) :data New cost function:
200 100 0
500 1000
1500 2000
2500 3000
ii
p((x , y )|𝜃𝜃)
maximize probability of data given model:
Recall: Cost Function Training Set Cost
{x(i), y(i)} Function 𝐽𝐽 Find 𝜃𝜃 with minimal
cost on training set
Input x h𝜃𝜃 Output y
Alternative View:
“Maximum Likelihood”
probability of training set Input x h𝜃𝜃 Output y
{x(i), y(i)}
Find 𝜃𝜃 with maximum
p({x i ,y i }|𝜃𝜃)
Training Set
Maximum likelihood way of estimating model parameters 𝜃𝜃
Maximum likelihood estimate
𝑈𝑈~ 𝑝𝑝(𝑈𝑈|𝜃𝜃) D={𝑢𝑢1 ,𝑢𝑢2 ,…,𝑢𝑢𝑚𝑚 }
In general, assume data is generated by some distribution
Observations (i.i.d.)
L D = ∏𝑚𝑚 𝑝𝑝(𝑢𝑢(𝑖𝑖)|𝜃𝜃) 𝑖𝑖=1
Likelihood
𝜃𝜃𝑀𝑀𝑀𝑀 = argmax L D Log likelihood
𝜃𝜃
= argmax ∑𝑚𝑚 log 𝑝𝑝(𝑢𝑢(𝑖𝑖)|𝜃𝜃)
𝜃𝜃
𝑖𝑖=1
Note: p replaces h!
log(f(x)) is monotonic/increasing, same argmax as f(x)
i.i.d. observations
• independently identically distributed random variables
• If 𝑢𝑢𝑖𝑖 are i.i.d. r.v.s, then
𝑝𝑝 𝑢𝑢1,𝑢𝑢2,…,𝑢𝑢𝑚𝑚 =𝑝𝑝 𝑢𝑢1 𝑝𝑝 𝑢𝑢2 …𝑝𝑝(𝑢𝑢𝑚𝑚)
• A reasonable assumption about many datasets, but not always
ML: Another example
• Observe a dataset of points 𝐷𝐷 = 𝑥𝑥𝑖𝑖
• Assume 𝑥𝑥 is generated by Normal distribution, 𝑥𝑥~𝑁𝑁(𝑥𝑥|𝜇𝜇, 𝜎𝜎)
• Find parameters 𝜃𝜃 = [𝜇𝜇, 𝜎𝜎] that maximize ∏10 𝑁𝑁(𝑥𝑥𝑖𝑖|𝜇𝜇, 𝜎𝜎) 𝑀𝑀𝑀𝑀 𝑖𝑖=1
𝑖𝑖=1:10
f1 ∼ N (10, 2.25) solution f2 ∼ N (10, 9)
f3 ∼ N (10, 0.25)
f4 ∼ N (8, 2.25)
ML for Linear Regression
h𝑥𝑥
h 𝑥𝑥(𝑖𝑖)
𝑡𝑡=𝑦𝑦+𝜖𝜖=h𝑥𝑥+𝜖𝜖 Noise𝜖𝜖∼𝑁𝑁𝜖𝜖0,𝛽𝛽−1 ,
Assume:
where β = 𝜎𝜎1 2
𝑝𝑝 𝑡𝑡 𝑥𝑥,𝜃𝜃,𝛽𝛽 𝑥𝑥(𝑖𝑖)
𝑤𝑤𝑒𝑒 𝑑𝑑𝑜𝑜𝑛𝑛′𝑡𝑡 𝑔𝑔𝑒𝑒𝑡𝑡 𝑡𝑡𝑜𝑜 𝑠𝑠𝑒𝑒𝑒𝑒 𝑦𝑦, 𝑜𝑜𝑛𝑛𝑙𝑙𝑦𝑦 𝑡𝑡
𝑡𝑡𝑖𝑖 h 𝑥𝑥(𝑖𝑖)
ML for Linear Regression
𝑡𝑡=𝑦𝑦+𝜖𝜖=h𝑥𝑥+𝜖𝜖 Noise𝜖𝜖∼𝑁𝑁𝜖𝜖0,𝛽𝛽−1 ,
Assume:
where β = 𝜎𝜎1 2
Probability of one data point
Max. likelihood solution
Likelihood function
𝛽𝛽𝑀𝑀𝑀𝑀 =argmax𝑝𝑝(𝒕𝒕|𝒙𝒙,𝜃𝜃,𝛽𝛽) 𝛽𝛽
𝜃𝜃𝑀𝑀𝑀𝑀 =argmax𝑝𝑝(𝒕𝒕|𝒙𝒙,𝜃𝜃,𝛽𝛽) 𝜃𝜃
Want to maximize w.r.t. 𝜃𝜃 𝛽𝛽𝑚𝑚𝑚𝑚𝑚𝑚
ln𝑝𝑝𝒕𝒕𝒙𝒙,𝜃𝜃,𝛽𝛽 =−2�𝑖𝑖=1 h𝑥𝑥(𝑖𝑖) −𝑡𝑡(𝑖𝑖) 2+2ln𝛽𝛽−2ln(2𝜋𝜋)
1
𝑚𝑚
2𝑚𝑚�𝑖𝑖=1 h 𝑥𝑥(𝑖𝑖) −𝑡𝑡(𝑖𝑖)
… but this is same as minimizing sum-of-squares cost
1
2
… which is the same as our SSE cost from before!!
1
multiply by − 1 , changing max to min, omit last two terms (don’t depend on 𝜃𝜃) 𝑚𝑚𝛽𝛽
Summary: Maximum Likelihood Solution for Linear Regression
h 𝑥𝑥 =𝜃𝜃𝑇𝑇𝑥𝑥 𝜃𝜃
Hypothesis:
500
400
300
200
100
0
𝜃𝜃: parameters
𝐷𝐷= 𝑥𝑥(𝑖𝑖),𝑡𝑡(𝑖𝑖) :data
0 500
1000 1500 2000 2500 3000 𝑝𝑝 𝒕𝒕 𝒙𝒙,𝜃𝜃,𝛽𝛽 = ∏𝑚𝑚 𝑁𝑁(𝑡𝑡(𝑖𝑖)|h 𝑥𝑥(𝑖𝑖) , 𝛽𝛽−1)
Likelihood: 𝑖𝑖=1 𝜃𝜃 Goal: maximize likelihood, equivalent to
argmin 1 ∑𝑚𝑚 h 𝑥𝑥(𝑖𝑖) −𝑡𝑡(𝑖𝑖)
2
𝜃𝜃
2𝑚𝑚
𝑖𝑖=1 𝜃𝜃
(same as minimizing SSE)