Introduction to Machine Learning The Kernel Trick, Regularization and
Regression
Prof. Kutty
Copyright By PowCoder代写 加微信 powcoder
Announcements
• Midterm Conflict forms will be out today; due in 1 week
– hardconflictsonly
– proofrequired
• Project 1 is out
– getstartedtoday(orsooner!)
Today’s Agenda
• Recap: Dual Formulation to kernelize SVMs
• Section 1: The kernel function
• Section 2: Regularization
• Section 3: Linear Regression with Squared Loss
Recap: (Efficient) Non-linear decision boundaries with SVMs
SVMs and the kernel trick
Dual formulation of Hard-Margin SVM
subject to ,
(‘) ̅ ‘ . ⋅ 0̅
≥ 1 for 3 ∈
Assuming data are linearly separable
acidgird 1, … , 7
• Boundary that classifies
IERXn 1XnXn
max ↵i ↵i↵jy(i)y(j)x ̄(i) ·x ̄(j)
↵ ̄,↵i 0 i=1 2 i=1 j=1
É IiIigo aces the training set correctly
• That is maximally removed from training examples closest to the decision boundary
Feature mapping with SVM
max ↵i ↵i↵jy(i)y(j)x ̄(i) ·x ̄(j)
↵ ̄,↵i 0 i=1 2 i=1 j=1 Xn 1XnXn
max ↵i 2 ↵i↵j y(i)y(j)( (x ̄(i)) · (x ̄(j))) ↵ ̄ i=1 i=1j=1
subject to ↵i 0 8i = 1,..,n Issue: potentially very inefficient
Kernels and Feature Maps
Feature map
takes input x ̄ 2 X (e.g., x ̄ 2 Rd ) and maps it to feature spaceF (e.g., Rp)
Each kernel has an associated feature mapping
Intuitively, the kernel function takes two inputs and gives their similarity in the feature space
8 :9 , ; ̅ = = :9 ⋅ = ; ̅
forif respace
:X !Fin K:X⇥X!R
• Sometimes it is much more efficient to compute !($̅ ! , $̅(#)) directly
Kernelized Dual SVM
↵i 2 ↵i↵j y(i)y(j)( (x ̄(i)) · (x ̄(j)))
i=1 i=1 j=1 Effed
subject to ↵i 0 8i = 1,..,n
↵i 2 ↵i↵jy(i)y(j)K(x ̄(i),x ̄(j))
↵ ̄ i=1 i=1j=1
subject to ↵i 0 8i = 1,..,n
Classifying a new example
Recall: ̄ Xn computed explicitly ⇤ (i) (i)(i)
✓= ↵iy x ̄(x ̄) i=1
AB’,’=(0̅’) ⋅=(0̅)
= AB’,(‘)8 0̅(‘),0̅ ‘)*
Section 1: The kernel function
Inferring the feature map
% ‘&, *̅ = ‘& ⋅ *̅ ! what is the corresponding feature map “(%̅)
To check Ichnia find
whether or not corresponding feature
ya VE t Ev.ua
Assuming the original feature space is 2-dimensional, find the G kernel that corresponds to the following (x ̄)
(x ̄) = [r, p2rx1, p2rx2, p2x1x2, x21, x2]
(i) (j) (i) (j) 2 , x ̄ ) = ( r + x ̄ · x ̄ )
b) K(x ̄(i), x ̄(j)) = (x ̄(i) · x ̄(j))2 + r c) K(x ̄(i), x ̄(j)) = r(x ̄(i) · x ̄(j))2
d) K(x ̄(i), x ̄(j)) = r2(x ̄(i) · x ̄(j))2 e) K(x ̄(i), x ̄(j)) = (x ̄(i) · x ̄(j))2
up niff Hint separate out terms that include vi
r from terms that do not
t 2r air t 2r
2ruzvztzuiviuzvztu.lv have t 4,4 42 v2
Examples of Valid Kernels
Linear Kernel
* ,+ , / ̅ = ,+ ⋅ / ̅
Quadratic Kernel
*,+,/̅ = ,+⋅/̅+8!with8≥0
mas’sfirst
RBF Kernel (aka Gaussian Kernel)
* ,+ , / ̅ = e x p − F ,+ − / ̅ !
dimension f e a t u r e m a p
Gaussian Kernel
The r parameter
Visualization
RBF kernel
K(x ̄(i), x ̄(j)) = exp( ||x ̄(i) x ̄(j)||2)
This example courtesy
controls how much I should higher degree polynomial
care about Visualization feature map is
SVM with RBF kernel underlying
sum of polynomials
K(x ̄(i), x ̄(j)) = exp( ||x ̄(i) x ̄(j)||2)
RBF Kernels
! #” , & ̅ = e x p − , #” − & ̅ !
Taylor series 0
Cu.org i polynomial
e2I naO nl
the RBF dimensional
Kernel Algebra
Let ,!(⋅,⋅) and ,”(⋅,⋅) be valid kernels, then the following are valid kernels:
K(x ̄, z ̄) = K1(x ̄, z ̄) + K2(x ̄, z ̄) K(x ̄, z ̄) = ↵K1(x ̄, z ̄)
K(x ̄, z ̄) = K1(x ̄, z ̄)K2(x ̄, z ̄) direct product
there exists 0 KCI 3 Oct lotz
i e in all these cases
sum scalar product ↵ > 0
Mercer’s Theorem
Intuition – –
A function !: R ×R → R is a valid kernel
for any ,+(.), … , ,+ / with ,+(!) ∈ R- and finite / Jeca the /×/ matrix G with 0!# = !(,+(!), ,+(#))
is positive-semidefinite
That is, G is 0
• symmetric 0 = 0
• and∀3∈R/ 3003≥0
In other words
for such a function !(76, 8̅), there exists a function φ such that !(76, 8̅) = φ(76) · φ(8̅)
Section 2: Regularization
Bias Variance Tradeoff
bias models are not ableto fully capture patterns in the dataset
variance modus are sensitive to noise and in the dataset
Bias-Variance Tradeoff
Estimation Error (variance)
*low varianceàconstant function
*high varianceàhigh degree polynomial,
RBF kernel
Structural Error (bias)
*low biasàlinear regression applied to linear data, RBF kernel
*high biasàconstant function, linear model applied to non-linear data
How to find boundaries that generalize well?
• Regularization
• Maximum margin separator
As we will see, these are related
Regularization
Idea: prefer a simpler hypothesis
• will push parameters toward some default value
(typically zero)
• resists setting parameters away from default value when data
weakly tells us otherwise
hyperparameter auction ggamelpficatite penalty trainingerror
̄ ̄ welldoes Jn, (✓) = + Rn(✓) how w rt Sn
regularization I do term/penalty; λ ≥ 0
What should !($̅) be?
• Desirable characteristics:
– should force components of 9̅ to be small (close to zero) – Convex, Smooth
• A popular choice – l1 norms
– Let’s use l2 norm as the penalty function
Soft-Margin SVM: exercise
Claim: Soft margin SVM is an optimization problem with hinge loss as objective function and L2-norm regularizer
min 1 $#,&,(‘
+ ∑)*!4) subjectto= +A ≥1−4) and4) ≥0 forF∈ 1,…,I
$#! + ())̅)
+ \max{1−`! 9̅⋅$̅! +b,0}
43 , 5 2 / ! 6 .
Derivation left as exercise
• Writed! ≥1−`(!) 9̅⋅$̅ ! +b andd! ≥0
• Observe that the objective function
includes the terms min ∑/ d 7& !6.!
Soft-Margin SVM: exercise
Claim: Soft margin SVM is an optimization problem with hinge loss as objective function and L2-norm regularizer
min 1 $#,&,(‘
+ ∑)*!4) subjectto= +A ≥1−4) and4) ≥0 forF∈ 1,…,I
+ \max{1−`! 9̅⋅$̅! +b,0}
T.ngeiossfmd.no
$#! + ())̅)
43 , 5 2 / ! 6 .
Derivation left as exercise
• Writed! ≥1−`(!) 9̅⋅$̅ ! +b andd! ≥0
• Observe that the objective function
includes the terms min ∑/ d 7& !6.!
Section 3: Linear Regression with Squared Loss
Supervised Learning
– Given data f (in feature space) and the labels g – Learn to predict g from f
• Labels could be discrete or continuous – Discrete labels: classification
– Continuous labels: regression
Regression vs Classification
Classification problem: y ∈ {−1, 1} or y ∈ {0, 1}
Regression problem: y ∈ R
Regression function f : Rd → R where f ∈ F
Linear Regression
A linear regression function is simply a linear function of the feature vector:
f ( x ̄ ; θ , b ) = θ · x ̄ + b
Learning task:
Choose parameters in response to training set
Sn ={x ̄(i),y(i)}ni=1 x ̄∈Rd y∈R
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com