Lecture 15: Kernels (1) CS 189 (CDSS offering)
2022/02/23
The SVM dual optimization problem
Copyright By PowCoder代写 加微信 powcoder
• Final form: arg max !!1 ” 1 !!Q! s.t. !!y = 0 , 0 # !i # C $i !2N
• This is also a QP! After solving, we get w% = ! !%yixi i=1 i
• We previously had an optimization problem on the d-dimensional w, and now we have an optimization problem on the N-dimensional !
• If you’re interested: the Karush-Kuhn-Tucker (KKT) conditions assure us that the solution we find from either the primal or the dual formulation is optimal
Some intuition about !
• The support vectors correspond to all indices i for which !i > 0 3
Today’s lecture
• Today’s lecture (finally) returns to the closing question from the first lecture: • Machine learning is all about “predicting y from x”, but what is x?
• We have thus far just used whatever x (or x) is provided, and we have focused on devising different models for predicting y
• Various forms of linear regression, logistic regression, SVMs…
• These are all linear models! What if the underlying relationship isn’t linear?
• An overarching theme for much of the remaining course is to put a linear model on top of a feature representation !(x), rather than x directly — we start this today
Recall: how powerful are linear models anyway? Or, the importance of featurization
• Linear models, by themselves, might seem not that useful
• However, they can be, if we use the right set of features
• That is, instead of using x directly, we work with a featurization !(x) that may be better suited to the data
• This week, we will discuss a general approach to featurization that makes it feasible to work with very large (even infinite dimensional!) !(x)
• This approach is called kernelization 3
A classic example of featurization
(this example is for linear classification, not regression)
Advanced Machine Learning Specialization, Coursera
The problem with (naïve) featurization
• Suppose my original input x is a d-dimensional vector [x1, …, xd]
• A quadratic featurization of x would include all the original features xi along with
all xixj for i ! {1,…,d} and j ! {1,…,d}
• This feature vector scales quadratically with d in terms of both the space
required to store the data and compute time
• In general, if we want to use an order p polynomial featurization, we will have
O(dp) time and space requirements — there must be a better way… 5
Revisiting the SVM dual formulation
• The dual form: arg max !!1 ” 1 !!Q! s.t. !!y = 0 , 0 # !i # C $i !2
Qij = yiyjx!i xj for i % j, and Qii = x!i xi
Also, we have w& = !N !&yixi — for a new point x, we would compute
&! N&!i=1i
w x = ! ! yixi x in order to make our prediction
• Notice how everything only depends on inner products between xs… interesting?
Why is this interesting? What’s so great about inner products?
For the soft margin SVM, for training, we would need to compute all x”i xj to get Q
• When classifying a new x, we would need to compute x”i x for all support vectors xi
With featurization, we would need to compute !(xi)”!(xj) and !(xi)”!(x) instead
And if we are working with large featurizations, doing this naïvely can get bad
• Key idea: here, we never actually need !(x) directly, only inner products of features
• Can this buy us anything? Turns out, yes! With the kernel trick 7
The kernel trick — example
quadraticfeatures for X E R d x It x Xi Xi X Xz Xi I x TO x I x x t XiXi Xix t x Xxix t XiXi
the alternate expression I t XT x
lookfamiliar innerproduct w features I1 F X Tx Xi XiIT
2xx 2x xi xix t 2xxixx’stXix
but we didn’t have to compute the features explicitly h 8
The kernel trick, in general
• The kernel trick is based on the observation that, for many useful featurizations !(x) of x, we can compute inner products without explicitly constructing !(x)
• This is really useful if the !(x) we want to use is massive
• We can even work with infinite dimensional !(x)! More on that in a second
• This trick is only applicable for models which only require inner products between different !(x) and not the !(x) themselves
• Hence why we learned the SVM dual formulation, which can be kernelized 9
Kernelization
• Kernelizing a model, at a high level, involves replacing all x”x#with !(x)”!(x#)
• The kernel function k(x, x#) = !(x)”!(x#) specifies how to compute this
inner product without computing the featurizations themselves
The kernel matrix Kij = k(xi, xj) stores the inner products between all the
pairs of training points
• (Soft margin) SVM is our first example of a model which can be kernelized — next time, we will see a couple more examples
The kernelized soft margin SVM
• The dual form: arg max !!1 ” 1 !!Q! s.t. !!y = 0 , 0 # !i # C $i !2
Qij = yiyjKij for i % j, and Qii = Kii
For a new point x, we compute !N !&yik(xi, x) — note that most !& = 0
• This approach is better than the primal formulation if the number of features
i=1 i i dominates — e.g., for an order p polynomial kernel, if dp ‘ N
Examples of kernel functions And their corresponding featurizations
• The polynomial kernel k(x, x#) = (c + x”x#)p corresponds to a !(x) containing all combinations (products) of p (or fewer) dimensions of x
• We worked out the quadratic kernel where p = 2, and we set c = 1 2
( 2″ ) corresponds to an infinite dimensional !(x)! You’ll see this in your homework
The radial basis function (RBF) kernel k(x, x#) = exp $ %x $ x#%2 •2
• And many more! Which are way less frequently used…
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com