程序代写 CS 189 (CDSS offering)

Lecture 16: Kernels (2) CS 189 (CDSS offering)
2022/02/25

Today’s lecture

Copyright By PowCoder代写 加微信 powcoder

• Last time, we saw how to kernelize the soft margin SVM, such that we can use featurizations !(x) that are very high dimensional (or even infinite dimensional)
• Today, we will kernelize a couple more of the models we have already studied — the perceptron, and ridge regression
• Keep in mind what this allows us to do — we can “extend” these linear models into problems requiring nonlinear solutions!
• Also worth noting: a number of models that we will study later in the course can also be kernelized, so it really is quite a useful technique

Recall: kernelization
• Kernelizing a model, at a high level, involves replacing all x!x”with !(x)!!(x”)
• All machine learning models must specify two things: how they are learned at
training time, and how they are used to predict at test time
• If both of these can be specified using only inner products, we are in business
• The kernel function k(x, x”) = !(x)!!(x”) specifies how to compute this inner product without computing the featurizations
The kernel matrix Kij = k(xi, xj) stores the inner products between all the pairs of training points

Recall: the perceptron algorithm The simplest version
• The perceptron is best understood as repeating the following update to ”
• Pick a (xi, yi) — if sign(“!xi) = yi, move on; otherwise, ” # ” + yixi
• We repeat this until sign(“!xi) = yi for all i
• For linearly separable datasets, this is guaranteed to eventually happen!
• If the data are not linearly separable, this algorithm will never terminate
• Turns out, kernelizing this algorithm and the linear model is pretty straightforward!

Characterizing ” for the kernel perceptron
• If we initialize ” = 0, then, since the update is ” # ” + yixi, at any iteration
during learning, I can write ” = !Ni=1 #iyixi for the appropriately chosen #i • Specifically, #i will be the number of times the i-th data point has been
misclassified up to the current training iteration
And, if the algorithm terminates, it will return some “$ = !N #$yixi
• An aside: this is an instance of the more general representer theorem

Learning in the kernel perceptron
• Before: initialize ” = 0
• Pick a (xi,yi)
• If sign(“!xi) % yi, then • “#”+yixi
• Once all points correctly classified: return “$
• Now: initialize ! = 0
• Pick a (xi,yi)
• If sign (!Nj=1 #jyjx!j xi) % yi, then • #i##i+1
• Once all points correctly classified: return #$

Predicting in the kernel perceptron And, actually kernelizing the kernel perceptron
• Prediction on a new x was previously just sign(“$!x)
Now, it is sign (!Ni=1 #$yix!i x) i
• We are definitely in business, since we only have inner products everywhere!
To switch from x to !(x), replace all x!i xj for pairs of training points with Kij, and replace all other x!i x with k(xi, x), for the appropriate K and k

Kernelizing linear classifiers in general
sign ( !Ni=1 #i$yix!i x) looks very similar to how the kernel SVM turned out • And, in fact, it is also the same for kernel logistic regression
This is no accident — it is a direct consequence of the representer theorem applied From this perspective, the big difference between these three methods for learning
to linear models
linear classifiers is that they will produce different #i$
• SVM is quite nice, from this perspective, because most #i$ will be zero! 8

Recall: ridge regression
• Ridge regression adds $2-regularization to least squares linear regression: arg min &Xw ‘ y&2 + %&w&2
• The solution we derived was given by w$ = (X!X + %I)’1X!y
• If this doesn’t come back to you immediately, it would be good to review this
• At first glance, it may not be obvious how to kernelize this solution, but we will see how it becomes obvious using a cool math trick

A cool math trick
xTX XId X XX XIN XXT XIN y
xTX X Id EXTXX TXTIN XX’t XIN y xTX XId EXTXXTt XI XI XXTt XIN y XTX XId 1 XX IIdXTXXT XIN y

Kernel ridge regression
• So, we can rewrite w$ = (X!X + %I)’1X!y = X!(XX! + %I)’1y for a N ( N I instead of a d ( d I
• What is XX!? By inspection, we have (XX!)ij = x!i xj
• XX! is called a Gram matrix and can be directly replaced with K!
• For predicting on a new x, we compute w$!x = y!(XX! + %I)’1Xx
• What is Xx? By inspection, we have (Xx)i = x!i x — replace with k(xi, x) 11

If time: the RBF kernel
• We gave the example last time of the radial basis function (RBF) kernel 2
k(x, x”) = exp ‘ &x ‘ x”&2 , also called the Gaussian kernel 2
• It can be understood as an infinite dimensional !(x), but it is perhaps better understood as a “distance” between x and x”
• For linear regression, the prediction function becomes a linear combination of Gaussians!
from Prof.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com