程序代写 COMP9417 - Machine Learning Tutorial: Kernel Methods

COMP9417 – Machine Learning Tutorial: Kernel Methods
Weekly Problem Set: Please submit questions 4a and 7 on Moodle by 11:55am Tuesday 29th March, 2022. Please only submit these requested questions and no others.
Question 1 (Dual Perceptron)
Review the development of the Perceptron training algorithm from lectures. Now compare this to the algorithm for Perceptron training in dual form introduced in the “Kernel Methods” lecture. The two algorithms are very similar but differ in a few crucial places. Provide an explanation of how the dual version of the algorithm relates to the original.

For the purposes of this answer, we ignore the learning rate η, or equivalently assume that η = 1. In the original Perceptron training algorithm we try to classify the current example and check for an error, which occurs if yi⟨w, xi⟩ ≤ 0. When this happens, we update the weight vector w by adding to it ηyixi, i.e., yixi.
Thinking about this, clearly the weight vector that is learned will have been updated zero or more times for every example that has been misclassified. Denoting this number as αi for example xi, the weight vector can be expressed as
w = 􏰑αiyixi
On this dual view, we are learning instance weights αi rather than feature weights wj . An instance
x is classified as
􏰋n􏰌 yˆ=sgn 􏰑αiyi⟨xi,x⟩
During training, the only information needed about the training data is all pairwise dot products,
which can be computed and stored easily in an n × n matrix called the Gram matrix: ⟨x1, x1⟩ ⟨x1, x2⟩ · · · ⟨x1, xn⟩
T ⟨x2, x1⟩ ⟨x2, x2⟩ · · · ⟨x2, xn⟩ G=XX=. …..
…. ⟨xn, x1⟩ ⟨xn, x2⟩ · · · ⟨xn, xn⟩

Question 2 (Feature Transformations)
Recall that the XOR function (graphed below) is not linearly separable. Show how we can learn the XOR function using a linear classifier after applying a feature transformation to the original dataset. As a concrete example, show how to extend the Dual Perceptron from the previous question to learn the XOR function.
XOR in (x1, x2)-space
0.0 0.5 1.0 x1
Figure 1: XOR function
Clearly, a linear classifier trained on the original data will not work, since linear classifiers only work when the data is linearly sperable. Ee can either try to learn a non-linear model, or as sug- gested in the question, we could transform our data to try to make it linearly seperable. There are many possible transformations that will achieve this. Consider the following: transformation φ:R2 →R6 definedby
x1 =  2×2  .
x2  x2  1
√ x2  2×1 x2
In other words, φ is the function that takes a vector [x1,x2] in 2-dimensions (original space), and returns a 6-dimensional vector (feature space). So, for our XOR problem, we would have:

1 1 1 1 √√√√
2−2−22 􏰆􏰈 􏰉􏰇 √  􏰆􏰈 􏰉􏰇  √  􏰆􏰈 􏰉􏰇  √  􏰆􏰈 􏰉􏰇  √ 
φ 1 = 2, φ −1 =− 2, φ −1 = 2, φ 1 =− 2.
1  1   1  √ 
−1  −1   √ 
1  1   1   √ 
−1  1   1   √ 
will plot in ( 2×1,
2x1x2) (the second and sixth features respectively) space:
 −1  22−2−2
Now, we can’t visualise 6-dimensional space, but let’s pick 2 of the 6 features and plot, namely, we
1.5 1.0 0.5 0.0 0.5 1.0 1.5
XOR in ( 2×1, 2x1x2)-space
0.5 0.0 2×1
0.5 1.0 1.5
Clearly, a linear classifier will work now! Therefore, the idea is to first transform your data using an appropriate feature transformation, φ, and then implement some of the linear classifiers we have studied previously but on the transformed data. To learn the XOR function using the dual perceptron, we would simply run the dual perceptron on the transformed features, which means that we would only update the i-th weight, αi if
􏰑 αj yiyj ⟨φ(xj ), φ(xi)⟩ < 0, j=1 so all that changes is now we refer to the Gram matrix of the feature transformations: ⟨φ(x1), φ(x1)⟩ ⟨φ(x2), φ(x1)⟩ ⟨φ(x1), φ(x2)⟩ ⟨φ(x2), φ(x2)⟩ · · · · · · . . ⟨φ(x1), φ(xn)⟩ ⟨φ(x2), φ(xn)⟩ ⟨φ(xn), x1⟩ ⟨φ(xn), φ(x2)⟩ · · · ⟨φ(xn), φ(xn)⟩ Note that we did not really need to use the entire feature transformation here, we could have simply transformed the data using the following transformation γ : R2 → R2: defined by 􏰆􏰈x1􏰉􏰇 􏰈 √2x1 􏰉 γ(x)=γx =√2xx, 212 which obviously also allows us to use a linear classifier (even without the ously a better choice since it is cheaper to compute a 2 dimensional feature transformation than a 6 dimensional one. We will explore in the next question why in general we use higher dimensional (even infinite dimensional) feature transformations. 2) term. This is obvi- Question 3. (The Kernel Trick) Using the context of the previous question, discuss the computational issues that arise from having to compute high dimensional feature transformations. Show how these can be mitigated by using the Kernel trick, and use this to extend the dual perceptron learning to kernel perceptron learning. In the previous question, we saw that we could apply linear classifiers to non-linearly seperable data by first transforming them so that they are linearly separable in the feature space. We used the transformation φ, and showed that the dual perceptron can be trained on the transformed data. This requires us to compute a 6 dimensional transformations, and then compute all pairwise dot products between them - and in many cases we will have to compute very high dimensional feature transformations to get a linear classifier working on non linearly sperable data (think real world datasets) - and so this approach may become computationally prohibitive. This is where the idea of kernels comes in. A kernel function is a function k : Rp × Rp → R, which takes two points in the original (p-dimensional) space, and computes a number (a measure of similarity between the two inputs). There are an infinite number of kernel functions, here are a few popular ones: • Polynomial Kernel: k(x1, x2) = (m+⟨x1, x2⟩)d, where m ≥ 0 and d ∈ N are hyper-parameters to be chosen. • Gaussian Kernel: k(x , x ) = exp 􏰙− 1 ∥x − x ∥2􏰚 with hyper-parameter σ2 > 0. 12 2σ2122
• LaplaceKernel:k(x1,x2)=exp􏰙−1b∥x1 −x2∥1􏰚withhyper-parameterb>0.
Note that importantly, all the above kernels are functions of the data in the original space. Let’s return to our feature transformation from the previous question. Let x, x′ be any two points in the original space, that is:
􏰈 x 1 􏰉 ′ 􏰈 x ′1 􏰉 x=x , x=x′ ,
and recall that to run a linear classifier, we need all dot products of the transformed points, which are of the form:
⟨φ(x),φ(x′)⟩=
x2x′2 112211221122 11
√x2  √x′2 
22 2×1 x2 2x′1 x′2
 2×2 , 2x′2  =1+2xx′ +2xx′ +x2x′2+x2x′2+2xx′xx′

which can be written simply as:
1+2xx′ +2xx′ +x2x′2+2xx′xx′ =(1+⟨x,x′⟩)2. 11 22 11 1122
From this, it is clear that the dot product is in fact an evaluation of the polynomial kernel with m = 1 and d = 2, i.e. that
k(x, x′) = ⟨φ(x), φ(x′)⟩.
Why is this so important? Well, instead of having to first compute φ(x), and then compute all pariwise dot products, the above calculation shows that we simply have to compute the kernel function between the two original vectors, and this computation is relatively much cheaper since it only operates on points in the original space! In other words, instead of doing a long computation of dot products, all we need to do is compute the value of the kernel. This works since recall that we only need access to dot prodcuts for learning, and we don’t really need to compute the feature transformations explicitly, this is called the Kernel trick. In fact, we do not even have to think about computing a feature transformation function φ anymore, we simply choose a kernal, and each kernel gives us a corresponding feature transformation φ. This gives rise to the Kernel perceptron, which is just a renaming of the dual perceptron on transformed features defined above, so we update the i-th weight, αi if
􏰑αjyiyjk(xi,xj) < 0, j=1 so all that changes is now we refer to the Kernel matrix, which is the pairwise evaluation of the chosen kernel k on the original data: k(x1, x1) k(x2, x1) · · · · · · . . k(x1, xn) k(x2, xn) k(xn, x1) k(xn, x2) · · · k(xn, xn) which is equivalent to Φ in the previous question, but cheaper to compute relative to the naive ap- proach of computing feateure transforations then taking dot products. Going back to the question of why we chose a 6 dimensional representation, it should be clear now that this was done so that we could relate back to the choice of kernel. In practice, if we are interested in polynomial features up to degree 8 say, then one would simply choose a polynomial kernel with d = 8. Question 4. (Kernels and their Feature Representations) In this question, we will show how the choice of kernel gives us different feature transformations. Note that in practice, we will simply choose a kernel and not be too concerned with the exact feature trans- formation, but it is important to know that different kernels correspond to different representations. (a) Let x, y ∈ R2 (i.e. x and y are two dimensional vectors), and consider the kernel k(x, y) = (2⟨x, y⟩ + 3)3. Compute the feature vector φ(x) correpsonding to this kernel. (In other words, the feature repre- sentation of x and y such that ⟨φ(x), φ(y)⟩ = k(x, y)) Writex=(x1,x2)T andy=(y1,y2)T then k(x, y) = (2⟨x, y⟩ + 3)3 . = 8x31y13 + 8x32y23 + 24x21y12x2y2 + 24x1y1x2y2 + 36x21y12 + 36x2y2 + 72x1y1x2y2 + 54x1y1 + 54x2y2 + 27 From here, it is easy to see that φT (x) = [√8x31, √8x32, √24x21x2, √24x1x2, √36x21, √36x2, √72x1x2, √54x1, √54x2, √27] (b) Challenge: Let x, y ∈ R, and consider the Gaussian kernel: k(x,y)=exp −2σ2(x−y) , σ >0. Hint: Use a Taylor expansion to rewrite the exponential in terms of a summation.
This exercise shows the full power of the kernel trick, since the Gaussian kernel allows us to compute infinite dimensional feature representations!
k(x, y) = exp = exp
􏰆1 2􏰇 − 2σ2 (x − y)
􏰆 x2 􏰇 􏰆 y2 􏰇􏰑∞ (xy)k
(Definition of the Gaussian kernel)
−2σ2 exp −2σ2
􏰆 x 2 􏰇 􏰆 y 2 􏰇 􏰑∞ x k y k
􏰝xy􏰞􏰞 Taylor expansion of exp σ2
σk√k! σk√k!,
−2σ2 exp −2σ2
􏰆x2􏰇􏰈xx2 x3 􏰉T
which can be written as ⟨φ(x), φ(y)⟩ where:
φ(x) = exp −2σ2 1, σ√1!, σ2√2!, σ3√3!, … ,
so we see that the implicit feature representation when using the Gaussian kernel is infinite dimensional! This means that by computing simple functions in the original space (via the kernel function), we are able to get powerful representations of the data without having to do any computation in infinite dimensional space!
Note again that in practice, we never actually need to compute explicit feature vectors, since we can always calculate dot products in the feature space simply by evaluating k, and for many kernels, such a computation is impossible anyway.

Question 5 (More of the Kernel Trick)
You are told that the “kernel trick” means that a non-linear mapping can be realised from the original data representation to a new, implicit feature space simply by defining a kernel function on dot products ofpairsofinstancesfromtheoriginaldata.Toseewhythisisso,youtaketwoinstancesx=[1,2]T and y = [3, 2]T , and take their dot product ⟨x, y⟩ and obtain the answer 7. Clearly, raising this dot product to the power of two will give (⟨x, y⟩)2 = 49. Now expand out this expression to show that this is the same answer you would have obtained if you had simply done a set of feature transformations on the original data.
Before reading this, if you are confused about features representations and the kernel trick have a look through the additional comments that follow this. In this problem, we are told that the original space is R2, and we have two points:
􏰈x1 􏰉 􏰈y1 􏰉 x=x,y=y,
and further that we are using the kernel:
k(x, y) = (⟨x, y⟩)2.
The goal is to figure out what feature representation (φ) we are choosing when we make use of this
kernel. To figure this out, note that:
k(x, y) = (⟨x, y⟩)2
􏰆􏰟􏰈x1􏰉 􏰈y1􏰉􏰠􏰇2
= (x1y1 + x2y2)2
= x21y12 + x2y2 + 2x1y1x2y2 􏰡 x21   y12 􏰢
=√x2 ,√y2 2×1 x2 2y1 y2
= ⟨φ(x), φ(y)⟩.
So, using the kernel k(x, y) = (⟨x, y⟩)2 is equivalent to using the feature mapping:
􏰆􏰈x􏰉􏰇  x21 
φ 1 = √ x2  . x2 2×1 x2
Question 6 (More Feature Transformations)
Consider the following depiction of a feature transformation from two dimensional space (R2) to three dimensional space (R3). Why would we use such a transformation? Generate one example from the ◦ class in the original space, and another example from the × class in the original space, and show their transformed values in the new space.

$’3#+#,% =%%>?)8%’%&1)%
H HH ’55+#,%+#)1%
&1F(%+)%)1% ‘%+(%&+#$’3&A%
“#$%&'()%)*+#,-
O6P8A<%Q%6R%S%P!A<@ X OOOx1 OXX OOXOOX XOX The transformation allows us to use a linear classifier to correctly classify the data. In the original space, the points are not linearly separable and the only way to do classification here would be to use a non-linear model, which can be difficult. The smart choice of transformation allows us to use our existing tools for linear classification, such as the perceptron. We can demonstrate this transformatioOn$on3t#w$o p&%o2in#ts,(in%*th'e ◦0c$la%sBs, c$on7s1id4er x$%=0($32,A%25),1an5d?fr&o'm3the × class consider then we have O$3#$&%2#(%*'0$%B$714$%0$3A%515?&'3 x× = (2√2, 3√2) then denote the transformation by φ, where φ : R2 → R3, φ(x1, x2) = (x21, x2, C Y4',$(%%%%%%%%%%%%8%()3+#,(8%=== φ(x×) = (8, 8, 18). &$'3#+#,%531B&$4= φ(x◦) = (2, 2, 2 2) . J+7$%)*$13A%+#%)$34(%12%4'3,+#(%'B1?)%F*')% 515?&'3 AG $3#$&%+(% ^HR8R_8 T((?4$%]φ6P<]" R= as positive and the third as negative, shown as the vector y. Start by constructing the Gram matrix for this data, incorporating the class labels, i.e., form the matrix X′(X′)T . Then solve to find the support vectors, their Lagrange multipliers α, then determine the weight vector w, threshold t and the margin m. Y%F'#)%)1%?($%;b%)1%7&'((+2A%531)$+#% 1#%'%(+4+&'3+)A%2#%)1%?($=%T#A%*$&5c γ S SS γSS Question 7 (Support Vector Machines) The Support Vector Machine is a essentially an approach to learning linear classifiers, but uses a al- ternative objective function to methods we looked at before, namely maximising thSe margin. Learning ('45&$%(+Z$%1#&A%[6R\γV< )1%,$)% HH H algorithms for this problem typically use quadratic optimization solvers, but it is possible to derive the solution manually for a small number of support vectors. Here is a toy data set of three examples shown as the matrix X, of which the first two are classified 6. Solve for t 7. Solve for m 1 3 X = 2 1 +1 y = +1 1 3 X′ = 2 1  To find a maximum margin classifier requires finding a solution for w, t and margin m. For this we can use the following steps (refer to slides 30–35 from the “Kernel Methods” lecture): 1. Set up the Gram matrix for labelled data 2. Set up the expression to be minimised 3. Take partial derivatives 4. Set to zero and solve for each multiplier 5. Solve for w Recall that computing the SVM classifier is equivalent to solving the following constrained optimi- sation problem: arg min 1 ∥w∥2 subject to yi (⟨xi , w⟩ − t) ≥ 1 for i = 1, . . . , n. w,t 2 The way to interpret this expression is that we are looking for w, t such that: 1. Weclassifyallpointscorrectlybyhavingobservationsfromthefirstclassbeingononesideof the line ⟨xi, w⟩ = t + 1 and points from the second class being on the opposite side of the line ⟨xi, w⟩ = t − 1. This is taken care of in the above expression by requiring yi(⟨xi, w⟩ − t) ≥ 1 for all i, which captures both statements succintly. Compare this to the weaker condition required for perceptron learning: yi(⟨xi, w⟩ − t) ≥ 0. 2. Require the margin between the two lines ⟨xi, w⟩ = t ± 1 to be as ‘fat’ as possible. Recall that this margin has width 1 , which we would like to maximise, or equivalently, minimise ∥w∥, or equivalently minimise 12 ∥w∥2. The plot below depicts the geometry of the problem (note that we take m = 1!). Next, it was shown in the lecture that it is simpler to consider the dual problem: arg max − 􏰑􏰑αiαjyiyj(⟨xi,xj⟩) subject to 􏰑αiyi = 0, αi ≥ 0 for i = 1,...,n. α1 ,...,αn 2 i=1 j=1 i=1 In the initial problem, we learn the parameters w, t, and in the dual problem, we focus instead on the vector α = (α1, . . . , αn)T . Note that we can still recover the initial parameters from α through the relationship: w = 􏰑αiyixi, and for the parameter t, we can solve the equation yi(⟨w, xi⟩ − t) = 1 for any xi on the decision boundary (any xi that is a support vector), that is: t = ⟨ w , x i ⟩ − y1 . The following steps explain how to solve the dual problem for the simple dataset provided in the question. 1. Consideringthefirstterminthedualproblem,weseethatwewillneedtheterms⟨xi,xj⟩for all i, j. The design matrix is the matrix of xi’s, defined by  x T2  n × p X= . ∈R TheGrammatrix(matrixofallpairwisedotproducts)isthenXXT ∈Rn×n,inwhichcasethe (i,j)-thelementoftheGrammatrixgivesustheterm⟨xi,xj⟩. Herehowever,wecannotethat we further need the terms terms ⟨yixi,yjxj⟩, which we can get by defining the augmented design matrix  x T1 y 1  ′  x T2 y 2  n × p X=.∈R , . x Tn y n and the augmented Gram matrix G′, as: G′ ≡(X′)(X′)T = 5 5 −1 This step is just to show that when computing the SVM, all we need is the matrix G, and no longer are required to carry around X. To see this more clearly, we can rewrite the dual problem as: arg max − 􏰑􏰑αiαjG′[i,j] subject to 􏰑αiyi = 0, for i = 1,...,n. α1 ,...,αn 2 2. The dual optimisation problem is thus i=1 j=1 i=1 arg max −1􏰙10α12 +10α1α2 −6α1α3 +5α2 −2α2α3 +α32􏰚+α1 +α2 +α3 α1 ,α2 ,α3 2 subjecttoα1 ≥0,α2 ≥0,α3 ≥0andα1+α2−α3 =0.Here,wenotethatsinceα1+α2−α3 = 0 =⇒ α3 = (α1 + α2), we can replace every occurence of α3 accordingly and simplify our problem into a maximisation over α1 and α2 only: argmax−1􏰙10α12 +10α1α2 −6α1(α1 +α2)+5α2 −2α2(α1 +α2)+(α1 +α2)2􏰚+2α1 +2α2 α1,α2 2 =argmax−1􏰙5α12 +4α1α2 +4α2􏰚+2α1 +2α2 α1,α2 2 3. Compute the partial derivatives with respect to α1, α2: ∂ =−5α1 −2α2 +2, ∂ =−2α1 −4α2 +2 4. Setting partial derivatives to zero and solving gives α1 = 41 and α2 = 38 . Also, since α3 = α1 + α2 we have α3 = 58 . Note here that since αi ̸= 0, all three points are support vectors, this is intuitive since our dataset is so small. In general, only the support vectors (points on the margins) will actually have a non-zero αi term. In this sense, αi captures the importance 10 5 −3 −3 −1 1 of the i-th point for learning the model. Note that points deep inside their respective classes are relatively unimportant, since if we are classifying points closest to the boundary between the classes correctly, we will always be classifying points deep in the class corectly too, and so their contribution to the model is zero. 5. From slide 30: w = 􏰗ni=1 αiyixi so w = 14 x 1 + 83 x 2 − 85 x 3 6. t can be obtained from any support vector, say x3, since y3(⟨w, x3⟩ − t) = 1; this gives t = 23 . Note that in general, the support vectors are those points xi for which αi ̸= 0. We can now visualise the model we have learned: 6 4 2 0 2 4 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts