COMP9417 – Machine Learning Tutorial: Kernel Methods
Weekly Problem Set: Please submit questions 4a and 7 on Moodle by 11:55am Tuesday 29th March, 2022. Please only submit these requested questions and no others.
Question 1 (Dual Perceptron)
Review the development of the Perceptron training algorithm from lectures. Now compare this to the algorithm for Perceptron training in dual form introduced in the “Kernel Methods” lecture. The two algorithms are very similar but differ in a few crucial places. Provide an explanation of how the dual version of the algorithm relates to the original.
Copyright By PowCoder代写 加微信 powcoder
Question 2 (Feature Transformations)
Recall that the XOR function (graphed below) is not linearly separable. Show how we can learn the XOR function using a linear classifier after applying a feature transformation to the original dataset. As a concrete example, show how to extend the Dual Perceptron from the previous question to learn the XOR function.
XOR in (x1, x2)-space
0.0 0.5 1.0 x1
Figure 1: XOR function
Question 3. (The Kernel Trick)
Using the context of the previous question, discuss the computational issues that arise from having to compute high dimensional feature transformations. Show how these can be mitigated by using the Kernel trick, and use this to extend the dual perceptron learning to kernel perceptron learning.
$’3#+#,% =%%>?)8%’%&1)%
H HH ’55+#,%+#)1%
&1F(%+)%)1% ‘%+(%&+#$’3&A%
transformed values in the new space.
Question 4. (Kernels and their Feature Representations)
In this question, we will show how the choice of kernel gives us different feature transformations. Note that in practice, we will simply choose a kernel and not be too concerned with the exact feature trans- formation, but it is important to know that different kernels correspond to different representations.
(a) Let x, y ∈ R2 (i.e. x and y are two dimensional vectors), and consider the kernel k(x, y) = (2⟨x, y⟩ + 3)3.
Compute the feature vector φ(x) correpsonding to this kernel. (In other words, the feature repre- sentation of x and y such that ⟨φ(x), φ(y)⟩ = k(x, y))
(b) Challenge: Let x, y ∈ R, and consider the Gaussian kernel:
k(x,y)=exp −2σ2(x−y) , σ >0. Hint: Use a Taylor expansion to rewrite the exponential in terms of a summation.
Question 5 (More of the Kernel Trick)
You are told that the “kernel trick” means that a non-linear mapping can be realised from the original data representation to a new, implicit feature space simply by defining a kernel function on dot products ofpairsofinstancesfromtheoriginaldata.Toseewhythisisso,youtaketwoinstancesx=[1,2]T and y = [3, 2]T , and take their dot product ⟨x, y⟩ and obtain the answer 7. Clearly, raising this dot product to the power of two will give (⟨x, y⟩)2 = 49. Now expand out this expression to show that this is the same answer you would have obtained if you had simply done a set of feature transformations on the original data.
from two dimensional space (R ) to three “#$%&'()%)*+#,-
dimensional space (R3). Why would we use such a transformation? Generate one example from the ◦
class in the original space, and another example from the × class in the original space, and show their .
Question 6 (More Feature Transformations)
Consider the following depict
O6P8A<%Q%6R%S%P!A<@
X OOOx1 OXX
OOXOOX XOX
O$3#$&%2#(%*'0$%B$714$%0$3A%515?&'3
C Y4',$(%%%%%%%%%%%%8%()3+#,(8%===
Question 7 (Support Vector Machines)
The Support Vector Machine is a essentially an approach to learning linear classifiers, but uses a al- ternative objective function to methods we looked at before, namely maximising the margin. Learning algorithms for this problem typically use quadratic optimization solvers, but it is possible to derive the solution manually for a small number of support vectors.
Here is a toy data set of three examples shown as the matrix X, of which the first two are classified as positive and the third as negative, shown as the vector y. Start by constructing the Gram matrix for this data, incorporating the class labels, i.e., form the matrix X′(X′)T . Then solve to find the support vectors, their Lagrange multipliers α, then determine the weight vector w, threshold t and the margin m.
6. Solve for t 7. Solve for m
1 3 X = 2 1
+1 y = +1
1 3 X′ = 2 1
To find a maximum margin classifier requires finding a solution for w, t and margin m. For this we can
use the following steps (refer to slides 30–35 from the “Kernel Methods” lecture):
1. Set up the Gram matrix for labelled data 2. Set up the expression to be minimised
3. Take partial derivatives
4. Set to zero and solve for each multiplier 5. Solve for w
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com