CS 189 Introduction to Machine Learning
Spring 2022 HW3
Due 3/15/22 at 11:59pm
• Homework 3 consists of both written and coding questions.
Copyright By PowCoder代写 加微信 powcoder
• We prefer that you typeset your answers using LATEX or other word processing software. If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a good time! Neatly handwritten and scanned solutions will also be accepted for the written questions.
• In all of the questions, show your work, not just the final answer. Deliverables:
1. SubmitaPDFofyourhomework,withanappendixlistingallyourcode,totheGradescope assignment entitled “HW2 Write-Up”. Please start each question on a new page. If there are graphs, include those graphs in the correct sections. Do not put them in an appendix. We need each solution to be self-contained on pages of its own.
• In your write-up, please state with whom you worked on the homework. This should be on its own page and should be the first page that you submit.
• In your write-up, please copy the following statement and sign your signature next to it. (Mac Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF file.) We want to make it extra clear so that no one inadvertently cheats. “I certify that all solutions are entirely in my own words and that I have not looked at another student’s solutions. I have given credit to all external sources I consulted.”
• Replicate all your code in an appendix. Begin code for each coding question in a fresh page. Do not put code from multiple questions in the same page. When you upload this PDF on Gradescope, make sure that you assign the relevant pages of your code from appendix to correct questions.
HW3, ©UCB CS 189, Spring 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Administrivia (2 points)
1. Please fill out the Check-In Survey if you haven’t already. Please write down the 10 digit alphanumeric code you get after completing the survey.
2. Declare and sign the following statement:
“I certify that all solutions in this document are entirely my own and that I have not looked
at anyone else’s solution. I have given credit to all external sources I consulted.”
Signature:
While discussions are encouraged, everything in your solution must be your (and only your) creation. Furthermore, all external material (i.e., anything outside lectures and assigned read- ings, including figures and pictures) should be cited properly. We wish to remind you that consequences of academic misconduct are particularly severe
HW3, ©UCB CS 189, Spring 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 2
2 Kernels (16 points)
For a function k(xi, xj) to be a valid kernel, it suffices to show either of the following conditions is true:
1. k has an inner product representation: ∃ Φ : Rd → H, where H is some (possibly infinite- dimensional) inner product space such that ∀xi, xj ∈ Rd, k(xi, xj) = ⟨Φ(xi), Φ(xj)⟩.
2. For every sample x1, x2, . . . , xn ∈ Rd, the kernel matrix
k(x1, x1) · · · k(x1, xn)
. .
K=. . . k(x,x) . ij
k(xn, x1) · · · k(xn, xn)
Starting with part (c), you can use either condition (1) or (2) in your proofs.
(a) (2 points) Show that the first condition implies the second one, i.e. if ∀xi, xj ∈ Rd, k(xi, xj) = ⟨Φ(xi), Φ(xj)⟩ then the kernel matrix K is PSD.
(b) (2 points) Show that if the second condition holds, then for any finite set of vectors, X = {x1, x2, . . . , xn}, in Rd there exists a feature map ΦX that maps the finite set X to Rn such that, for all xi and xj in X, we have k(xi, xj) = ⟨ΦX(xi), ΦX(xj)⟩.
(c) (2 points) Given a positive semidefinite matrix A, show that k(xi, xj) = xi⊤Axj is a valid kernel.
(d) (2 points) Give a counterexample that shows why k(xi, xj) = xi⊤(rev(xj)) (where rev(x) reverses
the order of the components in x) is not a valid kernel.
(e) (4 points) Show that when k: Rd ×Rd → R is a valid kernel, for all vectors x1,x2 ∈ Rd we have
k(x1, x2) ≤ k(x1, x1)k(x2, x2).
Show how the classical Cauchy-Schwarz inequality is a special case.
(f) (4 points) Suppose k1 and k2 are valid kernels with feature maps Φ1 : Rd → Rp and Φ2 : Rd → Rq respectively, for some finite positive integers p and q. Construct a feature map for the product of the two kernels in terms of Φ1 and Φ2, i.e. construct Φ3 such that for all x1, x2 ∈ Rd we have
k(x1, x2) = k1(x1, x2)k2(x1, x2) = ⟨Φ3(x1), Φ3(x2)⟩.
Hint: Recall that the inner product between two matrices A, B ∈ Rp×q is defined to be
⟨A,B⟩ = trA⊤B = AijBij.
is positive semidefinite.
HW3, ©UCB CS 189, Spring 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 3
3 Kernel Ridge Regression: Theory (10 points)
(a) (2points)Aswealreadyknow,thefollowingproceduregivesusthesolutiontoridgeregression in feature space:
arg min ∥Φw − y∥2 + λ∥w∥2 (1) w
Recall that the solution to ridge regression is given by
wˆ = (Φ⊤Φ + λId)−1Φ⊤y
Show that we can rewrite wˆ as
wˆ = Φ⊤(ΦΦ⊤ + λIn)−1y You may have previously seen this in lecture.
(b) (2 points) The prediction for a test point x is given by φ(x)⊤wˆ , where wˆ is the solution to (1). In this part we will show how φ(x)⊤wˆ can be computed using only the kernel k(xi,xj) = φ(xi)⊤φ(xj). Denote the following object:
k(x) := [k(x, x1), k(x, x2), . . . , k(x, xn)]⊤ Using the result from part (a), show that
φ(x)⊤wˆ = k(x)⊤(K + λI)−1y. In other words, if we define αˆ := (K + λI)−1y, then
φ(x)⊤wˆ =αik(x,xi)
— our prediction is a linear combination of kernel functions at different data points.
Note: To be clear, φ(x) is not the same as Φ. Φ is the featurized X matrix, while φ(x) is the featurization of a single data point x. Specifically, Φ is a Rn×d matrix where row i of Φ is φ(xi).
(c) (6 points) We will now consider kernel functions that do not directly correspond to a finite- dimensional featurization of the input points. For simplicity, we will stick to a scalar underly- ing raw input x. (The same style of argument can help you understand the vector case as well.) Consider the radial basis function (RBF) kernel function
2σ2 k(x,z) = exp− ,
2 ∥x−z∥
for some fixed hyperparameter σ. It turns out that this kernel does not correspond to any finite-
dimensional featurization φ(x). However, there exists a series φd(x) of d-dimensional features,
such that φd(x)T φd(z) converges as d → ∞ to k(x, z). Using Taylor expansions, find φd(x). xz
(Hint: focus your attention on the Taylor expansion of eσ2 .)
HW3, ©UCB CS 189, Spring 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 4
4 Kernel Ridge Regression: Practice (18 points)
In the following problem, you will implement Polynomial Ridge Regression and its kernel variant Kernel Ridge Regression, and compare them with each other. You will be dealing with a 2D regres- sion problem, i.e., xi ∈ R2. We give you three datasets, circle.npz (small dataset), heart.npz (medium dataset), and asymmetric.npz (large dataset). In this problem, the labels are actually discrete yi ∈ {−1, +1}, so in practice you should probably use a different model such as kernel SVMs, kernel logistic regression, or neural networks. The use of ridge regression here is for your practice and ease of coding.
You are only allowed to use numpy.*, numpy.linalg.*, and matplotlib in the following ques- tions. Make sure to include plots and results in your writeups.
(a) (2 points) Use matplotlib to visualize all the datasets and attach the plots to your report. Label the points with different y values with different colors and/or shapes.
(b) (6 points) Implement polynomial ridge regression (non-kernelized version) to fit the datasets circle.npz, asymmetric.npz, and heart.npz. The data is already shuffled. Use the first 80% data as the training dataset and the last 20% data as the validation dataset. Report both the average training squared loss and the average validation squared loss for polynomial or- der p ∈ {2, 4, 6, 8, 10, 12}. Use the regularization term λ = 0.001 for all p. Visualize your result and attach the heatmap plots for the learned predictions over the entire 2D domain for p ∈ {2, 4, 6, 8, 10, 12} in your report. Code for generating heatmap plots is included for your convenience, but you will need to generate polynomial features yourself by writing the featurize function.
(c) (8 points) Implement kernel ridge regression to fit the datasets circle.npz, heart.npz, and optionally (due to the computational requirements), asymmetric.npz. Use the polynomial kernel K(xi, xj) = (1 + x⊤i xj)p. Use the first 80% data as the training dataset and the last 20% data as the validation dataset. Report both the average training squared loss and the average validation squared loss for polynomial order p ∈ {1, . . . , 16}. Use the regularization term λ = 0.001 for all p. For circle.npz, also report the average training squared loss and validation squared loss for polynomial order p ∈ {1, . . . , 24} when you use only the first 15% data as the training dataset and the rest 85% data as the validation dataset. Based on the error, comment on when you want to use a high-order polynomial in linear/ridge regression. Lastly, comment on which of polynomial versus kernel ridge regression runs faster, and why.
(d) (2 points) A popular kernel function that is widely used in various kernelized learning algo- rithms is called the radial basis function kernel (RBF kernel). It is defined as
∥x−x′∥2
K(x,x ) = exp− . (2)
Implement the RBF kernel function for kernel ridge regression to fit the dataset heart.npz. Use the regularization term λ = 0.001. Report the average squared loss, visualize your result and attach the heatmap plots for the fitted functions over the 2D domain for σ ∈
HW3, ©UCB CS 189, Spring 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 5
{10,3,1,0.3,0.1,0.03} in your report. You may want to vectorize your kernel functions to speed up your implementation.
HW3, ©UCB CS 189, Spring 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 6
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com