程序代写代做代考 go kernel C algorithm EECS 189 Introduction to Machine Learning Fall 2020

EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, September 30 at 11:59 p.m.
2 Tikhonov Regularization and Weighted Least Squares
In this problem, we look at Tikhonov regularization from a probabilistic standpoint.
HW4
The main goal is to deepen your understanding of how priors and thus the aligned regulariza- tion impact the MAP estimator. First, you will work out how introducing a certain probabilistic prior before maximizing the posterior is equivalent to adding the Tikhonov regularization term: by adding the Tikhonov regularization term, we effectively constrain our optimization space. Simi- larly, using a probabilistic prior drives our optimization towards solutions that have a high (prior) probability of occurring. In the second half of the problem you will then do some simulations to see how different priors influence the estimator explicitly, as well as how this effect changes as the number of samples grows.
(a) Let x ∈ Rd be a d-dimensional vector and Y ∈ R be a one-dimensional random variable. Assume a linear model: Y = x⊤w + Z where Z ∈ R is a standard Gaussian random variable Z ∼ N(0, 1) and w ∈ Rd is a d-dimensional Gaussian random vector w ∼ N(0, Σ). Σ is a known symmetric positive-definite covariance matrix. Note that w is independent of the observation noise. What is the conditional distribution of Y given x and w?
Solution: A Gaussian random variable plus a constant is a Gaussian random variable with the mean having that constant added to it and the same variance, so Y|x,w ∼ N(x⊤w,1). Concretely, we have
11⊤2 f(Y|x,w)=√ exp{−2(Y−xw)},

the pdf of Gaussian distribution.
(b) (Tikhonov regularization) Let us assume that we are given n training data points
{(x1, Y1), (x2, Y2), · · · , (xn, Yn)} which we know are generated i.i.d. according to the model of (x,Y) in the previous part, i.e. we draw one w and use this to generate all Yi given distinct but arbitrary {xi}ni=1 , but the observation noises Zi vary across the different training points in an i.i.d. fashion. Derive the posterior distribution of w given the training data. Based on your result, what is the MAP estimate of w? Comment on how Tikhonov regularization is a generalization of ridge regression from a probabilistic perspective.
Note: w and Y = (Y1, Y2, . . . , Yn) are jointly Gaussian in this problem given {xi}ni=1.
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1

[Hint: (You may or may not find this useful) If the probability density function of a random variable is of the form
f(v) = C · exp{−1v⊤Av + b⊤v}, 2
where C is some constant to make f (v) integrates to 1, then the mean of v is A−1 b. This can be used to help complete squares if you choose to go that way.
Solution: By Bayes’ Theorem, we have
f(w|x1,··· ,xn,Y1,··· ,Yn) ∝ [􏰜 f(Yi|xi,w)]f(w)
i=1
The above derivation follows by omitting the constant terms in the denominator which don’t
depend on w.
nn [􏰜f(Yi|xi,w)]f(w)∝[􏰜exp{−1(Yi −x⊤i w)2}]exp{−w⊤Σ−1w}
n
i=1 i=12 2 The above follows from writing out the pdf directly.
􏰜n [
i=1
1
exp{−2(Yi −x⊤i w)2}]exp{−
w ⊤ Σ − 1 w 2
1 􏰑n
}∝[exp{−2 (Yi −x⊤i w)2}]exp{−
w ⊤ Σ − 1 w 2 }
i=1
∝ exp{−1[wT (X⊤X + Σ−1)w − 2(X⊤Y)⊤w]},
2
The above follows from expanding out the squared term. Here we have X ∈ Rn×d and the ith row of X is x⊤i and we have Y ∈ Rn with ith entry Yi. Therefore the posterior of w|x1, · · · , xn, Y1, · · · , Yn is N􏰇(X⊤X + Σ−1)−1X⊤Y, (X⊤X + Σ−1)−1􏰈. Derived based on the fol- lowing lemma.
Lemma: If the probability density function of a random variable is of the form f(v) = C · exp{−1v⊤Av + b⊤v},
2
where C is some constant to make f (v) integrate to 1, and A is a symmetric positive definite
matrix, then v is distributed as N(A−1b, A−1). The proof of the lemma is as follows:
􏰯1⊤ ⊤􏰰 f(v)=C·exp −2vAv+bv,
􏰯1􏰇 −1􏰈⊤􏰇 −1􏰈1⊤−1􏰰 =C1·exp−2v−(Ab) Av−(Ab)−2bAb,
􏰯 1􏰇 −1 􏰈⊤ 􏰇 −1 􏰈􏰰 = C2 · exp − 2 v − (A b) A v − (A b) ,
which is the density of a Gaussian random variable with mean A−1b and covariance matrix A−1 .
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2

Now that we have the posterior distribution of w, it is simple to find the MAP estimate. The MAP estimate is simply a maximization over the posterior probability. Since we know the posterior of w is a Gaussian, we have that the MAP estimate is the mean of that Gaussian. So we have that MAP estimate of w is (X⊤X + Σ−1)−1X⊤Y.
In ridge regression, we also assume a Gaussian prior on w. However, we constrain the co- variance of the prior to be a simple identity scaled by some constant. Tikhonov regularization generalizes this and allows us to select any covariance matrix for our prior on w. Note in particular that this means our prior can involve non-zero covariance terms, or scale variances in different directions differently (e.g., the first entry of w may have a smaller variance than the second entry; Tikhonov regularization allows us to encode this knowledge in the problem setup. As you will see in class, being able to give priority to certain directions in feature space can be important if we want to be able to take advantage of the regularizing effect of lots of features.
(c) We have so far assumed that the observation noise has a standard normal distribution. While this assumption is nice to work with, we would like to be able to handle a more general noise model. In particular, we would like to extend our result from the previous part to the case where the observation noise variables Zi are no longer independent across samples, i.e. Z is no longer N(0, In) but instead distributed as N(μz, Σz) for some mean μz and some covariance Σz (still independent of the parameter w). Derive the posterior distribution of w by appro- priately changing coordinates. We make the reasonable assumption that the Σz is invertible, since otherwise, there would be some dimension in which there is no noise.
(Hint: Write Z as a function of a standard normal Gaussian vector V ∼ N(0, In) and use the result in (b) for an equivalent model of the form Y􏰖 = X􏰖w + V.)
Solution: By changing variables, our goal is to reduce to the previous case — with white observation noises.
We want to define matrix Σ1/2 such that Σ1/2(Σ1/2)⊤ = Σ . According to eigenvalue decomposi- zzzz
tion Σz = U∆U⊤. Exploiting the spectral theorem’s guarantee of orthonormal eigenvalues and z2 2
all positive eigenvalues to form the diagonal matrix ∆, we get that we can choose Σ1 = U∆1 . 1−1
Now,wedefineVsuchthatZ=μ +Σ2V.ThenV∼N(0,I)andV=Σ 2(Z−μ). zzzz
Y = Xw + Z
Y−μz =Xw+Z−μz
−1−1−1
Σ 2(Y−μ)=Σ 2Xw+Σ 2(Z−μ)
zzzzz
−1 −1
Σ 2(Y−μ)=Σ 2Xw+V
zzz
So the problem reduces to the previous part by denoting Σz 2 X as X􏰖 and denoting Y􏰖 as Σz 2 (Y −
−1 −1 μz), which yields the posterior of w|x1, · · · , xn, Y1, · · · , Yn as a multivariate Gaussian
􏰉 ⊤ −1 −1 ⊤ ⊤ −1 −1􏰊 N (X􏰖 X􏰖+Σ ) X􏰖 Y􏰖,(X􏰖 X􏰖+Σ )
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

or
N 􏰇(X⊤Σ−1X + Σ−1)−1X⊤Σ−1(Y − μ ), (X⊤Σ−1X + Σ−1)−1􏰈 . z zzz
Notice here how the Sigma−1 matrix is now sitting in between the X⊤Σ−1X terms. zz
(d) (Compare the effect of different priors) In this part, you will generate plots that show how different priors on w affect our prediction of the true w which generated the data points. Pay attention to how the amount of data used and the choice of prior relative to the true w we use are related to the final prediction.
Do the following for Σ = Σ1, Σ2, Σ3, Σ4, Σ5, Σ6 respectively, where 
1 0  1 0.25  1 0.9 Σ= ; Σ= ; Σ= ;
101 20.251 30.91

 1 −0.25  1 −0.9 0.1 0  Σ= ; Σ= ; Σ=  4−0.251 5−0.91 600.1
Under the priors above, the coordinates of the (random) vector w are: (1) independent with large variance, (2) mildly positively correlated, (3) strongly positively correlated, (4) mildly negatively correlated, (5) strongly negatively correlated, and (6) independent with small vari- ances respectively.
With the starter code in the corresponding part of the iPython notebook, generate n ∈ {5, 6, . . . , 500} datapointsY=x1+x2+Zwithx1,x2 ∼N(0,5)andZ∼N(0,1)astrainingdata(allofthem independent of each other). Here, the true w is thus 􏰔1 1􏰕⊤). Note that the randomness of xi here is only for the generation of the plot but in our probabilistic model for parameter estima- tion we consider them as fixed and given. The starter code helps you generate an interactive plot where you can adjust the covariance prior and the number of samples used to calculate the posterior. Include 6 plots of the contours of the posteriors on w for various settings of Σ and number of data points. Write the covariance prior and number of samples for each plot. What do you observe as the number of data points increases?
Solution: We can observe that a good prior matters, but enough data will wash away the effect of the prior. A more detailed analysis is given in the next subproblem.
X = np.random.randn(n,2)*np.sqrt(5)
Z = np.random.randn(n)
y = np.sum(X,axis=1) + Z
w = np.linalg.inv((X.T.dot(X)+np.linalg.inv(Sigma))).dot(X.T.dot(Y))
mean = tikhonov_regression(X,y,Sigma) # note that ‘‘mean’’ are actually weight for the tikhonov
􏰀→ regression; we call it mean for the convenience to plot later var = np.linalg.inv(X.T.dot(X)+np.linalg.inv(Sigma))
mux,muy = mean
sigmax = np.sqrt(var[0,0])
sigmay = np.sqrt(var[-1,-1])
sigmaxy = var[0,-1]
mux,muy,sigmax,sigmay,sigmaxy = compute_mean_var(X,Y,Sigma)
Z = bivariate_normal(X_grid,Y_grid,
sigmax, sigmay, mux,
muy, sigmaxy)
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

(a) Σ3, n = 5 (b) Σ3, n = 50 (c) Σ3, n = 500
(d) Σ6 , n = 5 (e) Σ6 , n = 50 (f) Σ6 , n = 500
(e) (Influence of Priors) For our simulations, we will generate n training data samples from Y = x1+x2+Zwhereagainx1,x2 ∼N(0,5)andZ∼N(0,1)(allofthemindependentofeachother) as before. Notice that the true parameters w1 = 1, w2 = 1 are moderately large and positively correlated with each other. We want to quantitatively understand how the effect of the prior influences the mean square error as we get more training data. This should corroborate the qualitative results you saw in the previous part.
In this case, we could directly compute the “test error” for a given estimator w􏰓 of the parameter w (our prediction for Y given a new data point x = (x1, x2)⊤ is then 􏰓Y = w􏰓1 x1 + w􏰓2 x2). Specifi- cally, considering w􏰓 now fixed, the expected error for a randomly drawn Y given the true (but unknown) parameter vector w = (1, 1)⊤ is equal to EZ,x(Y − 􏰓Y)2 = 5(w􏰓1 − 1)2 + 5(w􏰓2 − 1)2 + 1. We call this the theoretical average test error. For simplicity, we refer to it as theoretical MSE. Note that here by our choice of definition, the expectation for new test samples is over x as well, although our estimator is not explicitly taking the randomness of x in the training data into account.
In practice, the expectation with respect to the true conditional distribution of Y given w cannot be computed since the true w is unknown. Instead, we are only given a finite amount of samples from the model (which we call the test set, which is independent of the training data, but identically distributed in this case even though in other cases, it could be drawn differently
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5

as well.) so that it is only possible to compute
1 􏰑n
n
which we call the empirical average test error. For simplicity, we refer to it as empirical MSE. Again, note that here, Yˆi = x⊤i w􏰓 where xi ∈ R2 and w􏰓 in your model is the solution to the least square problem with Tikhonov regularization given the training data.
With the starter code in the corresponding part of the iPython notebook, generate a test set of 500 data points (xi,Yi) from the above model. Plot the empirical and theoretical mean square error between Yˆi and Yi over the test data with respect to the size of training data n (increase n from 5 to 200 in increments of 5).
Note: If we just plotted both the empirical and theoretical average test errors with respect to the amount of training data for one “round” or training data, the results would still look jagged. In order to give a quantitive statement about the test error with respect to the training data n with a “smoother” plot, what we really want to know is the expectation of the theoretical average test error with respect to w and the training samples (xi , Yi ), i.e. E(x1 ,Y1 ),…,(xn ,Yn ) EZ,x (Y − 􏰓Y )2 (note that in this term, only 􏰓Y depends on the training data (x1, Y1), . . . , (xn, Yn) whereas (x, Y) is an independent fresh test sample). Consequently, as an approximation, it is worth replicating the entire experiment a few times (say 100 times) to get an empirical estimate of this quantity. (It is also insightful to look at the spread.) Compare what happens for different priors as the amount of training data increases. Try plotting the theoretical MSE with logarithmic x and y-axes and explain the plot. What constitutes a “good” prior and which of the given priors are “good” choices for our particular w = (1, 1)⊤? Describe how the influence of different priors changes with the number of data points.
Solution: We first plot the test empirical MSE (MSE for short) and theoretical MSE in the usual scale. We observe both of them decreases as n increases. But the lines overlap and are indistinguishable. Then we use a log scale for both x and y axis. Interestingly, we observe a linear relationship under the log scale, which follows from the fact that MSE decreases as the reciprocal of n, the number of training samples. And we can distinguish different priors better.
We can see in the log-scaled plot that Σ3 has the best error for a small amount of training data, so we could say that Σ3 is the best prior for w = (1, 1)⊤, however as we add more training data the influence of the prior washes away and all priors converge to roughly the same error.
We can summarize the quality of the priors as follows:
• Σ3 is the best prior in terms of data efficiency because it asserts that the variables w1 and
w2 are strongly correlated. This is indeed the case for the ground truth.
• Σ1, Σ2 and Σ4 are doing pretty well in terms of data efficiency. This is because while they assert that the w1 and w2 are uncorrelated or mildly positively or negatively correlated, they leave a large amount of uncertainty on the values w1 and w2 by having large diagonal entries, so they can easily adapt to the data.
( Y i − Yˆ i ) 2
i=1
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6

• Σ5 and Σ6 need the most data to estimate the parameters well. This is because they either assert the wrong model (w1 and w2 are anti-correlated) or are very certain about the wrong model (independent w1 and w2). Therefore we need more data to “overrule” the prior.
Three observations:
• Enough data will wash away the effect of prior.
• A good prior helps when there are not enough data.
• How to plot is IMPORTANT for scientific observation.
mse = np.mean((np.squeeze(X.dot(w)) – Y)**2)
theoretical_mse = sum((w-1) ** 2) * 5 + 1
X,y = generate_data(n_train) # Generate data.
w = tikhonov_regression(X,y,Sigma) # Estimate w.
mses[i,j,seed] = compute_mse(X_test, y_test, w) # Compute MSE.
theoretical_mses[i,j,seed] = compute_theoretical_mse(w)
(g) Average (over runs) MSE on Test Data. We can see this dropping but it is hard to understand what exactly is going on.
(i) Log log plot of average (over runs) theoretical MSE. By looking at a log-log plot, where both axes are on log scale, we can see more clearly the pattern — the quality of the prior impacts where we start with small amounts of training data, but the rate of convergence is given by a power-law.
(h) Average (over runs) Theoretical MSE. This re- veals that there are still some wiggles — so the wig- gles are also coming from variance in our estimator (given that we are only taking the average over a fi- nite number of runs), not random peculiarities of our test data.
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7

3 Kernel Ridge Regression: Theory
In traditional ridge regression, we are given a vector y ∈ Rn and a matrix X ∈ Rn×l, where n is
the number of training points and l is the dimension of the raw data points. In most practical
settings we don’t want to work with just the raw feature space, so we lift/distill the data points
using features and replace X with Φ ∈ Rn×d, where φi⊤ = φ(xi) ∈ Rd. Then we solve a well-
defined optimization problem that involves the matrix Φ and y to find the parameters w ∈ Rd.
Note the problem that arises here. If we have polynomial features of degree at most p in the raw l
dimensional space, then there are d = 􏰇l+p􏰈 terms that we need to optimize, which can be very very p
large (often larger than the number of training points n). Wouldn’t it be useful if instead of solving an optimization problem over d variables, we could solve an equivalent problem over n variables (where n is potentially much smaller than d), and achieve a computational runtime independent of the number of augmented features? As it turns out, the concept of kernels (in addition to a technique called the kernel trick) will allow us to achieve this goal.
When people talk about “the kernel trick,” they usually mean the following: imagine that we have some machine learning algorithm, which only uses data vectors to compute scalar products with other data vectors. Then we can substitute the operation of taking inner products ⟨xi , x j ⟩ with some other function K(xi,xj) and obtain a new ML algorithm! There is however, another interesting consideration: if our algorithm only operates with inner products, and we do featurization, maybe we could use features so that those scalar products ⟨φ(xi), φ(xj)⟩ =: K(xi, xj) were easy to compute (e.g. without explicitly writing out long vectors φ(xi) and φ(xj)). As it turns out, those ideas are closely connected: kernels and features provide two different views of the same thing, and it may sometimes be more convenient to think about one and sometimes about the other.
In this problem we will derive this connection between features and kernels, observe which features correspond to some particular kernels and vice versa, and see how kernel perspective can help with computation.
(a) Aswealreadyknow,thefollowingproceduregivesusthesolutiontoridgeregressioninfeature space:
arg min ∥Φw − y∥2 + λ∥w∥2 (1) w
To obtain the prediction for a test point x one needs to compute φ(x)wˆ , where wˆ is the solution to (1). In this part we will show how φ(x)wˆ can be computed using only the kernel K(xi, xj) = φ(xi)φ(xj)⊤. Denote the following two objects:
k(x) := [K(x,x1),K(x,x2),…,K(x,xn)]⊤, K = {K(xi,xj)}ni,j=1
Where the matrix K is the matrix filled with pairwise kernel similarity computations among
the training points.
Show that
φ(x)wˆ = k(x)⊤(K + λI)−1y.
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8

In other words, if we define αˆ := (K + λI)−1y, then
n
φ(x)wˆ =􏰑α[i]K(x,xi)
i=1
— our prediction is a linear combination of kernel functions at different data points.
HINT: use problem 6b of HW1 to write wˆ using the Moore–Penrose inverse (pseudoinverse), and then use the explicit expression for pseudoinverse for matrices with linearly independent rows. After that you will just need to understand how K and k(x) can be expressed through Φ and φ(x).
Solution: In problem 6b of HW1 we derived the “ghost-feature augmentation” view of Ridge Regression. That is, solving (1) is equivalent to solving the following problem:

w∈Rd, ξ∈Rn ξ
We know that the solution to this problem is given by the following formula:
􏰋
min ∥w∥2 +∥ξ∥2 s.t. Φ, λI  =y.
 􏰌 w
22
 w
à
􏰋􏰌
  ξ   = Φ , λ I y ,
where A† denotes the pseudo-inverse of matrix A. Now let’s plug in an explicit expression for the pseudo-inverse: if rows of A are linearly independent, then A† = A⊤(AA⊤)−1. Therefore, we get
⊤ 

w Φ 
  ⊤ −1
√
 ξ  =  λI (ΦΦ + λI) y.
Finally, we can use the explicit formula to calculate the expression for prediction on a new data point:
φ(x)w =φ(x)Φ⊤(ΦΦ⊤ + λI)−1y =k(x)⊤(K + λI)−1y,
where we plugged in k(x)⊤ = φ(x)Φ⊤ and X = ΦΦ⊤.
(b) (Polynomial Regression from a kernelized view) In this part, we will show that polynomial regression with a particular Tiknov regularization is the same as kernel ridge regression with a polynomial kernel for second-order polynomials. Recall that a degree 2 polynomial kernel function on Rl × Rl is defined as
K(x ,x ) = (1 + x⊤x )2, (2) ij ij
foranyxi,xj ∈ Rl. Givenadataset(xi,yi)fori = 1,2,…,n,showthesolutiontokernelridge regression is the same as a regularized least square solution to polynomial regression (with standard unweighted monomials as features) for p = 2 given the right choice of regularization for the polynomial regression. That is, show for any new point x given in
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9

the prediction stage, both methods give the same prediction yˆ with the same training data. (Hint: you should consider a regularization term of the form ∥Mw∥2 for some matrix M. Such regularization is called Tikhonov regularization.)
Feel free to just work with the l = 2 case since it shows the basic behavior. Solution: For clarity, define a vector-valued function from R2 → R6 such that
√√√
φ(a) = (1, a21, a2, 2a1, 2a2, 2a1a2)⊤
for a = (a1, a2)⊤.
(In general, we would have one constant 1, exactly l pure square terms, exactly l pure terms
√ l(l−1)
that have been multiplied by 2, and exactly 2 cross terms that are each multiplied by
So the whole general story looks very much like the l = 2 case given here.

2.
(3)
Define a matrix in Rn×6 such that

   φ ( x n ) ⊤ 
That is
Recall the solution to Kernel ridge regression is a function f with
where
Therefore, we can write f (x) as
N
f(x) = 􏰑αiK(xi,x)
i=1 n
= 􏰑 αiφ(xi)⊤φ(x) i=1
= α⊤Φφ(x),
α = (K + λnIn)−1y. (5)
φ(x1)      ⊤ φ(x )  2  
Φ=.  . .
We observe that K(xi,xj) = φ(xi)⊤φ(xj). For a kernel matrix K ∈ Rn×n with Kij = K(xi,xj), we have
Kij = φ(xi)⊤φ(xj) = (ΦΦ⊤)ij. (4) K = ΦΦ⊤.
f (x) = y⊤(ΦΦ⊤ + λnIn)−1Φφ(x). (6) For polynomial regression, define a vector-valued function from R2 → R6 such that
φ ̃(a) = (1, a21, a2, a1, a2, a1a2)⊤
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
for a = (a1, a2)⊤.

Define a matrix in Rn×6 such that
 ̃  φ(x1) 
  ̃ ⊤
φ(x2)  
Observe the relationship between φ and φ ̃: We have
φ ̃(x)=Dφ(x),Φ ̃ =ΦD, (8)
for a diagonal matrix D ∈ R6×6, with
D = diag(1,1,1,1/ 2,1/ 2,1/ 2).
A polynomial regression is nothing but replacing linear feature X by φ ̃(X) ∈ R6 and add a Tikhonov regularization over the parameters w ∈ R6. Recall in a previous homework, we’ve shown it has a closed form solution
w = (Φ ̃⊤Φ ̃ + Λ)−1Φ ̃⊤y, (9) for a polynomial regression with Tikhonov regularization matrix Λ ∈ Rd×d. Let Λ be a diagonal
matrix defined by
Λ = diag(λn, λn, λn, λn/2, λn/2, λn/2) = D(λnI6)D. (10) The predictor produced by Tikhonov regression is
g(x) = w⊤φ ̃(x)
= [(Φ ̃⊤Φ ̃ + Λ)−1Φ ̃⊤y]⊤φ ̃(x)
= y ⊤ Φ ̃ ( Φ ̃ ⊤ Φ ̃ + Λ ) − 1 φ ̃ ( x )
= y⊤ΦD(DΦ⊤ΦD + D(D−1ΛD−1)D)−1Dφ(x) = y⊤ΦDD−1(Φ⊤Φ + (D−1ΛD−1))−1D−1Dφ(x) = y⊤Φ(Φ⊤Φ + (D−1ΛD−1))−1φ(x)
= y⊤Φ(Φ⊤Φ + λnI6)−1φ(x)
= y⊤(ΦΦ⊤ + λnIn)−1Φφ(x) = f (x),
where the last equation follows from the hint. Hence, we have shown the equivalence between the two predictors.
(c) In general, we can show the equivalence between kernel ridge regression with a polynomial kernel (1 + x⊤z)p of order p and a polynomial regression with pth order polynomial on Rl with an appropriately specified Tikhonov regularization,. Comment on the computational complexity of doing polynomial regression with this Tikhonov regularization directly and that of doing kernel ridge regression at the training stage. (That is, the complexity of finding α and finding w.)

Φ ̃ =   (7) .
 . .
 ̃   φ ( x n ) ⊤ 
√√√
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11

Compare with the computational complexity of actually doing prediction as well.
Solution: In the polynomial regression with Tikhonov regularization, for any data point (xi, yi), computing its polynomial features of order p takes O(dp). The complexity of solving ordinary least squares is O(d3p + dpn). The total complexity is O(d3p + dpn).
In the kernel ridge regression, the complexity of computing the kernel matrix is O(n2d log p). (Why only a log for taking the power? Because we can use the repeated squaring trick to speed things up.)
The complexity of getting α after that is O(n3). The total complexity is O(n3 + n2d log p). It only has a log dependence on p but cubic dependence on n. Kernel ridge regression is preferred when p is large.
(d) We will now consider kernel functions that do not directly correspond to a finite-dimensional featurization of the input points. For simplicity, we will stick to a scalar underlying raw input x. (The same style of argument can help you understand the vector case as well.) Consider the kernel function
 2σ2  k(x,z) = exp− ,
2  |x−z|
for some fixed hyperparameter σ. Assume, for a moment, that it corresponds to some featur- ized Euclidean vectors φ(x) and φ(z) of the inputs x and z.
Imagine what these vectors would look like. In particular, recall that k(x,z) = φ(x)Tφ(z). When would φ(x) and φ(z) be collinear? When would they be (almost) orthogonal?
(Hint: think about when k(x, z) is maximized, and when it goes to zero)
This is a qualitative question. If you are having trouble with it, you can just jump ahead to the
next part and come back to it.
Solution: It is clear that k(x,z) is maximized when x = z, and approaches zero as x and z diverge. This tells us that our feature vectors can be thought of as “collinear” when the input points are the same, and begin to diverge in direction as the input points get further apart.
(e) We will now try to find a concrete and explicit featurization φ(x) corresponding to our kernel function. It turns out that no such finite-dimensional featurization exists. However, there exists a series φd(x) of d-dimensional features, such that φd(x)T φd(z) converges as d → ∞ to k(x, z). By considering Taylor expansions, find φd(x).
xz
(Hint: focus your attention on the Taylor expansion of eσ2 since the cross term is what deter-
mines the interaction. You can leverage the property of exponentials to pull the pure terms out so that the kernel function is written as a product involving a pure function of x, a pure function of z, and a function that has x and z interacting. This will let you pull the pure functions into φd(x) and φd(z) respectively.)
Solution: We can rewrite k(x, z) as
k(x, z) = e−x2/(2σ2)e−z2/(2σ2)exz/σ2 .

HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12

Now observe that, by the Taylor expansion,
e
xz/σ2 xz (xz)2 (xz)3 =1+σ2 +σ4·2!+σ6·3!+···
We can rewrite this as the inner product of
􏰋 x x2 x3 􏰌T
and
σ 2! σ 3!
Truncating to just d terms and substituting back into our expression for k(x, z), we see that
􏰌T
,
Use the Jupyter notebook to implement kernel ridge regression using the φd features for fixed values of d, as well as the kernel function directly to see the behavior in the infinite-dimensional case. Consider both their effectiveness at fitting functions, as well as their one-hot “impulse” responses. Comment on your observations.
Solution: We see that for small σ, the impulse response appears more “spikey”, essentially remaining at zero except at the single nonzero test point, where it jumps up to close to 1. As we increase σ, the impulse response flattens out, becoming wider but less tall. As we increase d, our fit of the sine wave improves, as would be expected. Notice that it fits the middle of the true function better than the edges. One way of justifying this would be to look at the impulse response. For a test point near the middle, the predicted function is the sum of many (non- negligibly large) impulse responses from the training points on both sides of the test point. But for test points near either side, there are fewer impulse responses that make an observable contribution to the predicted function, since there are no training points further to one side of the test point, so the predicted function decays back to zero.
Learning 1d functions using kernel regression
1σ2√3√···, σ 2! σ 3!
􏰋 z z2 z3 􏰌T 1σ2√3√···.
where
with equality achieved in the limit as d → ∞.
k(x, z) ≈ φd(x)T φd(z), −x2/(2σ2) 􏰋 x x2 xd−1
φd(x) = e 1 σ 2 √ ··· σd−1 √(d−1)! σ 2!
(f)
4
This question is designed to help you to be able to visualize what kernel regression is like so that you can understand the different phenomena that can occur.
(a) Do the tasks from part a of the associated jupyter notebook and report the results. Solution: Increasing gamma makes the plot narrower, but its height remains the same.
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13

(b) Do the tasks from part b of the associated jupyter notebook and report the results. Solution: As we saw in the previous part, increasing gamma leaves diagonal elements the
same, but makes off-diagonal elements smaller. Because of that, when gamma increases the singular values of K become close to each other. This aligns well with the experiments. From the experiments we can also see that n has almost no influence the shape of the plot of singular values of K.
(c) Do the tasks from part c of the associated jupyter notebook and report the results. Solution: To compute αˆ one can do the following:
alpha = np.linalg.inv(K) @ y
This isn’t the most efficient implementation or the most stable one, but it is fine.
(d) Do the tasks from part d of the associated jupyter notebook and report the results.
Solution: 1. Impulse response behaves more robustly for the Laplacian kernel. This aligns well with the results of part b as for rbf kernel the singular values of kernel matrix are small,
and thus (K + λI)−1 is sensitive to small changes of those eigenvalues.
2. As we increase gamma, the vector αˆ also converges to an impulse vector (i.e. to y in this
case). This happens because as gamma increases, K converges to identity and αˆ = (K+λI)−1y.
3. The impulse response is closest to the step function when gamma is about 30–100 (depends on the choice of point). This suggest that for such gamma and lambda the prediction of kernel ridge regression is close to that of the 1 nearest neighbour regression, as each point propagates its value in its own small vicinity.
(e) Do the tasks from part e of the associated jupyter notebook and report the results. Solution:
Before seeing the experiement it would be reasonable to assume, for example, that rbf kernel would be better for a smooth function and the Laplace one for a non-smooth function, based on how the impulse responses for those kernels behave.
1. When lambda is set to minimum, rbf achieves better approximation of sin(20x) and laplacian kernel — of ”piecewise linear function”.
2. When λ is of order 10−8, rbf kernel can achieve better approximation for both functions.
3. When γ is of order 102, rbf kernel can achieve better approximation for both functions, (rbf kernel needs larger λ for fitting piecewise linear function.) For laplacian kernel, it is hard to fit both functions when γ is of order 102.
4. When λ is chosen to be very large, the learned function will be zero function; when γ is chosen to be very large, the learned function will be almost zero function and interpolate the training data points.
5. The initial guess that we made was not correct: it turned out that expressive abilities of the rbf kernel are so good that it can approximate a piecewise function better than lapalacian kernel unless there is no regularization.
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14

(f) Do the tasks from part f of the associated jupyter notebook and report the results. Solution:
1. For the rbf kernel the prediction looks (relatively ) reasonable for γ from 45 to ∼1,450. (at least it doesn’t blow up or predict zero everywhere. . . ). Higher values of γ turn out to be more succeptible to noise! This doesn’t align with the results of part b as higher values of γ ensure better condition number of K. For laplacian kernel the range of gamma is from 0 to 45. For the whole range the predicion is not very sensitive to noise. In part b we could see that the shape of the spectrum for the laplace kernel also doesn’t depend on γ too much, but it’s not clear whether it is a coincidence (we cannot really vary both things very much, so it’s hard to claim that there is some dependence).
2. For rbf kernel, we could use (γ, λ) = (8.0, 1e − 8). For laplacian kernel, we could use (γ, λ) = (2.0, 0.0156). Laplacian kernel is less sensitive to the choice of parameters. Because of this stiffness it’s hard to make it approximate the function. Rbf kernel is much easier to cast to the desired shape.
(g) Do the tasks from part g of the associated jupyter notebook and report the results. Solution:
1. The λ⋆ increases when noise level increases.
(h) Do the tasks from part h of the associated jupyter notebook and report the results.
Solution:
1. When there is no noise, the γ⋆ is larger than the noisy case. When there is noise in the output, we observe the similar γ⋆ for a wide range of noise level.
2. The only qualitative difference in terms of the shape of the curve is that if there is no noise, there is no second descent (i.e. the curve is U-shaped for σ = 0, but it is W-shaped for σ > 0. We can see what exactly happens if we do the experiements in part f. First, in the noiseless case for small values of γ our kernel is very flat and our learned function is close to zero. Second, if gamma is too large, the kernel is so narrow that our solution has spikes in data points and is again zero everywhere else. Third, for the middle values of gamma kernel regression does something reasonable. That’s how we get a U-shaped curve in the noiseless case.
Now imagine that we have only noise and no signal. In this case we want to learn zero, so small and large γ do exactly what we want, but middle values of gamma learn something nontrivial from noise. Therefore we get something like a flipped U-shaped curve (􏰱-shaped curve?).
When we add signal and noise, those two curves add up, and the result is a W-shaped curve.
(i) Do the tasks from part i of the associated jupyter notebook and report the results.
Solution:
1. The picked hyperparameter performs reasonably good on the test data. 2. The picked hyperparameter is pretty stable when we change k.
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15

6 4 2 0 2 4 6
6420246
(j) circle.npz
(k) asymmetric.npz Figure 1: Dataset visualization
5 Kernel Ridge Regression: Practice
6 4 2 0 2 4 6
10 5 0 5 10 15
6 4 2 0 2 4 6
15 10 5 0 5 10 15
(l) heart.npz
In the following problem, you will implement Polynomial Ridge Regression and its kernel variant Kernel Ridge Regression, and compare them with each other. You will be dealing with a 2D regres- sion problem, i.e., xi ∈ R2. We give you three datasets, circle.npz (small dataset), heart.npz (medium dataset), and asymmetric.npz (large dataset). In this problem, we choose yi ∈ {−1, +1}, so you may view this question as a classification problem. Later on in the course we will learn more about logistic regression and SVMs, which can solve classification problems much better and can also leverage kernels. (Actually, you might already be able to see how you can leverage kernels for logistic regression since you’ve seen how it can be solved by iterating least-squares type problems.)
You are only allowed to use numpy.*, numpy.linalg.*, and matplotlib in the following ques- tions. Make sure to include plots and results in your writeups.
(a) Use matplotlib to visualize all the datasets and attach the plots to your writeup. Label the points with different y values with different colors and/or shapes.
Solution:
See Figure 1.
idx1 = y == 1.
idxm1 = y == -1.
X1 = X[idx1, :]
Xm1 = X[idxm1, :]
plt.scatter(X1[:, 0], X1[:, 1], label=’$y=+1$’)
plt.scatter(Xm1[:, 0], Xm1[:, 1], label=’$y=-1$’)
(b) Implement polynomial ridge regression (non-kernelized version that you should already have seen in a previous homework) to fit the datasets circle.npz, asymmetric.npy, and heart.npz. Use the first 80% of the data as the training dataset and the last 20% of the data as the validation dataset. Report both the average training squared loss and the average
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
0.00 0.05
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
0 1.00 0.75 1 0.50 0.25 0.00 3 0.25 0.50 4 0.75
0 1
2 3 4
-0.500
-1.000
-2.000
-1.000
-2.000
-2.000
-1.000
-2.000
-2.000
-0.500
0.500
-0.500
-2.000
0.500
-1.000
0.000
0.000
0.000
0.000
0.000
-1.000
-1.000
-2.000
-2.000
0.500
-2.000
-0.500
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(b) p = 4
1.00 55
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) p = 6
1.00 44
0.75
0.50 22
0.25
00.00 0
0.25
22 0.50
0.75 44
1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(f) p = 12
(a) p = 2
2.000
1.000
0.500
0.000
-1.000
1.000
-0.500
-1.000
-2.000
2.000
1.000
0.500
0.000
-0.500
0.000
2.000
-0.500
0.500
0-10.5.05000
21..000 -2.000
-1.000
-2.000
-2.000
-2.000
-2.000
-2.000
-2.000
-2.000
2.000
1.000
0.500
0.000
-0.500
-2.000
1.000
2.000
0.000
-2.000
-0.500
-2.000
-0.500
-1.000
-2.000
-2.000
-1.000
-2.000
1.000
0.500
-2.000
-1.000
-2.000
0.000
-1.000
-2.000
-0.500
1.000
0.500
0.000
2.000
-2.000
-1.000
-2.000
-0.500
0.000-1.000
0.500
2.000
-2.000
-1.000
-1.000
-2.000
-1.000
-20.0500
-2.000
0.000
120.0500
2.000
0.000
-0.500
-0.500
0.000
0.500
-1.00.0500
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(d) p = 8
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1 0
0.10 0.15 0.20 0.25 0.30
1 2 3 4 5
2
2.000
1.000
1.000
(e) p = 10
Figure 2: Heat map of circle.npz
polynomial order p ∈ {1, . . . , 16}.
validation squared
λ = 0.001 for all p. Visualize your result and attach the heatmap plots for the learned predictions over the entire 2D domain for p ∈ {2, 4, 6, 8, 10, 12} in your writeup.
Solution:
for
Use the regularization term
See Figure 2, 3, and 4. The error can be found in next part. If you directly use the code from homework 2, you may find that your result is slightly different from the error here due to the difference of the constant terms used in the polynomial, e.g., feature x1x2 vs feature 2x1x2, but your plot should be similar.
from scipy.special import binom
xs = []
for d0 in range(D + 1):
for d1 in range(D – d0 + 1):
# non-kernel polynomial feature
# xs.append((x[:, 0]**d0) * (x[:, 1]**d1))
# # kernel polynomial feature
xs.append((x[:, 0]**d0) * (x[:, 1]**d1) * np.sqrt(binom(D,
􏰀→ d0) * binom(D – d0, d1))) poly_x = np.column_stack(xs)
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 1.00
1
0.75 0.75
0.50 0 0.50
0.25 0.25 11
4
2
0
4
1.000
0.500
0.000
2.000
-2.000
-2.000 0.500
-0.500
1.000
-0.500
-1.000
0.000
-1.000
0.500
0.000
-1.000
-1.000
1.000
-1.000
-2.000
-0.500
0.000
-2.000
0.000
0.500
1.000
-1.000
0.000
0.500
0.500
-2.000
0.000
0.500
0.000
-0.500
-0.500
-2.000
-0.500
-2.000
-2.000
0.500
1.000
2.000
-2.000
-0.500
0.000
-1.000
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(b) p = 4
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) p = 6
(a) p = 2
-2.000
-2.000
2.000
1.000
-2.000
-1.000
0.500
-0.500
0.000
0.500
0.500
2.000
2.000
0.000
1.000
2.000
0.000
-0.500
1.000
0.500
2.000
0.500
-1.000
2.000
0.000
2.000
1.000
1.000
-0.500
-2.000
1.000
-0.500
1.000
-2.000
-0.500
0.500
0.000
-0.500
-1.000
-1.000
0.500
0.500
-1.000
0.000
2.000
1.000
-0.500
0.000
0.000
0.500
0.000
0.500
1.000
-0.500
-1.000
2.000
-0.500
-2.000
2.000
0.500
-0.500
-1.000
0.000
-1.000
0.000
2.000
-0.500
-0.500
0.500
0.500
1.000
0.500
1.000
-2.000
0.000
0.500
1.000
-0.500
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(e) p = 10
Figure 3: Heat map of heart.npz
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(f) p = 12
(d) p = 8
2.000
0.000
-1.000
0.000
-2.000
-2.000
0.500
0.500
-2.000
-2.000
-1.000
-1.000
-0.500
0.000
0.000
-0.500
0.500
-1.000
-0.500
-2.000
0.000
-1.000
-0.500
-2.000
0.000
-1.000
-1.000
-1.000
-2.000
-2.000
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(b) p = 4
1.00 55
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) p = 6
(a) p = 2
-1.000
-0.500
0.000
2.000
1.000
0.000
2.000
2.000
-0.500
1.000
0.500
0.000
10..0500
0.000
-0.500
-2.000
-1.000
-0.500
2.000
1.000
-0.500
-2.000
-2.000
0.500
-0.500
0.000
0.500
-2.000
-1.000
-1.000
0.000
-0.500
0.000
-0.500
-1.000
-0.500
-1.000
-0.500
-1.000
-2.000
-2.000
-2.000
-0.500
-1.000
1.000
-0.0500
2.000
0.000
0.500
12.000
-0.500
-1.000
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(d) p = 8
1.00 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(e) p = 10
Figure 4: Heat map of asymmetric.npz
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(f) p = 12
0
2
3
0.00 0.25 0.50 0.75 1.00
1.00 0.75 0.50
0.00
0.25 3 0.50 4 0.75 1.00
5
1.00 0.75 0.50
222 0.25 0.25
0.00 0.00 000
2
4
4
4
0.25
2 0.50
0.75
0.25
2 0.50 2
0.75
2
444 1.00 1.00
0.05
0.00
0.05
0.10
0.15
0.20
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
0 1
1.00 0.75 0.50 0.25
1 0
1.00 444
0.75 0.75
0.50 0.50 222
0
0.25 0.25
0.00 0 0.00 0
0.25 0.25
222 0.50 0.50
0.75 0.75 444
0.000
0.500
1.000
2.000
-1.000
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18
2 0.00
1 2 3 4
3
0.25
0.50
4 0.75
1.00

Xd_train = assemble_feature(X_train, D)
Xd_valid = assemble_feature(X_valid, D)
w = lstsq(Xd_train, y_train, LAMBDA)
error_train = np.average(np.square(y_train – Xd_train @ w))
error_valid = np.average(np.square(y_valid – Xd_valid @ w))
cond = np.linalg.cond(Xd_valid.T @ Xd_valid + np.eye(Xd_valid.shape
􏰀→ [1]))
(c) Implement kernel ridge regression to fit the datasets circle.npz, heart.npz, and option- ally (due to the computational requirements), asymmetric.npz. Use the polynomial kernel K(xi,xj) = (1 + x⊤i xj)p. Use the first 80% data as the training dataset and the last 20% data as the validation dataset. Report both the average training squared loss and the aver- age validation squared loss for polynomial order p ∈ {1, . . . , 16}. Use the regularization term λ = 0.001 for all p. Code for generating heatmap plots is included in the notebook. For circle.npz, also report the average training squared loss and validation squared loss for polynomial order p ∈ {1, . . . , 24} when you use only the first 15% of data as the training dataset and the final 85% of data as the validation dataset. Based on the error, comment on when you want to use a high-order polynomial in linear/ridge regression.
Solution:
You can see that when you training data is not enough, i.e., in the case when you only use 15% of the training data, you can easily overfit your training data if you use a high-order polynomial. When you have enough training data, i.e., in the case you are using the 80% of the training data, the overfitting is more unlikely. Therefore, you want to use a high-order polynomial only when you have enough training data to avoid the overfitting problem. The average error here is
########## 80% Training Data ################ ####### circle.npz #######
p=1 p=2 p=3 p=4 p=5 p = 6 p = 7 p = 8 p = 9 p=10 p=11 p=12 p=13 p = 14 p = 15 p = 16 ###### p=1 p=2 p=3 p=4 p=5 p = 6 p = 7 p = 8 p = 9 p=10 p=11 p=12 p = 13 p = 14 p = 15 p=16 ###### p=1
train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error =
heart.npz ###### train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error =
0.997088 validation_error = 0.995537 validation_error = 0.992699 validation_error = 0.943011 validation_error = 0.935539 validation_error = 0.511241 validation_error = 0.507592 validation_error = 0.086389 validation_error = 0.081809 validation_error = 0.043086 validation_error = 0.013966 validation_error = 0.008685 validation_error = 0.006517 validation_error = 0.003665 validation_error = 0.001912 validation_error = 0.001400 validation_error =
0.962643 validation_error = 0.236718 validation_error = 0.115481 validation_error = 0.012163 validation_error = 0.003759 validation_error = 0.002294 validation_error = 0.001441 validation_error = 0.000665 validation_error = 0.000305 validation_error = 0.000189 validation_error = 0.000139 validation_error = 0.000111 validation_error = 0.000093 validation_error = 0.000081 validation_error = 0.000072 validation_error = 0.000064 validation_error =
0.997579 cond = 1.001056 cond = 1.019356 cond = 0.997941 cond = 1.029308 cond = 0.547531 cond = 0.549927 cond = 0.101056 cond = 0.097989 cond = 0.054167 cond = 0.018290 cond = 0.011348 cond = 0.008556 cond = 0.004821 cond = 0.002475 cond = 0.001797 cond =
0.959952 cond = 0.189837 cond = 0.090813 cond = 0.009089 cond = 0.002975 cond = 0.001613 cond = 0.001056 cond = 0.000428 cond = 0.000202 cond = 0.000138 cond = 0.000114 cond = 0.000097 cond = 0.000084 cond = 0.000075 cond = 0.000068 cond = 0.000062 cond =
1.000194 cond =
3.885463 40.439621 230.817918 437.187915 804.009794 1307.933645 2159.011214 3630.740079 6230.776776 10920.048093 19529.648519 35549.340362 65983.294010 123976.972506 234627.222155 446625.921685
6.646302 26.941658 217.010014 348.834425 638.648596 1262.823064 2554.245128 5222.932534 10754.752173 22259.613418 46259.310324 96458.107873 201706.212544 422842.117216 888359.857996 1870033.835947
4.303603
asymmetric.npz ######
train_error = 0.999989 validation_error =
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19

p = 1 p = 2 p = 3 p = 4 p = 5 p = 6 p = 7 p = 8 p = 9 p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=20 p=21 p=22 p=23 p=24
train_error = 0.977122 validation_error = 1.017212 cond = 154347.326799 train_error = 0.965179 validation_error = 1.040716 cond = 188799.151210 train_error = 0.935814 validation_error = 1.083452 cond = 260636.616808 train_error = 0.828087 validation_error = 1.220925 cond = 388234.123476 train_error = 0.808276 validation_error = 1.294004 cond = 605958.721676 train_error = 0.465600 validation_error = 0.731820 cond = 974938.119166 train_error = 0.418462 validation_error = 0.701896 cond = 1604147.948302 train_error = 0.094915 validation_error = 0.326256 cond = 2690114.807338 train_error = 0.064552 validation_error = 0.979804 cond = 4592713.085243 train_error = 0.054649 validation_error = 2.273410 cond = 7981356.922646 train_error = 0.036871 validation_error = 3.763307 cond = 14136597.558594 train_error = 0.019774 validation_error = 1.865602 cond = 26239673.362870 train_error = 0.009580 validation_error = 0.104549 cond = 49619782.252457 train_error = 0.005777 validation_error = 0.372263 cond = 94594909.390382 train_error = 0.004199 validation_error = 0.544182 cond = 181457265.287672 train_error = 0.002995 validation_error = 0.436762 cond = 349803221.168144 train_error = 0.001924 validation_error = 0.705161 cond = 677043148.807441 train_error = 0.001210 validation_error = 1.518994 cond = 1314776445.035100 train_error = 0.000851 validation_error = 3.576013 cond = 2560349372.861672 train_error = 0.000678 validation_error = 7.938049 cond = 4997765669.676615 train_error = 0.000571 validation_error = 16.370187 cond = 9775415811.240183 train_error = 0.000483 validation_error = 32.763564 cond = 19153899435.104542 train_error = 0.000405 validation_error = 62.110989 cond = 37587428504.160706 train_error = 0.000344 validation_error = 103.845313 cond = 73859595026.545380
K = np.power(X @ XT + 1, D)
train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error = train_error =
0.998260 validation_error = 0.991565 validation_error = 0.828692 validation_error = 0.758986 validation_error = 0.263368 validation_error = 0.218690 validation_error = 0.140721 validation_error = 0.120781 validation_error = 0.109520 validation_error = 0.095645 validation_error = 0.083126 validation_error = 0.069519 validation_error = 0.052339 validation_error = 0.037785 validation_error = 0.029511 validation_error =
1.000176 cond = 0.991388 cond = 0.822373 cond = 0.748816 cond = 0.241398 cond = 0.195606 cond = 0.120891 cond = 0.102239 cond = 0.092603 cond = 0.081190 cond = 0.070826 cond = 0.059635 cond = 0.044942 cond = 0.032575 cond = 0.025690 cond =
82.880736 559.928514 4924.555570 15783.658385 36482.622481 73065.066532 148442.373823 303228.309085 623400.268355 1289425.566871 2682742.562813 5613779.945180
11813079.998338 24993651.532068 53158174.199813
p=2
p=3
p = 4
p = 5
p = 6
p = 7
p = 8
p = 9
p = 10
p=11
p=12
p=13
p=14
p=15
p=16
########## Just using 15% Training Data ################ ####### circle.npz #######
K = poly_kernel(X_train, X_train.T, D) + LAMBDA * np.eye(X_train.shape[0])
coeff = np.linalg.solve(K, y_train)
error_train = np.average(np.square(y_train – poly_kernel(X_train, X_train.
􏰀→ T, D) @ coeff))
error_valid = np.average(np.square(y_valid – poly_kernel(X_valid, X_train.
􏰀→ T, D) @ coeff))
print(“p = {:2d} train_error = {:7.6f} validation_error = {:7.6f} cond
􏰀→ = {:14.6f}”.
format(D, error_train, error_valid, np.linalg.cond(K)))
if D in [2, 4, 6, 8, 10, 12]:
isubplot += 1
plt.subplot(3,2,isubplot)
heatmap(lambda x, y: poly_kernel(np.column_stack([x, y]), X_train.T, D
􏰀→ ) @ coeff)
plt.title(“D = %d” % D)
(d) Diminishing influence of the prior with growing amount of data: With increasing amounts of data, the gains from regularization diminish. Sample a subset of training data from the first 80% of data from asymmetric.npz and use the data from the last 20% of data for validation. Make a plot whose x axis is the amount of the training data and y axis is the validation squared loss of the non-kernelized ridge regression algorithm. Include 6 curves for hyper-parameters λ ∈ {0.0001, 0.001, 0.01} and p = {5, 6}. Your plot should demonstrate
HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20

that with same p, the validation squared loss will converge with enough data, regardless of the choice of λ. You can use log scaling on the x axis for clarity, and you need to resample the data multiple times for a given p, λ, and amount of training data in order to get a smooth curve.
Solution: See Figure 5.
5
4
3
2
1
0
101 102
103 104
lambda = 0.001 D = 5 lambda = 0.010 D = 5 lambda = 0.100 D = 5 lambda = 0.001 D = 6 lambda = 0.010 D = 6 lambda = 0.100 D = 6
5
4
3
2
1
0
101 102
103 104
lambda = 0.001 D = 5 lambda = 0.010 D = 5 lambda = 0.100 D = 5 lambda = 0.001 D = 6 lambda = 0.010 D = 6 lambda = 0.100 D = 6
(a) Kernel Ridge Regression
Figure 5: Plot for Diminishing Influence of the Prior
plt.xscale(’log’)
for D in [5, 6]:
Xd_train = assemble_feature(X_train, D)
Xd_valid = assemble_feature(X_valid, D)
for LAMBDA in [0.001, 0.01, 0.1]:
print(“D=%d, LAMBDA=%.3f” % (D, LAMBDA))
pltx = [int(1.5**x) for x in range(4, 300) if 1.5**x < n_train] 􏰀→ + [n_train] plty = [] for n_sampl in pltx: error = [] time = max(int(40000 / n_sampl), 1) for ttt in range(time): idx = np.random.randint(n_train, size=n_sampl) Xd_sampl = Xd_train[idx, :] y_sampl = y_train[idx] w = lstsq(Xd_sampl, y_sampl, LAMBDA) error_valid = np.average(np.square(y_valid - Xd_valid @ (b) Non-Kernel Ridge Regression 􏰀→ w)) plty.append(np.average(error)) error.append(error_valid) plt.plot(pltx, plty, label="lambda = %.3f D = %d" % (LAMBDA, D) HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21 􏰀→ ) plt.ylim([0, 5]) plt.legend() plt.show(); (e) Apopularkernelfunctionthatiswidelyusedinvariouskernelizedlearningalgorithmsiscalled the radial basis function kernel (RBF kernel). It is defined as  ∥x−x′∥2  ′2 K(x,x ) = exp− . (11)   2σ2  Implement the RBF kernel function for kernel ridge regression to fit the dataset heart.npz. Use the regularization term λ = 0.001. Report the average squared loss, visualize your result and attach the heatmap plots for the fitted functions over the 2D domain for σ ∈ {10, 3, 1, 0.3, 0.1, 0.03} in your writeup. You may want to vectorize your kernel func- tions to speed up your implementation. Comment on the effect of σ. Solution: The average fitting error is sigma = 10.000 train_error = 0.279653 validation_error = 0.224638 cond = 800690.695468 sigma = 3.000 train_error = 0.119629 validation_error = 0.082379 cond = 778537.061196 sigma = 1.000 train_error = 0.005872 validation_error = 0.004201 cond = 648473.876828 sigma = 0.300 train_error = 0.000053 validation_error = 0.000050 cond = 469873.484420 sigma = 0.100 train_error = 0.000000 validation_error = 0.000000 cond = 442247.855472 sigma = 0.030 train_error = 0.000000 validation_error = 0.000078 cond = 291224.335632 The heat map can be found in Figure 6 for σ ∈ {10, 3, 1, 0.3, 0.1, 0.03}. As we see, the larger σ, the more data the kernel averages over and the more blurry the image of the heatmap gets. The previous code from kernel regression includes the implementation of RBF kernel. XXT = -2 * X @ XT XXT += np.sum(X * X, axis=1, keepdims=True) XXT += np.sum(XT * XT, axis=0, keepdims=True) K = np.exp(-XXT / (2 * sigma * sigma)) K = rbf_kernel(X_train, X_train.T, sigma) + LAMBDA * np.eye(X_train.shape 􏰀→ [0]) coeff = np.linalg.solve(K, y_train) error_train = np.average( np.square(y_train - rbf_kernel(X_train, X_train.T, sigma) @ coeff)) error_valid = np.average( np.square(y_valid - rbf_kernel(X_valid, X_train.T, sigma) @ coeff)) print("sigma = {:6.3f} train_error = {:7.6f} validation_error = {:7.6f} 􏰀→ cond = {:14.6f}". format(sigma, error_train, error_valid, np.linalg.cond(K))) isubplot += 1 plt.subplot(3,2,isubplot) heatmap(lambda x, y: rbf_kernel(np.column_stack([x, y]), X_train.T, sigma) 􏰀→ @ coeff) plt.title("sigma = %.2f" % sigma) HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22 1.00 0.75 0.50 0.25 0.00 1.0 0.25 0.50 0.75 1.00 0.5 0.0 0.5 1.5 2.0 2.5 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 0.25 0.25 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 0.25 0.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (a)σ=10 (c) σ = 1 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (d) σ = 0.3 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (f) σ = 0.03 0.50 0.75 1.00 0.50 0.75 1.00 0.50 0.75 1.00 0.50 0.75 1.00 1.00 1.00 0.75 0.75 0 0.50 0.50 1 0.25 0.25 0.00 2 0.00 0.25 0.25 3 0.50 0.50 0.75 4 0.75 1.00 1.00 1 0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (b) σ = 3 55 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (e) σ = 0.1 Figure 6: Heatmap of heart.npz (f) For polynomial ridge regression, which of your implementation is more efficient, the ker- nelized one or the non-kernelized one? For RBF kernel, explain whether it is possible to implement it in the non-kernelized ridge regression. Summarize when you prefer the kernelized to the non-kernelized ridge regression. Solution: For polynomial regression, our non-kernelized method is more efficient in practice. This matches our prediction, where the computational complexity of non-kernelized method is smaller when d ≪ n. However, this problem is in 2D space. When the feature space is in R20 for example, using a d = 10 polynomial can be impractical for non-kernelized method. In summary, we want to use kernelized method when the feature space is much larger than the number pf samples, or when the dimension of the feature space is infinite like in RBF, in which it is impossible to use the non-kernelized method. (g) Disable the clip option in the provided heatmap function and redraw the heatmap plots for the functions learned by the polynomial kernel and RBF kernel. Experiment on the provided datasets and describe one potential problem of the polynomial kernel related to what you see here. Does the RBF kernel have such problem? Compute, compare, comment, and attach the heatmap plots of the polynomial kernel and the RBF kernel on heart.npz dataset. Solution: Figure 7 shows the comparison between polynomial kernel (left) and RBF kernel (right) with similar fitting error. One most observable problem is that the polynomial kernel does not extrapolate well, i.e., you will see absurdly large number outside the domain of train- HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23 1 2 3 4 0.500 -2.000 -2.000 -2.000 -0.500 -1.000 -1.000 -0.500 -1.000 -0.500 -0.500 0.500 0.500 1.000 0.500 0.000 0.000 -2.000 -2.000 0.000 -2.000 -1.000 -1.000 -2.000 -1.000 -1.000 -0.500 -0.500 -1.000 -1.000 0.000 0.000 0.000 -0.500 0.500 0.500 -0.500 -1.000 -0.500 0.000 -0.500 -1.000 -0.500 -0.500 -0.500 -0.500 -0.500 ing data. On the other hand, RBF kernel extrapolates much nicely compared to the polynomial kernel. 6 Nearest Neighbors and Bayes risk In lecture, you saw an argument for why, if the training data is dense enough, and the underlying pattern is smooth enough, a 1-nearest neighbor estimator is going to basically approach a test squared-error that is the sum of the variance of the training noise and the irreducible noise at test time. You also saw an argument for why k-nearest neighbor estimation will do better by being able to average away the training noise. In this problem, the goal is to help you understand what the counterpart looks like for classification type problems. A census of Middle Earth is conducted to see the extent to which we can predict that someone is an orc or an elf based on age and height. We design x = [A, H] where A indicates the candidate’s age and H their height. The census includes enough data for a posterior probability P(y|x) where y indicates where this being is an elf. We labels the regions above from left to right R1 to R4. For illustrative simplicity, we have made the conditional probability of being an elf piecewise constant so that it is constant over each of those regions. (a) What is the Bayes risk in each of the four regions? Here, the Bayes risk means the probability of error of the optimum Bayes classifier that knows the underlying pattern perfectly. The Bayes risk is the counterpart of the “irreducible error” since it reflects an intrinsic uncertainty that we cannot remove when we are trying to do pre- diction based on the given information. Solution: Apparently, our Bayes’ decision boundary lies between R2 and R3. According to this decision rule, left side, R1,R2, will be classified with label 0 and the right side, R3, R4, label 1. Now, we analyze them one by one. R1: Ebayes = 0 : if the sample lies in region 1, it will be labeled as 0 deterministically. There- fore, the bayes’s risk of labeling it 1 will be 0 R2: Ebayes = 0.1: if the sample lies in region 2, it has a possibility of being 1 as 0.1, and we HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 (a) p = 2 1.00 1.00 0.0 0.75 0.75 0.50 2.5 0.50 0.25 5.0 0.25 0.00 7.5 0.00 0.25 0.25 10.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (b) σ = 3 0 2 3 4 1 0 1 2 3 0.50 0.50 4 12.5 0.75 0.75 15.0 1.00 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (c) p = 4 1.00 0.75 150 0.50 0.25 0.00 0.25 0 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (d) σ = 1 5 6 100 50 50 100 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 0.25 0.25 0.50 0.50 0.75 0.75 1.00 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (e) p = 12 Figure 7: Heatmap of heart.npz when the fitting error is similar between polynomial kernel and RBF kernel, with p = 2, 4, 12 (left) and σ = 3, 1, 0.3 (right). Notice the range on the right colorbar. HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25 1.00 0.75 0.50 0.25 1 0.00 (f) σ = 0.3 -2.000 -2.000 -2.000 -0.500 -1.000 0.500 0.500 0.000 0.500 0.000 -2.000 -0.500 -2.000 -1.000 -2.000 -2.000 0.500 -1.000 1.000 0.000 0.000 -0.500 -1.000 -1.000 -0.500 -0.500 -1.000 -1.000 -0.500 0.500 -0.500 0.500 1.000 1.000 -1.000 -2.000 0.000 -2.000 -2.000 -2.000 -1.000 -2.000 0.500 0.000 -1.000 1.000 1.000 0.000 0.500 0.500 -0.500 0.000 -0.500 -2.000 -2.000 -1.000 -1.000 -0.500 0.000 0.500 0.000 -1.000 -0.500 -2.000 -1.000 0.000 -0.500 1.000 2.000 0.000 0.500 1.000 2.000 -0.500 -1.000 -1.000 -0.500 label it as 0 by the bayes’ decision. Therefore, we have 0.1 chance of labeling it wrong. R3: Ebayes = 0.1: similar to R2 R4: Ebayes = 0: similar to R1 (b) Assume that the training data is dense enough so that all the nearest neighbors of a test sample lie in the same region as the test sample itself. Now consider a test sample x which falls in region Ri. For i ∈ 1, 2, 3, 4, what is the probability that x and its nearest neighbor have different labels y. Here, we are assuming that the training data has labels generated by the same random process that generates the labels at test time, and that the labels on training points are independent of the labels on test points. Solution: R1: 0: if the sample lies in region 1, it will be labeled as 0 deterministically and its nearest neighbor which is also in this region labeled 0 too. Therefore, the possibility of labeling it wrong will be 0. R2: 0.9 * 0.1 + 0.1 * 0.9 = 0.18: if the sample lies in region 2, it has a 0.1 chance to be labeled 0 and its nearest neighbor 0.9, resulting in one 0.1*0.9; by symmetry, it has 0.9 * 0.1 chance of being labeled 1 while its nearest neighbor is 0. Altogether, we have the chance of different labeled two nearest pairs 0.18. R3: same as R2 = 0.18 R4: 0: similar to R1 (c) What is the 1-nearest neighbor classification error rate in each region? Solution: This question is a basic explanation of the scenario mentioned in b. If we have the sample’s actual label different from the nearest neighbor, which we predicted to be the sample’s label, we will thus have an error. Therefore, the error rate will be what we calculated in the above question. R1: E1NN = 0 R2: E1NN = 0.18 R3: E1NN = 0.18 R4:E1NN =0 (d) Now we generalize the problem above. R1 : P(y = 1|x) = 0 R2 : P(y = 1|x) = p R3 : P(y = 1|x) = 1 − p R4 : P(y = 1|x) = 1 where p is a number between 0 and 0.5. Calculate the answers to the previous questions using this new information. Can you see why the classification performance of 1-nearest neighbor with sufficiently dense training data is never worse than twice the Bayes risk? Solution: We basically have the exact calculation process as above. For the previous question, we can think of p = 0.1 while 1-p = 0.9. Thus we have R1 :Ebayes = 0,E1NN = 0 HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26 R2:Ebayes =p,E1NN =2p(1-p) R3:Ebayes =p,E1NN =2p(1-p) R4:Ebayes =0,E1NN =0 Notice here that 2p(1−p) < 2p and this is actually a pretty celebrated result that says that when there is enough training data and it comes from the same distribution as test data, that simple 1-nearest neighbor methods will approach something better than twice the Bayes Risk. The intuition is that there are two sources of error — the test point might itself have an irreducible error in its own label or the nearest training point might happen to be mislabled. (e) Now, consider the previous part, but think about k-nearest neighbors where we take the ma- jority vote of the labels for the nearest neighbors. Here, we continue with the assumptions from the previous parts — all the nearest neighbors are in the same region as the test point. Remember, in the model given above, the training labels of data points are iid conditioned on which region that they are found in, and the test point’s label is also identically distributed and independent of all the training labels. What is the probability of classification error for this k-nearest neighbor classifier in each of the regions? Can you see why the performance will approach the Bayes risk as k → ∞? Solution: Similarly for k-nearest neighbors, R1 and R4 still have error rate of 0. For R2 and R3, the error rate changes. To avoid breaking ties, we choose k to be odd. We need to think more carefully about the error events now. There are two ways that errors can happen. First, the actual test point might have a label that disagrees with the Bayes’ optimum predictor. In this case, we will probably get an error unless something strange happens and the majority of the k neighbors choose the less likely choice. This weirdly correct case is the probability that a Binomial random variable with k coin tosses each with independent probability p < 1 2 of coming up heads actually has a majority of heads, in other words the Binomial shows something n or higher. Consequently this particular error event has probability: 2 Notice that this event has a probability that is upper bounded by p no matter what. The other error event involves the test point having the same label as the Bayes’ optimum predictor, but the majority of k nearby points strangly disagreeing. Consequently this particular error event has probability: 􏰑k 􏰍k􏰎 to 0 as k increases. An appropriate Chernoff Bound would reveal that this latter probability is HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27 ⌊k⌋􏰍􏰎 2k p · 􏰑 pi(1 − p)k−i. i=0 i (1− p)· Because of the weak law of large numbers, the above event has a probability that is tending i=⌈ k ⌉ 2 i pi(1− p)k−i. 7 no bigger than (1− p)exp􏰇−kD(1||p)􏰈 where the binary Kullback-Leibler divergence D(q||p) = 2 q ln q +(1−q) ln 1−q is a non-negative quantity whenever q 􏰁 p. This means that this error event p 1−p is going to zero exponentially in k and eventually with k high enough, we basically approach the optimal Bayes Risk. Quantifying contamination in ridge regularization In this problem, we want to make sure that you can understand why contamination is generically an issue with non-trivial ridge regularization, and where it is coming from at a fundamental level. Here, we are going to be leveraging the “feature augmentation” perspective on ridge-regression since that provides the key intuition. For simplicity throughout this problem, we are going to assume that the true pattern is a proportion w∗ of the first feature x[1] in each sample. So yi = w∗ xi[1] when we have training data {(xi, yi)}n1. (a) We will start with something that might initially feel totally unrelated. Suppose that there were only two features, and the first and second features are just multiples of each other. Suppose x[2] = βx[1] where β is some nonzero constant. In this case, the second feature is spiritually a perfect “alias” of the first feature for the training data. Compute what the minimum norm solution will be to the equation Xw = y. How does the survival of the true weight in the first coefficient depend on β? Solution: We can write our equation as  􏰔 􏰕w1 x[1] βx[1] w  = w x[1]. 2 Expanding out, we see that w1 + βw2 = w∗, and we wish to minimize w21 + w2. This gives us the new minimum norm problem 􏰔1 β􏰕w=w∗, which we know can be immediately solved by Moore-Penrose to yield 􏰔 􏰕T􏰉􏰔 􏰕􏰔 􏰕T􏰊−1 11∗   w􏰓 = 1 β 1 β 1 β =   w . 1+β2 β ∗ So the first coefficient becomes w∗ w1 = 1 + β2 , which approaches zero as β → ∞, as would be expected (as weight should shift to the sec- ond feature as it grows much larger than the first one, and is hence very attractive from the perspective of the minimum norm solution). When β → 0, all the weight sits on the first feature. Notice also the immediate parallel to what we know happens in ridge regression generally. Here β2 is behaving like the λ in traditional ridge regression — this should not be surprising given the entire context of this problem. HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28 (b) Now, we will return to ridge regression from the “augmented feature” perspective that you saw in an earlier homework. Here, assume that there are m features in x and the second feature is no longer a scaled version of the first feature. In the “augmented feature” perspective, you construct a new matrix A = [X λI] and solve a minimum norm problem of the type Aη = y where the y is just the first feature column X[1].  w􏰖 Consider the new matrix A􏰖 that is obtained by dropping the first column from A. Let 􏰖η =  ξ  and consider the minimum norm problem A􏰖􏰖η = X[1]. Give an expression for 􏰖η that is the minimum norm solution. (HINT: This is just a plug-in the formula type problem. The interesting thing is that you can actually get the answer using two different formulas. The Moore-Penrose pseudo-inverse is the straightforward approach. But you can also go via the traditional solution to ridge regression by reinterpreting the dropped-column problem back as a ridge regression with a smaller parameter space.) Solution: For notational convenience, let X􏰖 be X with the first column removed. We are asked to find the minimum-norm solution for the problem 􏰔√􏰕 X􏰖 λI 􏰖η = X[1]. Observe that the rows of A􏰖 are clearly linearly independent (since they each contain a different identity row vector in the second portion of each row), so we can apply the known minimum norm solution, to obtain −1  TTT􏰉􏰊  􏰔√􏰕    −1   X􏰖      X􏰖      X􏰖   T √  λI 􏰖  λI  λI 􏰖􏰖 􏰖η=√ X λI √  (X[1])=√  XX +λI X[1]. (12)         Although we didn’t ask you to do this here, the expression is interesting because we can see the w􏰖 part distinctly here: T􏰉T 􏰊−1 w􏰖 = X􏰖 X􏰖X􏰖 + λI X[1]. (13) Notice that this only has to do with the internal relationship among the features. It is a little ridge regression in which we best estimate the first column in terms of the others. This will in general be bigger the more colinear the first feature is to the rest of them. The great thing about recognizing this as a little ridge regression is that we can immediately give an alternative formula for it as well: 􏰉T 􏰊−1T w􏰖= X􏰖 X􏰖+λI X􏰖 X[1]. (14) This alternative formula is based on the more classic view of ridge regression. HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29 Look at the expression for 􏰖ξ as well: 􏰖ξ= λX􏰖X􏰖+λI X[1] (15) √􏰉T 􏰊−1 where this can also be understood in a different way as: 1 􏰖ξ= √λ(X[1]−X􏰖w􏰖) 1 T􏰉T 􏰊−1 = √ (I−(X􏰖X􏰖 ) X􏰖X􏰖 +λI )X[1]. (16) λ Notice that the ξ is behaving like a scaled residual — it is capturing the mismatch between the part of the first feature that is well represented among the rest of the features and the first feature itself. (c) In the previous part, you have effectively computed how for this particular ridge parameter λ, the last features together with the “ridge features” permit you to construct a synthetic perfect alias for the first feature as far as these training points are concerned. Prove that the solu- tion to the original ridge regression is going to have contamination that is spread across features other than the first feature in a manner that is proportional to the solution you found in the previous part. (HINT: The minimum-norm solution perspective will be helpful. Think about the equality constraint and what that implies in this case.) Solution: Consider the minimum-norm problem when the first column is reintroduced. In an analogous manner to the previous problem, we obtain the minimum-norm problem 􏰔X √λI􏰕ηˆ=w∗X[1]. Let η be the optimal solution to the above problem. Let ηˆ[1] be the first component of ηˆ (i.e. the weight applied to the first feature), and let ηˇ contain all the subsequent components in ηˆ. We can rewrite the above equality as 􏰖 X[1]ηˆ[1] + 􏰔X √λI􏰕 ηˇ = w∗X[1], which can be rearranged to yield 􏰖 􏰔X √λI􏰕ηˇ=(w∗−ηˆ[1])X[1]. Clearly, since we are trying to minimize the norm of ηˆ , we want to minimize the norm of ηˇ for a fixed ηˆ[1], so we can view the above equation as a minimum-norm problem for ηˇ. Comparing it to the minimum-norm problem from the previous part, we see that it is the same up to a scale factor. The Moore-Penrose pseudoinverse is a linear operation and so ηˇ = 􏰃w∗ − ηˆ[1]􏰄􏰖η. HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 30 The actual weights here that are going to cause contamination in test predictions are those associated with the original features, in other words, those proportional to w􏰖. The weights associated with the 􏰖ξ multiply the “ghostly” ridge features that have no presence in the real world of prediction at test time. Because we don’t yet know ηˆ[1] yet, we can’t exactly nail down what this contamination cost is going to be. But that’s what the next part is for. (d) Now we want to leverage the first part of this problem. What is the effective β for the syn- thetic alias? In other words, reinterpret the optimization for the original ridge-regression into two phases: in the first phase, you compute the cheapest alias you can for the first feature. In the second phase, you find the solution for ridge regression by optimally weighing the true feature and this cheapest alias in a manner reminiscent of the first part of this problem. (Understanding this reinterpretation is the heart of this part of the problem.)  We can further think about w􏰓 =    where the wˇ = on the contaminating features.     w􏰓[2]     .   .  represents the learned weights . w􏰓[1] wˇ w􏰓[m] Give expressions for both w􏰓[1] and wˇ that show their proportionality to w∗. If the underlying features were orthonormal relative to the test distribution for the input X, then 􏰏mj=2 (wˇ [ j])∗ would represent the component of test prediction squared error that is arising from the contamination represented by wˇ . Although we do not ask you to do this here, it is interesting to think about how everything above depends on λ. Solution: The approach here is a divide-and-conquer approach. We first reduce to the first part of the problem by finding out how to represent the first feature using all the others (including the “ghostly” ridge features) as far as the training data is concerned. Then we can use the formula from the first part to get what we want. Recallfromthefirstpartthattheweightonthesecondfeature(i.e. theonethatwilleventually correspond to contamination) was w∗ β while the weight on the first feature (the true feature) wasw∗ 1 . 1+β2 1+β2 What remains is figuring out β. The factor β reflected how much cheaper (or more expensive if β < 1) the weights were on the alias as compared to the original feature. Since the original feature would have a weight of 1 to be fully present, the β value is the reciprocal of the norm of the weight of 􏰖η computed in (12) in an earlier part. Why? Because we want the norm- squared needed for the second component for the reduced problem of part (a) to reflect the norm-squared needed in the original problem. HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 31 Consequently, the β is given by: β∗ = 1 ∥􏰖η∥ 􏰴􏰳􏰳􏰲 1 (17) (18) Of course, we can also just solve the problem directly without using the earlier parts. The proportionality to w∗ would still be instantly apparent. For example, we can just directly solve ridge using the straightforward Moore-Penrose formula to get From here, we instantly see that w􏰓[1] = w∗ 1 1+(β∗ )2 X[1] and wˇ = w∗ β∗ = 􏰉T 􏰊−1 (X[1])T X􏰖X􏰖 + λI w􏰖 where the β∗ is as above The above suffices to solve this problem and helps us better understand where contamination and the w􏰖 is given in (13). comes from in ridge regression.  w􏰓[1] ∗T T −1  wˇ  = w X (XX + λI) X[1] ∗ T T −1  w (X[1]) (XX + λI) X[1]   = .  w∗(X􏰖)T(XXT +λI)−1X[1]  Notice that the rules of matrix multiplication allow us to segment the expression more easily than if we had used the traditional form of ridge. The expression that determines contami- nation is particularly interesting. It is a “weighted inner-product” between the various false features and the true feature, where the “inner product weights” are given by the inverse of a regularized kernel-type matrix that involves the similarities between the feature vectors on training samples. multiplication. At this point, you can apply the Sherman-Morrison formula to compute the ( 1 )( 1 ) inverse and after some algebra, you’ll be led to the expression 1 − (β∗)2 (β∗)2 = 1 (1 − 1)=1. 1+(β∗ )2 1+(β∗ )2 So these expressions are the same after all. Student Survey However, this expression looks different from the formula that we calculated above. Similar, but different. Where is the 1 here? Here, the Sherman-Morrison formula for a rank-1 1+(β∗ )2 1+(β∗ )2 update to a matrix inverse ((A + uvT )−1 = A−1 − A−1 uvT A−1 ) turns out to be helpful. Notice 1+vT A−1u that (XXT + λI) = (X[1])(X[1])T + XXT + λI by the outer-product interpretation of matrix 􏰖􏰖 (β∗)2 1+ 1 (β∗ )2 (β∗)2 8 Please fill out the survey located at https://forms.gle/e2s63eC88YKmnzYW6. Solution: (No deliverables) HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 32 9 Your Own Question Write your own question, and provide a thorough solution. Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level. As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams. Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever. HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 33 Contributors: • Alexander Tsigler • Anant Sahai • Fanny Yang • Jerome Baek • Josh Sanz • Kate Sanders • Manav Rathod • Peter Wang • Philipp Moritz • Rahul Arya • Soroush Nasiriany • Yaodong Yu • Yichao Zhou • bsword HW4, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 34