EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, November 4 at 11:59 p.m. 2 Classification Policy
HW9
Suppose we have a classification problem with classes labeled 1, . . . , c and an additional “doubt” categorylabeledc+1.Let f :Rd →{1,…,c+1}beadecisionrule.Definethelossfunction
L(f(x),y) = λc
0 if f(x) = y
f(x) ∈ {1,…,c},
f(x) ∈ {1,…,c}, (1) where λc ≥ 0 is the loss incurred for making a misclassification and λd ≥ 0 is the loss incurred for
choosing doubt. In words this means the following:
• When you are correct, you should incur no loss.
• When you are incorrect, you should incur some penalty λc for making the wrong choice.
• When you are unsure about what to choose, you might want to select a category correspond- ing to “doubt” and you should incur a penalty λd.
We can see that in practice we’d like to have this sort of loss function if we don’t want to make a decision if we are unsure about it. This sort of loss function, however, doesn’t help you in instances where you have high certainty about a decision, but that decision is wrong.
To understand the expected amount of loss we will incur with decision rule f (x), we look at what is called the risk. (We’ve also used the word “loss” for this kind of thing.) The risk of classifying anewdatapointxasclass f(x)∈{1,2,…,c+1}is
c
R(f(x)|x) = L(f(x),i)P(Y = i|x).
i=1
(a) Show that the following policy fopt(x) obtains the minimum risk:
• (R1) Find class i such that P(Y = i|x) ≥ P(Y = j|x) for all j, meaning you pick the class with the highest probability given x.
• (R2)ChooseclassiifP(Y=i|x)≥1−λd λc
• (R3) Choose doubt otherwise. Solution:
if f(x) y
λ if f(x)=c+1
d
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
• Let’s first simplify the risk given our specific loss function. If f (x) = i where i is not doubt, then the risk is
c
R(f(x) = i|x) = L(f(x) = i,y = j)P(Y = j|x) (2)
j=1
=0·P(Y =i|x)+λ P(Y = j|x) (3)
= λc 1 − P(Y = i|x) (4) When f (x) = c + 1, meaning you’ve chosen doubt, the risk is:
c
R(f(x)=c+1|x)=L(f(x)=c+1,y= j)P(Y = j|x) (5)
j=1
= λ P(Y = j|x) (6)
= λd (7) P(Y = j|x) should sum to 1 since its a proper probability distribution.
j=1
– Case1a: f(x)=kwhereki.Thenwegetwith(R1)that R(fopt(x) = i|x) = λc 1 − P(Y = i|x)
≤ λc 1 − P(Y = k|x) = R( f (x) = k|x). – Case1b: f(x)=c+1.Thenwegetwith(R1)that
d
j=1
because
Now let fo pt : Rd → {1, . . . , c + 1} be the decision rule which implements (R1)–(R3). We want to show that in expectation the rule fo pt is at least as good as an arbitrary rule f . Let x ∈ Rd be a data point, which we want to classify. Let’s examine all the possible scenarios where fopt(x) and another arbitrary rule f(x) might differ:
Case1:Let fopt(x)=iwhereic+1.
R(fopt(x) = i|x) = λc 1 − P(Y = i|x) ≤λc(1−(1−λd))=λd =R(f(x)=c+1|x).
c
j=1, ji
λc
Case2:Let fopt(x)=c+1and f(x)=kwherekc+1.Then:
R(f(x) = k|x) = λc(1 − P(Y = k|x) R(fopt(x) = c + 1|x) = λd
We are in case (R3) which means that:
max P(Y=j|x)<1−λd/λc
j∈{1,...,c}
hence P(Y = k|x) < 1 − λd/λc, which means
R(f(x) = k|x) > λd = R(fopt(x))
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
(b)
Therefore in every case we proved that the rule fopt is at least as good as the arbitrary rule f, which proves that fopt is an optimal rule.
How would you modify your optimum decision rule if λd = 0? What happens if λd > λc? Explain why this is or is not consistent with what one would expect intuitively.
Solution: If λd = 0, then property (1) will hold iff there exists an i ∈ {1, . . . , c} such that P(fopt(x) = i|x) = 1. So we will either classify x in class i if we are 100% sure about this, or else we will choose doubt. Of course this is completely consistent with our intuition, because choosing doubt does not have any penalty at all, since λd = 0.
Ifλd >λc,thenwewillalwaysclassifyxintheclassi∈{1,…,c}whichgivesthehighest probability of correct classification. Once again this makes sense, since the cost of choosing doubt is higher than classifying x in any of the classes, hence our best option is to classify x in the class which gives the highest probability for a correct classification.
Sensors, Objects, and Localization (Part 2)
3
Let us say there are n objects and m sensors located in a 2d plane. The n objects are located at the points (x1, y1), . . . , (xn, yn). The m sensors are located at the points (a1, b1), . . . , (am, bm). We have measurements for the distances between the objects and the sensors: Di j is the measured distance from object i to sensor j. The distance measurement has noise in it. Specifically, we model
Dij =||(xi,yi)−(aj,bj)||+Zij,
where Zi j ∼ N(0, 1). The noise is independent across different measurements.
Assume we observe Dij = dij with (Xi,Yi) = (xi,yi) for i = 1,…,n and j = 1,…,m. Here, m = 7. Our goal is to predict (Xi′ , Yi′ ) from newly observed Di′1, . . . , Di′7. For a data set with q points, the error is measured by the average distance between the predicted object locations and the true object locations,
1 q
(xˆi − xi)2 + (yˆi − yi)2, q i=1
where (xˆi, yˆi) is the location of objects predicted by a model.
We are going to consider seven models in this problem and compare their performance:
• A Zeros Model: This model always predicts (0, 0) as the location. It serves as a baseline for how well we might do if we totally ignore the data.
• A Generative Model: This is basically from the earlier assignment (HW7 prob. 10): we first estimate sensor locations from the training data by solving the nonlinear least squares problem using the Gauss-Newton algorithm1, and then use the estimated sensor locations to estimate the new object locations.
1This is not covered in the class but you can find the related information in https://www.wikiwand.com/en/Gauss-Newton_ algorithm
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
• An Oracle Model: This is the same as generative model except that we will use the ground truth sensor location rather than the estimated sensor location.
• A Linear Model. Using the training set, the linear model attempts to fit (Xi, Yi) directly from the distance measurements (Di1, . . . , Di7). Then it predicts (Xi′ , Yi′ ) from (Di′1, . . . , Di′7) using the map that it found during training. (It never tries to explicitly model the underlying sensor locations.)
• A Second-Order Polynomial Regression Model. The set-up is similar to the linear model, but including second-order polynomial features.
• A Third-Order Polynomial Regression Model. The set-up is similar to the linear model, but including third-order polynomial features.
• A Neural Network Model. The Neural Network should have two hidden layers, each with 100 neurons, and use ReLU and/or tanh for the non-linearity. (You are encouraged to explore on your own beyond this however. These parameters were chosen to teach you a hype- deflating lesson.) The neural net approach also follows the principle of finding/learning a direct connection between the distance measurements and the object location.
(a) Implement the last four models listed above in part A of the jupyter notebook. Starter code has been provided for data generation and visualization to aid your explorations. We provide you a simple gradient descent framework for you to implement the neural network, but you are also free to use the TensorFlow and PyTorch code from your previous homework and other 3rd party libraries.
Solution: See the code that we released.
(b) In the following parts, we will deal with two types of data, the “regular” data set, and the “shifted” data set. The “regular” data set has the same distribution as the training data set, while the “shifted” data set has a different distribution. Run part B of the notebook to visualize the sensor location, the distribution of the “regular” data set, and the distribution of the “shifted” data set. Attach the plot.
Solution:
See the following figure.
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
(c) The starter code generated a set of 7 sensors and the following data sets:
• 15 training sets where ntrain varies from 10 to 290 in increments of 20.
• A “regular” testing data set where ntest = 1000.
• A “shifted” testing data set where ntest = 1000. You can do this by setting original dist to False in the function generate data in starter.py.
Use part C of the notebook to train each of the seven models on each of the fifteen training sets. Use your results to generate three figures (est. run time: 30 min). Each figure should include all of the models on the same plot so that you can compare them:
• A plot of training error versus ntrain (the amount of data used to train the model) for all of the models.
• A plot of testing error on the “regular” test set versus ntrain (the amount of data used to train the model) for all of the models.
• A plot of testing error on the “shifted” test set versus ntrain (the amount of data used to train the model) for all of the models.
Briefly describe your observations. What are the strengths and weaknesses of each model? How do they compare to the predict-zero baseline?
Solution: Plots varying the number of data points ntrain from 10 to 290 by 20 are shown. We observe that the generative model achieves the best performance as expected. Why? When data is drawn from a different distribution, we see that although all the models are performing worse, the generative model is still generalizing at some reasonable level. The rest of the models are behaving quite poorly indeed. This is because extrapolation is fundamentally challenging for
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
any universal-function-approximation type model. Second-order polynomial has the second best performance. Third order polynomial performs well when the data is generated from the same distribution, followed by neural networks. The linear model fails to perform well even with ample training data. On the “shifted” test data the neural net and linear model perform equivalently to always guessing (0, 0). The third-order polynomial performs even worse than that.
(d) We are now going to do some hyper-parameter tuning on our neural network. Fix the number of hidden layers to be two and let l be the number of neurons in each of these two layers. Try values for l between 100 and 500 in increments of 50. Use data sets with ntrain = 200 and ntest = 1, 000. What is the best choice for l (the number of neurons in the hidden layers)? Justify your answer with plots. The starter code is in part D of the notebook.
Solution: The best performance is achieved at the case when number of neurons equals to 300 − 500. (Any reasonable answer that is supported with a plot is acceptable.)
(e) We are going to do some more hyper-parameter tuning on our neural network. Let k be the number of hidden layers and let l be the number of neurons in each hidden layer. Write a formula for the total number of weights in our network in terms of l and k. If we want to keep the total number of weights in the network approximately equal to 10000, find a formula for l in terms of k. Try values of k between 1 and 4 with the appropriate implied choice for l. Use data sets with ntrain = 200 and ntest = 1000. What is the best choice for k (the number of layers)? Justify your answer with plots. The starter code is in part E of the notebook.
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Solution: For k > 1: The number of neurons nneurons for given k and l, it is enough to use the approximate formula nneurons ≈ (k − 1)l2 (each interior layer has l2 neurons). We therefore get
√
l ≈ 100/ k−1. Wegetthefollowingnumbersk = 1: l ≈ 1100,k = 2: l ≈ 100,k = 3:
l ≈ 70 and k = 4: l ≈ 50.
The best performance is achieved at the case when number of layers equals to 3 with tanh and
2 with ReLU. (Any reasonable answer that is supported with a plot is acceptable.)
(f) You might have seen that the neural network performance is disappointing compared to the generative model in the “shifted” data. Try increasing the number of training data and tune the hyper-parameters. Can you get it to generalize to the “shifted” test data? Attach the “number of training data vs accuracy” plot to justify your conclusion. What is the intuition how the neural network works on predicting D? The starter kit is provided in part F of the notebook.
Solution:
Figure 1: Generalization of Neural Network
No, increasing the training data does not work well as we saw in Figure 1. The reason is that the data are generated from a different distribution and neural network cannot transfer its performance to that, as mentioned above. Performance of the tuned neural network on data generated from a different distribution can be seen in the figures above. The neural network
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
works like using a nonlinear metric to interpolate the test data using the training samples. When the distribution of testing data is not similar to the training data, it is hard for the neural network to figure it out without the prior knowledge of the underlying model.
4 MLE of Multivariate Gaussian
In lecture, we discussed uses of the multivariate Gaussian distribution. We just assumed that we knew the parameters of the distribution (the mean vector μ and covariance matrix Σ). In prac- tice, though, we will often want to estimate μ and Σ from data. (This will come up even beyond regression-type problems: for example, when we want to use Gaussian models for classification problems.) This problem asks you to derive the Maximum Likelihood Estimate for the mean and variance of a multivariate Gaussian distribution.
(a) Let X have a multivariate Gaussian distribution with mean μ ∈ Rd and covariance matrix Σ ∈ Rd×d. Write the log likelihood of drawing the n i.i.d. samples x1, . . . , xn ∈ Rd from X given Σ and μ.
Solution: First, we calculate the likelihood by separating out the product:
n P(x1,x2,…,xn|μ,Σ) = P(xi|μ,Σ)
i=1
= n 1 exp{−1(xi − μ)⊤Σ−1(xi − μ)}
i=1 (2π)d/2|Σ|1/2 2 1 1n
= (2π)nd/2|Σ|n/2 exp{−2
nd n 1n
(xi − μ)⊤Σ−1(xi − μ)}
Then we take the log:
logP(x1,x2,…,xn|μ,Σ)=−2 log2π−2log|Σ|−2
(b) (Optional) Prove that the maximum likelihood estimates of μ and Σ are the sample mean and covariance. For taking derivatives with respect to matrices, you may use any formula in “The Matrix Cookbook” without proof. This is a reasonably involved problem part with lots of steps to get to the answer. We recommend students first do the one-dimensional case and then the two-dimensional case to warm up.
Note: Conventions for gradient and derivative in “The Matrix Cookbook” may vary from the conventions we saw in the discussion.
Solution: To find the MLE of μ and Σ, we solve the following problem: μˆ,Σˆ = argmax logP(x1,x2,…,xn|μ,Σ)
μ,Σ
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
i=1
(xi−μ)⊤Σ−1(xi−μ)
i=1
nd n 1n
=argmax−2 log2π−2log|Σ|−2 (xi−μ)⊤Σ−1(xi−μ)
μ,Σ
= arg min (xi − μ)⊤Σ−1(xi − μ) + n log |Σ|
μ,Σ
Here, we expected many students would simply take the derivatives and set them to zero to find the optimum solution. If that is what you did, it is fine for now and you can give yourself full credit. However, strictly speaking, that is not a correct argument. This is because finding a zero derivative is just finding a critical point, not necessarily an optimal solution. In general, one needs to find all the critical points and check the “boundary” (which in this case would be points on the boundary of the positive semidefinite cone, since that is the domain of log |Σ|. Alternatively, we can leverage what we know about convexity (a good optimization course like EECS 127 is really helpful here in building your intuition about this).
Anyway, since the objective above is not jointly convex, so we cannot simply take derivatives and set them to 0! Instead, we decompose the minimization into a nested optimization problem:
nn
min(xi −μ)⊤Σ−1(xi −μ)+log|Σ|=minmin(xi −μ)⊤Σ−1(xi −μ)+nlog|Σ| μ,Σ Σμ
The optimization problem has been decomposed into an inner problem that optimizes for μ given a fixed Σ, and an outer problem that optimizes for Σ given the optimal value μˆ. Let’s first solve the inner optimization problem. Given a fixed positive definite Σ (i.e. symmetric with all eigenvalues strictly larger than zero), the objective is transparently convex in μ since it is a classic quadratic.
nn
μˆ = arg min (xi − μ)⊤Σ−1(xi − μ) + log |Σ| = arg min (xi − μ)⊤Σ−1(xi − μ)
μ i=1 μ i=1
So we find the MLE of the mean by differentiating with respect to μ. In discussion 0 we saw
that ∂x⊤Ax = x⊤(A⊤ + A). If we denote (xi − μ) = vi, we get that: ∂x
∂(xi − μ)⊤Σ−1(xi − μ) = ∂v⊤i Σ−1vi ∂vi = (xi − μ)⊤(Σ−⊤ + Σ−1)(−1) = −2(xi − μ)⊤Σ−1 ∂μ ∂vi ∂μ
Therefore,
∂nn∂n
n
i=1
i=1 i=1
i=1
(xi −μ)⊤Σ−1(xi −μ)= ∂μ i=1 i=1 ∂μ
We set the expression equal to 0:
(xi −μ)⊤Σ−1(xi −μ)=−2(xi −μ)⊤Σ−1 i=1
n
i=1 n
n x (xi−μ)⊤Σ−1=0=⇒μˆ= i=1 i.
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
Notice that the choice of optimizing mean did not depend on the covariance Σ. Having solved the inner optimization problem, we now have that
n
Σˆ = arg min (xi − μˆ)⊤Σ−1(xi − μˆ) + n log |Σ|
Σ
We find the MLE of the variance by differentiating with respect to Σ, which requires using
Equations 57 and 61 from The Matrix Cookbook. Acording to Equation 61 we get that:
∂((xi − μˆ)⊤Σ−1(xi − μˆ)) = −Σ−1(xi − μˆ)(xi − μˆ)⊤Σ−1 ∂Σ
From Equation 57 we get that:
∂ log |Σ| = Σ−1 ∂Σ
Therefore,
∂ ⊤ −1 −1 ⊤ −1 −1
n n
i=1 i=1 We set the expression equal to 0:
n i=1
1 n
−Σ−1(xi −μˆ)(xi −μˆ)⊤Σ−1 +nΣ−1 =0 =⇒ Σˆ = n (xi −μˆ)(xi −μˆ)⊤
i=1
i=1
(x −μˆ) Σ (x −μˆ)+nlog|Σ|= −Σ (x −μˆ)(x −μˆ) Σ +nΣ ∂Σii ii
It remains to argue that the unique stationary point we found for the optimization problem is indeed a minimum. Since there are no discontinuities to check, we do this by checking the boundary conditions for the objective
n
f(μ,Σ)=(xi −μ)⊤Σ−1(xi −μ)+nlog|Σ|.
i=1
Consider fixed Σ now. For the objective to be finite, we need all eigenvalues of Σ to be positive (if this is not the case, the Σ−1 term is not defined). In this case, if one component μj → ∞ or μj → −∞, the quadratic term goes to ∞.
On the other hand, now consider μ fixed and we vary Σ. Since we are optimizing over all positive semidefinite Σ, we approach the boundaries of the domain if eigenvalues of Σ are going to 0 or ∞. For an eigenvalue λi going to 0 or λi going to ∞ we have
λi
and therefore the solution we found is indeed a minimum.
f(μ,Σ)=o 1 +olog(λi)→∞
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
(c)
Use the following code to sample from a two-dimensional Multivariate Gaussian and plot the samples:
import numpy as np
import matplotlib.pyplot as plt
mu = [15, 5]
sigma = [[20, 0], [0, 10]]
samples = np.random.multivariate_normal(mu, sigma, size=100)
plt.scatter(samples[:, 0], samples[:, 1])
plt.show()
Try the following three values of Σ:
20 0 20 14 20 −14 Σ = 0 10; Σ = 14 10; Σ = −14 10 .
Calculate the mean and covariance matrix of these distributions from the samples (that is, implement part (b)). Report your results. Include your code in your write-up. Note: you are allowed to use numpy.
Solution: There is no “correct” answer for the sample mean and variance, since it depends on randomness, but hopefully you got values for the MLE mean and variance that were close to the true mean and variance. If they are very far off, this may indicate an error in your code. The following code works by computing the formula above directly (including the matrix multiplication):
Regularized and Kernel k-Means
sample_mean = np.average(samples, axis=0)
sample_variance = 0.01 * np.sum(np.matrix((samples[i]-sample_mean)).T *
np.matrix((samples[i]-sample_mean)) for i in range(100))
5
Recall that in k-means clustering we attempt to minimize the objective k
min ∥xj − μi∥2, where C1 ,C2 ,…,Ck i=1 x j ∈Ci
21
μi =argminμi∈Rd ∥xj −μi∥2 = |Ci| xj, i=1,2,…,k.
xj∈Ci xj∈Ci Thesamplesare{x1,…,xn},wherexj ∈Rd.Ci isthesetofsamplepointsassignedtoclusteriand
|Ci| is its cardinality. Each sample point is assigned to exactly one cluster.
(a) What is the minimum value of the objective when k = n (the number of clusters equals
the number of sample points)?
Solution: The value is 0, as every point can have its own cluster.
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
(b) (Regularized k-means) Suppose we add a regularization term to the above objective. The objective is now
Show that the optimum of
min λ∥μi∥2 + ∥xj − μi∥2
is obtained at μ = 1
i |Ci|+λ
Solution:
Consider the function
μi ∈Rd x .
xj∈Ci
2 2
Setting it to zero, we have μ = 1 i |Ci|+λ
obtained at μ = 1 x . i |Ci|+λ xj∈Ci j
xj∈Ci
xj∈Ci
x . As the function f is convex, the minimum is
xj∈Ci
j
k2 2
i=1
λ∥μ∥ + ∥x −μ∥. i2 ji2
xj∈Ci
f(μ)= ∥x −μ∥+λ∥μ∥. i j i2 i2
Its gradient with respect to μi is
∇ f(μ)=2 (μ −x)+2λμ
xj∈Ci
μi i i j i
xj∈Ci
= 2 (|C | + λ)μ − x . i i j
j
(c) Here is a (sadly “dream world”) example where we would want to regularize clusters. Suppose there are n students who live in a R2 Euclidean world and who wish to share rides efficiently to Berkeley for their in-person final exam in EECS189/289A. The university permits k shuttles which may be used for shuttling students to the exam location. The students need to figure out k good locations to meet up. The students will then walk to the closest meet up point and then the shuttles will deliver them to the exam location. Let xj be the location of student j, and let the exam location be at (0, 0). Assume that we can drive as the crow flies, i.e., by taking the shortest path between two points. Write down an appropriate objective function to minimize the total distance that the students and vehicles need to travel. How is this different from the regularized k-means objective above?
(Hint: your result should be similar to the regularized k-means objective.) Solution:
The objective function that minimizes the total distance that the students and vehicles need to
travel is
k
∥μ∥ + ∥x −μ∥. (8) i2 jj2
min
μ i ∈ R d
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
i=1
xj∈Ci
It differs from regularized k-means in that there aren’t squares on the norms.
Cathy Wu, Ece Kamar, Eric Horvitz. Clustering for Set Partitioning with a Case Study in
Ridesharing. IEEE Intelligent Transportation Systems Conference (ITSC), 2016.
(d) (Kernel k-means) Suppose we have a dataset {xi}ni=1, xi ∈ Rl that we want to split into k clus- ters, i.e., finding the best k-means clustering without the regularization. Furthermore, suppose we know a priori that this data is best clustered in an impractically high-dimensional feature space Rm with an appropriate metric. Fortunately, instead of having to deal with the (im- plicit) feature map φ : Rl → Rm and (implicit) distance metric2, we have a kernel function κ(x1, x2) = ⟨φ(x1), φ(x2)⟩ that we can compute easily on the raw samples without having to explicitly compute all the features. How should we perform the kernelized counterpart of k-means clustering?
Derive the underlined portion of this algorithm, and show your work in deriving it. The main issue is that although we define the means μi in the usual way, we can’t ever compute φ explicitly because it’s way too big. Therefore, in the step where we determine which cluster each sample point is assigned to, we must use the kernel function κ to obtain the right result. (Review the lecture on kernels if you don’t remember how that’s done.)
Algorithm 1 Kernel k-means
Require: Data matrix X ∈ Rn×d; Number of clusters K; kernel function κ(x1, x2) Ensure: Cluster class class(j) for each sample xj.
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12:
functionKernel-k-means(X,c)
Randomly initialize class( j) to be an integer in 1, 2, . . . , K for each x j . while not converged do
for i ← 1 to K do
SetCi ={j∈{1,2,…,n}:class(j)=i}.
end for
for j←1tondo
Set class( j) = argmink end for
end while
Return Ci for i = 1,2,…,c. endfunction
Solution: Given a clustering Ci, we define
μi = 1 φ(x)
|Ci| x∈Ci
to minimize ∥φ(x) − μ ∥2. (But we don’t explicitly compute μ !)
x∈Ci i2 i
Given this choice of the μk’s, the K-means clustering is found by assigning each xi to the cluster argmink f(i,k),where
f (i, k) = ∥φ(xi) − μk∥2
2Just as how the interpretation of kernels in kernelized ridge regression involves an implicit prior/regularizer as well as an implicit feature space, we can think of kernels as generally inducing an implicit distance metric as well. Think of how you would represent the squared distance between two points in terms of pairwise inner products and operations on them.
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13
=φ(xi)·φ(xi)−2φ(xi)·μk +μk ·μk
=φ(xi)·φ(xi)− 2 φ(xi)·φ(xj)+ 1 φ(xj)·φ(xl)
|Ck|
= κ(xi,xi)− 2 κ(xi,xj)+
|Ck|2
κ(xj,xl).
xj∈Ck
xj∈Ck xl∈Ck
1 |Ck|2
|Ck|
The first term does not vary with k, so we can drop it. Therefore, we write
class(i) = arg min k
|Ck|2
κ(x , x ) − κ(x , x ) . (9)
xj∈Ck
xj∈Ck xl∈Ck
12
xj,xl∈Ck
j l |Ck| i j xj∈Ck
(e) The expression you derived may have unnecessary terms or redundant kernel computations. Explain how to eliminate them; that is, how to perform the computation quickly without doing irrelevant computations or redoing computations already done.
Solution: First, we dropped the term κ(xi, xi). Second, observe that the first summation above is the same for every sample point xi, so we can compute it once for each cluster per iteration, and use it with every sample point. Third, observe that we need to compute a kernel computa- tion for every pair of sample points (not necessarily distinct), but we only need to do it once at the beginning of the algorithm. This amounts to computing every entry of the kernel matrix, keeping in mind that the kernel function (and the kernel matrix) is symmetric to avoid redun- dant computations. Then the computed kernel values can be used multiple times per iteration, in both the first and second summations.
(f) For this part, we will try to gain more intuition on kernel k-means by running it on real ex- amples and visualizing it. There is no code needed to be written, and you simply have to run the demo to find the “best” kernel and hyperparameters. Of course, you are welcome to add some code if it helps you automate the process. Please refer to the Jupyter notebook for directions and questions you need to answer.
Solution:
(a) Gussian: There are clearly 5 clusters that can be easily separated. Setting the number of clusters to 5 and choosing linear kernel work very well.
(b) Circles: There are 2 concentric rings in this case with a little bit of noise. Obviously, we pick the number of clusters to be 2. RBF kernel with γ = 0.001 works best in this case. A rough rule of thumb is that linear kernel works fine when the clusters are linearly separable, and RBF kernel is useful when they are not.
(c) Digits: Visually, the data appear to have 4 clusters, but there are actually 5 with the 2 clusters on the left mixing together. This is example of how our assumption and k-means may not align with the structure we wish to discover, which in this case is the class labels. Both linear kernel and RBF kernel with γ ≤ 0.01 works are acceptable answers. There is no perfect way to separate the 2 clusters since they are mixed up.
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14
6 LDA and QDA on MNIST
All parts of this question are to be completed in the associated Jupyter notebook. The goal of this problem is to make you more familiar with the MNIST dataset as well as to help you understand how to use LDA and QDA on such a standard real-world dataset.
(a) The MNIST dataset consists of 28 by 28 grayscale images of digits, where each pixel can take on integer values from 0 to 255 (inclusive). When trying to learn this dataset using QDA, we need to learn the means μi of each class, as well as the covariance matrices Σi of each class. Recall that there are 10 classes: 0 through 9. How many degrees of freedom are there in the QDA model? (i.e. how many different parameters are there to learn)
For LDA, we learn a single covariance matrix Σ, not one for each class. How many degrees of freedom are there in our LDA model?
What does this tell you about the amount of data needed to train a QDA versus an LDA model? In particular, approximately how many images from the MNIST dataset do we need in order to create a QDA classifier? What about an LDA classifier?
(Hint: We don’t expect an exact answer, just an order of magnitude with some reasoning. Think about a linear system with x unknowns. For such a system, think about how many equations we need to solve for all the unknowns uniquely.)
(b) Implement the compute qda params single class and compute qda params functions for QDA, and inspect the MLE means of each of the 10 digit classes.
Solution:
mu = np.sum(data, axis=0) / n
centered_data = data – mu
sigma = centered_data.T @ centered_data / n
for k in range(num_classes):
indices = [i for i in range(n) if labels[i] == k]
params.append(compute_qda_params_single_class(data[indices]))
(c) Implement the classify function to use the learned parameters to classify a vector of test points.
(Hint: look into using the slogdet and einsum functions from NumPy to improve perfor- mance and avoid floating point errors.)
Then run the subsequent cells and observe your classifier’s performance on the training and test sets.
Solution:
densities = []
for k, (mu, sigma) in enumerate(params):
densities.append(
((test_points – mu).T * (np.linalg.inv(fuzz(sigma)) @ (test_points – mu).T)).sum(axis=0) +
np.linalg.slogdet(fuzz(sigma))[1]
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15
)
labels = np.argmin(np.array(densities), axis=0)
(d) Implement the accuracy vs n function and run the cell to view the training and test accu- racy of your classifier as we increase the number of training points made available. Comment on what you see in the results.
Solution:
(e) Implement the compute lda params function to implement LDA wherein you estimate a single covariance matrix for all the classes instead of a covariance matrix per class. Then run the subsequent cells to observe its performance on the training and test sets, as a function of the number of training points made available. Comment on your results, with reference to your analogous observations for QDA.
Solution:
(f) Run the digit accuracy vs n function for QDA and LDA to see the per-digit accuracy for both classifiers. Notice how the accuracy is not the same for different digits. Comment on your observations.
(g) Modify the fuzz function to treat eps as a tunable hyperparameter. Which choice of eps is best? Remember to construct a validation set to evaluate different hyperparameter choices, so the test data is kept hidden until you evaluate the final model.
Solution:
params = compute_params(training_data[:n], training_labels[:n])
training_accuracies.append(compute_accuracy(classify(params, training_data[:n]), training_labels[:n
→ ]))
test_accuracies.append(compute_accuracy(classify(params, test_data), test_labels))
params = compute_qda_params(data, labels)
sigma = np.zeros((d, d))
for mu_k, sigma_k in params:
sigma += sigma_k
for i in range(10):
params[i] = (params[i][0], sigma)
for eps in np.logspace(-8, 2, 15):
curr_validation = validation_score()
if curr_validation > best_validation:
best_validation = curr_validation
best_eps = eps
print(“eps = {} gives validation accuracy = {}”.format(eps, curr_validation))
hyper_tuning_training_data = training_data[:8000]
hyper_tuning_training_labels = training_labels[:8000]
hyper_tuning_validation_data = training_data[8000:]
hyper_tuning_validation_labels = training_labels[8000:]
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
7 Loss Shaping, Outliers, and Noise
In this problem you will explore how quirks in your training data can cause some of the ML techniques you have already seen, like least-squares regression, to perform poorly. You will then try alternate objective functions and other techniques for shaping your loss landscape to improve your performance. Although this problem is by no means a comprehensive introduction to the many loss shaping techniques used in practice, and doesn’t touch on techniques particular to neural networks at all, we hope that you will start to see the kind of analytical thinking that goes into choosing a loss function and modifying basic ML techniques for real problems.
Unless otherwise noted, each of the problem parts will ask you to write some code in the accom- panying Jupyter notebook, train a model, and think about the results.
(a) Classification: MSE
We begin with a classification problem where our training data has a signficant class imbalance relative to the test data. This makes it easy to diagnose when our classifier breaks down, but there are many situations where training and test data might have identical class proportions but other considerations, such as prediction performance per class or equivalence of outcomes, require changes to the classifier to deal with this imbalance.
To begin train a linear least-squares classifier, report the test accuracy, and visualize the decision boundary. Think about why the classifier failed. It may help to consider training a least-squares classifier on the 1D example below, with the decision function being whether the prediction fˆ(xtest) > 0. The two classes are y ∈ {−1, 1}.
y
x
Solution: Our classifier prediction accuracy was 0.5 on the test set. The decision boundary is far outside the region the two classes are concentrated in.
Our code was
resulting in the figures
Figure 2: Example 1D classification problem
w1 = np.linalg.solve(np.dot(X_train_aug.T, X_train_aug), np.dot(X_train_aug.T, Y_train))
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17
(a) Train data and decision boundary (b) Test data and decision boundary
The 1D example above should be illustrative for understanding why the MSE error is prob- lematic here. In order to minimize error at the far right of the class 1 samples, the slope of the predictor is very shallow and the x-intercept (where the decision boundary is) is pushed far to the left because the low population class (-1) doesn’t have enough “mass” to pull the boundary in.
(b) Classification: Weighting
One way to overcome the class imbalance is to add copies of the samples belonging to the lower population class to the training set. Show that adding c copies of a sample is equivalent to weighting its error by c in the least-squares regression problem. Then create a new training set with enough copies of the underrepresented class samples to equalize the imbalance and retrain your classifier. Report the test accuracy and visualize the decision boundary.
Solution: We now achieve a prediction accuracy of 0.9707 on the test set. The decision boundary sits just outside the edge of blue class samples in the test set.
Our code was
# Generate new training data with repeat items
k_repeat = int(N_1/N_2)
X_2_repeat = np.repeat(X_2, k_repeat, axis=0)
y_2_repeat = -1.0 * np.ones(X_2_repeat.shape[0])
X_train_repeat = np.concatenate((X_1, X_2_repeat), axis=0)
X_train_repeat_aug = np.ones((X_train_repeat.shape[0], X_train_repeat.shape[1]+1))
X_train_repeat_aug[:,:-1] = X_train_repeat
Y_train_repeat = np.concatenate((y_1, y_2_repeat), axis=0)
# Visualize the repeat train dataset
visualize_dataset(X_train, Y_train)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18
# Solve the linear regression on the training data
w1 = np.linalg.solve(np.dot(X_train_repeat_aug.T, X_train_repeat_aug), np.dot( → X_train_repeat_aug.T, Y_train_repeat))
# Report the classification accuracy on the test set
ypred = -1.0*(X_test_aug.dot(w1) < 0.0) + 1.0*(X_test_aug.dot(w1) > 0.0)
print(“prediction accuracy on the test set is “, 1.0*sum(ypred == Y_test) / N_test)
# Visualize the decision boundary
y0 = classifier_linear(X_vis_aug, w1)
visualize_classifier(y0)
resulting in the figures
(c) Train data and decision boundary with copies (d) Test data and decision boundary with copies
(c) Classification: Logistic Regression
An alternative to modifying the training data is to use logistic regression, where we model the probability that a data point corresponds to class 1:
hθ(xi) = P(yi = 1|xi,θ) = 1 . 1 + e − x Ti θ
This technique is called logistic regression because we are learning a regression model with the probability of membership in a class as the target, rather than a {1, −1} proxy for class membership. When we need a hard decision we simply choose the class with the highest probability.
θ is typically found using maximum likelihood estimation with the log-likelihood
N
i i max log L(θ|X, y) = max log P(y |x , θ)
θθ
i = 1
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19
N
y (1−y ) ii
=maxlog h(x) (1−h(x))
θiθi
θ
N
θ
i=1
Here we use y ∈ {0, 1}N rather than {−1, 1}N as with the MSE loss. This loss function may be familiar to some of you as the cross-entropy loss. Here we are actually maximizing the log-likelihood, but we could equivalently minimize the negative log-likelihood.
Use the LogisticRegression class from sklearn to train a model with the unbalanced training set, then report the test accuracy and visualize the decision boundary. Then do the same with your copy-balanced training set. Explain which loss function, between logistic loss and MSE loss, is more sensitive to class imbalance and why.
Solution: Our model achieves an accuracy of 0.934 on the unbalanced training set, compared to 0.5 for an MSE loss. The decision boundary lies on the edge of the red class’s data, bisecting the blue class.
It achieves an accuracy of 0.9875 on the copy-augmented training set, compared to 0.9707 for an MSE loss. The decision boundary splits the overlapping area of the two classes, misclassi- fying both with approximately equal probability.
Logistic regression is much less sensitive to class imbalance. Because its outputs are restricted to (0, 1) it won’t penalize a sample at the far edge of a class’s distribution from the decision boundary a much greater amount than a sample near the boundary, unlike MSE.
Our code was
resulting in the figures
i = 1
= max yi log hθ(xi) + (1 − yi) log1 − hθ(xi) .
classifier_logistic = LogisticRegression(penalty=’none’)
classifier_logistic.fit(X_train_aug, Y_train)
ypred = classifier_logistic.predict(X_test_aug)
print(“prediction accuracy on the test set is “, 1.0*sum(ypred == Y_test) / N_test)
# Test
y0 = classifier_logistic.predict(X_vis_aug)
visualize_classifier(y0)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20
(e) Train data and decision boundary (f) Test data and decision boundary
and with copies
resulting in the figures
(g) Train data and decision boundary (h) Test data and decision boundary
classifier_logistic = LogisticRegression(penalty=’none’)
classifier_logistic.fit(X_train_repeat_aug, Y_train_repeat)
ypred = classifier_logistic.predict(X_test_aug)
print(“prediction accuracy on the test set is “, 1.0*sum(ypred == Y_test) / N_test)
# Visualize
y0 = classifier_logistic.predict(X_vis_aug)
visualize_classifier(y0)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21
(d) Regression: Outliers
Now we will leave classification behind and explore some of the behaviors possible in regres- sion. We will start with a linear model and training data corrupted by outliers from another distribution:
f (x) = ax + b
y = δ( f (x) + ε), ε ∼ N(0, σ2)
δ =
1 w.p. p
−5 w.p.1−p
In the notebook you will fit this data using three loss functions: L1, L2, and Huber. You should be familiar with the L2 loss from least-squares regression. You have seen an L1 penalty before in LASSO regression, but it is also possible to train a model with an L1 penalty on the errors. The Huber loss is a hybrid of L1 and L2 loss functions. It acts like the L2 loss for ‘small’ errors less than some threshold, then switches to the L1 loss for errors above the threshold. (This should remind you of the relationship between logistic loss and the exponential loss that you saw all the way back in HW0.) This allows the Huber loss to provide a good estimate of the mean while limiting the influence of outliers without totally discounting them. Compare this to our use of OMP from a previous homework, where we used it to remove the influence of outliers entirely by a kind of feature augmentation. Anyway, the Huber loss is
1|f(x) − y|2, |f(x) − y| ≤ δ
L ( f ( x ) , y , δ ) = 2
Huber 12
δ|f(x) − y| − 2δ , otherwise
Train a model on the data set with outliers using each of the three loss functions. Report the test MSE against the true function, and comment on the effect of the outliers and estimate of the true coefficients in each case.
Solution: Although we provided a random seed, if you run certain cells multiple times (like SGDRegressor) you will end up with a different random state than we had, so your numerical results may differ slightly from ours.
L2 Loss: The test error was 0.6. The outliers had a strong impact on the learned function, pulling the slope down significantly.
rr.fit(X, y)
rr_coef = rr.coef_
print(rr_coef)
plot_regression(X, y, w_true, rr, X_test, ’L2 Loss’)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22
Figure 3: L2 loss regression line
L1 Loss: The effect of the outliers was almost eliminated, although the slope was still slightly lower than the true function. The test error was 0.015.
Figure 4: L1 loss regression line
Huber Loss: The effect of the outliers was essentially eliminated, and the L2 portion of the loss function helped us estimate the true coefficients almost exactly. The test error was 0.002.
sgdr = SGDRegressor(loss=’epsilon_insensitive’, penalty=’l2’, alpha=0.0001, epsilon=0.0, → fit_intercept=False)
sgdr.fit(X, y)
sgd_coef = sgdr.coef_
print(sgd_coef)
plot_regression(X, y, w_true, sgdr, X_test, ’L1 Loss’)
hr = HuberRegressor(epsilon=1.2, alpha=0.0001, fit_intercept=False)
hr.fit(X, y)
h_coef = hr.coef_
print(h_coef)
plot_regression(X, y, w_true, hr, X_test, ’Huber Loss’)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23
Figure 5: Huber loss regression line
(e) Regression: Multiplicative Input Noise
Now we have an underlying exponential model, but instead of additive output observation noise like you are used to the noise is multiplicative and applied to the input. In other words, instead of corrupted y values we observe corrupted x values. The ys we observe actually came from different x values than we see in the training data.
y = ezx, z ∼ U[1 − ε, 1 + ε]
We will attempt to learn this function with a polynomial model of degree p.
p fˆ(x)=w0 +wixi
i=1
As you may recall, the Taylor expansion of ex is 1 + x + x2 + x3 + ··· so we can approximate
1! 2! 3! the true model arbitrarily well with large enough degree p.
Train regressors with L1, L2, and Huber loss functions and report the test error. Com- ment on the suitability of each loss function in this situation.
Solution: L2 Loss: The test error was 67 with α = 0.01. This loss function performs the best of the three, but is still multiple times worse than the best possible regressor.
rr = Ridge(alpha=.01, fit_intercept=False)
rr.fit(X, y)
rr_coef = rr.coef_
print(rr_coef)
plot_regression(X, y, w_true, rr, X_test, ’L2 Loss’)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24
Figure 6: L2 loss regression line
L1 Loss: The test error was 2 × 1021. This loss is useless for data like this.
%%% insert multiplicative_l1 %%%
Figure 7: L1 loss regression line
Huber Loss: The test error was 630 with ε = 5, alpha = 0.01. If you try change p and the number of samples, you will see that the learned function and MSE are very sensitive to changing conditions relative to the simple L2 loss.
hr = HuberRegressor(epsilon=5, alpha=0.01, fit_intercept=False)
hr.fit(X, y)
h_coef = hr.coef_
print(h_coef)
plot_regression(X, y, w_true, hr, X_test, ’Huber Loss’)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25
Figure 8: Huber loss regression line
(f) Regression: Transformed Target
Since the function we want to learn is exponential, a natural experiment to try is to regress against a log-transform of the y data. This will result in a linear relationship with x and reduce the range of noise variances. Use sklearn’s TransformedTargetRegressor class to train a ridge regression model on log-transformed y values. Report the test error and comment on the result.
Solution: Our test error was 41, significantly lower than we achieved with ridge regression on untransformed targets.
You have seen a technique similar to this with lifting, where we transform input data to make our regression or classification problem easier (or even just solveable). This is applying the same kind of thinking to the regression target instead. Computer vision is one domain that commonly employs input or output transformations to make tasks easier to learn.
lrr = TransformedTargetRegressor(regressor=Ridge(alpha=1.0, fit_intercept=False),
func=np.log,
inverse_func=np.exp)
plot_regression(X, y, w_true, lrr, X_test, ’Log Relative Loss’)
lrr.fit(X, y)
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26
Figure 9: L1 loss regression line
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27
Contributors:
• Alvin Wan
• Anant Sahai
• Ashwin Pananjady • Cathy Wu
• Chawin Sitawarin
• Inigo Incer
• Jianbo Chen
• Jonathan Shewchuk • Josh Sanz
• Marc Khoury
• Mark Velednitsky
• Michael MacDonald • Philipp Moritz
• Rahul Arya
• Yichao Zhou
HW9, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28