CS 189 Introduction to
Spring 2020 Machine Learning Final Exam
• The exam is open book, open notes, and open web. However, you may not consult or communicate with other people (besides your exam proctors).
• You will submit your answers to the multiple-choice questions through Gradescope via the assignment “Final Exam – Multiple Choice”; please do not submit your multiple-choice answers on paper. By contrast, you will submit your answers to the written questions by writing them on paper by hand, scanning them, and submitting them through Grade- scope via the assignment “Final Exam – Writeup.”
Copyright By PowCoder代写 加微信 powcoder
• Please write your name at the top of each page of your written answers. (You may do this before the exam.)
• You have 180 minutes to complete the midterm exam (3:00–6:00 PM). (If you are in the DSP program and have an
allowance of 150% or 200% time, that comes to 270 minutes or 360 minutes, respectively.)
• When the exam ends (6:00 PM), stop writing. You must submit your multiple-choice answers before 6:00 PM sharp. Late multiple-choice submissions will be penalized at a rate of 5 points per minute after 6:00 PM. (The multiple-choice questions are worth 60 points total.)
• From 6:00 PM, you have 15 minutes to scan the written portion of your exam and turn it into Gradescope via the assignment “Final Exam – Writeup.” Most of you will use your cellphone and a third-party scanning app. If you have a physical scanner, you may use that. Late written submissions will be penalized at a rate of 5 points per minute after 6:15 PM.
• Mark your answers to multiple-choice questions directly into Gradescope. Write your answers to written questions on blank paper. Clearly label all written questions and all subparts of each written question. Show your work in written questions.
• Followingtheexam,youmustuseGradescope’spageselectionmechanismtomarkwhichquestionsareonwhichpages of your exam (as you do for the homeworks).
• The total number of points is 150. There are 16 multiple choice questions worth 4 points each, and six written questions worth 86 points total.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct answers must be checked.
First name
Q1. [64 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(1) [4 pts] Which of the following are true for the k-nearest neighbor (k-NN) algorithm?
A: k-NN can be used for both classification and regression.
B: As k increases, the bias usually increases.
(2) [4 pts] Let X be a matrix with singular value decomposition X = UΣV⊤. Which of the following are true for all X?
A: rank(X) = rank(Σ).
⃝ B: If all the singular values are unique, then the
SVD is unique.
C: The first column of V is an eigenvector of X⊤X. D: The singular values and the eigenvalues of
X⊤X are the same.
A is correct because the number of non-zero singular values is equal to the rank. B is incorrect because you could change both U → −U,V → −V. C is correct because the SVD and eigendecomposition of X⊤X is VΣ2V⊤. D is correct as X⊤X is positive semidefinite, so the eigenvalues can’t be negative.
(3) [4 pts] Lasso (with a fictitious dimension), random forests, and principal component analysis (PCA) all . . . A: can be used for dimensionality reduction or feature subset selection
⃝ B: compute linear transformations of the input features ⃝ C: are supervised learning techniques
D: are translation invariant: changing the origin of the coordinate system (i.e., translating all the training and test data together) does not change the predictions or the principal component directions
Option B is incorrect because random forests don’t compute linear transformations. Option C is incorrect because PCA is unsupervised.
(4) [4 pts] Suppose your training set for two-class classification in one dimension (d = 1; xi ∈ R) contains three sample points:pointx1 =3withlabely1 =1,pointx2 =1withlabely2 =1,andpointx3 =−1withlabely3 =−1.Whatare the values of w and b given by a hard-margin SVM?
⃝ A:w=1,b=1 C:w=1,b=0 ⃝ B:w=0,b=1 ⃝ D:w=∞,b=0
(5) [4 pts] Use the same training set as part (d). What is the value of w and b given by logistic regression (with no regular- ization)?
⃝ A:w=1,b=1 ⃝ C:w=1,b=0 ⃝ B:w=0,b=1 D:w=∞,b=0
(6) [4 pts] Below are some choices you might make while training a neural network. Select all of the options that will generally make it more difficult for your network to achieve high accuracy on the test data.
⃝ C: The decision boundary looks smoother with smaller values of k.
⃝ D: As k increases, the variance usually increases.
A: Initializing the weights to all zeros
B: Normalizing the training data but leaving the
test data unchanged
⃝ C: Using momentum
⃝ D: Reshuffling the training data at the beginning
of each epoch
A) Initializing weights with zeros makes it impossible to learn. B) Mean and standard deviation should be computed on the training set and then used to standardize the validation and test sets, so that the distributions are matched for each set. C) This describes momentum and will generally help training. D) This is best practice.
(7) [4 pts] To the left of each graph below is a number. Select the choices for which the number is the multiplicity of the eigenvalue zero in the Laplacian matrix of the graph.
B: 1 ⃝ D: 4 The multiplicity is equal to the number of connected components in the graph.
(8) [4 pts] Given the spectral graph clustering optimization problem Find y that minimizes y⊤Ly
subject to y⊤y = n and 1⊤y = 0,
which of the following optimization problems produce a vector y that leads to the same sweep cut as the optimization problem above? M is a diagonal mass matrix with different masses on the diagonal.
Minimize y⊤Ly
A:subjectto y⊤y=1
and 1⊤y = 0
Minimize y⊤Ly
⃝ B:subjectto ∀i,yi =1oryi =−1
C: Minimize y⊤Ly/(y⊤y) subjectto 1⊤y=0
Minimize y⊤Ly
⃝ D:subjectto y⊤My=1
and 1⊤My=0
(9) [4 pts] Which of the following methods will cluster the data in panel (a) of the figure below into the two clusters (red circle and blue horizontal line) shown in panel (b)? Every dot in the circle and the line is a data point. In all the options that involve hierarchical clustering, the algorithm is run until we obtain two clusters.
(a) Unclustered
(b) Desired clustering
⃝ A: Hierarchical agglomerative clustering with B: Hierarchical agglomerative clustering with Euclidean distance and complete linkage Euclidean distance and single linkage
⃝ C: Hierarchical agglomerative clustering with ⃝ D: k-means clustering with k = 2 Euclidean distance and centroid linkage
Single linkage uses the minimum distance between two clusters as a metric for merging clusters. Since the two clusters are densely packed with points and the minimum distance between the two clusters is greater than the within-cluster distances between points, single linkage doesn’t link the circle to the line until the very end.
The other three methods will all join some of the points at the left end of the line with the circle, before they are joined with the right end of the line.
(10) [4 pts] Which of the following statement(s) about kernels are true?
A: The dimension of the lifted feature vectors Φ(·), whose inner products the kernel function computes, can be
⃝ B: For any desired lifting Φ(x), we can design a kernel function k(x, z) that will evaluate Φ(x)⊤Φ(z) more quickly
than explicitly computing Φ(x) and Φ(z).
C: The kernel trick, when it is applicable, speeds up a learning algorithm if the number of sample points is
substantially less than the dimension of the (lifted) feature space.
D: If the raw feature vectors x, y are of dimension 2, then k(x, y) = x12y21 + x2y2 is a valid kernel.
A is correct; consider the Gaussian kernel from lecture. B is wrong; most liftings don’t lead to super-fast kernels. Just some special ones do. C is correct, straight from lecture. Though in this case, the dual algorithm is faster than the primal whether you use a fancy kernel or not. D is correct because k(x, y) is inner product of Φ(x) = [x12 x2]⊤ and Φ(y) = [y21 y2]⊤.
(11) [4 pts] We want to use a decision tree to classify the training points depicted. Which of the following decision tree classifiers is capable of giving 100% accuracy on the training data with four splits or fewer?
⃝ A: A standard decision tree with axis-aligned splits ⃝ B: Using PCA to reduce the training data to one
dimension, then applying a standard decision tree
C: A decision tree with multivariate linear splits D: Appending a new feature |x1| + |x2| to each
sample point x, then applying a standard decision tree
A standard decision tree will need (substantially) more than four splits. PCA to 1D will make it even harder. However, four non-axis-aligned multivariate linear splits suffice to cut the diamond out of the center. Finally, adding the L1 norm feature lets us perfectly classify the data with a single split that cuts off the top of the pyramid.
(12) [4 pts] Which of the following are true about principal components analysis (PCA)?
⃝ A: The principal components are eigenvectors of the centered data matrix.
B: The principal components are right singular vectors of the centered data matrix.
C: The principal components are eigenvectors of the sample covariance matrix.
D: The principal components are right singular vectors of the sample covariance matrix.
The first three follow directly from definitions. The last is because the covariance matrix is symmetric, so the singular vectors are the eigenvectors.
(13) [4 pts] Suppose we are doing ordinary least-squares linear regression with a fictitious dimension. Which of the following changes can never make the cost function’s value on the training data smaller?
A: Discard the fictitious dimension (i.e., don’t append a 1 to every sample point). ⃝ B: Append quadratic features to each sample point.
C: Project the sample points onto a lower-dimensional subspace with PCA (without changing the labels) and perform regression on the projected points.
D: Center the design matrix (so each feature has mean zero).
A: Correct. Discarding the fictitious dimension forces the linear regression function to be zero at the origin, which may increase
the cost function but can never decrease it.
B: Incorrect. Added quadratic features often help to fit the data better.
C: Correct. Regular OLS is at least as expressive. Projecting the points may incrase the cost function but can never decrease it. Centering features doesn’t matter so WLOG assume X has centered features. If the full SVD is X = UΣV⊤, then projecting onto a k-dimensional subspace gives Xk = UΣkV⊤. If wk is a solution for the PCA-projected OLS, we can take w = Vz where z is the first k elements of V⊤wk with the rest zero, and get Xkwk = Xw.
D: Correct. Since we’re using a fictitious dimension, translating the points does not affect the cost of the optimal regression function (which translates with the points).
(14) [4 pts] Which of the following are true about principal components analysis (PCA)? Assume that no two eigenvectors of the sample covariance matrix have the same eigenvalue.
A: Appending a 1 to the end of every sample point doesn’t change the results of performing PCA (except that the useful principal component vectors have an extra 0 at the end, and there’s one extra useless component with eigenvalue zero).
B: If you use PCA to project d-dimensional points down to j principal coordinates, and then you run PCA again to project those j-dimensional coordinates down to k principal coordinates, with d > j > k, you always get the same result as if you had just used PCA to project the d-dimensional points directly down to k principle coordinates.
⃝ C: If you perform an arbitrary rigid rotation of the sample points as a group in feature space before performing PCA, the principal component directions do not change.
D: If you perform an arbitrary rigid rotation of the sample points as a group in feature space before performing PCA, the largest eigenvalue of the sample covariance matrix does not change.
Appending an extra dimension with the same values introduces no variance in the extra dimension, so PCA will ignore that dimension. PCA discards the eigenvector directions associated with the largest eigenvalues; as the eigenvectors are mutually orthogonal, this does not affect the variance in the surviving dimensions, so your results depend solely on how many directions you discard. Rotating the sample points rotates the principal components, but it doesn’t change the variance along each of those (rotated) component directions.
(15) [4 pts] Consider running a single iteration of AdaBoost on three sample points, starting with uniform weights on the sample points. All the ground truth labels and predictions are either +1 or −1. In the table below, some values have been omitted. Which of the following statements can we say with certainty?
True Label Classifier Prediction Initial Weight Updated Weight
X1 −1 −1 1/3 √? X2 ? +1 1/3 √2/3
1/3 2/6 C: X2 is misclassified ⃝ D: X3 is misclassified
In the AdaBoost algorithm, all correctly classified points have their weights changed by the same multiplicative factor. Since we observe two different updated weights, we know one of x2 or x3 is correctly classified, and the other is mis- classified. Since x1 is correctly classified, the error rate is err = 1/3. As the error rate is less than 1/2, the weights of correctly classified points will decrease and the weights of misclassified points will increase. Hence, X2 is misclassified and X3 is correctly classified. As X1 is correctly classified, it has the same updated weight as X3. But we can’t tell what X3’s classifier prediction is; only that it is correctly classified.
As an aside, we can confirm the multipliers used for reweighting of misclassified and correctly classified points (in that order):
√ √ err = 2/3= 2 1−err= 1/3= 2
1−err 1/3 err 2/3 2
(16) [4 pts] Consider running the hierarchical agglomerative clustering algorithm on the following set of four points in R2, breaking ties arbitrarily. If we stop when only two clusters remain, which of the following linkage methods ensures the resulting clusters are balanced (each have two sample points)? Select all that apply.
A: X1’s updated weight is 2/6 ⃝ B: X3’s classifier prediction is −1
A: Complete linkage ⃝ B: Single linkage
C: Centroid linkage D: Average linkage
Under each of these linkage methods, we know that two adjacent points along a side of the rhombus will first be fused together. Without loss of generality, assume these are the points (0, 1) and (1, 0). Treating these as a cluster, the average, maximum and centroid distances to each of the two remaining points are all larger that the distance between the two points themselves. Therefore, complete linkage, single linkage and average linkage all result in balanced clusters. On there other hand, single linkage does not, since, for example, (1, 0) and (0, −1) have the same distance as (−1, 0) and (0, −1).
Q2. [14 pts] Principal Components Analysis Consider the following design matrix, representing four sample points Xi ∈ R2.
4 1 2 3
X =
5 4 1 0
We want to represent the data in only one dimension, so we turn to principal components analysis (PCA).
(1) [5 pts] Compute the unit-length principal component directions of X, and state which one the PCA algorithm would choose if you request just one principal component. Please provide an exact answer, without approximation. (You will need to use the square root symbol.) Show your work!
We center X, yielding
6 10 . (Divide by 4 if you want the sample covariance matrix. But we don’t care about the magnitude.)
1 −1
−1 1
̇ X= .
Its eigenvectors are [1/ 2 1/ 2]⊤ with eigenvalue 16 and [1/ 2 − 1/ 2]⊤ with eigenvalue 4. The former eigenvector is
Then X ̇ ⊤ X ̇ =
(Negated versions of these vectors also get full points.)
(2) [5 pts] The plot below depicts the sample points from X. We want a one-dimensional representation of the data, so draw the principal component direction (as a line) and the projections of all four sample points onto the principal direction.
Label each projected point with its principal coordinate value (where the origin’s principal coordinate is zero). Give the principal coordinate values exactly.
The principal coordinates are √1 , √5 , √5 , and √9 . (Alternatively, all of these could be negative, but they all have to have the 222 2
same sign.)
(3) [4 pts] The plot below depicts the sample points from X rotated 30 degrees counterclockwise about the origin.
As in part (b), identify the principal component direction that the PCA algorithm would choose and draw it (as a
line) on the plot. Also draw the projections of the rotated points onto the principal direction. Label each projected point with the exact value of its principal coordinate.
The line passes through the origin and is parallel to the two sample points that are farthest apart, so it’s easy to draw. Rotation
has not changed the principal coordinates: √1 , √5 , √5 , and √9 . (Again, these could all be negative.) 222 2
Q3. [14 pts] A Decision Tree
In this question we investigate whether students will pass or fail CS 189 based on whether or not they studied, cheated, and slept well before the exam. You are given the following data for five students. There are three features, “Studied,” “Slept,” and “Cheated.” The column “Result” shows the label we want to predict.
Studied Slept Cheated Result
Student 1 Student 2 Student 3 Student 4 Student 5
Yes No No Yes No Yes No Yes No Yes Yes Yes Yes Yes No
Passed Failed Failed Failed Passed
(1) [4 pts] What is the entropy H(Result) at the root node? (There is no need to compute the exact number; you may write it as an arithmetic expression.)
2 2 3 3 H(Result)=− 5log2 5+5log2 5 .
(2) [5 pts] Draw the decision tree where every split maximizes the information gain. (An actual drawing, please; a written description does not suffice.) Do not perform a split on a pure leaf or if the split will produce an empty child; otherwise, split. Explain (with numbers) why you chose the splits you chose.
A tree that first splits on “Cheated” and then “Studied.”
(3) [2 pts] Did the tree you built implicitly perform feature subset selection? Explain. Yes, because it does not use the feature “Slept.”
(4) [3 pts] Suppose you have a sample of n students for some large n, with the same three features. Assuming that we use a reasonably efficient algorithm to build the tree (as discussed in class), what is the worst-case running time to build the decision tree? (Write your answer in the simplest asymptotic form possible.) Why?
We have 3 binary features, so the tree’s depth cannot exceed 3 and each sample point participates in at most four treenodes. Hence, it cannot take more than Θ(n) time to build the tree.
Q4. [20 pts] Spectral Graph Clustering
Figure 1: An undirected, weighted graph in which all vertices have mass 1. The numbers inside the vertices are their indices (not masses).
In this problem, we approximate the sparsest cut of the graph above with the spectral graph clustering algorithm. Recall the spectral graph clustering optimization objective is to
find y that minimizes y⊤Ly subject to y⊤y = 6
and 1⊤y = 0.
(1) [4 pts] Write out the Laplacian matrix L.
100−1000 0 0 0
⃝ [0.36,−0.58,0.61,−0.4,0.04,−0.03] ⃝ [0.3, −0.33, −0.23, 0.3, −0.6, 0.55]
⃝ [−0.41,0.41,0.41,−0.41,−0.41,0.41]
⃝ [−0.56,−0.32,0.43,0.62,−0.06,−
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com