CS 189/289A Introduction to Machine Learning
Spring 2021 Final
• The exam is open book, open notes for material on paper. On your computer screen, you may have only this exam, Zoom, a limited set of PDF documents (see Piazza for details), and four browser windows/tabs: Gradescope, the exam instructions, clarifications on Piazza, and the form for submitting clarification requests.
• You will submit your answers to the multiple-choice questions directly into Gradescope via the assignment “Final – Multiple Choice”; please do not submit your multiple-choice answers on paper. If you are in the DSP program and have been granted extra time, select the “DSP, 150%” or “DSP, 200%” option. By contrast, you will submit your answers to the written questions by writing them on paper by hand, scanning them, and submitting them through Gradescope via the assignment “Final – Free Response.”
Copyright By PowCoder代写 加微信 powcoder
• Please write your name at the top of each page of your written answers. (You may do this before the exam.) Please start each top-level question (Q2, Q3, etc.) on a new sheet of paper. Clearly label all written questions and all subparts of each written question.
• You have 180 minutes to complete the minal exam (3:10–6:10 PM). (If you are in the DSP program and have an allowance of 150% or 200% time, that comes to 270 minutes or 360 minutes, respectively.)
• When the exam ends (6:10 PM), stop writing. You must submit your multiple-choice answers before 6:10 PM sharp. Late multiple-choice submissions will be penalized at a rate of 5 points per minute after 6:10 PM. (The multiple- choice questions are worth 60 points total.)
• From 6:10 PM, you have 15 minutes to scan the written portion of your exam and turn it into Gradescope via the assignment “Final – Free Response.” Most of you will use your cellphone/pad and a third-party scanning app. If you have a physical scanner, you may use that. Late written submissions will be penalized at a rate of 10 points per minute after 6:25 PM. (The written portion is worth 90 points total.)
• Followingtheexam,youmustuseGradescope’spageselectionmechanismtomarkwhichquestionsareonwhichpages of your exam (as you do for the homeworks). Please get this done before midnight. This can be done on a computer different than the device you submitted with.
• The total number of points is 150. There are 15 multiple choice questions worth 4 points each, and six written questions worth a total of 90 points.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct answers must be checked.
Q1. [60 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(a) [4 pts] Which of the following conditions could serve as a sensible stopping condition while building a decision tree?
⃝ A: Stop if you find the validation error is decreas- ing as the tree grows
⃝ B: Don’t split a treenode that has an equal number of sample points from each class
C: Don’t split a treenode whose depth exceeds a specified threshold
⃝ D: Don’t split a treenode if the split would cause a large reduction in the weighted average entropy
A: It is sensible to stop when validation error is increasing, but not when it is decreasing. B: This is the kind of node for which splitting is most likely to be effective, so this is not a sensible stopping condition. C: It is a common practice to limit the maximum depth of the tree. D: This is the kind of node for which splitting is most effective, so this is not a sensible stopping condition.
(b) [4 pts] Let classifier A be a random forest. Let classifier B be an ensemble of decision trees with bagging—identical to classifier A except that we do not limit the splits in each treenode to a subset of the features; at every treenode, the very best split among all d features is chosen. Which statements are true?
⃝ A: After training, all the trees in classifier B must be identical
⃝ B: Classifier B will tend to have higher bias than Classifier A
C: Classifier B will tend to have higher variance than Classifier A
D: Classifier B will tend to have higher training accuracy than Classifier A
A is incorrect because we still use bagging. B is incorrect because limiting the splits, as a random forest does, tends to increase the bias. C is correct because randomly limiting the splits, as a random forest does, tends to decrease the ensemble variance— this is what motivated random forests in the first place! D is correct because reduced bias increases training accuracy.
(c) [4 pts] We are using an ensemble of decision trees for a classification problem, with bagging. We notice that the decision trees look too similar. We would like to build a more diverse set of learners. What are possible ways to accomplish that?
⃝ A: Increase the size of each random subsample ⃝ C: Apply normalization to the design matrix first B: Decrease the size of each random subsample ⃝ D: Sample without replacement
(d) [4pts]SupposewehaveafeaturemapΦandakernelfunctionk(Xi,Xj)=Φ(Xi)·Φ(Xj).Selectthetruestatementsabout kernels.
⃝ A: If there are n sample points of dimension d, it takes O(nd) time to compute the kernel matrix B: The kernel trick implies we do not compute Φ(Xi) explicitly for any sample point Xi
⃝ C: For every possible feature map Φ : Rd → RD you could imagine, there is a way to compute k(Xi, Xj) in O(d) time
D: Running times of kernel algorithms do not depend on the dimension D of the feature space Φ(·)
(e) [4 pts] Which of the following are benefits of using the backpropogation algorithm to compute gradients?
⃝ A: Its running time is linear in the total number of units (neurons) in the network
B: It can be applied to any arithmetic function (assuming the directed computation graph has no cycles and we
evaluate the gradient at a point where the gradient exists) 2
C: Compared to naive gradient computation, it improves the speed of each iteration of gradient descent by eliminating repeated computations of the same subproblem
⃝ D: Compared to naive gradient computation, it reduces the number of iterations required to get close to a local minimum, by protecting against sigmoid unit saturation (vanishing gradients)
A: Its running time is linear in the number of edges in the network, not the number of units. The number of edges could be quadratic in the number of units.
B: The backpropogation algorithm requires computing gradients in topological order given the reversed computational graph. Thus, any graph that can be topologically sorted (any DAG) can be backpropogated over.
C: Yes; that is its purpose.
D: Backpropagation doesn’t change the gradient or descent step taken by each iteration; it just speeds up computing the gradient. Therefore, the number of iterations doesn’t change and the vanishing gradient problem is not diminished.
(f) [4pts]Whichofthefollowingarebenefitsofusingconvolutionalneuralnetworks—asopposedtofullyconnectedones— for image recognition tasks?
⃝ A: The ability to express a wider variety of more complicated functions of the input features
⃝ B: Fewer model architecture hyperparameters for the designer to select
C: Enables the network to more easily learn and recognize features regardless of their position in the image
D: Typically requires less data to train well
In principal, fully connected networks offer greater expressivity and model capacity as they have many more weights; so option A has it backwards. Since CNNs share weights, the same features can be learned and recognized at different positions. Many fewer parameters often implies less data is required to train a reasonable model. CNNs involve more architecture parameters (kernel size, stride, dilation, padding, pooling, etc.).
(g) [4 pts] Select the correct statements about principal component analysis (PCA). A: PCA is a method of dimensionality reduction
B: If we select only one direction (a one-dimensional subspace) to represent the data, the sample variance of the projected points is zero if and only if the original sample points are all identical
C: The orthogonal projection of a point x onto a unit direction vector w is (x⊤w)w
⃝ D: If we select only one direction (a one-dimensional subspace) to represent the data, PCA chooses the eigen-
vector of the sample covariance matrix that corresponds to the least eigenvalue
A and C are standard knowledge that we told you in lecture. B is true because PCA chooses the variance-maximizing direction; if there are two sample points that are not identical to each other, there is a direction with strictly positive variance. D is false because we want the eigenvector associated with the greatest eigenvalue.
(h) [4 pts] Consider a centered design matrix X ∈ Rn×d, where Xi ∈ Rd is the i-th sample point (a column vector) and Xi⊤ is the i-th row of X. Which of these optimization problems is a correct formulation of finding the first principal component v?
A: Subject to ∥v∥2 = 1, find v ∈ Rd that minimizes n ∥Xi − vv⊤Xi∥2
⃝ C: Subject to ∥v∥2 = 1, find v ∈ Rd that maximizes n ∥Xi − vv⊤Xi∥2
⃝ B: Find v ∈ Rd that minimizes v⊤X⊤Xv
D: Find v ∈ Rd that maximizes v⊤X⊤Xv
(i) [4 pts] Consider a centered design matrix X ∈ Rn×d and its singular value decomposition X = UDV⊤. X has d principal components, which are found by principal components analysis (PCA). Which statements are correct?
A: The row space of X (if not trivial) is spanned by some of the principal components
B: The null space of X (if not trivial) is spanned by some of the principal components
C: The principal components are all right singular vectors of X
D: The matrix UD lists the principal coordinates of every sample point in X
A, B: The row space is spanned by the principal components with nonzero singular values, and the null space is spanned by the principal components with zero singular values. C is true because the right singular vectors of X are the eigenvectors of X⊤X. D is true because UD = XV.
(j) [4 pts] Select the correct statements about the Fiedler vector.
⃝ A: The Fiedler vector is the eigenvector of the Laplacian matrix that is associated with the smallest eigenvalue
B: The Fiedler vector always satisfies the balance constraint (as written as an equation)
⃝ C: The Fiedler vector is a solution to the unrelaxed, NP-hard optimization problem
D: The sweep cut is a spectral graph partitioning technique that tries n − 1 different cuts (in an n-vertex graph) and picks one of them.
Option A is incorrect because it’s associated with the second-smallest eigenvalue. Option B is correct because 1 is always an eigenvector of the Laplacian matrix, eigenvectors are orthogonal to each other, and the balance condition is 1⊤y = 0. Option C is incorrect; the Fiedler vector is a solution to the relaxed problem, but probably not the unrelaxed one. Option D is a fact.
(k) [4 pts] Consider the optimization problem of finding q ∈ Rr that minimizes ∥y − X Pq∥2 + λ∥q∥2 , where λ > 0, X ∈ Rn×d is a design matrix, y ∈ Rn is a vector of labels, and P ∈ Rd×r is an arbitrary matrix with r < d. This problem has one unique solution. Which of the following is that one unique solution?
A: q = (P⊤X⊤XP + λI)−1P⊤X⊤y ⃝ B: q = (X⊤P⊤PX + λI)−1P⊤X⊤y
(l) [4 pts] Select the correct statements about AdaBoost.
A: “Ada” stands for “adaptive,” as the meta-
learner adapts to the performance of its learners
⃝ B: AdaBoost works best with support vector ma- chines
⃝ C: q = (P⊤X⊤XP + λI)−1X⊤y ⃝ D: q = (X⊤P⊤PX + λI)−1X⊤y
C: At test/classification time, AdaBoost computes a weighted sum of predictions
⃝ D: AdaBoost can transform any set of classifiers to a classifier with better training accuracy
A and C are basic. B is false: linear classifiers are less well suited to ensembling than most nonlinear classifiers are. D is false: if the weak classifiers all have 50% accuracy, then AdaBoost can’t build a reliable meta-learner, even for just the training points.
(m) [4 pts] Select the correct statements about AdaBoost.
⃝ A: When we go from iteration t to iteration t + 1, the weight of sample point Xi is increased if the majority of
the t learners misclassify Xi.
⃝ B: Unlike with decision trees, two data points that are identical (Xi = Xj) but have different labels (yi yj) can
be classified correctly by Adaboost.
C: AdaBoost can benefit from a learner that has only 5% training accuracy.
⃝ D: If you train enough learners and every learner achieves at least 51% validation accuracy, Adaboost can always achieve 100% validation accuracy.
A: The change in weight of a data point is only determined by the most recent learner at time t, not any of the previous learners. B: Even though AdaBoost is an ensemble, it only has one output for a given Xi. C: AdaBoost treats a learner with 5% training accuracy by flipping its output so it has 95% training accuracy; then it can benefit very much! D: Would be true for training accuracy, but of course it’s false for validation accuracy!
(n) [4 pts] In which of the following cases should you prefer k-nearest neighbors over k-means clustering? For all the four options, you have access to images X1, X2, . . . , Xn ∈ Rd.
⃝ A: You do not have access to labels. You want to find out if any of the images are very different from the rest, i.e., are outliers.
⃝ B: You have access to labels y1,y2,...,yn telling us whether image i is a cat or a dog. You want to find out whether the distribution of cats is unimodal or bimodal. You already know that the distribution of cats either has either one or two modes, but that’s all you know about the distribution.
C: You have access to labels y1,y2,...,yn telling us whether image i is a cat or a dog. You want to find out whether a new image z is a cat or a dog.
D: You have access to labels y1,y2,...,yn telling us whether image i is a cat or a dog. Given a new image z, you want to approximate the posterior probability of z being a cat and the posterior probability of z being a dog.
A: You don’t have access to labels, so you cannot use k-nearest neighbors.
B: In order to do this, you would first filter out all the dogs, so you’re left with cat images only. Then you perform k-means clustering twice, once with k = 1 and again with k = 2. If the total distance between the data points and the cluster centers are markedly lower in the second case than the first, that would suggest that the data is bimodal.
C: You want to perform classification, so you must use k-NN. k-means is a clustering algorithm and not a classification algo- rithm.
D: In this case, you must use k-NN. Instead of finding the plurality class in the k points closest to z, you find the class histogram of those points.
(o) [4 pts] Select the correct statements about the k-nearest neighbor classifier.
A: For exact nearest neighbor search in a very high-dimensional feature space, generally it is faster to use exhaustive search than to use a k-d tree.
B: When using a k-d tree, approximate nearest neighbor search is sometimes substantially faster than exact nearest neighbor search.
⃝ C: When using exhaustive search, approximate nearest neighbor search is sometimes substantially faster than exact nearest neighbor search.
⃝ D: Order-k Voronoi diagrams are widely used in practice for k-nearest neighbor search (with k > 1) in a two-dimensional feature space.
A: In very high-dimensional spaces, k-d trees with exact search generally don’t succeed in pruning many points away, so you’re doing exhaustive search anyway, only with the greater (by a constant factor) overhead of searching a tree data structure too.
B: Yes; sometimes by a factor of up to 100.
C: No; you don’t gain any speedup at all.
D: No; they take up O(k2n) space, and there is a lack of publicly available software for these difficult computations.
Q2. [20 pts] A Decision Tree
The decision tree drawn above stores sample points from two classes, A and B. The tables indicate the number of sample points of each class stored in each leaf node.
(a) [4 pts] What is entropy at the root (Treenode 0)? Simplify your answer if possible, but you don’t need to compute logarithms.
(b) [6 pts] What is the information gain of the split at the root node (Treenode 0)? Show your work so we understand how you got that answer. Simplify your answer as much as possible, but again, you don’t need to compute logarithms.
(c) [2 pts] What is the training accuracy of the tree?
(d) [4 pts] How can we increase the decision tree’s training accuracy? Give a reasonable explanation for why we might have considered doing that but decided not to do that. (“Because we set a limit” is not a valid answer; we’re looking for the underlying reason why we would set a limit.)
(e) [4pts]SupposeItoldyouthatwedidincreasethedecisiontree’strainingaccuracyinthewayyousuggestinpart(d),but when training completed, the decision tree training algorithm decided to revert to this tree in the end anyway. Explain how that might happen and why it might be the right thing to do.
(a) There are 15 class A samples and 10 class B samples in total. Therefore, the entropy at Treenode 0 is H(0)=−15log 15−10log 10
(b) The entropy at Treenode 1 is
and the entropy at Treenode 2 is Therefore, the information gain is
=−3log 3−2log 2. 525525
H(1)=−1log 1−2log 2 323323
25 2 25 25 2 25
H(0)−15H(1)−10H(2)=−3log 3−2log 2+1log 1+2log 2. 25 25 5 25 5 25 5 23 5 23
We didn’t require further simplifications for full points, but this can be further simplified to
3log 5−1log 3+2log 5=log 5−1log 3. 52352 523 2352
(c) The training accuracy is 21 (aka 84%). 25
(d) We could improve training accuracy by continuing to split beyond Treenodes 3 and 4 (which could also be written, “increase the tree depth”).
The likely reason we didn’t is the risk of overfitting. Alternatively, we might have built a deeper tree then pruned it back . . . (e) We may have built a deeper tree then pruned it back because the shorter tree performs better on the validation data.
Q3. [20 pts] The Singular Value Decomposition
We want you to compute (by hand) part of the singular value decomposition (SVD) X = UDV⊤ of the matrix
Note: if you like doing things the hard way, it is also correct to use the matrix 1 0 −1
XX =0 1 −1. −1 −1 2
(b) The singular values of X are square roots of eigenvalues of X⊤X (or XX⊤). Therefore, they are
1 0
X=0 1. −1 −1
(a) [6 pts] One way to compute the SVD by hand is to exploit the relationship between the singular values of X and the eigenvalues of a related symmetric matrix. What symmetric matrix has eigenvalues related to the singular values of X? Compute that matrix, write down its characteristic polynomial and simplify it, and determine its eigenvalues. Show your work!
(b) [4 pts] What is the relationship between the singular values of X and the eigenvalues of your symmetric matrix? Use that relationship to write down the two specific singular values of X.
(c) [5 pts] Generalize your answer above. For any matrix Z (not only X), what symmetric matrix has eigenvalues related to the singular values of Z? Prove that every singular value of Z can be identified by looking at the eigenvalues of the matrix your method constructs.
(d) [5 pts] Explicitly write down enough eigenvectors of your symmetric matrix to form a complete basis for the space. (Hint: the symmetry of your matrix should make it easy to guess eigenvectors; then you can confirm that they are eigenvectors.) Based on these, write down the value of U or V (whichever is appropriate).
(a) Consider the symmetric matrix
2 1 Y=X⊤X=1 2.
The characteristic polynomial of Y is (λ − 2)2 − 1 = λ2 − 4λ + 3. We can factor it into (λ − 3)(λ − 1) or simply use the quadratic formul
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com