CS代考 CS 189 Introduction to

CS 189 Introduction to
Spring 2019 Machine Learning Final Exam
• Please do not open the exam before you are instructed to do so.
• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your

Copyright By PowCoder代写 加微信 powcoder

cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
• When you start, the first thing you should do is check that you have all 12 pages and all 6 questions. The second thing is to please write your initials at the top right of every page after this one (e.g., write “JS” if you are ).
• The exam is closed book, closed notes except your two cheat sheets.
• You have 3 hours.
• Mark your answers on the exam itself in the space provided. Do not attach any extra sheets.
• The total number of points is 150. There are 26 multiple choice questions worth 3 points each, and 5 written questions worth a total of 72 points.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct answers must be checked.
First name
First and last name of student to your left
First and last name of student to your right

Q1. [78 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(a) [3 pts] Which of the following algorithms can learn nonlinear decision boundaries? The decision trees use only axis-
aligned splits.
􏰉 A depth-five decision tree 􏰉 AdaBoost with depth-one decision trees 􏰉 Quadratic discriminant analysis (QDA) ⃝ Perceptron
The solutions are obvious other than AdaBoost with depth-one decision trees, where you can form non-linear boundaries due to the final classifier not actually being a linear combination of the linear weak learners.
(b) [3 pts] Which of the following classifiers are capable of achieving 100% training accuracy on the data below? The decision trees use only axis-aligned splits.
⃝ Logistic regression ⃝ AdaBoost with depth-one decision trees 􏰉 A neural network with one hidden layer 􏰉 AdaBoost with depth-two decision trees
top left: Each weak learner will either classify the points from each pair in different classes, or classify every point in the same class. Since the meta classifier is a weighted sum of all of these weak classifiers, each which has a 50% training accuracy, the meta classifier cannot have 100% accuracy.
top right: A neural network with one hidden layer (with enough units) is a universal function approximator. lower left: Logistic regression finds a linear decision boundary, which cannot separate the data.
lower right: A depth two decision tree can fully separate the data.
(c) [3 pts] Which of the following are true of support vector machines?
􏰉 Increasing the hyperparameter C tends to decrease the training error
⃝ The hard-margin SVM is a special case of the soft- margin with the hyperparameter C set to zero
􏰉 Increasing the hyperparameter C tends to decrease the margin
⃝ Increasing the hyperparameter C tends to decrease the sensitivity to outliers
Top left: True, from the lecture notes.
Bottom left: False, Hard-margin SVM is where C tends towards infinity.
Top right: false, perceptron is trained using gradient descent and SVM is trained using a quadratic program.

Bottom right: True: slack becomes less expensive, so you allow data points to be farther on the wrong side of the margin and make the margin bigger. Doing this will never reduce the number of data points inside the margin.
(d) [3 pts] Let r(x) be a decision rule that minimizes the risk for a three-class classifier with labels y ∈ {0,1,2} and an asymmetric loss function. What is true about r(·)?
⃝ ∀y∈{0,1,2}, ∃x:r(x)=y
􏰉 If we don’t have access to the underlying data dis- tribution P(X) or P(Y|X), we cannot exactly compute the risk of r(·)
top left: it is possible that r(X) is the same for all X.
top right: no, because the risk is asymmetric
⃝ ∀x, r(x) is a class y that maximizes the posterior probability P(Y = y|X = x)
􏰉IfP(X=x)changesbutP(Y=y|X=x)remains the same for all x, y, r(X) still minimizes the risk
lower left: by definition of risk we need to be able to compute expectations over these two distributions.
lower right: Given that r(X) has no constraint, it can pick the y that minimizes risk for every X = x without trade-offs. Therefore, if only the marginals change, that choice is not affected.
(e) [3 pts] Which of the following are true about two-class Gaussian discriminant analysis? Assume you have estimated the parameters μˆC,ΣˆC, πˆC for class C and μˆD,ΣˆD, πˆD for class D.
⃝ IfμˆC =μˆD andπˆC =πˆD,thentheLDAandQDA classifiers are identical
⃝ If ΣˆC = I (the identity matrix) and ΣˆD = 5I, then the LDA and QDA classifiers are identical
􏰉 IfΣˆC = ΣˆD,πˆC = 1/6,andπˆD = 5/6,thenthe LDA and QDA classifiers are identical
⃝ If the LDA and QDA classifiers are identical, then the posterior probability P(Y = C|X = x) is linear in x
Top left: false, the covariance matrices might differ, making the QDA decision function nonlinear. Bottom left: false, the QDA decision function is nonlinear.
Top right: correct.
Bottom right: no, the posterior is a logistic function.
(f) [3 pts] Consider an n × d design matrix X with labels y ∈ Rn. What is true of fitting this data with dual ridge regression with the polynomial kernel k(Xi, Xj) = (XiT Xj + 1)p = Φ(Xi)⊤Φ(Xj) and regularization parameter λ > 0?
⃝ If the polynomial degree is high enough, the poly- nomial will fit the data exactly
⃝ The algorithm computes Φ(Xi) and Φ(Xj) in O(dp) time
Top left: see definition of dual ridge regression Lower left: both give the same solution, no matter n!
􏰉 The algorithm solves an n × n linear system
⃝ When n is very large, this dual algorithm is more likely to overfit than the primal algorithm with degree- p polynomial features
Top right: The dual method problem of ridge regression is indeed recommended only when d > n. But in this case we use a Kernel, so in fact we have a number of features of d′ = dp!
Top right: no need! just their dot product, which can easily be obtained with (XiT Xj + 1)p.
(g) [3 pts] Consider the kernel perceptron algorithm on an n × d design matrix X. We choose a matrix M ∈ RD×d and define
the feature map Φ(x) = Mx ∈ RD and the kernel k(x, z) = Φ(x) · Φ(z). Which of the following are always true? 3

􏰉 The kernel matrix is XM⊤MX⊤
⃝ If the primal perceptron algorithm terminates, then
the kernel perceptron algorithm terminates
⃝ The kernel matrix is MX⊤XM⊤
􏰉 If the kernel perceptron algorithm terminates, then
the primal perceptron algorithm terminates
Top left: k(x, y) = ⟨Mx, My⟩ = xT MT My is a valid kernel function since MT M is positive semidefinite for all matrices M. Bottom left: Yes, because K = Φ(X)Φ(X)⊤ and Φ(X) = XM⊤.
Top right: Counterexample: let one class have (−2, 1), (1, 1) and the other class have (−1, −1), (2, −1) and let M project the points onto the x-axis. The raw data is linearly separable but the projected data is not.
Bottom right: If w linearly separates the transformed points Mxi, then MTw linearly separates the original points since wT (Mxi) = (MT w)T xi.
(h) [3 pts] Which of the following are true of decision trees? Assume splits are binary and are done so as to maximize the information gain.
⃝ If there are at least two classes at a given node, there exists a split such that information gain is strictly positive
⃝ As you go down any path from the root to a leaf, the information gain at each level is non-increasing
Top left: false, recall example from section. Bottom left: false, recall example from section. Top right: correct.
Bottom right: correct.
􏰉 The deeper the decision tree is, the more likely it is to overfit
􏰉 Random forests are less likely to overfit than de- cision trees
(i) [3 pts] While solving a classification problem, you use a pure, binary decision tree constructed by the standard greedy procedure we outlined in class. While your training accuracy is perfect, your validation accuracy is unexpectedly low. Which of the following, in isolation, is likely to improve your validation accuracy in most real-world applications?
⃝ Lift your data into a quadratic feature space
⃝ Select a random subset of the features and use only
those in your tree
⃝ Normalize each feature to have variance 1
􏰉 Prune the tree, using validation to decide how to
Top left: False, lifting to a more complex feature space will not generally stop you from overfitting. Bottom left: False, an ensemble of standard decision trees fit to the same data-set will not learn Top right: The small change in split criterion will not generally stop you from overfitting.
Bottom right: Correct, lowering depth defends against overfitting.
(j) [3 pts] For the sigmoid activation function and the ReLU activation function, which of the following are true in general?
􏰉 Both activation functions are monotonically non- decreasing
⃝ Both functions have a monotonic first derivative
⃝ Compared to the sigmoid, the ReLU is more com- putationally expensive
􏰉 The sigmoid derivative s′(γ) is quadratic in s(γ)

a True. Simply graph the activation functions
b False. Sigmoid has non-monotonic derivative
c False. ReLU is simpler as all positives have derivative 1 and all negatives have 0. While we have to calculate exponential for Sigmoid
d True. as ReLU vanish all the negatives into zero, implicitly introducing sparsity in the network
e True. Unlike Sigmoid, the product of gradients of ReLU function doesn’t end up converging to 0 as the value is either 0 or 1
(k) [3 pts] Which of the following are true in general for backpropagation?
􏰉 It is a dynamic programming algorithm
􏰉 Some of the derivatives cannot be fully computed
until the backward pass
⃝ The weights are initially set to zero
⃝ Its running time grows exponentially in the num-
ber of layers
a False. As it is not a model, but a quick algorithm to compute derivatives in the network b True. We have a forward pass and a backward pass
c False. Linear time complexity instead
d False. The weights set randomly
e True. In the backward pass
(l) [3 pts] Facets of neural networks that have (reasonable, though not perfect) analogs in human brains include
⃝ backpropagation ⃝ convolutional masks applied to many patches 􏰉 linear combinations of input values 􏰉 edge detectors
(m) [3 pts] Which of the following are true of the vanishing gradient problem for sigmoid units?
􏰉 Deeper neural networks tend to be more suscepti- ble to vanishing gradients
⃝ If a unit has the vanishing gradient problem for one training point, it has the problem for every train- ing point
􏰉 Using ReLU units instead of sigmoid units can reduce this problem
⃝ Networks with sigmoid units don’t have this prob- lem if they’re trained with the cross-entropy loss func- tion
Top left: false, as the number of layers goes up, the gradient is more likely to vanish during backpropagation. If one node yields a gradient close to zero, the gain of the nodes in the previous layers will also be very low.
Bottom left: true, ReLU is generally better since its gradient does not go to zero as the input goes to zero. Top right: false, if gradients are vanishing, the weights have already effectively stopped changing their values. Bottom right: true, this is the incentive for ResNets.
(n) [3 pts] Suppose our input is two-dimensional sample points, with ten non-exclusive classes those points may belong to (i.e., a point can belong to more than one class). To train a classifier, we build a fully-connected neural network (with bias terms) that has a single hidden layer of twenty units and an output layer of ten units (one for each class). Which statements apply?
⃝ For the output units, softmax activations are more appropriate than sigmoid activations
⃝ This network will have 240 trainable parameters
􏰉 For the hidden units, ReLU activations are more appropriate than linear activations
􏰉 This network will have 270 trainable parameters
Softmax will create a valid probability distribution across all the outputs, making it well suited to predicting the single class a point is most likely to belong to but not to predicting whether or not the point is in each class. Sigmoid will give us a valid in-class probability for each class independently, allowing us to perform multiclass predictions.

There are 2 ∗ 20 + 20 = 60 parameters in the first layer and 20 ∗ 10 + 10 = 210 in the second, giving us a total of 270 parameters. With a linear activation on the hidden layer, this network reduces to a perceptron and cannot model non-linear decision bound-
Randomly initializing the weight terms breaks the symmetry of the network; its okay (and in fact standard practice) to initialize the bias terms to zero.
(o) [3 pts] Which of the following can lead to valid derivations of PCA?
􏰉 Fit the mean and covariance matrix of a Gaussian distribution to the sample data with maximum likeli- hood estimation
⃝ Find the direction w that minimizes the sample variance of the projected data
⃝ Find the direction w that minimizes the sum of projection distances
􏰉 Find the direction w that minimizes the sum of squares of projection distances
This is best explained by the lecture notes – in particular, lecture note 20 from the Spring 2019 iteration of the course.
(p) [3pts]WritetheSVDofann×ddesignmatrixX(withn≥d)asX=UDVT.Whichofthefollowingaretrue?
􏰉 The components of D are all nonnegative 􏰉 The columns of V all have unit length and are
orthogonal to each other
the same as the eigendecomposition 􏰉 The columns of D are orthogonal to each other
Top left: false, the columns of V can be used for PCA.
Bottom left: false, let the spectral decomposition be QΛQ−1. Λ may have negative values. Top right: correct.
Bottom right: False, a subset of the columns of V correspond to the null space of X.
(q) [3 pts] Which of the following is true about Lloyd’s algorithm for k-means clustering?
⃝ If X is a real, symmetric matrix, the SVD is always
⃝ It is a supervised learning algorithm
􏰉 It never returns to a particular assignment of
classes to sample points after changing to another one
􏰉 If run for long enough, it will always terminate ⃝ No algorithm (Lloyd’s or any other) can always
find the optimal solution
k-means is an unsupervised learning algorithm. The number of clusters k is a hyperparameter.
(r) [3 pts] Which of the following are advantages of using k-medoid clustering instead of k-means?
􏰉 k-medoids is less sensitive to outliers
􏰉 Medoids make more sense than means for non-
all pairwise distances, rather than just summing all points and averaging).
tance metric has no hyperparameters, unlike k-means Both k means and k medoids have k as a hyperparameter. Medoids are much more expensive to compute than means (calculating
Euclidean distance metrics
⃝ Medoids are faster to compute than means
⃝ The k-medoids algorithm with the Euclidean dis-

(s) [3pts]Wewishtocluster2-dimensionalpointsintotwoclusters,sowerunLloyd’salgorithmfork-meansclusteringuntil convergence. Which of the following clusters could it produce? (The points inside an oval belong to the same cluster).
(t) [3 pts] Which of the following are true of hierarchical clustering?
⃝ The number k of clusters is a hyperparameter
􏰉 The greedy agglomerative clustering algorithm repeatedly fuses the two clusters that minimize the distance between clusters
⃝ Complete linkage works only with the Euclidean distance metric
􏰉 During agglomerative clustering, single linkage is more sensitive to outliers than complete linkage
Top left: Part of the point of hierarchy is so you don’t have to guess k in advance Bottom left: Correct
Top right: Complete linkage is compatible with any distance function
Bottom right: Single linkage is very sensitive to outliers
(u) [3 pts] Which of the following are true of spectral clustering?
⃝ The Fiedler vector is the eigenvector associated with the second largest eigenvalue of the Laplacian matrix
􏰉 Nobody knows how to find the sparsest cut in polynomial time
􏰉 The relaxed optimization problem for partitioning a graph involves minimizing the Rayleigh quotient of the Laplacian matrix and an indicator vector (subject to a constraint)
⃝ The Laplacian matrix of a graph is invertible
Top left: The Fiedler vector corresponds to the second smallest eigenvalue.
Bottom left: The relaxed optimization problem minimizes the rayleigh quotient with constraints. Top right: It’s NP-hard.

Bottom right: the laplacian is never invertible; 1 is always in the nullspace.
(v) [3 pts] For binary classification, which of the following statements are true of AdaBoost?
􏰉 It can be applied to neural networks 􏰉 The metalearner provides not just a classification, but also an estimate of the posterior probability
⃝ It uses the majority vote of learners to predict the class of a data point
􏰉 The paper on AdaBoost won a Go ̈del Prize

(w) [3 pts] For binary classification, which of the following statements are true of AdaBoost with decision trees?
􏰉 It usually has lower bias than a single decision tree
􏰉 It is popular because it usually works well even before any hyperparameter tuning
together n data points in Rd?
􏰉 LFA is not sensitive to how you initialize it,
whereas Lloyd’s algorithm is
􏰉 LFA allows us to consider points as belonging to multiple “overlapping” clusters, whereas in k-means, each point belongs to only one cluster
⃝ To use the weight wi of a sample point Xi when training a decision tree G, we scale the loss function L(G(Xi), yi) by wi
⃝ It can train multiple decision trees in parallel
(x) [3pts]Whichofthefollowingarereasonsonemightchooselatentfactoranalysis(LFA)overk-meansclusteringtogroup
The first choice is true due to the curse of dimensionality. The second one is false: LFA is more expensive than k-means. For I iterations and k clusters, k-means runs in O(nkID) time, whereas SVD takes O(min(dn2,d2n)) time. The third one is true. We can measure how much a user vector belongs to a particular cluster by taking its inner product with the corresponding singular vector. The fourth one is false because of the above application.
(y) [3 pts] Which of the following are true for k-nearest neighbor classification?
􏰉 It is more likely to overfit with k = 1 (1-NN) than with k = 1,000 (1,000-NN)
􏰉 In very high dimensions, exhaustively checking every training point is often faster than any widely used competing exact k-NN query algorithm
Top left: correct; smaller k’s overfit more. Bottom left: empirical fact.
Top right: true; Fix & Hodges, 1951. Bottom right: false, it’s poly.
􏰉 If you have enough training points drawn from the same distribution as the test points, k-NN can achieve accuracy almost as good as the Bayes decision rule
⃝ The optimal running time to classify a point with k-NN grows linearly with k
(z) [3 pts] Suppose we use the k-d tree construction and query algorithms described in class to find the approximate nearest neighbor of a query point among n sample points. Select the true statements.
􏰉 It is possible to guarantee that the tree has O(log n) depth by our choice of splitting rule at each treenode
⃝ Sometimes we permit the k-d tree to be unbalanced so we can choose splits with better information gain
⃝ Querying the k-d tree is faster than querying a Voronoi diagram for sample points in R2
􏰉 Sometimes the query algorithm declines to search inside a box that’s closer to the query point than the nearest neighbor it’s found so far
⃝ In market research, LFA can distinguish different consumer types, whereas k-means cannot
􏰉 k-means requires you to guess k in advance, whereas LFA makes it easier to infer the right num- ber of clusters after the computation

Q2. [17 pts] Getting Down(hill) with the Funk Function
The Netflix Prize was an op

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com