CS代考 CS 189 Introduction to

CS 189 Introduction to
Spring 2016 Machine Learning Final
• Please do not open the exam before you are instructed to do so.
• The exam is closed book, closed notes except your two-page cheat sheet.

Copyright By PowCoder代写 加微信 powcoder

• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
• You have 3 hours.
• Please write your initials at the top right of each page (e.g., write “JS” if you are ). Finish
this by the end of your 3 hours.
• Mark your answers on front of each page, not the back. We will not scan the backs of each page, but you may
use them as scratch paper. Do not attach any extra sheets.
• The total number of points is 150. There are 30 multiple choice questions worth 3 points each, and 6 written
questions worth a total of 60 points.
• For multiple-choice questions, fill in the boxes for ALL correct choices: there may be more than one correct choice, but there is always at least one correct choice. NO partial credit on multiple-choice questions: the set of all correct answers must be checked.
First name
First and last name of student to your left
First and last name of student to your right

Q1. [90 pts] Multiple Choice
Check the boxes for ALL CORRECT CHOICES. Every question should have at least one box checked. NO PARTIAL CREDIT: the set of all correct answers (only) must be checked.
(1) [3 pts] What strategies can help reduce overfitting in decision trees?
􏰁 Pruning 􏰁 Enforce a minimum number of samples in leaf
􏰀 Make sure each leaf node is one pure class
􏰁 Enforce a maximum depth for the tree
(2) [3 pts] Which of the following are true of convolutional neural networks (CNNs) for image analysis?
􏰁 Filters in earlier layers tend to include edge detectors
􏰁 Pooling layers reduce the spatial resolution of the image
(3) [3 pts] Neural networks
􏰀 optimize a convex cost function
􏰁 can be used for regression as well as classifica- tion
􏰀 They have more parameters than fully- connected networks with the same number of lay- ers and the same numbers of neurons in each layer
􏰀 A CNN can be trained for unsupervised learn- ing tasks, whereas an ordinary neural net cannot
􏰀 always output values between 0 and 1 􏰁 can be used in an ensemble
(4) [3 pts] Which of the following are true about generative models?
􏰁 They model the joint distribution P(class = 􏰀 The perceptron is a generative model
C AND sample = x)
􏰁 They can be used for classification model
(5) [3 pts] Lasso can be interpreted as least-squares linear regression where
􏰁 weights are regularized with the l1 norm 􏰀 the weights have a Gaussian prior 􏰀 weights are regularized with the l2 norm 􏰀 the solution algorithm is simpler
(6) [3 pts] Which of the following methods can achieve zero training error on any linearly separable dataset?
􏰁 Decision tree
􏰁 Hard-margin SVM
(7) [3 pts] The kernel trick
􏰀 can be applied to every classification algorithm
􏰀 changes ridge regression so we solve a d × d linear system instead of an n × n system, given n sample points with d features
􏰀 15-nearest neighbors 􏰁 Perceptron
􏰀 is commonly used for dimensionality reduction
􏰁 exploits the fact that in many learning al- gorithms, the weights can be written as a linear combination of input points
􏰁 Linear discriminant analysis is a generative

(8) [3 pts] Suppose we train a hard-margin linear SVM on n > 100 data points in R2, yielding a hyperplane with exactly 2 support vectors. If we add one more data point and retrain the classifier, what is the maximum possible number of support vectors for the new hyperplane (assuming the n + 1 points are linearly separable)?
􏰀2 􏰀n 􏰀3 􏰁n+1
(9) [3 pts] In latent semantic indexing, we compute a low-rank approximation to a term-document matrix. Which of the following motivate the low-rank reconstruction?
􏰁 Finding documents that are related to each other, e.g. of a similar genre
􏰁 In many applications, some principal compo- nents encode noise rather than meaningful struc- ture
􏰀 The low-rank approximation provides a loss- less method for compressing an input matrix
􏰀 Low-rank approximation enables discovery of nonlinear relations
(10) [3 pts] Which of the following are true about subset selection?
􏰀 Subset selection can substantially decrease the bias of support vector machines
􏰀 Ridge regression frequently eliminates some of the features
􏰁 Subset selection can reduce overfitting
􏰁 Finding the true best subset takes exponential
(11) [3 pts] In neural networks, nonlinear activation functions such as sigmoid, tanh, and ReLU
􏰀 speed up the gradient calculation in backprop- 􏰁 help to learn nonlinear decision boundaries agation, as compared to linear units
􏰀 are applied only to the output units 􏰀 always output values between 0 and 1
(12) [3 pts] Suppose we are given data comprising points of several different classes. Each class has a different probability distribution from which the sample points are drawn. We do not have the class labels. We use k-means clustering to try to guess the classes. Which of the following circumstances would undermine its effectiveness?
􏰀 Some of the classes are not normally dis- tributed
􏰁 Each class has the same mean
(13) [3 pts] Which of the following are true of spectral graph partitioning methods?
􏰀 They find the cut with minimum weight 􏰀 They minimize a quadratic function subject to one constraint: the partition must be balanced
􏰁 They use one or more eigenvectors of the
Laplacian matrix 􏰀 The Normalized Cut was invented at Stanford
(14) [3 pts] Which of the following can help to reduce overfitting in an SVM classifier?
􏰁 Use of slack variables 􏰀 Normalizing the data
􏰀 High-degree polynomial features 􏰀 Setting a very low learning rate
􏰀 The variance of each distribution is small in all directions
􏰁 You choose k = n, the number of sample points

(15) [3 pts] Which value of k in the k-nearest neighbors algorithm generates the solid decision boundary depicted here? There are only 2 classes. (Ignore the dashed line, which is the Bayes decision boundary.)
􏰀 k=1 􏰀 k=2 􏰁 k=10 􏰀 k=100
(16) [3 pts] Consider one layer of weights (edges) in a convolutional neural network (CNN) for grayscale images, connecting one layer of units to the next layer of units. Which type of layer has the fewest parameters to be learned during training? (Select one.)
􏰀 A convolutional layer with 10 3 × 3 filters
􏰁 A max-pooling layer that reduces a 10 × 10
image to 5 × 5
􏰀 A convolutional layer with 8 5 × 5 filters
􏰀 A fully-connected layer from 20 hidden units
to 4 output units
(17) [3 pts] In the kernelized perceptron algorithm with learning rate ε = 1, the coefficient ai corresponding to a training example xi represents the weight for K(xi, x). Suppose we have a two-class classification problem with yi ∈ {1, −1}. If yi = 1, which of the following can be true for ai?
􏰀 ai = −1 􏰁 ai = 1 􏰁 ai = 0 􏰁 ai = 5
(18) [3 pts] Suppose you want to split a graph G into two subgraphs. Let L be G’s Laplacian matrix. Which of the following could help you find a good split?
􏰀 The eigenvector corresponding to the second- largest eigenvalue of L
􏰁 The eigenvector corresponding to the second- smallest eigenvalue of L
􏰀 The left singular vector corresponding to the second-largest singular value of L
􏰁 The left singular vector corresponding to the second-smallest singular value of L
(19) [3 pts] Which of the following are properties that a kernel matrix always has?
􏰀 Invertible
􏰀 At least one negative eigenvalue
􏰀 All the entries are positive 􏰁 Symmetric

(20) [3 pts] How does the bias-variance decomposition of a ridge regression estimator compare with that of ordinary least squares regression? (Select one.)
􏰀 Ridge has larger bias, larger variance 􏰀 Ridge has smaller bias, larger variance 􏰁 Ridge has larger bias, smaller variance 􏰀 Ridge has smaller bias, smaller variance
(21) [3 pts] Both PCA and Lasso can be used for feature selection. Which of the following statements are true?
􏰁 Lasso selects a subset (not necessarily a strict subset) of the original features
􏰁 PCA produces features that are linear combi- nations of the original features
(22) [3 pts] Which of the following are true about forward subset 􏰀 O(2d) models must be trained during the al-
gorithm, where d is the number of features
􏰁 It greedily adds the feature that most improves
cross-validation accuracy
􏰀 PCA and Lasso both allow you to specify how many features are chosen
􏰀 PCA and Lasso are the same if you use the kernel trick
selection?
􏰀 It finds the subset of features that give the lowest test error
􏰁 Forward selection is faster than backward se- lection if few features are relevant to prediction
(23) [3 pts] You’ve just finished training a random forest for spam classification, and it is getting abnormally bad performance on your validation set, but good performance on your training set. Your implementation has no bugs. What could be causing the problem?
􏰁 Your decision trees are too deep
􏰁 You are randomly sampling too many features
when you choose a split
􏰁 You have too few trees in your ensemble
􏰁 Your bagging implementation is randomly
sampling sample points without replacement 6 3 1
(24) [3 pts] Consider training a decision tree given a design matrix X = 2 7 and labels y = 0. Let f1 denote 9 6 1
feature 1, corresponding to the first column of X, and let f2 denote feature 2, corresponding to the second
column. Which of the following splits at the root node gives the highest information gain? (Select one.) 􏰀 f1 > 2 􏰀 f2 > 3
􏰁 f1 > 4 􏰀 f2 > 6
(25) [3 pts] In terms of the bias-variance decomposition, a 1-nearest neighbor classifier has than a 3-nearest neighbor classifier.
􏰁 higher variance 􏰀 lower variance
􏰀 higher bias 􏰁 lower bias

(26) [3 pts] Which of the following are true about bagging? 􏰁 In bagging, we choose random subsamples of
the input points with replacement
􏰀 Bagging is ineffective with logistic regression, because all of the learners learn exactly the same decision boundary
(27) [3 pts] An advantage of searching for an approximate nearest is that
􏰀 it sometimes makes exhaustive search much faster
􏰁 it sometimes makes searching in a k-d tree much faster
􏰀 The main purpose of bagging is to decrease the bias of learning algorithms.
􏰁 If we use decision trees that have one sample point per leaf, bagging never gives lower training error than one ordinary decision tree
neighbor, rather than the exact nearest neighbor,
􏰀 the nearest neighbor classifier is sometimes much more accurate
􏰀 you find all the points within a distance of (1 + ε)r from the query point, where r is the dis- tance from the query point to its nearest neighbor
(28) [3 pts] In the derivation of the spectral graph partitioning algorithm, we relax a combinatorial optimization problem to a continuous optimization problem. This relaxation has the following effects.
􏰁 The combinatorial problem requires an ex- act bisection of the graph, but the continuous al- gorithm can produce (after rounding) partitions that aren’t perfectly balanced
􏰀 The combinatorial problem cannot be modi- fied to accommodate vertices that have different masses, whereas the continuous problem can
(29) [3 pts] The firing rate of a neuron
􏰀 determines how strongly the dendrites of the
neuron stimulate axons of neighboring neurons
􏰀 only changes very slowly, taking a period of several seconds to make large adjustments
􏰀 The combinatorial problem requires finding eigenvectors, whereas the continuous problem re- quires only matrix multiplication
􏰁 The combinatorial problem is NP-hard, but the continuous problem can be solved in polyno- mial time
􏰁 is more analogous to the output of a unit in a neural net than the output voltage of the neuron
􏰀 can sometimes exceed 30,000 action potentials per second
(30) [3 pts] In algorithms that use the kernel trick, the Gaussian kernel
􏰁 gives a regression function or predictor func- tion that is a linear combination of Gaussians cen- tered at the sample points
􏰁 is less prone to oscillating than polynomials, assuming the variance of the Gaussians is large
􏰀 is equivalent to lifting the d-dimensional sam- ple points to points in a space whose dimension is exponential in d
􏰀 has good properties in theory but is rarely used in practice
(31) 3 bonus points! The following Berkeley professors were cited in this semester’s lectures (possibly self-cited) for specific research contributions they made to machine learning.

Q2. [8 pts] Feature Selection
A newly employed former CS 189/289A student trains the latest Deep Learning classifier and obtains state-of-the-art accuracy. However, the classifier uses too many features! The boss is overwhelmed and asks for a model with fewer features.
Let’s try to identify the most important features. Start with a simple dataset in R2.
(1) [4 pts] Describe the training error of a Bayes optimal classifier that can see only the first feature of the data. Describe the training error of a Bayes optimal classifier that can see only the second feature.
The first feature yields a training error of 50% (like random guessing). The second feature offers a training error of zero.
(2) [4 pts] Based on this toy example, the student decides to fit a classifier on each feature individually, then rank the features by their classifier’s accuracy, take the best k features, and train a new classifier on those k features. We call this approach variable ranking. Unfortunately, the classifier trained on the best k features obtains horrible accuracy, unless k is very close to d, the original number of features!
Construct a toy dataset in R2 for which variable ranking fails. In other words, a dataset where a variable is useless by itself, but potentially useful alongside others. Use + for data points in Class 1, and O for data points in Class 2.
An XOR Dataset is unpredictable with either feature. (This extends to n-dimensions, with the n-bit parity string.)

Q3. [10 pts] Gradient Descent for k-means Clustering
Recall the loss function for k-means clustering with k clusters, sample points x1, …, xn, and centers μ1, …, μk:
L=􏰇 􏰇 ∥xi−μj∥2,
where Sj refers to the set of data points that are closer to μj than to any other cluster mean.
(1) [4 pts] Instead of updating μj by computing the mean, let’s minimize L with batch gradient descent while holding the sets Sj fixed. Derive the update formula for μ1 with learning rate (step size) ε.
Therefore the update formula is
Thusε= 1 . |S1 |
|S1 | |S1 | xi ∈S1
= ∂ 􏰇 (xi −μ1)⊤(xi −μ1) ∂μ1 xi∈S1
= 􏰇2(μ1−xi). xi ∈S1
μ1 ← μ1 + ε 􏰇 (xi − μ1). xi ∈S1
(Note: writing 2ε instead of ε is fine.)
(2) [2 pts] Derive the update formula for μ1 with stochastic gradient descent on a single sample point xi. Use
learning rate ε.
μ1 ← μ1 + ε(xi − μ1) if xi ∈ S1, otherwise no change.
(3) [4 pts] In this part, we will connect the batch gradient descent update equation with the standard k-means algorithm. Recall that in the update step of the standard algorithm, we assign each cluster center to be the mean (centroid) of the data points closest to that center. It turns out that a particular choice of the learning rate ε (which may be different for each cluster) makes the two algorithms (batch gradient descent and the standard k-means algorithm) have identical update steps. Let’s focus on the update for the first cluster, with center μ1. Calculate the value of ε so that both algorithms perform the same update for μ1. (If you do it right, the answer should be very simple.)
In the standard algorithm, we assign μ1 ← 􏰆 1 xi. xi ∈S1 |S1 |
Comparingtotheanswerin(1),weset􏰆 1 xi =μ1+ε􏰆 (xi−μ1)andsolveforε. xi ∈S1 |S1 | xi ∈S1
􏰇 1xi−􏰇 1μ1 = ε􏰇(xi−μ1)
􏰇 1(xi−μ1) = ε􏰇(xi−μ1).
(Note: answers that differ by a constant factor are fine if consistent with answer for (1).)

Q4. [10 pts] Kernels
(1) [2 pts] What is the primary motivation for using the kernel trick in machine learning algorithms?
If we want to map sample points to a very high-dimensional feature space, the kernel trick can save us from
having to compute those features explicitly, thereby saving a lot of time.
(Alternative solution: the kernel trick enables the use of infinite-dimensional feature spaces.)
(2) [4 pts] Prove that for every design matrix X ∈ Rn×d, the corresponding kernel matrix is positive semidefinite. For every vector z ∈ Rn,
z⊤Kz = z⊤XX⊤z = |X⊤z|2,
(3) [2 pts] Suppose that a regression algorithm contains the following line of code.
w ← w + X⊤MXX⊤u
Here, X ∈ Rn×d is the design matrix, w ∈ Rd is the weight vector, M ∈ Rn×n is a matrix unrelated to X, and u ∈ Rn is a vector unrelated to X. We want to derive a dual version of the algorithm in which we express the weights w as a linear combination of samples Xi (rows of X) and a dual weight vector a contains the coefficients of that linear combination. Rewrite the line of code in its dual form so that it updates a correctly (and so that w does not appear).
a ← a + MXX⊤u
(4) [2 pts] Can this line of code for updating a be kernelized? If so, show how. If not, explain why. Yes:
a ← a + MKu
which is clearly nonnegative.

Q5. [12 pts] Let’s PCA
You are given a design matrix X =  −3 5 . Let’s use PCA to reduce the dimension from 2 to 1.
⊤ 􏰂82−80􏰃 The covariance matrix is X X = −80 82 .
􏰄 √1 􏰅 Its unit eigenvectors are 2
can be replaced with its negation.)
−2 6 7 −3
(1) [6 pts] Compute the covariance matrix for the sample points. (Warning: Observe that X is not centered.) Then compute the unit eigenvectors, and the corresponding eigenvalues, of the covariance matrix. Hint: If you graph the points, you can probably guess the eigenvectors (then verify that they really are eigenvectors).
2 with eigenvalue 162. (Note: either eigenvector − √1
(2) [3 pts] Suppose we use PCA to project the sample points onto a one-dimensional space. What one-dimensional subspace are we projecting onto? For each of the four sample points in X (not the centered version of X!), write the coordinate (in principal coordinate space, not in R2) that the point is projected to.
with eigenvalue 2 and
We are projecting onto the subspace spanned by 2 . (Equivalently, onto the space spanned by − √1
lently, onto the line x + y = 0.) The projections are (6, −4) → 10 , (−3, 5) → − 8 , (−2, 6) → − 8 , (7, −3) → 10 . 2222
(3) [3 pts] Given a design matrix X that is taller than it is wide, prove that every right singular vector of X with singular value σ is an eigenvector of the covariance matrix with eigenvalue σ2.
If v is a right singular vector of X, then there is a singular value decomposition X = UDV ⊤ such that v is a column of V . Here each of U and V has orthonormal columns, V is square, and D is square and diagonal. The covariance matrix is X⊤X = V DU⊤UDV ⊤ = V D2V ⊤. This is an eigendecomposition of X⊤X, so each singular vector in V with singular value σ is an eigenvector of X⊤X with eigenvalue σ2.

Q6. [10 pts] Trees
13 6 7 15 11 17
(1) [5 pts] Above, we have two depictions of the same k-d tree, which we have built to solve nearest neighbor queries. Each node of the tree at right represents a rectangular box at left, and also stores one of the sample points that lie inside that box. (The root node represents the whole plane R2.) If a treenode stores sample point i, then the line passing through point i (in the diagram at left) determines which boxes the child treenodes represent.
Simulate running an exact 1-nearest neighbor query, where the bold X is the query point. Recall that the query algorithm visits the treenodes in a smart order, and keeps track of the nearest point it has seen so far.
• Write down the numbers

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com