CS 189 Introduction to
Spring 2017 Machine Learning Final
• Please do not open the exam before you are instructed to do so.
• The exam is closed book, closed notes except your two page cheat sheet.
Copyright By PowCoder代写 加微信 powcoder
• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
• You have 3 hours.
• Please write your initials at the top right of each odd-numbered page (e.g., write “JS” if you are ).
Finish this by the end of your 3 hours.
• Mark your answers on the exam itself in the space provided. Do not attach any extra sheets.
• The total number of points is 150. There are 26 multiple choice questions worth 3 points each, and 7 written questions worth a total of 72 points.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct answers must be checked.
First name
First and last name of student to your left
First and last name of student to your right
Q1. [78 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(1) [3 pts] Which of the following are NP-hard problems? Let X ∈ Rn×d be a design matrix, let y ∈ Rn be a vector of labels, let L be the Laplacian matrix of some n-vertex graph, and let 1 = [1 1 . . . 1]⊤ .
minμ,y k y =i |Xj − μi|2 where each μi is the i=1 j
mean of sample points assigned class i
⃝ minyk y =i|Xj −μi|2 witheachμi fixed
⃝ miny∈Rn 1y⊤Ly subject to |y|2 = n; 1⊤y = 0 4
(2) [3 pts] Which clustering algorithms permit you to decide the number of clusters after the clustering is done?
⃝ k-means clustering a k-d tree used for divisive clustering
agglomerative clustering with single linkage ⃝ spectral graph clustering with 3 eigenvectors
(3) [3 pts] For which of the following does normalizing your input features influence the predictions?
⃝ decision tree (with usual splitting method) neural network
Lasso soft-margin support vector machine
(4) [3 pts] With the SVD, we write X = UDV⊤. For which of the following matrices are the eigenvectors the columns of U? ⃝ X⊤X ⃝ X⊤XX⊤X
XX⊤ XX⊤XX⊤
(5) [3 pts] Why is PCA sometimes used as a preprocessing step before regression?
To reduce overfitting by removing poorly predic- tive dimensions.
⃝ To expose information missing from the input data.
To make computation faster by reducing the di- mensionality of the data.
⃝ For inference and scientific discovery, we prefer features that are not axis-aligned.
(6) [3 pts] Consider the matrix X = ri=1 αiuiv⊤i where each αi is a scalar and each ui and vi is a vector. It is possible that the rank of X might be
⃝r+1 r−1 r 0
(7) [3 pts] Why would we use a random forest instead of a decision tree?
⃝ For lower training error.
To reduce the variance of the model.
To better approximate posterior probabilities.
⃝ For a model that is easier for a human to interpret.
min n 1y⊤Ly subject to ∀j,y y∈R 4 j
∈ {−1,+1};
(8) [3 pts] What tends to be true about increasing the k in k-nearest neighbors?
The decision boundary tends to get smoother. The bias tends to increase.
⃝ The variance tends to increase.
As the number of sample points approaches in- finity (with n/k → ∞), the error rate approaches less than twice the Bayes risk (assuming training and test points are drawn independently from the same distri- bution).
(9) [3 pts] Which of the following statements are true about the entropy of a discrete probability distribution?
It is a useful criterion for picking splits in decision trees.
⃝ It is a convex function of the class probabilities.
(10) [3 pts] A low-rank approximation of a matrix can be useful for removing noise.
It is maximized when the probability distribution is uniform.
⃝ It is minimized when the probability distribution is uniform.
filling in unknown values. matrix compression.
γ=miny⊤Ly. y∈Rn
|y|2 =n 1⊤ y=0
⃝ All the leaves must be pure.
Pruning usually achieves better test accuracy than
discovering latent categories in the data.
(11) [3 pts] Let L be the Laplacian matrix of a graph with n vertices. Let
Which of the following are true for every Laplacian matrix L? β≥γ
(12) [3 pts] Which of the following are true about decision trees? ⃝ They can be used only for classification.
⃝ The tree depth never exceeds O(log n) for n sample points.
β= min y⊤Ly y∈Rn
∀i,yi ∈{−1,+1} 1⊤ y=0
stopping early.
(13) [3 pts] Which of the following is an effective way of reducing overfitting in neural networks?
Augmenting the training data with similar syn- ⃝ Increasing the number of layers thetic examples
Weight decay (i.e., l2 regularization) Dropout
(14) [3 pts] If the VC dimension of a hypothesis class H is an integer D < ∞ (i.e., VC(H) = D), this means
there exists some set of D points shattered by H. no set of D + 1 points is shattered by H.
⃝ all sets of D points are shattered by H.
ΠH (D) = 2D.
(15) [3 pts] Consider the minimizer w∗ of the l2-regularized least squares objective J(w) = |Xw−y|2 +λ|w|2 with λ > 0. Which of the following are true?
⃝ Xw∗ = y ⃝ w∗ exists if and only if X⊤X is nonsingular ⃝ w∗ = X+y, where X+ is the pseudoinverse of X The minimizer w∗ is unique
(16) [3 pts] You are training a neural network, but the training error is high. Which of the following, if done in isolation, has a better-than-tiny chance of reducing the training error?
Adding another hidden layer Adding more units to hidden layers Normalizing the input data ⃝ Training on more data
(17) [3 pts] Filters in the late layers of a convolutional neural network designed to classify objects in photographs likely represent
⃝ edge detectors. concepts such as “this image contains wheels.” concepts such as “there is an animal.” ⃝ concepts such as “Jen is flirting with Dan.”
(18) [3 pts] Which of the following techniques usually speeds up the training of a sigmoid-based neural network on a classifi- cation task?
⃝ Using batch descent instead of stochastic Having a good initialization of the weights
⃝ Increasing the learning rate with every iteration Using the cross-entropy loss instead of the mean
squared error
(19) [3 pts] In a soft-margin support vector machine, decreasing the slack penalty term C causes
⃝ more overfitting. ⃝ a smaller margin.
less overfitting. less sensitivity to outliers.
(20) [3 pts] The shortest distance from a point z to a hyperplane w⊤ x = 0 is
⃝ w⊤z w⊤z
(21) [3 pts] The Bayes decision rule
does the best a classifier can do, in expectation ⃝ can be computed exactly from a large sample
⃝ w⊤z |w|2
chooses the class with the greatest posterior prob- ability, if we use the 0-1 risk function
minimizes the risk functional (22) [3 pts] Which of the following are techniques commonly used in training neural nets?
⃝ linear programming backpropagation
⃝ Newton’s method cross-validation
(23) [3 pts] Which of these statements about learning theory are correct?
The VC dimension of halfplanes is 3.
⃝ For a fixed set of training points, the more di- chotomies Π we have, the higher the probability that the training error is close to the true risk.
⃝ The VC dimension of halfspaces in 3D is ∞.
For a fixed hypothesis class H, the more train- ing points we have, the higher the probability that the training error is close to the true risk.
(24) [3 pts] Which of the following statements are true for a design matrix X ∈ Rn×d with d > n? (The rows are n sample points and the columns represent d features.)
⃝ Least-squares linear regression computes the weights w = (X⊤X)−1X⊤y.
⃝ X has exactly d − n eigenvectors with eigenvalue zero.
⃝ The sample points are linearly separable.
At least one principal component direction is or- thogonal to a hyperplane that contains all the sample points.
(25) [3 pts] Which of the following visuals accurately represent the clustering produced by greedy agglomerative hierarchical clustering with centroid linkage on the set of feature vectors {(-2, -2), (-2, 0), (1, 3), (2, 2), (3, 4)}?
0x −4 −3 −2 −1 0 1 2 3 4
0x −4 −3 −2 −1 0 1 2 3 4
(1,3) (2,2)
−3 ⃝ (-2,-2) (-2,0) (1,3) (2,2) (26) [3 pts] Which of the following statements is true about the standard k-means clustering algorithm?
⃝ The random partition initialization method usually outperforms the Forgy method.
After a sufficiently large number of iterations, the clusters will stop changing.
⃝ It is computationally infeasible to find the optimal clustering of n = 15 points in k = 3 clusters.
x·y ⃝ You can use the metric d(x, y) = |x|·|y| .
Q2. [9 pts] A Miscellany
(a) [3 pts] Consider a convolutional neural network for reading the handwritten MNIST letters, which are 28 × 28 images. Suppose the first hidden layer is a convolutional layer with 20 different 5 × 5 filters, applied to the input image with a stride of 1 (i.e., every filter is applied to every 5 × 5 patch of the image, with patches allowed to overlap). Each filter has a bias weight. How many weights (parameters) does this layer use?
20 × (5 × 5 + 1) = 520.
(b) [3 pts] Let X be an n × d design matrix representing n sample points in Rd. Let X = UDV⊤ be the singular value decomposition of X. We stated in lecture that row i of the matrix UD gives the coordinates of sample point Xi in principal coordinates space, i.e., Xi · vj for each j, where Xi is the ith row of X and vj is the jth column of V. Show that this is true.
As V is an orthogonal matrix, we can write XV = UDV⊤V = UD. By the definition of matrix multiplication, (U D)i j = Xi · v j .
(c) [3 pts] Let x,y ∈ Rd be two points (e.g., sample or test points). Consider the function k(x,y) = x⊤rev(y) where rev(y)
reverses the order of the components in y. For example, rev 2 3
Hint: remember how the kernel function is defined, and show a simple two-dimensional counterexample.
We have that k((−1, 1), (−1, 1)) = −2, but this is impossible as, if k is a valid kernel, then there is some function Φ such that k(x, x) = Φ(x)⊤Φ(x) ≥ 0.
2. Show that k cannot be a valid kernel function. 1
1
Q3. [10 pts] Maximum Likelihood Estimation for Reliability Testing
Suppose we are reliability testing n units taken randomly from a population of identical appliances. We want to estimate the mean failure time of the population. We assume the failure times come from an exponential distribution with parameter λ > 0, whose probability density function is f(x) = λe−λx (on the domain x ≥ 0) and whose cumulative distribution function is F(x)=x f(x)dx=1−e−λx.
(a) [6pts]Inanideal(butimpractical)scenario,weruntheunitsuntiltheyallfail.Thefailuretimesaret1,t2,…,tn.
Formulate the likelihood function L(λ; t1 , . . . , tn ) for our data. Then find the maximum likelihood estimate λˆ for the distribution’s parameter.
L(λ;t1,…,tn) = f(ti) = λe−λti = λne−λni=1 ti
ln L(λ) = n ln λ − λ
∂λ ln L(λ) = λ − ˆn
i=1 λ = ni = 1 t i
(b) [4 pts] In a more realistic scenario, we run the units for a fixed time T . We observe r unit failures, where 0 ≤ r ≤ n, and there are n − r units that survive the entire time T without failing. The failure times are t1 , t2 , . . . , tr .
Formulate the likelihood function L(λ; n, r, t1 , . . . , tr ) for our data. Then find the maximum likelihood estimate λˆ for the distribution’s parameter.
Hint 1: What is the probability that a unit will not fail during time T? Hint 2: It is okay to define L(λ) in a way that includes contributions (densities and probability masses) that are not commensurate with each other. Then the constant of proportionality of L(λ) is meaningless, but that constant is irrelevant for finding the best-fit parameter λˆ. Hint 3: If you’re confused, for part marks write down the likelihood that r units fail and n−r units survive; then try the full problem. Hint 4: If you do it right, λˆ will be the number of observed failures divided by the sum of unit test times.
L(λ;n,r,t ,…,t)∝ f(t)(1−F(T)) 1ri
ln L(λ) = r ln λ − λ ∂ rr
∂λ ln L(λ) = λ − ˆr
n−r
n−r −λti −λT
= λe e
= λr e−λ ri=1 ti e−λ(n−r)T
λ = ri = 1 t i + ( n − r ) T
ti − λ(n − r)T + constant ti − (n − r)T = 0
Q4. [10 pts] Decision Trees Consider the design matrix
4 6 9 1 7 5⊤ 165234
representing 6 sample points, each with two features f1 and f2.
The labels for the data are
In this question, we build a decision tree of depth 2 by hand to classify the data.
1 0 1 0 1 0⊤ (a) [2 pts] What is the entropy at the root of the tree?
−0.5 log2 0.5 − 0.5 log2 0.5 = 1
(b) [3 pts] What is the rule for the first split? Write your answer in a form like f1 ≥ 4 or f2 ≥ 3. Hint: you should be able to
eyeball the best split without calculating the entropies.
If we sort by f1, the features and the corresponding labels are
1 4 5 6 7 9
010011. 1 2 3 4 5 6
(c) [3 pts] For each of the two treenodes after the first split, what is the rule for the second split? For the treenode with labels (1, 1), there’s no need to split again.
For the treenode with labels (0, 1, 0, 0), if we sort by f1 , we have
If we sort by f2, we have
The best split is f1 ≥ 7.
If we sort using f2, we get
1 4 5 6 0100
1 2 4 6 1000
We easily see we should choose f2 ≥ 2.
(d) [2 pts] Let’s return to the root of the tree, and suppose we’re incompetent tree builders. Is there a (not trivial) split at the
root that would have given us an information gain of zero? Explain your answer.
Yes. The rules f1 ≥ 5, f2 ≥ 3, or f2 ≥ 5 would all fail to reduce the weighted average entropy below 1.
Q5. [11 pts] Bagging and Random Forests
We are building a random forest for a 2-class classification problem with t decision trees and bagging. The input is a n × d design matrix X representing n sample points in Rd (quantitative real-valued features only). For the ith decision tree we create an n-point training set X(i) through standard bagging. At each node of each tree, we randomly select k of the features (this random subset is selected independently for each treenode) and choose the single-feature split that maximizes the information gain, compared to all possible single-feature splits on those k features. Assume that we can radix sort real numbers in linear time, and we can randomly select an item from a set in constant time.
(a) [3 pts] Remind us how bagging works. How do we generate the data sets X(i)? What do we do with duplicate points?
For each training set X(i), we select n sample points from X uniformly at random with replacement. Duplicate points have proportionally greater weight in the entropy (or other cost function) calculations. (We will accept an answer that states that duplicate points are treated as if they were separate points infinitesimally close together.)
(b) [3 pts] Fill in the blanks to derive the overall running time to construct a random forest with bagging and random subset selection. Let h be the height/depth (they’re the same thing) of the tallest/deepest tree in the forest. You must use the tightest bounds possible with respect to n, d, t, k, h, and n′.
Consider choosing the split at a treenode whose box contains n′ sample points. We can choose the best split for these n′ sample points in O( ) time. Therefore, the running time per sample point in that node is O( ).
Each sample point in X(i) participates in at most O( ) treenodes, so each sample point contributes at most O( ) to the time. Therefore, the total running time for one tree is O( ).
We have t trees, so the total running time to create the random forest is O( ). The blanks in order: O(n′k), O(k), O(h), O(kh), O(nkh), O(nkht). They are each worth half a point.
(c) [2 pts] If we instead use a support vector machine to choose the split in each treenode, how does that change the asymp- totic query time to classify a test point?
It slows queries down by a factor of Θ(d) or Θ(k) (depending whether you run the SVM on k features or all d features— we’ll accept either interpretation), because it is necessary to inspect all d (or k) features of the query point at each treenode.
(d) [3 pts] Why does bagging by itself (without random subset selection) tend not to improve the performance of decision trees as much as we might expect?
It is common that the same few features tend to dominate in all of the subsets, so almost all the trees will tend to have very similar early splits, and therefore all the trees will produce very similar estimates. The models are not decorrelated enough.
Q6. [11 pts] One-Dimensional ConvNet Backprop Consider this convolutional neural network architecture.
In the first layer, we have a one-dimensional convolution with a single filter of size 3 such that h = s 3 v x . The i j=1 j i+j−1
second layer is fully connected, such that z = 4i=1 wihi. The hidden units’ activation function s(x) is the logistic (sigmoid) function with derivative s′(x) = s(x) (1 − s(x)). The output unit is linear (no activation function). We perform gradient descent on the loss function R = (y − z)2, where y is the training label for x.
(a) [1 pt] What is the total number of parameters in this neural network? Recall that convolutional layers share weights. There are no bias terms.
The answer is 7. There are 3 parameters in layer 1 and 4 parameters in layer 2.
(b) [4 pts] Compute ∂R/∂wi.
∂R =−2(y−z)hi ∂wi
(c) [1 pt] Vectorize the previous expression—that is, write ∂R/∂w.
∂R = −2(y − z)h
(d) [5 pts] Compute ∂R/∂vj.
∂v =−2(y−z)∂v =−2(y−z)
4 ∂z∂hi 4 ∂h ∂v =−2(y−z)
i=1 i j i=1
wihi(1−hi)xi+j−1
Q7. [11 pts] Spectral Graph Partitioning
(a) [3 pts] Write down the Laplacian matrix LG of the following graph G. Every edge has weight 1.
2 −1 −1 0 0 0 0
0 0 0 0 −1 1 0
0 0 0 −1 0 0 1 (b) [2 pts] Find three orthogonal eigenvectors of LG, all having eigenvalue 0.
−1 2 −1 0 0 0 0
−1 −1 2 0 0 0 0 0 0 0 1 0 0 −1 0 0 0 0 1 −1 0
1
x = 0 , y =
(c) [2 pts] Use two of those three eigenvectors (it doesn’t matter which two) to assign each vertex of G a spectral vector in R2. Draw these vectors in the plane, and explain how they partition G into three clusters. (Optional alternative: if you can draw 3D figures well, you are welcome to use all three eigenvectors and assign each vertex a spectral vector in R3.)
The eigenvectors x and y give the embedding
1,2,3→ 0 ,4,7→ 1 ,5,6→ 0 .
Each of the three clusters is mapped to a single point in R2.
(d) [3 pts] Let Kn be the complete graph on n vertices (every pair of vertices is connected by an edge of weight 1) and let
LKn be its Laplacian matrix. The eigenvectors of LKn are v1 = 1 = 1 . . . 1⊤ and every vector that is orthogonal to 1. What are the eigenvalues of LKn ?
λ1 =0andλ2 =···=λn =n.
(e) [1 pt] What property of these eigenvalues gives us a hint that the complete graph does not have any good partitions?
λ2 is large, so there is no low-sparsity cut. (The optimal cut has sparsity ≥ λ2/2.)
1 0 0
Q8. [10 pts] We Hope You Learned This
Consider learning closed intervals on the real line. Our hypothesis class H consists of all intervals of the form [a, b] where a < b and a, b ∈ R. We interpret an interval (hypothesis) [a, b] ∈ H as a classifier that identifies a point x as being in class C if a ≤
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com