CS代考 CS 189 Introduction to

CS 189 Introduction to
Spring 2019 Machine Learning Midterm
• Please do not open the exam before you are instructed to do so.
• The exam is closed book, closed notes except your one-page cheat sheet.

Copyright By PowCoder代写 加微信 powcoder

• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
• You have 1 hour and 20 minutes.
• Please write your initials at the top right of each page after this one (e.g., write “JS” if you are ).
Finish this by the end of your 1 hour and 20 minutes.
• Mark your answers on the exam itself in the space provided. Do not attach any extra sheets.
• The total number of points is 100. There are 20 multiple choice questions worth 3 points each, and 4 written questions worth a total of 40 points.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct answers must be checked.
First name
First and last name of student to your left
First and last name of student to your right

Q1. [60 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(a) [3 pts] Let A be a real, symmetric n × n matrix. Which of the following are true about A’s eigenvectors and eigenvalues?
⃝ A can have no more than n distinct eigenvalues
⃝ A can have no more than 2n distinct unit-length
eigenvectors
⃝ The vector ⃗0 is an eigenvector, because A⃗0 = λ⃗0 ⃝ We can find n mutually orthogonal eigenvectors of
(b) [3 pts] The matrix that has eigenvector [1, 2]⊤ with eigenvalue 2 and eigenvector [−2, 1]⊤ with eigenvalue 1 (note that these are not unit eigenvectors!) is
􏰄 9 −2 􏰅 ⃝ −2 6
􏰄 9/5 −2/5􏰅 ⃝ −2/5 6/5
􏰄 6 2 􏰅 ⃝ 2 9
􏰄6/5 2/5􏰅 ⃝ 2/5 9/5
(c) [3 pts] Consider a binary classification problem where we know both of the class conditional distributions exactly. To compute the risk,
⃝ we need to know all the sample points ⃝ we need to know the class prior probabilities ⃝ we need to know the loss function ⃝ we need to use gradient descent
(d) [3 pts] Assuming we can find algorithms to minimize them, which of the following cost functions will encourage sparse solutions (i.e., solutions where many components of w are zero)?
⃝ ∥Xw−y∥2 +λ∥w∥1 ⃝ ∥Xw−y∥2 +λ·(#ofnonzerocomponentsofw) ⃝ ∥Xw−y∥2 +λ∥w∥21 ⃝ ∥Xw−y∥2 +λ∥w∥2
(e) [3 pts] Which of the following statements about logistic regression are correct?
⃝ The cost function of logistic regression is convex ⃝ Logistic regression uses the squared error as the
loss function
(f) [3 pts] Which of the following statements about stochastic gradient descent and Newton’s method are correct?
⃝ Newton’s method often converges faster than stochastic gradient descent, especially when the di- mension is small
⃝ Newton’s method converges in one iteration when the cost function is exactly quadratic with one unique minimum.
⃝ If the function is continuous with continuous derivatives, Newton’s method always finds a local minimum
⃝ Stochastic gradient descent reduces the cost func- tion at every iteration.
⃝ The cost function of logistic regression is concave ⃝ Logistic regression assumes that each class’s
points are generated from a Gaussian distribution

(g) [3 pts] Let X ∈ Rn×d be a design matrix containing n sample points with d features each. Let y ∈ Rn be the corresponding real-valued labels. What is always true about every solution w∗ that locally minimizes the linear least squares objective function ∥Xw − y∥2, no matter what the value of X is?
⃝ w∗ = X+y (where X+ is the pseudoinverse) ⃝ w∗ satisfies the normal equations
⃝ w∗ is in the null space of X ⃝ All of the local minima are global minima
(h) [3 pts] We are using linear discriminant analysis to classify points x ∈ Rd into three different classes. Let S be the set of points in Rd that our trained model classifies as belonging to the first class. Which of the following are true?
⃝ The decision boundary of S is always a hyperplane ⃝ The decision boundary of S is always a subset of
⃝ S can be the whole space Rd
⃝ S is always connected (that is, every pair of points
a union of hyperplanes
(i) [3 pts] Which of the following apply to linear discriminant analysis?
in S is connected by a path in S )
⃝ It approximates the Bayes decision rule
⃝ The model produced by LDA is never the same as
the mean of all the data points
(j) [3 pts] Which of the following are reasons why you might adjust your model in ways that increase the bias?
⃝ You calculate the sample mean for each class
⃝ You calculate the sample covariance matrix using
⃝ You observe high training error and high validation error
⃝ You have few data points
⃝ You observe low training error and high validation error
⃝ Your data are not linearly separable
(k) [3 pts] Suppose you are given the one-dimensional data {x1, x2, . . . , x25} illustrated below and you have only a hard- margin support vector machine (with a fictitious dimension) at your disposal. Which of the following modifications can give you 100% training accuracy?
⃝ Centering the data ⃝ Add a feature xi2
⃝ Addafeaturethatis1ifx≤50,or−1ifx>50 ⃝ Addtwofeatures,xi2 andxi3
(l) [3 pts] You are performing least-squares polynomial regression. As the degree of your polynomials increases, which of the following is commonly seen to go down at first but then go up?
the model produced by QDA
⃝ Training error ⃝ Validation error
⃝ Variance ⃝ Bias

(m) [3 pts] Let f : R → R be a continuous, smooth function whose derivative f′(x) is also continuous. Suppose f has a unique global minimum x∗ ∈ (−∞, ∞), and you are using gradient descent to find x∗ . You fix some x(0) ∈ R and ε > 0, and run x(t) = x(t−1) − ε f ′(x(t−1)) repeatedly. Which of the following statements are true?
⃝ Gradient descent is sure to converge, to some value, for any step size ε > 0
⃝ If f has a local minimum x′ different from the global one, i.e., x′ 􏰓 x∗, and x(t) = x′ for some t, gradient descent will not converge to x∗
⃝ Assuming gradient descent converges, it converges tox∗ifandonlyiffisconvex
⃝ If, additionally, f is the objective function of logis- tic regression, and gradient descent converges, then it converges to x∗
(n) [3 pts] Suppose you are trying to choose a good subset of features for a least-squares linear regression model. Let algorithm A be forward stepwise selection, where we start with zero features and at each step add the new feature that most decreases validation error, stopping only when validation error starts increasing. Let algorithm B be similar, but at each step we include the new feature that most decreases training error (measured by the usual cost function, mean squared error), stopping only when training error starts increasing. Which of the following is true?
⃝ Algorithm B will select no more features than Al- gorithm A does
⃝ Algorithm B will select at least as many features as Algorithm A does
⃝ The first feature chosen by the two algorithms will be the same
⃝ Algorithm A sometimes selects features that Al- gorithm B does not select
(o) [3 pts] Suppose you have a multivariate normal distribution with a positive definite covariance matrix Σ. Consider a second multivariate Gaussian distribution whose covariance matrix is κΣ, where κ = cos θ > 0. Which of the following statements are true about the ellipsoidal isocontours of the second distribution, compared to the first distribution?
⃝ The principal axes of the ellipsoids would be ro- tated by θ
⃝ The principal axes (radii) of the ellipsoids will be scaled by κ
⃝ The principal axes (radii) of the ellipsoids will be scaled by 1/κ
⃝ The principal axes (radii) of the ellipsoids will be
(p) [3 pts] Suppose M and N are positive semidefinite matrices. Under what conditions is M − N certain to be positive semidefinite?
⃝ The smallest eigenvalue of M is greater than the
largest eigenvalue of N
⃝ If M and N share all the same eigenvectors
⃝ The largest eigenvalue of M is greater than the
largest eigenvalue of N
(q) [3 pts] You are given four sample points X1 = [−1, −1]⊤, X2 = [−1, 1]⊤, X3 = [1, −1]⊤, and X4 = [1, 1]⊤. Each of them is in class C or class D. For what feature representations are the lifted points Φ(Xi) guaranteed to be linearly separable (with no point lying exactly on the decision boundary) for every possible class labeling?
⃝ Φ(x) = [x1,x2,1] ⃝ Φ(x) = [x12,x2,x1,x2,1]
⃝ Φ(x)=[x1,x2,x12 +x2,1] ⃝ Φ(x)=[x12,x2,x1x2,x1,x2,1]
(r) [3 pts] Let Li(w) be the loss corresponding to a sample point Xi with label yi. The update rule for stochastic gradient descent with step size ε is
⃝ wnew ←w−ε∇XiLi(w)
⃝ wnew ←w−ε􏰆ni=1∇XiLi(w)
⃝ wnew ←w−ε∇wLi(w)
⃝ wnew ←w−ε􏰆ni=1∇wLi(w)

(s) [3 pts] Suppose you have a sample in which each point has d features and comes from class C or class D. The class conditional distributions are (Xi|yi = C) ∼ N(μC,σC2 ) and (Xi|yi = D) ∼ N(μD,σ2D) for unknown values μC,μD ∈ Rd and σC2 , σ2D ∈ R. The class priors are πC and πD. We use 0-1 loss.
⃝ If πC = πD and σC = σD, then the Bayes decision rule assigns a test point z to the class whose mean is closest to z.
⃝ If σC = σD, then the Bayes decision boundary is always linear.
⃝ If πC = πD, then the Bayes decision rule is ∗􏰊22􏰋CD
⃝ If σ = σ , then QDA will always produce a lin- r (z) = argminA∈{C,D} |z − μA| /(2σA) + d ln σA ear decision boundary when you fit it to your sample.
(t) [3 pts] Let f ∈ [0, 1] be the unknown, fixed probability that a person in a certain population owns a dog (how cute!). We model f with a hypothesis h ∈ [0, 1]. Before we observe any data at all, we can’t even guess what f might be, so we set our prior probability for f to be the uniform distribution, i.e., P( f = h) = 1 over h ∈ [0, 1]. Now, we pick one person from the population, and it turns out that they have a cute little labradoodle named Dr. Frankenstein. Which of the following is true about the posterior probability that f = h given this one sample point?
⃝ The posterior is uniform over h ∈ [0, 1].
⃝ The posterior increases linearly over h ∈ [0, 1].
⃝ The posterior increases nonlinearly over h ∈ [0, 1]. ⃝ The posterior is a delta function at 1.

Q2. [10 pts] The Perceptron Learning Algorithm
The table below is a list of sample points in R2. Suppose that we run the perceptron algorithm, with a fictitious dimension, on these sample points. We record the total number of times each point participates in a stochastic gradient descent step because it is misclassified, throughout the run of the algorithm.
x1 x2 y −32+1 −11+1 −1−1−1
22−1 1−1−1
times misclassified 0
initial weight vector is w(0) = (−3, 2, 1), where the last component
(a) [5 pts] Suppose that the learning rate is ε = 1 and the
is the bias term. What is the equation of the separating line found by the algorithm, in terms of the features x1 and x2?
(b) [2pts]Insomecases,removingevenasinglepointcanchangethedecisionboundarylearnedbytheperceptronalgorithm. For which, if any, point(s) in our dataset would the learned decision boundary change if we removed it? Explain your answer.
(c) [3 pts] How would our result differ if we were to add the additional training point (2, −2) with label +1?

Q3. [10 pts] Quadratic Discriminant Analysis (a) [4 pts] Consider 12 labeled data points sampled from three distinct classes:
􏰐􏰑􏰐􏰑􏰐􏰑􏰐􏰑 􏰂√􏰃􏰂√􏰃􏰂√􏰃􏰂√􏰃 􏰐􏰑􏰐􏰑􏰐􏰑􏰐􏰑
0 −2 5 −3 2 −2 42 −42 3 1 8 0 Class0: 2 , 0 , 3 , −5 Class1: √ , √ , √ , √ Class2: 5 , 3 , 6 , −2
For each class C ∈ {0, 1, 2}, compute the class sample mean μC , the class sample covariance matrix ΣC , and the estimate
of the prior probability πC that a point belongs to class C. (Hint: μ1 = μ0 and Σ2 = Σ0.)
(b) [4pts]SketchoneormoreisocontoursoftheQDA-producednormaldistributionorquadraticdiscriminantfunction(they each have the same contours) for each class. The isovalues are not important; the important aspects are the centers, axis directions, and relative axis lengths of the isocontours. Clearly label the centers of the isocontours and to which class they correspond.
(c) [2 pts] Suppose that we apply LDA to classify the data given in part (a). Why will this give a poor decision boundary?

Q4. [10 pts] Ridge Regression with One Feature
We are given a sample in which each point has only one feature. Therefore, our design matrix is a column vector, which we
will write x ∈ Rn (instead of X). Consider the scalar data generation model yi = ωxi + ei
where xi ∈ R is point i’s sole input feature, yi ∈ R is its scalar label (a noisy measurement), ei ∼ N(0,1) is standard unit- variance zero-mean Gaussian noise, and ω ∈ R is the true, fixed linear relationship that we would like to estimate. The ei’s are independent and identically distributed random variables, and the sole source of randomness. We will treat the design vector x as fixed (not random).
Our goal is to fit a linear model and get an estimate wλ for the true parameter ω. The ridge regression estimate for ω is n
(a) [4 pts] Express wλ in terms of λ,Sxx and Sxy, where Sxx = 􏰆ni=1 xi2 and Sxy = 􏰆ni=1 xiyi.
λw + (y − x w)  where λ ≥ 0.
w = argmin
λ w∈R ii
(b) [5 pts] Compute the squared bias of the ridge estimate wλz at a test point z ∈ R, defined to be bias2(wλ, z) = (E[wλz] − ωz)2,
where the expectation is taken with respect to the yi’s. Express your result in terms of ω, λ, S xx, and z. (Hint: simplify the expectation first.)
(c) [1 pt] What will the bias be if we are using ordinary least squares, i.e., λ = 0?

Q5. [10 pts] Logistic Regression with One Feature
We are given another sample in which each point has only one feature. Consider a binary classification problem in which sample values x ∈ R are drawn randomly from two different class distributions. The first class, with label y = 0, has its mean to the left of the mean of the second class, with label y = 1. We will use a modified version of logistic regression to classify these data points. We model the posterior probability at a test point z ∈ R as
P(y = 1|z) = s(z − α),
where α ∈ R is the sole parameter we are trying to learn and s(γ) = 1/(1 + e−γ) is the logistic function. The decision boundary
is z = α (because s(z) = 12 there).
We will learn the parameter α by performing gradient descent on the logistic loss function (a.k.a. cross-entropy). That is, for a
data point x with label y ∈ {0, 1}, we find the α that minimizes
J(α) = −y ln s(x − α) − (1 − y) ln(1 − s(x − α)).
(a) [5 pts] Derive the stochastic gradient descent update for J with step size ε > 0, given a sample value x and a label y. Hint: feel free to use s as an abbreviation for s(x − α).
(b) [3 pts] Is J(α) convex over α ∈ R? Justify your answer.
(c) [2 pts] Now we consider multiple sample points. As d = 1, we are given an n × 1 design matrix X and a vector y ∈ Rn of labels. Consider batch gradient descent on the cost function 􏰆ni=1 J(α; Xi, yi). There are circumstances in which this cost function does not have a minimum over α ∈ R at all. What is an example of such a circumstance?

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com