DS-GA 1003 : Machine Learning
Spring 2020: Midterm Practice Exam (100 Minutes)
Answer the questions in the spaces provided. If you run out of room for an answer, use the blank page at the end of the test. Please don’t miss the last question, on the back of the last test page.
Copyright By PowCoder代写 加微信 powcoder
Variants of SVM objective Hypothesis Spaces
L1/L2 Regularization Stopping Rules
Sparsity in Lasso and SVM Determining Kernels for Classification Kernelized Regression
Points Score 2
4 1 5 3 3 9 8 35
1. (2 points) Suppose you have a function
max 0, 1 − yi[wT xi] + 2||w||2.
where c > 0. Show how we could use this φ(c) to find a minimizer w∗ ∈ Rd of the
φ(c) = arg minw∈Rd n following objective function (where λ > 0):
max 0, 1 − yi[wT xi] + λ||w||2
Solution: Note
1 1 n
max 0, 1 − yi[wT xi] + λ||w||2
max 0, 1 − yi[wT xi] + 2λ||w||2
max 0, 1 − yi[wT xi] + 2||w||2
n 1/(2λ) n
Thus, let c = 1 . Then φ( 1 ) will return a minimizer of objective J2(w). 2λ 2λ
Page 1 of 9
2. Let X = Y = A = R. Suppose you receive the (x,y) data points (-1,1), (-1,0), (0,0), (1,1), and (2,2).
(a) Assume we’re using the 0-1 loss function l(a, y) = 1(a ̸= y).
i. (1 point) Suppose we restrict to the hypothesis space Fc of constant functions.
ˆ Give an empirical risk minimizer f(x).
Solution: f(x) = 0 or 1.
ii. (1 point) Suppose we restrict to the hypothesis space Fc of constant functions. ˆˆˆˆ
What is R(f), the empirical risk of f, where f is an empirical risk minimizer.
Solution: 3/5.
Page 2 of 9
iii. (1 point) Suppose we restrict to the hypothesis space Fl of linear functions. ˆ
Give an empirical risk minimizer f(x).
space of all functions?
(1 point) Consider the following version of the elastic-net objective: J(w) = n1∥Xw − y∥2 + λ1∥w∥1 + λ2∥w∥2.
Our training data is (x1, y1), . . . , (xn, yn) ∈ Rd × R, sampled i.i.d. from some distri- bution P. As usual, the design matrix X ∈ Rn×d has xi as its ith row, and y ∈ Rn has yi as its i’th coordinate. Which ONE of the following hyperparameter settings is most likely to give a sparse solution?
λ1 =0,λ2 =1 λ1 =1,λ2 =0 λ1 =0,λ2 =0
ˆ Solution: f(x) = x.
(b) Now assume we’re using the absolute loss function l(a, y) = |a − y|.
i. (1 point) What is the minimum empirical risk achievable over the hypothesis
Solution: 1/5.
Page 3 of 9
4. The penalized empirical risk for f ∈ F and dataset D is given by J(f; D) = Rˆ(f; D) + λΩ(f),
where Ω : F → [0, ∞) is a regularization function, λ > 0 is a regularization parameter,
Rˆ(f;D)= 1 l(f(x),y)
is the empirical risk of f for the data D, where |D| is the size of the set D.
(a) (1 point) Suppose we use an iterative descent method to minimize J (f ; D) for some training data D. Let f(i) be the prediction function at the i’th iteration. If our goal is to find a minimizer fˆ ∈ arg minf ∈F J (f ; D), which ONE of the following is the better stopping condition?
Rˆ(f(i);D)−Rˆ(f(i+1);D)<ε,forsomeappropriatelychosenε>0.
J(f(i);D)−J(f(i+1);D)<ε, for some appropriately chosen ε>0.
(b) (1 point) A friend reminds us that our real goal is to find an f that has small risk,
i.e. a small value of R(f ) = El(f (x), y). Suppose we have found f ̃ using D and
we have an independent validation set Dval from the same distribution as D. Select ̃
ALL of the following that are unbiased estimators of R(f): ̃ˆ ̃ ̃ˆ ̃
J(f,Dval) R(f,D) J(f,D) R(f,Dval)
(c) (2 points) A friend thinks that we are running too many iterations of the optimiza- tion procedure to minimize J(f;D), when our real goal is to find f with small risk. Let
f∗ ∈argminf El(f(x),y) fF ∈ arg minf ∈F El(f (x), y)
fˆ∈argminf∈F J(f;D)
and again let f(i) be the prediction function at the current iteration. Select ALL of the following that would support your friend’s claim that it’s time to stop the optimization algorithm:
(i) ˆ J(f ,D)=J(f,D)
(i) ˆ R(f )
J(f(i−100),Dval)−J(f(i),Dval)<εforsomeappropriatelychosenε>0. ˆ (i) ˆ ˆ
R(f ,Dval)−R(f,Dval)<εforsomeappropriatelychosenε>0. (i) ˆ
J(f ,Dval)−J(f,Dval)<εforsomeappropriatelychosenε>0.
5. Decide whether the following statements apply to full batch gradient descent (GD), mini- batch GD, neither, or both. Assume we’re minimizing a differentiable, convex objective function J(w) = n1 ni=1 fi(w), and we are currently at wt, which is not a minimum. For full batch GD, take v = ∇wJ(wt), and for minibatch GD take v to be a minibatch estimate of ∇wJ(wt) based on a random sample of the training data.
(a) (1 point) For any step size η > 0, after applying the update rule wt+1 ← wt − ηv, we must have J(wt+1) < J(wt). (Choose ONE answer below.)
Full batch Minibatch Both Neither
(b) (1 point) There must exist some η > 0 such that after applying the update rule wt+1 ← wt − ηv we have J(wt+1) < J(wt). (Choose ONE answer below.)
Full batch Minibatch Both Neither
(c) (1 point) v is an unbiased estimator of the full batch gradient. (Choose ONE answer below.)
Full batch Minibatch Both Neither
Page 5 of 9
6. We have discussed two methods that we claim can give sparsity: Lasso and SVM. In this question you’ll compare the “sparsity” achieved through these methods.
(a) (1 point) Suppose f : Rd → R is a prediction function attained by linear least squares Lasso regression. Give an expression for f(x) in terms of a vector that may be sparse in the lasso context, and specify which vector may be sparse.
(b) (2 points) Suppose f : Rd → R is a prediction function (i.e. score function) from a linear SVM. Give an expression for f(x) in terms of a vector that may be sparse in the SVM context, and specify which vector may be sparse.
Solution: f (x) = wT x, where w may be sparse.
Solution: f(x) = ni=1 αixTi x, where α = (α1, . . . , αn)T may be sparse.
Page 6 of 9
7. (2 points) (a) (1 point) Show that the following kernel function is a Mercer kernel (i.e. it represents an inner product):
k(x,y)= xTy , ∥x∥∥y∥
where x,y ∈ Rd.
(b) (2 points) Consider the binary classification problem shown in Figure 7: Denote
the input space by X = {(x1, x2) ∈ R2}. Give a feature mapping for which a linear classifier could perfectly separate the two classes shown.
(c) (2 points) For the classification problem in Figure 7, check all classifiers that could perfectly separate the classes:
linear SVM
SVM with quadratic kernel
SVM with radial basis functions
(d) (2 points) Suppose we fit a hard-margin SVM to N data points, and we have 2 data points “on the margin”. If we add a new data point to the training set and refit the SVM, what’s the largest number of data points that could end up “on the margin”. Support your answer (a picture could suffice).
Solution: Take φ(x) = x ||x||
Solution: Take(x,y)→(1,x,y,x2,xy,y2)
Solution: N+1
Page 7 of 9
8. Consider the regression setting in which X = Rd and Y = N and A = R with a linear hypothesis space F = {f(x) = wT x | w ∈ Rd} and the loss function
l(yˆ, y) = −yyˆ + exp(yˆ) + log(y!),
where yˆ is the action and y is the outcome. Consider the objective function
(a) (4 points) Provide a kernelized objective function Jk(α) : Rn → R. You may write your answer in terms of the Gram matrix K ∈ Rn×n, defined as Kij = xTi xj. (Hint: Recall the representer theorem).
l(wTxi,yi)+λ||w||2
Solution: BytheRepresentertheorem(youshouldbeabletojustifyitsapplica- tion, but you do not need to for full credit), we know that for w∗ = ni=1 αi∗xi = XT α∗, where X ∈ Rn×d is the usual design matrix. So w∗T xi = (Kα∗)i, where K = XXT is the Gram matrix. Thus the kernelized objective function is
You can also write out the objective function, in which case you can drop the log(y!) term, since it’s independent of α.
l((Kα)i,yi)+λαTKα.
Page 8 of 9
(b) (1 point) Let w∗ be a minimizer of J(w), and let α∗ be a minimizer of Jk(α). Give an expression relating w∗ to α∗,
Solution: w∗ = ni=1 αi∗xi = XT α∗
ˆT∗ Give a kernelized form of the prediction function f(x) = x w .
Which ONE of the following is the MOST accurate regarding kernel meth-
Good to use on very small data sets when unkernelized methods are over- fitting.
Good to use on very large data sets (e.g. millions or billions of data points) due to their efficiency.
Good to use on medium-sized data sets when unkernelized meth- ods are underfitting.
F True or False: Suppose φ : Rd → RD is the feature map corre- to our kernel function. Since k(x, x′) = ⟨φ(x), φ(x′)⟩ is an inner product in a D-dimensional space, kernel methods become infeasible when D is very large.
(c) (1 point)
(d) (1 point) ods:
w∗Tx= αi∗xi x=kxTα∗ i=1
wherekxT =[⟨x1,x⟩,···⟨xn,x⟩].
(e) (1 point) sponding
Page 9 of 9
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com