CS代考 GA-1003: Machine Learning (Spring 2020) Midterm Exam (March 10 5:20-11:59PM

DS-GA-1003: Machine Learning (Spring 2020) Midterm Exam (March 10 5:20-11:59PM)
• While the exam should take 90 minute, you have until 11:59PM on Tuesday March 10 to submit your answers on Gradescope. You have until 11:59PM on Wednesday March 11 for late submissions.
• No textbooks, notes, online resources or calculators. However you are allowed a double-sided reference sheet.
• The exam consists of 10 pages. If you are annotating the exam, then mark your answers in the provided space. If you lack space for an answer, then use the blank space on page 10. If you are typing your responses, then please follow the directions on Piazza.

Copyright By PowCoder代写 加微信 powcoder

Decomposing Risk Regularization Scaling Gradient Descent Loss Functions Decision Boundaries Kernels
Points Score

1. Consider input space X, output space Y and action space A. Fix a loss function l on A × Y. Consider hypothesis space F of functions from X to A. Fix a sample S drawn from X × Y. Take
• f∗ = argmin E [l(f(x), y)] f
• fF =argminE[l(f(x),y)] f∈F
• fˆ= argmin m1 􏰂mi=1 l(f(xi),yi) f∈F
where m is the number of samples in S.
(a) Recall that the approximation error is the difference of risks R(fF)−R(f∗).
i. (1 point) The approximation error is
􏰅 Positive or Zero 􏰅 Negative or Zero 􏰅 Cannot be Determined
ii. (1 point) The approximation error is
􏰅 Random 􏰅 Non-Random 􏰅 Cannot be Determined
iii. (1 point) If we increase the size of F, then the approximation error is
􏰅 Increased or Unchanged 􏰅 Decreased or Unchanged 􏰅 Cannot be Determined
iv. (1 point) If we increase the size of S, then the approximation error is 􏰅 Changed 􏰅 Unchanged 􏰅 Cannot be Determined
v. (1 point) Do we need to know the data generating distribution to compute the approximation error?
􏰅 True 􏰅 False
(b) Recall that the estimation error is the difference of risks R(f ) − R(fF ).
i. (1 point) The estimation error is
􏰅 Positive or Zero 􏰅 Negative or Zero 􏰅 Cannot be Determined
ii. (1 point) For fixed sample S, the estimation error is
􏰅 Random 􏰅 Non-Random 􏰅 Cannot be Determined
iii. (1 point) If we increase the size of F, then the estimation error is 􏰅 Increased or Unchanged 􏰅 Decreased or Unchanged
􏰅 Cannot be Determined
iv. (1 point) If we increase the size of S, then the estimation error is
􏰅 Changed 􏰅 Unchanged 􏰅 Cannot be Determined Page 1 of 10

v. (1 point) Do we need to know the data generating distribution to compute approximation error
􏰅 True 􏰅 False
(c) (1 point) For some models like , we have different approaches to
fitting the training data. Each approach attempts to find f. Does the choice of the approach affect
􏰅 Approximation Error 􏰅 Estimation Error 􏰅 Neither
We have a dataset D = {(0, 1) , (1, 4), (2, 3)} that we fit by minimizing an objective
function of the form:
J ( α 0 , α 1 ) = λ 1 ( α 0 + α 1 ) + λ 2 ( α 02 + α 12 ) + 􏰃 ( α 0 + α 1 x i − y i ) 2 ,
and the corresponding fitted function is given by f(x) = α0 + α1x. We tried four
different settings of λ1 and λ2, and the results are shown below.
For each of the following parameter settings, give the number of the plot that shows the resulting fit.
i. (2points) λ1 =0andλ2 =2. ii. (2points) λ1 =0andλ2 =0. iii. (2points) λ1 =0andλ2 =10.
iv. (2points) λ1 =5andλ2 =0.
Page 2 of 10

3. Suppose we have input space X = {−1.5, −0.5, 0.5, 1.5} × {−0.001, 0.001}, output space Y = {−1,1} and action space R. Assume the following about the data generating distribution
• Y coordinate has equal probability of being −1, 1
• X1 coordinate has equal probability of being {−1.5, −0.5, 0.5, 1.5}. X1 is related to
Y through X1 = Y − 0.5Z where Z = ±1 with equal probability
• X2 has equal probability of being {−0.001,0.001}. X2 is related to Y through
X2 =Y/1000
Suppose we have Ridge Regression with m samples
J(w)=λ(w2 +w2)+ 1 􏰃m 􏰌w x(i) +w x(i) −y􏰍2 12m1122i
We’re trying to decide between weights waccurate = [0, 1000] and wsmall = [1, 0].
(a) (2 points) What is the value of J(waccurate)? 􏰅 1000λ 􏰅 1000 􏰅 10002λ 􏰅 10002
(b) (2 points) For large values of m, the empirical risk
1 􏰃m 􏰌w x(i) +w x(i) −y􏰍2 m1122i
i=1 approximates the statistical risk
E􏰊(w1X1 +w2X2 −Y)2􏰋 .
Use the statistical risk to approximate the value J(wsmall).
􏰅 0.5+λ 􏰅 0.25+λ 􏰅 1+λ 􏰅 0.75+λ
(c) (2 points) Using your answers above, determine λ∗ such that we would choose wsmall for any λ > λ∗.
Page 3 of 10

(d) (2 points) For most values of λ, we would choose wsmall. How could we transform the features to avoid choosing the less accurate weights?
4. Momentum is a variation of gradient descent where we include the gradient at a previous iteration in the current iteration. The update rule is
w(t+1) = w(t) − α ∂L 􏰀w(t)􏰁 − γ ∂L 􏰀w(t−1)􏰁 ∂w ∂w
Here L is the objective function and α, γ > 0 are the learning rates. Assume for iteration t=0andt=−1,wesetw(t) =w0 theinitialguess.
Figure 1: Graph of objective func- tion L
(a) Refer to the chart in Figure 1.
i. (1 point) Assuming that w starts in a flat region that is not a minimum and α > 0, will the basic gradient descent algorithm terminate at a minimum? Note that the basic gradient descent algorithm is the momentum gradient descent algorithm with γ = 0
􏰅 Yes with enough iterations 􏰅 Maybe 􏰅 Never
ii. (1 point) Assuming that w starts in a sloped region and α > 0, will the basic
gradient descent algorithm terminate at a minimum?
􏰅 Yes with enough iterations 􏰅 Maybe 􏰅 Never
iii. (1 point) Assuming that w starts in a flat region that is not a minimum and both α > 0 and γ > 0, will the momentum gradient descent algorithm termi- nate at a minimum?
􏰅 Yes with enough iterations 􏰅 Maybe 􏰅 Never
Page 4 of 10

iv. (1 point) Assuming that w starts in a sloped region and both α > 0 and γ > 0, will the momentum gradient descent algorithm terminate at a minimum?
􏰅 Yes with enough iterations 􏰅 Maybe 􏰅 Never
v. (1 point) Is L(w) convex?
􏰅 Yes 􏰅 No 􏰅 No, but −L(w) is convex 􏰅 No, but L(−w) is convex
(b) (6 points) Fill in the twelve blanks in the code with the following variables to im-
plement gradient descent with momentum.
w X y wprev numiter w0 temp alpha gamma range len t
Note that the same variable can be used multiple times. Some variables may not be used at all. Only use one variable per blank.
def grad(X, y, w):
’’’ Returns gradient dL/dw at w X: matrix , training data features
y: vector , training data labels w: vector , weights ’’’
def grad_desc_momentum(X, y, num_iter, alpha, gamma, w0):
’’’ Returns weights w computed after num_iter iterations. X: matrix , training data features
y: vector , training data labels
num_iter: number, number of iterations to run alpha: number , learning rate
gamma: number , learning rate for momentum
w0: weights for t=0 and t=-1 ’’’
w, w_prev = _____________, ____________
for ________ in ________(________________):
g = grad(X, y, w)
m = grad(X, y, ___________)
_____, ______ = _______ – ______ * g \
– _____ * m, _______
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22
i. ii. iii. iv.
v. ix. vi. x. vii. xi. viii. xii.
Page 5 of 10

5. Consider input space X = {1, 2, 3, 4}, output space Y = {1, 2, 3, 4} and action space R. Take the square loss: l(yˆ, y) = (yˆ − y)2.
(3 points) Fix x. Determine the constant c such that E [(Y − c)2|X = x] is mini- mized. Note that you need to take a derivative.
(3 points) Assume the following about the data generating distribution
• The coordinate X is uniformly distributed on X. So equal probability 14 to
features {1, 2, 3, 4}.
• The coordinate Y given the coordinate X is uniformly distributed on {1, . . . , x}.
Soequalprobabilitiesx1 tolabels{1,…,x}conditionalonfeaturex.
What is the target function? In other words, for fixed x how should we choose
f∗(x) to minimize the expected square loss.
(4 points) What is the expected square loss of the target function?
(2 points) Figure 2 contains a training set {x1, x2, . . . , x25}. Below we have sev- eral feature transformations. By themselves, which might allow us to separate the transformed data with a linear decision boundary? Select all possible choices.
Page 6 of 10

􏰅 Centering the data
􏰅 Add a feature x2
􏰅 Addafeaturethatis1ifx≤50or−1ifx>50 􏰅 Add two features x2 and x3
(b) (2 points) Figure 3 contains a training set
{(x(1), x(1)), . . . , (x(100), x(100))}. 1212
Below we have several feature transformations. By them- selves, which might allow us to separate the transformed data with a linear decision boundary? Select all possible choices.
􏰅 Scaling the data
􏰅 Adding features x21, x2, x1x2
􏰅 Adding a feature |x1|
7. Define the Huber loss function h : R → R by
􏰎 x2/2 if |x| ≤ 1, h(x)= |x|−1/2 if|x|>1.
Consider the objective function
J(w) = λ∥w∥2 + n
h(wT xi − yi)
where (x1, y1), . . . , (xn, yn) ∈ Rd × R. Fix λ > 0. Note that the function is differentiable.
(a) (3 points) We want to minimize J(w) using stochastic gradient descent. Assume the current data point is (xi, yi). The step direction is given by v = −∇wG(w), for
Page 7 of 10
Figure 2: Training Data
Figure 3: Training Data 􏰅 Addingafeaturethatis1ifx2 ≥10or−1ifx2 <10 some function G(w). Give an explicit expression for G(w) in terms of h, λ, and the given data. You do not have to expand the function h. (b) (3 points) Assume J(w) has a minimizer w∗. Give an expression for w∗ in terms of a vector α ∈ Rn that is guaranteed by the representer theorem. You may use the design matrix X ∈ Rn×d. (c) (3 points) Let k : Rd × Rd → R be a Mercer kernel, and let K ∈ Rn×n denote the Gram matrix Kij = k(xi,xj). Give a kernelized form of the objective J in terms of K. Recall that wT xi = (Xw)i where X ∈ Rn×d is the matrix with ith row xTi . Figure 4 shows a training set in R2. Suppose that we use perceptron algorithm for classifi- 8. cation. We record the total number of times each point occurs in the update step. Remem- ber that if a point is misclassified, then it oc- curs in the update step. Figure 4: Training Data (a) i. (3 points) Assume that the initial weight is w(0) = [−3, 2, 1] where 1 is the offset term. So the feature x3 = 1 is constant. What is the equation of the separating line expressed in terms of x1 and x2 determined by the algorithm ? Page 8 of 10 ii. (1 point) In some cases, removing a single point can change the decision bound- ary. Here would removing a single point from the training set change the deci- sion boundary? Please explain your answer. iii. (2 points) If we added the point [2, −2] with label +1 to the training set, then would we obtain different results? In particular, would the algorithm converge? Figure 5 shows training data with two classes. We want to use hard-margin support vector machine. Remember that we choose the decision boundary to maximize the margin. Page 9 of 10 Figure 5: Training Data i. (2 points) Draw the decision boundary obtained by the hard-margin SVM method with a solid line. Draw the margins on either side with dashed lines. ii. (1 point) What are the possible support vectors. Please indicate the number. iii. (1 point) What is the classification error on the training set? In other words, how many points are incorrectly classified? iv. (1 point) Would the removal of a single point change the decision boundary? If so, then what points? v. (1 point) Suppose we use leave-one-out cross validation meaning we use 7-fold cross validation with a split of 6 to 1 between training set and validating set. Compute the average classification error over the 7-folds. Page 10 of 10 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com