Theorem[section]
DS-GA-1003: Machine Learning (Spring 2020) Midterm Exam (March 10 5:20-7:00PM)
• You have 90 minutes to complete the exam. No textbooks, notes, computers or calculators. However you are allowed a double-sided reference sheet.
• The exam consists of 9 single-sided pages. Mark your answers on the exam itself. Do not write on the back of pages. If you lack space for an answer, then use the blank space on page 9. We will not grade answers written on scratch paper.
Copyright By PowCoder代写 加微信 powcoder
Decomposing Risk Regularization Scaling Gradient Descent Loss Functions Decision Boundaries Kernels
Points Score
1. Consider input space X, output space Y and action space A. Fix a loss function l on A × Y. Consider hypothesis space F of functions from X to A. Fix a sample S drawn from X × Y. Take
• f∗ = argmin E [l(f(x), y)] f
• fF =argminE[l(f(x),y)] f∈F
• fˆ= argmin m1 mi=1 l(f(xi),yi) f∈F
where m is the number of samples in S.
(a) Recall that the approximation error is the difference of risks R(fF)−R(f∗).
i. (1 point) The approximation error is
Positive or Zero Negative or Zero Cannot be Determined
ii. (1 point) The approximation error is
Random Non-Random Cannot be Determined
iii. (1 point) If we increase the size of F, then the approximation error is
Increased or Unchanged Decreased or Unchanged Cannot be Determined
iv. (1 point) If we increase the size of S, then the approximation error is Changed Unchanged Cannot be Determined
v. (1 point) Do we need to know the data generating distribution to compute the approximation error?
True False
(b) Recall that the estimation error is the difference of risks R(f ) − R(fF ).
i. (1 point) The estimation error is
Positive or Zero Negative or Zero Cannot be Determined
ii. (1 point) For fixed sample S, the estimation error is
Random Non-Random Cannot be Determined
iii. (1 point) If we increase the size of F, then the estimation error is
Increased or Unchanged Decreased or Unchanged Cannot
be Determined
iv. (1 point) If we increase the size of S, then the estimation error is Changed Unchanged Cannot be Determined
Page 1 of 9
v. (1 point) Do we need to know the data generating distribution to compute approximation error
True False
(c) (1 point) For some models like , we have different approaches to
fitting the training data. Each approach attempts to find f. Does the choice of the approach affect
Approximation Error Estimation Error Neither
(8 points) We have a dataset D = {(0, 1) , (1, 4), (2, 3)} that we fit by minimizing
an objective function of the form:
J ( α 0 , α 1 ) = λ 1 ( α 0 + α 1 ) + λ 2 ( α 02 + α 12 ) + ( α 0 + α 1 x i − y i ) 2 ,
and the corresponding fitted function is given by f(x) = α0 + α1x. We tried four different settings of λ1 and λ2, and the results are shown below.
each of the following parameter settings, give the number of the plot that shows resulting fit.
i. 1 ii. 4 iii. 3 iv. 2
λ1 =0andλ2 =2. λ1 =0andλ2 =0. λ1 =0andλ2 =10. λ1 =5andλ2 =0.
Page 2 of 9
3. Suppose we have input space X = {−1.5, −0.5, 0.5, 1.5} × {−0.001, 0.001}, output space Y = {−1,1} and action space R. Assume the following about the data generating distribution
• Y coordinate has equal probability of being −1, 1
• X1 coordinate has equal probability of being {−1.5, −0.5, 0.5, 1.5}. X1 is related to
Y through X1 = Y − 0.5Z where Z = ±1 with equal probability
• X2 has equal probability of being {−0.001,0.001}. X2 is related to Y through
X2 =Y/1000
Suppose we have Ridge Regression with m samples
J(w)=λ(w2 +w2)+ 1 m w x(i) +w x(i) −y2 12m1122i
We’re trying to decide between weights waccurate = [0, 1000] and wsmall = [1, 0].
(a) (2 points) What is the value of J(waccurate)? 1000λ 1000 10002λ 10002
(b) (2 points) For large values of m, the empirical risk
1 m w x(i) +w x(i) −y2 m1122i
i=1 approximates the statistical risk
E(w1X1 +w2X2 −Y)2 .
Use the statistical risk to approximate the value J(wsmall).
0.5+λ 0.25+λ 1+λ 0.75+λ
(c) (2 points) Using your answers above, determine λ∗ such that we would choose wsmall
for any λ > λ∗.
(d) (2 points) For most values of λ, we would choose wsmall. How could we transform the features to avoid choosing the less accurate weights?
Solution: 0.25 10002 −1
Solution: Scaling features – for example, min-max scaler
Page 3 of 9
4. Momentum is a variation of gradient descent where we include the gradient at a previous iteration in the current iteration. The update rule is
w(t+1) = w(t) − α ∂L w(t) − γ ∂L w(t−1) ∂w ∂w
Here L is the objective function and α, γ > 0 are the learning rates. Assume for iteration t=0andt=−1,wesetw(t) =w0 theinitialguess.
Figure 1: Graph of objective func- tion L
(a) Refer to the chart in Figure 1.
i. (1 point) Assuming that w starts in a flat region that is not a minimum and α > 0, will the basic gradient descent algorithm terminate at a minimum? Note that the basic gradient descent algorithm is the momentum gradient descent algorithm with γ = 0
Yes with enough iterations Maybe Never
ii. (1 point) Assuming that w starts in a sloped region and α > 0, will the basic
gradient descent algorithm terminate at a minimum?
Yes with enough iterations Maybe Never
iii. (1 point) Assuming that w starts in a flat region that is not a minimum and both α > 0 and γ > 0, will the momentum gradient descent algorithm termi- nate at a minimum?
Yes with enough iterations Maybe Never
iv. (1 point) Assuming that w starts in a sloped region and both α > 0 and γ > 0,
will the momentum gradient descent algorithm terminate at a minimum? Yes with enough iterations Maybe Never
v. (1 point) Is L(w) convex?
Yes No No, but −L(w) is convex No, but L(−w) is convex
(b) (6 points) Fill in the twelve blanks in the code with the following variables to im- plement gradient descent with momentum.
Page 4 of 9
w X y wprev numiter w0 temp alpha gamma range len t
Note that the same variable can be used multiple times. Some variables may not be used at all. Only use one variable per blank.
def grad(X, y, w):
’’’ Returns gradient dL/dw at w X: matrix , training data features
y: vector , training data labels w: vector , weights ’’’
def grad_desc_momentum(X, y, num_iter, alpha, gamma, w0):
’’’ Returns weights w computed after num_iter iterations. X: matrix , training data features
y: vector , training data labels
num_iter: number, number of iterations to run alpha: number , learning rate
gamma: number , learning rate for momentum
w0: weights for t=0 and t=-1 ’’’
w, w_prev = _____________, ______
for ___
g = grad(X, y, w)
m = grad(X, y, ______
__
– __
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22
i. w0 ii. w0 iii. t
v. numiter
x. alpha xi. gamma
vi. vii. viii.
Take the square loss: l(yˆ, y) = (yˆ − y)2.
(a) (3 points) Fix x. Determine the constant c such that E [(Y − c)2|X = x] is mini-
mized. Note that you need to take a derivative.
(b) (3 points) Assume the following about the data generating distribution
• The coordinate X is uniformly distributed on X. So equal probability 14 to features {1, 2, 3, 4}.
5. Consider input space X = {1, 2, 3, 4}, output space Y = {1, 2, 3, 4} and action space R.
Solution: c = E[Y |X]
Page 5 of 9
• The coordinate Y given the coordinate X is uniformly distributed on {1, . . . , x}. Soequalprobabilitiesx1 tolabels{1,…,x}conditionalonfeaturex.
What is the target function? In other words, for fixed x how should we choose f∗(x) to minimize the expected square loss.
(4 points) What is the expected square loss of the target function?
(2 points) Figure 2 contains a training set {x1, x2, . . . , x25}. Below we have sev- eral feature transformations. By themselves, which might allow us to separate the transformed data with a linear decision boundary? Select all possible choices.
Solution: f∗(x) = (x + 1)/2.
E[(Y − f∗(X))2] = E[E[(Y − (X + 1)/2)2|X]] = 26. 48
Centering the data
Add a feature x2
Addafeaturethatis1ifx≤50or−1ifx>50 Add two features x2 and x3
Page 6 of 9
Figure 3: Training Data
Figure 2: Training Data
(2 points) Figure 3 contains
a training set
{(x(1), x(1)), . . . , (x(100), x(100))}. 1212
Below we have several feature transformations. By them- selves, which might allow us to separate the transformed data with a linear decision boundary? Select all possible choices.
Scaling the data
Adding features x21, x2, x1x2
Addingafeaturethatis1ifx2 ≥10or−1ifx2 <10 Adding a feature |x1|
7. Define the Huber loss function h : R → R by
x2/2 if |x| ≤ 1,
h(x)= |x|−1/2 if|x|>1. Consider the objective function
h(wT xi − yi)
where (x1, y1), . . . , (xn, yn) ∈ Rd × R. Fix λ > 0. Note that the function is differentiable.
(a) (3 points) We want to minimize J(w) using stochastic gradient descent. Assume the current data point is (xi, yi). The step direction is given by v = −∇wG(w), for some function G(w). Give an explicit expression for G(w) in terms of h, λ, and the given data. You do not have to expand the function h.
(b) (3 points) Assume J(w) has a minimizer w∗. Give an expression for w∗ in terms of a vector α ∈ Rn that is guaranteed by the representer theorem. You may use the design matrix X ∈ Rn×d.
(c) (3 points) Let k : Rd × Rd → R be a Mercer kernel, and let K ∈ Rn×n denote the Gram matrix Kij = k(xi,xj). Give a kernelized form of the objective J in terms of K. Recall that wT xi = (Xw)i where X ∈ Rn×d is the matrix with ith row xTi .
J(w) = λ∥w∥2 + n
G(w) = h(wT xi − yi) + λ∥w∥2
w∗ =αixi =XTα
h((Kα)i − yi) + λαT Kα.
Page 7 of 9
Figure 4 shows a training set in R2. Suppose
that we use perceptron algorithm for classifi- 8. cation. We record the total number of times each point occurs in the update step. Remem- ber that if a point is misclassified, then it oc-
curs in the update step.
Figure 4: Training Data
(3 points) Assume that the initial weight is w(0) = [−3, 2, 1] where 1 is the offset term. So the feature x3 = 1 is constant.
What is the equation of the separating line expressed in terms of x1 and x2 determined by the algorithm ?
Solution: −3×1 + 2×2 − 2 = 0
ii. (1 point) In some cases, removing a single point can change the decision bound- ary. Here would removing a single point from the training set change the deci- sion boundary? Please explain your answer.
iii. (2 points) If we added the point [2, −2] with label +1 to the training set, then would we obtain different results? In particular, would the algorithm converge?
Solution: The two points for the update.
Solution: Data would not be linearly separable so algorithm would not converge
Figure 5 shows training data with two classes. We want to use hard-margin support vector machine. Remember that we choose the decision boundary to maximize the margin.
Page 8 of 9
Figure 5: Training Data
i. (2 points) Draw the decision boundary obtained by the hard-margin SVM method with a solid line. Draw the margins on either side with dashed lines.
ii. (1 point) What are the possible support vectors. Please indicate the number.
iii. (1 point) What is the classification error on the training set? In other words, how many points are incorrectly classified?
iv. (1 point) Would the removal of a single point change the decision boundary? If so, then what points?
v. (1 point) Suppose we use leave-one-out cross validation meaning we use 7-fold cross validation with a split of 6 to 1 between training set and validating set. Compute the average classification error over the 7-folds.
Solution: 1,2,3,4,5
Solution: 0
Solution: Point 5
Solution: 1/7
Page 9 of 9
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com