CS代考 GA 1003: Machine Learning

DS-GA 1003: Machine Learning
March 12, 2019: Midterm Exam (100 Minutes)
Answer the questions in the spaces provided. If you run out of room for an answer, use the blank page at the end of the test. Please don’t miss the last questions, on the back of the last test page.

Copyright By PowCoder代写 加微信 powcoder

1) Bayes Optimal
2) Risk Decomposition
3) Linear Separability and Loss Functions 4) SVM with
5) Dependent Features
6) RBF Kernel
7) l2-norm Penalty
Points Score 7
6 6 9 6 4 6 44

1. (7 points) Consider a binary classification problem. For class y = 0, x is sampled from {1,2,3,4,5,6,7,8} with equal probability; for class y = 1, x is sampled from {7, 8, 9, 10} with equal probability. Assume that both classes are equally likely. Let f∗ : {1,2,3,4,5,6,7,8,9,10} → {0,1} represent the Bayes prediction function for the given setting under 0 − 1 loss. Find f ∗ and calculate the Bayes risk.
Solution: 0-1 loss:
Therefore:
f∗(x) = argmax p(y = c|x) c∈{0,1}
􏳆0 ifx ∈ {1,2,3,4,5,6} f∗(x) = 1 ifx ∈ {7,8,9,10}
l(a,y) = 1(a ̸= y) :=
0 otherwise.
R(f) = E[1(f(x)̸=y)] =0·P(f(x)=y)+1·P(f(x)̸=y) = P(f(x)̸=y),
which is just the misclassification error rate.
Bayes prediction function is just the assignment to the most likely class,
Under 0 − 1 loss, risk is the probability of mis-classification. f∗(x) mis-classifies points from class 0 occurring in {7, 8} as class 1. Hence, bayes risk is
p(y = 0, x ∈ {7, 8}) = p(x ∈ {7, 8}|y = 0)p(y = 0) = 41 × 12
Page 1 of 11

2. Consider the statistical learning problem for the distribution D on X × Y, where X = Y = R. A labeled example (x,y) ∈ R2 sampled from D has probability distribution given by x ∼ N(0,1) and y|x ∼ N(f∗(x),.1), where f∗(x) = 􏲿5i=0(i+1)xi.
Let Pk denote the set of all polynomials of degree k on R–that is, the set of all functions of the form f(x) = 􏲿ki=0 aixi for some a1,…,ak ∈ R.
Let Dm be a training set (x1, y1), . . . , (xm, ym) ∈ R×R drawn i.i.d. from D. We perform empirical risk minimization over a hypothesis space H for the square loss. That is, we try to find f ∈ H minimizing
(f(x) − y)2
(a) (2 points) If we change the hypothesis space H from P3(x) to P4(x) while keeping
the same training set, select ALL of the following that MUST be true: 􏳠 Approximation error increases or stays the same.
􏳡 Approximation error decreases or stays the same.
􏳠 Estimation error increases or stays the same.
􏳠 Bayes risk decreases.
(b) (2 points) If we change the hypothesis space H from P5(x) to P6(x) while keeping
the same training set, select ALL of the following that MUST be true: 􏳡 Approximation error stays the same.
􏳠 Estimation error stays the same.
􏳠 Optimization error stays the same.
􏳡 Bayes risk stays the same.
(c) (2 points) If we increase the size of the training set m from 1000 to 5000 while keeping the same hypothesis space P5(x), select ALL of the following that MUST be true:
􏳡 Approximation error stays the same.
􏳠 Estimation error decreases or stays the same.
􏳡 The variance of Rˆm(f) for f(x) = x2 decreases. 􏳡 Bayes risk stays the same.
Page 2 of 11

3.LetDt denoteatrainingset(x1,y1),…,(xnt,ynt)∈Rd×{−1,1}andDv avalidation
set (x1, y1), . . . , (xnv , ynv ) ∈ Rd × {−1, 1}. The training set Dt is linearly separable.
Define J(θ) = 1 􏲿 l(m), where l(m) is a margin-based loss function, and m is nt (x,y)∈Dt
the margin defined by m = y(θT x).
We have run an iterative optimization algorithm for 100 steps and attained θ ̃ as our
approximate minimizer of J(θ).
Denote the training accuracy by α(Dt) = 1 􏲿 1(yθ ̃T x > 0) and the validation
nt (x,y)∈Dt accuracy by α(Dv) = 1 􏲿 1(yθ ̃T x > 0).
nv (x,y)∈Dv
(a) Answer the following for the logistic loss l(m) = log(1 + e−m):
i. (1 point) F True or False: Achieving 100% training accuracy (α(Dt) = 1) implies that we have achieved a minimizer of the objective function (θ ̃ ∈ argminθ J(θ)).
ii. (1 point) F True or False: Achieving 100% validation accuracy (α(Dv) = 1) implies that we have achieved a minimizer of the objective function (θt ∈ argminθ J(θ)).
(b) Answer the following for the hinge loss l(m) = max(0, 1 − m):
i. (1 point) F True or False: Achieving 100% training accuracy (α(Dt) = 1) implies that we have achieved a minimizer of the objective function (θ ̃ ∈ argminθ J(θ)).
ii. (1 point) T True or False: Achieving a minimizer of the objective function (θ ̃ ∈ arg minθ J(θ)) implies we have achieved training accuracy 100% (α(Dt) = 1).
(c) Answer the following for the perceptron loss l(m) = max(0, −m):
i. (1 point) T True or False: Achieving 100% training accuracy (α(Dt) = 1) implies that we have achieved a minimizer of the objective function (θ ̃ ∈ argminθ J(θ)).
ii. (1 point) F True or False: Achieving a minimizer of the objective function (θ ̃ ∈ arg minθ J(θ)) implies we have achieved training accuracy 100% (α(Dt) = 1).
Page 3 of 11

Figure 1: A subset of datapoints from Dm with the decision boundary.
4. Given a dataset Dm = {(z1, y1), . . . , (zm, ym)} ∈ R2 × {−1, 1} we solve the optimization problem given below to obtain w, b which characterizes the hyperplane which classifies anypointz∈Rd intooneoftheclassesy=+1ory=−1andanumberξi foreach datapoint zi ∈ Dm, referred to as slack.
minimizew,b,ξ ∥w∥2 + mC 􏲿ni=1 ξi
subjectto yi(wTzi −b)≥1−ξi foralli
ξi ≥0 foralli.
On solving the optimization problem on D100 for some C ≥ 0, we get that wˆ = (1, 1)T
and b = 1. Define f(z) = wˆ z − b. Figure 1 shows a subset of datapoints from Dm and
assume that for all the datapoints zi ∈ Dm not shown in Figure 1 we have yif(zi) > 1.
In the figure a label of + represents y = 1 and a label of 􏳮 represents y = −1.
Page 4 of 11

Figure 2: Solution to SVM Question
(a) (2 points) On the graph in figure 1, draw lines to characterize the margin of the
classifier wˆT z = ˆb. The lines characterizing the margin are defined by {z ∈ R2 : ˆ2ˆ
f(z)=1}and{z∈R :f(z)=−1}.
(b) (4 points) Let ξx1,x2 denote the slack of the point located at z = (x1, x2). For each of the following questions below, fill in the blanks with the best choice from =, > or <: ξ(2,4) < ξ(2,0) ξ(−1,1) = ξ(−1,−2) ξ(−3,1) > ξ(−1,2.5) ξ(2,4) = ξ(−1,−2)
(c) From the representer theorem and from duality, we saw that wˆ can be expressed as wˆ = 􏲿mi=1 αizi, where any zi with αi ̸= 0 is called a support vector. The com- plementary slackness conditions give us the following possibilities for any training example:
1. The example definitely IS a support vector.
2. The example definitely IS NOT a support vector.
3. We cannot determine from the complementary slackness conditions whether or
Page 5 of 11

not the example is a support vector.
For each of the following training points, select the ONE best option from the possibilities above:
i. (1point) Exampleat(2,4) 􏳠 1 ii. (1point) Exampleat(1,1) 􏳠 1 iii. (1point) Exampleat(2,0) 􏳡 1
􏳠 3 􏳡 3 􏳠 3
Page 6 of 11

5. Let Dn represent a dataset (x1, y1), . . . , (xn, yn) ∈ Rd × R. The first two dimensions (i.e. features) of every vector xi are related to each other by scaling: xi1 = sxi2,∀i = 1,2,…,n for some s ∈ R. Let X ∈ Rn×d be the design matrix where the ith row of X contains xTi and rank(X) = d−1 (i.e. there are no other linear dependencies besides the one given). Consider the following objective function for elastic net defined over Dn:
J ( θ ) = n1 􏳇 􏳑 θ T x i − y i 􏳒 2 + λ 1 ∥ θ ∥ 1 + λ 2 ∥ θ ∥ 2 2
(a) Suppose that |s| ≠ 1. We optimize J(θ) using subgradient descent. We start the optimization from θ0 and converge to θˆ ∈ argminθ∈RdJ(θ). We then restart the optimization from a different point θ0′ and converge to θˆ′ ∈ argminθ∈RdJ(θ). Consider the following possibilities:
1. Must have θˆ = θˆ′
2. Mayhaveθ̸=θ butmusthaveJ(θ)=J(θ) ˆˆ′ ˆˆ′
3. Mayhaveθ̸=θ andJ(θ)̸=J(θ)
For each of the subparts below, select the ONE best possibility from the three
given above:
i.(1point)λ1=0,λ2=0 􏳠1 􏳡2 􏳠3
ii.(1point)λ1>0,λ2=0 􏳡1 􏳠2 􏳠3 iii.(1point)λ1=0,λ2>0 􏳡1 􏳠2 􏳠3
Page 7 of 11

(b) (3 points) Fix λ1 = 0 and λ2 > 0. We optimize J(θ) using stochastic gradient ˆ ˆˆT
descent, starting from 0, and we attain θ ∈ argminθ∈RdJ(θ). Let f(x) = θ x. Consider a new point xt ∈ Rd such that xTt xi = 0 ∀i = 1,2,…,n. Show that
f(xt) = 0. (This holds for any s, though you should not need to mention s in your
Solution: First we need to know that θˆ = 􏲿ni=1 αixi, for some α ∈ Rn. This follows from the representer theorem or from the fact that in stochastic subgra- dient descent, we’re always taking a step that’s a multiple of a data point.
Once we know that θ is in the span of the data, we simply apply it to xt: ˆ T 􏲿n T
f(xt)=θ xt = i=1αixi xt =0.
Page 8 of 11

6. Let k(x, x′) = exp(− 1 ||x − x′||2 ), σ > 0 be the radial basis function (RBF) kernel. 2σ2 X
By Mercer’s theorem, the kernel k corresponds to a feature map φ : X → H mapping inputs into an inner product space (actually a Hilbert space). Let || · ||H be the norm in H and ||·||X be the norm in X.
(a) (4 points) Show that for any inputs x1, x2, x3 ∈ X , ||x2 − x1||2X ≤ ||x3 − x1||2X =⇒ ||φ(x2) − φ(x1)||2H ≤ ||φ(x3) − φ(x1)||2H. (Hint: Expand ||φ(x) − φ(x′)||2 using inner products, and then derive the conclusion.)
∥φ(x2)−φ(x1)∥2H =⟨φ(x2),φ(x2)⟩+⟨φ(x1),φ(x1)⟩−2⟨φ(x2),φ(x1)⟩ = k(x2, x2) + k(x1, x1) − 2k(x2, x1)
=1+1−2exp(−21∥x1 −x2∥2X) ≤1+1−2exp(−12∥x1 −x3∥2X)
≤ k(x3, x3) + k(x1, x1) − 2k(x3, x1)
≤ ⟨φ(x3), φ(x3)⟩ + ⟨φ(x1), φ(x1)⟩ − 2⟨φ(x3), φ(x1)⟩ ≤ ∥φ(x3) − φ(x1)∥2H
Page 9 of 11

7. Consider the regression setting in which X = Rd, Y = R, and A = R with a linear hypothesis space F = {f(x) = wT x|w ∈ Rd} and the loss function
l ( yˆ , y ) = ( yˆ − y ) 2
where yˆ is the action and y is the outcome. Consider the objective function
l(wTxi,yi)+λ∥w∥,
(a) (4 points) Provide a kernelized objective function Jk(α) : Rn → R. You may write your answer in terms of the Gram matrix K ∈ Rn×n, defined as Kij = xTi xj.
where ∥w∥ = 􏳭􏲿di=1 wi2 is the l2 norm of w.
Solution: The kernelized objective function is
l((Kα)i,yi)+λ α Kα
Page 10 of 11

(b) (1 point) T True or False: Suppose we use subgradient descent to optimize the objective function and want to find the global minima of the objective function. If we find that there exists a zero subgradient at some step in the subdifferential set, we should stop the subgradient descent immediately.
(c) (1 point) T True or False: Let w∗ be any minimizer of J(w). Then w∗ has the form of w∗ = 􏲿ni=1 αixi.
Page 11 of 11

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com