Machine Learning I Exam from 24.09.2020
This document is not official; it is not endorsed by the university or the lecturer.
120 minutes, no auxiliary tools allowed, 20 + 15 + 25 + 20 + 20 = 100 points. 1. Multiple choice (4 × 5 = 20 Points)
Answer the following multiple choice questions.
Copyright By PowCoder代写 加微信 powcoder
(a) The Bayes error is
the lowest error of a linear classifier.
the expected error of a random linear classifier. the error of any nonlinear classifier.
the error of a naive Bayes classifier .
(b) The Fisher linear discriminant find the projection y = wTx of the data that maximises the margin between the two data generating distributions.
the within-class variance divided by the between-class variance.
the margin between the means of the data generating distributions.
the between-class variance divided by the within-class variance. (c) A biased estimator is used to
make the estimator less affected by the sampling of the data.
make the estimation procedure more sensitive to the sample data. reduce the risk of underfitting the data.
None of the above, an unbiased estimator is always better.
(d) Let x1, . . . , xN ∈ Rd be unlabelled observations. Consider a Gaussian Gram matrix K ∈ RN ×N . Which is always true?
∀u∈RN uKu≥0. ∀u∈RN uKu≤0.
2. Neural Networks (10 + 5 = 15 Points)
(a) Build a neural network that models the function f: R2 → {0,1}, x → 1min(x1,x2)≤−1(x) with at most three neurons of the form aj = step (i wij ai + bj ), where step(z) := 1{z≥0}(z). State weights and biases.
(b) State the number of neurons needed to build a neural network that models f : Rd → {0, 1}, x → 1∥x∥∞≤5(x) and describe the weights and bias of one such neurons.
3. Maximum likelihood and Bayes (5 × 5 = 25 Points)
People queue at the post office and their i.i.d processing times are D = (x1, x2, x3) = (1, 1, 2). The data generating distribution is P (xi = k) = (1 − θ)k−1θ, where k ∈ N ∪{∞} and θ ∈ [0, 1] is unknown.
(a) State likelihood function P(D|θ).
(b) Find the maximum likelihood parameter θˆ.
(c) Evaluate P (x4 > 1|θˆ).
Define a1 = step(−x1 − 1) and a2 = step(−x2−1)tocheckifx1 ≤−1and x1 x2 ≤ −1. If (at least) one of them gives
1, we want the output to be one and zero x2 else. Thus a3 = step(a1 + a2 − 1).
Weneed2d+1neurons. Wehavethat∥x∥∞ ≤5ifandonlyif−5≤xk ≤5for all k ∈ {1,…,d}. For k ∈ {1,…,d}, we thus take a2k−1 = step(5 − xk) (to check that xk ≤ 5) and a2k = step(xk + 5) (to check that xk ≥ −5). The output neuron is
a2d+1 = step(2d 1 ak − 1), as we only want to output 1 if all other aj give 1. k=1 2d
P(D|θ) = (1 − θ)1−1θ · (1 − θ)1−1θ · (1 − θ)2−1θ = θ3(1 − θ).
Wehaveθˆ=argmax P(D|θ).Wehave dθ3(1−θ)=3θ2−4θ3,soθ=0orθ=3. θdθ 4
We also have to check the boundary of the definition domain of P(D|θ): we have
P (D|0) = 0 = P (D|1) < P (D| 3 ) = 27 , so θˆ = 3 . 4 64 4
Since x4 can be every integer between 2 and ∞, we have
P ( x 4 > 1 | θˆ ) = k=1
1 − 4 4 k=2
∞ ∞ k−1 k−1 3 3
1 − θˆ θˆ = 4 = 4 · 3 = 4.
P ( x i = k ) =
3 ∞ 1 k 3 1 1
The sum is a geometric series so we can get the finite expression 13 for it. Simpler computation using the complement:
P (x4 > 1|θˆ) = 1 − P (x4 = 1|θˆ) = 1 − (1 − θˆ)1−1 θˆ = 1 − θˆ = 1 − 43 = 14 .
We now adopt a Bayesian view point on this problem, where we assume a prior distribution for the parameter θ to be defined as:
1, θ ∈ [0, 1],
(d) Show that the posterior distribution p(θ|D) is 20(1 − θ)θ3 for θ ∈ [0, 1] and zero elsewhere.
By the theorem of Bayes and the law of total probability we have p(D|θ)p(θ) p(D|θ)p(θ) θ3(1 − θ) · 1[0,1](θ)
p(θ|D) = p(D) = p(D|θ)p(θ)dθ = 1 θ3(1−θ)dθ R0
= θ3(1 − θ) · 1[0,1](θ) = 20(1 − θ)θ3 · 1[0,1](θ) 1
(e) Evaluate P (x4 > 1|D) = p(x|θ)p(θ|D) dθ.
P(x4 >1|D)=1−P(x4 =1|θˆ)=1−
20(1−θ)θ3 ·1[0,1](θ)·θ(1−θ)1−1dθ
= 1 − 20 θ4(1 − θ) dθ = 1 − 20
= 1 − 32 = 13
θ4 − θ5 dθ = 1 − 20 5 − 6
4. Lagrange multipliers (4 × 5 = 20 Points)
Let Σ ∈ Rd×d be a positive semidefinite matrix. Consider the constrained maximisation problem:
max ∥w∥2 subject to wTΣ−1w = 1 w∈Rd
(a) State the Lagrange function.
(b) Show that the problem is an eigenvalue problem of Σ.
L(w, λ) := ∥w∥2 + λ(1 − wTΣ−1w).
For w to be optimal, we need
∂L(w,λ) =2w−2λΣ−1w=! 0 ⇐⇒ w=λΣ−1w ⇐⇒ Σw=λw,
so w has to be a eigenvector of Σ with eigenvalue λ.
(c) Show that the solution is the eigenvector associated to the highest eigenvalue of Σ.
(d) Let w1, . . . , wT be a sequence of vectors where wt is obtained from wt−1 as the solution of the constraint problem
max zTwt−1 subject to zTΣ−1z = 1. z∈Rd
Find a closed form solution of wt as a function of wt−1.
From the constraint wTΣ−1w = 1 and w = λΣ−1w we get (as Σ is symmetric) ∥w∥2 = wTw = λwTΣ−1w = λ.
Thus the value of the eigenvalue coincides with the quantity we want to maximise.
The Lagrangian is
In order for z to be optimal, we require
L(z, λ) := zTwt−1 + λ(1 − zTΣ−1z).
∂ L ( z , λ ) = w t − 1 − 2 λ Σ − 1 z =! 0 ⇐ ⇒ w t − 1 = 2 λ Σ − 1 z ⇐ ⇒ z = 1 Σ w t − 1 .
∂z 2λ Plugging the second last equality into the constraint zTΣ−1z = 1, we get
zTwt−1 = 2λzTΣ−1z = 2λ and using the last equality we get
2λ=zTw =1wT Σw , t−1 2λ t−1 t−1
2λ=wT Σw , t−1 t−1
as Σ is positive semidefinite (so we don’t have to consider −√. . .). We thus get
Σwt−1 wt=z= =
wT Σw t−1 t−1
Σwt−1 2 T −1
∥Σwt−1∥Σ−1
with∥x∥Σ−1 :=xΣ x.
5. Ridge regression (10 + 10 = 20 Points) Consider the problem
min ∥y − Xw∥2 subject to ∥w∥∞ ≤ C, w∈Rd
whereC>0isaconstant,y∈RN andX∈RN×d isthedatamatrix. (a) Show that the problem is equivalent to
minwTXTXw−2yTXw subjectto −C≤wi≤C∀i∈{1,…,d} w∈Rd
∥y−Xw∥2 =(y−Xw)T(y−Xw)=yTy−yTXw−(Xw)Ty+(Xw)TXw
= yTy − 2yTXw + wTXTXw.
Since yTy is independent of w, we can neglect it when minimising over w. We have
yTXw = (Xw)Ty, as it is a scalar and so it is equal to its transpose.
Furthermore, ∥w∥∞ = max{|w1|, . . . , |wd|}, so ∥w∥∞ ≤ C is equivalent to |wk| ≤ C for all k ∈ {1,…,d}, i.e. −C ≤ wk ≤ C for all k ∈ {1,…,d}.
(b) At our disposal we have a quadratic solver QP(Q, l, A, b), which solves the generic quadratic problem
minvTQv+lTv subjectto Av≤b. v
Write the numpy code constructing the arrays Q,l,A and b from X, y and C.
def Reg(X, y, C):
Q = X.T.dot(X)
l = – 2 * y.T.dot(X).T
d = Q.shape[0]
A = np.concatenate([np.identity(d), -1 * np.identity(d)], axis=0)
b = C * np.ones(2 * d)
t = QP(Q, l, A, b)
The grey code was given.
Thanks to everyone contributing to this account of the exam and its solutions 🙂
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com