City University of
Course code & title : Session : Time allowed : Format :
CS5487 Machine Learning Semester A 2020/21
1. The final exam has 6 pages including
Copyright By PowCoder代写 加微信 powcoder
2. The following resources are allowed on the exam:
this page, consisting of 4 questions.
• You are allowed a cheat sheet that is one A4 page (double-sided) handwritten with pen or pencil.
3. All other resources are not allowed, e.g., internet searches, classmates, other textbooks. 4. Answer the questions on physical paper using pen or pencil.
• Answer ALL questions.
• Remember to write your name, EID, and student number at the top of each
answer paper.
5. You should stay on Zoom during the entire exam time.
• If you have any questions, use the private chat function in Zoom to message Antoni.
6. Final submission
• Take pictures of your answer paper and submit it to the “Final Exam” Canvas
assignment. You may submit it as jpg/png/pdf.
• It is the student’s responsibility to make sure that the captured images are legible. Illegible images will be graded as is, similar to illegible handwriting.
• If you have problems submitting to Canvas, then email your answer paper to Antoni
Question 1 2 3 4 total 25 25 25 25 100 CILO Question Weights (% of exam)
(a) (b) (c) (d) (e) (a) (b) (c) (d) (a) (b) (c) (d) (e) (a) (b) (c) CILO1 55 55 55 5 35 CILO 2 0 CILO3 5 5 5 520 CILO4 55 10 55 15 45
Statement of Academic Honesty
Below is a Statement of Academic Honesty. Please read it.
I pledge that the answers in this exam are my own and that I will not seek or obtain an unfair advantage in producing these answers. Specifically,
• I will not plagiarize (copy without citation) from any source;
• I will not communicate or attempt to communicate with any other person during the exam; neither will I give or attempt to give assistance to another
student taking the exam; and
• I will use only approved devices (e.g., calculators) and/or approved device
• I understand that any act of academic dishonesty can lead to disciplinary
I pledge to follow the Rules on Academic Honesty and understand that violations may led to severe penalties.
Name: EID: Student ID: Signature:
(a) Copy the above statement of academic honesty to your answer sheet. Fill in your name, EID, and student ID, and sign your signature to show that you agree with the statement and will follow its terms.
Problem 1 EM for GMMs with shared mean [25 marks]
Let x ∈ R be distributed as a (univariate) Gaussian mixture model with K components. Assume that the Gaussian components have a shared mean μ and different variances σj2,
p(x|θ) = πj N (x|μ, σj2), (1)
where θ = {{πj,σj2}Kj=1,μ} are the parameters. This is called a Gaussian scale mixture. Let X = {x1, · · · , xn} be a set of observed samples, and Z = {z1, · · · , zn} the set of corresponding hidden values.
(a) [5 marks] Plot an example of the Gaussian scale mixture with parameters
K =2, π1 =π2 = 12, μ=0, σ12 =1,σ2 =9. (2)
Intuitively, what type of data could be well modeled with a Gaussian scale mixture?
(b) [5 marks] For the general case (not assuming the parameters in (a)), write down the
complete data log-likelihood, log p(X, Z|θ),
(c) [5 marks] Derive the E-step, i.e., the Q function, Q(θ;θˆold).
(d) [5 marks] Derive the M-step, i.e., the parameter updates of θ.
(e) [5 marks] What is the intuitive explanation of the E- and M-steps in (c) and (d)? Note any differences or similarities with the EM algorithm for standard GMMs (without a shared mean).
Problem 2 BDR and Naive Bayes for discrete variables [25 marks]
For high-dimensional observation spaces, it might be difficult to learn a joint density over the space (e.g., if not enough data is available). One common assumption is to use a “Naive Bayes” model, where we assume that the individual features (dimensions) are conditionally independent given the class,
p(x|y = j) = p(xi|y = j), (3)
where x = [x1, · · · , xd]T is the observation vector, and xi is the individual feature. While the features are conditionally independent given the class, the features are still dependent in the overall distribution of observations p(x) (similar to a GMM with diagonal covariance matrices).
Let the vector x be a collection of d binary-valued features, i.e.
x=[x1,···,xd]T, xi ∈{0,1}. (4)
The individual features xi are binary, e.g., each xi is an indicator variable representing the presence/absence of a specific property in the data sample. Assume there are 2 classes, with class variable y ∈ {1, 2} and prior distribution p(y = 1) = π, p(y = 2) = 1 − π. Now define
pi =p(xi =1|y=1), ∀i (5) qi =p(xi =1|y=2), ∀i. (6)
The goal is to recover the class y given a measurement x.
(a) [5 marks] Interpret in words the meaning of pi, qi, and log pi .
(b) [5 marks] Write down the class-conditional densities p(x|y = 1) and p(x|y = 2) in terms
of pi, qi, and xi.
(c) [10 marks] Using the 0-1 loss function, derive the Bayes decision rule (BDR) for class y
for a given x. State the derived BDR in the following form:
1, g(x) > 0,
2, otherwise, for some decision function g(x) (which you derive).
(d) [5 marks] What is the intuitive interpretation of the decision function g(x)? In particular, what is the role of feature xi, its associated pi and qi, and the prior π? What is the shape of the decision surface of this BDR classifier?
Problem 3 Support vector regression [25 marks]
In this problem, we will consider support vector regression (SVR), which applies margin principles from SVMs to regression. The goal is to learn a linear function,
f(x) = wT x + b, (8) which fits a given dataset D = {(xi,yi)}ni=1, where xi ∈ Rd and yi ∈ R. Suppose we form a
“band” or “tube” around f(x) with width ε (see figure below).
−4 −2 0 2 4 6
f(x)+ǫ f(x) f(x)−ǫ
We can consider any training pair (xi, yi) that falls inside of the tube as correctly regressed, while points falling outside of the tube are errors. Assuming that ε is known, the SVR primal problem is
min 1 ∥w∥2 w,b 2
s.t. yi − (wT xi + b) ≤ ε, (9) (wTxi +b)−yi ≤ε, ∀i
(a) [5 marks] Explain the role of the objective function and the inequality constraints in the SVR primal problem. When are the equality constraints “active” or “inactive”?
(b) [5 marks] Let {αi} and {αˆi} be the Langrange multipliers for the first and second sets of inequality constraints, respectively. Show that the Lagrangian L(w, b, α, αˆ) for the SVR primal problem is:
L(w,b,α,αˆ)= 2∥w∥2 −αi(ε−yi +(wTxi +b))−αˆi(ε+yi −(wTxi +b)).
(c) [5 marks] Use the Lagrangian to derive conditions for the minimum of the SVR primal problem.
(d) [5 marks] Derive the SVR dual problem.
(e) [5 marks] Using the KKT conditions, at the optimum of the SVR dual problem, what
do the Lagrange multipliers αi, αˆi indicate about the data point (xi, yi)? ………
Problem 4 Kernel logistic regression [25 marks]
Consider the two-class logistic regression problem with a prior on the weight vector w, where x ∈ Rd is the input vector and y ∈ {0,1} is the class label. The training set is D = {X,y}, where X = [x1,··· ,xn] are the input vectors, and y = [y1,··· ,yn]T are the class labels. The conditional probability of the output class is
p(yi|xi) = πyi (1 − πi)1−yi , (11) i
where π = σ(wT x ) is the conditional probability that x belongs to class 1, and σ(a) = 1 ii i 1+e−a
is the logistic sigmoid function. The prior distribution on w is a zero-mean Gaussian with known precision matrix Γ (i.e., inverse of the covariance matrix), p(w) = N (w|0, Γ−1).
The MAP estimate can be obtained using the Newton-Raphson iterations
w(new) = (XRXT + Γ)−1XRz, (12)
where {R, z, π} are calculated from the previous w(old),
πi = σ(xTi w(old)), π = [π1,··· ,πn]T, (13) R = diag(π1(1 − π1), · · · , πn(1 − πn)), (14) z = XT w(old) − R−1(π − y). (15)
(a) [5 marks] Consider the case when the precision matrix is Γ = λI. How does Γ help to regularize the estimate of w in (12)?
(b) [15 marks] Derive kernel logistic regression, i.e., apply the kernel trick to calculate prob- ability π∗ = σ(wT x∗), and to MAP estimation using the Newton-Raphson iterations.
(c) [5 marks] Discuss the role of the prior precision Γ in kernel logistic regression.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com