CSCI 567 Fall 2018 Practice Final Exam
Problem
1
2
3
4
5
6
Total
Max
30
6
17
8
25
14
100
Points
Please read the following instructions carefully:
• Duration of the exam is 2 hours and 20 minutes. Questions are not ordered by their difficulty. Budget your time on each question carefully.
• Answers should be concise and written down legibly. All questions can be done within 5-12 lines. • Select one and only one answer for all multiple choice questions.
1
1 Multiple Choice (30 points)
(a) Suppose we apply the kernel trick with a kernel function k to the nearest neighbor algorithm (with L2 distance in the new feature space). What is the nearest neighbor of a new data point x from a training set S = {x1,…,xN}?
(A) argminxn∈S k(xn, xn) + k(x, x) − 2k(xn, x) (B) argminxn∈S k(xn, x)
(C) argminxn∈S (k(xn,x)−k(x,xn))2
(D) argminxn∈S k(xn, xn) + k(x, x) + 2k(xn, x)
Ans: A.
(2 points)
(b) Let X ∈ RN×D be a data matrix with each row corresponding to the feature of an example and y ∈ RN be a vector of all the outcomes, as used in the class. The least square solution is (XT X)−1XT y. Which of the following is the least square solution if we scale each data point by a factor of 2 (i.e. the new dataset is 2X)?
(A) 2(XT X)−1XT y (B) 4(XT X)−1XT y (C) 1 (XT X)−1XT y (D) 1 (XT X)−1XT y
Ans: C.
(c) Which of the following surrogate losses is not an upper bound of the 0-1 loss? (A) perceptron loss max{0, −z}
(B) hinge loss max{0, 1 − z}
(C) logistic loss log(1 + exp(−z))
(D) exponential loss exp(−z) Ans: A.
24
(d) The following table shows a binary classification training set and the number of times each point is misclassified during a run of the perceptron algorithm. Which of the following is the final output of the algorithm?
(2 points)
(2 points)
x
y
Times misclassified
(-3, 2)
+1
5
(-1, 1)
-1
5
(5, 2)
+1
3
(2, 2)
-1
4
(1, -2)
+1
3
(A) (0, 3) (B) (2, -1) (C) (-2, 1) (D) (0, -3)
Ans: D.
(2 points)
2
(e) Suppose we obtain a hyperplane w via logistic regression and are going to make a randomized prediction on the label y of a new point x based on the sigmoid model. What is the probability of predicting y = +1?
(A) e−wT x (C) 1
(B) I[wT x ≥ 0]
1+e−wT x Ans: C.
(D)
1 1+ewT x
(f) The multiclass perceptron algorithm is shown in Alg 1, and it is essentially minimizing the multiclass perceptron loss via SGD with learning rate 1. Based on this information, which of the following is the multiclass perceptron loss?
(2 points)
Algorithm 1: Multiclass Perceptron
1 2 3 4 5 6 7
Input: A training set (x1,y1),…,(xN,yN) Initialize: w1 = ··· = wC = 0 (or randomly) while not converged do
randomly pick an example (xn, yn), make prediction yˆ = argmaxk∈[C] wkTxn if yˆ ̸= yn then
w yˆ ← w yˆ − x n wyn ←wyn +xn
(A) N maxk̸=yn wTxn − wyT xn n=1 k n
(B) N max 0, maxk̸=yn wTxn − wyT xn n=1 kn
(C) Nn=1 maxk∈[C] wkTxn
(D) Nn=1 max 0, maxk∈[C] wkTxn
Ans: B.
(g)
(h)
For a fixed multiclass problem, which of the following multiclass-to-binary reductions has the largest testing time complexity?
(A) One-versus-all (B) One-versus-one (C) Tree reduction (D) Both (A) and (B)
Ans: B.
(2 points)
Which of the following is wrong about kernel function?
(A) If k1 and k2 are kernel functions, then c1k1 + c2k2 is a kernel function too for any c1, c2 ∈ R. (B) If k1 and k2 are kernel functions, then k1k2 is a kernel function too.
(C) Kernel function must be symmetric, that is, k(x, x′) = k(x′, x).
3
(2 points)
(D) If k is a kernel function, then exp(k) is a kernel function too.
Ans: A. (2 points)
(i) Which of the following is wrong about neural nets?
(A) A fully connected feedforward neural net without nonlinear activation functions is the same as a linear model.
(B) Dropping random neurons in each iteration of Backpropagation helps prevent overfitting.
(C) A neural net with one hidden layer and a fixed number of neurons can represent any continuous function.
(D) A max-pooling layer has no parameters to be learned.
Ans: C.
(j) Which of the following about SVM is true?
(A) Support vectors are training points that are misclassified.
(B) Support vectors are training points that are not on the learned hyperplane.
(C) Removing examples that are not support vectors will not affect the final hyperplane. (D) Only misclassified training points could be support vectors, but not all of them are.
Ans: C.
(2 points)
(2 points)
(k) Recall that the entropy of a distribution p over a set of K items is defined as − Kk=1 p(k) ln p(k). Which of the following distributions has the largest entropy?
(A) (0.25, 0.25, 0.25, 0.25) (B) (0.2, 0.3, 0.5, 0)
(C) (1, 0, 0, 0)
(D) (0.5, 0.5, 0, 0)
Ans: A. Entropy is maximized when the distribution is uniform.
(l) Which of the following statement is true?
(A) AdaBoost outputs a linear classifier if the base classifiers are linear.
(2 points)
(B) In PCA, the first principal component is the direction where the projection of the dataset has the smallest variance.
(C) Kernel density estimation is a nonparametric method.
(D) EM algorithm always converges to the MLE, but it can take very long for this to happen.
Ans: C. (2 points)
(m) Which of the following is wrong about PCA?
(A) The first step of PCA is to center the original dataset.
4
(B) The first principal component is the eigenvector of the covariance matrix with the smallest eigen- value.
(C) PCA outputs a compressed dataset that is a linear transformation of the original dataset. (D) Kernel PCA requires computing eigenvalues and eigenvectors of the Gram matrix.
Ans: B. (2 points)
(n) Which of the following has the most adaptive exploration for the multi-armed bandit problem? (A) A completely random strategy
(B) Greedy strategy
(C) Explore-then-exploit
(D) Upper Confidence Bound (UCB) algorithm
Ans: D.
(o) Which of the following is true about reinforcement learning?
(A) Value iteration converges to the actual value function.
(B) Model-free approaches requires more space than model-based approaches.
(C) Q-learning is a model-based approach.
(D) Mode-based approaches do not need to deal with the exploitation-exploration trade-off.
Ans: A.
(2 points)
(2 points)
5
2 Decision Tree (6 points)
Suppose we would like to build a decision tree classifier to predict “whether the professor will read the email”, using the following training dataset where the first 5 columns represent 5 binary features (whether the professor knows the sender, whether the email is too long, and whether it is about certain topics), and the last column is the label.
Known sender?
Is Long?
About research?
About grade?
About lottery?
Read?
no yes no yes no yes no yes yes yes
no yes yes yes yes no no no no yes
yes no yes yes no yes yes no yes yes
yes yes yes yes no yes no no yes yes
no no yes no no yes no no no yes
no no no no no yes yes yes yes no
Below are two examples of decision tree for this task: Tree 1
Tree 2
Yes No
No Yes No
Is long?
Yes No
Yes
No Yes
Known sender?
About grade?
Won’t read.
known sender?
Is long?
Will read
Won’t read
(a) How many training points are correctly classified by Tree 1 and Tree 2 respectively?
(A) 8 and 7 (B) 9 and 8 (C) 8 and 8
Ans: D
(D) 9 and 7
6
Will read
Won’t read
Will read
(2 points)
(b) Define the Information Gain of a node n with children Children(n) as Gain(n) = Entropy(Sn) − |Sm|Entropy(Sm)
m∈Children(n) |Sn|
where Sn and Sm are the subsets of training examples that belong to the node n and one of its child node m respectively. Compute the information gain for the root of Tree 2. Express your answer in terms of “log”, that is, you do not need to calculate the value of logarithm (and therefore the base of the logarithm also does not matter).
The entropy of the root is
Entropy(S) = −2 log 2 − 3 log 3 5555
(1 point)
(1 point)
(1 point)
(1 point)
The entropy for the left child and right child are respectively
and
The information gain is therefore
Entropy(S1) = −2 log 2 − 5 log 5 7777
Entropy(S2) = −2 log 2 − 1 log 1 3333
Entropy(S) − 7 Entropy(S1) − 3 Entropy(S2) 10 10
Note: incorrect information gain due to mistakes from previous calculations of entropy can still get the 1 point if the formula is applied correctly.
7
3 Boosting (17 points)
A version of the AdaBoost algorithm is shown below where the base algorithm is simply searching for a classifier with the smallest weighted error from a fixed classifier set H.
Algorithm 2: Adaboost
1
2
3 4 5
6
7
Given: A training set {(xn, yn ∈ {+1, −1})}Nn=1, and a set of classifier H, where each h ∈ H takes a feature vector as input and outputs +1 or −1.
Goal: Learn H(x) = sign Tt=1 βtht(x), where ht ∈ H, βt ∈ R, and sign(a) = Initialization: D1(n) = 1 , ∀n ∈ [N].
+1, if a ≥ 0, −1, otherwise.
N
Find ht = argminh∈H n:yn̸=h(xn) Dt(n).
fort=1,2,···,Tdo Compute
Compute
εt =
Dt(n) and n:yn ̸=ht (xn )
βt = 1log1−εt. 2 εt
for each n ∈ [N], where
is the normalization factor.
Executing AdaBoost
Dt+1(n) = Dt(n) exp(−βtynht(xn)) Zt
N
Zt = Dt(m) exp(−βtymht(xm))
m=1
3.1
Imagine running AdaBoost with a 1-dimensional training set of 8 examples as shown in Fig. 1, where circles mean y = +1 and crosses mean y = −1. The base classifier set H consists of all decision stumps, where each of them is parameterized by a pair (s, b) ∈ {+1, −1} × R such that
s, if x > b, h(s,b)(x) = −s, otherwise.
Figure 1: The 1-dimensional training set with 8 examples. Circles mean y = +1 and crosses mean y = −1. The number under each example is its x coordinate.
8
(a)
(b)
Which of the following is the possible parameter for h1?
(A) (s, b) = (+1, 3.5) (B) (s, b) = (-1, 3.5) (C) (s, b) = (+1, 7.5) (D) (s, b) = (-1, 7.5)
Ans: A.
(2 points)
3.2
Training Error of AdaBoost
Suppose we run AdaBoost for two rounds and observe that β1 and β2 are both positive but not equal. Is it possible that the final classifier H after these two rounds (see Line 2) has zero training error? Why or why not?
No, it’s not possible. (1 point) The example x = 8 is misclassified by h1, and there are two cases for h2:
• h2 also misclassifies this example, and thus H misclassifies it too;
• h2 correctly classifies this example, then it must misclassifies at least one of the other 7 points (call it x′); suppose H correctly classifies x = 8, which implies β1 < β2, but this implies that H will misclassifies x′ since the majority weight is at h2.
(As long as the argument is correct, you will get 3 points.)
In this problem we make the following so-called “weak learning assumption”: for every iteration of AdaBoost,
εt ≤ 1 − γ 2
for some constant γ ∈ (0, 1 ). In other words, the base algorithm is always able to return a base classifier ht 2
with weighted error rate γ smaller than that of random guessing. Under this assumption, you are going to prove that the training error of AdaBoost decreases exponentially fast, following the two steps below.
(a) Prove that for each t, the normalization factor Zt satisfies: Zt ≤1−4γ2
9
N
Zt = Dt(n) exp(−βtynht(xn))
n=1
= Dt(n) exp(−βt) +
n:yn ht (xn )=1
= (1 − εt) exp(−βt) + εt exp(βt)
n:yn ht (xn )=−1
Dt(n) exp(βt)
= 2(1 − εt )εt ≤2(1 +γ)(1 −γ)
(2 points) (2 points)
(3points)
22 = 1 − 4γ2
where the inequality holds due to the weak learning assumption and the fact that function (1 − ε)ε is increasing in [0, 1/2] (and you have to mention this reasoning in order to get the 3 points for this step; arriving at this step with incorrect reasons or no explanation at all gets 1 point).
(b) It can be shown that the training error of AdaBoost is bounded as (you are not asked to prove this)
1NT
1 N
N
n=1
I[H(xn) ̸= yn] ≤ Zt. n=1 t=1
N
Use this fact and the inequality 1 + z ≤ ez for all z ∈ R to prove the following
I[H(xn) ̸= yn] ≤ exp(−2Tγ2),
which implies that the training error of AdaBoost decreases in an exponential manner.
Using the inequality 1 + z ≤ ez with z = −4γ2 we have
Zt ≤ exp(−4γ2) = exp(−2γ2),
and therefore with the provided fact we have
1NT
I[H(xn) ̸= yn] ≤ exp(−2γ2) = exp(−2T γ2) n=1 t=1
N
(2 points)
(2 points)
10
4 K-means (8 points)
Consider the following dataset. All points are unlabeled and part of the same set. The triangles are used to distinguish two points later. Please do not draw on this diagram until you have read the problems below.
Suppose we run the K-means algorithm on this dataset with K = 2 and the two points indicated by triangles as the initial centroids. When the algorithm converges, there will be two clearly separated clusters. Directly on Figure 2, draw a straight line that separates these two clusters, as well as the centroids of these two clusters.
Figure 2: An unlabeled dataset
1
There are infinitely many choices for the separating line, for example a vertical line at x = 4. (2 points) The two centroids are (2, 1) and (5.5, 1). (2 points)
Next, find two other different sets of initialize centroids that will converge to the exact same result if we apply K-means. Please follow the instructions below
• the initialize centroids have to be points of the dataset;
• directly on Figure 3, use two triangles to indicate the first set, and two squares to indicate the second; • these two sets of points can overlap with each other, but of course cannot be the same;
• similarly these sets can overlap with the initialization of Figure 2 but cannot be the same;
• do not pick those that lead to ambiguous results due to different ways of breaking ties.
01234567
Figure 3: The same unlabeled dataset
1
There are many choices, for example (with yb and yt being the y-coordinate of the bottom and top points), (2.5, yb) and (5.5, yt), (2.5, yt) and (5.5, yb), (2.5, yt) and (5.5, yt), (1.5, yb) and (6.5, yb), and so on. Note that answers like (1.5,yb) and (5.5,yb) are not acceptable due to the ambiguousness from tie-breaking.
Two points for each set.
01234567
11
5 Generative Models (25 points) 5.1 Naive Bayes
Suppose we have the following training data. Each data point has three features (Weather, Emotion, Homework), where Weather ∈ Sunny,Cloudy, Emotion ∈ Happy,Normal,Unhappy, Homework ∈ Much,Little. The label PlayBasketball indicates whether it is suitable to play basketball. You are asked to build a naive Bayes classifier. Recall the naive Bayes assumption is
P (Weather, Emotion, Homework | PlayBasketball) =
P (Weather | PlayBasketball) × P (Emotion | PlayBasketball) × P (Homework | PlayBasketball).
Weather
Emotion
Homework
PlayBasketball
Sunny
Happy
Little
Yes
Sunny
Normal
Little
Yes
Cloudy
Happy
Much
Yes
Cloudy
Unhappy
Little
Yes
Sunny
Unhappy
Little
No
Cloudy
Normal
Much
No
(a) Write down the MLE for the following parameters (you only need to provide the final number): • P (PlayBasketball = Y es) =
• P (Weather = Sunny | PlayBasketball = Y es) = • P(Emotion = Normal | PlayBasketball = No) = • P (Homework = M uch | PlayBasketball = Y es) =
The answers are 2 , 1 , 1 and 1 respectively (one point for each). 322 4
12
(b) Given a new data instance x = (Weather = Sunny,Emotion = Normal,Homework = Much), com- pute P (PlayBasketball = Y es | x) (show your work).
P (PlayBasketball = Y es | x)
∝ P (PlayBasketball = Y es, x)
= P (PlayBasketball = Y es) × P (Weather = Sunny | PlayBasketball = Y es)×
P (Emotion = N ormal | PlayBasketball = Y es) × P (Homework = M uch | PlayBasketball = Y es) (1 point)
= 2 × 1 × 1 × 1 = 1 3 2 4 4 48
Similarly
P(PlayBasketball = No | x)
∝ P(PlayBasketball = No) × P(Weather = Sunny | PlayBasketball = No)×
(1 point)
P (Emotion = N ormal | PlayBasketball = N o) × P (Homework = M uch | PlayBasketball = N o) (1 point)
= 1 × 1 × 1 × 1 = 1 3 2 2 2 24
Therefore
(1 point)
(2 points)
P (PlayBasketball = Y es | x) =
11 48 =
1+1 3 48 24
13
5.2 Gaussian Mixture Models and EM
Let X ∈ R be a one-dimensional random variable distributed according to a mixture of two Gaussian distributions N(μ1,σ12) and N(μ2,σ2). In particular,
22
P(X =x)=P(X =x,Z =k)=P(Z =k)P(X =x|Z =k) k=1 k=1
= ω1N (x|μ1, σ12) + ω2N (x|μ2, σ2),
where ω1 and ω2 are weights of mixture components such that ω1 ≥ 0, ω2 ≥ 0, and ω1 + ω2 = 1. Recall that
the Gaussian density is N(x|μ,σ2) = √ 1 exp−(x−μ)2 , where μ is the mean and σ2 is the variance of 2πσ2 2σ2
the Gaussian distribution.
Now suppose x1, x2, ···, xN are i.i.d. samples of this model. Derive the EM algorithm for this model
by following the two steps below.
(a) E-Step: derive the expected (complete) log-likelihood explicitly, assuming
θOLD ={ωOLD,ωOLD,μOLD,μOLD,σOLD,σOLD} 121212
are the current estimates of the parameters. Recall the definition of the expected (complete) log- likelihood is the following:
N2
Q(θ ;θOLD) = γnk lnP(Xn = xn,Zn = k ;θ), (1)
n=1 k=1
where
and θ is the set of parameters ω1,ω2,μ1,μ2,σ1,σ2. To simplify your answer you can use the density
γnk =P(Zn =k|Xn =xn ;θOLD). function N without plugging in its formula.
By Bayes’ rule
ωOLDN(xn |μOLD,σOLD) γnk = k k k
lnωkN(xn | μk,σk)
(3 points)
(2 points)
ωOLDN (x | μOLD, σOLD) + ωOLDN (x | μOLD, σOLD) 1n112n22
So the expected complete log-likelihood Q(θ ;θOLD) is N2
ωOLDN(xn |μOLD,σOLD) k k k
2
| μOLD, σOLD) n=1k=11 n11 2 n22
ωOLDN (x | μOLD, σOLD) + ωOLDN (x
14
(b) M-step: derive the updates of all the parameters ω1, ω2, μ1, μ2, σ1, σ2 by maximizing Q(θ ; θOLD). You should use the notation γnk from the last question to simplify your answer.
Plugging in the Gaussian density we have
Q(θ;θ )= The update for ωk is simply
γnk
(1point)
(1 point)
(2points)
(2points)
(2 points)
(2 points)
OLD
N2
n=1 k=1
1 lnωk+ln2πσ2 −
(xn −μk)2 2σk2
Nn=1 γnk
ωk = N 2 γ =
k
Nn=1 γnk N
n=1 k=1 nk
(It is okay to omit the derivation here since we have seen this many times.)
Next setting the derivative w.r.t. μk gives
N γnk(xn−μk)=0
and therefore the update for μk is
Finally setting the derivative w.r.t. σk gives
N 1(xn−μk)2 γnk −σ + σ3
n=1 k k and solving for σk leads to the update
2 Nn=1 γnk(xn − μk)2 σk = N γ
n=1 nk
n=1 σk2
Nn=1 γnkxn
μk= N γ n=1 nk
15
6 Hidden Markov Model (14 points)
Recall a hidden Markov model is parameterized by
• initial state distribution P (Z1 = s) = πs
• transition distribution P (Zt+1 = s′ | Zt = s) = as,s′
• emission distribution P (Xt = o | Zt = s) = bs,o
These parameters are assumed to be known for this problem.
6.1 Missing data
Suppose we observe a sequence of outcomes x1, . . . , xt−1, xt+1, . . . , xT with the outcome at time t missing (2 ≤ t ≤ T − 1). Derive the conditional probability of the state at time t being some s, that is,
P(Zt =s|X1:t−1 =x1:t−1,Xt+1:T =xt+1:T). Express you answer in terms of the forward message
and the backward message
αs′ (t − 1) = P (Zt−1 = s′, X1:t−1 = x1:t−1), βs′(t)=P(Xt+1:T =xt+1:T |Zt =s′).
P(Zt =s|X1:t−1 =x1:t−1,Xt+1:T =xt+1:T)
∝P(Zt =s,X1:t−1 =x1:t−1,Xt+1:T =xt+1:T)
= P (Xt+1:T = xt+1:T | Zt = s, X1:t−1 = x1:t−1)P (Zt = s, X1:t−1 = x1:t−1)
= P(Xt+1:T = xt+1:T | Zt = s)P(Zt = s,Zt−1 = s′,X1:t−1 = x1:t−1) s′
(by conditional independence and marginalization) = βs(t)P(Zt = s | Zt−1 = s′)P(Zt−1 = s′,X1:t−1 = x1:t−1)
s′
= βs (t) as′ ,s αs (t − 1)
s′
The derivation is worth 4 points (and there are possibly different ways to do this), and getting the final correct answer gets 2 point.
16
6.2 Viterbi algorithm
Recall that the Viterbi algorithm starts with δs(1) = πsbs,x1 for each s, and then computes in a forward manner
and
δs(t) maxP(Zt =s,Z1:t−1 =z1:t−1,X1:t =x1:t)=bs,xt maxas′,sδs′(t−1) z1:t−1 s′
∆s(t)argmaxmaxP(Zt =s,Z1:t−1 =z1:t−1,X1:t =x1:t)=argmaxas′,sδs′(t−1) zt−1 z1:t−2 s′
for each s and t = 2, . . . , T . Using these quantities, describe
(a) how to find the most likely hidden state path z1∗, . . . , zT∗ given the entire observations x1, . . . , xT .
(b) for an arbitrary T0 < T , how to find the most likely hidden state path z1∗, . . . , z∗ T0
T0 observationsx1,...,xT0.
(a) The last state of the most likely path is
zT∗ = argmaxδs(T) s
the rest of the states zT∗ −1, . . . , z1∗ can be found in the backward manner: z∗ = ∆ ∗ (t)
given only the first
(2 points)
(2 points)
(2 points)
(2 points)
(b) Similarly
the rest of the states can be found in the exact same way z∗ = ∆ ∗ (t)
for t = T0,...,2.
t−1 zt
z∗ = argmax δ (T ) T0 s0
s
t−1
zt
17