ELEC 400M/ EECE 571M – FinalExam
26 April 2021
Instructions:
• The exam has three problems.
• The exam is open book.
• Please write your answers to the problems on your own answer booklet and upload it electronically through
the Canvas assignment submission. Please use a common file format, such as pdf, jpeg, png, etc. for
upload.
• You have 90 minutes to solve the problems. You have an extra 10 minutes to deal with the logistics of
creating a version for upload and executing the upload.
• For all answers, show all steps.
• Sub-tasks that can be solved independently of previous sub-tasks are identified through “−→”
1
1. (Multilayer) perceptron
We consider (D + 1)-dimensional feature vectors xn ∈ {1} × RD and binary targets yn ∈ {−1,+1},
n = 1, 2, . . . , N .
(a) We first consider the linear model
ŷn = sign(w
Txn)
with the D + 1 parameters w = [w0, w1, . . . , wD]
T .
To train this model, we apply the error measure
en(w) = max
(
0, (1− ynwTxn)3
)
.
i.−→ (3 marks) Determine the relationship between the error measure en and the decision error
e′n =
{
1, if yn 6= ŷn
0, if yn = ŷn
.
ii.−→ (5 marks) Derive the stochastic gradient descent algorithm for updating w based on en. State
the steps of algorithm (write it as pseudo-code).
(b) We now assume D = 2 and N = 4, i.e., we have four data samples. The four samples are given by
(x1 = [1, 0, 0]
T , y1 = −1), (x2 = [1, 1, 0]T , y2 = +1),
(x3 = [1, 0, 1]
T , y3 = +1), (x4 = [1, 1, 1]
T , y4 = −1).
i.−→ (3 marks) Denoting xn = [1, xn1, xn2]T , sketch the four data points in the (xn1, xn2)-coordinate
system. Identify data points with different labels using different markers.
Draw decision boundaries that consist of straight lines and identify the corresponding decision
regions such that a classifier would correctly decide on those four data points.
ii. (5 marks) Sketch a two-layer perceptron that would realize the decision boundaries as drawn in
the previous sub-problem and thus would correctly classify those four data points. Label the
necessary details and briefly explain how it implements the decision boundaries.
(You do not need to determine the numerical values for weights that determine the slope and
intersept of straight lines.)
2. Validation
We partition a data set D into a training data set Dtrain and a validation data set Dval with Dtrain∪Dval = D
and Dtrain ∩ Dval = ∅. We denote the model obtained by training on Dtrain as g− and the corresponding
validation error as Eval(g
−).
(a) We first use Eval(g
−) to estimate the out-of-sample error Eout(g
−).
i.−→ (3 marks) Is Eval(g−) an unbiased estimate of Eout(g−)? Show why or why not.
ii.−→ (2 marks) State a bound for Eout(g−) in terms of Eval(g−) and explain the variables in this bound.
(b) We now use the validation error to decide on hyperparameters (say m ∈ {1, . . . ,M}) of the model.
We denote the model with the selected hyperparameter as g−m∗ .
i.−→ (3 marks) Is Eval(g−m∗) an unbiased estimate of Eout(g
−
m∗)? Why or why not? If not, does Eval(g
−
m∗)
tend to overestimate or underestimate the out-of-sample error?
ii.−→ (2 marks) Could a test data set be used to determine the model hyperparameters, i.e., m∗? Why
or why not?
2
3. Expectation-maximization method
We assume that observed samples x = [x1, . . .xN ] of D-dimensional binary data xn = [xn1, . . . , xnD] ∈
{0, 1}D have been generated independently from the K-mixture distribution with probability mass function
(pmf)
P (xn|w,µ) =
K∑
j=1
wjPj(xn|µj),
which has K component distributions
Pj(xn|µj) =
D∏
i=1
µxniji (1− µji)
1−xni
and parameter vectors w = [w1, . . . , wK ] and µ = [µ1, . . . ,µK ].
wj > 0 is the weight for the jth component distribution and
∑K
j=1wj = 1.
Furthermore, µj = [µj1, . . . , µjD] with µji ∈ [0, 1].
The interpretation of the mixture distribution is that xn is drawn from the jth component distribution with
probability wj.
(a) Let us assume that K = 1.
i.−→ (1 mark) State the pmf P (x|µ).
ii. (4 marks) Using P (x|µ), determine the maximum likelihood estimate for µ as a function of x.
(It is likely useful to consider the logarithm of probabilities.)
(b) Now we consider a general value for K and want to develop the expectation-maximization (EM)
algorithm to estimate the parameters of the pmf.
For this, we introduce the vector j = [j1, . . . , jN ] of unobserved data, where jn = k if xn is drawn
from the kth component distribution.
i.−→ (4 marks) Determine the “complete log-likelihood” log (P (x, j|w,µ)) of the “complete data”
(x, j).
ii.−→ (4 marks) Determine the probability
γnk = P (jn = k|xn, ŵ, µ̂)
for given parameter estimates ŵ and µ̂.
iii.−→ (4 marks) Following the steps for the EM algorithm as presented in the course, we can express
the updated parameter estimates as
µ∗k = argmax
µk
N∑
n=1
γnk log (Pk(xn|µk)) , w
∗
k =
N∑
n=1
γnk
N
.
Solve the optimization for µ∗k and express the solution in terms of observed data x and the
probabilities γnk.
(c)−→ (2 marks) Describe how the EM algorithm can be used to categorize samples xn according to the
component distribution they have been drawn from?
3