Midterm 2 A CSC311 Fall 2020
27 Midterm 2 A
1. 6 To get full marks in each case provide a brief explanation: Grading Notes. 1 mark for correct
true/false, 1 mark for explanation.
(a) We apply PCA to reduce the number of features of a linearly separable, binary classification
dataset. True or False: The resulting dataset is necessarily linearly separable.
Answer. False, it may be that the data was linearly separable along a dimension of low variability.
(b) True or False: It is possible for a nonlinear autoencoder to achieve zero loss on a finite training
set (i.e., perfectly reconstruct all x in the training set) even if latent dimension K < D.
Answer. True. There is a non-linear manifold (even in 1 dimension) that connects any finite
number of points in RD. Even if we consider the entire data generating distribution, it might live
on a K-dimensional manifold of RD. Alternatively, the latent space RK has the same cardinality
as the feature space RD, so we could find a bijection between them.
Grading Notes. Should be flexible with the explanation here... anything along the lines of a
space filling curve or data lives on a low dimensional manifold should be accepted. We should
also accept: ”False. The data distribution might not lie on a lower dimensional manifold, so there
is no differentiable map from RD to RK that represents the entire data distribution” — there are
two key parts here: (1) this answer needs to assume the entire data distribution rather than a
finite dataset, and (2) it needs to assume the encoder/decoder are differentiable maps.
(c) Suppose that we are given a training set {x(1),x(2)} of exactly 2 points, and we apply Expectation
Maximization (EM) to fit a learned Gaussian mixture model with K = 2 components. True or
False: the two Gaussian centers of our model will always converge to the two training points, i.e.,
at convergence we will have µ̂1 = x
(1) and µ̂2 = x
(2) or (equivalently) µ̂2 = x
(1) and µ̂1 = x
(2).
Answer. False: It can get stuck in local minima µ̂2 = µ̂1 = (x
(1) + x(2))/2, π1 = π2 = 0.5.
Grading Notes. We can be somewhat flexible here if the student argues something about nu-
merical precision or that for most initializations it will converge. The student has to demonstrate
an understanding that there might be some counter examples.
2. 3 Consider two algorithms for binary classification: (A) logistic regression and (B) multi-layer neural
network with non-linear activations. If you had to boost one algorithm and bag the other, which would
you choose? Explain why briefly.
Answer. We should bag the neural network (low bias but higher variance) and boost the logistic
regression (possibly high bias and lower variance), since boosting lets us reduce bias, whereas bagging
lets us reduce variance.
Grading Notes. 1 point for each correct allocation, and 1 for the correct explanation. We should
also accept explanations based on compute (typically boosted ensembles have more members, so we
should choose the faster to compute classifier).
3. 6 Consider a data generating distribution with the pdf:
p(x|θ) = θ exp(−θx),
parameterized by θ > 0. Suppose we observe a dataset D = {x(1), x(2), . . . , x(N)} with N observations
x(i) > 0 generated by this distribution.
(a) 4 Using a prior p(θ) = exp(−θ), derive the Maximum A-Posteriori Estimator (MAP) of θ given
D.
Answer. 0 = sum i 1/theta – sum i x^i – 1, so theta = n/(sum i x^i + 1)
1
Midterm 2 A CSC311 Fall 2020
(b) 2 The conditional distribution p(x|θ) in this question is known as an exponential distribution.
True or False: the posterior p(θ|x) is an exponential distribution. Explain why or why not.
Answer. False. We have that p(theta | x) is proportional to p(D|theta)p(theta) = thetaˆn exp
(-theta (sum x + 1)), which is not the same form as the exponential due to the thetaˆn factor.
Grading Notes. For (a), 3 marks are given for a correct intermediate steps. We give full mark for
the final answer. In (b), 1 mark is given for the correct answer, and 1 mark for the correct explanation.
4. 2 For each 2D dataset below, you are asked to identify the direction of the first principal component.
You may specify the direction by the closest clock number (e.g., 12 is ”Up” and 3 is ”Right”).
(a) What is the first principal component for the dataset in (A)?
Answer. 11 or 5
(b) What is the first principal component for the dataset in (B)?
Answer. 3 or 9
Grading Notes. 1 point for each correct answer
5. 4 I apply PCA to reduce the number of features of a 3-dimensional dataset. Instead of using computer,
I do all of my calculations by hand and arrive at covariance matrix Σ̂ with the following spectral
decomposition:
Σ̂ =
Q︷ ︸︸ ︷
0
√
2
2
−
√
2
2
0.7 0 0
0.3 −
√
2
2
√
2
2
D︷ ︸︸ ︷
2 0 0.5
0 0.5 0
0.5 0 −1
Q>︷ ︸︸ ︷
0
√
2
2
−
√
2
2
0.7 0 0
0.3 −
√
2
2
√
2
2
It seems I made a few mistakes. For example, the first column of first matrix (Q) should be a unit
vector, but it’s not since ‖(0, 0.7, 0.3)>‖ 6= 1.
Identify FOUR OTHER things that are wrong about my result. Be specific, as in our example.
Answer.
(a) 2nd and 3rd columns of Q are colinear rather than orthogonal.
(b) QT should be transposed
(c) D should be a diagonal matrix
(d) the 3rd Eigenvalue in D should be non-negative since variance is non-negative (covariance is
positive semi-definite).
2
Midterm 2 A CSC311 Fall 2020
Grading Notes. 1 mark for each correct answer
6. 6 We are faced with a univariate, supervised regression task. The data is generated according to the
following distribution psample:
x ∼ U(−2π, 2π) (uniform distribution over [−2π, 2π])
t ∼ N (0, 1) (standard normal distribution)
Note: the targets are generated independently of the inputs. We decide to fit the data using a small
hypothesis space of size 2, H = {y = sin(x), y = cos(x)}, and we pick between the 2 hypotheses by
minimizing squared error on a training set.
Because of the way the data are generated, a training algorithm that minimizes the squared error will
randomly choose between the 2 hypotheses. In other words, for the purposes of this question, you may
assume that our training algorithm chooses sin(x) with probability 0.5 and cos(x) with probability 0.5.
(a) 4 Let’s first consider the Bayes error, bias, variance, and expected generalization error at x = π
for a single classifier. You are not required to show your work, but doing so may help you obtain
partial credit.
i. What is the Bayes error at x = π?
ii. What is the bias, (E[y]− y∗)2, at x = π?
iii. What is the variance at x = π?
iv. What is the overall expected generalization error, E[(y − t)2], at x = π?
Answer. Bayes error = 1, Bias = 0.25, Variance = 0.25, Generalization error = 1.5
(b) 2 Now suppose we train an ensemble of 5 models: each model is chosen from the hypothesis
space of part (a), but trained using a different independently sampled training set. Our ensemble
prediction is the average prediction over the 5 models. Which of the values in part (a) (i)-(iv)
will change, if any, and what are their new values?
Answer. Variance = 0.05, generalization error = 1.3.
Grading Notes. 1 mark for each correct number (mistakes carried though, so the same mistake is
deducted only once). Partial marks for correct work shown, but mistaken numbers.
3