CS计算机代考程序代写 Midterm 2 B CSC311 Fall 2020

Midterm 2 B CSC311 Fall 2020

[24 pts] Midterm 2 B

1. [6 pts] To get full marks in each case provide a brief explanation: Grading Notes. 1 mark for
correct true/false, 1 mark for explanation.

(a) True or False: Gaussian discrimnant analysis is a strictly more expressive binary linear classi-
fication model than linear regression.

Answer. True: GDA models can represent both linear and quadratic decision boundaries,
whereas linear regression can only represent linear decision boundaries.

(b) True or False: Linear regression is a strictly more expressive binary linear classification model
than a restricted variant of Gaussian discriminant analysis that uses the identity as the covariance
for all classes.

Answer. False: GDA models with identity covariances can represent all linear decision bound-
aries.

(c) Suppose we are given a training set {x(1),x(2), . . . ,x(N)} of N points that were generated by
a Gaussian mixture model with K = 10 components, and we apply Expectation Maximization
(EM) to fit a learned Gaussian mixture model with K = 10 components. True or False: EM
will exactly recover the true Gaussian mixture components.

Answer. False: It might get stuck in a local minima. Even if it didn’t, there is a finite dataset
that doesn’t perfectly represent the underlying sample distribution.

2. [4 pts] We are running AdaBoost on a new binary classification dataset, which has exactly 9 samples.
Consider the “trivial” classifier, whose hypothesis space is {+1,−1} (i.e., it predicts either all positive,
or all negative on the entire input space).

(a) [1 pts] Argue that the trivial classifier is guaranteed to do better than chance on our training
set, regardless of the labels.

Answer. There are an odd number of samples, so if it predicts the majority class, it will get at
least 5 right and do better than chance.

(b) [3 pts] Recall that AdaBoost is guaranteed to achieve zero training error, if at every iteration
the weak learner, which is trained to minimize the weighted error at that iteration, achieves a
weighted error slightly less than 50%. Does it follow that AdaBoost with the trivial classifier will
successfully fit our training set? Why or why not?

Answer. No, a weighted average of the trivial classifers is itself a trivial classifier, and so they
can never fit the training set. This doesn’t contradict the notion that Adaboost with a weak
learner is guaranteed to fit the training set: Even though the trivial classifier is guaranteed to get
> 50% on the initial training set, there exists a weighted training set on which it gets only 50%
(indeed, this will occur on the next iteration as follows from homework 3 Q1(a)).

Grading Notes. In part (b) 2 marks are allocated to a valid explanation. A valid explanation should
mention that the trivial classifier doesn’t always do better than chance because of the weighted training
set.

3. [4 pts] We are working on a regression task and deciding between a single linear regression model
(“Single Model”) and an ensemble of 5 linear regression models (“Ensemble”). We train both the
Single Model and each member of the Ensemble on an independently sampled training set of the same
size.

1

Midterm 2 B CSC311 Fall 2020

(a) [1 pts] Is the hypothesis space of the Single Model or the Ensemble more expressive?

Answer. Neither, a linear combination of lines is a line.

(b) [1 pts] Between the Single Model and the Ensemble, which one has larger bias (in the sense of
bias-variance), or do they have the same bias?

Answer. They have the same bias.

(c) [2 pts] Suppose the Single Model has an expected generalization error of 2.5 and a variance of
1. Can you say anything about the expected generalization error of the Ensemble Model?

Answer. Yes, an ensemble reduces the variance by a factor of 5, so that the new generalization
error is 2.5 – 0.8 = 1.7.

Grading Notes. 2 pts in (c) for giving the true expected generalization error of 1.7. 1pt in (c) for
just saying that it improves because variance falls.

4. [6 pts] Consider a data set of N pairs {(x(1), y(1)), . . . , (x(N), y(N))} where x(i) ∈ {0, 1} and y(i) ∈ R.
Consider the following generative model with parameters µ, θ−, θ+ for this data:

p(y|µ, θ) =
1


exp

(
−(y − µ)2/2

)
p(x|y, µ, θ) = θxsign(y)(1− θsign(y))

1−x

In other words, p(y|µ, θ) = N (µ, 1) has a 1D Gaussian distribution with mean µ and x is Bernoulli
distribution with distribution p(x = 1|y, µ, θ) = θsign(y) that switches depending on the sign of y.

(a) Derive the Maximum Likelihood Estimator (MLE) of µ, θ−, θ+.

Answer. MLE of µ is the mean of ys. MLE of θ− is the mean of the xs associated with a label
of y < 0. MLE of θ+ is the mean of the xs associated with a label of y > 0.

(b) Write a concise expression for p(x = 1|µ, θ) and p(x = 0|µ, θ), using only θ−, θ+, and wµ =∫ 0
−∞

1√

exp
(
−(y − µ)2/2

)
.

Answer. p(x = 1 | mu, theta) = w mu * theta minus + (1 – w mu) * theta plus
p(x = 0 | mu, theta) = w mu * (1 – theta minus) + (1 – w mu) * (1 – theta plus)

(c) True or False: p(y|x, µ, θ) is a Gaussian distribution. Explain why or why not.
Answer. False. Consider theta minus = 0, theta plus = 1. Then the mass of the posterior for y
given x = 1 is the (normalized) part of the Gaussian above 0.

Grading Notes. 2 pts for each part. In Part (a) we do not require derivations. In part (b) there is 1
point for each answer, and part marks for some valid work. In Part (c) there is 1 point for the answer,
and 1 point for a reasonable explanation.

5. [4 pts] We consider applying PCA to reduce the number of features of a 3-dimensional dataset.
Suppose the empirical mean is

µ̂ =


 4−1

2


and the empirical covariance matrix Σ̂ has the following spectral decomposition:

Σ̂ =





2
2


2
2

0

0 0 1

2
2


2
2

0






2 0 0

0 0.5 0

0 0 4







2
2

0

2
2


2
2

0

2
2

0 1 0




2

Midterm 2 B CSC311 Fall 2020

(a) [2 pts] If we reduce the dataset to ONE dimension using PCA, what will be the reconstruction
x̃ of the point

x =


20

4


Answer. First subtract the mean to get (-2 1 2). Then project on to (0 1 0) to get (0 1 0).
Then add back mean to get (4 0 2) — a not so great reconstruction!

(b) [2 pts] If we reduce the dataset to TWO dimensions using PCA, what will be the latent code
z for x?

Answer. The first coordinate will be (0 1 0)T[(-2 1 2)] = 1 (from above). The second will be
(-root2/2 0 root2/2)T[(-2 1 2)] = 2root2. So the latent code is (2root2, 1).

Grading Notes. In each part:

• 0.5 pts for subtracting/adding back the mean.
• 0.5 pts for writing out the correct formula.
• 0.5 pts for a correct intermediate step
• 0.5 pts for the answer (we will accept either order for the components).

3