Midterm 2 B
CSC311 Fall 2020
[24 pts] Midterm 2 B
1. [6 pts] To get full marks in each case provide a brief explanation: Grading Notes. 1 mark for
Copyright By PowCoder代写 加微信 powcoder
correct true/false, 1 mark for explanation.
(a) True or False: Gaussian discrimnant analysis is a strictly more expressive binary linear classi- fication model than linear regression.
Answer. True: GDA models can represent both linear and quadratic decision boundaries, whereas linear regression can only represent linear decision boundaries.
(b) True or False: Linear regression is a strictly more expressive binary linear classification model than a restricted variant of Gaussian discriminant analysis that uses the identity as the covariance for all classes.
Answer. False: GDA models with identity covariances can represent all linear decision bound- aries.
(c) Suppose we are given a training set {x(1),x(2),…,x(N)} of N points that were generated by a Gaussian mixture model with K = 10 components, and we apply Expectation Maximization (EM) to fit a learned Gaussian mixture model with K = 10 components. True or False: EM will exactly recover the true Gaussian mixture components.
Answer. False: It might get stuck in a local minima. Even if it didn’t, there is a finite dataset that doesn’t perfectly represent the underlying sample distribution.
2. [4 pts] We are running AdaBoost on a new binary classification dataset, which has exactly 9 samples. Consider the “trivial” classifier, whose hypothesis space is {+1, −1} (i.e., it predicts either all positive, or all negative on the entire input space).
(a) [1 pts] Argue that the trivial classifier is guaranteed to do better than chance on our training set, regardless of the labels.
Answer. There are an odd number of samples, so if it predicts the majority class, it will get at least 5 right and do better than chance.
(b) [3 pts] Recall that AdaBoost is guaranteed to achieve zero training error, if at every iteration the weak learner, which is trained to minimize the weighted error at that iteration, achieves a weighted error slightly less than 50%. Does it follow that AdaBoost with the trivial classifier will successfully fit our training set? Why or why not?
Answer. No, a weighted average of the trivial classifers is itself a trivial classifier, and so they can never fit the training set. This doesn’t contradict the notion that Adaboost with a weak learner is guaranteed to fit the training set: Even though the trivial classifier is guaranteed to get > 50% on the initial training set, there exists a weighted training set on which it gets only 50% (indeed, this will occur on the next iteration as follows from homework 3 Q1(a)).
Grading Notes. In part (b) 2 marks are allocated to a valid explanation. A valid explanation should mention that the trivial classifier doesn’t always do better than chance because of the weighted training set.
3. [4 pts] We are working on a regression task and deciding between a single linear regression model (“Single Model”) and an ensemble of 5 linear regression models (“Ensemble”). We train both the Single Model and each member of the Ensemble on an independently sampled training set of the same size.
Midterm 2 B CSC311 Fall 2020
(a) [1 pts] Is the hypothesis space of the Single Model or the Ensemble more expressive? Answer. Neither, a linear combination of lines is a line.
(b) [1 pts] Between the Single Model and the Ensemble, which one has larger bias (in the sense of bias-variance), or do they have the same bias?
Answer. They have the same bias.
(c) [2 pts] Suppose the Single Model has an expected generalization error of 2.5 and a variance of 1. Can you say anything about the expected generalization error of the Ensemble Model? Answer. Yes, an ensemble reduces the variance by a factor of 5, so that the new generalization error is 2.5 – 0.8 = 1.7.
Grading Notes. 2 pts in (c) for giving the true expected generalization error of 1.7. 1pt in (c) for just saying that it improves because variance falls.
4. [6 pts] Consider a data set of N pairs {(x(1), y(1)), . . . , (x(N), y(N))} where x(i) ∈ {0, 1} and y(i) ∈ R. Consider the following generative model with parameters μ, θ−, θ+ for this data:
p(y|μ,θ)= √ p(x|y, μ, θ) = θx
In other words, p(y|μ,θ) = N(μ,1) has a 1D Gaussian distribution with mean μ and x is Bernoulli distribution with distribution p(x = 1|y, μ, θ) = θsign(y) that switches depending on the sign of y.
(a) Derive the Maximum Likelihood Estimator (MLE) of μ, θ−, θ+.
Answer. MLE of μ is the mean of ys. MLE of θ− is the mean of the xs associated with a label ofy<0. MLEofθ+ isthemeanofthexsassociatedwithalabelofy>0.
2π sign(y)
(1 − θ )1−x sign(y)
exp −(y−μ) /2
(b) Write a concise expression for p(x = 1|μ,θ) and p(x = 0|μ,θ), using only θ−,θ+, and wμ = R 0 √1 exp −(y − μ)2/2.
Answer. p(x = 1 | mu, theta) = wmu * thetaminus + (1 – wmu) * thetaplus p(x = 0 | mu, theta) = w mu * (1 – theta minus) + (1 – w mu) * (1 – theta plus)
(c) True or False: p(y|x, μ, θ) is a Gaussian distribution. Explain why or why not.
Answer. False. Consider theta minus = 0, theta plus = 1. Then the mass of the posterior for y given x = 1 is the (normalized) part of the Gaussian above 0.
Grading Notes. 2 pts for each part. In Part (a) we do not require derivations. In part (b) there is 1 point for each answer, and part marks for some valid work. In Part (c) there is 1 point for the answer, and 1 point for a reasonable explanation.
5. [4 pts] We consider applying PCA to reduce the number of features of a 3-dimensional dataset. Suppose the empirical mean is
4 μˆ = −1
and the empirical covariance matrix Σˆ has the following spectral decomposition:
−2 20200−202 2222
√√ Σˆ=0 0100.5022 022
Midterm 2 B CSC311 Fall 2020
(a) [2 pts] If we reduce the dataset to ONE dimension using PCA, what will be the reconstruction x ̃ of the point
2 x = 0
Answer. First subtract the mean to get (-2 1 2). Then project on to (0 1 0) to get (0 1 0). Then add back mean to get (4 0 2) — a not so great reconstruction!
(b) [2 pts] If we reduce the dataset to TWO dimensions using PCA, what will be the latent code z for x?
Answer. The first coordinate will be (0 1 0)T[(-2 1 2)] = 1 (from above). The second will be (-root2/2 0 root2/2)T[(-2 1 2)] = 2root2. So the latent code is (2root2, 1).
Grading Notes. In each part:
• 0.5 pts for subtracting/adding back the mean.
• 0.5 pts for writing out the correct formula.
• 0.5 pts for a correct intermediate step
• 0.5 pts for the answer (we will accept either order for the components).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com