CS计算机代考程序代写 flex algorithm Midterm 2 A CSC311 Fall 2020

Midterm 2 A CSC311 Fall 2020

27 Midterm 2 A
1. 6 To get full marks in each case provide a brief explanation: Grading Notes. 1 mark for correct

true/false, 1 mark for explanation.

(a) We apply PCA to reduce the number of features of a linearly separable, binary classification
dataset. True or False: The resulting dataset is necessarily linearly separable.

Answer. False, it may be that the data was linearly separable along a dimension of low variability.

(b) True or False: It is possible for a nonlinear autoencoder to achieve zero loss on a finite training
set (i.e., perfectly reconstruct all x in the training set) even if latent dimension K < D. Answer. True. There is a non-linear manifold (even in 1 dimension) that connects any finite number of points in RD. Even if we consider the entire data generating distribution, it might live on a K-dimensional manifold of RD. Alternatively, the latent space RK has the same cardinality as the feature space RD, so we could find a bijection between them. Grading Notes. Should be flexible with the explanation here... anything along the lines of a space filling curve or data lives on a low dimensional manifold should be accepted. We should also accept: ”False. The data distribution might not lie on a lower dimensional manifold, so there is no differentiable map from RD to RK that represents the entire data distribution” — there are two key parts here: (1) this answer needs to assume the entire data distribution rather than a finite dataset, and (2) it needs to assume the encoder/decoder are differentiable maps. (c) Suppose that we are given a training set {x(1),x(2)} of exactly 2 points, and we apply Expectation Maximization (EM) to fit a learned Gaussian mixture model with K = 2 components. True or False: the two Gaussian centers of our model will always converge to the two training points, i.e., at convergence we will have µ̂1 = x (1) and µ̂2 = x (2) or (equivalently) µ̂2 = x (1) and µ̂1 = x (2). Answer. False: It can get stuck in local minima µ̂2 = µ̂1 = (x (1) + x(2))/2, π1 = π2 = 0.5. Grading Notes. We can be somewhat flexible here if the student argues something about nu- merical precision or that for most initializations it will converge. The student has to demonstrate an understanding that there might be some counter examples. 2. 3 Consider two algorithms for binary classification: (A) logistic regression and (B) multi-layer neural network with non-linear activations. If you had to boost one algorithm and bag the other, which would you choose? Explain why briefly. Answer. We should bag the neural network (low bias but higher variance) and boost the logistic regression (possibly high bias and lower variance), since boosting lets us reduce bias, whereas bagging lets us reduce variance. Grading Notes. 1 point for each correct allocation, and 1 for the correct explanation. We should also accept explanations based on compute (typically boosted ensembles have more members, so we should choose the faster to compute classifier). 3. 6 Consider a data generating distribution with the pdf: p(x|θ) = θ exp(−θx), parameterized by θ > 0. Suppose we observe a dataset D = {x(1), x(2), . . . , x(N)} with N observations
x(i) > 0 generated by this distribution.

(a) 4 Using a prior p(θ) = exp(−θ), derive the Maximum A-Posteriori Estimator (MAP) of θ given
D.
Answer. 0 = sum i 1/theta – sum i x^i – 1, so theta = n/(sum i x^i + 1)

Midterm 2 A CSC311 Fall 2020

(b) 2 The conditional distribution p(x|θ) in this question is known as an exponential distribution.
True or False: the posterior p(θ|x) is an exponential distribution. Explain why or why not.
Answer. False. We have that p(theta | x) is proportional to p(D|theta)p(theta) = thetaˆn exp
(-theta (sum x + 1)), which is not the same form as the exponential due to the thetaˆn factor.

Grading Notes. For (a), 3 marks are given for a correct intermediate steps. We give full mark for
the final answer. In (b), 1 mark is given for the correct answer, and 1 mark for the correct explanation.

4. 2 For each 2D dataset below, you are asked to identify the direction of the first principal component.
You may specify the direction by the closest clock number (e.g., 12 is ”Up” and 3 is ”Right”).

(a) What is the first principal component for the dataset in (A)?

Answer. 11 or 5

(b) What is the first principal component for the dataset in (B)?

Answer. 3 or 9

Grading Notes. 1 point for each correct answer

5. 4 I apply PCA to reduce the number of features of a 3-dimensional dataset. Instead of using computer,
I do all of my calculations by hand and arrive at covariance matrix Σ̂ with the following spectral
decomposition:

Σ̂ =

Q︷︸︸︷


0
√
2
2

−
√
2
2

0.7 0 0

0.3 −
√
2
2

√
2
2




D︷︸︸︷


2 0 0.5

0 0.5 0

0.5 0 −1




Q>︷︸︸︷


0
√
2
2

−
√
2
2

0.7 0 0

0.3 −
√
2
2

√
2
2




It seems I made a few mistakes. For example, the first column of first matrix (Q) should be a unit
vector, but it’s not since ‖(0, 0.7, 0.3)>‖ 6= 1.
Identify FOUR OTHER things that are wrong about my result. Be specific, as in our example.

Answer.

(a) 2nd and 3rd columns of Q are colinear rather than orthogonal.

(b) QT should be transposed

(d) the 3rd Eigenvalue in D should be non-negative since variance is non-negative (covariance is
positive semi-definite).

Midterm 2 A CSC311 Fall 2020

Grading Notes. 1 mark for each correct answer

6. 6 We are faced with a univariate, supervised regression task. The data is generated according to the
following distribution psample:

x ∼ U(−2π, 2π) (uniform distribution over [−2π, 2π])

t ∼ N (0, 1) (standard normal distribution)

Note: the targets are generated independently of the inputs. We decide to fit the data using a small
hypothesis space of size 2, H = {y = sin(x), y = cos(x)}, and we pick between the 2 hypotheses by
minimizing squared error on a training set.

Because of the way the data are generated, a training algorithm that minimizes the squared error will
randomly choose between the 2 hypotheses. In other words, for the purposes of this question, you may
assume that our training algorithm chooses sin(x) with probability 0.5 and cos(x) with probability 0.5.

(a) 4 Let’s first consider the Bayes error, bias, variance, and expected generalization error at x = π
for a single classifier. You are not required to show your work, but doing so may help you obtain
partial credit.

i. What is the Bayes error at x = π?

ii. What is the bias, (E[y]− y∗)2, at x = π?
iii. What is the variance at x = π?

iv. What is the overall expected generalization error, E[(y − t)2], at x = π?
Answer. Bayes error = 1, Bias = 0.25, Variance = 0.25, Generalization error = 1.5

(b) 2 Now suppose we train an ensemble of 5 models: each model is chosen from the hypothesis
space of part (a), but trained using a different independently sampled training set. Our ensemble
prediction is the average prediction over the 5 models. Which of the values in part (a) (i)-(iv)
will change, if any, and what are their new values?

Answer. Variance = 0.05, generalization error = 1.3.

Grading Notes. 1 mark for each correct number (mistakes carried though, so the same mistake is
deducted only once). Partial marks for correct work shown, but mistaken numbers.

Related Posts