Midterm 2 A
CSC311 Fall 2020
27 Midterm 2 A
1. 6 To get full marks in each case provide a brief explanation: Grading Notes. 1 mark for correct true/false, 1 mark for explanation.
Copyright By PowCoder代写 加微信 powcoder
(a) We apply PCA to reduce the number of features of a linearly separable, binary classification dataset. True or False: The resulting dataset is necessarily linearly separable.
Answer. False, it may be that the data was linearly separable along a dimension of low variability.
(b) True or False: It is possible for a nonlinear autoencoder to achieve zero loss on a finite training set (i.e., perfectly reconstruct all x in the training set) even if latent dimension K < D. Answer. True. There is a non-linear manifold (even in 1 dimension) that connects any finite number of points in RD. Even if we consider the entire data generating distribution, it might live on a K-dimensional manifold of RD. Alternatively, the latent space RK has the same cardinality as the feature space RD, so we could find a bijection between them.
Grading Notes. Should be flexible with the explanation here... anything along the lines of a space filling curve or data lives on a low dimensional manifold should be accepted. We should also accept: ”False. The data distribution might not lie on a lower dimensional manifold, so there is no differentiable map from RD to RK that represents the entire data distribution” — there are two key parts here: (1) this answer needs to assume the entire data distribution rather than a finite dataset, and (2) it needs to assume the encoder/decoder are differentiable maps.
(c) Suppose that we are given a training set {x(1), x(2)} of exactly 2 points, and we apply Expectation Maximization (EM) to fit a learned Gaussian mixture model with K = 2 components. True or False: the two Gaussian centers of our model will always converge to the two training points, i.e., at convergence we will have μˆ1 = x(1) and μˆ2 = x(2) or (equivalently) μˆ2 = x(1) and μˆ1 = x(2). Answer. False: It can get stuck in local minima μˆ2 = μˆ1 = (x(1) + x(2))/2, π1 = π2 = 0.5. Grading Notes. We can be somewhat flexible here if the student argues something about nu- merical precision or that for most initializations it will converge. The student has to demonstrate an understanding that there might be some counter examples.
2. 3 Consider two algorithms for binary classification: (A) logistic regression and (B) multi-layer neural network with non-linear activations. If you had to boost one algorithm and bag the other, which would you choose? Explain why briefly.
Answer. We should bag the neural network (low bias but higher variance) and boost the logistic regression (possibly high bias and lower variance), since boosting lets us reduce bias, whereas bagging lets us reduce variance.
Grading Notes. 1 point for each correct allocation, and 1 for the correct explanation. We should also accept explanations based on compute (typically boosted ensembles have more members, so we should choose the faster to compute classifier).
3. 6 Consider a data generating distribution with the pdf: p(x|θ) = θ exp(−θx),
parameterized by θ > 0. Suppose we observe a dataset D = {x(1), x(2), . . . , x(N)} with N observations x(i) > 0 generated by this distribution.
(a) 4 Using a prior p(θ) = exp(−θ), derive the Maximum A-Posteriori Estimator (MAP) of θ given D.
Answer. 0 = sumi 1/theta – sumi x^i – 1, so theta = n/(sumi x^i + 1) 1
Midterm 2 A CSC311 Fall 2020
(b) 2 The conditional distribution p(x|θ) in this question is known as an exponential distribution. True or False: the posterior p(θ|x) is an exponential distribution. Explain why or why not. Answer. False. We have that p(theta | x) is proportional to p(D|theta)p(theta) = thetaˆn exp (-theta (sum x + 1)), which is not the same form as the exponential due to the thetaˆn factor.
Grading Notes. For (a), 3 marks are given for a correct intermediate steps. We give full mark for the final answer. In (b), 1 mark is given for the correct answer, and 1 mark for the correct explanation.
4. 2 For each 2D dataset below, you are asked to identify the direction of the first principal component. You may specify the direction by the closest clock number (e.g., 12 is ”Up” and 3 is ”Right”).
(a) What is the first principal component for the dataset in (A)?
Answer. 11 or 5
(b) What is the first principal component for the dataset in (B)?
Answer. 3 or 9
Grading Notes. 1 point for each correct answer
5. 4 I apply PCA to reduce the number of features of a 3-dimensional dataset. Instead of using computer, I do all of my calculations by hand and arrive at covariance matrix Σˆ with the following spectral decomposition:
z }| {z }| {z }| {
0 22 −22 2 0 0.5 0 22 −22 Σˆ=0.70 000.500.70 0
√√√√ 0.3 − 2 2 0.5 0 −1 0.3 − 2 2
It seems I made a few mistakes. For example, the first column of first matrix (Q) should be a unit vector, but it’s not since ∥(0, 0.7, 0.3)⊤ ∥ ≠ 1.
Identify FOUR OTHER things that are wrong about my result. Be specific, as in our example. Answer.
(a) 2nd and 3rd columns of Q are colinear rather than orthogonal.
(b) QT should be transposed
(c) D should be a diagonal matrix
(d) the 3rd Eigenvalue in D should be non-negative since variance is non-negative (covariance is positive semi-definite).
Midterm 2 A CSC311 Fall 2020
Grading Notes. 1 mark for each correct answer
6. 6 We are faced with a univariate, supervised regression task. The data is generated according to the
following distribution psample:
x ∼ U(−2π,2π) (uniform distribution over [−2π,2π])
t ∼ N (0, 1) (standard normal distribution)
Note: the targets are generated independently of the inputs. We decide to fit the data using a small hypothesis space of size 2, H = {y = sin(x),y = cos(x)}, and we pick between the 2 hypotheses by minimizing squared error on a training set.
Because of the way the data are generated, a training algorithm that minimizes the squared error will randomly choose between the 2 hypotheses. In other words, for the purposes of this question, you may assume that our training algorithm chooses sin(x) with probability 0.5 and cos(x) with probability 0.5.
(a) 4 Let’s first consider the Bayes error, bias, variance, and expected generalization error at x = π for a single classifier. You are not required to show your work, but doing so may help you obtain partial credit.
i. What is the Bayes error at x = π?
ii. What is the bias, (E[y] − y∗)2, at x = π?
iii. What is the variance at x = π?
iv. What is the overall expected generalization error, E[(y − t)2], at x = π? Answer. Bayes error = 1, Bias = 0.25, Variance = 0.25, Generalization error = 1.5
(b) 2 Now suppose we train an ensemble of 5 models: each model is chosen from the hypothesis space of part (a), but trained using a different independently sampled training set. Our ensemble prediction is the average prediction over the 5 models. Which of the values in part (a) (i)-(iv) will change, if any, and what are their new values?
Answer. Variance = 0.05, generalization error = 1.3.
Grading Notes. 1 mark for each correct number (mistakes carried though, so the same mistake is
deducted only once). Partial marks for correct work shown, but mistaken numbers.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com