CS代考 GA-1003: Machine Learning (Spring 2021) Midterm Exam (March 23 – March 24)

DS-GA-1003: Machine Learning (Spring 2021) Midterm Exam (March 23 – March 24)
• You should finish the exam within 2 hours once it is started and submit on Gradescope by 5:20pm EST on March 24.
• You can refer to textbooks, lecture slides, and notes. However, searching answers online and collaboration are not allowed.
• Please write your solution on a separate sheet either by hand or by typing. The final submission must be in PDF format and each question (six in total) should start on a new page.

Copyright By PowCoder代写 加微信 powcoder

• Using the provided template is optional. You can find the template here: https: //tinyurl.com/3fhjtxwj
Generalization MLE Optimization Regularization SVM Kernels Total:
Points Score 15

1. Generalization and risk decomposition. Alice and Bob are trying to build a clas- sifier to predict whether a student will take DS1003 next spring. They have collected data of students who did and did not take DS1003 in the past.
(a) (2 points) They first split the data into train, validation, and test sets. In 1–2 sentences, explain the difference between the test set and the validation set.
(b) (2 points) After training a linear classifier, they find that both the training error and the validation error are very high. Alice suggests that they try to reduce training error first. To decrease training error, they would want to use
A. more training data
B. less training data
(c) (2 points) After changing the amount of data, the training error is still quite high. Bob figures that there might be something wrong with the features. To reduce training error, they should use
A. more features
B. fewer features
(d) (3 points) After some feature engineering, the training error finally goes down. Encouraged by this result, Bob suggests to use student NetID as a feature. Is this a good idea? Why or why not?
(e) (1 point) When checking the validation error, they find that it is much higher than the training error. This is a sign of overfitting / poor generalization.
(f) (3 points) To reduce the gap between training and validation error, Alice makes the following suggestions. Which one(s) would you agree with if you were Bob? (Select all correct answers)
A. Collect more training data
B. Add more features
C. Add regularization
D. Use a non-linear classifier
(g) (2 points) Finally, Alice and Bob are happy with their classifier and send it to the department head with a reported test accuracy of 90%. If you were the department head, do you expect this classifier to have similar prediction accuracy next spring? Why or why not?
Solution: The validation set is used for model selection and the test set is used for final evaluation.
Solution: Using NetID, the model can achieve zero training error by memoriz- ing the label of each student, so it will cause overfitting.
Page 1 of 10

Solution: The accuracy is expected to be lower as the cohort next spring will be different from past students. A positive answer is also considered correct if they assume that students won’t change too much within one year.
2. MLE and probabilistic models. Alice and Bob are still trying to predict who will take DS1003 next spring as in question 1. This time however they would like to estimate the probability of a student to enroll in DS1003. At first, they consider the students to be independent and identical and collect N outcomes D = yiNi=1 with yi = 1 if student i took DS1003 and yi = 0 if not.
(a) (2 points) What is an appropriate parametric family of distributions to model the desired probability.
A. Gaussian family B. Beta family
C. Bernoulli family
D. Poisson family
(b) (1 point) We call θ ∈ Θ the parameter of the distribution in the family noted
{p(y|θ), θ ∈ Θ}. What is the appropriate Θ?
(c) (3 points) Alice and Bob are going to use Maximum Likelihood Estimation to chose a value for θ. What is the likelihood of D given θ as a function of the yi and θ?
(d) (1 point) At this point, Alice and Bob think of differentiating between the students using features they collected. We denote by x ∈ Rd the features and the training data set is now D = {xi,yi}Ni=1. They choose a conditional family with a linear predictors parametrized by w ∈ Rd such that under their model p(y = 1|x,w) = f(wTx). The transfer funcion f must be a mapping from which space to which space?
A. [0,1]→R B. [0,1] → [0,1]
C. R→[0,1]
(e) (3 points) They choose f(x) = 1/(1 + e−x). Write down the log-likelihood of data
set D as a function of w, xis and yis.
Solution: Θ = [0, 1].
Solution: p(D|θ) = 􏳝Ni=1 θyi (1 − θ)1−yi .
Page 2 of 10

Solution: p(D|w)=􏳝Ni=1f(wTxi)yi(1−f(wTxi))1−yi
log p(D|w) = − 􏲿Ni=1 yi log(1 + e−wT xi ) + (1 − yi) log(1 + ewT xi ).
(f) (3 points) What is the optimization problem to solve to get a wˆ by maximum
likelihood estimation and which method would you use to solve it? (Write the
optimization problem in terms of wˆ = arg max . . ..) w
(g) (2 points) Alice and Bob have access to a test and validation set on top of the training data they used so far. They decide to finally test different possible transfer functions f1, f2 and f3 and obtain by maximum likelihood corresponding wˆ1,wˆ2 and wˆ3. They compute the likelihood of the different sets for the different models they obtained and collect the result in an array:
Solution: wˆ=argmax−􏲿Ni=1yilog(1+e−wTxi)+(1−yi)log(1+ewTxi) w
There is no closed form a prior, we can use gradient descent optimization.
training set f1, wˆ1 0.55
f2, wˆ2 0.57
f3, wˆ3 0.56
validation set 0.45 0.43 0.42
testing set 0.43 0.43 0.45
Which model would you choose and why?
Solution: model 1 since it has the maximum likelihood on the validation set.
3. Loss functions and optimization. Your boss asks you to build a classifier to detect fake news on COVID-19. You decide to use a linear classifier for simplicity:
􏳆1 if w · φ(x) ≥ 0 f(x) = −1 otherwise ,
where φ is the feature map and w ∈ Rd is the parameter. We denote the label by y ∈ {−1, +1}. A positive prediction (+1) means x is fake news.
(a) You will use the familiar ERM objective and SGD to learn the parameter w. For each of the following loss functions, explain whether it is suitable for the classifi- cation task and the learning framework. If yes, write down the (sub)gradient for a single example.
i. (2 points)
where I(·) is the indicator function.
l(x, y, w) = I [(φ(x) · w)y ≤ 0],
Page 3 of 10

Solution: No. (Sub)gradient is zero everywhere.
ii. (2 points)
iii. (2 points)
iv. (2 points)
l(x, y, w) = min ((φ(x) · w)y, 0)
l(x, y, w) = 1 − (φ(x) · w)y
Solution: No. Loss decreases when prediction is more incorrect (small margin).
Solution: No. Optimal w is either zero or does not exist.
1 − 2(φ(x) · w)y l(x, y, w) = (1 − (φ(x) · w)y)2
if (φ(x) · w)y ≤ 0
if 0 ≤ (φ(x) · w)y ≤ 1 otherwise
Solution: Yes.
−2yφ(x) if (φ(x) · w)y ≤ 0 ∇wl(x,y,w)=−2y(1−(φ(x)·w)y)φ(x) if0≤(φ(x)·w)y≤1
0 otherwise
(b) (2 points) You recall from DS1003 that hinge loss is commonly used for classifica- tion tasks:
l(x, y, w) = max(1 − (φ(x) · w)y, 0),
and decide to give it a try. After several SGD epochs, you find that the average hinge loss of the training examples is 0.2. Your boss does not know hinge loss though and asks about the accuracy. What can you say about the training accuracy? Your answer should be one sentence.
Solution: The training accuracy is at least 0.8 because hinge loss is an upper bound of the 0-1 loss.
Page 4 of 10

(c) (2 points) You soon realized that there are ten times more real news (negative examples) than fake news (positive examples) in your dataset because it is expensive to have fake news annotated. In this case, is a high accuracy meaningful? Why or why not?
(d) (3 points) Your next job is to design a new loss function such the loss is zero if the margin (yφ(x) · w) is larger than 1, and the loss of a positive example is 10 times the loss of a negative example given the same margin because you do not want to miss any fake news! Write down your loss function l(x,y,w) and its gradient ∇wl(x,y,w).
Solution: No. It is easy to get high accuracy (e.g. 90%) by predicting all examples as real news.
􏳆10max(1−yφ(x)·w,0) ify=1 l(x, y, w) = max(1 − yφ(x) · w, 0) if y = −1
0 if yφ(x) · w − 1 ≥ 0 ∇wl(x,y,w)=−10φ(x) ify=1andyφ(x)·w−1<0 φ(x) if y = −1 and yφ(x) · w − 1 < 0 4. Regularization. Recall that if we consider the perceptron algorithm as SGD using a certain loss function, we can define the perceptron loss: l(x, y, w) = max(−ywT x, 0) , and the hypothesis space is given by H=􏳓f |f(x)=wTx,w∈Rd􏳔 . (a) (3 points) Given linearly separable data, does the perceptron algorithm find a unique solution w∗? Why or why not? Solution: No. The solution depends on the order of the examples. Page 5 of 10 (b) (3 points) Now consider minimizing the ERM objective using the perceptron loss with L2 regularization: J(w) = 􏳇 max(−yiwT xi, 0) + c∥w∥2 Is the solution unique? (c) (3 points) Give a solution w that achieves the minimum J(w) but cannot separate the training data. (d) (3 points) To avoid the degenerate solution in the previous question, let’s add a con- straint ∥w∥2 ≥ 1. Write the Lagrangian for the problem of minimizing J(w) subject to the constraint and specify which variable(s) is the lagrangian multiplier(s). (You can use J(w) in the expression) (e) (3 points) Is the new constraint minimization problem convex? Why or why not? Recall that convex problems must have convex objective and convex feasible set. 5. Support Vector Machines. (a) (3 points) Recall that the SVM objective is equivalent to minimizing the average hinge loss with L2 regularization. What if we minimize the hinge loss without regularization? Explain why the solution is not unique without regularization. (You can assume the data is linearly separable.) (b) Recall the hard-margin SVM objective: minimize 12 ∥w∥2 s.t. yi(wTxi +b) ≥ 1 ∀i Page 6 of 10 Solution: Yes. Solution: The zero vector. L(w, λ) = J(w) + λ(1 − ∥w∥2) Solution: No. {w | ∥w∥2 ≥ 1} is not a convex set. Solution: Given w∗ that achieves zero training error, we can arbitrarily scale it without changing the total loss. The constraints says that the (functional) margin of each example is at least 1. If we change the constraint to require the margin to be at least c (c > 0), i.e. solving
minimize 21 ∥w∥2
s.t. yi(wTxi +b)≥c∀i
i. (3 points) Would it change the separating hyperplane? Why or why not?
ii. (3 points) Let w∗ be the solution of the original hard-margin SVM, and w′ be the solution of the modified problem with margin at least c. Write an expression of w′ using w∗.
(c) Consider the following dataset:
where circles are positive examples and crosses are negative examples.
i. (2 points) Draw the decision boundary given by a linear (hard-margin) SVM trained on this dataset, and circle the support vectors.
ii. (2 points) Now add a single example to the training set such that all examples would be support vectors.
(d) (2 points) Given your observation in the previous question, explain why soft-margin SVM may be preferred even if the data is linearly separable.
Solution: No. It would only scale w and b.
Solution: w′ = cw∗
Page 7 of 10

Solution: Robust to outliers.
6. Kernels. Consider the following dataset:
Each point is an example in R2, i.e. x = (x1,x2). Circles are positive examples and crosses are negative examples.
(a) (3 points) What’s the maximum training accuracy we can achieve on this dataset if we use a linear classifier?
(b) (3 points) Add a single feature to make the dataset linearly separable.
(c) (3 points) Suppose we are going to use SVM to solve the binary classification prob- lem. Which of the following kernel(s) can separate the data? (Select all correct answers)
A. linear kernel
B. polynomial kernel with degree 3
C. Gaussian kernel
(d) Now consider the general setting where x ∈ Rp and we have n training examples. Suppose we learn a SVM with quadratic kernel and find that there are m support vectors.
i. (1 point) At inference time, what is the computation cost to make a single prediction using the primal solution? Express your answer in m, n, p using the big-O notation.
Solution: 75%
Solution: There are many answers, e.g. (x1 − x2)2.
Page 8 of 10

Solution: O(p2)
ii. (1 point) What if we use the dual solution (i.e. use the kernel trick)?
iii. (1 point) Given your above answers, when do we want to use the dual solution for prediction?
(e) (3 points) For which of the following model(s)/algorithm(s) can we use the kernel trick? (Select all correct answers)
A. Perceptron
B. Logistic regression with L2 regularization
C. Lasso regression
D. k-nearest neighbor
Solution: O(mp)
Solution: m < p Page 9 of 10 Congratulations! You have reached the end of the exam. Page 10 of 10 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com