10-301/10-601 Fall 2020 Midterm 1 Practice Problems
1 K-Nearest Neighbors
1. Select all that apply: Please select all that apply about k-NN in the following options:
2. (1 point) to…
⃝ ⃝ ⃝ ⃝
k-NN works great with a small amount of data, but it is too slow when the amount of data becomes large.
k-NN is sensitive to outliers; therefore, in general, we decrease k to avoid overfitting.
k-NN can only be applied to classification problems, and it cannot be used to solve regression problems.
We can always achieve zero training error (perfect classification) on a consistent data set with k-NN, but it may not generalize well in testing.
Select one: A k-Nearest Neighbor model with a large value of k is analogous
A short Decision Tree with a low branching factor A short Decision Tree with a high branching factor A long Decision Tree with a low branching factor A long Decision Tree with a high branching factor
3. (1 point)
set with lots of noise. You want your classifier to be less sensitive to the noise. Which is more likely to help and with what side-effect?
⃝ Increase the value of k → Increase in prediction time ⃝ Decrease the value of k → Increase in prediction time ⃝ Increase the value of k → Decrease in prediction time ⃝ Decrease the value of k → Decrease in prediction time
Select one. Imagine you are using a k-Nearest Neighbor classifier on a data
10-301/10-601 Midterm 1 Practice Problems – Page 2 of 6
2 Model Selection and Errors
1. Training Sample Size: In this problem, we will consider the effect of training sample
size N on a linear regression problem with M features.
The following plot shows the general trend for how the training and testing error change as we increase the training sample size N. Your task in this question is to analyze this plot and identify which curve corresponds to the training and test error. Specifically:
1. Which curve represents the training error? Please provide 1–2 sentences of justification.
2. In one word, what does the gap between the two curves represent?
10-301/10-601 Midterm 1 Practice Problems – Page 3 of 6
3 Linear Regression
1. (1 point) Select one: The closed form solution for linear regression is θˆ = (XT X)−1XT y. Suppose you have N = 35 training examples and M = 5 features (excluding the bias term). Once the bias term is now included, what are the dimensions of X, y, θˆ in the closed form equation?
⃝ Xis35×6,yis35×1,θˆis6×1 ⃝ Xis35×6,yis35×6,θˆis6×6 ⃝ Xis35×5,yis35×1,θˆis5×1 ⃝ Xis35×5,yis35×5,θˆis5×5
2. Consider linear regression on N 1-dimensional points x(i) ∈ R with labels y(i) ∈ R. We apply linear regression in both directions on this data, i.e., we first fit y with x and get y=β1xasthefittedline,thenwefitxwithyandgetx=β2yasthefittedline. Discuss the relations between β1 and β2:
True or False: The two fitted lines are always the same, i.e. we always have β2 = 1 . β1
⃝ True
⃝ False
3. Please circle True or False for the following questions, providing brief explanations to
support your answer.
(i) [3 pts] Consider a linear regression model with only one parameter, the bias, ie., y = b. Then given N data points (x(i), y(i)) (where x(i) is the feature and y(i) is the output), minimizing the sum of squared errors results in b being the median of the y(i) values.
Circle one: True False Brief explanation:
(ii) [3 pts] Given data D = {(x(1), y(1)), …, (x(N), y(N))}, we obtain θˆ, the parameters that minimize the training error cost for the linear regression model y = θT x we learn from D.
Consider a new dataset Dnew generated by duplicating the points in D and adding 10 points that lie along y = θˆT x. Then the θˆnew that we learn for y = θT x from Dnew is equal to θˆ.
Circle one: True False Brief explanation:
10-301/10-601 Midterm 1 Practice Problems – Page 4 of 6
4. We have an input x and we want to estimate an output y using linear regression.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets Snew plotted in Fig. 2, indicate which regression line (relative to the original one) in Fig. 3 corresponds to the regression line for the new data set. Write your answers in the table below.
Dataset (a) (b) (c) (d) (e) Regression line
Figure 1: An observed data set and its associated regression line.
10-301/10-601 Midterm 1 Practice Problems – Page 5 of 6
(a) Adding one outlier to the (b) Adding two outliers to the original data set. original data set.
(c) Adding three outliers to the original data set. Two on one side (d) Duplicating the original data set. and one on the other side.
(e) Duplicating the original data set and adding four points that lie on the trajectory of the original regression line.
Figure 2: New data set Snew.
10-301/10-601 Midterm 1 Practice Problems – Page 6 of 6
Figure 3: New regression lines for altered data sets Snew.