IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021
Name: SID:
Instructions:
Practice Midterm Exam 3
March 2021
1. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.
2. You are allowed one (double sided) 8.5 x 11 inch note sheet and a simple pocket calculator. The use of any other note sheets, textbook, computer, cell phone, other electronic device besides a simple pocket calculator, or other study aid is not permitted.
3. You will have until 5:00PM to turn in the exam.
4. Whenever a question asks for a numerical answer (such as 2.7), you may write your answer as an
expression involving simple arithmetic operations (such as 2(1) + 1(0.7)).
5. Good luck!
1
IEOR 142 Practice Midterm Exam, Page 2 of 13 March 2021
1 True/False and Multiple Choice Questions – 45 Points
Instructions: Please circle exactly one response for each of the following 15 questions. Each question is worth 3 points. There will be no partial credit for these questions.
1. TheprobabilitymodelunderlyinglogisticregressionstatesthatPr(Y =1|X)=h(β0+β1X1+···+βpXp)
where Y is the dependent variable, X is the vector of independent variables, (β0,β1,…,βp) are the
logistic regression coefficients, and h(w) = 1 is the logistic function. 1+e−w
A. True B. False
2. Consider a linear regression model with a highly insignificant variable such that the p-value of the corresponding coefficient is greater than 0.50. Then, removing this variable from the model and re- training always results in a decrease in the training set R2 value.
A. True B. False
3. Consider a linear regression model with a highly insignificant variable such that the p-value of the corresponding coefficient is greater than 0.50. Then, removing this variable from the model and re- training always results in an increase in the test set OSR2 value.
A. True B. False
4. Consider a simple linear regression problem with a continuous dependent variable Y and a single inde- pendent variable X. Suppose that we have a training dataset of n = 2 observations (x1, y1), (x2, y2) that satisfies x1 ̸= x2 and yi = β0 + β1xi for i = 1, 2, where β0, β1 are the true coefficients for the model. Let βˆ0 and βˆ1 denote the estimates of β0 and β1, respectively, based on minimizing the RSS (residual sum of squared errors) on the training set. Then, it must be the case that βˆ0 = β0 and βˆ1 = β1.
A. True B. False
5. In order to train a boosting model (with trees as the base models), one of the required inputs to the algorithm is the number of splits in each of the base tree models, and this parameter should ideally be tuned with cross-validation.
A. True B. False
6. Consider training a CART model for binary classification and suppose that we use either the error rate impurity function or the Gini index impurity function. Then, in both cases, the total impurity cost of the tree is guaranteed to strictly decrease after every additional split.
A. True B. False
IEOR 142 Practice Midterm Exam, Page 3 of 13 March 2021
7. Consider using the bootstrap to asses the variability of the OSR2 value of a previously trained Random Forests model on the test set, e.g., by constructing a confidence interval. Suppose that we set B = 10, 000 for the number of bootstrap replications. Then, this procedure requires computing the OSR2 value of the Random Forests model on 10,000 different bootstrapped datasets.
A. True B. False
8. The accuracy of a logistic regression model does not depend on the choice of the probability threshold value.
A. True B. False
9. Consider the dataset below in Figure 1 for a binary classification problem with p = 2 features and where + denotes a positive label and − denotes a negative label.
Figure 1
Then, it is possible for some classifier to achieve perfect 100% accuracy on this dataset. A. True
B. False
10. After removing punctuation, the bag of words representation of “Paul likes to travel” is the same as that of “Paul likes to travel. Paul likes to travel.”
A. True B. False
11. It is always the case that nonparametric methods (like boosting and random forests) will outperform parametric methods (like linear regression) in terms of out of sample predictive performance.
A. True B. False
IEOR 142 Practice Midterm Exam, Page 4 of 13 March 2021
12. Consider a binary classification problem where the test set has Npos > 0 positive observations and Nneg > 0 negative observations. Suppose that we have previously trained a model on the training set, and that, on the test set, this model has a true positive rate value denoted by TPR and a false positive rate value denoted by FPR. Then a correct expression for the accuracy of this model on the test set is given by:
Accuracy = Npos · TPR + Nneg(1 − FPR) Npos + Nneg
A. B.
13. Which of A. B.
C. D.
True False
the following actions has the least risk of increasing the likelihood of overfitting?
Increasing the number of trees/iterations when training a boosting model
Increasing the number of trees when training a random forests model while leaving the value of m (mtry) fixed
Decreasing the value of m (mtry) when training a random forests model while leaving the number of trees fixed
Introducing new independent variables in a linear regression model that are quadratic functions of the original set of independent variables
14. Which of
1. Increasing the value of k results in more overall computation time for the cross-validation procedure
2. Using k = n where n is the number of data points in the training set is the same as leave-one-out cross-validation (LOOCV).
3. Using k = 1 is the same the validation set method. A. Only (1.) and (2.)
B. Only (1.) and (3.)
C. Only (2.) and (3.)
D. All three statements
the following statements are true regarding k-fold cross-validation?
IEOR 142 Practice Midterm Exam, Page 5 of 13 March 2021
15. Consider training a CART model for a classification problem on a training set of size n = 6 with p = 2 independent variables. Figure 2 below displays a scatter plot of the independent variables (X1, X2) along with 5 regions corresponding to the CART model that was trained. What is the most definitive (i.e., strongest) statement that can be made about the accuracy A of this CART model on the training set?
A. 0≤A≤1 B. 4/6≤A≤1 C. 5/6≤A≤1 D. A = 1
Figure 2
IEOR 142 Practice Midterm Exam, Page 6 of 13 March 2021
2 Short Answer Questions – 55 Points
Instructions: Please provide justification and/or show your work for all questions, but please try to keep your responses brief. Your grade will depend on the clarity of your answers, the reasoning you have used, as well as the correctness of your answers.
The first two problems concern a dataset1 of golf player statistics with 162 observations, each corre- sponding to a different top professional golfer who participated in the PGA tour in 2018. Various attributes2 concerning player performance and winnings throughout the entire length of the 2018 season were collected and aggregated. Table 1 below describes these attributes in more detail. For clarity, the first 6 observations of the dataset are also included below. We are primarily interested in building models for predicting player success – in terms of monetary winnings – based on the four direct performance statistics/attributes that are provided. We are also interested in which performance statistics have the greatest impact on success.
Variable
PlayerName
Winnings
AverageScore
AveragePutts
AverageDrivingDist
DrivingAccuracy
Table 1: Description of the dataset.
Description
The player’s name
Total monetary winnings over the entire season, in millions of dollars (USD)
Average total point score per 18 hole round
Average number of putts per hole
Average drive distance per hole, in yards
Percentage of shots where the drive shot successfully lands on the fairway area
> head(golf_data)
# A tibble: 6 x 6
PlayerName
Winnings AverageScore AveragePutts AverageDrivingDist DrivingAccuracy
286. 57.7
303. 61.8
293. 70.2
291. 67.8
292 66.5
301. 61.3
1This dataset is a subset of a much more comprehensive dataset available at https://www.kaggle.com/bradklassen/ pga- tour- 20102018- data.
2To understand some of the attributes better, note that a “putt” is a very short distance shot taken on the “green” near the hole, whereas a “drive” is the initial shot which is typically a very long distance shot.
1 Aaron Baddeley 0.905
2 Aaron Wise 1.05
3 Abraham Ancer 3.17
4 Adam Hadwin 2.22
5 Adam Long 1.65
6 Adam Schenk 1.26
70.8 1.72
70.7 1.73
70.6 1.75
70.5 1.73
71.5 1.79
70.8 1.75
IEOR 142 Practice Midterm Exam, Page 7 of 13 March 2021
1. (25 points) The dataset was split into a training set with 105 observations and a test set with 57 obser- vations, and a linear regression model was built, using the training data, to predict Winnings based on the four direct player performance stats, namely AverageScore, AveragePutts, AverageDrivingDist, and DrivingAccuracy. The output from R is given below.
> summary(mod1)
Call:
lm(formula = Winnings ~ AverageScore + AveragePutts + AverageDrivingDist +
DrivingAccuracy, data = golf_train)
Residuals:
Min 1Q Median 3Q Max
-1.4422 -0.6233 -0.0387 0.5000 4.9060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
AverageScore -1.745918 0.186456 -9.364 2.46e-15 ***
(Intercept) 114.161145 14.961424 7.630 1.40e-11 ***
AveragePutts 2.192717
AverageDrivingDist 0.026401
DrivingAccuracy -0.003636
—
4.374250 0.501
0.017066 1.547
0.029532 -0.123
0.617
0.125
0.902
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.89 on 100 degrees of freedom
Multiple R-squared: 0.6087,Adjusted R-squared: 0.5931
F-statistic: 38.89 on 4 and 100 DF, p-value: < 2.2e-16
The training data was further used to compute a correlation table, the output of which is given below.
> cor(golf_train[,c(2,5,6,7,8)])
Winnings AverageScore AveragePutts AverageDrivingDist DrivingAccuracy
Winnings 1.00000000
AverageScore -0.76107132
AveragePutts -0.27987104
AverageDrivingDist 0.29221875
DrivingAccuracy -0.01249549
-0.7610713
1.0000000
0.4286643
-0.1652435
-0.1461416
-0.27987104
0.42866433
1.00000000
0.04029845
0.04488629
0.29221875
-0.16524349
0.04029845
1.00000000
-0.70911574
-0.01249549
-0.14614163
0.04488629
-0.70911574
1.00000000
Furthermore, variance inflation factors for the independent variables in the linear regression model were also computed.
> vif(mod1)
AverageScore AveragePutts AverageDrivingDist DrivingAccuracy
1.651558 1.393514 2.648311 2.629330
IEOR 142 Practice Midterm Exam, Page 8 of 13 March 2021
Please answer the following questions.
(a) (4 points) A particular golf player is considering adjusting his training strategy and expects that, in the 2019 season, his average score will be 70.50 points per round, he will average 1.76 putts per hole, his average driving distance will be 300 yards, and his driving accuracy will be 60%. The player also expects that there are no major differences in how player performance impacts winnings in the 2019 season versus the 2018 season. Use the R output on the previous pages to make a prediction for this player’s total winnings in millions of dollars in the 2019 season.
(b) (4 points) Is there a high degree of multicollinearity present in the training set? On what have you based your answer?
IEOR 142 Practice Midterm Exam, Page 9 of 13 March 2021
(c) (4 points) Based on the R output on the previous pages, is there enough evidence to conclude that the true coefficient corresponding to AverageScore is not equal to 0? On what have you based your answer?
(d) (4 points) Based on the R output on the previous pages, is there enough evidence to conclude that the true coefficient corresponding to AveragePutts is not equal to 0? On what have you based your answer?
(e) (4 points) Consider adding a new independent variable to the model called AveragePuttsPerRound, which is equal to the average number of putts per 18 hole round. (Recall that AveragePutts is the average number of putts per hole, and you may assume that each round consists of exactly 18 holes.) Is it possible for this new variable to improve the linear regression model for predicting Winnings? Explain your answer.
IEOR 142 Practice Midterm Exam, Page 10 of 13 March 2021
10.0
7.5
5.0
2.5
0.0
70 71 72
AverageScore
2
1
0
−1
70 71 72
AverageScore
Figure 3
(f) (5 points) A data scientist working with the PGA Tour has determined that a simple linear re- gression model that only uses a single independent variable, AverageScore, would strike the best balance between interpretability and performance in this application domain. The data scientist is considering using one of two possible dependent variables: Winnings as before, or a logarithmic transformation log(Winnings). Figure 3 shows scatter plots on the training data of these two possible dependent variables versus AverageScore. Based on Figure 3, which dependent variable choice would you recommend in order to get the best predictive performance? Explain your answer.
Winnings
log(Winnings)
IEOR 142 Practice Midterm Exam, Page 11 of 13 March 2021
2. (20 points) Next, a CART model was built to predict Winnings as a function of the four provided independent variables. The tree diagram corresponding to this model is shown in Figure 4 below.
AverageScore >= 70.52
no
AverageScore >= 71.12
yes
yes
no
yes
AverageScore >= 70.02
no
0.8197 1.506 2.897 5.145
Figure 4
Note that the training set R2 value of the above CART model is 0.704. Furthermore, the value of the
cp parameter used when training the above CART model was set to cp = 0.01.
(a) (5 points) Consider a new CART model that results after some new split on one of the four leaf nodes (buckets) of the current model. Using the information above, what is the most definitive (i.e., strongest) statement you can make concerning the training set R2 value of this new CART model? Explain your answer.
(b) (5 points) Consider a new CART model that results after removing the bottom right split “Av- erageScore ≥ 70.02”. Using the information above, what is the most definitive (i.e., strongest) statement you can make concerning the training set R2 value of this new CART model? Explain your answer.
IEOR 142 Practice Midterm Exam, Page 12 of 13 March 2021
(c) (4 points) Provide a brief but precise explanation for why AverageScore is the only (out of four possible) independent variable that was selected by the CART algorithm at each split in the above tree.
ˆ
(d) (6 points) Let f(AverageScore) denote the prediction function corresponding to this CART model,
i.e., the function that returns the predicted value of Winnings as a function of AverageScore. Draw ˆ
the graph of the function f(AverageScore) in Figure 5 below. (You only need to draw the graph for values of AverageScore between 69 and 72 and you do not have to be concerned with evenly spacing the ticks on the x and y axes.)
Winnings
Figure 5
AverageScore
IEOR 142 Practice Midterm Exam, Page 13 of 13 March 2021
3. (10 points) Consider again the golfer from Q1 part (a) who is considering adjusting his training strategy and expects that, in the 2019 season, his average score will be 70.50 points per round, he will average 1.76 putts per hole, his average driving distance will be 300 yards, and his driving accuracy will be 60%. A friend of yours offers you a bet, whereby if you agree to the bet then you have to pay your friend $100 right now. If the golfer mentioned above earns over $2.5 million dollars in the 2019 season, then your friend will pay you back $150. Otherwise, your friend keeps the $100 and does not pay you back anything.
Do you currently have enough information to decide if you should take this bet or not? If yes, then please mention if you will take the bet or not and describe how you used the information in the previous problems to make your decision. If no, then please precisely describe what additional information you need and, if applicable, what additional model(s) you would build on the training data and how you would use the results of those model(s) to make your decision.