1. Please read this statement and Agree/Disagree below:
“In submitting this assessment, I confirm that my conduct during this quiz adheres to the Code
of Behaviour on Academic Matters. I confirm that I did NOT act in such a way that would
constitute cheating, misrepresentation, or unfairness, including but not limited to, using
unauthorized aids and assistance, impersonating another person, and committing plagiarism. I
pledge upon my honour that I have not violated the Faculty of Applied Science & Engineering’s
Honour Code during this assessment.”
2. [2] Which of the following statements is false?
a. Basis vectors forming an orthogonal basis are always orthonormal.
b. Basis vectors forming an orthonormal basis are always orthogonal.
c. Basis vectors forming an orthonormal basis are always normal.
d. All vectors in an orthonormal basis has length 1.
3. [2] In lecture, we discussed decision trees – an intuitive classification model that splits on
different attributes, creating a tree-like structure. A data scientist is given a large data set and
uses part of the data to train a really big decision tree with many branches and nodes, that
perfectly fits the data. When they apply it to the validation data, overall accuracy is only 78%.
a. Why is test performance so poor?
b. What can the data scientist do to improve the model?
4. [2] A data scientist has a data set with a lot of features and chooses to use some of these
features to train a model on training data and evaluate performance on testing data. They find
that both training and testing accuracy is poor. What would you recommend (i) removing a few
features or (ii) adding more features? Explain.
5. [2] A data set with 4 features has the following covariance matrix:
A B C D
A 0.5 0.018 0.11 0.048
B 0.018 0.01 0.0025 0.14
C 0.11 0.0025 0.023 0.0055
D 0.048 0.14 0.0055 6
You’re asked to remove a highly correlated feature from the data set. Which one would you
remove?
False
overfitting multiple strategies – stop each branch based on some criteria during
creation of the tree (minimum # of examples to continue), set max
length for a branch, or “pruning” – removing branches based on least
important features, or in such a way as to not hurt accuracy too much
best answer: ii) adding more features, provides additional information used to
make our predictions (or differentiate between classes), but arguments could
also be made for removing features
ex. corr(A,B)…
cov(A,B) = 0.018
cov(A,A) = 0.5
cov(B,B) = 0.01
corr(A,B) = 0.18 / (sqrt(0.5)*sqrt(0.01))
= 0.255
corr(A,C) = 1.026
corr(A,D) = 0.028
corr(B,C) = 0.165
corr(B,D) = 0.572
corr(C,D) = 0.015
Features A and C are highly correlated and removing either
feature A or C is acceptable.
note corr >1 for corr(A,C) due to rounding error in values provided in the
table, should be limited to range -1 to 1
6. [4] You have two binary classification models (P_1 and P_2), that use a series of features to
predict the probability of emails being spam. The computed probabilities are shown in the table
below, along with actual labels, for six validation data.
Label P_1 P_2
1 0 0.1 0.1
2 0 0.4 0.5
3 0 0.3 0.5
4 1 0.5 0.4
5 1 0.4 0.8
6 1 0.8 0.6
a. Calculate the AUC for each model.
b. Assuming you value F1-score, which model would you choose?
c. What is the precision, recall, accuracy and confusion matrix for this best model?
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
FPR
TPR TPR
FPR
P_1
P_2
for threshold >=0.5
>=0.5
TPR_P1 = 2 /(2 + 1) = 2/3
FPR_P1 = 0/(0 + 3) = 0
1
1
0
o
TPR_P2 = 2 / (2 + 1) = 2/3
FPR_P2 = 2 / (2 + 1) = 2/3
o TPR_P1 = 3 / ( 3 + 0) = 1
FPR_P1 = 2 / (2 + 1) = 2/3
for threshold >=0.3
o
TPR_P2 = 3 / (3 + 0) = 1
FPR_P2 = 2/ (2 + 1) = 2/3
o
AUC_P1 = 0.94 AUC_P2 = 0.7
(b) F1-score = 2 * precision * recall / (precision + recall)
where, precision = TP/(TP+FP), recall = TP/(TP+FN)
example for threshold of 0.5
precision P1 = TP/(TP+FP) = 2/(2+0) = 1
recall P1 = TP/(TP+FN) = 2/(2+1) = 2/3
-> F1-score for P1 = 2 * 1 * 2/3 / (1 + 2/3) = 0.8
precision of P2 = 2/(2+2) = 1/2
recall of P2 = 2/(2+1) = 2/3
-> F1-score for P2 = 2 * 1/2 * 2/3 / (1/2 + 2/3) = 0.57
continue finding F1-scores for all thresholds…
Computing across all thresholds the best F1-score for
P_1 = 0.857 and P_2 = 0.79. Hence, we select model P_1.
0
0
0
1
1
0
(c) best model = P1 with threshold of 0.35
confusion matrix
TP = 3, FP = 1
FN = 0 TN = 2
o
o
o
o
1
1
0
o
o
o
o
1/3 2/3
2/32/3
a)
accuracy = #correct / all = 5/6 = 0.833
precision of P1 = TP/(TP + FP) = 0.75
recall of P1 = TP(TP + FN) = 1
>=0.5
0
1
1
0
1
1
continue for
more thresholds…
workings…
Note it is often acceptable to just compute F1-score using
0.5 threshold. Either approach would receive full marks.
example threshold >=0.5