CS代考程序代写 AI algorithm decision tree SQL Data 100, Final Fall 2019

Data 100, Final Fall 2019
Name:
Email:
Student ID:
Exam Room:
First and last name of student to your left: First and last name of student to your right: All work on this exam is my own (please sign):
@berkeley.edu
Instructions:
• This final exam consists of 117 points and must be completed in the 170 minute time period ending at 6:00PM, unless you have accommodations supported by a DSP letter.
• Please write your initials on the top of every page.
• Note that some questions have circular bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply. When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• You may use three cheat sheets each with two sides.
• Please show your work for computation questions as we may award partial credit.
1

Data 100 Page 2 of 36 Initials:
1 An Instructor Thinks This Is A Good Question [9 Pts.]
The average response time for a question on Piazza this semester was 11 minutes. As always, the number of questions answered by each TA is highly variable, with a few TAs going above and beyond the call of duty. Below are the number of contributions for the top four TAs (out of 20, 000 total Piazza contributions):
Suppose we take an SRS (simple random sample) of size n = 500 contributions from the original 20, 000 contributions. We will also define some random variables:
• Di = 1 when the ith contribution in our sample is made by Daniel; else Di = 0.
• Si = 1 when the ith contribution in our sample is made by Suraj; else Si = 0.
• Mi = 1 when the ith contribution in our sample is made by Mansi; else Mi = 0.
• Ai = 1 when the ith contribution in our sample is made by Allen; else Ai = 0.
• Oi = 1 when the ith contribution is made by anyone other than Daniel, Suraj, Mansi, or Allen; else, Oi = 0
Throughout this problem, you may leave your answer as an unsimplified fraction. If your answer is much more complicated than necessary, we may deduct points. Some of these problems are simple, and some are quite tricky. If you’re stuck, move on and come back later.
(a) i. [1 Pt] What is P(A1 = 1)? P(A1 =1)=
ii. [1 Pt] What is E[S1]? E[S1] =
TA
# contributions
Daniel Suraj Mansi Allen
2000
1800
700
500
Solution: 500 = 1 20000 40
Solution: 1800 = 9 20000 100

Data 100 iii.
iv.
v.
Page 3 of 36 Initials:
[1 Pt] What is E[M100]? E[M100] =
[1 Pt] What is Var[D50]? Var[D50] =
Solution: 700 = 7 20000 200
Solution: D50 ∼Bernoulli(2000 = 1 ).Var(D50)= 1 ·(1− 1 )= 9 20000 10 10 10 100
[1Pt] WhatisD400 +S400 +A400 +M400 +O400? D400 +S400 +A400 +M400 +O400 =
Solution: 1. The 400th contribution must be made by Daniel, Suraj, Mansi, or other so one of the 5 random variables must 1 and the rest are 0s.
(b) For parts b.i and b.ii, let:
• ND =􏰂500 Di i=1
• NS =􏰂500 Si i=1
• NM = 􏰂500 Mi i=1
• NA =􏰂500 Ai i=1
• NO =􏰂500 Oi i=1
i. [1 Pt] What is E[NA]? E[NA] =

Data 100 Page 4 of 36 Initials:
Solution:
500
E[NA] = E[􏰃 Ai] i=1
500
= 􏰃E[Ai] i=1
500 500 =􏰃
i=1 20000 = 500 · 500
20000 = 25
2
ii. [1Pt] WhatisVar(ND +NS +NA +NM +NO)? Var(ND +NS +NA +NM +NO)=
(c) [2 Pts] Let’s consider the situation where we sample with replacement instead of taking a SRS. If we take a sample with replacement of 10 contributions, what is the probability that 3 were by Daniel, 3 were by Suraj, and 4 were by Mansi?
Probability =
Solution: 0.ND +NS +NA +NM +NO =500.Var(500)=0.Thevariance of a constant is 0.
Solution: This is a multinomial distribution where P (Daniel) = 2000 , P (Suraj) =
1800 and P (Mansi) = 700 and the number of trials n = 10. 20000 20000
20000
Then,
P(3 Daniel,3 Suraj,4 Mansi) =
Note, 􏰎10􏰏 · 􏰎7􏰏 = 10! . 3 3 3!3!4!
􏰀10􏰁 􏰀2000􏰁3 􏰀7􏰁 􏰀1800􏰁3 􏰀 700 􏰁4 3 · 20000 · 3 ∗ 20000 · 20000

Data 100 Page 5 of 36 Initials:
2 Relative Mean Squared Error [6 Pts.]
Consider a set of points {x1, x2, …, xn}, where each xi ∈ R, and further suppose we want to determine a summary statistic c for this data. Naturally, our choice of loss function determines the optimal c.
In this problem, let’s consider a new loss function l(c) = (x−c)2/x. We call this loss function the relative squared error loss. If we compute the average over an entire dataset, we get the empirical risk function below:
L(c)=1􏰃n (xi−c)2 n i=1 xi
Forexample,supposeourdatais[0.1, 0.2, 0.5, 0.5, 1],andweconsiderthesum- mary statistic c = 1. The empirical risk would be:
1 􏰀(0.1 − 1)2 (0.2 − 1)2 (0.5 − 1)2 (0.5 − 1)2 (1 − 1)2 􏰁 5 0.1 + 0.2 + 0.5 + 0.5 + 1
= (8.1+3.2+0.5+0.5) = 2.46 5
[6 pts] Give the summary statistic that minimizes the relative mean squared error for the data above, i.e. [0.1, 0.2, 0.5, 0.5, 1]. Make sure to show your work in the space below, correct answers will not be accepted without shown work.
cˆ =
Solution: Since the loss function is convex, we can find the optimizer of the loss function by setting the derivative with respect to c to 0 and solve for c.

Data 100 Page 6 of 36 Initials:
dL = 1 􏰃n d (xi − c)2 dc n i=1 dc xi
=1􏰃n −2(xi−c) n i=1 xi
1􏰃n 2c
= n −2 + x (1)
i 1n n2c
i=1
= (􏰃−2+􏰃 )
n i=1 i=1 xi
= 1(−2n+2c􏰃n 1)=−2+2c􏰃n 1 =0
Therefore,
n
i=1 xi n i=1 xi − 2 n + 2 c 􏰃n 1 = 0
cˆ=􏰂n 1 i=1 xi
i=1 xi n
Plugging in the data, we have that the optimal value of c is
cˆ = 5 = 5 = 1
1+1+1+1+1 204 0.1 0.2 0.5 0.5 1

Data 100 Page 7 of 36 Initials:
3 Election (Pandas) [12 Pts.]
You are given an DataFrame elections with the results of each U.S. presidential election. The first 8 rows of elections is shown on the left. The max votes Series on the right is described later on this page.
elections max votes
(a) [3 Pts] Suppose we want to add a new column called Popular Result that is equal to ’win’ if the candidate won the popular vote and ’loss’ if the candidate lost the popu- lar vote. Note, this is not the same thing as the Result column, e.g. Donald Trump won the 2016 election but lost the popular vote, i.e. did not have the largest value for Popular Vote in 2016.
To do this, we’ll start by using a new pandas function we have not learned in class called transform. For example, the code below creates a Series called max votes shown at the top right of this page.
max_votes = elections.groupby(“Year”)[“Popular_Vote”].transform(max)
max_votes.to_frame().head(8) # to_frame used so that it looks nicer
Using the max votes Series, create the new Popular Result column in elections. Your code may not use any loops. We have done the first line for you. If you’re not quite sure what your goal is, we provide a picture of the result on the next page. You may not need all lines. Hint: The .loc feature in pandas accepts boolean arrays for either of its arguments.
elections[“Popular_Result”] = “loss”
__________________________________________________________
__________________________________________________________
___________________________________________________ = “win”

Data 100 Page 8 of 36 Initials:
Solution:
elections[“Popular_Result”] = “loss”
popular_winners = elections[“Popular_Vote”] == max_votes
elections.loc[popular_winners, “Popular_Result”] = “win”

Data 100 (b)
Page 9 of 36 Initials: [2 Pts] Below is the correct result for part a of this problem.
elections
Fill in the code below so that df is a dataframe with only candidates whose ultimate
result was not the same as the popular vote, i.e.
df
You may not need all lines. Make sure to assign df somewhere.
_______________________________________________________
_______________________________________________________
_______________________________________________________
Solution:
df = elections[elections[“Result”] != elections[“Popular_Result”]]
alternatively,
df = elections.query(“Result != Popular_Result”)

Data 100 (c)
Page 10 of 36 Initials:
[4 Pts] Create a series win fraction giving the fraction each candidate won out of all elections participated in by that candidate. For example, Andrew Jackson participated in 3 presidential elections (1824, 1828, and 1832) and won 2 of these (1828 and 1832), so his fraction is 2/3. You should use the Result column, not the Popular Result column. For example, win fraction.to frame().head(9) would give us:
win fraction
You may not use loops of any kind. You do not need to worry about the order of the candidates. You may assume that no two candidates share the same name.
def f(s):
________________________________________
________________________________________
________________________________________
win_fraction = ________________________________________
Solution:
def f(s):
num_wins = sum(s == “win”)
num_elections = len(s)
return num_wins / num_elections
win_fraction = elections.groupby(“Candidate”)[“Result”].agg(f)
[3 Pts] Create a series s that gives the name of the last candidate who successfully won office for each party. That is, s.to frame() would give us:
(d)

Data 100 Page 11 of 36 Initials:
s
elections_sorted = elections.sort_values(_____________)
winners_only = ____________________________________
s = winners_only.____________(___________)[____________].__________
Solution:
elections_sorted = elections.sort_values(“Year”)
winners_only = elections_sorted[elections_sorted[“Result”]==”win”]
winners_only.groupby(“Party”)[“Candidate”].last()

Data 100 Page 12 of 36 Initials:
4 Regression [13 Pts.]
Recall from lab 9 the tips dataset from the seaborn library, which contains records about tips, total bills, and information about the person who paid the tip. Throughout this entire problem, assume there are a total of 20 records, though we only ever show 5. The first 5 rows of the resulting dataframe are shown below. The integer on the far left is the index, not a column of the DataFrame.
Suppose we want to predict the tip from the other available data. Four possible design matrices XMFB, XMF , XFB, and XF are given below.

Data 100 (a) i.
Page 13 of 36 Initials: [2 Pts] What is the rank of each of our four design matrices?
rank(XMFB)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19 rank(XMF)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19 rank(XFB)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19 rank(XF)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19
ii. [2 Pts] Recall that an Ordinary Least Squares (OLS) model is an
⃝20 ⃝20 ⃝20 ⃝ 20
unregularized lin-
ear model that minimizes the MSE for a given design matrix. Suppose we train three
different unregularized OLS models on XM F , XF B and XF , respectively. The result-
⃗⃗⃗
ing predictions given by each model are yˆM F , yˆF B , and yˆF . Which of the following
statements are true?
⃗⃗
􏰄 yˆ M F = yˆ F B
⃗⃗ 􏰄 yˆ M F = yˆ F
⃗⃗ 􏰄 yˆ F B = yˆ F
􏰄 None of These
iii. In lecture, we said that the residuals sum to zero for an OLS model trained on a
feature matrix that includes a bias term. For example, if SFB is the sum of the ⃗
residualsforyˆFB,thenSFB =0becauseXFB includesabiasterm.
i. [2Pts] LetSMF,SFB,andSF bethesumsoftheresidualsforourthreemodels. Which of the following are true? We have omitted SFB from the list below
because we already gave away the answer above. 􏰄 SMF =0 􏰄 SF =0 􏰄 Neitherofthese
ii. [2 Pts] Let SF , SF , and SF be the sums of the residuals for only female MFFB F
customers. For example, SF is the sum of the residuals for the 0th, 4th, etc. MF
rowsofXMF,SF isthesumoftheresidualsforthe0th,4th,etc.rowsofXFB, FB
and similarly for SF . Which of the following are true?
􏰄SF =0 􏰄SF =0 􏰄SF =0 􏰄Noneofthese MF FB F

Data 100 (b)
Page 14 of 36 Initials:
Suppose we create a new design matrix XB that contains only the total bill, size, and
a bias term. Suppose we then fit an OLS model on XB, which generates predictions ⃗
yˆ = [yˆ0, yˆ1, …, yˆ19] = [2.631665, 2.0483329, . . . ] with residuals ⃗r = [r0, r1, …, r19] = [−1.621665, −0.388329, . . . ].
Suppose we then do a very strange thing: We create a new design matrix W that has ⃗
the columns from XB, as well as two new columns corresponding to yˆ and ⃗r from our model on XB. Note: You’d never ever do this, but we’re asking as a way to probe your knowledge of regression. The first 5 rows of W are given below.
i. [2 Pts] What is the rank of W?
⃝0 ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝10 ⃝20 ⃝40
ii. [3 Pts] Let βˆ1, βˆ2, βˆ3, βˆ4, βˆ5 be optimal parameters of a linear regression model on W, e.g. βˆ4 is the weight of the yhat column of our data frame. Give a set of parameters that minimizes the MSE.
βˆ1 = βˆ2 = βˆ3 = βˆ4 = βˆ5 =

Data 100 Page 15 of 36 Initials:
Solution: βˆ1 =0,βˆ2 =0,βˆ3 =0,βˆ4 =1,βˆ5 =1.
r = y − yˆ so yˆ + r = yˆ + y − yˆ = y. The MSE would be 0.

Data 100 Page 16 of 36 Initials:
5 Alternate Classification Techniques [14 Pts.]
The primary technique for binary classification in our course was logistic regression, where we first calculated P(Y = 1|⃗x) = σ(⃗xTβ⃗), then applied a threshold T to compute a label (either 0 or 1). In other words, we predict yˆ = f(⃗x) = I(σ(⃗xT β⃗) > T), where I is an indicator function (i.e. returns 1 if the argument is true, 0 otherwise).
We trained such a model by finding the β⃗ that minimizes the cross entropy loss between our predicted probabilities and the true labels.
In this problem we’ll explore some variants on this idea. (a) In this part, we’ll consider various loss functions.
i. [2 Pts] Suppose our true labels are ⃗y = [0, 0, 1], our predicted probabilities of being in class 1 are [0.1, 0.6, 0.9], and our threshold is T = 0.5. Give the total (not average) cross-entropy loss. Do not simplify your answer.
Total CE Loss =
ii. [2 Pts] For the same values as above, give the total squared loss. Do not simplify your answer.
Squared Loss =
iii. [2 Pts] Which of the following are valid reasons why we minimized the average cross-entropy loss rather than the average squared loss?
􏰄 To prevent our parameters from going to infinity for linearly separable data.
􏰄 There is no closed form solution for the average squared loss.
􏰄 To improve the chance that gradient descent converges to a good set
of parameters.
􏰄 The cross entropy loss gives a higher penalty to very wrong probabili- ties.
􏰄 None of the above
Solution:
−(log(1 − 0.1) + log(1 − 0.6) + log(0.9)) = −(log(0.9) + log(0.4) + log(0.9))
Solution:
(0−0.1)2 +(0−0.6)2 +(1−0.92) = 0.12 +0.62 +0.12

Data 100
Page 17 of 36 Initials:
Solution: Accuracy is # correct / total, and zero-one loss is total – # correct. If you minimize the zero-one loss, you maximize the # correct. We have seen in class that something that maximizes accuracy does not necessarily maximize recall or precision.
(b)
iv. [1Pt] Athirdlossfunctionwemightconsideristhezero-oneloss,givenbyLZO(y,yˆ)= I(y ̸= yˆ). In other words, the loss is 1 if the label is incorrect, and 0 if it is correct. For the same values above, what is the total zero-one loss?
⃝0⃝1⃝2⃝3
v. [2 Pts] The zero-one loss is a function of both β⃗ and T . This is in contrast to the
⃗ ⃗ˆ ˆ
cross-entropy loss, which is only a function of β. Let βZO and TZO be parameters
⃗ˆ ˆ that minimize the zero-one loss. Which of the following are true about βZO and TZO?
􏰄 They maximize accuracy. 􏰄 They maximize precision. 􏰄 They maximize recall. 􏰄 None of these
vi. [3 Pts] A DS100 student wants to run gradient descent on the total zero-one loss to
⃗ˆ ˆ
find optimal βZO and TZO. Give the very specific reason that this will always fail.
Answer in 10 words or less. Vague answers will be given no credit.
[2 Pts] In this part, we’ll consider an alternative to the logistic function.
Instead of using the logistic function as our choice of f , let’s say we instead use a scaled
inverse tangent function, f(x) = 1 tan−1(x) + 1. This choice of f has the exact same π2
tail-end behavior as σ(x). In other words, it is always between 0 and 1. A plot of f is below:
Which of the following are true?
􏰄 The cross entropy loss is still well defined for all possible outputs of our model.
Solution: The gradient will always be 0 so running gradient descent will never find the global minimum unless we start at it.

Data 100
Page 18 of 36 Initials:
6
􏰄 We are still able to construct an ROC curve and use the AUC as a metric for our classifier.
􏰄 We can still compute a confusion matrix from our classifier. 􏰄 We can still assume that log P (Y =1|x) is linear.
P (Y =0|x)
􏰄 None of the above
Linear Separability [4 Pts.]
Suppose we fit a logistic regression model with two features x1,x2, and find that with clas- ⃗ˆ ˆ ˆ
sification threshold T = 0.75 and β = [β1, β2] = [2, 3], we achieve 100% training accuracy. Let x2 = mx1 + b be the equation for the line that separates the two classes. Give m and b (you may leave your answers in terms of ln). Hint: You might find the following fact useful: σ(ln(3)) = 0.75.
m= b=
Solution:
We are given that σ(2×1 + 3×2) = 0.75 is a hyperplane that achieves 100% training accu- racy, i.e. that linearly separates our data.
To get this in the required form, we can take the inverse sigmoid of each side, getting
2×1 +3×2 = σ−1(0.75) = ln(3). Rearranging, we get x2 = −2×1 + ln(3), so m = −2 and
b = ln(3) . 3
333

Data 100 Page 19 of 36 Initials:
7 ROC Curves [5 Pts.]
Here, we present a ROC curve, with unlabelled axes.
(a) [4 Pts] Fill in the pseudocode below to generate a ROC Curve. (Ignore the ”X” above.)
Hint: You can convert a boolean array to an array of 1’s and 0’s by multiplying the array by 1:
>>> y
array([False, False, True, True], dtype=bool)
>>> 1 * y
array([0, 0, 1, 1])
predicted_probs = np.array([0.37, 0.1, …])
y_actual = np.array([1, 0, …])
thresholds = np.linspace(______, _______, 1000)
tprs, fprs = [], []
for t in ________________________________:
y_pred = ________________________________
a = np.sum((y_pred == y_actual) & (y_pred == 1))
b = np.sum((y_pred == y_actual) & (y_pred == 0))
c = np.sum((y_pred != y_actual) & (y_pred == 1))
d = np.sum((y_pred != y_actual) & (y_pred == 0))
tprs.append(________________________________)
fprs.append(________________________________)
plt.plot(fprs, tprs)
Solution:
thresholds = np.linspace(0, 1, 1000)
for t in thresholds:
y_pred = 1 * (pedicted_probs > t)

Data 100 Page 20 of 36 Initials:
a = np.sum((y_pred == y_actual) & (y_pred == 1))
b = np.sum((y_pred == y_actual) & (y_pred == 0))
c = np.sum((y_pred != y_actual) & (y_pred == 1))
d = np.sum((y_pred != y_actual) & (y_pred == 0))
tprs.append(a / (a + d))
fprs.append(c / (b + c))
(b) [1 Pt] Which of the following classification thresholds most likely corresponds to the point marked with an ”X” above?
⃝0.1 ⃝0.65 ⃝0.9 ⃝1.0

Data 100
Page 21 of 36
Initials:
8
(a)
PCA [7 Pts.]
Consider the matrix X below.
(b)
Suppose we decompose X using PCA into X = UΣV T . Let r x c be the dimensions of VT.
i. [1 Pt] What is r?
⃝0 ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝Noneofthese
ii. [1 Pt] What is c?
⃝0 ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝Noneofthese
[3 Pts] Let P be the principal component matrix of X. That is, P = UΣ.
Suppose we now decompose the principal component matrix P into its principal compo- nents,givingusP =UPΣPVPT.WhatisVPT?
VPT =
0 2 −1
0 2 −2 X=1 1 −3 1 1 −4
2 0 −5
Solution:
1 0 0 0 1 0 001
One can arrive at this answer by equating U Σ = UP ΣP VPT and noticing that this impliesVPT =I,butthatisn’tveryinterestingoruseful.
Instead,wecanrecognizethatVT isarotationmatrixthatrotatesouroriginaldata X so that it is axis-aligned. In other words, P = UΣ is the data rotated such that the greatest variance occurs along the x axis.
If we perform PCA again, we’re basically trying to rotate P so that it is axis-aligned. However,itisalreadyaxis-alignedsotherotationmatrixVPT shoulddonothing.The matrix which does nothing is the identity matrix. The dimensions are 3 x 3 for the same reasons as in part a.

Data 100 Page 22 of 36 Initials:
Note, it was not enough to state I as the solution, as that leaves the dimensions am- biguous. We had many students who gave identity matrices that had the wrong di- mensions.
(c) Consider the statement: ”When we created 2D PCA scatter plots in this class, we were usually plotting the first 2 _________ of __________”?
i. [1 Pt] For the first blank, what is the appropriate word? ⃝ rows ⃝ columns
ii. [1 Pt] For the second blank, what is the appropriate object?
⃝ X ⃝ U ⃝ Σ ⃝ VT ⃝ UΣ ⃝ ΣVT ⃝ UΣVT

Data 100
Page 23 of 36 Initials:
9
(a)
SQL [10 Pts.]
[5 Pts] In this problem, we have the two tables below, named brackets and names respectively. The left table is a list of U.S. tax brackets, e.g. a person’s first $9700 of income is taxed at 10%, income between $9701 and $39475 is taxed at 12%, etc. The right table is a list of people, their ages, and incomes.
brackets names
Give a SQL query that results in the table below, except that the order of your rows may be different. Here, the rate column represents the highest tax bracket at which their income is taxed. For example, Lorenza earns $165,743, so her highest income is taxed at the 32% rate. The how much column says how much of the person’s income is taxed at this rate, e.g. $5,017 of Lorenza’s income is taxed at 32% since her income exceeds the low of the 32% bracket of $160,726 by $5,017. Your output should have the same column names as the example below.
SELECT ___________________________________________________
FROM _____________________________________________________
WHERE ________________________ AND _______________________;

Data 100 Page 24 of 36 Initials:
Solution:
SELECT rate, name, and income-low AS how_much
FROM brackets, names
WHERE income >= low AND income <= high; Data 100 (b) Page 25 of 36 Initials: [5 Pts] For this problem, we have the ds100 grades table below. ds100 grades Suppose we want to generate the table below. The table above provides the average grade on HW5 for students with ”ee” in their name, separated into two groups: those who have taken Data 8 and those who have not. For example, Akeem and Desiree both have ”ee” in their names, and have taken Data 8. The average of their scores is 91. Kathleen has an ”ee” in her name, but has not taken Data 8. Since she is the only person in the table who has not taken Data 8, the average is just her score of 95. Penelope and Ashoka do not have ”ee” in their names, so their data will not get included in the table. Each table includes the count, the HW5 average, and whether the row corresponds to students who took Data 8 or not. Give a query below that generates this table. The order of your rows does not matter. Your output should have the same column names as the example below. SELECT ___________________________________________________ FROM _____________________________________________________ WHERE ____________________________________________________ __________________________________________________________; Data 100 Page 26 of 36 Initials: Solution: SELECT COUNT(*) AS count, AVG(hw5) AS hw5_average, data8 FROM ds100_grades WHERE name LIKE ’%ee%’ GROUP BY data8; Data 100 Page 27 of 36 Initials: 10 Decision Trees [8 Pts.] Suppose we are trying to train a decision tree model for a binary classification task. We denote the two classes as 0 (the negative class) and 1 (the positive class) respectively. Our input data consists of 6 sample points and 2 features x1 and x2. The data is given in the table below, and is also plotted for your convenience on the right. (a) [2 Pts] What is the entropy at the root of the tree? Do not simplify your answer. entropy = (b) [3 Pts] Suppose we split the root note with a rule of the form xi ≥ β, where i could be either 1 or 2. Which of the following rules minimizes the weighted entropy of the two resulting child nodes? ⃝x1≥3 ⃝x1≥4.5 ⃝x1≥8.5 ⃝x2≥3.5 ⃝x2≥4.5 (c) [3 Pts] Now, suppose we split the root note with a different rule of the form below: x1 ≥β1 andx2 ≤β2, where β1,β2 are the thresholds we choose for splitting. Give a β1 and β2 value that minimizes the entropy of the two resulting child nodes of the root. β1 = β2 = Solution: Proportion 0 = 4 = 2 . 63 Proportion 1 = 2 = 1. 63 Entropy=−(2 log2 +1 log1). 3333 Data 100 Page 28 of 36 Initials: Solution: β1 = 4, β2 = 3. Any solution that contains only the two points labeled 1 is sufficient. Data 100 Page 29 of 36 Initials: 11 Clustering [7 Pts.] (a) The two figures below show two datasets clustered into three clusters each. For each dataset, state whether the given clustering could have been generated by the K-means and Max- agglomerative clustering algorithms. By max-agglomerative we mean the exact algorithm discussed in class, where the distance between two clusters is given by the maximum distance between any two points in those clusters. Note: There are no hidden overplotted cluster markers. For example, there’s no need to look closely at all the triangles to see if there is a square or circle hidden somewhere. i. [2 Pts] Dataset 1: 􏰄 K-means 􏰄 Max-agglomerative 􏰄 None of these ii. [2 Pts] Dataset 2: 􏰄 K-means 􏰄 Max-agglomerative 􏰄 None of these (b) For each of the following statements, say whether the statement is true or false. i. [1 Pt] If we run K-Means clustering three times, and the generated labels are exactly equal all three times, then the locations of the generated cluster centers are also exactly equal all three times. ⃝ True ⃝ False ii. [1 Pt] Assuming no two points have the same distance, the cluster labels computed by K-means are always the same for a given dataset. ⃝ True ⃝ False Solution: This problem was more difficult than we had intended. The problem was scored entirely based on your answer to the K-means part of the problem, i.e. you were not penalized (or rewarded) for getting max-agglomerative correct as actually proving the correct answer is extremely difficult without a computer. Data 100 Page 30 of 36 Initials: iii. [1 Pt] Assuming no two points have the same distance, the cluster labels computed by Max-agglomerative clustering are always the same for a given dataset. ⃝ True ⃝ False Data 100 Page 31 of 36 Initials: 12 Potpourri [16 Pts.] (a) [1 Pt] Suppose we train an OLS model to predict a person’s salary from their age and get β1 as the coefficient. Suppose we then train another OLS model to a predict a person’s salary from both their age and number of years of education and get parameters γ1 and γ2, respectively. For these two models β1 = γ1. ⃝ Always True ⃝ Sometimes True ⃝ Never True (b) [1 Pt] Suppose we train a ridge regression model with non-zero hyperparameter λ to predict a person’s salary from their age and get β1 as the coefficient. Suppose we then train another ridge regression model using the same non-zero hyperparameter λ to predict a person’s salary from both their age and number of years of education and get parameters γ1 and γ2, respec- tively. For these two models β1 = γ1. ⃝ Always True ⃝ Sometimes True ⃝ Never True (c) [1 Pt] If we get 100% training accuracy with a logistic regression model, then the data is linearly separable. ⃝ Always True ⃝ Sometimes True ⃝ Never True (d) [1 Pt] If we get 100% training accuracy with a decision tree model, then the data is linearly separable. ⃝ Always True ⃝ Sometimes True ⃝ Never True (e) [1Pt] Increasingthehyperparameterλinaridgeregressionmodeldecreasestheaverageloss. ⃝ Always True ⃝ Sometimes True ⃝ Never True (f) [1 Pt] Let MSE1 be the training MSE for an unregularized OLS model trained on X1. Let MSE2 be the training MSE for an unregularized OLS model trained on X2, where X2 is just X1 with one new linearly independent column. If MSE1 > 0, then MSE2 < MSE1. ⃝ Always True ⃝ Sometimes True ⃝ Never True Solution: This one is pretty tricky. While adding another column definitely never makes the loss worse (because we’re simply increasing the size of the span of our columns a.k.a. observations), it is possible that the MSE stays the same. As a trivial counterexample: For example suppose the y vector we’re trying to predict is the column vector [1, 0, 0] and we want to predict this from a single feature [0, 1, 0]. The closest point on the span of [0, 1, 0] is [0, 0, 0] which has MSE 1/3. Suppose we add a linearly independent feature so our design matrix is now: Data 100 Page 32 of 36 Initials: 0 0 1 0 01 The span of these two (linearly independent) columns is now a plane orthogonal to y. The closest point in this plane is still [0, 0, 0] which has MSE of 1/3. (g) [1 Pt] When using regularization on a linear regression model, you should center and scale the quantitative non-bias columns of your design matrix. ⃝ Always True ⃝ Sometimes True ⃝ Never True Data 100 Page 33 of 36 Initials: (h) [3 Pts] Suppose you have the following .xml file:

DS 100
Fall 2019 Josh Hug Deb Nolan


CS 61B
Spring 2019 Josh Hug
Fernando Perez

Which of the following XPath queries will return only the strings ”Josh Hug” and ”Deb Nolan” (can have multiple of each)? There is at least one correct answer.
􏰄 //professor/text()
􏰄 //professor/../class/professor/text()
􏰄 //class/professor/../class/professor/text()
􏰄 //semester/../professor/text()
􏰄 /catalog/class[name/text()=”DS 100″]/professor/text() 􏰄 /catalog/class/name[text()=”DS 100″]/professor/text()
(i) [3 Pts] Consider the regular expression \d\w{2,5}d+[hug+]$
Which of the following strings match this regular expression? At least one of these is correct.
􏰄 123445dg
􏰄 1234dddhug 􏰄 61bdug
􏰄 61bdg
􏰄 61bdugggg
􏰄 1hello234gg
(j) [3 Pts] Consider the string 61bdugggg
Which of these regular expressions match the entire string? At least one of these is correct.
􏰄 \dbug*|\w*
􏰄 [61b]+\d{1,3}[a-z]* 􏰄 \d{2}b+[ds100][hug]* 􏰄 .*g$
􏰄 61bdugggg
􏰄 61[b|d]{1}ug+

Data 100 Page 34 of 36 Initials:
13 HCE [6 Pts.]
In this problem, we will ask a somewhat sensitive and complex real world problem. We will be lenient in grading this problem, but we want you to provide an opinion and try to defend it. You will not be penalized for unpopular or ”politically incorrect” opinions. Joke answers will receive no credit. If there is something unclear in the problem description, write your assumption.
In a hypothetical course, all submitted work is automatically reviewed for cheating by plagiarism detection software. However, some students also have the entirety of their subjected to an intensive manual review at the end of the semester. It is not possible to manually review all student’s work due to the large number of students in the course.
One approach is to randomly select students for manual review. An alternate approach is to use a model to try to target students who are more likely to plagiarize. For example, a student who has all perfect scores on assignments but very poor midterm grades might warrant manual review.
Suppose you build a logistic regression model to classify students with one of two labels: ”investi- gate” or ”do not investigate”. Students who are given the ”investigate” label have all of their work carefully reviewed by a teaching assistant (TA) for evidence of cheating. Students who are given the ”do not investigate” label are not manually reviewed at all.
The model uses as features the full text of all of the student’s electronically submitted work, grades on each problem for all assignments and exams, submission times for electronically submitted work, and the full text of all the student’s Piazza posts. The model works by generating a plagiarism probability for each student. Students with a plagiarism probability above a certain threshold will be assigned the ”investigate” label. The model is trained on a dataset collected during previous semesters of the course, where each student has a true label corresponding to whether or not the student was caught plagiarizing.
(a) [3 Pts] Below, describe at least one benefit and at least one downside of using such a logistic regression model compared to the randomized approach.
Solution: Any answer that addresses the prompt should be given full credit. Example answer:
A benefit is that the TAs may have a higher precision with this model – of all the students the TAs look at, more of them may have actually plagiarized than if the students were just randomly chosen. In turn, this causes more students to be caught for plagiarism.
A downside is that the model is trained on previous students who are labeled as having plagiarized or not plagiarized. In the past, a student who may have actually plagiarized might have not been caught, and so the model is unable to detect these ”smart” plagiariz- ers. This means the model may be likely to classify students who plagiarize in ways that were previously undetected as having not plagiarized.

Data 100 Page 35 of 36 Initials:
(b) [3 Pts] Suppose we add a demographic factor to our design matrix, specifically whether the student is international or not. Suppose that after training, the coefficient related to the inter- national feature is non-zero. Is it ethical to include this feature in your model? Why or why not?
Solution: Any answer that addresses the prompt should be given full credit. Example answer:
It is unethical to include this feature in the model because it discriminates against students based on something they cannot control. Even though this only increases the odds their work will be reviewed, there may be false positives during the review process.
Note: Some students took offense at the existence of this question on the exam, some quite gravely. The concern expressed was that the inclusion of this question on the exam reinforces stereotypes against international students. We believe that as data scientists you will face challenging moral questions like these, and we believe that engaging with a difficult question despite the discomforts is worth it. In real world situations, models often make conclusions that seem unjust, e.g. the Amazon applicant review algorithm that we discussed in lecture that specifically lowered the scores of female job applicants.
Models do not exist in a vacuum. They are trained on real world data, and the predictions they generate are used in many different ways. This problem gave you a chance to explore these broader context issues. Many of you highlighted the fact that the conclusions of the algorithm were only used as the first step in a thorough code review process. Others objected to the fact that the algorithm discriminates based on characteristics that a student cannot control. All of these and more were angles worth pursuing.
It is worth noting that we did not say whether the coefficient was positive or negative.

Data 100 Page 36 of 36 Initials:
14
(a)
(b)
1729 [0 Pts.]
[0 Pts] What is the height difference between Josh Hug and Suraj Rampure? (Make sure to specify units.)
Height difference =
[0 Pts] What should Josh name his new kid (assume female if you want a gender specific name)?
Name =
Solution: 1 foot.
Solution: Taco Bell on Durant