DATA 100 Final-Exam Fall 2020
INSTRUCTIONS
Final-Exam
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course staff with your solutions before the exam deadline.
This exam is intended for the student with email address
For questions with circular bubbles, you should select exactly one choice. You must choose either this option
Or this one, but not both!
For questions with square checkboxes, you may select multiple choices. You could select this choice.
You could select this one too!
You may start your exam now. Your exam is due at
Exam generated for
Preliminaries
You can complete and submit these questions before the exam starts.
(a)
(b)
(c)
(d)
(e)
(f)
What is your full name?
What is your Berkeley email?
What is your student ID number?
When are you taking this exam? Tuesday 7pm PST
Wednesday 8am PST
Other
Honor Code: All work on this exam is my own.
By writing your full name below, you are agreeing to this code:
Important: You must copy the following statement exactly into the box below. Failure to do so may result in points deducted on the exam.
“I certify that all work on this exam is my own. I acknowledge that collaboration of any kind is forbidden, and that I will face severe penalties if I am caught, including at minimum, harsh penalties to my grade and a letter sent to the Center for Student Conduct.”
Exam generated for
Consider sampling students from the audience of a comedy show at UC Berkeley. The theater, which is currently at full capacity, is divided into three sections: Front, Middle, and Back. The following table contains the capacity of each section:
Section Capacity
Front 20 Middle 35 Back 25
In the first two subparts of this question, we sample 5 students uniformly at random with replacement. Ai. (1.0 pt) In our sample of 5 students, what is the expected number of students sitting in the
middle?
9 4
5 4
35 16
7 16
25 16
None of the above
B. (2.0 pt) In our sample of 5 students, what is the probability that everyone is not in the same
section? Select all that apply.
5 1i 5 i 7 i i=04 16 16
155 57 5 4 16 16
1 − 1 5 − 5 5 − 7 5 4 16 16
1−5 1i5 5−i7 5−i i=0416 16
None of the above
Exam generated for
ii. Consider the population of UC Berkeley students. We are interested in finding the expectation and variance of the number of students that have a driver’s license in a sample from this population. We are given the following information:
• 70% of students are in-state and 30% of students are out-of-state
• 60% of in-state students have driver’s licenses and 30% of out-of-state students have driver’s
licenses
We sample 120 students uniformly at random with replacement.
A. (2.0 pt) Define the random variable Xi to be 1 if the ith student in our sample has a driver’s license, and 0 otherwise.
What is P (Xi = 1)? Please answer as a decimal rounded to two decimal places.
B. (1.0 pt) How many students do we expect to hold a driver’s license in our sample? Your answer should be an algebraic expression involving prevletter, where prevletter is the correct answer to the previous part.
C. (1.0 pt) What is the variance of the number of students that hold a driver’s license in our sample? Again, your answer should be an algebraic expression involving prevletter, as defined above.
D. (2.0 pt) In the previous two parts, we assumed that we were sampling with replacement. How would your answers to the above two parts change if we were instead sampling without replacement?
Expectation and variance would both stay the same
Expectation and variance would both be different
Expectation would stay the same while variance would be different
Expectation would be different while the variance would stay the same
0.7·0.6+0.3·0.3 = 0.51
120p, where p = prevletter.
120p(1 − p), where p = prevletter.
Exam generated for
Consider sampling students from the audience of a comedy show at UC Berkeley. The theater, which is currently at full capacity, is divided into three sections: Front, Middle, and Back. The following table contains the capacity of each section:
Section Capacity
Front 35 Middle 20 Back 25
In the first two subparts of this question, we sample 5 students uniformly at random with replacement. (b) Ai. (1.0 pt) In our sample of 5 students, what is the expected number of students sitting in the
middle?
9 4
5 4
35 16
7 16
25 16
None of the above
B. (2.0 pt) In our sample of 5 students, what is the probability that everyone is not in the same
section? Select all that apply.
5 1i 5 i 7 i i=04 16 16
155 57 5 4 16 16
1 − 1 5 − 5 5 − 7 5 4 16 16
1−5 1i5 5−i7 5−i i=0416 16
None of the above
Exam generated for
ii. Consider the population of UC Berkeley students. We are interested in finding the expectation and variance of the number of students that have a driver’s license in a sample from this population. We are given the following information:
• 30% of students are in-state and 70% of students are out-of-state
• 20% of in-state students have driver’s licenses and 80% of out-of-state students have driver’s
licenses
We sample 150 students uniformly at random with replacement.
A. (2.0 pt) Define the random variable Xi to be 1 if the ith student in our sample has a driver’s license, and 0 otherwise.
What is P (Xi = 1)? Please answer as a decimal rounded to two decimal places.
B. (1.0 pt) How many students do we expect to hold a driver’s license in our sample? Your answer should be an algebraic expression involving prevletter, where prevletter is the correct answer to the previous part.
C. (1.0 pt) What is the variance of the number of students that hold a driver’s license in our sample? Again, your answer should be an algebraic expression involving prevletter, as defined above.
D. (2.0 pt) In the previous two parts, we assumed that we were sampling with replacement. How would your answers to the above two parts change if we were instead sampling without replacement?
Expectation and variance would both stay the same
Expectation and variance would both be different
Expectation would stay the same while variance would be different
Expectation would be different while the variance would stay the same
0.3·0.2+0.7·0.8 = 0.62
150p, where p = prevletter.
150p(1 − p), where p = prevletter.
Exam generated for
Throughout this question, we are dealing with pandas DataFrame and Series objects. All code for this question, where applicable, must be written in Python. You may assume that pandas has been imported as pd.
The following DataFrame cars contains the names of car models from 1970 to 1982. The name column is the primary key of the table.
The first five rows are shown below.
name
toyota corolla 1200 buick skylark 320 fiat 128
ford mustang gl ford torino
mpg horsepower weight acceleration year
origin brand
32.0 65 1836 15.0 165 3693 29.0 49 1867 27.0 86 2790 17.0 140 3449
21.0 1974 11.5 1970 19.5 1973 15.6 1982 10.5 1970
Japan USA
toyota buick
Europe fiat USA ford USA ford
(a) (2.0 pt) Below, write a line of Pandas code that creates a Series of the names of cars created by brand “carbrand” with greater than mpgnum mpg. The resulting Series should be assigned to the variable varname.
(b) (4.0 pt) Below, write a line of Pandas code to create a DataFrame containing data only for those car models whose brands have at least mpgnum2 mpg for each of their models. The resulting DataFrame must have the same structure and format as cars. The resulting DataFrame should be assigned to the variable varname2.
varname = cars[(cars[“brand”]==”carbrand”) & (cars[“mpg”] > mpgnum)][“name”]
varname2 = cars.groupby(“brand”).filter(lambda x: min(x[“mpg”]) >= mpgnum2)
Exam generated for
In this question, we’re interested in finding the number of classes taken by students at Zoom University. We will be working with two DataFrames, students and enrollment. Throughout this question, you may assume that pandas has been imported as pd.
Each row in the students DataFrame represents a student. The students DataFrame contains the following columns:
• student_name: the student’s name • SID: the student’s ID
• major: the student’s major
Here are the first four rows in students: student_name
0 Alice Red
1 Bob Lime
2 Susie Orange
3 Frank Blue
SID ma jor
123 Computer Science 128 Biology
209 Anthropology
212 History
Each row in the enrollment DataFrame represents an enrollment record for a specific student in a single class. If a student is enrolled in multiple classes, each class taken by the student is a separate row in enrollment. The enrollment DataFrame contains the following columns:
• SID: the student’s ID
• class_name: the name of the class the student is enrolled in • class_id: the ID of the class
Here are the first five rows in enrollment:
SID class_name class_id
0 123
1 128
2 128
3 209
4 212
Intro to Data Science 200 Organic Chemistry 145 Intro to Data Science 100 US History 185 US History 185
Note: It is possible for rows with different class_id to share the same class_name in the enrollment DataFrame. For example, there is an “Intro to Data Science” with class_id 100 and another “Intro to Data Science” with class_id 200.
Exam generated for
(a) (4.0 pt) Suppose you are asked to add a column num_class to the students DataFrame that indicates the number of classes each student is enrolled in. If a student does not have any enrollment records, they should have a value of 0 in num_class. You are allowed to change the index of students, but the number of rows should stay the same after adding the column, and the name and major columns should be kept the same.
Which of the following accomplishes this task? There is only one correct answer.
A:
num_class = students.merge(enrollment, left_on=’student_name’, right_on=’class_name’, how=’right’)
.groupby(‘SID’).count()
num_class = num_class.drop(columns=[‘class_name’, ‘student_name’, ‘major’])
num_class = num_class.rename(columns={‘class_id’: ‘num_class’})
students = students.merge(num_class, left_on=’SID’, right_index=True)
B:
num_class = enrollment.groupby(‘SID’).count()
num_class = num_class.set_index(‘SID’)
num_class = num_class.rename(columns={‘class_id’: ‘num_class’})
students[‘num_class’] = num_class[‘class_id’]
C:
num_class = students.merge(enrollment, on=’SID’, how=’outer’).groupby(‘SID’).count()
num_class = num_class.drop(columns=[‘class_name’, ‘student_name’, ‘major’])
num_class = num_class.rename(columns={‘class_id’: ‘num_class’})
students = students.merge(num_class, left_on=’SID’, right_index=True)
A
B
C
None of the above
Exam generated for
(b) (4.0 pt) Now you are asked to find all unique majors across all students enrolled in Intro to Data Science. Specifically, you need to create a Series ds_majors that has majors as the index and the counts of students enrolled in Intro to Data Science in each major as the values.
Which of the following accomplishes this task? There is only one correct answer.
A:
ds = enrollment[enrollment[‘class_name’] == ‘Intro to Data Science’]
ds_majors = ds.merge(students, on=’SID’, how=’outer’).groupby(‘major’)[‘SID’].count()
B:
ds = enrollment[enrollment[‘class_name’] == ‘Intro to Data Science’]
ds_majors = ds.merge(students, on=’SID’, how=’left’).groupby(‘major’)[‘SID’].count()
C:
major_count = students.groupby(‘major’).count()
merged = enrollment.merge(major_count, on=’SID’)
ds = merged[merged[‘class_name’] == ‘Intro to Data Science’]
ds_majors = ds[‘major’]
A
B
C
None of the above
Exam generated for
A biology class grows and weighs yams as part of a class project. Some yams were grown in hot water and some were grown in cold water. A student, Shirley, decides to create a histogram of the yam weights.
(a) (2.0 pt) Professor Kane decides that yams weighing between 8 and 9 kilograms are his favorite. What percentage of yams weigh between 8 and 9 kilograms?
20%
25%
30%
35%
Impossible to tell
With the information given, you can’t determine how many yams in the 7.5-8.5 kg bin weigh over 8 kg.
(b) (2.0 pt) Another student, Jeff, suspects that the the yams grown in hot water didn’t grow as well as the yams grown in cold water and as such ended up weighing less. If 20 yams were grown in total and weighed, how many yams weigh less than 7 kilograms?
3
4
5
6
Impossible to tell
The proportion of yams weighing less than 7 kilograms is given by (6−5)(0.15)+(6.5−6)(0.2)+(7−6.5)(0.1) = 0.15 + 0.1 + 0.05 = 0.3. That gives us 6 yams weighing less than 7 kilograms.
Exam generated for
bin (7.5 to 8.5 kilograms). Which bin contains more yams? Median bin (7.5 to 8.5 kg bin)
Maximum bin (9.5 to 10 kg bin)
They contain the same number of yams
Impossible to tell
(8.5 − 7.5)(0.25) = (10 − 9.5)(0.5)
Exam generated for
(a) Suppose we have the following dataset from the neighborhood CVS store on Shattuck. The table shows total rain (mm) for each quarter and total number of umbrellas sold for each quarter. Note: For the first three parts of this question, our dataset only has these four rows.
Quarter
Jan-Mar Apr-Jun Jul-Sep Oct-Dec
Total rain (mm)
300 50 10 200
Total number of umbrellas sold
200 40 10 100
i. (2.0 pt) We first decide to model umbrella sales using the constant model yˆ = θ. We will use squared loss as our loss function (no regularization).
Which expression below correctly gives the average loss of our fitted model on the given dataset? Select the closest answer.
ˆi R(θ)=1 4 (y
−87.5x)2 i
− 140)2
ˆi R(θ) = 1 4 (y
ˆi R(θ)=1 4 (y
4 i=1
4 i=1
−87.5)2 R(θ)=1 4 (y −140x)2
4 i=1 ˆi
i
4 i=1
The MSE minimizing value for the constant model is the mean, which in this case is 200+40+10+100 =
87.5.
4
ii. (3.0 pt) Now we decide to fit a simple linear model with an intercept term yˆ = θ0 + θ1x that predicts total number of umbrellas sold (y) given total rain (x). We will use squared loss as our loss function, and we will not use regularization.
We are given r = 0.979, σx = 116.40, and σy = 72.59, which are the correlation coefficient, standard deviation of x, and standard deviation of y, respectively.
Which expression below correctly gives the average loss of our fitted model on the given dataset? Select the closest answer.
R(θ)=1 4 (y −(10+0.61x))2 ˆi i
4 i=1
R(θ)=1 4 (y −(0.61+2x))2 ˆi i
4 i=1
R(θ)=1 4 (y −(2.57+1.57x))2 ˆi i
4 i=1
R(θ)=1 4 (y −(2+0.61x))2 ˆi i
θˆ=rσy =0.98·72.59≈0.61 1 σx 116.4
θˆ =y ̄−θˆx ̄=87.5−0.61·140≈2 01
4 i=1
Exam generated for
iii. (3.0 pt) For whatever reason, we decide to reverse our model. That is, we decide to predict total rain (x) given total number of umbrellas sold (y) using a simple linear model with an intercept term xˆ = θ0 + θ1y. Again, we will use squared loss as our loss function, and we will not use regularization.
Which expression below correctly gives the average loss of our fitted model on the given dataset? Select the closest answer.
R(θ)=1 4 (x −(10+1.57y))2 ˆi i
4 i=1
R(θ)=1 4 (x −(0.61+2y))2 ˆi i
4 i=1
R(θ)=1 4 (x −(2.57+1.57y))2 ˆi i
4 i=1
R(θ)=1 4 (x −(2+0.61y))2 ˆi i
θˆ=rσx =0.98·116.4≈1.57 1 σy 72.59
θˆ =x ̄−θˆy ̄=140−1.57·87.5≈2.57 01
4 i=1
Exam generated for
assume that we have many more rows of data, not just the four given originally.
In the first part of this question, we didn’t use the Quarter column. Let’s suppose we want to one-hot encode Quarter for use in our model, but with a twist – we only want to encode whether or not the current Quarter is Jul-Sep, since that’s when rainfall is at a low.
The resulting design matrix, along with an intercept column, is provided below. (Note, the “Total number of umbrellas sold” column is no longer visible since it’s not part of our design matrix.)
Intercept Quarter=Jul-Sep
1 0 1 0 1 1
. . 1 0
Quarter!=Jul-Sep Total rain (mm)
1 300 1 50 0 10
. .
1 200
We fit two different linear models using ordinary least squares, both of which use a subset of the columns of the above design matrix:
• We fit a linear model on all columns except Quarter!=Jul-Sep. After doing so, we end up with the following fitted model, where our optimal model parameter is θˆ = [letter1, letter2, letter3]T :
yˆ = letter1 + letter2 · (Quarter=Jul-Sep) + letter3 · (Total rain)
• We fit a linear model on all columns except Quarter=Jul-Sep. After doing so, we end up with the
following fitted model, where our optimal model parameter is βˆ = [D, E, F ]T : yˆ = D + E · (Quarter!=Jul-Sep) + F · (Total rain)
In this problem, you will express D, E, and F in terms of letter1, letter2, and letter3. Your answers should all be algebraic expressions, for instance “100 * letter1 * letter2 * letter3” (that is not the correct answer to any of these parts). If you don’t believe it’s possible to determine the answer, just write “not possible”.
Exam generated for
Note, this solution assumes A = letter1, B = letter2, and C = letter3. D=A+B
We know that the design matrices used in both models convey the same infor- mation, i.e. their spans are the same. (Specifically, Quarter=Jul-Sep = Intercept – Quarter!=Jul-Sep.) Thus, their predictions must also be the same.
Let x represent the column Quarter=Jul-Sep in the first matrix, and let z represent the total rain column. We then have
A + Bx + Cz = D + E(1 − x) + Fz
Changing the column we one-hot encoded in this case doesn’t affect the coefficient on total rain, so C = F. Looking at the first two terms on both sides closer, we have
A + Bx = D + E − Ex
Since the LHS and RHS above must be equal regardless of the value x takes on, wehaveA=D+EandB=−E. ThatgivesE=−BandD=A−E=A+B,as needed.
ii. (2.0 pt) What is E in terms of letter1, letter2, and letter3?
iii. (2.0 pt) What is F in terms of letter1, letter2, and letter3?
iv. (1.0 pt) Suppose we now regularize the previous two models using L2 regularization with some fixed value of λ > 0.
We denote the optimal regularized model parameters by θˆ and βˆ , corresponding to the first ridge ridge
and second models in the previous part, respectively. All three of our features, including our intercept term, are regularized.
True or False: The relationships involving D, E, F, letter1, letter2, and letter3 from the previous part still hold true, even though our model is now regularized.
True
False
Though θˆ and βˆ from the previous parts have the same predictions, their components are different,
and they have different L2 norms. Thus, we can’t use the same “comparison trick” we used in the
previous part to match values of θˆ and βˆ . ridge ridge
However, if we instead didn’t regularize the intercept term, the relationships would remain true!
E = −B
F=C
Exam generated for
(a) In class, we derived the following bias-variance decomposition under a specific set of conditions. model risk = σ2 + (model bias)2 + model variance
We assume that there is an unknown underlying function g(x) that generates the points we observe. Specifically, we observe Yi = g(xi) + εi, where εi is a zero-mean noise term with variance σ2 that is independent for each observation. Our model’s goal is to approximate g(x) as best as possible.
i. (1.0 pt) Does this decomposition hold true for linear models and squared loss? Yes
No
We derived this specific equation by decomposing mean squared error. We did not make any assumptions
about the kind of prediction function we were using; it holds true for any model with squared loss.
ii. (1.0 pt) Does this decomposition hold true for non-linear models and squared loss? Yes
No
iii. (1.0 pt) Does this decomposition hold true for linear models and absolute loss? Yes
No
iv. (1.0 pt) Does this decomposition hold true for classification decision trees and zero-one loss? (Zero-one loss is equal to 1 if a prediction is correct, and 0 if it is incorrect.)
Yes No
Exam generated for
What effect does pruning a decision tree have on its i. (1.0 pt) Bias?
Increases it Decreases it
When we prune a decision tree, we remove branches that are not crucial to its classifications. This reduces complexity, which increases bias while reducing variance.
ii. (1.0 pt) Variance? Increases it
Decreases it
iii. (1.0 pt) Complexity? Increases it
Decreases it
Depends on the splitting rule
Exam generated for
For each of the following prompts, answer true if the given modification to k-fold cross-validation will result in overfitting, and false if it will not. Assume that we have a large dataset that we have split into a training set and test set.
(a) (1.0 pt) The test set is divided into k folds. For each fold of the test set, we use the entire training set to train the model, and use the given fold/subset of the test set for validation. The average error among all k folds is the cross-validation error.
True or False: This modification will result in overfitting.
True
False
We shouldn’t be using the test set for validation purposes; that defeats the purpose of cross-validation.
(b) (1.0 pt) We use normal k-fold cross-validation, but for each fold we only use half of the validation set for validation.
True or False: This modification will result in overfitting.
True
False
This will not cause overfitting, but it is essentially throwing away data; we could be training our model on more data without overfitting.
(c) (1.0 pt) We use normal k-fold cross-validation, but for each fold we use the entire training set for training. True or False: This modification will result in overfitting.
True
False
The purpose of training on k − 1 folds and using the remaining fold for validation is to not train and validate our model on the same fold. By making the modification proposed in the question, we would be doing just that.
(d) (1.0 pt) We use normal k-fold cross-validation, but after the train-test split, we standardize the training set before running cross-validation so that each column has mean 0 and variance 1.
True or False: This modification will result in overfitting. True
False
This subpart is tricky; this is what you’re supposed to do. If you standardize before the train-test split, you’re encoding information about the test set into your training set (to standardize, you need to know the mean of a column, but if you compute the mean of a given column for the entire dataset, that gives you information about what the test data’s mean is). This principle is called leakage.
Exam generated for
Consider the following model:
fθ(x)=θ0 +2θ1x+θ1θ2×2
We have a training dataset with two observations (xi , yi ): {(1, 1), (2, 3)}.
20
In order to determine optimal model parameters θˆ0, θˆ1, and θˆ2, we choose squared loss with L2 regularization. Assume that the regularization hyperparameter λ = 1 for the entirety of this question, and assume that we
2
regularize the intercept term θ0. Our objective function is the sum of our loss function averaged across our
entire dataset and a regularization penalty.
We decide to use gradient descent to help us solve for the optimal parameters.
(a) (3.0 pt) Which of the following is equal to the objective function for our model, loss, regularization, and training data?
The last option is the only one equivalent to
1np
a
Suppose we start our gradient descent procedure at the initial guess θ(0) = b, where a, b, c are some
R(θ)=(θ0+2θ1 +θ1θ2−1)2+(θ0+2θ1+1+4θ1θ2−3)2+1(θ02+θ12+θ2) 2
R(θ)= 1(1−(θ0 +2θ1 +θ1θ2))2 +(3−(θ0 +2θ1+1 +4θ1θ2))2 +|θ0|2 +|θ1|+|θ2| 2
R(θ)= 1(1−(θ0 +2θ1 +θ1θ2))2 +(3−(θ0 +2θ1+1 +4θ1θ2))2+2(θ12 +θ2) 2
R(θ)= 1(θ0 +2θ1 +θ1θ2 −1)2 +(θ0 +2θ1+1 +4θ1θ2 −3)2 +θ02 +θ12 +θ2 2
constants.
∂R
Then,
guess θ(0), is of the form
(yi −fθ(xi))2 +λθi2 i=1 i=0
n
Note that in this question 1 = λ which makes things tricky.
n
c
, the partial derivative of our objective function with respect to θ0 evaluated at our initial θ=θ(0)
∂θ0
where G and H are integers.
Ga+H ·2b +5bc−4
Exam generated for
(b)
i. (3.0 pt) What is G? -3
-2
-1
0
1
2
3 Starting with
R(θ)= 1(θ0 +2θ1 +θ1θ2 −1)2 +(θ0 +2θ1+1 +4θ1θ2 −3)2 +θ02 +θ12 +θ2 2
we can take the partial derivative with respect to θ0:
∂R = 12(θ0 +2θ1 +θ1θ2 −1)+2(θ0 +2·2θ1 +4θ1θ2 −3)
We’re told we’re evaluating the partial derivative at our initial guess θ(0), which we are calling a
θ(0) = b. Then: c
Giving G = 3 and H = 3.
ii. (3.0 pt) What is H? -3
-2 -1 0 1 2 3
∂ R
∂θ0 θ=θ(0)
b
∂θ0 2
=3θ0 +3·2θ1 +5θ1θ2 −4
=3a+3·2 +5bc−4
Exam generated for
γ0
Suppose we define γ = γ1 such that
γ2
Can we use ridge regression to find γˆ?
Yes
No
γ0 =θ0,γ1 =2θ1,γ2 =θ2
The updated model is fγ(xi) = γ0 + γ1xi + γ1γ2x2i , which is not linear in γ, meaning we can’t use ridge regression to solve for γˆ.
(d) (1.0 pt) Suppose our model is instead
fθ(xi) = θ0 + 2θ1 xi,1 + θ2xi,1 · xi,2
where xi,1 and xi,2 are scalars corresponding to feature 1 and feature 2 for observation i, respectively. Let γ be as defined in the previous part.
Can we use ridge regression to find γˆ? Yes
No
Our updated model is fγ (xi) = γ0 + γ1xi,1 + γ2xi,1 · xi,2. This is indeed linear in terms of γ, so we can use ridge regression to find γˆ.
Exam generated for
Below is a buggy implementation of sgd, a function which is supposed to perform stochastic gradient descent with batch size B on the training dataset X and y by applying the gradient gradient_function with learning rate alpha.
def sgd(X, y, theta0, gradient_function, alpha, B, max_iter=100000):
“””
Performs stochastic gradient descent.
Args:
X: A 2D array, the dataset, with features stored in columns
and observations stored in rows
y: A 1D array, the outcome values
theta0: A 1D array, the initial weights
gradient_function: A function that takes in a vector
of weights, a dataset, and outcome values and
returns the value of the gradient
alpha: A float, the learning rate
B: An integer, the batch size
max_iter (optional): The maximum number of iterations
to attempt during SGD
Returns:
A 1D array of optimal weights
Notes:
gradient_function takes 3 arguments: a 1D array of weights,
a 2D array of data points, and a 1D array of outcomes. It
returns a 1D array of the same shape as the weights, the
value of the gradient evaluated with those parameters.
“””
theta = theta0
for _ in range(max_iter):
idx = np.random.choice(X.shape[1], size=B, replace=True)
Xb, yb = X[idx,:], y[idx]
grad = gradient_function(theta, Xb, yb)
theta = theta – alpha*grad
return theta
Which of the following edits need to be made to the implementation of sgd above so that it works correctly
(as specified in class)? Select all that apply.
X.shape[1] should be replaced with X.shape[0]
size=B should be replaced with size=X.shape[0]
replace=True should be replaced with replace=False
theta – alpha*grad should be replaced with theta + alpha*grad
gradient_function(theta, Xb, yb) should be replaced with gradient_function(theta, X, y) X[idx,:], y[idx] should be replaced with X[:, idx], y
None of the above
Exam generated for
In this problem, we’ll be using logistic regression to build a classifier that differentiates between 2 varieties of wine produced in the same region of Italy.
In this problem, assume the following:
• We are working with a design matrix X with two features: the hue of the wine (hue, x1) and its alcohol by volume (abv, x2). Note that both hue and abv are quantitative (hue is a quantitative measure of a wine’s color).
• X is standardized.
• All wines are either type 0 or 1 (y).
We are modeling the probability that a particular wine is of type 1 using P(Y =1|x)=σ(θ1 ·hue+θ2 ·abv)
(a) (2.0 pt) Consider the following scatter plot of our two (standardized) features. Note, this scatter plot is only relevant in this subpart of the question.
Which of the following statements are true about an unregularized logistic regression model fit on the above data? Select all that apply.
After performing logistic regression, the weight for the hue feature will very likely have a negative sign.
After performing logistic regression, the weight for the abv feature will very likely have a negative sign.
After performing logistic regression, the abv feature will have very likely a higher magnitude weight than the hue feature.
This data is linearly separable between the two wine types without any feature transformations.
Exam generated for
yˆ for some choice of θ:
hue abv y yˆ
-0.17 0.24 0 0.45 -1.18 1.61 0 0.19 1.25 -0.97 1 0.80
What is the mean cross-entropy loss on just the above three rows of our training data?
− 1 log(0.45) + log(0.19) + log(0.20) 3
− 1 log(0.55) + log(0.19) + log(0.80) 3
− 1 log(0.45) + log(0.81) + log(0.80) 3
− 1 log(0.55) + log(0.81) + log(0.80) 3
None of the above
(c) (3.0 pt) After thresholding yˆ, we compute a confusion matrix for our model’s predictions. As a reminder,
type 0 and type 1 refer to wine types.
Predicted Type 0
Actual Type 0 57 Actual Type 1 ???
Predicted Type 1
??? 62
For some reason, our confusion matrix is corrupted, and doesn’t contain the information on the off-diagonals. However, we somehow know that our model’s accuracy is 119 and our model’s precision is 31 .
What is our model’s recall? Give your answer as a reduced fraction with no spaces, i.e. in the form a/b (no decimals or spaces).
130 32
62/71
From the confusion matrix, we’re given TP = 62 and TN = 57. We’re also given
that the accuracy is 119 , and we know that accuracy is T P +T N . Since 130 TP+TN+FP+FN
TP +TN =119, we know that FP +FN =11.
Since we know the precision is 31, we can solve for FP:
32
TP = 62 =31 TP +FP 62+FP 32
Thus, 62+FP = 64 and FP = 2. This means that FN = 11−FP = 9, and the recall is
TP = 62 =62 T P + F N 62 + 9 71
(d) Suppose we choose θˆ = [2, 1]T . Consider the wine “Billywine” with hue 1 and abv −2. 4
Exam generated for
only one correct answer.
β=3 2
β = −3 2
3 β = e2
β = e− 3 2
β = σ(−3) 2
3 β=log−2
1+3 2
We know that with the logistic regression model, the log-odds of the probability of belonging to class 1 islinear,andspecificallyisxTθˆ. Here,xTθˆ=2·1 +1·(−2)=−3.
Then, the odds is the log-odds exponentiated with base e:
2 β = exT θ = e− 3
42
Exam generated for
all that apply. (β is as defined in the previous subpart.)
γ = e− 3 2
γ = σ(−3) 2
γ = σ(3) 2
γ=β−1 β
γ=β β+1
As we’ve studied in class, P (Y = 1|x) = σ(xT θˆ) = σ(− 3 ), which gives one answer choice. 2
To arrive at the other, we need to realize that
σ ( x T θˆ ) = 1 = 1 + e − x T θˆ
Since β = exT θˆ, another correct answer choice is β . β+1
e x T θˆ
1 + e x T θˆ
iii. (2.0 pt) Suppose that we choose a threshold T such that the decision boundary of our model is 2 · hue + abv = 3 . What value of T results in this decision boundary? There is only one correct answer.
2
(β and γ are as defined in the previous two subparts.) T=β
T = e−γ
T=γ
T = −β
T=1−β
T=log(γ ) 1−γ
T=1−γ
Notice that in this specific case, the decision boundary σ(2·hue+abv) = T is equivalent to 2·hue+abv =
−xT θˆ. This is because xT θˆ = − 3 , and we are given that 2 · hue + abv = 3 . 22
This means that σ−1(T ) = −xT θˆ, or equivalently that T = σ(−xT θˆ).
We know that σ(−t) = 1 − σ(t). We also know that γ = σ(xT θˆ). This means that
T =σ(−xTθˆ)=1−σ(xTθˆ)=1−γ
Exam generated for
(a) Suppose we are given the following scatter plot.
We have data that is plotted in the space of features x1 and x2. Suppose we want to perform PCA on these two features.
i. (1.0 pt) Which of the following is most likely to be the equation of the line representing PC 1?
x2 = 11×1 −9 3
x2 = 3×1
x2 = − 20 x1 + 5 3
x2 = 2×1 3
x2 = −3×1
Exam generated for
x2 = −3×1 x2 = 1×1 +5
3
x2 = −4×1 + 10
x2 = 3×1
x2 =−3×1 2
Exam generated for
(b) In this part of this question, we will look at emotion ratings of images for a psychology experiment. Each row of the DataFrame F represents an image, and each column represents an emotion. There are 940 images and 7 emotions. An example row of F is provided below.
Say we perform the SVD on F using the following code: X = (F – np.mean(F, axis = 0))
u, s, vt = np.linalg.svd(X, full_matrices=False)
i. (1.0 pt)
The above scree plot depicts the proportion of variance captured by each PC. Ignoring the plot’s title, which of the following lines of code could have created the above plot?
plt.plot(s**2/np.sum(s**2), u)
plt.plot(F[:, :7]), s**2/np.sum(s))
plt.plot(np.arange(1, F.shape[1]+1), s**2/np.sum(s**2)) plt.plot(np.arange(1, F.shape[1]+1), s**2/np.sum(s))
plt.plot(u@s, s**2/np.sum(s**2))
As covered in class.
Exam generated for
s[1]?
0.3
3.3
6
8
36
From the above scree plot, we know that the proportion of variance captured by PC 2 is roughly 0.3.
σ2 2
This means 7 = 0.3. We are told the denominator is 121, which gives σ2 = 36.3 =⇒ σ2 ≈ 6.
j=1
(Note that due to Python’s zero-indexing, σ2 corresponds to s[1].)
iii. (1.0 pt) Which of the following statements evaluates to True? (u @ np.diag(s)).shape == (940, 7)
(u @ np.diag(s)).shape == (7, 7)
(u @ np.diag(s)).shape == (940, 940)
(u @ np.diag(s)).shape == (7, 940)
None of the above
Our PC matrix UΣ has the same dimensions as our data matrix.
iv. (1.0 pt) True or False: Ignoring numerical precision issues, the expression np.var((X @ vt.T)[:, i]) == s[i]**2 / len(X)
evaluates to True for all integers i between 0 and X.shape[1] – 1.
True
False
Impossible to tell
Both the left and right sides are equal to the variance of PC i.
Exam generated for
(a) Consider the following three datasets, each consisting of two features (x1 and x2) and a class label (red crosses and blue stars).
The green triangle in Dataset 3 represents a point with an overlapping red cross and blue star point at the same position. Assume that otherwise, there are no overlapping points of different classes in any of the above datasets.
i. (2.0 pt) On which of the above datasets could logistic regression (fit with no regularization) achieve 100% training accuracy? Select all that apply.
Dataset 1
Dataset 2
Dataset 3
None of the above
Logistic regression can achieve 100% training accuracy only when the training data is linearly separable, which it is for Dataset 1 but not for the others.
ii. (2.0 pt) On which of the above datasets could a decision tree achieve 100% training accuracy? Select all that apply.
Dataset 1
Dataset 2
Dataset 3
None of the above
The only time a decision tree or random forest can achieve 100% training accuracy is when there are no overlapping points of different classes (impure nodes). There aren’t in Dataset 1 or 2, so both can achieve 100% training accuracy. There is an overlapping point in Dataset 3 (the green triangle) so neither can achieve 100% training accuracy there.
iii. (2.0 pt) On which of the above datasets could a random forest achieve 100% training accuracy? Select all that apply.
Dataset 1
Dataset 2
Dataset 3
None of the above
Exam generated for
and binary response variable y, and we want to train a binary classifier.
The all-zero classifier is a classifer that predicts 0 for all observations, regardless of input. The training
accuracy of the all-zero classifier on our training data is 1 . 8
If we were to build a decision tree for classification, what would be the entropy of the tree at the root node, where all observations begin?
−17log 1+log 7 82828
−1log 1+log 7 82828
−18log 1+56log 7 64 28 28
−1log 1+7log 7 82828
−8log 1+7log 7 28 28
Impossible to tell
At the root node, 1 points belong in class 0 and 7 points belong in class 1. Thus, the entropy of the node is
−1 log 1 − 7 log 7 8888
Note, we accidentally made it so that two answer choices are correct; we awarded credit to students who selected either one.
88
Exam generated for
Consider a DataFrame people containing the height, weight, and BMI (body mass index) of several individuals. Our dataset has three columns:
• height (cm): Height in centimeters • weight (kg): Weight in kilograms
• bmi: Body Mass Index, calculated as
people[‘bmi’] = people[‘weight (kg)’] / (people[‘height (cm)’] / 100) ** 2 The first five rows of people might look something like:
The first five rows of humans might look
something like:
weight (lb)
241
162
212
220
206
height (cm)
185.42 172.72 187.96 180.34 175.26
weight (kg)
109.545 73.6364 96.3636
100 93.6364
bmi
31.8626 24.6835 27.2761 30.7479 30.4845
(a) (2.0 pt) Let r(x, y) be a function that computes the correlation coefficient r for two Series of numbers x and y.
Suppose, just for this part, that the values in height (cm) and weight (kg) are generated using an uncorrelated random number generator (that is, r(people[‘height (cm)’], people[‘weight (kg)’]) == 0).
What is the most likely value of R = r(people[‘height (cm)’], people[‘bmi’])? R < -0.2
-0.2 <= R < 0.2
R >= 0.2
If height and weight are uncorrelated, then on average, as height increases, bmi decreases. We can fix a value for weight and let height vary without changing weight because they are uncorrelated.
(b) (2.0 pt) For whatever reason, we decide to add Imperial units to our dataset, which we will now call humans. That is, we add the columns height (in) and weight (lb), where humans[‘height (in)’] = humans[‘height (cm)’] / 2.54 and humans[‘weight (lb)’] = humans[‘weight (kg)’] * 2.2.
height (in)
73 68 74 71 69
height (cm)
185.42 172.72 187.96 180.34 175.26
weight (kg)
109.545 73.6364 96.3636
100 93.6364
bmi
31.8626 24.6835 27.2761 30.7479 30.4845
Which of the following sets of columns are linearly independent and have a span that is equal to the span of the columns of humans? Select all that apply.
height (in), height (cm), weight (lb), weight (kg), bmi height (in), weight (lb), bmi
height (cm), weight (lb), bmi
Exam generated for
height (in), height (cm), weight (lb), bmi
height (cm), bmi
None of the above
A correct answer includes exactly one of the height columns, exactly one of the weight columns, and the bmi column.
(c) (2.0 pt) Now suppose we fit two linear models on the humans data. Model A:
ˆ
bmi = θ0 + θin · height (in) + θcm · height (cm) + θlb · weight (lb) + θkg · weight (kg)
Model B:
ˆ
bmi = β0 + βcm · height (cm) + βkg · weight (kg)
Suppose we create 95% confidence intervals for each of the above non-intercept parameters using the bootstrap method. Which of the following parameters’ confidence interval will likely contain the value 0? Select all that apply.
θin
θcm
θlb
θkg
βcm
βkg
None of the above
There is strong multicollinearity in the first model. There is also some multicollinearity in the second model, since in reality height and weight are positively correlated, but height and weight are both independently useful in predicting bmi.
(d) (1.0 pt) Suppose we add random noise to all columns in humans except for bmi. Assume that our random noise is drawn from the Normal distribution with mean 0 and variance 2, and that the noise for each element in the DataFrame is independent. We call this new DataFrame noisy_humans.
Suppose we fit Model A and Model B on noisy_humans and create bootstrapped confidence intervals for each of the above six parameters. True or False: our answer to the previous part remains the same.
True False
A small amount of noise will make it so that our design matrix is full rank, but two of the columns are still redundant.
Exam generated for
Below, we’ve clustered three different datasets each into three classes (orange circles, blue crosses, and green stars). Assume that there are no overlapping points anywhere.
Exam generated for
inertia = n · distortion
where n is a positive integer? Select all that apply. Clustering A
Clustering B
Clustering C
None of the above
In the correct answer choices, all three clusters have the same number of points. This means the distortion
will be of the form
a21 +a2 +a23 + b21 +b2 +b3 + c21 +c2 +c3 333
where ai represents the distance from point i in cluster a to its cluster center. On the other hand, inertia will be of the form
a21 +a2 +a23 +b21 +b2 +b23 +c21 +c2 +c23
meaning that inertia = 3 · distortion. Note, this is not true for the first answer choice, because the clusters
have different numbers of points in them.
(b) (1.0 pt) In which of the above dataset/clustering combinations is there a point with a negative silhouette score? Select all that apply.
Clustering A
Clustering B
Clustering C
None of the above
Exam generated for
(a) (1.0 pt) Fill in the blanks: In the star schema for data storage, the fact table contains ____ that refer to ____ in ____.
primary keys, secondary keys, dimension tables integers, primary keys, dimension tables
primary keys, dimension tables, foreign keys primary keys, foreign keys, dimension tables foreign keys, primary keys, dimension tables
(b) (1.0 pt) Fill in the blanks: ____ is/are designed to manipulate small amounts of data. ____ is/are designed to manipulate large amounts of data. ____ do/does both.
numpy and pandas, Hadoop and Spark, Modin Hadoop and Spark, Modin, numpy and pandas Hadoop and Spark, numpy and pandas, Modin Modin, numpy and pandas, Hadoop and Spark
(c) (1.0 pt) True or False: Hadoop, Spark, and Modin were all created at Berkeley. True
False
Spark and Modin were, but Hadoop was not.
Exam generated for
39
No more questions.