Data 100 & 200A Principles and Techniques of Data Science
Spring 2019
INSTRUCTIONS
Final
• You have 2 hours and 50 minutes to complete the exam.
• The exam is closed book, closed notes, closed computer, closed calculator, except for the provided midterm 1
reference sheet and up to three 8.5″ × 11″ sheets of notes of your own creation.
• Mark your answers on the exam itself. We will not grade answers written on scratch paper.
Last name
First name
Student ID number
CalCentral email (_@berkeley.edu)
Exam room
Name of the person to your left
Name of the person to your right
All the work on this exam is my own.
(please sign)
Terminology and Notation Reference:
exp(x)
ex
log(x)
loge x
Linear regression model
E[Y |X] = XT β
Logistic (or sigmoid) function
σ(t) = 1 1+exp(−t)
Logistic regression model
P(Y =1|X)=σ(XTβ)
Squared error loss function
L(y, θ) = (y − θ)2
Absolute error loss function
L(y, θ) = |y − θ|
Cross-entropy loss function
L(y, θ) = −y log θ − (1 − y) log(1 − θ)
Bias
Bias[θˆ, θ] = E[θˆ] − θ
Variance
Var[θˆ] = E[(θˆ− E[θˆ])2]
Mean squared error
MSE[θˆ, θ] = E[(θˆ − θ)2]
2
1. (6 points) Visualization
You have data on 500 rental housing units, including rent price, number of bedrooms, and square footage.
(a) (2 pt) Fill the box for all suitable visualizations for comparing the distribution of rent price for 2, 3, and 4 bedroom units.
Scatterplot Violin plots Barplot Boxplots None of these
(b) (2 pt) Fill the box for all suitable visualizations for examining how rent price and square footage relate.
Density plots Scatterplot Histograms Boxplots None of these
The scatterplot generated by sns.jointplot below to the left visualizes a dataset containing 600 (x, y) pairs. You partition the pairs by their x value into three groups:
• The 200 with the lowest x values are in group 0.
• The 200 with intermediate x values are in group 1. • The 200 with the highest x values are in group 2.
(c) (1 pt) Bubble the letter of the boxplot that shows the distribution of x values in each of the three groups. ⃝A⃝B⃝C⃝D
(d) (1 pt) Bubble the letter of the boxplot that shows the distribution of y values in each of the three groups. ⃝A⃝B⃝C⃝D
2. (4 points) Classification and Regression Trees (CART) and Random Forests
(a) (1 pt) When predicting a qualitative outcome, the predicted class at each terminal node is:
⃝ the most common class in the node ⃝ the average class for the node ⃝ neither
(b) (1 pt) When predicting a quantitative outcome, the predicted value at each terminal node is:
⃝ the most common outcome in the node ⃝ the average outcome for the node ⃝ neither
(c) (2 pt) The main idea behind ensemble prediction methods such as Random Forests is that aggregating predictions from predictors applied to multiple bootstrap samples of the learning set:
⃝ reduces bias ⃝ reduces variability ⃝ reduces both bias and variability ⃝ none of these
Name: 3 3. (8 points) Feature Engineering
These instructions are not identical to the similar question on midterm 2, so please read carefully. For each dataset depicted below in a scatterplot, fill in the squares next to all of the letters for the functions f thatwouldmakeitpossibletochoosescalarsβ0,β1,andβ2 suchthatyi =β0+xiβ1+f(xi)β2 forall(xi,yi) pairs in the dataset. The input to each f is a scalar x shown on the horizontal axis, and the corresponding y value is shown on the vertical axis.
(A) f(x) = −x
(B) f(x)=x
(C) f(x) = x2
(D) f(x) = |x|
(E) None of the above
(i) (2 pt) A B C D E (ii) (2 pt) A B C D E
62
4 2 0
0 −2
−2 −4
−4 −6
−6
−1 0 1 2 3 4 5 xx
(iii) (2 pt) A 4.5
4 3.5 3 2.5
B
C
D
E
(iv) (2 pt) A
8 7 6 5 4 3
B
C
D
E
−1 0 1 2 3 4 5
22
−2 0 2 4 6 8 10 12 xx
−3 −2 −1 0 1 2 3
yy
yy
4
i
xi
yi
1 2 3
0 3
3 0
6 0
-3 0 3
4. (11 points) Regression
Suppose you are interested in predicting a one-dimensional quantitative random variable Y (outcome) in terms of a two-dimensional quantitative random variable X (covariates). You have a learning set (x1, y1), . . . , (x3, y3) of three observations, given in the table to the right.
(a) (2 pt) You estimate the regression function E[Y |X] using ridge regression with no intercept term and a
shrinkage parameter of λ. Select the correct expression for the regression function, where β = [β1 ⃝ E[Y |X] = XT β
⃝ E[Y |X] = XT β + ξ
⃝ E [ Y | X ] = X T β + λ · ( β 12 + β 2 2 )
⃝ E [ Y | X ] = X T β + ξ + λ · ( β 12 + β 2 2 ) ⃝ None of these
(b) (2 pt) Why are the letters X and Y capitalized in the models in (a) for the regression function? ⃝ They represent random variables.
⃝ They represent matrix-valued elements of the learning set.
⃝ They represent unknown parameters that need to be estimated from the learning set.
β2]T .
(c) (2 pt) There are two derivations below: left and right. Select the one that expresses the regularized empirical risk (mean squared error) for ridge regression for this dataset in terms of β1 and β2 when λ=1.
⃝ 1(−3+3β2)2 +(3β1)2 +(3+6β1)2+(β12+β2) ⃝ 1(−3−3β2)2 +(−3β1)2 +(3−6β1)2+(β12+β2) 33
= 16β12 +12β1 +4β2 −6β2 +6 = 16β12 −12β1 +4β2 +6β2 +6
(d) (3 pt) Find the β1 and β2 that minimize the regularized empirical risk objective from part (c) above.
Show your work for full credit.
β1 = β2 =
(e) (2 pt) Choose the value of λ for which the values of β1 and β2 that would minimize the regularized empirical risk for ridge regression would also minimize the mean squared error on the learning set.
⃝-1 ⃝0 ⃝1 ⃝None of these ⃝Impossible to tell
Name: 5 5. (12 points) Logistic Regression
(a) (4 pt) Select always, sometimes, or never to describe when each statement below is true about a logistic regression model P(Y = 1|X) = σ(XT β), where Y is binary and X is vector-valued. X and β are finite.
⃝ Always ⃝ Always ⃝ Always
⃝ Always
⃝ Sometimes
⃝ Sometimes
⃝ Sometimes
⃝ Sometimes
⃝Never:
⃝Never:
⃝ Never:
⃝ Never:
P(Y =1|X)>XTβ
P(Y =1|X)=P(Y =0|−X)
P(Y = 1|X) < 1
σ(XT β) ≤ σ(XT (2 · β))
(b) (8 pt) Complete the code below to assign best_λ to the regularization parameter λ among choices that gives the smallest 2-fold cross-validation risk for L1-regularized logistic regression. The learning set x has n observations and m features. Correct labels are named y. Assume minimize can minimize any function.
n, m = x.shape
assert len(y) == n
from scipy.optimize import minimize as scipy_minimize
def minimize(f, k):
"Return the k-length array that minimizes f."
return scipy_minimize(f, np.zeros(k))['x']
def sigma(t):
return 1/(1 + np.exp(-t))
def log_loss(y, z):
return -y * np.log(z) - (1 - y) * np.log(1 - z)
def risk(λ):
"Return 2-fold cross-validation risk for regularization hyperparameter λ." def objective(b):
losses = log_loss(y_train, sigma(x_train @ b))
return __________________________________________________________________________
half = n // 2
x_1, x_2 = x[:half, :], x[half:, :]
y_1, y_2 = y[:half], y[half:]
x_train, y_train = ________________________________, ________________________________
b = minimize(objective, m)
risk_1 = ____________________________________________________________________________
x_train, y_train = ________________________________, ________________________________
b = minimize(objective, m)
risk_2 = ____________________________________________________________________________
return (risk_1 + risk_2)/2
choices = [2 ** c for c in range(-10, 10)]
best_λ = ________________________________________________________________________________
6
6. (13 points) Taxi Trips
The streets of Anytown, USA, are a perfect grid with every block exactly one tenth of a mile long. Taxi drivers only pick up and drop off at intersections. You have acquired a log of all 200,000 taxi trips in Anytown during 2018, which includes measurements for the following 3 variables: number of blocks, total distance, and fare. Taxi fare in Anytown is $2/mile + $3/trip. Assume no measurement error and no missing values.
(a) (1 pt) What is the rank of the matrix with rows corresponding to the 200,000 taxi trips and columns to the 3 variables?
⃝ 1 ⃝ 2 ⃝ 3 ⃝ 4 ⃝ None of these ⃝ Impossible to tell
(b) (1 pt) What is the rank of a four-column matrix containing columns for the three variables as well as a
column of all ones?
⃝ 1 ⃝ 2 ⃝ 3 ⃝ 4 ⃝ None of these ⃝ Impossible to tell
(c) (4 pt) If you apply PCA to this four-column matrix, which of the following statements will be true?
U, Σ, and V refer to the
singular value decomposition of the centered four-column matrix X: X = UΣV T . The sum of the singular values will be the total variance of the dataset.
All singular values will be greater than zero.
The dot product of the first and second columns of V will be equal to 1. A scatterplot of the first two columns of UΣ will form a straight line.
⃝ True ⃝ True ⃝ True ⃝ True
⃝ False: ⃝ False: ⃝ False: ⃝ False:
A simple random sample of 500 taxi trips in Anytown is taken on May 1, 2018. The data matrix for this sample has 4 columns: number of blocks, total distance, fare, and the duration of the trip.
(d) (5 pt) Fill the boxes for all of the following that are possible to compute reliably via bootstrapping, using this 4-column sample from May 1 as well as the original 3-column population dataset for all of 2018.
A confidence interval for the mean trip duration in 2018. A confidence interval for the mean trip duration in 2019.
An estimate of the variance of the average duration computed for each possible sample of 500 trips taken on May 1, 2018.
An estimate of the variance of the average duration computed for each possible sample of 100 trips taken on May 1, 2018.
A confidence interval for the difference between the average trip distance on May 1, 2018, and the average trip distance for all of 2018.
None of the above
(e) (2 pt) Using this random sample, you decide to fit an ordinary least squares linear regression model with no intercept using the normal equations, where the outcome is trip duration and the design matrix contains three covariates: number of blocks, total distance, and fare. Which problem below do you expect to encounter first?
⃝ The model will overfit because the sample is too small.
⃝ Processing the dataset will require more than one computer.
⃝ Gradient descent will reach a local minimum that is not the global minimum. ⃝ It will be impossible to fit the parameters of the model.
⃝ None of these
Name: 7 7. (22 points) Tables
Fill in both the Python code and the SQL query to produce each result below, assuming that the following tables are stored both as Pandas DataFrames and Sqlite tables. Only the first few rows are shown for each table. No tables have missing values. The scores table contains a row for each submitted homework in a course and contains columns for the id of the student who submitted, the hw number, and the score. Assume that a student can submit a homework at most once, every enrolled student has submitted at least one homework, and every homework has been submitted by at least one student. The hws table contains a row for each homework and contains columns for the name, which is always the letters “hw” followed by the homework number, as well as the max possible score for the homework. Each homework appears exactly once in the hws table. The ugrads table contains the student ID (sid) for each undergraduate enrolled at the university. There are both undergrads and graduate students in the course.
scores hws ugrads
(a) (6 pt) Select the number of undergrads enrolled in the course.
Python: _____________________(_____________________________.isin(_______________________________))
SQL: SELECT COUNT(*) FROM ugrads WHERE _______________________________________________________; (b) (6 pt) List the IDs for students who are missing at least one homework.
Python: counts = _______________________________________________________________________________
list(__________________[_________________________________________________________].index)
SQL: SELECT id FROM scores ______________________________ COUNT(*)<(SELECT COUNT(*) FROM hws); (c) (6 pt) Build a table with one row per homework containing the total number of points scored by under-
graduates on that homework.
Python: (scores.__________________________________________________________________________________
.groupby(_______________________________)________________________________________________) SQL: SELECT hw, ______________ FROM scores _______________________________________ GROUP BY hw;
(d) (4 pt) Select the average fraction of possible points that was scored on each submitted homework. Python: names = scores['hw'].apply(lambda n: f'_________________________________________________')
np.mean(scores['score'] / np.array(hws.set_index('name')________________________________)) SQL: SELECT __________________________ FROM scores JOIN hws ON _______________________________;
id
hw
score
90210
94720
90210
1 3 3
4 3 2
name
max
hw1 hw2 hw3
5 5 3
sid
94709
94114
94720
8
8. (8 points) Bias and Variance
(a) (2 pt) Select the best description of the bias of an estimator θˆ for a parameter θ, where θˆ is computed based on a simple random sample from the population of interest.
⃝ The difference between θ and the value of θˆ for a particular sample.
⃝ The average difference between θ and θˆ computed for each element of the sample.
⃝ The average difference between θ and θˆ over all possible simple random samples from the population.
⃝ The difference between the expected value of θ and the value of θˆ for a particular sample.
You estimate the maximum possible outcome (number of dots) of rolling of a fair 6-sided die using the outcome of a single roll of the die (pretending you don’t already know the number of dots on the faces of the die).
(b) (2 pt) What is the bias of this estimator?
⃝ -3.5 ⃝ -2.5 ⃝ -1.5 ⃝ 0 ⃝ 1.5 ⃝ 2.5
(c) (1 pt) What is the variance of this estimator?
⃝ 0 ⃝ 5 ⃝ 7 ⃝ 35 ⃝ 47 ⃝ 55 ⃝ 61 2 2 12 12 6 6
(d) (1 pt) What is the mean squared error of this estimator? ⃝ 0 ⃝ 5 ⃝ 7 ⃝ 35 ⃝ 47 ⃝ 55 ⃝ 61 2 2 12 12 6 6
⃝ 3.5 ⃝ Other: ________
⃝ Other: ________
⃝ Other: ________
(e) (2 pt) How would the magnitude (absolute value) of the estimator’s bias change if your estimator was the maximum outcome that appeared in 100 rolls of the die instead of just one roll?
⃝ It would increase ⃝ It would decrease ⃝ It would stay the same
9. (4 points) Sampling Distributions
The temperature difference between the outside and inside of a room measured each minute has this distribution:
You sample 10 differences at random with replacement. For each distribution below, select the histogram that best visualizes the distribution.
⃝A⃝B⃝C ⃝A ⃝B ⃝C ⃝A⃝B⃝C ⃝A⃝B⃝C
⃝ D ⃝ E ⃝ F: The sampling distribution of the sample 90th percentile. ⃝D ⃝E ⃝F: Thesamplingdistributionofthesamplemedian.
⃝ D ⃝ E ⃝ F: The sampling distribution of the sample standard deviation. ⃝ D ⃝ E ⃝ F: The bootstrap distribution of the sample median.
Name: 9 10. (8 points) Distributed File Systems
A large file is partitioned into 8 equal-sized blocks, and each block is replicated 2 times. For each block, a simple random sample of 2 machines is chosen (without replacement) out of 4 total machines, and one replica of the block is stored on each machine chosen in this way.
(a) (2 pt) What is the expected number of blocks stored on each machine?
⃝ 0 ⃝ 1 ⃝ 2 ⃝ 4 ⃝ 8 ⃝ 16 ⃝ 32 ⃝ Other: ________
(b) (2 pt) If one machine fails, what is the chance that the entire file can still be read? ⃝ 0 ⃝ 1 ⃝ 1 ⃝ 1 ⃝ 3 ⃝ 7 ⃝ 1 ⃝ Other: ________
(c) (2 pt) If two machines fail, what is the chance that the entire file can still be read?
⃝0 ⃝ 1 ⃝18 ⃝ 1 ⃝ 5 ⃝58 ⃝1−58 ⃝1 ⃝Other: ________
(d) (2 pt) If three machines fail, what is the chance that the entire file can still be read?
⃝0 ⃝ 1 ⃝18 ⃝ 1 ⃝ 5 ⃝58 ⃝1−58 ⃝1 ⃝Other: ________
11. (4 points) Distributed Computing
Assume you have the long-running remote function f defined below and you want to call it 10 times.
@ray.remote
def f():
run_time = 10 * np.random.random() # Chosen uniformly at random from 0 to 10 seconds
time.sleep(run_time) # Do nothing for run_time seconds
return run_time
(a) (1 pt) Assuming there is no overhead, what is the expected amount of time required to call f 10 times if all calls are executed serially?
⃝ 1 second ⃝ 5 seconds ⃝ 10 seconds ⃝ 50 seconds ⃝ 100 seconds ⃝ None of these
(b) (1 pt) Assuming there is no overhead, what is the maximum amount of time required to call f 10 times
if all calls to f are executed in parallel?
⃝ 1 second ⃝ 5 seconds ⃝ 10 seconds ⃝ 50 seconds ⃝ 100 seconds ⃝ None of these
(c) (2 pt) Select all of the expressions below that correctly compute a list of the return values from calling
f 10 times and execute all calls in parallel. [f() for _ in range(10)]
[f.remote() for _ in range(10)] ray.get([f.remote() for _ in range(10)]) [ray.get(f.remote()) for _ in range(10)] None of these
84248
226666
226666
10
12. (0 points) Optional: Draw a Picture About Berkeley Data Science