Name:
Email address: Student id:
DS-100 Practice Final Questions Fall 2017
Instructions:
• These are a random selection of previous final exam questions.
• This is not representative of the length of the final (its too long!). • You may use a single page (two-sided) cheat sheet.
1
DS100 Final, Page 2 of 42
1 Loss Minimization
1. In a petri dish, yeast populations grow exponentially over time. In order to estimate the growth rate of a certain yeast, you place yeast cells in each of n petri dishes and observe the population yi at time xi and collect a dataset {(x1, y1), . . . , (xn, yn)}. Because yeast populations are known to grow exponentially, you propose the following model:
log(yi) = γxi (1) where γ is the growth rate parameter (which you are trying to estimate). We would like to
derive the L2 regularized estimator least squares estimator.
(1) [4 Pts.] Write the regularized least squares loss function for γ under this model. Use λ as
the regularization parameter.
(2) [8 Pts.] Solve for the optimal γ as a function of the data and λ
Solution:
1 n
L(γ) = n
i=1
(log(yi) − γxi)2 + λγ2 (2)
Solution: Taking the derivative of the regularized loss function function:
∂ L(γ) = 1 n ∂ (log(yi)−γxi)2 + ∂ λγ2 (3)
∂γ
Setting the derivative equal to zero and solving for γ:
2n 2γ n
n i=1 ∂γ ∂γ 2 n
= −n (log(yi) − γxi)xi + 2λγ (4) i=1
2n 2γ n
=− log(yi)xi+ λ+x2i (5)
n i=1 n i=1
0=− log(yi)xi+ λ+x2i (6)
γ
λ + x2i i=1
= log(yi)xi i=1
(7)
(8) (9)
n i=1 n i=1 nn
n −1n
γ= λ+x2 log(y)x
iii i=1 i=1
DS100 Final, Page 3 of 42
2. Suppose we observe a dataset {x1, . . . , xn} of independent and identically distributed samples from the exponential distribution. Suppose we give you a ”probability model” parameterized by λ:
fλ(x) = λe−λx
that estimates the probability of a particular data point. In addition we give you the “log-
likelihood” loss function as the following:
n
L(λ) = −n log (λ) + λ xi (10)
i=1
Derive the parameter value λ that minimizes this loss function. Circle your answer.
Solution: Taking the derivative of the loss function with respect to the parameter λ we get:
∂ ∂ ∂n
∂λL(λ) = −n∂λ log (λ) + ∂λλ
1 n
xi (11) (12)
(13)
x i (14) (15)
= −nλ +
To minimize the loss we set the above derivative equal to zero and solve:
i=1
1 n
xi
i=1
0 = − n λˆ + 1 1n
λˆ=n xi i=1
i=1
i=1 i
ˆn λ=n x
(16)
Thus loss minimizing estimate is:
ˆ n 1n −1 1
λ=n x= n xi =Mean(x) (17) i=1 i i=1
DS100 Final, Page 4 of 42
3. Suppose we collect a dataset of n observations {x1, . . . , xn} which we believe are drawn from a distribution with the following PDF:
(x−μ)6
fμ(x)=Cexp − 6 (18)
where C is a constant that does not depend on μ. As before we are given the loss function: 1 n
L(μ)=−nlogC+6
(1) [4 Pts.] Compute the derivative of the derivative of the loss with respect to μ.
(xi −μ)6 (19)
(2) [3 Pts.] Because there is no closed form solution for μ in ∂ L(μ) = 0, we would likely ∂μ
use gradient descent to approximately compute μˆ. Given the gradient function:
g(μ)= ∂ logL(μ), (23)
∂μ
and a step size ρ(t), what is the gradient descent update rule to go from μ(t) to μ(t+1)? (Hint: your answer should contain only the variables g(μ(t)), μ(t), μ(t+1), and ρ(t).)
i=1
Solution: Taking the derivative:
∂ L(μ) = − ∂ n log C + 1 n ∂ (xi − μ)6 (20)
i=1 n
= − (xi − μ)5 (22) i=1
∂μ ∂μ
=0−(xi −μ)5 (21)
n
6 i=1 ∂μ
Solution: Recall that the gradient points in the “uphill” direction. When we minimize we want to go the opposite direction (negative gradient. So the update rule would look like:
μ(t+1) ← μ(t) − ρ(t)g(μ(t)) (24)
DS100 Final, Page 5 of 42
2
Wrangling and Querying Data 2.1 SQL
For the questions in this subsection, assume we have a massive database in the cloud with the following schema:
— A simple digital media store database
CREATE TABLE media
(mid integer PRIMARY KEY,
name text, type char, year_released integer, length integer, buy_cost float, rent_cost float, avg_rating float);
CREATE TABLE customers
(cid integer PRIMARY KEY,
name text, joined date, nation_id integer, activity_level integer);
CREATE TABLE transactions (tid integer PRIMARY KEY,
tdate date, item integer, customer integer,
rent_or_buy integer, price_paid float, percent_viewed float, FOREIGN KEY (item) REFERENCES media,
FOREIGN KEY (customer) REFERENCES customers);
CREATE VIEW stats AS
SELECT min(length) AS len_min, max(length) AS len_max,
avg(length) AS len_mu, stddev(length) AS len_sigma, min(avg_rating) AS ar_min, max(avg_rating) AS ar_max, avg(avg_rating) AS ar_mu, stddev(avg_rating) AS ar_sigma
FROM media;
DS100 Final, Page 6 of 42
4. [4 Pts.] In the media table above, the type column encodes the type of media as a unique character code (e.g., ’S’ for song, ’M’ for movie, ’E’ for episode, etc.). Suppose we wanted to modify the stats view to display the stats for each type of media. Which of the following are true? (Select all that apply.)
A. We need to change the granularity of the view to be finer than it is above.
B. We need to add a GROUP BY type clause to the view.
C. It would be helpful to add media.type to the list of columns in the SELECT clause of the view.
D. The modified view should have more rows than the original view above.
E. None of the above.
5. [3 Pts.] Which of the following queries finds the ids of media that are 2 standard deviations
longer than the mean length? (Select only one.) A.
B.
C.
SELECT media.mid FROM media, stats
WHERE media.mid = stats.mid
AND media.length >= stats.len_mu
+ 2*(stats.len_sigma);
SELECT media.mid
FROM media, stats
WHERE media.length >= stats.len_mu
+ 2*(stats.len_sigma);
SELECT media.mid FROM media
WHERE media.length >= avg(media.length)
+ 2*stddev(media.length);
D. None of the above.
DS100 Final, Page 7 of 42
2.2 SQL Sampling
The transactions table has 30 million (30 × 106) rows. It is too large to load into the memory of our laptop. We will extract a sample from the database server to process on our laptop in Python.
SELECT *
FROM transactions TABLESAMPLE Bernoulli(.0001);
6. [2 Pts.] In expectation, how many rows will there be in the answer to this query?
7. [4Pts.] YourfriendEmilyEngineertellsyoutoavoidBernoullisampling,andusethefollowing query instead:
SELECT *
FROM transactions
LIMIT XX;
(where XX is replaced by the correct answer to the previous question). Select all the true
statements:
A. Emily’s LIMIT query will probably run faster than the TABLESAMPLE query. For Emily’s query, the database engine can simply access the first XX rows it finds in the table, and skip the rest.
B. Emily’s query result may be biased to favor certain rows.
C. The output of the TABLESAMPLE query provides a hint about how many rows there are in the transactions table while Emily’s LIMIT query does not.
D. Emily’s LIMIT query may run fast, but it will swamp the memory on your laptop, since it doesn’t sample the database.
Solution: 3000 = 30 × 102
Solution: True. For reasoning above.
Solution: True. The database will optimize for speed, which will likely fa- vor clusters of records stored near each other.
Solution: True. You can extrapolate from the sample size and the sample probability to predict the table size.
DS100
Final, Page 8 of 42
Solution: False. Emily’s query will only return XX rows to the laptop.
E. None of the above.
8. [2 Pts.] You will recall from Homework 5 that it is possible to do bootstrap sampling in SQL by constructing a design table with two columns. Each of the columns used in that scheme is described by a single choice below. Identify the two correct choices:
A. A foreign key to the table being sampled.
B. A count column to capture the number of tuples in each bootstrap sample. C. An identifier to group rows together into bootstrap samples.
D. A regularization column to prevent overfitting.
2.3 Pandas
For the questions in this subsection, assume that we have pandas dataframes with the same schemas as described in the previous section on SQL. That is, we have a media dataframe with columns mid, name, type, year, et cetera. Assume that the index column of each dataframe is meaningless—the primary key is represented as a regular column.
9. [3 Pts.] Consider the following code snippet:
def get_average_price_paid(join_method): return (customers
.merge(transactions, how=join_method, left_on=’cid’, right_on=’customer’)
.loc[:,’price_paid’]
.fillna(0) # <- Important .mean()
)
inner = get_average_price_paid(’inner’)
outer = get_average_price_paid(’outer’)
left = get_average_price_paid(’left’)
right = get_average_price_paid(’right’)
Assume that all item prices are positive, all transactions refer to valid customers in the customers table, but some customers may have no transactions.
(1) How are inner and outer related? Pick one best answer. A. inner < outer
B. inner ≤ outer C. inner = outer
DS100
Final, Page 9 of 42
D. inner ≥ outer E. inner > outer
(2) How are left and right related? Pick one best answer. A. left < right
B. left ≤ right C. left = right D. left ≥ right E. left > right
(3) How are left and outer related? Pick one best answer. A. left < outer
B. left ≤ outer C. left = outer D. left ≥ outer
10. [3 Pts.] We wish to write a python expression to find the largest amount of money spent by one person on any single date. We will use the following code:
biggie = transactions.groupby(_____)[’price_paid’].sum().max()
What should we be pass in as our groupby predicate? Select only one answer. A. ’tdate’
B. ’customer’
C. [’item’, ’tdate’]
D. [’customer’, ’tdate’] E. [’customer’, ’item’]
11. [6 Pts.] Fill in the following python code that finds the names of every customer who has spent over $100.
merged = customers.merge(__A__, left_on=__B__, right_on=__C__) grouped = merged.groupby(__D__).__E_()
names = grouped[__F__].index
Solution:
merged = customers.merge(transactions, \
left_on=cid, right_on=customer)
grouped = merged.groupby(cid).sum()
names = grouped[grouped.price_paid > 100].index
DS100 Final, Page 10 of 42
12. [4 Pts.] We wish to find years where the average price paid (over all time) for products released in that year is greater than the average price paid across all transactions; from those years we want to return the earliest (smallest). We have the following code:
merged = transactions.merge(media, left_on=”item”, \ right_on=”mid”)
mean_price = merged.groupby(“year_released”)\ .mean().price_paid.mean() # Line A
by_year = merged.groupby(“year_released”).count() # Line B is_greater = by_year[by_year.price_paid > mean_price] # Line C result = is_greater.sort_index(ascending=False).index[0] # Line D
Some of these lines need to be modified in order for the code to work properly. We have suggested replacements for each line below. Which lines need to be replaced? Select all that apply.
A. mean_price = merged.price_paid.mean()
B. by_year = merged.groupby(“year_released”).mean()
C. is_greater = by_year.where(by_year.price_paid > mean_price)
D. result = is_greater.sort_index(ascending=True).index[0] E. All the lines are correct.
DS100 Final, Page 11 of 42
3 Feature Engineering
For this question you were given the following sales data and asked to build a model to predict units sold based on the the product attributes to guide the design of future products.
…
13. Write down a reasonable schema for this data.
ProdID
Name
Desc
Price
Category
Units Sold
13
Errorplane
“A truly uncaught exception . . . ”
404.00
Toy
9
42
Rock Kit
“Launch into minerology with . . . ”
123.45
Toys
1
54
Punative Jokes
“Jokes that will get you fined . . . ”
1.00
Books
30
Solution:
Products(prodId INTEGER, name VARCHAR, desc VARCHAR,
price REAL,
cat CHARACTER(20), sold INTEGER);
14. Suppose we are interested in building a linear predictive model. For each of the columns indicate which (one or more) of the feature transformations could be appropriate.
(1) The ProdID column:
A. Drop the column
B. One-Hot Encoding C. Leave as is
(2) The Name column:
A. The length of the text in characters
B. One-Hot Encoding
C. Bag-of-words Encoding
D. Leave as is
Solution: Because we are trying to predict sales of future products and each product is likely to have a unique product id, this feature is not likely to be helpful and may harm predictions.
DS100 Final, Page 12 of 42
Solution: The length of the name could be helpful in predicting how well product sell. Perhaps long names are hard to remember? A one-hot encoding would not likely be helpful unless we are going to make new products with the same name as old products. A bag-of-word encoding could be helpful if for example certain words like kit implied a decrease or increase in sales.
(3) The Desc column:
A. The length of the text in characters
B. One-Hot Encoding
C. Bag-of-words Encoding D. bi-gram Encoding
E. Leave as is
(4) The Price column:
A. The length of the text in characters
B. One-Hot Encoding
C. Bag-of-words Encoding
D. Convert the price to an indicator indicating if it is less than 19.99. E. Leave as is
(5) The Category column:
A. The length of the text in characters.
B. One-Hot Encoding
C. Bag-of-words Encoding D. N-Gram Encoding
E. Leave as is
15. It might be reasonable to assume that the relationship between units sold and price differs for each category (e.g., an expensive toy might be less likely to sell than expensive jewelry). Which of the following feature functions might capture this intuition?
Solution: This is similar to the Name but given longer post a bi-gram featurization may also be helpful.
Solution: Asafloatingpointnumberwecouldleavethepriceasis.Wemighthowever also compare the price to a few thresholds (e.g., 19.99 or 199.99) to test if the price is in some critical sales ranges.
Solution: Categorical data is typically best represented in a one-hot encoding.
DS100
Final, Page 13 of 42
A. φ(category, price) = category + price
B. φ(category, price) = price × category
C. φ(category, price) = OneHot (category) + price D. φ(category, price) = price × OneHot (category)
Solution: This will result in a vector of length k, where k is the number of distinct categories. Each element will be equal to the price for examples that match the category, and 0 otherwise. So the learned coefficient for each element is the slope for prices for that category. Most of the other choices would result in type errors, since it doesn’t make sense to add or multiply a category (a string) by any number. Choice F would run, but it would only allow predictions for each category to vary by additive factors; each category would get a different intercept rather than a different slope on price.
E. φ(category, price) = category × OneHot (price)
F. φ(category, price) = Concatenate(OneHot(category), price)
DS100 Final, Page 14 of 42
4 Feature Engineering 2
For this problem we collected the following data on the new social networking app UFace.
…
16. Suppose we are interested in predicting the number of responses for future posts. For each of the columns, indicate which (one or more) of the given feature transformations could be informative. Select all that apply.
(1) [2 Pts.] The PostID column: A. Drop the column
B. One-Hot encoding C. Leave as is
(2) [2 Pts.] The Time column:
A. Take the hour as a float
B. One-Hot encoding
C. Bag-of-words encoding
D. Time since midnight in seconds
(3) [2 Pts.] The Text column:
A. The length of the text
B. One-Hot encoding
C. Bag-of-words encoding
D. Leave as is
(4) [2 Pts.] The State column: A. The length of the text
B. One-Hot encoding
C. Bag-of-words encoding D. Leave as is
PostID
UTC Time
Text
Num. Responses
State
3
08:10 PM
“Checkout my breakfast . . . ”
2
VA
13
11:00 AM
“Studied all night for . . . ”
5
CA
14
12:04 PM
“Hello world!”
0
NY
17
11:35 PM
“That exam was lit …”
42
CA
DS100 Final, Page 15 of 42
17. [4 Pts.] Suppose we believe that people are more likely to respond to tweets in the afternoon (roughly from hours 13 to 17). Which of the following feature functions would help capture this intuition? Assume that the function localHour takes a time and a state as its arguments and returns the hour of the day (in 24-hour time) in the state’s time zone. Also assume that any boolean-valued feature is encoded as 0 (false) or 1 (true). Select all that apply.
A. φ(time, state) = localHour(time, state)
B. φ(time, state) = 13 < localHour(time, state) < 17
C. φ(time, state)=exp−(localHour(time, state)−15)2 D. φ(time, state) = exp (localHour(time, state) − 15)
E. None of the above.
18. [2 Pts.] Given the following text from a BigData Borat post: “Data Science is statistics on a Mac.”
Which of the following is the bi-gram encoding including stop-words? (Select only one.)
A. {(’data’, 1), (’science’, 1), (’statistics’, 1),
(’mac’, 1)}
B. {(’data science’, 1), (’science statistics’, 1),
(’statistics mac’, 1)}
C. { (’data science’, 1), (’science is’, 1),
(’is statistics’, 1), (’statistics on’, 1), (’on a’, 1), (’a mac’, 1)}
D. {(’data science’, 1), (’is statistics’, 1), (’on a’, 1), (’mac’, 1)}
DS100 Final, Page 16 of 42
5 Least Squares Regression and Regularization
19. Binary Features You are part of a team that is analyzing data from a clinical trial. Let X be a
full-column-rank n × 2 design matrix with the following columns:
1. X0 is a column of 1s. This generates an intercept term.
2. X1 is a binary treatment indicator vector taking on values 0 or 1. Xi1 = 1 means that yi represents the response of a treated individual. Xi1 = 0 means that yi represents the response of an untreated individual
You propose the following linear model:
yi = β0 + β1Xi1 + εi
You solve for βˆ0 and βˆ1 (estimates of β0 and β1, respectively) by minimizing the residual sum
of squares. Show that
βˆ 1 = y ̄ T − y ̄ C ,
• y ̄T is the average response of all the treated individuals, and
• y ̄C is the average response over all untreated or “control” individuals. Hint: Think about the meaning of ni=1 Xi1 and how y ̄ is related to y ̄T and y ̄C
where:
DS100 Final, Page 17 of 42
Solution: First, here’s a proof that is not very concise but attempts to give intuition for the problem.
ˆˆˆ
When we search for the function f (x) = β0 + β1 x minimizing the sum of squared residuals,
fˆ will be evaluated only at the points x = 0 and x = 1, since those are the only values
the Xi1s take. (A scatter plot of Xi1 against y would look like 2 vertical slices at x = 0
and x = 1.) Since a line parametrized by an intercept and a slope can pass through
any two points, our function fˆ can do the best possible job of minimizing the squared
errors, by passing through the vertical “middles” of the slices at x = 0 and x = 1. More ˆ
precisely, f (0) will be the value minimizing the sum of squared errors for those responses ˆˆˆ
with x = 0, and similarly for f(1) = β0 + β1. We have seen that the minimizer of the squared error between a number and a list of numbers is the mean of the list of numbers, so
ˆˆˆˆˆˆ
f(0)=β0 =y ̄C,andf(1)=β0 +β1 =y ̄T.Therefore,β1 =y ̄T −y ̄C.
Here is an alternative “brute force” proof by calculation. Let m = ni=1 Xi1, the number of treated individuals. Then:
βˆ0 T −1 T
βˆ = ( X X ) X y (25)
1
n m−1
=mm XTy (26) 1 m−mT
=m(n−m)−m n Xy (27)
1 1 −1iyi
=n (28)
n−m−1m my ̄T βˆ1 is just the second element of this vector, which is:
βˆ1= 1 (−yi+nmy ̄T) (29) n−m m
= 1 (−(my ̄T +(n−m)y ̄C)+ny ̄T) (30) n−m
= y ̄ T − y ̄ C (31)
20. For this question we use the following toy dataset:
DS100 Final, Page 18 of 42
(1) [3 Pts.] We have fit several models depicted as curves in the following plots:
(a) (b) (c)
Select the plot that best matches each of the models below. Each plot is used exactly
once.
1. Linear regression model ⃝(A) ⃝(B) √(C)
2. Linear regression with degree 10 polynomial features ⃝(A) √(B) ⃝(C)
3. Ridge regression with degree 10 polynomial features and substantial regularization. √(A) ⃝(B) ⃝(C)
DS100 Final, Page 19 of 42
(2) [2 Pts.] We fit two more models to these data. Again, the solid curves display the predic- tions made by each model.
(a) (b)
Select the plot that best matches each of the models below. Each plot is used exactly once.
1. Ridge regression with degree 10 polynomial features, λ = 0.1. √ (A) ⃝ (B)
2. Ridge regression with degree 10 polynomial features, λ = 1.0. ⃝ (A) √ (B)
21. Suppose you are given a dataset {(xi, yi)}ni=1 where xi ∈ R is a one dimensional feature and yi ∈ R is a real-valued response. To model this data you choose a model characterized by the following objective function:
n2
J(θ)=yi −θ0 −xiθ1 −x2iθ22 +λ|θi| (32)
i=1 i=1
(1) [7 Pts.] Select all the true statements for the above objective function (Equation 32).
A. This loss function likely corresponds to a classification problem. B. θ is the regularization parameter.
C. This is an example of L1 regularization.
D. This is not a linear model in θ.
E. This model includes a bias/intercept term.
F. This model incorporates a non-linear feature transformation.
G. Large values of λ would reduce the model to a constant θ0.
H. None of the above are true.
(2) [2 Pts.] Suppose in our implementation we accidentally forget to square the first term:
n2
J(θ)=yi −θ0 −xiθ1 −x2iθ2+λ|θi| (33)
i=1 i=1
What would change if we tried to train a model using gradient descent on this objective function rather than the original objective function? (Select only one)
DS100
Final, Page 20 of 42
A. B. C. D. E.
Thetrainingcodewouldraiseanerrorduetoamatrix/vectordimensionproblem. The training process would diverge with θ0 → −∞
The training process would diverge with θ0 → ∞
The training process would converge to a different regression line.
Nothing; the training process would eventually converge to the same regression line.
22. [5 Pts.] Let X be a n × p design matrix with full column rank and y be a n × 1 response vector. Let βˆ be the optimal solution to the least squares problem and r be its associated error. In other words,
y = Xβˆ + r (34)
Consider X2 the second column of X.
(1) [1 Pt.] True or False. Without any additional assumptions,
r · X2 = 0
where · denotes the usual dot product?
(2) [4 Pts.] Provide a short proof or counter example.
You may use the following scratch space but we will only grade what you put on the answer sheet.
Solution: True. It suffices to show that r is orthogonal to the column space of X.
XTr=XT(y−X(XTX)−1XTy)=(XT −XTX(XTX)−1XT)y=(XT −XT)y=0
DS100 Final, Page 21 of 42
6 Classification
23. For each of the following circle T for true or F for false.
(1) [1 Pt.] Binary or multi-class classification techniques are most appropriate when making predictions about continuous responses.
(2) [1 Pt.] In a setting with extreme class imbalance in which 95% of the training data have the same label it is always possible to get at least 95% training accuracy.
(3) [1 Pt.] In a setting with extreme class imbalance in which 95% of the training data have the same label it is always possible to get at least 95% test accuracy.
(4) [1 Pt.] In logistic regression, predictor variables (X) are continuous, with values from 0 to 1.
(5) [1 Pt.] In two-class logistic regression, the response variable (y) is continuous, with values from 0 to 1.
(6) [1 Pt.] In logistic regression, the outputs of the sigmoid function are continuous, with values from 0 to 1.
(7) [1 Pt.] In two-class logistic regression, the output of the sigmoid function for each data point represents the category of that data point.
Solution: False. Classification techniques are used in setting like spam classification where the response variable is one of potential many classes.
Solution: True. In settings with class imbalance always predicting the most common label is guaranteed to get a training accuracy that matches the proportion of the most common label.
Solution: False. The test accuracy could be much lower depending on the class imbalance in the test data.
Solution: False. There is no such constraint on the values that predictor variables might take. They are continuous from −∞ to ∞.
Solution: False. The response variable is categorical, only taking values 0 or 1.
Solution: True. This is a simple property of the sigmoid function.
Solution: False. The outputs of the sigmoid function represent the probability that the data point falls into a certain class (0 or 1).
DS100 Final, Page 22 of 42
(8) [1 Pt.] In logistic regression, we calculate the weights θˆ as θˆ = XT X−1 XT y, and then
fitresponsesasyˆ =σ xTθˆ . ii
24. Using the following figure to answer each of the following questions:
(A) (B) (C) (D)
(1) Which of the above plots represents a linearly separable binary classification task? √(A) ⃝(B) ⃝(C) ⃝(D)
(2) Which of the above plots represents a binary classification task that is not linearly
separable? √
Solution: False. We cannot analytically solve for θˆ - we must use gradient descent. You can tell this wouldn’t work because it would give us the same decision boundary
Tˆ1 Tˆ
(where σ(xi θ) = 2 , or equivalently where xi θ = 0) as linear regression.
x2
x2
x1
x2
x1
x1
⃝ (A) ⃝ (B) ⃝ (C)
(3) Which of the above plots represents a multi-class classification task?
(D)
⃝(A) √(B) ⃝(C) ⃝(D)
(4) Which of the above plots depicts a 1-dimensional Gaussian mixture model?
⃝(A) ⃝(B) √(C) ⃝(D)
25. Consider the following buggy Python implementation of gradient descent.
1 2 3 4 5 6 7 8 9
10 11 12 13
def grad_descent(X, Y, theta0,
grad_function, max_iter = 1000000):
"""X: A 2D array, the feature matrix.
Y: A 1D array, the response vector.
theta0: A 1D array, the initial parameter vector.
grad_function: Maps a parameter vector, a feature matrix, and
a response vector to the gradient of some loss function at
the given parameter value. The return value is a 1D array."""
theta = theta0
for t in range(1, max_iter+1):
grad = grad_function(theta, X, Y)
theta = theta0 + 1/t * grad return theta
DS100 Final, Page 23 of 42
Select all the issues with this Python implementation
A. Line 11 theta should be replaced by theta0. B. Line 12 theta0 should be replaced by theta. C. Line 12 1/t should be replaced by t.
D. Line 12 + should be replaced by -.
26. Suppose we collect a binary classification dataset consisting of {(xi, yi)}ni=1 where xi ∈ R and yi ∈ {0, 1}. Recall that the the probability mass function for a Bernoulli random variable y ∈ {0, 1} is:
P(y | θ) = θy(1 − θ)(y−1) (35) and the sigmoid function is given by:
σ(t) = 1 (36) 1+e−t
Which of the following is the loss function for the logistic regression model with L2 regular- ization?
A. J(θ) = ni=1 θiy(1 − θ)(yi−1) + λ|θ|
B. J(θ) = − ni=1(θxi)y(1 − θxi)(yi−1) + λθ2
C. J(θ) = ni=1 σ (θxi)y (1 − σ (θxi))(yi−1)
D. J(θ)=λθ2 −ni=1[yilogσ(θxi)+(yi −1)log(1−σ(θxi))] E. J(θ)=λθ2 +ni=1[yilogσ(θxi)+(yi −1)log(1−σ(θxi))]
27. [4 Pts.] Which of the following can help deal with overfitting in a logistic regression model? A. Adding additional features.
B. Obtaining additional training data.
C. Performing regularization.
D. Removing data until your classes are linearly separable.
DS100 Final, Page 24 of 42
7 Classification 2
28. For each of the following select T for true or F for false on the answer sheet.
(1) [1 Pt.] A binary or multi-class classification technique should be used whenever there are categorical features.
(2) [1 Pt.] Logistic regression is actually used for classification.
(3) [1 Pt.] The logistic regression loss function was derived by modeling the observations as noisy observations with a Gaussian noise model.
(4) [1 Pt.] Class imbalance can be a serious problem in which the number of training data points from one class is much larger than another.
(5) [1 Pt.] A broken binary classifier that always predicts 0 is likely to get a test accuracy around 50% on all prediction tasks.
(6) [1 Pt.] The root mean squared error is the correct metric for evaluating the prediction accuracy of a binary classifier.
29. Consider the following binary classification dataset
Solution: False. Categorical features may appear in both classification and regression settings and should be addressed using one-hot-encoding.
Solution: True. Logistic regression is somewhat confusingly named as it applies to classifications tasks but builds on the linear models we introduced in least squares linear regression.
Solution: False. Logistic regression was derived using the Bernoulli likelihood of function.
Solution: True.Classimbalancecanbeaseriousproblemandoftenoccursinsettings like disease diagnosis where a large fraction of the population is healthy.
Solution: False. In many case class imbalance could result in substantially higher or lower accuracy.
Solution: False. Root mean squared error is a standard measure of accuracy for regression. Logistic regression accuracy is often measured by the fraction of examples predicted correctly or in some cases the likelihood of the data under the model.
DS100 Final, Page 25 of 42
y
1 0
x
(1) [3 Pts.] Draw a reasonable approximation of the logistic regression probability estimates for P(Y = 1 | x) on top of the figure on the answersheet.
Solution: Anything close to the following would be acceptable:
y
1 0
x
It is important that:
1. the curve is higher for smaller values of x 2. the curve is smooth
3. the curve is a sigmoid
(2) [1 Pt.] Are these data linearly separable? A. Yes
B. No
DS100 Final, Page 26 of 42
30. [3 Pts.] Suppose you are given θ for the logistic regression model to predict whether a tumor is malignant (y = 1) or benign (y = 0) based on features of the tumor x. If you get a new patient x∗ and find that xT∗ θ > 0, what can you say about the tumor? Select only one.
A. The tumor is benign
B. The tumor is more likely benign
C. The tumor is more likely to be malignant D. The tumor is malignant
31. [4Pts.] Whichofthefollowingexplanationsthatapplyingregularizationtoalogisticregression model? Select all that apply.
A. The training error is too high.
B. The test error is too low.
C. The data are high-dimensional.
D. There is a large class imbalance.
E. None of the above justify regularization for logistic regression.
DS100 Final, Page 27 of 42
8 Bias-Variance Tradeoff
32. For each of the following circle T for true or F for false.
(1) Increasing the regularization penalty decreases bias.
(2) Without taking precautions, reducing bias often leads to increased variance.
(3) In the bias variance trade-off the variance refers to the variability in predictions across different training datasets.
(4) As we improve the model to reduce bias we often run the risk of under-fitting.
33. In this question we complete the following figure describing the train-test split and k-fold cross validation. Note the data table with many records and a few columns is depicted on the left as a tall rectangle.
Solution: False. Increasing the regularization penalty reduces variance but increases bias.
Solution: True. This is the fundamental bias variance tradeoff.
Solution: True. In the bias variance trade-off the variance refers to variability in predictions due to variability in training data.
Solution: False. Improving the model by reducing bias often leads to over-fitting
(A) (D)
(E)
K-Fold Cross Validation
(1) This part of the figure refers to the validation data. ⃝(A) ⃝(B) ⃝(C) ⃝(D) √(E)
(2) This part of the figure refers to the testing data ⃝(A) √(B) ⃝(C) ⃝(D) ⃝(E)
Data
(B) (C)
DS100 (3)
(4)
Final, Page 28 of 42
This part of the figure refers to the process of constructing the train-test split.
√(A) ⃝(B) ⃝(C) ⃝(D) ⃝(E)
Select all the following statements that apply to the above figure.
A. This figure illustrates 5-fold cross validation.
B. This figure illustrates 6-fold cross-validation.
C. Assuming all the data points are distinct each of the validation data sets are also distinct.
D. The test data should be used during cross-validation to fully evaluate the model.
34. For each of the following select T for true or F for false on the answer sheet.
(1) [1 Pt.] Regularization can be used to manage the bias-variance trade-off.
(2) [1 Pt.] When conducting linear regression, adding polynomial features to your data often decreases the variance of your fitted model.
(3) [1 Pt.] When conducting linear regression, adding polynomial features to your data often decreases the bias of your fitted model.
(4) [1 Pt.] Suppose your data are an i.i.d. sample from a population. Then collecting a larger sample for use as a training set can help reduce bias.
(5) [1 Pt.] Suppose your data are an i.i.d. sample from a population. Then collecting a larger sample for use as a training set can help reduce variance.
(6) [1 Pt.] Training error is typically larger than test error.
Solution: True. Regularization encourages simpler models which can help to reduce variance but increase bias.
Solution: False. Adding more features tends to increase your model’s variance since there are more parameters to fit.
Solution: True. Adding more features tends to decrease your model’s bias since your model can fit more complicated patterns in the data.
Solution: False. Increasing the dataset size without changing the modeling procedure can often reduce variance but is unlikely to address bias.
Solution: True. More data often helps to reduce variance in the model fitting process.
Solution: False. Training error often under-estimates the test error.
DS100 Final, Page 29 of 42
(7) [1 Pt.] If you include the test set in your training data, your accuracy as measured on the test set will probably increase.
(8) [1 Pt.] It is important to frequently evaluate models on the test data throughout the process of model development.
35. [2 Pts.] A colleague has been developing models all quarter and noticed recently that her test error has started to gradually increase while her training error has been decreasing. Which of the following is the most likely explanation for what is happening? Select only one.
A. She is starting to over-fit to her training data.
B. She is starting to under-fit to her training data. C. The model is overly biased.
D. None of the above.
Solution: True. Training on the test data improves test accuracy but this improvement can be misleading due to over-fitting.
Solution: False. Noooooooooooo. Once test data is used it is no longer test data. You should create validation datasets or use cross-validation procedures to evaluate models.
DS100 Final, Page 30 of 42 36. [5 Pts.] Given the following general loss formulation:
nd
argmin yi −xTi θ2 +λθp2 (37)
θ
i=1 p=1
Which of the following statements are true? Select all that apply. A. There are d data points.
B. There are n data points.
C. The data is d dimensional.
D. This is a classification problem.
E. This is a linear model.
F. This problem has LASSO regularization.
G. Larger values of λ imply increased regularization. H. Larger values of λ will increase variance.
I. Larger values of λ will likely increase bias. J. None of the above are true.
37. [3 Pts.] In class we broke the least-squares error into three separate terms:
E(y−fθ(x))2=E(y−h(x))2+E(h(x)−fθ(x))2+E(fθ(x)−E[fθ(x)])2 (38)
where y = h(x) + ε, h(x) is the true model and ε is zero-mean noise. For each of the following terms, indicate its usual interpretation in the bias variance trade-off:
1. E [(y − h(x))2]: A. Bias B. Variance C. Noise
2. E(h(x)−fθ(x))2: A. Bias B. Variance C. Noise
3. E(fθ(x)−E[fθ(x)])2: A. Bias B. Variance C. Noise
DS100 Final, Page 31 of 42
9 Big Data
38. Which of the following are true:
A. Star schemas are designed to decrease redundancy.
B. A Data Warehouse is typically updated every time a change occurs in a related Operational Data Store.
C. A typical Data Warehouse favors cleanliness over completeness: it rejects data that does not conform to the warehouse schema.
D. A typical Data Lake favors completeness over cleanliness: it allows you to store any data you like, without even requiring a schema.
E. The “T” in ETL involves many of the same tasks as Data Wrangling.
39. Consideradatawarehouseofautomobilesensorreadings,whichrecordsinformationonsensors, readings, and vehicles where the sensors are placed. Which of the following are true:
A. Because a traditional ETL process only loads data into the warehouse periodically, it will lose sensor information recorded in the operational data store.
B. Each sensor reading should be timestamped in the data warehouse.
C. There is no reason for the data warehouse to record timestamps for information on the vehicles.
40. Which of the following features are typical of a distributed file system:
A. It can store large volumes of data.
B. It is optimized to store data as compactly as possible.
Solution: False. That would be too resource-intensive. Instead, we run batch ETL periodically.
Solution: False. For this application, sensor readings would not be UPDATEs to existing sensor readings; instead it makes more sense for each sensor reading to be timestamped and stored in the operational data store separately. So the ETL process would pick up the full history of sensor readings in this case.
Solution: False. It would be useful for the warehouse to record the history of changes to information about vehicles.
DS100 Final, Page 32 of 42
Solution: False. It in fact wastes storage by replicating data, to provide fault tolerance among other features.
C. It can keep serving files even after a certain number of machine failures.
D. After a crash, if any data can be recovered at all, then all the data can be recovered.
41. In class, we asserted that MapReduce is being used less and less in practice. It is being replaced by what other programming interfaces? Why?
42. Which of the following are true?
A. Because people can store any file in a Data Lake, it is harder to assess data quality in a Lake than in a Warehouse.
B. The raw data in a Data Lake will likely require more wrangling than the data in a well-governed Data Warehouse.
C. The lack of a unifying schema in a Data Lake makes it difficult to get a global view of information being captured.
D. Relative to traditional Data Warehouses, Data Lakes make it easier to secure data in a well-governed way.
43. Consider the following simple Data Warehouse schema from a Cellular Service Provider, which records activity on a cell phone network:
CREATE TABLE devices (
did integer, customer_id integer, phone_number varchar(13), firstname text, lastname text,
Solution: In general this is false. It is possible for some shards to be recoverable and others not so.
Solution: SQL and Dataframe APIs. Both of those interfaces can achieve anything that MapReduce can do, and offer higher-level constructs that are convenient for tabular data: e.g. joins, built-in aggregates (reduce functions), etc.
Solution: The first 3 parts are true and self-explanatory. The last one is false … data lakes encourage users to “land” data files that are unstructured and experimental. It is hard to know what confidential information might be in such files, and who should get access to them as a result.
DS100 Final, Page 33 of 42
zip varchar(12), registered_on varchar(2), PRIMARY KEY(did),
UNIQUE (customer_id) — a ‘‘candidate’’ key );
CREATE TABLE billing (
rate_code char PRIMARY KEY,
description text, base_fee float, per_minute float, max_minutes integer, overage_fee float,
PRIMARY KEY (rate_code));
CREATE TABLE calls (
caller_handset_id integer, callee_handset_id integer, cell_tower_id integer, call_start datetime, call_end datetime, billing_code char,
PRIMARY KEY (caller_handset_id, call_start),
FOREIGN KEY (caller_handset_id) REFERENCES devices,
FOREIGN KEY (billing_code) REFERENCES billing;
(1) [3 Pts.] Which of these tables is a dimension table? Select all that apply. A. devices
B. calls
C. billing
D. None of the above.
(2) [3 Pts.] Which of the following statements are true? Select all that apply.
A. The calls.billing code column violates star schema design because any
update to a single billing fee requires updates to many call records.
B. If we want to look for correlations between a device’s average call length and the time since it was registered, we have to perform a join.
C. If the cell service provider implemented a Data Lake, it would make it easier for them to load audio recordings of calls for subsequent analysis.
D. None of the above statements are true.
44. [3 Pts.] The figure below depicts a distributed file system with one logical “big file” partitioned into 4 “shards” (A, B, C, D) and replicated across multiple worker machines (1, 2, 3, 4).
Solution: devices and billing
Big File
D
Worker 1
B
D
Worker 2
A
Worker 3
AC D
Worker 4
A
D
A
C
B
C
C
B
B
DS100 Final, Page 34 of 42
Suppose workers 1 AND 2 both fail. Which of the following statements are true? Select all
that apply.
A. The full file will remain available since worker 3 and worker 4 are both still running.
B. The system can tolerate one more worker failure without losing data.
C. If every request requires all 4 shards of the file, then worker 3 and worker 4 can share the work evenly.
D. None of the above statements are true.
45. Consider only the mechanism of partitioning files into shards, and storing different shards on
different machines. Which of the following statements are true? Select all that apply. A. Partitioning enhances the ability of the system to store large files.
B. Partitioning allows the system to tolerate machine failures without losing data. C. Partitioning allows the system to read files in parallel.
D. None of the above statements are true.
46. [2 Pts.] Recall the statistical query pattern discussed in class for computing on very large data
sets. Which of the following statements are true? Select all that apply.
A. It eliminates the need for the end-user device (e.g. a laptop) to acquire all the
data.
B. It pushes the computational task closer to the large-scale data storage.
C. It is well suited to both MapReduce and SQL interfaces.
D. Analternativetothestatisticalquerypatternforbigdataistoacquireasample of the full dataset on the end-user device.
E. None of the above statements are true.
DS100 Final, Page 35 of 42
10 XPath
47. [3 Pts.] Below is a tree representation of an XML document. The tree has 10 nodes, which we have been numbered 1 through 10. Two of these are text nodes: one containing “Some Text” (labeled #5), and the other “More Text” (labeled #10). In addition, some of the nodes have attributes, e.g. #7 is the tag
For each of the following XPath expressions, provide the numbers for the nodes which are located by the expression. If no nodes match, say NULL.
1. //f
2. //b/..
3. //c//f
4. //b[@id]
Solution: 4 and 9
Solution: 1 and 3
Solution: Any “f” that is a child of any “c”: 9
Solution: Any “b” that has the ATtribute id: 8
DS100 Final, Page 36 of 42
11 EDA and Visualization
48. [2 Pts.] Consider the following statistics for infant mortality rate. According to these statistics, which transformation would best symmetrize the distribution? (Select only one.)
Transformation lower quartile median upper quartile x 13 30 68
√x 3.5 5 8 log(x) 1.15 1.5 1.8
A. no transformation
B. square root
C. log
D. not possible to tell with this information
49. [5 Pts.] For each of the following scenarios, determine which plot type is most appropriate to reveal the distribution of and/or the relationships between the following variable(s). For each scenario, select only one plot type. Some plot types may be used multiple times.
A. histogram
B. pie chart
C. bar plot
D. line plot
E. side-by-side boxplots
F. scatter plot
G. stacked bar plot H. overlaid line plots
I. mosaic plot
(1) [1 Pt.] sale price and number of bedrooms (assume integer) for houses sold in Berkeley in 2010.
(2) [1 Pt.] sale price and date of sale for houses sold in Berkeley between 1995 and 2015.
(3) [1 Pt.] infant birth weight (grams) for babies born at Alta Bates hospital in 2016.
Solution: E. Side-by-side Boxplots. We might imagine using a scatter plot since we are plotting the relationship between two numeric quantities. However because the number of bedrooms is an integer and most houses will only have a small number, we are likely to encounter over-plotting in the scatter plot. Therefore side-by-side boxplots are likely to be most informative.
Solution: F. Scatter Plot. Here we are plotting two numeric quantities with sufficient spread on each axis.
DS100 Final, Page 37 of 42
Solution: A. Histogram. Here we are plotting the distribution of a likely large number of observations and therefore a histogram would be most appropriate.
(4) [1 Pt.] mother’s education-level (highest degree held) for students admitted to UC Berke- ley in 2016
(5) [1 Pt.] SAT score and HS GPA of students admitted to UC Berkeley in 2016
(6) [1 Pt.] race and gender of students admitted to UC Berkeley in 2016
(7) [1 Pt.] The percentage of female student admitted to UC Berkeley each year from 1950 to 2000.
(8) [1 Pt.] SAT score for males and females of students admitted to UCB from 1950 to 2000
Solution: C. Bar Plot. Here we want to visualize counts of a categorical variable.
Solution: F. Scatter Plot. Here we are visualizing the relationship between two continuous quantities.
Solution: I. mosaic plot Here we are visualizing the relationship between two cate- gorical variables.
Solution: D. Line plot. This allows us to see the trends over time.
Solution: E. side-by-side boxplots. This allows us to see the distributions of SAT scores per gender and year.
DS100 Final, Page 38 of 42 50. [4 Pts.] Consider the following empirical distribution:
(1) [1 Pt.] The distribution has mode(s). A.1 B.2 C.3 D.4
(2) [1 Pt.] The distribution is: A. Skewed left
B. Symmetric
C. Skewed right
(3) [2 Pts.] Select all of the following properties displayed by the distribution: A. gaps
B. outliers
C. normal left tail
D. None of the above
DS100 Final, Page 39 of 42
51. [4 Pts.] Select all of the problems associated with the following plot (there may be more than one problem):
A. Over-plotting
B. Use of chart junk
C. Vertical axis should be in log scale
D. Missing vertical axis label
E. Poor use of the horizontal dimension
F. Graph elements interfere with data G. Stacking
H. Use of angles to convey information
I. None of the above are problems with this awesome plot.
DS100 Final, Page 40 of 42
52. In the odd questions, name the plot’s type (for example, “scatter” or “box”). In the even questions, answer whether the plot is useful for answering the given query.
(1) [1 Pt.]
(2) [1 Pt.] True or false: This plot is useful for answering the question: “Did underemploy- ment generally increase when unemployment increased?”
Solution: This is a stacked line plot.
Solution: False. Because of the stacking, it’s hard even to tell how much underem- ployment there was in each year, much less compare changes in underemployment with changes in unemployment. To answer this question, a good visualization would be a scatter plot of the change in unemployment and underemployment for each year.
(3) [1 Pt.]
DS100 Final, Page 41 of 42
(The plot is from http://www.stubbornmule.net/2009/07/love-old-fashioned/ and displays topics of selected popular music over time. Not all pop songs are represented
in the dataset.)
(4) [1 Pt.] True or false: This plot is useful for answering the question: “Among the songs in this dataset, how many were released in each of the five decades from 1960 to 2010?”
(5) [1 Pt.]
(6) [1Pt.] Trueorfalse:Thisplotisusefulforansweringthequestion:“Assumingthebacteria population grew linearly over time, what was the rate of increase?”
Solution: This is a mosaic plot.
Solution: True. The widths of the bars show us how many songs were released in each category, and the categories include the 1960s through 2000s.
Solution: This is a line plot on a logarithmic scale.
Solution: False. Since the plot is on a log scale, the slope of the line is the exponential rate of growth, not the linear rate. (The question would have been clearer if it had read, “If we model the population growth as a linear function of time, what would be a good
estimate of the linear growth rate?”)
DS100 Final, Page 42 of 42
End of Exam