Name:
Email address: Student id:
DS-100 Final Exam Spring 2017
Instructions:
• Please fill in you name, email address, and student id at the top of both this exam booklet and your answer sheet.
• All answers must be written on the separate answer sheet.
• This exam must be completed in the 3 hour time period ending at 6:00PM.
• You may use a single page (two-sided) cheat sheet.
• Work quickly through each question. There are a total of 172 points on this exam.
• You must turn in both this exam booklet and your answer sheet.
• Don’t cheat!
1
DS100 Final, Page 2 of 31 May 11, 2017
1 Maximum Likelihood and Loss Minimization
1. Suppose we observe a dataset {x1, . . . , xn} of independent and identically distributed samples from the exponential distribution. The probability density function (PDF) of an exponential distribution (for x ≥ 0) parameterized by the parameter λ is given by:
fλ(x) = λe−λx
(1) [4 Pts.] What is the log-likelihood function of this dataset with respect to λ?
Solution: The likelihood function is given by:
nn
L(λ) = fλ(xi) = λe−λxi (1)
i=1 i=1 Therefore the log-likelihood function is given by:
nn
log L(λ) = log λe−λxi = log (λ) + log e−λxi (2)
i=1 i=1 n
= n log (λ) − λ xi i=1
(3)
(2) [6 Pts.] Derive the maximum likelihood value λˆMLE. Circle your answer.
Solution: Taking the derivative of the log-likelihood function with respect to the parameter λ we get:
∂ ∂ ∂n
∂λ log L(λ) = n∂λ log (λ) − ∂λλ
1 n
= nλ −
To compute the maximum likelihood parameter λˆMLE we set the above derivative equal
xi
i=1
xi (4) (5)
i=1
(6)
DS100 Final, Page 3 of 31 May 11, 2017
to zero and solve.
0 = nλˆ − MLE
xi (7) i=1
(8) (9)
1 n
1 1n
λˆ = n xi
MLE i=1 ˆn
λMLE=n x i=1 i
Thus the maximum likelihood parameter estimate is:
ˆ n 1n −1 1
λMLE = n x = n xi = Mean(x) (10) i=1 i i=1
You may use the following scratch space but we will only grade what you put on the answer sheet.
DS100 Final, Page 4 of 31 May 11, 2017
2. Suppose we collect a dataset of n IID observations {x1, . . . , xn} which we believe are drawn from a distribution with the following PDF:
(x−μ)6
fμ(x)=Cexp − 6 (11)
where C is a constant that does not depend on μ.
(1) [3 Pts.] Write the log-likelihood function for μ.
Solution: Because we assumed the data are independent and identically distributed (IID) according to fμ, the likelihood function is the product of the probabilities of each observation:
nn
L(μ) =
i=1
n (xi−μ)6
fμ(xi) = C exp − i=1
6 (12) (13)
1n =Cnexp −6 (xi−μ)6
i=1 Taking the log of the likelihood function we get:
1 n
logL(μ)=nlogC−6
(xi −μ)6 (14)
i=1
(2) [4 Pts.] Compute the derivative of the log-likelihood with respect to μ.
Solution: Taking the derivative:
∂ logL(μ)= ∂nlogC−1n ∂(xi−μ)6 (15)
i=1 n
= (xi − μ)5 (17) i=1
∂μ ∂μ
=0+(xi −μ)5 (16)
n
6 i=1 ∂μ
(3) [3 Pts.] Because there is no closed form solution for μ in ∂ log L(μ) = 0, we would ∂μ
likely use gradient ascent to approximately compute μˆMLE. Given the gradient function: g(μ)= ∂ logL(μ), (18)
∂μ
and a step size ρ(t), what is the gradient ascent update rule to go from μ(t) to μ(t+1)? (Hint: your answer should contain only the variables g(μ(t)), μ(t), μ(t+1), and ρ(t).)
DS100 Final, Page 5 of 31 May 11, 2017
Solution: Recallthatthegradientpointsinthe“uphill”direction.Whenwemaximize, uphill is the way to want to go. So the update rule would look like:
μ(t+1) ← μ(t) + ρ(t)g(μ(t)) (19)
DS100 Final, Page 6 of 31 May 11, 2017
2
Wrangling and Querying Data 2.1 SQL
For the questions in this subsection, assume we have a massive database in the cloud with the following schema:
— A simple digital media store database
CREATE TABLE media
(mid integer PRIMARY KEY,
name text, type char, year_released integer, length integer, buy_cost float, rent_cost float, avg_rating float);
CREATE TABLE customers
(cid integer PRIMARY KEY,
name text, joined date, nation_id integer, activity_level integer);
CREATE TABLE transactions (tid integer PRIMARY KEY,
tdate date, item integer, customer integer,
rent_or_buy integer, price_paid float, percent_viewed float, FOREIGN KEY (item) REFERENCES media,
FOREIGN KEY (customer) REFERENCES customers);
CREATE VIEW stats AS
SELECT min(length) AS len_min, max(length) AS len_max,
avg(length) AS len_mu, stddev(length) AS len_sigma, min(avg_rating) AS ar_min, max(avg_rating) AS ar_max, avg(avg_rating) AS ar_mu, stddev(avg_rating) AS ar_sigma
FROM media;
DS100 Final, Page 7 of 31 May 11, 2017
3. [4 Pts.] In the media table above, the type column encodes the type of media as a unique character code (e.g., ’S’ for song, ’M’ for movie, ’E’ for episode, etc.). Suppose we wanted to modify the stats view to display the stats for each type of media. Which of the following are true? (Select all that apply.)
A. We need to change the granularity of the view to be finer than it is above.
B. We need to add a GROUP BY type clause to the view.
C. It would be helpful to add media.type to the list of columns in the SELECT clause of the view.
D. The modified view should have more rows than the original view above.
E. None of the above.
4. [3 Pts.] Which of the following queries finds the ids of media that are 2 standard deviations
longer than the mean length? (Select only one.) A.
B.
C.
SELECT media.mid FROM media, stats
WHERE media.mid = stats.mid
AND media.length >= stats.len_mu
+ 2*(stats.len_sigma);
SELECT media.mid
FROM media, stats
WHERE media.length >= stats.len_mu
+ 2*(stats.len_sigma);
SELECT media.mid FROM media
WHERE media.length >= avg(media.length)
+ 2*stddev(media.length);
D. None of the above.
DS100 Final, Page 8 of 31 May 11, 2017
2.2 SQL Sampling
The transactions table has 30 million (30 × 106) rows. It is too large to load into the memory of our laptop. We will extract a sample from the database server to process on our laptop in Python.
SELECT *
FROM transactions TABLESAMPLE Bernoulli(.0001);
5. [2 Pts.] Suppose you ran this query many times. What distribution describes the output sizes (in number of rows) you would see across runs?
6. [2 Pts.] In expectation, how many rows will there be in the answer to this query?
7. [4Pts.] YourfriendEmilyEngineertellsyoutoavoidBernoullisampling,andusethefollowing query instead:
SELECT *
FROM transactions
LIMIT XX;
(where XX is replaced by the correct answer to the previous question). Select all the true
statements:
A. Emily’s LIMIT query will probably run faster than the TABLESAMPLE query. For Emily’s query, the database engine can simply access the first XX rows it finds in the table, and skip the rest.
B. Emily’s query result may be biased to favor certain rows.
C. The output of the TABLESAMPLE query provides a hint about how many rows there are in the transactions table while Emily’s LIMIT query does not.
Solution: Binomial
Solution: 3000 = 30 × 102
Solution: True. For reasoning above.
Solution: True. The database will optimize for speed, which will likely fa- vor clusters of records stored near each other.
DS100
Final, Page 9 of 31 May 11, 2017
D.
E.
Emily’s LIMIT query may run fast, but it will swamp the memory on your laptop, since it doesn’t sample the database.
None of the above.
Solution: True. You can extrapolate from the sample size and the sample probability to predict the table size.
Solution: False. Emily’s query will only return XX rows to the laptop.
8. [2 Pts.] You will recall from Homework 4 that it is possible to do bootstrap sampling in SQL by constructing a design table with two columns. Each of the columns used in that scheme is described by a single choice below. Identify the two correct choices:
A. A foreign key to the table being sampled.
B. A count column to capture the number of tuples in each bootstrap sample. C. An identifier to group rows together into bootstrap samples.
D. A regularization column to prevent overfitting.
2.3 Pandas
For the questions in this subsection, assume that we have pandas dataframes with the same schemas as described in the previous section on SQL. That is, we have a media dataframe with columns mid, name, type, year, et cetera. Assume that the index column of each dataframe is meaningless—the primary key is represented as a regular column.
9. [3 Pts.] Consider the following code snippet:
def get_average_price_paid(join_method): return (customers
.merge(transactions, how=join_method, left_on=’cid’, right_on=’customer’)
.loc[:,’price_paid’]
.fillna(0) # <- Important .mean()
)
inner = get_average_price_paid(’inner’)
outer = get_average_price_paid(’outer’)
left = get_average_price_paid(’left’)
right = get_average_price_paid(’right’)
Assume that all item prices are positive, all transactions refer to valid customers in the customers table, but some customers may have no transactions.
DS100 Final, Page 10 of 31
May 11, 2017
(1) How are inner and outer related? Pick one best answer. A. inner < outer
B. inner ≤ outer C. inner = outer D. inner ≥ outer E. inner > outer
(2) How are left and right related? Pick one best answer. A. left < right
B. left ≤ right C. left = right D. left ≥ right E. left > right
(3) How are left and outer related? Pick one best answer. A. left < outer
B. left ≤ outer C. left = outer D. left ≥ outer
10. [3 Pts.] We wish to write a python expression to find the largest amount of money spent by one person on any single date. We will use the following code:
biggie = transactions.groupby(_____)[’price_paid’].sum().max()
What should we be pass in as our groupby predicate? Select only one answer. A. ’tdate’
B. ’customer’
C. [’item’, ’tdate’]
D. [’customer’, ’tdate’] E. [’customer’, ’item’]
11. [6 Pts.] Fill in the following python code that finds the names of every customer who has spent over $100.
merged = customers.merge(__A__, left_on=__B__, right_on=__C__) grouped = merged.groupby(__D__).__E_()
names = grouped[__F__].index
Solution:
DS100 Final, Page 11 of 31 May 11, 2017
merged = customers.merge(transactions, \
left_on=cid, right_on=customer)
grouped = merged.groupby(cid).sum()
names = grouped[grouped.price_paid > 100].index
12. [4 Pts.] We wish to find years where the average price paid (over all time) for products released in that year is greater than the average price paid across all transactions; from those years we want to return the earliest (smallest). We have the following code:
merged = transactions.merge(media, left_on=”item”, \ right_on=”mid”)
mean_price = merged.groupby(“year_released”)\ .mean().price_paid.mean() # Line A
by_year = merged.groupby(“year_released”).count() # Line B is_greater = by_year[by_year.price_paid > mean_price] # Line C result = is_greater.sort_index(ascending=False).index[0] # Line D
Some of these lines need to be modified in order for the code to work properly. We have suggested replacements for each line below. Which lines need to be replaced? Select all that apply.
A. mean_price = merged.price_paid.mean()
B. by_year = merged.groupby(“year_released”).mean()
C. is_greater = by_year.where(by_year.price_paid > mean_price)
D. result = is_greater.sort_index(ascending=True).index[0] E. All the lines are correct.
DS100 Final, Page 12 of 31 May 11, 2017
3 Feature Engineering
For this problem we collected the following data on the new social networking app UFace.
…
13. Suppose we are interested in predicting the number of responses for future posts. For each of the columns, indicate which (one or more) of the given feature transformations could be informative. Select all that apply.
(1) [2 Pts.] The PostID column: A. Drop the column
B. One-Hot encoding C. Leave as is
(2) [2 Pts.] The Time column:
A. Take the hour as a float
B. One-Hot encoding
C. Bag-of-words encoding
D. Time since midnight in seconds
(3) [2 Pts.] The Text column:
A. The length of the text
B. One-Hot encoding
C. Bag-of-words encoding
D. Leave as is
(4) [2 Pts.] The State column: A. The length of the text
B. One-Hot encoding
C. Bag-of-words encoding D. Leave as is
PostID
UTC Time
Text
Num. Responses
State
3
08:10 PM
“Checkout my breakfast . . . ”
2
VA
13
11:00 AM
“Studied all night for . . . ”
5
CA
14
12:04 PM
“Hello world!”
0
NY
17
11:35 PM
“That exam was lit …”
42
CA
DS100 Final, Page 13 of 31 May 11, 2017
14. [4 Pts.] Suppose we believe that people are more likely to respond to tweets in the afternoon (roughly from hours 13 to 17). Which of the following feature functions would help capture this intuition? Assume that the function localHour takes a time and a state as its arguments and returns the hour of the day (in 24-hour time) in the state’s time zone. Also assume that any boolean-valued feature is encoded as 0 (false) or 1 (true). Select all that apply.
A. φ(time, state) = localHour(time, state)
B. φ(time, state) = 13 < localHour(time, state) < 17
C. φ(time, state)=exp−(localHour(time, state)−15)2 D. φ(time, state) = exp (localHour(time, state) − 15)
E. None of the above.
15. [2 Pts.] Given the following text from a BigData Borat post: “Data Science is statistics on a Mac.”
Which of the following is the bi-gram encoding including stop-words? (Select only one.)
A. {(’data’, 1), (’science’, 1), (’statistics’, 1),
(’mac’, 1)}
B. {(’data science’, 1), (’science statistics’, 1),
(’statistics mac’, 1)}
C. { (’data science’, 1), (’science is’, 1),
(’is statistics’, 1), (’statistics on’, 1), (’on a’, 1), (’a mac’, 1)}
D. {(’data science’, 1), (’is statistics’, 1), (’on a’, 1), (’mac’, 1)}
DS100 Final, Page 14 of 31 May 11, 2017
4 Least Squares Regression and Regularization
16. For this question we use the following toy dataset:
(1) [3 Pts.] We have fit several models depicted as curves in the following plots:
(a) (b) (c)
Select the plot that best matches each of the models below. Each plot is used exactly
once.
1. Linear regression model ⃝(A) ⃝(B) √(C)
2. Linear regression with degree 10 polynomial features ⃝(A) √(B) ⃝(C)
3. Ridge regression with degree 10 polynomial features and substantial regularization. √(A) ⃝(B) ⃝(C)
DS100 Final, Page 15 of 31 May 11, 2017
(2) [2 Pts.] We fit two more models to these data. Again, the solid curves display the predic- tions made by each model.
(a) (b)
Select the plot that best matches each of the models below. Each plot is used exactly once.
1. Ridge regression with degree 10 polynomial features, λ = 0.1. √ (A) ⃝ (B)
2. Ridge regression with degree 10 polynomial features, λ = 1.0. ⃝ (A) √ (B)
17. Suppose you are given a dataset {(xi, yi)}ni=1 where xi ∈ R is a one dimensional feature and yi ∈ R is a real-valued response. To model this data you choose a model characterized by the following objective function:
n2
J(θ)=yi −θ0 −xiθ1 −x2iθ22 +λ|θi| (20)
i=1 i=1
(1) [7 Pts.] Select all the true statements for the above objective function (Equation 20).
A. This loss function likely corresponds to a classification problem. B. θ is the regularization parameter.
C. This is an example of L1 regularization.
D. This is not a linear model in θ.
E. This model includes a bias/intercept term.
F. This model incorporates a non-linear feature transformation.
G. Large values of λ would reduce the model to a constant θ0.
H. None of the above are true.
(2) [2 Pts.] Suppose in our implementation we accidentally forget to square the first term:
n2
J(θ)=yi −θ0 −xiθ1 −x2iθ2+λ|θi| (21)
i=1 i=1
What would change if we tried to train a model using gradient descent on this objective function rather than the original objective function? (Select only one)
DS100
Final, Page 16 of 31 May 11, 2017
A. B. C. D. E.
Thetrainingcodewouldraiseanerrorduetoamatrix/vectordimensionproblem. The training process would diverge with θ0 → −∞
The training process would diverge with θ0 → ∞
The training process would converge to a different regression line.
Nothing; the training process would eventually converge to the same regression line.
18. [5 Pts.] Let X be a n × p design matrix with full column rank and y be a n × 1 response vector. Let βˆ be the optimal solution to the least squares problem and r be its associated error. In other words,
y = Xβˆ + r (22)
Consider X2 the second column of X.
(1) [1 Pt.] True or False. Without any additional assumptions,
r · X2 = 0
where · denotes the usual dot product?
(2) [4 Pts.] Provide a short proof or counter example.
You may use the following scratch space but we will only grade what you put on the answer sheet.
Solution: True. It suffices to show that r is orthogonal to the column space of X.
XTr=XT(y−X(XTX)−1XTy)=(XT −XTX(XTX)−1XT)y=(XT −XT)y=0
DS100 Final, Page 17 of 31 May 11, 2017
5 Classification
19. For each of the following select T for true or F for false on the answer sheet.
(1) [1 Pt.] A binary or multi-class classification technique should be used whenever there are categorical features.
(2) [1 Pt.] Logistic regression is actually used for classification.
(3) [1 Pt.] The logistic regression loss function was derived by modeling the observations as noisy observations with a Gaussian noise model.
(4) [1 Pt.] Class imbalance can be a serious problem in which the number of training data points from one class is much larger than another.
(5) [1 Pt.] A broken binary classifier that always predicts 0 is likely to get a test accuracy around 50% on all prediction tasks.
(6) [1 Pt.] The root mean squared error is the correct metric for evaluating the prediction accuracy of a binary classifier.
20. Consider the following binary classification dataset
Solution: False. Categorical features may appear in both classification and regression settings and should be addressed using one-hot-encoding.
Solution: True. Logistic regression is somewhat confusingly named as it applies to classifications tasks but builds on the linear models we introduced in least squares linear regression.
Solution: False. Logistic regression was derived using the Bernoulli likelihood of function.
Solution: True.Classimbalancecanbeaseriousproblemandoftenoccursinsettings like disease diagnosis where a large fraction of the population is healthy.
Solution: False. In many case class imbalance could result in substantially higher or lower accuracy.
Solution: False. Root mean squared error is a standard measure of accuracy for regression. Logistic regression accuracy is often measured by the fraction of examples predicted correctly or in some cases the likelihood of the data under the model.
DS100 Final, Page 18 of 31 May 11, 2017
y
1 0
x
(1) [3 Pts.] Draw a reasonable approximation of the logistic regression probability estimates for P(Y = 1 | x) on top of the figure on the answersheet.
Solution: Anything close to the following would be acceptable:
y
1 0
x
It is important that:
1. the curve is higher for smaller values of x 2. the curve is smooth
3. the curve is a sigmoid
(2) [1 Pt.] Are these data linearly separable? A. Yes
B. No
DS100 Final, Page 19 of 31 May 11, 2017
21. [3 Pts.] Suppose you are given θ for the logistic regression model to predict whether a tumor is malignant (y = 1) or benign (y = 0) based on features of the tumor x. If you get a new patient x∗ and find that xT∗ θ > 0, what can you say about the tumor? Select only one.
A. The tumor is benign
B. The tumor is more likely benign
C. The tumor is more likely to be malignant D. The tumor is malignant
22. [4Pts.] Whichofthefollowingexplanationsthatapplyingregularizationtoalogisticregression model? Select all that apply.
A. The training error is too high.
B. The test error is too low.
C. The data are high-dimensional.
D. There is a large class imbalance.
E. None of the above justify regularization for logistic regression.
DS100 Final, Page 20 of 31 May 11, 2017
6 Clustering
23. [4 Pts.] The following diagram shows a scatter plot of a small 2-dimensional dataset with 8 elements. An initialization of the k-means algorithm (with k = 3) is displayed; the initial cluster centers are displayed as stars. (They have the same locations as 3 of the points themselves, which is a common initialization of the k-means algorithm.)
Which of the following depicts the 3 cluster centers that would result from a single iteration of the k-means algorithm (with k = 3), starting from the initial cluster center locations above? Select only one.
A. D.
B.
E.
C. F.
DS100 Final, Page 21 of 31 May 11, 2017 24. [4 Pts.] Consider the data plotted below. Which of the following clustering methods is most
likely to return the groupings below using untransformed x and y values? Select only one.
A. Single-linkage clustering
B. Complete-linkage clustering
C. Average-linkage clustering
D. k-means clustering
E. None of the above are likely to recover the true groups
DS100 Final, Page 22 of 31 May 11, 2017
7 Bias-Variance Tradeoff
25. For each of the following select T for true or F for false on the answer sheet.
(1) [1 Pt.] Regularization can be used to manage the bias-variance trade-off.
(2) [1 Pt.] When conducting linear regression, adding polynomial features to your data often decreases the variance of your fitted model.
(3) [1 Pt.] When conducting linear regression, adding polynomial features to your data often decreases the bias of your fitted model.
(4) [1 Pt.] Suppose your data are an i.i.d. sample from a population. Then collecting a larger sample for use as a training set can help reduce bias.
(5) [1 Pt.] Suppose your data are an i.i.d. sample from a population. Then collecting a larger sample for use as a training set can help reduce variance.
(6) [1 Pt.] Training error is typically larger than test error.
(7) [1 Pt.] If you include the test set in your training data, your accuracy as measured on the test set will probably increase.
Solution: True. Regularization encourages simpler models which can help to reduce variance but increase bias.
Solution: False. Adding more features tends to increase your model’s variance since there are more parameters to fit.
Solution: True. Adding more features tends to decrease your model’s bias since your model can fit more complicated patterns in the data.
Solution: False. Increasing the dataset size without changing the modeling procedure can often reduce variance but is unlikely to address bias.
Solution: True. More data often helps to reduce variance in the model fitting process.
Solution: False. Training error often under-estimates the test error.
Solution: True. Training on the test data improves test accuracy but this improvement can be misleading due to over-fitting.
DS100 Final, Page 23 of 31 May 11, 2017
(8) [1 Pt.] It is important to frequently evaluate models on the test data throughout the process of model development.
26. [2 Pts.] A colleague has been developing models all quarter and noticed recently that her test error has started to gradually increase while her training error has been decreasing. Which of the following is the most likely explanation for what is happening? Select only one.
A. She is starting to over-fit to her training data.
B. She is starting to under-fit to her training data. C. The model is overly biased.
D. None of the above.
Solution: False. Noooooooooooo. Once test data is used it is no longer test data. You should create validation datasets or use cross-validation procedures to evaluate models.
DS100 Final, Page 24 of 31 27. [5 Pts.] Given the following general loss formulation:
nd
argmin yi −xTi θ2 +λθp2
θ
i=1 p=1
Which of the following statements are true? Select all that apply. A. There are d data points.
B. There are n data points.
C. The data is d dimensional.
D. This is a classification problem.
E. This is a linear model.
F. This problem has LASSO regularization.
G. Larger values of λ imply increased regularization. H. Larger values of λ will increase variance.
I. Larger values of λ will likely increase bias. J. None of the above are true.
28. [3 Pts.] In class we broke the least-squares error into three separate terms:
May 11, 2017
(23)
E(y−fθ(x))2=E(y−h(x))2+E(h(x)−fθ(x))2+E(fθ(x)−E[fθ(x)])2 (24)
where y = h(x) + ε, h(x) is the true model and ε is zero-mean noise. For each of the following terms, indicate its usual interpretation in the bias variance trade-off:
1. E [(y − h(x))2]: A. Bias B. Variance C. Noise
2. E(h(x)−fθ(x))2: A. Bias B. Variance C. Noise
3. E(fθ(x)−E[fθ(x)])2: A. Bias B. Variance C. Noise
DS100 Final, Page 25 of 31 May 11, 2017
8 Big Data
29. Consider the following simple Data Warehouse schema from a Cellular Service Provider, which records activity on a cell phone network:
CREATE TABLE devices (
did integer, customer_id integer, phone_number varchar(13),
firstname text, lastname text,
zip varchar(12), registered_on varchar(2), PRIMARY KEY(did),
UNIQUE (customer_id) — a ‘‘candidate’’ key );
CREATE TABLE billing (
rate_code char PRIMARY KEY,
description text, base_fee float, per_minute float, max_minutes integer, overage_fee float,
PRIMARY KEY (rate_code));
CREATE TABLE calls (
caller_handset_id integer, callee_handset_id integer, cell_tower_id integer, call_start datetime, call_end datetime, billing_code char,
PRIMARY KEY (caller_handset_id, call_start),
FOREIGN KEY (caller_handset_id) REFERENCES devices,
FOREIGN KEY (billing_code) REFERENCES billing;
(1) [3 Pts.] Which of these tables is a dimension table? Select all that apply. A. devices
B. calls
C. billing
D. None of the above.
(2) [3 Pts.] Which of the following statements are true? Select all that apply.
A. The calls.billing code column violates star schema design because any
update to a single billing fee requires updates to many call records.
B. If we want to look for correlations between a device’s average call length and the time since it was registered, we have to perform a join.
Solution: devices and billing
DS100 Final, Page 26 of 31 May 11, 2017
C. If the cell service provider implemented a Data Lake, it would make it easier for them to load audio recordings of calls for subsequent analysis.
D. None of the above statements are true.
30. [3 Pts.] The figure below depicts a distributed file system with one logical “big file” partitioned into 4 “shards” (A, B, C, D) and replicated across multiple worker machines (1, 2, 3, 4).
Suppose workers 1 AND 2 both fail. Which of the following statements are true? Select all that apply.
A. The full file will remain available since worker 3 and worker 4 are both still running.
B. The system can tolerate one more worker failure without losing data.
C. If every request requires all 4 shards of the file, then worker 3 and worker 4 can share the work evenly.
D. None of the above statements are true.
31. Consider only the mechanism of partitioning files into shards, and storing different shards on
different machines. Which of the following statements are true? Select all that apply. A. Partitioning enhances the ability of the system to store large files.
B. Partitioning allows the system to tolerate machine failures without losing data. C. Partitioning allows the system to read files in parallel.
D. None of the above statements are true.
32. [2 Pts.] Recall the statistical query pattern discussed in class for computing on very large data
sets. Which of the following statements are true? Select all that apply.
A. It eliminates the need for the end-user device (e.g. a laptop) to acquire all the
data.
B. It pushes the computational task closer to the large-scale data storage.
C. It is well suited to both MapReduce and SQL interfaces.
D. Analternativetothestatisticalquerypatternforbigdataistoacquireasample of the full dataset on the end-user device.
E. None of the above statements are true.
Big File
D
Worker 1
B
D
Worker 2
A
Worker 3
AC D
Worker 4
A
D
A
C
B
C
C
B
B
DS100 Final, Page 27 of 31 May 11, 2017
9 EDA and Visualization
33. [2 Pts.] Consider the following statistics for infant mortality rate. According to these statistics, which transformation would best symmetrize the distribution? (Select only one.)
Transformation lower quartile median upper quartile x 13 30 68
√x 3.5 5 8 log(x) 1.15 1.5 1.8
A. no transformation
B. square root
C. log
D. not possible to tell with this information
34. [5 Pts.] For each of the following scenarios, determine which plot type is most appropriate to reveal the distribution of and/or the relationships between the following variable(s). For each scenario, select only one plot type. Some plot types may be used multiple times.
A. histogram
B. pie chart
C. bar plot
D. line plot
E. side-by-side boxplots
F. scatter plot
G. stacked bar plot H. overlaid line plots
I. mosaic plot
(1) [1 Pt.] sale price and number of bedrooms (assume integer) for houses sold in Berkeley in 2010.
(2) [1 Pt.] sale price and date of sale for houses sold in Berkeley between 1995 and 2015.
(3) [1 Pt.] infant birth weight (grams) for babies born at Alta Bates hospital in 2016.
Solution: E. Side-by-side Boxplots. We might imagine using a scatter plot since we are plotting the relationship between two numeric quantities. However because the number of bedrooms is an integer and most houses will only have a small number, we are likely to encounter over-plotting in the scatter plot. Therefore side-by-side boxplots are likely to be most informative.
Solution: F. Scatter Plot. Here we are plotting two numeric quantities with sufficient spread on each axis.
DS100 Final, Page 28 of 31 May 11, 2017
Solution: A. Histogram. Here we are plotting the distribution of a likely large number of observations and therefore a histogram would be most appropriate.
(4) [1 Pt.] mother’s education-level (highest degree held) for students admitted to UC Berke- ley in 2016
(5) [1 Pt.] SAT score and HS GPA of students admitted to UC Berkeley in 2016
(6) [1 Pt.] race and gender of students admitted to UC Berkeley in 2016
(7) [1 Pt.] The percentage of female student admitted to UC Berkeley each year from 1950 to 2000.
(8) [1 Pt.] SAT score for males and females of students admitted to UCB from 1950 to 2000
Solution: C. Bar Plot. Here we want to visualize counts of a categorical variable.
Solution: F. Scatter Plot. Here we are visualizing the relationship between two continuous quantities.
Solution: I. mosaic plot Here we are visualizing the relationship between two cate- gorical variables.
Solution: D. Line plot. This allows us to see the trends over time.
Solution: E. side-by-side boxplots. This allows us to see the distributions of SAT scores per gender and year.
DS100 Final, Page 29 of 31 35. [4 Pts.] Consider the following empirical distribution:
May 11, 2017
(1) [1 Pt.] The distribution has mode(s). A.1 B.2 C.3 D.4
(2) [1 Pt.] The distribution is: A. Skewed left
B. Symmetric
C. Skewed right
(3) [2 Pts.] Select all of the following properties displayed by the distribution: A. gaps
B. outliers
C. normal left tail
D. None of the above
DS100 Final, Page 30 of 31 May 11, 2017
36. [4 Pts.] Select all of the problems associated with the following plot (there may be more than one problem):
A. Over-plotting
B. Use of chart junk
C. Vertical axis should be in log scale
D. Missing vertical axis label
E. Poor use of the horizontal dimension
F. Graph elements interfere with data G. Stacking
H. Use of angles to convey information
I. None of the above are problems with this awesome plot.
DS100 Final, Page 31 of 31 May 11, 2017
End of Exam