Name: Email: Student ID:
@berkeley.edu
DS-100 Midterm Exam Fall 2017
Instructions:
• This exam must be completed in the 1.5 hour time period ending at 8:30PM.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must shade in the box/circle. Checkmarks will likely be mis-graded.
• You may use a single page (two-sided) study guide.
• Work quickly through each question. There are a total of 116 points on this exam.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1
DS100 Midterm, Page 2 of 25 October 12, 2017
Syntax Reference Regular Expressions
“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)
“$” matches the position at the end of string character.
“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.
“+” match preceding literal or sub-expression one or more times.
“*” match preceding literal or sub-expression zero or more times
“.” match any character except new line. Some useful re package functions.
re.split(pattern, string) split the string at substrings that match the pattern. Returns a list.
Useful Pandas Syntax
“[ ]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.
“( )” used to create a sub-expression
“\d” match any digit character. “\D” is the
complement.
“\w” match any word character (letters, digits,
underscore). “\W” is the complement.
“\s” match any whitespace character includ- ing tabs and newlines. \S is the comple- ment.
“\b” match boundary between words
re.sub(pattern, replace, string)
apply the pattern to string replac- ing matching substrings with replace. Returns a string.
df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean df.groupby(group_columns)[[’colA’, ’colB’]].sum()
pd.merge(df1, df2, on=’hi’) # Merge df1 and df2 on the ’hi’ column
pd.pivot_table(df, # The input dataframe index=out_rows, # values to use as rows columns=out_cols, # values to use as cols
values=out_values, # values to use in table aggfunc=”mean”, # aggregation function fill_value=0.0) # value used for missing comb.
DS100 Midterm, Page 3 of 25 October 12, 2017
Data Generation and Probability Samples For each of the following questions select the single best answer.
1. [2 Pts] A political scientist is interested in answering a question about a country composed of three states with exactly 10000, 20000, and 30000 voting adults. To answer this question, a political survey is administered by randomly sampling 25, 50, and 75 voting adults from each town, respectively. Which sampling plan was used in the survey?
⃝ cluster sampling
√ stratified sampling
⃝ quota sampling
⃝ snowball sampling
2. [2 Pts] A deck with 26 cards labeled A through Z is thoroughly shuffled, and the value of the third card in the deck is recorded. What is the probability that we observe the letter C on the third card?
√ 1 ⃝ 3 ⃝25·24·1 ⃝ 1 ·1 ·24 ⃝Noneoftheabove. 26 26 26 26 26 26 26 26
3. [3 Pts] Suppose Sam visits your store to buy some items. He buys toothpaste for $2.00 with probability 0.5. He buys a toothbrush for $1.00 with probability 0.1. Let the random variable X be the total amount Sam spends. What is E[X]? Show your work in the space provided.
√ $1.10
⃝ $1.5
⃝ $3.00
⃝ The toothpaste purchase may not be independent of the toothbrush purchase so we can’t compute this expectation.
You may show your work in the following box for partial credit:
Solution: Let Xtoothpaste be the amount Sam spends on toothpaste, and Xtoothbrush be the amount Sam spends on a toothbrush.
From the linearity of expectation, we have:
E[X] = E[Xtoothpaste + Xtoothbrush] = E[Xtoothpaste] + E[Xtoothbrush]
We know that E[Xtoothpaste] = (0.5)(0) + (0.5)(2) = 1, and E[Xtoothbrush] = (0.9)(0) + (0.1)(1) = 0.1. Thus, E[X] = 1.1.
DS100 Midterm, Page 4 of 25 October 12, 2017
4. [3 Pts] Suppose we have a coin that lands heads 80% of the time. Let the random variable X be the proportion of times the coin lands tails out of 100 flips. What is Var[X]? You must show your work in the space provided.
⃝ 0.8 ⃝ 0.16 ⃝ 0.04 √ 0.0016 ⃝ 0.008
Solution: Let Xi be the outcome of the ith spin. If the ith spin lands heads than we say Xi = 1 and otherwise Xi = 0. Then the proportion of times Xi lands heads is given by:
1 n
Xi
We can compute the variance of Y using the following identities:
Var[Y]=Var 100 1n
= 1002 Var
1 n
(1)
(Squared variance of constant multiple.)
(Ind. Variables implies linearity of var.)
p(1−p) 100
= 1002
1 n
= 1002
Var [Xi] p(1 − p) =
Y = 100 1n
i=1
Xi i=1
Xi
i=1
i=1
i=1
= .8(1−.8) = .16 =.0016
100 100
DS100 Midterm, Page 5 of 25 October 12, 2017 5. A small town has 5 houses with the following people living in each house:
Suppose we take a cluster sample of 2 houses (without replacement), what is the chance that: (1) [2 Pts] Kim and Lars are in the sample
⃝0 ⃝1/20 ⃝1/10 ⃝1/6 ⃝1/5 √2/5 ⃝1 You may show your work in the following box for partial credit:
Abe, Ben Cat, Dan, Emma Frank, George Hank, Ira, Jen Kim, Lars
Solution: ThechancethatKimandLarsareinthesamesampleisgivenbythechance of choosing their house. The chance of choosing the their house on the first draw is 1 .
5
Because we are drawing without replacement. The chance of choosing their house on
the second draw is given by the chance of not choosing their house on the first draw
( 4 ) times the chance of choosing their house on the second draw ( 1 ). Thus the total 54
chance of choosing them in the first two draws is:
1+4×1=2 5545
(2) [2 Pts] Kim, Abe, and Ben are in the sample
⃝0 ⃝1/20 √1/10 ⃝1/6 ⃝1/5 ⃝2/5 ⃝1 You may show your work in the following box for partial credit:
Solution: To draw Kim, Abe, and Ben we would need to draw both of their houses. This can be done two ways (draw Abe and Ben’s house first and then Kim’s or vice versa). Each way has probability:
Thus the total probability is:
1×1 54
2×1×1=2=1 5 4 20 10
(3) [1 Pt] Kim and Dan are in the sample – Select all that apply
The same as the chance Kim and Lars are in the sample
DS100 Midterm, Page 6 of 25 October 12, 2017
√ The same as the chance Kim, Abe, and Ben are in the sample Neither of the above
DS100 Midterm, Page 7 of 25 October 12, 2017
Data Cleaning and EDA
6. True or False. For each of the following statements select true or false.
(1) [1 Pt] Exploratory data analysis is the process of testing key hypotheses. ⃝ True √ False
(2) [1 Pt] The structure of the data describes how it is formatted and organized. √ True ⃝ False
(3) [1 Pt] Throughout the process of exploratory data analysis it is often necessary to trans- form and clean data.
√ True ⃝ False
(4) [1 Pt] During the data cleaning process it is generally a good idea to drop records that contain missing values.
⃝ True √ False
7. In homework 3, we analyzed ride sharing data comparing the weekday and weekend patterns for both casual and registered riders.
Solution: False. Exploratory data analysis is the process of gaining understanding about data to inform future analysis.
Solution: True. The structure of data includes its formatting (e.g., JSON, CSV, XML, raw text) as well as the fields and organization of records.
Solution: True. A key step in exploratory data analysis is identify and in some cases correcting anomalies and issues with data.
Solution: False. Nooooo. It is very important that the cleaning process is done with care to avoid introducing transformations that might bias subsequent analysis. Drop- ping records with missing values, for example, missing addresses, could substantially bias the data (e.g., removing homeless people).
DS100 Midterm, Page 8 of 25 October 12, 2017
(1) [1 Pt] On weekdays, the number of casual riders was most frequently the number of registered riders.
⃝ higher than √ lower than ⃝ similar to
(2) [1 Pt] Which group of riders demonstrated a pronounced bi-modal daily usage pattern:
⃝ Casual Riders √ Registered Riders ⃝ Both casual and registered riders.
DS100 Midterm, Page 9 of 25 October 12, 2017
8. Using the following snippet of data to answer each of the questions below.
Business.data
“business_id”,”name”,”address”,”phone”
10,”TIRAMISU KITCHEN”,”033 BELDEN PL”,”+14154217044″
19,”LIFESTYLE CAFE”,”1200 VAN NESS AVE”,”+14157763262″
24,”OMNI S.F. HOTEL”,” “,”9999999999999999”
42,”The “Best”, Food!”,”500 CALIFORNIA ST”,”+14156211114″
43,”The “Best”, Food!”,”3716 Cesar Chavez”,”+14156211114″
(1) [1 Pt] Which of the following best describes the format of this file. ⃝ Raw Text
⃝ Tab Separated Values
√ Comma Separated Values
⃝ JSON
(2) [1 Pt] Which of the following best describes the granularity of each record? ⃝ Restaurant Chains
√ Individual Restaurant Locations ⃝ Strings
⃝ Daily
(3) [4 Pts] Select all the true statements.
√ From the available data the business id appears to be a primary key.
There appear to be no missing values
√ While the data appears to be quoted there may be issues with the quote character.
There are nested records.
None of the above statements is true.
DS100 Midterm, Page 10 of 25 October 12, 2017
Transformations and Smoothing
9. [3Pts] Whichofthefollowingarereasonablemotivationsforapplyingapowertransformation? Select all that apply:
√ To help visualize highly skewed distributions
Bring data distribution closer to random sampling
√ To help straighten relationships between pairs of variables. Reduce the dimension of data
Remove missing values
None of the above
10. [3 Pts] Which of the following transformations could help make linear the relationship shown in the plot below? Select all that apply:
√ log(y) √ x2 √ √y log(x) y2 Noneoftheabove
80
60
40
20
0
123456789
X
Figure 1
Y
DS100 Midterm, Page 11 of 25 October 12, 2017
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
321012
Figure 2
11. [2 Pts] The above plot contains a histogram, rug plot, and Gaussian kernel density estimator. The Gaussian kernel is defined by:
1 (x−z)2 Kα(x,z)=√2πα2exp − 2α2
Judging from the shape of separate standing peaks, which of the following is the most likely value for the kernel parameter α.
⃝ α=0 √ α=0.1 ⃝ α=10 ⃝ α=100
DS100 Midterm, Page 12 of 25 October 12, 2017
Regular Expressions
12. [2 Pts] Select all the strings that fully match the regular expression: [ˆdp]an
√ Dan pan √ fan √ man Noneoftheabove.
13. [2 Pts] Select all the strings that fully match the regular expression: <[a-z]*@\w+.edu>
√ <@berkeley$edu>
√
None of the above strings match.
14. [2 Pts] Select all the strings that fully match the regular expression: ˆGo.*
Way to ˆGo!
√
go trees?
None of the above strings match
15. [2 Pts] What is the result of evaluating the following python command?
len(re.split(r”\d+”, “You get a 99.9 on the exam.”)) ⃝2√3⃝4⃝5
16. For the following tasks, write the corresponding Python code or regular expression.
(1) [2 Pts] Write a regular expression that only matches sub strings consisting of an a imme-
diately followed by zero or one b characters.
regx = r’_________________________________________________’
(2) [3 Pts] Suppose we’ve run the code below:
text = ’Data\t \t Science 100’
Go Bears!
Solution:
regx = r’ab?’
DS100 Midterm, Page 13 of 25 October 12, 2017 Use a method in the re module to replace all the continuous segments of spaces with a
single comma. The resulting string should look like “Data,Science,100”. re._______________________________________________________
Solution:
re.sub(r’\s+’, ’,’, text)
DS100 Midterm, Page 14 of 25 October 12, 2017
DataFrames, Joins, and Aggregation
17. The ti and fare DataFrames contain data of the people aboard the Titanic when it crashed:
>>> ti.head()
survived class
Both tables contain one row for each passenger, uniquely identified by the id column. Here’s a description of the columns in each DataFrame:
Third
First
Third
sex id
male 1410
female 1522
female 1864
| >>> fare.head()
| fare alone id
| 0 73.5000 True 1457
| 1 9.2250 True 1645
| 2 8.6625 True 1716
0 0
1 1
2 1
3 1 First female 1687 | 3 59.4000 False 1367
4 0 Third male 1173 | 4 18.0000 False 1639
DataFrame fare
DataFrame ti
survived: 1 if the person survived, else 0
class: ticket class (First, Second, or Third) alone: True if the person was alone at purchase. sex: Sex of person (male or female)
Fill in the blanks to compute the following statements. You may assume that the pandas module is imported as pd. You may not use more lines than the ones provided.
(1) [2 Pts] The total number of survivors.
fare: Price of ticket in USD
Solution:
ti[’survived’].sum()
(2) [4 Pts] The proportion of females who survived (a float).
ti.loc[_______________________________,___________].mean()
Solution:
ti.loc[ti[’sex’] == ’female’, ’survived’].mean()
DS100
Midterm, Page 15 of 25
October 12, 2017
(3) [4 Pts]
A DataFrame containing the proportion of survivors for each sex. It should look like:
Solution:
ti[[’survived’, ’sex’]].groupby(’sex’).mean()
(4) [5 Pts]
A DataFrame containing the proportion of survivors for each sex and class. It should look like:
Solution:
pd.pivot_table(ti, values=’survived’,
index=’sex’, columns=’class’)
(5) [8 Pts] A DataFrame containing the proportion of survivors for each sex after filtering out those that bought their ticket alone. The table should have the same structure as (3) but with different numbers.
merged = ___________________________________________________
(merged_____________________________________________________
___________________________________________________________
__________________________________________________________)
DS100 Midterm, Page 16 of 25 October 12, 2017
Solution:
merged = pd.merge(ti, fare, on=’id’)
(merged[merged[’alone’]]
.loc[:, [’survived’, ’sex’]]
.groupby(’sex’).mean())
18. [3 Pts] From the following list select all statements that are true for Pandas Data Frames. √ All data frames must have an index.
All columns must be the same type.
√ You can always index a record by its row number.
Missing values in string columns are always encoded as NaN. None of the above
DS100 Midterm, Page 17 of 25 October 12, 2017
Visualizations
19. The figure below is a scatter plot of the heights of mothers (in) and fathers (in) of a sample of 1000 UC Berkeley students.
(1) [2 Pts] What is the main problem with this plot? ⃝ Choice of scale
⃝ Jiggling the baseline ⃝ Aspect ratio
√ Overplotting
⃝ Lack of context
⃝ Perception (length, angle, area)
(2) [2 Pts] What is the remedy for this problem? ⃝ Overlay plots
√ Jitter values
⃝ Use color to condition on student’s gender ⃝ Transform one variable or the other or both ⃝ Improve labels and legends
DS100 Midterm, Page 18 of 25 October 12, 2017
20. [2 Pts] The following figure is a line plot of CO2 emissions over time. What is the main problem with this plot?
1960 1970 1980
1990 2000 2010
date
√ Empty data region ⃝ Jiggling the baseline ⃝ Overplotting ⃝ Lack of context ⃝ Perception (length, angle, area)
21. Consider the following visualization of the number of casual riders per hour by day of the week, which has been constructed from the bike sharing data used in Homework 3.
350
300
250
200
150
100
50 0
Sat Sun
Mon Tue Wed Thu Fri weekday
(1) [2 Pts] Which days of the week frequently (at least 75% of the time) had fewer than 50 casual riders? Select all that apply.
Saturday Sunday √ Monday √ Tuesday None of the above.
(2) [3 Pts] Which of the following describe conclusions that we can draw about the distribu-
tion of rider counts on Tuesdays using the above plot? Select all that apply.
Skewed left Symmetric √ Skewed right Unimodal √ Has outliers None of the above
casual
co2
0 200 400
DS100 Midterm, Page 19 of 25 October 12, 2017
Estimation and Loss Minimization
22. Consider the following loss function.
L(θ, x) =
(1) [2 Pts] Select all statements that are true. The loss function is concave. √ The loss function is convex.
The loss function is smooth.
None of the above statements are true.
(2) [4 Pts] Given a sample x1, . . . xn, which value of θ minimizes the average loss? Show your work in the space provided.
√ 20th percentile ⃝ 25th percentile ⃝ 75th percentile ⃝ 80th percentile
(3) [2 Pts] The optimal value θ∗ is a percentile for the √ sample
⃝ population
4(θ − x) θ ≥ x x−θ θ