Name: Email: Student ID:
@berkeley.edu
DS-100 Midterm Exam Spring 2018
Instructions:
• This midterm exam must be completed in the 80 minute time period ending at
12:30PM, unless you have accommodations supported by a DSP letter.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• You may use a one-sheet (two-sided) study guide.
• Work quickly through each question. There are a total of 168 points on this exam.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1
DS100 Midterm,
Page 2 of 22 March 8th, 2018
Syntax Reference Regular Expressions
“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)
“$” matches the position at the end of string character.
“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.
“+” match preceding literal or sub-expression one or more times.
“*” match preceding literal or sub-expression zero or more times
“[ ]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.
“( )” used to create a sub-expression
“\d” match any digit character. “\D” is the
complement.
“\w” match any word character (letters, dig- its, underscore). “\W” is the comple- ment.
“\s” match any whitespace character includ- ing tabs and newlines. \S is the comple- ment.
“\b” match boundary between words
requests.post(url, auth, params, data)
makes a POST requests with params in the header and data in the body.
“.” match any character except new line.
Some useful re and requests package functions.
re.findall(pattern, st) return the list of all sub-strings in st that match pattern.
requests.get(url, auth, params, data)
makes a GET requests with params in the header and data in the body.
Useful Pandas Syntax
df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean df.groupby(group_columns)[[’colA’, ’colB’]].sum()
pd.merge(df1, df2, on=’hi’) # Merge df1 and df2 on the ’hi’ column
pd.pivot_table(df, # The input dataframe index=out_rows, # values to use as rows columns=out_cols, # values to use as cols
values=out_values, # values to use in table aggfunc=”mean”, # aggregation function fill_value=0.0) # value used for missing comb.
DS100 Midterm, Page 3 of 22 March 8th, 2018
Data Design and Bias
1. [1 Pt] Your letter grade (e.g., A+, A, . . . ) in a class that grades on a curve is most accurately described as what kind of data?
⃝ Nominal √ Ordinal ⃝ Quantitative ⃝ Numerical
2. [1 Pt] The number of gold medals won by each country in the 2018 Olympics is an example of what kind of data:
⃝ Nominal ⃝ Ordinal ⃝ Qualitative √ Quantitative
3. A discussion leader with 32 students in her section would like to sample a single student that is representative of the total population of students in her section. She enumerates her students 0 to 31 and follows one of the following procedures:
(a) [4 Pts] She flips a fair coin 31 times and records the number of heads. She then selects the student with the number that matches the number of heads. What type of sample has the discussion leader taken? Select all that apply.
Simple random sample √ Probability sample Convenience sample None of the above
(b) [4 Pts] She flips a fair coin 5 times and records the sequence of heads and tails as 1’s and 0’s, respectively. She then selects the student whose number corresponds to the binary sequence. For example, if she flipped [1, 1, 0, 0, 1] then she would select:
1∗20 + 1×21 + 0×22 + 0×23 + 1×24 =student21
Solution: There is a clear ordering in the grades; however, the difference between two grades depends on the grade distribution.
Solution: While the type of medal is an ordinal variable the number of a particular medal is a quantitative variable.
Solution: Since we can write down the probability that each student is selected, this is an example of a probability sample. However, this is not a simple random sample as not all students are equally likely to be selected.
DS100 Midterm, Page 4 of 22 March 8th, 2018 What type of sample has the discussion leader taken? Select all that apply.
√ Simple random sample √ Probability sample Convenience sample None of the above
4. Sampling True/False For each of the following select true or false:
(a) [1 Pt] If each element/member of the population has an equal chance of being chosen,
then we have a simple random sample. ⃝ True √ False
(b) [1 Pt] In cluster sampling, each cluster has an equal chance of being chosen. √ True ⃝ False
(c) [1 Pt] In stratified sampling, each element of the population is a assigned to exactly one stratum.
√ True ⃝ False
(d) [1 Pt] A small simple random sample can often be more representative of the population
than a very large convenience sample. √ True ⃝ False
Solution: Since we can write down the probability that each student is selected, this is an example of a probability sample. Also, since each student (and “subset” of students) had an equal chance of being chosen, we also have a simple random sample.
Solution: False, each subset must have an equal chance of being chosen. (this is a stronger condition)
Solution: True, by definition of cluster sampling.
Solution: True, by the definition of stratified sampling.
Solution: Convenience samples are often heavily biased. In class we considered a scenario where a small carefully constructed simple random sample was less biased than a very large convenience sample.
DS100 Midterm, Page 5 of 22 March 8th, 2018
5. We would like to understand the sleeping habits on university students living in campus dorms across the United States.
(a) [2 Pts] To keep costs down we randomly sample a subset of dorms across the United States and then construct a simple random sample of students within each of the selected dorms. This is an example of which sampling procedure:
⃝ Simple random sample ⃝ Stratified sample √ Cluster sample
(b) [2 Pts] Which of the following sampling procedures would ensure that we have good coverage of both male and female students within each dorm.
⃝ Simple random sample √ Stratified sample ⃝ Cluster sample
Pandas
6. Pandas True/False
(a) [1 Pt] If the pandas DataFrame df has 10 columns, then df.iloc[:, 0:5] will re- turn a DataFrame with 5 columns.
√ True ⃝ False
(b) [1 Pt] Assuming that len(df1) == 100 and len(df2) == 100 are both true, thendf1.merge(df2, how=’outer’)producesatmost200rows.
⃝ True √ False
Solution: This is a form of cluster sampling where the clusters correspond to dorms. Within each dorm we have selected a simple random sample however a census or even stratified sample could be used.
Solution: A stratified sample of the students within dorm would ensure that we have good coverage of male and female students
Solution: True pd.iloc is inclusive for the starting value and exclusive for the ending value, so it will return columns 0, 1, 2, 3, 4.
Solution: False, an outer join is the cross product of the rows and can produce up to 10,000 rows.
DS100 Midterm, Page 6 of 22 March 8th, 2018
(c) [1 Pt] The return type of the pandas.DataFrame.groupby function can either be a DataFrame or a Series object.
⃝ True √ False
Solution: False. groupby returns a GroupBy object.
DS100 Midterm, Page 7 of 22 March 8th, 2018
7. The tables food and store contain information regarding different ingredients and where to buy them. You may assume all strings are strings and numbers are floats.
This is preview of the first 5 rows of the DataFrames. You may assume it has many more rows than what is shown, with the same structure and no missing data.
food
name
broccoli green chicken pink cheddar yellow mango yellow carrot orange
calories 25
200 350
40 50
food group vegetable meat
dairy
fruit vegetable
color
index 0
1
2
3 4
index 0
1
2
3 4
(a) [5 Pts] Which of the following expressions returns a Series containing only the names of all the red vegetables in the food DataFrame? Select all that apply.
food[(food[“color”] == “red”) |
store
food name store name distance price broccoli yasai 1 1.5 broccoli safeway 2 2
cheddar
mango
carrot costco
trader joes 1 4 berkeley bowl 3 1 6 5
(food[“food_group”] == “vegetable”)][“name”]
√
food[(food[“color”] == “red”) & (food[“food_group”] == “vegetable”)]
food[(food[“name”].isin(store[“food_name”])) & (food[“food_group”] == “vegetable”)]
None of the above.
food[(food[“color”] == “red”) &
(food[“food_group”] == “vegetable”)][“name”]
Solution:
False; it contains vegetables that may not be red
√ True; it contains only the names of all red vegetables. False, it is a DataFrame
False; never filters out to only select red vegetables
False
DS100 Midterm, Page 8 of 22 March 8th, 2018
(b) [5 Pts] Select all of the following expressions that generate a DataFrame containing only
rows of fruit.
food.where(food[“food_group”] == “fruit”)
√
food[“food_group”] == “fruit”
None of the above.
√
food.set_index(“food_group”).loc[“fruit”, :]
food[food[“food_group”] == “fruit”]
DS100 Midterm, Page 9 of 22
(c) [5 Pts] Select all true statements about the following expression.
cal100_foods = food[food[“calories”] <= 100]
nearby_stores = store[store["distance"] <= 2]
output_df = cal100_foods.merge(nearby_stores,
how = "left",
left_on="name",
right_on="food_name")
March 8th, 2018
output df[’name’] and output df[’food name’] are always the √ same.
output df could contain NaN values.
nearby stores always contains the same number of rows as the output df.
√ output df could contain more rows than the original food DataFrame. None of the above.
Solution:
False; that is the column being merged on
√ True; if none of the stores have the food item in the left table, the
merged cols will be NaN.
False; it is a left join so it could have more rows.
√ True; if every food item can be found at multiple stores, a new row is created for each store the food is found in. This can result in a much larger number of rows in the output DataFrame than food.
False;
(d) [4 Pts] Which of the following tables is represented by agg df?
safeway_food = store[store["store_name"] == "safeway"]
merged_df = pd.merge(food, safeway_food, left_on="name",
right_on="food_name")
agg_df = (merged_df.groupby("food_group")
)
.mean()
.drop(columns="distance")
DS100
Midterm, Page 10 of 22 March 8th, 2018
√
⃝
⃝⃝
Solution: The first table has food groups on the index and includes both calories and price as columns since they contain integer values and will not be dropped in the groupby.
DS100 Midterm, Page 11 of 22 March 8th, 2018 (e) [4 Pts] Which of the following expressions would generate the following table?
⃝ (food.groupby(["food_group", "color"])[["calories"]]
.median())
√
⃝ (food.set_index("food_group") .groupby("color")[["calories"]] .mean())
⃝ pd.pivot_table(food, values="calories", index="color", columns="food_group", aggfunc=np.median)
pd.pivot_table(food, values="calories",
index="food_group", columns="color",
aggfunc=np.median)
DS100 Midterm, Page 12 of 22 March 8th, 2018
EDA and Visualization
8. [5 Pts] Which of the following claims are true for the distribution shown below? Select all that apply.
It is left skewed √ It is unimodal √ The right tail is longer than the left tail It is symmetric None of the above
Solution:
A. False; it is right skewed B. True
C. True
D. False; it is asymmetric
9. [5 Pts] We wish to compare the results of kernel density estimation using a gaussian kernel and a boxcar kernel. For α > 0, which of the following statements are true? Choose all that apply.
Gaussian Kernel:
Box Car Kernel:
1 Bα(x,z)= α
0
Kα(x,z)=√1 exp −(x−z) 2πα2 2α2
2
if − α ≤ x − z ≤ α 2 2
else
√ Decreasing α for a gaussian kernel decreases the smoothness of the KDE.
The gaussian kernel is always better than the boxcar kernel for KDEs.
Because the gaussian kernel is smooth, we can safely use large α values for kernel density estimation without worrying about the actual distribution of data
DS100 Midterm, Page 13 of 22 March 8th, 2018
(a) Gaussian (b) Box Car
√ The area under the box car kernel is 1, regardless of the value of α None of the above
Solution:
A. True
B. False; if the α values are not carefully selected for the gaussian kernel, the box car kernel can provide a better kernel density estimate
C. False; if we set α too high we potentially risk including too many points in our estimate, resulting in a flatter curve.
D. True
10. [5 Pts] Which of the following styles of plots are good for visualizing the distribution of a continuous variable? Choose all that apply.
Pie Charts √ Box Plots Bar Plots √ Histogram None of the above
Solution:
A. False; pie charts are bad
B. True
C. False; bar plots usually for nominal/ordinal data D. True
11. [2 Pts] Suppose you wish to compare the number of homes homeowners in the US own and their respective salaries. Which style of plot would be the best?
⃝ Scatter Plot ⃝ Overlaid Line Plots √ Side by Side Box Plots ⃝ Stacked Bar Plot
DS100 Midterm, Page 14 of 22 March 8th, 2018
Solution:
A. False; most people own around 1-2 homes thus there will be heavy overplotting B. False; doesn’t make sense
C. True
D. False; stacking is bad and bar plots won’t do a good job
12. [5 Pts] Consider the plot below. What are some ways to improve the plot? Choose all that apply. Assume each is done individually.
√ Remove outliers and then plot on a different scale
Plot as a line plot instead of a scatterplot.
Jitter the data with noise sampled from a uniform distribution of (-1, 1) √ Utilize transparency
None of the above
Solution:
A. True
B. False; for these data a line plot would be an incomprehensible web.
C. False;simplyaddingasmallrandomnoisedoesn’talleviatetheproblemathand of markers being condensed near the x axis.
D. True
13. [5 Pts] Consider the plot below which visualizes day of the week versus the average tip given in dollars. What are serious visualization errors made with this plot? Choose all that apply.
DS100 Midterm, Page 15 of 22 March 8th, 2018
Area perception Jittering Overplotting Stacking √ None of the above
Solution:
A. False; this is a line plot
B. False; jittering is a technique to address overplotting C. False; there is no overplotting
D. False; there is no stacking
DS100 Midterm, Page 16 of 22 March 8th, 2018
14. True/False
(a) [1 Pt] A data scientist must always consider potential sources of bias in a given dataset.
√ True ⃝ False
(b) [1 Pt] It is always reasonable to drop missing values.
⃝ True √ False
15. Use the following dataset to answer the following questions:
id,diet,pulse,time,kind
1,low fat,85,1 min,rest
1,low fat,85,15 min,rest
1,low fat,88,30 min,rest
2,low fat,90,1 min,rest
2,low fat,92,15 min,rest
2,low fat,93,30 min,rest
3,low fat,97,1 min,rest
3,low fat,97,15 min,rest
(a) [1 Pt] Which of the following best describes the format of this file? ⃝ Raw text
⃝ Tab Separated Values (TSV)
√ Comma Separated Values (CSV)
⃝ JSON
(b) [4 Pts] Select all the true statements.
From the data available, the id seems to be a primary key. √ There appear to be no missing values.
There are nested records.
None of the above.
16. [5 Pts] Select all the true statements about the following XML file:
1 2 3 4 5 6 7 8 9
10
< email >
Hello there! How are we today?
< /email >
< email >