Data 100, Midterm 1 Fall 2019
Name:
Email:
Student ID:
Exam Room:
All work on this exam is my own (please sign):
@berkeley.edu
Instructions:
• This midterm exam consists of 90 points and must be completed in the 80 minute time period ending at 9:30, unless you have accommodations supported by a DSP letter.
• Note that some questions have circular bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• Youmayuseaone-sheet(two-sided)cheatsheet,inadditiontotheincludedMidterm 1 Reference Sheet.
1
Data 100 Midterm 1, Page 2 of 14 October 8, 2019
1 Cereal (Pandas)
You are given a Pandas DataFrame cereal with information about 80 different cereals.
The figure above is the result of running cereal.head(). All values are per-serving. type can be either cold or hot. rating is the average score (out of 100) given by customers.
(a) [1 Pt] What is the granularity of the cereal data frame? ⃝ serving of cereal
⃝ type of cereal ⃝ manufacturer ⃝ name of cereal
(b) [3 Pts] Add a new column to cereal named low calorie which has the boolean value True if the cereal is low-calorie and False otherwise. A cereal is low-calorie if it has less than or equal to 100 calories per serving.
cereal[______________________] = _________________________
(c) [3 Pts] Identify the type for each of the following variables.
Solution:
cereal[“low_calorie”] = cereal[“calories”] <= 100
fiber
⃝ continuous type
⃝ continuous low calorie ⃝ continuous
⃝ discrete
⃝ discrete
⃝ discrete
⃝ nominal
⃝ nominal
⃝ nominal
⃝ ordinal
⃝ ordinal
⃝ ordinal
Data 100 (d)
Midterm 1, Page 3 of 14 October 8, 2019
[6 Pts] For this problem and the next problem below you may use the following func- tions:
groupby, agg, filter, merge
unique, value_counts, sort_values, apply
max, min, mean, median, std, count,
np.mean, sum, any, all, isnull, len
You can also use any other methods you have used in class, for example, anything in the Pandas or Numpy libraries. You can leave lines inside parentheses blank to represent a function call with no arguments.
Create a Series indexed by manufacturer. For each manufacturer, the value should be equal to the maximum sugars value of all cereals by that manufacturer. Your series should be sorted by the value in decreasing order. You may not need all lines.
max_sugar = cereal._________________(__________________)["sugars"]
._________________(__________________)
._________________(__________________)
For example, the first few entries of the Series would be:
Kelloggs 15
Post 14
Quaker Oats 14
[8 Pts] Which manufacturers make only cold cereals? Return an array, list, or series of these manufacturers that exclusively manufacture cold cereals, i.e. your list should not a company if it makes ANY hot cereals. You may not need all lines. You can leave lines inside parentheses blank to represent a function call with no arguments.
def f(df):
___________________________________________
___________________________________________
___________________________________________
cold_only =
cereal._________________(__________________)
._________________(__________________)["manufacturer"]
._________________(__________________)
Solution:
cereal.groupby("manufacturer")["sugars"] \
.agg(max).sort_values(ascending=False)
(e)
Data 100
Midterm 1, Page 4 of 14 October 8, 2019
Solution:
cereal.groupby("manufacturer") \
.filter(lambda x: sum(x["type"] == "hot") == 0)["manufacturer"]
Data 100 (f)
Midterm 1, Page 5 of 14 October 8, 2019
[2 Pts] Consider the data frame below. Assume cereal was modified correctly in part a. The Interior values are average rating per category, e.g. 32.026596 is the average rating of the low calorie cereals made by General Mills.
Which of the following four lines of code could be used to create this data frame?
pd.pivot table(data=cereal, index=’manufacturer’, columns=’low calorie’, values=’rating’, aggfunc=np.mean)
cereal.groupby([’manufacturer’,’low calorie’])[’rating’]
.mean()
pd.pivot table(data=cereal, index=’low calorie’,
columns=’manufacturer’, values=’rating’, aggfunc=np.mean)
cereal.groupby(’rating’)[[’manufacturer’,’low calorie’]]
.mean()
[2 Pts] The above table contains NaNs because some companies don’t make cereals with the given calorie level. By calorie level, we mean whether low calorie is True or False. E.g. Nabisco does not make any low calorie cereals. If we wanted to show a colleague this pivot table to illustrate the average rating across manufacturer and calorie level combinations for cereals, what should we do with these NaN values? Pick the one best option that applies.
⃝ Fill them with the average rating across all the cereals in the same calorie level
⃝ Fill them with the average rating across all cereals from the same manufacturer
⃝ Both A and B are acceptable
⃝ Leave as-is
⃝ Replace with a rating randomly selected from a cereal with the same calorie level
(g)
Solution:
Data 100 Midterm 1, Page 6 of 14 October 8, 2019
Solution: You want to leave the values as-is because it will show that there are no such cereals in the specified category which is useful information in itself.
Data 100 Midterm 1, Page 7 of 14 October 8, 2019
2 Computing Summary Statistics
Suppose we’re given the set of points {−15, 10, 20, 30, 30, 35, 40, 50}, and we want to deter- mine a summary statistic c. For each of the following loss functions, determine or select the optimal value of c, cˆ, that minimizes the corresponding empirical risk.
For (a), (b), and (c), select the correct answer. For (d), write your answer in the provided box. To help you with this task, we have computed the following: mean = 25, median = 30, SD =
20, and n = 8.
(a) [2 Pts] (xi − c)2 ⃝ 20
⃝ 25 ⃝5 ⃝ 50 ⃝0
(b) [2 Pts] 5(xi − c)2 ⃝ 20
⃝ 125 ⃝ 25 ⃝ 100 ⃝ -15
(c) [2 Pts] |xi − c| ⃝ 25
⃝ 50 ⃝ 40 ⃝ 30 ⃝ -15
(d) [5 Pts] (3xi − c)2
For part d, write only your answer in the box below. Feel free to show your work else-
where on this page.
cˆ =
Solution: cˆ = 3x ̄ = 75
Data 100 Midterm 1, Page 8 of 14 October 8, 2019
3 Regex
You are interviewing for a job at Triple Rock, and they want you to prove your skills with regular expressions on synthetic data.
(a) [4 Pts] First they give you a list of two of their distributors.
distributors[0] = "Geyser Beverage: 55 Wright Brothers Ave",
distributors[1] = "Mindful Distribution: 2935 Adeline St"
Give a regular expression that extracts the name and street number from such strings. Your regex should work for either string. For example, after running the code below, name should be ’Geyser Beverage’ and street number should be 55.
regex_1 = r’ ___________________________________________’
name, street_number = re.findall(regex_1, distributors[0])[0]
(b) Next they give you information regarding how each of two table paid its bill. The first two entries are:
paid bills[0] = "4123713131673827 paid $30.50, and $37 paid by 5612512165638672.",
paid bills[1] = "$171.25 was charged to 4612512165638672."
i. [4 Pts] Give a regular expression that can extracts Visa and Mastercard credit card numbers from such strings. Your regex should work for either string. A Visa credit card number is any sequence of 16 digits that starts with a 4, and a Mastercard is any sequence of 15 digits that starts with 5. Observe there are no dashes or other extra- neous characters in a credit card number. For example, after running the code below, cc nums should be [’4123713131673827’, ’5612512165638672’].
regex_2 = r’ ___________________________________________’
cc_nums = re.findall(regex_2, paid_bills[0])
ii. [4 Pts] Write a regular expression which will correctly find and return the dollar amounts including whatever is to the right of the optional decimal point. Your regex should work for either string. For example, after running the code below, amountsshouldbe[’30.50’, ’37’].
Solution:
([\w\s+]):\s+(\d+)
Solution:
[45]\d{15}
Data 100 Midterm 1, Page 9 of 14 October 8, 2019
regex_3 = r’ ___________________________________________’
amounts = re.findall(regex_3, paid_bills[0])
Solution:
\$(\d+\.?\d*)
Data 100 Midterm 1, Page 10 of 14 October 8, 2019
4 EDA
(a) [2 Pts] Which of the following transformations could help make linear the relationship shown in the plot below? Select all that apply:
log(y) log(x) ex
y3
None of the Above
(b) Sally likes making desserts, and she wants to learn more about the sugar content of her recipes. For each of 100 recipes, she records the amount of sugar (in grams) per serving. A histogram of the sugar measurements appears in the plot below on the left.
Her friend, Max, makes a similar histogram of his 100 recipes, which appears below on the right. Note: The two images shown are exactly alike.
i. [3 Pts] How many of Sally’s recipes have more than 10 grams of sugar per serving? Do not worry about interval endpoints, i.e. assume that no recipes have exactly 0, 5, 10, or 20 grams.
⃝ 15 ⃝ 30 ⃝ 60 ⃝ Impossibletotell
ii. [3 Pts] How would you describe the distribution of sugar in Sally’s recipes? Check
all that apply.
unimodal multimodal symmetric
skew left skew right contains outliers
iii. [2 Pts] Do Max’s recipes have the exact same sugar content as Sally’s?
⃝ Yes ⃝ No ⃝ Impossible to tell
Data 100 Midterm 1, Page 11 of 14 October 8, 2019
5 Visualizations
(a) [6 Pts] Consider plots A and B below. For each plot, identify its primary flaw (if any)
and give a recommendation in the provided box.
Does Plot A have a significant flaw? ⃝ Yes ⃝ No
If you picked yes, in the box below, make a recommendation to fix the primary flaw.
Does Plot B have a significant flaw? ⃝ Yes ⃝ No
If you picked yes, in the box below, make a recommendation to fix the primary flaw.
Solution: Increase the bandwidth for a smoother density estimate
Solution: Re-scale y-axis
Data 100 (b)
Midterm 1, Page 12 of 14 October 8, 2019
[6 Pts] Consider plots C and D below. For each plot, identify its primary flaw (if any) and give a recommendation in the provided box.
Does Plot C have a significant flaw? ⃝ Yes ⃝ No
If you picked yes, in the box below, make a recommendation to fix the primary flaw.
Does Plot D have a significant flaw? ⃝ Yes ⃝ No
If you picked yes, in the box below, make a recommendation to fix the primary flaw.
[2Pts] ForplotDabove,whichspeciesofIrishasthehighestfrequencyinthedataset? ⃝ Virginica
⃝ Versicolor
⃝ Setosa
⃝ Impossible to tell
(c)
Solution: Makedensitycurvessothatyoucanmoreeasilycomparethedistributions.
Solution: Not applicable
Data 100 Midterm 1, Page 13 of 14 October 8, 2019
6 Sampling
(a) Professor Hug is an instructor for both Data 100 and CS W186 this semester. Stu- dents come to his office hours for both of these classes, but some students also come for other reasons. Professor Hug is interested in knowing how many Data 100 students this semester have taken CS 61B. He takes a convenience sample of people that come to office hours.
i. [2 Pts] Name a group or individual that is included in the sampling frame but is not in the population of interest.
ii. [2 Pts] Name a group or individual that is in the population of interest but not in the sampling frame.
(b) For the rest of the question, assume that the sampling frame is the exact same as the population. That is, both the population and the sampling frame is the population of Data 100 students. Also, assume that there are 1000 students in Data 100 and that 500 of them have taken CS 61B. Using the class list, Professor Hug takes a simple random sample of 50 students in Data 100. Let Xi be 1 if the ith person sampled took CS 61B and 0 otherwise, for i = 1,...,50.
Find the following quantities. Somewhere in this problem you might need the ”finite population correction factor” given by N −n .
Solution: CS W186 Students who are not in Data 100; students not in Data 100 who come to Professor Hug’s office hours for other reasons not related to Data 100 and CS W186
Solution: Data 100 students who do not go to Professor Hug’s office hours
i. [3Pts] P(X5 =1)=
ii. [4Pts] P(X5 =1,X50 =1)=
N−1
Solution: P(X5 = 1) = P(5th person has taken CS 61B) = 500 = 1 1000 2
Solution: P(X5 =1,X50 =1)== 500 ×499 =1 ×499 1000 999 2 999
iii. [3Pts] Var(X1 +X2 +···+X50)=
Data 100 Midterm 1, Page 14 of 14 October 8, 2019
Solution:
Var(X1 +X2 +···+X50) =
= 1000−5050 500 (1− 500 )
1000 − 1 1000 1000
= 950501
999 4
N −nnp(1−p) N−1
(c) [4 Pts] Now suppose Professor Hug takes a census of the class. Let Xi be 1 if the ith per- son sampled took CS 61B and 0 otherwise, for i = 1, . . . , 1000. Find the given quantity.
Var(X1 +X2 +···+X1000)=
Solution: If a census is taken, then all 1000 students are surveyed, and X1 + · · · + X1000 = 500. That is, there is no variability in this sum, and
Var(X1 +X2 +···+X1000)=0 Note that the finite population correction factor is 0 in this case.