Name: Email: Student ID:
@berkeley.edu
DS-100 Final Exam Fall 2017
Instructions:
• This final exam must be completed in the 3 hour time period ending at 11:00AM.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must shade in the box/circle. Check marks will likely be mis-graded.
• You may use a two page (two-sided) study guide.
• Work quickly through each question. There are a total of 127 points on this exam.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1
DS100 Final, Page 2 of 30 December 14th, 2017
Syntax Reference Regular Expressions
“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)
“$” matches the position at the end of string character.
“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.
“+” match preceding literal or sub-expression one or more times.
“*” match preceding literal or sub-expression zero or more times
“.” match any character except new line.
XPath
“[ ]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.
“( )” used to create a sub-expression
“\d” match any digit character. “\D” is the
complement.
“\w” match any word character (letters, digits,
underscore). “\W” is the complement.
“\s” match any whitespace character includ- ing tabs and newlines. \S is the comple- ment.
“\b” match boundary between words
An XPath expression is made up of location steps separated by forward slashes. Each location step has three parts: an axis, which gives the direction to look; a node test which indicates the node name or text(); and an optional predicate to filter the matching nodes:
axis::node[predicate]
We have used shortcut names for the axis: “.” refers to self, “//” refers to self or descendants, “..” refers to to parent, and child is the default axis and can be dropped. The node of the XPath
expression is either an element name or text() for text content or @attribute for an attribute.
The predicate contains an expression that evaluates to true or false. Only those nodes that evaluate to true are kept. To check whether an attribute is present in a node, we use, e.g., [@time] (this evaluates to true if the node has a time attribute). Similarly, [foo] evaluates to true if the node has a childnodenamedfoo.Thevalueofanattributecanbecheckedwith,e.g.,[@time = “2017”].
DS100 Final, Page 3 of 30
December 14th, 2017
Variance and Expected Value Calculations
The expected value of X is
m E[X] = xjpj
j=1
The variance of X is
Var[X]=(xj −E[X])2pj =x2jpj −E[X]2 =EX2−E[X]2
j=1 j=1 The standard deviation of X is SD [X] = Var [X].
For X1,…Xn,
E[a1X1 +a2X2 +···+anXn]=a1E[X1]+···+anE[Xn] = If the Xi are independent, then
Var[a1X1 +a2X2 +···+anXn]=a21Var[X1]+···+a2nVar[Xn]
In the special case where E [Xi] = μ, Var [Xi] = σ2, ai = 1/n and the Xi are independent, then
we have
PySpark
sc.textFile(filename) Creates an RDD from the filename with each line in the file as a separate record.
rdd.collect() Takes an rdd and returns a Python list.
rdd.filter(f) Applies the function f to each record in rdd and keeps all the records that evaluate to True.
rdd.map(f) Applies the function f to each record in rdd producing a new RDD con- taining the outputs of f.
rdd.mapValues(f) Takes an rdd of key-
SEX ̄ = σ/√n
value pairs (lists) and applies the function f to each record in rdd producing a new RDD containing the outputs of f.
rdd.reduceByKey(f) Takes an rdd of key-value pairs (lists). It then groups the values by the key and applys the re- duce function f to combine (e.g., sum) all the values returning an RDD of [key,sum(values)] lists
s.split() Splits a string on whitespace.
np.array(list) Constructs a vector from a list of elements.
mm
EX ̄ = μ VarX ̄ = σ2/n
DS100 Final, Page 4 of 30 December 14th, 2017
Data Cleaning, Regular Expressions, and XPath
1. Consider the following text data describing purchases of financial products:
Id Date Product Company
0 99/99/99 Debt collection California Accounts Service
1 06/15/10 Credit reporting EXPERIAN INFORMATION SOLUTIONS INC
3 10/21/14 MORTGAGE OCWEN LOAN SERVICING LLC
5 03/30/15 The CBE Group Inc
6 02/03/16 Debt collection The CBE Group, Inc.
7 01/07/17 Credit reporting Experian Information Solutions Inc.
8 03/15/17 Credit card FIRST NATIONAL BANK OF OMAHA
(1) [2 Pts] Select all the true statements from the following list.
Some of the product values appear to be missing.
Some of the date values appear to be missing.
The file is comma delimited
The file is fixed width formatted.
To analyze the companies we will need to correct for variation in capitalization
and punctuation.
None of the above statements are true.
(2) [2 Pts] Select all of the following regular expressions that properly match the dates.
\d?/\d?/\d?
\d+/\d+/\d+
\d*/\d*/\d*
\d\d/\d\d/\d\d
None of the above regular expressions match.
(3) [2 Pts] which of the following regular expressions exactly matches the entry FIRST NATIONAL BANK OF OMAHA?Selectallthatmatches.
[A-Z]*
FIR[A-Z,\s]* OMAHA
F[A-Z, \s]+A
F[A-Z]*
None of the above regular expressions match.
DS100 Final, Page 5 of 30 2. Consider the following HTML document:
December 14th, 2017
Hello!
my story is here and it’s silly.
Name | Instrument |
---|---|
Abe | violin |
Amy | violin |
Dan | viola |
Cal | trumpet |
Name | Instrument |
---|---|
Sally | bass |
Terry | guitar |
Cassie | drums |
Tobie | piano |
The End!
(1) [2 Pts] Which of the following XPath queries locates the p-elements in the document? Select all that apply.
//p //table/../p //body//p ./body/p
(2) [2 Pts] What will be the result of XPath query: ./body/table/tr/td/a/text()
⃝ www.yyy ⃝ Abe ⃝ [Abe,Cal,Terry] ⃝ [www.yyy,www.ccc,www.ter]
(3) [2 Pts] Which of the following XPath queries locates the names of all musicians in the
second table (i.e., Sally, Cassie, and Tobie)? Select all that apply. //table[@id]//td/text()
./body/table[2]/text()
//table[@id=”xyz”]/tr/td[1]/text()
//tr/td[1]/text() None
DS100
Final, Page 6 of 30 December 14th, 2017 The query from the previous page is repeated below for quick reference.
Hello!
my story is here and it’s silly.
Name | Instrument |
---|---|
Abe | violin |
Amy | violin |
Dan | viola |
Cal | trumpet |
Name | Instrument |
---|---|
Sally | bass |
Terry | guitar |
Cassie | drums |
Tobie | piano |
The End!
(4) [3 Pts] Which of the following XPath queries locates the instruments of all musicians with Web pages? (A musician has a Web page if there is an a-tag associated with their name. Select all that apply.
//td/a/../../td[2]/text()
//a/ancestor-or-self::table/tr/td[2]/text() //table/tr/td[a]/../td[2]/text()
//tr/td[a]/text()
None
DS100 Final, Page 7 of 30 December 14th, 2017
Visualization
3. [2 Pts] Which of the following transformations could help make linear the relationship shown in the plot below? Select all that apply:
15.0 12.5 10.0
7.5
0 200 400
600 800 1000
X
log(y) x2 √x log(x) y2 None
4. [2 Pts] Which graphing techniques can be used to address problems with over-plotting? Check
all that apply.
jiggling transparency smoothing faceting
banking to 45 degrees contour plotting linearizing
Y
DS100 Final, Page 8 of 30 December 14th, 2017
5. The following line plot compares the annual production-based and consumption-based carbon dioxide emissions (million tons) in Armenia.
(1) [2 Pts] This plot best conveys:
⃝ The relative increase in CO2 emissions since 1990.
⃝ The overall trend in CO2 emissions broken down by source. ⃝ The relative breakdown of CO2 emissions sources over time. ⃝ The cumulative CO2 emissions.
(2) [2 Pts] What kind of plot would facilitate the relative comparison of the these two sources of emissions over time?
⃝ stacked barchart
⃝ side-by-side boxplots
⃝ line plot of annual differences
⃝ scatterplotofproduction-basedemissionsagainstconsumption-basedemissions
DS100 Final, Page 9 of 30 December 14th, 2017
Sampling
1a 1b 1c 2a 2b 2c 3a 3b 3c 4a 4b 4c
6. Kalie wants to measure interest for a party on her street. She assigns numbers and letters to each house on her street as illustrated above. She picks a letter “a”, “b”, or “c” at random and then surveys every household on the street ending in that letter.
(1) [1 Pt] What kind of sample has Kalie collected? ⃝ Quota Sample ⃝ Cluster Sample
⃝ Simple Random Sample ⃝ Stratified Sample
(2) [1 Pt] What is the chance that two houses next door to each other are both in the sample?
⃝1⃝1⃝1⃝0 396
For the remaining parts of this question, suppose that 1 of the houses ending in “a” favor the party; 2
3 of the houses ending in “b” favor the party; and all of the houses ending in “c” favor the party. 4
Hence, overall, p = 3 of the houses favor the party. 4
(3) [4 Pts] If Kalie estimates how favorable the party is using the proportion p of households in her survey favoring the party, what is the expected value of her estimator E[ p]? Show your work in the space below.
⃝1⃝2⃝3⃝1 234
DS100 Final, Page 10 of 30 December 14th, 2017
(4) [6 Pts] If, as before, Kalie estimates how favorable the party is using the proportion p of households in her survey favoring the party, what is the variance of her estimator Var[ p]? Show your work in the space below.
⃝1⃝2⃝1⃝1 9 27 6 24
DS100 Final, Page 11 of 30 December 14th, 2017
SQL
7. [2 Pts] From the following list, select all the statements that are true:
A database is a system that stores data.
SQL is a declarative language that specifies what to produce but not how to compute it.
To do large scale data analysis it is usually faster to extract all the data from the database and use Pandas to execute joins and compute aggregates.
The schema of a table consists of the data stored in the table.
The primary key of a relation is the column or set of columns that determine the
values of the remaining column.
None of the above statements are true.
8. [4 Pts] The following relational schema represents a large table describing Olympic medalists.
MedalAwards(year, athlete_name, medal,
event, num_competitors,
country, population, GDP)
If we allow athletes to compete for different countries on different years and in multiple events, which of the following normalized representations most reduces data redundancy while encoding the same information.
⃝ MedalAwards(year, athlete_name, medal, event) Athlete(year, athlete_name, country, event,
num_competitors, population, GDP)
⃝ MedalAwards(year, athlete_name, medal, event) Athlete(year, athlete_name, country, event,
num_competitors)
CountryInfo(year, country, population, GDP)
⃝ MedalAwards(year, athlete_name, medal, event) Events(year, event, num_competitors) Athlete(year, athlete_name, country) CountryInfo(year, country, population, GDP)
⃝ MedalAwards(year, athlete_name, medal, event) Events(event, num_competitors) Athlete(athlete_name, country) CountryInfo(country, population, GDP)
DS100 Final, Page 12 of 30 December 14th, 2017 9. For this question you will use the the following database consisting of three tables:
CREATE TABLE medalist(
name TEXT PRIMARY KEY,
country TEXT, birthday DATE);
CREATE TABLE games(
year INT PRIMARY KEY,
city TEXT,
country TEXT
);
— medaltype column takes three values: — ’G’ for gold, ’S’ for silver,
— and ’B’ for bronze
CREATE TABLE medals(
name TEXT,
year INT,
FOREIGN KEY name REFERENCES medalist, FOREIGN KEY year REFERENCES games, category TEXT,
medaltype CHAR);
(1) [1 Pt] Which of the following queries returns 5 rows from the medalist table (select all that apply):
SELECT * FROM medalist WHERE LEN(*) < 5;
SELECT * FROM medalist LIMIT 5;
SELECT * FROM medalist HAVING LEN(*) < 5;
FROM medalist SELECT * WHERE COUNT(*) < 5;
(2) [1 Pt] Which of the following queries returns the names of all the German medalist
(select all that apply):
SELECT name FROM medalist WHERE country = ’Germany’;
FROM medalist SELECT name WHERE country = ’Germany’;
SELECT name FROM medalist HAVING country == ’Germany’; FROM medalist SELECT name HAVING country IS ’Germany’;
DS100 Final, Page 13 of 30 December 14th, 2017 Summarizing the schema on the pervious page for quick refence:
medalist(name, country, birthday);
games(year, city, country);
medals(name, year, category, medaltype);
(3) [3 Pts] Which of the following queries returns the total number of medals broken down by type (gold, silver, and bronze) for each country in the ‘vault’ competition. (Select all that apply.)
SELECT medalists.country, medals.medaltype,
COUNT(*) AS medal_count FROM medals, medalists
WHERE medalists.name = medals.name
AND medals.category = ’vault’
GROUP BY medalists.country, medals.medaltype
SELECT games.country, medals.medaltype,
COUNT(medals.medaltype) AS medal_count FROM medals, games
AND games.year = medals.year
HAVING medals.category = ’vault’
GROUP BY games.country, medals.medaltype
SELECT medalists.country, medals.medaltype,
COUNT(*) AS medal_count FROM medals, medalists
WHERE medalists.name = medals.name
GROUP BY medalists.country, medals.medaltype, medals.category HAVING category = ’vault’
FROM medals, games SELECT games.country, medals.medaltype,
COUNT(medals.medaltype) AS medal_count AND games.year = medals.year
AND medals.category = ’vault’
GROUP BY games.country, medals.medaltype
DS100 Final, Page 14 of 30 December 14th, 2017 Summarizing the schema on the pervious page for quick refence:
medalist(name, country, birthday);
games(year, city, country);
medals(name, year, category, medaltype);
(4) [5 Pts] What does the following query compute?
WITH
country_medal_count(country, count) AS (
SELECT medalists.country, count(*) AS FROM medalists JOIN medals
ON medalists.name = medals.name
GROUP BY country
),
annual_medal_count(country, year, count) AS (
SELECT medalists.country, medals.year, count(*) FROM medalists JOIN medals
ON medalists.name = medals.name
GROUP BY medalists.country, year
)
SELECT cmc.country, amc.year, amc.count / cmc.count
FROM country_medal_count AS cmc, annual_medal_count AS amc WHERE cmc.country = amc.country
GROUP BY cm.country
⃝ The average number of medals earned for each country in each year. ⃝ The conditional distribution of medals over the years given the country. ⃝ The conditional distribution of medals over countries given the year. ⃝ The joint distribution of medals over countries and years.
DS100 Final, Page 15 of 30 December 14th, 2017
Bootstrap Confidence Intervals
10. [6 Pts] Consider the following diagram of the bootstrap process. Fill in 9 blanks on the diagram using the phrases below:
(A) Population
(B) Bootstrap population (C) Observed sample (D) Expected sample
(E) Bootstrap sample
(F) Sampling distribution (G) Sampling
(H) Bootstrapping
(J) Empirical distribution (K) True distribution
(L) Population parameter (M) Sample Statistic
(I) Bootstrap sampling dis- tribution (N)
Bootstrap Statistic
(1) (3)
7.6 Billion People
(2) ✓
(4) (5)
✓ö
(7)
(8)
ö (9)
(6)
✓⇤ ö
✓⇤
ö
✓⇤
1. 4.
2. 5.
3. 6.
7.
8.
9.
DS100 Final, Page 16 of 30 December 14th, 2017
11. A fast food chain collects a sample of n = 100 service times from their restaurants, and finds a sample average of θ = 8.4 minutes and a sample standard deviation of 2 minutes. They wish to construct a confidence interval for the population mean service time, denoted by θ.
(1) [2 Pts] The 2.5th and 97.5th percentiles of the bootstrap distribution for the mean θ∗ below are located at 7.7 and 9.1, respectively. Which of the following constitutes a valid 95% bootstrap confidence interval for θ?
Bootstrap Distribution of *
(10,000 Replicates)
600
400
200
0 7.0 7.5 8.0 8.5
9.0 9.5 10.0
⃝ 8.4−1.96· 2 ,8.4+1.96· 2 10 10
⃝ (7.7,9.1)
⃝ 7.7−1.96· 2 ,9.1+1.96· 2
⃝ (8.4−7.1,8.4+9.1)
Explain your reasoning in the box below.
10 10
DS100 Final, Page 17 of 30 December 14th, 2017 (2) [4 Pts] The 2.5th and 97.5th percentiles of the bootstrap distribution for the studentized
mean
θ∗ − θ
∗ SE θ
(depicted below) are located at −2.2 and 1.9, respectively. Which of the following constitutes a valid 95% bootstrap confidence interval for θ? Recall that θ = 8.4 and
SEθ=√2 . 100
*
Bootstrap Distribution of se( * ) (10,000 Replicates)
500 400 300 200 100
0
4202
⃝ 8.4−1.9√2 ,8.4+2.2√2 100 100
⃝ (−1.9,2.2) ⃝ 7.7−2.2√2 100
⃝ 8.4−2.2√2
100 100
,8.4+1.9√2
,9.1+1.9√2 100
Explain your reasoning in the box below.
DS100 Final, Page 18 of 30 December 14th, 2017
Map Reduce, Spark, and Big Data
12. [2 Pts] From the following list, select all the statements that are true:
Schema on read means that the organization of data is determined when it is loaded into the data warehouse.
In a star schema the primary keys are stored in the fact table and the foreign keys are stored in the dimension table.
Data stored in a data lake will typically require more data cleaning than the data stored in the data warehouse.
None of the above statements are true.
13. Consider the following layout of the files A and B onto a distributed file-system of 6 machines.
A.1
B.2
B.1
A.4
A.3
B.1
A.4
A.2
File A File B
Machine 1 Machine 2 Machine 3
Machine 4 Machine 5 Machine 6
A.1
A.2
A.3
A.4
B.1
B.2
B.3
B.4
A.1
A.3
B.2
B.3
B.1
A.2
B.3
B.4
A.1
B.3
A.2
B.2
B.4
A.3
A.4
B.4
Assume that all blocks have the same file size and computation takes the same amount of time.
(1) [1 Pt] If we wanted to load file A in parallel which of the following sets of machines would give the best load performance:
⃝ {M1,M2} ⃝ {M1,M2,M3} ⃝ {M2,M4,M5,M6}
(2) [1 Pt] If we were to lose machines M 1, M 2, and M 3 which of the following file or files
would we lose (select all that apply).
File A File B We would still be able to load both files.
(3) [1 Pt] If each of the six machines fail with probability p, what is the probability that we will lose block B.1 of file B.?
⃝ 3p ⃝ p3 ⃝ (1−p)3 ⃝ 1−p3
DS100 Final, Page 19 of 30 December 14th, 2017
14. [4 Pts] Suppose you are given the following raw.txt containing the income for set of individuals:
State Age
VA 28
CA 33
VA 24
CA 32
TX 45
ca 42
ca 70
TX 35
TX 48
VA 92
Income
45000
72000
50000
100000
53000
89000
8000
41000
71000
3000
What does the following query compute?
(sc.textFile("raw.txt")
.map(lambda x: x.split())
.filter(lambda x: x[0] != "State")
.map(lambda x: [x[0].upper(), float(x[1]), float(x[2])]) .filter(lambda x: x[1] <= 65.0)
.map(lambda x: [x[0], np.array([1.0, x[2], x[2]**2])] ) .reduceByKey(lambda a, b: a + b)
.mapValues(lambda x: np.sqrt(x[2]/x[0] - (x[1]/x[0])**2))
).collect()
⃝ The variance in income for each state.
⃝ The standard deviation in income for each state.
⃝ The standard deviation of the income for each state excluding individuals who are older than 65.0
⃝ The standard deviation of the income excluding individuals who are older than 65.
15. [2 Pts] Select all of the following aggregation operations that will produce the same result
regardless of the ordering of the data.
⃝ lambda a, b: max(a, b) ⃝ lambda a, b: a + b
⃝ lambda a, b: a - b
⃝ lambda a, b: (a-b)**2
DS100 Final, Page 20 of 30 December 14th, 2017
Bias Variance Trade-off and Regularized Loss Minimization
16. [1 Pt] Which of the following plots depicts models with the highest model variance?
(a) (b) (c)
⃝(a) ⃝(b) ⃝(c)
17. [3 Pts] Assuming a regularization penalty of the form λR(θ). Complete the following illus-
tration. Note that the x-axis is the regularization parameter λ and not the model complexity.
14 12 10 8 6 4 2 0 2
14 12 10 8 6 4 2 0 2
14 12 10 8 6 4 2 0 2
2 4 6 8 10 x
2 4 6 8 10 x
2 4 6 8 10 x
⃝ (A) is the Test Error and (B) is the error due to (Bias)2.
⃝ (A) is the error due to Model Variance and (B) is the Training Error
⃝ (A) is the error due to Model Variance and (B) is the error due to (Bias)2. ⃝ (A) is the error due to (Bias)2 and (B) is the error due to Model Variance.
Error
y
y
y
DS100 Final, Page 21 of 30 December 14th, 2017
18. Suppose you are given a dataset {(xi, yi)}ni=1 where xi ∈ R is a one dimensional feature and yi ∈ R is a real-valued response. You use fθ to model the data where θ is the model parameter. You choose to use the following regularized loss:
1 n
L(θ) = n
(1) [1 Pt] This regularized loss is best described as:
⃝ Average absolute loss with L2 regularization. ⃝ Average squared loss with L1 regularization. ⃝ Average squared loss with L2 regularization. ⃝ Average Huber loss with λ regularization.
(2) [6 Pts] Suppose you choose the model fθ(xi) = θx3i . Using the above objective derive and circle the loss minimizing estimate for θ.
i=1
(yi − fθ(xi))2 + λθ2 (1)
DS100 Final, Page 22 of 30 December 14th, 2017
Least Squares Regression
19. Given a full-rank n × p design matrix X, and the corresponding response vector y ∈ Rn, the Least Squares estimator is
βˆ = (XT X)−1XT y.
Let e = y − yˆ denote the n × 1 vector of residuals, where yˆ = Xβˆ. (Illustrated below)
Figure 2: Geometric interpretation of Least Squares, courtesy of Wikipedia.
(1) [1 Pt] There exists a set of weights β such that ni=1 (yi − (Xβ)i)2 < ni=1 (ei)2.
⃝ True ⃝ False
(2) [1 Pt] We always have that e ⊥ yˆ (i.e. eTyˆ = 0).
⃝ True ⃝ False
(3) [1 Pt] For any set of weights β, we always have that e ⊥ X(βˆ − β). ⃝ True ⃝ False
Provide a short argument:
DS100 Final, Page 23 of 30 December 14th, 2017
20. [2 Pts] When developing a model for a donkeys weight, we considered the following box plots of weight by age category.
This plot suggests:
⃝ age is not needed in the model
⃝ some of the age categories can be combined ⃝ age could be treated as a numeric variable ⃝ none of the above
DS100 Final, Page 24 of 30 December 14th, 2017
21. [8 Pts] Suppose that we try to predict a donkey’s weight, yi from its sex alone. (Recall that the sex variable has values: gelding, stallion, and female). In class, we studied the following model consisting of dummy variables:
yi =θFDF,i +θGDG,i +θSDS,i
where the dummy variable DF,i = 1 if the ith donkey is female and DF,i = 0 otherwise. The
dummy variables DG and DS are dummies for geldings and stallions, respectively. Prove that if we using the following loss function:
n
L(θF,θG,θS)=(yi −(θFDF,i +θGDG,i +θSDS,i))2
i=1
thenthelossminimizingvalueθˆF =y ̄F wherey ̄F istheaverageweightofthefemaledonkeys.
DS100 Final, Page 25 of 30 December 14th, 2017
Classification and Logistic Regression
22. Consider the following figures of different shapes plotted in a two dimensional feature space. Suppose we are interested in classifying the type of shape based on the location.
x2
x2
0
0
x1
0
x2
0
x1
x2 0
0
x1
0
0
x1
(A) (B) (C) (D)
(1) [1 Pt] Which figure best illustrates substantial class imbalance? ⃝(A) ⃝(B) ⃝(C) ⃝(D)
(2) [1 Pt] Which figure is linearly separable. ⃝(A) ⃝(B) ⃝(C) ⃝(D)
(3) [1 Pt] Which figure corresponds to a multi-class classification problem. ⃝(A) ⃝(B) ⃝(C) ⃝(D)
(4) [3 Pts] Assuming we applied the following feature transformation:
φ(x)=[I(x1 <0), I(x2 >0), 1.0]
where I(z) is the indicator which is 1.0 if the expression z is true and 0 otherwise. Which
of the above plots is linearly separable in the transformed space (select all that apply). ⃝ (A) ⃝ (B) ⃝ (C) ⃝ (D) ⃝ Noneoftheplots.
DS100 Final, Page 26 of 30 December 14th, 2017 23. Suppose you are given the following dataset {(xi, yi)}ni=1 consisting of x and y pairs where the
covariate xi ∈ R and the response yi ∈ {0, 1}.
1.5 1.0 0.5 0.0 0.5
3210123
X
(1) [1 Pt] Given this data, the value P (Y = 1 | x = 3) is likely closest to: ⃝ 0.95 ⃝ 0.50 ⃝ 0.05 ⃝ -0.95
(2) [2 Pts] Roughly sketch the predictions made by the logistic regression model for P(Y =1|X).
1.5 1.0 0.5 0.0 0.5
3210123
X
P(Y|X)
Y
DS100 Final, Page 27 of 30 December 14th, 2017 24. Consider the following broken Python implementation of stochastic gradient descent.
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def stochastic_grad_descent(
X, Y, theta0, grad_function,
max_iter = 1000000, batch_size=100):
“””
X:A2D
Y:A1D
theta0:
grad_function: Maps a parameter vector, a feature matrix,
array, the feature matrix.
array, the response vector.
A 1D array, the initial parameter vector.
and a response vector to the gradient of some loss
function at the given parameter value. returns the optimal theta
“””
theta = theta0
for t in range(1, max_iter+1):
(xbatch, ybatch) = (X[1:batch_size, :], Y[1:batch_size])
grad = grad_function(theta0, xbatch, ybatch)
theta = theta – t * grad
return theta
(1) [4 Pts] Select all the issues with this Python implementation
Line 16 does not adequately sample all the data.
Line 18 should be evaluated at theta and not theta0. Line 18 should take the negative of the gradient.
Line 20 should be evaluated at theta0 and not theta. Line 20, t should be replaced with 1/t.
(2) [2 Pts] Supposed we wanted to add L2 regularization with parameter lam. Which of the following rewrites of Line 18 would achieve this goal:
⃝ grad = (grad_function(theta, xbatch, ybatch) + theta.dot(theta) * lam)
⃝ grad = (grad_function(theta, xbatch, ybatch) – theta.dot(theta) * lam)
⃝ grad = (grad_function(theta, xbatch, ybatch) + 2*theta*lam)
⃝ grad = (grad_function(theta, xbatch, ybatch) – 2*theta*lam)
DS100 Final, Page 28 of 30 December 14th, 2017
P-Hacking
25. [2 Pts] An analysis of tweets the day after hurricane Sandy reported a surprising finding – that nightlife picked up the day after the storm. It was supposed that after several days of being stuck at home cabin fever struck. However, later someone pointed out that most tweets were from Manhattan and that those tweeting were not suffering from an extended black out. The earlier study’s conclusions are an example of:
⃝ Texas sharpshooter bias ⃝ sampling bias
⃝ confirmation bias
⃝ Simpson’s paradox
26. [2 Pts] Suppose that everyone of the 275 students in Data 100 is administered a clairvoyance test as part of the final exam and two of the students “pass” the test and are declared to be clairvoyant. What kind of mistake have the professors in Data 100 have made in their testing:
⃝ post-hoc ergo procter-hoc ⃝ gambler’s fallacy
⃝ early stopping
⃝ multiple testing
⃝ Simpson’s paradox
27. [2 Pts] The following plot illustrates a reversal in trends observed when conditioning a model
on subgroups.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5
5.0 2.5 0.0 2.5 x
5.0 7.5 10.0
This is an example of:
⃝ post-hoc ergo procter-hoc ⃝ sampling bias
⃝ selection bias
⃝ Simpson’s paradox
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5
5.0 2.5 0.0 2.5 x
5.0 7.5 10.0
y
y
DS100 Final, Page 29 of 30 December 14th, 2017
Feature Engineering, Over-fitting, and Cross Validation 28. [2 Pts] Select all statements that are true.
If there are two identical features in the data, the L2-regularization will force the coefficient of one redundant feature to be 0.
We cannot use linear regression to find the coefficients for θ in y = θ1×3 + θ2×2 + θ3x + θ4 since the relationship between y and x is non-linear.
Introducing more features increases the model complexity and may cause over-fitting.
None of the above statements are true.
29. [2 Pts] Bag-of-words encodings have the disadvantage that they drop semantic information associated with word ordering. Which of the following techniques is able to retain some of the semantic information in the word ordering? Select all that apply.
Remove all the stop words
Use N-gram features.
Give more weights if one word occurs multiple times in the document. (Similar to the TF-IDF)
Create special features for common expressions or short phrases.
None of the above.
30. Suppose you are fitting a model parameterized by θ using a regularized loss with regularization parameter λ. Indicate which error you should use to complete each of the following tasks.
(1) [1 Pt] To optimize θ you should use the:
⃝ Training Error ⃝ Cross-Validation Error ⃝ Test Error
(2) [1 Pt] To determine the best value for λ you should use the:
⃝ Training Error ⃝ Cross-Validation Error ⃝ Test Error
(3) [1 Pt] To evaluate the degree of polynomial features you should use the: ⃝ Training Error ⃝ Cross-Validation Error ⃝ Test Error
(4) [1 Pt] To evaluate the quality of your final model you should use the: ⃝ Training Error ⃝ Cross-Validation Error ⃝ Test Error
DS100 Final, Page 30 of 30 December 14th, 2017
End of Exam