Name: Email: Student ID:
@berkeley.edu
DS-100 Final Exam Fall 2017
Instructions:
• This final exam must be completed in the 3 hour time period ending at 11:00AM.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must shade in the box/circle. Check marks will likely be mis-graded.
• You may use a two page (two-sided) study guide.
• Work quickly through each question. There are a total of 127 points on this exam.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1

DS100 Final, Page 2 of 35 December 14th, 2017
Syntax Reference Regular Expressions
“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)
“$” matches the position at the end of string character.
“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.
“+” match preceding literal or sub-expression one or more times.
“*” match preceding literal or sub-expression zero or more times
“.” match any character except new line.
XPath
“[ ]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.
“( )” used to create a sub-expression
“\d” match any digit character. “\D” is the
complement.
“\w” match any word character (letters, digits,
underscore). “\W” is the complement.
“\s” match any whitespace character includ- ing tabs and newlines. \S is the comple- ment.
“\b” match boundary between words
An XPath expression is made up of location steps separated by forward slashes. Each location step has three parts: an axis, which gives the direction to look; a node test which indicates the node name or text(); and an optional predicate to filter the matching nodes:
axis::node[predicate]
We have used shortcut names for the axis: “.” refers to self, “//” refers to self or descendants, “..” refers to to parent, and child is the default axis and can be dropped. The node of the XPath
expression is either an element name or text() for text content or @attribute for an attribute.
The predicate contains an expression that evaluates to true or false. Only those nodes that evaluate to true are kept. To check whether an attribute is present in a node, we use, e.g., [@time] (this evaluates to true if the node has a time attribute). Similarly, [foo] evaluates to true if the node has a childnodenamedfoo.Thevalueofanattributecanbecheckedwith,e.g.,[@time = “2017”].

DS100 Final, Page 3 of 35
December 14th, 2017
Variance and Expected Value Calculations
The expected value of X is
m E[X] = 􏰃xjpj
j=1
The variance of X is
Var[X]=􏰃(xj −E[X])2pj =􏰃x2jpj −E[X]2 =E􏰈X2􏰉−E[X]2
j=1 j=1 The standard deviation of X is SD [X] = 􏰕Var [X].
For X1,…Xn,
E[a1X1 +a2X2 +···+anXn]=a1E[X1]+···+anE[Xn] = If the Xi are independent, then
Var[a1X1 +a2X2 +···+anXn]=a21Var[X1]+···+a2nVar[Xn]
In the special case where E [Xi] = μ, Var [Xi] = σ2, ai = 1/n and the Xi are independent, then
we have
PySpark
sc.textFile(filename) Creates an RDD from the filename with each line in the file as a separate record.
rdd.collect() Takes an rdd and returns a Python list.
rdd.filter(f) Applies the function f to each record in rdd and keeps all the records that evaluate to True.
rdd.map(f) Applies the function f to each record in rdd producing a new RDD con- taining the outputs of f.
rdd.mapValues(f) Takes an rdd of key-
SE􏰈X ̄􏰉 = σ/√n
value pairs (lists) and applies the function f to each record in rdd producing a new RDD containing the outputs of f.
rdd.reduceByKey(f) Takes an rdd of key-value pairs (lists). It then groups the values by the key and applys the re- duce function f to combine (e.g., sum) all the values returning an RDD of [key,sum(values)] lists
s.split() Splits a string on whitespace.
np.array(list) Constructs a vector from a list of elements.
mm
E􏰈X ̄􏰉 = μ Var􏰈X ̄􏰉 = σ2/n

DS100 Final, Page 4 of 35 December 14th, 2017
Data Cleaning, Regular Expressions, and XPath
1. Consider the following text data describing purchases of financial products:
Id Date Product Company
0 99/99/99 Debt collection California Accounts Service
1 06/15/10 Credit reporting EXPERIAN INFORMATION SOLUTIONS INC
3 10/21/14 MORTGAGE OCWEN LOAN SERVICING LLC
5 03/30/15 The CBE Group Inc
6 02/03/16 Debt collection The CBE Group, Inc.
7 01/07/17 Credit reporting Experian Information Solutions Inc.
8 03/15/17 Credit card FIRST NATIONAL BANK OF OMAHA
√ Some of the product values appear to be missing.
√ Some of the date values appear to be missing.
􏰄 The file is comma delimited
√ The file is fixed width formatted.
􏰄 None of the above statements are true.
(2) [2 Pts] Select all of the following regular expressions that properly match the dates.
􏰄 \d?/\d?/\d?
√ \d+/\d+/\d+
√ \d*/\d*/\d*
√ \d\d/\d\d/\d\d
􏰄 None of the above regular expressions match.
(3) [2 Pts] which of the following regular expressions exactly matches the entry FIRST NATIONAL BANK OF OMAHA?Selectallthatmatches.
􏰄 [A-Z]*
√ FIR[A-Z,\s]* OMAHA
√ F[A-Z, \s]+A
􏰄 F[A-Z]*
􏰄 None of the above regular expressions match.
(1) [2 Pts] Select all the true statements from the following list.
√ To analyze the companies we will need to correct for variation in capital- ization and punctuation.
Solution: The last one is a bit tricky. The expression F[A-Z, \s]* also matches the ‘F’ in “Student Loan Finance Corporation”.

DS100 Final, Page 5 of 35 2. Consider the following HTML document:
December 14th, 2017

Hello!

my story is here and it’s silly.

Name	Instrument
Abe	violin
Amy	violin
Dan	viola
Cal	trumpet

Name	Instrument
Sally	bass
Terry	guitar
Cassie	drums
Tobie	piano

The End!

(1) [2 Pts] Which of the following XPath queries locates the p-elements in the document? Select all that apply.
√√√√
//p //table/../p //body//p ./body/p
(2) [2 Pts] What will be the result of XPath query: ./body/table/tr/td/a/text() ⃝ www.yyy ⃝ Abe √ [Abe,Cal,Terry] ⃝ [www.yyy,www.ccc,www.ter]
(3) [2 Pts] Which of the following XPath queries locates the names of all musicians in the second table (i.e., Sally, Cassie, and Tobie)? Select all that apply.
􏰄 //table[@id]//td/text()
􏰄 ./body/table[2]/text()
√
􏰄 //tr/td[1]/text()
􏰄 None
//table[@id=”xyz”]/tr/td[1]/text()

DS100
Final, Page 6 of 35 December 14th, 2017 The query from the previous page is repeated below for quick reference.

Hello!

my story is here and it’s silly.

Name	Instrument
Abe	violin
Amy	violin
Dan	viola
Cal	trumpet

Name	Instrument
Sally	bass
Terry	guitar
Cassie	drums
Tobie	piano

The End!

(4) [3 Pts] Which of the following XPath queries locates the instruments of all musicians
with Web pages? (A musician has a Web page if there is an a-tag associated with their
name. Select all that apply.
􏰄 //a/ancestor-or-self::table/tr/td[2]/text()
√
􏰄 //tr/td[a]/text()
􏰄 None
√
//td/a/../../td[2]/text()
//table/tr/td[a]/../td[2]/text()

DS100 Final, Page 7 of 35 December 14th, 2017
Visualization
3. [2 Pts] Which of the following transformations could help make linear the relationship shown in the plot below? Select all that apply:
15.0 12.5 10.0
7.5
0 200 400
600 800 1000
X
􏰄 log(y) 􏰄 x2 √ √x √ log(x) √ y2 􏰄 None
4. [2 Pts] Which graphing techniques can be used to address problems with over-plotting? Check
all that apply.
􏰄 jiggling √ transparency √ smoothing √ faceting
􏰄 banking to 45 degrees √ contour plotting 􏰄 linearizing
Y

DS100 Final, Page 8 of 35 December 14th, 2017
5. The following line plot compares the annual production-based and consumption-based carbon dioxide emissions (million tons) in Armenia.
(1) [2 Pts] This plot best conveys:
⃝ The relative increase in CO2 emissions since 1990.
√ The overall trend in CO2 emissions broken down by source. ⃝ The relative breakdown of CO2 emissions sources over time. ⃝ The cumulative CO2 emissions.
(2) [2 Pts] What kind of plot would facilitate the relative comparison of the these two sources of emissions over time?
√ line plot of annual differences
⃝ scatterplotofproduction-basedemissionsagainstconsumption-basedemissions
⃝ stacked barchart
⃝ side-by-side boxplots

DS100 Final, Page 9 of 35 December 14th, 2017
Sampling
1a 1b 1c 2a 2b 2c 3a 3b 3c 4a 4b 4c
6. Kalie wants to measure interest for a party on her street. She assigns numbers and letters to each house on her street as illustrated above. She picks a letter “a”, “b”, or “c” at random and then surveys every household on the street ending in that letter.
(1) [1 Pt] What kind of sample has Kalie collected? ⃝ Quota Sample √ Cluster Sample
⃝ Simple Random Sample ⃝ Stratified Sample
(2) [1 Pt] What is the chance that two houses next door to each other are both in the sample?
⃝1⃝1⃝1√0 396
For the remaining parts of this question, suppose that 1 of the houses ending in “a” favor the party; 2
3 of the houses ending in “b” favor the party; and all of the houses ending in “c” favor the party. 4
Hence, overall, p = 3 of the houses favor the party. 4
(3) [4 Pts] If Kalie estimates how favorable the party is using the proportion p􏰔 of households in her survey favoring the party, what is the expected value of her estimator E[ p􏰔]? Show your work in the space below.
⃝1⃝2√3⃝1 234
(4) [6 Pts] If, as before, Kalie estimates how favorable the party is using the proportion p􏰔 of households in her survey favoring the party, what is the variance of her estimator Var[ p􏰔]? Show your work in the space below.
⃝1⃝2⃝1√1 9 27 6 24
Solution: Each group of houses, grouped by the last letter of the address, is a cluster.
Solution: None of the adjacent houses end in the same letter, so the chance is zero.
Solution: Theexpectedvalueisanaverage,weightedbytheprobabilityofeachvalue.
Sinceeachclusterisequallylikely,E[p􏰔]=11 +13 +11=3 323434

DS100 Final, Page 10 of 35 December 14th, 2017
Solution:Var[p􏰔]=E[(p􏰔−p)2]=1􏰎1 +0+1􏰏= 1 =1. 316 163·824

DS100 Final, Page 11 of 35 December 14th, 2017
SQL
7. [2 Pts] From the following list, select all the statements that are true: 􏰄 A database is a system that stores data.
√ SQL is a declarative language that specifies what to produce but not how to compute it.
􏰄 To do large scale data analysis it is usually faster to extract all the data from the database and use Pandas to execute joins and compute aggregates.
􏰄 The schema of a table consists of the data stored in the table.
√ The primary key of a relation is the column or set of columns that determine the values of the remaining column.
􏰄 None of the above statements are true.
8. [4 Pts] The following relational schema represents a large table describing Olympic medalists.
MedalAwards(year, athlete_name, medal,
event, num_competitors,
country, population, GDP)
If we allow athletes to compete for different countries on different years and in multiple events, which of the following normalized representations most reduces data redundancy while encoding the same information.
Solution: A database is a collection of data. A database management system (DBMS) is the system that manages access to the database.
Solution: SQL is declarative programming language which specifies what the user wants to accomplish allowing the system to determine how to ac- complish it.
Solution: Doing analysis directly in the database can often be much more effi- cient as database management systems are designed to accelerate data access and aggregation over very large datasets.
Solution: The schema of a table consists of the column names, their types, and any constraints on those columns. The instance of a database is the data stored in the database.

DS100
Final, Page 12 of 35 December 14th, 2017
⃝
⃝
√
⃝
MedalAwards(year, athlete_name, medal, event)
Athlete(year, athlete_name, country, event,
num_competitors, population, GDP)
MedalAwards(year, athlete_name, medal, event)
Athlete(year, athlete_name, country, event,
num_competitors)
CountryInfo(year, country, population, GDP)
MedalAwards(year, athlete_name, medal, event)
Events(year, event, num_competitors)
Athlete(year, athlete_name, country)
CountryInfo(year, country, population, GDP)
MedalAwards(year, athlete_name, medal, event)
Events(event, num_competitors)
Athlete(athlete_name, country)
CountryInfo(country, population, GDP)

DS100 Final, Page 13 of 35 December 14th, 2017 9. For this question you will use the the following database consisting of three tables:
CREATE TABLE medalist(
name TEXT PRIMARY KEY,
country TEXT, birthday DATE);
CREATE TABLE games(
year INT PRIMARY KEY,
city TEXT,
country TEXT
);
— medaltype column takes three values: — ’G’ for gold, ’S’ for silver,
— and ’B’ for bronze
CREATE TABLE medals(
name TEXT,
year INT,
FOREIGN KEY name REFERENCES medalist, FOREIGN KEY year REFERENCES games, category TEXT,
medaltype CHAR);
(1) [1 Pt] Which of the following queries returns 5 rows from the medalist table (select all that apply):
􏰄 SELECT * FROM medalist WHERE LEN(*) < 5; √ 􏰄 SELECT * FROM medalist HAVING LEN(*) < 5; 􏰄 FROM medalist SELECT * WHERE COUNT(*) < 5; (2) [1 Pt] Which of the following queries returns the names of all the German medalist (select all that apply): √ 􏰄 FROM medalist SELECT name WHERE country = ’Germany’; 􏰄 SELECT name FROM medalist HAVING country == ’Germany’; 􏰄 FROM medalist SELECT name HAVING country IS ’Germany’; SELECT * FROM medalist LIMIT 5; SELECT name FROM medalist WHERE country = ’Germany’; DS100 Final, Page 14 of 35 December 14th, 2017 Solution: The second solution has a wrong order of FROM and SELECT; the third andfourthuseHAVINGwhichistobeusedonlyafterGROUP BYclause. DS100 Final, Page 15 of 35 December 14th, 2017 Summarizing the schema on the pervious page for quick refence: medalist(name, country, birthday); games(year, city, country); medals(name, year, category, medaltype); (3) [3 Pts] Which of the following queries returns the total number of medals broken down by type (gold, silver, and bronze) for each country in the ‘vault’ competition. (Select all that apply.) √ 􏰄 SELECT games.country, medals.medaltype, COUNT(medals.medaltype) AS medal_count FROM medals, games AND games.year = medals.year HAVING medals.category = ’vault’ GROUP BY games.country, medals.medaltype √ 􏰄 FROM medals, games SELECT games.country, medals.medaltype, COUNT(medals.medaltype) AS medal_count AND games.year = medals.year AND medals.category = ’vault’ GROUP BY games.country, medals.medaltype SELECT medalists.country, medals.medaltype, COUNT(*) AS medal_count FROM medals, medalists WHERE medalists.name = medals.name AND medals.category = ’vault’ GROUP BY medalists.country, medals.medaltype SELECT medalists.country, medals.medaltype, COUNT(*) AS medal_count FROM medals, medalists WHERE medalists.name = medals.name GROUP BY medalists.country, medals.medaltype, medals.category HAVING category = ’vault’ Solution: Both first and third solutions will technically return the desired result (the first solution might be faster as the filtering is performed before grouping). The second DS100 Final, Page 16 of 35 December 14th, 2017 solution applies HAVING in a wrong place (before the GROUP BY statement). The last solution has wrong order of FROM and SELECT DS100 Final, Page 17 of 35 December 14th, 2017 Summarizing the schema on the pervious page for quick refence: medalist(name, country, birthday); games(year, city, country); medals(name, year, category, medaltype); (4) [5 Pts] What does the following query compute? WITH country_medal_count(country, count) AS ( SELECT medalists.country, count(*) AS FROM medalists JOIN medals ON medalists.name = medals.name GROUP BY country ), annual_medal_count(country, year, count) AS ( SELECT medalists.country, medals.year, count(*) FROM medalists JOIN medals ON medalists.name = medals.name GROUP BY medalists.country, year ) SELECT cmc.country, amc.year, amc.count / cmc.count FROM country_medal_count AS cmc, annual_medal_count AS amc WHERE cmc.country = amc.country GROUP BY cm.country ⃝ The average number of medals earned for each country in each year. √ The conditional distribution of medals over the years given the country. ⃝ The conditional distribution of medals over countries given the year. ⃝ The joint distribution of medals over countries and years. DS100 Final, Page 18 of 35 December 14th, 2017 Bootstrap Confidence Intervals 10. [6 Pts] Consider the following diagram of the bootstrap process. Fill in 9 blanks on the diagram using the phrases below: (A) (B) (C) (D) (E) (F) Sampling distribution (G) Sampling (H) Bootstrapping Population Bootstrap population Observed sample Expected sample Bootstrap sample (J) (K) (L) (M) tribution (N) Empirical distribution True distribution Population parameter Sample Statistic Bootstrap Statistic (I) Bootstrap sampling dis- (1) (3) 7.6 Billion People (2) ✓ (4) (5) ✓ö (7) (8) ö (9) (6) ✓⇤ ö ✓⇤ ö ✓⇤ 1. 2. 3. (A) 4. (L) 5. (G) 6. (C) 7. (M) 8. (H) 9. (E) (N) (I) DS100 Final, Page 19 of 35 December 14th, 2017 11. A fast food chain collects a sample of n = 100 service times from their restaurants, and finds a sample average of θ􏰔 = 8.4 minutes and a sample standard deviation of 2 minutes. They wish to construct a confidence interval for the population mean service time, denoted by θ. (1) [2 Pts] The 2.5th and 97.5th percentiles of the bootstrap distribution for the mean θ􏰔∗ below are located at 7.7 and 9.1, respectively. Which of the following constitutes a valid 95% bootstrap confidence interval for θ? Bootstrap Distribution of * (10,000 Replicates) 600 400 200 0 7.0 7.5 8.0 8.5 9.0 9.5 10.0 ⃝ 􏰎8.4−1.96· 2 ,8.4+1.96· 2 􏰏 √ 10 10 (7.7, 9.1) ⃝ 􏰎7.7−1.96· 2 ,9.1+1.96· 2 􏰏 ⃝ (8.4−7.1,8.4+9.1) Explain your reasoning in the box below. 10 10 Solution: Use the percentiles of the bootstrap distribution directly: (7.7, 9.1) DS100 Final, Page 20 of 35 December 14th, 2017 (2) [4 Pts] The 2.5th and 97.5th percentiles of the bootstrap distribution for the studentized mean θ􏰔∗ − θ􏰔 􏰌􏰍 􏰟 􏰔∗ SE θ (depicted below) are located at −2.2 and 1.9, respectively. Which of the following constitutes a valid 95% bootstrap confidence interval for θ? Recall that θ􏰔 = 8.4 and SE􏰌θ􏰍=√2 . 􏰟􏰔 100 * Bootstrap Distribution of se( * ) (10,000 Replicates) 500 400 300 200 100 0 4202 √ 􏰐8.4−1.9√2 ,8.4+2.2√2 􏰑 100 100 ⃝ (−1.9,2.2) ⃝ 􏰐7.7−2.2√2 100 ⃝ 􏰐8.4−2.2√2 100 100 ,8.4+1.9√2 􏰑 ,9.1+1.9√2 􏰑 100 Explain your reasoning in the box below. Solution: Let q∗ and q∗ denote the quantiles of the bootstrap distribution for the .025 .975 studentizedmean.Thenusingq∗ andq∗ asestimatesofthequantilesof θ−θ ,we θ − q se(θ), θ − q se(θ) 􏰔 .975 􏰔􏰔 .025 􏰔 = 8.4 − 1.9 √ , 8.4 + 2.2 √ 100 = (8.02, 8.84). 100 .025 .975 􏰔 SE[θ􏰔] have Hence the appropriate interval is 􏰐∗∗􏰑􏰀22􏰁 􏰙 θ􏰔−θ 􏰚 q∗ ≤ ≤ q∗ .025 se(θ) .975 0.95 ≈ P =P θ−q se(θ)≤θ≤θ−q se(θ) 􏰔 􏰐∗∗􏰑 􏰔 .975 􏰔 􏰔 .025 􏰔 DS100 Final, Page 21 of 35 December 14th, 2017 Map Reduce, Spark, and Big Data 12. [2 Pts] From the following list, select all the statements that are true: 􏰄 Schema on read means that the organization of data is determined when it is loaded into the data warehouse. 􏰄 In a star schema the primary keys are stored in the fact table and the foreign keys are stored in the dimension table. √ Data stored in a data lake will typically require more data cleaning than the data stored in the data warehouse. 􏰄 None of the above statements are true. 13. Consider the following layout of the files A and B onto a distributed file-system of 6 machines. Solution: In schema on load data is organized as it is loaded into the data warehouse. In schema on read data is organized as it is read during data analysis. Solution: In a start schema the fact table contains foreign key reference to each of the dimension tables. Solution: Data in the data warehouse is typically cleaned during the ETL process while data in the data lake is captured in its raw form and may required substantially data cleaning and transformation. A.1 B.2 B.1 A.4 A.3 B.1 A.4 A.2 File A File B Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 Machine 6 A.1 A.2 A.3 A.4 B.1 B.2 B.3 B.4 A.1 A.3 B.2 B.3 B.1 A.2 B.3 B.4 A.1 B.3 A.2 B.2 B.4 A.3 A.4 B.4 Assume that all blocks have the same file size and computation takes the same amount of time. (1) [1 Pt] If we wanted to load file A in parallel which of the following sets of machines would give the best load performance: ⃝ {M1,M2} ⃝ {M1,M2,M3} √ {M2,M4,M5,M6} DS100 Final, Page 22 of 35 December 14th, 2017 (2) (3) [1 Pt] If we were to lose machines M 1, M 2, and M 3 which of the following file or files would we lose (select all that apply). 􏰄 File A 􏰄 File B √ We would still be able to load both files. [1 Pt] If each of the six machines fail with probability p, what is the probability that we will lose block B.1 of file B.? ⃝ 3p √ p3 ⃝ (1−p)3 ⃝ 1−p3 Solution: Whileallchoiceswouldbeabletoloadthefile,only{M2,M4,M5,M6} could load the file in parallel. DS100 Final, Page 23 of 35 December 14th, 2017 14. [4 Pts] Suppose you are given the following raw.txt containing the income for set of individuals: State Age VA 28 CA 33 VA 24 CA 32 TX 45 ca 42 ca 70 TX 35 TX 48 VA 92 Income 45000 72000 50000 100000 53000 89000 8000 41000 71000 3000 What does the following query compute? (sc.textFile("raw.txt") .map(lambda x: x.split()) .filter(lambda x: x[0] != "State") .map(lambda x: [x[0].upper(), float(x[1]), float(x[2])]) .filter(lambda x: x[1] <= 65.0) .map(lambda x: [x[0], np.array([1.0, x[2], x[2]**2])] ) .reduceByKey(lambda a, b: a + b) .mapValues(lambda x: np.sqrt(x[2]/x[0] - (x[1]/x[0])**2)) ).collect() ⃝ The variance in income for each state. ⃝ The standard deviation in income for each state. √ The standard deviation of the income for each state excluding individuals who are older than 65.0 ⃝ The standard deviation of the income excluding individuals who are older than 65. 15. [2 Pts] Select all of the following aggregation operations that will produce the same result regardless of the ordering of the data. √ √ ⃝ lambda a, b: a - b ⃝ lambda a, b: (a-b)**2 lambda a, b: max(a, b) lambda a, b: a + b DS100 Final, Page 24 of 35 December 14th, 2017 Bias Variance Trade-off and Regularized Loss Minimization 16. [1 Pt] Which of the following plots depicts models with the highest model variance? (a) (b) (c) ⃝(a) ⃝(b) √(c) 17. [3 Pts] Assuming a regularization penalty of the form λR(θ). Complete the following illus- tration. Note that the x-axis is the regularization parameter λ and not the model complexity. 14 12 10 8 6 4 2 0 2 14 12 10 8 6 4 2 0 2 14 12 10 8 6 4 2 0 2 2 4 6 8 10 x 2 4 6 8 10 x 2 4 6 8 10 x ⃝ (A) is the Test Error and (B) is the error due to (Bias)2. ⃝ (A) is the error due to Model Variance and (B) is the Training Error ⃝ (A) is the error due to Model Variance and (B) is the error due to (Bias)2. √ (A) is the error due to (Bias)2 and (B) is the error due to Model Variance. Error y y y DS100 Final, Page 25 of 35 December 14th, 2017 18. Suppose you are given a dataset {(xi, yi)}ni=1 where xi ∈ R is a one dimensional feature and yi ∈ R is a real-valued response. You use fθ to model the data where θ is the model parameter. You choose to use the following regularized loss: 1 􏰃n L(θ) = n i=1 (1) [1 Pt] This regularized loss is best described as: ⃝ Average absolute loss with L2 regularization. ⃝ Average squared loss with L1 regularization. √ Average squared loss with L2 regularization. ⃝ Average Huber loss with λ regularization. (yi − fθ(xi))2 + λθ2 (1) (2) [6 Pts] Suppose you choose the model fθ(xi) = θx3i . Using the above objective derive and circle the loss minimizing estimate for θ. Solution: Step 1: Step 2: Take the derivative of the loss function. ∂L(θ)=1􏰃n ∂ 􏰎yi−θx3i􏰏2+ ∂λθ2 (2) ∂θ n i=1 ∂θ ∂θ =− Set derivative equal to zero and solve for θ. 2 􏰃n n i=1 􏰎yi−θx3i􏰏x3i +2λθ (3) 0=− 2 􏰃n n i=1 􏰎yi−θx3i􏰏x3i +2λθ (4) 1 􏰃n nλ i=1 􏰎 y i − θ x 3i 􏰏 x 3i 1 􏰃n θ = 1 􏰃n (5) x6i (6) (7) (8) (9) yix3i −θnλ θ1+nλ x6i =nλ yix3i θ=nλ 􏰙 1 􏰃n 􏰚 1 􏰃n i=1 i=1 Thus we obtain the final answer: i=1 i=1 1􏰂n yix3i n i=1 ˆ θ=􏰎λ+1􏰂n x6􏰏 n i=1 i DS100 Final, Page 26 of 35 December 14th, 2017 Least Squares Regression 19. Given a full-rank n × p design matrix X, and the corresponding response vector y ∈ Rn, the Least Squares estimator is βˆ = (XT X)−1XT y. Let e = y − yˆ denote the n × 1 vector of residuals, where yˆ = Xβˆ. (Illustrated below) Figure 2: Geometric interpretation of Least Squares, courtesy of Wikipedia. (1) [1 Pt] There exists a set of weights β such that 􏰂ni=1 (yi − (Xβ)i)2 < 􏰂ni=1 (ei)2. ⃝ True √ False Solution: The least squares estimator βˆ minimizes the RSS, so for every β. nn 􏰃(y−Xβˆ)2i ≤􏰃(y−Xβ)2i i=1 i=1 (2) [1 Pt] We always have that e ⊥ yˆ (i.e. eTyˆ = 0). √ True ⃝ False (3) [1 Pt] For any set of weights β, we always have that e ⊥ X(βˆ − β). √ True ⃝ False Provide a short argument: DS100 Final, Page 27 of 35 December 14th, 2017 Solution: • The residual is orthogonal to the columns space of X. • It is true by the picture. 20. [2 Pts] When developing a model for a donkeys weight, we considered the following box plots of weight by age category. This plot suggests: ⃝ age is not needed in the model √ some of the age categories can be combined ⃝ age could be treated as a numeric variable ⃝ none of the above DS100 Final, Page 28 of 35 December 14th, 2017 21. [8 Pts] Suppose that we try to predict a donkey’s weight, yi from its sex alone. (Recall that the sex variable has values: gelding, stallion, and female). In class, we studied the following model consisting of dummy variables: yi =θFDF,i +θGDG,i +θSDS,i where the dummy variable DF,i = 1 if the ith donkey is female and DF,i = 0 otherwise. The dummy variables DG and DS are dummies for geldings and stallions, respectively. Prove that if we using the following loss function: n L(θF,θG,θS)=􏰃(yi −(θFDF,i +θGDG,i +θSDS,i))2 i=1 thenthelossminimizingvalueθˆF =y ̄F wherey ̄F istheaverageweightofthefemaledonkeys. Solution: The summation that we are minimizing can be split into three separate sums because only one of the dummy variables is 1 for any observation. That is, when DF,i = 1 then DG,i = 0 and DS,i = 0. n min 􏰃(yi − (θF DF,i + θGDG,i + θS DS,i))2 θF ,θG,θS i=1 =􏰃(yi −θF)2 +􏰃(yi −θG)2 +􏰃(yi −θS)2 FGS ThisimpliesthatwecanminimizeoverθF separately,i.e., min􏰃(yi −θF)2 θF F WecandifferentiatewithrespecttoθF toget 􏰃−2(yi −θF) Set this to 0 and solve for θF F 1 􏰃 y = θˆ #F i F F DS100 Final, Page 29 of 35 December 14th, 2017 Classification and Logistic Regression 22. Consider the following figures of different shapes plotted in a two dimensional feature space. Suppose we are interested in classifying the type of shape based on the location. x2 x2 0 0 x1 0 x2 0 x1 x2 0 0 x1 0 0 x1 (A) (B) (C) (D) (1) [1 Pt] Which figure best illustrates substantial class imbalance? ⃝(A) ⃝(B) ⃝(C) √(D) (2) [1 Pt] Which figure is linearly separable. ⃝(A) √(B) ⃝(C) ⃝(D) (3) [1 Pt] Which figure corresponds to a multi-class classification problem. ⃝(A) ⃝(B) √(C) ⃝(D) (4) [3 Pts] Assuming we applied the following feature transformation: φ(x)=[I(x1 <0), I(x2 >0), 1.0]
where I(z) is the indicator which is 1.0 if the expression z is true and 0 otherwise. Which
of the above plots is linearly separable in the transformed space (select all that apply). ⃝ (A) ⃝ (B) ⃝ (C) √ (D) ⃝ Noneoftheplots.
Solution: This question is a bit tricky. The feature transformation maps each quadrant to the feature values in the following picture (bias term not included):

DS100 Final, Page 30 of 35 December 14th, 2017
(1,1)
x2
(0,1)
(1,0)
x1
(0,0)
We see that in this case (D) is clearly linearlly separable. While (C) is almost linearly separable there is a triangle in the 1st quadrant that would not be separable from the crosses.

DS100 Final, Page 31 of 35 December 14th, 2017 23. Suppose you are given the following dataset {(xi, yi)}ni=1 consisting of x and y pairs where the
covariate xi ∈ R and the response yi ∈ {0, 1}.
1.5 1.0 0.5 0.0 0.5
3210123
X
(1) [1 Pt] Given this data, the value P (Y = 1 | x = 3) is likely closest to: ⃝ 0.95 ⃝ 0.50 √ 0.05 ⃝ -0.95
(2) [2 Pts] Roughly sketch the predictions made by the logistic regression model for P(Y =1|X).
1.5 1.0 0.5 0.0 0.5
3210123
X
Solution:
1.5 1.0 0.5 0.0 0.5
3210123
X
P(Y|X)
P(Y|X)
Y

DS100 Final, Page 32 of 35 December 14th, 2017 24. Consider the following broken Python implementation of stochastic gradient descent.
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def stochastic_grad_descent(
X, Y, theta0, grad_function,
max_iter = 1000000, batch_size=100):
“””
X:A2D
Y:A1D
theta0:
grad_function: Maps a parameter vector, a feature matrix,
array, the feature matrix.
array, the response vector.
A 1D array, the initial parameter vector.
and a response vector to the gradient of some loss
function at the given parameter value. returns the optimal theta
“””
theta = theta0
for t in range(1, max_iter+1):
(xbatch, ybatch) = (X[1:batch_size, :], Y[1:batch_size])
grad = grad_function(theta0, xbatch, ybatch)
theta = theta – t * grad
return theta
√ Line 16 does not adequately sample all the data.
√ Line 18 should be evaluated at theta and not theta0. 􏰄 Line 18 should take the negative of the gradient.
􏰄 Line 20 should be evaluated at theta0 and not theta. √ Line 20, t should be replaced with 1/t.
(2) [2 Pts] Supposed we wanted to add L2 regularization with parameter lam. Which of the following rewrites of Line 18 would achieve this goal:
⃝ grad = (grad_function(theta, xbatch, ybatch) + theta.dot(theta) * lam)
⃝ grad = (grad_function(theta, xbatch, ybatch) – theta.dot(theta) * lam)
√
⃝ grad = (grad_function(theta, xbatch, ybatch) – 2*theta*lam)
(1) [4 Pts] Select all the issues with this Python implementation
grad = (grad_function(theta, xbatch, ybatch) +
2*theta*lam)

DS100 Final, Page 33 of 35 December 14th, 2017
P-Hacking
25. [2 Pts] An analysis of tweets the day after hurricane Sandy reported a surprising finding – that nightlife picked up the day after the storm. It was supposed that after several days of being stuck at home cabin fever struck. However, later someone pointed out that most tweets were from Manhattan and that those tweeting were not suffering from an extended black out. The earlier study’s conclusions are an example of:
⃝ Texas sharpshooter bias √ sampling bias
⃝ confirmation bias ⃝ Simpson’s paradox
26. [2 Pts] Suppose that everyone of the 275 students in Data 100 is administered a clairvoyance test as part of the final exam and two of the students “pass” the test and are declared to be clairvoyant. What kind of mistake have the professors in Data 100 have made in their testing:
⃝ post-hoc ergo procter-hoc ⃝ gambler’s fallacy
⃝ early stopping
√ multiple testing ⃝ Simpson’s paradox
27. [2 Pts] The following plot illustrates a reversal in trends observed when conditioning a model on subgroups.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5
5.0 2.5 0.0 2.5 x
5.0 7.5 10.0
This is an example of:
⃝ post-hoc ergo procter-hoc ⃝ sampling bias
⃝ selection bias
√ Simpson’s paradox
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
10.0 7.5
5.0 2.5 0.0 2.5 x
5.0 7.5 10.0
y
y

DS100 Final, Page 34 of 35 December 14th, 2017
Feature Engineering, Over-fitting, and Cross Validation 28. [2 Pts] Select all statements that are true.
􏰄 If there are two identical features in the data, the L2-regularization will force the coefficient of one redundant feature to be 0.
􏰄 We cannot use linear regression to find the coefficients for θ in y = θ1×3 + θ2×2 +
√ θ3x + θ4 since the relationship between y and x is non-linear.
Introducing more features increases the model complexity and may cause over- fitting.
􏰄 None of the above statements are true.
29. [2 Pts] Bag-of-words encodings have the disadvantage that they drop semantic information associated with word ordering. Which of the following techniques is able to retain some of the semantic information in the word ordering? Select all that apply.
􏰄 Remove all the stop words
√ Use N-gram features.
􏰄 Give more weights if one word occurs multiple times in the document. (Similar to the TF-IDF)
√ Create special features for common expressions or short phrases. 􏰄 None of the above.
30. Suppose you are fitting a model parameterized by θ using a regularized loss with regularization parameter λ. Indicate which error you should use to complete each of the following tasks.
(1) [1 Pt] To optimize θ you should use the:
√ Training Error ⃝ Cross-Validation Error ⃝ Test Error
(2) [1 Pt] To determine the best value for λ you should use the:
⃝ Training Error √ Cross-Validation Error ⃝ Test Error
(3) [1 Pt] To evaluate the degree of polynomial features you should use the: ⃝ Training Error √ Cross-Validation Error ⃝ Test Error
(4) [1 Pt] To evaluate the quality of your final model you should use the: ⃝ Training Error ⃝ Cross-Validation Error √ Test Error

DS100 Final, Page 35 of 35 December 14th, 2017
End of Exam

Hello!

Hello!

Related Posts