Data 100, Final Summer 2021
Student ID:
Exam Time:
All work on this exam is my own (please sign):
Copyright By PowCoder代写 加微信 powcoder
@berkeley.edu
Data 100 Page 2 of 42 Initials:
Honor Code [1 Pt]
1. As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I will not communicate with any other individual during the exam, current student or otherwise. All work on this exam is my own.
(a) Please confirm your agreement with the above statement by writing your name in the space below.
___________________________________________________
Data 100 Page 3 of 42 Initials:
Probability Potpourri [8 Pts]
2. Suppose X is a discrete, positively valued random variable. The following graph describes the probability distribution of X2.
(a) [2 Pts] What is the expected value of X? Round your answer to two decimal places.
Solution: 4.45. First note that the question states that X is positively valued. There- fore, the possible values of X are the positive square root values of the current data points (2, 3, 5, 7), and the probabilities are the same as denoted on the graph (.15, .25, .4, .2). Thus,
E(X)=kP(X =k)=2·.15+3·.25+5·.4+7·.2=4.45 k
(b) [2 Pts] Following your answer to the previous question, what is the variance of X? Round your answer to two decimal places.
Solution: 2.85.
We can calculate E(X2) using the plot above:
E(X2)=k2P(X =k)=4·.15+9·.25+25·.4+49·.2=22.65 k
From the previous part, we have E(X)2 = 4.452. Thus, Var(X) = 22.65 − 4.452 ≈ 2.85.
Var(X) = E(X2) − E(X)2
Data 100 Page 4 of 42 Initials:
3. Oh no! Our friend Kanu has decided to take the Data 100 final without studying at all. He believes he can pass the course by simply guessing uniformly at random on every question. Assume Kanu needs a 10% on the final to pass. The test consists of 20 MCQ questions and 4 FRQ questions. The grading scheme is as follows:
– 5 points are awarded for each correct answer.
– −31 points are awarded for each incorrect answer. – 0 points are awarded for each blank answer.
– 10 points are awarded for each correct answer.
– −31 points are awarded for each incorrect answer. – 0 points are awarded for each blank answer.
There are 140 points available, so Kanu needs at least a 14 to pass.
(a) [4 Pts] Each MCQ question has 4 possible answers, one of which are correct. Each FRQ question has 10 possible answers, one of which is correct. On average, which of the following test taking strategies will help Kanu pass the class? Select all that apply.
Guess randomly on all MCQ and FRQ.
Guess randomly on all MCQ and leave the FRQ blank.
Guess randomly on all FRQ and leave the MCQ blank.
Guess randomly on all MCQ and 21 of the FRQ. Leave the other 12 of the FRQ blank.
Guess randomly on 34 of the MCQ and all the FRQ. Leave the other 14 of the MCQ blank.
Data 100 Page 5 of 42 Initials:
Extreme Tradeoffs [12 Pts]
4. Mr. Bean wants to model extreme precipitation events, which are historically difficult to pre- dict accurately. To attempt to create the world’s best model, he tries training multiple models using bootstrap sampling and regularization.
(a) [2 Pts] Mr. Bean designs 4 different models by bootstrap sampling 30% of the total train- ing data to train each model on. On the test set, he creates the following plot displaying the temperature feature against the model’s predictions and true observed values. The dotted line shows the average prediction across all 4 models. Which of the following does the figure indicate? Select all that apply.
High model variance Low model variance High model bias
Low model bias
(b) [4Pts] Mr.Beandecidestodiagnosetheissuefurther.Heincreasesthenumberoftrained models to 100 and evaluates the models on the point (xi,yi). Using historical data, he assumes that measurement errors follow a normal distribution with mean 0 and standard deviation σ = 4 mm. Given the below statistics calculated using pd.describe on the predictions and loss for these models, estimate the magnitude of the empirical bias.
Solution: This displays high model variance since for any particular point, we have 4 predictions of precipitation that are vastly different.
This displays high model bias since on average, our predictions are far away from the ground truth distribution and we don’t have the necessary model complexity to model a non-linear distribution.
Page 6 of 42 Initials:
Round to 3 decimal places.
Hint: Think of how bias is calculated in our bias-variance decomposition and relate the quantities below to the terms in the decomposition.
This question is difficult, so if you are not sure how to start then skip it for now and come back to the question later.
In the box below, show how you obtained the value above. Specifically, write down the bias-variance decomposition, substituting in the relevant quantities. No LATEX is required, you can use plain English.
Solution: Studentsreceivedcreditforwritingthebias-variancetradeoff,eithermath- ematically, or in plain English—for example, risk = model variance + (model bias) ∧ 2 + observation variance”. Points were also awarded for pinpointing the values of each of the three known quantities. The math is below:
E[(y − fθ(x))2] = σ2 + Var[fθ(x)] + (E[fθ(x)] − g(x))2 101.9301 = 42 + 8.25592 + bias2
101.9301 − 42 − 8.25592 = bias2 |bias| = 4.215
[2 Pts] He decides to change his models to add L2 regularization. What behavior is expected in the training set compared to the unregularized models?
The model bias will decrease.
The model bias will increase.
The model variance will decrease.
The model variance will increase.
The observational variance will decrease. The observational variance will increase.
Solution: 4.215 mm
Page 7 of 42 Initials:
Solution: The model bias increases and model variance decreases as per the bias- variance tradeoff when model complexity decreases. The observational variance doesn’t change since the dataset remains the same.
[2 Pts] Regardless of your answer to the previous question, assume that after implement- ing regularization, the model bias is too high. Which of these solutions helps reduce the model bias?
Add an intercept term.
Use a decision tree with the same features.
Increase the regularization hyperparameter.
Decrease the regularization hyperparameter.
Solution: Adding an intercept term adds model complexity, which decreases model bias.
A decision tree has lower bias than linear regression, even with the same features, as a decision tree tends to fit training points perfectly.
The regularization hyperparameter controls the model complexity. A larger hyperpa- rameter reduces the model complexity, so reducing the regularization hyperparameter will increase model complexity, which decreases model bias.
[2 Pts] Assume that we fixed the previous issue by changing to a different unspecified regression model, and the model bias decreased. Which of the following could have happened as a result?
The model variance increased.
The model variance decreased.
The model variance stayed the same. The observational variance decreased.
Solution: Since we change to a different type of model, we might have a completely different (or the same) expected loss in the bias-variance decomposition. Therefore, our variance could have increased, decreased, or stayed the same.
The observational variance does is not affected by our model because the dataset remains the same.
Data 100 Page 8 of 42 Initials:
Thinking Inside the Box [6 Pts]
5. Below are boxplots showing the distributions for three different quantitative variables. We will name these variables Variable A, Variable B, and Variable C.
Some of these distributions may be skewed—if a distribution is skewed, we want to apply a transformation to symmetrize it.
The following three parts will ask which transformations may be suitable for symmetrizing each distribution. If no transformation is necessary, select ”No transformation necessary.”
(a) [1 Pt] Which of the following transformations may symmetrize Variable A?
log(x) x2 √x x3 No transformation necessary. (b) [1 Pt] Which of the following transformations may symmetrize Variable B?
log(x) x2 √x x3 No transformation necessary.
(c) [1 Pt] Which of the following transformations may symmetrize Variable C?
log(x) x2 √x x3 No transformation necessary.
Data 100 Page 9 of 42 Initials:
Solution: Variable A is already symmetric, so it needs no transformation. Variable B is right-skewed, so log and square root transformations may help. Variable C is left-skewed, so power transformations may help.
The boxplots are repeated here for your convenience.
In each of the following parts, you will see a statement about the boxplots above. Deter- mine if each statement is True, False, or Impossible to tell.
(d) [1 Pt] Variable B has the lowest first quartile among all three variables. ⃝ True ⃝ False ⃝ Impossible to tell
(e) [1 Pt] Variable A is unimodal.
⃝ True ⃝ False ⃝ Impossible to tell
Solution: True. Variable B has the lowest bottom of the box across all three vari- ables.
Data 100 Page 10 of 42 Initials:
Solution: Impossible to tell. We know that variable A is symmetric, but the boxplot conceals information about the actual shape of the distribution. It could very well be unimodal, bimodal, or have even more modes.
(f) [1 Pt] Variable C contains zero points greater than 1.5 ∗ IQR above its median. ⃝ True ⃝ False ⃝ Impossible to tell
Solution: True.Thereareno”big”outlierpointsinVariableC,only”small”outliers.
Data 100 Page 11 of 42 Initials:
Night Owl or Early Bird? [13 Pts]
6. Suriya and Meghna are Data 100 students, and they have a prediction task where we wish to predict whether people are night owls or early birds using their favorite color. They’re given a shortened training set with 5 data points where X = [‘blue’, ‘green’, ‘pink’, ‘purple’, ‘red’] and they wish to predict y = [0, 1, 1, 1, 0].
(a) [1 Pt]
What type of variables does X contain? ⃝ Quantitative continuous
⃝ Quantitative discrete
⃝ Qualitative discrete
⃝ Qualitative nominal ⃝ Qualitative ordinal
Solution: X contains colors, which are not quantitative. They don’t have ordering, so the correct response is qualitative nominal.
They decide to one-hot encode the data in X into a design matrix X′, with the
(b) [1 Pt]
categories being ordered alphabetically from left to right. How many values in X′ are zero?
Solution: The matrix will contain a 1 value for each matching color in a 5×5 matrix. Since there are 5 colors, there will be 25 – 5 = 20 zero values.
Suriya and Meghna decide to use logistic regression with no intercept term, where the predicted probabilities are rounded to the nearest whole number. Suriya decides to try L0 regularization for their logistic regression model. Unlike L1 and L2 regularization, L0 regularization does not add a term to the loss function. Instead, it specifies a constraint that only some k elements in our parameter θ for the model can be non-zero.
Hint: σ(0) = .5
(c) [2 Pts] Suppose he applies L0 regularization where k = 5, and finds the optimal θ for X′
and y using logistic regression. How many points does he misclassify?
(d) [2 Pts] Suppose he applies L0 regularization where k = 1 and find the optimal θ for X′ and y using logistic regression. How many points does he misclassify?
Solution: Since this means that all of the θ values can be nonzero, this is effectively logistic regression without regularization. As a result, the first and last θ will tend to- wards −∞, and the middle three θ will tend towards +∞, resulting in 0 misclassified points on the fitted data.
Page 12 of 42 Initials:
Solution: Given the hint, we know that σ(0) = .5 and based on our threshold, that would be a positive (1) prediction. Therefore, even if our θ values are all 0, we only misclassify the two negative predictions (0s). Since we have one non-zero θ value, we can set the first θ1 to any number less than 0, which yields a predicted probability of less than 0.5. We will now only misclassify the other negative prediction.
Therefore, the answer is 1.
Since linear regression using mean square error is easier to solve than logistic regression, Meghna tries to use that instead to create a quick model.
[1 Pt] Is X′T X′ invertible? ⃝ Yes ⃝ No
[2 Pts] What is the optimal value of θ if we use mean square error as the loss function? Your answer should be a sequence of 5 elements, e.g. [1, 2, 3, 4, 5].
[2 Pts] For the optimal value of θ, what is the mean squared error on X′?
Since they can’t possibly train a great model with 5 data points, they seek out the full training set and discover that it has 1,000,000 training points with the same colors from before. They decide to use logistic regression, without regularization and without an intercept term, for the remaining parts of the question.
[1 Pt] What is the column rank of the new one-hot encoded dataset?
[1 Pt] Meghna wants to use a gradient method to discover the optimal θ. Which of the following options is the best suited to this training set and problem?
⃝ Stochastic gradient descent (batch size = 1)
⃝ Gradient descent, on the complete dataset
⃝ Stochastic gradient descent (batch size = 32)
Solution: Yes, it is full rank. In fact, it’s orthonormal!
Solution: The design matrix X′ is the identity matrix. Therefore, our θ is simply the y values exactly: [0, 1, 1, 1, 0].
Solution: Withtheoptimalθabove,wecanpredicteverypointperfectly,sothemean squared error is 0.
Solution: Sincethereare5OHEcolumns(whicharelinearlyindependentsincethey are orthogonal), the rank is 5.
Data 100 Page 13 of 42 Initials:
Solution: Ourtrainingsetistoolargeforbatchgradientdescentontheentiredataset and using a batch size of 1 is likely to lead to fluctuation and oscillation. The middle ground that is appropriate is SGD with a batch size of 32.
Data 100 Page 14 of 42 Initials:
Donut Decisions [9 Pts]
7. Below is a dataset from which we want to create a classifier. We have two features, x1 and x2, and two classes. Assume the orange points are in class 1 and the blue points are in class 0. The points displayed here are training data.
(a) [1 Pt] Is this dataset linearly separable? ⃝Yes ⃝No
Below are 4 different possible decision boundaries (a.k.a. classifiers) we can generate. The orange regions are areas where new points would be classified as class 1, and the blue regions are areas where new points would be classified as class 0.
Solution: No line can be drawn that perfectly separates the two classes, so this dataset is not linearly separable.
Page 15 of 42
[2 Pts] Which of the above classifiers (A, B, C, D) has perfect accuracy on the training set? Select all that apply.
Note: Do not try to distinguish borderline points—you may assume points right on the boundary are classified correctly.
The following parts will ask you about which model(s) could generate each of the bound- aries. For each part, assume x1 and x2 are the only features in our model.
[1 Pt] Which of the following models could have generated boundary A?
Solution: Classifier A clearly does not predict every point correctly. Classifier C has 2 misclassified points, both blue points in orange regions. One is at about (7, −13), and the other is at around (5, 0). Classifiers B and D have no misclassified points.
Data 100 Page 16 of 42 Initials:
Logistic Regression Decision Tree Random Forest above
(d) [1 Pt] Which of the following models could have generated boundary B?
Logistic Regression Decision Tree Random Forest above
(e) [1 Pt] Which of the following models could have generated boundary C?
Logistic Regression Decision Tree Random Forest above
None of the
None of the
Solution: Boundary A could only have been generated by logistic regression, be- cause decision trees (and therefore random forests) only allow for axis-aligned splits.
Solution: Boundary B only contains axis-aligned splits, so it could easily have been made by a decision tree, and random forests can create any boundary a decision tree can. Logistic regression could NOT have made this boundary, because logistic re- gression only returns a linear boundary. This boundary, while it consists of lines, is piecewise linear, not strictly linear.
(f) [1 Pt] Which of the following models could have generated boundary D?
Logistic Regression Decision Tree Random Forest None of the
(g) [2 Pts] Suppose we add a new feature x3, which is some function of x1 and x2. Now, as- sume x3 is a feature in our model. Which of the following models could have generated boundary D?
Logistic Regression Decision Tree Random Forest None of the above
None of the
Solution: A decision tree could not have made Boundary C, because it classifies some points incorrectly. A random forest could have made this boundary due to the piecewise linear splits.
Solution: None of logistic regression, decision trees, or random forests can create curved decision boundaries. With just these two features x1 and x2, none of the models could create this boundary.
Data 100 Page 17 of 42 Initials:
Solution: Ifweselectanappropriatefunction,wecanmakethisdatasetlinearlysep- arable in 3D space, allowing for all 3 models to classify all training points correctly. In this example, any function that is small for points either too close or too far from the origin, and large for points a certain distance from the origin (or vice versa), would do. The function used to generate this figure was
√22 2 e− x1 +x2 −7.75
although any function meeting the properties described above would do. Note that a student would not have needed to come up with a specific function to correctly the answer the question, only recognizing that such a function exists would have sufficed.
When we add a new feature x3, boundaries that are linear or piecewise linear in the 3D space can look curved in the original 2D space, leading to the boundary seen in figure D.
Data 100 Page 18 of 42 Initials:
Feature Engineering [8 Pts]
8. The following dataset contains information about passengers on the Titanic. There are 20 rows in this dataset, and you may assume there are no missing or null values in the dataset. The first 5 rows are shown below.
A brief description of the columns:
• age and fare are strictly positive
• sex takes on values ∈ {male, female}
• class takes on values ∈ {First, Second, Third}
• embark town takes on values ∈ {Southampton, Cherbourg, Queenstown, London, Ox- ford}
(a) [2 Pts] Suppose we one-hot encode the sex column to get a design matrix Φ1 with 2 columns, sex male and sex female, where values can be 0 or 1 within each column. Note that Φ1 does NOT contain an intercept term.
Select all of the following statements that are true about Φ1. Φ1 has 20 rows
Φ1 is full column rank ΦT1 Φ1 is invertible
None of the above
(b) [2 Pts] Suppose we one-hot encode the sex and embark town column and include an intercept term in the model. This results in a design matrix Φ2 with 8 columns.
Select all of the following statements that are true about Φ2. Φ2 has 20 rows
Φ2 is full column rank
Solution: Thefirstchoiceiscorrectsinceone-hotencodingdoesnotchangethenum- ber of rows of the matrix, only the number of columns.
The columns are linearly independent, so choices 2 and 3 are also correct.
Page 19 of 42 Initials:
ΦT2 Φ2 is invertible None of the above
Solution: Thefirstchoiceiscorrectsinceone-hotencodingdoesnotchangethenum- ber of rows of the matrix, only the number of columns.
The sum of all the one-hot encoded columns resulting from one categorical feature (e.g. the sum of the sex male and sex female columns) is a column of all 1’s, which the design matrix already contains due to the bias term. Therefore, Φ2 is not full column rank.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com