DATA 100 Final-Exam Spring 2021
INSTRUCTIONS
Final-Exam
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course staff with your solutions before the exam deadline.
Copyright By PowCoder代写 加微信 powcoder
This exam is intended for the student with email address
For questions with circular bubbles, you should select exactly one choice. You must choose either this option
Or this one, but not both!
For questions with square checkboxes, you may select multiple choices. You could select this choice.
You could select this one too!
You may start your exam now. Your exam is due at
Exam generated for
Preliminaries
You can complete and submit these questions before the exam starts.
What is your full name?
What is your Berkeley email?
What is your student ID number?
When are you taking this exam? Wednesday 11:40am PDT
Wednesday 7:10pm PDT
Honor Code: All work on this exam is my own.
By writing your full name below, you are agreeing to this code:
Important: You must copy the following statement exactly into the box below. Failure to do so may result in points deducted on the exam.
“I certify that all work on this exam is my own. I acknowledge that collaboration of any kind is forbidden, and that I will face severe penalties if I am caught, including at minimum, harsh penalties to my grade and a letter sent to the Center for Student Conduct.”
Exam generated for
(a) (2.0 pt) Recall the tips dataset that we worked with on assignments in the past, which includes data about the tip on a restaurant bill as well as the day of week and the sex of the individual. The plot below attempts to examine patterns between the tip as a percentage of the bill and the sex of the individual by the day of week (DOW)
Select the best reason below for why the data visualization is misleading or poorly constructed. the y-axis should be log transformed
the clustering of bars doesn’t allow a key comparison to be made
the plot suffers from overplotting
the bars for each day of week should be stacked on top of each other (e.g. the bar for “Thur” would have a total height of approximately 0.3)
Exam generated for
gradient fields most likely corresponds to the surface shown above?
A gradient field is a plot that shows the direction and relative magnitude of the gradient of a surface on a 2-dimensional plot where each point has a vector pointing from it in the direction of the gradient at that point and the length of that vector is proportional to the magnitude of the gradient.
Exam generated for
Exam generated for
(c) (3.0 points)
We have read in some data as the dataframe df. Consider a subset of df below, which contains some information on the background of various individuals in the US.
Exam generated for
i. (2.0 pt) Suppose we want to observe the relationship between and the distributions of the AFQT (an intelligence metric, with units percentile) and log_earn_1999 (log of the individual’s earnings in 1999) variables based on whether the individual’s parents both went to college. Select the line of code below that generates the best plot to observe this relationship.
sns.kdeplot(x=df[‘AFQT’], y=df[‘log_earn_1999’],
hue=df[‘mother_college’] & df[‘father_college’])
sns.scatterplot(x=df[‘AFQT’], y=df[‘log_earn_1999’],
hue=df[‘mother_college’] & df[‘father_college’])
sns.lineplot(x=df[‘AFQT’], y=df[‘log_earn_1999’],
hue=df[‘mother_college’] & df[‘father_college’])
sns.kdeplot(x=’AFQT’, y=’log_earn_1999′, hue=[‘mother_college’,
‘father_college’], data=df)
sns.scatterplot(x=’AFQT’, y=’log_earn_1999′,hue=[‘mother_college’,
‘father_college’], data=df)
sns.lineplot(x=’AFQT’, y=’log_earn_1999′, hue=[‘mother_college’,
‘father_college’], data=df)
Hint: Consider overplotting. A
Exam generated for
of the individual. We run the following code to generate a plot:
df2 = df.groupby(“zip_code”).mean().reset_index()
sns.lineplot(“zip_code”, “log_earn_1999”, data=df2)
Select the reason below for why this plot would represent a bad data visualization. treats a categorical variable as a continuous variable
treats a continuous variable as a categorical variable
represents a density with a feature other than area
does not show the relationship between the variables of interest
Exam generated for
2. (9.0 points)
(a) (4.0 points)
Recall that a random forest is created from a number of decision trees, with each decision tree created from a bootstrapped version of the original training set. One hyperparameter of a random forest is the number of decision trees we train to create the random forest.
Define T to be the number of decision trees used to create the random forest. Let’s say we have two candidate values for T: var1 and var2. We want to perform var3 – fold cross-validation to determine the optimal value of T. Assume var1, var2, and var3 are integers.
i. (2.0 pt) In this cross-validation process, how many random forests will we train? Your answer should be in terms of var1, var2, and/or var3 and should be an integer.
ii. (2.0 pt) In this cross-validation process, how many decision trees will we train? Your answer should be in terms of var1, var2, and/or var3 and should be an integer.
(var1 + var2) * var3
Exam generated for
(b) (2.0 pt) Let’s say we pick three hyperparameters to tune with cross-validation. We have 5 candidate values for hyperparameter 1, 6 candidate values for hyperparameter 2, and 7 candidate values for hyperparameter 3. We perform 4-fold cross validation to find the optimal combination of hyperparameters, across all possible combinations.
In this cross-validation process, how many random forests will we train? Your answer can be left as a product of multiple integers, e.g. “1 * 2 * 3”, or simplified to a single integer, e.g. “6”. (These are not the correct answers to the problem).
(c) (3.0 pt) Here is some code that attempts to implement the cross-validation procedure described above. However, it is buggy. In one sentence, describe the bug below.
You may assume the following:
• X_train is a pd.DataFrame that contains our design matrix, and Y_train is a pd.Series that contains our response variable, both for the full training set.
• Assume ensemble.RandomForestClassifier(**args) creates a random forest with the appropriate hyperparameter values. The bug is not on this line.
• The candidate values for each hyperparameter have been loaded into the lists cands1, cands2, and cands3, respectively.
1: from sklearn.model_selection import KFold
2: from sklearn import ensemble
3: import numpy as np
4: import pandas as pd
6: kf = KFold(n_splits = 4)
7: cv_scores = []
8: for cand1 in cands1:
18: 19: 20:
for cand2 in cands2:
for cand3 in cands3:
validation_accuracies = []
for train_idx, valid_idx in kf.split(X_train):
split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx]
split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]
model = ensemble.RandomForestClassifier(**args)
model.fit(X_train, Y_train)
accuracy = np.mean(model.predict(split_X_valid) == split_Y_valid)
validation_accuracies.append(accuracy)
cv_scores.append(np.mean(validation_accuracies))
4 * 5 * 6 * 7 = 840
Each iteration of the algorithm trains a random forest on the entire training set, as opposed to the part of the training set that is not reserved for validation.
Exam generated for
We are trying to train a decision tree for a classification task where 0 is the negative class and 1 is the positive class. We are given 8 data points each in pairs of (x1,x2) features.
(a) (3.0 pt)
341 210 131 590 961 721 470 881
What is the entropy at the root of the tree? Round to 4 decimal places.
(b) (2.0 pt) What is the gini inpurity at the root of the tree? Note that the formula for gini impurity is 1 − ci=1 p2i where pi is the fraction of items labelled with class i and c is the total number of classes.
−(58log258 +38log2(38))=0.6616
1−((85)2 +(38)2)=0.46875
Exam generated for
following minimizes the weighted entropy of the two resulting child nodes.
x1 ≥ 6 x1 ≥ 3.5 x2 ≥ 5 x2 ≥ 3.5 x2 ≥ 6.5
(d) (2.0 points)
We have decided to create a food recommendation system using a decision tree! We would like to run our decision tree to see what food it recommends in certain scenarios.
If you have trouble reading the above tree, please go to this link: https://i.imgur.com/9Z40cYP.png
i. (1.0 pt) Bob wants to eat some unhealthy food, specifically at a fast food restaurant. When asked what he’s in the mood for, he replies with “Mediterranean”. Which of the following restaurants could the decision tree recommend for Bob?
Chipotle
Taco Bell
Dyars Cuisine IBs Burgers
ii. (1.0 pt) Larry would like to eat some unhealthy food as well! However, he got a salary bonus from his job so he does not want to eat at a fast food restaurant. When asked how much he would like to pay, he replies with “I have no preference”. Which of the following restaurants could the decision tree recommend for Larry?
Olive Garden
Cheesecake Factory
Super Dupers Burger
Steakhouse
Exam generated for
(e) (3.0 pt) Joey and Andrew are each training their own decision tree for a classification task. Joey decides to limit the depth of his decision tree to depth 3 while Andrew decides to not set a limit on the depth of his decision tree. When plotting the training error, Joey’s error seems to be much higher than Andrew’s error. However, when plotting the validation error, Andrew’s error seems be much higher than his training error as well as Joey’s error. Andrew is confused and surmises that there must be a bug in his code that is causing this to happen. What happened? Explain. What can he do to improve it? Name at least 3 things he can do to improve his error. Please limit your repsonse to 2 sentences per reason.
He is not correct. Andrew’s high validation error and low training error is due to overfitting. Joey did not run into this error because he limited his depth to 3. For Andrew to improve his validation error, he should try to limit the depth of his tree, try pruning his decision tree, preventing splits that have less than 1% of the samples, or using Random forests.
Exam generated for
(a) (3.0 pt) Suppose we are modeling the number of calls to MangoBot food delivery service per minute. We believe that there are likely more calls around lunch time.
Which of the following feature encodings of the time of day (0.0 to 24.0, exclusive of both ends) would capture this assumption? Select all that apply.
time_of_day ** 2
np.log(12 * time_of_day)
1-np.cos(np.pi * time_of_day / 12) np.exp(-(time_of_day – 12) ** 2)
(b) (4.0 pt) Recall that in a binary classification task, we want our data to become linearly separable so that we can maximize the performance of our classifier. In many cases, however, our data are not directly linearly separable. As a result, we want to apply some transformation to our data so they will become linearly separable afterwards.
For the following dataset, select all transformations that can make the data linearly separable.
(x1, x2) → (x21, x2) (x1, x2) → (x1, x2) (x1, x2) → (x21, x2) (x1, x2) → (x2, x21)
Exam generated for
Consider the following preprocessing steps:
i. Remove all punctuations (., ,, :, . . . ).
ii. Remove all stopwords (did, the, . . . ). Note that stopwords do not include words that negate things
such as no, not, …
iii. Lower case the sentence, and keep words that only consist of letters a − z.
iv. Encode the sentence as a vector containing the frequencies for all the unique words in the text.
Suppose we use the frequency vector from the steps above as our feature to train a logistic regression model that predicts the sentiment of a sentence (positive, negative). In 1-2 sentences, describe a case where our model would fail and make a false prediction.
Your answer must be specific to the preprocessing steps and includes an example sentence to earn credits.
Counting the frequency of all words in a sentence does not address the order of the words in the sentence. This could be problematic when you have the following two sentences:
“I am happy that it does not rain today.” “I am not happy that it does rain today.”
The sentiments of the 2 sentences above are clearly opposed to each other, how- ever, if we count the frequency of the words following the same preprocesing steps above, we would end up with exactly the same frequency vector. This means we will always have a false prediction for one of the sentences.
Exam generated for
(d) (3.0 pt) Recall that in the housing assignment, if we want to include a categorical variable in our linear model, we need to convert it into a collection of dummy variables of values 0 and 1. Suppose we have a dataframe housing that contains a subset of the Cook County data.
We are interested in one-hot-encoding the categorical variable floor_material and using the dummy columns as the sole features to build an ordinary least squares model to predict the sale price of the houses.
Specifically, we create the design matrix X with the following block of code: X = pd.get_dummies(housing[‘floor_material’]).to_numpy()
In addition, running the code housing[‘floor_material’].value_counts() gives us the following out- put:
Which of the following statements are true about the design matrix X? Select all that apply. Note: define θ∗ to be the vector containing the optimal parameters.
X has a dimension of 3 columns and 120 rows.
We can add a bias column of all 1’s to X and still find a unique solution for the optimal parameters. X⊤X is a diagonal matrix (zeros everywhere except along the main diagonal).
All of the entries in X⊤X add up to be 120.
The optimal parameter vector θ∗ contains the average sale price for each type of floor material.
(e) (3.0 pt) When building your models, one way to select features is to consider the pair-wise relationship between each column and the response variable (i.e. the column you are trying to predict). Consider the following approach:
i. Compute the pairwise correlation coefficient between each column and the response variable in the dataframe.
ii. Sort the correlation coefficients in descending order.
iii. Pick the top k coefficients and select the corresponding columns as the features.
In 1-2 sentences, describe how the approach above can result in multicollinearity and issues with feature diversity.
Your answer must explain why multicollinearity and lack of feature diversity could poten- tially occur to earn credits for this question.
It is possible that more than 1 column can share a strong correlation with the response variable at the same time, and even worse, there can be strong cor- relations (near multicollinearity) among multiple columns – this could cause a high variance in the model and impact test performance and is unfortunately not captured by the approach above.
The approach above is deterministic given the data and will always produce a fixed set of columns. This can limit feature diversity as we are building our model.
Exam generated for
Suppose we are modelling some response using our data X. For a given observation we have 3 features, x1, x2, x3. Note that the subscript does not refer to the first, second, and third observations, respectively. For a given data point x, we come up with a model of the form fθ(x) = θ1×1 + θ1θ2×2 + θ12×3. We use the squared error function, denoted L(y,yˆ), to calculate the error for each observation and additionally use L2 regularization, denoted R(θ), with penalty λ. You may assume that λ > 0. Thus our objective function is of the form L(y, yˆ) + λR(θ).
(a) (3.0 pt) For a single observation x having response y and features x1, x2, x3, compute the gradient to be used in gradient descent:
(y − (θ1×1 + θ1θ2×2 + θ12×3))(x1 + θ2×2 + 2θ1×3) − λθ1 −2 (y − (θ1×1 + θ1θ2×2 + θ12×3))(θ1×2) − λθ2
2(x1 + θ2×2 + 2θ1×3 − y)(θ1×1 + θ1θ2×2 + θ12×3) + 2λθ1 2(θ1×2 − y)(θ1×1 + θ1θ2×2 + θ12×3) + 2λθ2
−2(y − (θ1×1 + θ1θ2×2 + θ12×3))(x1 + θ2×2 + 2θ1×3) −2(y − (θ1×1 + θ1θ2×2 + θ12×3))(θ1×2)
(θ1×1 + θ1θ2×2 + θ12×3 − y)(θ1) + λθ1 2 (θ1×1 + θ1θ2×2 + θ12×3 − y)(θ1θ2) + λθ1θ2
(θ1×1 + θ1θ2×2 + θ12×3 − y)(θ12) + λθ12
(θ1×1 + θ1θ2×2 + θ12×3 − y)2(θ1) (θ1×1 + θ1θ2×2 + θ12×3 − y)2(θ2) (θ1×1 + θ1θ2×2 + θ12×3 − y)2(θ3)
(y − (θ1×1 + θ1θ2×2 + θ12×3)) + λR(θ1) (y − (θ1×1 + θ1θ2×2 + θ12×3)) + λR(θ2) (y − (θ1×1 + θ1θ2×2 + θ12×3))
(b) (2.0 pt) Suppose that you and your friend are implementing gradient descent. Just for fun, your friend chooses a negative learning rate α and asks you to fix their code. Which of the following expressions will always result in the same update as the conventional gradient descent algorithm? You may assume that the gradient ∇ is correctly computed and you do not need to worry about the magnitude of α.
θ(t+1) = θ(t) − α∇ θ(t+1) = θ(t) + α∇
θ(t+1) = θ(t) − |α|∇ θ(t+1) = θ(t) + |α|∇ None of the above
(c) (4.0 points)
Exam generated for
i. (2.0 pt) We seek to optimize a given loss function using stochastic gradient descent where 1 < batch size < n, where n is the total number of data points, We initialize all model parameters as 0 and use a constant learning rate η(t) = α. Based on the contour plot below, which of the following will most likely result in better minimization of the loss function:
Fewer iterations
Greater learning rate Smaller batch size Greater batch size
Exam generated for
learning rate where at time t, the learning rate η(t) = α , where α > 0. Based on the contour plot t+1
below, which of the following will most likely result in better minimization of the loss function:
Fewer iterations
Greater iterations
Negate α
Smaller α
Greater α
η(t) = √α t+1
Exam generated for
(a) (6.0 pt) Leif wants to do a study on the number of flowers in people’s gardens. He collects data on 100 different gardens, classifying each of them into three different sizes: ‘small’, ‘medium’, and ‘large’, and counts every flower in each person’s garden. The following is the first five rows of the data he collected:
Leif then asks you to construct the following table using the data he collected. The table represents the total flowers in each category. For example, there are 1700 Hyacinths in “large” gardens.
Write code below such that the above table is generated. Assume the data Leif collected is placed in a Pandas DataFrame assigned to the variable inputdf. The resul
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com