4/11/2021 View Submission | Gradescope
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817 1/22
5
Q1 Exam Information 0 Points
Corrections/clarification doc
You may find any real-time exam related information here:
https://docs.google.com/document/d/1c21nMYywG5Sl_STsm71cap2biI0qI_qIoX
Please have this doc open in a tab and be sure to check it periodically.
Mathematical expressions
For answers that include mathematical expressions, you can type your answer in plain text math as best you can or you can attempt to render LaTeX by including double dollar signs ($$) before and after your expression. Either way is totally fine.
For example, if the answer is the expression for mean squared error of linear regression, ∑N (y(i) − θT x(i) )2 , you can something like one
i=1 of the following:
Plain text:
sum i=1 to N (y^i – theta^T x^i)^2
Rendered:
$$\sum_{i=1}^N (y^i – \theta^T x^i)^2$$ which renders as
∑N (yi − θT xi)2 i=1
Q2 Multiple-choice and True/False 6 Points
Q2.1
2 Points
For linear regression, recall our MSE objective with l2 regularization
J(θ)=1∑N (y(i)−θTx(i))2+λ∥θ∥2 N i=1 2
gtbN
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Select all that apply:
Suppose we are training a model on some dataset D with the objective above. Which of the following is likely to reduce training mean squared error?
Decrease λ
Decrease training set size
Q2.2
2 Points
Continuing from Q2.1
Select all that apply:
When training MSE is better than test MSE, which of the following should improve test MSE?
Increase λ
Increase training set size
Q2.3
2 Points
True or False:
For any neural network, the validation loss will always decrease monotonically with the number of iterations of gradient descent, provided the step size is sufficiently small.
Increase λ
Increase training set size
Decrease λ
Decrease training set size
2/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
True False
Q3 Number of Parameters 10 Points
In the following questions, give the number of parameters that will need to be learned in each model. Do not count any hyperparameters.
Q3.1
2 Points
Linear regression with:
20 data points in the training set 2 real input features, x1 , x2 Using a bias term
Single real output
Number of parameters: 3
Q3.2
2 Points
Logistic regression with:
11 data points in the training set
3 binary input features, x1 , x2 , x3 Single binary output
Using a bias term
(No feature mapping)
Number of parameters: 4
3/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q3.3
2 Points
Logistic regression with:
11 data points in the training set
2 input features, x1 , x2 , but we use a feature mapping function φ(x) = [1, x1, x2, x1x2, x21, x2]T
(Bias term is implicitly embedding in the feature mapping)
Number of parameters: 6
Q3.4
2 Points
Logistic regression with:
11 data points in the training set
3 binary input features, x1 , x2 , x3 5 binary outputs
Using a bias term
(No feature mapping)
Number of parameters: 20
Q3.5
2 Points
Neural network with:
11 data points in the training set
3 binary input features, x1 , x2 , x3 5 binary outputs
Zero hidden layers
Using softmax loss
Using a bias term
4/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Number of parameters: 20
Q4 Probabilistic Models 10 Points
For each machine learning model, select the objective function that it attempts to fit. Assume that x is the input, y is the output, and θ is the vector of all parameters.
(The set of options is the same for all of the following questions.)
Q4.1
2 Points
Linear regression (without regularization)
p(y ∣ θ)
p(x ∣ θ)
p(y ∣ x, θ)
p(x ∣ y, θ)
p(x ∣ y, θ)p(y ∣ θ)
p(y ∣ θ) exp(r(θ))
p(x ∣ θ) exp(r(θ))
p(y ∣ x, θ) exp(r(θ))
p(x ∣ y, θ) exp(r(θ))
p(x ∣ y, θ)p(y ∣ θ) exp(r(θ))
Q4.2
2 Points
Linear regression with L2 regularization
5/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
p(y ∣ θ)
p(x ∣ θ)
p(y ∣ x, θ)
p(x ∣ y, θ)
p(x ∣ y, θ)p(y ∣ θ)
p(y ∣ θ) exp(r(θ))
p(x ∣ θ) exp(r(θ))
p(y ∣ x, θ) exp(r(θ))
p(x ∣ y, θ) exp(r(θ))
p(x ∣ y, θ)p(y ∣ θ) exp(r(θ))
Q4.3
2 Points
Logistic regression with L1 regularization
p(y ∣ θ)
p(x ∣ θ)
p(y ∣ x, θ)
p(x ∣ y, θ)
p(x ∣ y, θ)p(y ∣ θ)
p(y ∣ θ) exp(r(θ))
p(x ∣ θ) exp(r(θ))
p(y ∣ x, θ) exp(r(θ))
p(x ∣ y, θ) exp(r(θ))
p(x ∣ y, θ)p(y ∣ θ) exp(r(θ))
Q4.4
2 Points
6/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Neural networks for classification (no regularization)
p(y ∣ θ)
p(x ∣ θ)
p(y ∣ x, θ)
p(x ∣ y, θ)
p(x ∣ y, θ)p(y ∣ θ)
p(y ∣ θ) exp(r(θ))
p(x ∣ θ) exp(r(θ))
p(y ∣ x, θ) exp(r(θ))
p(x ∣ y, θ) exp(r(θ))
p(x ∣ y, θ)p(y ∣ θ) exp(r(θ))
Q4.5
2 Points
Neural networks for regression with L2 regularization
p(y ∣ θ)
p(x ∣ θ)
p(y ∣ x, θ)
p(x ∣ y, θ)
p(x ∣ y, θ)p(y ∣ θ)
p(y ∣ θ) exp(r(θ))
p(x ∣ θ) exp(r(θ))
p(y ∣ x, θ) exp(r(θ))
p(x ∣ y, θ) exp(r(θ))
p(x ∣ y, θ)p(y ∣ θ) exp(r(θ))
7/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q5 Neural Networks 17 Points
Q5.1
2 Points
True or False: the decision boundary of any fully-connected neural network with a single hidden layer containing K nodes with sigmoid activations, and a softmax output with K classes can be recreated as a multi-class (multinomial) logistic regression hypothesis function.
True False
Q5.2
3 Points
Let f be a fully-connected neural network with input x ∈ RM , P hidden layers with K nodes per layer and sigmoid activations, and a single sigmoid output. Let g be the same network as f , except we insert another hidden layer with K nodes with no activation, so that g has P + 1 hidden layers. Denote this new layer Lnew . Assume that there are no bias terms for any layer, nor for the input.
Select ALL that apply:
f can learn the same decision boundary as g if the additional linear layer is placed…
Immediately after the input.
Immediately before the output.
Anywhere in between the above two choices.
None of the above.
8/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q5.3
2 Points
Continuing from Q5.2
Select one:
Assume that Lnew is placed in between two other hidden layers in g. How many additional parameters does g learn over f ?
K K2 KP KM 2K2
Q5.4
2 Points
Continuing from Q5.2
True or False:
After training both f and g to convergence, g can have a lower training loss than f.
True False
Q5.5
2 Points
Continuing from Q5.2
True or False:
After training both f and g to convergence, f can have a lower training loss than g.
True False
9/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q5.6
3 Points
You have trained a neural network on some data. However, you are running it on a computer from the 1990’s, and don’t have enough space to store it. Your friend suggests “trimming” some of the edges by removing any edge between two hidden layers that has an absolute weight less than some threshold. That is, for some threshold t, remove any edge where the corresponding weight, w, satisfies ∣w∣ < t. Assume that you have done hyperparameter searching to find the ideal value of t > 0 under the condition that at least one edge is trimmed.
After “trimming” your neural network, could validation loss increase?
Yes No
Q5.7
0 Points
Continuing from Q5.6
Why or why not? Show your work if you would like it to be considered for partial credit.
Q5.8
3 Points
Continuing from Q5.6
After “trimming” your neural network, could validation loss decrease?
Yes No
10/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q5.9
0 Points
Continuing from Q5.8
Why or why not? Show your work if you would like it to be considered for partial credit.
Q6 Bias and Variance 12 Points
Suppose we have a decision function hθ(x), characterized by a 2- dimensional vector of parameters θ. We could plot all possible decision functions in 2D space as below.
The three plots correspond to three learning algorithms (a, b, and c) that use the same parameterized representation for hθ(x), but different algorithms.
The solid dot labeled h∗ represents the function we wish to learn. The shaded region for each algorithm (labeled “h possible”) shows the actual learned decision functions when that algorithm was trained repeatedly, each time using a different set of input training examples x , but always labeled by the true h∗(x).
Assume each of these three figures above can be described as one of the following:
. high variance, high bias
11/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
. high variance, low bias . low variance, high bias . low variance, low bias
Q6.1
2 Points
Which of these four descriptions corresponds to figure (a)
i) high variance, high bias ii) high variance, low bias iii) low variance, high bias iv) low variance, low bias
Q6.2
2 Points
Which of these four descriptions corresponds to figure (b)
i) high variance, high bias ii) high variance, low bias iii) low variance, high bias iv) low variance, low bias
Q6.3
2 Points
Which of these four descriptions corresponds to figure (c)
12/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
i) high variance, high bias ii) high variance, low bias iii) low variance, high bias iv) low variance, low bias
Q6.4
2 Points
Here are the same figures again:
Which of the three figures corresponds to optimal regularization?
a b c
Q6.5
2 Points
Which of the three figures corresponds to too much regularization?
a b c
13/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q6.6
2 Points
Which of the three figures corresponds to not enough regularization?
a b c
Q7 Regularization and Feature Engineering 6 Points
Q7.1
2 Points
Which one of the following is a guaranteed consequence of regularization in linear regression? (the error metric is mean squared error)
Training error will increase or remain the same as the non- regularized model
Training error will decrease or remain the same as the non- regularized model
True error will increase or remain the same as the non-regularized model
True error will decrease or remain the same as the non-regularized model
Q7.2
2 Points
Which one of the following regularization methods will likely make irrelevant weights go to exactly 0 after training for enough iterations?
14/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
l1 regularization
l2 regularization
None of the above
Q7.3
2 Points
It is possible to turn datasets that are originally not linearly separable into ones that are linearly separable through clever feature engineering.
True False
Q8 MDP Properties 5 Points
Which of the following statements are true for an MDP?
l0 regularization
15/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
For an infinite horizon MDP with a finite number of states and actions and with a discount factor γ that satisfies 0 < γ < 1, value iteration is guaranteed to converge.
When running value iteration, if the policy (the greedy policy with respect to the values) has converged, the values must have converged as well.
If one is using value iteration and the values have converged, the policy must have converged as well.
Value iteration will converge to the same vector of values (V ∗) no matter what values we use to initialize V.
None of the above
Q9
22 Points
A robot is trying to get to its office hours, occurring on floors 3, 4, or 5 in the Gates building on campus. It is running a bit late and there are a lot of students waiting for it. There are three ways it can travel between floors in Gates: the stairs, the elevator, and the helix.
The state of the robot is the floor that it is currently on (either 3, 4, or 5).
The actions that the robot can take are stairs, elevator, or helix.
In this problem, we are using a linear, feature-based approximation of
the Q-values:
Qw(s,a) = ∑3 fi(s,a)wi i=0
We define the feature functions as follows:
If the only difference between two MDPs is the value of the discount factor then they must have the same optimal policy.
16/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Furthermore, the weights will be updated as follows:
wi ←wi +α[r+γmaxa′ Qw(s′,a′)−Qw(s,a)] δ Qw(s,a) δwi
Q9.1 Approximate Q-learning 6 Points
Calculate the following initial Q values given the initial weights above. You may write your numerical answers in unsimplified arithmetic.
Note: It is worth writing these down as you will use them in later parts. Qw (4, elevator):
54
Qw (4, stairs): 15
Qw (4, helix): 16
Q9.2
3 Points
17/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
For this problem, the initial Q-values for state 3 happens to be equal to the corresponding initial Q-values for state 5.
In this problem, as you update the weights, will these values remain equal? I.e., will Qw (3, a) = Qw (5, a) given any action a and vector w?
Yes No
Why or why not?
S only matters for
1. the |s - 4| expression where |5-4| = |3-4|.
2. the emptiness calculations where s = 3 and s=5 always gives same values
Q9.3
3 Points
Given the Q-values for state 4 calculated in above, what are the probabilities that each of the following actions could be chosen when using ε-greedy exploration from state 4 (assume random movements are chosen uniformly from all actions)? Write your answers in terms of ε (you can write 'epsilon' or just 'e').
elevator:
stairs:
helix:
1 − 2ε 3
ε
3
ε
3
18/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817
View Submission | Gradescope
Q9.4
8 Points
Repeating the feature functions here for convenience:
Given a sample with start state 3, action = stairs, successor state = 4, and reward = -2, update each of the weights using learning rate α = 0.25 and discount factor γ = 0.6. You should write your numerical answers in unsimplified arithmetic.
w0 2.35
w1 54.5
w2 4.7
w3 0.1
Q9.5
19/22
4/11/2021
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817 20/22
View Submission | Gradescope
Mock Exam 2 (CAUTION: once opened, you have 80 minutes to complete)
1 DAY, 22 HOURS LATE
STUDENT
Art Zhu
TOTAL POINTS
- / 88 pts
QUESTION 1
Exam Information
QUESTION 2
Multiple-choice and True/False
2.1 (no title)
UNGRADED
0 pts
6 pts 2 pts
2 Points
What is one advantage of using approximate Q-learning instead of standard Q-learning? What is one disadvantage?
Advantage:
Disadvantage:
Approximate Q learning avoids storing huge Q tables when the dimensions are huge; this can be great savings in terms of memory.
Approximate Q learning, needs to select a good approximation model. E.g. linear regression to approximate Q table can be too limiting. Some extra feature engineering might be needed.
4/11/2021 View Submission | Gradescope
2.2 (no title)
2.3 (no title)
QUESTION 3
Number of Parameters
3.1 (no title)
3.2 (no title)
3.3 (no title)
3.4 (no title)
3.5 (no title)
QUESTION 4
Probabilistic Models
4.1 (no title)
4.2 (no title)
4.3 (no title)
4.4 (no title)
4.5 (no title)
QUESTION 5
Neural Networks
5.1 (no title)
5.2 (no title)
5.3 (no title)
5.4 (no title)
5.5 (no title)
5.6 (no title)
5.7 (no title)
5.8 (no title)
5.9 (no title)
QUESTION 6
Bias and Variance
6.1 (no title)
6.2 (no title)
2 pts 2 pts
10 pts 2 pts
2 pts 2 pts 2 pts 2 pts
10 pts 2 pts
2 pts 2 pts 2 pts 2 pts
17 pts 2 pts
3 pts 2 pts 2 pts 2 pts 3 pts 0 pts 3 pts 0 pts
12 pts 2 pts
2 pts
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817 21/22
4/11/2021 View Submission | Gradescope
6.3 (no title)
6.4 (no title)
6.5 (no title)
6.6 (no title)
QUESTION 7
Regularization and Feature Engineering
7.1 (no title)
7.2 (no title)
7.3 (no title)
QUESTION 8
MDP Properties
QUESTION 9
(no title)
9.1 Approximate Q-learning
9.2 (no title)
9.3 (no title)
9.4 (no title)
9.5 (no title)
2 pts 2 pts 2 pts 2 pts
6 pts 2 pts
2 pts 2 pts
5 pts
22 pts 6 pts
3 pts 3 pts 8 pts 2 pts
https://www.gradescope.com/courses/228165/assignments/1158235/submissions/74288817 22/22