4/11/2021 Edit Assignment | Gradescope
Note – These are practice questions and in no way contribute to your grade in the course.
Mathematical expressions
For answers that include mathematical expressions, you can type your answer in plain text math as best you can or you can attempt to render LaTeX by including double dollar signs ($$) before and after your expression. Either way is totally fine.
For example, if the answer is the expression for mean squared error of linear regression, 1 ∑N (y(i) − θT x(i) )2 , you can something like one of the
N i=1 following:
Plain text:
1/N sum i=1 to N (y^i – theta^T x^i)^2
Rendered:
$$\frac{1}{N} \sum_{i=1}^N (y^i – \theta^T x^i)^2$$ which renders as
1∑N (yi−θTxi)2 N i=1
Q2 Multiple Choice 9 Points
Q2.1 Select the best choice 2 Points
Let Vk (s) indicate the value of state s at iteration k in (synchronous) value iteration.
WhatistherelationshipbetweenVk+1(s)and∑s′ P(s′∣s,a)[R(s,a,s′)+ γVk (s′)], for any a ∈ actions?
Please indicate the most restrictive relationship that applies. For example, if x < y always holds, please use < instead of ≤. Selecting ? means it's not possible to assign any true relationship.
Select the best choice
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
15/29
4/11/2021 Edit Assignment | Gradescope
Vk+1(s) □ ∑s′ P(s′∣s,a)[R(s,a,s′)+γVk(s′)] =
<
>
≤ ≥ ?
Q2.2 Select the best choice 2 Points
Let Q(s, a) indicate the estimated Q-value of state-action pair (s, a) at some point during Q-learning. Now your learner gets reward r after taking action a at state s and arrives at state s′ . Before updating the Q values based on this experience, what is the relationship between Q(s, a) and r +
γmaxa′ Q(s′,a′)?
Please indicate the most restrictive relationship that applies. For example, if x < y always holds, please use < instead of ≤. Selecting ? means it's not possible to assign any true relationship.
Select the best choice
Q(s,a) □ r+γmaxa′ Q(s′,a′) =
<
>
≤ ≥ ?
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
16/29
4/11/2021 Edit Assignment | Gradescope
Q2.3 Select all that apply 2 Points
During standard (not approximate) Q-learning, you get reward r after taking action North from state A and arriving at state B. You compute the sample r + γQ(B, South), where South = arg maxa Q(B, a).
Which of the following Q-values are updated during this step?
Select all that apply
Q(A, North)
Q(A, South)
Q(B, North)
Q(B, South)
None of the above
Q2.4 True/False 3 Points
In general, for Q-Learning (standard/tabular Q-learning, not approximate Q- learning) to converge to the optimal Q-values, which of the following are true?
True or False: It is necessary that every state-action pair is visited infinitely often. True
False
EXPLANATION
In order to ensure convergence in general for Q learning, this has to be true. In practice, we generally care about the policy, which converges well before the values do, so it is not necessary to run it infinitely often.
True or False: It is necessary that the discount γ is less than 0.5.
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
17/29
4/11/2021 Edit Assignment | Gradescope
True False
True or False: It is necessary that actions get chosen according to arg maxa Q(s, a).
True False
EXPLANATION
The discount factor must be greater than 0 and less than 1, not 0.5.
EXPLANATION
This would actually do rather poorly, because it is purely exploiting based on the Q-values learned thus far, and not exploring other states to try and find a better policy.
Q3 Logistic Regression 16 Points
Q3.1 Conditional Likelihood 4 Points
Given the following dataset, D, and a fixed parameter vector, θ, write an expression for the binary logistic regression conditional likelihood.
D = {(x(1),y(1) = 0),(x(2),y(2) = 0),(x(3),y(3) = 1),(x(4),y(4) = 1)}
Write your answer in terms of θ, x(1) , x(2) , x(3) , and x(4) .
Do not include y(1), y(2), y(3), or y(4) in your answer.
Don’t try to simplify your expression.
We have provided below the plain text and LaTeX version of the logistic regression hypothesis function, which may help you type up your answers quicker.
In your answer, you don’t have to worry about bold text or parentheses in the superscript. For example x^1 rather than $$\mathbf{x}^{(i)}$$ .
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
18/29
4/11/2021 Edit Assignment | Gradescope
Plain text hypothesis function for input x: 1/(1 + exp(-theta^T * x))
LaTeX hypothesis function for input x: $$\frac{1}{1 + e^{-\theta^T x}}$$
Conditional likelihood:
EXPLANATION
(1− 1T 1)(1− 1T 2) 1T 3 1T 4 1+e−θ x 1+e−θ x 1+e−θ x 1+e−θ x
Q3.2 Decision Boundary 4 Points
Write an expression for the decision boundary of binary logistic regression with a bias term for two-dimensional input features x1 ∈ R and x2 ∈ R and parameters b (the intercept parameter), w1 , and w2 .
AssumethatthedecisionboundaryoccurswhenP(Y =1∣x,b,w1,w2)= P(Y = 0 ∣ x,b,w1,w2).
Write your answer in terms of x1, x2, b, w1, and w2. Decision boundary equation:
EXPLANATION
0=b+w1x1 +w2x2
What is the geometric shape defined by this equation?
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
19/29
4/11/2021 Edit Assignment | Gradescope
EXPLANATION
A line.
Q3.3 Decision Boundary 8 Points
We have now feature engineered the two-dimensional input, x1 ∈ R and x2 ∈ R, mapping it to a new input vector:
this feature vector x and the corresponding parameter vector θ = [b, w1 , w2 ]T . Assume that the decision boundary occurs when P (Y = 1 ∣ x, θ) = P (Y =
0 ∣ x,θ).
Write your answer in terms of x1, x2, b, w1, and w2. Decision boundary expression:
x = ⎡⎢ x 1 2 ⎤⎥ ⎣ 12⎦
x2
Write an expression for the decision boundary of binary logistic regression with
EXPLANATION
0=b+w1x21 +w2x2
What is the geometric shape defined by this equation?
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
20/29
4/11/2021 Edit Assignment | Gradescope
If we add an L2 regularization on [w1 , w2 ]T , what happens to parameters as we increase the λ that scales this regularization term?
If we add an L2 regularization on [w1 , w2 ]T , what happens to the decision boundary shape as we increase the λ that scales this regularization term?
EXPLANATION
The parameters shrink, so the ellipse will get bigger.
Q4 Neural Networks 12 Points
Consider the following neural network for a 2-D input, x1 ∈ R and x2 ∈ R:
where:
all g functions are the same arbitrary non-linear activation function with no parameters
l(y, y^) is an arbitrary loss function with no parameters, and:
z1 =wAx1 +wBx2 a1 = g(z1)
EXPLANATION
An ellipse. Probably decent partial credit for circle.
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
21/29
4/11/2021 Edit Assignment | Gradescope
z2 =wCa1 a2 = g(z2)
z3 =wDa1 a3 = g(z3)
z4 = wE a2 + wF a3 y^ = g ( z 4 )
Note: There are no bias terms in this network. Q4.1 Partial derivatives
4 Points
What is the chain of partial derivatives needed to calculate the derivative Your answer should be in the form:
∂l =∂?∂?… ∂wE ∂? ∂?
∂l ? ∂wE
Make sure each partial derivative ∂? in your answer cannot be decomposed ∂?
further into simpler partial derivatives. Do not evaluate the derivatives. Be sure to specify the correct subscripts in your answer.
You may write your answer: In plain text as:
y^ can be written as In LaTeX as:
$$\frac{d\ell}{dw_E} = \frac{d?}{d?} \frac{d?}{d?} … \frac{d?}{d?}$$
Typing d is fine; no need to use , ∂ y^ can be written as
where each ? and the … are appropriately replaced.
∂l = ∂wE
dl/dwE = d?/d? * d?/d? * … d?/d?
y_hat
\partial
$$\hat{y}$$
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
22/29
4/11/2021 Edit Assignment | Gradescope
EXPLANATION
∂ l = ∂ l ∂ y^ ∂ z 4 ∂ w E ∂ y^ ∂ z 4 ∂ w E
Q4.2 Partial derivatives 4 Points
The network diagram from above is repeated here for convenience:
What is the chain of partial derivatives needed to calculate the derivative Your answer should be in the form:
∂l =∂?∂?… ∂wC ∂? ∂?
∂l ? ∂wC
Make sure each partial derivative ∂? in your answer cannot be decomposed ∂?
further into simpler partial derivatives. Do not evaluate the derivatives. Be sure to specify the correct superscripts in your answer.
∂l = ∂wC
EXPLANATION
∂ l = ∂ l ∂ y^ ∂ z 4 ∂ a 2 ∂ z 2 ∂wC ∂y^ ∂z4 ∂a2 ∂z2 ∂wC
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
23/29
4/11/2021 Edit Assignment | Gradescope
Q4.3 Regularization 4 Points
The gradient descent update step for weight wC is: wC ←wC −α ∂l
∂wC where α (alpha) is the learning rate (step size).
Now, we want to change our neural network objective function to add an L2 regularization term on the weights. The new objective is:
l ( y , y^ ) + λ 12 ∥ w ∥ 2 2
where λ (lambda) is the regularization hyperparameter and w is all of the weights in the neural network stacked into a single vector, x =
[wA , wB , wC , wD , wE , wF ]T .
Write the right-hand side of the new gradient descent update step for weight wC given this new objective function. You may use ∂l in your answer.
Update:wC ←____
EXPLANATION
Update for wC :
wC←wC−α(∂l +λwC) ∂wC
Q5 Value Iteration 8 Points
Consider training a robot to navigate the following grid-based MDP environment.
∂wC
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
24/29
4/11/2021 Edit Assignment | Gradescope
There are six states, A, B, C, D, E, and a terminal state T.
Actions from states B, C, and D are Left and Right.
The only action from states A and E is Exit, which leads deterministically to the terminal state
The reward function is as follows:
R(A, Exit, T) = 10
R(E, Exit, T) = 1
The reward for any other tuple (s, a, s’) equals -1
Assume the discount factor is just 1.
When taking action Left, with 0.8 probability, the robot will successfully move one space to the left, and with 0.2 probability, the robot will move one space in the opposite direction.
Likewise, when taking action Right, with 0.8 probability, the robot will move one space to the right and with 0.2 probability, it will move one space to the left.
Q5.1
8 Points
Run (synchronous) value iteration on this environment for two iterations. Begin by initializing the value for all states, V0 (s), to zero.
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
25/29
4/11/2021 Edit Assignment | Gradescope
Write the value of each state after the first (k = 1) and the second (k = 2) iterations. Write your values as a comma-separated list of 6 numerical expressions in the alphabetical order of the states, specifically
V(A), V(B), V(C), V(D), V(E), V(T).Eachofthesixentriesmaybea number or an expression that evaluates to a number. Do not include any max operations in your response.
There is a space below to type any work that you would like us to consider. Showing work is optional. Correct answers will be given full credit, even if no work is shown.
V1(A),V1(B),V1(C),V1(D),V1(E),V1(T) (values for 6 states): 10,-1,-1,-1,1, 0
V2(A),V2(B),V2(C),V2(D),V2(E),V2(T) (values for 6 states): 10, 6.8, -2, -0.4, 1, 0
What is the resulting policy after this second iteration? Write your answer as a comma-separated list of three actions representing the policy for states, B, C, and D, in that order. Actions may be Left or Right.
π(B), π(C), π(D) based on V2 Left, Left, Right
Optional work for this problem:
Q6 MDP Settings and Policies 9 Points
Consider a 4×4 Grid World that follows the same rules as Grid World from lecture.
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
26/29
4/11/2021 Edit Assignment | Gradescope
Specifically:
The shaded states have only one action, exit, which leads to a terminal state (not shown) and a reward with the corresponding numerical value printed in that state.
Leaving any other state gives a living reward, R(s) = r.
The agent will travel in the direction of its chosen action with probability 1 − n and will travel in one of the two adjacent directions with probability n/2 each. If the agent travels into a wall, it will remain in the same state.
Match the MDP setting below with the following optimal policies.
Note: We do not expect you to run value iteration to convergence to compute these policies but rather reason about the effect of different MDP settings.
Q6.1
3 Points
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
27/29
4/11/2021 Edit Assignment | Gradescope
γ = 1.0, n = 0.2, r = 0.1 A
B
C
D E F
EXPLANATION
With positive living reward, the agent will try to stay alive as long as possible. With non-zero noise, n, it will avoid negative states if at all possible.
Q6.2
3 Points
γ = 1.0, n = 0, r = −0.1 A
B
C
D E F
EXPLANATION
With γ = 1, the policy will travel to the 10.0 state, and with zero noise, n, it doesn’t have to worry about slipping sideways into negative states.
Q6.3
3 Points
Figure repeated for convenience:
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
28/29
4/11/2021 Edit Assignment | Gradescope
γ = 0.1, n = 0.2, r = −0.1 A
B
C
D E F
EXPLANATION
With low γ, the policy will prefer the closer 1.0 state, but with non-zero noise, it will avoid negative states if at all possible.
https://www.gradescope.com/courses/238518/assignments/1106944/outline/edit
29/29