Solutions
10-601 Machine Learning Spring 2021
Exam 2 Practice Problems April 11, 2021
Time Limit: N/A Instructions:
Name: Andrew Email: Room: Seat: Exam Number:
• Fill in your name and Andrew ID above. Be sure to write neatly, or you may not receive credit for your exam.
• Clearly mark your answers in the allocated space on the front of each page. If needed, use the back of a page for scratch space, but you will not get credit for anything written on the back of a page. If you have made a mistake, cross out the invalid parts of your solution, and circle the ones which should be graded.
• No electronic devices may be used during the exam.
• Please write all answers in pen.
• You have N/A to complete the exam. Good luck!
10-601 Machine Learning Exam 2 Practice Problems – Page 2 of 16
1 Logistic Regression
1. [2 pts] If today I want to predict the probability that a student sleep more than 8 hours on average (SA) given the Course loading (C), I will choose to use linear regression over logistic regression.
Circle one: True False
False.
2. Answer the following questions with brief explanations where necessary.
a) [2 pts] A generalization of logistic regression to a multiclass settings involves ex-
pressing the per-class probabilities P (y = c|x) as the softmax function exp(wcT x) , d∈C exp(wdT x)
where c is some class from the set of all classes C.
Consider a 2-class problem (labels 0 or 1). Rewrite the above expression for this situation, to end up with expressions for P (Y = 1|x) and P (Y = 0|x) that we have already come across in class for binary logistic regression.
P(y = 1|x) = exp(w1T x) = exp((w1−w0)T x) = exp(wT x) = p exp(w0T x)+exp(w1T x) 1+exp((w1−w0)T x) 1+exp(wT x)
Therefore, 1 − p = 1 1+exp(wT x)
b) [3 pts] Given 3 data points (1, 1), (1, 0), (0, 0) with labels 0, 1, 0 respectively. Con-
sider 2 models, Model 1: σ(w1x1 + w2x2), Model 2: σ(w0 + w1x1 + w2x2) (σ(z) is
the sigmoid function 1 ) that compute p(y = 1|x). Using the given data, we can 1+e−z
learn parameters wˆ by maximizing the conditional log-likelihood. Suppose we switched (0, 0) to label 1 instead.
Do the parameters learnt for Model 1 change?
Circle one: True False
One-line explanation:
False. The parameters learnt for Model 1 don’t change because w1x1 + w2x2 = 0 for (0, 0). Hence p = 0.5 irrespective of the labels or the values of w.
What about Model 2?
Circle one: True False One-line explanation:
True. This model has a bias term which remains non-zero for (0, 0), and can thus change the model depending on the label assigned.
c) [2 pts] For logistic regression, we need to resort to iterative methods such as gradient descent to compute the wˆ that maximizes the conditional log likelihood. Why?
There is no closed-form solution.
10-601 Machine Learning Exam 2 Practice Problems – Page 3 of 16
d) [3pts]ConsideringaGaussianprior,writeouttheMAPobjectivefunctionJ(w)MAP in terms of the MLE objective J(w)MLE. Name the variant of logistic regression this results in.
JMAP (w) = JMLE(w) − λ∥w∥2. This is L2 regularized logistic regression.
3. Given a training set {(xi,yi),i = 1,…,n} where xi ∈ Rd is a feature vector and yi ∈ {0, 1} is a binary label, we want to find the parameters wˆ that maximize the likelihood for the training set, assuming a parametric model of the form
p(y = 1|x; w) = 1 . 1 + exp(−wT x)
The conditional log likelihood of the training set is
n
l(w) = yi log p(yi, |xi; w) + (1 − yi) log(1 − p(yi, |xi; w)),
and the gradient is
i=1
p(y=1|x)≥ 1 2
so we predict yˆ = 1 if wT x ≥ 0.
=⇒ 1≥1 1+exp(−wTx) 2
=⇒ 1+exp(−wTx)≤2 =⇒ exp(−wTx)≤1 =⇒ −wTx≤0
=⇒ wTx≥0,
n
∇l(w) = (yi − p(yi|xi; w))xi.
i=1
a) [5 pts.] Is it possible to get a closed form for the parameters wˆ that maximize the
conditional log likelihood? How would you compute wˆ in practice?
There is no closed form expression for maximizing the conditional log likelihood. One has to consider iterative optimization methods, such as gradient descent, to compute wˆ.
b) [5 pts.] For a binary logistic regression model, we predict y = 1, when p(y = 1|x) ≥ 0.5. Show that this is a linear classifier.
Using the parametric form for p(y = 1|x):
c) Consider the case with binary features, i.e, x ∈ {0,1}d ⊂ Rd, where feature x1 is rare and happens to appear in the training set with only label 1. What is wˆ1? Is the gradient ever zero for any finite w? Why is it important to include a regularization term to control the norm of wˆ?
10-601 Machine Learning Exam 2 Practice Problems – Page 4 of 16
If a binary feature fired for only label 1 in the training set then, by maximizing the conditional log likelihood, we will make the weight associated to that feature be infinite. This is because, when this feature is observed in the training set, we will want to predict predict 1 irrespective of everything else. This is an undesired behaviour from the point of view of generalization performance, as most likely we do not believe this rare feature to have that much information about class 1. Most likely, it is spurious co-occurrence. Controlling the norm of the weight vector will prevent these pathological cases.
10-601 Machine Learning Exam 2 Practice Problems – Page 5 of 16
2 Feature Engineering and Regularization
1. Model Complexity: In this question we will consider the effect of increasing the model complexity, while keeping the size of the training set fixed. To be concrete, con- sider a classification task on the real line R with distribution D and target function c∗ : R → {±1} and suppose we have a random sample S of size n drawn iid from D. For each degree d, let φd be the feature map given by φd(x) = (1, x, x2, . . . , xd) that maps points on the real line to (d + 1)-dimensional space.
Now consider the learning algorithm that first applies the feature map φd to all the training examples and then runs logistic regression as in the previous question. A new example is classified by first applying the feature map φd and then using the learned classifier.
a) [4 pts.] For a given dataset S, is it possible for the training error to increase when we increase the degree d of the feature map? Please explain your answer in 1 to 2 sentences. No. Every linear separator using the feature map φd can also be expressed using the feature map φd+1, since we are only adding new features. It follows that the training error must decrease for any given sample S.
b) [4 pts.] Briefly explain in 1 to 2 sentences why the true error first drops and then increases as we increase the degree d. When the dimension d is small, the true error is high because it is not possible to the target function is not well approximated by any linear separator in the φd feature space. As we increase d, our ability to approximate c∗ improves, so the true error drops. But, as we continue to increase d, we begin to overfit the data and the true error increases again.
10-601 Machine Learning Exam 2 Practice Problems – Page 6 of 16
3
Neural Networks
Figure 1: neural network
Consider the neural network architecture shown above for a 2-class (0, 1) classification problem. The values for weights and biases are shown in the figure. We define:
a1 = w11x1 + b11
a2 = w12x1 + b12
1.
a3 = w21z1 + w22z2 + b21
z1 = relu(a1)
z2 = relu(a2)
z =σ(a),σ(x)= 1
3 3 1+e−x
Use this information to answer the questions that follow.
(i) [6 pts] For x1 = 0.3, compute z3, in terms of e. Show all work. z3 =
z=1
3 1+e−0.15
(ii) [2 pts] To which class does the network predict the given data point (x1 = 0.3), i.e.,yˆ=?Notethatyˆ=1ifz3 >12,elseyˆ=0.
Circle one: 0 1 yˆ(x1 = 0.3) = 1
10-601 Machine Learning Exam 2 Practice Problems – Page 7 of 16 (iii) [6 pts] Perform backpropagation on the bias b21 by deriving the expression for
the gradient of the loss function L(y,z3) with respect to the bias term b21, ∂L , in ∂ b21
terms of the partial derivatives ∂α, where α and β can be any of L,zi,ai,bij,wij,x1 ∂β
for all valid values of i,j. Your backpropagation algorithm should be as explicit as possible—that is, make sure each partial derivative ∂α cannot be decomposed
∂β
further into simpler partial derivatives. Do not evaluate the partial derivatives. ∂L = ∂L ∂z3 ∂a3
∂b21 ∂z3 ∂a3 ∂b21
(iv) [6 pts] Perform backpropagation on the bias b12 by deriving the expression for the gradient of the loss function L(y,z3) with respect to the bias term b12, ∂L , in
∂ b12 terms of the partial derivatives ∂α, where α and β can be any of L,zi,ai,bij,wij,x1
∂β
for all valid values of i,j. Your backpropagation algorithm should be as explicit as possible—that is, make sure each partial derivative ∂α cannot be decomposed
∂β
further into simpler partial derivatives. Do not evaluate the partial derivatives. ∂L = ∂L ∂z3 ∂a3 ∂z2 ∂a2
∂b12 ∂z3 ∂a3 ∂z2 ∂a2 ∂b12
2. In this problem we will use a neural network to classify the crosses (×) from the circles (◦) in the simple dataset shown in Figure 2a. Even though the crosses and circles are not linearly separable, we can break the examples into three groups, S1, S2, and S3 (shown in Figure 2a) so that S1 is linearly separable from S2 and S2 is linearly separable from S3. We will exploit this fact to design weights for the neural network shown in Figure 2b in order to correctly classify this training set. For all nodes, we will use the threshold activation function
1z>0 φ(z)= 0 z≤0.
10-601 Machine Learning
Exam 2 Practice Problems – Page 8 of 16
S2 S1
S3
x2
5 4 3 2 1 0
012345
x1
(a) The dataset with groups S1, S2, and S3.
w31 h1
w11 w12 x1
y
w32
h2 w21 w22
Figure 2
x2 (b) The neural network architecture
5
4
3
2
1
0 012345
x1
5
4
3
2
1
0 012345
x1
5
4
3
2
1
0 012345
x1
(a) Set S2 and S3 (b) Set S1 and S2
Figure 3: NN classification.
(c) Set S1, S2 and S3
(i) First we will set the parameters w11, w12 and b1 of the neuron labeled h1 so that its output h1(x) = φ(w11x1 + w12x2 + b1) forms a linear separator between the sets S2 and S3.
(a) [1 pt.] On Fig 3a, draw a linear decision boundary that separates S2 and S3.
(b) [1 pt.] Write down the corresponding weights w11, w12, and b1 so that h1(x) = 0 for all points in S3 and h1(x) = 1 for all points in S2. One solution would suffice and the same applies to (ii) and (iii).
w11 =−1,w12 =0,b1 =3
(ii) Next we set the parameters w21,w22 and b2 of the neuron labeled h2 so that its output h2(x) = φ(w21x1 + w22x2 + b2) forms a linear separator between the sets S1 and S2.
(a) [1 pt.] On Fig 3b, draw a linear decision boundary that separates S1 and S2.
x2
x2
x2
10-601 Machine Learning Exam 2 Practice Problems – Page 9 of 16 (b) [1 pt.] Write down the corresponding weights w21, w22, and b2 so that h2(x) = 0
for all points in S1 and h2(x) = 1 for all points in S2.
w21 =3,w22 =1,b2 =−7
(iii) Now we have two classifiers h1 (to classify S2 from S3) and h2 (to classify S1 from S2). We will set the weights of the final neuron of the neural network based on the results from h1 and h2 to classify the crosses from the circles. Let h3(x) = φw31h1(x) + w32h2(x) + b3.
(a) [1 pt.] Compute w31 , w32 , b3 such that h3 (x) correctly classifies the entire dataset. w31 = 1, w32 = 1, b3 = −1.5
(b) [1 pt.] Draw your decision boundary in Fig 3c.
(iv) Back propagation
In the above example, we need to learn the weights by according to the data. At first step, we need to get the gradients of the parameters of neural networks.
Suppose there m data points xi with label yi, where i ∈ [1, m]. xi is a d × 1 vector and yi ∈ {0, 1}. We use the data to train a neural network with one hidden layer:
h(x) =σ(W1x + b1) p(x) =σ(W2h(x) + b2),
whereσ(x)= 1 isthesigmoidfunction,W1 isanbydmatrixandb1 isan 1+exp(−x)
by1vector,W2 isa1bynmatrixandb1 isa1by1vector.
We use cross entropy loss function and minimize the negative log likelihood to train
the neural network:
l = m1 li = m1 −(yi logpi +(1−yi)log(1−pi)),
ii where pi = p(xi), hi = h(xi).
(a) Describe how you would drive the gradients w.r.t the parameters W1,W2 and b1, b2. (No need to write out the detailed mathematical expression.) Use chain rule.
(b) When m is large, we typically use a small sample of all the data set to estimate the gradient, this is call stochastic gradient descent (SGD). Explain why we use SGD instead of gradient descent.
SGD converges faster than gradient descent.
(c) Work out the following gradient: ∂l , ∂l , ∂l , ∂l , ∂l , ∂l . When deriving the ∂pi ∂W2 ∂b2 ∂hi ∂W1 ∂b1
gradient w.r.t. the parameters in lower layers, you can may assume the gradient
in upper layers are available to you (i.e., you can use them in your equation).
For example, when calculating ∂l , you can assume ∂l , ∂l , ∂l , ∂l are known.
∂W1 ∂pi ∂W2 ∂b2 ∂hi
10-601 Machine Learning
Exam 2 Practice Problems – Page 10 of 16
∂l ∂pi ∂l
∂W2 ∂l
∂b2
∂l
∂hi
=1(−yi +1−yi) m pi 1−pi
=1∂li ∂pi =1∂lipi(1−pi)hTi mi ∂pi∂W2 mi ∂pi
=1∂lipi(1−pi) m i ∂pi
=∂li ∂pi = ∂li pi(1−pi)W2T ∂pi ∂hi ∂pi
1∂li ∂hi 1∂li T 1ii1ii
∂l
∂W =m ∂h ∂W = m ∂h ◦hi ◦(1−hi) xi
∂l = 1 ∂li ∂hi = 1 ∂li ◦ hi ◦ (1 − hi) ∂b1 mi ∂hi∂b1 mi ∂hi
10-601 Machine Learning Exam 2 Practice Problems – Page 11 of 16
4 Reinforcement Learning
4.1 Markov Decision Process
Environment Setup (may contain spoilers for Shrek 1)
Lord Farquaad is hoping to evict all fairytale creatures from his kingdom of Duloc, and has one final ogre to evict: Shrek. Unfortunately all his previous attempts to catch the crafty ogre have fallen short, and he turns to you, with your knowledge of Markov Decision Processes (MDP’s) to help him catch Shrek once and for all.
Consider the following MDP environment where the agent is Lord Farquaad:
Figure 4: Kingdom of Duloc, circa 2001 Here’s how we will define this MDP:
• S (state space): a set of states the agent can be in. In this case, the agent (Farquaad) can be in any location (row, col) and also in any orientation ∈ {N, E, S, W }. Therefore, state is represented by a three-tuple (row, col, dir), and S = all possible of such tuples. Farquaad’s start state is (1, 1, E).
• A (action space): a set of actions that the agent can take. Here, we will have just three actions: turn right, turn left, and move forward (turning does not change row or col, just dir). So our action space is {R,L,M}. Note that Farquaad is debilitatingly short, so he cannot travel through (or over) the walls. Moving forward when facing a wall results in no change in state (but counts as an action).
• R(s, a) (reward function): In this scenario, Farquaad gets a reward of 5 by moving into the swamp (the cell containing Shrek), and a reward of 0 otherwise.
• p(s′|s,a) (transition probabilities): We’ll use a deterministic environment, so this will bee 1 if s′ is reachable from s and by taking a, and 0 if not.
10-601 Machine Learning Exam 2 Practice Problems – Page 12 of 16
1. What are |S| and |A| (size of state space and size of action space)?
|S| = 4 rows × 4 columns × 4 orientations = 64 |A| = |{R,L,M}| = 3
2. Why is it called a ”Markov” decision process? (Hint: what is the assumption made with p?)
p(s′|s,a) assumes that s′ is determined only by s and a (and not any other previous states or actions).
3. What are the following transition probabilities?
p((1, 1, N)|(1, 1, N), M) = p((1, 1, N)|(1, 1, E), L) = p((2, 1, S)|(1, 1, S), M) = p((2, 1, E)|(1, 1, S), M) =
p((1, 1, N)|(1, 1, N), M) = 1 p((1, 1, N)|(1, 1, E), L) = 1 p((2, 1, S)|(1, 1, S), M) = 1 p((2, 1, E)|(1, 1, S), M) = 0
4. Given a start position of (1, 1, E) and a discount factor of γ = 0.5, what is the expected discounted future reward from a = R? For a = L? (Fix γ = 0.5 for following problems).
For a = R we get RR = 5∗(21)16 (it takes 17 moves for Farquaad to get to Shrek, starting with R, M, M, M, L…)
For a = L, this is a bad move, and we need another move to get back to our original orientation, from which we can go with our optimal policy. So the reward here is:
RL =(12)2 ∗RR =5∗(21)18
5. What is the optimal action from each state, given that orientation is fixed at E? (if there are multiple options, choose any)
10-601 Machine Learning Exam 2 Practice Problems – Page 13 of 16
RRMR RRLR MRLR
MML-
(some have multiple options, I just chose one of the possible ones)
6. Farquaad’s chief strategist (Vector from Despicable Me) suggests that having γ = 0.9 will result in a different set of optimal policies. Is he right? Why or why not?
Vector is wrong. While the reward quantity will be different, the set of optimal policies does not change. (it is now 5 ∗ ( 9 )16) (one can only assume that Lord Farquaad and
10
Vector would be in kahoots: both are extremely nefarious!)
7. Vector then suggests the following setup: R(s, a) = 0 when moving into the swamp, and R(s, a) = −1 otherwise. Will this result in a different set of optimal policies? Why or why not?
It will not. While the reward quantity will be different, the set of optimal policies does not change. (Farquaad will still try to minimize the number of steps he takes in order to reach Shrek)
8. Vector now suggests the following setup: R(s, a) = 5 when moving into the swamp, and R(s, a) = 0 otherwise, but with γ = 1. Could this result in a different optimal policy? Why or why not?
This will change the policy, but not in Lord Farquaad’s favor. He will no longer be incentivized to reach Shrek quickly (since γ = 1). The optimal reward from each state is the same (5) and therefore each action from each state is also optimal. Vector really should have taken 10-301/601…
9. Surprise! Elsa from Frozen suddenly shows up. Vector hypnotizes her and forces her to use her powers to turn the ground into ice. Now the environment is now stochastic: since the ground is now slippery, when choosing the action M, with a 0.2 chance, Farquaad will slip and move two squares instead of one. What is the expected future-discounted rewards from s = (2, 4, S)?
Recall that Rexp = maxaE[R(s, a) + γRs′ ]
(notation might be different than in the notes, but conceptually, our reward is the best expected reward we can get from taking any action a from our current state s.)
10-601 Machine Learning Exam 2 Practice Problems – Page 14 of 16
In this case, our best action is obviously to move forward. So we get
Rexp = (expected value of going two steps) + (expected value of going one step)
E[2steps] = p((4, 4, S)|(2, 4, S), M) × R((4, 4, S), (2, 4, S), M) = 0.2 × 5 = 1
E[1step] = p((4, 3, S)|(2, 4, S), M) × (R((4, 3, S), (2, 4, S), M) + γR(4,3,S))
where R(4,3,S) is the expected reward from (4,3,S). Since the best reward from here is obtained by choosing a = M, and we always end up at Shrek, we get
E[1step] = 0.8 × (0 + γ × 5) = 0.8 × 0.5 × 5 = 2
giving us a total expected reward of Rexp = 1 + 2 = 3
(I will be very disappointed if this is not the plot of Shrek 5)
4.2 Value and Policy Iteration
1. Which of the following environment characteristics would increase the computational complexity per iteration for a value iteration algorithm? Choose all that apply:
Large Action Space
A Stochastic Transition Function Large State Space
Unknown Reward Function
None of the Above
A and C (state space and action space). The computational complexity for value iteration per iteration is O(|A||S|2)
2. Which of the following environment characteristics would increase the computational complexity per iteration for a policy iteration algorithm? Choose all that apply:
Large Action Space
A Stochastic Transition Function Large State Space
Unknown Reward Function
None of the Above
A and C again. The computational complexity for policy iteration per iteration is O(|A||S|2 + |S|3)
3. In the image below is a representation of the game that you are about to play. There are 5 states: A, B, C, D, and the goal state. The goal state, when reached, gives 100 points as reward. In addition to the goal’s points, you also get points by moving to different states. The amount of points you get are shown next to the arrows. You start at state
10-601 Machine Learning Exam 2 Practice Problems – Page 15 of 16
B. To figure out the best policy, you use asynchronous value iteration with a decay (γ) of 0.9.
(i) When you first start playing the game, what action would you take (up, down, left, right) at state B?
Up
(ii) What is the total reward at state B at this time?
50 (immediate reward of 50, and future reward (value at state A) starts at 0)
(iii) Let’s say you keep playing until your total values for each state has converged. What action would you take at state B?
C
(iv) What is the total reward at state B at this time?
174 (30 from the immediate action, and 144 from the future reward (value at state C))
4.3 Q-Learning
1. For the following true/false, circle one answer and provide a one-sentence explanation:
(i) One advantage that Q-learning has over Value and Policy iteration is that it can account for non-deterministic policies.
10-601 Machine Learning Exam 2 Practice Problems – Page 16 of 16
Circle one: True False
False. All three methods can account for non-deterministic policies
(ii) You can apply Value or Policy iteration to any problem that Q-learning can be applied to.
Circle one: True False
False. Unlike the others, Q-learning doesn’t need to know the transition proba- bilities (p(s’ | s, a)), or the reward function (r(s,a)) to train. This is its biggest advantage.
(iii) Q-learning is guaranteed to converge to the true value Q* for a greedy policy.
Circle one: True False
False. Q-learning converges only if every state will be explored infinitely. Thus, purely exploiting policies (e.g. greedy policies) will not necessarily converge to Q*, but rather to a local optimum.
2. For the following parts of this problem, recall that the update rule for Q-learning is: w ← w − α q(s, a; w) − (r + γ max q(s′, a′; w) ∇wq(s, a; w)
a′
(i) From the update rule, let’s look at the specific term X = (r + γ maxa′ q(s′, a′; w))
Describe in English what is the role of X in the weight update.
Estimate of true total return (Q*(s,a)). This may get multiple answers, so grade accordingly
(ii) Is this update rule synchronous or asynchronous?
Asynchronous
(iii) A common adaptation to Q-learning is to incorporate rewards from more time steps into the term X. Thus, our normal term rt+γ∗maxat+1 q(st+1, at+1; w) would become rt + γ ∗ rt+1 + γ2 maxat+2 q(st+2, at+2 : w) What are the advantages of using more rewards in this estimation?
Incorporating rewards from multiple time steps allows for a more ”realistic” esti- mate of the true total reward, since a larger percentage of it is from real experience. It can help with stabilizing the training procedure, while still allowing training at each time step (bootstrapping). This type of method is called N-Step Temporal Difference Learning.