AA228/CS238: Decision Making under Uncertainty
Autumn 2016
Prof. Mykel J. Kochenderfer • Durand 255 • email: aa228-aut1617- .edu
MIDTERM CELEBRATION 2 Due date: November 16, 2016
You have 60 minutes to complete this exam. You may use one page of notes (front and back) but no
other resources. All questions are weighted equally.
1. Decision theory and games. Suppose we have a two-player game where there are exactly two Nash
equilibria. What is the range of the number of dominant strategy equilibria possible in this game?
2. Markov decision processes. Suppose we have a Markov decision process consisting of five states, s1:5,
and two actions, stay and continue. We have:
• T (si+1 | si, continue) = 1 for i ∈ {1, 2, 3, 4}
• T (s5 | s5, a) = 1 for all actions a
• T (si | si, stay) = 1 for i ∈ {1, 2, 3, 4}
• R(si, a) = 0 for i ∈ {1, 2, 3, 5} and all actions a
• R(s4, continue) = 10
• R(s4, stay) = 0
What does the discount factor have to be for the utility U(s1) to be 1?
3. Approximate dynamic programming. What is a disadvantage of sparse sampling over branch-and-
bound?
4. Exploration and exploitation. What is an advantage of interval estimation over softmax for exploration
in bandit problems?
5. Model-based reinforcement learning. What is an advantage of Thompson sampling over other model-
based approaches based on maximum-likelihood estimation?
6. Model-free reinforcement learning. Suppose we are doing Q-learning using a linear approximation for
Q(s, a) with basis function β(s, a) = [s, a, sa, s2]>. We initialize Q(s, a) to zero everywhere. When
transitioning from state 1 to state 2 by action 1, we observe a reward of 10. With a learning rate α
of 0.1 and discount rate γ of 0.9, write down the equation for computing Q(s, a) for any s and a after
performing the learning update.
Hint: the update rule for the SARSA algorithm using a linear approximation is
θ ← θ + α
(
rt + γθ
>β(st+1, at+1)− θ>β(st, at)
)
β(st, at)
1