Name: SUID:
AA228/CS238: Decision Making under Uncertainty
Autumn 2020
Prof. Mykel J. Kochenderfer • Remote • email:
MIDTERM 3 Due date: November 13, 2020 (5pm)
You have 90 minutes to complete this exam. This exam is electronically timed; you do not have to keep
track of your own time. To accommodate those in other time-zones and complex working situations, you
may choose any 90 minute window between 5pm PST Nov 12th, 2020 and 5pm PST Nov 13th, 2020 to
take the exam. Answer all questions. You may consult any material (e.g., books, calculators, computer
programs, and online resources), but you may not consult other people inside or outside of the class. If you
need clarification on a question, please make a private post on Piazza. Only what is submitted prior to
the deadline will be graded.
Question 1. (1 pt) I understand the above instructions and will adhere to them.
Question 2. (5 pts) You enter a casino and there are two slot machines A and B, each with potentially
different winning probabilities. You have a uniform prior over the machines’ winning probabilities (i.e. you
do not have any prior belief that one is more likely than the other). You have been playing for a few rounds
and keeping track of the outcomes of each attempt, and want to use UCB1 exploration to select your actions.
Use the exploration parameter c = 10. Use a Bayesian approach to estimate your winning probabilities—
do not use a maximum likelihood approach.
a) (2 pts) If you have played on machine A a total of 10 times, winning 8 of those times, and you have
played on machine B a total of 4 times, winning 3 of those times, which machine would you play on
next? What is its estimated upper confidence bound?
b) (2 pts) If you have played on machine A a total of 10 times, winning 9 of those times, and you have
played on machine B a total of 10 times, winning 8 of those times, which machine would you play on
next? What is its estimated upper confidence bound?
c) (1 pts) In which of the above situations (part (a) and/or part (b)) did the UCB1 exploration strategy
lead to a non-greedy action? Why? Provide a quantitative argument.
Solution:
Recall that we compute the upper confidence bound for an action a using
UCB1(a) = ρA + c
√
logN(s)
N(s, a)
Here we have a single state (so the s in N(s) and N(s, a) will never change and we will drop it from our
experssions going forward).
a) Because we have a uniform prior we start by modeling ρA and ρB as distributed according to a beta
distribution Beta(1, 1). We then update this based on our experience to get posterior distributions of
the winning probabilities for each machine:
ρA ∼ Beta(9, 3)
ρB ∼ Beta(4, 2)
1
With a Bayesian approach N comes from our pseudocounts with N = 12+6 = 18, N(aa) = 9+3 = 12,
and N(ab) = 4 + 2 = 6. With this, we can now compute the upper confidence bound for both actions:
UCB1(aA) =
9
9 + 3
+ c
√
log(18)
12
≈ 5.66
UCB1(aB) =
4
4 + 2
+ c
√
log(18)
6
≈ 7.61
This would lead us to select action ab since UCB1(aB) ≥ UCB1(aA). The estimated upper confidence
bound is 7.61.
Alternate Solution: We will also count for full credit solutions which did not use the pseudocounts
in determining N(a) and N . In this approach, we have N = 10 + 4 = 14, N(aa) = 10, and N(ab) = 4,
which are the actual counts of the pulls that we performed. With these values we get:
UCB1(aA) =
9
9 + 3
+ c
√
log(14)
10
≈ 5.89
UCB1(aB) =
4
4 + 2
+ c
√
log(14)
4
≈ 8.79
In this case we still select action ab since UCB1(aB) ≥ UCB1(aA). The estimated upper confidence
bound is 8.79.
b) Because we have a uniform prior we start by modeling ρA and ρB as distributed according to a beta
distribution Beta(1, 1). We then update this based on our experience to get posterior distributions of
the winning probabilities for each machine:
ρA ∼ Beta(10, 2)
ρB ∼ Beta(9, 3)
With a Bayesian approach we use the pseudocounts, giving us N = 12+12 = 24, N(aa) = 10+2 = 12,
and N(ab) = 9 + 3 = 12. With this, we can now compute the upper confidence bound for both actions:
UCB1(aA) =
10
10 + 2
+ c
√
log(24)
12
≈ 5.98
UCB1(aB) =
9
9 + 3
+ c
√
log(24)
12
≈ 5.90
This would lead us to select action aa since UCB1(aB) ≥ UCB1(aA). The estimated upper confidence
bound is 5.98.
Alternate Solution: We will also count for full credit solutions which did not use the pseudocounts in
determining N(a) and N . In this approach, we have N = 10 + 10 = 20, N(aa) = 10, and N(ab) = 10,
the actual counts of the pulls that we performed. With these values we get:
UCB1(aA) =
10
10 + 2
+ c
√
log(20)
10
≈ 6.31
UCB1(aB) =
9
9 + 3
+ c
√
log(20)
10
≈ 6.22
In this case we still select action aa since UCB1(aA) ≥ UCB1(aB). The estimated upper confidence
bound is 6.31.
c) In part a, it led to a non-greedy action. At each step, the greedy action will be that which maximizes
our expected utility. Since we represent the true probability of success as distributed according to
2
a beta distribution, the expected value of this distribution will be the expected utility of taking that
action — e.g. if ρA ∼ Beta(9, 3), and we receive a reward of 1 for a successful pull, our expected reward
for taking action a will be equal to the mean of this beta distribution ρA =
9
3
. So to see whether we
took a greedy action we look at ρA and ρB , and see whether we took the larger of the two.
In part a, we had ρA =
3
4
and ρB =
2
3
. With UCB1 exploration we decided to take action b. Since
ρA > ρB , we took the non-greedy action.
In part b, we had ρA =
5
6
and ρB =
3
4
. With UCB1 exploration we decided to take action a. Since
ρA > ρB , we took the greedy action.
For full credit, you need both (i) to say that the scenario in part (a) led to a non-greedy action
and (ii) to justify this by comparing the ρ values in the scenarios.
Question 3. (6 pts) You are standing in front of two closed doors. Behind one of the doors is a tiger and
behind the other is a large reward of 100. If you open the door with the tiger, then you receive a penalty of
−10. Instead of opening one of the two doors, you can listen, in order to gain some information about the
location of the tiger. Unfortunately, listening is not free (it has a cost of −1). In addition, listening is also
not entirely accurate. There are only two states sL and sR, which correspond to the tiger being behind the
left door and the tiger being behind the right door respectively.
The transition and observation models can be described in detail as follows. We can open one of the doors
with action aL (open left door) or aR (open right door). After we open a door and collect the associated
reward, the problem “resets”, meaning that we transition into state sL or state sR with equal probability.
The listen action aN does not change the state of the world. When the world is in state sL, the listen
action results in observation oL (i.e. hearing the tiger behind the left door) with probability 0.85 and the
observation oR with probability 0.15; similarly in world state sR, the listen action results in observation oR
with probability 0.85 and the observation oL with probability 0.15.
a) (2 pts) Draw the alpha vectors (as drawn in the book and in class) for each action, aL, aR, and aN for
a one step horizon. Be sure to label your axes.
b) (4 pts) Update the sL element of the aN alpha vector (corresponding to the listen action) using Fast
Informed Bound (i.e. calculate the value of α
(2)
N (sL)) using your alpha vectors from (a) as α
1
L, α
1
R, α
1
N .
Use discount factor γ = 0.9.
Solution:
a) We found the following alpha vectors by considering the reward from taking a single step for each
action for each of the two states we could be in. For instance, αL = [R(sL, aL), R(sR, aL)].
αL =
[
−10
100
]
αR =
[
100
−10
]
αN =
[
−1
−1
]
Depending on whether sL or sR is shown on the x axis we get one of the two following equally valid
plots. We draw a line corresponding to αTL
[
p(sL)
p(sR)
]
, a line corresponding to αTR
[
p(sL)
p(sR)
]
and a line
corresponding to αTN
[
p(sL)
p(sR)
]
for different values of p(sL) or p(sR) on the x axis.
3
0 0.2 0.4 0.6 0.8 1
−20
0
20
40
60
80
100
p(sL)
U
ti
li
ty αL
αR
αN
0 0.2 0.4 0.6 0.8 1
−20
0
20
40
60
80
100
p(sR)
U
ti
li
ty αL
αR
αN
b) Recall that the Fast Informed Bound (FIB) update equation is given by:
α(k+1)a (s) = R(s, a) + γ
∑
o
max
a′
∑
s′
O(o | a, s′)T (s′ | s, a)α(k)a′ (s
′)
So we’ll apply this for a = aN and s = sL. We first note that the sum over s
′ will disappear, as with
the listening action we will remain in the same state — so T (s′ | s, a) will be 1 when s′ = sL and 0
otherwise. Removing the sum and plugging in a = aN and sL gives us the simplified expression
α
(2)
N (sL) = R(sL, aN ) + γ
∑
o
max
a′
O(o | aN , sL)α
(1)
a′ (sL)
Now, we can expand the sum with our two observations oL and oR
α
(2)
N (sL) = R(sL, aN ) + γ
(
max
a′
O(oL | aN , sL)α
(1)
a′ (sL) + max
a′
O(oR | aN , sL)α
(1)
a′ (sL)
)
= R(sL, aN ) + γ
(
max
a′
0.85α
(1)
a′ (sL) + max
a′
0.15α
(1)
a′ (sL)
)
Now the only thing left to evaluate is 0.85α
(1)
a′ (sL) and 0.15α
(1)
a′ (sL) for all possible values of a
′: aL,
aR, and aN . This is in order to evaluate the maxa′ terms. We observe that in both cases, since we
are evaluating the alpha vector at sL, we will choose aR whose alpha vector is (100,−10). Choosing
a′ = aR our expression becomes
α
(2)
N (sL) = R(sL, aN ) + γ
(
max
a′
0.85α
(1)
a′ (sL) + max
a′
0.15α
(1)
a′ (sL)
)
= −1 + γ (0.85 · 100 + 0.15 · 100)
= −1 + 0.9 · 100 = 89
So this will be our updated value. Interestingly, our observation probabilities did not affect the final
result.
Question 4. (3 pts) You built a little robot and its first task is to localize itself. It has a map of the world,
but only a noisy lidar signal to detect distances to surrounding objects. Name a method it can use to perform
belief updates, and explain why it would be better than some of the other methods discussed in class.
Solution: You could use a particle filter to perform belief updates which works with continous state spaces
and does not impose restrictions on the dynamics of the system. This has an advantage over a Kalman filter
(which would impose a more restrictive model on the transition and observation functions). Particle filters
also can represent multimodal beliefs, while Kalman filters are restricted to a unimodal belief.
4
An extended Kalman filter can also be used to account for nonlinear dynamics, but it would still impose
assumptions around the noise being gaussian. Similarly, an unscented Kalman filter can be used, which has
the advantage that it is a derivative free approach. This, like the EKF, assumes gaussian noise.
Question 5. (6 pts) Consider an environment with 2 states and 2 actions. An agent performs actions and
observes the rewards and transitions listed below. Each step consists of the current state s which can be s1
or s2, reward r, action a which can be a1 or a2, and next state s
′ which can be s1 or s2.
Perform Q-learning after each step, and provide the values of the Q-function after each update. Use
learning rate α = 0.5 and a discount factor γ = 0.5 for each step. Initialize your Q-function with zeros for
all states and actions.
step s a r s′
1 s1 a1 −10 s1
2 s1 a2 −10 s2
3 s2 a1 20 s1
4 s1 a2 −10 s2
a) (4 pts) Report the values of the Q-function after each of the steps described in the table above.
b) (2 pts) What is the optimal policy after the last step?
Solution:
a) Recall the update rule for Q-learning
Q(s, a)← Q(s, a) + α
(
r + γmax
a′
Q(a′, s′)−Q(s, a)
)
Step 1:
Q(s1, a1)← 0 + 0.5 (−10 + 0.5 max(0, 0)− 0) = −5
Q function: [
−5 0
0 0
]
Step 2:
Q(s1, a2)← 0 + 0.5 (−10 + 0.5 max(0, 0)− 0) = −5
Q function: [
−5 −5
0 0
]
Step 3:
Q(s2, a1)← 0 + 0.5 (20 + 0.5 max(−5,−5)− 0) = 8.75
Q function: [
−5 −5
8.75 0
]
Step 4:
Q(s1, a2)← −5 + 0.5 (−10 + 0.5 max(8.75, 0)− (−5)) = −5.3125[
−5 −5.3125
8.75 0
]
5
b) Given that after Step 4
Q(s1, a1) = −5 Q(s1, a2) = −5.3125
Q(s2, a1) = 8.75 Q(s2, a2) = 0
We conclude that the optimal policy is
πQ(s1) = a1
πQ(s2) = a1
since Q(s1, a1) > Q(s1, a2) and Q(s2, a1) > Q(s2, a2).
Question 6. (6 pts) Assume you are interested in an MDP with 5 discrete states, 3 discrete actions,
and unknown transition and reward functions. You decide to take a model-based approach and learn the
transition and reward functions.
a) (2 pts) What is an advantage of Bayesian model-based methods over maximum likelihood model-based
methods for reinforcement learning?
b) (2 pts) If you model your belief over the transition probabilities using a collection of independent
Dirichlet distributions, how many parameters will you need in total to describe these Dirichlet distri-
butions?
c) (2 pts) If you have a strong belief that P (s′ | s1, a1) is a uniform distribution, give one possible
initialization of the pseudocounts for the Dirichlet distribution over P (s′ | s1, a1) that captures this
belief.
Solution:
a Possible answers:
• A Bayesian approach will not assign 0 probability to rare transitions that are not observed in the
dataset.
• Some algorithms (like posterior sampling) that depend on a Bayesian approach build in exploration
into the process so you don’t have to pick an exploration strategy and/or tune exploration hyper
parameters.
Other advantages of a Bayesian over maximum-likelihood approach will be accepted for full credit.
b There are |S|2|A| = 75 parameters. Each state-action pair will require |S| = 5 parameters to describe
its Dirichlet distribution, we have |S||A| = 5 · 3 state action pairs, so we will need
5 · (5 · 3) = 75
parameters total.
c We would want to initialize all pseudocounts to be the same large number. For example, Dir(100, 100, 100, 100, 100).
We will accept any solution where (i) all counts are equal and (ii) all counts are greater than or equal
to 2, although to fit the description “strong” belief you would typically want higher counts than 2.
6