CS计算机代考程序代写 Bayesian arm algorithm Name: SUID:

Name: SUID:

AA228/CS238: Decision Making under Uncertainty
Autumn 2019
Prof. Mykel J. Kochenderfer • Durand 255 • email:

MIDTERM 2 Due date: November 5

You have 60 minutes to complete this exam. You may use one page of notes (front and back) but no other resources. All
questions are weighted equally. You may use your own scratch paper to work through the problems, but what you submit must fit
within the provided boxes. Please answer using a pen or a dark pencil. We will be scanning your exams to upload them to Gradescope
and need your answers to be visible. Please write your full name and SUID at the top of this page.

Question 1. Suppose we have discrete state and action spaces of size n and m, respectively. In general MDPs, the time complexity
of the policy improvement step in policy iteration is O(n2m). If we know that T (s′ | s, a) ∈ {1, 0}, can we provide a tighter complexity
bound? If so, what would it be?

Question 2. We are building a mobile robot. The problem is governed by the following state variables: robot position x, robot
velocity v, and destination d. The robot controls its acceleration a. A reward of 1 is given when the robot reaches its destination.
When not at its destination, the robot receives a penalty that is proportional to the magnitude of its acceleration. Draw the structure
of a sensible factored representation of this problem, including all the relevant chance (shown with circles), decision (shown with
squares), and utility nodes (shown with diamonds). Do not draw informational edges.

Question 3. Choose two approximate dynamic programming algorithms that use a generative model. Describe two properties for
each algorithm.

1

Question 4. Consider a two-armed bandit problem where we have two pulls. We will take a Bayesian approach with a uniform
distribution as a prior over the probability of payoff for both arms. The potential payoff for each arm is 1. We will use U∗(w1, `1, w2, `2)
to represent the expected total future payoff of the optimal policy given wi wins and `i losses from arm i, respectively. Write down
the value for U∗(1, 0, 0, 0).

Question 5. Suppose we are interacting with a 10 grid world (similar to what was presented in class and in the textbook), where
we have four actions available: up, down, left, and right. With probability θ, we go in the direction corresponding to the name of
the action; otherwise, we go in one of the other directions with uniform probability. In class and in the book, θ was set to 0.7, but
for this question we assume a uniform distribution over θ prior to interacting with the environment. The true value of θ is globally
applicable, i.e., it does not depend on which cell we are in. We know the reward function exactly. Explain how we would apply
Thompson sampling to decide how to select actions in this environment.

Question 6. Is the computational complexity for Sarsa(λ) per experience tuple (st, at, rt, st+1) greater than that of Sarsa? If so,
explain why one might still want to use Sarsa(λ).

2