代写 R algorithm GUI math graph COMS 4701: Artificial Intelligence

COMS 4701: Artificial Intelligence
Homework 4
Due: November 1, 2019
1 (Q-)Learning All About Bandits (20 points)
Recall that the multi-armed bandit problem is a special case of active reinforcement learning. In the simplest form, there are no states; the Q-values Qt(a) keep track of our evaluation of each possible action a after taking a specified action at at a given timestep t. We will choose an action according to the upper-confidence-bound (UCB) exploration function:
at =argmax􏰀Qt−1(a)+c􏰂lnt/Nt(a)􏰁. a
Here c is an adjustable parameter and Nt(a) is the number of times that we took action a prior to time t. Transitions are just action-reward pairs (at,rt), and we can learn the Q-values using the simplified TD update rule
Qt(at) = Qt−1(at) + α(rt − Qt−1(at)),
where α is the learning rate as usual. For this problem, we will consider a 3-armed bandit with
fixed rewards as follows: R(A) = −2, R(B) = 0, and R(C) = 2.
(a) Suppose we set α = 0.5 and c = 1. We have just tried each action once, giving us Q-values Q3(A) = −1, Q3(B) = 0, and Q3(C) = 1. From here, find the next three transitions (at,rt) as well as the corresponding Q-value updates Qt(at) for t = 4,5,6. Be sure to perform the Q-value update after each corresponding transition, not all at the end.
(b) Repeat the previous part (starting at t = 4) with c = 5. How does c impact how we view exploration vs exploitation?
(c) Let’s fast forward in time. Suppose that our Q-values have converged to their true values and that we have tried each action exactly 10 times (perhaps due to some other exploration function): Q30(A) = −2, Q30(B) = 0, Q30(C) = 2, and N30(a) = 10. Following the UCB exploration function from here and using c = 1 again, what is the next timestep at which we try an action other than C? You may write out any nonlinear equations and solve them programmatically.
(d) What can you say about the limiting behavior of exploration vs exploitation when using the UCB exploration function? Do we ever truly stop exploring if implementing on a real system? Explain why or why not.
1

2 Creepy App (20 points)
A certain app is deciding whether to show you a personalized ad, represented by a binary random variable A. It can possibly consider two factors: whether you browse the virtual marketplace and whether you make a trip to the physical store location (it can track your location!). These decisions are represented by binary random variables V and P .
You will decide which activity to do (or both or neither) based on whether or not it rains (R) tomorrow, which is forecast with probability 0.4. Suppose that if it rains (+r), the probability of you browsing online (+v) is 0.8 while going out to the store (+p) is 0.1. If it does not rain (−r), you will browse online with probability 0.3 and go to the store with probability 0.6. You can assume that P and V are conditionally independent, given R. The joint probabilities of your two behaviors and an ad popping up are as follows:
V
P
A
Pr(V,P,A)
+v +v +v +v −v −v −v −v
+p +p −p −p +p +p −p −p
+a −a +a −a +a −a +a −a
0.12 0.08 0.18 0.12 0.06 0.14 0.09 0.21
(a) (b) (c)
(d)
3
Compute the marginal distributions Pr(V) and Pr(P) and joint distribution Pr(V,P) for your two activities. Are V and P independent?
Compute Pr(A | V), the table showing probabilities of an ad popping up given only your online activity. What can you say about the relationship between A and P , given V ?
Compute the table of probabilities P r(A | R) for an ad popping given the possibility of rain. Use the assumption that A and R are conditionally independent given P and V . (Hint: Use what you found in (b) to simplify the math. You also have access to P r(V | R).)
Compute P r(+r | +a); what is the likelihood that it’s raining given that an ad pops up? Cleaning Your Apartment (60 points)
You just bought a robot vacuum cleaner, which also has a floor plan of your apartment. The robot knows the locations of dirty spots to clean, but the room also has impassable walls and furniture, as well as accessible rugs or stairways that can damage the sweeper. Unfortunately, the robot’s motors are also defective; it usually moves forward when it intends to do so, but it may occasionally move to the left or right instead. We will implement a MDP model to figure out the best policy for this robot.
tGetting started: Download the code files provided. You will first work with YOURUNI MDP.py; the other three will be used in the last part of the assignment. This file implements a FloorPlan class. FloorPlans are represented as 2D grids. Subsets of grid locations (represented as coordi- nate pairs) are stored in the (free) states, blocked, (terminal) goals, and (terminal) traps fields.1
1Note that we will not treat the “special” grid locations as states. This will enable you to simply loop over the “regular” states without worrying about terminal tests in your implementations.
2

Other fields include the room dimensions, rewards of different “state” types, discount factor, action set, and transition success probability. The four cardinal directions comprise the available actions. The success probability p indicates the likelihood of the robot moving in its intended direction, while slipping left or right of the intended direction occurs with probability 1 (1 − p) each.
2
There are a number of methods, most of which you don’t need to worry about. Some of them are book-keeping methods, while others implement the transition model by finding valid actions and then resolving a selected action according to the current state and success probabilities. The utility functions may be useful if you want to conduct further testing of your implementations, and you can call these from the main function following our provided examples.
First run the original uncommented code in the main method, which will set up a one-room floor plan and print it out. Blocked cells (walls and furniture) are represented by #, goals by G, and traps (stairways and rugs) by T. A random policy is also shown on a second printout of the map, which indicates actions using arrow symbols (∧, V, <, >). This policy is tested by simulating a number of trials (default 500) starting from an initial state and then following it. The average total reward from these trials is shown, as is the success rate (finishing in a goal instead of a trap).2
3.1 Value Iteration (20 points)
Implement the functions value iteration() and extract policy(). Value iteration returns a dic- tionary of the form {state:value} and the number of iterations taken; an iteration is defined as a single update for all state values. You should initialize all values at 0 and loop over self.states and self.actions. The Bellman update can be computed by calling self.Qvalue(). Two ver- sions of the Bellman update should be implemented. If asynch is True, values should be updated in-place; otherwise, you should maintain an old copy of the values for each iteration (Figure 17.4 in AIMA). Convergence is achieved when the maximum change among all values does not exceed ε(1 − γ)/γ (also shown in Figure 17.4).
Once you have the values, a policy can be extracted, which essentially follows the policy improve- ment step of the policy iteration algorithm. Implement this in extract policy(), which takes in a dictionary of states to values and returns a policy dictionary of the form {state:action}. You will again want to refer to self.Qvalue() to efficiently find Q-values.
For the one-room test case and the model parameters provided, you should see the following values after convergence:
2.17 2.32 1.92 1.56 1.22 0.98
2.59 2.85 0.00 0.00 0.98 1.21
3.06 3.42 2.94 1.23 0.81 1.03
3.58 4.14 3.52 -10.00 -10.00 -10.00
4.14 5.00 4.13 -10.00 -10.00 -10.00
3.58 4.14 3.53 3.06 3.51 0.00
3.06 3.43 3.07 3.51 4.19 0.00
-10.00 -10.00 -10.00 0.00 5.00 0.00
0.00 0.00
1.58 0.00
1.98 2.39
2.44 2.91
2.97 3.51
3.59 4.20
4.15 5.00
3.63 4.20
2Occasionally test random policy() may get stuck, since a random policy may generate infinite loops. After you feel that you understand the code sufficiently, you can comment this block out so that you don’t run it every time you test your own code.
3

The converged policy should be easy to assess in that it should avoid traps and push toward goals. You should also see a success rate close to 99% and an average reward of around 3.7.
3.2 Policy Iteration (20 points)
You already have policy improvement in extract policy(), so you just need to implement policy evaluation and put them together to do policy iteration. We will do this iteratively. Implement policy eval sweep(), which takes in a policy and a set of current values (both dictionaries as above) and performs a single update on each value. The only loop in this function should be one over each state-value pair. You will again want to use self.Qvalue() and implement two versions based on the asynch flag. The updated values dictionary is then returned.
policy iteration() will take an initial policy and alternate between evaluation and improve- ment to generate an optimal policy (again, represented as a dictionary). It takes in a parameter k that specifies how many sweeps of iterative policy evaluation to perform before improving it. Con- vergence is achieved when no change occurs between a current policy and an updated one across allstates.3 Aswithvalueiteration,thenumberofiterationsisalsoreturnedbythismethod,where an iteration is defined as both the k iterative evaluation steps and one improvement step together.
You can check that your policy iteration is correct by running the test code and verifying that your results are identical to those of value iteration.
3.3 Q-Learning (15 points)
The robot has been corrupted and has lost all memory of the environment. It will now explore the room and come up with an optimal policy based on its observations. The Qlearner.py file defines a Qlearner class. Fields include the usual Q-learning parameters (α, ε, γ, Q0), as well as a set of states and Q-values, the latter implemented as a {(state, action): Q-value} dictionary.
Implement the update() method, which performs one transition, updates the corresponding Q- value, and returns the successor state. First the agent should select an action according to the ε-greedy method, either following the current optimal policy or choosing a random action among valid actions(state). Once an action is chosen, the robot will move using the provided transition(state, action) function, which returns a successor state and reward. Use this in- formation to update self.Qvalues. When done, the successor state should be returned.
If implemented correctly, Q-learning can be tested back in the original file. We provided a sam- ple set of learning parameters and number of episodes, but you can change these to see if you can see any differences in the learning results. A decent policy may be learned within 50 episodes, but you may have to crank this number up to consistently see a close-to-optimal policy for all states.
For a more interactive visualization of your Qlearner, run the provided crawler.pyc file with its associated graphics file in the same directory (thanks UC Berkeley!). If you have not modified your Qlearner code, a simple crawler robot will take random actions and struggle back and forth. If you have a correct implementation of Q-learning, the robot should eventually learn to move forward. Try playing around with the parameters in the GUI to see whether you can help it learn faster.
3While not strictly correct, we will not perform a final evaluation sweep to find convergent values before declaring a policy converged. So don’t be surprised if you see a suboptimal policy when using a small k.
4

3.4 Writeup and Submission (5 points)
Answer the following questions and submit with problems 1 and 2 in your PDF file on Gradescope.
1. How many iterations does value iteration take to converge using ε = 0.001 for both floor plans? Specifically compare your asynchronous vs synchronous implementations.
2. How many iterations does policy iteration take for both floor plans? You should see that it is significantly fewer than that of value iteration.
3. The trade-off is that each individual iteration of the latter method takes longer. But we can control this by tweaking the number of policy evaluation sweeps per iteration. What do you observe if you set it to 1 (for either floor plan)?
4. About how many episodes of Q-learning do you find to be a reasonable number to consistently learn a policy that produces a success rate and average reward close to value/policy iteration?
5. Fire up the crawler bot and play with some of the parameters while it’s learning. What do you find to be a good strategy to help it learn more quickly than simply leaving the parameters all fixed (e.g., starting with a high value of γ and then decreasing it over time)?
For this problem, submit the following two files on Gradescope: YOURUNI mdp.py (rename this) and Qlearner.py (do NOT rename this).
5