Homework 8: Reinforcement Learning
Written Questions (46 points)
LATEX Bonus Point (1 points)
(1 point) Select one: Did you use LATEX for the entire written portion of this homework? ⃝ Yes
Copyright By PowCoder代写 加微信 powcoder
Value Iteration (18 points)
While attending an ML conference, you meet scientists at NASA who ask for your help sending space- craft to the surface of the Sun. Specifically, they ask you to develop a reinforcement learning agent capable of carrying out the space-flight from Earth to the Sun. You model this problem as a Markov decision process (MDP). The figure below depicts the state space.
SK SL +100
SE SF Metis -100
Here are the details:
1. Each grid cell is a state SA,SB,…,SL corresponding to a position in the solar system.
2. Theactionspaceincludesmovementup/down/left/right.Transitionsarenon-deterministic. With probability 80% the agent transitions to the intended state. With probability 10% the agent slips left of the intended direction. With probability 10% the agent slips right of the intended direction. For example, if the agent is in state SF and takes action left, it moves to state SE with 80% probability, it moves to state SB (left of the intended direction) with 10% probability, and it moves to state SJ (right of the intended direction) with 10% probability,.
3. It is not possible to move into blocked states, which are shaded grey, since they contain other planets. If the agent’s action would move them out of bounds of the board or a blocked state, it remains in the same state.
4. The start state is SC (Earth). The terminal states include both the SL (Sun) and SE (asteroid Metis).
5. Non-zero rewards are depicted with arrows. Flying into the Sun from the left gives positive re- ward R(SK,a,SL) = +100 ∀a ∈ {up,down,left,right}, since it is more fuel-efficient than flying into the sun from below (the agent can use the gravitational field of the planets in SA and SI in addition to SG). However, approaching the Sun from the left has its risks, as flying into Metis is inadvisable and gives negative reward R(SF,a,SE) = −100 ∀a ∈
Homework 8: Reinforcement Learning
{up, down, left, right}. Note that flying into the Sun from below still achieves the goal and gives positive reward R(SH , a, SL) = +50 ∀a ∈ {up, down, left, right}. All other rewards are zero.
Below, let V ∗(s) denote the value function for state s using the optimal policy π∗(s). Synchronous Value Iteration
(3points) Reportthevalueofeachstateafterasingleroundofsynchronousvalueiterationinthetable below. Initialize the value table V 0(s) = 0, ∀s ∈ {SA . . . SL} and assume γ = 0.9. Visit each state in reverse alphabetical order. Ignore the blocked states. Round your answers only to the first decimal place. Do not round intermediate values when calculating your answers.
(3 points) What is the policy, π(s), that corresponds to the value table you calculated above? Write one of up, down, left, or right for each state. If multiple actions are acceptable, choose the one that comes alphabetically first. For terminal states, write terminal. Ignore the blocked states.
Homework 8: Reinforcement Learning
Asynchronous Value Iteration
(3points) Startingover,reportthevalueofeachstateforasingleroundofasynchronousvalueiteration in the table below. Initialize the value table V 0(s) = 0, ∀s ∈ {SA . . . SL} and assume γ = 0.9. Visit each state in reverse alphabetical order. Ignore the blocked states. Round your answers only to the first decimal place. Do not round intermediate values when calculating your answers.
(3 points) What is the policy, π(s), that corresponds to the value table you calculated above? Write one of up, down, left, or right for each state. If multiple actions are acceptable, choose the one that comes alphabetically first. For terminal states, write terminal. Ignore the blocked states.
Homework 8: Reinforcement Learning
3. (3 points) Below, we give you the value of each state one round before the convergence of asyn- chronous value iteration.1 What is the final value of each state, V ∗(s)? Be sure to use asynchronous value iteration, and visit each state in reverse alphabetical order. Ignore the blocked states. Round your answers only to the first decimal place. Do not round intermediate values when calculating your answers.
SJ 80.9516
SK 96.9920
SF 52.5638
SH 48.4960
SB 44.5586
SC 35.3208
SD 41.2039
Your solution:
1This is actually one round before the policy convergence, not the value convergence. The values we provide are the values after the third iteration.
Homework 8: Reinforcement Learning
4. (3 points) What is the policy, π∗(s), that corresponds to V ∗(s)? Write one of up, down, left, or right for each state. If multiple actions are acceptable, choose the one that comes alphabetically first. For terminal states, write terminal. Ignore the blocked states.
Homework 8: Reinforcement Learning
Q-Learning (9 points)
Let’s consider the same grid world as before:
This time, however, suppose we don’t know the reward function or the transition probability between states. Some rules for this setup are:
1. Each grid cell is a state SA,SB,…,SL corresponding to a position in the solar system.
2. The action space of the agent is: {up, down, left, right}.
3. If the agent hits the edge of the board, it remains in the same state. It is not possible to move into blocked states, which are shaded grey, since they contain other planets.
4. The start state is SC (Earth). The terminal states include both the SL (Sun) and SE (asteroid Metis).
5. Use the discount factor γ = 0.9, ε = 0.5, and learning rate α = 0.1.
We will go through three iterations of Q-learning in this section. Initialize Q(s, a) = 0, ∀s ∈ {SA, . . . , SL},
∀a ∈ {up, down, left, right}.
(1 point) Select all that apply: If the agent were to act greedily, what action would it take at this time?
down left right
(1 point) Beginning at state SC , you take the action right and receive a reward of 0. You are now in state SD . What is the new value for Q(SC , right), assuming the update for deterministic transitions? Round your answer to the fourth decimal place.
Q(SC , right)
Homework 8: Reinforcement Learning
3. (1point) WhatisthenewvalueforQ(SC,right),usingthetemporaldifferenceerrorupdate?Round your answer to the fourth decimal place.
Q(SC , right)
4. (1 point) Select all that apply: Continue to update your Q-function (as calculated by the temporal difference error update) from above. This time, though, assume your run has brought you to state SH with no updates to the Q-function in the process. If the agent were to act greedily, what action would it take at this time?
down left right
5. (1point) BeginningatstateSH,youtaketheactionup,receivearewardof+50,andtherunterminates. What is the new value for Q(SH , up), assuming the update for deterministic transitions? Round your answer to the fourth decimal place.
Q(SH , up)
6. (1 point) What is the new value for Q(SH , up), using the temporal difference error update? Round your answer to the fourth decimal place.
Q(SH , up)
7. (1 point) Select all that apply: Continue to update your Q-function (as calculated by the temporal difference error update) from above. You start from state SC since the previous run terminated, but manage to make it to state SF with no updates to the Q-function. If the agent were to act greedily, what action would it take at this time?
down left right
Homework 8: Reinforcement Learning
8. (1 point) Beginning at state SF , you take the action left, receive a reward of -100, and the run ter- minates. What is the new value for Q(SF , left), assuming the update for deterministic transitions? Round your answer to the fourth decimal place.
Q(SF , left)
9. (1 point) What is the new value for Q(SF , left), using the temporal difference error update? Round your answer to the fourth decimal place.
Q(SF , left)
Homework 8: Reinforcement Learning
Function Approximation (8 points)
In this question we will motivate function approximation for solving Markov Decision Processes by looking at Breakout, a game on the Atari 2600. The Atari 2600 is a gaming system released in the 1980s, but nevertheless is a popular target for reinforcement learning papers and benchmarks. The Atari 2600 has a resolution of 160 × 192 pixels. In the case of Breakout, we try to move the paddle to hit the ball in order to break as many tiles above as possible. We have the following actions:
• Move the paddle left • Move the paddle right • Do nothing
(a) Atari Breakout (b) Black and white Breakout
Figure 1: Atari Breakout. 1a is what Breakout looks like. We have the paddle in the bottom of the screen aiming to hit the ball in order to break the tiles at the top of the screen. 1b is our transformation of Atari Breakout into black and white pixels for the purpose of some of the following problems.
1. (1 point) Suppose we are dealing with the black and white version of Breakout2 as in Figure 1b. Fur- thermore, suppose we are representing the state of the game as just a vector of pixel values without considering if a certain pixel is always black or white. Since we are dealing with the black and white version of the game, these pixel values can either be 0 or 1.
What is the size of the state space?
2. (1 point) In the same setting as the previous part, suppose we wish to apply Q-learning to this problem. What is the size of the Q-value table we will need?
2Play a “Google”-Doodle version here
Homework 8: Reinforcement Learning
3. (1 point) Now assume we are dealing with the colored version of Breakout as in Figure 1a. Now each pixel is a tuple of real valued numbers between 0 and 1. For example, black is represented as (0, 0, 0) and white is (1, 1, 1).
What is the size of the state space and Q-value table we will need?
By now you should see that we will need a huge table in order to apply Q-learning (and similarly value iteration and policy iteration) to Breakout given this state representation. This table would not even fit in the memory of any reasonable computer! Now this choice of state representation is particularly na ̈ıve. If we choose a better state representation, we could drastically reduce the table size needed.
On the other hand, perhaps we don’t want to spend our days feature engineering a state representation for Breakout. Instead we can apply function approximation to our reinforcement algorithms! The whole idea of function approximation is that states nearby to the state of interest should have similar values. That is, we should be able to generalize the value of a state to nearby and unseen states.
Let us define qπ (s, a) as the true action value function of the current policy π. Assume qπ (s, a) is given to us by some oracle. Also define q(s, a; w) as the action value predicted by the function approximator parameterized by w. Here w is a matrix of size dim(S) × |A|, where dim(S) denotes the dimension of the state space. Clearly we want to have q(s, a; w) be close to qπ (s, a) for all (s, a) pairs we see. This is just our standard regression setting. That is, our objective function is just the Mean Squared Error:
J(w)=11 (qπ(s,a)−q(s,a;w))2. (1) 2 N s∈S,a∈A
Because we want to update for each example stochastically3, we get the following update rule:
w ← w − α (q(s, a; w) − qπ(s, a)) ∇wq(s, a; w). (2)
However, more often than not we will not have access to the oracle that gives us our target qπ (s, a). So how do we get the target to regress q(s, a; w) on? One way is to bootstrap an estimate of the action value under a greedy policy using the function approximator itself. That is to say
qπ(s, a) ≈ r + γ max q(s′, a′; w) (3) a′
where r is the reward observed from taking action a at state s, γ is the discount factor and s′ is the state resulting from taking action a at state s. This target is often called the Temporal Difference (TD) target, and gives rise to the following update for the parameters of our function approximator in lieu of a tabular update:
w←w−α q(s,a;w)−r+γmaxq(s′,a′;w) ∇wq(s,a;w). (4) a′
TD Error 3This is not really stochastic, you will be asked in a bit why.
Homework 8: Reinforcement Learning
4. (2 points) Consider the setting where we can represent our state by some vector s, action a ∈ {0, 1, 2} and we choose a linear approximator. That is:
q(s,a;w) = sTwa (5)
Again, assume we are in the black and white setting of Breakout as in Figure 1b. Show that tabular Q-learning is just a special case of Q-learning with a linear function approximator by describing a construction of s. (Hint: Engineer features such that Eq. (5) encodes a table lookup)
5. (3 points) Stochastic Gradient Descent works because we can assume that the samples we receive are independent and identically distributed. Is that the case here? If not, why and what are some ways you think you could combat this issue?
Homework 8: Reinforcement Learning
Empirical Questions (10 points)
The following parts should be completed after you work through the programming portion of this as- signment (Section 7).
(4 points) Run Q-learning on the mountain car environment using both tile and raw features.
For the raw features: run for 2000 episodes with max iterations of 200, ε set to 0.05, γ set to 0.999, and a learning rate of 0.001.
For the tile features: run for 400 episodes with max iterations of 200, ε set to 0.05, γ set to 0.99, and a learning rate of 0.00005.
For each set of features, plot the return (sum of all rewards in an episode) per episode on a line graph. On the same graph, also plot the rolling mean over a 25 episode window. Comment on the difference between the plots.
Plot of Raw
Homework 8: Reinforcement Learning
Plot of Tile
Homework 8: Reinforcement Learning
Figure 2: Estimated optimal value function visualizations for both types of features
2. (2 points) For both raw and tile features, we have run Q-learning with some good parameters and created visualizations of the value functions after many episodes. For each plot in Figure 2, write down which features (raw or tile) were likely used in Q-learning with function approximation. Explain your reasoning. In addition, interpret each of these plots in the context of the mountain car environment.
3. (2 points) We see that Figure 2b seems to look like a plane. Can the value function depicted in this plot ever be nonlinear (linear here strictly refers to a function that can be expressed in the form of y = Ax + b)? If so, describe a potential shape. If not, explain why.
Hint: How do we calculate the value of a state given the Q-values? Answer
Homework 8: Reinforcement Learning
0.06 0.04 0.02 0.00 0.02 0.04 0.06
1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Position
Action Velocity
1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Position
Figure 3: Estimated optimal policy visualizations for both types of features
4. (2 points) In a similar fashion to the previous question, we have created visualizations of the potential policies learned. For each plot in Figure 3, write down which features (raw or tile) were likely used in Q- learning with function approximation. Explain your reasoning. In addition, interpret each of these plots in the context of the mountain car environment. Specifically, why are the edges linear v.s. non-linear? Why do they learn these patches at these specific locations?
Homework 8: Reinforcement Learning
7 Programming [68 Points]
Your goal in this assignment is to implement Q-learning with linear function approximation to solve the mountain car environment. You will implement all of the functions needed to initialize, train, evaluate, and obtain the optimal policies and action values with Q-learning. In this assignment we will provide the environment for you. The program you write will be automatically graded using the Gradescope system.
7.1 Specification of Mountain Car
In this assignment, you will be given code that fully defines the Mountain Car environment. In Mountain Car you control a car that starts at the bottom of a valley. Your goal is to reach the flag at the top right, as seen in Figure 4. However, your car is under-powered and cannot climb up the hill by itself. Instead you must learn to leverage gravity and momentum to make your way to the flag. It would also be good to get to this flag as fast as possible.
Figure 4: What the Mountain Car environment looks like. The car starts at some point in the valley. The goal is to get to the top right flag.
The state of the environment is represented by two variables, position and velocity. position can be between [−1.2, 0.6] (inclusive) and velocity can be between [−0.07, 0.07] (inclusive). These are just measurements along the x-axis.
The actions that you may take at any state are {0, 1, 2}, where each number corresponds to an action: (0) pushing the car left, (1) doing nothing, and (2) pushing the car right.
7.2 Q-learning with Linear Approximations
The Q-learning algorithm is a model-free reinforcement learning algorithm, where we assume we don’t have access to the model of the environment the agent is interacting with. We also don’t build a complete model of the environment during the learning process. A learning agent interacts with the environment solely based on calls to step and reset methods of the environment. Then the Q-learning algorithm updates the q-values based on the values returned by these methods. Analogously, in the approximation setting the algorithm will instead update the parameters of q-value approximator.
Let the learning rate be α and discount factor be γ. Recall that we have the information after one interaction with the environment, (s, a, r, s′). The tabular update rule based on this information is:
Q(s,a)=(1−α)Q(s,a)+α r+γmaxQ(s′,a′) . a′
Homework 8: Reinforcement Learning
Instead, for the function approximation setting we use the following update rule derived from the Function Approximation Section (Section 4). Note that we have made the bias term explicit here, where before it was implicitly folded into w:
w←w−α q(s,a;w)−(r+γmaxq(s′,a′;w) ∇wq(s,a;w), a′
q(s, a; w) = sT wa + b.
The epsilon-greedy action selection method selects the optimal action with probability 1 − ε and selects uniformly at random from one of the 3 actions (0, 1, 2) with probability ε. The reason that we use an epsilon-greedy action selection is we would like the agent to do explorations by stochastically selecting random actions with small probability. For the purpose of testing, we will test two cases: ε = 0 and 0 < ε < 1. When ε = 0 (no exploration), the program becomes deterministic and your output have to match our reference output accurately. In this case, pick the action represented by the smallest number if there is a draw in the greedy action selection process. For example, if we are at state s and Q(s, 0) = Q(s, 2), then take action 0. When 0 < ε < 1, your output will need to fall in a certain range within the reference determined by running exhaustive experiments on the input parameters.
7.3 Feature Engineering
Linear approximations are great in their ease of use and implementations. However, there sometimes is a downside; they’re linear. This can pose a problem when we think the value function itself is nonlinear with respect to the stat
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com