(http://www.stanford.edu)
AA228/CS238
(https://web.stanford.edu/class/
bin/wp/)
Decision Making under Uncertainty (https://web.stanford.edu/class/aa228/cgi-bin/w
Project 2
Reinforcement Learning
Due Date: by 5 pm on Friday, November 5th. Penalty-free grace period until 5 pm on Monday, November 8th. See “Late
Policy” for details. (https://web.stanford.edu/class/aa228/cgi-bin/wp/)
This project is a competition to find the best policy for three di�erent Markov decision processes given sampled
transitions, each consisting of a state , action , reward , and next state . The best policy is the one that maximizes
the total expected reward.
Three CSV-formatted datasets will be provided. Note that this is an instance of batch reinforcement learning where the
data from exploring has been collected and given to you, and you have to work with it. There will not be an
opportunity to explore in runtime in a simulator, because your output will be a (deterministic) policy which we will run
on our simulator. Keep this in mind, particularly if you are using a model-free approach.
1. small.csv (https://raw.githubusercontent.com/sisl/AA228-CS238-Student/master/project2/data/small.csv) 10 x
10 grid world (100 states) with 4 actions. Use LinearIndices((10,10))[x,y] to find the integer state from
the x and y coordinates. Actions are 1: le�, 2: right, 3: up, 4: down. The discount factor is 0.95.
2. medium.csv (https://raw.githubusercontent.com/sisl/AA228-CS238-
Student/master/project2/data/medium.csv) MountainCarContinuous-v0 environment from Open AI Gym
(https://gym.openai.com/envs) with altered parameters. State measurements are given by integers with 500
possible position values and 100 possible velocity values (50,000 possible state
measurements). 1+pos+500*vel gives the integer corresponding to a state with position pos and
velocity vel . There are 7 actions that represent di�erent amounts of acceleration. This problem is
undiscounted, and ends when the goal (the flag (https://web.stanford.edu/class/aa228/cgi-bin/wp/wp-
content/uploads/2019/10/flag.gif)) is reached. Note that since the discrete state measurements are
calculated a�er the simulation, the data in medium.csv does not quite satisfy the Markov property.
3. large.csv (https://raw.githubusercontent.com/sisl/AA228-CS238-Student/master/project2/data/large.csv) It’s a
secret! MDP with 312020 states and 9 actions, with a discount factor of 0.95. This problem has a lot of hidden
structure. Look at the transitions and rewards carefully and you might figure some of it out!
Each line in the CSV file represents a transition from a state to a next state —along with the associated reward —
when taking an action . You might not have data for every state, nor are all states equally important. Use your best
judgment when handling this.
s a r sp
s sp r
a
https://raw.githubusercontent.com/sisl/AA228-CS238-Student/master/project2/data/small.csv
https://raw.githubusercontent.com/sisl/AA228-CS238-Student/master/project2/data/medium.csv
https://gym.openai.com/envs
https://web.stanford.edu/class/aa228/cgi-bin/wp/wp-content/uploads/2019/10/flag.gif
https://raw.githubusercontent.com/sisl/AA228-CS238-Student/master/project2/data/large.csv
The files can be accessed from the AA228-CS238-Student repository: https://github.com/sisl/AA228-CS238-Student/
(https://github.com/sisl/AA228-CS238-Student/)
Rules
Note: The scores on the leaderboard are a�er subtracting the score of a random policy, so you should try to at
least get positive leaderboard scores to get full credit. For the large.csv , the di�erence is additionally
multiplied by 10 to give it a higher weight for the leaderboard (not for grading).
LaTeX Template
We provide an optional LaTeX template on Overleaf (https://www.overleaf.com/read/gsptsmcrzpdv) for your
README.pdf write-up. Note you’re free to use your own template (and you’re note even required to use LaTeX).
Submission
Submission Video Tutorial – We’ve put together a quick video tutorial (https://youtu.be/0aitSORMFS4) explaining the
repository and how to submit to Gradescope.
FAQ
This list has been extended from last year to reflect common queries made on Ed. You may find our query answered
here without even needing to wait on Ed!
You can use any programming language but you cannot use any package/algorithms directly related to
reinforcement learning. You can use general optimization packages and libraries for things like function
approximation and ODE solving so long as you discuss what you use in your writeup and make it clear how it is
used in your code.
You are free to use the data structures of POMDPs.jl (https://github.com/JuliaPOMDP/POMDPs.jl), just not any
algorithms.
You can use di�erent approaches and algorithms for the di�erent problems. It will probably help to look at the
data carefully and use the structure of the problem to your advantage.
Discussions are encouraged, and we expect similar approaches to the project from multiple people, but you
must write your own code. Otherwise, it violates the Stanford Honor Code.
You may look at the code for the MountainCarContinuous-v0 environment
(https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py), but
note that we will not use exactly the same parameters.
Submit a README.pdf file with a description of your algorithm(s) and their performance characteristics.
Include the running/training time for each policy. Also, typeset your code and include it in the PDF.
Grading Rubric:
Small Problem ( small.policy ) – 10%
Medium Problem ( medium.policy ) – 20%
Large Problem ( large.policy ) – 30%
README.pdf – 40%
Description of strategy – 15%
Performance characteristics, i.e. running/training time – 15%
Code (included in PDF) – 10%
Click the template link (https://www.overleaf.com/read/gsptsmcrzpdv), click “Menu”, and “Copy Project” (make
sure you’re signed into Overleaf)
Submit your .policy files via Gradescope (https://www.gradescope.com/) under the Project 2 (.policy
files) assignment.
Submit your README.pdf via Gradescope (https://www.gradescope.com/) under the Project 2
(README.pdf) assignment.
https://github.com/sisl/AA228-CS238-Student/
https://www.overleaf.com/read/gsptsmcrzpdv
https://github.com/JuliaPOMDP/POMDPs.jl
https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py
https://www.overleaf.com/read/gsptsmcrzpdv
https://www.gradescope.com/
https://www.gradescope.com/
What programming languages can I use?
You may use any language you like! We are only looking for your source code and the output .policy
files.
How do I convert linear indices to subscripts and vice versa?
For Julia, use CartesianIndices
(https://docs.julialang.org/en/v1/base/arrays/#Base.IteratorsMD.CartesianIndices) and LinearIndices
(https://docs.julialang.org/en/v1/base/arrays/#Base.LinearIndices). For Python, use
numpy.unravel_index
(https://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html) and
numpy.ravel_multi_index
(https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel_multi_index.html). For MATLAB,
use ind2sub (https://www.mathworks.com/help/matlab/ref/ind2sub.html) and sub2ind
(https://www.mathworks.com/help/matlab/ref/sub2ind.html).
I like the competition and leaderboard aspect. Can we use late days for the competition?
No. You can only use late days for the general project grading. Any submissions a�er the deadline will not
be considered for the leaderboard.
What’s the output format for the .policy file?
Your program should output a file with the same name as the input filename, but with
a .policy extension, e.g., “small.policy”.
Each output file should contain an action for every possible state in the problem. The i-th row contains
the action to be taken from the i-th state. small.policy should contain 100
lines, medium.policy should contain 50000 lines, and large.policy should contain 312020 lines.
Each line should have a number between 1 and the total number of actions for that problem. If using
Linux/MAC, you can check the number of lines using wc -l
What does the score represent?
For each problem, the score is the expected value for a state drawn from initial distribution D, that is
. For all problems, we evaluate this expectation using simulations. The higher the score, the
better. The scores on the leaderboard are a�er subtracting the score of a random policy.
E
s∼D
[U(s)]
What’s the reward?
All we can say is that the reward is a function of both the states and actions.
What packages can we use?
For the computational part, sparse matrix libraries (builtin in numpy and Julia) could be useful. Feel free
to use any deep learning and machine learning packages as well (as long as you are not using any RL
implementations therein). For building models with the data, data munging libraries
like DataFrames.jl , python’s Pandas, or graphing libraries like StatPlots.jl might be useful.
What about the terminal states?
There are no terminal states for small.csv and large.csv . For medium.csv , the simulation ends
when the new position is beyond some position (where the flag is). You may want to think about which
integer state measurements correspond to this case.
Why does the same lead to a di�erent in some lines?(s, a) sp
The MDP is stochastic i.e. the transition function defines a probability distribution over next states. So
yes, sometimes you’ll reach di�erent states from the same state, even though you took the same action.
However, the probabilities might be di�erent. Also, for a given , the reward is always the
same.
(s, a) → sp
Is it necessary to implement a score function for this project?
This project is a bit di�erent. In project 1, if you implemented the scoring function correctly, you would
know your place on the leaderboard. However, for project 2, you are evaluated on an MDP model that is
kept secret from you—samples from this model are provided to you and you can create an
approximation of it, but you don’t know it exactly.
https://docs.julialang.org/en/v1/base/arrays/#Base.IteratorsMD.CartesianIndices
https://docs.julialang.org/en/v1/base/arrays/#Base.LinearIndices
https://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel_multi_index.html
https://www.mathworks.com/help/matlab/ref/ind2sub.html
https://www.mathworks.com/help/matlab/ref/sub2ind.html
Does the value of the discount factor matter?
Yes. Technically you could pick whatever discount factor you want, but the policy will be evaluated with
the specified discount factor. Generally you’d prefer to train with the discount factor you’ll be evaluated
with, but there exist situations where playing with the discount factor can improve performance.
What can we do when we reach a state that has no actions or reward associated with it?
You can use one of the interpolation schemes to approximate from the available states. This can be
particularly possible for the medium problem, but will be di�icult for the large problem because of the
underlying structure.
If we are getting negative leaderboard scores, what does that mean?
On the leaderboard, we are subtracting the scores that we get from a random policy, and a negative value
shows that you are doing worse than random. You should definitely be able to do better than a random
policy, especially for the small problem.
If a state is missing actions, do we assign as a reward for the missing actions?−∞
That’s a modeling decision you need to make. But propagating -inf rewards can be tricky.
What do I need to do for full credit?
Any RL algorithm with reasonable positive leaderboard scores will be fine. You can use di�erent
algorithms for each problem, and we recommend that you try out several. Please mention everything
you try in your README.pdf .
Is there any way we can get a script we can run locally to score our policies? Alternatively, is there any
way we can implement an estimate of the score?
You are allowed to submit via Gradescope as much as you’d like, and that will give you feedback for each
of your policies. Your final submission will be the one that’s graded and each submission will replace the
previous one on the leaderboard. We do not want students to rely on the feedback from Gradescope as
the only means of testing their algorithms—we encourage plotting your policies and reward values to see
if they’re reasonable (not required).
When it is mentioned that there are 500 possible position values, and 100 possible velocity values, does
this mean that velocity values are [1, 100], or can velocity values be negative or out of this range?
Velocity values can be negative. They are definitely not in the range [1,100]. Same with position. Again,
the real values are continuous. They have just been discretized in the given fashion. You are welcome to
look at the original OpenAI environment linked in the description.
How do I typeset my code?
If you’re using , you can use the verbatim environment as a simple approach or the listings
environment for syntax highlighting (or pythontex if you’re feeling fancy).
LT XA E