COMP9414 24T2 Artificial Intelligence
Assignment 2 – Reinforcement Learning
Due: Week 9, Wednesday, 24 July 2024, 11:55 PM.
1 Problem context
Copyright By PowCoder代写 加微信 powcoder
Taxi Navigation with Reinforcement Learning: In this assignment, you are asked to implement Q-learning and SARSA methods for a taxi nav- igation problem. To run your experiments and test your code, you should make use of the Gym library1, an open-source Python library for developing and comparing reinforcement learning algorithms. You can install Gym on your computer simply by using the following command in your command prompt:
pip install gym
In the taxi navigation problem, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, one taxi starts off at a random square and the passenger is at a random location (one of the four specified locations). The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends. To show the taxi grid world environment, you can use the following code:
1https://www.gymlibrary.dev/environments/toy text/taxi/
env = gym.make(”Taxi−v3”, render mode=”ansi”).env state = env.reset()
rendered env = env.render()
print(rendered env)
In order to render the environment, there are three modes known as “human”, “rgb array, and “ansi”. The “human” mode visualizes the envi- ronment in a way suitable for human viewing, and the output is a graphical window that displays the current state of the environment (see Fig. 1). The “rgb array” mode provides the environment’s state as an RGB image, and the output is a numpy array representing the RGB image of the environment. The “ansi” mode provides a text-based representation of the environment’s state, and the output is a string that represents the current state of the environment using ASCII characters (see Fig. 2).
Figure 1: “human” mode presentation for the taxi navigation problem in Gym library.
You are free to choose the presentation mode between “human” and “ansi”, but for simplicity, we recommend “ansi” mode. Based on the given description, there are six discrete deterministic actions that are presented in Table 1.
For this assignment, you need to implement the Q-learning and SARSA algorithms for the taxi navigation environment. The main objective for this assignment is for the agent (taxi) to learn how to navigate the gird-world and drive the passenger with the minimum possible steps. To accomplish the learning task, you should empirically determine hyperparameters, e.g., the learning rate α, exploration parameters (such as ε or T), and discount factor γ for your algorithm. Your agent should be penalized -1 per step it
Figure 2: “ansi” mode presentation for the taxi navigation problem in Gym library. Gold represents the taxi location, blue is the pickup location, and purple is the drop-off location.
Table 1: Six possible actions in the taxi navigation environment.
Move South Move North Move East Move West Pickup Passenger Drop off Passenger
Number of the action
0 1 2 3 4 5
takes, receive a +20 reward for delivering the passenger, and incur a -10 penalty for executing “pickup” and “drop-off” actions illegally. You should try different exploration parameters to find the best value for exploration and exploitation balance.
As an outcome, you should plot the accumulated reward per episode and the number of steps taken by the agent in each episode for at least 1000 learning episodes for both the Q-learning and SARSA algorithms. Examples of these two plots are shown in Figures 3–6. Please note that the provided plots are just examples and, therefore, your plots will not be exactly like the provided ones, as the learning parameters will differ for your algorithm.
After training your algorithm, you should save your Q-values. Based on your saved Q-table, your algorithms will be tested on at least 100 random grid-world scenarios with the same characteristics as the taxi environment for both the Q-learning and SARSA algorithms using the greedy action selection
Figure 3: Q-learning reward. Figure 4: Q-learning steps.
Figure 5: SARSA reward. Figure 6: SARSA steps.
method. Therefore, your Q-table will not be updated during testing for the new steps.
Your code should be able to visualize the trained agent for both the Q- learning and SARSA algorithms. This means you should render the “Taxi- v3” environment (you can use the “ansi” mode) and run your trained agent from a random position. You should present the steps your agent is taking and how the reward changes from one state to another. An example of the visualized agent is shown in Fig. 7, where only the first six steps of the taxi are displayed.
2 Testing and discussing your code
As part of the assignment evaluation, your code will be tested by tutors along with you in a discussion carried out in the tutorial session in week 10. The assignment has a total of 25 marks. The discussion is mandatory and, therefore, we will not mark any assignment not discussed with tutors.
Before your discussion session, you should prepare the necessary code for this purpose by loading your Q-table and the “Taxi-v3” environment. You should be able to calculate the average number of steps per episode and the
Figure 7: The first six steps of a trained agent (taxi) based on Q-learning algorithm.
average accumulated reward (for a maximum of 100 steps for each episode) for the test episodes (using the greedy action selection method).
You are expected to propose and build your algorithms for the taxi nav- igation task. You will receive marks for each of these subsections as shown in Table 2. Except for what has been mentioned in the previous section, it is fine if you want to include any other outcome to highlight particular aspects when testing and discussing your code with your tutor.
For both Q-learning and SARSA algorithms, your tutor will consider the average accumulated reward and the average taken steps for the test episodes in the environment for a maximum of 100 steps for each episode. For your Q- learning algorithm, the agent should perform at most 14 steps per episode on average and obtain a minimum of 7 average accumulated reward. Numbers worse than that will result in a score of 0 marks for that specific section. For your SARSA algorithm, the agent should perform at most 15 steps per episode on average and obtain a minimum of 5 average accumulated reward. Numbers worse than that will result in a score of 0 marks for that specific section.
Finally, you will receive 1 mark for code readability for each task, and your tutor will also give you a maximum of 5 marks for each task depending on the level of code understanding as follows: 5. Outstanding, 4. Great, 3. Fair, 2. Low, 1. Deficient, 0. No answer.
Table 2: Marks for each task.
Results obtained from agent learning
Results obtained from testing the trained agent
Accumulated rewards and steps per episode plots for Q-learning algorithm.
Accumulated rewards and steps per episode plots for SARSA algorithm.
Average accumulated rewards and average steps per episode for Q-learning algorithm.
Average accumulated rewards and average steps per episode for SARSA algorithm.
Visualizing the trained agent for Q-learning algorithm. Visualizing the trained agent for SARSA algorithm.
Code understanding and discussion
Code readability for Q-learning algorithm
Code readability for SARSA algorithm
Code understanding and discussion for Q-learning algorithm Code understanding and discussion for SARSA algorithm
Total marks
3 Submitting your assignment
2 marks 2 marks
1 mark 1 mark 5 mark 5 mark
The assignment must be done individually. You must submit your assignment solution by Moodle. This will consist of a single .zip file, including three files, the .ipynb Jupyter code, and your saved Q-tables for Q-learning and SARSA (you can choose the format for the Q-tables). Remember your files with your Q-tables will be called during your discussion session to run the test episodes. Therefore, you should also provide a script in your Python code at submission to perform these tests. Additionally, your code should include short text descriptions to help markers better understand your code. Please be mindful that providing clean and easy-to-read code is a part of your assignment.
Please indicate your full name and your zID at the top of the file as a comment. You can submit as many times as you like before the deadline – later submissions overwrite earlier ones. After submitting your file a good
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com