CSCI 561
Foundation for Artificial Intelligence
Advanced Game Playing Reinforcement Learning
Professor Wei-Min Shen
Outline • Motivation
– Agent and Environment (Game)
• States, actions, utility, rewards, policy • Utility value iteration
• Policy Iterations
• ReinforcementLearning – Model-based
– Model-free
• Q-Learning
– State space for advanced game playing
A Key Question
• In all search and game playing problems, what is the key information that you wish to have for finding the optimal solution?
– You wish to have but you may not always have
Agent and Environment
Sensors
Agent Intelligence Agent
percepts
actions
Environment
Effectors
4
Agent, Environment, Game
Sensors
Agent
Intelligence
Agent
percepts
actions
Environment
Effectors
5
Agent and Environment (Star War)
Agents: R2D2
3PIO
Darth Vader
States: 1,..,11 Actions: ^,v,>,< Percepts: ... Goals:... Rewards: ...
State Utility Values: ... (your “crystal ball”)
1
2
3
4
5
6
7
8
9
10
11
Agent and Environment The Little Prince’s Planet
• Percepts: {rose, volcano, nothing}
• Actions: {forward, backward, turn-around}
• States: {north, south, east, west}
7
The Little Prince Planet Environment
See Nothing
See Volcano
See Nothing
See Rose
8
A Model for Little Prince’s Planet
See Volcano
See Nothing
See Nothing
His internal action model only needs 4 states! (why?)
See Rose
9
State, Action and Sensor Models
10
Little Prince’s Model
Transition Model f can be deterministic or probabilistic Sensor Model q can be deterministic or probabilistic
Are the transition model f here deterministic? Are the sensor model q deterministic?
Is there any hidden states in this example?
11
The Environment may be Uncertain • “Seeing may not always be believing”
• States and observations are not 1-1 mapping
– One state may be seen in many different ways
– The same observation may be seen in many different states
– E.g., When the Prince sees nothing, where is he? Facing which way?
– Can you made the little prince’s sensors non-deterministically?
• “Action may not produce the same result”
• Actions do not always take you to the same states
– E.g., the Little Prince’s planet may have “planet-quakes”J
• Transitions between states by actions may be nondeterministic
– E.g., Star War’s transitions are not deterministic
– Can you made the Little Prince’s actions non-deterministically?
12
Examples of Uncertainties
• Speech Recognition
– “Listening is not always equal to hearing”J
• Travelingthroughroomswithcoloredwalls – “Seeing does not always tell where you are”J
a0 Room-1 a1 Room-2 a2 Room-3 a3
Wall-color Wall-color
Wall-color
13
Uncertainties in Real World • MultipleAgents
– The actions of other agents in the environment may be ”uncertain” from your point of view !
• AdvancedGamePlaying
– Your opponent’s moves are definitely “uncertain”
from your point of view !
• When you driving on a highway
– What are uncertain?
14
Star War’s Probabilistic Transition Model
• Transitions can be probabilistic – P(statet+1 | statet, actiont)
Xt-1
At-1
P(Xt=1)
P(Xt=2)
P(Xt=3)
P(Xt=4)
P(Xt=5)
...
1
up
0.9
0.1
0.0
0.0
0.0
1
down
0.1
0.1
0.0
0.0
0.8
1
left
0.9
0.0
0.0
0.0
0.1
1
right
0.1
0.8
0.0
0.0
0.1
2
up
0.1
0.8
0.1
0.0
0.0
...
0.1
Intended
0.8 0.1
In Game Playing: Opponent’s Actions are not certain
State You Move
0.1 0.7 0.1 0.1
Therefore, it is appropriate to use: P(next_state | current_state, your_move)
The result state of your move Is not certain because it will be decided by your opponent
The Key Representations
• Model the environment by States, Actions, Percepts, and
• Transition model f = P(st+1|st,at) may be probabilistic
• Sensor Model q = P(z|s) may be probabilistic
• States may have Utility Value U(s) or V(s)
• Agents may receive Reward r (implicit “goals”)
– When entering a state: sè r
– When performing an action in a state: (s, a)è r
• Behaviors may be represented as Policy: sè a
• Objective: find the optimal policy based on utilities or rewards
17
Certainty vs Uncertainty
The certain way s1 a1 s2
The uncertain way
S1 is not always blue
S2 is not always red
s1 a1 s2 a1
a1
“a1” from S1
does not always go to S2
s3 s3
18
0.1
Intended
0.8 0.1
Uncertainty in Sensor & Actions
?
?
?
? ? ? ? ?
State
Observations
S19
• Solution: let’s use probabilities
– Action/Transition Model (for states and actions)
Use Probabilistic Transitions
P( NextState | CurrentState, Action) – Sensor Model (for states and observations)
P( Observation | State )
19
Hidden Markov Model (with actions)
z2
s23
z1
z3
z1
a1
s94 a1
a1
s77
1. Actions
2. Percepts (observations)
3. States
4. Appearance: statesàobservations 5. Transitions: (states, actions)àstates 6. Current State
20
Action Model
Hidden Markov Model (1/2)
21
Hidden Markov Model (2/2)
Sensor Model
This is also called “localization”
22
The HMM for Little Prince’s Planet (with uncertain actions and sensors)
{P(s3|s0,f)=.51, P(s2|s1,b)=.32, P(s4|s3,t)=.89, ...} {P(rose|s0)=.76, P(volcano|s1)=.83, P(nothing|s3)=.42, ...}
π1(0)=0.25,π2(0)=0.25,π3(0)=0.25,π4(0)=0.25
23
Experience and State Sequence
• E1:T is an experience is a sequence of
– {observe1, act1, observe2, act2, ...., actT-1, observeT} – E1:T = {o1, a1, o2, a2, o3, ..., oT-1, aT-1, oT}
• X1:T is a sequence of states that may be hidden but correspond to the experience
– X1:T = {X1, X2, X3, ..., XT-1, XT}
24
Little Prince Example
• From an “experience” (time 1 through t)
• Infer the most likely “sequence of states” {rose}, forward, {.}, ..., turn, ..., {rose}, bwd, {volcano}
......
Find the “best sequence of (hidden) states” that support the experience!
25
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
Action and Sensor Models (review)
z2
s23
z1
z3
z1
a1
s94 a1
a1
s77
1. Actions
2. Percepts (observations)
3. States
4. Appearance: statesàobservations 5. Transitions: (states, actions)àstates 6. Current State
What about the goals?
26
Utility Value of States Utility ValueóGoal Information
z2 z1
s23
z3
z1
s94
U(s23)=0.4
1. Actions
2. Percepts (observations)
3. States
4. Appearance: statesàobservations
5. Transitions: (states, actions)àstates
6. Current State
7. Rewards: R(s) or R(s,a) (related to the goals, given to the agent)
(8) Utility Value of States: U(s) (how good for the goals, must compute)
a1
a1
a1
U(s94)=0.2
s77
U(s77)=0.4
27
Rewards and Utilities • Agentmayreceiverewardswhen
– Entering a state or performing an action
• EverystatemayhaveaUtilityValueU(s),V(s)
Rewards and Utility for R2D2:
Xt
Reward
Utility
4
+1
0.0
7
-1
0.0
Else
-0.04
0.0
Given To be learned
Reward and Action Example
-0.04
-0.04
-0.04
+1
-0.04
-0.04
-1
-0.04 (start)
-0.04
-0.04
-0.04
Intended
Rewards (not utility) as shown: Two terminal states: -1, +1 (goal)
-0.04 for all nonterminal states
0.1
0.8
0.1
Actions: >, <, ^, v, outcome probability:
0.8 for intended, 0.2 sideways
29
State Utility Value Example
?
?
?
?
?
?
?
?
?
?
?
?
0.1
0.8
0.1
Intended
State Utility Values:
Must be learned or computed by the agent Initially, could be all 0.0
30
Little Prince Example (may add Rewards) • (action)TransitionProbabilitiesf(Fwd,Back,Turn)
F
S0
S1
S2
S3
S0
0.1
0.1
0.1
0.7
S1
0.1
0.1
0.7
0.1
S2
0.7
0.1
0.1
0.1
S3
0.1
0.7
0.1
0.1
B
S0
S1
S2
S3
S0
0.1
0.1
0.7
0.1
S1
0.1
0.1
0.1
0.7
S2
0.1
0.7
0.1
0.1
S3
0.7
0.1
0.1
0.1
T
S0
S1
S2
S3
S0
0.7
0.1
0.1
0.1
S1
0.1
0.7
0.1
0.1
S2
0.1
0.1
0.1
0.7
S3
0.1
0.1
0.7
0.1
• (sensor)AppearanceProbabilitiesθ • InitialStateProbabilitiesπ
You may add rewards to states (e.g., prefer to see rose)
θ
Rose
Volcano
Nothing
S0
0.8
0.1
0.1
S1
0.1
0.8
0.1
S2
0.1
0.1
0.8
S3
0.1
0.1
0.8
π
S0
S1
S2
S3
.25
.25
.25
.25
31
E1:T
Little Prince in Action (POMDP)
time
{rose}, fward, {none}, ..., turn, {rose}, back, {volcano}
......
• Given: “experience” E1:T (time 1 through T) • Definitions:
– State S and actions A (Forward, Back, Turn)
– (Action model) Transition Probabilities Φ
– (Sensor model) Appearance Probabilities θ (rose, volcano, none)
– (localization) Initial/current State Probabilities π
• Partially Observable: agent sees only the percepts, not the states
32
Partially Observable
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 S3 S4
State Utility and Decision on Actions • Combine the following together
– Partially observable action/perception model • E.g., Little Prince Example
– Compute state utility values from rewards (goals) • E.g., use Dynamic Programming (next slide)
– Policy (decide actions based on state utility values) • To gain as much as rewards as you can
• To go to the goals as close as you can
33
Compute State Values by Dynamic Programming
(review, from ALFE 6.1.1)
Backward recursion: compute future utility/values by back from the goal* stage by stage
Rewards are given
ai[R] = R(s,ai) from
an action ai on a state s
Backward recursion equation:
34
Markov Decision Process (MDP)
• A MDP consists of
– State S and actions A
– Initial State s0 Probability Distribution π – Transition Model Φ(s’|s,a)
– Sensor Model θ(z|s)
– A reward function R(s)
Note: A typical MDP has no sensor model
35
Maximum Expected Utility (MEU) and Rational Agents
• Every state has a utility value U(s)
• The expected utility of an action given the
current evidence or observation e, is the average
utility value of the outcomes, weighted by the
z2
s2
probability that the outcome occurs: z1 U(s2)=0.4
• The principle of maximum expected utility (MEU) is that a rational agent should choose the action that maximizes its expected utility:
action = arg max EU (a | e) a
a1 a1
s9 U(s9)=0.2
s7 U(s7)=0.4
E U ( a | e ) = ∑ P ( r e s u l t ( a ) = s ' | a , e )U ( s ' ) s'
a1
36
POMDP: Sensor Model, Belief States
• A Partially Observable MDP (POMDP) is defined as follows:
– AsetofstatesS(withaninitialstates0)
– AsetofActions:A(s)ofactionsineachstates
– AtransitionmodelP(s’|s,a),orT(s,a,s’)
– ArewardfunctionR(s),orR(s,a)
– A sensor model P(e|s) ç ç
– A belief of what the current state is b(s) çç
• Belief States (where am I now? What is my current state?):
– Ifb(s)wasthepreviousbeliefstate,andtherobotdoesaction“a”andthen
perceives a new evidence “e”, then the new belief state:
where α is a normalization constant making the belief states sum to 1
37
Goals, Rewards, Utilities, Policies • Goals
– Given to the agent from the problem statements • Rewards
– Given to the agent, designed based on the goals • Utility values for states
– Computed by the agent, based on the rewards
• Policies
– Computed or learned by the agent
– Used by the agent to select its actions
– The better a policy, the more rewards it collects
38
Compute Utilities from Rewards over Time
• Utility Value = sum of all future rewards – Add the rewards as they are
Uh =([s0,s1,s2,...])=R(s0)+R(s1)+R(s2)+... – Discount the far-away rewards in the future
U =([s,s,s,...])=R(s)+γR(s)+γ2R(s)+... h012012
• The expected future utility value Uπ(s) obtained by executing a policy π starting from s
⎡∑∞ ⎤ The Optimal Policy
Uπ(s)=E⎢ γtR(st)⎥ π* =argmaxUπ(s) ⎣t=0 ⎦s π
39
Compute Utilities (example)
0.812
0.868
0.918
+1
0.762
0.660
-1
0.705
0.655
0.611
0.388
Rewards (given)
Utilities (computed)
Given the rewards in all nonterminal state are R(s) = - 0.04, Compute utilities using a future discount γ = 1
(we will see how these utilities are computed)
40
State Utility Value Iteration
(improving U(s) every step)
• U(s): the expected sum of maximum rewards
achievable starting from a particular state s • Bellman equations:
– For n states, there are n equations must be solved simultaneously
U * (s) = R(s) + γ max ∑T (s, a, s ')U * (s ')
a
s'
• Bellman iteration:
– Converge to U*(s) step by step
(17.5)
Ui+1(s)←R(s)+γ max∑T(s,a,s')Ui(s') a s'
(17.6)
41
Bellman Iteration Example
The Initial Utility Values
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Reward R(s)
-0.04
-0.04
-0.04
+1
-0.04
-0.04
-1
-0.04
-0.04
-0.04
-0.04
Ui+1 (s) ← R(s) + γ max ∑T (s, a, s ')Ui (s ') a s'
Transaction probability
T(s, a, s’) = 0.8 for intended,
0.2 for sideways Discount Gamma
0.1
γ
= 1.0
0.812
0.868
0.918
+1
0.762
0.660
-1
0.705
0.655
0.611
0.388
Intended
0.8 0.1
The Final Utility Values
42
Bellman Iteration Example (1st iteration)
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Reward R(s)
The Initial Utilities U0
U1(s4) = +1.0 + 1.0*max(0.0*0.0) U1(s7) = -1.0 + 1.0*max(0.0*0.0)
The Utility Values U1 after 1st iteration
-0.04
-0.04
-0.04
+1
-0.04
-0.04
-1
-0.04
-0.04
-0.04
-0.04
Transaction probability
T(s, a, s’) = 0.8 for intended,
0.2 for sideways Discount Gamma
0.1
γ
= 1.0
-0.04
-0.04
-0.04
+1.0
-0.04
-0.04
-1.0
-0.04
-0.04
-0.04
-0.04
Intended
0.8 0.1
43
Bellman Iteration Example (2nd iteration)
-0.04
-0.04
-0.04
+1.0
-0.04
-0.04
-1.0
-0.04
-0.04
-0.04
-0.04
Reward R(s)
U1
U2(s4) = +1.0 + 1.0*max(0, 0, 0)
U2(s7) = -1.0 + 1.0*max(0, 0, 0)
U2(s3) = -.04 + 1.0*max(.8*1-.1*.04-.1*.04, ...) U2(s6) = -.04 + 1.0*max(-.8*.04-.1*.04-.1*.04,...)
U2
-0.04
-0.04
-0.04
+1
-0.04
-0.04
-1
-0.04
-0.04
-0.04
-0.04
Transaction probability
T(s, a, s’) = 0.8 for intended, 0.2 for sideways
Discount Gamma γ = 1.0
0.1
?
?
?
?
?
0.752
-0.08
?
+1.0
-1.0
?
Intended
0.8 0.1
The Utility Values U2 after 2nd iteration Please fill in the values for “?”
44
Bellman Iteration Example
The Initial Utilities
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Reward R(s)
-0.04
-0.04
-0.04
+1
-0.04
-0.04
-1
-0.04
-0.04
-0.04
-0.04
Ui+1 (s) ← R(s) + γ max ∑T (s, a, s ')Ui (s ') a s'
Transaction probability
T(s, a, s’) = 0.8 for intended,
0.2 for sideways Discount Gamma
0.1
γ
= 1.0
0.812
0.868
0.918
+1
0.762
0.660
-1
0.705
0.655
0.611
0.388
Intended
0.8 0.1
45
Please verify these final Utility Values by yourself!!
Utility Value Iteration for POMDP
• In MDP, utility value iteration, when sensors
are certain, we know the next state s’
• In POMDP, where sensors are uncertain, we must average over all possible evidences for the next state s’ (see the last term below):
Ui+1(s)←R(s)+γ max∑T(s,a,s')Ui(s') a s'
α (s)=R(s)+γ⎛∑T(s,a,s')∑P(e|s')α (s')⎞ p ⎜⎝ p . e ⎟⎠
s' e
46
Policy and Optimal Policy
• A solution of (PO)MDP can be represented as
– A Policy π(s)=a
– Each execution of policy from s0 may yield
a different history or path due to the probabilistic nature of the transition model
– An optimal policy π*(s) is a policy that yields the highest expected utility
47
Policy Based on Known Utility Values
Suppose we know the utility values of the states as above, What path would an optimal policy would choose?
RewardsàUtility ValuesàPolicies
0.812
0.868
0.918
+1
0.762
0.660
-1
0.705
0.655
0.611
0.388
48
Optimal Policy Examples Depending on Reward distribution R(s)
If the rewards for the nonterminal states are evenly distributed (e.g., -0.04), then the path chosen will depend on the probabilities of the transition model
à
à
à
+1
^
^
-1
^
ß
ß
ß
3 2 1
Nonterminal R(s)= - 0.04
1234
Caution: because the nondeterministic actions, from state (3,2) or (4,1), you may “accidentally” go to (4,2). So there is a “risk” in this policy.
49
Optimal Policy Examples Depending on the reward distribution R(s)
à
à
à
+1
^
à -1 suicide
à
à
à
^
à
à
à
+1
^
^ risky
-1
^
à
^
ß
Nonterminal rewards R(s)=-0.04
à
à
à
+1
^
^
-1
^
ß
ß
?
ß
R(s)<-1.6284
(life is painful, death is good)
-0.4278 < R(s)<-0.0850 (life is OK, willing to risk)
à
à à +1 end nicely
^
ß -1 no risk
^
ß
ß
ß
risky
v
v
v
v ß +1 don’t end
v
ß -1 no risk
v
Why not go up?
Hint: Consider the nature of actions, deterministic or not
Go up is “risky”
-0.0221
(life is rewarding,
I don’t want to end it)
50
Optimal Policy
• You can compute the optimal policy once after
all U*(s) are known
• Or, you can compute it incrementally every
iteration when Ui(s) is updated
– This is called “Policy Iteration” (see the next slide)
πs* =argmax∑T(s,a,s’)U*(s’) a
51
Policy Iteration (improving π(s) every step)
• Start with a randomly chosen initial policy π0
• Iterate until no change in utility values of the state:
• Policy evaluation: given a policy πi, calculate the utility Ui(s) of every state s using policy πi by solving the system of equations:
• Policy improvement: calculate the new policy πi+1 using one-step look-ahead based on the current Ui(s):
Uπ (s) = R(s)+γ ∑T(s,π(s),s’)Uπ (s’) s’
π i+1 (s) = argmax ∑T (s, a, s ‘)U * (s ‘) a s’
52
Policy Iteration Comments
• Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement of state utility value is possible
• Converge to the optimal policy
• Gives the exact value of the optimal policy
53
Policy Iteration Example
54
Policy Iteration Example
55
Policy Iteration Example
56
Policy Iteration Example
57
Policy Iteration Example
58
Policy Iteration Example
59
Policy Iteration Example
• The new policy π2 after an iteration from π1 – π2(Hungry)àEat
– π2(Full)àSleep
60
Summary for Uncertain Environments • POMDPisvery(themost)generalmodel
– Deal with uncertainty by
• action models and sensor models
• Incorporate goalsàrewardsàutilitiesàpolicy
– Utilities and policies can be computed from rewards • Systematically (Bellman equations)
• Iteratively (Bellman’s iteration algorithm)
– Solve a problem by following a good policy • One policy for one problem/goal
• Different policies are needed for different goals
61
REINFORCEMENT LEARNING
Markov Decision Process
• Markov Decision Process
– We learned that a Markov Decision Process (MDP) consists of
• A set of states S (with an initial state s0) • A set of actions A in each state,
• A set of percepts that can be sensed
• A transition model P(s’|s,a), or T(s,a,s’) • A sensor model q = P(z|s)
• The current state distribution
• A reward function R(s,a,s’) // careful here
– The MDP’s solution identifies the best action to take in each state
• Optimal policy π(s)=a obtained by value iteration or policy iteration (aka dynamic programming)
An Example for MDP
• Given
– States: s1, …, sn, Actions: a1, a2
– TransitionProbabilityDistribution:(assumesn+1=s1)
• P(si,a1,si+1)=0.8, P(si,a1,s1)=0.2, P(si,a2,s1)=0.9, P(si,a2,si+1)=0.1.
– Reward:allR(si,ai,sj)=0.0,exceptR(sn-1,a1,sn)=99.9 – Wedefinethegoalstatetobesn
– TheFuturediscountfactor:gamma=0.7
• State utility values: U(sn-1), U(sn-2), U(sn-3), … and U(s2), U(s1)
• Optimal Policy: π(si)=a1
99.9 Goal
64
Reinforcement Learning • Given a Markov Decision Process (environment)
(known) (known) (known)
– A set of states S
– A set of actions A in each state
– A set of percepts that can be sensed – The current state
– Transitions P(s’| s, a)
– Rewards R(s, a, s’)
• Could the agent still learn the optimal policy?
(known) (only known by the env) (only known by the env)
Goals, Rewards, Utilities, Policies
• Goals
– Given to the agent from the problem statements
• Rewards
– Given to the agent, designed based on the goals
• Utility values for states
– Computed by the agent, based on the rewards
• Policies
– Computed or learned by the agent
– Used by the agent to select its actions
– The better a policy, the more rewards it collects
66
Reinforcement Learning Propagating the Delayed Rewards
G
Before RL
G
After RL
After RL, no matter where you are, you know which way to go to the Goal!
This is because all states now have utility values that will lead your agent to the goal.
Reinforcement Learning (Key Ideas)
• Can the agent still learn the optimal policy?
– WhenitknowsonlythestatesSandactionsA,butnotthetransitionmodel P(s,a,s’) and/or reward function R(s,a,s’)
• Try an action on a state of the environment, and get a “sample” that includes (s, a, s’, r) and maybe U:
– A state transition (s,a)às’
– A reward received for the action r = R(s, a, s’)
– And maybe the utility value of the next state U(s’)
• Objective: Try different actions in states to discover an optimal policy π(s)=a and eventually tells the agent which action will lead to the most rewarding states
How to Explore the State Space
• Visit different states to discover a policy and improve it
– Numerous strategies
– A simple strategy is executing actions randomly
• Initially there is no policy, but random actions eventually discover a policy
• At every time step, act randomly with a probability (for exploration), otherwise, follow the current policy π(s)=a
– The random action may causes unnecessary action execution when the optimal policy is already discovered
– One solution is to lower the probability of selecting a random action over time (reduce the “temperature”)
Reinforcement Learning
• Objective: Learn the optimal policy π*(s)=a • Two general classes of algorithms:
– Model-based RL
• First learn the transition model and the utility values of
states, then learn the policy (e.g., policy iteration) – Model-free RL
• Learn the policy without learning an explicit transition model, but by receiving “samples” from the environment
• Three algorithms we will learn: – Monte Carlo
– Temporal Difference – Q-Learning
An Example for Model-Based RL
• Given
– States: s1, …, sn,
– Actions: a1, a2
– Probabilistic transition model: (assume sn+1=s1)
• P(si,a1,si+1)=0.8,P(si,a1,s1)=0.2,P(si,a2,s1)=0.9,P(si,a1,si+1)=0.1.
– Reward: all R(si, ai, sj) = 0.0, except R(sn-1, a1, sn) = 99.9
– Future discount factor: g = 0.7
• Try to learn: U(sn-1), U(sn-2), U(sn-3), … and U(s2), U(s1)
99.9
71
Model-Based RL
• Learn an approximate model based on random actions – From a state s, count outcomes of s’ for each (s, a)
– Normalize to give an estimate of P(s, a, s’)
– Learn state utility U(s) from rewards R(s, a, s’) when encountered – Learn a policy (e.g., policy iteration) and solve the MDP
State Utility Value Iteration (review)
(improving U(s) every step)
• U(s): the expected sum of maximum rewards
achievable starting from a particular state s • Bellman equations:
– Many equations must be solved simultaneously
• Bellman iteration:
– Converge to U*(s) step by step
U *(s) = R(s,a,s’)+γ max∑T(s,a,s’)U *(s’) a s’
U i +1 ( s ) ← R ( s , a , s ‘ ) + γ m a x ∑ T ( s , a , s ‘ )U i ( s ‘ ) a s’
73
Utility Value Iteration (example)
Find your way to the goal
Use Bellman Iteration to compute state values from rewards
g=0.9
Vp(s)=r+gr +g2r +… t t t+1 t+2
¥
=ågir 0£g <1
U i +1 ( s ) ← R ( s , a , s ' ) + γ m a x ∑ T ( s , a , s ' )U i ( s ' ) a s'
t+i i =0
Policy Iteration (review) (improving π(s) every step)
• Start with a randomly chosen initial policy π0
• Iterate until no change in the utility values of the states:
• Policy evaluation: given a policy πi, calculate the utility value Ui(s) of every state s using policy πi by solving the system of equations:
• Policy improvement: calculate the new policy πi+1 using one-step look-ahead based on Ui(s):
Uπ (s) = R(s,π(s),s')+γ∑T(s,π(s),s')Uπ (s') s'
π i+1 (s) = argmax ∑T (s, a, s ')U * (s ') a s'
75
Compute Utility from Rewards and Policy
V(y1) V(y2)
V(yj)
Transition: P(y|x,a) Pxy[a]
y1 x y2
V(x)
Rewards: R(x,a)
Assume the transitions are known
Policy: a=π(x)
yl
76
An Example for Model-Based RL
• Given
– States: s1,...,sn,
– Actions: a1, a2
– Probabilistictransitionmodel:(assumesn+1=s1)
• P(si,a1,si+1)=0.8, P(si,a1,s1)=0.2, P(si,a2,s1)=0.9, P(si,a1,si+1)=0.1.
– Reward:allR(si,ai,sj)=0.0,exceptR(sn-1,a1,sn)=99.9
– Wedefinethegoalstatetobesn
– Futurediscountfactor:g=0.7
• U(sn-1)=99.9, U(sn-2)~=70, U(sn-3)~=49, ... and U(s1)~=99.9*0.7n
99.9
77
Model-Based RL
• Learn an approximate model based on random actions – From a state s, count outcomes of s’ for each (s, a)
– Normalize to give an estimate of P(s, a, s’)
– Learn state utility U(s) from rewards R(s, a, s’) when encountered – Solve the learned MDP and obtain a policy (e.g. policy iteration)
Pros:
If and when the goal changes, one can use the learned model P(s, a, s’) to recalculate U(s) and compute a new policy offline
Cons:
Larger representation than model-free RL
Model-Free Reinforcement Learning • Objective: Learn the optimal policy π*(s)=a
• Model-based RL
– First learn the transition model and the utility values
of states, then learn the policy (e.g., policy iteration)
• Model-free RL
– Learn the policy without learning an explicit model,
but by receiving “samples” from the environment
– Three algorithms we will learn: • Monte Carlo
• Temporal Difference • Q-Learning
Monte Carlo RL (model free)
– Generate a fixed policy π(s) = a based on U(s)
• In each episode, the learner follows the policy starting at a
random state
• Average the sampled returns originating from each state • Generate a policy based on U(s) as in “policy iteration”
• Cons:
– The utility value of states U(s) is given and fixed (not learned)
– Works only in episodic problems
– Takes a very long time to converge as learning is from complete sample returns
– Wastes information as it figures out state values in isolation from other states
•
Temporal Difference RL (model free)
Improve Monte Carlo, still using a fixed policy
– UpdateU(s)usingeachsample(s,a,s’,r)receivedfromthe environment and each sample includes:
1. A state transition (s,a)às’ done by the environment 2. A reward received for the action r = R(s, a, s’)
3. And the value of the next state U(s’)
– Thatis,Sample=r+γU(s’),whereγ=futurediscount TD Algorithm:
– U(s)←(1-α)U(s)+α*Sample=(1-α)U(s)+α[r+γU(s’)]
– Whereαisthe“learningrate”(howmuchyoutrustthesample) – Youmightdecreaseαovertimewhenconvergestotheaverage – Theparameterαalsorepresents“forgettinglongpastvalues”
•
• Cons:
– Onlyprovidestheutilityvaluesofstates,notimprovepolicydirectly
An Example for TD-RL
• Given
– States: s1, ..., sn,
– Actions: a1, a2
– Probabilistic transition model: (assume sn+1=s1)
• P(si,a1,si+1)=0.8, P(si,a1,s1)=0.2, P(si,a2,s1)=0.9, P(si,a1,si+1)=0.1.
– Reward: all R(si, ai, sj) = 0.0, except R(sn-1, a1, sn) = 99.9
– We define the goal state to be sn
– The learning rate a = 0.4
– The Future discount factor: g = 0.7
• Use
TD algorithm to learn: U(sn-1), U(sn-2), U(sn-3), ... and U(s2), U(s1)
99.9
82
Q-Learning (Definition)
• The key idea is to use Q(s,a) to represent the value of taking an action a in state s, rather than only the value of state U(s):
– Define:
• Optimal π(s) = argmax Uπ (s) = argmaxa Q(s, a)
– The last term using Q to replace U
Q(s,a)≡∑P(s,a,s')[R(s,a,s')+γU(s')] s'
Q k +1 ( s , a ) ← ∑ P ( s , a , s ' ) [ R ( s , a , s ' ) +γ m a x Q k ( s ' , a ' ) ] s' a'
83
Q-Learning (Algorithm)
• Initially, let Q(s, a)=0, given α and g
• Sample a transition and reward: (s,a,s’,r)
– Sample = r + g maxa’ Q(s’,a’)
– Q(s,a) = (1-α)Q(s,a) + α*Sample
a’
Qt+1(s,a) = (1−α)Qt (s,a)+α[R(s,a,s')+γ maxQt (s',a')] a'
s
r a
S’
Advantages:
1. No need for a fixed policy, can do off-policy learning, or directly learns a policy
2. The optimal policy π(s)=a is to select the action a that has the best Q(s, a)
3. Provably convergent to the optimal policy when t -> infinite
84
The Q-Learning Algorithm
85
Q-Learning (in One Formula)
r(s) or r(s,a)
86
Q-Learning example from reward r(s,a) g=0.9
Qˆ ( s 1 , a r i g h t ) ← r + γ m a x Qˆ ( s 2 , a ‘ ) a’
← 0 + 0.9 max{63,81,100} ß 72+
ß 7 2 ←+ 9 0
Back-propagate Q-values 1-step backwards
Q-Learning from state utility values Q-Learning
Notice these
p*(s) = argmaxQ(s,a) a
V * (s ) = max Q (s , a ) a
An Example for Q-Learning
• Given
– States: s1, …, sn,
– Actions: a1, a2
– Probabilistic transition model: (assume sn+1=s1)
• P(si,a1,si+1)=0.8, P(si,a1,s1)=0.2, P(si,a2,s1)=0.9, P(si,a1,si+1)=0.1.
– Reward: all R(si, ai, sj) = 0.0, except R(sn-1, a1, sn) = 99.9
– We define the goal state to be sn
– The learning rate a = 0.4
– The Future discount factor: g = 0.7
• Try to learn: Q(sn-1,a1), Q(sn-1,a2), Q(sn-2,a1), and Q(sn-2,a2)=?
99.9
89
An Example for Q-Learning
Initially: all Q(si, aj) = 0.0
Assume you got two samples at the state sn-1 for action a1
• sn-1,a1, sn, r=R(sn-1,a1, sn)=99.9, // P(sn-1,a1)àsn
• sn-1,a1, s1, r=R(sn-1,a1, s1)=0.0, // P(sn-1,a1)às1
Updated Q value, if P(s,a) are known:
– Q(sn-1,a1)= 0.8*[99.9 + 0.7*0.0] + 0.2[0.0+0.7*0.0] = 79.9
• If P(s,a) are unknown: sample1 = 99.9+.7*0; sample2=0+.7*0
– Q(sn-1,a1)= (1-0.4)*0.0 + 0.4*99.9 = 39.96
– Q(sn-1,a1)= (1-0.4)*39.96 + 0.4*0.0 = 23.976
99.9
90
Discussions of RL (1)
– Easytouseinproblemswherethedatacanbemappedtostateseasily
• Pros:
– Guaranteedtofindtheoptimalpolicygivenenoughtimeevenwith
suboptimal actions
• Cons:
– States of the environment are not always known
– Computation time is intractable for large or continuous state spaces
• E.g. if each cell in a grid world is a state, then state space grows exponentially with number of rows and columns (states = map size = m*n)
– Cannot handle raw data, must use an approximation/reduction function
• Designing approximation functions to disambiguate similar states requires human intelligence or an alternate learning technique
– E.g. use the relative distance (states = m*n) between two agents in a hunter-prey problem as opposed to their cell coordinates (states = m*n*m*n)
– Model-free RL cannot transfer the learned knowledge when the goal changes
• Forgetting a learned policy is much more difficult (hysteresis), quicker to start from scratch
Discussions of RL (2)
Utility function is unknown at the beginning
• When agent visits each state, it receives a reward
• Possibly negative What function should it learn?
• R(s): Utility-based agent
• If it already knows the transition model
• Then use MDP algorithm to solve for MEU actions
• U(s,a): Q-learning agent
• If it doesn’t already know the transition model
• Then pick action that has highest U in current state
• 𝜋*(s): reflex agent
• Learn a policy directly, then pick the action that the policy says
Q Learning Summary
Modify Bellman equation to learn Q values of actions • U(s) = R(s)+𝛾maxa∑s1(P(s1|s,a)U(s1))
• Wedon’tknowRorP
• But when we perform a’ in s, we move to s’ and receive R(s)
• We want Q(s,a) = Expected utility of performing a in state s
Update Q after each step
• If we get a good reward now, then increase Q
• If we later get to a state with high Q, then increase Q here too
• Q(s,a)← (1-⍺)Q(s,a) + ⍺(R(s)+𝛾 maxa’Q(s’,a’))
• ⍺ is the learning rate, 𝛾 is the discount factor Converges to correct values if ⍺ decays over time
• Similar to the “temperature” in simulated annealing
Q-Learning in Star War Environment
1
2
3
4
5
6
7
8
9
10
11
Initial Q Values
⍺ = 0.1 𝛾 = 1
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
Step 1
Q(s,a)← (1−⍺)Q(s,a) + ⍺(R(s)+𝛾 maxa’Q(s’,a’))
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
Step 1
Q(1,R)← (1−⍺)Q(1,R) + ⍺(R(1)+𝛾 maxa’Q(2,a’))
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
97
Step 1
Q(1,R)← (1−⍺)Q(1,R) + ⍺(−0.04+𝛾 maxa’Q(2,a’))
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
98
Step 1
Q(1,R)← (1−⍺)Q(1,R) + ⍺(−0.04+𝛾 0.0)
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
99
Step 1
Q(1,R)← (1−⍺)0.0 + ⍺(−0.04+𝛾 0.0)
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
100
Step 1
Q(1,R)← 0.1(−0.04+0.0)
1
U: 0.0 D: 0.0 L: 0.0 R: 0.0
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U: 0.0 D: 0.0
L: 0.0 R: 0.0
10 U:0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
101
Step 1
Q(1,R)← −0.004
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: 0.0
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
102
Step 2
Q(2,R)← (1−⍺)Q(2,R) + ⍺(R(2)+𝛾 maxa’Q(3,a’))
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: 0.0
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
103
Step 3
Q(3,R)← (1−⍺)Q(3,R) + ⍺(R(3)+𝛾 maxa’Q(4,a’))
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
U: 0.0 D: 0.0 L: 0.0 R: 0.0
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
104
Step 4
Q(4,*)← 1
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
+1
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
105
Step 5
Q(1,R)← (1−⍺)Q(1,R) + ⍺(R(s)+𝛾 maxa’Q(2,a’))
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
+1
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
106
Step 5
Q(1,R)← (1−⍺)Q(1,R) + ⍺(R(s)+𝛾 0.0)
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
+1
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
107
Step 5
Q(1,R)← −0.0076
1
U: 0.0
D: 0.0
L: 0.0 R: -0.0076
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
+1
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
108
SARSA: State-Action-Reward-State-Action
Modify Q learning to use chosen action, a’, not max
• Q(s,a)← (1−⍺)Q(s,a) + ⍺(R(s)+𝛾 maxa’Q(s’,a’))
• Q(s,a)← (1−⍺)Q(s,a) + ⍺(R(s)+𝛾 Q(s’,a’))
On-policy, instead of off-policy
• SARSA learns based on real behavior, not optimal behavior
• Good if real world provides obstacles to optimal behavior
• Q learning learns optimal behavior, beyond real behavior
• Good if training phase has obstacles that won’t persist
SARSA: State-Action-Reward-State-Action
110
Step 5: SARSA
Q(1,R)← (1−⍺)Q(1,R) + ⍺(R(s)+𝛾 Q(2,R))
1
U: 0.0 D: 0.0 L: 0.0 R: -0.004
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
+1
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
111
Step 5: SARSA
Q(1,R)← −0.008 which is closer to −0.04 than −0.076
1
U: 0.0
D: 0.0
L: 0.0 R: -0.008
2
U: 0.0 D: 0.0 L: 0.0 R: -0.004
3
U: 0.0 D: 0.0 L: 0.0 R: -0.004
4
+1
5
U: 0.0 D: 0.0 L: 0.0 R: 0.0
6
U: 0.0 D: 0.0 L: 0.0 R: 0.0
7
U: 0.0 D: 0.0 L: 0.0 R: 0.0
8 U:0.0 D: 0.0
L: 0.0 R: 0.0
9 U:0.0 D: 0.0
L: 0.0 R: 0.0
10 U: 0.0 D: 0.0
L: 0.0 R: 0.0
11 U: 0.0 D: 0.0
L: 0.0 R: 0.0
112
Q Learning & SARSA Converge to correct values
• Assuming agent tries all actions, in all states, many times • Otherwise, there won’t be enough experience to learn Q
• And if ⍺ decays over time
• Similar to simulated annealing
Avoids the complexity of solving an MDP
• MDP requires solving for the optimal policy before starting
• An RL agent can start right away and then learn as it goes
• Although this might be wasteful if you already know P and R
Q-Learning for Game Playing
Q-Learning for Game Playing
State YourMove
Steps:
1. Define the transition model
P(statet+1|statet,YourMove) 2. Define rewards
3. Learn Q(s,a) from rewards
Q(state, YourMove)
4. Play games with learned Q
0.1 0.7
0.1 0.1
Reward for win
A Key Question
• In all search and game playing problems, what is the key information that you wish to have for finding the optimal solution?
– You wish to have but you may not always have • The answer: Play a lot of games and
– Learn the state utility values for wining the games – Learn the Q values for wining the games
Exploitation vs. Exploration
• RL algorithms do not specify how actions are chosen by the agent
• One strategy would be in state s to select the action a that maximizes Q(s,a), thereby exploiting its current approximation Q.
– Usingthisstrategytheagentrunstheriskthatitwillovercommitto actions that are found during early training to have high Q values, while failing to explore other actions that have even higher values.
– ItiscommoninQlearningtouseaprobabilisticapproachtoselecting actions.
– ActionswithhigherQvaluesareassignedhigherprobabilities,but every action is assigned a nonzero probability. One way to assign such probabilities is where P(ai |s) is the probability T of selecting action ai, given that the agent is in state s, and where T > 0 is a constant that determines how strongly the selection favors actions with high Q values.
P(a |s)= eQˆ(s,ai)/T
i ∑ e Qˆ ( s , a j ) / T
j
Exploitation vs. Exploration
• Hence it is good to try new things now and then – If T is large, then do more exploration,
– If T is small, then follow or exploit the current policy
• One can decrease T over time to first explore, and then converge and exploit
– For example T= c/k+d where k is iteration of the algorithm P(a |s)= eQˆ(s,ai)/T
i ∑ e Qˆ ( s , a j ) / T
• Decreasing T over time is jalso known as simulated annealing, which is inspired by annealing process in nature. T is also known as Temperature