CS代考 COSC1125/1127: Artificial Intelligence

COSC1125/1127: Artificial Intelligence

Week 9: Reinforcement Learning

Instructor: Prof.
RMIT University, Melbourne, Australia
[These slides were created by and for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!

‹#›

Week 9: From MDP to Reinforcement Learning

Some news…

Take Home Exercise (THE) done!
Well done all! 👏👏👏
Thanks for your professionalism!
Contest feedbacks already running
Are you in already?
Check post #170
Preliminary contest due on Sunday @ 11:59pm
Friday is a public holiday.
Attend tutorials on Wed & Thursday.
Drop-in lab moved to Thursday 4:30pm-5:30pm.

*

Feedback contest is running!

Preliminary submission (10%) due end of this week!
Every day 2am + 2pm

‹#›

Feedback from the past: What would you say to a fellow student wrt the Pacman Contest?
It’s loads of fun and you’ll learn a lot of you put the time in. Don’t underestimate the challenge though.
“Do it! You will love it!”
It’s really fun, but make sure you are prepared to spend a fair amount of time on it.
Go for it, it’s fun and you can learn a lot by reading, researching and exploring the Pac-Man.
AI can be fun to learn through games. Take the course and you won’t regret it 🙂
It’s a challenge. Starts slow and simple.
It was fun, but a lot of work and will require heaps of time to do well.
Make sure to start thinking about the final part throughout the rest of the assignments
Its hard, but its fun
Have fun, and refer UC materials to succeed.
Don’t start late.
It is a fun and practical way to learn and apply the concepts taught in the course.
Ask more questions about the algorithm than I did. I was confused at first but in the end my confusion stemmed from my over complicating the task.

‹#›

From Project 2 to Project Contest

Please watch it to avoid issues

Course Survey is live!
Method of teaching:
Lectorials
Forum
Assessments:
Pacman Projects
Flexible THE
Technology:
EdStem + GitHub
Learning Materials:
Videos + book + tute
If you are active participant in the course (lectorials, tutes) please please don’t just ignore it, so I get a balanced view…

‹#›

Extra 1-on-1 support for Pacman Project

Want 1-on-1 team support for your project?
Book 1-on-1 team 15’ consultation time on Wednesdays 3pm-4pm, weeks 9 to 12.
Book here!
When you book:
Make a PRIVATE post before to inform us what you want to chat.
Include the link to your repo.

Use the 1-on-1 tutorial consultation time!
Remember 1-on-1 Mondays 4pm-5pm Consultation Time for tutorials, use that & book it HERE!

No tutorials this Friday Sept 24th: holiday!
For tutorials: attend Wednesdays and Thursdays tutes.
Drop-in lab moved to Thursday 4:30pm

‹#›

Questions?

*

IMPORTANT DISCLAIMER
These pack of slides are NOT core study
material. The core & main material is the book or any specific resources given for the week.
The slides in this presentations were prepared only for supporting the instructor during the lectorial and only some of them have been used in the session as needed.
These pack of slides do not represent the complete material of the week, but only a few bits here and there around what was discussed in lectorials. Most of the week content will NOT show up in this pack of slides.
Please refer to the book (and any resources specifically given) for a comprehensive coverage of the week material.

CS 188: Artificial Intelligence

Reinforcement Learning

Instructors: and
University of California, Berkeley
[These slides were created by and for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!

‹#›

Topics
Model-based vs Model-free learning.
Passive and active RL
Direct evaluation, ADP, Temporal Difference (TD).
Q-value Q(s,a)
Q-learning and TD Q-learning.
Exploration vs Exploitation.
Approximate Q-learning.

‹#›

This week videos on RL from UCB!
Part I
Part II

‹#›

Reinforcement Learning

Reinforcement Learning
Basic idea:
Receive feedback in the form of rewards
Agent’s utility is defined by the reward function
Must (learn to) act so as to maximize expected rewards
All learning is based on observed samples of outcomes!
Environment

Agent
Actions: a
State: s
Reward: r

Show robot-soccer learning to walk videos (UT Austin):

? Show snake robot:

‹#›

Example: Learning to Walk
Initial
A Learning Trial
After Learning [1K Trials]
[Kohl and Stone, ICRA 2004]

Example: Toddler Robot
[Tedrake, Zhang and Seung, 2005]
[Video: TODDLER – 40s]

The Crawler!

[Demo: Crawler Bot (L10D1)] [You, in Project 3]

At times no reward accumulated at all, just sitting still
‹#›

Reinforcement Learning
Still assume a Markov decision process (MDP):
A set of states s ∈ S
A set of actions (per state) A
A model T(s,a,s’) = P(s’ | s, a)
A reward function R(s,a,s’)
Still looking for a policy π(s)

New twist: don’t know T or R
I.e. we don’t know which states are good or what the actions do
Must actually try actions and states out to learn

Offline (MDPs) vs. Online (RL)

Offline Solution
Online Learning

Model-Based Learning

Model-based vs Model-free Learning: Expected Age
Goal: Compute expected age of class students
P(a): probability of age a in the class/group – ‘the model”

Unknown P(A): “Model Based”

Unknown P(A): “Model Free”

Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
Why does this work? Because samples appear with the right frequencies.
Why does this work? Because eventually you learn the right model.

Adaptive Dynamic Programming:
Model-Based Learning
Model-Based Idea:
Learn an approximate model based on experiences
Solve for values as if the learned model were correct

Step 1: Learn empirical MDP model
Count outcomes s’ for each s, a
Normalize to give an estimate of
Discover each when we experience (s, a, s’)

Step 2: Solve the learned MDP
For example, use value iteration, as before, to estimate value U(s) (or V(s))

How to estimate T(s,a,s’)?
 
A
B C D
E

‹#›

Example: Model-Based Learning
Input Policy π
Assume: γ = 1
Observed Episodes (Training)
Learned Model
A
B C D
E

B, east, C, -1
C, east, D, -1
D, exit, x, +10
B, east, C, -1
C, east, D, -1
D, exit, x, +10
E, north, C, -1
C, east, A, -1
A, exit, x, -10
Episode 1

Episode 2

Episode 3

Episode 4

E, north, C, -1
C, east, D, -1
D, exit, x, +10
T(s,a,s’).

T(B, east, C) = 1.00
T(C, east, D) = 0.75
T(C, east, A) = 0.25

R(s,a,s’).

R(B, east, C) = -1
R(C, east, D) = -1
R(D, exit, x) = +10

Adaptive Dynamic Programming
 

‹#›

Adaptive Dynamic Programming in AIMA 4G

‹#›

Model-Free Learning

Passive Reinforcement Learning

Passive Reinforcement Learning
Simplified task: policy evaluation
Input: a fixed policy π(s)
You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
Goal: learn the state values

In this case:
Learner is “along for the ride”
No choice about what actions to take
Just execute the policy and learn from experience
This is NOT offline planning! You actually take actions in the world.

Direct Evaluation
Goal: Compute values for each state under π

Idea: Average together observed sample values
Act according to π.
Every time you visit a state, write down what the sum of discounted rewards turned out to be.
Average those samples.

This is called direct evaluation

Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training)
Output Values
A
B C D
E

B, east, C, -1
C, east, D, -1
D, exit, x, +10
B, east, C, -1
C, east, D, -1
D, exit, x, +10
E, north, C, -1
C, east, A, -1
A, exit, x, -10
Episode 1

Episode 2

Episode 3

Episode 4

E, north, C, -1
C, east, D, -1
D, exit, x, +10
A
B C D
E

+8
+4
+10
-10
-2

Problems with Direct Evaluation
What’s good about direct evaluation?
It’s easy to understand
It doesn’t require any knowledge of T, R
It eventually computes the correct average values, using just sample transitions

What bad about it?
It wastes information about state connections
Each state must be learned separately
So, it takes a long time to learn
Output Values
A
B C D
E

+8
+4
+10
-10
-2
If B and E both go to C under this policy, how can their values be different?

Why Not Use Policy Evaluation?
Simplified Bellman updates calculate V for a fixed policy:
Each round, replace V with a one-step-look-ahead layer over V

This approach fully exploited the connections between the states
Unfortunately, we need T and R to do it!

Key question: how can we do this update to V without knowing T and R?
In other words, how to we take a weighted average without knowing the weights?

π(s)
s
s, π(s)
s, π(s),s’

s’

Sample-Based Policy Evaluation?
We want to improve our estimate of V by computing these averages:

Idea: Take samples of outcomes s’ (by doing the action!) and average

π(s)
s
s, π(s)

s1′

s2′

s3′
s, π(s),s’

s’
Almost! But we can’t rewind time to get sample after sample from state s.

Temporal Difference Learning

Temporal Difference Learning
Big idea: learn from every experience!
Update V(s) each time we experience a transition (s, a, s’, r)
Likely outcomes s’ will contribute updates more often

Temporal difference learning of values
Policy still fixed, still doing evaluation!
Move values toward value of whatever successor occurs: running average

π(s)
s
s, π(s)

s’

Sample of V(s):
Update to V(s):
Same update:

Exponential Moving Average
Exponential moving average
The running interpolation update:

Makes recent samples more important:

Forgets about the past (distant past values were wrong anyway)

Decreasing learning rate (alpha) can give converging averages

Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions

B, east, C, -2
0
0 0 8
0

0
-1 0 8
0

0
-1 3 8
0

C, east, D, -2
A
B C D
E

States

Problems with TD Value Learning
TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages
However, if we want to turn values into a (new) policy, we’re sunk:

Idea: learn Q-values, not values
Makes action selection model-free too!

a
s
s, a
s,a,s’

s’

Active Reinforcement Learning

Active Reinforcement Learning
Full reinforcement learning: optimal policies (like value iteration)
You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
You choose the actions now
Goal: learn the optimal policy / values

In this case:
Learner makes choices!
Fundamental tradeoff: exploration vs. exploitation
This is NOT offline planning! You actually take actions in the world and find out what happens…

Detour: Q-Value Iteration
Value iteration: find successive (depth-limited) values
Start with V0(s) = 0, which we know is right
Given Vk, calculate the depth k+1 values for all states:

But Q-values are more useful, so compute them instead
Start with Q0(s,a) = 0, which we know is right
Given Qk, calculate the depth k+1 q-values for all q-states:

Q-Learning
Q-Learning: sample-based Q-value iteration

Learn Q(s,a) values as you go
Receive a sample (s,a,s’,r)
Consider your old estimate:
Consider your new sample estimate:

Incorporate the new estimate into a running average:

[Demo: Q-learning – gridworld (L10D2)]
[Demo: Q-learning – crawler (L10D3)]

Video of Demo Q-Learning — Gridworld

Video of Demo Q-Learning — Crawler

Q-Learning Properties
Amazing result: Q-learning converges to optimal policy — even if you’re acting suboptimally!

This is called off-policy learning

Caveats:
You have to explore enough
You have to eventually make the learning rate
small enough
… but not decrease it too quickly
Basically, in the limit, it doesn’t matter how you select actions (!)

CS 188: Artificial Intelligence

Reinforcement Learning II

Instructors: and — University of California, Berkeley
[These slides were created by and for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Please retain proper attribution and the reference to ai.berkeley.edu. Thanks!

‹#›

Reinforcement Learning
We still assume an MDP:
A set of states s ∈ S
A set of actions (per state) A
A model T(s,a,s’)
A reward function R(s,a,s’)
Still looking for a policy π(s)

New twist: don’t know T or R, so must try out actions

Big idea: Compute all averages over T using sample outcomes

The Story So Far: MDPs and RL
Known MDP: Offline Solution

Goal Technique

Compute V*, Q*, π* Value / policy iteration

Evaluate a fixed policy π Policy evaluation

Unknown MDP: Model-Based

Unknown MDP: Model-Free

Goal Technique

Compute V*, Q*, π* VI/PI on approx. MDP

Evaluate a fixed policy π PE on approx. MDP

Goal Technique

Compute V*, Q*, π* Q-learning

Evaluate a fixed policy π Value Learning

‹#›

Model-Free Learning
Model-free (temporal difference) learning
Experience world through episodes

Update estimates each transition

Over time, updates will mimic Bellman updates

r

a
s
s, a

s’

a’
s’, a’

s’’

Q-Learning
We’d like to do Q-value updates to each Q-state:

But can’t compute this update without knowing T, R

Instead, compute average as we go
Receive a sample transition (s,a,r,s’)
This sample suggests

But we want to average over results from (s,a) (Why?)
So keep a running average

Q-Learning Properties
Amazing result: Q-learning converges to optimal policy — even if you’re acting suboptimally!

This is called off-policy learning

Caveats:
You have to explore enough
You have to eventually make the learning rate
small enough
… but not decrease it too quickly
Basically, in the limit, it doesn’t matter how you select actions (!)

[Demo: Q-learning – auto – cliff grid (L11D1)]

Video of Demo Q-Learning Auto

Exploration vs. Exploitation

How to Explore?
Several schemes for forcing exploration
Simplest: random actions (ε-greedy)
Every time step, flip a coin
With (small) probability ε, act randomly
With (large) probability 1-ε, act on current policy

Problems with random actions?
You do eventually explore the space, but keep thrashing around once learning is done
One solution: lower ε over time
Another solution: exploration functions

[Demo: Q-learning – manual exploration – bridge grid (L11D2)] [Demo: Q-learning – epsilon-greedy — crawler (L11D3)]

Video of Demo Q-learning – Manual Exploration – Bridge Grid

Video of Demo Q-learning – Epsilon-Greedy – Crawler

Exploration Functions
When to explore?
Random actions: explore a fixed amount
Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring

Exploration function
Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.

Note: this propagates the “bonus” back to states that lead to unknown states as well!

Modified Q-Update:

Regular Q-Update:
[Demo: exploration – Q-learning – crawler – exploration function (L11D4)]

Video of Demo Q-learning – Exploration Function – Crawler

Regret
Even if you learn the optimal policy, you still make mistakes along the way!
Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards
Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal
Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

Approximate Q-Learning

Approximate Q-Learning

Generalizing Across States
Basic Q-Learning keeps a table of all q-values

In realistic situations, we cannot possibly learn about every single state!
Too many states to visit them all in training
Too many states to hold the q-tables in memory

Instead, we want to generalize:
Learn about some small number of training states from experience
Generalize that experience to new, similar situations
This is a fundamental idea in machine learning, and we’ll see it over and over again

[demo – RL pacman]

Example: Pacman

[Demo: Q-learning – pacman – tiny – watch all (L11D5)]
[Demo: Q-learning – pacman – tiny – silent train (L11D6)]
[Demo: Q-learning – pacman – tricky – watch all (L11D7)]
Let’s say we discover through experience that this state is bad:
In naïve q-learning, we know nothing about this state:
Or even this one!

Video of Demo Q-Learning Pacman – Tiny – Watch All

Video of Demo Q-Learning Pacman – Tiny – Silent Train

Video of Demo Q-Learning Pacman – Tricky – Watch All

Feature-Based Representations
Solution: describe a state using a vector of features (properties)
Features are functions from states to real numbers (often 0/1) that capture important properties of the state
Example features:
Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc.
Is it the exact state on this slide?
Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

Linear Value Functions
Using a feature representation, we can write a q function (or value function) for any state using a few weights:

Advantage: our experience is summed up in a few powerful numbers

Disadvantage: states may share features but actually be very different in value!

Approximate Q-Learning
Q-learning with linear Q-functions:

Intuitive interpretation:
Adjust weights of active features
E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features

Formal justification: online least squares

Exact Q’s
Approximate Q’s

Example: Q-Pacman

[Demo: approximate Q-learning pacman (L11D10)]

Video of Demo Approximate Q-Learning — Pacman

Q-Learning and Least Squares

0
20
0
20
40

0
10
20
30
40
0
10
20
30
20
22
24
26

Linear Approximation: Regression*
Prediction:

Prediction:

‹#›
Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),’k’,’filled’); a=axis; a(3)=0; axis(a);

Optimization: Least Squares*
0
20
0

Error or “residual”
Prediction
Observation

‹#›
Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),’k’,’filled’); a=axis; a(3)=0; axis(a);

Minimizing Error*

Approximate q update explained:

Imagine we had only one point x, with features f(x), target value y, and weights w:
“target”
“prediction”

0
2
4
6
8
10
12
14
16
18
20
-15
-10
-5
0
5
10
15
20
25
30

Degree 15 polynomial
Overfitting: Why Limiting Capacity Can Help*

Policy Search

Policy Search
Problem: often the feature-based policies that work well (win games, maximize utilities) aren’t the ones that approximate V / Q best
E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions
Q-learning’s priority: get Q-values close (modeling)
Action selection priority: get ordering of Q-values right (prediction)
We’ll see this distinction between modeling and prediction again later in the course

Solution: learn policies that maximize rewards, not the values that predict them

Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights

Policy Search
Simplest policy search:
Start with an initial linear value function or Q-function
Nudge each feature weight up and down and see if your policy is better than before

Problems:
How do we tell the policy got better?
Need to run many sample episodes!
If there are a lot of features, this can be impractical

Better methods exploit lookahead structure, sample wisely, change multiple parameters…

Policy Search
[ ]
[Video: HELICOPTER]

Conclusion

We’ve seen how AI methods can solve problems by:
Search
Knowledge Representation
Automated Planning
Markov Decision Problems
Reinforcement Learning
Basic Probabilities reasoning: beyond true/false.

Next week: Reasoning with Uncertainty using Bayesian Networks.
Exploiting dependances

Whiteboard used in lectorial

‹#›