Artificial Intelligence Review
What is an Agent?
An entity
□ situated: operates in a dynamically changing environment □ reactive: responds to changes in a timely manner
□ autonomous:cancontrolitsownbehaviour
□ proactive:exhibitsgoal-orientedbehaviour
□ communicating: coordinate with other agents??
Examples: humans, dogs, …, insects, sea creatures, …, thermostats? Where do current robots sit on the scale?
Lectures
Environment Types
Fully Observable vs Partially Observable
Agent’s sensors give access to complete state of environment (no internal state required)
DeterministicvsStochastic
Next state of environment determined only by current state and agent’s choice of action
Episodic vs Sequential
Agent’s experience divided into “episodes”; agent doesn’t need to think ahead in episodic environment
Static vs Dynamic
Environment changes while agent deliberates
DiscretevsContinuous
Limited number of distinct, clearly defined percepts and actions
□ Artificial Intelligence and Agents
□ Problem Solving and Search
□ ConstraintSatisfactionProblems
□ Logic and Knowledge Representation □ Reasoning with Uncertainty
□ Machine Learning
□ Natural Language Processing
□ Knowledge Based Systems
□ NeuralNetworksandReinforcementLearning
Specifying Agents
State Space Search Problems
□ percepts: inputs to the agent via sensors
□ actions: outputs available to the agent via effectors
□ goals: objectives or performance measure of the agent □ environment:worldinwhichtheagentoperates
Most generally, a function from percept sequences to actions
Ideally rational agent does whatever action is expected to maximize some performance measure – the agent may not know the performance measure (Russell and Norvig 2010)
Resource bounded agent must make “good enough” decisions based on its perceptual, computational and memory limitations (design tradeoffs)
Example Agents
□ State space — set of all states reachable from initial state(s) byany action sequence
□ Initial state(s) — element(s) of the state space
□ Transitions
◮ Operators — set of possible actions at agent’s disposal; describe
state reached after performing action in current state, or
◮ Successor function — s(x)= set of states reachable from state x by
performing a single action
□ Goal state(s) — element(s) of the state space
□ Path cost — cost of a sequence of transitions used to evaluate solutions (apply to optimization problems)
Example Problem — Romania Map
Agent Type
Percepts
Actions
Goals
Environment
Medical diagnosis system
Symptoms, findings, pa- tient responses
Questions, tests, treat- ments
Healthy patient, minimise costs
Patient, hospital
Satellite im- age system
Pixels of vary- ing intensity, colour
Print cate- gorisation of scene
Correct cate- gorisation
Images from or- biting satellite
Automated taxi driver
Cameras, speedometer, GPS, sonar, microphone
Steer, acceler- ate, brake, talk to passenger
Safe, fast, legal, comfortable trip, maximise profits
Roads, other traffic, pedestri- ans, customers
Robocup robot
Camera
ages,
range
readings, sonar readings
im- laser finder
Move motors, “kick” ball
Score goals
Playing field with ball and other robots
Oradea
Straight−line distance
to Bucharest
Arad 366 Bucharest 0 Craiova 160 Dobreta 242 Eforie 161 Fagaras 178 Giurgiu 77 Hirsova 151 Iasi 226 Lugoj 244 Mehadia 241 Neamt 234 Oradea 380 Pitesti 98 Rimnicu Vilcea
71
Zerind 151
Neamt
87
75
Arad 140
Iasi 92
142 98
Sibiu 80
99
Fagaras
118
Vaslui
Timisoara 111
Rimnicu Vilcea
Lugoj
97
Pitesti
211
70
85
Hirsova 86Sibiu253
146
75 138
101
193
Mehadia
Urziceni
Dobreta
120
Craiova
329 80
Bucharest 90
Giurgiu
Timisoara
Urziceni
Vaslui 199 Zerind 374
Based on Russell and Norvig (2010) Figure 2.5.
Eforie
Summary — Blind Search
Constraint Satisfaction Problems
Criterion
Breadth First
Uniform Cost
Depth- First
Depth- Limited
Iterative Deepening
Bidirectional
Time Space Optimal Complete
bd
bd
Yes Yes
bd
bd
Yes Yes
bm
bm
No No
bl
bl
No Yes, if l ≥ d
bd
bd
Yes Yes
d b2 d b2
Yes Yes
b — branching factor
d — depth of shallowest solution m — maximum depth of tree
l — depth limit
□ Constraint Satisfaction Problems are defined by a set of variables Xi, each with a domain Di of possible values, and a set of constraints C
□ Aim is to find an assignment to each the variables Xi (a value from the domain Di) such that all of the constraints C are satisfied
A∗ Search
□ Idea: Use both cost of path generated and estimate to goal to order
nodes on the frontier
□ g(n) = cost of path from start to n; h(n) = estimate from n to goal
□ Order priority queue using function f (n) = g(n) + h(n)
□ f (n) is the estimated cost of the cheapest solution extending this path
□ Expand node from frontier with smallest f -value
□ Essentially combines uniform-cost search and greedy search
Forward Checking
Idea: Keep track of remaining legal values for unassigned variables Terminate search when any variable has no legal values
WA NT Q NSW V SA T
Arc Consistency
Propositional Logic
Simplest form of constraint propagation is arc consistency Arc (constraint) X → Y is arc consistent if
for every value x in dom(X) there is some allowed y in dom(Y)
WA NT Q NSW V SA T
Make X → Y arc consistent by removing any such x from dom(X )
□ Useletterstostandfor“basic”propositions;combinethemintomore complex sentences using operators for not, and, or, implies, iff
□ Propositional connectives: ¬ negation
∧ conjunction ∨ disjunction
→ implication ↔ bi-implication
¬P
P ∧ Q
P∨Q P → Q P ↔ Q
“not P” “P and Q”
“P or Q”
“If P then Q”
“P if and only if Q”
Hill Climbing by Min-Conflicts
Truth Table Semantics
□ The semantics of the connectives can be given by truthtables
2
2
1
2
3
1
2
3
3
2
3
2
3
0
P
Q
¬P
P∧Q
P∨Q
P→Q
P↔Q
True True False False
True False True False
False False True True
True False False False
True True True False
True False True True
True False False True
□ Variable selection: randomly select any conflicted variable □ Value selection by min-conflicts heuristic
◮ Choose value that violates fewest constraints ◮ Can (often) solve n-Queens for n ≈ 10, 000, 000
□ One row for each possible assignment of True/False to variables
□ Important: P and Q are any sentences, including complex sentences
Definitions
Resolution Rule of Inference
□ A sentence is valid if it is True under all possible assignments of True/False to its variables (e.g. P ∨ ¬P)
□ A tautology is a valid sentence
□ Two sentences are equivalent if they have the same truth table, e.g.
P∧Q and Q∧P
◮ SoPisequivalenttoQifandonlyifP↔Qisvalid
□ A sentence is satisfiable if there is some assignment of True/False to its variables for which the sentence is True
□ A sentence is unsatisfiable if it is not satisfiable (e.g. P ∧ ¬P)
◮ Sentence is False for all assignments of True/False to its variables ◮ So P is a tautology if and only if ¬P is unsatisfiable
Conversion to Conjunctive Normal Form
□ Eliminate↔rewritingP↔Qas(P→Q)∧(Q→P)
□ Eliminate→rewritingP→Qas¬P∨Q
□ Use De Morgan’s laws to push ¬ inwards (repeatedly) ◮ Rewrite ¬(P∧Q) as ¬P∨¬Q
◮ Rewrite ¬(P∨Q) as ¬P∧¬Q
□ Eliminate double negations: rewrite ¬¬P as P
□ Use the distributive laws to get CNF [or DNF] – if necessary
◮ Rewrite (P∧Q)∨R as (P∨R)∧(Q∨R) [for CNF] ◮ Rewrite (P∨Q)∧R as (P∧R)∨(Q∧R) [for DNF]
A1∨···∨Am∨B ¬B∨C1 ∨···∨Cn
❧✱ ❧✱
❧❧
❧❧
✱✱ ✱✱
❧✱
A1 ∨···∨Am ∨C1 ∨···∨Cn
where B is a propositional variable and Ai and Cj are literals
□ B and ¬B are complementary literals
□ A1 ∨···∨Am ∨C1 ∨···∨Cn is the resolvent of the two clauses
□ Special case: If no Ai and Cj, resolvent is empty clause, denoted Q
Applying Resolution Refutation
□ Negate query to be proven (resolution is a refutation system)
□ Convert knowledge base and negated query into CNF
□ Repeatedlyapplyresolutionuntileithertheemptyclause(contradic- tion) is derived or no more clauses can be derived
□ If the empty clause is derived, answer ‘yes’ (query follows from knowledge base), otherwise answer ‘no’ (query does not follow from knowledge base)
Random Variables
Bayes’ Rule
□ Propositions are random variables that can take on several values
P(Weather = Sunny) = 0.8
P(Weather = Rain) = 0.1 P(A)
P(A|B)P(B)
□ AI systems abandon joint probabilities and work directly with
□ Deriving Bayes’ Rule:
P(A ∧B) = P(A|B)P(B) (Definition)
P(B ∧A) = P(B|A)P(A) (Definition)
So P(A|B)P(B) = P(B|A)P(A) since P(A ∧ B) = P(B ∧ A)
Hence P(B|A) = P(A|B)P(B) if P(A) ƒ= 0 P(A)
□ Note: If P(A) = 0, P(B|A) is undefined
P(Weather = Cloudy) = 0.09 P(Weather = Snow) = 0.01
□ Every random variable X has a domain of possible values (x1, x2, ···, xn)
□ ProbabilitiesofallpossiblevaluesP(Weather)=(0.8,0.1,0.09,0.01) is a probability distribution
□ P(Weather, Appendicitis) is a combination of random variables represented by cross product (can also use logical connectives P(A ∧ B) to represent compound events)
Conditional Probability by Enumeration
P(B|A) =
conditional probabilities using Bayes’ Rule
Bayesian Networks
□ Example (Pearl, 1988) Burglary
Earthquake
P(Earthquake)
0.002
toothache
toothache
catch
catch
catch
catch
cavity
.108
.012
.072
.008
cavity
.016
.064
.144
.576
P(¬cavity ∧toothache) P(toothache)
True True 0.95 True False 0.94 False True 0.29 False Flase 0.001
MaryCalls
Alarm P(JohnCalls) True 0.90
False 0.05
□ Probabilities summarize potentially infinite set of possible circum- stances
P(¬cavity|toothache) =
= 0.108 + 0.012 + 0.016 + 0.064 = 0.4
JohnCalls
P(Burglary) 0.001
Alarm
Burglary Earthquake P(Alarm)
0.016 + 0.064
AlarmP(MaryCalls) True 0.70
False 0.01
Example – Causal Inference
Hidden Markov Model for POS Tagging
□ P(JohnCalls|Burglary)
□ P(J|B) = P(J|A ∧ B).P(A|B)+ P(J|¬A ∧ B).P(¬A|B) = P(J|A).P(A|B)+P(J|¬A).P(¬A|B)
= P(J|A).P(A|B)+ P(J|¬A).(1 − P(A|B))
□ Now P(A|B) = P(A|B ∧ E).P(E|B)+ P(A|B ∧ ¬E).P(¬E|B) = P(A|B ∧ E).P(E)+ P(A|B ∧ ¬E).P(¬E)
= 0.95 × 0.002 + 0.94 × 0.998 = 0.94002
□ Therefore P(J|B) = 0.90 × 0.94002 + 0.05 × 0.05998 = 0.849017
□ Fact 3: P(X|Z)=P(X|Y∧Z).P(Y|Z)+P(X|¬Y∧Z).P(¬Y|Z), since
X ∧ Z ⇔ (X ∧ Y ∧ Z) ∨ (X ∧ ¬Y ∧ Z) (conditional version of Fact 2)
Bigram Model
Maximize P(w1, ···, wn|t1, ···, tn).P(t1, ···, tn)
□ Apply independence assumptions (Markov assumptions)
◮ P(w1, ···, wn|t1, ···, tn) = ΠP(wi|ti)
◮ Observations (words) depend only on states (tags)
◮ P(t1, · · · , tn) = P(tn|tn−1). · · · .P(t0|φ), where φ = start
◮ Bigram model: state (tag) depends only on previous state (tag)
□ Estimate probabilities
◮ P(ti|t j) = #((t j,ti occurs)/#(tj starts a bigram)
◮ Choose tag sequence that maximizes ΠP(wi|ti).P(ti|ti−1) ◮ Parts of speech generated by finite state machine
Viterbi Algorithm
1. Sweep forward (one word at a time) saving only the most likely sequence (and its probability) for each tag ti of wi
2. Select highest probability final state
3. Follow chain backwards to extract tag sequence
Supervised Learning
Choosing an Attribute to Split
□ Given a training set and a test set, each consisting of a set of items for each item in the training set, a set of features and a target output
□ Learner must learn a model that can predict the target output for any given item (characterized by its set of features)
□ Learner is given the input features and target output for each item in the training set
◮ Items may be presented all at once (batch) or in sequence (online) ◮ Items may be presented at random or in time order (stream)
◮ Learner cannot use the test set at all in defining the model
□ Model is evaluated by its performance on predicting the output for each item in the test set
Restaurant Training Data
Patrons?
None Some Full
Type?
French
Italian
Thai Burger
Patrons is a “more informative” attribute than Type, because it splits the examples more nearly into sets that are “all positive” or “all negative”
This notion of “informativeness” can be quantified using the mathematical concept of “entropy”
A parsimonious tree can be built by minimizing the entropy at each step
Information Gain
Alt
Bar
F/S
Hun
Pat
Price
Rain
Res
Type
Est
Wait?
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
T
T
F
T
T
F
F
F
F
T
F
T
F
F
T
F
F
T
T
F
T
T
F
T
F
F
F
T
T
F
F
F
T
T
F
T
T
T
F
T
F
T
F
T
F
T
F
T
Some Full Some Full Full Some None Some Full Full None Full
$$$ $ $ $ $$$ $$ $ $$ $ $$$ $ $
F
F
F
F
F
T
T
T
T
F
F
F
T
F
F
F
T
T
F
T
F
T
F
F
French Thai Burger Thai French Italian Burger Thai Burger Italian Thai Burger
0–10 30–60 0–10 10–30 >60 0–10 0–10 0–10 >60 10–30 0–10 30–60
T
F
T
T
F
T
F
T
F
F
F
T
Patrons? None Some
For Patrons, Entropy
For Type, Entropy =
Type?
Full
French Italian
Thai Burger
1 1 1Σ−1 1 2 2Σ (0)+ (0)+ 2 3 log( ) − log( )
=
= 0+0+ Σ (0.585)Σ=0.459
63 333 112
1 6633
(1.585)+ 3
1 (1)+ (1)+ (1)+ (1) = 1
23 11
Induced Decision Tree
Text Classification
Patrons?
None Some Full
□ Input: A document (e-mail, news article, review, tweet) □ Output: One class drawn from a fixed set of classes
◮ So text classification is a multi-class classification problem ◮ . . . and sometimes a multi-label classification problem
□ Learning Problem
◮ Input: Training set of labelled documents {(d , c ), ···}
11
◮ Output: Learned classifier that maps d to predicted class c
F T
French
T
Hungry?
Yes No
Type? F
Italian
F
Thai
Burger
Fri/Sat? T No Yes
FT
Laplace Error and Pruning
Following Ockham’s Razor, prune branches that do not provide much benefit in classifying the items (aids generalization, avoids overfitting)
For a leaf node, all items will assigned the majority class at that node. Estimate error rate on the (unseen) test items using the Laplace error
E =1−n+1 N+k
N = total number of (training) items at the node
n = number of (training) items in the majority class k = number of classes
If the average Laplace error of the children exceeds that of the parent node, prune off the children
Bernoulli Model
Maximize P(x1, · · · , xn|c).P(c)
□ Features are presence or absence of word wi in document
□ Apply independence assumptions
◮ P(x1, ···, xn|c) = P(x1|c). ···.P(xn|c)
◮ Probability of word w (not) in class c independent of context □ Estimate probabilities
◮ P(w|c) = #(w in document in class c)/#(documents in class c) ◮ P(¬w|c) = 1 − P(w|c)
◮ P(c) = #(documents in class c)/#documents
Naive Bayes Classification
MNB Example
w1
w2
w3
w4
Class
1
0
0
1
1
0
0
0
1
0
1
1
0
1
0
1
0
1
1
1
0
1
1
0
0
1
0
0
0
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
1
0
0
Class = 1
Class = 0
P(Class)
0.40
0.60
P(w1|Class)
0.75
0.50
P(w2|Class)
0.25
0.67
P(w3|Class)
0.50
0.33
P(w4|Class)
0.50
0.50
Words
Class
d1
Chinese Beijing Chinese
c
d2
Chinese Chinese Shanghai
c
d3
Chinese Macao
c
d4
Tokyo Japan Chinese
j
d5
Chinese Chinese Chinese Tokyo Japan
?
To classify document with w2, w3, w4
• P(Class = 1|¬w1, w2, w3, w4)
≈((1−0.75)∗0.25∗0.5∗0.5)∗0.4
= 0.00625
• P(Class = 0|¬w1, w2, w3, w4)
≈((1−0.5)∗0.5∗0.67∗0.33)∗0.6 = 0.03333
Bag of Words Model
I love this movie! It’s sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I’ve seen it several times, and I’m always happy to see it again whenever I have a friend who hasn’t seen it yet!
P(Chinese|c) = (5+1)/(8+6) = 3/7 P(Tokyo|c) = (0+1)/(8+6) = 1/14 P(Japan|c)= (0+1)/(8+6) = 1/14 P(Chinese| j) = (1+1)/(3+6) = 2/9 P(Tokyo| j) = (1+1)/(3+6) = 2/9 P(Japan| j) = (1+1)/(3+6) = 2/9
To classify document d5
• P(c|d5) ∝ [(3/7)3 . 1/14 . 1/14] . 3/4
≈ 0.0003
• P(j|d5)∝[(2/9)3 .2/9.2/9].1/4
≈ 0.0001
• Choose Class c
Natural Languages – Ambiguity
□ Natural languages exhibit ambiguity
“The fisherman went to the bank” (lexical)
“The boy saw a girl with a telescope” (structural)
“Every student took an exam” (semantic)
“The table won’t fit through the doorway because it is too [wide/narrow]” (pragmatic)
□ Ambiguitymakesitdifficulttointerpretmeaningofphrases/sentences ◮ But also makes inference harder to define and compute
□ Resolve ambiguity by mapping to unambiguous representation
it
I
the
to
and
seen
yet
would whimsical times sweet satirical adventure genre fairy humor have
great
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
Typical (Small) Grammar
Chart Parsing
S → NP VP
NP → [Det] Adj∗ N [AP | PP | Rel Clause]∗ VP → V [NP] [NP] PP∗
AP → Adj PP
PP → P NP
Det → a | an | the | . . .
N → John | park | telescope | . . .
V → saw | likes | believes | . . .
Adj → hot | hotter | . . .
P → in | . . .
Special notation: ∗ is “0 or more”; [ . . ] is “optional”
Syntactic Structure
□ Use a chart to record parsed fragments and hypotheses
□ HypothesesN→α•βwhereN→αβisagrammarrulemeans
“trying to parse N as αβ and have so far parsed α”
□ One node in chart for each word gap, start and end
□ One arc in chart for each hypothesis
□ At each step, apply fundamental rule ◮IfcharthasN→α•Bβfromn1 ton2 andB→γ•fromn2 ton3
addN→αB•βfromn1 ton3
□ Accept sentence when S → α• is added from start to end
□ Can produce any sort of derivation
Example Chart
Syntactically ambiguous = more than one parse tree
First-Order Logic
Defining Semantic Properties
□ Terms: constants, variables, functions applied to terms (refer to objects)
◮ e.g. a, f (a), mother of (Mary), . . .
□ Atomic formulae: predicates applied to tuples of terms
◮ e.g. likes(Mary, mother of (Mary)), likes(x, a)
□ Quantified formulae:
◮ e.g. ∀xlikes(x, a), ∃xlikes(x, mother of (y))
◮ here the second occurrences of x are bound by the quantifier (∀ in
the first case, ∃ in the second) and y in the second formula is free
Brothers are siblings
∀x ∀y (brother(x, y) → sibling(x, y))
“Sibling” is symmetric
∀x ∀y (sibling(x, y) ↔ sibling(y, x))
One’s mother is one’s female parent
∀x ∀y (mother(x, y) ↔ (female(x) ∧ parent(x, y)) A first cousin is a child of a parent’s sibling
∀x ∀y (firstcousin(x, y) ↔ ∃p ∃sparent(p, x) ∧ sibling(p, s) ∧ parent(s, y)
Converting English into First-OrderLogic
□ Everyone likes lying on the beach — ∀xlikes lying on beach(x) □ Someone likes Fido —∃xlikes(x, Fido)
□ No one likes Fido — ¬∃xlikes(x, Fido) (or ∀x¬likes(x, Fido))
□ Fido doesn’t like everyone — ¬∀xlikes(Fido, x)
□ All cats are mammals — ∀x (cat(x) → mammal(x))
□ Some mammals are carnivorous — ∃x(mammal(x)∧carnivorous(x)) □ Note: ∀x A(x) ⇔ ¬∃x¬A(x), ∃x A(x) ⇔ ¬∀x¬A(x)
First-Order Resolution
A1 ∨···∨Am ∨B ❧❧❧
❧❧
¬B′ ∨C1 ∨···∨Cn ✱✱✱
✱✱ ❧✱
❧✱
(A1 ∨ ···∨ Am ∨ C1 ∨ ···∨ Cn)θ
where B, B′ are positive literals, Ai, Cj are literals, θ is an mgu of B and B′ □ B and ¬B′ are complementary literals
□ (A1 ∨···∨Am ∨C1 ∨···∨Cn)θ is the resolvent of the two clauses □ Special case: If no Ai and Cj, resolvent is empty clause, denoted Q
Unification
McCulloch & Pitts Model of a Single Neuron
□ A unifier of two atomic formulae is a substitution of terms for variables that makes them identical
◮ Each variable has at most one associated term ◮ Substitutions are applied simultaneously
□ Unifier of P(x, f(a), z) and P(z, z, u) : {x/f(a), z/f(a), u/f(a)}
□ Substitution σ1 is a more general unifier than a substitution σ2 if for
some substitution τ, σ2 = σ1τ (i.e. σ1 followed by τ)
□ Theorem. If two atomic formulae are unifiable, they have a most
general unifier (mgu).
❍❍❍❍
w1 ❍❍
x1
❍❥ s✲ ✲g(s) w2 ✟✟✟✯
Σ
✟ ✕✁
✟ ✁ x,x areinputs
✁
✟ w=−θ 12
✁0
✁s=w1x1 +w2x2 −θ
= w1x1 +w2x2 +w0
w1, w2 are synaptic weights θ is a threshold
w0 is a bias weight
g is transfer function
Examples
□ {P(x,a),P(b,c)}isnotunifiable
□ {P(f(x),y),P(a,w)} is not unifiable
□ {P(x,c),P(b,c)} is unifiable by {x/b}
□ {P(f(x),y),P(f(a),w)}isunifiableby
σ = {x/a, y/w}, τ = {x/a, y/a, w/a}, υ = {x/a, y/b, w/b} Notethatσisanmguandτ=σθwhereθ=…?
□ {P(x), P( f (x))} is not unifiable (c.f. occur check!)
Perceptron Learning Rule
Adjust the weights as each input is presented Recall s = w1x1 + w2x2 + w0
if g(s) = 0 but should be 1, wk ← wk +ηxk
w0 ← w0 +η
so s ← s+η(1+∑xk) so s ← s−η(1+∑xk)
kk
otherwise weights are unchanged (η > 0 is called the learning rate)
Theorem: This will eventually learn to classify the data correctly, as long as they are linearly separable
g
x2
1
if g(s) =1 but should be 0, wk ← wk −ηxk
w0 ← w0 −η
22
Multi-Layer Neural Networks
Example: Delayed Rewards
XOR
NOR
NOR
+0.5 −1 −1
+1
−1
AND
−1.5 +1
−1 +0.5
Question: Given an explicit logical function, we can design a multi-layer neural network by hand to compute that function – but if we are just given a set of training data, can we train a multi-layer network to fit this data?
Reinforcement Learning Framework
□ Agent interacts with its environment
□ There is a set S of states and a set A ofactions
□ At each time step t, the agent is in some state st and must choose an action at , whereupon it goes into state st+1 = δ(st , at ) and receives reward r(st , at )
□ In general, r() and δ() can be multi-valued, with a random element
□ The aim is to find an optimal policy π : S → A which maximizes the cumulative reward
Calculation
Theorem: In a deterministic environment, for an optimal policy, the value functionV∗ satisfiestheBellmanequations:V∗(s)=r(s,a)+γV∗δ(s,a)
where a = π∗(s) is the optimal action at state s.
Let δ∗(s) be the transition function for π∗(s) and suppose γ = 0.9
1. Supposeδ∗(s1)=s1.ThenV∗(s1)=5+0.9V∗(s1)soV∗(s1)=50 Suppose δ∗(s2) = s2. Then V ∗(s2) = 10 + 0.9V ∗(s2) so V ∗(s2) = 100
∗∗∗∗
2. Supposeδ(s1)=s2. ThenV (s1)=2+0.9V (s2)soV (s1)=92
∗∗∗∗ Supposeδ(s )=s . ThenV (s )=10+0.9V (s )soV (s )=100
22222
3. Supposeδ∗(s1)=s2.ThenV∗(s1)=2+0.9V∗(s2)soV∗(s1)=81.6
Suppose δ∗(s2) = s1. Then V ∗(s2) = 15 + 0.9V ∗(s1) so V ∗(s2) = 88.4 So 2 is the optimal policy