Reasoning Under Uncertainty
Uncertainty material is covered in chapters 13 and 14.
Chapter 13 gives some basic background on probability from the point of view of A.I.
Chapter 14 talks about Bayesian Networks, exact reasoning in Bayes Nets as well as approximate reasoning, which will be main topics for us.
Copyright By PowCoder代写 加微信 powcoder
Note: Slides in this section draw Faheim Bacchus, , , Sheila McIlraith
CSC384 Introduction to Artificial Intelligence
Reasoning Under Uncertainty
The world is a very uncertain place.
As of this point, we’ve basically danced around that fact. We’ve assumed that what we see in the world is really there, what we do in the world has predictable outcomes, etc.
i.e., if you are in state S1 and you execute action A you arrive at state S2.
CSC384 Introduction to Artificial Intelligence
Example: Sokoban
move(S1,’N’) = S2 move(S1,’S’) = S3
CSC384 Introduction to Artificial Intelligence
Probabilistic Sokoban is a Very Different Game
CSC384 Introduction to Artificial Intelligence
Move(S1,’N’) = S2 50% of the time Move(S1,’N’) = S3 50% of the time
Probabilistic Sokoban is a Very Different Game
Based on what we can see, there’s a 30% chance we’re in cell S1, 30% in Ss and 40% in
CSC384 Introduction to Artificial Intelligence
Life in an Uncertain World
We might not know the effects of an action
We may not know exactly what state we are in
We may still need to act, but we can’t act solely on the basis of facts.We have to “gamble”.
CSC384 Introduction to Artificial Intelligence
The action might have a random component, like rolling dice. We might not know the long term effects of a drug.
We might not know the status of a road when we choose to drive down it.
E.g., we can’t see our opponents cards in a poker game. We don’t know what a patient’s ailment is.
Uncertainty
But how do we gamble rationally?
• If we must arrive at the airport at 9pm on a week night we could “safely” leave for the airport 1⁄2 hour before.
Some probability of the trip taking longer, but the probability is low.
• If we must arrive at the airport at 4:30pm on Friday we most likely need 1 hour or more to get to the airport.
Relatively high probability of it taking 1.5 hours.
Acting rationally under uncertainty means we must parameterize the likehood of our actions’ outcomes and the relative value of these outcomes. We will do this with probability.
CSC384 Introduction to Artificial Intelligence
Events, Event Spaces
Probability of an Event, Axioms of Probability
Conditional Probability
Independence of Probabilities
Conditional Independence of Probabilities
The Chain Rule
Joint Probability Table, Joint Distribution
Marginalization of a Joint Probability
Bayes’ Rule
Normalization of Probabilities
Benefits of Independence
• Bayesian networks do just this. CSC384 IntroductiontoArtificialIntelligence
representation of joint inference
… are tools to both represent uncertainty in event spaces and to reason about the likelihood of events.
’
“ ”
’
’
’
’ “”
’
“ ”
This one models a sequence of conditional independence relationships.
Sequential Models
sequences of observations occur in many domains:
Speech recognition Robot localization User attention Medical monitoring
Markov Models capture sequences
▪ Say we have one variable X (perhaps with a very large number of possible value assignments).
▪ We want to track the probability of different values of X (i.e. the probability distribution over X) as its values change over time.
▪ Possible solution: Make multiple copies of X, one for each time point (we assume a discrete model of time): X1, X2, X3 … Xt
▪ A Markov Model is specified by the two following assumptions: ▪The current state Xt is conditionally independent of the earlier
states given the previous state.
P(Xt | Xt-1, Xt-2, …. X1) = P(Xt | Xt-1)
▪The transitions between Xt-1 and Xt are determined by probabilities that do not change over time (they are stationary probabilities).
P(Xt | Xt-1)
Markov Models
X1 X2 X3 X4
! P(X1,X2,X3,….) = P(X1)P(X2|X1)P(X3|X2) … (Assumption 1) ! All the CPTs (except P(X1)) are the same (Assumption 2)
These assumptions give rise to a Bayesian Network that looks like this:
Markov Models
X1 X2 X3 X4
This model reflects a specific set of conditional independence
relationships:
X t-1is conditionally independent of Xt+1, Xt+2, … given Xt
The current state separates the past from the future.
Example Markov Chain Weather
States: X = {rain, sun}
Initial distribution: P(X1=sun) = 1.0
CPT P(Xt | Xt-1):
P(Xt|Xt-1)
[Slide created by and for CS188 Intro to AI at UC Berkeley].
Example Markov Chain Weather
! P(X1=sun) = 1.0
! Use summing out rule with X1
What is the probability distribution after one step, P(X2)?
CPT P(Xt | Xt-1):
P(Xt|Xt-1)
[Slide created by and for CS188 Intro to AI at UC Berkeley].
Example Markov Chain Weather
! What is the probability distribution on day t (P(Xt))? ! Sum out Xt-1
CPT P(Xt | Xt-1):
Forward simulation
Compute P(X2) then P(X3) then P(X4) …
[Slide created by and for CS188 Intro to AI at UC Berkeley].
P(Xt|Xt-1)
Example Run of Forward Computation
From initial observation of sun
P(X3) P(X4) From initial observation of rain
P(X1) P(X2) ▪
P(X3) P(X4)
▪ From yet another initial distribution Pr(X1):
P(X1) P(X2)
[Slide created by and for CS188 Intro to AI at UC Berkeley.]
Stationary Distributions
For most Markov chains:
Influence of the initial distribution dissipates over time.
The distribution we end up in is independent of the initial distribution
Stationary distribution
The distribution that we end up with is called the stationary distribution of the chain.
This satisfies:
That is the stationary distribution does not change on a forward progression
We can compute it by solving simultaneous equations (or by forward simulating the system many times; forward simulation is generally computationally more effective)
Hidden Markov Models
Markov chains not so useful for most agents
Need observations to update your beliefs
Hidden Markov models (HMMs)
Underlying Markov chain over states X
But you also observe outputs (effects) at each time step
X1 X2 X3 X4 E1 E2 E3 E4
[Slide created by and for CS188 Intro to AI at UC Berkeley].
Example: Weather HMM
Raint-1 +1 Umbrellat+1
Umbrellat-1
P(Rt+1|Rt)
! An HMM is defined by: ! Initial distribution: P(X1)
! Transitions: P(Xt|Xt-1) ! Emissions: P(Et|Xt)
[Slide created by and for CS188 Intro to AI at UC Berkeley].
Joint Distribution of an HMM
X1 X2 X3 E1 E2 E3
▪Assumptions:
▪P(Xt|Xt-1 … X1, Et-1 …. E1) = P(Xt|Xt-1)
Current state is conditionally independent of early states + evidence given previous state
▪P(Xt|Xt-1) is the same for all time points t Probabilities are stationary
▪P(Et|Xt … X1, Et-1 …. E1) = P(Et|Xt)
Current evidence is conditionally independent of early states + early evidence given
current state
Note that two evidence items are not independent, unless one of the intermediate states is known.
Real HMM Examples
! Speech recognition HMMs:
! Machine translation HMMs:
! Robot tracking:
Observations are acoustic signals (continuous valued)
States are specific positions in specific words (so, tens of thousands)
Observations are words (tens of thousands) States are translation options
Observations are range readings (continuous) States are positions on a map (continuous)
[Slide created by and for CS188 Intro to AI at UC Berkeley.]
Tracking/Monitoring
▪Monitoring is the task of tracking P(Xt|et …. e1) over time. i.e. determining state given current and
previous observations.
▪P(X1) is the initial distribution over variable (or feature) X. Usually start with a uniform distribution
over all values of X.
▪As time elapses and we make observations and must update our distribution over X, i.e. move from P(Xt-1|et-1 …. e1) to P(Xt|et …. e1).
▪This means updating HMM equations. Tools to do this existed before Bayes Nets, but we can relate inference tools to Variable Elimination.
Example: Robot Localization
Example from
Sensor model: Can read in which directions there is a wall, never more than 1 mistake
Motion model: Either executes the move, or the robot with low probability does not move at all. Cannot move in wrong direction.
Initially uniform distribution over where robot is located—equally likely to be anywhere.
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
Example: Robot Localization
▪Prob ▪0 ▪1
Initially don’t know where you are. Observe a wall above and below, no wall to the left or right. Low probability of 1 mistake, 2 mistakes not possible
White: impossible to get this reading (more than one mistake)
Lighter grey: was possible to get the reading, but less likely because it required 1 mistake
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
Example: Robot Localization
▪Prob ▪0 ▪1
t=2: Move right. Low probability didn’t move, else must have
moved right.
Still observing wall above and below
can only be here if
(a) was at low probability square to the left (b) was at this square and action didn’t work.
[Slide created by and for CS188 Intro to AI at UC Berkeley.]
Example: Robot Localization
[Slide created by and for CS188 Intro to AI at UC Berkeley.]
Example: Robot Localization
[Slide created by and for CS188 Intro to AI at UC Berkeley.]
Update Rules for HMMs
!”#$%#& ‘()”#*+,- .”# /001
• !”#$% & ‘$( )* )+’$,#&(“)%’- .$ .&%( () /,$&($ & +$0″$* (1&( ($00′ 2′ .1&( ‘(&($ .$ (1″%3 .$4,$ “%5
• With HMMs, we can use the Forward Algorithm to calcuate beliefs that are conditioned on prior observations.
This algorithm is a special version of Variable Elimination (VE), which we will cover in more depth shortly. VE can be applied to any Bayes Net, not just Markov Models.
• The structure of the HMM allows for simple update rules:
We will store beliefs about state at time t, and use these to calculate beliefs at time (t+1).
!”#$%#& ‘()”#*+,- .”# /001
▪To create a distribution of beliefs about
Xt-1 , we need only look at the variables that
are ancestors of Xt-1
Xt-2 Xt-1 Et-2 Et-1
!”#$%#& ‘()”#*+,- .”# /001
P(Xt-1|et-1,et-2…e1) = P(Xt-1,et-1,et-2…e1)/P(et-1,et-2…e1).
We calculate the numerator by marginalizing X1,X2… Xt-2 from P(Xt-1,Xt-2,Xt-3….X1,et-1,et-2…e1).
X1: P(X1) P(e1|X1)P(X2|X1)
X2: P(e2|X2)P(X3|X2)
Xt-2: P(et-2|Xt-2)P(Xt-1|Xt-2) Xt-1: P(et-1|Xt-1)
X2 Xt-2 Xt-1 E2 Et-2 Et-1
!”#$%#& ‘()”#*+,- .”# /001
Summing out X1 we get a factor of X2; summing out X2 we get a factor of X3 and so on:
X1: P(X1) P(e1|X1)P(X2|X1)
X2: P(e2|X2)P(X3|X2)F2(X2)
Xt-2: P(et-2|Xt-2)P(Xt-1|Xt-2)Ft-2(Xt-2) Xt-1: P(et-1|Xt-1)Ft-1(Xt-1)
X2 Xt-2 Xt-1 E2 Et-2 Et-1
!”#$%#& ‘()”#*+,- .”# /001
X1: P(X1) P(e1|X1)P(X2|X1) X2: P(e2|X2)P(X3|X2)F2(X2) …
Xt-2: P(et-2|Xt-2)P(Xt-1|Xt-2)Ft-2(Xt-2) Xt-1: P(et-1|Xt-1)Ft-1(Xt-1)
P(Xt-1|et-1,et-2 … ,e1) = normalize(P(et-1|Xt-1)Ft-1(Xt-1))
This is a table with one value for each Xt-1
!”#$%#& ‘()”#*+,- .”# /001
Now say time has passed but no observation has
been made yet.
X1: P(X1) P(e1|X1)P(X2|X1)
X2: P(e2|X2)P(X3|X2)
Xt-2: P(et-2|Xt-2)P(Xt-1|Xt-2) Xt-1: P(et-1|Xt-1)P(Xt|Xt-1)
Xt-1 -1 Et
Same buckets with one new one (Xt) and one new factor (P(Xt|Xt-1)).
!”#$%#& ‘()”#*+,- .”# /001
Sum out variables, as before:
X1: P(X1) P(e1|X1)P(X2|X1) X2: P(e2|X2)P(X3|X2)F2(X2) …
Xt-2: P(et-2|Xt-2)P(Xt-1|Xt-2)Ft-2(Xt-2) Xt-1: P(et-1|Xt-1)P(Xt|Xt-1)Ft-1(Xt-1)
Xt-1 -1 : Ft(Xt)
Ft(Xt) = P(et-1|Xt-1)P(Xt|Xt-1)Ft-1(Xt-1)
HMM Rules, Recap
1.Access initial distribution (P(X1))
2.Calculate state estimates over time: P(Xt|et-1,et-2 … ,e1) =
normalize( P(Xt|Xt-1) P(Xt-1|et-1,et-2 … ,e1)) 3. Weight with observation:
P(Xt|et,et-1,et-2 … ,e1)= normalize(P(Xt|et-1,et-2 … ,e1)*P(et|Xt))
Example: Passage of Time
As time passes, uncertainty “accumulates”
▪(Transition model: ghosts usually go clockwise)
T=1 T=2 T=5
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
Example: Observation
As we get observations, beliefs get re-weighted, uncertainty “decreases”
Before observation After observation
P(Xt|et,et-1 … ,e1) =c*(P(Xt|et-1,et-2 … ,e1)*P(et|Xt))
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
Most Likely Explanation
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
HMMs: MLE Queries
HMMs defined by States X
Observations E
Initial distribution: Transitions:
Emissions:
New query: most likely explanation: New method: the Viterbi algorithm
X1 X2 X3 X4 X5
E1 E2 E3 E4 E5
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
State Trellis
State trellis: graph of states and transitions over time
Each arc represents some transition
Each arc has weight
Each path is a sequence of states
The product of weights on a path is that sequence’s probability along with the evidence Forward algorithm computes sums of paths, Viterbi computes best paths
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
Forward / Viterbi Algorithms
Forward Algorithm (Sum) Viterbi Algorithm (Max)
[Slide created by and for CS188 Intro to AI at UC Berkeley. ]
# Π: initial probabilities. A: transition matrix. B: emission matrix. viterbi(O, S, Π, A, B):
prob_trellis = matrix(length(S), length(O)) path_trellis = matrix(length(S), length(O)) # Determine trellis values for X1
for s in range(length(S)):
prob_trellis[s,0] = Π[s] * B[s, O[0]]
path_trellis[s,0] = [s]
# For X2-XT find each current state’s most likely prior state x. for o in range(1, length(O)):
for s in range(length(S)):
x = argmax(x in prob_trellis[x, o-1] * A[x,s] * B[s,o]) prob_trellis[s,o] = prob_trellis[x, o-1] * A[x,s] * B[s,o] path_trellis[s,o] = path_trellis[x, o-1].append[s]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com