程序代写代做 AI chain algorithm graph Hidden Markov Mode Excel game IS 2440 Artificial Intelligence

IS 2440 Artificial Intelligence
Week 4: Agents with Uncertain Info
Daqing He January 28, 2020
1
© Daqing He
Many slides are from Dan Klein’s (Berkeley) course

Muddiest Points
• Knowledge-basedAgents
– We know ambiguity can exist in natural language, can it exist in propositional logic
or first-order logic language?
– For the knowledge-based agent, what should we do if there is no way to do other infer? For example, in the Wumpus world game, what should we do if based on all the information we have we still cannot know which next cell is safe.
– Can we create knowledge-based agents that are also ethical by adding “ethical” knowledge to the knowledge base?
– AI should have learned by itself right? But from what we have learned last week, it seems like, in order to have a workable agent, we have to both program the real- world environment and what this agent should do in a certain case. How is that
different from the traditional way of writing a program?
2 © Daqing He

Muddiest Points
• Knowledge base
– How does the idea of a monotonic knowledge base interact with the idea of unexplored knowledge? We could represent facts in the world which might seem contradictory, but in reality are not. For example: “light is a particle”; “light is a wave”. Both of these seem to be true, but also contradictory if we assume that a wave and a particle are different. So can we not add things to a knowledge base that are not understood?
– It says that as an example of Entailment, the KB containing the statements “the Panther won” and “the Penguin won” entails “Either the Panther won or the Penguin won.” However, why are these statements mutually exclusive, i.e. why must it be the case that for the Penguin/Panther to win, the Panther/Penguin must
lose? Also, for the other example, wouldn’t it be more accurate to say that x + 5 ≥ y entails x + 7 > y, as if x + 5 ≥ y, mustn’t it be the case that x + 7 > y is true.
3 © Daqing He

Muddiest Points
• First order logic
– What’s the difference between constant symbols and objects. Besides, which means is that all of the objects are constant symbols? What’s the relationship between them?
– In AIMA 8.2.2 why “LeftLeg” is function symbol? What is LeftLeg’s functionality?
– In page 69 about first order logic, what is the difference between relations and functions? why father of is a function and brother of is a relation?
– Is there any limitations of first order logic for knowledge representation?
4 © Daqing He

Muddiest Points
• Can you give some actual use cases of “Propositional logic” and “First Order Logic”, is there any program based on these two?
– http://people.dbmi.columbia.edu/~ehs7001/Buchanan-Shortliffe- 1984/MYCIN%20Book.htm
– https://en.wikipedia.org/wiki/Expert_system • Generalizability
– When agents have solved a problem, for example one scenario of the Wumpus problem, what happens to the knowledge base it used to solve that problem since it will probably not be used in any future Wumpus problem with different environment?
5 © Daqing He

Resolution example
• Unit resolution rule
– Where li and m are complementary literals, i.e., m = ¬ li
• Proof by contradiction, i.e., show KBÙ¬α unsatisfiable
– KB = (B1,1 Û (P1,2Ú P2,1)) Ù¬ B1,1 α = ¬P1,2
6
© Daqing He

Muddiest Points
• ForwardandBackwardChain
– What results in the difference between the complexity of forward
chaining and backward chaining?
– Will backward chaining always perform poorly when dealing with tasks of detection or tasks in which Forward chaining excels even though it is doing very less irrelevant work ?
– Can Forward Chaining and Backward Chaining be use together by the agent when solving a problem? Like using forward chaining first to learn more about the environment then using that information with backward chaining to solve a problem?
– Do humans program agent to do either forward chaining or backward chaining or can the agent decide for itself to do one or the other?
7 © Daqing He

Agenda
• BasicProbabilisticInference
• Markov Chain
• Hidden Markov Model
8 © Daqing He

Class Goals
• Afterthisclass,youshouldbeableto
– identify main ideas behind reasoning under uncertainty with
probabilistic inference
– identify main ideas behind Markov chain and Hidden Markov Model
– describe Markov chain and Hidden Markov, and make simple inference.
9 © Daqing He

Agents handling uncertainty
10 © Daqing He

Knowledge Agents need Knowledge Base
• Thesizeandcomplexityofknowledgebase increase with the complexity of the task
• Lackofmechanismtohandlecomplexity
• Selecting knowledge to make inference
• Interactions among different knowledge
• Incompleteness in knowledge
11 © Daqing He

Uncertainty
• In agent space, uncertainty refers to an agent may never know for sure what state it is in or where it will end up after a sequence of actions
• This has serious drawbacks
– Agent must consider every logically possible explanation for the observations, no matter how unlikely.
• This leads to impossible large and complex state representations
– Agent has no effective way to select among different possible states
• because it does not know how to compare the merits of plans.
12 © Daqing He

Can we avoid uncertainty?
• In general, we cannot because
• 1) the problem space can have partially observability, nondeterminism or combination
– This means the problem itself has certain degree of uncertainty
• 2) our imperfection or ignorance in theories and practice
– Our science theories are not fully developed: such as in medicine and social
sciences
– Our experiment/tests cannot fully cover all aspects of the problems, such as knowledge on individual patients in a disease
13 © Daqing He

Rational Agent that can handle uncertainty
• We define an agent is rational if and only if it chooses the action that yields the highest expected utility, averaged over all the possible outcomes of the action
– This highest expected utility is called maximum expected utility (MEU)
• Therefore, rational agent depends on the relative importance of various goals and the likelihood that they will be achieved.
– Probability theory provide us the modeling of likelihood
• a numerical degree of belief between 0 (certainly to be false) and 1 (certainly true)
• Decision theory
– Decision theory = probability theory + utility theory
14 © Daqing He

Basic Probabilistic Inference
15 © Daqing He

Random Variable
• Arandomvariabledescribesaneventwhoseoutcome has some degree of uncertainty
– R = Is it raining?
– T=Isithotorcold?
– D = How long will it take to drive to work? – L = Where is the Wampus?
• Random variables have domains
– R in {true, false} (often write as {+r, -r})
– T in {hot, cold}
– Din[0,∞)
– L in possible locations, maybe {(1,1), (1,2), …}
16 © Daqing He

Probability and Probability Distribution
• Probability
– The likeliness that an outcome happens
– All probability is between 0 and 1 (inclusive) – P(dice take value 6) = 1/6
• ProbabilitydistributionofadiscreterandomvariableA – The probability of each outcome of a discrete random variable A – Often can be obtained through the number of sampling n → ∞
– Sum of all probabilities for a variable A have to be 1
T
P
hot
0.5
cold
0.5
W
P
sun
0.6
rain
0.1
fog
0.3
meteor 0.0
© Daqing He
17

Joint Distributions
• A joint distribution over a set of random variables: specifies a real number for each assignment (or outcome):
– Must obey:
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
18 © Daqing He

Events
• An event is a set E of outcomes
• From a joint distribution, we can
calculate the probability of any event
– Probability that it’s hot AND sunny?
– Probability that it’s hot?
– Probability that it’s hot OR sunny?
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
19 © Daqing He

Quiz: Events
• P(+x, +y) ?
• P(+x) ?
• P(-y OR +x) ?
X
Y
P
+x
+y
0.2
+x
-y
0.3
-x
+y
0.4
-x
-y
0.1
20 © Daqing He

Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding
T
P
hot
0.5
cold
0.5
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
W
P
sun
0.6
rain
0.4
21 © Daqing He

Quiz: Marginal Distributions
X
P
+x
-x
X
Y
P
+x
+y
0.2
+x
-y
0.3
-x
+y
0.4
-x
-y
0.1
Y
P
+y
-y
22 © Daqing He


Conditional Probabilities
A simple relation between joint and conditional probabilities
– Infact,thisistakenasthedefinitionofaconditionalprobability
P(a) is called prior probability
P(a|b) is called
posterior probability
P(a,b)
P(a) P(b)
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
23
© Daqing He

Normalization Trick
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
W
P
sun
0.4
rain
0.6
24 © Daqing He

Quiz: Conditional Probabilities
• P(+x | +y) ?
• P(-x | +y) ?
X
Y
P
+x
+y
0.2
+x
-y
0.3
-x
+y
0.4
-x
-y
0.1
• P(-y|+x)?
25 © Daqing He

Probabilistic Inference
• Probabilistic inference: compute a desired probability from other known probabilities (e.g. conditional from joint)
• We generally compute conditional probabilities
– P(on time | no reported accidents) = 0.90
– Theserepresenttheagent’sbeliefsgiventheevidence
• Probabilities change with new evidence:
– P(on time | no accidents, 5 a.m.) = 0.95
– P(on time | no accidents, 5 a.m., raining) = 0.80
– Observingnewevidencecausesbeliefstobeupdated
26 © Daqing He

Inference by Enumeration
• General case:
– Evidence variables:
– Query* variable:
– Hidden variables:
§ Step 1: Select the entries consistent with the evidence
§ We want:
* Works fine with multiple query variables, too
All variables
§ Step 2: Sum out H to get joint of Query and evidence
§ Step 3: Normalize ⇥1
Z
27
© Daqing He

Inference by Enumeration
• General case:
– Evidence variables:
– Query* variable:
– Hidden variables:
§ Step 1: Select the entries consistent with the evidence
§ We want:
* Works fine with multiple query variables, too
All variables
§ Step 2: Sum out H to get joint of Query and evidence
§ Step 3: Normalize ⇥1
Z
28
© Daqing He

Inference by Enumeration
• P(W | winter, hot)?
– Step 1:
– Step 2: Obtain joint probability P(W, winter, hot)
– Step 3: Obtain Z using
Z = P(sun, winter, hot) + P(rain, winter, hot) = 0.10 + 0.05 = 0.15
– Calculate P( W | winter, hot) using
S
T
W
P
summer
hot
sun
0.30
summer
hot
rain
0.05
summer
cold
sun
0.10
summer
cold
rain
0.05
winter
hot
sun
0.10
winter
hot
rain
0.05
winter
cold
sun
0.15
winter
cold
rain
0.20
S
T
W
P
winter
hot
sun
0.10
winter
hot
rain
0.05
S
T
W
P(W| winter, hot)
winter
hot
sun
0.10/0.15 = 0.67
winter
hot
rain
0.05/0.15 = 0.33
29 © Daqing He

S
T
W
P
summer
hot
sun
0.30
summer
hot
rain
0.05
summer
cold
sun
0.10
summer
cold
rain
0.05
winter
hot
sun
0.10
winter
hot
rain
0.05
winter
cold
sun
0.15
winter
cold
rain
0.20
Inference by Enumeration
• P(W | winter)? – Step 1:
30 © Daqing He

Inference by Enumeration
• P(W | winter)?
– Step 1:
– Step 2: hidden variable is T, sum out it
S
T
W
P
winter
hot
sun
0.10
winter
hot
rain
0.05
winter
cold
sun
0.15
winter
cold
rain
0.20
31 © Daqing He

Inference by Enumeration
• P(W | winter)?
– Step 1:
– Step 2: hidden variable is T, sum out it
– Step 3: Obtain Z using
Z = P(sun, winter) + P(rain, winter) = 0.25 + 0.25 = 0.50
– Calculate P( W | winter) using
S
W
P
winter
sun
0.25
winter
rain
0.25
S
W
P(W|winter)
winter
sun
0.5
winter
rain
0.5
32 © Daqing He

Inference by Enumeration
• Can you calculate P(W)?
S
T
W
P
summer
hot
sun
0.30
summer
hot
rain
0.05
summer
cold
sun
0.10
summer
cold
rain
0.05
winter
hot
sun
0.10
winter
hot
rain
0.05
winter
cold
sun
0.15
winter
cold
rain
0.20
33 © Daqing He

Inference by Enumeration
§ Obvious problems:
§ Worst-case time complexity O(dn)
§ Space complexity O(dn) to store the joint distribution
34 © Daqing He

The Product Rule
• Sometimes have conditional distributions but want the joint
35 © Daqing He

The Product Rule
• Example:
D
W
P
wet
sun
0.1
dry
sun
0.9
wet
rain
0.7
dry
rain
0.3
D
W
P
wet
sun
0.08
dry
sun
0.72
wet
rain
0.14
dry
rain
0.06
W
P
sun
0.8
rain
0.2
36 © Daqing He

Bayes’ Rule
• Two ways to factor a joint distribution over two variables:
• Dividing, we get:
• Why is this at all helpful?
– Lets us build one conditional from its reverse
– Often one conditional is tricky but the other one is simple
– Foundation of many systems we’ll see later (e.g. ASR, MT)
• In the running for most important AI equation! 37
That’s my rule!
© Daqing He

Inference with Bayes’ Rule
• Example: Diagnostic probability from causal probability: P (cause|e↵ect) = P (e↵ect|cause)P (cause)
• Example:
– M: meningitis, S: stiff neck
P (e↵ect)
P (+m) = 0.0001
P (+s| + m) = 0.8 P(+s| m) = 0.01
P(+m| + s) = P(+s| + m)P(+m) = P(+s)
P(+s| + m)P(+m)
P(+s| + m)P(+m) + P(+s| m)P(m)
=
0.8 ⇥ 0.0001 = 0.8 ⇥ 0.0001 + 0.01 ⇥ 0.9999
Example givens
– Note: posterior probability of meningitis still very small
– Note: you should still get stiff necks checked out! Why?
38
© Daqing He

Quiz: Bayes’ Rule
• Given:
D
W
P
wet
sun
0.1
dry
sun
0.9
wet
rain
0.7
dry
rain
0.3
R
P
sun
0.8
rain
0.2
• What is P(W | dry) ?
4391 © Daqing He

The Chain Rule
• More generally, can always write any joint distribution as an incremental product of conditional distributions
• Why is this always true?
40 © Daqing He

Markov Chain
41 © Daqing He

Independence
• Two variables are independent in a joint distribution if: We write
– Saysthejointdistributionfactorsintoaproductoftwosimpleones
– Usually variables aren’t independent!
• Can use independence as a modeling assumption
– Independence can be a simplifying assumption
– Empirical joint distributions: at best “close” to independent
– What could we assume for {Weather, Traffic, Cavity}?
42 © Daqing He

Example: Independence
• N fair, independent coin flips:
H
0.5
T
0.5
H
0.5
T
0.5
H
0.5
T
0.5
43 © Daqing He

Example: Independence?
T
P
hot
0.5
cold
0.5
P2(T,W) = P(T)P(W)
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
T
W
P
hot
sun
0.3
hot
rain
0.2
cold
sun
0.3
cold
rain
0.2
W
P
sun
0.6
rain
0.4
44 © Daqing He

Conditional Independence
• Unconditional (absolute) independence very rare
• Conditional independence is our most basic and robust form of
knowledge about uncertain environments.
• X is conditionally independent of Y given Z if and only if:
or, equivalently, if and only if
45 © Daqing He

Conditional Independence
• P(Toothache, Cavity, Catch)
• If I have a cavity, the probability that the probe catches in
it doesn’t depend on whether I have a toothache:
– P(+catch | +toothache, +cavity) = P(+catch | +cavity)
• The same independence holds if I don’t have a cavity: – P(+catch | +toothache, -cavity) = P(+catch| -cavity)
• Catch is conditionally independent of Toothache given Cavity:
– P(Catch | Toothache, Cavity) = P(Catch | Cavity)
§ Equivalent statements:
§ P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
§ P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) § One can be derived from the other easily
46 © Daqing He

Reasoning over Time or Space
• Often, we want to reason about a sequence of observations
– Speech recognition
– Robot localization
– User attention
– Medical monitoring
• Need to introduce time (or space) into our models
47 © Daqing He

First/Second order Markov chains
First-order Markov chain:
Second-order Markov chain:
48 © Daqing He

Markov Models
– Value of X at a given time is called the state X1 X2 X3 X4
– Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities)
– Stationarity assumption: transition probabilities the same at all times
49 © Daqing He

Joint Distribution of a Markov Model
X1 X2 X3 X4
– Joint distribution:
P (X1, X2, X3, X4) = P (X1)P (X2|X1)P (X3|X2)P (X4|X3)
– More generally:
P ( X 1 , X 2 , . . . , X T ) = P ( X 1 ) PY( X 2 | X 1 ) P ( X 3 | X 2 ) . . . P ( X T | X T 1 ) T
= P (X1) P (Xt|Xt1) t=2
• Does this indeed define a joint distribution?
• Can every joint distribution be factored this way, or are we making some assumptions about
– Questions to be resolved:
the joint distribution by using this factorization?
50 © Daqing He


Chain Rule and Markov Models
X1 X2 X3 X4
From the chain rule, every joint distribution over X1, X2, X3, X4 can be written
as:
P (X1, X2, X3, X4) = P (X1)P (X2|X1)P (X3|X1, X2)P (X4|X1, X2, X3)
Assuming that X3 ? X1 | X2 and
results in the expression posited on the previous slide:
P (X1, X2, X3, X4) = P (X1)P (X2|X1)P (X3|X2)P (X4|X3)
51 © Daqing He

X4 ?X1,X2 |X3


Chain Rule and Markov Models
X1 X2 X3 X4
From the chain rule, every joint distribution over X1, X2, . . . , XT
t=2
can be
written as: YT P (X1 , X2 , . . . , XT ) = P (X1 )
P (Xt |X1 , X2 , . . . , Xt1 ) Xt ?X1,…,Xt2 |Xt1

Assuming that for all t:
gives us the expression posited on the earlYier slide:
T
P (X1 , X2 , . . . , XT ) = P (X1 ) P (Xt |Xt1 )
52 t=2
©DaqingHe

Implied Conditional Independencies
X1 X2 X3 X4
• Weassumed: X3 ?X1 |X2 and X4 ?X1,X2 |X3
• Dowealsohave – Yes!
– Proof:
X1 ?X3,X4 |X2 ? P(X1 | X2,X3,X4) = P(X1,X2,X3,X4)
P (X2, X3, X4)
= PP(X1)P(X2 | X1)P(X3 | X2)P(X4 | X3)
x1 P(x1)P(X2 | x1)P(X3 | X2)P(X4 | X3) = P(X1,X2)
P(X2)
53 = P(X1 | X2) © Daqing He

Markov Models Recap
• Explicit assumption for all t : Xt ? X1,…,Xt2 | Xt1
• Consequence, joint distribution can be written as:
P ( X 1 , X 2 , . . . , X T ) = P ( X 1 ) PY( X 2 | X 1 ) P ( X 3 | X 2 ) . . . P ( X T | X T 1 ) T
= P (X1) P (Xt|Xt1) t=2
• Implied conditional independencies: (try to prove this!) – Past variables independent of future variables given the present
i.e.,if t1 t2 >t3 then: Xt1 ?Xt3 |Xt2
• Additional explicit assumption: P(Xt | Xt1) is the same for all t
54 © Daqing He

Example Markov Chain: Weather
• States: X = {rain, sun}
§ Initial distribution: 1.0 sun § CPT P(Xt | Xt-1):
Xt-1
Xt
P(Xt|Xt-1)
sun
sun
0.9
sun
rain
0.1
rain
sun
0.3
rain
rain
0.7
0.9 rain sun
0.7 0.1
0.3
0.9 sun 0.1
0.3 rain 0.7
sun
rain
55
© Daqing He
Two new ways of representing the same CPT

Example Markov Chain: Weather
0.9 rain sun
0.7
• Initial distribution: 1.0 sun
0.3
• What is the probability distribution after one step?
0.1
56
© Daqing He

Mini-Forward Algorithm
• Question: What’s P(X) on some day t? X1 X2 X3 X4
P(xt)= XP(xt1,xt)
Xxt1 xt1
=
P(xt | xt1)P(xt1)
Forward simulation
57 © Daqing He

Example Run of Mini-Forward Algorithm
§ From initial observation of sun
P(X1) P(X2) P(X3) P(X4)
§ From initial observation of rain P(X1) P(X2) P(X3) P(X4)
§ From yet another initial distribution P(X1): …
58
P(X¥)
P(X¥)
P(X1)
P(X¥) © Daqing He
[Demo: L13D1,2,3]


For most chains:
– Influence of the initial distribution gets less and less over time.
– The distribution we end up in is independent of the initial distribution
§ Stationary distribution:
§ The distribution we end up with is called
the stationary distribution P1 of the chain § It satisfies
X
x
Stationary Distributions
P1(X) = P1+1(X) =
P (X|x)P1(x)
59
© Daqing He

Example: Stationary Distributions
• Question: What’s P(X) at time t = infinity? X1 X2 X3 X4
P1(sun) = P (sun|sun)P1(sun) + P (sun|rain)P1(rain) P1(rain) = P (rain|sun)P1(sun) + P (rain|rain)P1(rain)
Xt-1
Xt
P(Xt|Xt-1)
sun
sun
0.9
sun
rain
0.1
rain
sun
0.3
rain
rain
0.7
P1(sun) = 0.9P1(sun) + 0.3P1(rain) P1(rain) = 0.1P1(sun) + 0.7P1(rain)
P1(sun) = 3P1(rain) P1(rain) = 1/3P1(sun)
Also: P1(sun) + P1(rain) = 1
60
P1(sun) = 3/4 P1(rain) = 1/4
© Daqing He

Application of Stationary Distribution: Web Link Analysis
• PageRank over a web graph
– Each web page is a state
– Initial distribution: uniform over pages
– Transitions:
• With prob. c, uniform jump to a
random page (dotted lines, not all shown)
• With prob. 1-c, follow a random outlink (solid lines)
• Stationary distribution
– Will spend more time on highly reachable pages
– E.g. many ways to get to the Acrobat Reader download page
– Somewhat robust to link spam
– Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many other factors (rank actually getting less important over time)
61 © Daqing He

Hidden Markov Model
62 © Daqing He

Hidden Markov Models
• Markov chains not so useful for most agents
– “Conditioned on the present, the past & future are
independent” is often not true
– Need observations to update your beliefs
• Hidden Markov models (HMMs)
– Underlying Markov chain over states X
– You observe outputs (effects) at each time step
X1 X2 X3 X4 X5
63
© Daqing He
E
1
E
2
E
3
E
4
E
5

Hidden Markov Model
An HMM is a temporal probabilistic model in which the state of the process is described by a single discrete random variable.
Hidden
Observable
64 © Daqing He

States and observations
View world as a series of snapshots, or time slice.
: the set of state variables at time e, unobservable. : the set of observable evidence variables.
The observation at time t is .
65
© Daqing He

Transition model & sensor model
• Given a set of state and evidence variable for a given problem, how the state evolves (the transition model), and how the evidence variable get their values (the sensor model).
• Transition model: it rains yesterday, will it rain or not today?
• Sensor model: will director takes an umbrella if it rains?
66 © Daqing He

Transition model & Markov Model
Transition model specify the probability distribution of current state, given previous values.
NT: t is unbounded.
Markov assumption: current state depends on only a finite fixed number of previous states.
Processes satisfying assumption are called Markov chains.
67 © Daqing He

Example: Umbrella & Rain
A secret guard living in the underground installation, want to know whether it’s raining today by observing the director coming in with or without an umbrella.
: evidence variable, umbrella appear, not appear. : state variable, Rain, Sunny.
68 © Daqing He

Sensor Model
Sensor Model specifies the probability distribution of evidence variable, given previous values.
Sensor Markov assumption:
69 © Daqing He

Example: Weather HMM
P(Xt | Xt1) Raint-1 Raint
P(Et |Xt) Umbrellat-1 Umbrellat
• An HMM is defined by: – Initial distribution:
Raint+1
Umbrellat+1
Xt
Xt+1
P(Xt+1|Xt)
+r
+r
0.7
+r
-r
0.3
-r
+r
0.3
-r
-r
0.7
Xt
Et
P(Et|Xt)
+r
+u
0.9
+r
-u
0.1
-r
+u
0.2
-r
-u
0.8
– Transitions: – Emissions:
P(Xt | Xt1) P(Et |Xt)
70
© Daqing He

Joint probability distribution
For any time t:
Transition Model
Sensor Model
71 © Daqing He

Joint Distribution of an HMM
X1 X2 X3 X5 E1 E2 E3 E5
– Joint distribution:
P (X1, E1, X2, E2, X3, E3) = P (X1)P (E1|X1)P (X2|X1)P (E2|X2)P (X3|X2)P (E3|X3) – More generally: YT
P (X1 , E1 , . . . , XT , ET ) = P (X1 )P (E1 |X1 ) P (Xt |Xt1 )P (Et |Xt )
t=2
• Can every joint distribution be factored this way, or are we making some assumptions about the
joint distribution by using this factorization?
– Questions to be resolved:
• Does this indeed define a joint distribution?
72 © Daqing He

Chain Rule and HMMs X1 X2 X3 E1 E2 E3

From the chain rule, every joint distribution over X1, E1, X2, E2, X3, E3 can be written as:
P (X1, E1, X2, E2, X3, E3) =P (X1)P (E1|X1)P (X2|X1, E1)P (E2|X1, E1, X2) P (X3|X1, E1, X2, E2)P (E3|X1, E1, X2, E2, X3)
Assuming that

X2 ?E1 |X1, E2 ?X1,E1 |X2, X3 ?X1,E1,E2 |X2, E3 ?X1,E1,X2,E2 |X3 gives us the expression posited on the previous slide:
P (X1, E1, X2, E2, X3, E3) = P (X1)P (E1|X1)P (X2|X1)P (E2|X2)P (X3|X2)P (E3|X3) 73 © Daqing He

X1 X2 X3 E1 E2 E3
From the chain rule, every joint distribution over X1, E1, . . . , XT , ET can be written as: YT
Chain Rule and HMMs


P (X1 , E1 , . . . , XT , ET ) = P (X1 )P (E1 |X1 ) P (Xt |X1 , E1 , . . . , Xt1 , Et1 )P (Et |X1 , E1 , . . . , Xt1 , Et1 , Xt ) t=2
Assuming that for all t:
– State independent of all past states and all past evidence given the previous state, i.e.:
Xt ? X1,E1,…,Xt2,Et2,Et1 | Xt1
– Evidence is independent of all past states and all past evidence given the current state, i.e.:
Et ? X1,E1,…,Xt2,Et2,Xt1,Et1 | Xt gives us the expression posited on the earlier sliYdTe:
P (X1 , E1 , . . . , XT , ET ) = P (X1 )P (E1 |X1 ) P (Xt |Xt1 )P (Et |Xt )
74 © Daqing He
t=2

Inference on HMM Four inference tasks
• Filtering: identify the current state probability distribution given all evidences to date.
• Prediction
• Smoothing
• Most likely explanation.
75 © Daqing He

Filtering
Given the result of filtering up to time t, agent can directly compute the result for t+1 from the new evidence et+1.
76 © Daqing He

Filtering: Proof
© Daqing He
77

Filtering
Transition Model Sensor Model
78 © Daqing He

Filtering
79 © Daqing He

Example: Umbrella & Rain
Given t=0, P(rain)=P(sun)=0.5; director takes umbrella in t=1 and t=2. Let’s calculate P(rain1|umbrella 1), and P(rain 2| umbrella 1, umbrella 2).
transition model
today rain
today sun
yesterday rain
0.7
0.3
yesterday sun
0.3
0.7
sensor model
take umbrella
no umbrella
rain
0.9
0.1
sun
0.2
80
0.8
© Daqing
He

Example: Umbrella & Rain
81 © Daqing He

Example: Umbrella & Rain
82 © Daqing He

Filtering & Prediction
Filtering: To identify the current state probability distribution given all evidences to date.
Prediction: To identify a future state probability distribution given all evidences to date.
83 © Daqing He

Prediction
Can be seen as filtering without the addition of new evidence.
84 © Daqing He

Prediction
Can be seen as filtering without the addition of new evidence.
Filtering
85 © Daqing He

Summary
• Probabilitygivesusamodelforhandlinguncertainty
• Conditionalprobabilityenablesustoperforminference
based on evidence
• Timecanbeintroducedasafactorforrepresenting sequence of observations and states
– Directlymodelingthetransitionofstates=>MarkovChain
– Inference based on observed evidence generated from hidden states => Hidden Markov Model
86 © Daqing He

Slides for your information
87 © Daqing He

Inference in temporal models Learn the model: learn the transition and sensor
model with EM in the future class.
Four inference task with the learned models: (1)Filtering, (2)Prediction, (3)Smoothing, (4) Most likely explanation.
88 © Daqing He

Smoothing
Compute the probability distribution of a past time state, given all evidence up to the present.
For example, given umbrella observation data up to today, what is probability of last Wednesday raining?
89 © Daqing He

forward-backward algorithm
90 © Daqing He

forward-backward algorithm
© Daqing He
91

forward-backward algorithm
92 © Daqing He

Example: Umbrella & Rain
93 © Daqing He

94 © Daqing He

Inference in temporal models Learn the model: learn the transition and sensor
model with EM in the future class.
Four inference task with the learned models: (1)Filtering, (2)Prediction, (3)Smoothing, (4) Most likely explanation.
95 © Daqing He

Most likely explanation
Given a sequence of observations, find the sequence of states that is most likely to have generated the observations.
96 © Daqing He

Most likely explanation
A forward procedure:
Transition Model
Sensor Model Last step
97
© Daqing He