MAST30001 Stochastic Modelling
MAST30001 Stochastic Modelling
Lecturer: Mark Holmes
1
Stochastic modelling
We study a random experiment in the context of a Probability
Space (Ω,F ,P). Here,
I Ω is the sample space – the set of all possible outcomes of our
random experiment,
I F is a sigma-field – a collection (with certain properties) of
subsets of Ω. We view these as events we can see or measure,
I P is a probability measure – a function (with certain
properties) defined on the elements of F with certain
properties.
2
The sample space Ω
The (nonempty) set of possible outcomes for the random
experiment. Examples:
I coin tossing: {H,T}, {(H,H), (H,T ), (T ,H), (T ,T )}, the
set {H,T}N of all infinite sequences of Hs and Ts.
I battery lifetime: [0,∞).
I animal population: Z+.
I queue length over time: the set of piecewise-constant
functions from [0,∞) to Z+.
I infection status: {susceptible, infected , immune}n (if n is the
number of individuals).
I facebook friend network: set of simple networks with number
of vertices equal to the number of users: edges connect
friends.
Elements of Ω are often written as ω.
3
Review of basic notions of set theory
I A ⊂ B.
I A is a subset of B, or if A occurs then B occurs.
I A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B} = B ∪ A.
I Union of sets (events): at least one occurs.
I A1 ∪ A2 ∪ · · · ∪ An =
⋃n
i=1 Ai .
I A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B} = B ∩ A.
I Intersection of sets (events): all occur.
I A1 ∩ A2 ∩ · · · ∩ An =
⋂n
i=1 Ai .
I Ac = {ω ∈ Ω : ω 6∈ A}.
I Complement of a set/event: event doesn’t occur.
I ∅: the empty set or impossible event.
I A and B are disjoint (or mutually exclusive) if A ∩ B = ∅.
4
The sigma-field F
A sigma-field F on Ω is a collection of subsets of Ω such that
I Ω ∈ F , and
I if A ∈ F then Ac ∈ F , and
I if A1,A2, · · · ∈ F then ∪∞i=1Ai ∈ F .
I For countable sample spaces, F is typically the set of all
subsets.
Example: Toss a coin once, F =
{
∅, {H}, {T}, {H,T}
}
I For uncountable sample spaces, the situation is more
complicated – see later.
5
The probability measure P
A probability measure P on (Ω,F) is a set function from F to
[0, 1] satisfying
P1. P(A) ≥ 0 for all A ∈ F [probabilities measure long run %’s or
certainty]
P2. P(Ω) = 1 [There is a 100% chance something happens]
P3. Countable additivity: if A1, A2 · · · are disjoint events in F ,
then P (
⋃∞
i=1 Ai ) =
∑∞
i=1 P(Ai ) [Think about it in terms of
frequencies]
6
How do we specify P?
The modelling process consists of
I defining the values of P(A) for some ‘basic events’ in A ∈ F ,
I deriving P(B) for the other ‘unknown’ more complicated
events in B ∈ F from the axioms above.
Example: Toss a fair coin 1000 times. Any particular sequence (of
length 1000) of H’s and T’s has chance 2−1000.
I What is the chance there are more than 600 H’s in the
sequence?
I What is the chance the first time the proportion of heads
exceeds the proportion of tails occurs after toss 20?
7
Properties of P
I P(∅) = 0.
I P(Ac) = 1− P(A).
I P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
I . . .
I “Continuity”:
P(∪∞i=1Ai ) = lim
n→∞
P(∪ni=1Ai )
P(∩∞i=1Ai ) = lim
n→∞
P(∩ni=1Ai )
8
Conditional probability
Let A,B ∈ F be events with P(B) > 0. Supposing we know that
B occurred, how likely is A given that information? That is, what
is the conditional probability P(A|B)?
We define
P(A|B) =
P(A ∩ B)
P(B)
.
For a frequency interpretation, consider the situation where we
have 10 tickets, numbered 1, 2, . . . , 10, and we choose one ticket
uniformly at random. Then Ω = {1, 2, . . . , 10}.
Let A = {ω ∈ Ω : ω is even} = {2, 4, 6, 8, 10},
and B = {ω ∈ Ω : ω is a multiple of 3} = {3, 6, 9}.
Then P(A) = 5/10, P(B) = 3/10, P(A ∩ B) = P({6}) = 1/10.
According to the definition, P(A|B) = 1/10
3/10
= 1/3. This is nothing
but the proportion of outcomes in B that are also in A.
9
Example:
Tickets are drawn consecutively and without replacement from a
box of tickets numbered 1, . . . , 10. What is the chance the second
ticket is even numbered given the first is
I even?
I labelled 3?
10
Law of total probability
Let B1,B2, · · · ,Bn be disjoint events with P(Bi ) > 0 and
A ⊂
⋃n
j=1 Bj , then
P(A) =
n∑
j=1
P(A|Bj)P(Bj).
If also P(A) > 0 then
P(Bj |A) =
P(Bj ∩ A)
P(A)
=
P(A|Bj)P(Bj)∑n
k=1 P(A|Bk)P(Bk)
.
11
Example:
A disease affects 1/1000 newborns and shortly after birth a baby is
screened for this disease using a cheap test that has a 2% false
positive rate (the test has no false negatives). If the baby tests
positive, what is the chance it has the disease?
12
Independent events
Events A and B are said to be independent if
P(A ∩ B) = P(A)P(B).
If P(B) 6= 0 then this is the same as P(A|B) = P(A).
Events A1, · · · ,An are independent if, for each subset {i1, …, ik} of
{1, …, n},
P(Ai1 ∩ · · · ∩ Aik ) = P(Ai1)× · · · × P(Aik ).
This is a stronger requirement than each pair of events being
independent.
13
A disconcerting example
I Let S be the circle of radius 1.
I We say two points on S are in the same family if you can get
from one to the other by taking steps of arclength 1 around
the circle.
I Each family chooses a single member to be head.
I If X is a point chosen uniformly at random from the circle,
what is the chance X is the head of its family?
14
A disconcerting example
I A = {X is head of its family}.
I Ai = {X is i steps clockwise from its family head}.
I Bi = {X is i steps counterclockwise from its family head}.
I By uniformity, should have P(A) = P(Ai ) = P(Bi ), BUT
I law of total probability:
1 = P(A) +
∞∑
i=1
(P(Ai ) + P(Bi )) !
The issue is that the set A is not one we can measure so should
not be included in F .
These kinds of issues are technical to resolve and are dealt with in
later probability or analysis subjects which use measure theory.
15
Random variables
A random variable on a probability space (Ω,F ,P) is a function
X : Ω→ IR such that
(Often we want to talk about the probabilities that random
variables take values in (−∞, b] = {x ∈ R : x ≤ b}.
When we write P(X ≤ b), we mean the probability of the set
{ω : X (ω) ≤ b}. In order for this to make sense, we need this set
to be in F ! )
{X ≤ b} := {ω ∈ Ω : X (ω) ≤ b} ∈ F for every b ∈ R.
(in measure theory, this is called a Borel-measurable function, after
Emile Borel (1871-1956)).
Because F is a sigma-field, this tells us that things like {X > b},
{a < X < b} etc. are also in F .
In this course we will also allow our random variables to take the
value +∞ (or −∞). So in general they will be functions from Ω to
R̄ = R ∪ {−∞,∞}.
16
Indicator random variables
The most important examples of random variables are the
indicator random variables:
Let A ∈ F be an event.Then the function 1A : Ω→ R (or
IA : Ω→ R) given by
1A(ω) =
{
1, if ω ∈ A
0, otherwise,
is called the indicator of A.
Exercise: show that 1A is a random variable.
17
Distribution Functions
The function FX (t) = P(X ≤ t) = P({ω : X (ω) ≤ t}) that maps
R to [0, 1] is called the distribution function of the random variable
X .
Any distribution function F
F1. is non-decreasing,
F2. is such that F (x)→ 0 as x → −∞ and F (x)→ 1 as x →∞,
F3. is ‘right-continuous’, that is limh→0+ F (t + h) = F (t) for all t.
Exercise: If (Ω,F ,P) is a probability space and A ∈ F , find the
distribution function of 1A.
18
Distribution Functions
We say that a distribution function F is
I discrete if F only grows in jumps, i.e. there exists a countable
set A ⊂ R and a function f mapping A to [0, 1] such that
F (x) =
∑
y∈A:y≤x f (y).
I absolutely continuous if there exists a function f that maps R
to R+ such that F (t) =
∫ t
−∞ f (u)du.
A random variable X has a discrete distribution if FX is discrete,
and an (absolutely) continuous distribution if FX is absolutely
continuous.
The function f is called the probability mass function in the
discrete case, and the probability density function in the absolutely
continuous case. Note that there are distribution functions that are
not even mixtures of the above (see e.g. Cantor Function). In
other words there are random variables whose distributions are not
mixtures of discrete and absolutely continuous distributions!
19
Examples of distributions
I Examples of discrete random variables: binomial, Poisson,
geometric, negative binomial, discrete uniform
http://en.wikipedia.org/wiki/Category:
Discrete_distributions
I Examples of continuous random variables: normal,
exponential, gamma, beta, uniform on an interval (a, b)
http://en.wikipedia.org/wiki/Category:
Continuous_distributions
20
http://en.wikipedia.org/wiki/Category:Discrete_distributions
http://en.wikipedia.org/wiki/Category:Discrete_distributions
http://en.wikipedia.org/wiki/Category:Continuous_distributions
http://en.wikipedia.org/wiki/Category:Continuous_distributions
Random Vectors
A random vector X = (X1, ...,Xd) is a vector of random variables
on the same probability space.
The distribution function FX of X is
FX(t) = P(X1 ≤ t1, · · · ,Xd ≤ td), t = (t1, · · · , td) ∈ Rd .
One can also write this as
FX(t) = P(∩di=1{Xi ≤ ti}).
It follows that
P(s1 < X1 ≤ t1, s2 < X2 ≤ t2)
= F (t1, t2)− F (s1, t2)− F (t1, s2) + F (s1, s2).
21
Independent Random Variables
The random variables X1, · · · ,Xd are called independent if
FX(t) = FX1(t1)× · · · × FXd (td) for all t = (t1, · · · , td).
This turns out to be equivalent to the statement that the events
{X1 ∈ I1}, · · · , {Xd ∈ Id} are independent for all intervals I1, . . . Id .
If the random variables are all discrete, or the random vector is
absolutely continuous (the latter is stronger than each of its
coordinates being absolutely continuous) then this is equivalent to
a relevant mass/density function fX factorising as
fX(t) = fX1(t1) · · · fXd (td).
22
Expectation
The expectation (or expected value) of a random variable X is
E[X ] =
∫ ∞
−∞
xdFX (x).
The integral on the right hand side is a Lebesgue-Stieltjes integral.
It can be evaluated as
=
{ ∑∞
i=1 xiP(X = xi ), if X is discrete∫∞
−∞ xfX (x)dx , if X is absolutely continuous.
In second year, we required that the integral be absolutely
convergent. In this course we will allow the expectation to be
infinite, provided that we never get in a situation where we have
‘∞−∞’.
23
Expectation of g(X )
If X is a random variable then for a measurable (nice) function g ,
Y = g(X ) is a random variable, so by definition
E[Y ] =
∫ ∞
−∞
ydFY (y).
The law of the unconscious statistician says that also
E[Y ] = E[g(X )] =
∫ ∞
−∞
g(x)dFX (x).
24
Properties of Expectation
I E[aX + bY ] = aE[X ] + bE[Y ].
I If P(X ≤ Y ) = 1, then E[X ] ≤ E[Y ].
I If P(X = c) = 1, then E[X ] = c .
I ...
I If 0 ≤ Xn ↑ X pointwise in ω then E[Xn] ↑ E[X ]
I If Xi ≥ 0 then E[
∑∞
i=1 Xi ] =
∑∞
i=1 E[Xi ]
I If X ≥ 0 then E[X ] =
∫∞
0
P(X > t)dt =
∫∞
0
P(X ≥ t)dt
25
Moments
I The kth moment of X is E[X k ].
I The kth central moment of X is E[(X − E[X ])k ].
I The variance Var(X ) of X is the second central moment
E[X 2]− (E[X ])2.
I Var(aX + b) = a2Var(X ).
I The covariance of Cov(X ,Y ) of X and Y is
E[(X − E[X ])(Y − E[Y ])].
I If X and Y have finite means and are independent, then
E[XY ] = E[X ]E[Y ].
I Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ).
26
Conditioning on random variables
Let X ,Y be random variables on (Ω,F ,P), and A ∈ F .
The conditional probability of event A given a random variable X
is written as P(A|X ). The conditional expectation of Y given a
random variable X is written as E[Y |X ].
The official definitions are technical, but you should think of these
as follows:
I P(A|X ) is a function of X that takes the value P(A|X = x) on
the event that X = x .
I E[Y |X ] is a function of X that takes the value E[Y |X = x ]
on the event that X = x .
As X is a random variable, these two functions of X are also
random variables.
27
Conditioning on random variables – technicality
A random variable has the property that {X ≤ x} ∈ F for every
x ∈ R. It could be however that F contains a lot more stuff than
events regarding X . Let GX denote the smallest σ-field on Ω such
that X is a random variable on (Ω,GX ,P). Suppose that
E[|Y |] <∞. Then E[Y |X ] is defined to be any random variable Z
on (Ω,GX ,P) such that E[Z1B ] = E[Y1B ] for each B ∈ GX .
I There is typically more than one random variable Z that
satisfies this definition. Note that if Z and Z ′ both satisfy this
definition then P(Z = Z ′) = 1, so we generally don’t care
which version of E[Y |X ] we are working with.
I P(A|X ) is defined to be E[1A|X ].
Example: let D ∈ F , and let X = 1D . Suppose that Y is a
random variable with E[|Y |] <∞ then the random variable
E[Y |D]1D + E[Y |Dc ]1Dc ,
satisfies the definition of E[Y |X ].
28
Conditioning on discrete random variables - the punchline
If X has a discrete distribution (taking values x1, x2, . . . ), and
E[|Y |] <∞ then
E[Y |X ] =
∑
i
E[Y |X = xi ]1{X=xi},
i.e. the right hand side satisfies the definition of E[Y |X ].
This is nothing but the statement that if η(x) = E[Y |X = x ], then
η(X ) satisfies the definition of E[Y |X ].
29
Conditional Distribution
The conditional distribution function FY |X (y |X ) of Y evaluated at
the real number y is given by P(Y ≤ y |X )
30
Properties of Conditional Expectation
The following hold with probability 1:
I Linearity: E[aY1 + bY2|X ] = aE[Y1|X ] + bE[Y2|X ],
I Monotonicity: If P(Y1 ≤ Y2) = 1, then E[Y1|X ] ≤ E[Y2|X ],
I E[c |X ] = c ,
I E[E[Y |X ]] = E[Y ],
I For any nice (i.e. Borel measurable) function g ,
E[g(X )Y |X ] = g(X )E[Y |X ]
I E[Y |X ] is the function of X that is closest to Y in the mean
square sense. This means that E[(g(X )− Y )2] is minimised
when g(X ) = E[Y |X ] (see Borovkov, page 57).
31
Exercise
Let Ω = {a, b, c, d}, and let F contain all subsets of Ω.
Let P be the probability measure satisfying
P({a}) = 1
2
, P({b}) = P({c}) = 1
8
and P({d}) = 1
4
.
Define random variables,
Y (ω) =
{
1, ω = a or b,
0, ω = c or d ,
X (ω) =
{
2, ω = a or c ,
5, ω = b or d .
Compute E[X ], E[X |Y ] and E[E[X |Y ]].
32
Example:
Suppose also that the number of individuals M entering a bank in
a given day has a Poisson(λ) distribution.
Suppose that individuals entering the bank each hold an Australian
passport with probability p, independently of each other and M.
Let N denote the number of individuals holding an Australian
passport who enter the bank during that day.
I What is the distribution of N, given that M = m?
I Find E[N|M = m].
I Give an expression for E[N|M], simplifying where possible.
I Compute E[N].
33
Law of Large Numbers (Borovkov §2.9)
The Law of Large Numbers (LLN) states that if X1, X2, · · · are
independent and identically-distributed with mean µ, then with
probability 1
X n :=
1
n
n∑
j=1
Xj → µ
as n→∞.
34
Central Limit Theorem (Borovkov §2.9)
The Central Limit Theorem (CLT) states that if X1, X2, · · · are
independent and identically-distributed with mean µ and variance
σ2 ∈ (0,∞), then for any x ,
P
(
X n − µ
σ/
√
n
≤ x
)
→ Φ(x) ≡
∫ x
−∞
1
√
2π
e−t
2/2dt
as n→∞.
That is, a suitably-scaled variation from the mean approaches a
standard normal distribution as n→∞.
(Note that writing Zn :=
X n−µ
σ/
√
n
, this becomes
FZn(x)→ FZ (x),
for each x , where Z ∼ N (0, 1).)
35
Limit Theorems (Borovkov §2.9)
The Poisson Limit Theorem states that that if X1, X2, · · · are
independent Bernoulli random variables with
P(Xi = 1) = 1− P(Xi = 0) = pi , then X1 + X2 + · · ·+ Xn is
well-approximated by a Poisson random variable with parameter
λn = p1 + · · ·+ pn.
Specifically, with Wn = X1 + X2 + · · ·+ Xn, then, for each x ∈ R
FWn(x) ≈ FYn(x)
where Yn ∼ Poisson(λn).
(There is, in fact, a bound on the accuracy of this approximation
|P(Wn ∈ B)− P(Yn ∈ B)| ≤
∑n
i=1 p
2
i
max(1, λn)
,
)
36
Example
Suppose there are three ethnic groups, A (20%), B (30%) and C
(50%), living in a city with a large population. Suppose 0.5%, 1%
and 1.5% of people in A, B and C respectively are over 200cm tall.
Suppose that we select at random 50, 50, and 200 people from A,
B, and C respectively. What is the probability that at least four of
the 300 people will be over 200cm tall?
37
Stochastic Processes (Borovkov §2.10)
A collection of random variables {Xt , t ∈ I} (or {X (t), t ∈ I}, or
(Xt)t∈I ...) on a common prob space (Ω,F ,P) is called a
stochastic process. The index variable t is often called ‘time’.
I If I = {0, 1, 2, · · · } or {· · · ,−2,−1, 0, 1, 2, · · · }, the process is
a discrete time process.
I If I = IR or [0,∞), the process is a continuous time process.
38
Examples of Stochastic Processes
Standard Brownian Motion is a very special Gaussian process
(Xt)t∈[0,∞) where X0 = 0,
I For each t > s ≥ 0, Xt − Xs ∼ N (0, t − s)
I For any 0 ≤ s1 < t1 ≤ s2 < · · · ≤ sk < tk , Xt1 − Xs1 , . . . ,
Xtk − Xsk are independent.
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
−8
00
−6
00
−4
00
−2
00
0
20
0
I (Xt)t≥0 is continuous (with probability 1) as a function of t.
39
Examples of Stochastic Processes
If e.g. Xt is the number of sales of an item up to time t,
then (Xt)t≥0 is called a counting process.
40
Examples of Stochastic Processes
Xt is the number of people in a queue at time t.
Lect 04 620-301 1
41
Interpretations
I If both ω and t are fixed, then Xt(ω) is a real number.
I For a fixed t, the function Xt : Ω→ R is a random variable.
I For a fixed ω, we can think of t as a variable, and
X·(ω) : I → R as a function (realization, trajectory, sample
path).
I If we allow ω to vary, we get a collection of trajectories.
42
Interpretations
If Xt is a counting process:
I For fixed ω and t, Xt(ω) is a non-negative integer.
I For fixed ω, X·(ω) is a non-decreasing step function of t.
I For fixed t, Xt is a non-negative integer-valued random
variable.
I For s < t, Xt −Xs is the number of events that have occurred
in the interval (s, t].
If Xt is the number of people in a queue at time t, then
{Xt : t ≥ 0} is a stochastic process where, for each t, Xt is a
non-negative integer-valued random variable but it is NOT a
counting process because, for fixed ω, Xt(ω) can decrease.
43
Finite-Dimensional Distributions
Knowing just the one-dimensional distributions (i.e. the
distribution of Xt for each t) is not enough to describe a stochastic
process.
To specify the complete distribution of a stochastic process
(Xt)t∈I , we need to know the finite-dimensional distributions ,
i.e. the family of joint distribution functions Ft1,t2,··· ,tk (x1, · · · , xk)
of Xt1 , · · · ,Xtk for all k ≥ 1 and t1, · · · , tk ∈ I .
44
Discrete-Time Markov Chains
We are frequently interested in applications where we have a
sequence X1,X2, · · · of outputs (which we model as random
variables) in discrete time. For example,
I DNA: A (adenine), C (cytosine), G (guanine), T (thymine).
I Texts: Xj takes values in some alphabet, for example
{A,B, · · · ,Z , a, · · · }.
I Developing and testing compression software.
I Cryptology: codes, encoding and decoding.
I Attributing manuscripts.
45
Independence?
Is it reasonable to assume that neighbouring letters are
independent?
No, e.g. in English texts:
I a is very common, but aa is very rare.
I q is virtually always followed by u (and then another vowel).
46
The Markov Property
The Markov property embodies a natural first generalisation to the
independence assumption. It assumes a kind of one-step
dependence or memory. Specifically, for I = Z+ = {0, 1, 2, . . . }
(and discrete-valued) processes the Markov property takes the form
P(Xn+1 = y |Xn = xn,Xn−1 = xn−1, · · · ,X0 = xo)
=P(Xn+1 = y |Xn = xn)
for each n ∈ Z+.
47
Discrete stopping times
Let (Xt)t∈Z+ be a sequence of random variables. A random
variable T taking values in Z+ ∪ {+∞} is called a stopping time
with respect to (Xt)t∈Z+ if for each t ∈ Z+, there exists a
non-random function ht such that
1{T≤t} = ht(X0, . . . ,Xt).
This says that we can determine whether or not T ≤ t, knowing
only the values of X0, . . . ,Xt (i.e. knowing about the past and the
present but without knowing the future).
Example: First hitting time. For a sequence (Xt)t∈Z+ , let
T (x) = inf{t ∈ Z+ : Xt = x} denote the first time that the
sequence is equal to x . Then
1{T (x)≤t} =
t∑
m=0
1{T (x)=m} =
t∑
m=0
1{Xm=x}
m−1∏
i=0
1{Xi 6=x},
so T (x) is a stopping time.
48
Examples:
We have seen above that for a sequence (Xt)t∈Z+ , the first hitting
times T (x) = inf{t ∈ Z+ : Xt = x} are stopping times. The
following are also stopping times:
I First strictly positive hitting times:
T ′(x) = inf{t ∈ N : Xt = x}.
I ith hitting times: T1(x) = T (x),
Ti (x) = inf{t > Ti−1(x) : Xt = x}.
I The maximum or minimum of two stopping times.
I A non-random time.
Something like T − 1 is not in general a stopping time. E.g. for a
Bernoulli sequence, if T is the first time we see a 1 in the sequence
then T − 1 is not a stopping time. Why?
49
Continuous stopping times:
A stopping time for a continuous-time process (Xt)t≥0 is a random
variable T (with values in [0,∞]) such that for each t <∞ there
is a non-random (measureable) function ht such that
1{T≤t} = ht((Xu)u∈[0,t]).
As in the discrete setting, hitting times etc., are stopping times.
50