CS计算机代考程序代写 chain AI MAST30001 Stochastic Modelling

MAST30001 Stochastic Modelling

Lecturer: Mark Holmes

Stochastic modelling

We study a random experiment in the context of a Probability
Space (Ω,F ,P). Here,
I Ω is the sample space – the set of all possible outcomes of our

random experiment,

I F is a sigma-field – a collection (with certain properties) of
subsets of Ω. We view these as events we can see or measure,

I P is a probability measure – a function (with certain
properties) defined on the elements of F with certain
properties.

The sample space Ω

The (nonempty) set of possible outcomes for the random
experiment. Examples:

I coin tossing: {H,T}, {(H,H), (H,T ), (T ,H), (T ,T )}, the
set {H,T}N of all infinite sequences of Hs and Ts.

I battery lifetime: [0,∞).
I animal population: Z+.
I queue length over time: the set of piecewise-constant

functions from [0,∞) to Z+.
I infection status: {susceptible, infected , immune}n (if n is the

number of individuals).

I facebook friend network: set of simple networks with number
of vertices equal to the number of users: edges connect
friends.

Elements of Ω are often written as ω.

Review of basic notions of set theory

I A ⊂ B.
I A is a subset of B, or if A occurs then B occurs.

I A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B} = B ∪ A.
I Union of sets (events): at least one occurs.
I A1 ∪ A2 ∪ · · · ∪ An =

⋃n
i=1 Ai .

I A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B} = B ∩ A.
I Intersection of sets (events): all occur.
I A1 ∩ A2 ∩ · · · ∩ An =

⋂n
i=1 Ai .

I Ac = {ω ∈ Ω : ω 6∈ A}.
I Complement of a set/event: event doesn’t occur.

I ∅: the empty set or impossible event.
I A and B are disjoint (or mutually exclusive) if A ∩ B = ∅.

The sigma-field F

A sigma-field F on Ω is a collection of subsets of Ω such that
I Ω ∈ F , and
I if A ∈ F then Ac ∈ F , and
I if A1,A2, · · · ∈ F then ∪∞i=1Ai ∈ F .
I For countable sample spaces, F is typically the set of all

subsets.

Example: Toss a coin once, F =
{
∅, {H}, {T}, {H,T}

}
I For uncountable sample spaces, the situation is more

complicated – see later.

The probability measure P

A probability measure P on (Ω,F) is a set function from F to
[0, 1] satisfying

P1. P(A) ≥ 0 for all A ∈ F [probabilities measure long run %’s or
certainty]

P2. P(Ω) = 1 [There is a 100% chance something happens]
P3. Countable additivity: if A1, A2 · · · are disjoint events in F ,

then P (
⋃∞

i=1 Ai ) =
∑∞

i=1 P(Ai ) [Think about it in terms of
frequencies]

How do we specify P?

The modelling process consists of

I defining the values of P(A) for some ‘basic events’ in A ∈ F ,
I deriving P(B) for the other ‘unknown’ more complicated

events in B ∈ F from the axioms above.
Example: Toss a fair coin 1000 times. Any particular sequence (of
length 1000) of H’s and T’s has chance 2−1000.

I What is the chance there are more than 600 H’s in the
sequence?

I What is the chance the first time the proportion of heads
exceeds the proportion of tails occurs after toss 20?

Properties of P

I P(∅) = 0.
I P(Ac) = 1− P(A).
I P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
I . . .

I “Continuity”:

P(∪∞i=1Ai ) = lim
n→∞

P(∪ni=1Ai )

P(∩∞i=1Ai ) = lim
n→∞

P(∩ni=1Ai )

Conditional probability

Let A,B ∈ F be events with P(B) > 0. Supposing we know that
B occurred, how likely is A given that information? That is, what
is the conditional probability P(A|B)?
We define

P(A|B) =
P(A ∩ B)
P(B)

For a frequency interpretation, consider the situation where we
have 10 tickets, numbered 1, 2, . . . , 10, and we choose one ticket
uniformly at random. Then Ω = {1, 2, . . . , 10}.
Let A = {ω ∈ Ω : ω is even} = {2, 4, 6, 8, 10},
and B = {ω ∈ Ω : ω is a multiple of 3} = {3, 6, 9}.
Then P(A) = 5/10, P(B) = 3/10, P(A ∩ B) = P({6}) = 1/10.
According to the definition, P(A|B) = 1/10

3/10
= 1/3. This is nothing

but the proportion of outcomes in B that are also in A.

Example:

Tickets are drawn consecutively and without replacement from a
box of tickets numbered 1, . . . , 10. What is the chance the second
ticket is even numbered given the first is

I even?

I labelled 3?

Law of total probability

Let B1,B2, · · · ,Bn be disjoint events with P(Bi ) > 0 and
A ⊂

⋃n
j=1 Bj , then

P(A) =
n∑

j=1

P(A|Bj)P(Bj).

If also P(A) > 0 then

P(Bj |A) =
P(Bj ∩ A)

P(A)
=

P(A|Bj)P(Bj)∑n
k=1 P(A|Bk)P(Bk)

Example:

A disease affects 1/1000 newborns and shortly after birth a baby is
screened for this disease using a cheap test that has a 2% false
positive rate (the test has no false negatives). If the baby tests
positive, what is the chance it has the disease?

Independent events

Events A and B are said to be independent if
P(A ∩ B) = P(A)P(B).
If P(B) 6= 0 then this is the same as P(A|B) = P(A).
Events A1, · · · ,An are independent if, for each subset {i1, …, ik} of
{1, …, n},

P(Ai1 ∩ · · · ∩ Aik ) = P(Ai1)× · · · × P(Aik ).

This is a stronger requirement than each pair of events being
independent.

A disconcerting example

I Let S be the circle of radius 1.

I We say two points on S are in the same family if you can get
from one to the other by taking steps of arclength 1 around
the circle.

I Each family chooses a single member to be head.

I If X is a point chosen uniformly at random from the circle,
what is the chance X is the head of its family?

A disconcerting example

I A = {X is head of its family}.
I Ai = {X is i steps clockwise from its family head}.
I Bi = {X is i steps counterclockwise from its family head}.
I By uniformity, should have P(A) = P(Ai ) = P(Bi ), BUT
I law of total probability:

1 = P(A) +
∞∑
i=1

(P(Ai ) + P(Bi )) !

The issue is that the set A is not one we can measure so should
not be included in F .

These kinds of issues are technical to resolve and are dealt with in
later probability or analysis subjects which use measure theory.

Random variables

A random variable on a probability space (Ω,F ,P) is a function
X : Ω→ IR such that
(Often we want to talk about the probabilities that random
variables take values in (−∞, b] = {x ∈ R : x ≤ b}.
When we write P(X ≤ b), we mean the probability of the set
{ω : X (ω) ≤ b}. In order for this to make sense, we need this set
to be in F ! )
{X ≤ b} := {ω ∈ Ω : X (ω) ≤ b} ∈ F for every b ∈ R.

(in measure theory, this is called a Borel-measurable function, after
Emile Borel (1871-1956)).

Because F is a sigma-field, this tells us that things like {X > b},
{a < X < b} etc. are also in F . In this course we will also allow our random variables to take the value +∞ (or −∞). So in general they will be functions from Ω to R̄ = R ∪ {−∞,∞}. 16 Indicator random variables The most important examples of random variables are the indicator random variables: Let A ∈ F be an event.Then the function 1A : Ω→ R (or IA : Ω→ R) given by 1A(ω) = { 1, if ω ∈ A 0, otherwise, is called the indicator of A. Exercise: show that 1A is a random variable. 17 Distribution Functions The function FX (t) = P(X ≤ t) = P({ω : X (ω) ≤ t}) that maps R to [0, 1] is called the distribution function of the random variable X . Any distribution function F F1. is non-decreasing, F2. is such that F (x)→ 0 as x → −∞ and F (x)→ 1 as x →∞, F3. is ‘right-continuous’, that is limh→0+ F (t + h) = F (t) for all t. Exercise: If (Ω,F ,P) is a probability space and A ∈ F , find the distribution function of 1A. 18 Distribution Functions We say that a distribution function F is I discrete if F only grows in jumps, i.e. there exists a countable set A ⊂ R and a function f mapping A to [0, 1] such that F (x) = ∑ y∈A:y≤x f (y). I absolutely continuous if there exists a function f that maps R to R+ such that F (t) = ∫ t −∞ f (u)du. A random variable X has a discrete distribution if FX is discrete, and an (absolutely) continuous distribution if FX is absolutely continuous. The function f is called the probability mass function in the discrete case, and the probability density function in the absolutely continuous case. Note that there are distribution functions that are not even mixtures of the above (see e.g. Cantor Function). In other words there are random variables whose distributions are not mixtures of discrete and absolutely continuous distributions! 19 Examples of distributions I Examples of discrete random variables: binomial, Poisson, geometric, negative binomial, discrete uniform http://en.wikipedia.org/wiki/Category: Discrete_distributions I Examples of continuous random variables: normal, exponential, gamma, beta, uniform on an interval (a, b) http://en.wikipedia.org/wiki/Category: Continuous_distributions 20 http://en.wikipedia.org/wiki/Category:Discrete_distributions http://en.wikipedia.org/wiki/Category:Discrete_distributions http://en.wikipedia.org/wiki/Category:Continuous_distributions http://en.wikipedia.org/wiki/Category:Continuous_distributions Random Vectors A random vector X = (X1, ...,Xd) is a vector of random variables on the same probability space. The distribution function FX of X is FX(t) = P(X1 ≤ t1, · · · ,Xd ≤ td), t = (t1, · · · , td) ∈ Rd . One can also write this as FX(t) = P(∩di=1{Xi ≤ ti}). It follows that P(s1 < X1 ≤ t1, s2 < X2 ≤ t2) = F (t1, t2)− F (s1, t2)− F (t1, s2) + F (s1, s2). 21 Independent Random Variables The random variables X1, · · · ,Xd are called independent if FX(t) = FX1(t1)× · · · × FXd (td) for all t = (t1, · · · , td). This turns out to be equivalent to the statement that the events {X1 ∈ I1}, · · · , {Xd ∈ Id} are independent for all intervals I1, . . . Id . If the random variables are all discrete, or the random vector is absolutely continuous (the latter is stronger than each of its coordinates being absolutely continuous) then this is equivalent to a relevant mass/density function fX factorising as fX(t) = fX1(t1) · · · fXd (td). 22 Expectation The expectation (or expected value) of a random variable X is E[X ] = ∫ ∞ −∞ xdFX (x). The integral on the right hand side is a Lebesgue-Stieltjes integral. It can be evaluated as = { ∑∞ i=1 xiP(X = xi ), if X is discrete∫∞ −∞ xfX (x)dx , if X is absolutely continuous. In second year, we required that the integral be absolutely convergent. In this course we will allow the expectation to be infinite, provided that we never get in a situation where we have ‘∞−∞’. 23 Expectation of g(X ) If X is a random variable then for a measurable (nice) function g , Y = g(X ) is a random variable, so by definition E[Y ] = ∫ ∞ −∞ ydFY (y). The law of the unconscious statistician says that also E[Y ] = E[g(X )] = ∫ ∞ −∞ g(x)dFX (x). 24 Properties of Expectation I E[aX + bY ] = aE[X ] + bE[Y ]. I If P(X ≤ Y ) = 1, then E[X ] ≤ E[Y ]. I If P(X = c) = 1, then E[X ] = c . I ... I If 0 ≤ Xn ↑ X pointwise in ω then E[Xn] ↑ E[X ] I If Xi ≥ 0 then E[ ∑∞ i=1 Xi ] = ∑∞ i=1 E[Xi ] I If X ≥ 0 then E[X ] = ∫∞ 0 P(X > t)dt =
∫∞
0

P(X ≥ t)dt

Moments

I The kth moment of X is E[X k ].
I The kth central moment of X is E[(X − E[X ])k ].
I The variance Var(X ) of X is the second central moment

E[X 2]− (E[X ])2.
I Var(aX + b) = a2Var(X ).

I The covariance of Cov(X ,Y ) of X and Y is
E[(X − E[X ])(Y − E[Y ])].

I If X and Y have finite means and are independent, then
E[XY ] = E[X ]E[Y ].

I Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ).

Conditioning on random variables

Let X ,Y be random variables on (Ω,F ,P), and A ∈ F .
The conditional probability of event A given a random variable X
is written as P(A|X ). The conditional expectation of Y given a
random variable X is written as E[Y |X ].
The official definitions are technical, but you should think of these
as follows:

I P(A|X ) is a function of X that takes the value P(A|X = x) on
the event that X = x .

I E[Y |X ] is a function of X that takes the value E[Y |X = x ]
on the event that X = x .

As X is a random variable, these two functions of X are also
random variables.

Conditioning on random variables – technicality
A random variable has the property that {X ≤ x} ∈ F for every
x ∈ R. It could be however that F contains a lot more stuff than
events regarding X . Let GX denote the smallest σ-field on Ω such
that X is a random variable on (Ω,GX ,P). Suppose that
E[|Y |] <∞. Then E[Y |X ] is defined to be any random variable Z on (Ω,GX ,P) such that E[Z1B ] = E[Y1B ] for each B ∈ GX . I There is typically more than one random variable Z that satisfies this definition. Note that if Z and Z ′ both satisfy this definition then P(Z = Z ′) = 1, so we generally don’t care which version of E[Y |X ] we are working with. I P(A|X ) is defined to be E[1A|X ]. Example: let D ∈ F , and let X = 1D . Suppose that Y is a random variable with E[|Y |] <∞ then the random variable E[Y |D]1D + E[Y |Dc ]1Dc , satisfies the definition of E[Y |X ]. 28 Conditioning on discrete random variables - the punchline If X has a discrete distribution (taking values x1, x2, . . . ), and E[|Y |] <∞ then E[Y |X ] = ∑ i E[Y |X = xi ]1{X=xi}, i.e. the right hand side satisfies the definition of E[Y |X ]. This is nothing but the statement that if η(x) = E[Y |X = x ], then η(X ) satisfies the definition of E[Y |X ]. 29 Conditional Distribution The conditional distribution function FY |X (y |X ) of Y evaluated at the real number y is given by P(Y ≤ y |X ) 30 Properties of Conditional Expectation The following hold with probability 1: I Linearity: E[aY1 + bY2|X ] = aE[Y1|X ] + bE[Y2|X ], I Monotonicity: If P(Y1 ≤ Y2) = 1, then E[Y1|X ] ≤ E[Y2|X ], I E[c |X ] = c , I E[E[Y |X ]] = E[Y ], I For any nice (i.e. Borel measurable) function g , E[g(X )Y |X ] = g(X )E[Y |X ] I E[Y |X ] is the function of X that is closest to Y in the mean square sense. This means that E[(g(X )− Y )2] is minimised when g(X ) = E[Y |X ] (see Borovkov, page 57). 31 Exercise Let Ω = {a, b, c, d}, and let F contain all subsets of Ω. Let P be the probability measure satisfying P({a}) = 1 2 , P({b}) = P({c}) = 1 8 and P({d}) = 1 4 . Define random variables, Y (ω) = { 1, ω = a or b, 0, ω = c or d , X (ω) = { 2, ω = a or c , 5, ω = b or d . Compute E[X ], E[X |Y ] and E[E[X |Y ]]. 32 Example: Suppose also that the number of individuals M entering a bank in a given day has a Poisson(λ) distribution. Suppose that individuals entering the bank each hold an Australian passport with probability p, independently of each other and M. Let N denote the number of individuals holding an Australian passport who enter the bank during that day. I What is the distribution of N, given that M = m? I Find E[N|M = m]. I Give an expression for E[N|M], simplifying where possible. I Compute E[N]. 33 Law of Large Numbers (Borovkov §2.9) The Law of Large Numbers (LLN) states that if X1, X2, · · · are independent and identically-distributed with mean µ, then with probability 1 X n := 1 n n∑ j=1 Xj → µ as n→∞. 34 Central Limit Theorem (Borovkov §2.9) The Central Limit Theorem (CLT) states that if X1, X2, · · · are independent and identically-distributed with mean µ and variance σ2 ∈ (0,∞), then for any x , P ( X n − µ σ/ √ n ≤ x ) → Φ(x) ≡ ∫ x −∞ 1 √ 2π e−t 2/2dt as n→∞. That is, a suitably-scaled variation from the mean approaches a standard normal distribution as n→∞. (Note that writing Zn := X n−µ σ/ √ n , this becomes FZn(x)→ FZ (x), for each x , where Z ∼ N (0, 1).) 35 Limit Theorems (Borovkov §2.9) The Poisson Limit Theorem states that that if X1, X2, · · · are independent Bernoulli random variables with P(Xi = 1) = 1− P(Xi = 0) = pi , then X1 + X2 + · · ·+ Xn is well-approximated by a Poisson random variable with parameter λn = p1 + · · ·+ pn. Specifically, with Wn = X1 + X2 + · · ·+ Xn, then, for each x ∈ R FWn(x) ≈ FYn(x) where Yn ∼ Poisson(λn). (There is, in fact, a bound on the accuracy of this approximation |P(Wn ∈ B)− P(Yn ∈ B)| ≤ ∑n i=1 p 2 i max(1, λn) , ) 36 Example Suppose there are three ethnic groups, A (20%), B (30%) and C (50%), living in a city with a large population. Suppose 0.5%, 1% and 1.5% of people in A, B and C respectively are over 200cm tall. Suppose that we select at random 50, 50, and 200 people from A, B, and C respectively. What is the probability that at least four of the 300 people will be over 200cm tall? 37 Stochastic Processes (Borovkov §2.10) A collection of random variables {Xt , t ∈ I} (or {X (t), t ∈ I}, or (Xt)t∈I ...) on a common prob space (Ω,F ,P) is called a stochastic process. The index variable t is often called ‘time’. I If I = {0, 1, 2, · · · } or {· · · ,−2,−1, 0, 1, 2, · · · }, the process is a discrete time process. I If I = IR or [0,∞), the process is a continuous time process. 38 Examples of Stochastic Processes Standard Brownian Motion is a very special Gaussian process (Xt)t∈[0,∞) where X0 = 0, I For each t > s ≥ 0, Xt − Xs ∼ N (0, t − s)
I For any 0 ≤ s1 < t1 ≤ s2 < · · · ≤ sk < tk , Xt1 − Xs1 , . . . , Xtk − Xsk are independent. 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 −8 00 −6 00 −4 00 −2 00 0 20 0 I (Xt)t≥0 is continuous (with probability 1) as a function of t. 39 Examples of Stochastic Processes If e.g. Xt is the number of sales of an item up to time t, then (Xt)t≥0 is called a counting process. 40 Examples of Stochastic Processes Xt is the number of people in a queue at time t. Lect 04 620-301 1 41 Interpretations I If both ω and t are fixed, then Xt(ω) is a real number. I For a fixed t, the function Xt : Ω→ R is a random variable. I For a fixed ω, we can think of t as a variable, and X·(ω) : I → R as a function (realization, trajectory, sample path). I If we allow ω to vary, we get a collection of trajectories. 42 Interpretations If Xt is a counting process: I For fixed ω and t, Xt(ω) is a non-negative integer. I For fixed ω, X·(ω) is a non-decreasing step function of t. I For fixed t, Xt is a non-negative integer-valued random variable. I For s < t, Xt −Xs is the number of events that have occurred in the interval (s, t]. If Xt is the number of people in a queue at time t, then {Xt : t ≥ 0} is a stochastic process where, for each t, Xt is a non-negative integer-valued random variable but it is NOT a counting process because, for fixed ω, Xt(ω) can decrease. 43 Finite-Dimensional Distributions Knowing just the one-dimensional distributions (i.e. the distribution of Xt for each t) is not enough to describe a stochastic process. To specify the complete distribution of a stochastic process (Xt)t∈I , we need to know the finite-dimensional distributions , i.e. the family of joint distribution functions Ft1,t2,··· ,tk (x1, · · · , xk) of Xt1 , · · · ,Xtk for all k ≥ 1 and t1, · · · , tk ∈ I . 44 Discrete-Time Markov Chains We are frequently interested in applications where we have a sequence X1,X2, · · · of outputs (which we model as random variables) in discrete time. For example, I DNA: A (adenine), C (cytosine), G (guanine), T (thymine). I Texts: Xj takes values in some alphabet, for example {A,B, · · · ,Z , a, · · · }. I Developing and testing compression software. I Cryptology: codes, encoding and decoding. I Attributing manuscripts. 45 Independence? Is it reasonable to assume that neighbouring letters are independent? No, e.g. in English texts: I a is very common, but aa is very rare. I q is virtually always followed by u (and then another vowel). 46 The Markov Property The Markov property embodies a natural first generalisation to the independence assumption. It assumes a kind of one-step dependence or memory. Specifically, for I = Z+ = {0, 1, 2, . . . } (and discrete-valued) processes the Markov property takes the form P(Xn+1 = y |Xn = xn,Xn−1 = xn−1, · · · ,X0 = xo) =P(Xn+1 = y |Xn = xn) for each n ∈ Z+. 47 Discrete stopping times Let (Xt)t∈Z+ be a sequence of random variables. A random variable T taking values in Z+ ∪ {+∞} is called a stopping time with respect to (Xt)t∈Z+ if for each t ∈ Z+, there exists a non-random function ht such that 1{T≤t} = ht(X0, . . . ,Xt). This says that we can determine whether or not T ≤ t, knowing only the values of X0, . . . ,Xt (i.e. knowing about the past and the present but without knowing the future). Example: First hitting time. For a sequence (Xt)t∈Z+ , let T (x) = inf{t ∈ Z+ : Xt = x} denote the first time that the sequence is equal to x . Then 1{T (x)≤t} = t∑ m=0 1{T (x)=m} = t∑ m=0 1{Xm=x} m−1∏ i=0 1{Xi 6=x}, so T (x) is a stopping time. 48 Examples: We have seen above that for a sequence (Xt)t∈Z+ , the first hitting times T (x) = inf{t ∈ Z+ : Xt = x} are stopping times. The following are also stopping times: I First strictly positive hitting times: T ′(x) = inf{t ∈ N : Xt = x}. I ith hitting times: T1(x) = T (x), Ti (x) = inf{t > Ti−1(x) : Xt = x}.

I The maximum or minimum of two stopping times.

I A non-random time.

Something like T − 1 is not in general a stopping time. E.g. for a
Bernoulli sequence, if T is the first time we see a 1 in the sequence
then T − 1 is not a stopping time. Why?

Continuous stopping times:

A stopping time for a continuous-time process (Xt)t≥0 is a random
variable T (with values in [0,∞]) such that for each t <∞ there is a non-random (measureable) function ht such that 1{T≤t} = ht((Xu)u∈[0,t]). As in the discrete setting, hitting times etc., are stopping times. 50

Related Posts