Chapter 2
Probability and Statistics Review
In this chapter we briefly review without proofs some definitions and concepts in probability and statistics. Many introductory and more advanced texts can be recommended for review and reference. On introductory probability see e.g., Bean [26], Ghahramani [124], or Ross [249]. Mathematical statistics and probability books at an advanced undergraduate or first-year graduate level include, e.g., DeGroot and Schervish [69], Freund (Miller and Miller) [209], Hogg, McKean and Craig [150] or Larsen and Marx [177]. Casella and Berger [40] or Bain and Englehart [21] are somewhat more advanced. Durrett [80] is a graduate probability text. Lehmann [180] and Lehmann and Casella [181] are graduate texts in statistical inference.
2.1 Random Variables and Probability Distribution and Density Functions
The cumulative distribution function (cdf) of a random variable X is FX defined by
FX(x)=P(X ≤x), x∈R.
In this book P(·) denotes the probability of its argument. We will omit the subscript X and write F (x) if it is clear in context. The cdf has the following properties:
1. FX is non-decreasing.
2. FX is right-continuous; that is,
lim FX(x+ε)=FX(x), ε→0+
3. lim FX(x)=0 and lim FX(x)=1. x→−∞ x→∞
forallx∈R.
A random variable X is continuous if FX is a continuous function. A random variable X is discrete if FX is a step function.
Discrete distributions can be specified by the probability mass function (pmf) pX(x) = P(X = x). The discontinuities in the cdf are at the points where the pmf is positive, and p(x) = F (x) − F (x−).
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
37
Copyright © 2019. CRC Press LLC. All rights reserved.
38 Statistical Computing with R If X is discrete, the cdf of X is
FX(x) = P(X ≤ x) = pX (k). {k≤x: pX (k)>0}
Continuous distributions do not have positive probability mass at any single point. For continuous random variables X the probability density function (pdf) or density of X is fX(x) = FX′ (x), provided that FX is differentiable, and by the fundamental theorem of calculus
FX(x) = P(X ≤ x) =
The joint density of continuous random variables X and Y is fX,Y (x, y)
and the cdf of (X,Y) is
FX,Y (x,y) = P(X ≤ x;Y ≤ y) =
fX,Y (s,t)dsdt. The marginal probability densities of X and Y are given by
∞ ∞
fX (x) = fX,Y (x, y)dy; fY (y) = fX,Y (x, y)dx.
−∞ −∞
The corresponding formulas for discrete random variables are similar, with sums replacing the integrals. In the remainder of this chapter, for simplicity fX (x) denotes either the pdf (if X is continuous) or the pmf (if X is discrete) of X.
The set of points {x : fX(x) > 0} is the support set of the random vari- able X. Similarly, the bivariate distribution of (X, Y ) is supported on the set {(x, y) : fX,Y (x, y) > 0}.
Expectation, Variance, and Moments
The mean of a random variable X is the expected value or mathematical expectation of the variable, denoted E[X]. If X is continuous with density f, then the expected value of X is
∞ −∞
E[X] = If X is discrete with pmf f(x), then
xf(x)dx.
xf(x). {x: fX (x)>0}
E[X] =
(The integrals and sums above are not necessarily finite. We implicitly assume
that E|X| < ∞ whenever E[X] appears in formulas below.) The expected
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
x −∞
y x −∞ −∞
fX(t)dt.
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 39 value of a function g(X) of a continuous random variable X with pdf f is
defined by
E[g(X)] =
Let μX = E[X]. Then μX is also called the first moment of X. The rth moment
of X is E[Xr]. Hence if X is continuous,
∞ −∞
The variance of X is the second central moment, V ar(X) = E[(X − E[X])2].
The identity E[(X − E[X])2] = E[X2] − (E[X])2 provides an equivalent for- mula for variance,
V ar(X) = E[X2] − (E[X])2 = E[X2] − μ2X .
The variance of X is also denoted by σX2 . The square root of the variance is the standard deviation. The reciprocal of the variance is the precision.
The expected value of the product of continuous random variables X and Y with joint pdf fX,Y is
Cov(X,Y)=E[(X−μX)(Y −μY)]
= E[XY ] − E[X]E[Y ] = E[XY ] − μXμY .
The covariance of X and Y is also denoted by σXY . Note that Cov(X,X) = V ar(X). The product-moment correlation is
Cov(X,Y) σXY ρ(X,Y)= =σ σ .
V ar(X)V ar(Y ) X Y Correlation can also be written as
X−μXY −μY ρ(X, Y ) = E σ σ .
XY
Two variables X and Y are uncorrelated if ρ(X, Y ) = 0.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
E[XY ] =
The covariance of X and Y is defined by
E[Xr] =
xrfX(x)dx.
∞∞ −∞ −∞
∞ −∞
g(x)f(x)dx.
xyfX,Y (x, y)dxdy.
Copyright © 2019. CRC Press LLC. All rights reserved.
40 Statistical Computing with R Conditional Probability and Independence
In classical probability, the conditional probability of an event A given that event B has occurred is
P(A|B) = P(AB), P(B)
where AB = A ∩ B is the intersection of events A and B. Events A and B are independent if P (AB) = P (A)P (B); otherwise they are dependent. The joint probability that both A and B occur can be written
P (AB) = P (A|B)P (B) = P (B|A)P (A).
If random variables X and Y have joint density fX,Y (x, y), then the con-
ditional density of X given Y = y is
fX|Y=y(x)= fX,Y(x,y).
fY (y) Similarly the conditional density of Y given X = x is
fY|X=x(y)= fX,Y(x,y). fX (x)
Thus, the joint density of (X, Y ) can be written
fX,Y (x,y) = fX|Y =y(x)fY (y) = fY |X=x(y)fX(x).
Independence
The random variables X and Y are independent if and only if fX,Y (x, y) = fX (x)fY (y)
for all x and y; or equivalently, if and only if FX,Y (x, y) = FX (x)FY (y), for all x and y.
The random variables X1 , . . . , Xd are independent if and only if the joint pdf f of X1, . . . , Xd is equal to the product of the marginal density functions. That is, X1, . . . , Xd are independent if and only if
d
f (x1 , . . . , xd ) = fj (xj )
j=1
for all x = (x1,...,xd)T in Rd, where fj(xj) is the marginal density (or marginal pmf) of Xj.
The variables {X1 , . . . , Xn } are a random sample from a distribution FX
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 41 if X1, . . . , Xn are independently and identically distributed with distribution
FX. In this case the joint density of {X1,...,Xn} is
n f(x1,...,xn) = fX(xi).
i=1
IfX andY areindependent,thenCov(X,Y)=0andρ(X,Y)=0.How- ever, the converse is not true; uncorrelated variables are not necessarily inde- pendent. The converse is true in an important special case: if X and Y are normally distributed then Cov(X, Y ) = 0 implies independence.
Properties of Expected Value and Variance
Suppose that X and Y are random variables, and a and b are constants. Then the following properties hold (provided the moments exist).
1. E[aX+b]=aE[X]+b.
2. E[X+Y]=E[X]+E[Y].
3. If X and Y are independent, E[XY ] = E[X]E[Y ].
4. Var(b)=0.
5. Var[aX+b]=a2Var(X).
6. Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y).
7. IfX andY areindependent,Var(X+Y)=Var(X)+Var(Y).
If {X1, . . . , Xn} are independent and identically distributed (iid) we have E[X1 +···+Xn]=nμX, Var(X1 +···+Xn)=nσX2 ,
so the sample mean X = 1 n Xi has expected value μX and variance n i=1
σX2 /n. (Apply properties 2, 7, and 5 above.)
The conditional expected value of X given Y = y is
∞ −∞
if FX|Y =y(x) is continuous.
Two important results are the conditional expectation rule and the condi-
tional variance formula:
E[X] = E[E[X|Y ]] (2.1)
V ar(X) = E[V ar(X|Y )] + V ar(E[X|Y ]). (2.2) See for example Ross [248, Ch. 3] for a proof of (2.1, 2.2) and many applica-
tions.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
E[X|Y = y] =
xfX|Y =y(x)dx,
Copyright © 2019. CRC Press LLC. All rights reserved.
42 Statistical Computing with R
2.2 Some Discrete Distributions
Some important discrete distributions are the “counting distributions.” The counting distributions are used to model the frequency of events and waiting time for events in discrete time, for example. Three important count- ing distributions are the binomial (and Bernoulli), negative binomial (and geometric), and Poisson.
Several discrete distributions including the binomial, geometric, and neg- ative binomial distributions can be formulated in terms of the outcomes of Bernoulli trials. A Bernoulli experiment has exactly two possible outcomes, “success” or “failure.” A Bernoulli random variable X has the probability mass function
P(X =1)=p, P(X =0)=1−p,
where p is the probability of success. It is easy to check that E[X] = p and V ar(X) = p(1 − p). A sequence of Bernoulli trials is a sequence of outcomes X1, X2, . . . of iid Bernoulli experiments.
Binomial and Multinomial Distribution
Suppose that X records the number of successes in n iid Bernoulli tri- als with success probability p. Then X has the Binomial(n,p) distribution [abbreviated X ∼ Bin(n, p)] with
n x n−x n! x n−x
P(X =x)= x p (1−p) = x!(n−x)!p (1−p) , x=0,1,...,n.
The mean and variance formulas are easily derived by observing that the binomial variable is an iid sum of n Bernoulli(p) variables. Therefore,
E[X] = np, V ar(X) = np(1 − p).
A binomial distribution is a special case of a multinomial distribution. Suppose that there are k + 1 mutually exclusive and exhaustive events A1,...,Ak+1 that can occur on any trial of an experiment, and each event occurs with probability P(Aj) = pj, j = 1,...,k+1. Let Xj record the num- ber of times that event Aj occurs in n independent and identical trials of the experiment. Then X = (X1,...,Xk) has the multinomial distribution with joint pdf
f(x ,...,x ) = n! px1px2 ...pxk+1, 0 ≤ x ≤ n, (2.3)
1 k x1!x2!...xk+1! 1 2 k+1 wherexk+1 =n−kj=1xj.
j
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 43 Geometric Distribution
Consider a sequence of Bernoulli trials, with success probability p. Let the random variable X record the number of failures until the first success is observed. Then
P(X =x)=p(1−p)x, x=0,1,2,.... (2.4) A random variable X with pmf (2.4) has the Geometric(p) distribution [ab-
breviated X ∼ Geom(p)]. If X ∼ Geom(p), then the cdf of X is FX(x)=P(X ≤x)=1−(1−p)⌊x⌋+1, x≥0,
and otherwise FX (x) = 0. The mean and variance of X are given by E[X]= 1−p; Var[X]= 1−p.
p p2 Alternative formulation of Geometric distribution
The geometric distribution is sometimes formulated by letting Y be defined as the number of trials until the first success. Then Y = X + 1, where X is the random variable defined above with pmf (2.4). Under this model, we have P(Y =y)=p(1−p)y−1,y=1,2,...,and
E[Y ] = E[X + 1] = 1 − p + 1 = 1; pp
Var[Y]=Var[X+1]=Var[X]= 1−p. p2
However, as a counting distribution, or frequency model, the first formulation (2.4) given above is usually applied, because frequency models typically must include the possibility of a zero count.
Negative Binomial Distribution
The negative binomial frequency model applies in the same setting as a
geometric model, except that the variable of interest is the number of failures
until the rth success. Suppose that exactly X failures occur before the rth
success. If X = x, then the rth success occurs on the (x + r)th trial. In the
first x + r − 1 trials, there are r − 1 successes and x failures. This can happen
x+r−1 = x+r−1 ways, and each way has probability prqx. The probability r−1 x
mass function of the random variable X is given by x+r−1r x
P (X = x) = r − 1 p q , x = 0, 1, 2, . . . . (2.5) The negative binomial distribution is defined for r > 0 and 0 < p < 1 as
follows. The random variable X has a negative binomial distribution with
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
44 Statistical Computing with R parameters (r, p) if
P(X=x)= Γ(x+r) prqx, x=0,1,2,..., (2.6) Γ(r)Γ(x + 1)
where Γ(·) is the complete gamma function defined in (2.8). Note that (2.5) and (2.6) are equivalent when r is a positive integer. If X has pmf (2.6) we will write X ∼ NegBin(r, p). The special case NegBin(r = 1, p) is the Geom(p) distribution.
Suppose that X ∼ NegBin(r, p), where r is a positive integer. Then X is the iid sum of r Geom(p) variables. Therefore, the mean and variance of X given by
E[X] = r1 − p, V ar[X] = r1 − p, p p2
are simply r times the mean and variance of the Geom(p) variable in (2.4). These formulas are also valid for all r > 0.
Note that like the geometric random variable, there is an alternative for- mulation of the negative binomial model that counts the number of trials until the rth success.
Poisson Distribution
A random variable X has a Poisson distribution with parameter λ > 0 if the pmf of X is
e−λ λx
p(x)= x! , x=0,1,2,….
If X ∼ Poisson(λ), then
E[X] = λ; V ar(X) = λ.
A useful recursive formula for the pmf is p(x + 1) = p(x) λ , x = 0,1,2,….
x+1
The Poisson distribution has many important properties and applications (see,
e.g., [133, 164, 250]).
Examples
Example 2.1 (Geometric cdf). The cdf of the geometric distribution with success probability p can be derived as follows. If q = 1 − p, then at the points x = 0,1,2,… the cdf of X is given by
P(X≤x)=
Alternately, P(X ≤ x) = 1−P(X ≥ x+1) = 1−P(first x+1 trials are
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
x p(1 − qx+1)
pqk =p(1+q+q2+···+qx)= 1−q =1−qx+1. failures) = 1 − qx+1. ⋄
k=0
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 45 Example 2.2 (Mean of the Poisson distribution). If X ∼ Poisson(λ), then
∞ E[X] =
x
e − λ λ x x!
∞ x=1
e − λ λ x − 1 (x − 1)!
∞ e − λ λ x
= λ
The last equality follows because the summand is the Poisson pmf and the
x! = λ.
total probability must sum to 1. ⋄
= λ
x=0
x=0
2.3 Some Continuous Distributions
Normal Distribution
The normal distribution with mean μ and variance σ2 [abbreviated N(μ,σ2)] is the continuous distribution with pdf
1 1x−μ2 f(x) = √2πσ exp −2 σ
, −∞ < x < ∞.
The standard normal distribution N(0,1) has zero mean and unit variance,
and the standard normal cdf is
z 1−t2/2
√ e dt, −∞ < z < ∞.
Φ(z) =
The normal distribution has several important properties. We summarize some of these properties, without proof. For more properties and characteri- zations see [162, Ch. 13], [221], or [285].
A linear transformation of a normal variable is also normally distributed. If X ∼ N(μ,σ), then the distribution of Y = aX + b is N(aμ + b,a2σ2). It follows that if X ∼ N (μ, σ), then
−∞
2π
Z = X − μ ∼ N (0, 1). σ
Linear combinations of normal variables are normal; if X1, . . . , Xk pendent, Xi ∼ N(μi,σi2), and a1,...,ak are constants, then
Y =a1X1 +···+akXk
are inde-
is normally distributed with mean μ=ki=1 aiμi and variance σ2 =ki=1 a2i σi2. Therefore, if X1,...,Xn is a random sample (X1,...,Xn are iid) from a N(μ, σ2) distribution, the sum Y = X1 +· · ·+Xn is normally distributed with E[Y ] = nμ and V ar(Y ) = nσ2. It follows that the sample mean X = Y/n has the N(μ,σ2/n) distribution if the sampled distribution is normal. (In case the sampled distribution is not normal, but the sample size is large, the Central Limit Theorem implies that the distribution of Y is approximately normal.
See Section 2.5.)
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
46 Statistical Computing with R Gamma and Exponential Distributions
A random variable X has a gamma distribution with shape parameter r>0andrateparameterλ>0ifthepdfofX is
λr
f(x) = Γ(r) xr−1e−λx, x ≥ 0,
where Γ(r) is the complete gamma function, defined by ∞
(2.7)
(2.8)
Γ(r) =
tr−1e−tdt, r ̸= 0,−1,−2,….
0
Recall that Γ(n) = (n − 1)! for positive integers n.
The notation X ∼ Gamma(r,λ) indicates that X has the density (2.7),
with shape r and rate λ. If X ∼ Gamma(r, λ), then E[X]=r; Var(X)= r.
Gamma distributions can also be parameterized by the scale parameter θ = 1/λ instead of the rate parameter λ. In terms of (r,θ) the mean is rθ and the variance is rθ2. An important special case of the gamma distribution is r = 1, which is the exponential distribution with rate parameter λ. The Exponential(λ) pdf is
f(x) = λe−λx, x ≥ 0.
If X is exponentially distributed with rate λ [abbreviated X ∼ Exp(λ)], then
E[X]=1; Var(X)= 1. λ λ2
It can be shown that the sum of iid exponentials has a gamma distribution. If X1,…,Xr are iid with the Exp(λ) distribution, then Y = X1 + ··· + Xr has the Gamma(r, λ) distribution.
Chisquare and t
The Chisquare distribution with ν degrees of freedom is denoted by χ2(ν).
The pdf of a χ2(ν) random variable X is
f(x) = 1 x(ν/2)−1e−x/2, x ≥ 0, ν = 1,2,…,.
Γ(ν /2)2ν /2
Note that χ2(ν) is a special case of the gamma distribution, with shape pa- rameter ν/2 and rate parameter 1/2. The square of a standard normal vari- able has the χ2(1) distribution. If Z1, . . . , Zν are iid standard normal, then Z12 + · · · + Zν2 ∼ χ2(ν). If X ∼ χ2(ν1) and Y ∼ χ2(ν2) are independent, then X+Y ∼χ2(ν1+ν2).IfX∼χ2(ν),then
E[X] = ν, V ar(X) = 2ν.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
λ λ2
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 47 The Student’s t distribution [270] is defined as follows. Let Z ∼ N(0,1)
and V ∼ χ2(ν). If Z and V are independent, then the distribution of Z
T = V/ν
has the Student’s t distribution with ν degrees of freedom, denoted t(ν). The
density of a t(ν) random variable X is given by
Γ(ν+1) 1
E[X]=0, ν>1; Var(X)= ν , ν>2. ν−2
In the special case ν = 1, the t(1) distribution is the standard Cauchy distribu- tion. For small ν the t distribution has “heavy tails” compared to the normal distribution. For large ν, the t(ν) distribution is approximately normal, and t(ν) converges in distribution to standard normal as ν → ∞.
Beta and Uniform Distributions
f(x) = 2 √ Γ(ν )
1
, x ∈ R, ν = 1,2,… The mean and variance of X ∼ t(ν) are given by
x2 (ν+1)/2 2 1+ν
νπ
A random variable X with density function
f(x)= Γ(α+β)xα−1(1−x)β−1, 0≤x≤1,α>0,β>0. (2.9)
Γ(α)Γ(β)
has the Beta(α, β) distribution. The constant in the beta density is the recip-
rocal of the beta function, defined by
B(α,β)=
The continuous uniform distribution on (0,1) or Uniform(0,1) is the special case Beta(1,1).
The parameters α and β are shape parameters. When α = β the distribu- tion is symmetric about 1/2. When α ̸= β the distribution is skewed, with the direction and amount of skewness depending on the shape parameters. The mean and variance are
E[X] = α ; V ar(X) = αβ . α + β (α + β)2(α + β + 1)
IfX∼Uniform(0,1)=Beta(1,1),thenE[X]=1 andVar(X)= 1. 2 12
In Bayesian analysis, a beta distribution is often chosen to model the distribution of a probability parameter, such as the probability of success in Bernoulli trials or a binomial experiment.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
1 Γ(α)Γ(β)
0
tα−1(1−t)β−1dt= Γ(α+β).
Copyright © 2019. CRC Press LLC. All rights reserved.
48 Statistical Computing with R Lognormal Distribution
A random variable X has the Lognormal(μ,σ2) distribution [abbreviated X ∼ LogN(μ,σ2)] if X = eY , where Y ∼ N(μ,σ2). That is, logX ∼ N(μ,σ2). The lognormal density function is
1 −(log x−μ)2 /(2σ2 ) fX (x) = √ e
, x > 0.
The cdf can be evaluated by the normal cdf of logX ∼ N(μ,σ2), so the cdf
x 2πσ of X ∼ LogN(μ,σ2) is given by
logx−μ
FX (x) = Φ σ , x > 0.
The moments are
r rY 122
E[X ]=E[e ]=exp rμ+ r σ , r>0.
2
The mean and variance are
E[X]=eμ+σ2/2, Var(X)=e2μ+σ2(eσ2 −1).
Examples
(2.10)
Example 2.3 (Two-parameter exponential cdf). The two-parameter expo- nential density is
f(x) = λe−λ(x−η), x ≥ η, (2.11)
where λ and η are positive constants. Denote the distribution with density function (2.11) by Exp(λ,η). When η = 0 the density (2.11) is exponential with rate λ.
The cdf of the two-parameter exponential distribution is given by
x x−η F (x) = λe−λ(t−η)dt =
λe−λudu = 1 − e−λ(x−η), x ≥ η. In the special case η = 0 we have the cdf of the Exp(λ) distribution,
F(x) = 1 − e−λx, x ≥ 0.
⋄
Example 2.4 (Memoryless property of the exponential distribution). The exponential distribution with rate parameter λ has the memoryless property. That is, if X ∼ Exp(λ), then
P(X >s+t|X >s)=P(X >t), foralls,t≥0.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
η0
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 49 The cdf of X is F (x) = 1 − exp(−λx), x ≥ 0 (see Example 2.3). Therefore,
for all s, t ≥ 0 we have
P (X > s + t|X > s) = P (X > s + t) = 1 − F (s + t)
P (X > s) e−λ(s+t) −λt
= e−λs = e = P (X > t).
1 − F (s) = 1 − F (t)
The first equality is simply the definition of conditional probability, P (A|B) = P (AB)/P (B). ⋄
2.4 Multivariate Normal Distribution
The bivariate normal distribution
Two continuous random variables X and Y have a bivariate normal distri- bution if the joint density of (X, Y ) is the bivariate normal density function, which is given by
1 1x−μ12 f(x,y)=2πσ1σ21−ρ2 exp −2(1−ρ2) σ1
x−μ1y−μ2 y−μ22
−2ρ σ σ + σ , (2.12)
122
(x,y) ∈ R2. The parameters are μ1 = E[X], μ2 = E[Y], σ12 = Var(X), σ2 = V ar(Y ), and ρ = Cor(X, Y ). The notation (X, Y ) ∼ BVN(μ1, μ2, σ12, σ2, ρ) indicates that (X, Y ) have the joint pdf (2.12). Some properties of the bivariate normal distribution (2.12) are:
1. The marginal distributions of X and Y are normal; that is X ∼ N(μ1,σ12) and Y ∼ N(μ2,σ2).
2. The conditional distribution of Y given X = x is normal with mean μ2 + ρσ2/σ1(x − μ1) and variance σ2(1 − ρ2).
3. The conditional distribution of X given Y = y is normal with mean μ1 + ρσ1/σ2(y − μ2) and variance σ12(1 − ρ2).
4. X and Y are independent if and only if ρ = 0.
Suppose (X1, X2) ∼ BVN(μ1, μ2, σ12, σ2, ρ). Let μ = (μ1, μ2)T and
σ11 σ12 Σ=σσ,
21 22
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
50 Statistical Computing with R
where σij = Cov(Xi,Xj). Then the bivariate normal pdf (2.12) of (X1,X2)
can be written in matrix notation as
1 1 T−1
f(x1,x2)=(2π)|Σ|1/2exp −2(x−μ)Σ (x−μ), wherex=(x1,x2)T ∈R2.
The multivariate normal distribution
The joint distribution of continuous random variables X1 , . . . , Xd is multi- variate normal or d-variate normal, denoted Nd(μ,Σ), if the joint pdf is given by
1 1 T−1 f(x1,…,xd)=(2π)d/2|Σ|1/2exp −2(x−μ) Σ (x−μ) , (2.13)
where Σ is the d × d nonsingular covariance matrix of (X1, . . . , Xd)T , μ = (μ1,…,μd)T is the mean vector, and x = (x1,…,xd)T ∈ Rd.
The one-dimensional marginal distributions of a multivariate normal vari- able are normal with mean μi and variance σi2, i = 1,…,d. Here σi2 is the ith entry on the diagonal of Σ. In fact, all of the marginal distributions of a multivariate normal vector are multivariate normal (see, e.g., Tong [288, Sec. 3.3]).
The normal random variables X1,…,Xd are independent if and only if the covariance matrix Σ is diagonal.
Linear transformations of multivariate normal random vectors are multi- variate normal. That is, if C is an m×d matrix and b = (b1,…,bm)T ∈ Rm, then Y = CX + b has the m-dimensional multivariate normal distribution with mean vector Cμ + b and covariance matrix CΣCT .
Applications and properties of the multivariate normal distribution are covered by Anderson [13] and Mardia et al. [194]. Refer to Tong [288] for properties and characterizations of the bivariate normal and multivariate nor- mal distribution.
2.5 Limit Theorems
Laws of Large Numbers
The Weak Law of Large Numbers (WLLN) or (LLN) states that the sample
mean converges in probability to the population mean. Suppose that X1, X2 . . .
are independent and identically distributed (iid), E|X1| < ∞ and μ = E[X1].
ForeachnletXn = 1n Xi.ThenXn →μinprobabilityasn→∞. n i=1
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 51 That is, for every ε > 0,
lim P(|Xn−μ|<ε)=1. n→∞
For a proof, see Durrett [80].
The Strong Law of Large Numbers (SLLN) states that the sample mean
converges almost surely to the population mean μ. Suppose that X1,X2,...
are pairwise independent and identically distributed, E|X1| < ∞ and μ =
E[X1].ForeachnletXn = 1n Xi.ThenXn →μalmostsurelyas n i=1
n → ∞. That is, for every ε > 0,
P(lim |Xn−μ|<ε)=1.
n→∞
For Etemadi’s proof, see Durrett [80].
Central Limit Theorem
The first version of the Central Limit Theorem was proved by de Moivre in the early 18th century for random samples of Bernoulli variables. The general proof was given independently by Lindeberg and Lévy in the early 1920’s.
Theorem 2.1 (The Central Limit Theorem). If X1, . . . , Xn is a random sample from a distribution with mean μ and finite variance σ2 > 0, then the limiting distribution of
X−μ Zn = σ/√n
is the standard normal distribution.
See Durrett [80] for the proofs.
2.6 Statistics
Unless otherwise stated, X1 , . . . , Xn is a random sample from a distribution with cdf FX(x) = P(X ≤ x), pdf or pmf fX(x), mean E[X] = μX and variance σX2 . The subscript X on F,f,μ, and σ is omitted when it is clear in context. Lowercaselettersx1,…,xn denoteanobservedrandomsample.
A statistic is a function Tn = T (X1, . . . , Xn) of a sample. Some examples
of statistics are the sample mean, sample variance, etc. The sample mean is
X = 1 n
n i=1
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Xi, and sample variance is
1 n n X 2 − n X 2
(Xi −X)2 = i=1 i . n−1 n−1
S2 =
The sample standard deviation is S = S2.
i=1
√
Copyright © 2019. CRC Press LLC. All rights reserved.
52 Statistical Computing with R The empirical distribution function
An estimate of F (x) = P (X ≤ x) is the proportion of sample points that fall in the interval (−∞, x]. This estimate is called the empirical cumulative distribution function (ecdf) or empirical distribution function (edf). The ecdf of an observed sample x1, . . . , xn is defined by
0, x
P(A|B) = P(B|A)P(A). P(B)
Often the Law of Total Probability is applied to compute P(B) in the de- nominator. These formulas follow from the definitions of conditional and joint probability.
For continuous random variables the distributional form of Bayes’ Theorem
∞ −∞
fY (y) =
For discrete random variables X and Y we can write the distributional form
fY |X=x(y)fX (x)dx.
is
fY |X=x(y)fX (x) fX|Y=y(x)= fY(y)
fY |X=x(y)fX (x) =∞ fY|X=x(y)fX(x)dx.
For discrete random variables
fX|Y=y(x)=P(X =x|Y =y)= x{P(Y =y|X =x)P(X =x)}.
P(Y =y|X =x)P(X =x)
These formulas follow from the definitions of conditional and joint probability.
Bayesian Statistics
In the frequentist approach to statistics, the parameters of a distribution are considered to be fixed but unknown constants. The Bayesian approach views the unknown parameters of a distribution as random variables. Thus,
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
−∞
Copyright © 2019. CRC Press LLC. All rights reserved.
Probability and Statistics Review 57
in Bayesian analysis, probabilities can be computed for parameters as well as the sample statistics.
Bayes’ Theorem allows one to revise his/her prior belief about an unknown parameter based on observed data. The prior belief reflects the relative weights that one assigns to the possible values for the parameters. Suppose that X has the density f(x|θ). The conditional density of θ given the sample observations x1 , . . . , xn is called the posterior density, defined by
f(x1, . . . , xn|θ)fθ(θ) fθ|x(θ) = f(x1,…,xn|θ)fθ(θ)dθ,
where fθ(θ) is the pdf of the prior distribution of θ. The posterior distribution summarizes our modified beliefs about the unknown parameters, taking into account the data that has been observed. Then one is interested in comput- ing posterior quantities such as posterior means, posterior modes, posterior standard deviations, etc.
Note that any constant in the likelihood function cancels out of the pos- terior density. The basic relation is
posterior ∝ prior × likelihood,
which describes the shape of the posterior density up to a multiplicative con- stant. Often the evaluation of the constant is difficult and the integral cannot be obtained in closed form. However, Monte Carlo methods are available that do not require the evaluation of the constant in order to sample from the posterior distribution and estimate posterior quantities of interest. See, e.g., [49, 107, 110, 125, 240] on development of Markov Chain Monte Carlo sam- pling.
Readers are referred to Lee [179] for an introductory presentation of Bayesian statistics. Albert [5] is a good introduction to computational Bayesian methods with R. A textbook covering probability and mathemati- cal statistics from both a classical and Bayesian perspective at an advanced undergraduate level is DeGroot and Schervish [69].
2.8 Markov Chains
In this section we briefly review discrete time, discrete state space Markov chains. A basic understanding of Markov chains is necessary background for Chapter 11 on Markov Chain Monte Carlo methods. Readers are referred to Ross [251] for an excellent introduction to Markov chains.
A Markov chain is a stochastic process {Xt} indexed by time t ≥ 0. Our goal is to generate a chain by simulation, so we consider discrete time Markov chains. The time index will be the nonnegative integers, so that the process
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
58 Statistical Computing with R
starts in state X0 and makes successive transitions to X1, X2, . . . , Xt, . . . . The set of possible values of Xt is the state space.
Suppose that the state space of a Markov chain is finite or countable. Without loss of generality, we can suppose that the states are 0, 1, 2, . . . . The sequence {Xt|t ≥ 0} is a Markov chain if
P(Xt+1 = j|X0 = i0,X1 = i1,…,Xt−1 = it−1,Xt = i) = P(Xt+1 =j|Xt =i),
for all pairs of states (i,j), t ≥ 0. In other words, the transition probability depends only on the current state, and not on the past.
If the state space is finite, the transition probabilities P(Xt+1|Xt) can
be represented by a transition matrix P = (pij) where the entry pij is the
probability that the chain makes a transition to state j in one step starting
from state i. The probability that the chain moves from state i to state j in k
steps is p(k), and the Chapman-Kolmogorov equations (see e.g. [251, Ch. 4]) ij
provide that the k-step transition probabilities are the entries of the matrix Pk. That is, P(k) = (p(k)) = Pk, the kth power of the transition matrix.
A Markov chain is irreducible if all states communicate with all other states: given that the chain is in state i, there is a positive probability that the chain can enter state j in finite time, for all pairs of states (i,j). A state i is recurrent if the chain returns to i with probability 1; otherwise state i is transient. If the expected time until the chain returns to i is finite, then i is nonnull or positive recurrent. The period of a state i is the greatest common divisor of the lengths of paths starting and ending at i. In an irreducible chain, the periods of all states are equal, and the chain is aperiodic if the states all have period 1. Positive recurrent, aperiodic states are ergodic. In a finite-state Markov chain all recurrent states are positive recurrent.
In an irreducible, ergodic Markov chain the transition probabilities con- verge to a stationary distribution π on the state space, independent of the initial state of the chain.
In a finite-state Markov chain, irreducibility and aperiodicity imply that for all states j
πj = lim p(n) n→∞ ij
exists and is independent of the initial state i. The probability distribution π = {πj } is called the stationary distribution, and π is the unique nonnegative solution to the system of equations
∞∞
πj =πipij, j≥0; πj =1. (2.17) i=0 j=0
We can interpret πj as the (limiting) proportion of time that the chain is in state j.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
ij
Copyright © 2019. CRC Press LLC. All rights reserved.
0.52 0.16 0.16 0.16 P2 = 0.16 0.52 0.16 0.16 ,
0.16 0.16 0.52 0.16 0.16 0.16 0.16 0.52
0.2626 P16 =. 0.2458 0.2458 0.2458
Probability and Statistics Review 59
Example 2.7 (Finite state Markov chain). Ross [251] gives the following example of a Markov chain model for mutations of DNA. A DNA nucleotide has four possible values. For each unit of time, the model specifies that the nucleotide changes with probability 3α, for some 0 < α < 1/3. If it does change, then it is equally likely to change to any of the other three values. Thuspii =1−3αandpij =3α/3=α,i̸=j.Ifwenumberthestates1to4, the transition matrix is
1−3αα α α
P = α 1 − 3α α α (2.18)
α α1−3αα α α α 1−3α
where pij = Pi,j is the probability of a mutation from state i to state j. The ith row of a transition matrix is the conditional probability distribution P(Xn+1 = j|Xn = i), j = 1,2,3,4 of a transition to state j given that the process is currently in state i. Thus each row must sum to 1 (the matrix is row stochastic). This matrix happens to be doubly stochastic because the columns also sum to 1, but in general a transition matrix need only be row stochastic.
Suppose that α = 0.1. Then the two-step and trices are
the 16-step transition ma- 0.2458 0.2458 0.2458
0.2626 0.2458 0.2458 . 0.2458 0.2626 0.2458 0.2458 0.2458 0.2626
The three-step transition matrix is P2P = P3, etc. The probability p(2) of 14
transition from state 1 to state 4 in two steps is P21,4 = 0.16, and the probability that the process returns to state 2 from state 2 in 16 steps is p(16) = P16 =
0.2626.
All entries of P are positive, hence all states communicate; the chain is ir-
reducible and ergodic. The transition probabilities in every row are converging
to the same stationary distribution π on the four states. The stationary dis-
tribution is the solution of equations (2.17); in this case π(i) = 1 , i = 1, 2, 3, 4. 4
(In this example, it can be shown that the limiting probabilities do not depend on α: Pn = 1 + 3 (1 − 4α)n → 1 as n → ∞.) ⋄
22 2,2
ii44 4
Example 2.8 (Random walk). An example of a discrete-time Markov chain with an infinite state space is the random walk. The state space is the set of all integers, and the transition probabilities are
pi,i+1 = p, pi,i−1 =1−p,
pi,j =0,
i = 0,±1,±2,..., i=0,±1,±2,..., j∈/{i−1,i+1}.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.
60 Statistical Computing with R
In the random walk model, at each transition a step of unit length is taken at random to the right with probability p or left with probability 1 − p. The state of the process at time n is the current location of the walker at time n. Another interpretation considers the gambler who bets $1 on a sequence of Bernoulli(p) trials and wins or loses $1 at each transition; if X0 = 0, the state of the process at time n is his gain or loss after n trials.
In the random walk model all states communicate, so the chain is irre- ducible. All states have period 2. For example, it is impossible to return to state 0 starting from 0 in an odd number of steps. The probability that the first return to 0 from state 0 occurs in exactly 2n steps is
(2n) 2n n n (2n)! n p00 = n p (1−p) = n!n!(p(1−p)) .
Itcanbeshownthat∞ p(2n) <∞ifandonlyifp̸=1/2.Thus,the n=1 00
expected number of visits to 0 is finite if and only if p ̸= 1/2. Recurrence and transience are class properties, hence the chain is recurrent if and only if p = 1/2 and otherwise all states are transient. When p = 1/2 the process is called a symmetric random walk. The symmetric random walk is discussed in Example 4.5. ⋄
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:38.
Copyright © 2019. CRC Press LLC. All rights reserved.