ECONOMETRICS I ECON GR5411
Lecture 3 – More on Probability Distribution and Large Sample Distribution Theory
by
Seyhan Erden Columbia University MA in Economics
Joint and Marginal Probability Functions:
If 𝑋 and 𝑌 has discrete distributions, the function 𝑓 𝑥, 𝑦 = Pr(𝑋 = 𝑥, 𝑌 = 𝑦) is called the joint probability distribution of (𝑋, 𝑌).
The marginal probability functions of 𝑋 and 𝑌 are given by
and
𝑓 𝑥 =Pr 𝑋=𝑥 =,𝑓(𝑥,𝑦) –
𝑓 𝑦 =Pr 𝑌=𝑦 =,𝑓(𝑥,𝑦) .
Lecture 3 GR5411 by Seyhan Erden 2
Joint and Marginal Probability Density Functions:
If 𝑋 and 𝑌 has continuous distributions, then the marginal pdf of 𝑋 and 𝑌 are given
and
𝑓𝑥 =Pr𝑋=𝑥 =/𝑓𝑥,𝑦𝑑𝑦 –
𝑓𝑦 =Pr𝑌=𝑦 =/𝑓𝑥,𝑦𝑑𝑥 .
Lecture 3 GR5411 by Seyhan Erden 3
Joint Distributions:
The joint density function for two RVs 𝑋 and 𝑌 denoted by 𝑓(𝑥, 𝑦) is defined as
𝑃𝑟𝑜𝑏 𝑎≤𝑥≤𝑏,𝑐≤𝑦≤𝑑
, , 𝑓 𝑥,𝑦 𝑖𝑓𝑥𝑎𝑛𝑑𝑦𝑎𝑟𝑒𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
89.9: ;9-9< :<
//𝑓𝑥,𝑦𝑑𝑦𝑑𝑥 𝑖𝑓𝑥𝑎𝑛𝑑𝑦𝑎𝑟𝑒𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 8;
=
Lecture 3 GR5411 by Seyhan Erden 4
Conditional Probability Functions:
If (𝑋, 𝑌) is a discrete random vector, for any x such that Pr 𝑋=𝑥 =𝑓 𝑥 >0,theconditionalpdfof𝒀given that 𝑿 = 𝒙 is a function of y denoted and defined by
𝑓 𝑦|𝑥 =Pr Y=yX=x =𝑓(𝑥,𝑦) 𝑓(𝑥)
and 𝑓 𝑥|𝑦 is defined similarly for any y such that Pr𝑌=𝑦 =𝑓𝑦 >0 𝑓(𝑥,𝑦)
𝑓 𝑥|𝑦 =Pr X=x𝑌=𝑦 = 𝑓(𝑦)
Lecture 3 GR5411 by Seyhan Erden 5
Conditional Probability Functions:
If (𝑋, 𝑌) is a continuous random vector, for any x such that 𝑓 𝑥 > 0, the conditional pdf of 𝒀 given that 𝑿 = 𝒙 is a function of y denoted and defined by
and 𝑓 𝑥|𝑦
𝑓𝑦|𝑥 =𝑓(𝑥,𝑦) 𝑓(𝑥)
is defined similarly
𝑓𝑥|𝑦 =𝑓(𝑥,𝑦) 𝑓(𝑦)
Lecture 3 GR5411 by Seyhan Erden 6
Independent RVs:
𝑋 and 𝑌 are called independent RVs if for every (𝑥, 𝑦) 𝑓𝑥,𝑦 =𝑓𝑥𝑓(𝑦)
Hence, if 𝑋 and 𝑌 are independent iff
𝑓𝑦|𝑥 =𝑓𝑦 and𝑓𝑥|𝑦 =𝑓𝑥
𝑋 and 𝑌 are independent iff
𝑓𝑥,𝑦 =𝑔𝑥h(𝑦)
For some function 𝑔 and h (𝑔 and h are not necessarily pdf’s here)
Lecture 3 GR5411 by Seyhan Erden 7
Next in Probability theory review: 1. Multivariate Distributions
2. Law of Large Numbers
3. Multivariate Central Limit Theorem
Lecture 3 GR5411 by Seyhan Erden 8
Multivariate Distributions
Under multivariate distributions topic we will collect various definitions and facts about distributions of vectors of random variables.
We start by defining the mean and the variance of the n- dimensional random variable 𝑉.
The 1st and the 2nd moments of an 𝑚×1 vector of randomvariables,𝑉= 𝑉T,𝑉U…𝑉W Xaresummarized by its mean vector and covariance matrix.
Because 𝑉 is a vector, the vector of its means – the mean vector – is 𝑬 𝑽 = 𝝁𝑽. The 𝑖\] element of the mean vector is the 𝑖\] element of 𝑉.
Lecture 3 GR5411 by Seyhan Erden 9
The Covariance Matrix
The covariance matrix of 𝑉 is the matrix consisting of the variance 𝑣𝑎𝑟 𝑉_ , 𝑖 = 1, … , 𝑚 along the diagonal and the (𝑖, 𝑗) off-diagonal elements 𝑐𝑜𝑣 𝑉_ , 𝑉a . In matrix form, the covariance matrix is
X
𝑣𝑎𝑟𝑉T … 𝑐𝑜𝑣𝑉T,𝑉W = ⋮ ⋱ ⋮
𝑐𝑜𝑣 𝑉W,𝑉T … 𝑣𝑎𝑟 𝑉W
Σc=𝐸𝑉−𝜇c 𝑉−𝜇c
Lecture 3 GR5411 by Seyhan Erden
10
Multivariate Normal Distribution
The 𝑚×1 vector random variable 𝑉 has a multivariate normal distribution with mean 𝜇c and a covariance matrix Σc if it has the joint probability density function
𝑓 𝑉 = 1 𝑒 𝑥 𝑝 − 12 𝑉 − 𝜇 c X Σ cl T 𝑉 − 𝜇 c (2𝜋)W |Σc |
where |Σc | is the determinant of the covariance matrix Σc . The multivariate normal distribution is denoted by 𝑁𝜇c,Σc .
Lecture 3 GR5411 by Seyhan Erden 11
An important fact about the multivariate normal distribution is that if two jointly normally distributed random variables are uncorrelated (or, equivalently, have a block diagonal covariance matrix), then they are independently distributed. That is, let 𝑉T and 𝑉U be jointly normally distributed random variables with respective dimensions 𝑚T×1 and 𝑚U×1.
Then if
𝑐𝑜𝑣𝑉,𝑉=𝐸𝑉−𝜇 𝑉−𝜇X=0
TU T cn U co Wn×Wo
then 𝑉T and 𝑉U are independent.
If 𝑉_ are i.i.d. 𝑁 0,𝜎cU , then Σc = 𝜎cU𝐼W and the
multivariate normal distribution simplifies to the
product of 𝑚 univariate normal variables.
Lecture 3 GR5411 by Seyhan Erden 12
Distribution of Linear Combinations and Quadratic Forms of Normal Random Variables
Linear combinations of multivariate normal random variables are themselves normally distributed and certain quadratic forms of multivariate normal random variables have a chi-squared distribution.
Let 𝑉 be an 𝑚×1 random variable distributed by
𝑁 𝜇c , Σc , let A and B be nonrandom 𝑎×𝑚 and 𝑏×𝑚 matrices, and let 𝑑 be a nonrandom 𝑎×1 vector. Then
and
𝑑+AV ~𝑁𝑑+𝐴𝜇c,𝐴Σc𝐴′
𝑐𝑜𝑣𝐴𝑉,𝐵𝑉 =𝐴Σc𝐵′
Lecture 3 GR5411 by Seyhan Erden 13
If 𝐴Σc𝐵X = 0 (note that this is a matrix with size 𝑎×𝑏) then 𝐴𝑉 and 𝐵𝑉 are independently distributed; and
𝑉−𝜇cXΣclT𝑉−𝜇c ~𝜒WU
Let 𝑈 be an 𝑚 − dimensional multivariate standard normal random variable with distribution 𝑁 0, 𝐼 . If 𝐶 is symmetric
and idempotent, then
𝑈X𝐶𝑈 ~ 𝜒|U where 𝑟 = 𝑟𝑎𝑛𝑘(𝐶)
Proof: If 𝐶 is symmetric and idempotent then it can be written as 𝐶 = 𝐴𝐴′, where 𝐴 is 𝑛×𝑟 with 𝐴X𝐴 = 𝐼. Then
𝐴X𝑈~𝑁 𝐴X0,𝐴X𝐼𝐴
hence 𝑈X𝐶𝑈 = 𝑈X𝐴𝐴X𝑈 = 𝐴X𝑈 X 𝐴X𝑈 ~ 𝜒|U
Lecture 3 GR5411 by Seyhan Erden 14
Recall Markov Inequality:
Let 𝑋 be a r.v. and let g 𝑥 be a non-negative function.
Then for any 𝜖 > 0,
Pr 𝑔(𝑋)≥𝜖 ≤𝐸 𝑔(𝑋) 𝜖
Ä
Proof: note that
𝐸 𝑔(𝑋) = / 𝑔 𝑋 𝑓 𝑥 𝑑𝑥. lÄ
≥ / 𝑔𝑋𝑓𝑥𝑑𝑥 Ç . ÉÑ
≥𝜖 / 𝑓 𝑥 𝑑𝑥=𝜖𝑃𝑟 𝑔(𝑋)≥𝜖 Ç . ÉÑ
Lecture 3 GR5411 by Seyhan Erden 15
Recall Chebyshev’s Inequality:
Let
where 𝜇 = 𝐸(𝑋) and 𝜎U = 𝑉𝑎𝑟(𝑋)
U g𝑋=
𝑋−𝜇 𝜎U
Then for any 𝜖 ≡ 𝑡U > 0 𝑋−𝜇UU1
Hence, or
𝑃𝑟 𝜎U ≥𝑡 ≤𝑡U
𝑃𝑟 𝑋 − 𝜇 ≥ 𝜎𝑡 ≤ 1 𝑡U
𝑃𝑟 𝑋−𝜇 ≥𝑡 ≤𝜎U
Lecture 3 GR5411 by Seyhan Erden
𝑡U
16
So Chebyshev’s Inequality says:
For example, if 𝑡 = 3, then
𝑃 𝑟 𝑋 − 𝜇 ≥ 3 𝜎 ≤ 19
This holds for every distribution with finite variance. This is weaker than “68-95-99.7” rule of normal distr. So this bound is conservative for some distributions. But it holds for every finite distribution.
Chebyshev says only 75% of values must lie within two standard deviations of the mean and 89% within three standard deviations
Lecture 3 GR5411 by Seyhan Erden 17
Properties of Sample Mean:
Ø Let 𝑋T … 𝑋â be a random sample from a distribution with mean 𝜇 and variance 𝜎U.
ØLet 𝑋äâ denote the sample mean.
ØWhat is the mean and variance of 𝑋äâ?
ØCan you bound 𝑃𝑟 𝑋äâ − 𝜇 ≥ 𝑡 for every 𝑡 > 0?
Lecture 3 GR5411 by Seyhan Erden
18
Convergence in Probability:
A sequence of RVs 𝑍â is said to converge in probability to a constant 𝑏 if for every 𝜀 > 0
or
𝑃𝑟𝑍â−𝑏≥𝜀→0 𝑎𝑠𝑛→∞ 𝑃𝑟𝑍â−𝑏<𝜀→1 𝑎𝑠𝑛→∞
Convergence in probability is denoted by 𝑍â →ê 𝑏 or𝑍â−𝑏→ê 0as𝑛⟶∞.
Lecture 3 GR5411 by Seyhan Erden
19
Law of Large Numbers (𝐿𝐿𝑁)
If 𝑌_ are 𝑖𝑖𝑑, 𝐸 𝑌_ = 𝜇ì and 𝑉𝑎𝑟 𝑌_ < ∞, then
𝑌ä →ê 𝜇 ì ä As 𝑛 ↑, the sampling distribution of 𝑌
concentrates around the population mean 𝜇ì. 1. One feature of the sampling distribution is
that the variance of 𝑌ä decreases as 𝑛 ↑s.
2. Another feature is that probability that 𝑌ä falls
outside ∓ 𝛿 of 𝜇ì vanishes as 𝑛 ↑s
Lecture 3 GR5411 by Seyhan Erden 20
Proof of Law of Large Numbers (𝐿𝐿𝑁)
The link between the variance of 𝑌ä and the probability that 𝑌ä is within ∓ 𝛿 of 𝜇ì is provided by Chebyshev's inequality which can be written as follows (using current notation):
𝑃 𝑟 𝑌ä − 𝜇 ì ≥ 𝛿 ≤ 𝑣 𝑎 𝑟 𝑌ä 𝛿U
for any positive constant 𝛿.
Because 𝑌_ are 𝑖𝑖𝑑 with variance 𝜎ìU, 𝑣𝑎𝑟 𝑌ä 𝜎ìU/𝑛; thus,
=
Lecture 3 GR5411 by Seyhan Erden
21
Proof of Law of Large Numbers (𝑐𝑜𝑛𝑡′) thus,
for any 𝛿 > 0, ä U 𝑣𝑎𝑟𝑌=𝜎ì ⟶0
Then,
𝛿U 𝑛𝛿U
𝑃 𝑟 𝑌ä − 𝜇 ì ≥ 𝛿 ⟶ 0
for every 𝛿 > 0, proving the 𝐿𝐿𝑁.
Lecture 3 GR5411 by Seyhan Erden 22
Convergence in Distribution:
If the distribution of a sequence of random variables converge to a limit as 𝑛 ⟶ ∞, then the sequence of random variables is said to converge in distribution.
The CTL (Central Limit Theorem) says that the standardized sample average converges in distribution to a normal random variable.
Lecture 3 GR5411 by Seyhan Erden 23
Convergence in Distribution:
Let 𝐹T, 𝐹U, … 𝐹â, … be a sequence of CDFs (Cumulative Distribution Functions) corresponding to a sequence of random variables, 𝑆T, 𝑆U, … , 𝑆â, …
For example:
𝑆â =
Then 𝑆â is said to converge in distribution to 𝑆 if the distribution function 𝐹â converge to 𝐹 (the distribution of 𝑆)
𝑆â→<𝑆 𝑖𝑓𝑓 lim𝐹â=𝐹 â→Ä
This means pointwise convergence for 𝐹𝑛(𝑥), at all points where 𝐹(𝑥) is continuous.
The distribution 𝐹 is called asymptotic distribution of 𝑆â Lecture 3 GR5411 by Seyhan Erden 24
𝑌ä − 𝜇 ì 𝜎ìä
Convergence in Probability vs. Convergence in Distribution:
I f 𝑆 â →ê 𝜇
then 𝑆â becomes close to 𝜇 with high probability
as 𝑛 ↑.
In contrast,
I f 𝑆 â →< 𝑆
then the distribution of 𝑆â becomes close to the
distribution of 𝑆 as 𝑛 ↑.
Lecture 3 GR5411 by Seyhan Erden 25
The Central Limit Theorem (Lindeberg-Levy):
If 𝑌_ are 𝑖𝑖𝑑 and 0 < 𝜎ìä < ∞, where 𝜎ìä = 𝜎ì/ 𝑛 and 𝜎ìU = 𝑉𝑎𝑟(𝑌) and let 𝜇ì = 𝐸(𝑌), then the asymptotic distribution of 𝑌ä − 𝜇ì /𝜎ìä is 𝑁(0,1)
Since𝜎ìä =𝜎ì/ 𝑛
CLT can also be expressed as the distribution of
𝑛 𝑌ä − 𝜇ì converges to 𝑁(0, 𝜎ìU) Conventional shorthand for this limit is
𝑛 𝑌ä − 𝜇 ì →< 𝑁 ( 0 , 𝜎 ìU ) as 𝑛 ⟶ ∞
Lecture 3 GR5411 by Seyhan Erden 26
Slutsky’s Theorem:
Slutsky’s theorem combines consistency and convergence in distribution.
Suppose that 𝑎â →ê 𝑎 where 𝑎 is a constant, and 𝑆â →< 𝑆.Then <
𝑎â + 𝑆â → 𝑎 + 𝑆 𝑎 â 𝑆 â →< 𝑎 𝑆
And, if 𝑎 ≠ 0,
𝑆â →< 𝑆
𝑎â 𝑎
Lecture 3 GR5411 by Seyhan Erden 27
Continuous Mapping Theorem:
Continuous mapping theorem concerns the asymptotic properties of a continuous function, 𝑔, of a sequence of random variables, 𝑆â.
The theorem has two parts.
The first is that if 𝑆â converges in probability to a constant 𝑎, then 𝑔 𝑆â converges in probability to 𝑔 𝑎
The second is that if 𝑆â converges in distribution to a constant 𝑆, then 𝑔 𝑆â converges in probability to 𝑔 𝑆
Lecture 3 GR5411 by Seyhan Erden 28
Continuous Mapping Theorem:
That is, if 𝑔 is a continuous function, then 1. If 𝑆â →ê 𝑎, then 𝑔 𝑆â →ê 𝑔 𝑎
2. If 𝑆â →< 𝑆, then 𝑔 𝑆â →< 𝑔 𝑆
As an example of 1: if 𝑠ìU →ê 𝜎ìU, then 𝑠ì →ê 𝜎ì.
𝑠ìU =
Lecture 3 GR5411 by Seyhan Erden
29
As an example of 2: suppose that 𝑆â →< 𝑍, where U 𝑍 is a standard normal variable, and let 𝑔 𝑆â = 𝑆â .
Because 𝑔 is continuous, the continuous mapping theorem applies and
that is,
𝑔𝑆â →< 𝑔𝑍 𝑆âU →< 𝑍U
In other words, the distribution of 𝑆âU converges to the distribution of a squared standard normal random variable, which in turn has a chi-square distribution; that is
𝑆âU →< 𝜒TU
Lecture 3 GR5411 by Seyhan Erden 30
CLT for Multivariate Case:
If 𝒙T ... 𝒙â are random sample from multivariate distribution with finite mean vector 𝝁 and finite positive definite covariance matrix 𝑸, then
𝑛 𝒙ü â − 𝝁 →< 𝑁 ( 0 , 𝑸 )
1â
𝒙ü â = 𝑛 , 𝒙 _
_†T
where
Lecture 3 GR5411 by Seyhan Erden 31
EXAMPLES: Unbiased vs. Consistent:
Some examples: let 𝑌_~𝑖𝑖𝑑(𝜇, 𝜎U)
1. Estimator 1: 𝜇̂T = 𝑌ä this estimator is both
unbiased and consistent, It is unbiased because,
𝐸𝜇̂T =𝐸𝑛1,𝑌_ =𝑛1,𝐸(𝑌_)=𝑛1(𝑛𝜇)=𝜇
It is consistent because, ä U 𝑃𝑟 𝜇̂T−𝜇 >𝛿 ≤𝐸 𝑌−𝜇
ä =𝑉𝑎𝑟(𝑌)
𝛿U
𝛿U
Lecture 3 GR5411 by Seyhan Erden
32
Unbiased vs. Consistent:
It is consistent because, ä
𝑃𝑟 𝜇̂T−𝜇 >𝛿 ≤𝐸 𝑌−𝜇
Note that
U
ä =𝑉𝑎𝑟(𝑌)
𝛿U
𝛿U
𝑉𝑎𝑟𝑌ä=𝑉𝑎𝑟1,𝑌_ =1𝑛𝜎U=𝜎U 𝑛 𝑛U 𝑛
𝜎U 𝑛
⟶ 0, 𝑎𝑠 𝑛 ⟶ ∞ Thus, 𝑃𝑟 𝜇̂T − 𝜇 > 𝛿 ⟶ 0, 𝑎𝑠 𝑛 ⟶ ∞
Hence, 𝜇̂T is consistent.
Lecture 3 GR5411 by Seyhan Erden 33
Unbiased vs. Consistent:
Some examples: let 𝑌_~𝑖𝑖𝑑(𝜇, 𝜎U)
2. Estimator 1: 𝜇̂U = 𝑌T (the first observation) this
estimator is unbiased but inconsistent, It is unbiased because,
𝐸𝜇̂U =𝐸𝑌T =𝜇
It is inconsistent because,
𝐸𝑌T−𝜇U 𝜎U 𝑃𝑟 𝜇̂U−𝜇 >𝛿 ≤ 𝛿U =𝛿U
This does not tend to zero as 𝑛 → ∞. Why? The estimator only uses information in one observation.
Lecture 3 GR5411 by Seyhan Erden 34
Unbiased vs. Consistent:
Some examples: let 𝑌_~𝑖𝑖𝑑(𝜇, 𝜎U)
3. Estimator 1: 𝜇̂¢ = 𝑌ä + âT this estimator is biased but consistent,
It is biased because, 1 1
𝐸 𝜇 ̂ ¢ = 𝐸 𝑌ä + 𝑛 = 𝜇 + 𝑛 ≠ 𝜇
It is consistent because, 𝑌ä⟶𝜇andâT ⟶0as𝑛⟶∞
Lecture 3 GR5411 by Seyhan Erden 35