CS代考计算机代写 AI algorithm Contents

Contents
Econometrics (M.Sc.)
Prof. Dr. Dominik Liebl
2021-02-07
Contents 1
Preface 5
Organization of the Course . . . . . . . . . . . . . . . . . . . . . . 5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Review: Probability and Statistics 7
1.1 ProbabilityTheory ………………….. 7 1.2 RandomVariables…………………… 16
1

2 Review: Simple Linear Regression 43
2.1 TheSimpleLinearRegressionModel . . . . . . . . . . . . . 43 2.2 OrdinaryLeastSquaresEstimation . . . . . . . . . . . . . . 47 2.3 PropertiesoftheOLSEstimator……………. 59
3 Multiple Linear Regression 67
3.1 Assumptions……………………… 67 3.2 Deriving the Expression of the OLS Estimator . . . . . . . . 72 3.3 SomeQuantitiesofInterest ……………… 74 3.4 MethodofMomentsEstimator ……………. 78 3.5 Unbiasednessof—ˆ|X …………………. 80 3.6 Varianceof—ˆ|X……………………. 81 3.7 TheGauss-MarkovTheorem……………… 86
4 Small Sample Inference 89
4.1 Hypothesis Tests about Multiple Parameters . . . . . . . . . 90 4.2 TestsaboutOneParameter ……………… 93 4.3 Testtheory………………………. 95 4.4 TypeIIErrorandPower……………….. 102 4.5 p-Value………………………… 105 4.6 ConfidenceIntervals …………………. 105 4.7 Practice:SmallSampleInference …………… 107
5 Large Sample Inference 123
5.1 ToolsforAsymptoticStatistics ……………. 123 5.2 Asymptotics under the Classic Regression Model . . . . . . . 131 5.3 RobustConfidenceIntervals ……………… 136 5.4 Practice:LargeSampleInference …………… 137
6 Maximum Likelikhood 149
6.1 LikelihoodPrinciple………………….. 149 2

6.2 Properties of Maximum Likelihood Estimators . . . . . . . . 149
6.3 The(Log-)LikelihoodFunction…………….. 150
6.4 Optimization: Non-Analytical Solutions . . . . . . . . . . . . 152
6.5 OLS-EstimationasML-Estimation . . . . . . . . . . . . . . 155
6.6 VarianceofML-Estimators—ˆ ands2 . . . . . . . . . . . 157 ˆ 2ML ML
6.7 Consistencyof—MLandsML ……………… 158 6.8 Asymptotic Theory of Maximum-Likelihood Estimators . . . 159
7 Instrumental Variables Regression 165
7.1 The IV Estimator with a Single Regressor and a Single Instru- ment…………………………. 166
7.2 TheGeneralIVRegressionModel…………… 174
7.3 CheckingInstrumentValidity …………….. 180
7.4 ApplicationtotheDemandforCigarettes . . . . . . . . . . 182
Bibliography
193
3

Preface
Organization of the Course
• Etherpad: Throughout the course, I will use the following etherpad for collecting important links and further information (Zoom-meeting link etc.): https://etherpad.wikimedia.org/p/8e2yF6X6FCbqep2AxW1b
• eWhiteboard: Besides this script, I will use an eWhiteboard during the lecture. This material will be saved as pdf-files and provided as accompaning lecture materials.
• Lecture Materials: You can find all lecture materials at eCampus or directly at sciebo: https://uni-bonn.sciebo.de/s/iOtkEDajrqWKT8v
Literature
• A guide to modern econometrics, by M. Verbeek
• Introduction to econometrics, by J. Stock and M.W. Watson
– E-Book: https://bonnus.ulb.uni-bonn.de/SummonRecord/FET CH-bonn_catalog_45089983
• Econometric theory and methods, by R. Davidson and J.G. MacKinnon • A primer in econometric theory, by J. Stachurski
• Econometrics, by F. Hayashi
5

Chapter 1
Review: Probability and Statistics
1.1 Probability Theory
Probability is the mathematical language for quantifying uncertainty. We can apply probability theory to a diverse set of problems, from coin flipping to the analysis of econometric problems. The starting point is to specify the sample space, that is, the set of possible outcomes.
1.1.1 Sample Spaces and Events
The sample space , is the set of possible outcomes of an experiment. Points Ê in are called sample outcomes or realizations. Events are subsets of Example 2.1 If we toss a coin twice then = {HH,HT,TH,TT}. The event that the first toss is heads is A = {HH,HT}.
Example: Let Ê be the outcome of a measurement of some physical quantity, for example, temperature. Then = R = (≠Œ,Œ). The event that the measurement is larger than 10 but less than or equal to 23 is A = (10, 23].
Example: If we toss a coin forever then the sample space is the infinite set = {Ê = (Ê1,Ê2,Ê3,…,)|Êi œ {H,T}} Let A be
7

the event that the first head appears on the third toss. Then A = {(Ê1,Ê2,Ê3,…,)|Ê1 = T,Ê2 = T,Ê3 = H,Êi œ {H,T} for i > 3}.
Given an event A, let Ac = {Ê œ ; Ê œ/ A} denote the complement of A. Informally, Ac can be read as “not A.” The complement of is the empty set ÿ. The union of events A and B is defined as
AfiB={ʜ|ʜAorʜBorʜ both}
which can be thought of as “A or B.” If A1,A2,… is a sequence of sets then
€Œ Ai ={Êœ:ÊœAi foratleastonei}. i=1
The intersection of A and B is defined as
A fl B = {Ê œ ; Ê œ A and Ê œ B}
which reads as “A and B.” Sometimes A fl B is also written shortly as AB. If A1,A2,… is a sequence of sets then
‹Œ Ai ={Êœ:ÊœAi foralli}. i=1
If every element of A is also contained in B we write A μ B or, equivalently, B ∏ A. If A is a finite set, let |A| denote the number of elements in A. We say that A1, A2, . . . are disjoint or mutually exclusive if Ai fl Aj = ÿ whenever i ”= j. For example, A1 = [0,1),A2 = [1,2),A3 = [2,3),… are dtisjoint. A partition of is a sequence of disjoint sets A1, A2, . . . such that
Œi = 1 A i = .
8

Summary: Sample space and events


A
|A| Ac
AfiB A fl B A μ B ÿ

sample space
outcome
event (subset of )
number of points in A (if A is finite) complement of A(not A)
union (A or B)
intersection (A and B); short notation: AB set inclusion (A is a subset of or equal to B) null event (always false)
true event (always true)
1.1.2 Probability
We want to assign a real number P(A) to every event A, called the proba- bility of A. We also call P a probability distribution or a probability measure. To qualify as a probability, P has to satisfy three axioms. That is, a function P that assigns a real number P(A) œ [0,1] to each event A is a probability distribution or a probability measure if it satisfies the following three axioms:
Axiom 1: P(A) Ø 0 for every A Axiom 2: P() = 1
Axiom 3: If A1, A2, . . . are disjoint then
P Qa €Œ A i Rb = ÿŒ P ( A i ) . i=1 i=1
Note: It is not always possible to assign a probability to every event A if the sample space is large, such as, for instance, the whole real line, = R. In case
9

of = R strange things can happen. There are pathological sets that simply break down the mathematics. An example of one of these pathological sets, also known as non-measurable sets because they literally can’t be measured (i.e. we cannot assign probabilities to them), are the Vitali sets. Therefore, in such cases like = R, we assign probabilities to a limited class of sets called a ‡-field or ‡-algebra. For = R, the canonical ‡-algebra is the Borel ‡-algebra. The Borel ‡-algebra on R is generated by the collection of all open subsets of R.
One can derive many properties of P from the axioms. Here are a few:
• P(ÿ)=0
• AμB∆P(A)ÆP(B)
• 0ÆP(A)Æ1
• P(Ac)=1≠P(A)
• AflB=ÿ∆P(AfiB)=P(A)+P(B)
A less obvious property is given in the following: For any events A and B we have that,
P (A fi B) = P (A) + P (B) ≠ P (AB).
Example. Two coin tosses. Let H1 be the event that heads occurs on toss 1 and let H2 be the event that heads occurs on toss 2. If all outcomes are equally likely, that is, P({H1,H2}) = P({H1,T2}) = P({T1,H2}) = P ({T1, T2}) = 1/4, then
P (H1 fi H2) = P (H1) + P (H2) ≠ P (H1H2) = 1 + 1 ≠ 1 = 3. 2244
10

Probailities as frequencies. One can interpret P(A) in terms of fre- quencies. That is, P(A) is the (infinitely) long run proportion of times that A is true in repetitions. For example, if we say that the probability of heads is 1/2, i.e P(H) = 1/2 we mean that if we flip the coin many times then the proportion of times we get heads tends to 1/2 as the number of tosses increases. An infinitely long, unpredictable sequence of tosses whose limiting proportion tends to a constant is an idealization, much like the idea of a straight line in geometry.
The following R codes approximates the probability P(H) = 1/2 using 1, 10 and 100,000 many (pseudo) random coin flips:
set.seed(869)
## 1 (fair) coin-flip:
results <- sample(x = c("H", "T"), size = 1) ## Relative frequency of "H" in 1 coin-flips length(results[results=="H"])/1 #> [1] 1
## 10 (fair) coin-flips:
results <- sample(x = c("H", "T"), size = 10, replace = TRUE) ## Relative frequency of "H" in 10 coin-flips length(results[results=="H"])/10 #> [1] 0.3
## 100000 (fair) coin-flips:
results <- sample(x = c("H", "T"), size = 100000, replace = T) ## Relative frequency of "H" in 100000 coin-flips length(results[results=="H"])/100000 #> [1] 0.50189
11

1.1.3 Independent Events
If we flip a fair coin twice, then the probability of two heads is 1 ◊ 1. We 22
multiply the probabilities because we regard the two tosses as independent. Two events A and B are called independent if
P(AB) = P(A)P(B).
Or more generally, a whole set of events {Ai|i œ I} is independent if
P Qa ‹ A i Rb = Ÿ P ( A i ) iœJ iœJ
for every finite subset J of I, where I denotes the not necessarily finite index set (e.g. I = {1,2,…}).
Independence can arise in two distinct ways. Sometimes, we explicitly assume that two events are independent. For example, in tossing a coin twice, we usually assume the tosses are independent which reflects the fact that the coin has no memory of the first toss.
In other instances, we derive independence by verifying that the definition of independence P(AB) = P(A)P(B) holds. For example, in tossing a fair die once, let A = {2,4,6} be the event of observing an even number and let B = {1,2,3,4} be the event of observing no 5 and no 6. Then, AflB={2,4}istheeventofobservingeithera2ora4. AretheeventsA and B independent?
P(AB) = 2 = P(A)P(B) = 1 · 2 623
and so A and B are independent. In this case, we didn’t assume that A and B are independent it just turned out that they were.
12

Cautionary Notes. Suppose that A and B are disjoint events (i.e. AB = ÿ), each with positive probability (i.e. P(A) > 0 and P(B) > 0).
Can they be independent? No! This follows since P(AB) = P(ÿ) = 0 ”= P(A)P(B) > 0.
Except in this special case, there is no way to judge (in-)dependence by looking at the sets in a Venn diagram.
Summary: Independence
1. A and B are independent if P (AB) = P (A)P (B).
2. Independence is sometimes assumed and sometimes derived. 3. Disjoint events with positive probability are not independent.
1.1.4 Conditional Probability
If P(B) > 0 then the conditional probability of A given B is
P(A | B) = P(AB). P(B)
Think of P(A | B) as the fraction of times A occurs among those in which B occurs. Here are some facts about conditional probabilities:
• The rules of probability apply to events on the left of the bar “|”. That is, for any fixed B such that P(B) > 0,P(· | B) is a probability i.e. it satisfies the three axioms of probability: P (A | B) Ø 0, P ( | B) = 1 and if A1,A2,… are disjoint then P (tŒi=1 Ai | B) = qŒi=1 P (Ai | B).
• Butit’sgenerallynottruethatP(A|BfiC)=P(A|B)+P(A|C). 13

In general it is also not the case that P(A | B) = P(B | A). People get this confused all the time. For example, the probability of spots given you have measles is 1 but the probability that you have measles given that you have spots is not 1. In this case, the dierence between P(A | B) and P(B | A) is obvious but there are cases where it is less obvious. This mistake is made often enough in legal cases that it is sometimes called the
“prosecutor’s fallacy”.
Example. A medical test for a disease D has outcomes + and ≠. The
probabilities are:
D Dc
+ .0981 ≠ .9019
.0081 .0900 .0009 .9010
.0090 .9910 1
From the definition of conditional probability, we have that:
• Sensitivity of the test:
P(+ | D) = P(+ fl D)/P(D) = 0.0081/(0.0081 + 0.0009) = 0.9
• Specificity of the test:
P (≠ | Dc) = P (≠ fl Dc)/P (Dc) = 0.9010/(0.9010 + 0.0900) ¥ 0.9
Apparently, the test is fairly accurate. Sick people yield a positive test result 90 percent of the time and healthy people yield a negative test result about 90 percent of the time. Suppose you go for a test and get a positive result. What is the probability you have the disease? Most people answer 0.90 = 90%. The correct answer is P(D | +) = P(+ fl D)/P(+) = 0.0081/(0.0081 + 0.0900) = 0.08. The lesson here is that you need to compute the answer numerically. Don’t trust your intuition.
14

If A and B are independent events then
P(A | B) = P(AB) = P(A)P(B) = P(A)
P(B) P(B)
So another interpretation of independence is that knowing B doesn’t
change the probability of A.
From the definition of conditional probability we can write
P(AB) = P(A | B)P(B) and also P(AB) = P(B | A)P(A). Often, these formulas give us a convenient way to compute P(AB) when A
and B are not independent.
Note, sometimes P(AB) is written as P(A,B).
Example. Draw two cards from a deck, without replacement. Let A be the event that the first draw is Ace of Clubs and let B be the event that the second draw is Queen of Diamonds. Then P (A, B) = P (A)P (B | A) =
(1/52) ◊ (1/51)
Summary: Conditional Probability
1. If P(B) > 0 then P(A | B) = P(AB)/P(B)
2. P(· | B) satisfies the axioms of probability, for fixed B. In general,
P (A | ·) does not satisfy the axioms of probability, for fixed A.
3. Ingeneral,P(A|B)”=P(B|A).
4. A and B are independent if and only if P(A | B) = P(A).
15

1.2 Random Variables
Statistics and econometrics are concerned with data. How do we link sample spaces, events and probabilities to data? The link is provided by the concept of a random variable. A real-valued random variable is a mapping X : æ R that assigns a real number X(Ê) œ R to each outcome Ê.
At a certain point in most statistics/econometrics courses, the sample space, , is rarely mentioned and we work directly with random variables. But you should keep in mind that the sample space is really there, lurking in the background.
Example. Flip a coin ten times. Let X(Ê) be the number of heads in the sequence Ê. For example, if Ê = HHTHHTHHTT then X(Ê) = 6.
Example. Let = {(x, y)|x2 + y2 Æ 1} be the unit disc. Consider drawing a point “at random” from . A typical outcome is then of the form Ê = (x, y). Some examples of random variables are X(Ê) = x, Y (Ê) = y, Z(Ê) = x + y,W(Ê)=Ôx2 +y2.
Given a real-valued random variable X œ R and a subset A of the real line (A μ R), define X≠1(A) = {Ê œ |X(Ê) œ A}. This allows us to link the probabilities on the random variable X, i.e. the probabilities we are usually working with, to the underlying probabilities on the events, i.e. the probabilities lurking in the background.
Example. Flip a coin twice and let X be the number of heads. Then, PX(X = 0) = P({TT}) = 1/4, PX(X = 1) = P({HT,TH}) = 1/2 and PX(X = 2) = P({HH}) = 1/4. Thus, the events and their associated
16

probability distribution, P , and the random variable X and its distribution, PX, can be summarized as follows:
Ê P ({Ê}) X(Ê) TT1/4 0 TH1/4 1 HT1/4 1 HH1/4 2
x PX(X=x) 01/4
11/2
21/4
Here, PX is not the same probability function as P because P maps from the sample space events, Ê, to [0, 1], while PX maps from the random-variable events, X(Ê), to [0,1]. We will typically forget about the sample space and just think of the random variable as an experiment with real-valued
(possible multivariate) outcomes. We will therefore write P (X = xk) instead of PX (X = xk) to simplify the notation.
1.2.1 Univariate Distribution and Probability Functions
1.2.1.1 Cumulative Distribution Function The cumulative distribution function (cdf)
FX :Ræ[0,1]
of a real-valued random variable X œ R is defined by
FX(x) = P(X Æ x).
You might wonder why we bother to define the cdf. The reason is that it eectively contains all the information about the random variable. Indeed, let X œ R have cdf F and let Y œ R have cdf G. If F(x) = G(x) for all xœRthenP(XœA)=P(Y œA)forallAμR. Inordertodenotethat
17

two random variables, here X and Y , have the same distribution, one can writeshortlyX=d Y.
Caution: Equality in distribution, X =d Y , does generally not mean equal- ity in realizations, that is X =d Y ”∆ X(Ê) = Y(Ê) for all Ê œ .
The defining properties of a cdf. A function F mapping the real line to [0, 1], short F : R æ [0, 1], is called a cdf for some probability measure P if and only if it satisfies the following three properties:
1. F is non-decreasing i.e. x1 < x2 implies that F (x1) Æ F (x2). 2. F is normalized: limxæ≠Œ F (x) = 0 and limxæŒ F (x) = 1 3. F is right-continuous, i. e. F(x) = F (x+) for all x, where F1x+2= lim F(y). yæx,y>x
Alternatively to cumulative distribution functions one can use probabil- ity (mass) functions in order to describe the probability law of discrete random variables and denstiy function in order to describe the probability law of continuous random variables.
1.2.1.2 Probability Functions for Discrete Random Variables.
A random variable X is discrete if it takes only countably many values X œ {x1,x2,…}.
For instance, X œ {1, 2, 3} or X œ {2, 4, 6, . . . } or X œ Z or X œ Q. 18

We define the probability function or probability mass function (pmf) for X by
fX(x) = P(X = x) for all x œ {x1,x2,…}
1.2.1.3 Density Functions for Continuous Random Variables.
A random variable X is continuous if there exists a function fX such that 1. fX(x)Ø0forallx
2.sŒ fX(x)dx=1and ≠Œ
3. P(a 0, moment is given by
μk = EËXkÈ = ⁄ +Œ xkfX(x)dx. 25 ≠Œ

• The kth, k > 1, central moment is given by
μck =EË(X≠E[X])kÈ =⁄ +Œ(x≠μ)kfX(x)dx,
where μ = E(X).
≠Œ
Note. Moments determine the tail of a distribution (but not much else); see Lindsay and Basak (2000). Roughly: The more moments a distribution has the faster converge its tails to zero. Distributions with compact supports (e.g. the uniform distribution U[a,b]) have infinitely many moments. The Normal distribution has also infinitely many moments – even though this distribution has not a compact support since „(x) > 0 for all x œ R. .
1.2.6.1 Law of Total Expectation
As long as we do not fix the values of the conditioning variables, X2, . . . , Xd, they are random variables. Consequently, the conditional mean is generally itself a random variable
E(X1|X2,…,Xd) = ⁄ x1f(x1 | X2,…,Xd)dx1.
Note that f(x1 | X2,…,Xd) is just a transformation of the random variables X2,…,Xd. So we can easily compute the unconditional mean E(X1) by taking the mean of E(X1|X2, . . . , Xd) as following,
E1E(X1|X2,…,Xd)2 =
= ⁄ ···⁄ ⁄ x1f(x1 | x2,…,xd)dx1 fX2,…,Xd(x2,…,xd)dx2 …dxd
= ⁄ x1 3⁄ ···⁄ f(x1,x2,…,xd)dx2 …dxd4 dx1
= x1fX1 (x1)dx1 = E(X1).
26

The result that E1E(X1|X2,…,Xd)2 = E(X1) is called law of total expectation or law of iterated expectation.
1.2.7 Independent Random Variables
Random variables X1, . . . , Xd are mutually independent if for all x =
(x1,…,xd)Õ it is true that
F (x1, . . . , xd) = F1(x1) · F2(x2) · . . . · Fd(xd)
f(x1, . . . , xd) = f1(x1) · f2(x2) · . . . · fd(xd) The following holds true:
• Two real-valued random variables X and Y are independent from each other if and only if the marginal density of X equals the conditional densityofXgivenY =yforallyœR,
fX(x)=fX|Y(x|y) forallyœR.
Of course, the same statement applies to the marginal density of Y givenX=xforallxœR. Thatis,XandY aretwoindependent real-valued random variables if and only if fY (y) = fY |X (y | x) for all x œ R.
• If a real-valued random variable X is independent from a real-valued random variable Y , then the conditional mean of X given Y = y equals the unconditional mean of X for all y œ R (i.e. the regression function becomes a constant)
E(X|Y =y)=E(X) forallyœR.
Of course, the same statement applies to the conditional mean of Y givenX=xforallxœR;i.e.,ifXandY aretwoindependent randomvariables,thenE(Y |X=x)=E(Y) forallxœR.
27

Cautionary note. The properties that E(X | Y = y) = E(X) for allyœRorthatE(Y |X=x)=E(Y)forallxœR,donotimply that Y and X are independent.
1.2.8 I.I.D. Samples
Tradition dictates that the sample size is denoted by the natural number n œ {1,2,…}. A random sample is a collection X = (X1,…,Xn) of random variables X1,…,Xn. If X1,…,Xn are all independent from each other and if each random variable has the same marginal distribution, we say that the random sample
X = (X1, . . . , Xn) is i.i.d. (independent and identically distributed).
1.2.9 Some Important Discrete Random Variables 1.2.9.1 The Discrete Uniform Distribution
Let k > 1 be a given integer. Suppose that X has probability mass function
given by
We say that X has a uniform distribution on {1, . . . , k}.
f(x) = Y] 1/k for x = 1,…,k [ 0 otherwise.
set.seed(51)
## Set the parameter k
k <- 10 ## Draw one realization from the discrete uniform distribution sample(x = 1:k, size = 1, replace = TRUE) #> [1] 7
28

1.2.9.2 The Bernoulli Distribution
Let X represent a possibly unfair coin flip. Then P (X = 1) = p and P (X = 0) = 1 ≠ p for some p œ [0, 1]. We say that X has a Bernoulli distribution written X ≥ Bernoulli(p). The probability function is f (x) = px(1 ≠ p)1≠x for x œ {0,1}
set.seed(51)
## Set the parameter p
p <- 0.25 ## Draw n realization from the discrete uniform distribution n <- 5 sample(x = c(0,1), size = n, prob = c(1-p, p), replace=TRUE) #> [1] 1 0 0 1 0
## Alternatively:
## (Bernoulli(p) equals Binomial(1,p)) rbinom(n = n, size = 1, prob = p)
#> [1] 1 1 0 1 0
1.2.9.3 The Binomial Distribution
Suppose we have a coin which falls heads with probability p for some p œ [0, 1]. Flip the coin n times and let X be the number of heads (or successes). Assume that the tosses are independent. Let f(x) = P(X = x) be the mass function.
ItcanbeshownthatY_] Qa nRbpx(1≠p)n≠x forx=0,…,n f(x) = _[ 0 x otherwise.
A random variable with this mas function is called a binomial random variable and we write X ≥ Binomial(n, p). If X1 ≥ Binomial (n1, p1) and
29

X2 ≥ Binomial(n2, p) and if X1 and X2 are independent, then X1 + X2 ≥ Binomial (n1 + n2, p)
set.seed(51)
## Set the parameters n and p size <- 10 # number of trials p <- 0.25 # prob of success ## Draw n realization from the binomial distribution: n <- 5 rbinom(n = n, size = size, prob = p) #> [1] 4 1 2 6 1
1.2.10 Some Important Continuous Random Variables
1.2.10.1 The Uniform Distribution
X has a Uniform(a, b) distribution, written X ≥ Uniform(a, b), if
where a < b. The distribution function is Y f(x)=Y] 1 b≠a for x œ [a,b] [ 0 otherwise _0 xb
## Drawing from the uniform distribution:
n <- 10 a <- 0 30 b <- 1 runif(n = n, min = a, max = b) #> [1] 0.83442365 0.75138318 0.40601047 0.97101998 0.11233151 #> [6] 0.50750617 0.69714201 0.17104008 0.25448233 0.01813812
1.2.10.2 The Normal (or Gaussian) Distribution
X has a Normal (or Gaussian) distribution with parameters μ and ‡, denoted
by X ≥ N (μ, ‡2) , if
f(x)=‡Ô2fiexp ≠2‡2(x≠μ) , xœR
where μ œ R and ‡ > 0. Later we shall see that μ is the “center” (or mean of the distribution and ‡ is the “spread” (or standard deviation) of the distribution. The Normal plays an important role in probability and statistics. Many phenomena in nature have approximately Normal distributions. The Central Limit Theorem gives a special role to the Normal distribution by stating that the distribution of averages of random variables can be approximated by a Normal distribution.
We say that X has a standard Normal distribution if μ = 0 and ‡ = 1. Tradition dictates that a standard Normal random variable is denoted by Z. The PDF and CDF of a standard Normal are denoted by „(z) and (z). There is no closed-form expression for . Here are some useful facts:
(i) IfX≥N(μ,‡2)thenZ=(X≠μ)/‡≥N(0,1) (ii) IfZ≥N(0,1)thenX=μ+‡Z≥N(μ,‡2)
(iii) If Xi ≥ N (μi,‡i2),i = 1,…,n are independent then
ÿn Qÿnÿn2R Xi ≥ N a μi, ‡i b .
i=1 i=1 i=1 31
1 I 1 2J

The following R-codes plots the standard Normal density function:
Standard Normal Density Function
# draw a plot of the N(0,1) PDF
curve(dnorm(x),
xlim = c(-3.5, 3.5),
ylab = “Density”,
main = “Standard Normal Density Function”)
−3 −2 −1 0 1 2 3 x
This is how you can draw realizations from pseudo random Normal variables:
## Drawing from the uniform distribution:
n <-12 mu <-0 sigma <- 1 rnorm(n = n, mean = mu, sd = sigma) #> [1] 0.085602504 -0.695791615 -1.364410561 -0.183503290
32
Density
0.0 0.1 0.2 0.3 0.4

#> [5] -1.675347076 0.007303551 0.346965187 0.037914318
#> [9] 0.881345676 -0.882815597 -0.883560071 -0.795629557
An extension of the normal distribution in a univariate setting is the multivariate normal distribution. Let X = (X1, . . . , Xk)Õ be a k-dimensional normal variable, short X ≥ Nk(μ,) with mean vector E(X) = μ œ Rk and covariance matrix Cov(X) = . The joint density function or proba- bility density function (pdf) of the k-dimensional multivariate normal distribution is
exp1≠1(x≠μ)Õ≠1(x≠μ)2 fX (x1,…,xk) = 2 Ò(2fi)k|| ,
where || denotes the determinante of . For k = 2 we have the bivariate pdf of two random normal variables, X and Y say
g X , Y ( x , y ) = Ò1 2fi‡X‡Y 1≠fl2XY
Ax≠μ BAy≠μ B Ay≠μ B2TZ^
‡X
Y] 1 SAx≠μ B2 ·exp[ 2 U x ≠2fl
X Y + Y V\.
≠2(1 ≠ flXY ) ‡X XY
‡Y
‡Y
Lets consider the special case where X and Y are independent standard normal random variables with densities fX(x) and fY (y). We then have the parameters ‡X = ‡Y = 1, μX = μY = 0 (due to marginal standard normality) and correlation flXY = 0 (due to independence). The joint density of X and Y then becomes
g (x,y)=f (x)f (y)= 1 ·expI≠1Ëx2 +y2ÈJ. X,Y X Y 2fi 2
1.2.10.3 The Chi-Squared Distribution
The chi-squared distribution is another distribution relevant in economet- rics. It is often needed when testing special types of hypotheses frequently
33

encountered when dealing with regression models.
The sum of M squared independent standard normal distributed random
variables, Z1, . . . , ZM follows a chi-squared distribution with M degrees of freedom:
2 2 ÿM 2 2 Z1 +···+ZM = Zm ≥‰M.
m=1
A ‰2 distributed random variable with M degrees of freedom has expectation M, mode at M ≠ 2 for M Ø 2 and variance 2 · M.
Using the code below, we can display the pdf and the distribution function or cumulated density function (cdf) of a ‰23 random variable in a single plot. This is achieved by setting the argument add = TRUE in the second call of curve(). Further we adjust limits of both axes using xlim and ylim and choose dierent colors to make both functions better distinguishable. The plot is completed by adding a legend with help of legend().
# plot the PDF
curve(dchisq(x, df = 3), xlim = c(0, 10),
ylim = c(0, 1),
col = “blue”,
ylab = “”,
main = “pdf and cdf of Chi-Squared Distribution, M = 3”)
# add the CDF to the plot
curve(pchisq(x, df = 3), xlim = c(0, 10),
add = TRUE,
col = “red”)
# add a legend to the plot
34

legend(“topleft”, c(“PDF”, “CDF”),
col = c(“blue”, “red”), lty = c(1, 1))
pdf and cdf of Chi−Squared Distribution, M = 3
PDF CDF
0 2 4 6 8 10
x
Since the outcomes of a ‰2M distributed random variable are always positive, the support of the related PDF and CDF is RØ0.
As expectation and variance depend (solely!) on the degrees of freedom, the distribution’s shape changes drastically if we vary the number of squared standard normals that are summed up. This relation is often depicted by overlaying densities for dierent M, see the Wikipedia Article.
We reproduce this here by plotting the density of the ‰21 distribution on the interval [0, 15] with curve(). In the next step, we loop over degrees of freedom M = 2, …, 7 and add a density curve for each M to the plot. We also adjust the line color for each iteration of the loop by setting col = M. At last, we add a legend that displays degrees of freedom and the associated colors.
35
0.0 0.4 0.8

# plot the density for M=1
curve(dchisq(x, df = 1), xlim = c(0, 15),
xlab = “x”,
ylab = “Density”,
main = “Chi-Square Distributed Random Variables”)
# add densities for M=2,…,7 to the plot using a for() loop
for (M in 2:7) { curve(dchisq(x, df = M),
xlim = c(0, 15), add = T,
col = M)
}
# add a legend
legend(“topright”, as.character(1:7),
col = 1:7 , lty = 1,
title = “D.F.”)
36

Chi−Square Distributed Random Variables
D.F.
1 2 3 4 5 6 7
0 5 10 15 x
Increasing the degrees of freedom shifts the distribution to the right (the mode becomes larger) and increases the dispersion (the distribution’s variance grows).
1.2.10.4 The Student t Distribution
Let Z be a standard normal random variable, W a ‰2‹ random variable and further assume that Z and W are independent. Then it holds that
Ò Z =: X ≥ t‹ W/‹
and X follows a Student t distribution (or simply t distribution) with ‹ degrees of freedom.
The shape of a t‹ distribution depends on ‹. t distributions are sym- metric, bell-shaped and look similar to a normal distribution, especially when ‹ is large. This is not a coincidence: for a suciently large ‹, the t‹
37
Density
0.0 0.4 0.8

distribution can be approximated by the standard normal distribution. This approximation works reasonably well for ‹ Ø 30.
A t‹ distributed random variable X has an expectation if ‹ > 1 and it has a variance if ‹ > 2.
E(X) =0, M > 1 Var(X)= M ,M>2
M≠2
Let us plot some t distributions with dierent degrees of freedoms ‹ and
compare them to the standard normal distribution.
# plot the standard normal density
curve(dnorm(x),
xlim = c(-4, 4),
xlab = “x”,
lty = 2,
ylab = “Density”,
main = “Densities of t Distributions”)
# plot the t density for M=2
curve(dt(x, df = 2), xlim = c(-4, 4),
col = 2,
add = T)
# plot the t density for M=4
curve(dt(x, df = 4), xlim = c(-4, 4),
col = 3,
add = T)
38

# plot the t density for M=25
curve(dt(x, df = 25), xlim = c(-4, 4),
col = 4,
add = T)
# add a legend
legend(“topright”,
c(“N(0, 1)”, “M=2”, “M=4”, “M=25”), col = 1:4,
lty = c(2, 1, 1, 1))
Densities of t Distributions
N(0, 1) M=2 M=4 M=25
−4 −2 0 2 4 x
The plot illustrates that as the degrees of freedom increase, the shape of the t distribution comes closer to that of a standard normal bell curve. Already for ‹ = 25 we find little dierence to the standard normal density.
39
Density
0.0 0.1 0.2 0.3 0.4

If ‹ is small, we find the distribution to have heavier tails than a standard normal.
1.2.10.5 Cauchy Distribution
The Cauchy distribution is a special case of the t distribution corresponding to ‹ = 1. The density is
f(x) = 1 . fi(1+x2)
For the Cauchy distribution, the expectation does not exist – that is, it has no mean. Let’s try to compute the mean of a Cauchy distribution and see what goes wrong. Its mean should be
μ = E(X) = ⁄ Œ xdx ≠Œ fi(1+x2)
.
In order for this improper integral to exist, we need both integrals s≠0Œ and s0Œ to be finite. Let’s look at the second integral.
⁄ Œ xdx = 1 log11+x22-Œ =Œ 0 fi(1+x2) 2fi -0
Similarlys, the other integral, s≠0Œ, is ≠Œ. Since they’re not both finite, the integral Œ doesn’t exist. In other words Œ ≠ Œ is not a number. Thus,
the Cauchy distribution has no mean.
What this means in practice is that if you take a sample x1, x2, . . . , xn
from the Cauchy distribution, then the average x ̄ does not tend to a particular number. Instead, every so often you will get such a huge number, either positive or negative, that the average is overwhelmed by it.
40
≠Œ

1.2.10.6 The F Distribution
Another ratio of random variables important to econometricians is the ratio of two independent ‰2 distributed random variables that are divided by their degrees of freedom M and n. The quantity
W/M ≥ FM,n with W ≥ ‰2M , V ≥ ‰2n V/n
follows an F distribution with numerator degrees of freedom M and denomi- nator degrees of freedom n, denoted FM,n. The distribution was first derived by George Snedecor but was named in honor of Sir Ronald Fisher.
By definition, the support of both PDF and CDF of an FM,n distributed random variable is RØ0.
Say we have an F distributed random variable Y with numerator degrees of freedom 3 and denominator degrees of freedom 14 and are interested in P (Y Ø 2). This can be computed with help of the function pf(). By setting theargumentlower.tailtoFALSEweensurethatRcomputes1≠P(Y Æ2), i.e,the probability mass in the tail right of 2.
We can visualize this probability by drawing a line plot of the related density and adding a color shading with polygon().
pf(2, df1 = 3, df2 = 14, lower.tail = F) #> [1] 0.8396462
# define coordinate vectors for vertices of the polygon
x <- c(2, seq(2, 10, 0.01), 10) y <- c(0, df(seq(2, 10, 0.01), 3, 14), 0) # draw density of F_{3, 14} curve(df(x ,3 ,14), 41 ylim = c(0, 0.8), xlim = c(0, 10), ylab = "Density", main = "Density Function") # draw the polygon polygon(x, y, col = "orange") Density Function 0 2 4 6 8 10 x The F distribution is related to many other distributions. An important special case encountered in econometrics arises if the denominator degrees of freedom are large such that the FM,n distribution can be approximated by the FM,Œ distribution which turns out to be simply the distribution of a ‰2M random variable divided by its degrees of freedom M, i.e. W/M≥FM,Œ with W≥‰2M. 42 Density 0.0 0.2 0.4 0.6 0.8 Chapter 2 Review: Simple Linear Regression 2.1 The Simple Linear Regression Model The Data-Generating Process We assume some outcome, Yi œ R, and the explanatory variable, Xi œ R, are related to each other via the following linear model: Yi =—0 +—1Xi +Ái, (2.1) where i = 1,...,n indexes the individual observations and n denotes the sample size. • Equation (2.1) describes the data generating process that generates the individual observations that we observe in the data. • The parameters —0 œ R and —1 œ R are fixed (deterministic) parameters and their unknown values are the objects of our inquiry. • The explanatory variables, Xi, may be either fixed (deterministic) or random (“fixed/deterministic or random design”). While deterministic Xi remain fixed in repeated samples, random Xi have dierent realiza- tions in repeated samples. In case of random designs, we assume that 43 each Xi is drawn from the same distribution and that the individual Xi’s are independent from each other. In this case we say that the Xi’s are independent and identically distributed (i.i.d.). Examples: Fixed Design (FD): Xi = amount of money paid out to indi- vidual i in an economic laboratory experiment. Random Design (RD): Xi = years of education of the randomly sampled individual i. • The error terms Á1, . . . , Án are random variables representing the non- systematic and/or unobserved influences on Yi. • The dependent variable Yi is always random since the error terms Ái are assumed random, for all i = 1,...,n. 2.1.1 Assumptions About the Error Term 2.1.1.1 The Distribution of Ái A typical distributional assumption on the error terms is the i.i.d. assump- tion. That is, each Ái is drawn from the same (but unknown) distribution and two error terms Ái and Áj are independent of each other for all i ”= j. This is the assumption typically adopted when randomly sampling data from a large population. Sometimes we need to be even more specific, for instance, by assuming that the error terms Á1, . . . , Án are i.i.d. normal, i.e. i.i.d. 2 Ái ≥ N(0,‡ ) for all i = 1,...,n. 44 Note. The above i.i.d. assumptions are not as restrictive as they may seem on first sight. They allow that the conditional variances are heteroscedastic, i.e. Var(Ái |Xi) = ‡i2 with dierent variance terms ‡i2 across i = 1, . . . , n. The literature often distinguishes between the following general scenarios for the conditional variances of the error terms: Homoscedastic error terms: The conditional variances Var(Ái|Xi = xi) = ‡2 are equal to some constant ‡2 > 0 for every possible realization xi of Xi. (Due to the i.i.d. setting we then additionally have that Var(Ái|Xi = xi) = ‡2 for all i = 1,…,n.)
Heteroscedastic error terms: The conditional variances Var(Ái|Xi =
xi) = ‡2(xi) are equal to a non-constant variance-function ‡2(xi) which
is a function of the realization x of X . Example: Var(Á |X ) = 1 (X )2. i i i i 12 i
2.1.1.2 The Exogeneity Assumption E[Ái|Xi] = 0
Since the explanatory variables Xi are random variables under the random design, we need to think about possible dependencies between X1, . . . , Xn and the error terms
Ái =Yi ≠—0 ≠—1Xi, i=1,…,n.
An assumption of central importance is that the conditional mean of Ái
given Xi is zero, i.e.
E[Ái|Xi] = 0, for all i = 1,…,n.
In other words, on average, the non-systematic determinants of Yi are zero – no matter the value of Xi. That is, this assumption is a statement about the relationship between Ái and Xi. It turns out that the assumption E[Ái|Xi] = 0
45

is one of the most important assumptions we make in econometrics. When it is violated it causes fundamental econometric problems. Unfortunately, it is generally impossible to check this assumption since we do not observe realizations of the error terms Á1, . . . , Án.
Terminology: The exogeneity assumption E[Ái|Xi] = 0, for all i = 1, . . . , n, is also known as the orthogonality assumption.
Implications of the exogeneity assumption:
• The assumption that E[Ái|Xi] = 0, for all i = 1, . . . , n, implies that the correlation between Xi and Ái is zero for all i = 1, . . . , n, although the converse is not true.
• The assumption that E[Ái|Xi] = 0, for all i = 1, . . . , n, implies also that E[Ái] = 0 for all i = 1,…,n.
Note: Under a fixed design, we simply assume that E[Ái] = 0 for all i = 1,…,n.
2.1.2 The Population Regression Line
We now have settled all assumptions regarding the simple linear regression model (see Chapter 3 for a concise overview). This allows us to think about the relationship between Yi and Xi under the regime of the model assumptions. For instance, note that we can write
E[Yi|Xi] = E[—0 + —1Xi + Ái|Xi] = —0 + —1Xi + E[Ái|Xi].
Then, by our exogeneity assumption, E[Ái|Xi] = 0, we have that E[Yi|Xi] = —0 + —1Xi
46

This conditional expectation of Yi given Xi defines the population regres- sion line – we do not observe this object, but want to estimate it.
2.1.3 Terminology: Estimates versus Estimators
Our goal is to devise some way to estimate the unknown parameters —0 œ R
and —1 œ R which define the population regression line. We call our method
or formula for estimating these parameters an estimator. An estimator is
a function of the random sample ((Y1, X1), . . . , (Yn, Xn)) and, therefore, is
itself a random variable that (generally) has dierent realizations in repeated
samples. The result of applying our estimator to specific data, i.e., to a given
observed realization ((Yobs,Xobs),…,(Yobs,Xobs)) of the random sample 11 nn
((Y1, X1), . . . , (Yn, Xn))) is called an estimate.
Caution: We will usually not use dierent notations for random vari- ables/estimators and their realizations/estimates since often both points of views tell a necessary story.
2.2 Ordinary Least Squares Estimation
There are many ways to estimate the parameters —0 and —1 of the simple linear regression model in (2.1). In the following, we will derive the Ordinary Least Squares (OLS) estimator using a “mechanical” approach. In the next chapter, we will use an alternative approach (“methods of moments”) which provides a perspective that is particularly interesting/useful in econometrics. You will see that the properties of the estimators for —0 and —1 crucially depend on our assumption regarding the error terms Ái and the explanatory variables Xi.
47

2.2.1 Terminology: Sample Regression Line, Prediction and Residuals
Let us define some necessary terms. Call —ˆ our estimate of — and —ˆ our 0ˆ01
estimate of —1. Now, define the predicted value, Yi, of the dependent variable, Yi, to be
Yˆ = —ˆ + —ˆ X (2.2) i01i
This is just the prediction of the dependent variable, Yi, given the value of Xi and the estimates —ˆ and —ˆ . Equation (2.2) defines the sample regression line. 0 1
Define the residual, ˆÁi, as the dierence between the observed value, Yi, and the predicted value, Yˆ :
ˆÁ = Y ≠ Yˆ iii
=Y ≠—ˆ ≠—ˆX i01i
The residual, ˆÁi, is the vertical distance between the observed value, Yi, and the sample regression line, i.e., the prediction, Yˆ , of Y .
the errors Ái.
ˆÁ =Y ≠—ˆ ≠—ˆX (computable) ii01i
48
i
ii
We must make an important distinction between the residuals, ˆÁi, and
Ái = Yi ≠ —0 ≠ —1Xi (unobservable)
Because —0 and —1 are unknown, we can never know the value of the error
terms Á . However, because we actually come up with the estimates —ˆ and ˆi0
—1, and because we observe Xi, we can calculate the residual ˆÁi for each observation. This distinction is important to keep in mind.

Note that we can write
Ái = Yi ≠ —0 + —1Xi
= Yi ≠ E[Yi|Xi].
But for this to make sense, we needed to impose the exogeneity assumption that E[Ái|Xi] = 0, since only then we can identify the population regression line —0 + —1Xi using the conditional mean of Yi given Xi.
2.2.2 Deriving the Expression of the OLS Estimator
The method of ordinary least squares (OLS) estimation has a long history and was first described by Legendre in 1805 – although Karl Friedrich Gauss claimed to use OLS since 1795. In 1809 Gauss published his work on OLS which extended the work of Legendre.
The idea of OLS is to choose parameter values that minimize the sum of squared residuals for a given data set. These minimizing parameters are then the estimates of the unknown population parameters. It turns
out that the OLS estimator is equipped with several desirable properties. In a sense, OLS is a purely “mechanical” method. We will see that it is equivalent to an alternative estimation method called methods of moments which have a more profound econometric motivation, given a certain set of
(moment-)assumptions.
Deriving the OLS estimate algebraically. Our objective is to find the
parameter values —ˆ and —ˆ that minimize the sum of squared residuals, 01
Sn(b0, b1), for a given sample (i.e., for given data) ((Y1, X1), . . . , (Yn, Xn)) of 49

the random sample ((Y1, X1), . . . , (Yn, Xn)), where ÿn 2
= ÿn ( Y ≠ Y ˆ ) 2 i=1 i i
= ÿn (Yi ≠b0 ≠b1Xi)2 i=1
= ÿn (Yi2 ≠ 2b0Yi ≠ 2b1YiXi + b20 + 2b0b1Xi + b21Xi2). i=1
Sn(b0, b1) = ei i=1
Now partially dierentiate the last line with respect to b0 and b1, respectively. ˆSn(b0, b1) = ÿn (≠2Yi + 2b0 + 2b1Xi)
ˆb0 i=1
ˆSn(b0, b1) = ÿn 1≠2YiXi + 2b0Xi + 2b1Xi22
ˆb1 i=1
Next, we want to find the minimizing arguments
(—ˆ , —ˆ )Õ = min arg S (b , b ). 0 1 (b0,b1)œR2 n 0 1
For this we set the two partial derivatives equal to zero which gives us two equations that fully determine the values —ˆ and —ˆ :
ˆÿn ˆÿn
n—0 ≠ Yi + —1 Xi = 0
i=1 i=1
ÿn 3 ≠ Y X + — ˆ X + — ˆ X 2 4 = 0
i=1 ii 0i 1i
The two latter equations are known as the least squares normal equations.
It is easy to see from the first normal equation that the OLS estimator of —0 is ˆ ̄ ˆ ̄
01
—0 =Y ≠—1X (2.3) 50

Substituting —ˆ into the second normal equation gives 0
0=ÿn 3≠YX+(Y ̄≠—ˆX ̄)X+—ˆX24 i=1ii 1i1i
= ÿn 3 ≠ X ( Y ≠ Y ̄ ) + — ˆ X ( X ≠ X ̄ ) 4 i=1ii 1ii
Solving for —ˆ gives 1
QnRQnR = ≠ a ÿ X ( Y ≠ Y ̄ ) b + —ˆ a ÿ X ( X ≠ X ̄ ) b
ii1ii i=1 i=1
—ˆ = q ni = 1 ( Y i ≠ Y ̄ ) X i
n ̄
1 q (X≠X)X
i=1 i ̄ i ̄ n (Y ≠Y)(X ≠X)
= q i=1
qi=1 i ̄ i
= qi=1
ii
n (X ≠X ̄)(X ≠X ̄)
n (X≠X)Y
ii ni = 1 ( X i ≠ X ̄ ) 2
The last two lines follow from the “useful result” that will be discussed in the exercises of this chapter. ˆ ˆ
Here is some R-code for computing —0 and —1 for a given realization of the random sample ((Y1, X1), . . . , (Yn, Xn)), i.e. for a given (well, here simulated) data set ((Y1, X1), . . . , (Yn, Xn)).
n <- 25 # sample size ## simulate data set.seed(3) X <- runif(n, min = 1, max = 10) error <- rnorm(n, mean = 0, sd = 5) beta0 <- 1 51 beta1 <- 2 Y <-beta0+beta1*X+error ## save simulated data as data frame data_sim <- data.frame("Y" = Y, "X" = X) ## OLS fit lm_obj <- lm(Y~X, data = data_sim) ## ## Plot par(family = "serif") plot(x = data_sim$X, y = data_sim$Y, main="", axes=FALSE, pch = 16, cex = 0.8, xlab = "X", ylab = "Y") axis(1, tick = FALSE) axis(2, tick = FALSE, las = 2) abline(lm_obj, lty=2, lwd = 1.3, col="darkorange") abline(a = beta0, b = beta1, lwd=1.3, col="darkblue") legend("topleft", col=c("darkorange", "darkblue"), legend = c("Sample Regression Line", "Population Regression Line"), lwd=1.3, lty=c(2,1), bty="n") 52 30 25 20 15 10 5 0 Sample Regression Line Population Regression Line 2468 X ## Estimates coef(lm_obj) #> (Intercept) X #> -1.561339 2.512011
Interpretation of the Results. The coecients have the usual intercept and slope interpretation. That is, for the unknown parameters —0 and —1 we have that
ˆE[Y|X] =—1 with E[Y|X]=—0 +—1X. ˆX
That is, —1 is the true (unknown) marginal eect of a one unit change in X on Y . Therefore, —ˆ is the estimated marginal eect of a one unit change in X on Y : 1
\
ˆE[Y|X] ˆX
ˆ \ˆˆ =—1 with E[Y|X]=—0 +—1X.
53
Y

2.2.3 Behavior of the OLS Estimates for Resampled Data (conditionally on Xi)
Usually, we only observe the estimates —ˆ and —ˆ computed for a given 01
data set. However, in order to understand the statistical properties of the estimators —ˆ and —ˆ we need to view them as random variables which yield
01
dierent realizations in repeated samples generated from (2.1) conditionally on X1, . . . , Xn. This allows us then to think about questions like:
• “Is the estimator able to estimate the unknown parameter-value cor- rectly on average (conditionally on a given set of X1, . . . , Xn)?”
• “Are the estimation results more precise if we have more data?”
A first idea about the statistical properties of the estimators —ˆ and —ˆ
can be gained using Monte Carlo simulations as following. 0 1
## Sample sizes
n_small <- 10 # small sample size n_large <- 100 # large sample size ## True parameter values beta0 <- 1 beta1 <- 1 ## Generate explanatory variables (random design) X_n_small <- runif(n_small, min = 1, max = 10) X_n_large <- runif(n_large, min = 1, max = 10) ## Monte-Carlo (MC) Simulation ## 1. Generate data 54 ## 2. Compute and store estimates ## Repeat steps 1. and 2. many times set.seed(3) ## Number of Monte Carlo repetitions ## How many samples to draw from the models rep <- 1000 ## Containers to store the lm-results n_small_list <- vector(mode = "list", length = rep) n_large_list <- vector(mode = "list", length = rep) for(r in 1:rep){ ## Sampling from the model conditionally on X_n_small error_n_small <- rnorm(n_small, mean = 0, sd = 5) Y_n_small <- beta0 + beta1 * X_n_small + error_n_small n_small_list[[r]] <- lm(Y_n_small ~ X_n_small) ## Sampling from the model conditionally on X_n_large error_n_large <- rnorm(n_large, mean = 0, sd = 5) Y_n_large <- beta0 + beta1 * X_n_large + error_n_large n_large_list[[r]] <- lm(Y_n_large ~ X_n_large) } ## Reading out the parameter estimates beta0_estimates_n_small <- rep(NA, rep) beta1_estimates_n_small <- rep(NA, rep) beta0_estimates_n_large <- rep(NA, rep) beta1_estimates_n_large <- rep(NA, rep) for(r in 1:rep){ beta0_estimates_n_small[r] <- n_small_list[[r]]$coefficients[1] beta1_estimates_n_small[r] <- n_small_list[[r]]$coefficients[2] 55 beta0_estimates_n_large[r] <- n_large_list[[r]]$coefficients[1] beta1_estimates_n_large[r] <- n_large_list[[r]]$coefficients[2] } Now, we have produced realizations of the estimators —ˆ |X and —ˆ |X conditionally on 0 1 Q1 X1R X=c. .d a1 Xnb and we have saved these realizations in beta0_estimates_n_small, beta1_estimates_n_small, beta0_estimates_n_large, and beta1_estimates_n_large. This allows us to visualize the behavior of the OLS estimates for the repeatedly sampled data (conditionally on Xi). ## Plotting the results library("scales") # alpha() produces transparent colors ## Define a common y-axis range y_range <- range(beta0_estimates_n_small, beta1_estimates_n_small)*1.1 ## Generate the plot par(family = "serif") # Serif fonts ## Layout of plotting area layout(matrix(c(1:6), 2, 3, byrow = TRUE), widths = c(3,1,1)) ## Plot 1 plot(x=0, y=0, axes=FALSE, xlab="X", ylab="Y", type="n", xlim=c(1,10), ylim=c(-5,35), main="Small Sample (n=10)") axis(1, tick = FALSE); axis(2, tick = FALSE, las = 2) for(r in 1:rep){ 56 abline(n_small_list[[r]], lty=2, lwd = 1.3, col="darkorange") } abline(a = beta0, b = beta1, lwd=1.3, col="darkblue") legend("topleft", col=c("darkorange", "darkblue"), legend=c( "Sample regression lines from\nrepeated samples (cond. on X)", "Population regression line"), lwd=1.3, lty=c(2,1), bty="n") ## Plot 2 plot(x=rep(0,rep), y=beta0_estimates_n_small, axes=FALSE, xlab="", ylab="", pch=19, cex=1.2, ylim=y_range, main=expression(hat(beta)[0]~|~X), col=alpha("red",0.2)) points(x = 0, y=beta0, pch="-", cex = 1.2, col="black") text(x=0, y=beta0, labels = expression(beta[0]), pos = 4) ## Plot 3 plot(x=rep(0,rep), y=beta1_estimates_n_small, axes=FALSE, xlab="", ylab="", pch=19, cex=1.2, ylim=y_range, main=expression(hat(beta)[1]~|~X), col=alpha("red",0.2)) points(x = 0, y=beta1, pch="-", cex = 1.2, col="black") text(x=0, y=beta1, labels = expression(beta[1]), pos = 4) ## Plot 4 plot(x=0, y=0, axes=FALSE, xlab="X", ylab="Y", type="n", xlim=c(1,10), ylim=c(-5,35), main="Large Sample (n=100)") axis(1, tick = FALSE); axis(2, tick = FALSE, las = 2) for(r in 1:rep){ abline(n_large_list[[r]], lty=2, lwd = 1.3, col="darkorange") } abline(a = beta0, b = beta1, lwd=1.3, col="darkblue") ## Plot 5 plot(x=rep(0,rep), y=beta0_estimates_n_large, axes=FALSE, xlab="", ylab="", pch=19, cex=1.2, ylim=y_range, 57 main=expression(hat(beta)[0]~|~X), col=alpha("red",0.2)) points(x = 0, y=beta0, pch="-", cex = 1.2, col="black") text(x=0, y=beta0, labels = expression(beta[0]), pos = 4) ## Plot 6 plot(x=rep(0,rep), y=beta1_estimates_n_large, axes=FALSE, xlab="", ylab="", pch=19, cex=1.2, ylim=y_range, main=expression(hat(beta)[1]~|~X), col=alpha("red",0.2)) points(x=0, y=beta1, pch="-", cex = 1.2, col="black") text(x=0, y=beta1, labels = expression(beta[1]), pos = 4) 30 20 10 0 30 20 10 0 Small Sample (n=10) Sample regression lines from repeated samples (cond. on X) Population regression line 2 4 6 8 10 X Large Sample (n=100) 2 4 6 8 10 X β^0 | X β^1 | X −β0 −β1 YY This are promising plots: 58 β^0 | X −β0 −β1 β^1 | X • The realizations of —ˆ |X and —ˆ |X are scattered around the true (un- 01 known) parameter values —0 and —1 for both small and large samples. • The realizations of —ˆ |X and —ˆ |X concentrate more and more around 01 the true (unknown) parameter values —0 and —1 as the sample size increases. However, this was only a simulation for one specific data generating process. Such a Monte Carlo simulation does not allow us to generalize these properties. Next we use theoretical arguments to show that these properties also hold in general. 2.3 Properties of the OLS Estimator Note that we derived the OLS estimator for a given realization ((y1, x1), . . . , (yn, xn)) of the random sample ((Y1, X1), . . . , (Yn, Xn)). So, why should OLS be a “good” estimator in general? Does it behave well when we consider all possible realization of the OLS estimator for conditional resamplings, given X1, . . . , Xn, from all possible data generating processes described by (2.1)? In the following we consider the two questions: • “Are the conditional means of the estimators —ˆ and —ˆ , given X, equal to the true parameter values —0 and —1?” 0 1 • “Are the conditional variances of the estimators —ˆ and —ˆ , given X, become small as the smaple size n becomes large?”0 1 59 2.3.1 Mean and Bias of the OLS Estimator Because the estimator —ˆ=a—0b=cqn (X≠X ̄)Yd QˆR Qc Y ̄≠—ˆX ̄ Rd 1 (2.4) —ˆai=1i ib 1 q ni = 1 ( X i ≠ X ̄ ) 2 is a function of the random variables in the random sample ((Y1, X1), . . . , (Yn, Xn)), it is itself a random variable. Therefore, —ˆ has a probability distribution, and we can talk sensibly about its expectation E(—ˆ). Note that the estimator —ˆ in (2.4) contains both the Yi’s and the Xi’s. Therefore, under random designs, we need to be specific which mean (or variance) of the estimator we want to consider. For the random design, we consider conditional means (or variances) given the design matrix Q1 X1R X=c. .d. a1 Xnb Using the conditional mean, E[—ˆ|X], we can define the conditional bias: Bias(—ˆ|X) = E[—ˆ|X] ≠ —. For fixed designs, conditioning on X is not false but superfluous since X remains anyways fixed in repeated samples and therefore Bias(—ˆ|X) = Bias(—ˆ) = E[—ˆ] ≠ —. Unbiasedness of OLS. The OLS estimator is called unbiased, if its conditional expectation is equal to the true parameter, i.e. if E[—ˆ|X] = — ... Bias(—ˆ|X) = 0 60 That is, the conditional mean of the random estimator equals the true parameter value which we want to estimate. Put another way, unbiasedness means that the conditional distribution of the estimator is centered on the true value of the parameter. Of course, the probability of a single estimate being exactly equal to the truth in any particular sample is zero if the error terms Ái (and therefore the Yi) are continuous random variables. To show that —ˆ is conditionally unbiased, we need to show that its 1 conditional expectation is equal to —1: Sqn (X≠X ̄)Y T Sqn (X≠X ̄)(—+—X+Á) T E[—ˆ|X]=EUi=1i i|XV=EUi=1i 0 1i i|XV 1 qni=1(Xi ≠ X ̄)2 qni=1(Xi ≠ X ̄)2 In case of a deterministic design, X is fixed in repeated saqmples which makes the conditioning on X superfluous since, for instance, (Xi ≠ X ̄ )2 is then a constant which allows us to bring it outside of the expectation; the same applies to all other terms that involve only deterministic quantities. So, the conditioning X is actually only necessary under the random design, where it allows us then to bring all terms containing only X variables (or deterministic quantities) outside of the expectation. However, interpretations then must consider the conditioning on X. So, we really only need to concern ourselves with the numerator, which we can write as: nnSnT —0 ÿ(Xi ≠X ̄)+—1 ÿ(Xi ≠X ̄)Xi +EUÿ(Xi ≠X ̄)Ái |XV i=1 i=1 i=1 For the first two terms, note that n SQnRT — 0 ÿ ( X i ≠ X ̄ ) = — 0 U a ÿ X i b ≠ n X ̄ V = 0 i=1 i=1 — 1 ÿn ( X i ≠ X ̄ ) X i = — 1 ÿn ( X i ≠ X ̄ ) 2 , i=1 i=1 61 where the latter result follows again from the “useful result” that will be discussed in the exercises of this chapter. So we are left with 1Sÿn T E[—ˆ|X]=—+n ̄2EU (X≠X ̄)Á|XV 1 1 qi=1(Xi ≠X) i=1 i i Clearly, then, we want the last term here to be zero, so that —ˆ is conditionally Ë ̄1 È unbiased. The last term is zero if the factor E q(Xi ≠ X)Ái|X = 0. For fixed/random designs we have the following expressions: Fixed design: E Ëqni=1(Xi ≠ X ̄)Ái|XÈ = qni=1(Xi ≠ X ̄)E [Ái] Random design: E Ëqni=1(Xi ≠ X ̄)Ái|XÈ = qni=1(Xi ≠ X ̄)E [Ái|Xi] That is, the OLS estimator, —ˆ , is (conditionally) unbiased if the exogeneity 1 assumption E[Ái|Xi] = 0 holds (or if E[Ái] = 0 in case of the fixed design). We return, therefore, to the importance of the exogeneity assumption E[Á|Xi] since if this assumption does not hold, the estimator is biased and remains biased even asymptotically as n æ Œ. The unbiasedness of —ˆ implies that —ˆ = Y ̄ ≠ —ˆ X ̄ is also an unbiased estimator, since 1 0 1 E [ —ˆ | X ] = E 5 Y ̄ ≠ —ˆ X ̄ | X 6 01 S 1 ÿn T = E U ( — + — X + Á ) ≠ —ˆ X ̄ | X V ni=10 1i i 1 ̄1ÿn ˆ ̄ = —0 + —1X + n i=1 E[Ái|Xi] ≠ E[—1|X]X = —0 where the last line follows from the exogeneity assumption E[Ái|X] = 0 and 62 the unbiasedness of —ˆ . 1 Note that the exogeneity assumption, E[Ái|X] = 0, is the only distribu- tional assumption on the error terms (and the Xi’s) that is needed to show that the OLS estimator is unbiased. In fact, OLS remains unbiased if the error terms are heteroscedastic and/or autocorrelated. However, under the more restrictive assumption that the error terms are homoscedastic and not autocorrelated, one can show that the OLS estimator is the most ecient (lowest variance) estimator within the family of all unbiased estimators for the regression paraemters —0 and —1. This result is known as the Gauss-Markov Theorem and we will consider this theorem in the context of the multiple regression model in more detail (see Chapter 3). 2.3.2 Variance of the OLS Estimator Remember: An unbiased estimator can still be a very imprecise estimator if it has a large variance. In order to derive the variance of the OLS estimator one needs more than only the exogeneity assumption. One gets dierent variance expressions for dierent distributional assumptions on the error terms Á1, . . . , Án. As we consider this in more detail for the multiple regression model, we consider in the following only the most simple case, namely, conditionally homoscedastic i.i.d. error terms. Under this simple scenario it can be shown that 12 n≠1qni=1Xi2 Var(—0|X)=n‡ ·n≠1qni=11Xi≠X ̄22 ˆ121 Var(—1|X)=n‡ ·n≠1qni=11Xi≠X ̄22, ˆ (2.5) (2.6) assuming that Var(Ái|X) = ‡2. That is equations (2.5) and (2.6) are only sensible for the case of homoscedastic error terms. From equations (2.5) and (2.6) it follows that Var(—ˆ |X) and Var(—ˆ |X) become eventually small as 01 n æ Œ which is good since this means that the estimators become more precise as the sample size n increases. 63 Note that the variance expressions in (2.5) and (2.6) are not really useful in practice since usually we do not know the variance of the error terms ‡2. Under the i.i.d. assumption we can replace the unknown ‡2 by plugging-in reasonable estimates such as s2=1ÿnˆÁ2 ors2=1ÿnˆÁ2 ML ni=1 i UB n≠2i=1 i where s2UB is the unbiased estimator and s2ML the maximum likelihood esti- mator of ‡2. This leads to the practically relevant variance estimators ‰ˆ 12 n≠1qni=1Xi2 Var(—0|X)=ns ·n≠1qni=11Xi≠X ̄22 ‰ˆ121 Var(—1|X)=ns ·n≠1qni=11Xi≠X ̄22, where s2 = s2ML or s2 = s2UB. 2.3.3 Consistency of the OLS Estimator Now, we have everything together in order to show that the OLS estimators —ˆ and —ˆ are consistent estimators of the unknown parameter values — 010 and —1. Since the OLS estimator is unbiased, we do not need to bother with a possible bias-problem—provided the underlying assumptions of the simple linear regression model are true. Moreover, since the variances of the estimators —ˆ © —ˆ and —ˆ © —ˆ converge to zero as n æ Œ, we have that 0 0n 1 1n the Mean Squared Error (MSE) converges to zero, i.e. MSE(—ˆ |X) = E 5(—ˆ ≠ — )2|X6 jn jnj =3Bias(—ˆ |X)42+Var(—ˆ |X)æ0 as næŒ for j=0,1. jn jn ̧ ̊ ̇ ̋ =0 64 This means that the estimators —ˆ and —ˆ converge in mean-square as 0n 1n n æ Œ. Fortunately, mean-square convergence implies convergence in probability or short —ˆ æ — as n æ Œ for j = 0,1.1 The latter result is jn p j exactly what we mean by saying that an estimator is (weakly) consistent. Decomposition of the MSE. Let ◊ˆ be some estimator of a univariate parameter ◊, then we can decompose the MSE of ◊ˆ as following ˆ 5ˆ 26 MSE(◊)=E (◊≠◊) SQ R2T Wc1ˆ ˆ2 1 ˆ 2d X =EUa ◊≠E(◊) + E(◊)≠◊ b V ̧ ̊ ̇ ̋ S =0 deterministic T Wˆ ˆ ˆ ˆ ˆ ̇ ˆ ̋ ̧ ̊X = E WU(◊ ≠ E(◊))2 + 2(◊ ≠ E(◊))(E(◊) ≠ ◊) + (E(◊) ≠ ◊)2XV 5ˆˆ265ˆˆˆ6 ˆ2 =E (◊≠E(◊)) +E 2(◊≠E(◊))(E(◊)≠◊) +(E(◊)≠◊) ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋ ̧ ̊ ̇ˆ2 ̋ =(Bias(◊)) ˆ =Var(◊) ˆ ˆ =0, since E(◊) ≠ E(◊) = 0 ˆ3 ˆ42 = Var(◊) + Bias(◊) 1See, for instance, https://www.statlect.com/asymptotic-theory/relations-among-modes-of- convergence 65 Chapter 3 Multiple Linear Regression Preamble. In the following we focus on case of random designs X, since this is the more relevant case in econometrics. This point of view requires us to consider conditional means and variances “given X”. That is, if we would be able to resample from the model, we do so by fixing the in-principle random explanatory variable X. 3.1 Assumptions The multiple linear regression model is defined by the following assumptions: Assumption 1: The Linear Model Assumption (Data Generating Process) ÿK k=1 Usually, a constant (intercept) is included, in this case Xi1 = 1 for all i. In the following we will always assume that Xi1 = 1 for all i, unless otherwise stated. It is convenient to write equation (3.1) using matrix notation Yi = XiÕ — +Ái, i=1,...,n, (1◊K )(K ◊1) 67 Yi = —kXik +Ái, i=1,...,n. (3.1) where Xi = (Xi1,...,XiK)Õ and — = (—1,...,—K)Õ. Stacking all individual rows i leads to Y=X—+Á, (3.2) (n◊1) (n◊K )(K ◊1) (n◊1) where Y =c .d, X=c . ... . d, and Á=c .d. QX11 ... X1KR aYnb aXn1 ... XnKb aÁnb As already mentioned above, we focus on the random design case. Moreover, we adopt the classical sampling assumption for analyzing data that was randomly sampled from a large population: we assume that the random variables (Yi, Xi1, . . . , XiK ) are i.i.d. across observations i = 1, . . . , n. Assumption 2: Exogeneity E(Ái|Xi) = 0, i = 1,...,n This assumption demands that the mean of the random error term Ái is zero irrespective of the realizations of Xi. Note that together with the random sampling assumption (in Assumption 1) this assumption implies even strict exogeneity E(Ái |X) = 0 since we have independence across i = 1, . . . , n. Under strict exogeneity, the mean of Ái is zero irrespective of the realizations of X1,...,Xn. Assumption 3: Rank Condition (no perfect multicollinearity) rank(X) = K a.s. This assumption demands that the event of one explanatory variable being linearly dependent on the others occurs with a probability equal to zero. 68 QY1R QÁ1R (This is the literal translation of the “almost surely (a.s.)” concept.) The assumption implies that n Ø K. This assumption is a bit dicey and its violation belongs to one of the classic problems in applied econometrics (keywords: multicollinearity, dummy variable trap, variance inflation). The violation of this assumption harms any economic interpretation as we cannot disentangle the explanatory variables’ individual eects on Y . Therefore, this assumption is also often called an identification assumption. Assumption 4: Error distribution. Depending on the context (i.e., parameter estimation vs. hypothesis testing and small n vs. large n) there are dierent more or less restrictive assumptions. Some of the most common ones are the following: i.i.d. 2 • i.i.d. Normal. Ái ≥ N(0,‡ ) for all i = 1,...,n. • i.i.d. The error terms Á1, . . . , Án are i.i.d. • Spherical errors (“Gauss-Markov assumptions”). The error terms Á1, . . . , Án might have dierent distributions, but their variances and covariances are such that E(Á2i|Xi) = ‡2 > 0 for all i = 1,…,n E(ÁiÁj|X) = 0, i ”= j.
Thus, here one assumes that, for a given realization of X, the error process is uncorrelated (i.e. Cov(Ái, Áj |X) = E(Ái Áj |X) = 0 for all i ”= j) and homoscedastic (i.e. Var(Ái |X) = ‡2 for all i).
Remember. The above i.i.d. assumptions are not as restrictive as they may seem on first sight. They allow that the conditional variances are heteroscedastic, i.e. Var(Ái |Xi) = ‡i2 with dierent variance terms
69

‡i2 across i = 1,…,n, where ‡i may be a function of Xi and/or deterministic quantities.
Technical Note. When we write that E(Á2i |X) = ‡2, we implicitly assume that the second moment of Ái exists and that it is finite.
3.1.1 Some Implications of the Exogeneity Assumption
(a) If E(Ái |Xi) = 0 for all i = 1,…,n, then the unconditional mean of the error term is zero, i.e.
E(Ái) = 0, i = 1,…,n
Proof: Using the Law of Total Expectations (i.e., E[E(Z|X)] = E(Z)) we can rewrite E(Ái) as
E(Ái) = E[E(Ái |Xi)], for all i = 1,…,n. But the exogeneity assumption yields
E[E(Ái |Xi)] = E[0] = 0 for all i = 1,…,n, which completes the proof. ⇤
(b) Exogeneity is sometimes also called orthogonality–the reason is the following. Generally, two random variables X and Y are said to be orthogonal if their cross moment is zero, i.e. if E(XY ) = 0.
Under exogeneity, i.e. if E(Ái |Xi) = 0, the regressors and the error 70

term are orthogonal to each other, i.e,
E(Xik Ái) = 0 for all i = 1,…,n and k = 1,…,K.
Proof:
E(Xik Ái) = E(E(Xik Ái |Xik)) (By the Law of Total Expectations)
= E(XikE(Ái |Xik)) (By the linearity of cond. expectations)
Now, to show that E(Xik Ái) = 0, we need to show that E(Ái |Xik) = 0, which is done in the following:
Since Xik is an element of Xi, a slightly more sophisticated use of the Law of Total Expectations (i.e., E(Y |X) = E(E(Y |X, Z)|X)) implies that
E(Ái |Xik) = E(E(Ái |Xi)|Xik). So, the exogeneity assumption, E(Ái |Xi) = 0 yields
E(Ái |Xik) = E(E(Ái |Xi) |Xik) = E(0|Xik) = 0. ̧ ̊ ̇ ̋
=0
I.e., we have that E(Ái |Xik) = 0 which allows us to conclude that E(Xik Ái) = E(XikE(Ái |Xik)) = E(Xik0) = 0. ⇤
(c) Because the mean of the error term is zero (E(Ái) = 0 for all i; see point (a)), it follows that the orthogonality property (E(Xik Ái) = 0) is equivalent to a zero-correlation property. I.e., if E(Ái |Xi) = 0, then
Cov(Ái,Xik) = 0 for all i = 1,…,n and k = 1,…,K. 71

Proof:
Cov(Ái, Xik) = E(Xik Ái) ≠ E(Xik) E(Ái) (Def. of Cov) = E(Xik Ái) (By point (a): E(Ái) = 0)
= 0 (By orthogonality result in point (b)) ⇤
3.2 Deriving the Expression of the OLS Estimator
Similar to Section 2.2.2, we can derive the expression for the OLS esti- mator —ˆ = (—ˆ , . . . , —ˆ )Õ œ RK as the vector-valued minimizing argument
1K
of the sum of squared residuals, Sn(b) with b œ RK, for a given sample
((Y1, X1), . . . , (Yn, Xn)). In matrix terms this is
Sn(b)=(Y ≠Xb)Õ(Y ≠Xb)=YÕY ≠2YÕXb+bÕXÕXb.
To find the minimizing argument
—ˆ = arg min Sn(b)
bœRK we compute all partial derivatives
ˆS(b) = ≠2 (XÕY ≠ XÕXb) . ˆb
(K ◊1)
and set them equal to zero which leads to K linear equations (the “normal equations”) in K unknowns. This system of equations defines the OLS estimates, —ˆ, for a given data-set:
≠ 2 3 X Õ Y ≠ X Õ X —ˆ 4 = 0 .
(K ◊1) 72

From our rank assumption (Assumption 3) it follows that XÕX is an invertible matrix which allows us to solve the equation system by
—ˆ = (XÕX)≠1 XÕY (K ◊1)
The following codes computes the estimate —ˆ for a given realization (Y, X) of the random sample (Y, X).
# Some given data
X_1 <- c(1.9,0.8,1.1,0.1,-0.1,4.4,4.6,1.6,5.5,3.4) X_2 <- c(66, 62, 64, 61, 63, 70, 68, 62, 68, 66) Y <- c(0.7,-1.0,-0.2,-1.2,-0.1,3.4,0.0,0.8,3.7,2.0) dataset <- cbind.data.frame(X_1,X_2,Y) ## Compute the OLS estimation my.lm <- lm(Y ~ X_1 + X_2, data = dataset) ## Plot sample regression surface library("scatterplot3d") # library for 3d plots plot3d <- scatterplot3d(x = X_1, y = X_2, z = Y, angle=33, scale.y=0.8, pch=16, color ="red", main ="OLS Regression Surface") plot3d$plane3d(my.lm, lty.box = "solid", col=gray(.5), draw_polygon=TRUE) 73 OLS Regression Surface 60 62 64 66 68 70 −1 0 1 2 3 4 5 6 X_1 3.3 Some Quantities of Interest Predicted values and residuals. • The (OLS) predicted values: Yˆ = XÕ—ˆ ˆ Õ i≠1 Õi Inmatrixnotation: Y =X(XX) XY =PXY ̧ ̊ ̇ ̋ —ˆ iiˆi Õ≠1Õ • The (OLS) residual: ˆÁ = Y ≠ Yˆ Inmatrixnotation:ˆÁ=Y≠Y=(In≠X(XX) X)Y=MXY Projection matrices. The matrix PX = X(XÕX)≠1XÕ is the (n ◊ n) projection matrix that projects any vector (from Rn) into the column space spanned by the column vectors of X and MX = In ≠X(XÕX)≠1XÕ = In ≠PX is the associated (n ◊ n) orthogonal projection matrix that projects any vector (from Rn) into the vector space that is orthogonal to that spanned by X. The projection matrices PX and MX have some nice properties: 74 Y −2 −1 0 1 2 3 4 X_2 (a) PX and MX are symmetric, i.e. PX = PXÕ and MX = MXÕ . (b) PX and MX are idempotent, i.e. PXPX = PX and MXMX = MX. (c) MoreoverXÕPX =XÕ,PXX =X,XÕMX =0,MXX =0,andPXMX = 0. All of these properties follow directly from the definitions of PX and MX (check it out). Using these properties one can show that the residual vector ˆÁ = (ˆÁ1, . . . , ˆÁn)Õ is orthogonal to each of the column vectors in X, i.e XÕˆÁ = XÕMXY (By Def. of MX) ...XÕˆÁ= 0 Y (sinceXÕMX=0) (K ◊n)(n◊1) ...XÕˆÁ= 0 (K ◊1) (3.3) Note that, in the case with intercept (3.3) implies that qni=1 ˆÁi = 0. Moreover, equation (3.3) implies also that the residual vector Á is orthogonal to the predicted values vector, since X Õ ˆÁ = 0 ∆ —ˆÕXÕˆÁ = —ˆÕ0 ... Yˆ Õ ˆÁ = 0 . Another insight from equation (3.3) is that the vector ˆÁ has to satisfy K linear restrictions (XÕˆÁ = 0) which means it looses K degrees of freedom. Consequently, the vector of residuals ˆÁ has only n ≠ K so-called degrees of freedom. This loss of K degrees of freedom also appears in the definition of the unbiased variance estimator s2UB= 1 ÿnˆÁ2i. (3.4) n ≠ K i=1 75 Variance decomposition. A further useful result that can be shown using the properties of PX and MX is that Y ÕY = Yˆ ÕYˆ + ˆÁÕˆÁ, i.e. Y Õ Y = ( Yˆ + ˆÁ ) Õ ( Yˆ + ˆÁ ) = (PXY+MXY)Õ(PXY+MXY) = ( Y Õ P XÕ + Y Õ M XÕ ) ( P X Y + M X Y ) = Y Õ P XÕ P X Y + Y Õ M XÕ M X Y + 0 = YˆÕYˆ+ˆÁÕˆÁ (3.5) Decomposition (3.5) is the basis for the well-known variance decomposition result for OLS regressions. For the OLS regression of the linear model (3.1) with intercept, the total sample variance of the dependent variable Y1, . . . , Yn can be decomposed as following: 1ÿn1 ̄221ÿn3ˆ ̄ˆ42 1ÿn2 n i=1 Yi ≠ Y = n i=1 Yi ≠ Y + n i=1 ˆÁi , (3.6) total sample variance explained sample variance unexplained sample variance ̄ 1qn ̄ˆ where Y = n i=1 Yi and likewise for Y. Proof of (3.6): • As a consequence of (3.3) we have for regressions with intercept that qni=1 ˆÁ = 0. Hence, from Y = Yˆ + ˆÁ it follows that i iii 1 ÿn 1 ÿn ˆ 1 ÿn ni=1Yi = ni=1Yi+ni=1ˆÁi ̄ ̄ˆ Y = Y+0 (3.7) 76 • From (3.5) we know that Y ÕY = Yˆ ÕYˆ + ˆÁÕˆÁ, i.e. Y Õ Y YÕY≠nY ̄2 Õ ̄2 YY≠nY = = = = Yˆ Õ Yˆ + ˆÁ Õ ˆÁ Yˆ Õ Yˆ ≠ n Y ̄ 2 + ˆÁ Õ ˆÁ ÿn 2 ̄ 2 ÿn ˆ 2 ˆ 2 ÿn 2 Yi ≠nY + ˆÁi ˆÕˆ ̄ˆ2 Õ ̄ ̄ˆ YY≠nY ̄+ˆÁˆÁ (by(3.7)Y=Y) Yi ≠nY ÿn ( Y ≠ Y ̄ ) 2 = ÿn ( Y ˆ ≠ Y ˆ ) 2 + ÿn ˆ Á 2 ⇤ i=1 i=1 i=1 i=1 i=1 ̄ i=1 iii Coecient of determination R2. The larger the proportion of the ex- plained variance, the better is the fit of the model. This motivates the definition of the so-called R2 coecient of determination: q ni = 1 3 ˆ ̄ˆ 4 2 q 2 Y i ≠ Y ni = 1 i ˆÁ 2 R=qni=11Yi≠Y ̄22 =1≠qni=11Yi≠Y ̄22 Obviously, we have that 0 Æ R2 Æ 1. The closer R2 lies to 1, the better is the fit of the model to the observed data. However, a high/low R2 does not mean a validation/falsification of the estimated model. Any relation (i.e., model assumption) needs a plausible explanation from2relevant economic theory. The most often criticized disadvantage of the2R is that additional regressors (relevant or not) will always increase the R . Here is an example of the problem. 77 set.seed(123) n <- 100 X <- runif(n, 0, 10) X_ir <- runif(n, 5, 20) # Sample size # Relevant X variable # Irrelevant X variable error <- rt(n, df = 10)*10 # True error Y <-1+5*X+error #Yvariable lm1 <- summary(lm(Y~X)) # Correct OLS regression lm2 <- summary(lm(Y~X+X_ir))# OLS regression with X_ir lm1$r.squared < lm2$r.squared #> [1] TRUE
So, R2 increases here even though X_ir is a completely irrelevant explana- tory variable. Because of this, the R2 cannot be used as a criterion for model selection. Possible solutions are given by penalized criterions such as the so-called adjusted R2 defined as
1 q ni = 1 ˆÁ 2i R=1≠ n≠K1
2
2ÆR2 1 q ni = 1 Y ≠ Y ̄ 2
n≠1 i
The adjustment is in terms of the degrees of freedom n ≠ K.
3.4 Method of Moments Estimator
The methods of moments estimator exploits the exogeneity assumption that E(Ái |Xi) = 0 for all i = 1,…,n (Assumption 2). Remember that E(Ái |Xi) = 0 implies that E(Xik Ái) = 0 for all i = 1,…,n and all k = 1,…,K. The fundamental idea behind “method of moments estimation” is to use the sample analogues of these population moment restrictions E(Xik Ái) = 0,
78

k = 1, . . . , K, for deriving the estimator:
K population moment restrictions K sample moment restrictions
1ÿnˆÁi=0 Z_ Z_ n _ _ i_ 1ÿn_
E(Á)=0 _ i=1 E(Xi2Ái)=0 _^ Xi2ˆÁi =0 _ _^
1 ÿn n i=1
Under our set of assumptions (Ass 1-4), the sample means n≠1 qni=1 XiˆÁi
. . _ … E ( X i Á i ) = 0 n i = 1 _ …
X i ˆÁ i =
0
_ (K◊1) . _
_ . _ n. _
(K ◊1)
E(XÁ)=0_\ iK i
_ 1ÿ_
are consistent estimators of the population means E(Xi Ái). qThe idea is now
to find —ˆ ,…,—ˆ values which lead to residuals ˆÁ = Y ≠ K —ˆ X that 0 K i i k=1kik
fulfill the above sample moment restrictions. This shoquld in principle be possible since we have a linear system of K equations 1 ni=1 XiˆÁ = 0 and K
n i=1
iKi \
_ X ˆÁ = 0 _
ˆˆˆ n unknowns — = (—0, . . . , —K )Õ. Solving the equation system yields,
1 ÿn
n i=1 (K◊1)
XiˆÁi= 0
1ÿn3 ˆ4
Xi Yi≠XiÕ— = 0
n i=1 (K◊1)
1ÿn 1ÿn ˆ XiYi≠ XiXiÕ—= 0
n i=1 n i=1 (K◊1) 1ÿn ˆ1ÿn
n i=1 XiXiÕ— = n i=1 XiYi Q 1 ÿn
R ≠ 1 1 ÿn
—ˆ = an i=1 XiXiÕb n i=1 XiYi
—ˆ = (XÕX)≠1 XÕY 79

which equals the OLS estimator of —; although, we used now a dierent approach to derive the estimator.
Once again we see the importance of the exogeneity assumption E(Ái |Xi) which we used here as the starting point for the derivation of the methods of moments estimator. However, unlike with deriving the OLS estimator as the estimator that minimizes the sum of squared residuals, here we derived the estimator from the exogeineity assumptions. The method of moments is a very general method, which usually has good properties. We will return to the method of moments several times throughout the semester.
3.5 Unbiasednessof—ˆ|X
Once again, but now using matrix algebra, we can show that the OLS (or
likewise the Methods-of-Moments estimator) —ˆ = (XÕX)≠1 XÕY is unbiased:
E[—ˆ|X] = E 5(XÕX)≠1 XÕY |X6
= E 5(XÕX)≠1 XÕ(X— + Á)|X6
= E 5(XÕX)≠1 XÕX— + (XÕX)≠1 XÕ Á |X6
= — + E 5(XÕX)≠1 XÕ Á |X6
= — + (XÕX)≠1 XÕ E[Á |X] = —
̧ ̊ ̇ ̋
Note. This result only requires the strict exogeneity assumption E(Á |X) = 0 which follows from our Assumption 2 (i.e. E(Ái |Xi) = 0 for all i) together with Assumption 1 (i.e. (Yi, Xi) is i.i.d. across i = 1, . . . , n). In particular, we did not need to assume homoscedasticity (it also holds for auto-correlated error terms).
80
… Bias[—ˆ|X] = 0
=0

3.6 Variance of —ˆ|X
The conditional variance of —ˆ given X is given by
Var(—ˆ|X) =E 5(—ˆ ≠ —)(—ˆ ≠ —)Õ|X6
=E C3(XÕX)≠1 XÕ Á4 3(XÕX)≠1 XÕ Á4Õ |XD
=E 5(XÕX)≠1 XÕ Á ÁÕ X (XÕX)≠1 |X6 = (XÕX)≠1 XÕE [Á ÁÕ |X] X (XÕX)≠1
(n◊n)
≠1 ̇ ̋ ̧ ̊ ≠1 = (XÕX) XÕ Var(Á |X) X (XÕX) ,
̧ ̊ ̇ ̋
(K ◊K )-dimesnional In the above derivations we used, first, that
—ˆ ≠ — = ( X Õ X ) ≠ 1 X Õ Y ≠ —
= (XÕX)≠1 XÕ(X— + Á) ≠ — = — + (XÕX)≠1 XÕ Á) ≠ —
= (XÕX)≠1 XÕ Á
and, second, that E[ÁÁÕ|X] = Var(Á|X) since the unconditional mean E(Á) = 0 under our assumptions. The above expression is the general version of Var(—ˆ|X) which can be further simplified using specific assumptions on the distribution of the error term Á:
• In case of spherical errors (“Gauss-Markov assumptions”), i.e. no heteroscedasticity and non auto-correlations, we have that Var(Á|X) =
‡2In such that
Var(—ˆ|X) = ‡2 (XÕX)≠1
(K◊K) 81

• In case of conditional heteroscedasticitiy, we have that Qc‡12 0 … 0Rd
Var(Á|X) = c 0 ‡2
c. . … .d 1 n
… 0 d = diag(‡2,…,‡2), a 0 0 . . . ‡ n2 b
where the variances ‡i2 may be functions of Xi, i.e. ‡2 © ‡i2(Xi). So, under conditional heteroscedasticitiy we have the following “sandwich” form
Var(—ˆ|X) = (XÕX)≠1 XÕ diag(‡12,…,‡n2)X (XÕX)≠1 .
Practical estimation of the standard errors. The diagonal elements
in the (K ◊ K) matrix Var(—ˆ|X) are the variance expressions for the single
estimators —ˆ with k = 1,…,K, k
Var(—ˆ |X) = 5(XÕX)≠1 XÕ Var(Á |X)X (XÕX)≠16 (generally) k kk
=‡2 5(XÕX)≠16kk , (spherical errors)
where [A]kl denotes the klth element of A that is in the kth row and lth
column of A. Taking square roots leads to the standard errors of the
estimators —ˆ |X k
SE(—ˆ |X) = 35(XÕX)≠1 XÕ Var(Á |X)X (XÕX)≠16 41/2 (generally) k kk
= 3‡2 5(XÕX)≠16kk41/2 ,
Of course, the above expressions for Var(—ˆ |X) and SE(—ˆ |X) are generally
useless in practice since we typically do not know the (n ◊ n) variance matrix 82
kk
(spherical errors)

Var(Á|X) or ‡2, but need to estimate them from the data. So, we typically need to work with
‰k Õ Õ‰ Õ kk
Var(—ˆ |X) = 5(X X)≠1 X Var(Á |X)X (X X)≠16 (generally)
=‡ˆ2 5(XÕX)≠16kk , (spherical errors) ‰k ÕÕ‰ Õkk
and
SE(—ˆ |X) = 35(X X)≠1 X Var(Á |X)X (X X)≠16 41/2 (generally)
= 3‡ˆ2 5(XÕX)≠16kk41/2 , (spherical errors) For the case of spherical errors, qwe already know a possible estimator,
2‰ 2 ≠1 n 2
namely, ‡ˆ = sUB = (n ≠ K) i=1 ˆÁi . However, finding a reasonable
estimator Var(Á |X) = diag(vˆ1, . . . , vˆn) for the general heteroscedastic case is a little more tricky. The econometric literature knows the following heteroscedasticity-consistent (HC) robust approaches:
HC0: vˆi=ˆÁ2i
HC1: vˆi= n ˆÁ2i
HC2: vˆ =
i 1≠hi
HC3: vˆi = ˆÁ2i 2 (1≠hi)
n≠K ˆÁ2i
(Ω Most often used)
where hi = Xi(X X) Xi = [PX]ii is the ith diagonal element of the projec-
HC4:vˆi= ˆÁ2i
(1 ≠ hi)”i Õ Õ ≠1 q
tion matrix PX , h ̄ = n≠1 ni=1 hi, and ”i = min{4, hi/h ̄}.
Side Note. The statistic 1/n Æ hi Æ 1 isqcalled the “leverage” of Xi, where
(by construction) average leverage is n≠1 ni=1 hi = K/n. Observations Xi 83

with leverage statistics hi that greatly exceed the average leverage value K/n are referred to as “high-leverage” observations. High-leverage observations have potentially a large influence on the estimation result. Typically, high- leverage observations Xi distort the estimation results if the corresponding value of the error term Ái is unusually large (“outlier”).
The estimator HC0 was suggested in the econometrics literature by White (1980) and is justified by asymptotic arguments. The estimators HC1, HC2 and HC3 were suggested by MacKinnon and White (1985) to improve the performance in small samples. A more extensive study of small sample behavior was carried out by Long and Ervin (2000) which arrive at the conclusion that HC3 provides the best performance in small samples as it inflates the ˆÁi values which is thought to adjust for the “over-influence” of observations with large leverage values hi. Cribari-Neto (2004) suggested the estimator HC4 to further improve small sample performance, especially in
the presence of influential observations (large hi values).
The following R code shows how to compute HC robust variance/standard
error estimators:
set.seed(2)
n <- 100 K <-3 X <- matrix(runif(n*(K-1), 2, 10), n, K-1) X <- cbind(1,X) beta <- c(1,5,5) # heteroscedastic errors: sigma <- abs(X[,2] + X[,3])^1.5 error <- rnorm(n, mean = 0, sd=sigma) Y <- beta[1]*X[,1] + beta[2]*X[,2] + beta[3]*X[,3] + error ## lm_fit <- lm(Y~X -1 ) 84 ## Caution! By default R computes the standard errors ## assuming homoscedastic errors. This can lead to ## false inferences under heteroscedastic errors. summary(lm_fit)$coefficients #> Estimate Std. Error t value Pr(>|t|)
#> X1 -3.818392 17.797787 -0.2145431 0.830573969
#> X2 5.474779 2.084915 2.6259006 0.010043024
#> X3 5.566453 2.011848 2.7668360 0.006778811
library(“sandwich”) # HC robust variance estimation library(“lmtest”)
## Robust estimation of the variance of \hat{\beta}: Var_beta_hat_robust <- sandwich::vcovHC(lm_fit, type="HC3") Var_beta_hat_robust #> X1 X2 X3
#> X1 389.83293 -39.198061 -38.026861
#> X2 -39.19806 5.763624 2.260672
#> X3 -38.02686 2.260672 5.437287
## Corresponding regression-output: lmtest::coeftest(lm_fit, vcov = Var_beta_hat_robust) #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> X1 -3.8184
#> X2 5.4748
#> X3 5.5665
#> —
19.7442 -0.1934 0.84706
2.4008 2.2804 0.02477 *
2.3318 2.3872 0.01892 *
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
85
1

Observe that the HC robust variance estimation leads to larger variances than the classic variance estimation for homoscedastic errors. This is typically, but not always, the case.
3.7 The Gauss-Markov Theorem
The Gauss-Markov Theorem. Let’s assume Assumptions 1-4 hold with spherical errors, i.e., with E(Á ÁÕ |X) = ‡2In. Then the Gauss-Markov theorem states that of all linear and unbiased estimators, the OLS (or Methods of Moments) estimator —ˆ = (XÕX)≠1XÕY will have the smallest variance, in a matrix sense. That is, for any alternative linear and unbiased estimator — ̃ we have that
Var(— ̃|X) Ø Var(—ˆ|X) (“in the matrix sense”) …Var(— ̃|X)≠Var(—ˆ|X)= D ,
where D is a positive semidefinite (K ◊ K) matrix, i.e., aÕDa Ø 0 for any
K-dimensional vector a œ RK. Observe that this implies that Var(— ̃ |X) Ø
Var(—ˆ |X) for any k = 1,…,K. k k
Proof of the Gauss-Markov Theorem. Since — ̃ is assumed to be linear
in Y , we can write
where C is some (K ◊ n) matrix, which is a function of X and/or nonrandom
— ̃ = C Y , components. Adding a (K ◊ n) zero matrix 0 yields
=0
̃ 3 ̇ ≠1 ̋ ̧ ≠1 ̊4
(K◊K)
—= C≠(XÕX) XÕ+(XÕX) XÕ Y. 86

Let now D = C ≠ (XÕX)≠1 XÕ, then
— ̃ = 3 D + ( X Õ X ) ≠ 1 X Õ 4 Y
— ̃ = DY + (XÕX)≠1 XÕY
— ̃ = D ( X — + Á ) + ( X Õ X ) ≠ 1 X Õ Y — ̃ = DX— + DÁ + —ˆ
∆ E(— ̃|X) = E(DX—|X) + E(DÁ|X) + E(—ˆ|X) = DX— + 0 + —
(3.8) (3.9)
Since — ̃ is (by assumption) unbiased, we have that E(— ̃|X) = —. Therefore, (3.9) implies that DX = 0(K◊K) since we must have that E(— ̃|X) = DX— +
0 + — = —. Plugging DX = 0 into (3.8) yields,
where we used that
— ̃ = D Á + —ˆ
— ̃ ≠ — = D Á + ( —ˆ ≠ — )
— ̃ ≠ — = D Á + ( X Õ X ) ≠ 1 X Õ Á
— ̃ ≠ — = 3D + (XÕX)≠1 XÕ4 Á,
—ˆ ≠ — = ( X Õ X ) ≠ 1 X Õ Y ≠ —
= (XÕX)≠1XÕ(X— + Á) ≠ —
= (XÕX)≠1XÕ Á. 87
(3.10)

So,
Var(— ̃|X) = Var(— ̃ ≠ —|X) (since — is not random)
= Var((D + (XÕX)≠1XÕ)Á|X) (from equation (3.10)) =(D+(XÕX)≠1XÕ)Var(Á|X)(DÕ +X(XÕX)≠1)
= ‡2(D + (XÕX)≠1XÕ)In(DÕ + X(XÕX)≠1)
= ‡2 1DDÕ + (XÕX)≠12 (using that DX = 0)
Ø ‡2(XÕX)≠1 (Since DDÕ is pos. semidef.)
= Var(—ˆ|X).
Showing that DDÕ is really positive semidefinite: aÕDDÕa = (DÕa)Õ(DÕa) = a ̃Õa ̃ Ø 0,
where a ̃ is a K dimensional column-vector.
Note. To reiterate: The unbiasedness of —ˆ did not depend on any assump- tions about the distribution of Á, except that E(Á |X) = 0 which follows from our Assumption 2 together with the i.i.d. assumption in Assumption 1. Once we imposed additionally the assumption of spherical errors E(Á Á |X) = ‡2In we can show that —ˆ has the smallest variance of all linear unbiased estimators.
88

Chapter 4
Small Sample Inference
Note on small sample sizes. Sometimes cases < n = 30 are referred to as “small sample cases” since one often hopes that the central limit theorem is helping out for cases Ø n = 30. However, this is a very dangerous rule of thumb! A suciently large n that allows us to rely on the central limit theorem depends on many dierent distributional aspects. In this chapter, we consider the case where we cannot rely on the central limit theorem. Exact inference. This chapter considers exact inference using the multiple linear regression model. By exact we mean correct distributions for each sample size n. That is, no asymptotic (large n æ Œ) arguments will be used. Assumptions. Recall that we, in general, did not impose a complete distributional assumption on Á in Assumption 4 – the i.i.d. normal case in Assumption 4 was only one possible option. However, to do exact inference, the normality Assumption on the error terms is not a mere option, but a necessity. \ So for this chapter we assume that Assumptions 1-3 from Chapter 3 hold and that additionally the following assumption holds: 89 Assumption 4ú: Error distribution. For small sample cases, we assume i.i.d. that the error terms are i.i.d. normal conditionally on Xi, i.e., Ái|Xi ≥ N (0, ‡2) for all i = 1, . . . , n with spherical errors conditionally on X. That is, where Á = (Á1,...,Án)Õ. Á |X ≥ N 10, ‡2In2 , Normality of —ˆ. Under Assumptions 1-4ú it can be shown that —ˆ |X ≥ N 3—, Var(—ˆ |X)4 , (4.1) nn where Var(—ˆ |X) = ‡2(XÕX)≠1. n ˆ Õ ≠1 Õ Õ ≠1 Õ This result follows from noting that —n = (X X) X Y = —+(X X) X Á and because (XÕX)≠1XÕ Á is just a linear combination of the normally dis- tributed error terms Á which, therefore, is again normally distributed, condi- tionally on X. Note that (4.1) is the exact small sample distribution of —ˆ |X n and fully relies on the conditional normality assumption Assumption 4ú. Note. The subscript n in —ˆ is here only to emphasize that the distribution ˆnˆ of —n depends on n; we will, however, often simply write —. 4.1 Hypothesis Tests about Multiple Parameters Let us consider the following system of q-many null hypotheses: H0 : R — ≠ r = 0 , (q◊K)(K◊1) (q◊1) (q◊1) 90 wherethe(q◊K)matrixRandtheq-vectorr=(r1,...,rq)Õ arechosenby the statistician to specify her/his null hypothesis about the unknown true parameter vector —. To make sure that there are no redundant equations, it is required that rank(R) = q. We must also specify the alternative against which we are testing the null hypothesis, for instance HA :R—≠r”=0 The above multiple parameter hypotheses cover also the special case of single parameter hypothesis; for instance, by setting R = (0, 1, 0 . . . , 0) and r = 0 one get’s Under our assumptions (Assumptions 1 to 4ú), we have that (R—ˆ ≠r)|X≥N3R—≠r,RVar(—ˆ|X)RÕ)4. nn around the q-vector 0. If, however, the alternative hypothesis is correct (i.e., (R— ≠ r) = a ”= 0), the realizations of R—ˆ ≠ r|X scatter around the q-vector a ”= 0. So, under the alternative hypothesis, there will be a systematic location-shift of the q-dimensional random variable R—ˆ |X away from r which we try to detect using statistical hypothesis testing. 4.1.1 The Test Statistic and its Null Distribution H0: —k=0 HA : —k ”= 0 So, the realizations of (R—ˆ ≠ r)|X will scatter around the unknown (R— ≠ r) n in a non-systematical, Gaussian fashion. Therefore, if the null hypothesis is correct (i.e., (R— ≠ r) = 0), the realizations of (R—ˆ ≠ r)|X scatter The fact that (R—ˆ ≠ r) œ Rq is a q-dimensional random variable makes it a nˆ little bothersome to use as a test-statistic. Fortunately, we can turn (R—n ≠ r) 91 n n n into a scalar-valued test statistic using the following quadratic form: W=(R—ˆ ≠r)Õ[RVar(—ˆ|X)RÕ]≠1(R—ˆ ≠r) ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋ nnn (1◊q) (q◊q) (q◊1) Note that the test statistic W is simply measuring the distance (it’s a weighted L2-distance) between the two q-vectors R—ˆ and r. Moreover, under the null n hypothesis (i.e., if the null hypothesis is true), W is just a sum of q-many independent squared standard normal random variables. Therefore, under the null hypothesis, W is ‰2(q) distributed with q degrees of freedom (see Section 1.2.10.3), W=(R—ˆ ≠r)Õ[RVar(—ˆ|X)RÕ]≠1(R—ˆ ≠r)≥0 ‰2(q). nnn Usually, we do not know Var(—ˆ |X) and, therefore, have to estimate n this quantity. Unfortunately, the general heteroscedasticity consistent ro- \ˆ \ˆ bust estimators VarHC0(—n|X), ..., VarHC4(—n|X) from Chapter 3 are only asymptotically justified and, therefore, inappropriate for exact small sample inference. For truly exact finite sample inference, we need a variance estimator for which we can derive the exact small sample distribution. Therefore, we assume in Assumption 4ú spherical errors (i.e., Var(Á|X) = In‡2) which ˆ2Õ≠1 2 yield that Var(— q|X) = ‡ (X X) , and where ‡ can be estimated by n s2UB = (n ≠ K)≠1 ni=1 ˆÁ2i . From the normality assumption in Assumption 4ú, it follows then that ‡≠2(n ≠ K)s2UB ≥ ‰2(n ≠ K). This then leads to the following exact null distribution of F =(R—ˆ ≠r)Õ[R(s2 (XÕX)≠1)RÕ]≠1(R—ˆ ≠r)/q ≥0 F(q,n≠K), (4.2) n UB n where F(q,n≠K) denotes the F-distribution with q numerator and n≠K denominator degrees of freedom (see, for instance, Hayashi, 2000, Ch. 1). By contrast to W, F is a practically useful test-statistic, and we can use 92 H H observed values F to measure the distance of our estimate R—ˆ from value obs n r. Observed values, Fobs, that are unusually large under the null hypothesis, lead to a rejection of the null hypothesis. The null distribution F(q,n≠K) of F is used to judge what’s unusually large under the null hypothesis. The F distribution. The F distribution is a ratio of two ‰2 distributions. It has two parameters: the numerator degrees of freedom, and the denomina- tor degrees of freedom. Each combination of the parameters yields a dierent F distribution. See Section 1.2.10.6 for more information on the F statistic. F(3,5) F(3,15) F(3,500) F(1,30) F(3,30) F(15,30) 01234 01234 4.2 Tests about One Parameter For testing a hypothesis about only one parameter —k, with k = 1, . . . , K H0: —k=r HA : —k ”= r 93 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 the (q ◊ K=1 ◊ K)-matrix R equals a row-vector of zeros but with a one as the kth element (e.g., for k = 2, R = (0,1,0,...,0)) such that F in (4.2) simplifies to 3 —ˆ ≠ r 4 2 ‰ ˆ H0 k ≥ F(1,n≠K), Var(— |X) k ‰ˆ 2Õ≠1Õ where Var(—k|X) = R(s (X X) )R . Taking square roots yields t=—ˆ≠r≥t . k H0 (n≠K) ‰ˆ SE(—k |X ) Thus the t-distribution with n ≠ K degrees of freedom is the appropriate reference metric for judging how “far away” our estimates are from the hypothetical parameter value r under the null hypothesis. Note. All commonly used statistical software packages report t-tests testing the null hypothesis H0 : —k = 0, i.e., with r = 0. This means to test the null hypothesis that Xk has “no (linear) eect” on Y . The following plot illustrates that as the degrees of freedom increase, the shape of the t distribution comes closer to that of a standard normal bell curve. Already for 25 degrees of freedom we find little dierence to the standard normal density. In case of small degrees of freedom values, we find the distribution to have heavier tails than a standard normal. 94 N(0, 1) t2 t2 t25 −4 −2 0 2 4 4.3 Testtheory 4.3.1 Significance Level To actually test the null hypothesis (e.g., H0: R— ≠ r = 0 or H0 : —k = 0), we need to have a decision rule on when we will reject and not reject the null hypothesis. This amounts to deciding on a probability with which we are comfortable rejecting the null hypothesis when it is in fact true (Type I error or – error). The probability of such a Type I error shall be bounded from above by a (small) significance level –, that is P (reject H0|H0 is true) = P (Type I Error) = – For a given significance level (e.g., – = 0.05) and a given alternative hypoth- esis, we can divide the range of all possible values of the test statistic (i.e., R since both t œ R and F œ R) into a rejection region and a non-rejection region by using certain quantiles called critical values of the test statistic 95 0.0 0.1 0.2 0.3 0.4 distribution under the null. We can do this because the test statistics t H0 and F have known distributions under the null hypothesis (t ≥ t and n≠K F ≥ F (q, n ≠ K)); indeed, under Assumption 4 , we know the exact null distributions for every sample size n. Having decided on the rejection and H0 ú non-rejection regions, it is a simple matter of seeing where the observed (obs) sample values tobs or Fobs of the statistics t or F are–either in the rejection or in the non-rejection region. Non-conservative versus conservative tests. Since the test statistics F and t are continuous random variables of which we know the exact distri- butions (under Assumptions 1-4ú), we can find critical values such that P(Type I Error) = – We call such tests “non-conservative” since the probability of a type I error equals the significance level –. Test statistics with P(Type I Error) < – are called conservative test statistics; they lead to valid inferences, but will detect a violation of the null hypothesis less often than a non-conservative test. A test statistic with P (Type I Error) > – leads to invalid inferences!
4.3.2 Critical Value for the F-Test
The critical value c– > 0 defines the rejection region, ]c–,Œ[, and non- rejection region, ]0, c–] which divide the test-statistic space (here R+ since F œ R+) for a given significance level –, such that
P(Type I Error) = PH03F œ]c–,Œ[4 = –,
where c– is here the (1 ≠ –/2) quantile of the F -distribution with (q, n ≠ K) degrees of freedom, and where PH0 means that we compute the probability under the assumption that H0 is true.
96

1 − α = 95% of F9,120 Non−Rejection Region α = 5% of F9,120 Rejection Region
cα = 1.9588
01234
The rejection region. The rejection region describes a range of values of the test statistic F which we rarely see if the null hypothesis is true (only in at most – · 100% cases). If the observed value of the test statistic, Fobs, falls in this region, we will reject the null hypothesis–and hereby, accept Type I errors in at most – · 100% of cases.
The non-rejection region. The non-rejection region describes a range of values of the test statistic F which we expect to see (in (1 ≠ –) · 100% cases) if the null hypothesis is true. If the observed value of the test statistic, Fobs falls in this region, we will not reject the null hypothesis.
Caution: Not rejecting the null hypothesis does not mean that we can conclude that the null hypothesis is true. We only had no suciently strong evidence against the null hypothesis. A violation of the null hypothesis, for
97
0.0 0.2 0.4 0.6 0.8 1.0

instance R— ≠ r = a ”= 0, may simply be too small (too small a value) to stand out from the estimation errors (measured by the standard error) in —ˆ .
Reading the F-Table. Fortunately, you do not need to read old-school distribution tables to find the critical value c–, but can simply use R
Changing the significance level from – = 0.05 to – = 0.01 makes the critical value c– larger and, therefore, the rejection region smaller (fewer Type I errors)
k
df1 <- 9 # numerator df df2 <- 120 # denominator df alpha <- 0.05 # significance level ## Critical value c_alpha (= (1-alpha) quantile): c_alpha <- qf(p = 1-alpha, df1 = df1, df2 = df2) c_alpha #> [1] 1.958763
alpha <- 0.01 ## Critical value c_\alpha = 1-\alpha-quantile: c_alpha <- qf(p = 1-alpha, df1 = df1, df2 = df2) c_alpha #> [1] 2.558574
4.3.3 Critical Value(s) for the t-Test
In case of the t-test, we need to dierentiate between two-sided and one-sided testing.
98

Two-Sided t-Test Two-sided hypothesis:
H0: —k=r HA : —k ”= r
In case of a two-sided t-test, we reject the null hypothesis if the observed realization of the t-test, tobs, is “far away” from zero either by being suciently smaller or greater than r. The corresponding two-sided critical values are denoted by ≠c–/2 > 0 and c–/2 > 0, where c–/2 > 0 is the (1 ≠ –/2) quantile of the t-distribution with (n ≠ K) degrees of freedom. These critical values defines the following rejection and the non-rejection regions
rejectionregion: ]≠Œ,≠c–/2[ fi ]c–/2,Œ[ non-rejection region: [≠c–/2, c–/2].
For this rejection region it holds true that P(TypeIError)=PH03tœ]≠Œ,≠c–/2[ fi ]c–/2,Œ[4 =–.
99

95% of t12 5% of t12
−cα 2=−2.18 cα 2=2.18
−4 −2 0 2 4
One-Sided t-Test One-sided hypothesis:
H0: —k=r
HA : —k > r (or HA : —k < r) In case of a one-sided t-test, we will reject the null if tobs is suciently “far away” from zero in the relevant direction of HA. The corresponding critical valueiseither≠c– (HA :—k r),wherec– isthe(1≠–) quantile of the t-distribution with (n ≠ K) degrees of freedom. The critical value c– defines the following rejection and the non-rejection regions:
ForHA :—k =0 versus HA :—k <0: rejection region: ] ≠ Œ, ≠c–[ non-rejection region: [≠c–, Œ[ 100 0.0 0.1 0.2 0.3 0.4 0.5 0.6 such that ForHA :—k =0 versus HA :—k >0:
such that
P (Type I Error) = PH0 3t œ ] ≠ Œ, ≠c–[4 = –. rejection region: ]c–, Œ[
non-rejection region: ] ≠ Œ, c–] P(Type I Error) = PH03t œ ]c–,Œ[4 = –.
95% of t12 5% of t12
− cα = − 1.78
95% of t12 5% of t12
cα = 1.78
−4 −2 0 2 4 −4 −2 0 2 4
Reading the t-Table. Fortunately, you do not need to read old-school distribution tables to find the critical values c–/2 or c–, but you can simply use R
101
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6

df <- 16 # degrees of freedom alpha <- 0.05 # significance level ## One-sided critical value (= (1-alpha) quantile): c_oneSided <- qt(p = 1-alpha, df = df) c_oneSided #> [1] 1.745884
## Two-sided critical value (= (1-alpha/2) quantile): c_twoSided <- qt(p = 1-alpha/2, df = df) ## lower critical value -c_twoSided #> [1] -2.119905
## upper critical value
c_twoSided
#> [1] 2.119905
4.4 Type II Error and Power
A Type II error is the mistake of not rejecting the null hypothesis when in fact it should have been rejected. The probability of making a Type II error equals one minus the probability of correctly rejecting the null hypothesis (“Power”). For instance, in the case of using the t-test to test the null hypothesis H0 : —k = 0 versus the one-sided alternative hypothesis HA :—k >0)wehavethat
P(TypeIIError)=PHA t œ ]≠Œ,c–]
3 rejection region 4
3 non-rejection region 4 ̇ ̋ ̧ ̊
=1≠PHA tœ ]c–,Œ[ , ̧ ̊ ̇ ̋
̇ ̋ ̧ ̊
“Power”
102

where PHA means that we compute the probability under the assumption that HA is true.
There is a trade o between the probability of making a Type I error and the probability of making a Type II error: a lower significance level –, decreases P(Type I Error), but necessarily increases P(Type II Error) and vice versa. Ideally, we would have some sense of the costs of making each of these errors, and would choose our significance level to minimize these total costs. However, the costs are often dicult to know. Moreover, the probability of making a Type II error is usually impossible to compute, since we usually do not know the true distribution of —ˆ under the alternative hypothesis. k
For illustration purposes, however, consider the case of a t test for a one-sided hypothesis
H0: —k=0 HA : —k > 0,
where the true (usually unknown) parameter value is —k =Ò 3 and where the true (usually also unknown) standard error SE(—ˆ |X) = ‡2[(XÕX)≠1] =
k kk 1.5. The advantage here is that we can derive the distribution of the t-test
statistic even under the alternative hypothesis. Note that the distribution
of the t-test statistic bÒecomes here a standard normal distribution, since we
assume SE(—ˆ |X) = ‡2[(XÕX)≠1] = 1.5 to be a known (determinsitc) k kk
quantity. (This completely unrealistic assumption is only used for illustrative puposes!)
Distribution “under the null hypothesis” (i.e., if —k = 0 were true): k H0
t=Ò —ˆ ≥N(0,1) ‡2[(XÕX)≠1]kk
Distribution under the alternative hypothesis for the true parameter value 103

(i.e., for the actual —k = 3):
t= —ˆ =—ˆ+3≠3
ÒkÒk ‡2[(XÕX)≠1]kk ‡2[(XÕX)≠1]kk
Òk Ò HA
= —ˆ ≠ 3 + 3 ≥ N ( , 1 )
‡2[(XÕX)≠1]kk ‡2[(XÕX)≠1]kk ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋
≥N(0,1) = (mean-shift)
So, the mean-shift depends on the value of Ò‡2[(XÕX)≠1]kk and the dierence between the actual parameter value (—k = 3) and the hypothetical parameter under the null-hypothesis (here 0). The following Graphic illus- tÒrates the probability of a type II error and the power for the case where
‡2[(XÕX)≠1]kk = 1.5 such that = 3/1.5 = 2.
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Null Distribution N(0,1)
P(Type II Error) cα = 1.64 Power
N(0,1)
N(2,1)
−4 −2 0 2 4 6
104

4.5 p-Value
The p-value of a test statistic is the significance level we would obtain if we took the sample value of the observed test statistic, Fobs or tobs, as the border between the rejection and non-rejection regions.
F-test p = PH0(F Ø Fobs)
t-test (HA :—k ”=r) p=2·min{PH0(tÆtobs),PH0(tØtobs)} t-test (HA :—k >r) p=PH0(tØtobs)
t-test (HA :—k 1 99.87
#> 2 96.31
#> 3 83.14
round(head(test_data_new, 3), 2) # New Y, conditionally on X #> YX_1X_2 X_3
#> 1 97.98 1 6.94 18.49
## compare
round(head(test_data,
#> YX_1X_2 X_3
1 6.94 18.49
1 6.55 19.06
1 3.23 16.51
108

#> 2 100.03 1 6.55 19.06
#> 3 75.38 1 3.23 16.51
4.7.1 Normally Distributed —ˆ|X
The above data generating process fulfills our regulatory assumptions Assump-
tion 1-4ú. So, by theory, the estimators —ˆ |X should be normal distributed
conditionally on X
—ˆ|X≥N(—,‡2[(XÕX)≠1] ) k k kk
where [(XÕX)≠1]kk denotes the element in the kth row and kth column of
the matrix (XÕX)≠1. Let’s check the distribution by means of a Monte Carlo
simulation for the case of —ˆ |X with a small sample size of n = 10. 2
set.seed(123)
k
n
beta_true
sigma
<- 10 # a small sample size <- c(2,3,4) # true data vector <- 3 # true var of the error term ## Lets generate a data set from our data generating process mydata <- myDataGenerator(n = n, beta=beta_true) X_cond <- cbind(mydata$X_1, mydata$X_2, mydata$X_3) ## True mean and variance of the true normal distribution ## of beta_hat_2|X: # true mean beta_true_2 <- beta_true[2] # true variance var_true_beta_2 <- sigma^2 * diag(solve(t(X_cond) %*% X_cond))[2] 109 ## Lets generate 5000 realizations from beta_hat_2 ## conditionally on X and check whether their ## distribution is close to the true normal distribution rep <- 5000 # MC replications beta_hat_2 <- rep(NA, times=rep) ## for(r in 1:rep){ MC_data <- myDataGenerator(n = n, beta = beta_true, X = X_cond) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) beta_hat_2[r] <- coef(lm_obj)[2] } ## Compare ## True beta_2 versus average of beta_hat_2 estimates beta_true_2 #> [1] 3
round(mean(beta_hat_2), 4)
#> [1] 3.0091
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 4)
#> [1] 0.416
round(var(beta_hat_2), 4)
#> [1] 0.4235
## True normal distribution of beta_hat_2 versus ## empirical density of beta_hat_2 estimates library(“scales”)
110

curve(expr = dnorm(x, mean = beta_true_2, sd=sqrt(var_true_beta_2)),
xlab=””,ylab=””, col=gray(.2), lwd=3, lty=1,
xlim=range(beta_hat_2), ylim=c(0,1.1)) lines(density(beta_hat_2, bw = bw.SJ(beta_hat_2)),
col=alpha(“blue”,.5), lwd=3) legend(“topleft”, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(“blue”,.5)), bty=”n”, legend= c(expression(
“Theoretical Gaussian Density of”~hat(beta)[2]~|~X), expression(
“Empirical Density Estimation based on MC realizations from”~ hat(beta)[2]~|~X)))
Theoretical Gaussian Density of β^ | X 2^
Empirical Density Estimation based on MC realizations from β2 | X
12345
Great! The nonparametric density estimation (estimated via density()) 111
0.0 0.2 0.4 0.6 0.8 1.0

computed from the simulated realizations of —ˆ |X is indicating that —ˆ |X is 22
really normally distributed as described by our theoretical result in Equation (4.1).
However, what would happen if we would sample unconditionally on X? How does the distribution of —ˆ would then look like?
2
set.seed(123)
## Lets generate 5000 realizations from beta_hat_2 ## WITHOUT conditioning on X
beta_hat_2_uncond <- rep(NA, times=rep) ## for(r in 1:rep){ MC_data <- myDataGenerator(n = n, beta = beta_true) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) beta_hat_2_uncond[r] <- coef(lm_obj)[2] } ## Compare ## True beta_2 versus average of beta_hat_2 estimates beta_true_2 #> [1] 3
round(mean(beta_hat_2_uncond), 4)
#> [1] 2.9973
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 4)
#> [1] 0.416
round(var(beta_hat_2_uncond), 4)
#> [1] 0.2521
112

## Plot
curve(expr = dnorm(x, mean = beta_true_2, sd=sqrt(var_true_beta_2)),
xlab=””,ylab=””, col=gray(.2), lwd=3, lty=1,
xlim=range(beta_hat_2_uncond), ylim=c(0,1.1)) lines(density(beta_hat_2_uncond, bw=bw.SJ(beta_hat_2_uncond)),
col=alpha(“blue”,.5), lwd=3) legend(“topleft”, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(“blue”,.5)), bty=”n”, legend= c(expression(
“Theoretical Gaussian Density of”~hat(beta)[2]~|~X), expression(
“Empirical Density Estimation based on MC realizations from”~
hat(beta)[2])))
Theoretical Gaussian Density of β^ | X 2^
Empirical Density Estimation based on MC realizations from β2
12345
113
0.0 0.2 0.4 0.6 0.8 1.0

Not good. Since we do not condition on X, the realizations of X aect the distribution of —ˆ and our theoretical Gaussian distribution result in Equation
(4.1) does not apply anymore.
4.7.2 Testing Multiple Parameters
In the following, we do inference about multiple parameters. We test
Or equivalently
versus
H0:—2=3 and —3=4
HA :—2 ”= 3 and/or —3 ”= 4.
H0 :R— ≠ r = 0 HA :R—≠r”=0,
where
The following R code can be used to test this hypothesis:
R=Qa0 1 0Rb and r=Qa3Rb. 001 4
suppressMessages(library(“car”)) # for linearHyothesis() # ?linearHypothesis
## Estimate the linear regression model parameters
lm_obj <- lm(Y ~ X_2 + X_3, data = mydata) ## Option 1: car::linearHypothesis(model = lm_obj, hypothesis.matrix = c("X_2=3", "X_3=4")) #> Linear hypothesis test
114

#>
#> Hypothesis:
#> X_2 = 3
#> X_3 = 4
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#>
#> 1
#> 2
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
## Option 2:
R <- rbind(c(0,1,0), c(0,0,1)) car::linearHypothesis(model = lm_obj, hypothesis.matrix = R, rhs = c(3,4)) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 3
#> X_3 = 4
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#> Res.Df RSS Df Sum of Sq F Pr(>F)
Res.Df RSS Df Sum of Sq
9 87.285
F Pr(>F)
7 37.599 2 49.686 4.6252 0.05246 .
115
1

#> 1 9 87.285
#> 2 7 37.599 2 49.686 4.6252 0.05246 .
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Not surprisingly, we cannot reject the null hypothesis at a significance level of, for instance, – = 0.05 since we actually test the true null hypothesis. However, in repeated samples we should nevertheless observe – · 100% type I errors (false rejections of H0). Let’s check this using the following Monte Carlo simulation:
1
## Lets generate 5000 F-test decisions and check ## whether the empirical rate of type I errors is ## close to the theoretical significance level. rep <- 5000 # MC replications F_test_pvalues <- rep(NA, times=rep) ## for(r in 1:rep){ ## generate new MC_data conditionally on X_cond MC_data <- myDataGenerator(n = n, beta = beta_true, X = X_cond) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) ## save the p-value p <- linearHypothesis(lm_obj, c("X_2=3", "X_3=4"))$Pr(>F)[2]
F_test_pvalues[r] <- p } ## signif_level <- 0.05 116 rejections <- F_test_pvalues[F_test_pvalues < signif_level] round(length(rejections)/rep, 3) #> [1] 0.05
##
signif_level <- 0.01 rejections <- F_test_pvalues[F_test_pvalues < signif_level] round(length(rejections)/rep, 3) #> [1] 0.009
Note that this is actually a very strong result. First, it means that we correctly control for the type I error rate since the type I error rate is not larger than the chosen significance level –. Second, it means that the test is not conservative (i.e. very ecient) since the type I error rate is approximately equal to the chosen significance level –. (In fact, if we would increase the number of Monte Carlo repetitions, the empirical type I error rate would converge to the selected significance level – dut to the law of large numbers.)
Next, we check how well the F test detects certain violations of the null hypothesis. We do this by using the same data generating process, but by testing the following incorrect null hypothesis:
H0:—2=4 and —3=4 HA :—2 ”=4 and/or —3 ”=4
set.seed(321)
rep <- 5000 # MC replications F_test_pvalues <- rep(NA, times=rep) ## for(r in 1:rep){ ## generate new MC_data conditionally on X_cond MC_data <- myDataGenerator(n = n, 117 beta = beta_true, X = X_cond) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) ## save p-values of all rep-many tests F_test_pvalues[r] <- linearHypothesis(lm_obj, c("X_2=4","X_3=4"))$Pr(>F)[2]
}
##
signif_level <- 0.05 rejections <- F_test_pvalues[F_test_pvalues < signif_level] length(rejections)/rep #> [1] 0.3924
Indeed, we can now reject the (false) null hypothesis in approximately 38% of all resamplings from the true data generating process. Caution: This also means that we are not able to “see” the violation of the null hypothesis in 100% ≠ 38% = 62% of cases. Therefore, we can never use an insignificant test result (p-valueØ –) as a justification to accept the null hypothesis.
Moreover, note that the F test is not informative about which part of the null hypothesis (—2 = 4 and/or —3 = 4) is violated – we only get the information that at least one of the multiple parameter hypotheses is violated:
car::linearHypothesis(lm_obj, c(“X_2=4”, “X_3=4”)) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 4
#> X_3 = 4
#>
#> Model 1: restricted model
118

#> Model 2: Y ~ X_2 + X_3
#>
#>
#> 1
#> 2
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Res.Df RSS Df Sum of Sq F Pr(>F)
9 141.245
7 32.572 2 108.67 11.677 0.005889 **
4.7.3 Dualty of Confidence Intervals and Hypothesis Tests
Confidence intervals can be computed using R as following:
1
signif_level <- 0.05 ## 95% CI for beta_2 confint(lm_obj, parm = "X_2", level = 1 - signif_level) #> 2.5% 97.5%
#> X_2 1.370315 3.563536
## 95% CI for beta_3
confint(lm_obj, parm = “X_3”, level = 1 – signif_level) #> 2.5% 97.5%
#> X_3 3.195389 4.695134
We can use these two-sided confidence intervals to do hypothesis tests. For instance, when testing the null hypothesis
H0 :—3 =4 versus HA :—3 ”= 4
119

we can check whether the confidence interval CI1≠– contains the hypothetical value 4 or not. In case of 4 œ CI1≠–, we cannot reject the null hypothesis. In case of 4 ”œ CI1≠–, we reject the null hypothesis.
If the Assumption 1-4ú hold true, then CI1≠– is an exact confidence interval. That is, under the null hypothesis, it falsely rejects the null hypothesis in only – · 100% of resamplings. Let’s check this in the following R code:
## Lets generate 1000 CIs
set.seed(123)
signif_level <- 0.05 rep <- 5000 # MC replications confint_m <- matrix(NA, nrow=2, ncol=rep) ## for(r in 1:rep){ ## generate new MC_data conditionally on X_cond MC_data <- myDataGenerator(n = n, beta = beta_true, X = X_cond) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) ## save the p-value CI <- confint(lm_obj, parm="X_2", level=1-signif_level) confint_m[,r] <- CI } ## inside_CI <- confint_m[1,] <= beta_true_2 & beta_true_2 <= confint_m[2,] ## CI-lower, CI-upper, beta_true_2 inside? head(cbind(t(confint_m), inside_CI)) #> inside_CI
120

#> [1,] 0.8555396 3.639738 1
#> [2,] 0.9143542 3.270731 1
#> [3,] 1.9336526 4.984167 1
#> [4,] 1.9985874 3.812695 1
#> [5,] 3.0108642 5.621791 0
#> [6,] 2.0967675 4.716398 1
round(length(inside_CI[inside_CI == FALSE])/rep, 2) #> [1] 0.05
nCIs <- 100 plot(x=0,y=0,type="n",xlim=c(0,nCIs),ylim=range(confint_m[,1:nC ylab="", xlab="Resamplings", main="Confidence Intervals") for(r in 1:nCIs){ if(inside_CI[r]==TRUE){ lines(x=c(r,r), y=c(confint_m[1,r], confint_m[2,r]), lwd=2, col=gray(.5,.5)) }else{ lines(x=c(r,r), y=c(confint_m[1,r], confint_m[2,r]), lwd=2, col="darkred") } } axis(4, at=beta_true_2, labels = expression(beta[2])) abline(h=beta_true_2) 121 Is]), Confidence Intervals 0 20 40 60 80 100 Resamplings 122 0123456 β2 Chapter 5 Large Sample Inference 5.1 Tools for Asymptotic Statistics 5.1.1 Modes of Convergence In the following we will discuss the four most important convergence concepts for sequences of random variables (z1,z2,...,zn) shortly denoted by {zn}. Non-random scalars (or vectors or matrices) will be denoted by Greek letters such as –. Four Important Modes of Convergence 1. Convergence in Probability: A sequence of random scalars {zn} con- verges in probability to a constant (non-random) – if, for any (arbitrarily small) Á > 0,
lim P(|zn ≠–|>Á)=0. næŒ
p
123
We write: plimnæŒ zn = –, or zn ≠æ –. Convergence in probability of a sequence of random vectors (or matrices) {zn} to a constant vector (or matrix) – requires element-wise convergence in probability.

2. Almost Sure Convergence: A sequence of random scalars {zn} con- verges almost surely to a constant (non-random) – if
P 3 lim zn = –4 = 1. næŒ
a.s.
We write: zn ≠æ –. Almost sure convergence of a sequence of random vectors
(or matrices) {zn} to a constant vector (or matrix) – requires element-wise almost sure convergence.
Note. Almost sure convergence is (usually) rather hard to derive, since the probability is about an event concerning an infinite sequence. Fortunately, however, there are established strong laws of large numbers that we can use for showing almost sure convergence.
3. Convergence in Mean Square: A sequence of random scalars {zn} converges in mean square (or in quadratic mean) to a constant (non- random) – if
lim E 1(zn ≠ –)22 = 0 næŒ
ˆ
E ((zn ≠ –)2) is termed the mean squared error: MSE(zn) = E ((zn ≠ –)2). Mean square convergence of a sequence of random vectors (or matrices) {zn} to a deterministic vector (or matrix) – requires element-wise mean square convergence.
4. Convergence in Distribution: Let Fn be the cumulative distribution function (cdf) of zn and F the cdf of z. A sequence of random scalars {zn} converges in distribution to a random scalar z if for all t such that F (t) is continuous at t,
lim Fn(t) = F (t). næŒ 124
m.s.
Wewrite: zn≠æ–. Ifzn isanestimator(e.g.,zn =—k,n)theexpression

d
We write: zn ≠æ z and call F the asymptotic (or limit) distribution of da
zn. Sometimes you will see statements like zn ≠æN(0,1) or zn ≥ N(0,1), d
which should be read as zn ≠æz, where z ≥ N(0,1).
Note. A stochastic sequence {zn} can also convergence in distribution to a deterministic scalar –. In this case – is treated as a degenerated random
variable with cdf
– [1 if tØ–
F (t) = Y]0 if t < – 4Õ. Multivariate Convergence in Distribution: Let zn,z œ RK be K-dimensional random variables, then dd zn ≠æz if and only if ⁄Õzn ≠æ⁄Õz for any ⁄ œ RK . This statement is known as the Cramér Wold Device. It is needed since element-wise convergence in distribution does generally not imply convergence of the joint distribution of zn to the joint distribution of z; except, if all elements are independent from each other. Relations among Modes of Convergence Lemma 5.1.1. Relationship among the four modes of convergence: m.s. p (i) zn≠æ–∆zn≠æ–. a.s. p (ii) zn ≠æ– ∆ zn ≠æ–. dp (iii) zn ≠æ – ... zn ≠æ –. I.e., if the limiting random variable is a constant (i.e., a degenerated random variable), then convergence in distribution is equivalent to convergence in probability. 125 Proofs can be found, e.g., here: https://www.statlect.com/asymptotic- theory/relations-among-modes-of-convergence 5.1.2 Continuous Mapping Theorem (CMT) Lemma 5.1.2. Preservation of convergence for continuous trans- formations (or "continuous mapping theorem (CMT)"): Suppose {zn} is a stochastic sequence of random scalars, vectors, or matrices and that a(·) is a continuous function that does not depended on n. Then pp (i) zn≠æ–∆a(zn)≠æa(–) a.s. a.s. (ii) zn≠æ–∆a(zn)≠æa(–) dd (iii) zn ≠æ – ∆ a(zn) ≠æ a(–) Proof can be found, e.g., in Asymptotic Statistics, van der Vaart (1998), Theorem 2.3. Or here: https://www.statlect.com/asymptotic-theory/conti nuous-mapping-theorem Note. The CMT does not hold for m.s.-convergence except for the case where a(.) is linear. Examples. As a consequence of the CMT (Lemma 5.1.2) we have that the usual arithmetic operations preserve convergence in probability (equivalently for almost sure convergence and convergence in distribution): ppp Ifxn≠æ— and yn≠æ“ then xn+yn≠æ—+“ ppp Ifxn≠æ— and yn≠æ“ then xn·yn≠æ—·“ 126 ppp If xn ≠æ — and “ ”= 0 yn ≠æ “ then xn/yn ≠æ —/“, provided that Õ ≠1 p ≠1 Õ p If XnXn ≠æXÕX then (XnXn) ≠æXÕX, provided XÕX is a non- 5.1.3 Slutsky Theorem singular matrix. The following results are concerned with combinations of convergence in probability and convergence in distribution. These are particularly important for the derivation of the asymptotic distribution of estimators. Lemma 5.1.3 (Slutsky Theorem). Let xn and yn denote sequences of ran- dom scalars or vectors and let An denote a sequences of random matrices. Moreover, – and A are deterministic limits of appropriate dimensions and x is a random limit of appropriate dimension. dpd (i) Ifxn≠æx and yn≠æ– then xn+yn≠æx+– dpp (ii)Ifxn≠æx and yn≠æ0 then xÕnyn≠æ0 dpd (iii) If xn ≠æx and An ≠æA then Anxn ≠æAx, where it is as- sumed that An and xn are “conformable” (i.e., the matrix- and vector- dimensions fit to each other). Important special case: dpd If xn ≠æN(0,) and An ≠æA then Anxn ≠æN(0,AAÕ) Proofs can be found, e.g., in Asymptotic Statistics, van der Vaart, Theorem 2.8. Or here: https://www.statlect.com/asymptotic-theory/Slutsky-theorem 127 Note. Sometimes, only parts (i) and (ii) of Lemma 5.1.3 are called “Slut- sky’s theorem.” 5.1.4 Law of Large Numbers (LLN) and Central Limit Theorem (CLT) So far, we discussed the definitions of the four most important convergence modes, their relations among each other, and basic theorems (CMT and Slutsky) about functionals of stochastic sequences. Though, we still lack of tools that allow us to actually show that a stochastic sequence convergences (in some of the discussed modes) to some limit. In the following we consider the stochastic sequences {z ̄ } of sample means z ̄ = n≠1 qni=1 z , where z , i = 1,...,n, are (scalar, vector, or matrix-valued) nii random variables. Remember: the sample mean z ̄ is an estimator of the deterministic population mean μ. Weak LLNs, strong LLqNs, and CLTs tell us conditions under which arithmetic means z ̄ = n≠1 ni=1 z converge: ni p n n z ̄ ≠æ μ (weakLLN) a.s. z ̄ ≠æ μ (strongLLN) Ônd2 n(z ̄≠μ)≠æN(0,‡) (CLT) n n In the following we introduce the most well-known versions of the weak, the strong LLN, and the CLT. Theorem 5.1.4 (Weak LLN (Chebychev)). If limE(z ̄)=μ and limVar(z ̄)=0 then z ̄ ≠æμ næŒn næŒn n Proof can be found, for instance, here: https://www.statlect.com/asymptotic- theory/law-of-large-numbers 128 p Theorem 5.1.5 (Strong LLN (Kolmogorov)). If{z}is an iid sequence and E(z)=μ then z ̄ ≠æμ a.s. iin Proof can be found, e.g., in Linear Statistical Inference and Its Applications, Rao (1973), pp. 112-114. Note. The weak and the strong LLN for random vectors follow from requiring element-by-element convergence. Theorem 5.1.6 (CLT (Lindeberg-Levy)). If {zi} is an iid sequence and E(zi) = μ and Var(zi) = ‡2 then Ôd2 n(z ̄≠μ)≠æN(0,‡) as næŒ Proof can be found, e.g., in Asymptotic Statistics, van der Vaart (1998), Theorem 2.17. The Lindeberg-Levy CLT for K-dimensional random vectors follows from our above discussion on “Multivariate Convergence in Distribution.” From thisweknowthatifz ̄ œRK andμœRK,then n n ÔdÔd n(z ̄ ≠μ)≠æN(0,) ... n(⁄Õz ̄ ≠⁄Õμ)≠æN(0,⁄Õ⁄), for any ⁄ œ RK. That is, to apply the Lindeberg-Levy CLT (Theorem 5.1.6) to multivari- ate (e.g., K-dimensional) stochastic sequences, we need to check whether the univariate stochastic sequence {⁄Õzi} is i.i.d. with E(⁄Õzi) = ⁄Õμ and Var(⁄Õzi) = ⁄Õ⁄ for any ⁄ œ RK. This is the case if the multivariate (K- dimensional) stochastic sequence {zi} is an i.i.d. sequence with E(zi) = μ and Var(zi) = . 129 nn Note. The LLNs and the CLT are stated with respect to sequences of sample means {z ̄ }; i.e., the simplest estimators you probably can think of. We will see, however, that this is all we need in order to analyze also more complicated estimators such as the OLS estimator. 5.1.5 Estimators as a Sequences of Random Variables Our concepts above readily apply to general scalar-valued (univariate) or vector-valued (K-dimensional) estimators, say ◊ˆ œ RK, that are computed from i.i.d. random samples. n (Weak) Consistency: We say that an estimator ◊ˆ is (weakly) consistent n for ◊ if Asymptotic Bias: The asymptotic bias of an estimator ◊ˆ of some true ˆp ◊n≠æ◊ as næŒ parameter ◊ is defined as: n ABias(◊ˆ ) = lim E(◊ˆ ) ≠ ◊ n næŒ n If ABias(◊ˆ ) = 0, then ◊ˆ is called an asymptotically unbiased. n Asymptotic Normality: A consistent estimator ◊ˆ is asymptotically nor- mal distributed if n Ô ˆ n(◊n ≠◊)≠æN(0,) as næŒ d n Var(Ôn(◊ˆ ≠ ◊)) = lim Var(Ôn◊ˆ ) = as called the næŒ n næŒ n where lim asymptotic variance of Ôn(◊ˆ ≠ ◊). n 130 ÔˆpÔ n-consistent: Consistent estimators ◊n ≠æ ◊ are called n-consistent if Ô If additionally the random vector z is normal distributed, then ◊ˆ is often called consistent and asymptotically normal. n 5.2 Asymptotics under the Classic Regression Model Given the above introduced machinery on asymptotic concepts and results, we can now proof that the OLS estimators —ˆ and s2UB applied to the classic regression model (defined by Assumptions 1-4) are consistent and asymptot- ically normal distributed estimators as n æ Œ. That is, we can drop the unrealistic normality and spherical errors assumption (Assumption 4ú), but still use the usual test statistics (t-test, F-test); as long as the sample size n is “large.” However, before we can formally state the asymptotic properties, we first need to adjust our “data generating process” assumption (Assumption 1) such that we can apply Kolmogorov’s strong LLN and Lindeberg-Levy’s CLT. Second, we need to adjust the rank assumption (Assumption 3), such that the full column rank of X is guaranteed for the limiting case as n æ Œ, too. Assumptions 2 and 4 from Chapter 3 are assumed to hold. Assumption 1ú: Data Generating Process for Asymptotics As- sumption 1 applies, but additionally we assume that (Xi,Ái) (or equivalently (Yi, Xi)) is i.i.d. for all i = 1, . . . , n, with existing and finite second moments for Xi and fourth moments for Ái. (Note: The fourth moment of Ái is actually only needed for Theorem 5.2.4; for the rest two moments are sucient.) 131 ˆ n(◊n≠◊)≠æz as næŒ d Note. The above adjustment of Assumption 1 is far less restrictive than assuming that the error-terms are normally distributed (as it’s necessary for small samples). Assumption 3ú: Rank Condition for Asymptotics The (K ◊ K) limiting matrix X X Õ = E(S Õ iÕ ) = E(X X ) in has full rank K; i.e., X X p Õ this assumption does not assume that SX X X X X X i n≠1XÕX = S p XÕX ≠æXÕX as næŒ therefore, is nonsingular and invertible. (Note, ≠æ hold, it assumes that the limiting matrix is of full rank K.) Note. The crucial part of Assumption 3ú is that the limit matrix has full rank K. The convergence in probability statement (S p already from Assumption 1ú. XÕX ≠æXÕX) follows Note. Assumption 3ú implies the existence and finiteness of the first two moments of Xi (even without Assumption 1ú). Theorem 5.2.1 (Consistency of S≠1 ). Under Assumption 1ú and 3ú we have that XÕX A1XÕXB≠1 =S≠1 p ≠1 n XÕX ≠æ XÕX as næŒ Proof is done in the lecture. Theorem 5.2.2 (Consistency of —ˆ). Under Assumption 1ú, 2 and 3ú we have that ˆp —n≠æ— as næŒ Proof is done in the lecture. Ô Furthermore, we can show that the appropriately scaled (by n) sampling error —ˆ ≠ — of the OLS estimator is asymptotically normal distributed. Õ Õ , but if this convergence 132 Theorem 5.2.3 (Sampling error limiting normality (the classic case)). For simplicity, we consider here the special case of spherical error (Var(Á|X) = ‡2In). Under Assumption 1ú, 2 and 3ú we then have that Ôˆ d 1 2≠12 n(—n≠—)≠æN 0,‡XÕX as næŒ In principle, we can derive the usual test statistics from the latter result. we need , where the consistency of the former estimator is provided by Prop. 5.2.1 and the consistency of s2UB is provided by the following result. Theorem 5.2.4 (Consistency of s2UB). ct.com/fundamentals-of-statistics/OLS-estimator-properties 5.2.1 The Case of Heteroscedasticity Theorem 5.2.3 can also be stated and proofed for conditionally heteroscedastic error terms. In this case one gets Proof is done in the lecture. Though, as long as we do not know (we usually don’t) ‡2 and X X to plug-in the (consistent!) estimators S≠1 Õ 2 Õ 2p2 sUB≠æ‡ as næŒ Proof is skipped, but a detailed proof can be found here: https://www.statle Ô ˆ d 1 ≠1 2 Õ ≠1 2 n(—n ≠—)≠æN 0,XÕXE(Ái XiXi)XÕX and s XX UB (5.1) (i.e., the asymptotic variance of n(— ≠ —)) is where≠1 2 Õ ≠1 as næŒ Ô ˆ E(Á XX) XXiiiXX n Õ Õ usually unknown and needs to be estimated from the data by ≠1E„(Á2XXÕ)S≠1p≠1 2 Õ≠1 SXÕX i i i XÕX ≠æXÕXE(Ái XiXi)XÕX 133 as næŒ, where E„(Á2i XiXiÕ) denotes some consistent estimator of E(Á2 XiXiÕ) such as one of the followingt choices: HC0: E„(Á2iXiXiÕ)=1ÿn ˆÁ2iXiXiÕ n i=1 HC1: E„(Á2i XiXiÕ) = 1 ÿn n ˆÁ2i XiXiÕ ni=1 n≠K HC2:E„(Á2XXÕ)=1ÿn ˆÁ2i XXÕ i i i ni=11≠hi i i HC3: E„(Á2iXiXiÕ)=1ÿn ˆÁ2i 2XiXiÕ (ΩMostoftenused) ni=1 (1≠hi) HC4: E„(Á2 XiXÕ) = 1 ÿn ˆÁ2i XiXÕ i i n i=1 (1 ≠ hi)”i i (These are the heteroscedasticity-consistent robust estimators from Chapter 3.6. Note. In order to show that any of the above versions (HC0-4) of E„(Á2i XiXiÕ) is a consistent estimator of E(Á2i XiXiÕ) we actually need to assume that the explanatory variables in X have finite fourth moments (see Hayashi, 2000, Chapter 2.5). So, for this, we would need to make our Assumption 1ú more restrictive (only two moments are assumed for X). 5.2.2 Hypothesis Testing and Confidence Intervals From our asymptotic results under the classic regression model (Assumptions 1ú, 2, 3ú, and 4) we get the following results important for testing statistical hypothesis. 134 5.2.2.1 Robust Hypothesis Testing: Multiple Parameters Let us reconsider the following system of q-many null hypotheses: H0 : R — ≠ r = 0 , (q◊K)(K◊1) (q◊1) (q◊1) wherethe(q◊K)matrixRandtheq-vectorr=(r1,...,rq)Õ arechosenby the statistician to specify her/his null hypothesis about the unknown true parameter vector —. To make sure that there are no redundant equations, it is required that rank(R) = q. By contrast to the multiple parameter tests for small samples (see Chapter 4.1), we can work here with a heterosedasticity robust test statistic which is applicable for heteroscedastic error terms: W = n(R—ˆ ≠ r)Õ[R S≠1 E„(Á2 X XÕ)S≠1 RÕ]≠1(R—ˆ ≠ r) æ0 ‰2(q) (5.2) as n æ Œ. The price to pay is that the distribution of the test statistic under the null hypothesis is only valid asymptotically for large n. That is, the critical values taken from the asymptotic distribution will be useful only for “large” samples sizes. In case of homoscedastic error terms, one can nXXiiiXXnd substitute S≠1 „ 2 i Õ ≠1 2 ≠1 Õ Õ E(Á XX)S XX i iXX UBXX . Õ Õ bys S Õ Finite-sample correction. In order to improve the finite-sample perfor- mance of this test, one usually uses the Fq,n≠K distribution with q and n ≠ K degrees of freedoms instead of the ‰2(q) distribution. Asymptotically (n æ Œ) Fq,n≠K is equivalent to ‰2(q). However, for finite sample sizes n (i.e., the practically relevant case) Fq,n≠K leads to larger critical values which helps to account for the estimation errors in S≠1 „ 2 i Õ ≠1 2 ≠1 ) (orins S which are otherwise neglected by the pure asymptotic perspective. 135 E(Á XX)S XXiiXX UBXX Õ Õ Õ H 5.2.2.2 Robust Hypothesis Testing: Single Parameters Let us reconsider the case of hypotheses about only one parameter —k, with k = 1,...,K We can selecting the kth diagonal element of the test-statistic in (5.2) and taking the square root yields H0: —k=r HA : —k ”= r Ô n 3 —ˆ ≠ r 4 k t= This t test statistic allows for heteroscedastic error terms. In case of ho- æ N(0,1). ÚËS≠1 „ 2 Õ ≠1 È H0d E(Á XX)S XX iiiXXkk Õ Õ moscedastic error terms, one can substitute [S≠1 „ 2 i Õ ≠1 kk Finite-sample correction. In order to improve the finite-sample perfor- mance of this t test, one usually uses the t(n≠K) distribution with n ≠ K degrees of freedoms instead of the N (0, 1) distribution. Asymptotically (n æ Œ) t(n≠K) is equivalent to N(0,1). However, for finite sample sizes n (i.e., the practically relevant case) tn≠K leads to larger critical values s2[S≠1 Õ ]. Õ Õ UB XXkk which helps to account for the estimation errors in [S≠1 „ 2 i Õ ≠1 kk perspective. Õ UB XXkk ] ] ) which are otherwise neglected by the pure asymptotic (or in s2 [S≠1 X X i i X X 5.3 Robust Confidence Intervals Following the derivations in Chapter 4.6, but using the expression for the ro- bust standard errors, we get the following heteroscedasticity robust (random) 136 E(Á XX)S XX i iXX ] by Õ Õ E(Á XX)S (1 ≠ –) · 100% confidence interval CI =C—ˆ ±t Ún≠1[S≠1 E„(Á2XXÕ)S≠1 ] D. 1≠– k 1≠–,n≠K XÕX i i i XÕX kk Here, the coverage probability is an asymptotic coverage probability with P(—k œCI1≠–)æ1≠–asnæŒ. 5.4 Practice: Large Sample Inference Let’s apply the above asymptotic inference methods using R. As in Chapter 4.7 we, first, program a function myDataGenerator() which allows us to generate data from the following model, i.e., from the following fully specified data generating process: Yi =—1 +—2Xi2 +—3Xi3 +Ái, — = (—1,—2,—3)Õ = (2,3,4)Õ Xi2 ≥U[≠4,4] Xi3 ≥U[≠5,5] Ái ≥ U[≠0.5|Xi2|,0.5|Xi2|], i=1,...,n where (Yi, Xi) is assumed i.i.d. across i = 1, . . . , n. Note that, by contrast to Chapter 4.7, the error terms are not Gaussian and conditionally het- eroscedastic since Var(Á |X ) = 1 X2 (the unconditional variance is given by Var(Á ) = 4). i i 12 i2 i9 Moreover, by contrast to Chapter 4.7, we here do not need to sample new realizations of Y1, . . . , Yn conditionally on a given data matrix X. (The finite sample inference results in Chapter 4.7 are conditionally on X; however, the large sample results are unconditional results.) Therefore, this option is now removed from the myDataGenerator() R-function. 137 ## Function to generate artificial data myDataGenerator <- function(n, beta){ ## X <- cbind(rep(1, n), runif(n, -4, 4), ## runif(n, -5, 5)) eps <- runif(n, -.5 * abs(X[,2]), +.5 * abs(X[,2])) Y <-X%*%beta+eps data <- data.frame("Y"=Y, ## return(data) } "X_1"=X[,1], "X_2"=X[,2], "X_3"=X[,3]) 5.4.1 Normally Distributed —ˆ for n æ Œ The above data generating process fulfills our regulatory assumptions As- sumption 1ú, 2, 3ú, and 4. So, by theory, the estimators —ˆ should be normal k distributed for large sample sizes n – unconditionally on X and even for heteroscedastic error terms. Ôn3—ˆ≠—4æN10,Ë≠1E(Á2XXÕ)≠1È 2 k k d XÕX iiiXÕXkk Or: For our above specified data generating process, we have —ˆ æ N1— , n≠1 Ë≠1 E(Á2XXÕ)≠1 È 2 kdk XÕXiiiXÕXkk • From the assumed distributions of Xi2 and Xi3 we have that (we used that (i) E(X2) = V (X) if X has mean zero and (ii) that the variance 138 of uniform distributed random variables 1 (b ≠ a)2): 12 Qc10 0RdQc100Rd =E(S )=E(XXÕ)=c0 E(X2) 0 d=c0 16 0d XÕX XÕX i i a i2 2 b a 3 25b 0 0 E(Xi3) 0 0 3 • Moreover, E(Á2XXÕ) = E(XXÕE(Á2|X)) = E1XXÕ1 1 X222 such that iii iiii ii12i2 QE(Á2)0 0RQ400R 2Õci22 dc964d E(Ái XiXi) = ca 0 E (Ái Xi2) 0 db = ca0 15 0 db the case of —ˆ with a largish sample size of n = 100. 2 0 0 E(Á2X2) 0 0 100 i i3 Let’s check the distribution by means of a Monte Carlo simulation for 27 set.seed(123) n <- 100 # a largish sample size beta_true <- c(2,3,4) # true data vector ## Mean and variance of the true asymptotic ## normal distribution of beta_hat_2: # true mean beta_true_2 <- beta_true[2] # true variance var_true_beta_2 <- (solve(diag(c(1, 16/3, 25/3))) %*% diag(c(4/9, 64/15, 100/27))%*% solve(diag(c(1, 16/3, 25/3))))[2,2]/n ## Lets generate 5000 realizations from beta_hat_2 ## conditionally on X and check whether their 139 ## distribution is close to the true normal distribution rep <- 5000 # MC replications beta_hat_2 <- rep(NA, times=rep) ## for(r in 1:rep){ MC_data <- myDataGenerator(n = n, beta = beta_true) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) beta_hat_2[r] <- coef(lm_obj)[2] } ## Compare: ## True beta_2 versus average of beta_hat_2 estimates beta_true_2 #> [1] 3
round(mean(beta_hat_2), 3)
#> [1] 3
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 5)
#> [1] 0.0015
round(var(beta_hat_2), 5)
#> [1] 0.00147
## True normal distribution of beta_hat_2 versus ## empirical density of beta_hat_2 estimates library(“scales”)
curve(expr = dnorm(x, mean = beta_true_2,
sd=sqrt(var_true_beta_2)), xlab=””,ylab=””, col=gray(.2), lwd=3, lty=1,
140

xlim=range(beta_hat_2),ylim=c(0,14.1),main=paste0(“n=”,n)) lines(density(beta_hat_2, bw = bw.SJ(beta_hat_2)),
col=alpha(“blue”,.5), lwd=3) legend(“topleft”, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(“blue”,.5)), bty=”n”, legend= c(expression(
“Theoretical (Asymptotic) Gaussian Density of”~hat(beta)[2]), expression(
“Empirical Density Estimation based on MC realizations from”~ hat(beta)[2])))
n=100
Theoretical (Asymptotic) Gaussian Density of β^
Empirical Density Estimation based on MC realizations from β2
2^
2.90 2.95 3.00 3.05 3.10
Great! The nonparametric density estimation (estimated via density()) computed from the simulated realizations of —ˆ is indicating that —ˆ is really
141
22
0 2 4 6 8 10 12 14

normally distributed as described by our theoretical result in Theorem 5.2.3 (homoscedastic case) and in Equation (5.1) (heteroscedastic case).
However, is the asymptotic distribution of —ˆ also usable for (very) small samples like n = 5? Let’s check that: 2
set.seed(123)
n <- 5 # a small sample size beta_true <- c(2,3,4) # true data vector ## Mean and variance of the true asymptotic ## normal distribution of beta_hat_2: # true mean beta_true_2 <- beta_true[2] # true variance var_true_beta_2 <- (solve(diag(c(1, 16/3, 25/3)))%*% diag(c(4/9, 64/15, 100/27))%*% solve(diag(c(1, 16/3, 25/3))))[2,2]/n ## Lets generate 5000 realizations from beta_hat_2 ## conditionally on X and check whether their ## distribution is close to the true normal distribution rep <- 5000 # MC replications beta_hat_2 <- rep(NA, times=rep) ## for(r in 1:rep){ MC_data <- myDataGenerator(n = n, beta = beta_true) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) beta_hat_2[r] <- coef(lm_obj)[2] } 142 ## Compare: ## True beta_2 versus average of beta_hat_2 estimates beta_true_2 #> [1] 3
round(mean(beta_hat_2), 3)
#> [1] 2.996
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 5)
#> [1] 0.03
round(var(beta_hat_2), 5)
#> [1] 0.05621
## True normal distribution of beta_hat_2 versus ## empirical density of beta_hat_2 estimates library(“scales”)
curve(expr = dnorm(x, mean = beta_true_2,
sd=sqrt(var_true_beta_2)), xlab=””,ylab=””, col=gray(.2), lwd=3, lty=1,
xlim=c(2,4), ylim=c(0,3),main=paste0(“n=”,n)) lines(density(beta_hat_2, bw = bw.SJ(beta_hat_2)),
col=alpha(“blue”,.5), lwd=3) legend(“topleft”, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(“blue”,.5)), bty=”n”, legend= c(expression(
“Theoretical (Asymptotic) Gaussian Density of”~hat(beta)[2]), expression(
“Empirical Density Estimation based on MC realizations from”~ hat(beta)[2])))
143

n=5
Theoretical (Asymptotic) Gaussian Density of β^
Empirical Density Estimation based on MC realizations from β2
2^
2.0 2.5 3.0 3.5 4.0
Not good. The actual distribution has substantially fatter tails. That is, if we would use the quantiles of the asymptotic distribution, we would falsely reject the null-hypothesis too often (probability of type I errors would be larger than the significance level). But asymptotic are kicking in pretty fast here: things become much more reliable already for n = 10.
5.4.2 Testing Multiple and Single Parameters
In the following, we do inference about multiple parameters. We test H0:—2=3 and —3=5
versus HA :—2 ”= 3 and/or —3 ”= 5. 144
0.0 0.5 1.0 1.5 2.0 2.5 3.0

Or equivalently where
H0 :R— ≠ r = 0 HA :R—≠r”=0,
R=Qa0 1 0Rb and r=Qa3Rb. 001 5
The following R code can be used to test this hypothesis:
suppressMessages(library(“car”)) # for linearHyothesis() # ?linearHypothesis
library(“sandwich”)
## Generate data
MC_data <- myDataGenerator(n = 100, beta = beta_true) ## Estimate the linear regression model parameters lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) vcovHC3_mat <- sandwich::vcovHC(lm_obj, type="HC3") ## Option 1: car::linearHypothesis(model = lm_obj, hypothesis.matrix = c("X_2=3", "X_3=5"), vcov=vcovHC3_mat) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 3
145

#> X_3 = 5
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#>1 99
#> 2 97 2 1150.4 < 2.2e-16 *** #> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
## Option 2:
R <- rbind(c(0,1,0), c(0,0,1)) car::linearHypothesis(model = lm_obj, hypothesis.matrix = R, rhs = c(3,5), vcov=vcovHC3_mat) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 3
#> X_3 = 5
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#> Note: Coefficient covariance matrix supplied.
146
1

#>
#> Res.Df Df F Pr(>F)
#>1 99
#> 2 97 2 1150.4 < 2.2e-16 *** #> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
The p-value is very small and allows us to reject the (false) null-hypothesis at any of the usual significance levels.
Next, we do inference about a single parameter. We test H0 :—3 =5
versus HA :—3 ”= 5.
1
# Load libraries
library(“lmtest”) # for coeftest() library(“sandwich”) # for vcovHC()
## Generate data
n <- 100 MC_data <- myDataGenerator(n = n, beta = beta_true) ## Estimate the linear regression model parameters lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) ## Robust t test ## Robust standard error for \hat{\beta}_3: SE_rob <- sqrt(vcovHC(lm_obj, type = "HC3")[3,3]) 147 ## hypothetical (H0) value of \beta_3: beta_3_H0 <- 5 ## estimate for beta_3: beta_3_hat <- coef(lm_obj)[3] ## robust t-test statistic t_test_stat <- (beta_3_hat - beta_3_H0)/SE_rob ## p-value K <- coef(lm_obj) ## p_value <- 2 * min( pt(q = t_test_stat, df = n - K), 1- pt(q = t_test_stat, df = n - K)) #> [1] 1.548442e-65
p_value
Again, the p-value is very small and allows us to reject the (false) null- hypothesis at any of the usual significance levels.
148

Chapter 6
Maximum Likelikhood
6.1 Likelihood Principle
The basic idea behind maximum likelihood estimation is very simple: find the distribution parameters for which it is most likely that the distribution has generated the data we actually observed. We must therefore be very specific about the process that generated the data. This is a trade o – by imposing a fair amount of structure on the data, we get in return a very desirable estimator. The question always remains, however, whether we have made the right decision about the specific distribution/density function.
6.2 Properties of Maximum Likelihood Estimators
Why do we like maximum likelihood as an estimation method? The answer is that: A maximum likelihood estimator ◊ˆ of some parameter ◊ œ R is
• Consistent: plim (◊ˆ ) = ◊ næŒ n
• Asymptotically Normally Distributed: Ôn(◊ˆ ≠ ◊) ≥a N (0, ‡2) n
149

• Asymptotically Ecient: For any other consistent estimator ◊ ̃ , ‡ ̃ 2 Ø ‡ 2 . n
Thus, if we have the right assumptions, maximum likelihood estimators are very appealing.
Simple Example: Coin Flipping (Bernoulli Trial) In the following
we will consider the simple example of coin flips. Call ◊ the probability that
we get a head ◊ = P (Coin = HEAD) which implies that the probability that
we get a tail is 1 ≠ ◊ = P (Coin = TAIL). We don’t know the probability ◊
and our goal is to estimate it using an i.i.d. sample of size n, i.e. (X1, . . . , Xn) i.i.d
with Xi ≥ B(◊) for all i = 1,…,n. A given realization of this random sample is then (x1,…,xi,…,xn) and consists of xi = 1 values for “head” andxj =0for“tail”,i”=j=1,…,n. Intotalwehave0ÆhÆnmany heads and 0 Æ n ≠ h Æ n many tails.
6.3 The (Log-)Likelihood Function
How do we combine information from n observations to estimate ◊?
If we assume that all of the observations are drawn from same distribution and are independent, then joint probability of observing h heads and n ≠ h
tails in the n coin flips that we actually observed, given ◊, is: h n≠h
L ( ◊ ) = ◊Ÿ ( 1 ≠ ◊ )
nx 1≠x
= ◊i(1≠◊) i i=1
where xi = 1 forHEADin ith coin flip and xi = 0 forTAILin ith coin flip. The function L is called the likelihood function.
150

In general, when the observations are identically and independently dis-
tributed (i.i.d):
Ÿn
L(◊) =
f(xi|◊)
i=1
where f(xi|◊) is the density function of the random variable Xi evaluated at the realization Xi = xi; the parameter ◊ denotes the density function
parameter(s).
Our goal is to choose a value for ◊ such that the value of the likelihood
function is at a maximum, i.e. we choose the value of the parameter(s) that maximize the “probability” or better the likelihood of observing the data that we actually observed. That is:
◊ˆ = arg max L(◊). ◊
ˆ defines the maximum likelihood (ML) parameter estimator ◊.
Usually it’s easier to work with sums rather than products, so we can apply a monotonic transformation (taking ln) to the likelihood which yields to the log-likelihood function:
̧(◊) = lnL(◊) = ÿn lnf(xi|◊).
i=1
̧(◊)=ÿn (xiln(◊)+(1≠xi)ln(1≠◊))
i=1
In this case, we can analytically solve for the value of ◊ that maximizes
the log likelihood (and hence also the likelihood):
In our coin flipping example:
d ̧ ÿn A 1 1 B d◊ = xi◊ ≠(1≠xi)1≠◊
i=1 =h≠n≠h
◊ 1≠◊ 151

Setting the above expression to zero and solving gives us our ML estimator
(MLE):
◊ˆ =h ML n
6.4 Optimization: Non-Analytical Solutions
Usually we are not so fortunate as to have a closed-form analytical solution for the MLE and must rely on the computer to find the maximizing arguments of the (log-)likelihood function. Various methods exist for finding the maximum (or minimum) of a function. General idea: (i) start at some value for parameters in the parameter space (i.e., in the space of all possible parameter- values), (ii) search across that space until values of parameters are found that yield a derivative of the log likelihood that is zero (or arbitrarily close to zero).
6.4.1 Newton-Raphson Optimization
One of the most-used methods for optimization is the Newton-Raphson method (or a variant of it). The Newton-Raphson method relies on Taylor- series approximations of the likelihood.
First-order and second-order Taylor-series approximations of a function value f(◊+h) are using the function value f(◊) and derivatives of f evaluated at ◊:
First-order: f (◊ + h) ¥ f (◊) + f Õ(◊)h
Second-order: ¥ f (◊) + f Õ(◊)h + (1/2)f ÕÕ(◊)h2
for small values of h.
Aim: Suppose we want to find the value of ◊ that maximizes a twice- dierentiable function f(◊).
152

Implementation-Idea: The Taylor-series approximation gives then f(◊ + h) ¥ f(◊) + fÕ(◊)h + (1/2)fÕÕ(◊)h2
which implies
ˆf(◊ + h) ¥ fÕ(◊) + fÕÕ(◊)h. ˆh
Therefore, the first-order condition for the value of h that maximizes the Taylor-series expansion f(◊) + fÕ(◊)h + (1/2)fÕÕ(◊)h2 is
0 = fÕ(◊) + fÕÕ(◊)hˆ, hˆ = ≠ f Õ ( ◊ ) .
giving
That is, in order to increase the value of f(◊) one shall substitute ◊ by
fÕÕ(◊)
◊ + hˆ = ◊ ≠ f Õ ( ◊ )
The Newton Raphson optimization algorithm uses this insight as following. We first must provide a starting value, s, for ◊0 = s and, second, decide on some (small) convergence criterion, t, e.g. t = 10≠10, for the first derivative. Then the Newton Raphson optimization algorithm is given by:
let ◊0 = s let i = 0 while
|SfÕ(◊i)| > t
U let i = i + 1
let◊=◊ ≠fÕ(◊i≠1) i i≠1 fÕÕ(◊i≠1)
do
let ◊ˆ = ◊i
fÕÕ(◊)
ˆ return (◊)
153

Repetition i ◊ˆ ̧Õ(◊ˆ ) ̧Õ(◊ˆ )/ ̧ÕÕ(◊ˆ ) iiii
0 0.40 1 0.16 2 0.19 3 0.19 4 0.19 5 0.20
≠4.16 1.48
2.15 · 10≠1 5.43 · 10≠3 3.53 · 10≠6 1.50 · 10≠12
2.40 · 10≠1 ≠3.32 · 10≠2 ≠6.55 · 10≠3 ≠1.73 · 10≠4 ≠1.13 · 10≠7 ≠4.81 · 10≠14
Table 6.1: Result of applying the Newton Raphson optimaization algorithm to our coin flipping example for given data h = 1 with sample size n = 5.
Note. For problems that are globally concave, the starting value s doesn’t matter. For more complex problems, however, the Newton-Raphson algorithm can get stuck into a local maximum. In such cases, it is usually a good idea to try multiple starting values.
Newton-Raphson Algorithm: Example Let’s return to our earlier coin-
flipping example, with only one head h = 1 for a sample size of n = 5. We
already know that ◊ˆ = h = 1 = 0.2, but let’s apply the Newton-Raphson ML n 5
Algorithm. Recall that
d ̧ = h ≠ n ≠ h
d◊ ◊ 1≠◊ d2 ̧=≠h≠ n≠h
d◊2 ◊2 (1≠◊)2
We have h = 1 and n = 5. Choosing t = 10≠10 as our convergence criterion and ◊0 = 0.4 as the starting value, allows us to run the algorithm which gives us the results shown in Table 6.1.
154

6.5 OLS-Estimation as ML-Estimation
Now let’s return to our linear model
Y =X—+Á
To apply ML-estimation, we must make a distributional assumption about Á
such as, for instance:
Á ≥ N(0,‡2In)
That’s Assumption 4ú from Chapter 4; we could have chosen also another distributional assumption for Á here, but we would need to specify it correctly. That is, we are imposing here much more structure on the data than needed with the OLS estimator (in the context of large sample inference).
The multivariate density for Á = (Á1, . . . , Án)Õ is then f(Á) = 1 e≠(1/2‡2)(ÁÕ Á).
(2fi‡2)n/2
Noting that Á = Y ≠ X—, we get the log likelihood
̧(—,‡2)=≠nln(2fi)≠nln(‡2)≠ 1 (Y ≠X—)Õ(Y ≠X—) 2 2 2‡2
with K unknown parameters — = (—1, . . . , —K )Õ and ‡2 (scalar). Taking derivatives gives
ˆ ̧ = ≠ 1 (≠XÕY + XÕX—) ˆ— ‡2
ˆ ̧ =≠ n +C1(Y ≠X—)Õ(Y ≠X—)D ˆ‡2 2‡2 2
155
1 2 (‡2)

So, we have K + 1 equations and K + 1 unknowns. Setting equal to zero and solving gives
—ˆ = (XÕX)≠1XÕY
ML 1 ˆ ˆ 1ÿn
s2ML = n(Y ≠ X—ML)Õ(Y ≠ X—ML) = n ˆÁ2i i
Thus, the MLE of the linear model, —ˆ , is the same as the OLS estimator, —ˆ. ˆ ML
Moreover, since the ML estimator —M L is here equivalent to the OLS estimator (same formula, same mean, same variance) we can use the whole inference
machinery (t-test, F -test, confidence intervals) from the last chapters.
As it is needed for the next chapter, we also give here the second derivatives of the log-likelihood function L as well as the expressions of minus one times
the mean of the second derivatives of the log-likelihood function L: ˆ2 ̧ = ≠ 1 (XÕX)
ˆ—ˆ— ‡2
∆ (≠1)·EQaˆ2 ̧Rb=1E(XÕX)
ˆ—ˆ— ‡2
ˆ2 ̧ = n2≠[(Y≠X—)Õ(Y≠X—)] 13
ˆ‡2ˆ‡2 2 (‡2) (‡2) n q ni = 1 Á 2i
=2‡4≠ ‡6q
∆ ( ≠ 1 ) · E Qa ˆ 2 ̧ Rb = ≠ n + E [ ni = 1 Á 2i ]
ˆ‡2ˆ‡2 2‡4 ‡6
= ≠ n + n‡2 = n
2‡4 ‡6 2‡4 156

ˆ2 ̧ = ˆ2L =≠XÕ(Y≠X—)=XÕÁ ˆ—ˆ‡2 ˆ‡2ˆ— ‡4 ‡4
∆ (≠1)·EQa ˆ2L Rb=E(XÕÁ)=E[E(XÕÁ|X)] ˆ‡2ˆ— ‡4 ‡4
= E[XÕE(Á |X)] = 0 ‡4
6.6 Variance of ML-Estimators —ˆ and s2 ML ML
The variance of an MLE is given by the inverse of the Fisher information matrix. The Fisher information matrix is defined as minus one times the expected value matrix of second derivatives of the log-likelihood function. For the OLS case, the Fisher information matrix is
IQ — R=S 1E(XÕX) 0 T=S n Õ 0 T,
2‡4
abU‡2
‡2 0n 0n
VU‡2XX V 2‡4
where we used that E(XÕX) = E(qni=1 XiXiÕ) = nE(XiXiÕ) = nXÕX. The upper left element of the Fisher information matrix is easily shown, but the derivationofthelowerrightelementisrathertedious.1 So,takingtheinverse of the Fisher information matrix gives the variance-covariance matrix of the
estimators —ˆ ML
Given this result, it is easy to see that Var(—ˆ ) æ 0 and Var(s2
n æ Œ. ML ML
and s2 ML
VarQ—ˆR=S‡2≠1 0T aMLb UnXÕX V,
s 2M L 0 2 ‡ 4 n
1See https://www.statlect.com/fundamentals-of-statistics/linear-regression-maximum- likelihood for more details.
157
) æ 0 as

6.7 Consistency of —ˆ and s2 ML ML
If E[Á|X] = 0 (strict exogeneity, follows from our Assumptions 1 and 2), then the bias of —ˆ is zero since E[—ˆ ] = —
E[—ˆ ] = E[(XÕX)≠1XÕ(X— + Á)]
ML
ML
= E[E[(XÕX)≠1XÕ(X— + Á)|X]]
= E[E[(XÕX)≠1XÕX—|X]] + E[E[(XÕX)≠1XÕ Á |X]] = E[E[—|X]] + E[(XÕX)≠1XÕE[Á |X]]
= — + E[(XÕX)≠1XÕE[Á |X]]
=—
…E[—ˆ ]≠—=Bias(—ˆ )=0 ML ML
Of course, from this it also follows that the squared bias is equal to zero Bias2(—ˆ ) = 0. This implies that the mean square error (MSE) of the ML
MLˆ ˆ estimator —ML equals the variance of the ML estimator —ML:
MSE(—ˆ )= E[(—ˆ ≠—)2]=Var(—ˆ ) æ0 as næŒ. ML ̧ ML ̊ ̇ ML ̋
MSE(—ˆML)=Var(—ˆML) since —ˆML is unbiased.
Since convergence in mean square implies convergence in probability, we have
established that the ML-estimator —ˆ is a (weakly) consistent estimator of ML

—ˆ æ—asnæŒ. ML p
Moreover, one can also show that s2ML is a biased but asymptotically unbiased estimator, that is Bias2(s2ML) æ 0 as n æ Œ. Together with the result that Var(s2ML) æ 0 as n æ Œ we have that
MSE(s2ML) = E[(s2ML ≠ ‡2)2] = Bias2(s2ML) + Var(s2ML) æ 0 as n æ Œ. Again, since convergence in mean square implies convergence in probability,
we have established that the ML-estimator s2ML is a (weakly) consistent 158

estimator of ‡2
In practice, however, one usually works with the unbiased (and consistent)
s2MLæp‡2 as næŒ.
alternatives2UB = 1 qni=1ˆÁ2i eventhoughonecanshowthatMSE(s2ML)< 2 n≠K MSE(‡ˆUB) for suciently large n. 6.8 Asymptotic Theory of Maximum-Likelihood Estimators So far, we only considered consistency of the ML-estimators. In the following, we consider the asymptotic distribution of ML-estimators. We only consider the simplest situation: Assume an i.i.d. sample X1,...,Xn, and suppose that the distribution of Xi possesses a density f(x|◊). The true parameter ◊ œ R is unknown. (Example: density of an exponential distribution f(x|◊) = ◊ exp(≠◊x)) • Likelihood function: • Log-likelihood function: • The maximum-likelihood estimator ◊ˆ maximizes ̧ (◊) nn It can generally be shown that ◊ˆ is a consistent estimator of ◊. Derivation n of the asymptotic distribution relies on a Taylor expansion (around ◊) of the 159 Ÿn i=1 Ln(◊) = ̧n(◊) = lnL(◊) = ÿn lnf(Xi|◊) f(Xi|◊) i=1 derivative ̧Õn(·) of the log-likelihood function. By the so called “Mean Value Theorem”, we then know that for some  between ◊ˆ and ◊ we have ̧Õ (◊ˆ ) = ̧Õ (◊) + ̧ÕÕ( )(◊ˆ ≠ ◊) (Mean Value Theorem) nnnnnn Since ◊ˆ maximizes the log-Likelihood function it follows that ̧Õ (◊ˆ ) = 0. nˆˆnn This implies (since ̧Õ (◊n) = ̧Õ (◊) ≠ ̧ÕÕ(Ân)(◊n ≠ ◊)) that nnn ̧Õ (◊) = ≠ ̧ÕÕ( )(◊ˆ ≠ ◊). n nnn (6.1) Note that necessarily sŒ f(x|◊)dx = 1 for all possible values of the true ≠Œ parameter ◊. Therefore, s Œ ˆ f(x|◊)dx = 0 and s Œ ˆ2 f(x|◊)dx = 0. ≠Œ ˆ◊ ≠Œ ˆ◊2 Using this, we now show that the average nn 1ÿn ˆlnf(Xi|◊) n i=1 ˆ◊ is asymptotically normal. For the mean of 1 qni=1 ˆ lnf(Xi|◊) one gets: n ˆ◊ A 1 B 1 Q ÿn ˆ R n ⁄ Œ ˆ f ( x | ◊ ) E ̧Õ(◊) = E lnf(X|◊) = ˆ◊ f(x|◊)dx=0. nn nqai=1ˆ◊ibn≠Œf(x|◊) For the variance of 1 ni=1 ˆ lnf(Xi|◊) one gets: n ˆ◊ A1 B n A ˆ B 1 QQ ˆ f(X|◊)R2R 1 Var ̧Õ(◊)=Var lnf(X|◊)=Ecaaˆ◊ ibdb=J(◊) nn n2 ˆ◊ i n f(Xi|◊) n ̧ ̊ ̇ ̋ Moreover, the average 1 qni=1 ˆ lnf(Xi|◊) is taken over i.i.d.~random vari- n ˆ◊ =:J (◊) ables ˆ ln f (Xi|◊). So, ˆ◊ we can apply the Lindeberg-L’evy central limit theorem from which it follows that 1 ̧ Õ ( ◊ˆ ) = ̧ Õ ( ◊ˆ ) æ N ( 0 , 1 ) 160 nnn nn Ò1J(◊) ÒnJ(◊) n d Thus (using (6.1)) we also have ≠ ̧ÕÕ(Ân)(◊ˆ ≠◊)æ N(0,1) (6.2) Ònnd n·J(◊) Further analysis requires to study the term ̧ÕÕ(Ân). We begin this with studying the mean of n 1 ̧ÕÕ(◊)=1ÿn ˆ2 lnf(Xi|◊)=1ÿn ˆAˆlnf(Xi|◊)B nn ni=1ˆ◊ˆ◊ ni=1ˆ◊ ˆ◊ =1ÿn ˆQˆf(Xi|◊)R a ˆ◊ b n i=1 ˆ◊ f(Xi|◊) = 1 ÿn Qc3 ˆ2 f(Xi|◊)4 f(Xi|◊) ≠ ˆ f(Xi|◊) ˆ f(Xi|◊)Rd c ˆ◊ˆ◊ ˆ◊ ˆ◊ d n i=1 a (f(Xi|◊))2 Taking the mean of 1 ̧ÕÕ(◊) yields: b nn 1 n Q ˆ2 f(X|◊) Q ˆ f(X|◊)R2R Ecaˆ◊2 i ≠aˆ◊ i b db E( ̧ÕÕ(◊))= n n nn n f(Xi|◊) f(Xi|◊) QQ ˆ f(X |◊)R2R = 0 ≠ E ca a ˆ ◊ i b db = ≠ J ( ◊ ) f(Xi|◊) Taking the variance of 1 ̧ÕÕ(◊) yields: VarA1 ̧ÕÕ(◊)B= 1n VarQa ˆ2 lnf(X|◊)Rb æ0 as næŒ nn n2 ˆ◊ˆ◊ i ̧ ̊ ̇ ̋ =some fixed, deterministic number So, 1 ̧ÕÕ (◊) is an unbiased estimator for ≠J (◊) and thus EQaA1 ̧ÕÕ(◊)≠(≠J(◊))B2Rb =VarA1 ̧ÕÕ(◊)B æ0 nn nn 161 nn as næŒ That is 1 ̧ÕÕ(◊) is mean square consistent nn 1 ̧ÕÕ(◊) æm.s. ≠J(◊) as n æ Œ nn which implies that 1 ̧ÕÕ(◊) is also (weakly) consistent nn 1 ̧ÕÕ(◊) æp ≠J(◊) as n æ Œ nn since mean square convergence implies convergence in probability. Remember: We wanted to study ̧ÕÕ(Ân) in (6.2) not 1 ̧ÕÕ(◊), but we are nˆnn actually close now. We know that the ML estimator ◊n is (weakly) consistent, i.e., ◊ˆ æ ◊. From this it follows that also  æ ◊ since  is a value npnpp between ◊ˆ and ◊ (Mean Value Theorem). Therefore, we have that also n 1 ̧ÕÕ(Ân) æp ≠J(◊) as n æ Œ. nn Multiplying by ≠1/ÒJ (◊) yields ≠1 ̧ÕÕ( ) ≠ ̧ÕÕ( ) Ò nnn=n≠1/2 nnæJ(◊). ÒJ(◊) Òn·J(◊) p Rewriting the quotient in (6.2) a little bit yields ≠ ̧ÕÕ( ) ˆ ≠ ̧ÕÕ( ) Ò n n (◊n ≠◊)=n≠1/2Ò n n ˆ ·n1/2(◊n ≠◊). ÒJ(◊)n1/2(◊ˆ ≠◊)æ N(0,1), nd 162 n·J(◊) n·J(◊) ̧ ̊ ̇ ̋ æpÔJ(◊) Thus we can conclude that (using (6.2)): or equivalently (◊ˆ≠◊)æNQa0, 1 Rb n d nJ(◊) with nJ(◊) = ≠E( ̧ÕÕ(◊)) = I(◊) which is the asymptotic normality result n we aimed, where I(◊) is called the “Fisher information”. The above arguments can easily be generalized to multidimensional pa- rameter vectors ◊ œ RK . In this case, J (◊) becomes a K ◊ K matrix, and ◊ˆ ≠ ◊ æ N A 0 , 1 J ( ◊ ) ≠ 1 B , ndKn where nJ (◊) = ≠E( ̧ÕÕ(◊)) = I(◊) is called “Fisher information matrix”. n 163 Chapter 7 Instrumental Variables Regression Regression models may suer from problems like omitted variables, mea- surement errors and simultaneous causality. If so, the error term Ái is correlated with the regressor, Xik say, and the corresponding coecient of interest, —k, is estimated inconsistently. If one is lucky, one can add, for instance, the omitted variables to the regression to mitigate the risk of biased estimations (“omitted variables bias”). However, if omitted vari- ables cannot be measured or are not available for other reasons, multiple regression cannot solve the problem. The same issue arises if there is “simul- taneous causality”. When causality runs from X to Y and vice versa (e.g. if Y = Demanded quantity of a good and X = Price of this good), there will be an estimation bias that cannot be corrected for by multiple regression. A general technique for obtaining a consistent estimator of the coecient of interest is instrumental variables (IV) regression. In this chapter we focus on the IV regression tool called two-stage least squares (TSLS). The first sections briefly recap the general mechanics and assumptions of IV regression and show how to perform TSLS estimation using R. Next, IV regression is used for estimating the elasticity of the demand for cigarettes — a classical example where multiple regression fails to do the job because of simultaneous causality. 165 7.1 The IV Estimator with a Single Regressor and a Single Instrument Consider the simple regression model Yi =—1 +—2Xi +Ái , i=1,...,n, (7.1) where the error term Ái is correlated with the regressor Xi (X is the called “endogenous”). In this case Assumption 2 is violated, that is, strict exogeneity and orthogonality between Xi and Ái do not hold. Therefore, OLS estimation (also maximum likelihood and methods of moments estimation) is inconsis- tent for the true —2. In the most simple case, IV regression uses a single instrumental variable Zi to obtain a consistent estimator for —2. Conditions for valid instruments: Zi must satisfy two conditions to be a valid instrument: 1. Instrument relevance condition: Xi and its instrument Zi must be correlated: flZ,X ”= 0. 2. Instrument exogeneity condition: E(Ái |Zi) = 0. As a consequence: The instrument Zi must not be correlated with the error term Ái: flZ,Á = 0. 7.1.1 The Two-Stage Least Squares Estimator As can be guessed from its name, TSLS proceeds in two stages. In the first stage, the variation in the endogenous regressor Xi is decomposed into a “problem-free” component that is explained by the (exogenous) instrument Zi and a problematic component that is correlated with the error Ái. The second stage uses the problem-free component of the variation in Xi to estimate —2. 166 The first stage regression model is Xi = fi0 + fi1Zi + ‹i, where fi0 + fi1Zi is the component of Xi that is explained by Zi while ‹i is the component (an error term) that cannot be explained by Zi and exhibits correlation with Ái. Using the OLS estimates fi‚0 and fi‚1 we obtain predicted values X„i = fiˆ0 + fiˆ1Zi, i = 1, . . . , n. If Zi is a valid instrument, the X„i are problem-free in the sense that X„i is exogenous in a regression of Yi on X„i which is done in the second stage regression. The second stage produces —‚TSLS and —‚TSLS, the TSLS estimates of —1 and —2. For the case of a single instrument one can show that the TSLS estimator of —2 is which is nothing but the ratio of the sample covariance between Zi and Yi to the sample covariance between Zi and Xi. The estimator in (7.2) is a consistent estimator for —2 in (7.1) under the assumption that Zi is a valid instrument. The CLT implies that the 1 qn —‚TSLS = sZY = n≠1q i=1(Yi ≠ Y )(Zi ≠ Z) , (7.2) 12 2 s 1 ni = 1 ( X ≠ X ) ( Z ≠ Z ) ZX n≠1 i i distribution of —‚TSLS can be approximated by a normal distribution if the 2 sample size n is large. This allows us to use t-statistics, F -statistics, confidence intervals, etc. 7.1.2 Application: Demand For Cigarettes (1/2) The relation between the demand for and the price of commodities is a simple yet widespread problem in economics. Health economics is concerned with the study of how health-aecting behavior of individuals is influenced by the health-care system and regulation policy. Probably the most prominent 167 example in public policy debates is smoking as it is related to many illnesses and negative externalities. It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily. The question is by how much taxes must be increased to reach a certain reduction in cigarette consumption. Economists use elasticities to answer this kind of question. Since the price elasticity for the demand of cigarettes is unknown, it must be estimated. A simple OLS regression of log quantity on log price cannot be used to estimate the eect of interest since there is simultaneous causality between demand and supply. Instead, IV regression can be used. We use the data set CigarettesSW which comes with the package AER. It is a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995. In the following, however, we consider data for the cross section of states in 1995 only – that is, we transform the panel data to a cross-sectional data set. We start by loading the package, attaching the data set. An overview about summary statistics for each of the variables is returned by summary(CigarettesSW). Use ?CigarettesSW for a detailed description of the variables. #> state year cpi population
#> AL : 2 1985:48 Min. :1.076 Min. : 478447
# load the data set and get an overview
library(“AER”) data(“CigarettesSW”) summary(CigarettesSW)
#> AR :2 #> AZ :2 #> CA :2 #> CO :2
1995:48
1st Qu.:1.076
Median :1.300
Mean :1.300
3rd Qu.:1.524
168
1st Qu.: 1622606
Median : 3697472
Mean : 5168866
3rd Qu.: 5901500

#> CT : 2 Max. :1.524 Max. :31493524
#> (Other):84
#> packs income tax
#> Min. : 49.27 Min. : 6887097 Min. :18.00
#> 1st Qu.: 92.45 1st Qu.: 25520384 1st Qu.:31.00
#> Median :110.16 Median : 61661644 Median :37.00
#> Mean :109.18 Mean : 99878736 Mean :42.68
#> 3rd Qu.:123.52 3rd Qu.:127313964 3rd Qu.:50.88
#> Max. :197.99 Max. :771470144 Max. :99.00
#>
#> price
#> Min. : 84.97
#> 1st Qu.:102.71
#> Median :137.72
#> Mean :143.45 Mean : 48.33
#> 3rd Qu.:176.15 3rd Qu.: 59.48
#> Max. :240.85 Max. :112.63
#>
taxs
Min. : 21.27
1st Qu.: 34.77
Median : 41.05
We are interested in estimating —2 in
log(Qcigarettes)=— +— log(Pcigarettes)+Á,
(7.3) where Qcigarettes is the number of cigarette packs per capita sold and P cigarettes
i12ii ii
is the after-tax average real price per pack of cigarettes in state i = 1, . . . , n = 48. The instrumental variable we are going to use for instrumenting the
endogenous regressor log(Pcigarettes) is SalesTax, the portion of taxes on i
cigarettes arising from the general sales tax. SalesTax is measured in dollars per pack. The idea is that:
1. SalesTax is a relevant instrument as it is included in the after-tax average price per pack.
169

2. Also, it is plausible that SalesTax is exogenous since the sales tax does not influence quantity sold directly but indirectly through the price.
In the following, we perform some transformations in order to obtain
deflated cross section data for the year 1995. We also compute the sample
correlation between the sales tax and price per pack. The sample correlation
is a consistent estimator of the population correlation. The estimate of
approximately 0.614 indicates that SalesTax and Pcigarettes exhibit positive i
correlation which meets our expectations: higher sales taxes lead to higher prices. However, a correlation analysis like this is not sucient for checking whether the instrument is relevant. We will later come back to the issue of checking whether an instrument is relevant and exogenous; see Chapter 7.3.
# compute real per capita prices
CigarettesSW$rprice <- with(CigarettesSW, price / cpi) # compute the sales tax CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi) # check the correlation between sales tax and price cor(CigarettesSW$salestax, CigarettesSW$price) #> [1] 0.6141228
# generate a subset for the year 1995
c1995 <- subset(CigarettesSW, year == "1995") The first stage regression is log(P cigarettes) = fi + fi SalesT ax + ‹ . i01ii We estimate this model in R using lm(). In the second stage we run a cigarettes \cigarettes regression of log(Q ) on log(P ) to obtain —‚TSLS and —‚TSLS. ii12 170 # perform the first stage regression cig_s1 <- lm(log(rprice) ~ salestax, data = c1995) coeftest(cig_s1, vcov = vcovHC, type = "HC1") #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.6165463 0.0289177 159.6444 < 2.2e-16 *** #> salestax 0.0307289 0.0048354 6.3549 8.489e-08 ***
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
The first stage regression is
\cigarettes
log(Pi ) = 4.62 + 0.031 SalesT axi
which predicts the relation between sales tax price per cigarettes to be positive. How much of the observed variation in log(Pcigarettes) is explained by the instrument SalesTax? This can be answered by looking at the regression’s R2 which states that about 47% of the variation in after tax prices is explained by the variation of the sales tax across states.
\cigarettes
We next store log(Pi ), the fitted values obtained by the first stage
regression cig_s1, in the variable lcigp_pred. 171
(0.03) (0.005)
1
# inspect the R^2 of the first stage regression
summary(cig_s1)$r.squared #> [1] 0.4709961

# store the predicted values
lcigp_pred <- cig_s1$fitted.values Next, we run the second stage regression which gives us the TSLS estimates we seek. # run the stage 2 regression cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred) coeftest(cig_s2, vcov = vcovHC, type = "HC1") #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.7199 1.5971 6.0859 2.153e-07 ***
#> lcigp_pred -1.0836 0.3337 -3.2472 0.002178 **
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Thus estimating the model (7.3) using TSLS yields \cigarettes \cigarettes
log(Qi ) = 9.72 + 1.08log(Pi ), (7.4) (1.60) (0.33)
The function ivreg() from the package AER carries out TSLS procedure automatically. It is used similarly as lm(). Instruments can be added to the usual specification of the regression formula using a vertical bar separating the model equation from the instruments. Thus, for the regression at hand the correct formula is log(packs) ~ log(rprice) | salestax.
172
# perform TSLS using ivreg()
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, 1 data = c1995) coeftest(cig_ivreg, vcov = vcovHC, type = "HC1") #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.71988 1.52832 6.3598 8.346e-08 ***
#> log(rprice) -1.08359 0.31892 -3.3977 0.001411 **
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
We find that the coecient estimates coincide for both approaches.
Two Notes on the Computation of TSLS Standard Errors:
1. We have demonstrated that running the individual regressions for each stage of TSLS using lm() leads to the same coecient estimates as when using ivreg(). However, the standard errors reported for the second-stage regression, e.g., by coeftest() or summary(), are invalid: neither adjusts for using predictions from the first-stage regression as regressors in the second-stage regression. Fortunately, ivreg() performs the necessary adjustment automatically. This is another advantage over manual step-by-step estimation which we have done above for demonstrating the mechanics of the procedure.
2. Just like in multiple regression it is important to compute heteroskedasticity-robust standard errors as we have done above using vcovHC().
The TSLS estimate for —2 in (7.4) suggests that an increase in cigarette prices by one percent reduces cigarette consumption by roughly 1.08 percent-
173
1

age points, which is fairly elastic. However, we should keep in mind that this estimate might not be trustworthy even though we used IV estimation: there still might be a bias due to omitted variables. Thus a multiple IV regression approach is needed to reduce the risk of omitted variable biases.
7.2 The General IV Regression Model
The simple IV regression model is easily extended to a multiple regression model which we refer to as the general IV regression model. In this model we distinguish between four types of variables: the dependent variable, included exogenous variables, included endogenous variables and instrumental variables:
Yi =—1 +—2X2i +···+—KXKi +—K+1W1i +···+—K+rWri +Ái, (7.5) with i = 1, . . . , n is the general instrumental variables regression model where
• Yi is the dependent variable
• —1, . . . , —K+r are K + r unknown regression coecients
• X2i, . . . , XKi are K ≠ 1 endogenous regressors
• W1i, . . . , Wri are r exogenous regressors which are uncorrelated with Ái • Ái is the error term
• Z1i, . . . , Zmi are m instrumental variables
The coecients are overidentified if m > (K ≠ 1). If m < (K ≠ 1), the coecients are underidentified and when m = (K ≠ 1) they are exactly 174 identified. For estimation of the IV regression model we require exact identification or overidentification. Estimating regression models with TSLS using multiple instruments by means of ivreg() is straightforward. There are, however, some subtleties in correctly specifying the regression formula. Assume that you want to estimate the model Yi =—1 +—2X2i +—3X3i +W1i +Ái where X2i and X3i are endogenous regressors that shall be instrumented by Z1i, Z2i and Z3i, and where W1i is an exogenous regressor. Say the corresponding data is available in a data.frame with column names y, x2, x3, w1, z1, z2 and z3. It might be tempting to specify the argument formula in your call of ivreg()asy~x2+x3+w1|z1+z2+z3which is, however, wrong. As explained in the documentation of ivreg() (see ?ivreg), it is necessary to list all exogenous variables as instruments too, that is joining them by +’s on the right of the vertical bar: y ~ x2 + x3 + w1 | w1 + z1 + z2 + z3, where w1 is “instrumenting itself”. Similarly to the simple IV regression model, the general IV model (7.5) can be estimated using the two-stage least squares estimator: • First-stage regression(s): Run an OLS regression for each of the endogenous variables (X2i,...,XKi) on all instrumental variables (Z1i,...,Zmi), all exoge- nous variables (W1i, . . . , Wri) and an intercept. Compute the fitted values (X„2i, . . . , X„Ki). • Second-stage regression: Regress the dependent variable on the predicted values of all endogenous regressors, all exogenous variables and an intercept using OLS. This gives —‚TSLS,...,—‚TSLS, the TSLS estimates of the model coecients. 1 K+r 175 In the general IV regression model, the instrument relevance and in- strument exogeneity assumptions are equivalent to the case of the simple regression model with a single endogenous regressor and only one instrument. That is, for Z1i, . . . , Zmi to be a set of valid instruments, the following two conditions must be fulfilled: 1. Instrument Relevance If there are K ≠ 1 endogenous variables, r exogenous variables and mØK≠1instrumentsandtheX„ú,...,X„ú arethepredictedvalues 2i Ki from the K ≠ 1 population first stage regressions, it must hold that (X„ú ,...,X„ú ,W1i,...,Wri,1) 2i Ki are not perfectly multicollinear, where “1” denotes the constant regressor (intercept) which equals 1 for all observations. Explanations: Let’s say there is only one endogenous regressor Xi. If all the instruments Z1i,...,Zmi are irrelevant, all the X„iú are just the mean of X such that there is perfect multicollinearity with the constant intercept 1. 2. Instrument Exogeneity E(Ái |Z1i, . . . , Zim) = 0. Consequently, all m instruments must be uncorrelated with the error term, flZ1,Á = 0,...,flZm,Á = 0. One can show that if the IV regression assumptions hold, the TSLS estimator in (7.5) is consistent and normally distributed when the sample size n is large. That is, if we have valid instruments, we obtain valid statistical inference using t-tests, F-tests and confidence intervals for the model coecients. 176 7.2.1 Application: Demand for Cigarettes (2/2) The estimated elasticity of the demand for cigarettes in (7.1) is 1.08. Although (7.1) was estimated using IV regression it is plausible that this IV estimate is biased. The TSLS estimator is inconsistent for the true —2 if the instrument (here: the real sales tax per pack) is invalid, i.e., if the instrument correlates with the error term. This is likely to be the case here since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend to generate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model. log(Qcigarettes)=— +— log(Pcigarettes)+— log(income)+Á (7.6) i12i3ii Before estimating (7.6) using ivreg() we define income as real per capita income rincome and append it to the data set CigarettesSW. # add real income to the dataset (cpi: consumer price index) CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi) c1995 <- subset(CigarettesSW, year == "1995") # estimate the model cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax, data = c1995) coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1") #>
#> t test of coefficients:
177

#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.43066
#> log(rprice) -1.14338
#> log(rincome) 0.21452
#> —
1.25939 7.4883 1.935e-09 ***
0.37230 -3.0711 0.003611 **
0.31175 0.6881 0.494917
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
We obtain
\cigarettes cigarettes
log(Qi ) = 9.42 ≠ 1.14 log(Pi ) + 0.21 log(incomei). (7.7)
(1.26) (0.37) (0.31)
In the following we add the cigarette-specific taxes (cigtaxi) as a further
instrumental variable and estimate again using TSLS.
1
# add cigtax to the data set
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi) c1995 <- subset(CigarettesSW, year == "1995") # estimate the model cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax, data = c1995) coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1") #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
178

#> (Intercept) 9.89496
#> log(rprice) -1.27742
#> log(rincome) 0.28040
#> —
0.95922 10.3157 1.947e-13 ***
0.24961 -5.1177 6.211e-06 ***
0.25389 1.1044 0.2753
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Using the two instruments salestaxi and cigtaxi we have m = 2 for
one endogenous regressor – so the coecient on the endogenous regressor
log(Pcigarettes) is overidentified. The TSLS estimate of (7.6) is i
\cigarettes cigarettes
log(Qi ) = 9.89 ≠ 1.28 log(Pi ) + 0.28 log(incomei). (7.8)
Should we trust the estimates presented in (7.7) or rather rely on (7.8)? The estimates obtained using both instruments are more precise since in (7.8) all standard errors reported are smaller than in (7.7). In fact, the standard error for the estimate of the demand elasticity is only two thirds of the standard error when the sales tax is the only instrument used. This is due to more information being used in estimation (7.8). If the instruments are valid, (7.8) can be considered more reliable.
However, without insights regarding the validity of the instruments it is not sensible to make such a statement. This stresses why checking instrument validity is essential. Chapter 7.3 briefly discusses guidelines in checking instrument validity and presents approaches that allow to test for instrument relevance and exogeneity under certain conditions. These are then used in an application to the demand for cigarettes in Chapter 7.4.
179
(0.96) (0.25) (0.25)
1

7.3 Checking Instrument Validity 7.3.1 Instrument Relevance
Instruments that explain little variation in the endogenous regressor Xi are called weak instruments. Weak instruments provide little information about the variation in Xi that is exploited by IV regression to estimate the eect of interest: the estimate of the coecient on the endogenous regressor is estimated inaccurately. Moreover, weak instruments cause the distribution of the estimator to deviate considerably from a normal distribution even in large samples such that the usual methods for obtaining inference about the true coecient on Xi may produce wrong results.
A Rule of Thumb for Checking for Weak Instruments: Consider the case of a single endogenous regressor Xi and m instruments Z1i, . . . , Zmi. If the coecients on all instruments in the population first-stage regression of a TSLS estimation are zero, the instruments do not explain any of the variation in the Xi which clearly violates assumption that instruments must be relevant. Although the latter case is unlikely to be encountered in practice, we should ask ourselves to what extent the assumption of instrument relevance should be fulfilled. While this is hard to answer for general IV regression, in the case of a single endogenous regressor Xi one may use the following rule of thumb: Compute the F-statistic which corresponds to the hypothesis that the coecients on Z1i, . . . , Zmi are all zero in the first-stage regression. If the F-statistic is less than 10, the instruments are “weak” such that the TSLS estimate of the coecient on Xi is probably biased and no valid statistical inference about its true value can be made.
This rule of thumb is easily implemented in R. Run the first-stage re- gression using lm() and subsequently compute the heteroskedasticity-robust F -statistic by means of linearHypothesis(). This is part of the application to the demand for cigarettes discussed in Chapter 7.4.
180

If Instruments are Weak
There are two ways to proceed if instruments are weak:
1. Discard the weak instruments and/or find stronger instruments. While the former is only an option if the unknown coecients remain identified when the weak instruments are discarded, the latter can be very dicult and even may require a redesign of the whole study.
2. Stick with the weak instruments but use methods that improve upon TSLS in this scenario, for example limited information maximum likeli- hood estimation. (Out of the scope of this course.)
3. Use tests that allow for inferences robust to weak instruments:
Anderson-Rubin test
7.3.2 Instrument Validity
If there is correlation between an instrument and the error term, IV regression is not consistent. The overidentifying restrictions test (also called the J- test) is an approach to test the hypothesis that additional instruments are exogenous. For the J-test to be applicable there need to be more instruments than endogenous regressors.
The J-Statistic (or Sargan-Hansen test) Take ˆÁTSLS, i = 1,…,n, i
the residuals of the TSLS estimation of the general IV regression model (7.5). Run the OLS regression
ˆÁTSLS =”0 +”1Z1i +···+”mZmi +”m+1W1i +···+”m+rWri +ei i
and test the joint hypothesis
H0 :”1 =…”m =0
181
(7.9)

which states that all instruments are exogenous. This can be done using the corresponding F-statistic by computing
J = mF.
This test is the overidentifying restrictions test and the statistic is called the
J-statistic with
Jæ0 ‰2 as næŒ
H
m≠(K ≠1)
d
under the assumption of homoskedasticity. The degrees of freedom m ≠ (K ≠ 1) state the degree of overidentification since this is the number of instruments m minus the number of endogenous regressors K ≠ 1.
It is important to note that the J-statistic is only ‰2m≠(K≠1) distributed when the error term Ái in the regression (7.9) is homoskedastic. A discussion of the heteroskedasticity-robust J-statistic is beyond the scope of this chapter. The application in the next section shows how to apply the J-test using linearHypothesis().
7.4 Application to the Demand for Cigarettes
Are the general sales tax and the cigarette-specific tax valid instruments? If not, TSLS is not helpful to estimate the demand elasticity for cigarettes discussed in Chapter 7.2. As discussed in Chapter 7.1, both variables are likely to be relevant but whether they are exogenous is a dierent question.
One can argue that cigarette-specific taxes could be endogenous because there might be state specific historical factors like economic importance of the tobacco farming and cigarette production industry that lobby for low cigarette specific taxes. Since it is plausible that tobacco growing states have higher rates of smoking than others, this would lead to endogeneity of cigarette specific taxes. If we had data on the size on the tobacco and cigarette
182

industry, we could solve this potential issue by including the information in the regression. Unfortunately, this is not the case.
However, since the role of the tobacco and cigarette industry is a factor that can be assumed to dier across states but not over time we may exploit the panel structure of CigarettesSW. Alternatively, a (non-panel) regression using data on changes between two time periods eliminates such state specific and time invariant eects. Next, we consider such changes in variables between 1985 and 1995. That is, we are interested in estimating the long-run
elasticity of the demand for cigarettes.
The model to be estimated by TSLS using the general sales tax and the
cigarette-specific sales tax as instruments hence is log(Qcigarettes)≠log(Qcigarettes)=— +— Ëlog(Pcigarettes)≠log(Pcigarettes)È
i,1995 i,1985 1 2 i,1995 i,1985
+ —3 [log(incomei,1995) ≠ log(incomei,1985)] + Ái .
(7.10)
We first create dierences from 1985 to 1995 for the dependent variable, the regressors and both instruments.
# subset data for year 1985
c1985 <- subset(CigarettesSW, year == "1985") # define differences in variables packsdiff <- log(c1995$packs) - log(c1985$packs) pricediff <- log(c1995$price/c1995$cpi) - log(c1985$price/c1985 incomediff <- log(c1995$income/c1995$population/c1995$cpi) - log(c1985$income/c1985$population/c1985$cpi) salestaxdiff <- (c1995$taxs - c1995$tax)/c1995$cpi - (c1985$tax 183 $cpi) s - c1985$t a cigtaxdiff <- c1995$tax/c1995$cpi - c1985$tax/c1985$cpi We now perform three dierent IV estimations of (7.10) using ivreg(): 1. TSLS using only the dierence in the sales taxes between 1985 and 1995 as the instrument. 2. TSLS using only the dierence in the cigarette-specific sales taxes 1985 and 1995 as the instrument. 3. TSLS using both the dierence in the sales taxes 1985 and 1995 and the dierence in the cigarette-specific sales taxes 1985 and 1995 as instruments. # estimate the three models cig_ivreg_diff1 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + salestaxdiff) cig_ivreg_diff2 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + cigtaxdiff) cig_ivreg_diff3 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + salestaxdiff + cigtaxdiff) As usual we use coeftest() in conjunction with vcovHC() to obtain robust coecient summaries for all models. 184 # robust coefficient summary for 1. coeftest(cig_ivreg_diff1, vcov = vcovHC, type = "HC1") #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.117962 0.068217 -1.7292 0.09062 .
#> pricediff -0.938014 0.207502 -4.5205 4.454e-05 ***
#> incomediff 0.525970 0.339494 1.5493 0.12832
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
# robust coefficient summary for 2.
coeftest(cig_ivreg_diff2, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.017049 0.067217 -0.2536 0.8009
#> pricediff -1.342515 0.228661 -5.8712 4.848e-07 ***
#> incomediff 0.428146 0.298718 1.4333 0.1587
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
# robust coefficient summary for 3.
coeftest(cig_ivreg_diff3, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
185
1
1

#> (Intercept) -0.052003 0.062488 -0.8322 0.4097
#> pricediff -1.202403 0.196943 -6.1053 2.178e-07 ***
#> incomediff 0.462030 0.309341 1.4936 0.1423
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
library(stargazer)
# gather robust standard errors in a list
rob_se <- list(sqrt(diag(vcovHC(cig_ivreg_diff1, type = "HC1"))), sqrt(diag(vcovHC(cig_ivreg_diff2, type = "HC1"))), sqrt(diag(vcovHC(cig_ivreg_diff3, type = "HC1")))) # generate table stargazer(cig_ivreg_diff1, cig_ivreg_diff2, cig_ivreg_diff3, header = FALSE, type = "latex", omit.table.layout = "n", digits = 3, column.labels = c("IV: salestax", "IV: cigtax", "IVs: salestax, cigtax"), dep.var.labels.include = FALSE, dep.var.caption = "Dependent Variable: 1985-1995 Difference in Log per Pack Price", se = rob_se) Table 7.1 reports negative estimates of the coecient on pricediff that are quite dierent in magnitude. Which one should we trust? This hinges on 186 We proceed by generating a tabulated summary of the estimation results using stargazer(). Table 7.1: TSLS Estimates of the Long-Term Elasticity of the Demand for Cigarettes using Panel Data pricedi incomedi Constant Observations R2 Adjusted R2 Residual Std. Error (df = 45) IV: salestax (1) ≠0.938úúú (0.208) 0.526 (0.339) ≠0.118ú (0.068) 48 0.550 0.530 0.091 IV: cigtax (2) ≠1.343úúú (0.229) 0.428 (0.299) ≠0.017 (0.067) 48 0.520 0.498 0.094 IVs: salestax, cigtax (3) ≠1.202úúú (0.197) 0.462 (0.309) ≠0.052 (0.062) 48 0.547 0.526 0.091 Dep. variable: 1985-95 di in log price/pack the validity of the instruments used. To assess this we compute F -statistics for the first-stage regressions of all three models to check instrument relevance. # first-stage regressions mod_relevance1 <- lm(pricediff ~ salestaxdiff + incomediff) mod_relevance2 <- lm(pricediff ~ cigtaxdiff + incomediff) mod_relevance3 <- lm(pricediff ~ incomediff + salestaxdiff + cigtaxdiff) # check instrument relevance for model (1) linearHypothesis(mod_relevance1, 187 "salestaxdiff = 0", vcov = vcovHC, type = "HC1") #> Linear hypothesis test
#>
#> Hypothesis:
#> salestaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: pricediff ~ salestaxdiff + incomediff
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#>1 46
#> 2 45 1 28.445 3.009e-06 ***
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
# check instrument relevance for model (2)
linearHypothesis(mod_relevance2, “cigtaxdiff = 0”,
vcov = vcovHC, type = “HC1”) #> Linear hypothesis test
#>
#> Hypothesis:
#> cigtaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: pricediff ~ cigtaxdiff + incomediff
#>
188

#> Note: Coefficient covariance matrix supplied. #>
#> Res.Df Df F Pr(>F)
#>1 46
#> 2 45 1 98.034 7.09e-13 ***
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
# check instrument relevance for model (3)
linearHypothesis(mod_relevance3,
c(“salestaxdiff = 0”, “cigtaxdiff = 0”),
vcov = vcovHC, type = “HC1”) #> Linear hypothesis test
#>
#> Hypothesis:
#> salestaxdiff = 0
#> cigtaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: pricediff ~ incomediff + salestaxdiff + cigtaxdiff #>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#>1 46
#> 2 44 2 76.916 4.339e-15 ***
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
All F-statistics are larger than 10; so, the rule of thumb for detecting 189
1

weak instruments would suggest that the instruments are not weak.
Next, we also conduct the overidentifying restrictions test for model three which is the only model where the coecient on the dierence in log prices is
overidentified (m = 2, (K≠1) = 1) such that the J-statistic can be computed. To do this we take the residuals stored in cig_ivreg_diff3 and regress them on both instruments and the presumably exogenous regressor incomediff. We again use linearHypothesis() to test whether the coecients on both instruments are zero which is necessary for the exogeneity assumption to be fulfilled. Note that with test = “Chisq” we obtain a chi-squared distributed test statistic instead of an F-statistic.
# compute the J-statistic
cig_iv_OR <- lm(residuals(cig_ivreg_diff3) ~ incomediff + salestaxdiff + cigtaxdiff) cig_OR_test <- linearHypothesis(cig_iv_OR, c("salestaxdiff = 0", cig_OR_test #> Linear hypothesis test
#>
#> Hypothesis:
#> salestaxdiff = 0
“cigtaxdiff = 0”),
test = “Chisq”)
#> cigtaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: residuals(cig_ivreg_diff3) ~ incomediff + salestax
#>
#> Res.Df RSS Df Sum of Sq Chisq Pr(>Chisq)
#> 1 46 0.37472
190
diff +
c

#> 2 44 0.33695 2 0.037769 4.932 0.08492 .
#> —
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Caution: In this case the p-value reported by linearHypothesis() is wrong because the degrees of freedom are set to 2. This diers from the degree of overidentification (m ≠ (K ≠ 1) = 2 ≠ (2 ≠ 1) = 1) so the J -statistic is ‰21 distributed instead of following a ‰2 distribution as assumed defaultly by linearHypothesis(). We may compute the correct p-value using pchisq().
Since this value is smaller than 0.05 we reject the hypothesis that both instruments are exogenous at the level of 5%. This means one of the following:
1. The sales tax is an invalid instrument for the per-pack price.
2. The cigarettes-specific sales tax is an invalid instrument for the per-pack
price.
3. Both instruments are invalid.
One can argue that the assumption of instrument exogeneity is more likely to hold for the general sales tax such that the IV estimate of the long-run elasticity of demand for cigarettes we consider the most trustworthy is ≠0.94, the TSLS estimate obtained using the general sales tax as the only instrument.
The interpretation of this estimate is that over a 10-year period, an increase in the average price per package by one percent is expected to decrease consumption by about 0.94 percentage points. This suggests that, in the long run, price increases can reduce cigarette consumption considerably.
191
1
# compute correct p-value for J-statistic
pchisq(cig_OR_test[2, 5], df = 1, lower.tail = FALSE) #> [1] 0.02636406

Bibliography
Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics & Data Analysis, 45(2):215–233.
Hayashi, F. (2000). Econometrics. Princeton University Press.
Lindsay, B. G. and Basak, P. (2000). Moments determine the tail of a distribution (but not much else). The American Statistician, 54(4):248– 251.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3):217–224.
MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Jour- nal of Econometrics, 29(3):305–325.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, pages 817–838.
193