Contents
Econometrics (M.Sc.)
Prof. Dr. Dominik Liebl
2021-02-07
Contents 1
Preface 5
Organization of the Course . . . . . . . . . . . . . . . . . . . . . . 5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Review: Probability and Statistics 7
1.1 ProbabilityTheory ………………….. 7 1.2 RandomVariables…………………… 16
1
2 Review: Simple Linear Regression 43
2.1 TheSimpleLinearRegressionModel . . . . . . . . . . . . . 43 2.2 OrdinaryLeastSquaresEstimation . . . . . . . . . . . . . . 47 2.3 PropertiesoftheOLSEstimator……………. 59
3 Multiple Linear Regression 67
3.1 Assumptions……………………… 67 3.2 Deriving the Expression of the OLS Estimator . . . . . . . . 72 3.3 SomeQuantitiesofInterest ……………… 74 3.4 MethodofMomentsEstimator ……………. 78 3.5 Unbiasednessof—ˆ|X …………………. 80 3.6 Varianceof—ˆ|X……………………. 81 3.7 TheGauss-MarkovTheorem……………… 86
4 Small Sample Inference 89
4.1 Hypothesis Tests about Multiple Parameters . . . . . . . . . 90 4.2 TestsaboutOneParameter ……………… 93 4.3 Testtheory………………………. 95 4.4 TypeIIErrorandPower……………….. 102 4.5 p-Value………………………… 105 4.6 ConfidenceIntervals …………………. 105 4.7 Practice:SmallSampleInference …………… 107
5 Large Sample Inference 123
5.1 ToolsforAsymptoticStatistics ……………. 123 5.2 Asymptotics under the Classic Regression Model . . . . . . . 131 5.3 RobustConfidenceIntervals ……………… 136 5.4 Practice:LargeSampleInference …………… 137
6 Maximum Likelikhood 149
6.1 LikelihoodPrinciple………………….. 149 2
6.2 Properties of Maximum Likelihood Estimators . . . . . . . . 149
6.3 The(Log-)LikelihoodFunction…………….. 150
6.4 Optimization: Non-Analytical Solutions . . . . . . . . . . . . 152
6.5 OLS-EstimationasML-Estimation . . . . . . . . . . . . . . 155
6.6 VarianceofML-Estimators—ˆ ands2 . . . . . . . . . . . 157 ˆ 2ML ML
6.7 Consistencyof—MLandsML ……………… 158 6.8 Asymptotic Theory of Maximum-Likelihood Estimators . . . 159
7 Instrumental Variables Regression 165
7.1 The IV Estimator with a Single Regressor and a Single Instru- ment…………………………. 166
7.2 TheGeneralIVRegressionModel…………… 174
7.3 CheckingInstrumentValidity …………….. 180
7.4 ApplicationtotheDemandforCigarettes . . . . . . . . . . 182
Bibliography
193
3
Preface
Organization of the Course
• Etherpad: Throughout the course, I will use the following etherpad for collecting important links and further information (Zoom-meeting link etc.): https://etherpad.wikimedia.org/p/8e2yF6X6FCbqep2AxW1b
• eWhiteboard: Besides this script, I will use an eWhiteboard during the lecture. This material will be saved as pdf-files and provided as accompaning lecture materials.
• Lecture Materials: You can find all lecture materials at eCampus or directly at sciebo: https://uni-bonn.sciebo.de/s/iOtkEDajrqWKT8v
Literature
• A guide to modern econometrics, by M. Verbeek
• Introduction to econometrics, by J. Stock and M.W. Watson
– E-Book: https://bonnus.ulb.uni-bonn.de/SummonRecord/FET CH-bonn_catalog_45089983
• Econometric theory and methods, by R. Davidson and J.G. MacKinnon • A primer in econometric theory, by J. Stachurski
• Econometrics, by F. Hayashi
5
Chapter 1
Review: Probability and Statistics
1.1 Probability Theory
Probability is the mathematical language for quantifying uncertainty. We can apply probability theory to a diverse set of problems, from coin flipping to the analysis of econometric problems. The starting point is to specify the sample space, that is, the set of possible outcomes.
1.1.1 Sample Spaces and Events
The sample space , is the set of possible outcomes of an experiment. Points Ê in are called sample outcomes or realizations. Events are subsets of Example 2.1 If we toss a coin twice then = {HH,HT,TH,TT}. The event that the first toss is heads is A = {HH,HT}.
Example: Let Ê be the outcome of a measurement of some physical quantity, for example, temperature. Then = R = (≠Œ,Œ). The event that the measurement is larger than 10 but less than or equal to 23 is A = (10, 23].
Example: If we toss a coin forever then the sample space is the infinite set = {Ê = (Ê1,Ê2,Ê3,…,)|Êi œ {H,T}} Let A be
7
the event that the first head appears on the third toss. Then A = {(Ê1,Ê2,Ê3,…,)|Ê1 = T,Ê2 = T,Ê3 = H,Êi œ {H,T} for i > 3}.
Given an event A, let Ac = {Ê œ ; Ê œ/ A} denote the complement of A. Informally, Ac can be read as “not A.” The complement of is the empty set ÿ. The union of events A and B is defined as
AfiB={ʜ |ʜAorʜBorʜ both}
which can be thought of as “A or B.” If A1,A2,… is a sequence of sets then
€Œ Ai ={Êœ :ÊœAi foratleastonei}. i=1
The intersection of A and B is defined as
A fl B = {Ê œ ; Ê œ A and Ê œ B}
which reads as “A and B.” Sometimes A fl B is also written shortly as AB. If A1,A2,… is a sequence of sets then
‹Œ Ai ={Êœ :ÊœAi foralli}. i=1
If every element of A is also contained in B we write A μ B or, equivalently, B ∏ A. If A is a finite set, let |A| denote the number of elements in A. We say that A1, A2, . . . are disjoint or mutually exclusive if Ai fl Aj = ÿ whenever i ”= j. For example, A1 = [0,1),A2 = [1,2),A3 = [2,3),… are dtisjoint. A partition of is a sequence of disjoint sets A1, A2, . . . such that
Œi = 1 A i = .
8
Summary: Sample space and events
Ê
A
|A| Ac
AfiB A fl B A μ B ÿ
sample space
outcome
event (subset of )
number of points in A (if A is finite) complement of A(not A)
union (A or B)
intersection (A and B); short notation: AB set inclusion (A is a subset of or equal to B) null event (always false)
true event (always true)
1.1.2 Probability
We want to assign a real number P(A) to every event A, called the proba- bility of A. We also call P a probability distribution or a probability measure. To qualify as a probability, P has to satisfy three axioms. That is, a function P that assigns a real number P(A) œ [0,1] to each event A is a probability distribution or a probability measure if it satisfies the following three axioms:
Axiom 1: P(A) Ø 0 for every A Axiom 2: P( ) = 1
Axiom 3: If A1, A2, . . . are disjoint then
P Qa €Œ A i Rb = ÿŒ P ( A i ) . i=1 i=1
Note: It is not always possible to assign a probability to every event A if the sample space is large, such as, for instance, the whole real line, = R. In case
9
of = R strange things can happen. There are pathological sets that simply break down the mathematics. An example of one of these pathological sets, also known as non-measurable sets because they literally can’t be measured (i.e. we cannot assign probabilities to them), are the Vitali sets. Therefore, in such cases like = R, we assign probabilities to a limited class of sets called a ‡-field or ‡-algebra. For = R, the canonical ‡-algebra is the Borel ‡-algebra. The Borel ‡-algebra on R is generated by the collection of all open subsets of R.
One can derive many properties of P from the axioms. Here are a few:
• P(ÿ)=0
• AμB∆P(A)ÆP(B)
• 0ÆP(A)Æ1
• P(Ac)=1≠P(A)
• AflB=ÿ∆P(AfiB)=P(A)+P(B)
A less obvious property is given in the following: For any events A and B we have that,
P (A fi B) = P (A) + P (B) ≠ P (AB).
Example. Two coin tosses. Let H1 be the event that heads occurs on toss 1 and let H2 be the event that heads occurs on toss 2. If all outcomes are equally likely, that is, P({H1,H2}) = P({H1,T2}) = P({T1,H2}) = P ({T1, T2}) = 1/4, then
P (H1 fi H2) = P (H1) + P (H2) ≠ P (H1H2) = 1 + 1 ≠ 1 = 3. 2244
10
Probailities as frequencies. One can interpret P(A) in terms of fre- quencies. That is, P(A) is the (infinitely) long run proportion of times that A is true in repetitions. For example, if we say that the probability of heads is 1/2, i.e P(H) = 1/2 we mean that if we flip the coin many times then the proportion of times we get heads tends to 1/2 as the number of tosses increases. An infinitely long, unpredictable sequence of tosses whose limiting proportion tends to a constant is an idealization, much like the idea of a straight line in geometry.
The following R codes approximates the probability P(H) = 1/2 using 1, 10 and 100,000 many (pseudo) random coin flips:
set.seed(869)
## 1 (fair) coin-flip:
results <- sample(x = c("H", "T"), size = 1) ## Relative frequency of "H" in 1 coin-flips length(results[results=="H"])/1
#> [1] 1
## 10 (fair) coin-flips:
results <- sample(x = c("H", "T"), size = 10, replace = TRUE) ## Relative frequency of "H" in 10 coin-flips length(results[results=="H"])/10
#> [1] 0.3
## 100000 (fair) coin-flips:
results <- sample(x = c("H", "T"), size = 100000, replace = T) ## Relative frequency of "H" in 100000 coin-flips length(results[results=="H"])/100000
#> [1] 0.50189
11
1.1.3 Independent Events
If we flip a fair coin twice, then the probability of two heads is 1 ◊ 1. We 22
multiply the probabilities because we regard the two tosses as independent. Two events A and B are called independent if
P(AB) = P(A)P(B).
Or more generally, a whole set of events {Ai|i œ I} is independent if
P Qa ‹ A i Rb = Ÿ P ( A i ) iœJ iœJ
for every finite subset J of I, where I denotes the not necessarily finite index set (e.g. I = {1,2,…}).
Independence can arise in two distinct ways. Sometimes, we explicitly assume that two events are independent. For example, in tossing a coin twice, we usually assume the tosses are independent which reflects the fact that the coin has no memory of the first toss.
In other instances, we derive independence by verifying that the definition of independence P(AB) = P(A)P(B) holds. For example, in tossing a fair die once, let A = {2,4,6} be the event of observing an even number and let B = {1,2,3,4} be the event of observing no 5 and no 6. Then, AflB={2,4}istheeventofobservingeithera2ora4. AretheeventsA and B independent?
P(AB) = 2 = P(A)P(B) = 1 · 2 623
and so A and B are independent. In this case, we didn’t assume that A and B are independent it just turned out that they were.
12
Cautionary Notes. Suppose that A and B are disjoint events (i.e. AB = ÿ), each with positive probability (i.e. P(A) > 0 and P(B) > 0).
Can they be independent? No! This follows since P(AB) = P(ÿ) = 0 ”= P(A)P(B) > 0.
Except in this special case, there is no way to judge (in-)dependence by looking at the sets in a Venn diagram.
Summary: Independence
1. A and B are independent if P (AB) = P (A)P (B).
2. Independence is sometimes assumed and sometimes derived. 3. Disjoint events with positive probability are not independent.
1.1.4 Conditional Probability
If P(B) > 0 then the conditional probability of A given B is
P(A | B) = P(AB). P(B)
Think of P(A | B) as the fraction of times A occurs among those in which B occurs. Here are some facts about conditional probabilities:
• The rules of probability apply to events on the left of the bar “|”. That is, for any fixed B such that P(B) > 0,P(· | B) is a probability i.e. it satisfies the three axioms of probability: P (A | B) Ø 0, P ( | B) = 1 and if A1,A2,… are disjoint then P (tŒi=1 Ai | B) = qŒi=1 P (Ai | B).
• Butit’sgenerallynottruethatP(A|BfiC)=P(A|B)+P(A|C). 13
In general it is also not the case that P(A | B) = P(B | A). People get this confused all the time. For example, the probability of spots given you have measles is 1 but the probability that you have measles given that you have spots is not 1. In this case, the di erence between P(A | B) and P(B | A) is obvious but there are cases where it is less obvious. This mistake is made often enough in legal cases that it is sometimes called the
“prosecutor’s fallacy”.
Example. A medical test for a disease D has outcomes + and ≠. The
probabilities are:
D Dc
+ .0981 ≠ .9019
.0081 .0900 .0009 .9010
.0090 .9910 1
From the definition of conditional probability, we have that:
• Sensitivity of the test:
P(+ | D) = P(+ fl D)/P(D) = 0.0081/(0.0081 + 0.0009) = 0.9
• Specificity of the test:
P (≠ | Dc) = P (≠ fl Dc)/P (Dc) = 0.9010/(0.9010 + 0.0900) ¥ 0.9
Apparently, the test is fairly accurate. Sick people yield a positive test result 90 percent of the time and healthy people yield a negative test result about 90 percent of the time. Suppose you go for a test and get a positive result. What is the probability you have the disease? Most people answer 0.90 = 90%. The correct answer is P(D | +) = P(+ fl D)/P(+) = 0.0081/(0.0081 + 0.0900) = 0.08. The lesson here is that you need to compute the answer numerically. Don’t trust your intuition.
14
If A and B are independent events then
P(A | B) = P(AB) = P(A)P(B) = P(A)
P(B) P(B)
So another interpretation of independence is that knowing B doesn’t
change the probability of A.
From the definition of conditional probability we can write
P(AB) = P(A | B)P(B) and also P(AB) = P(B | A)P(A). Often, these formulas give us a convenient way to compute P(AB) when A
and B are not independent.
Note, sometimes P(AB) is written as P(A,B).
Example. Draw two cards from a deck, without replacement. Let A be the event that the first draw is Ace of Clubs and let B be the event that the second draw is Queen of Diamonds. Then P (A, B) = P (A)P (B | A) =
(1/52) ◊ (1/51)
Summary: Conditional Probability
1. If P(B) > 0 then P(A | B) = P(AB)/P(B)
2. P(· | B) satisfies the axioms of probability, for fixed B. In general,
P (A | ·) does not satisfy the axioms of probability, for fixed A.
3. Ingeneral,P(A|B)”=P(B|A).
4. A and B are independent if and only if P(A | B) = P(A).
15
1.2 Random Variables
Statistics and econometrics are concerned with data. How do we link sample spaces, events and probabilities to data? The link is provided by the concept of a random variable. A real-valued random variable is a mapping X : æ R that assigns a real number X(Ê) œ R to each outcome Ê.
At a certain point in most statistics/econometrics courses, the sample space, , is rarely mentioned and we work directly with random variables. But you should keep in mind that the sample space is really there, lurking in the background.
Example. Flip a coin ten times. Let X(Ê) be the number of heads in the sequence Ê. For example, if Ê = HHTHHTHHTT then X(Ê) = 6.
Example. Let = {(x, y)|x2 + y2 Æ 1} be the unit disc. Consider drawing a point “at random” from . A typical outcome is then of the form Ê = (x, y). Some examples of random variables are X(Ê) = x, Y (Ê) = y, Z(Ê) = x + y,W(Ê)=Ôx2 +y2.
Given a real-valued random variable X œ R and a subset A of the real line (A μ R), define X≠1(A) = {Ê œ |X(Ê) œ A}. This allows us to link the probabilities on the random variable X, i.e. the probabilities we are usually working with, to the underlying probabilities on the events, i.e. the probabilities lurking in the background.
Example. Flip a coin twice and let X be the number of heads. Then, PX(X = 0) = P({TT}) = 1/4, PX(X = 1) = P({HT,TH}) = 1/2 and PX(X = 2) = P({HH}) = 1/4. Thus, the events and their associated
16
probability distribution, P , and the random variable X and its distribution, PX, can be summarized as follows:
Ê P ({Ê}) X(Ê) TT1/4 0 TH1/4 1 HT1/4 1 HH1/4 2
x PX(X=x) 01/4
11/2
21/4
Here, PX is not the same probability function as P because P maps from the sample space events, Ê, to [0, 1], while PX maps from the random-variable events, X(Ê), to [0,1]. We will typically forget about the sample space and just think of the random variable as an experiment with real-valued
(possible multivariate) outcomes. We will therefore write P (X = xk) instead of PX (X = xk) to simplify the notation.
1.2.1 Univariate Distribution and Probability Functions
1.2.1.1 Cumulative Distribution Function The cumulative distribution function (cdf)
FX :Ræ[0,1]
of a real-valued random variable X œ R is defined by
FX(x) = P(X Æ x).
You might wonder why we bother to define the cdf. The reason is that it e ectively contains all the information about the random variable. Indeed, let X œ R have cdf F and let Y œ R have cdf G. If F(x) = G(x) for all xœRthenP(XœA)=P(Y œA)forallAμR. Inordertodenotethat
17
two random variables, here X and Y , have the same distribution, one can writeshortlyX=d Y.
Caution: Equality in distribution, X =d Y , does generally not mean equal- ity in realizations, that is X =d Y ”∆ X(Ê) = Y(Ê) for all Ê œ .
The defining properties of a cdf. A function F mapping the real line to [0, 1], short F : R æ [0, 1], is called a cdf for some probability measure P if and only if it satisfies the following three properties:
1. F is non-decreasing i.e. x1 < x2 implies that F (x1) Æ F (x2). 2. F is normalized: limxæ≠Œ F (x) = 0 and limxæŒ F (x) = 1
3. F is right-continuous, i. e. F(x) = F (x+) for all x, where
F1x+2= lim F(y). yæx,y>x
Alternatively to cumulative distribution functions one can use probabil- ity (mass) functions in order to describe the probability law of discrete random variables and denstiy function in order to describe the probability law of continuous random variables.
1.2.1.2 Probability Functions for Discrete Random Variables.
A random variable X is discrete if it takes only countably many values X œ {x1,x2,…}.
For instance, X œ {1, 2, 3} or X œ {2, 4, 6, . . . } or X œ Z or X œ Q. 18
We define the probability function or probability mass function (pmf) for X by
fX(x) = P(X = x) for all x œ {x1,x2,…}
1.2.1.3 Density Functions for Continuous Random Variables.
A random variable X is continuous if there exists a function fX such that 1. fX(x)Ø0forallx
2.sŒ fX(x)dx=1and ≠Œ
3. P(a