OLS and the Conditional Expectation Function
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
Week One
January 11th and 12th, 2021
1/84
This Week
Course Details
Basic housekeeping
Course tools: Menti, R, and R-Studio Introduction to tidy data
OLS and the Conditional Expectation Function Review and properties of the CEF
Review, implementation, and value of OLS
2/84
Course Details: Contact
Lecturer: Chris Hansman
Email: chansman@imperial.ac.uk
Office: 53 Prince’s Gate, 5.01b Phone: +44 (0)20 7594 1044
TA: Davide Benedetti
Email: d.benedetti@imperial.ac.uk
3/84
Course Details: Assessment
Two assignments
Assignment 1 (25%)
Assigned Tuesday of Week 3
Due by 4pm on Tuesday of Week 5 Assignment 2 (25%)
Assigned Tuesday of Week 6
Due by 5:30pm Tuesday of Week 8
Final Exam (50%)
4/84
Course Details: Tentative Office Hours and Tutorials
Tentative office hours
Tuesdays from 17:30-18:30
Or by appointment
Formal tutorials will begin in Week 2
Davide will be available this week to help with R/RStudio
5/84
Course Details: Mentimeter
On your phone (or computer) go to Menti.com
6/84
Course Details: R and R-Studio
Make sure you have the most up-to-date version of R: https://cloud.r-project.org/
And an up-to-date version of RStudio:
https://www.rstudio.com/products/rstudio/download/
7/84
Course Details: In Class Exercises
Throughout the module we’ll regularly do hands on exercises Lets start with a quick example:
On the insendi course page find the data: ols basic.csv
5variablesY,X,Y sin,Y 2,Y nl
Load the data into R, and run an OLS regression of Y on X. What is the coefficient on X?
8/84
Course Details: Projects in R-Studio
For those with R-Studio set up:
Open R-Studio and select File ⇒ New Project ⇒ New Directory ⇒
New Project
Name the directory “EF lecture 1” and locate it somewhere convenient
Each coursework should be completed in a unique project folder
9/84
Course Details: R set up
Download all data files from the hub and place them in EF lecture 1 s p price.csv
ols basics.csv
ames testing.csv ames training.csv
10/84
Course Details: The Tidyverse
The majority of the coding we do will utilize the tidyverse
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying design philosophy, grammar, and data structures.
For an excellent introduction and overview:
Hadley Wickham’s R for Data Science: https://r4ds.had.co.nz/
install.packages(“tidyverse”)
library(tidyverse)
11/84
Course Details: Tidy Data
The tidyverse is structured around tidy datasets
There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column 2. Each observation must have its own row 3. Each value must have its own cell
For the theory underlying tidy data:
http://www.jstatsoft.org/v59/i10/paper
12/84
An Example of Tidy Data
13/84
An Example of Non-Tidy Data
14/84
Fixing An Observation Scattered Across Rows: spread()
tidy2 <- table2 %>%
spread(key=”type”, value=”count”)
15/84
Another Example of Non-Tidy Data
16/84
Fixing Columns as Values: gather()
tidy4a <- table4a %>%
gather(‘1999‘, ‘2000‘, key = “year”, value = “cases”)
17/84
Introducing the Pipe: %≥%
You’ll notice that both of these operations utilize a “pipe”: %≥%
A tool for clearly expressing a sequence of multiple operations Can help make code easy to read and understand
Consider evaluating the following: x = (log(e9)) Could write it as:
x <-sqrt(log(exp(9)))
Or with pipes:
x <- 9 %>%
exp() %>%
log() %>%
sqrt()
18/84
This Week: Two Parts
(1) Introduction to the conditional expectation function (CEF) Why is the CEF a useful (and widely used) summary of the
relationship between variables Y and X
(2) Ordinary Least Squares and the CEF
Review, implementation, and the utility of OLS
19/84
Part 1: The Conditional Expectation Function
Overview
Key takeaway: useful tool for describing the relationship between
variables Y and X
Why: (at least) three nice properties:
1. Law of iterated expections 2. CEF decomposition property 3. CEF prediction property
20/84
Review: Expectation of a Random Variable Y
Suppose Y is a random variable with a finite number of outcomes y1,y2,···yk occurring with probability p1,p2,···pk:
The expectation of Y is:
k E[Y]= ∑yipi
i=1
For example: if Y is the value of a (fair) dice roll:
E[Y]=1×1+2×1+3×1+4×1+5×1+6×1 =3.5 666666
Suppose Y is a (continuous) random variable whose CDF F(y) admits density f (y )
The expectation of Y is:
This is just a number!
E[Y]= yf(y)dy
21/84
The Conditional Expectation Function (CEF)
We are often interested in the relationship between some outcome Y and a variable (or set of variables) X
A useful summary is the conditional expectation function: E[Y|X] Gives the expectation of Y when X takes any particular value
Formally, if fy(·|X) is the conditional density of Y|X:
E[Y|X]= zfy(z|X)dz
E[Y|X] is a random variable itself: a function of the random X
Can think of it as E[Y|X]=h(X)
Alternatively, evaluate it at particular values: for example X = 0.5
E[Y|X =0.5] is just a number!
22/84
Unconditional Expectation of Height for Adults: E[H]
23/84
Adult Height (Inches)
54 60 66 72 78
Unconditional Expectation of Height for Adults: E[H]
Adult Height (Inches)
54 60 66 72 78
24/84
Unconditional Expectation of Height for Adults: E[H]
E[H]=67.5 In.
Adult Height (Inches)
54 60 66 72 78
25/84
Conditional Expectation of Height by Age: E[H|Age]
E[H|Age=5]
E[H|Age=10]
E[H|Age=15]
E[H|Age=20] E[H|Age=25] E[H|Age=30] E[H|Age=35]
E[H|Age=40]
Height (Inches)
30 40 50 60 70 80
0 5 10 15 20 25 30 35 40 Age
26/84
Why the Conditional Expectation Function?
E[Y|X] is not the only function that relates Y to X
For example, consider 95th Percentile of Y given X: P95(Y|X)
P95[H|G=Male]
P95[H|G=Female]
Adult Height (Inches)
54 60 66 72 78
But E[Y|X] has a bunch of nice properties
27/84
Property 1: The Law of Iterated Expectations
EX[E[Y|X]]=E[Y]
Example: let Y be yearly wages for MSc graduates
E[Y]=£1,000,900
Two values for X : {RMFE, Other}
Say 10% of MSc students are RMFE, 90% in other programs E[Y|X=RMFE]=£10,000,000
E[Y|X=Other]=£1000
The expectation works like always (just over E[Y|X] instead of X): E[E[Y|X]]=E[Y|X =RMFE]×P[X =RMFE]+E[Y|X =Other]×P[X =Other]
£10,000,000×0.1 £1000×0.9 = £1, 000, 900
28/84
Property 1: The Law of Iterated Expectations
E[E[Y|X]]=E[Y]
Not true, for example, for the 95th percentile: E[P95[Y|X]]̸=P95[Y]
29/84
Property 2: The CEF Decomposition Property
Any random variable Y can be broken down into two pieces Y =E[Y|X]+ε
Where the residual ε has the following properties: (i) E[ε|X] = 0 (“mean independence”)
(ii) ε uncorrelated with any function of X
Intuitively this property says we can break down Y into two parts: (i) The part of Y “explained by” X: E[Y|X]
This is the (potentially) useful part when predicting Y with X (ii) The part of Y unrelated to X: ε
30/84
Property 2: Proof
Y =E[Y|X]+ε
(i) E[ε|X] = 0 (“mean independence”)
ε = Y − E [Y |X ]
⇒E[ε|X]=E[Y −E[Y|X]|X] =E[Y|X]−E[Y|X]=0
(ii) ε uncorrelated with any function of X
Cov(ε,h(x)) = E[h(X)ε]−E[h(X)]E[ε]
=0 How come?
= E[E[h(X)ε|X]]
iterated expectations
= E[h(X)E[ε|X]] = E[h(X)·0] = 0
31/84
Property 3: The CEF Prediction Property
Out of any function of X, E[Y|X] is the best predictor of Y
In other words, E[Y|X] is the “closest” function to Y on average
What do we mean by closest?
Consider any function of X, say m(X)
m(X) is close to Y if the difference (or “error”) is small: Y −m(x) Close is about magnitude, treat positive/negative the same…
m(X) is also close to Y if the squared error is small: (Y −m(x))2 E[Y|X] is the closest, in this sense, in expectation:
E [Y |X ] = arg min E [(Y − m(X ))2 ] m(X)
“Minimum mean squared error”
32/84
Property 3: Proof (Just for Fun)
Out of any function of X, E[Y|X] is the best predictor of Y E [Y |X ] = arg min E [(Y − m(X ))2 ]
m(X)
To see this, note:
(Y −m(X))2 =([Y −E[Y|X]]+[E[Y|X]−m(X)])2 = [Y − E [Y |X ]]2 + [E [Y |X ] − m(X )]]2
+2[E[Y|X]−m(X)]]·[Y −E[Y|X]]
h(x) ε
⇒E[(Y −m(X))2]=E[(Y −E[Y|X])2]+E[(E[Y|X]−m(X))2]+E[h(X)·ε]
Unrelated to m(X) Min. when m(X)=E[Y|X] =0
33/84
Summary: Why We Care About Conditional Expectation Functions
Useful tool for describing relationship between Y and X
Several nice properties
Most statistical tests come down to comparing E[Y|X] at certain X Classic example: experiments
34/84
Part 2: Ordinary Least Squares
Linear regression is arguably the most popular modeling approach across every field in the social sciences
Transparent, robust, relatively easy to understand
Provides a basis for more advanced empirical methods Extremely useful when summarizing data
Plenty of focus on the technical aspects of OLS last term Focus today on an applied perspective
35/84
Review of OLS in Three Parts
1. Overview
Intuition and Review of Population and Sample Regression Algebra
Connection With Conditional Expectation Function Estimating a Linear Regression in R
2. An Example: Predicting Home Prices
3. Rounding Out Some Details
Scaling and Implementation
36/84
OLS Part 1: Overview
37/84
X
Y
OLS Estimator Fits a Line Through the Data
Y
X
βOLS +βOLSX 01
37/84
A Line Through the Data: Example in R
10
5
0
−5
●
●
●● ●
●
●
●● ●
●
●
●●
● ●
●
●
● ●
● ● ●
●●
●●
●●● ●
●
●● ● ●●
●●
●●● ●●●
●● ● ● ●●
●●●●●
●
●●●●● ● ●●
● ● ●● ● ●● ●●●●
●●●● ●● ● ● ●●
●●● ●●●
● ● ●●● ●● ●●●● ●●
● ●●
● ● ●
●● ●● ●●●●● ●●
●
●● ●
●●
●
●
●
● ●●●
● ●● ●● ●● ● ●
●● ●● ●
● ● ●● ●●●●
●● ● ●
●
● ● ●
●●● ●●●● ● ●● ● ●● ● ● ● ●●●●●●●● ●●●
● ●●●●●●● ● ● ●● ●●●●●●●●●●●●●●
● ● ●
●● ●●●●●●●●●●● ●
●●●● ● ●●
●●● ● ● ● ●●● ●●
● ●●●● ● ● ● ●● ● ● ● ●●● ●●●
● ●●●●●● ●●
●●● ●● ●● ●
● ●● ●
●●● ●● ● ●● ● ●
●●●●
● ● ● ●●
●
●● ●
● ● ●●
●●●
●●●
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●
● ●
●●
●
●● ● ●● ●
●● ●●●●● ●●●●● ● ● ●● ● ● ● ●●
● ●● ●●● ● ●● ● ●●●● ● ●●●●●●●● ●● ●●●●● ●●● ●●●● ●
●●●● ● ●● ● ●●
●●● ● ●
●● ●●●●
●● ● ●● ●●●●●●●●●
● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ●●●●●●●●●●● ●●●● ●●●
●● ●●●●●● ● ● ●●● ●●●●●●●●●●●●● ●
● ● ●●●● ●●● ● ●●●●●●●●● ● ● ● ● ●●●● ●● ● ● ●● ● ●● ●●● ● ●● ● ● ●●● ● ●●●● ●
●● ●● ●● ●● ●●● ●●● ● ● ●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●● ●● ●● ●● ● ●●
●
●●●●●●●● ●
●
● ●●● ●
●● ● ●●
●●●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●
● ●
●●●● ●●●● ●●
●●● ●
●●
● ●● ● ●●●
●
●●● ●●
● ●● ●●●
●
●
●
● ● ●
● ●
●
●
● ●
−2 0 2
X
●● ●
● ●● ●
●
● ●●●●
●
●●● ●●
●●● ●●
●●● ●●●
●●●
●
●
●
●
●
●●●●● ● ●●
38/84
Y
A Line Through the Data: Example in R
10
5
0
−5
●
●
●● ●
●
●
●● ●
●
●
●●
● ●
●
●
● ●
● ● ●
●●
●●
●●● ●
●
●● ● ●●
●●
●●● ●●●
●● ● ● ●●
●●●●●
●
●●●●● ● ●●
● ● ●● ● ●● ●●●●
●●●● ●● ● ● ●●
●●● ●●●
● ● ●●● ●● ●●●● ●●
● ●●
● ● ●
●● ●● ●●●●● ●●
●
●● ●
●●
●
●
●
● ●●●
● ●● ●● ●● ● ●
●● ●● ●
● ● ●● ●●●●
●● ● ●
●
● ● ●
●●● ●●●● ● ●● ● ●● ● ● ● ●●●●●●●● ●●●
● ●●●●●●● ● ● ●● ●●●●●●●●●●●●●●
● ● ●
●● ●●●●●●●●●●● ●
●●●● ● ●●
●●● ● ● ● ●●● ●●
● ●●●● ● ● ● ●● ● ● ● ●●● ●●●
● ●●●●●● ●●
●●● ●● ●● ●
● ●● ●
●●● ●● ● ●● ● ●
●●●●
● ● ● ●●
●
●● ●
● ● ●●
●●●
●●●
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●
● ●
●●
●
●● ● ●● ●
●● ●●●●● ●●●●● ● ● ●● ● ● ● ●●
● ●● ●●● ● ●● ● ●●●● ● ●●●●●●●● ●● ●●●●● ●●● ●●●● ●
●●●● ● ●● ● ●●
●●● ● ●
●● ●●●●
●● ● ●● ●●●●●●●●●
● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ●●●●●●●●●●● ●●●● ●●●
●● ●●●●●● ● ● ●●● ●●●●●●●●●●●●● ●
● ● ●●●● ●●● ● ●●●●●●●●● ● ● ● ● ●●●● ●● ● ● ●● ● ●● ●●● ● ●● ● ● ●●● ● ●●●● ●
●● ●● ●● ●● ●●● ●●● ● ● ●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●● ●● ●● ●● ● ●●
●
●●●●●●●● ●
●
● ●●● ●
●● ● ●●
●●●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●
● ●
●●●● ●●●● ●●
●●● ●
●●
● ●● ● ●●●
●
●●● ●●
● ●● ●●●
●
●
●
● ● ●
● ●
●
●
● ●
−2 0 2
X
●● ●
● ●● ●
●
● ●●●●
●
●●● ●●
●●● ●●
●●● ●●●
●●●
●
●
●
●
●
●●●●● ● ●●
39/84
Y
How Do We Choose Which Line?
40/84
β0 + β1X
X
Y
One Data Point
40/84
β0 + β1X
xi
X
Y
vi: Observation i’s Deviation from β0 +β1xi
40/84
β0 + β1X
vi
β0 + β1xi
xi
X
Y
One Data Point
40/84
β0 + β1X
yi=β0 + β1xi+vi vi
β0 + β1xi
xi
X
Y
Choosing the Regression Line
For any line β0 +β1X, the data point (yi,xi) may be written as: yi =β0+β1xi+vi
vi will be big if β0 +β1xi is “far” from yi vi willbesmallifβ0+β1xi is“close”toyi We refer to vi as the residual
41/84
Choosing the (Population) Regression Line
yi =β0+β1xi+vi
An OLS regression is simply choosing the βOLS,βOLS that make vi
as “small” as possible on average How do we define “small”?
Want to treat positive/negative the same: consider vi2
Choose βOLS,βOLS to minimize: 01
E[vi2] = E[(yi −β0 −β1xi)2]
01
42/84
(Population) Regression Anatomy
{βOLS,βOLS} = arg min E[(y −β −β x )2] 0 1 {β0,β1} i 0 1i
In this simple case with only one xi , β OLS has an intuitive definition: 1
βOLS = Cov(yi,xi) 1 Var(xi)
βOLS =y ̄−βOLSx ̄ 01
43/84
Regression Anatomy (Matrix Notation)
yi =β0+β1xi+vi
You will often see more concise matrix notation:
β0 1
β=β Xi=x 1i
2×1 2×1 y =X′β+v
This lets us write the OLS Coefficients as:
βOLS = argminE[(y −X′β)2]
iii
⇒ βOLS = E[X X′]−1E[X y ] ii ii
{β} i i
44/84
(Sample) Regression Anatomy
βOLS = argminE[(y −X′β)2] {β} i i
βOLS = E[X X′]−1E[X y ] ii ii
Usually do not explicitly know these expectations, so compute sample analogues:
Where
⇒βˆOLS =(X′X)−1(X′Y)
1 x1 y1
1 x2 y2 X=. . Y=.
. . . 1 xN yn
ˆOLS N (yi −Xi′β)2
β =argmin∑ {β} i=1
N
N×2 N×1
45/84
This Should (Hopefully) Look Familiar
N
RSS(b)= ∑(yi −Xi′b)2 i=1
46/84
Estimating a Linear Regression in R
Simple command to estimate OLS
ols v1<-lm(y∼ x, data= ols basics)
And to display results: summary(ols v1)
47/84
A Line Through the Data: Example in R
10
5
0
−5
●
●
● ●●
● ●
●●
●● ●
●●
●
●● ●
●
●
● ●
● ● ●
●● ●●● ●
●●
●● ● ●●
●●
●●●
● ● ●● ● ●
●●● ● ●●●●●
●
●●●● ● ●● ●
●●●● ●● ● ● ●
●
● ● ●● ● ● ●●●●
● ● ● ●●●
● ● ●●● ●●●● ●●
● ●●
● ● ●●
●●● ●●●●● ●●●
●
●● ●
●●
●
● ●●● ● ●● ●● ●●
● ●
● ●●●●●●●●● ●●
● ● ● ●●●
● ●●●
●●●●● ●
●● ● ●● ●●●●●● ●● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●●● ● ● ●
●● ● ●● ● ●●● ●● ●●●
● ● ●●●● ● ● ● ● ●● ●● ●● ● ● ● ●● ●●●●●● ●● ●● ● ●● ●●● ● ●● ● ● ●●● ● ●●●● ●
●●● ● ●● ●●●●● ● ● ● ●●●
●● ●●●●●● ●●●● ●● ● ● ● ●●●●●● ● ●●●●●● ● ● ● ● ●
● ● ●● ● ● ●● ●●
●● ● ● ● ● ● ●● ● ● ●●● ● ●●
● ● ● ● ● ●●● ● ● ●
● ●● ●●● ●
● ● ● ●● ●●●● ●●●● ●● ● ● ● ●●●●●● ●●●●●
●● ●●●●● ● ●● ●●●●●● ●●● ● ● ●●● ● ●
● ●●● ●●● ● ● ●●
●●● ●●●
● ●●
●● ● ●● ● ● ●● ● ●
●●●● ●●
●●● ●
●
● ● ●●
●●●
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●●
●● ● ●
●● ●●●●●● ●●●●● ● ● ●● ● ● ● ●●
● ●● ●●● ●●●● ● ●●●● ●
●●●●●●●● ●●●● ●
●
●●● ●● ●●●
●● ● ●● ●●●
● ●●● ●
●● ● ●●
● ●
● ●●●●●●●●●●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●
●
●●●● ●● ● ●● ●● ●
●●●● ●
● ●●
● ●●● ● ● ●●● ●
● ●
● ●
●●
●
● ●●●
●●● ●●●● ●●●● ● ●
●●●● ●●● ● ●
●●● ●
●●
● ●● ● ●●●
●
●●● ●●
● ●● ●●●
●
●
● ●
●
●
● ●
●
●
●
● ●
−2 0 2
X
●● ●
● ●● ●
●
● ●●●●
●
●●● ●●
●●● ●●
●●● ●●●
●
●
●
●
●
●
Intercept looks something like 1 Slope, approximately 2?
48/84
Y
OLS in R
49/84
Recall, for comparison
50/84
Regression and the Conditional Expectation Function
Why is linear regression so popular?
Simplest way to estimate (or approximate) conditional expectations!
Three simple results
OLS perfectly captures CEF if CEF is Linear
OLS generates best linear approximation to the CEF if not OLS perfectly captures CEF with binary (dummy) regressors
51/84
Regression captures CEF if CEF is Linear
Take the special case of a linear conditional expectation function:
E[y |X ] = X′β iii
Then OLS captures E[yi|Xi]
10
5
0
−5
●
●
● ●●
● ●
●● ●●
● ●●
●
●● ●
●
●
● ●● ● ●
● ●●
●●
●●● ●
●
●● ● ●● ●●
●●●● ●● ●
●● ●● ●●●●● ●● ●
● ●●● ● ●● ● ● ●●●●
●●●● ●● ● ● ● ●
● ● ●●● ● ● ● ●● ● ●●●● ●●
● ●● ●● ● ●●● ●●●●● ●●●
●
●● ●
●●
●
● ●●● ● ●● ●● ●●
● ●
● ● ●●●●●●●●●● ●●
●●●●● ● ●●●● ● ●●●●●●●● ● ● ●●● ●●● ● ● ●●●● ● ● ● ●
●●●●●●● ● ●●
● ●● ●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●●●●●●● ●●
● ●●●● ●●●●● ●
● ●●●● ●●●●● ●● ●●●●● ● ●●● ●●●● ●●●●
● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●
● ● ● ●
● ●● ●●● ●
● ●●
●● ● ●● ● ● ●● ● ●
●●●●
●● ● ● ●●
● ●●●
●●●
●
● ●●●
●● ● ●●
●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●●
● ●● ●●● ●●●● ● ●●●●
● ●●●● ●
●
●
● ●●●●●●●●● ●● ●● ●●●
● ●●●●● ●●● ● ●●● ●
●● ●● ●●●
● ● ●●●●●●●●
●●● ● ●●●●●●● ●
●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●●● ●●● ● ●● ●
● ●●●●● ●● ●● ●● ● ● ● ● ● ●●●● ●●●● ● ●● ● ●●● ●●●●●● ●●●●●●● ● ●● ●●●
●●●● ● ●●●●● ● ●●● ●● ●● ●
●● ●●●
● ●
●●
●
●●
●
●● ● ●●●●
●●●
●●●● ●
● ●●● ● ●●●● ● ●
●●● ● ● ●●
●
● ●●●●
●● ● ●
●
●
●
●●● ●●
●●● ●●●● ●●
●●●● ●●●● ● ●
●●●●● ●●● ●
●●
●● ●●●
● ●● ●●
●
●● ●
● ●
●
●
● ●
●
●
● ●
−2 0 2
X
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●●
●● ●●
● ●● ●
●
●
●
●
●
●
●
52/84
Y
Conditional Expectation Function Often Non-Linear
10
●
●
●● ●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
● ●● ●
●● ●
●● ●
●●● ●
●
● ●
● ●
● ●●
● ●
● ●●
●●●
●●●● ●● ● ●●●●●
●● ●● ● ●
● ●●● ●●● ●● ● ● ●●●●
● ●●
●
●●● ● ● ●● ● ● ●●●● ● ● ●● ●
●● ●
●
●●● ●
● ● ●
●● ●
● ●● ●
● ●
● ●
●●● ●●● ● ●● ● ●●
●● ● ● ●●
● ● ●●● ● ●● ●● ●●●
●
●
●● ●
● ● ●● ●
●● ● ● ● ●● ●
●
●● ●●●
● ●●
●●●● ●
● ●●● ●
● ●
●
● ●
●●●
● ●● ● ●
●●●
● ●● ● ● ●●●●● ●
● ● ●● ● ●●● ● ●
●
● ●● ● ●●●●● ●●●●
● ●● ●●● ●● ●●
●●● ● ● ●●● ●● ●●●●● ●●●● ●●●●●●●● ●
● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●●●● ● ●● ●●●●● ●
● ●● ●●
●●●●● ●●●●●● ●
●● ●● ●● ●● ●●●● ●●● ●● ●●● ●●●●●● ●● ●
●●● ●●●●●●
● ● ●
● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●●●●
● ●
●●● ● ●●● ●
●● ● ●●
● ●
●● ●
● ●●● ● ● ●●●●●
●
● ●●●
●● ●● ●● ●●● ● ●●● ●●●●● ●
● ●● ●
●●●● ●
●
● ● ●● ● ● ●●●
● ●●●● ●●
●● ●●●● ● ● ●●●●●●● ●
●● ●●●●● ●● ● ● ● ● ● ● ●● ●
● ●●●● ●● ●●●● ●● ●● ●
●●●●● ●
●●●● ●●● ●● ● ●● ●●●●● ●●● ●●
●● ● ● ●●● ●
●●●●● ●●●● ● ●● ●●● ●● ● ● ●●
●
● ●● ●
●
● ● ● ● ● ●● ● ● ● ● ●●
●
● ● ●
●
●
●
●
● ●
●
●
●●
●
●
●●
● ●
●
●
0
●
●
−10
●
●
−2 0 2
X
53/84
Y_nl
OLS Provides Best Linear Approximation to CEF
10
−10
0
●
●● ●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ●●● ●● ● ●●● ●● ● ●● ●●●●
●●
● ● ●
●●●●●● ●● ●●●● ● ●●●● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●● ●● ●●● ●
●● ● ●● ● ●
● ●
● ●
●●● ●
●
●
●● ● ●
● ●●●●
● ● ●● ● ●● ●●
● ●● ●
●● ●●●●●●●
● ● ●●●●● ●● ●●
●● ● ● ●● ●● ●●●●●●● ●●●●●●
● ● ●●● ●●●
● ●●
●●●●●
●●● ●●● ●
● ●●
●●●●●● ●● ●
● ●●●●●●
● ●● ●●● ● ● ●● ●● ● ● ●● ●●●● ● ● ●● ●
●●● ●● ●
● ●●●●● ● ●●
● ●
●
●
●● ● ● ● ● ●●●
● ● ● ● ● ● ●
●●●
●● ●●●● ●● ●●●
● ● ● ● ● ● ●
●● ●
● ● ● ●
● ●
● ●●
●●●●●
● ● ●●●●● ●●
● ●● ● ●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ●●● ●● ●●●●● ●●●● ●●●●●●●● ●
●●● ● ●●● ● ●● ●
● ● ●● ● ● ●● ●●●● ●●● ●●●●● ● ●
● ● ● ●● ●
●●●●●● ● ● ● ●● ● ● ● ● ● ●
●●●●●● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●
● ●●●●● ●
●
● ●● ● ● ● ● ●● ●● ●●●●
●●●● ●●●●● ● ● ●●●● ●
●
● ●● ●
●●●● ●
● ●●●● ● ●●●
● ●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ●
●
● ●●
●●● ●●●●●● ● ●●●●● ●●● ●● ●● ●●● ●●●
●●
● ●●● ● ●●●●● ●●●●
●● ● ●● ●● ●
●●
●●●● ●●● ● ● ● ●● ● ● ●● ●●●●● ●● ●●●●●●● ●●●●●●●
● ● ● ●● ● ●
●●● ●●●
●● ●●● ●●●●
● ● ●●● ● ●● ●●●●
●
● ● ●
●●
●●
●●
● ●
●
●
●
●
●
●
−2 0 2
X
●
●
●
●
● ●
●
●
54/84
Y_nl
OLS Provides Best Linear Approximation to CEF
In most contexts, OLS will not precisely tell us E[yi|Xi]
But captures key features of E[yi|Xi]
In many contexts this approximation is preferred
Even if more complex techniques provide a better fit of the data...
Transparent/simple to estimate/easy to digest
55/84
Simple Conditional Expectation with a Dummy Variable
There is one more important context in which OLS perfectly captures E[yi|Xi]
Dummy variables
A Dummy Variable is a binary variable that takes the value of 1 if
some condition is met and 0 otherwise
For example, suppose we have stock data, and want to create an variable that indicates whether a given stock is classified in the IT sector:
Security
3M Company Activision Blizzard Aetna Inc
Apple Inc.
Bank of America
.
Xerox Corp.
Price Sector IT $247.95 Industrials 0 $70.74 Information Tech. 1 $187.00 Health Care 0 $178.36 Information Tech. 1 $31.66 Financials 0
. . . $31.75 Information Tech. 1
56/84
Simple Conditional Expectation with a Dummy Variable
Only two possible values of the conditional expectation function: E[Pricei|ITi =1] or E[Pricei|ITi =0]
If we run the regression:
Price = βOLS +βOLSIT +vOLS
i01ii βOLS and βOLS allow us to recover both!
01
βOLS = E[Price|IT =0] 0ii
Expected Price for non IT stocks
βOLS = E[Price |IT = 1]−E[Price |IT = 0] 1iiii
Expected Price Difference for IT Stocks
Aside: True both in the population and in-sample
57/84
OLS and Conditional Expectations: In Practice
β OLS 1
β OLS β OLS 00
Non IT Stocks IT Stocks
Average for Non IT Stocks: $92.090 Average for IT Stocks: $113.475
58/84
Share Price (USD)
0 20 40 60 80 100
Regressions with Dummy Variables
Suppose we regress price on a constant and an IT dummy And we recover βˆOLS and βˆOLS
What will the value of βˆOLS be? 1
Vote: Go to menti.com Recall:
Average for Non IT stocks: $92.090 Average for IT stocks: $113.475
01
59/84
Implementing Regression with Dummy Variables
It is useful to note that this is simply a mechanical feature of our estimator
Letting yi = pricei , xi = IT , Nx be the number of IT observations Our OLS estimates are:
∑yi is.t.xi=0N−Nx
Average Price for non IT stocks
ˆOLS
β0 = (X′X)−1(X′Y ) =
ˆOLS β1
yi− yi
Average Price Difference for IT stocks
∑∑
N N−N
is.t.xi=1 x is.t.xi=0 x
60/84
Implementing Regressions with Categorical Variables
What if we are interested in comparing all 11 GISCS sectors? Create dummy variables for each sector omitting 1
Lets call them D1i,···,D10i
pricei = β0 +δ1D1i +···+δ10D10i +vi
In other words Xi = [1 D1i ···D10]′ or
1 0 ··· 1 0
1 1 ··· 0 0
1 0 ··· 0 0
1 0 ··· 1 0 X=
Regress pricei on a constant and those 10 dummy variables
1 0 ··· 0 1 . . . . .
. . .. . . 1 1 ··· 0 0
61/84
Average Share Price by Sector for Some S&P Stocks
OLS δ4
OLS δ6
OLS δ1
OLS OLS δ2 δ3
OLS δ7
OLS δ5
OLS δ8
OLS δ10
OLS δ9
OLS β0
62/84
Share Price (USD)
0 20 40 60 80 100
Cons. Discret. Cons. Staples
Energy Financials
Health Care Industrials
IT Materials
Real Estate Telecom
Utilities
Implementing Regressions with Dummy Variables
βˆOLS (coef. on the constant) is the mean for the omitted category: 0
In this case “Consumer Discretionary”
The coefficient on each dummy variable (e.g. δˆOLS) is the difference
k
between βˆOLS and the conditional mean for that category
0
Key point: If you are only interested in categorical variables... You can perfectly capture the full CEF in a single regression
For example:
E[price|sector =consumerstaples]=βOLS+δOLS
ii 01 E[price|sector =energy]=βOLS+δOLS
ii 02 .
63/84
Very Simple to Implement in R
R has a trick to estimate regressions with categorical variables ols sector<-lm(price∼as.factor(sector), data= s p price)
64/84
Why Do We Leave Out One Category? Flashback:
65/84
Why Do We Leave Out One Category? Flashback:
66/84
Why Do We Leave Out One Category?
X has full column rank when all of its columns are linearly independent
Suppose we had a dataset of 6 stocks from two sectors: e.g. Consumer Discretionary and IT
And suppose we include dummies for both sectors 1 0 1
1 1 0
1 0 1 X=1 0 1
1 1 0 110
Are the columns of X linearly independent?
67/84
Why Do We Leave Out One Category?
Perhaps a more intuitive explanation: suppose we include all sectors: pricei = β0 +δ1D1i +···+δ10iD10i +δ11D11i +vi
Then the interpretation of
βOLS =E[price |D =0,···,D =0,D
=0]
e.g. expected price for stocks that belong to no sector—nonsensical
0 i1i 10i 11i
Not specific to this example, true for any categorical variable
Forgetting to omit a category sometimes called the “dummy variable
trap”
An alternative: If you omit the constant from a regression, you can include all categories
68/84
OLS Part 2: A Predictive Model
Suppose we see 1500 observations of some outcome yi Example: residential real estate prices
We have a few characteristics of the homes E.g. square feet/year built
Want to build a model that helps us predict yi out of sample I.e. the price of some other home
69/84
We Are Given 100 Observations of yi
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●
●
●
●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●
●
●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●
●
●
70/84
Outcome
How Well Can We Predict Out-of-Sample Outcomes (yoos) i
20
10
0
−10
−20
0 25 50 75 100
Observation
●
● ●●
● ●
●
●
●
●●
●● ● ●●●
● ●●●●●
●
●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●
●
●● ●
●
●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●
●
●
●
●
●
●
71/84
Outcome and Prediction
Our Best Prediction (yˆoos) i
20
10
0
−10
−20
0 25 50 75 100
Observation
●
●
●
●● ●● ●●
●● ●●
● ●●●●●●●
● ●●●
●●● ●
●●● ●●●●●● ●
●●● ●●● ● ●●●●
●●●● ●●●
●●●
●
●
●
●
●●●●●●
●●● ●●●
●
●●● ●●
●
●
●●● ●
● ●●●
●●
●
●
●●
●
●
71/84
Outcome and Prediction
Prediction vs reality (yˆoos vs. yˆoos) ii
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●
●●● ● ● ● ● ●
●●●●●● ●●● ●●●●● ● ●●●●
●●●●●
●●● ● ●● ●●●●
●●●● ●●● ●●●●●●●●
●
●
●
●●●●
● ●●
●●●●● ●●● ●●●
● ●● ●● ●●●●●● ●●● ● ●
●●● ●●●● ●
● ●●●●● ●●●●●● ●●●●●●
●● ● ●●● ●●● ●●
●●●●●●● ●●● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
●● ●● ●●●
● ●●
● ●●●
●●● ●
●
●
●
●
●
●●
71/84
Outcome and Prediction
A Good Model Has Small Distance (yoos −yˆoos)2
i
i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●
●
●●● ●
●
●
●
●
72/84
Outcome and Prediction
Measure of Fit: Out of Sample Mean Squared Error
(yoos −yˆoos)2
MSEoos=∑ i
Noos
73/84
Example Using Ames Housing Data
Predict log prices with year home is built
Predict log prices with year home is built and square footage Menti: What is MSEoos in the more complex model?
74/84
A Few Practical Details When Using OLS
Dummys and Continuous Variables Scaling Coefficients
Data Transformations
75/84
Dummy and Continuous variables
Suppose we want to combine dummy and continuous variables Consider the impact of education on wages
Let Yi be wages, Xi be years of education
Let Dmale,i be a dummy variable equal to 1 for males, 0 otherwise
Yi = β0 = β1Xi +δmaleDMale,i +vi
76/84
Education and Wages with a Male Dummy
δ OLS Male
β OLS 0
Y =β OLS+β OLSX i01i
Y =β OLS+β OLSX +δ OLS i 0 1 i Male
Years of Education (Xi)
77/84
Wages (Yi)
Dummy and Continuous variables
Similar interpretation of dummies as before with one caveat
β OLS is mean of the omitted category (non-males) when Xi = 0
δ OLS is the difference in wages for males when Xi = 0 male
OLS coefficients can be interpreted as (differences in) means with continous variables set to 0
Sometimes referred to as group or category specific intercepts
Works with many dummies:
e.g. different “intercepts” for each sector
0
78/84
Scaling Variables: Independent Variables
yi =β0+β1xi+vi
yi wages (in $), and xi as years of eduction
Suppose we want to change the units of xi ? E.g. convert to months of education:
xmonths = x ×12 ii
β1 will simply scale accordingly:
y =β +β1xmonths+v
i012i i Intercept, R2, statistical significance unchanged
79/84
Scaling Variables: Dependent Variable
yi =β0+β1xi+vi
yi wages (in $), and xi as years of eduction
Suppose we want to change the units of Yi ? E.g. Convert $ to 1000s of $
y1000 = y /1000 ii
β0, β1, vi will scale accordingly:
y1000 = 1000×β +1000×β x +1000×v
i01ii Again R2, statistical significance unchanged
80/84
Percent vs. Percentage Point Change
Percent change is proportionate (or relative) change x1−x0 ×100
x0
Percentage point change is a raw change in percentages
For example: consider the unemployment rate (in %) If unemployment goes from 10% to 11%:
1 percentage point change (10−9)×100=10%change
10
Take care to distinguish between them
81/84
Quadratics and Higher Order Polynomials
Can often do better job of approximating the CEF using higher order polynomials, for example:
y =β +β x +β x2+v i01i2ii
Downside: the relationship between Xi Yi harder to summarize: ∂yi =β1+2β2xi
∂xi
Changes for different values of Xi
Tradeoff between quality of approximation and simplicity
82/84
This Week
(1) Introduction to the conditional expectation function (CEF) Why is the CEF a useful (and widely used) summary of the
relationship between variables Y and X
(2) Ordinary Least Squares and the CEF
Review, implementation, and the utility of OLS
83/84
Next Week
The Basics of Causal Inference
84/84