Review
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
March 8-9, 2021
1/102
Topic 1: OLS and the Conditional Expectation Function
Consider random variable yi and (variables) Xi Which of the following is false,
(a) Xi′βOLS provides the best predictor of yi out of any function of Xi (b) Xi′βOLS is the best linear approximation of E[yi|Xi]
(c) yi −E[yi|Xi] is uncorrelated with Xi
2/102
Topic 1: OLS and the Conditional Expectation Function
1. A review of the conditional expectation function and its properties 2. The relationship between OLS and the CEF
3/102
Topic 1 – Part 1: The Conditional Expectation Function (CEF)
We are often interested in the relationship between some outcome yi and a variable (or set of variables) Xi
A useful summary is the conditional expectation function: E[yi|Xi] Gives the mean of yi when Xi takes any particular value
Formally, if fy (·|Xi ) is the conditional p.d.f. of yi |Xi :
E[yi|Xi]= zfy(z|Xi)dz
E[yi|Xi] is a random variable itself: a function of the random Xi
Can think of it as E[yi|Xi] = h(Xi)
Alternatively, evaluate it at particular values: for example Xi = 0.5
E [yi |Xi = 0.5] is just a number!
4/102
Topic 1 – Part 1: The Conditional Expectation Function: E[Y|X]
E[H|Age=5]
E[H|Age=10]
E[H|Age=15]
E[H|Age=20] E[H|Age=25] E[H|Age=30] E[H|Age=35]
E[H|Age=40]
Height (Inches)
30 40 50 60 70 80
0 5 10 15 20 25 30 35 40 Age
5/102
Topic 1 – Part 1: Three Useful Properties of E[X|Y]
(i) The law of iterated expectations (LIE): E[E[yi|Xi]] = E[yi]
(ii) The CEF Decomposition Property:
Any random variable yi can be broken down into two pieces
yi = E[yi|Xi]+εi
Where the residual εi has the following properties:
(a) E[εi|Xi]=0(“meanindependence”) (b) εi uncorrelated with any function of Xi
(iii) Out of any function of Xi, E[yi|Xi] is the best predictor of yi:
E[yi|Xi] = arg min E[(yi −m(Xi))2] m(Xi )
6/102
Topic 1 – Part 1 summary: Why We Care About Conditional Expectation Functions
Useful tool for describing relationship between yi and Xi
Several nice properties
Most statistical tests come down to comparing E[yi|X] at certain Xi Classic example: experiments
7/102
Topic 1 – Part 2: Ordinary Least Squares
Linear regression is arguably the most popular modeling approach across every field in the social sciences
Transparent, robust, relatively easy to understand
Provides a basis for more advanced empirical methods Extremely useful when summarizing data
Plenty of focus on the technical aspects of OLS last term Focus on an applied perspective
8/102
Topic 1 – Part 2: OLS Estimator Fits a Line Through the Data
10
5
0
−5
●
●
●● ●
●
●
●● ●
●
●
●●
● ●
●
●
● ●
● ● ●
●●
●●
●●● ●
●
●● ● ●●
●●
●●● ●●●
●● ● ● ●●
●●●●●
●
●●●●● ● ●●
● ● ●● ● ●● ●●●●
●●●● ●● ● ● ●●
●●● ●●●
● ● ●●● ●● ●●●● ●●
● ●●
● ● ●
●● ●● ●●●●● ●●
●
●● ●
●●
●
●
●
● ●●●
● ●● ●● ●● ● ●
●● ●● ●
● ● ●● ●●●●
●● ● ●
●
● ● ●
●●● ●●●● ● ●● ● ●● ● ● ● ●●●●●●●● ●●●
● ●●●●●●● ● ● ●● ●●●●●●●●●●●●●●
● ● ●
●● ●●●●●●●●●●● ●
●●●● ● ●●
●●● ● ● ● ●●● ●●
● ●●●● ● ● ● ●● ● ● ● ●●● ●●●
● ●●●●●● ●●
●●● ●● ●● ●
● ●● ●
●●● ●● ● ●● ● ●
●●●●
● ● ● ●●
●
●● ●
● ● ●●
●●●
●●●
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●
● ●
●●
●
●● ● ●● ●
●● ●●●●● ●●●●● ● ● ●● ● ● ● ●●
● ●● ●●● ● ●● ● ●●●● ● ●●●●●●●● ●● ●●●●● ●●● ●●●● ●
●●●● ● ●● ● ●●
●●● ● ●
●● ●●●●
●● ● ●● ●●●●●●●●●
● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ●●●●●●●●●●● ●●●● ●●●
●● ●●●●●● ● ● ●●● ●●●●●●●●●●●●● ●
● ● ●●●● ●●● ● ●●●●●●●●● ● ● ● ● ●●●● ●● ● ● ●● ● ●● ●●● ● ●● ● ● ●●● ● ●●●● ●
●● ●● ●● ●● ●●● ●●● ● ● ●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●● ●● ●● ●● ● ●●
●
●●●●●●●● ●
●
● ●●● ●
●● ● ●●
●●●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●
● ●
●●●● ●●●● ●●
●●● ●
●●
● ●● ● ●●●
●
●●● ●●
● ●● ●●●
●
●
●
● ● ●
● ●
●
●
● ●
−2 0 2
X
●● ●
● ●● ●
●
● ●●●●
●
●●● ●●
●●● ●●
●●● ●●●
●●●
●
●
●
●
●
●●●●● ● ●●
9/102
Y
Topic 1 – Part 2: Choosing the (Population) Regression Line
yi =β0+β1xi+vi
An OLS regression is simply choosing the βOLS,βOLS that make vi
as “small” as possible on average How do we define “small”?
Want to treat positive/negative the same: consider vi2
Choose βOLS,βOLS to minimize: 01
E[vi2] = E[(yi −β0 −β1xi)2]
01
10/102
Topic 1 – Part 2: Regression and The CEF
Given yi and Xi The population regression coefficient is: βOLS = E[X X′]−1E[X y ]
A useful time to note: you should remember the OLS estimator: βˆOLS = (X′X)−1X′Y
With just one xi:
ii ii
ˆ
βˆOLS = Cov(xi,yi)
ˆ Var(xi)
11/102
Topic 1 – Part 2: Regression and the Conditional Expectation Function
Why is linear regression so popular?
Simplest way to estimate (or approximate) conditional expectations!
Three simple results
OLS perfectly captures CEF if CEF is Linear
OLS generates best linear approximation to the CEF if not OLS perfectly captures CEF with binary (dummy) regressors
12/102
Topic 1 – Part 2: Regression captures CEF if CEF is Linear
Take the special case of a linear conditional expectation function:
E[y |X ] = X′β iii
Then OLS captures E[yi|Xi]
10
5
0
−5
●
●
● ●●
● ●
●● ●●
● ●●
●
●● ●
●
●
● ●● ● ●
● ●●
●●
●●● ●
●
●● ● ●● ●●
●●●● ●● ●
●● ●● ●●●●● ●● ●
● ●●● ● ●● ● ● ●●●●
●●●● ●● ● ● ● ●
● ● ●●● ● ● ● ●● ● ●●●● ●●
● ●● ●● ● ●●● ●●●●● ●●●
●
●● ●
●●
●
● ●●● ● ●● ●● ●●
● ●
● ● ●●●●●●●●●● ●●
●●●●● ● ●●●● ● ●●●●●●●● ● ● ●●● ●●● ● ● ●●●● ● ● ● ●
●●●●●●● ● ●●
● ●● ●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●●●●●●● ●●
● ●●●● ●●●●● ●
● ●●●● ●●●●● ●● ●●●●● ● ●●● ●●●● ●●●●
● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●
● ● ● ●
● ●● ●●● ●
● ●●
●● ● ●● ● ● ●● ● ●
●●●●
●● ● ● ●●
● ●●●
●●●
●
● ●●●
●● ● ●●
●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●●
● ●● ●●● ●●●● ● ●●●●
● ●●●● ●
●
●
● ●●●●●●●●● ●● ●● ●●●
● ●●●●● ●●● ● ●●● ●
●● ●● ●●●
● ● ●●●●●●●●
●●● ● ●●●●●●● ●
●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●●● ●●● ● ●● ●
● ●●●●● ●● ●● ●● ● ● ● ● ● ●●●● ●●●● ● ●● ● ●●● ●●●●●● ●●●●●●● ● ●● ●●●
●●●● ● ●●●●● ● ●●● ●● ●● ●
●● ●●●
● ●
●●
●
●●
●
●● ● ●●●●
●●●
●●●● ●
● ●●● ● ●●●● ● ●
●●● ● ● ●●
●
● ●●●●
●● ● ●
●
●
●
●●● ●●
●●● ●●●● ●●
●●●● ●●●● ● ●
●●●●● ●●● ●
●●
●● ●●●
● ●● ●●
●
●● ●
● ●
●
●
● ●
●
●
● ●
−2 0 2
X
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●●
●● ●●
● ●● ●
●
●
●
●
●
●
●
13/102
Y
Topic 1 – Part 2: OLS Provides Best Linear Approximation to CEF
10
−10
0
●
●● ●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ●●● ●● ● ●●● ●● ● ●● ●●●●
●●
● ● ●
●●●●●● ●● ●●●● ● ●●●● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●● ●● ●●● ●
●● ● ●● ● ●
● ●
● ●
●●● ●
●
●
●● ● ●
● ●●●●
● ● ●● ● ●● ●●
● ●● ●
●● ●●●●●●●
● ● ●●●●● ●● ●●
●● ● ● ●● ●● ●●●●●●● ●●●●●●
● ● ●●● ●●●
● ●●
●●●●●
●●● ●●● ●
● ●●
●●●●●● ●● ●
● ●●●●●●
● ●● ●●● ● ● ●● ●● ● ● ●● ●●●● ● ● ●● ●
●●● ●● ●
● ●●●●● ● ●●
● ●
●
●
●● ● ● ● ● ●●●
● ● ● ● ● ● ●
●●●
●● ●●●● ●● ●●●
● ● ● ● ● ● ●
●● ●
● ● ● ●
● ●
● ●●
●●●●●
● ● ●●●●● ●●
● ●● ● ●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ●●● ●● ●●●●● ●●●● ●●●●●●●● ●
●●● ● ●●● ● ●● ●
● ● ●● ● ● ●● ●●●● ●●● ●●●●● ● ●
● ● ● ●● ●
●●●●●● ● ● ● ●● ● ● ● ● ● ●
●●●●●● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●
● ●●●●● ●
●
● ●● ● ● ● ● ●● ●● ●●●●
●●●● ●●●●● ● ● ●●●● ●
●
● ●● ●
●●●● ●
● ●●●● ● ●●●
● ●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ●
●
● ●●
●●● ●●●●●● ● ●●●●● ●●● ●● ●● ●●● ●●●
●●
● ●●● ● ●●●●● ●●●●
●● ● ●● ●● ●
●●
●●●● ●●● ● ● ● ●● ● ● ●● ●●●●● ●● ●●●●●●● ●●●●●●●
● ● ● ●● ● ●
●●● ●●●
●● ●●● ●●●●
● ● ●●● ● ●● ●●●●
●
● ● ●
●●
●●
●●
● ●
●
●
●
●
●
●
−2 0 2
X
●
●
●
●
● ●
●
●
14/102
Y_nl
Topic 1 – Part 2: Implementing Regressions with Categorical Variables What if we are interested in comparing all 11 GISCS sectors?
Create dummy variables for each sector omitting 1 Lets call them D1i,···,D10i
pricei = β0 +δ1D1i +···+δ10D10i +vi In other words Xi = [1 D1i ···D10]′ or
1 0 ··· 1 0
1 1 ··· 0 0
1 0 ··· 0 0
1 0 ··· 1 0 X=
Regress pricei on a constant and those 10 dummy variables
1 0 ··· 0 1 . . . . .
. . .. . . 1 1 ··· 0 0
15/102
Topic 1 – Part 2: Average Share Price by Sector for Some S&P Stocks
16/102
Topic 1 – Part 2: Implementing Regressions with Dummy Variables
βˆOLS (coef. on the constant) is the mean for the omitted category: 0
In this case “Consumer Discretionary”
The coefficient on each dummy variable (e.g. δˆOLS) is the difference
k
between βˆOLS and the conditional mean for that category
0
Key point: If you are only interested in categorical variables… You can perfectly capture the full CEF in a single regression
For example:
E[price|sector =consumerstaples]=βOLS+δOLS
ii 01 E[price|sector =energy]=βOLS+δOLS
ii 02 .
17/102
Topic 1 – Part 2 Summary: Regression and the Conditional Expectation Function
Why is linear regression so popular?
Simplest way to estimate (or approximate) conditional expectations!
Three simple results
OLS perfectly captures CEF if CEF is Linear
OLS generates best linear approximation to the CEF if not OLS perfectly captures CEF with binary (dummy) regressors
18/102
Topic 1: OLS and the Conditional Expectation Function
1. A review of the conditional expectation function and its properties 2. The relationship between OLS and the CEF
19/102
Topic 2: Causality and Regression
1. The potential outcomes framework 2. Causal effects with OLS
20/102
Topic 2: Causality and Regression
Suppose wages (yi ) are determined by:
yi =β0+β1xi+γai+ei
and we see years of schooling (xi ) but not ability (ai ) Corr(xi,ai) > 0 and Corr(yi,ai) > 0
We estimate: And recover
yi =β0+β1xi+vi
βOLS =β +γδOLS 111
Bias
Is our estimated βOLS larger or smaller than β1? 1
21/102
Topic 2 – Part 1: The Potential Outcomes Framework
Ideally, how would we find the impact of candy on evaluations (yi )? Imagine we had access to two parallel universes and could observe
The exact same student (i)
At the exact same time
In one universe they receive candy—in the other they do not
And suppose we could see the student’s evaluations in both worlds Define the variables we would like to see: for each individual i:
yi1 = evaluation with candy yi0 = evaluation without candy
22/102
Topic 2 – Part 1: The Potential Outcomes Framework
If we could see both yi1 and yi0 impact would be easy to find: The causal effect or treatment effect for individual i defined as
yi1 −yi0
Would answer our question—but we never see both yi1 and yi0!
Some people call this the “fundamental problem of causal inference” Intuition: there are two “potential” worlds out there
The treatment variable Di decides which one we see:
yi1 if Di = 1 yi= yi0ifDi=0
23/102
Topic 2 – Part 1: So What Do Differences in Conditional Means Tell You?
E[yi1|Di = 1]−E[yi0|Di = 0] = E[yi1|Di = 1]−E[yi0|Di = 1]
Average Treatment Effect for the Treated Group +E[yi0|Di = 1]−E[yi0|Di = 0]
Selection Effect ̸= E[yi1]−E[yi0]
Average Treatment Effect
So our estimate could be different from the average effect of treatment E[yi1]−E[yi0] for two reasons:
(1) The morning section might have given better reviews anyway: E[yi0|Di = 1]−E[yi0|Di = 0] > 0
Selection Effect
(2) Candy matters more in the morning:
E[yi1|Di = 1]−E[yi0|Di = 1] ̸= E[yi1]−E[yi0]
Average Treatment Effect for the Treated Group Average Treatment Effect
24/102
Topic 2 – Part 2: Causality and Regression
yi =β0+β1xi+vi
Regression coefficient captures causal effect (β OLS = β ) if:
E[vi|xi] = E[vi] Failsanytimecorr(xi,vi)̸=0
An aside: we have used similar notation for 3 different things: 1. β1: the causal effect on yi of a 1 unit change in xi.
2. β OLS = Cov (xi ,yi ) : the population regression coefficient 1 Var(xi)
1
3. βˆOLS = Cov (xi ,yi ) : the sample regression coefficient
1 Var (xi )
25/102
Topic 2 – Part 2: Omitted Variables Bias
So if we have:
What will the regression of yi on xi give us?
Recall that the regression coefficient is βOLS = Cov(yi,xi) : 1 Var(xi)
βOLS = Cov(yi,xi) 1 Var(xi)
= β1 + Cov (vi , xi ) Var(xi)
yi =β0+β1xi+vi
26/102
Topic 2: Causality and Regression
1. The potential outcomes framework 2. Causal effects with OLS
27/102
Menti Break
The average coursework grade in the Morning class is 68 The average coursework grade in the Afternoon class is 75 Suppose we run the following regression:
Coursework = β0 + β1Afternooni + vi What is the value of β0?
(a) 68 (b) 75 (c) 7
28/102
Topic 3: Panel Data and Diff-in-Diff
1. Panel Data, First Difference Regression and Fixed Effects 2. Difference-in-Difference
29/102
Topic 3: Panel Data and Diff-in-Diff
Panel data consists of observations of the same n units in T different periods
If the data contains variables x and y, we write them (xit,yit)
fori=1,···,N
i denotes the unit, e.g. Microsoft or Apple
andt=1,···,T
t denotes the time period, e.g. September or October
30/102
Topic 3 – Part 1: Panel Data and Omitted Variables
Lets reconsider our omitted variables problem: yit =β0+β1xit+γai+eit
Suppose we see xit and yit but not ai
Suppose Corr(xit,eit) = 0 but Corr(ai,xi) ̸= 0
Note that we are assuming ai doesn’t depend on t
31/102
Topic 3 – Part 1: First Difference Regression
yit=β0+β1xit+ vit
γ ai +eit
Suppose we see two time periods t = {1, 2} for each i We can write our two time periods as:
yi,1 = β0 +β1xi,1 +γai +ei,1
yi,2 = β0 +β1xi,2 +γai +ei,2
Taking changes (differences) gets rid of fixed omitted variables
∆yi,2−1 = β1∆xi,2−1 +∆ei,2−1
32/102
Topic 3 – Part 1: Fixed Effects Regression
yit =β0+β1xit+γai+eit
An alternative approach:
Lets define δi = γai and rewrite:
yit =β0+β1xit+δi+eit So yi is determined by
(i) The baseline intercept β0 (ii) The effect of xi
(iii) An individual specific change in the intercept: δi Intuition behind fixed effects: Lets just estimate δi
33/102
Topic 3 – Part 1: Fixed Effects Regression – Implementation
N−1
yit = β0 +β1xit + ∑ δiDi +eit
i=1
Note that we’ve left out DN
βOLS is interpreted as the intercept for individual N:
βOLS=E[y|x =0,i=N] 0 itit
0
and for all other i (e.g. i=1)
δ2 = E[yit|xit = 0,i = 1]−β0
This should look familiar
34/102
Topic 3 – Part 2: Difference-in-Difference
We are interested in the impact of some treatment on outcome Yi
Suppose we have a treated group and a control group
Let Di =1 be a dummy equal to 1 if i belongs to the treatment
group
And suppose we see both groups before and after the treatment occurs
Let Aftert = 1 be equal to 1 if time t is after the treatment date Yit =β0+β1Di×Aftert+β2Di+β3Aftert+vit
35/102
Topic 3 – Part 2: Diff-in-Diff Graphically
36/102
Topic 3 – Part 2: When Does Diff-in-Diff Identify A Causal Effect
As usual, we need
E[vit|Di,Aftert] = E[vit]
What does this mean intuitively?
Parallel trends assumption: In the absence of any reform the
average change in leverage would have been the same in the treatment and control groups
In other words: trends in both groups are similar
37/102
Topic 3 – Part 2: Parallel Trends
Parallel trends does not require that there is no trend in leverage Just that it is the same between groups
Does not require that the levels be the same in the two groups What does it look like when the parallel trends assumption fails?
38/102
Topic 3 – Part 2: When Parallel Trends Fails
Leverage
Treatment (Delaware)
Control (Non−Delaware)
Before After
Month (t)
39/102
Topic 3: Panel Data and Diff-in-Diff
1. Panel Data, First Difference Regression and Fixed Effects 2. Difference-in-Difference
40/102
Topic 4: Regularization
1. Basics of Ridge, LASSO and Elastic Net
2. How to choose hyperparameter λ: cross-validation
41/102
Topic 4 – Part 1: The Basics of Elastic Net
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●
●
●
●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●
●
●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●
●
●
42/102
Outcome
Topic 4 – Part 1: How Well Can We Predict Out-of-Sample Outcomes (yoos) i
20
10
0
−10
−20
0 25 50 75 100
Observation
●
● ●●
● ●
●
●
●
●●
●● ● ●●●
● ●●●●●
●
●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●
●
●● ●
●
●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●
●
●
●
●
●
●
43/102
Outcome and Prediction
Topic 4 – Part 1: A Good Model Has Small Distance (yoos −yˆoos)2 i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●
●
●●● ●
●
●
●
●
44/102
Outcome and Prediction
Topic 4 – Part 1: Solution to OLS drawbacks – Regularization
With 100 Observations OLS Doesn’t do Very Well Solution: regularization
LASSO/RIDGE/Elastic Net
Simplest version of elastic net (nests LASSO and RIDGE):
β Ni=1 k=1 k=1 For α =1 is this LASSO or RIDGE?
ˆelastic 1N 2 K K2 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk
45/102
Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=0.2)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
46/102
Coefficient Estimate
Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=1)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
47/102
Coefficient Estimate
Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=3)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
48/102
Coefficient Estimate
Topic 4 – Part 1: LASSO Coefficients For All λ
49 49 46 35 21 10 3
−5 −4 −3 −2 −1 0 1
Log Lambda
49/102
Coefficients
−1 0 1 2 3 4 5
Topic 4 – Part 2: How to Choose λ – k-fold Cross Validation
Partition the sample into k equal folds The default for R is k=10
For our sample, this means 10 folds with 10 observations each Cross-validation proceeds in several steps:
1. Choose k-1 folds (9 folds in our example, with 10 observations each) 2. Run LASSO on these 90 observations
3. find βlasso(λ) for all 100 λ
4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds
This provides 10 estimates of MSE(λ) for each λ
Can construct means and standard deviations of MSE(λ) for each λ
Choose λ that gives small mean MSE(λ)
50/102
Topic 4 – Part 2: Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1
●●● ●
●●●
● ●
● ●
●●
●●
●●
●●
●● ●
● ●
● ●●
● ●
● ●
● ●
●
● ●●
● ●
●● ●
● ●
●
● ●●
● ●●
●● ●● ●●
● ●●
●● ●● ●●●
●● ●● ●
●
●
−5 −4 −3 −2 −1 0 1
log(Lambda)
51/102
Mean−Squared Error
30 40 50 60 70 80 90
Topic 4: Regularization
1. Basics of Ridge, LASSO and Elastic Net
2. How to choose hyperparameter λ: cross-validation
52/102
Topic 5: Observed Factor Models
Suppose xt is a vector of asset returns, and B is a matrix of factor loadings
Which has higher dimension: (a) B
(b) Σx =Cov(xt)
53/102
Topic 5: Observed Factor Models
1. General Framing of Linear Factor Models
2. Single Index Model and the CAPM
3. Multi-Factor Models Fama-French
Macroeconomic Factors 4. Barra approach
54/102
Topic 5 – Part 1: Linear Factor Models
Assume that returns xit are driven by K common factors: xi,t = αi +β1,if1,t +β2,if2,t +···+βK,ifK,t +εit
ft = (f1,t,f2,t,··· ,fK,t)′ is the set of common factors These are the same for all assets (constant over i)
But change over time (different for t, t+1) Each ft has dimension (K × 1)
βi = (β1,i,β2,i,··· ,βK,i)′ is the set of factor loadings
K different parameters for each asset
But constant over time (same for all t)
Fixed, specific relationship between asset i and factor k
55/102
Topic 5 – Part 1: Linear Factor Model
xt =α+Bft+εt
Summary of Parameters
α: (m×1) intercepts for m assets
B:(m×K)loadings(βik)onK factorsformassets μf : (K × 1) vector of means for K factors
Ωf : (K × K ) variance covariance matrix of factors
Ψ: (m×m) diagonal matrix of asset specific variances
Given our assumptions xt is m-variate covariance stationary with: E[xt|ft] = α +Bft
Cov[xt|ft] = Ψ E[xt]=μx =α+Bμf Cov[xt]=Σx =BΩfB′+Ψ
56/102
Topic 5 – Part 2: The Index Model: First Pass
xi =αi1T +Rmβi +εi
Estimate OLS regression on time-series version of our factor specification
One regression for each asset i
Recover two parameters αˆi and βˆi for each asset i Ωˆf is just the sample variance of observed factor
Estimate Residuals
Use these to estimate asset specific variances (for each i):
With these, can compute:
(T −2) Σˆ x = Bˆ Ωˆ f Bˆ ′ + Ψˆ
εˆ = x − αˆ 1 − R βˆ iiiTmi
εˆ ′ εˆ σˆi2= i i
57/102
Topic 5 – Part 2: The Index Model/Testing CAPM: Second Pass
x ̄ i = γ 0 + γ 1 βˆ i + γ 2 σˆ i + η i
CAPM tests: expected excess return should be determined only by systemic risk (β)
1. γ0=0oraverageαis0
2. γ2 = 0 (idiosyncratic risk shouldn’t be priced) 3. γ1=R ̄m
58/102
Topic 5 – Part 3: Fama-French Three Factor Model
Recall our general linear factor model:
xi,t = αi +β1,if1,t +β2,if2,t +···+βK,ifK,t +εit
Fama-French is just another version of this with three particular factors:
xi,t = αi +β1,if1,t +β2,if2,t +β3,if3,t +εit
The factors are:
1. f1,i = Rmt : proxy for excess market return—same as before 2. f2,i = SMBt : size factor
3. f2,i = HMLt : value factor
Can use same two-pass methodology to recover parameters
Should understand what the covariance matrix Σx looks like
59/102
Topic 5 – Part 3: Macroeconomic Factors
An alternative approach uses key macro variables as factors
For example, Chen, Roll, and Ross use: IP: Growth rate in industrial production
EI: Changes in expected inflation
UI: Unexpected inflation
CG: Unexpected changes in risk premiums GB: Unexpected changes in term premia
In this case, our general model becomes:
xi,t =αi +βR,iRm,t +βIp,iIPt +βEI,iEIt +βUI,iUIt +βCG,iCGt +βGB,iGCt +εit
Can use two-pass procedure to estimate βˆs, evaluate the model
Like before, can use estimated βˆs, asset specific variances, and factor covariances to estimate asset covariance matrix
60/102
Topic 5 – Part 4: BARRA approach
̃x t = B f t + ε t
Assume that B is known
This looks just like the standard OLS matrix notation And we can estimate our ft like always:
ˆfOLS = (B′B)−1B′ ̃x tt
A bit weird conceptually because the role of the βs flips But no technical difference…
Except heteroskedasticity Estimate with GLS
61/102
Topic 5: Observed Factor Models
1. General Framing of Linear Factor Models
2. Single Index Model and the CAPM
3. Multi-Factor Models Fama-French
Macroeconomic Factors 4. Barra approach
62/102
Topic 6: Principal Components Analysis
Suppose the covariance of two asset returns is given by: 1 0
03
What fraction of the total variance in asset returns will be explained by the first principle component?
(a) 1 3
(b) 3 4
(c) 1 4
63/102
Topic 6: Principal Components Analysis
1. Basics of Eigenvectors and Eigenvalues 2. PCA
64/102
Topic 6 – Part 1: Basics of Eigenvalues and Eigenvectors
Consider the square n×n matrix A.
An eigenvalue λi of A is a (1×1) scalar:
The corresponding eigenvector of a ⃗vi is an (n × 1) vector Whereλi,⃗vi satisfy:
A⃗vi =λi⃗vi
⃗vi are the special vectors that A only stretches
λi is the stretching factor
Won’t ask you to compute these without giving you the formulas Except maybe in the diagonal case…
65/102
Topic 6 – Part 1: For Some Vectors ⃗v , Matrix A Only Stretches Lets say
5 0 1 5
A= 2 3 andv⃗1= 1 ⇒Av⃗1= 5 =1v⃗1
Av1=(5 5)’
v1=(1 1)’
66/102
Topic 6 – Part 1: Eigenvalues of Σx with Uncorrelated Data 3 0
Σx= 0 1
What are the eigenvalues and eigenvectors of Σx ?
With uncorrelated assets eigenvalues are just the variances of each asset return!
Eigenvectors:
1 0 v1= 0 , v2= 1
Note that the first eigenvalue points in the direction of the largest variance
We sometimes write the eigenvectors together as a matrix:
1 0 Γ=(v1 v2)= 0 1
67/102
Topic 6 – Part 1: Eigenvectors are Γ = (v1 v2) = 2 2 22
V2=(−1 1)’ V1=(1 1)’
−10 −5 0 5 10 xa
2 1 Cov(x)=Σx= 1 2
1 −1 1 1
xb
−10 −5 0 5 10
67/102
Topic 6 – Part 1: Eigenvectors of Σx with Correlated Data 2 1
Σx= 1 2
1 −1
1 1 Γ=(v1 v2)= 2 2
22
Just as with uncorrelated data, first eigenvector finds the direction with the most variability
Second eigenvector points in the direction that explains the maximum amount of the remaining variance
Note that the two are perpendicular (because Σx is symmetric)
This is the geometric implication of the fact that they are orthogonal:
vi′vj =0
The fact that they are orthogonal also implies:
Γ′ = Γ−1
68/102
Topic 6 – Part 2: Principal Components Analysis
xt
m×1 E[xt] = α Cov(xt) = Σx
Define the principal components variables as: p = Γ′(xt −α)
m×1 Γ is the ordered matrix of eigenvectors
The proportion of the total variance of xi that is explained by the largest eigenvalue λi is simply:
λi ∑mi=1 λi
69/102
Topic 6 – Part 2: Principal Components Analysis
Our principal components variables provide a transformation of the data into variables that are:
Uncorrelated (orthogonal)
Order by how much of the total variance they explain (size of
eigenvalue)
What if we have many m, but the first few (2, 5, 20) principle components explain most of the variation:
Idea: Use these as “factors” Dimension reduction!
70/102
Topic 6 – Part 2: Principal Components Analysis
Note that because Γ′ = Γ−1
xt =α+Γp
We can also partition Γ into the first K < m eigenvectors and the
remaining m−K
Partition p into its first K elements and the remaining m − K
p1 p= p2
We can then write
This looks just like a factor model:
But
xt = α + Bft + εt Cov(εt) = Ψ = Γ2Λ2Γ′2
Γ = [Γ1 Γ2]
xt =α+Γ1p1+Γ2p2
71/102
Topic 6 - Part 2: Implementing Principal Components Analysis
xt =α+Γ1p1+Γ2p2
Recall the sample covariance matrix:
Σˆ x = 1 X ̃ X ̃ ′ T
Calculate this, and perform the eigendecomposition (using a computer):
Σˆx =ΓΛΓ′
We now have everything need to compute the sample principle
components at each t:
P=[p1 p2···pT]= ΓX ̃
m×T
72/102
Topic 6 - Part 2: xˆt : Predicted Yields from First Two Components
7
6
5
4
3
2
1
0
CMT3Month
CMT6Month
CMT1Year
CMT2Year
CMT3Year
CMT5Year
CMT7Year
CMT10Year
CMT20Year
1200 1400
0 200 400 600
800 1000
73/102
Topic 6: Principal Components Analysis
1. Basics of Eigenvectors and Eigenvalues 2. PCA
74/102
Topic 7: Limited Dependent Variables
True or false: It is never ok to use an OLS regression when the outcome variable is binary
(a) True (b) False
75/102
Topic 7: Limited Dependent Variables
1. Binary Dependent variables 2. Censoring and truncation
76/102
Topic 7 - Part 1: Linear Probability Models
P(Yi = 1|Xi) = β0 +β1xi
The linear probability model has a bunch of advantages
1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS
3. Can use all the techniques we’ve seen: IV/difference-in-difference, etc Because of this simplicity, lots of applied research just uses linear
probability models
But a few downsides...
Predicted probabilities above one Constant effects of xi
1
77/102
Topic 7 - Part 1: Two Common Alternatives to Linear Probability Models
P(yi = 1|xi) = G(β0 +β1xi) In practice, mostly use two choices of G(·):
1. Probit: the standard normal CDF
z
G(z) = Φ(z) =
φ(v) = (2π)−1/2 exp−v2/2
2. Logit: the logistic function
G(z) = Λ(z) = exp(z) 1+exp(z)
−∞
φ(v)dv
78/102
Topic 7 - Part 1: Probit
Probability of Passing
0 .5 1 1.5
0 20 40 60 80 100 Assignment 1 Score
79/102
Topic 7 - Part 1: Why Does the Probit Approach Make Sense?
You should be able to derive the probit from a latent variable model:
y i∗ = β 0 + β 1 x i + v i 1 if yi∗ ≥ 0
yi= 0ifyi∗<0 P(y =1)=P(y∗ >0)
ii
Probit approach: assume vi is distributed standard normal
vi|xi ∼N(0,1)
P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0|xi) =P(vi >−(β0+β1xi)|xi)
= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )
80/102
Topic 7 – Part 1: The Effect of a Change in xi for the Probit
P(yi = 1|xi) = Φ(β0 +β1xi) Taking derivatives:
∂P(yi = 1|xi) = β1Φ′(β0 +β1xi) ∂xi
The derivative of the standard normal CDF is just the PDF: Φ′(z) = φ(z)
so
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
Obviously should be able to do this for more complicated functions
81/102
Topic 7 – Part 1: Deriving the Log-likelihood
Given data Y , define the likelihood function:
L(β0,β1) = f (Y |X;β0,β1) n
= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1
Take the log-likelihood: l(β0,β1) = log(L(β0,β1))
= log(f (Y |X;β0,β1)) n
= ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1
82/102
Topic 7 – Part 2: Censoring and Truncation
An extremely common data issue is censoring:
We only observe Yi if it is below (or above) some threshold
We see Xi either way
Example: Income is often top-coded
That is, we might only see whether income is > £100,000
Formally, we might be interested in Yi , but see:
Wi =min(Yi,ci) where ci is a censoring value
Similar to censoring is truncation
We don’t observe anything if Yi is above some threshold
e.g.: we only have data for those with incomes below £100,000
83/102
Topic 7 – Part 2: Censored Regression
So in general:
f(yi =y|xi,ci)=1{y≥ci} 1−Φ
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
ci −β0 −β1xi σ
1 y−β0−β1xi +1{y