CS计算机代考程序代写 AI finance Review

Review
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
March 8-9, 2021
1/102

Topic 1: OLS and the Conditional Expectation Function
􏰒 Consider random variable yi and (variables) Xi 􏰒 Which of the following is false,
(a) Xi′βOLS provides the best predictor of yi out of any function of Xi (b) Xi′βOLS is the best linear approximation of E[yi|Xi]
(c) yi −E[yi|Xi] is uncorrelated with Xi
2/102

Topic 1: OLS and the Conditional Expectation Function
1. A review of the conditional expectation function and its properties 2. The relationship between OLS and the CEF
3/102

Topic 1 – Part 1: The Conditional Expectation Function (CEF)
􏰒 We are often interested in the relationship between some outcome yi and a variable (or set of variables) Xi
􏰒 A useful summary is the conditional expectation function: E[yi|Xi] 􏰒 Gives the mean of yi when Xi takes any particular value
􏰒 Formally, if fy (·|Xi ) is the conditional p.d.f. of yi |Xi : 􏰩
E[yi|Xi]= zfy(z|Xi)dz
􏰒 E[yi|Xi] is a random variable itself: a function of the random Xi
􏰒 Can think of it as E[yi|Xi] = h(Xi)
􏰒 Alternatively, evaluate it at particular values: for example Xi = 0.5
E [yi |Xi = 0.5] is just a number!
4/102

Topic 1 – Part 1: The Conditional Expectation Function: E[Y|X]
E[H|Age=5]
E[H|Age=10]
E[H|Age=15]
E[H|Age=20] E[H|Age=25] E[H|Age=30] E[H|Age=35]
E[H|Age=40]
Height (Inches)
30 40 50 60 70 80
0 5 10 15 20 25 30 35 40 Age
5/102

Topic 1 – Part 1: Three Useful Properties of E[X|Y]
(i) The law of iterated expectations (LIE): E[E[yi|Xi]] = E[yi]
(ii) The CEF Decomposition Property:
􏰒 Any random variable yi can be broken down into two pieces
yi = E[yi|Xi]+εi
􏰒 Where the residual εi has the following properties:
(a) E[εi|Xi]=0(“meanindependence”) (b) εi uncorrelated with any function of Xi
(iii) Out of any function of Xi, E[yi|Xi] is the best predictor of yi:
E[yi|Xi] = arg min E[(yi −m(Xi))2] m(Xi )
6/102

Topic 1 – Part 1 summary: Why We Care About Conditional Expectation Functions
􏰒 Useful tool for describing relationship between yi and Xi
􏰒 Several nice properties
􏰒 Most statistical tests come down to comparing E[yi|X] at certain Xi 􏰒 Classic example: experiments
7/102

Topic 1 – Part 2: Ordinary Least Squares
􏰒 Linear regression is arguably the most popular modeling approach across every field in the social sciences
􏰒 Transparent, robust, relatively easy to understand
􏰒 Provides a basis for more advanced empirical methods 􏰒 Extremely useful when summarizing data
􏰒 Plenty of focus on the technical aspects of OLS last term 􏰒 Focus on an applied perspective
8/102

Topic 1 – Part 2: OLS Estimator Fits a Line Through the Data
10
5
0
−5


●● ●


●● ●


●●
● ●


● ●
● ● ●
●●
●●
●●● ●

●● ● ●●
●●
●●● ●●●
●● ● ● ●●
●●●●●

●●●●● ● ●●
● ● ●● ● ●● ●●●●
●●●● ●● ● ● ●●
●●● ●●●
● ● ●●● ●● ●●●● ●●
● ●●
● ● ●
●● ●● ●●●●● ●●

●● ●
●●



● ●●●
● ●● ●● ●● ● ●
●● ●● ●
● ● ●● ●●●●
●● ● ●

● ● ●
●●● ●●●● ● ●● ● ●● ● ● ● ●●●●●●●● ●●●
● ●●●●●●● ● ● ●● ●●●●●●●●●●●●●●
● ● ●
●● ●●●●●●●●●●● ●
●●●● ● ●●
●●● ● ● ● ●●● ●●
● ●●●● ● ● ● ●● ● ● ● ●●● ●●●
● ●●●●●● ●●
●●● ●● ●● ●
● ●● ●
●●● ●● ● ●● ● ●
●●●●
● ● ● ●●

●● ●
● ● ●●
●●●
●●●
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●
● ●
●●

●● ● ●● ●
●● ●●●●● ●●●●● ● ● ●● ● ● ● ●●
● ●● ●●● ● ●● ● ●●●● ● ●●●●●●●● ●● ●●●●● ●●● ●●●● ●
●●●● ● ●● ● ●●
●●● ● ●
●● ●●●●
●● ● ●● ●●●●●●●●●
● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ●●●●●●●●●●● ●●●● ●●●
●● ●●●●●● ● ● ●●● ●●●●●●●●●●●●● ●
● ● ●●●● ●●● ● ●●●●●●●●● ● ● ● ● ●●●● ●● ● ● ●● ● ●● ●●● ● ●● ● ● ●●● ● ●●●● ●
●● ●● ●● ●● ●●● ●●● ● ● ●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●● ●● ●● ●● ● ●●

●●●●●●●● ●

● ●●● ●
●● ● ●●
●●●
● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●
● ●
●●●● ●●●● ●●
●●● ●
●●
● ●● ● ●●●

●●● ●●
● ●● ●●●



● ● ●
● ●


● ●
−2 0 2
X
●● ●
● ●● ●

● ●●●●

●●● ●●
●●● ●●
●●● ●●●
●●●





●●●●● ● ●●
9/102
Y

Topic 1 – Part 2: Choosing the (Population) Regression Line
yi =β0+β1xi+vi
􏰒 An OLS regression is simply choosing the βOLS,βOLS that make vi
as “small” as possible on average 􏰒 How do we define “small”?
􏰒 Want to treat positive/negative the same: consider vi2
􏰒 Choose βOLS,βOLS to minimize: 01
E[vi2] = E[(yi −β0 −β1xi)2]
01
10/102

Topic 1 – Part 2: Regression and The CEF
􏰒 Given yi and Xi The population regression coefficient is: βOLS = E[X X′]−1E[X y ]
􏰒 A useful time to note: you should remember the OLS estimator: βˆOLS = (X′X)−1X′Y
􏰒 With just one xi:
ii ii
ˆ
βˆOLS = Cov(xi,yi)
ˆ Var(xi)
11/102

Topic 1 – Part 2: Regression and the Conditional Expectation Function
􏰒 Why is linear regression so popular?
􏰒 Simplest way to estimate (or approximate) conditional expectations!
􏰒 Three simple results
􏰒 OLS perfectly captures CEF if CEF is Linear
􏰒 OLS generates best linear approximation to the CEF if not 􏰒 OLS perfectly captures CEF with binary (dummy) regressors
12/102

Topic 1 – Part 2: Regression captures CEF if CEF is Linear
􏰒 Take the special case of a linear conditional expectation function:
E[y |X ] = X′β iii
􏰒 Then OLS captures E[yi|Xi]
10
5
0
−5


● ●●
● ●
●● ●●
● ●●

●● ●


● ●● ● ●
● ●●
●●
●●● ●

●● ● ●● ●●
●●●● ●● ●
●● ●● ●●●●● ●● ●
● ●●● ● ●● ● ● ●●●●
●●●● ●● ● ● ● ●
● ● ●●● ● ● ● ●● ● ●●●● ●●
● ●● ●● ● ●●● ●●●●● ●●●

●● ●
●●

● ●●● ● ●● ●● ●●
● ●
● ● ●●●●●●●●●● ●●
●●●●● ● ●●●● ● ●●●●●●●● ● ● ●●● ●●● ● ● ●●●● ● ● ● ●
●●●●●●● ● ●●
● ●● ●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●●●●●●● ●●
● ●●●● ●●●●● ●
● ●●●● ●●●●● ●● ●●●●● ● ●●● ●●●● ●●●●
● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●
● ● ● ●
● ●● ●●● ●
● ●●
●● ● ●● ● ● ●● ● ●
●●●●
●● ● ● ●●
● ●●●
●●●

● ●●●
●● ● ●●
●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●●
● ●● ●●● ●●●● ● ●●●●
● ●●●● ●


● ●●●●●●●●● ●● ●● ●●●
● ●●●●● ●●● ● ●●● ●
●● ●● ●●●
● ● ●●●●●●●●
●●● ● ●●●●●●● ●
●● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●●● ●●● ● ●● ●
● ●●●●● ●● ●● ●● ● ● ● ● ● ●●●● ●●●● ● ●● ● ●●● ●●●●●● ●●●●●●● ● ●● ●●●
●●●● ● ●●●●● ● ●●● ●● ●● ●
●● ●●●
● ●
●●

●●

●● ● ●●●●
●●●
●●●● ●
● ●●● ● ●●●● ● ●
●●● ● ● ●●

● ●●●●
●● ● ●



●●● ●●
●●● ●●●● ●●
●●●● ●●●● ● ●
●●●●● ●●● ●
●●
●● ●●●
● ●● ●●

●● ●
● ●


● ●


● ●
−2 0 2
X
● ●
●● ● ●●●●● ●● ● ● ●●
● ●●●
●● ●●
● ●● ●







13/102
Y

Topic 1 – Part 2: OLS Provides Best Linear Approximation to CEF
10
−10
0

●● ●






● ●
● ●


● ●
● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ●●● ●● ● ●●● ●● ● ●● ●●●●
●●
● ● ●
●●●●●● ●● ●●●● ● ●●●● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●● ●● ●●● ●
●● ● ●● ● ●
● ●
● ●
●●● ●


●● ● ●
● ●●●●
● ● ●● ● ●● ●●
● ●● ●
●● ●●●●●●●
● ● ●●●●● ●● ●●
●● ● ● ●● ●● ●●●●●●● ●●●●●●
● ● ●●● ●●●
● ●●
●●●●●
●●● ●●● ●
● ●●
●●●●●● ●● ●
● ●●●●●●
● ●● ●●● ● ● ●● ●● ● ● ●● ●●●● ● ● ●● ●
●●● ●● ●
● ●●●●● ● ●●
● ●


●● ● ● ● ● ●●●
● ● ● ● ● ● ●
●●●
●● ●●●● ●● ●●●
● ● ● ● ● ● ●
●● ●
● ● ● ●
● ●
● ●●
●●●●●
● ● ●●●●● ●●
● ●● ● ●
● ●
● ●

● ● ● ● ● ● ● ● ● ●●● ●● ●●●●● ●●●● ●●●●●●●● ●
●●● ● ●●● ● ●● ●
● ● ●● ● ● ●● ●●●● ●●● ●●●●● ● ●
● ● ● ●● ●
●●●●●● ● ● ● ●● ● ● ● ● ● ●
●●●●●● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●
● ●●●●● ●

● ●● ● ● ● ● ●● ●● ●●●●
●●●● ●●●●● ● ● ●●●● ●

● ●● ●
●●●● ●
● ●●●● ● ●●●
● ●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ●

● ●●
●●● ●●●●●● ● ●●●●● ●●● ●● ●● ●●● ●●●
●●
● ●●● ● ●●●●● ●●●●
●● ● ●● ●● ●
●●
●●●● ●●● ● ● ● ●● ● ● ●● ●●●●● ●● ●●●●●●● ●●●●●●●
● ● ● ●● ● ●
●●● ●●●
●● ●●● ●●●●
● ● ●●● ● ●● ●●●●

● ● ●
●●
●●
●●
● ●






−2 0 2
X




● ●


14/102
Y_nl

Topic 1 – Part 2: Implementing Regressions with Categorical Variables 􏰒 What if we are interested in comparing all 11 GISCS sectors?
􏰒 Create dummy variables for each sector omitting 1 􏰒 Lets call them D1i,···,D10i
pricei = β0 +δ1D1i +···+δ10D10i +vi 􏰒 In other words Xi = [1 D1i ···D10]′ or
1 0 ··· 1 0
1 1 ··· 0 0
1 0 ··· 0 0
1 0 ··· 1 0 X= 
􏰒 Regress pricei on a constant and those 10 dummy variables
1 0 ··· 0 1 . . . . .
. . .. . . 1 1 ··· 0 0
15/102

Topic 1 – Part 2: Average Share Price by Sector for Some S&P Stocks
16/102

Topic 1 – Part 2: Implementing Regressions with Dummy Variables
􏰒 βˆOLS (coef. on the constant) is the mean for the omitted category: 0
􏰒 In this case “Consumer Discretionary”
􏰒 The coefficient on each dummy variable (e.g. δˆOLS) is the difference
k
between βˆOLS and the conditional mean for that category
0
􏰒 Key point: If you are only interested in categorical variables… 􏰒 You can perfectly capture the full CEF in a single regression
􏰒 For example:
E[price|sector =consumerstaples]=βOLS+δOLS
ii 01 E[price|sector =energy]=βOLS+δOLS
ii 02 .
17/102

Topic 1 – Part 2 Summary: Regression and the Conditional Expectation Function
􏰒 Why is linear regression so popular?
􏰒 Simplest way to estimate (or approximate) conditional expectations!
􏰒 Three simple results
􏰒 OLS perfectly captures CEF if CEF is Linear
􏰒 OLS generates best linear approximation to the CEF if not 􏰒 OLS perfectly captures CEF with binary (dummy) regressors
18/102

Topic 1: OLS and the Conditional Expectation Function
1. A review of the conditional expectation function and its properties 2. The relationship between OLS and the CEF
19/102

Topic 2: Causality and Regression
1. The potential outcomes framework 2. Causal effects with OLS
20/102

Topic 2: Causality and Regression
􏰒 Suppose wages (yi ) are determined by:
yi =β0+β1xi+γai+ei
􏰒 and we see years of schooling (xi ) but not ability (ai ) Corr(xi,ai) > 0 and Corr(yi,ai) > 0
􏰒 We estimate: 􏰒 And recover
yi =β0+β1xi+vi
βOLS =β +γδOLS 111
􏰐 􏰏􏰎 􏰑
Bias
􏰒 Is our estimated βOLS larger or smaller than β1? 1
21/102

Topic 2 – Part 1: The Potential Outcomes Framework
􏰒 Ideally, how would we find the impact of candy on evaluations (yi )? 􏰒 Imagine we had access to two parallel universes and could observe
􏰒 The exact same student (i)
􏰒 At the exact same time
􏰒 In one universe they receive candy—in the other they do not
􏰒 And suppose we could see the student’s evaluations in both worlds 􏰒 Define the variables we would like to see: for each individual i:
yi1 = evaluation with candy yi0 = evaluation without candy
22/102

Topic 2 – Part 1: The Potential Outcomes Framework
􏰒 If we could see both yi1 and yi0 impact would be easy to find: 􏰒 The causal effect or treatment effect for individual i defined as
yi1 −yi0
􏰒 Would answer our question—but we never see both yi1 and yi0!
􏰒 Some people call this the “fundamental problem of causal inference” 􏰒 Intuition: there are two “potential” worlds out there
􏰒 The treatment variable Di decides which one we see:
􏰍yi1 if Di = 1 yi= yi0ifDi=0
23/102

Topic 2 – Part 1: So What Do Differences in Conditional Means Tell You?
E[yi1|Di = 1]−E[yi0|Di = 0] = E[yi1|Di = 1]−E[yi0|Di = 1] 􏰐 􏰏􏰎 􏰑
Average Treatment Effect for the Treated Group +E[yi0|Di = 1]−E[yi0|Di = 0]
􏰐 􏰏􏰎 􏰑
Selection Effect ̸= E[yi1]−E[yi0]
􏰐 􏰏􏰎 􏰑
Average Treatment Effect
􏰒 So our estimate could be different from the average effect of treatment E[yi1]−E[yi0] for two reasons:
(1) The morning section might have given better reviews anyway: E[yi0|Di = 1]−E[yi0|Di = 0] > 0
􏰐 􏰏􏰎 􏰑
Selection Effect
(2) Candy matters more in the morning:
E[yi1|Di = 1]−E[yi0|Di = 1] ̸= E[yi1]−E[yi0]
􏰐 􏰏􏰎 􏰑􏰐􏰏􏰎􏰑
Average Treatment Effect for the Treated Group Average Treatment Effect
24/102

Topic 2 – Part 2: Causality and Regression
yi =β0+β1xi+vi
􏰒 Regression coefficient captures causal effect (β OLS = β ) if:
E[vi|xi] = E[vi] 􏰒 Failsanytimecorr(xi,vi)̸=0
􏰒 An aside: we have used similar notation for 3 different things: 1. β1: the causal effect on yi of a 1 unit change in xi.
2. β OLS = Cov (xi ,yi ) : the population regression coefficient 1 Var(xi)
1
􏱋
3. βˆOLS = Cov (xi ,yi ) : the sample regression coefficient
1􏱋 Var (xi )
25/102

Topic 2 – Part 2: Omitted Variables Bias
􏰒 So if we have:
􏰒 What will the regression of yi on xi give us?
􏰒 Recall that the regression coefficient is βOLS = Cov(yi,xi) : 1 Var(xi)
βOLS = Cov(yi,xi) 1 Var(xi)
= β1 + Cov (vi , xi ) Var(xi)
yi =β0+β1xi+vi
26/102

Topic 2: Causality and Regression
1. The potential outcomes framework 2. Causal effects with OLS
27/102

Menti Break
􏰒 The average coursework grade in the Morning class is 68 􏰒 The average coursework grade in the Afternoon class is 75 􏰒 Suppose we run the following regression:
Coursework = β0 + β1Afternooni + vi 􏰒 What is the value of β0?
(a) 68 (b) 75 (c) 7
28/102

Topic 3: Panel Data and Diff-in-Diff
1. Panel Data, First Difference Regression and Fixed Effects 2. Difference-in-Difference
29/102

Topic 3: Panel Data and Diff-in-Diff
􏰒 Panel data consists of observations of the same n units in T different periods
􏰒 If the data contains variables x and y, we write them (xit,yit)
􏰒 fori=1,···,N
􏰒 i denotes the unit, e.g. Microsoft or Apple
􏰒 andt=1,···,T
􏰒 t denotes the time period, e.g. September or October
30/102

Topic 3 – Part 1: Panel Data and Omitted Variables
􏰒 Lets reconsider our omitted variables problem: yit =β0+β1xit+γai+eit
􏰒 Suppose we see xit and yit but not ai
􏰒 Suppose Corr(xit,eit) = 0 but Corr(ai,xi) ̸= 0
􏰒 Note that we are assuming ai doesn’t depend on t
31/102

Topic 3 – Part 1: First Difference Regression
yit=β0+β1xit+ vit 􏰐􏰏􏰎􏰑
γ ai +eit
􏰒 Suppose we see two time periods t = {1, 2} for each i 􏰒 We can write our two time periods as:
yi,1 = β0 +β1xi,1 +γai +ei,1
yi,2 = β0 +β1xi,2 +γai +ei,2
􏰒 Taking changes (differences) gets rid of fixed omitted variables
∆yi,2−1 = β1∆xi,2−1 +∆ei,2−1
32/102

Topic 3 – Part 1: Fixed Effects Regression
yit =β0+β1xit+γai+eit
􏰒 An alternative approach:
􏰒 Lets define δi = γai and rewrite:
yit =β0+β1xit+δi+eit 􏰒 So yi is determined by
(i) The baseline intercept β0 (ii) The effect of xi
(iii) An individual specific change in the intercept: δi 􏰒 Intuition behind fixed effects: Lets just estimate δi
33/102

Topic 3 – Part 1: Fixed Effects Regression – Implementation
N−1
yit = β0 +β1xit + ∑ δiDi +eit
i=1
􏰒 Note that we’ve left out DN
􏰒 βOLS is interpreted as the intercept for individual N:
βOLS=E[y|x =0,i=N] 0 itit
0
􏰒 and for all other i (e.g. i=1)
δ2 = E[yit|xit = 0,i = 1]−β0
􏰒 This should look familiar
34/102

Topic 3 – Part 2: Difference-in-Difference
􏰒 We are interested in the impact of some treatment on outcome Yi
􏰒 Suppose we have a treated group and a control group
􏰒 Let Di =1 be a dummy equal to 1 if i belongs to the treatment
group
􏰒 And suppose we see both groups before and after the treatment occurs
􏰒 Let Aftert = 1 be equal to 1 if time t is after the treatment date Yit =β0+β1Di×Aftert+β2Di+β3Aftert+vit
35/102

Topic 3 – Part 2: Diff-in-Diff Graphically
36/102

Topic 3 – Part 2: When Does Diff-in-Diff Identify A Causal Effect
􏰒 As usual, we need
E[vit|Di,Aftert] = E[vit]
􏰒 What does this mean intuitively?
􏰒 Parallel trends assumption: In the absence of any reform the
average change in leverage would have been the same in the treatment and control groups
􏰒 In other words: trends in both groups are similar
37/102

Topic 3 – Part 2: Parallel Trends
􏰒 Parallel trends does not require that there is no trend in leverage 􏰒 Just that it is the same between groups
􏰒 Does not require that the levels be the same in the two groups 􏰒 What does it look like when the parallel trends assumption fails?
38/102

Topic 3 – Part 2: When Parallel Trends Fails
Leverage
Treatment (Delaware)
Control (Non−Delaware)
Before After
Month (t)
39/102

Topic 3: Panel Data and Diff-in-Diff
1. Panel Data, First Difference Regression and Fixed Effects 2. Difference-in-Difference
40/102

Topic 4: Regularization
1. Basics of Ridge, LASSO and Elastic Net
2. How to choose hyperparameter λ: cross-validation
41/102

Topic 4 – Part 1: The Basics of Elastic Net
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●


●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●

●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●


42/102
Outcome

Topic 4 – Part 1: How Well Can We Predict Out-of-Sample Outcomes (yoos) i
20
10
0
−10
−20
0 25 50 75 100
Observation

● ●●
● ●



●●
●● ● ●●●
● ●●●●●

●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●

●● ●

●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●






43/102
Outcome and Prediction

Topic 4 – Part 1: A Good Model Has Small Distance (yoos −yˆoos)2 i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●

●●● ●




44/102
Outcome and Prediction

Topic 4 – Part 1: Solution to OLS drawbacks – Regularization
􏰒 With 100 Observations OLS Doesn’t do Very Well 􏰒 Solution: regularization
􏰒 LASSO/RIDGE/Elastic Net
􏰒 Simplest version of elastic net (nests LASSO and RIDGE):
β Ni=1 k=1 k=1 􏰒 For α =1 is this LASSO or RIDGE?
ˆelastic 􏰦1N 2 􏰋K K2􏰌􏰧 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk
45/102

Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=0.2)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
46/102
Coefficient Estimate

Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=1)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
47/102
Coefficient Estimate

Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=3)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
48/102
Coefficient Estimate

Topic 4 – Part 1: LASSO Coefficients For All λ
49 49 46 35 21 10 3
−5 −4 −3 −2 −1 0 1
Log Lambda
49/102
Coefficients
−1 0 1 2 3 4 5

Topic 4 – Part 2: How to Choose λ – k-fold Cross Validation
􏰒 Partition the sample into k equal folds 􏰒 The default for R is k=10
􏰒 For our sample, this means 10 folds with 10 observations each 􏰒 Cross-validation proceeds in several steps:
1. Choose k-1 folds (9 folds in our example, with 10 observations each) 2. Run LASSO on these 90 observations
3. find βlasso(λ) for all 100 λ
4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds
􏰒 This provides 10 estimates of MSE(λ) for each λ
􏰒 Can construct means and standard deviations of MSE(λ) for each λ
􏰒 Choose λ that gives small mean MSE(λ)
50/102

Topic 4 – Part 2: Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1
●●● ●
●●●
● ●
● ●
●●
●●
●●
●●
●● ●
● ●
● ●●
● ●
● ●
● ●

● ●●
● ●
●● ●
● ●

● ●●
● ●●
●● ●● ●●
● ●●
●● ●● ●●●
●● ●● ●


−5 −4 −3 −2 −1 0 1
log(Lambda)
51/102
Mean−Squared Error
30 40 50 60 70 80 90

Topic 4: Regularization
1. Basics of Ridge, LASSO and Elastic Net
2. How to choose hyperparameter λ: cross-validation
52/102

Topic 5: Observed Factor Models
􏰒 Suppose xt is a vector of asset returns, and B is a matrix of factor loadings
􏰒 Which has higher dimension: (a) B
(b) Σx =Cov(xt)
53/102

Topic 5: Observed Factor Models
1. General Framing of Linear Factor Models
2. Single Index Model and the CAPM
3. Multi-Factor Models 􏰒 Fama-French
􏰒 Macroeconomic Factors 4. Barra approach
54/102

Topic 5 – Part 1: Linear Factor Models
􏰒 Assume that returns xit are driven by K common factors: xi,t = αi +β1,if1,t +β2,if2,t +···+βK,ifK,t +εit
􏰒 ft = (f1,t,f2,t,··· ,fK,t)′ is the set of common factors 􏰒 These are the same for all assets (constant over i)
􏰒 But change over time (different for t, t+1) 􏰒 Each ft has dimension (K × 1)
􏰒 βi = (β1,i,β2,i,··· ,βK,i)′ is the set of factor loadings
􏰒 K different parameters for each asset
􏰒 But constant over time (same for all t)
􏰒 Fixed, specific relationship between asset i and factor k
55/102

Topic 5 – Part 1: Linear Factor Model
xt =α+Bft+εt
􏰒 Summary of Parameters
􏰒 α: (m×1) intercepts for m assets
􏰒 B:(m×K)loadings(βik)onK factorsformassets 􏰒 μf : (K × 1) vector of means for K factors
􏰒 Ωf : (K × K ) variance covariance matrix of factors
􏰒 Ψ: (m×m) diagonal matrix of asset specific variances
􏰒 Given our assumptions xt is m-variate covariance stationary with: E[xt|ft] = α +Bft
Cov[xt|ft] = Ψ E[xt]=μx =α+Bμf Cov[xt]=Σx =BΩfB′+Ψ
56/102

Topic 5 – Part 2: The Index Model: First Pass
xi =αi1T +Rmβi +εi
􏰒 Estimate OLS regression on time-series version of our factor specification
􏰒 One regression for each asset i
􏰒 Recover two parameters αˆi and βˆi for each asset i 􏰒 Ωˆf is just the sample variance of observed factor
􏰒 Estimate Residuals
􏰒 Use these to estimate asset specific variances (for each i):
􏰒 With these, can compute:
(T −2) Σˆ x = Bˆ Ωˆ f Bˆ ′ + Ψˆ
εˆ = x − αˆ 1 − R βˆ iiiTmi
εˆ ′ εˆ σˆi2= i i
57/102

Topic 5 – Part 2: The Index Model/Testing CAPM: Second Pass
x ̄ i = γ 0 + γ 1 βˆ i + γ 2 σˆ i + η i
􏰒 CAPM tests: expected excess return should be determined only by systemic risk (β)
1. γ0=0oraverageαis0
2. γ2 = 0 (idiosyncratic risk shouldn’t be priced) 3. γ1=R ̄m
58/102

Topic 5 – Part 3: Fama-French Three Factor Model
􏰒 Recall our general linear factor model:
xi,t = αi +β1,if1,t +β2,if2,t +···+βK,ifK,t +εit
􏰒 Fama-French is just another version of this with three particular factors:
xi,t = αi +β1,if1,t +β2,if2,t +β3,if3,t +εit
􏰒 The factors are:
1. f1,i = Rmt : proxy for excess market return—same as before 2. f2,i = SMBt : size factor
3. f2,i = HMLt : value factor
􏰒 Can use same two-pass methodology to recover parameters
􏰒 Should understand what the covariance matrix Σx looks like
59/102

Topic 5 – Part 3: Macroeconomic Factors
􏰒 An alternative approach uses key macro variables as factors
􏰒 For example, Chen, Roll, and Ross use: 􏰒 IP: Growth rate in industrial production
􏰒 EI: Changes in expected inflation
􏰒 UI: Unexpected inflation
􏰒 CG: Unexpected changes in risk premiums 􏰒 GB: Unexpected changes in term premia
􏰒 In this case, our general model becomes:
xi,t =αi +βR,iRm,t +βIp,iIPt +βEI,iEIt +βUI,iUIt +βCG,iCGt +βGB,iGCt +εit
􏰒 Can use two-pass procedure to estimate βˆs, evaluate the model
􏰒 Like before, can use estimated βˆs, asset specific variances, and factor covariances to estimate asset covariance matrix
60/102

Topic 5 – Part 4: BARRA approach
̃x t = B f t + ε t
􏰒 Assume that B is known
􏰒 This looks just like the standard OLS matrix notation 􏰒 And we can estimate our ft like always:
ˆfOLS = (B′B)−1B′ ̃x tt
􏰒 A bit weird conceptually because the role of the βs flips 􏰒 But no technical difference…
􏰒 Except heteroskedasticity 􏰒 Estimate with GLS
61/102

Topic 5: Observed Factor Models
1. General Framing of Linear Factor Models
2. Single Index Model and the CAPM
3. Multi-Factor Models 􏰒 Fama-French
􏰒 Macroeconomic Factors 4. Barra approach
62/102

Topic 6: Principal Components Analysis
􏰒 Suppose the covariance of two asset returns is given by: 􏰉1 0􏰊
03
􏰒 What fraction of the total variance in asset returns will be explained by the first principle component?
(a) 1 3
(b) 3 4
(c) 1 4
63/102

Topic 6: Principal Components Analysis
1. Basics of Eigenvectors and Eigenvalues 2. PCA
64/102

Topic 6 – Part 1: Basics of Eigenvalues and Eigenvectors
􏰒 Consider the square n×n matrix A.
􏰒 An eigenvalue λi of A is a (1×1) scalar:
􏰒 The corresponding eigenvector of a ⃗vi is an (n × 1) vector 􏰒 Whereλi,⃗vi satisfy:
A⃗vi =λi⃗vi
􏰒 ⃗vi are the special vectors that A only stretches
􏰒 λi is the stretching factor
􏰒 Won’t ask you to compute these without giving you the formulas 􏰒 Except maybe in the diagonal case…
65/102

Topic 6 – Part 1: For Some Vectors ⃗v , Matrix A Only Stretches 􏰒 Lets say
􏰉5 0􏰊 􏰉1􏰊 􏰉5􏰊
A= 2 3 andv⃗1= 1 ⇒Av⃗1= 5 =1v⃗1
Av1=(5 5)’
v1=(1 1)’
66/102

Topic 6 – Part 1: Eigenvalues of Σx with Uncorrelated Data 􏰉3 0􏰊
Σx= 0 1
􏰒 What are the eigenvalues and eigenvectors of Σx ?
􏰒 With uncorrelated assets eigenvalues are just the variances of each asset return!
􏰒 Eigenvectors:
􏰉1􏰊 􏰉0􏰊 v1= 0 , v2= 1
􏰒 Note that the first eigenvalue points in the direction of the largest variance
􏰒 We sometimes write the eigenvectors together as a matrix:
􏰉1 0􏰊 Γ=(v1 v2)= 0 1
67/102

Topic 6 – Part 1: Eigenvectors are Γ = (v1 v2) = 2 2 22
V2=(−1 1)’ V1=(1 1)’
−10 −5 0 5 10 xa
􏰉2 1􏰊 Cov(x)=Σx= 1 2
􏰇1 −􏰇1 􏰇1 􏰇1
xb
−10 −5 0 5 10
67/102

Topic 6 – Part 1: Eigenvectors of Σx with Correlated Data 􏰉2 1􏰊
Σx= 1 2
􏰇1 −􏰇1
􏰇1 􏰇1 Γ=(v1 v2)= 2 2
22
􏰒 Just as with uncorrelated data, first eigenvector finds the direction with the most variability
􏰒 Second eigenvector points in the direction that explains the maximum amount of the remaining variance
􏰒 Note that the two are perpendicular (because Σx is symmetric)
􏰒 This is the geometric implication of the fact that they are orthogonal:
vi′vj =0
􏰒 The fact that they are orthogonal also implies:
Γ′ = Γ−1
68/102

Topic 6 – Part 2: Principal Components Analysis
xt 􏰐􏰏􏰎􏰑
m×1 E[xt] = α Cov(xt) = Σx
􏰒 Define the principal components variables as: p = Γ′(xt −α)
􏰐 􏰏􏰎 􏰑
m×1 􏰒 Γ is the ordered matrix of eigenvectors
􏰒 The proportion of the total variance of xi that is explained by the largest eigenvalue λi is simply:
λi ∑mi=1 λi
69/102

Topic 6 – Part 2: Principal Components Analysis
􏰒 Our principal components variables provide a transformation of the data into variables that are:
􏰒 Uncorrelated (orthogonal)
􏰒 Order by how much of the total variance they explain (size of
eigenvalue)
􏰒 What if we have many m, but the first few (2, 5, 20) principle components explain most of the variation:
􏰒 Idea: Use these as “factors” 􏰒 Dimension reduction!
70/102

Topic 6 – Part 2: Principal Components Analysis
􏰒 Note that because Γ′ = Γ−1
xt =α+Γp
􏰒 We can also partition Γ into the first K < m eigenvectors and the remaining m−K 􏰒 Partition p into its first K elements and the remaining m − K 􏰉p1􏰊 p= p2 􏰒 We can then write 􏰒 This looks just like a factor model: 􏰒 But xt = α + Bft + εt Cov(εt) = Ψ = Γ2Λ2Γ′2 Γ = [Γ1 Γ2] xt =α+Γ1p1+Γ2p2 71/102 Topic 6 - Part 2: Implementing Principal Components Analysis xt =α+Γ1p1+Γ2p2 􏰒 Recall the sample covariance matrix: Σˆ x = 1 X ̃ X ̃ ′ T 􏰒 Calculate this, and perform the eigendecomposition (using a computer): Σˆx =ΓΛΓ′ 􏰒 We now have everything need to compute the sample principle components at each t: P=[p1 p2···pT]= ΓX ̃ 􏰐􏰏􏰎􏰑 m×T 72/102 Topic 6 - Part 2: xˆt : Predicted Yields from First Two Components 7 6 5 4 3 2 1 0 CMT3Month CMT6Month CMT1Year CMT2Year CMT3Year CMT5Year CMT7Year CMT10Year CMT20Year 1200 1400 0 200 400 600 800 1000 73/102 Topic 6: Principal Components Analysis 1. Basics of Eigenvectors and Eigenvalues 2. PCA 74/102 Topic 7: Limited Dependent Variables 􏰒 True or false: It is never ok to use an OLS regression when the outcome variable is binary (a) True (b) False 75/102 Topic 7: Limited Dependent Variables 1. Binary Dependent variables 2. Censoring and truncation 76/102 Topic 7 - Part 1: Linear Probability Models P(Yi = 1|Xi) = β0 +β1xi 􏰒 The linear probability model has a bunch of advantages 1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS 3. Can use all the techniques we’ve seen: IV/difference-in-difference, etc 􏰒 Because of this simplicity, lots of applied research just uses linear probability models 􏰒 But a few downsides... 􏰒 Predicted probabilities above one 􏰒 Constant effects of xi 1 77/102 Topic 7 - Part 1: Two Common Alternatives to Linear Probability Models P(yi = 1|xi) = G(β0 +β1xi) 􏰒 In practice, mostly use two choices of G(·): 1. Probit: the standard normal CDF 􏰩z G(z) = Φ(z) = φ(v) = (2π)−1/2 exp−v2/2 2. Logit: the logistic function G(z) = Λ(z) = exp(z) 1+exp(z) −∞ φ(v)dv 78/102 Topic 7 - Part 1: Probit Probability of Passing 0 .5 1 1.5 0 20 40 60 80 100 Assignment 1 Score 79/102 Topic 7 - Part 1: Why Does the Probit Approach Make Sense? 􏰒 You should be able to derive the probit from a latent variable model: y i∗ = β 0 + β 1 x i + v i 􏰍1 if yi∗ ≥ 0 yi= 0ifyi∗<0 P(y =1)=P(y∗ >0)
ii
􏰒 Probit approach: assume vi is distributed standard normal
vi|xi ∼N(0,1)
P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0|xi) =P(vi >−(β0+β1xi)|xi)
= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )
80/102

Topic 7 – Part 1: The Effect of a Change in xi for the Probit
P(yi = 1|xi) = Φ(β0 +β1xi) 􏰒 Taking derivatives:
∂P(yi = 1|xi) = β1Φ′(β0 +β1xi) ∂xi
􏰒 The derivative of the standard normal CDF is just the PDF: Φ′(z) = φ(z)
􏰒 so
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
􏰒 Obviously should be able to do this for more complicated functions
81/102

Topic 7 – Part 1: Deriving the Log-likelihood
􏰒 Given data Y , define the likelihood function:
L(β0,β1) = f (Y |X;β0,β1) n
= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1
􏰒 Take the log-likelihood: l(β0,β1) = log(L(β0,β1))
= log(f (Y |X;β0,β1)) n
= ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1
82/102

Topic 7 – Part 2: Censoring and Truncation
􏰒 An extremely common data issue is censoring:
􏰒 We only observe Yi if it is below (or above) some threshold
􏰒 We see Xi either way
􏰒 Example: Income is often top-coded
􏰒 That is, we might only see whether income is > £100,000
􏰒 Formally, we might be interested in Yi , but see:
Wi =min(Yi,ci) where ci is a censoring value
􏰒 Similar to censoring is truncation
􏰒 We don’t observe anything if Yi is above some threshold
􏰒 e.g.: we only have data for those with incomes below £100,000
83/102

Topic 7 – Part 2: Censored Regression
􏰒 So in general:
f(yi =y|xi,ci)=1{y≥ci} 1−Φ
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
􏰋 􏰉ci −β0 −β1xi 􏰊􏰌 σ
1 􏰉y−β0−β1xi􏰊 +1{y