Limited Dependent Variables
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
February 21-22, 2022
Copyright By PowCoder代写 加微信 powcoder
Today: Four Parts
1. Writing and minimizing functions in R
2. Binary dependent variables
3. Implementing a probit in R via maximum likelihood 4. Censoring and truncation
Part 1: Simple Functions in R
Often valuable to create our own functions in R May want to simplify code
Automate a common task/prevent mistakes Plot or optimize a function
Simple syntax in R for user written functions Two key components
Arguments Body
Creating functions in R
function_name <- function(arguments}{
Write a simple function that adds two inputs x and y
Write the function f (x) = (x −1)2
What x minimizes this function? How do we find it in R?
Rosenbrock’s Banana Function
The Rosenbrock Banana Function is given by:
f (x1,x2) = (1−x1)2 +100(x1 −x2)2
What values of x1 and x2 minimize this function?
Please find this using optim in R with starting values (−1.2,1)
Part 2: Binary Dependent Variables
1. Review: Bernoulli distribution
2. Linear probability model and limitations 3. Introducing the probit and logit
4. Deriving the probit from a latent variable 5. Partial effects
Bernoulli Distribution
We are interested in an event that has two possible outcomes Call them success and failure, but could describe:
Heads vs. tails in a coin flip Chelsea wins next match
Pound rises against dollar
1 if Success 0 if Failure
Y is often called a Bernoulli trial
Say the probability of success is p, probability of failure is (1−p) So the PMF of Y can be written as:
P(Y =y)=py(1−p)(1−y)
Bernoulli Distribution
Say p = 0.2:
So then we can write the probabilities of both values of y as:
P(Y = y) = (0.2)y (0.8)(1−y) P(Y = 1) = (0.2)1(0.8)(1−1) = 0.2
P(Y = 0) = (0.2)0(0.8)(1−0) = 0.8
Binary Dependent Variables
yi =β0+β1xi+vi
So far, focused on cases in which yi is continuous
What about when yi is binary? Thatis,yi iseither1or0
For example: yi represents employment, passing vs. failing this course, etc...
Put any concerns about causality aside for a moment: Assume E[vi|Xi] = 0
How do we interpret β1?
A Look at a Continuous Outcome
A Look at a Continuous Outcome
βOLS +βOLSX 01
A Look at a Binary Outcome
0 20 40 60 80 100 Assignment 1 Score
Probability of Passing
0 .5 1 1.5
Binary Dependent Variables
yi =β0+β1xi+vi
With a continuous yi , we interpreted β1 as a slope:
Change in yi for a one unit change in xi
This doesn’t make much sense when yi is binary
Say yi is employment, xi is years of schooling, β = 0.1
What does it mean for a year of schooling to increase your
employment by 0.1?
Solution: think in probabilities
Linear Probability Models
yi =β0+β1xi+vi When E[vi|xi] = E[vi] = 0 we have:
But if yi is binary:
E[yi|xi] = β0 +β1xi E[yi|xi] = P(yi = 1|xi)
So we can think of our regression as:
P(yi = 1|xi) = β0 +β1xi
β1 is the change in probability of “success” (yi = 1) for a one unit change in xi
Linear Probability Models
P(yi = 1|xi) = β0 +β1xi
Basic idea: probability of success is a linear function of xi
Examples:
1. Bankruptcy:
P(Bankruptcyi =1|Leveragei)=β0+β1Leveragei Probability of bankruptcy increases linearly with leverage
2. Mortgage Denial:
P(MortgageDeniali =1|Incomei)=β0+β1Incomei
Probability of denial decreases linearly in income
Linear Probability Models
0 20 40 60 80 100 Assignment 1 Score
Probability of Passing
0 .5 1 1.5
Linear Probability Models
0 20 40 60 80 100 Assignment 1 Score
Probability of Passing
0 .5 1 1.5
Linear Probability Models
P(yi = 1|xi) = β0 +β1xi
The linear probability model has a bunch of advantages
1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS
3. Can use all the techniques we’ve seen: difference-in-difference, etc 4. Can include many Xi :
P(yi =1|Xi)=β0+Xi′β
Because of this simplicity, lots of applied research just uses linear
probability models
But a few downsides...
Linear Probability Models: Downsides
Downside 1: Nonsense predictions
P (MortgageDeniali |Incomei ) = β0 + β1 Incomei
Suppose we estimate this and recover β OLS = 1, β OLS = −0.1 01
Income is measured in 10k
What is the predicted probability of denial for an individual with an
income of 50k?
What is the predicted probability of denial for an individual with an
income of 110k?
What about 1,000,000?
Linear Probability Models
0 20 40 60 80 100 Assignment 1 Score
Probability of Passing
0 .5 1 1.5
Linear Probability Models: Downsides
Downside 2: Constant Effects
MortgageDeniali = β0 + β1 Incomei + vi
β OLS = 1, β OLS = −0.1 01
Income is measured in 10k
Probability of denial declines by 0.1 when income increases from 50,000 to 60,000
Seems reasonable
Probability of denial declines by 0.1 when income increases from 1,050,000 to 1,060,000
Probably less realistic
Alternatives to Linear Probability Models
Simplest problem with P(yi = 1|xi ) = β0 + β1xi : Predicts P(yi |xi ) > 1 for high values of β0 + β1xi
Predicts P(yi |xi ) < 0 for low values of β0 + β1xi Solution:
P(yi = 1|xi) = G(β0 +β1xi)
Where 0
Normal Density
P(X
Why Does the Probit Approach Make Sense?
1 if yi∗ ≥ 0 yi= 0ifyi∗<0
⇒P(y =1)=P(y∗ ≥0) ii
So lets plug in for yi∗ and figure out the probabilities:
P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0) =P(vi >−(β0+β1xi))
= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )
Which is exactly the probit:
P(yi = 1|xi) = Φ(β0 +β1xi)
What About the Logit?
y i∗ = β 0 + β 1 x i + v i
The logit can actually be derived the same way
Assuming vi follows a standard logistic distribution Instead of a standard normal distribution
More awkward/uncommon distribution but still symmetric around 0 All the math/interpretation is the same, just using Λ(z) instead of
Primary benefit is computational/analytic convenience
The Effect of a Change in Xi
In OLS/Linear probability model interpreting coefficients was easy:
β1 is the impact of a one unit change in xi This interpretation checks out formally:
Taking derivatives:
P(yi = 1|xi) = β0 +β1xi ∂P(yi =1|xi) =β1
Things are a little less clean with probit/logit
Can’t interpret β1 as the impact of a one unit change in xi anymore!
The Effect of a Change in xi
P(yi = 1|xi) = G(β0 +β1xi)
Taking derivatives:
∂P(yi = 1|xi) = β1G′(β0 +β1xi)
The impact of xi is now non-linear
Downside: harder to interpret
Upside: no longer have the same effect when, e.g. income goes from
50,000 to 60,000 as when income goes from 1,050,000 to 1,060,000
For any set of values xi, β1G′(β0 +β1xi) is pretty easy to compute
The Effect of a Change in xi for the Probit P(yi = 1|xi) = Φ(β0 +β1xi)
Taking derivatives:
∂P(yi = 1|xi) = β1Φ′(β0 +β1xi)
The derivative of the standard normal CDF is just the PDF:
Φ′(z) = φ(z)
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
Practical Points: The Effect of a Change in xi
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number
A few approaches:
1. Chose an important value of xi : e.g. the mean x ̄
∂P(yi = 1|x ̄) = β1φ(β0 +β1x ̄) ∂xi
This is called the partial effect at the average
Practical Points: The Effect of a Change in xi
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number
A few approaches:
2. Take the average for all observed values of xi
n ∂P(yi=1|xi) n φ(β +β x) ∑∂xi =∑β101i
i=1 n i=1 n
This is called (confusingly) the average partial effect
Practical Points: The Effect of a Change in xi
P(yi = 1|xi) = Φ(β0 +β1xi)
If xi is a dummy variable, makes sense to avoid all the calculus and
P(yi = 1|xi = 1)−P(yi = 1|xi = 0) = Φ(β0 +β1)−Φ(β0)
Practical Points: No Problem with Many Xi
So far we have only seen one xi
P(yi = 1|xi) = Φ(β0 +β1xi)
This can easily be extended to many Xi :
P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki)
Intuition behind the latent variable approach remains the same: yi∗ = β0 +β1x1i +β2x2i +···+βkxki +vi
1 if yi∗ ≥ 0 yi= 0ifyi∗<0
Practical Points: No Problem with Many Xi
P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki) However, does make partial effects a bit more complicated:
∂P(yi = 1|xi) = β2φ(β0 +β1x1i +β2x2i +···+βkxki) ∂ x2i
What about partial effects with a more complicated function of Xi ?
P(y =1|x)=Φ(β +β x +β x2+···+β ln(x)) ii01i2iki
Implementation: Flashback
Implementation: Flashback
Implementation of Probit by MLE
Suppose we have n independent observation of (yi,xi) Where yi is binary
And suppose we have a probit specification: P(yi = 1|xi) = Φ(β0 +β1xi)
This means that for each i:
P(yi = 0|xi) = 1−Φ(β0 +β1xi)
In other words, P (yi = y |xi ) is Bernoulli!
P(yi =y|xi)=[Φ(β0+β1xi)]y[1−Φ(β0+β1xi)](1−y)
Implementation of Probit by MLE
Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi)
What is the joint density of two independent observations i and j? f(yi,yj|xi,xj;β0,β1)=P(yi|xi;β0,β1)×P(yj|xj;β0,β1)
= [Φ(β0 + β1 xi )]yi [1 − Φ(β0 + β1 xi )](1−yi ) ×[Φ(β0 +β1xj)]yj [1−Φ(β0 +β1xj)](1−yj)
Implementation of Probit by MLE
Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi)
And what is the joint density of all n independent observations?
f (Y |X; β0 , β1 ) = ∏ P (yi |xi ; β0 , β1 )
= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1
Implementation of Probit by MLE
f(Y =y|X;β0,β1)=∏[Φ(β0+β1xi)]yi[1−Φ(β0+β1xi)](1−yi)
Implementation of Probit by MLE
Given data Y , define the likelihood function:
L(β0,β1) = f (Y |X;β0,β1) n
= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1
Take the log-likelihood: l(β0,β1) = log(L(β0,β1))
= log(f (Y |X;β0,β1)) n
= ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1
Implementation of Probit by MLE
We then have
(βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101
Intuition: values of β0,β1 that make the observed data most likely
Implementation of Probit by MLE
We then have
(βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101
Log-likelihood is a Nice Concave Function
0.5 0.6 0.7
Log−Likelihood
Part 3: Implementation of Probit by MLE in R
It turns out this log-likelihood is globally concave in (β0,β1) Pretty easy problem for a computer
Standard optimization packages will typically coverge relatively easily Lets try in R
Implementation of Probit in R
Lets start by simulating some example data We will use the latent variable approach
y i∗ = β 0 + β 1 x i + v i 1 if yi∗ ≥ 0
yi= 0ifyi∗<0 To start, lets define some parameters
And choose n=10000
beta 0 <- 0.2 beta 1 <- 0.5
Simulating Data in R
y i∗ = β 0 + β 1 x i + v i
To generate yi∗ we need to simulate xi and vi
We will use the function rnorm(n)
Simulates n draws from a normal random variable
x i=rnorm(n)
v i=rnorm(n)
Aside: We’ve simulated both xi and vi as normal Probit only assumes vi normal
Could have chosen xi to be uniform or some other distribution
Simulating Data in R
With xi and vi in hand, we can generate yi∗ and yi y i∗ = β 0 + β 1 X i + v i
TodothisinR:
1 if yi∗ ≥ 0 yi= 0ifyi∗<0
y i star = beta 0+beta 1*x i+v i y i=y i star>0
Writing the Likelihood Function in R
Now, recall the log-likelihood
l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))
In R, the function Φ(·) is pnorm(·)
We will define beta= [β0,β1]
beta[1] refers to β0
beta[2] refers to β1
Writing the Likelihood Function in R
l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))
Easy to break into a few steps
Tocapturelog(Φ(β0+β1xi))
l 1 <- log(pnorm(beta[1]+beta[2]*x i))
To capture log(1−Φ(β0 +β1xi))
l 2 <- log(1-pnorm(beta[1]+beta[2]*x i))
To capture the whole function l(β0,β1)
sum(y i*l 1+(1-y i)*l 2)
sum(y i*l 1+(1-y i)*l 2) is just a function of β0,β1
0.5 0.6 0.7
Here I’ve just plotted it holding β0 = 0.2
Log−Likelihood
Minimizing the Negative Likelihood
So we have our R function, which we will call probit loglik We just need to find the β0,β1 that maximize this:
Unfortunately, most software is set up to minimize Easy solution:
(βˆMLE,βˆMLE)=arg max l(β ,β )=arg min −l(β ,β ) 010101
(β0 ,β1 ) (β0 ,β1 ) So we just define probit loglik to be -l(β0,β1)
Maximum Likelihood Estimation
To find the β0,β1 that maximize the likelihood use the function: optim(par=·,fn=·)
This finds the parameters (par) that minimize a function (fn) Takes two arguments:
par: starting guesses for the parameters to estimate fn: what function to minimize
Menti: Estimate a Logit via Maximum Likelihood
On the hub, you’ll find data: logit data.csv
Data is identical to yi , xi before, except vi simulated from standard
logistic distribution
Everything identical in likelihood, except instead of Φ(z), we have:
Λ(z) = exp(z) 1+exp(z)
Part 4: Censoring and Truncation
So far we have focused on binary dependent variables
Two other common ways in which yi may be limited are
Censoring Truncation
The censored regression model and likelihood
An extremely common data issue is censoring:
We only observe yi∗ if it is below (or above) some threshold
We see xi either way
Example: Income is often top-coded
That is, we might only see whether income is > £100,000 Formally, we might be interested in yi∗, but see:
y =min(y∗,c) iii
where ci is a censoring value
An Example of Censored Data
Truncation
Similar to censoring is truncation
We don’t observe anything if yi∗ is above some threshold
e.g.: we only have data for those with incomes below £100,000
An Example of Truncated Data: No one over £100,000
Terminology: Left vs. Right
If we only see yi∗ when it is above some threshold it is left censored We still see other variables xi regardless
If we only see yi∗ when it is below some threshold it is right censored We still see other variables xi regardless
If we only see the observation when yi∗ is above some threshold ⇒ left truncated
Not able to see other variables xi in this case
If we only see the observation when yi∗ is below some threshold ⇒ right truncated
Not able to see other variables xi in this case
Censored Regression
Suppose there is some underlying outcome y i∗ = β 0 + β 1 x i + v i
Again depends on observable xi Unobservable vi
We only see continuous yi∗ if it is above/below some threshold: Hours worked: yi =max(0,yi∗)
Income: yi = min(£100,000,yi∗)
And assume: vi ∼ N(0,σ2), with vi ⊥ xi
Censored Regression
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
What is P(yi =ci|xi)?
P(y =c|x)=P(y∗≥c|x)
ci −β0 −β1xi σ
Censored Regression
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
For yi < ci what is f (yi |xi )?
1 yi−β0−β1xi f(yi|xi)= φ
Censored Regression
So in general:
f (yi|xi,ci) = 1{yi ≥ ci} 1−Φ
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
ci −β0 −β1xi σ
1 yi−β0−β1xi +1{yi