CS代写 OLSX 01

Limited Dependent Variables
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
February 21-22, 2022

Copyright By PowCoder代写 加微信 powcoder

Today: Four Parts
1. Writing and minimizing functions in R
2. Binary dependent variables
3. Implementing a probit in R via maximum likelihood 4. Censoring and truncation

Part 1: Simple Functions in R
􏰀 Often valuable to create our own functions in R 􏰀 May want to simplify code
􏰀 Automate a common task/prevent mistakes 􏰀 Plot or optimize a function
􏰀 Simple syntax in R for user written functions 􏰀 Two key components
􏰀 Arguments 􏰀 Body

Creating functions in R
function_name <- function(arguments}{ 􏰀 Write a simple function that adds two inputs x and y 􏰀 Write the function f (x) = (x −1)2 􏰀 What x minimizes this function? How do we find it in R? Rosenbrock’s Banana Function 􏰀 The Rosenbrock Banana Function is given by: f (x1,x2) = (1−x1)2 +100(x1 −x2)2 􏰀 What values of x1 and x2 minimize this function? 􏰀 Please find this using optim in R with starting values (−1.2,1) Part 2: Binary Dependent Variables 1. Review: Bernoulli distribution 2. Linear probability model and limitations 3. Introducing the probit and logit 4. Deriving the probit from a latent variable 5. Partial effects Bernoulli Distribution 􏰀 We are interested in an event that has two possible outcomes 􏰀 Call them success and failure, but could describe: 􏰀 Heads vs. tails in a coin flip 􏰀 Chelsea wins next match 􏰀 Pound rises against dollar 􏰉1 if Success 0 if Failure 􏰀 Y is often called a Bernoulli trial 􏰀 Say the probability of success is p, probability of failure is (1−p) 􏰀 So the PMF of Y can be written as: P(Y =y)=py(1−p)(1−y) Bernoulli Distribution 􏰀 Say p = 0.2: 􏰀 So then we can write the probabilities of both values of y as: P(Y = y) = (0.2)y (0.8)(1−y) P(Y = 1) = (0.2)1(0.8)(1−1) = 0.2 P(Y = 0) = (0.2)0(0.8)(1−0) = 0.8 Binary Dependent Variables yi =β0+β1xi+vi 􏰀 So far, focused on cases in which yi is continuous 􏰀 What about when yi is binary? 􏰀 Thatis,yi iseither1or0 􏰀 For example: yi represents employment, passing vs. failing this course, etc... 􏰀 Put any concerns about causality aside for a moment: 􏰀 Assume E[vi|Xi] = 0 􏰀 How do we interpret β1? A Look at a Continuous Outcome A Look at a Continuous Outcome βOLS +βOLSX 01 A Look at a Binary Outcome 0 20 40 60 80 100 Assignment 1 Score Probability of Passing 0 .5 1 1.5 Binary Dependent Variables yi =β0+β1xi+vi 􏰀 With a continuous yi , we interpreted β1 as a slope: 􏰀 Change in yi for a one unit change in xi 􏰀 This doesn’t make much sense when yi is binary 􏰀 Say yi is employment, xi is years of schooling, β = 0.1 􏰀 What does it mean for a year of schooling to increase your employment by 0.1? 􏰀 Solution: think in probabilities Linear Probability Models yi =β0+β1xi+vi 􏰀 When E[vi|xi] = E[vi] = 0 we have: 􏰀 But if yi is binary: E[yi|xi] = β0 +β1xi E[yi|xi] = P(yi = 1|xi) 􏰀 So we can think of our regression as: P(yi = 1|xi) = β0 +β1xi 􏰀 β1 is the change in probability of “success” (yi = 1) for a one unit change in xi Linear Probability Models P(yi = 1|xi) = β0 +β1xi 􏰀 Basic idea: probability of success is a linear function of xi 􏰀 Examples: 1. Bankruptcy: P(Bankruptcyi =1|Leveragei)=β0+β1Leveragei 􏰀 Probability of bankruptcy increases linearly with leverage 2. Mortgage Denial: P(MortgageDeniali =1|Incomei)=β0+β1Incomei 􏰀 Probability of denial decreases linearly in income Linear Probability Models 0 20 40 60 80 100 Assignment 1 Score Probability of Passing 0 .5 1 1.5 Linear Probability Models 0 20 40 60 80 100 Assignment 1 Score Probability of Passing 0 .5 1 1.5 Linear Probability Models P(yi = 1|xi) = β0 +β1xi 􏰀 The linear probability model has a bunch of advantages 1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS 3. Can use all the techniques we’ve seen: difference-in-difference, etc 4. Can include many Xi : P(yi =1|Xi)=β0+Xi′β 􏰀 Because of this simplicity, lots of applied research just uses linear probability models 􏰀 But a few downsides... Linear Probability Models: Downsides 􏰀 Downside 1: Nonsense predictions P (MortgageDeniali |Incomei ) = β0 + β1 Incomei 􏰀 Suppose we estimate this and recover β OLS = 1, β OLS = −0.1 01 􏰀 Income is measured in 10k 􏰀 What is the predicted probability of denial for an individual with an income of 50k? 􏰀 What is the predicted probability of denial for an individual with an income of 110k? 􏰀 What about 1,000,000? Linear Probability Models 0 20 40 60 80 100 Assignment 1 Score Probability of Passing 0 .5 1 1.5 Linear Probability Models: Downsides 􏰀 Downside 2: Constant Effects MortgageDeniali = β0 + β1 Incomei + vi 􏰀 β OLS = 1, β OLS = −0.1 01 􏰀 Income is measured in 10k 􏰀 Probability of denial declines by 0.1 when income increases from 50,000 to 60,000 􏰀 Seems reasonable 􏰀 Probability of denial declines by 0.1 when income increases from 1,050,000 to 1,060,000 􏰀 Probably less realistic Alternatives to Linear Probability Models 􏰀 Simplest problem with P(yi = 1|xi ) = β0 + β1xi : 􏰀 Predicts P(yi |xi ) > 1 for high values of β0 + β1xi
􏰀 Predicts P(yi |xi ) < 0 for low values of β0 + β1xi 􏰀 Solution: P(yi = 1|xi) = G(β0 +β1xi) 􏰀 Where 0 −z) = 1−Φ(−z) = Φ(z)

Normal Density
P(X−z)

Why Does the Probit Approach Make Sense?
􏰉1 if yi∗ ≥ 0 yi= 0ifyi∗<0 ⇒P(y =1)=P(y∗ ≥0) ii 􏰀 So lets plug in for yi∗ and figure out the probabilities: P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0) =P(vi >−(β0+β1xi))
= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )
􏰀 Which is exactly the probit:
P(yi = 1|xi) = Φ(β0 +β1xi)

What About the Logit?
y i∗ = β 0 + β 1 x i + v i
􏰀 The logit can actually be derived the same way
􏰀 Assuming vi follows a standard logistic distribution 􏰀 Instead of a standard normal distribution
􏰀 More awkward/uncommon distribution but still symmetric around 0 􏰀 All the math/interpretation is the same, just using Λ(z) instead of
􏰀 Primary benefit is computational/analytic convenience

The Effect of a Change in Xi
􏰀 In OLS/Linear probability model interpreting coefficients was easy:
􏰀 β1 is the impact of a one unit change in xi 􏰀 This interpretation checks out formally:
􏰀 Taking derivatives:
P(yi = 1|xi) = β0 +β1xi ∂P(yi =1|xi) =β1
􏰀 Things are a little less clean with probit/logit
􏰀 Can’t interpret β1 as the impact of a one unit change in xi anymore!

The Effect of a Change in xi
P(yi = 1|xi) = G(β0 +β1xi)
􏰀 Taking derivatives:
∂P(yi = 1|xi) = β1G′(β0 +β1xi)
􏰀 The impact of xi is now non-linear
􏰀 Downside: harder to interpret
􏰀 Upside: no longer have the same effect when, e.g. income goes from
50,000 to 60,000 as when income goes from 1,050,000 to 1,060,000
􏰀 For any set of values xi, β1G′(β0 +β1xi) is pretty easy to compute

The Effect of a Change in xi for the Probit P(yi = 1|xi) = Φ(β0 +β1xi)
􏰀 Taking derivatives:
∂P(yi = 1|xi) = β1Φ′(β0 +β1xi)
􏰀 The derivative of the standard normal CDF is just the PDF:
Φ′(z) = φ(z)
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi

Practical Points: The Effect of a Change in xi
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
􏰀 Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number
􏰀 A few approaches:
1. Chose an important value of xi : e.g. the mean x ̄
∂P(yi = 1|x ̄) = β1φ(β0 +β1x ̄) ∂xi
􏰀 This is called the partial effect at the average

Practical Points: The Effect of a Change in xi
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
􏰀 Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number
􏰀 A few approaches:
2. Take the average for all observed values of xi
n ∂P(yi=1|xi) n φ(β +β x) ∑∂xi =∑β101i
i=1 n i=1 n
􏰀 This is called (confusingly) the average partial effect

Practical Points: The Effect of a Change in xi
P(yi = 1|xi) = Φ(β0 +β1xi)
􏰀 If xi is a dummy variable, makes sense to avoid all the calculus and
P(yi = 1|xi = 1)−P(yi = 1|xi = 0) = Φ(β0 +β1)−Φ(β0)

Practical Points: No Problem with Many Xi
􏰀 So far we have only seen one xi
P(yi = 1|xi) = Φ(β0 +β1xi)
􏰀 This can easily be extended to many Xi :
P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki)
􏰀 Intuition behind the latent variable approach remains the same: yi∗ = β0 +β1x1i +β2x2i +···+βkxki +vi
􏰉1 if yi∗ ≥ 0 yi= 0ifyi∗<0 Practical Points: No Problem with Many Xi P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki) 􏰀 However, does make partial effects a bit more complicated: ∂P(yi = 1|xi) = β2φ(β0 +β1x1i +β2x2i +···+βkxki) ∂ x2i 􏰀 What about partial effects with a more complicated function of Xi ? P(y =1|x)=Φ(β +β x +β x2+···+β ln(x)) ii01i2iki Implementation: Flashback Implementation: Flashback Implementation of Probit by MLE 􏰀 Suppose we have n independent observation of (yi,xi) 􏰀 Where yi is binary 􏰀 And suppose we have a probit specification: P(yi = 1|xi) = Φ(β0 +β1xi) 􏰀 This means that for each i: P(yi = 0|xi) = 1−Φ(β0 +β1xi) 􏰀 In other words, P (yi = y |xi ) is Bernoulli! P(yi =y|xi)=[Φ(β0+β1xi)]y[1−Φ(β0+β1xi)](1−y) Implementation of Probit by MLE 􏰀 Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi) 􏰀 What is the joint density of two independent observations i and j? f(yi,yj|xi,xj;β0,β1)=P(yi|xi;β0,β1)×P(yj|xj;β0,β1) = [Φ(β0 + β1 xi )]yi [1 − Φ(β0 + β1 xi )](1−yi ) ×[Φ(β0 +β1xj)]yj [1−Φ(β0 +β1xj)](1−yj) Implementation of Probit by MLE 􏰀 Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi) 􏰀 And what is the joint density of all n independent observations? f (Y |X; β0 , β1 ) = ∏ P (yi |xi ; β0 , β1 ) = ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1 Implementation of Probit by MLE f(Y =y|X;β0,β1)=∏[Φ(β0+β1xi)]yi[1−Φ(β0+β1xi)](1−yi) Implementation of Probit by MLE 􏰀 Given data Y , define the likelihood function: L(β0,β1) = f (Y |X;β0,β1) n = ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1 􏰀 Take the log-likelihood: l(β0,β1) = log(L(β0,β1)) = log(f (Y |X;β0,β1)) n = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1 Implementation of Probit by MLE 􏰀 We then have (βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101 􏰀 Intuition: values of β0,β1 that make the observed data most likely Implementation of Probit by MLE 􏰀 We then have (βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101 Log-likelihood is a Nice Concave Function 0.5 0.6 0.7 Log−Likelihood Part 3: Implementation of Probit by MLE in R 􏰀 It turns out this log-likelihood is globally concave in (β0,β1) 􏰀 Pretty easy problem for a computer 􏰀 Standard optimization packages will typically coverge relatively easily 􏰀 Lets try in R Implementation of Probit in R 􏰀 Lets start by simulating some example data 􏰀 We will use the latent variable approach y i∗ = β 0 + β 1 x i + v i 􏰉1 if yi∗ ≥ 0 yi= 0ifyi∗<0 􏰀 To start, lets define some parameters 􏰀 And choose n=10000 beta 0 <- 0.2 beta 1 <- 0.5 Simulating Data in R y i∗ = β 0 + β 1 x i + v i 􏰀 To generate yi∗ we need to simulate xi and vi 􏰀 We will use the function rnorm(n) 􏰀 Simulates n draws from a normal random variable x i=rnorm(n) v i=rnorm(n) 􏰀 Aside: We’ve simulated both xi and vi as normal 􏰀 Probit only assumes vi normal 􏰀 Could have chosen xi to be uniform or some other distribution Simulating Data in R 􏰀 With xi and vi in hand, we can generate yi∗ and yi y i∗ = β 0 + β 1 X i + v i 􏰀 TodothisinR: 􏰉1 if yi∗ ≥ 0 yi= 0ifyi∗<0 y i star = beta 0+beta 1*x i+v i y i=y i star>0

Writing the Likelihood Function in R
􏰀 Now, recall the log-likelihood
l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))
􏰀 In R, the function Φ(·) is pnorm(·)
􏰀 We will define beta= [β0,β1]
beta[1] refers to β0
beta[2] refers to β1

Writing the Likelihood Function in R
l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))
􏰀 Easy to break into a few steps
􏰀 Tocapturelog(Φ(β0+β1xi))
l 1 <- log(pnorm(beta[1]+beta[2]*x i)) 􏰀 To capture log(1−Φ(β0 +β1xi)) l 2 <- log(1-pnorm(beta[1]+beta[2]*x i)) 􏰀 To capture the whole function l(β0,β1) sum(y i*l 1+(1-y i)*l 2) sum(y i*l 1+(1-y i)*l 2) is just a function of β0,β1 0.5 0.6 0.7 􏰀 Here I’ve just plotted it holding β0 = 0.2 Log−Likelihood Minimizing the Negative Likelihood 􏰀 So we have our R function, which we will call probit loglik 􏰀 We just need to find the β0,β1 that maximize this: 􏰀 Unfortunately, most software is set up to minimize 􏰀 Easy solution: (βˆMLE,βˆMLE)=arg max l(β ,β )=arg min −l(β ,β ) 010101 (β0 ,β1 ) (β0 ,β1 ) 􏰀 So we just define probit loglik to be -l(β0,β1) Maximum Likelihood Estimation 􏰀 To find the β0,β1 that maximize the likelihood use the function: optim(par=·,fn=·) 􏰀 This finds the parameters (par) that minimize a function (fn) 􏰀 Takes two arguments: 􏰀 par: starting guesses for the parameters to estimate 􏰀 fn: what function to minimize Menti: Estimate a Logit via Maximum Likelihood 􏰀 On the hub, you’ll find data: logit data.csv 􏰀 Data is identical to yi , xi before, except vi simulated from standard logistic distribution 􏰀 Everything identical in likelihood, except instead of Φ(z), we have: Λ(z) = exp(z) 1+exp(z) Part 4: Censoring and Truncation 􏰀 So far we have focused on binary dependent variables 􏰀 Two other common ways in which yi may be limited are 􏰀 Censoring 􏰀 Truncation 􏰀 The censored regression model and likelihood 􏰀 An extremely common data issue is censoring: 􏰀 We only observe yi∗ if it is below (or above) some threshold 􏰀 We see xi either way 􏰀 Example: Income is often top-coded 􏰀 That is, we might only see whether income is > £100,000 􏰀 Formally, we might be interested in yi∗, but see:
y =min(y∗,c) iii
where ci is a censoring value

An Example of Censored Data

Truncation
􏰀 Similar to censoring is truncation
􏰀 We don’t observe anything if yi∗ is above some threshold
􏰀 e.g.: we only have data for those with incomes below £100,000

An Example of Truncated Data: No one over £100,000

Terminology: Left vs. Right
􏰀 If we only see yi∗ when it is above some threshold it is left censored 􏰀 We still see other variables xi regardless
􏰀 If we only see yi∗ when it is below some threshold it is right censored 􏰀 We still see other variables xi regardless
􏰀 If we only see the observation when yi∗ is above some threshold ⇒ left truncated
􏰀 Not able to see other variables xi in this case
􏰀 If we only see the observation when yi∗ is below some threshold ⇒ right truncated
􏰀 Not able to see other variables xi in this case

Censored Regression
􏰀 Suppose there is some underlying outcome y i∗ = β 0 + β 1 x i + v i
􏰀 Again depends on observable xi 􏰀 Unobservable vi
􏰀 We only see continuous yi∗ if it is above/below some threshold: 􏰀 Hours worked: yi =max(0,yi∗)
􏰀 Income: yi = min(£100,000,yi∗)
􏰀 And assume: vi ∼ N(0,σ2), with vi ⊥ xi

Censored Regression
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
􏰀 What is P(yi =ci|xi)?
P(y =c|x)=P(y∗≥c|x)
􏰃ci −β0 −β1xi 􏰄 σ

Censored Regression
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
􏰀 For yi < ci what is f (yi |xi )? 1 􏰃yi−β0−β1xi􏰄 f(yi|xi)= φ Censored Regression 􏰀 So in general: f (yi|xi,ci) = 1{yi ≥ ci} 1−Φ y i∗ = β 0 + β 1 x i + v i y =min(y∗,c) iii vi|xi,ci ∼N(0,σ2) 􏰅 􏰃ci −β0 −β1xi 􏰄􏰆 σ 1 􏰃yi−β0−β1xi􏰄 +1{yiCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com