CS代写 EC994: Classification Errors and ”Optimal”Decision Rules

Logistic Regression

Logistic Regression
I Logistic regression is a generative learning method, that models P(Y =y|X).

I Classification decisions are then usually made based on decision boundaries, which are estimated using various di↵erent methods.
I (Plain) Logistic regression estimates linear decision boundaries.

Linear Decision Boundarry Visualized
−1 0 1 2 3 B3
−3 −2 −1 0 1 2 3 4

Why Linear Regression Wont Work?
I We want to obtain estimates of P(Y = y|X), clearly, linear regression may not be a good idea, because it likely generates fits that generate probabilities > 1 or < 0. I Nevertheless, in empirical economics papers, linear probability models are very popular. P(Y = c|X) = 0 + Logistic Regression I Logistic regression takes the form h(z)= ez 1+ez I Noteasz!1, I h(z)!0,z=0,h(z)=1/2and I z ! 1,h(z) ! 1. P(Y = c|X) = h(0 + I where h(z) is the sigmoid function. Logistic Regression Substituting: −10 0 5 10 e0+Ppk=1 kXk P ( Y = 1 | X ) = 1 + e 0 + P pk = 1 k X k 0.0 0.4 0.8 Logistic Regression...is linear regression in disguise Suppose you call P(Y = 1|X) = p(X), then the logistic function becomes. e0+Ppk=1 kXk eX p(X)=1+e0+Ppk=1kXk =1+eX You can rearrange this as p(X) = e0+Ppk=1 kXk = eX This looks almost like a linear model... the LHS has an intuitive explanation. In the numerator is P(Y = 1|X), while the denominator is P(Y = 0|X). This is called the“odds ratio”. Taking logs: log( p(X) ) = X 1p(X) Estimating Logistic Regression using MLE The coecient vector is unknown. We will estimate it by maximizing the likelihood function ! Maximum Likelihood. We assume that the individual pairs of observations are independent from one another. L(|X) = Y p(xi) Y (1 p(xj)) i:yi =1 j:yj =0 This is the joint likelihood of observing, out of a sample of N observations, a subset I with yi = 1 and a complementary subset J with yj = 0: Estimating Logistic Regression using MLE We can rewrite in the binary case since yi 2 {0, 1}, so the likelihood of an individual observation yi p(xi0)yi ⇥(1p(xi0))1yi The joint likelihood becomes: p(xi0)yi (1 p(xi0))1yi Rewriting the Log Likelihood Function Taking logs... yi log(p(xi0))+(1yi)log((1p(xi0))) the function is concave, a maximum will statisfy a first order condition: @ log (L( |X )) Xn @ = (yi p(xi0))xi =0 The reason why the FOC is relatively easy to derive, is because the derivative of a logistic is easy... log(L(|X))= You want to find a vector that maximizes the above expression. Since One can show that We do a homework on this! h0(z) = h(z)(1 h(z)) A simple example: Logistic Regression Lets look at the example that we have plotted earlier already. −1 0 1 2 3 B3 Here, we fit a logistic regression with two variables and an intercept on a dummy variable. −3 −2 −1 0 1 2 3 4 A simple example: Logistic Regression The output in R looks as follows: > summary(glm.fit)
glm(formula = Forested ~ B3 + B4, family = binomial(link = logit),
data = DF.plot)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.600 -0.330 0.195 0.324 2.715
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.146
0.289 3.97 7.3e-05 ***
0.348 -6.90 5.3e-12 ***
0.239 -4.08 4.5e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 269.20 on 199 degrees of freedom
Residual deviance: 107.48 on 197 degrees of freedom
AIC: 113.5
Number of Fisher Scoring iterations: 6

A simple example: Logistic Regression and MAP Rule
I The Maximum Likelihood is estimated using the log-odds transformation, so we have
log( p(X) ) = ˆ0 + X1ˆ1 + X2ˆ2 1p(X)
I With ˆ0 = 1.146, ˆ1 = 2.398 and ˆ2 = 0.997.
I We minimize the Bayes Error rate if we follow the MAP Decision
I MAP Decision rule. Set yi = 1 if Pˆ(yi = 1|xi) > 1/2
I Now log(.5/(1 .5)) = 0, so the MAP decision rule translates into
ˆ 0 + X 1 ˆ 1 + X 2 ˆ 2 > 0

A simple example: Logistic Regression and Decision Boundary
We have seen that the MAP Decision Rule translates to
ˆ 0 + X 1 ˆ 1 + X 2 ˆ 2 > 0 So we can rewrite this as an equation
X > ˆ 0 X ˆ 1 2 ˆ 1ˆ
X2 > 1.146 + 2.398X1 0.997 0.977
Which in this case is:
which naturally we can plot, since its a straight line with intercept.

An Example with Real Data: Mortgage Applications
I We start with an example where |C| = c = 2, i.e. a binary example. For this purpose, we use data from the US Home Mortgage Disclosure Act (HMDA).
I The HMDA requires certain financial institutions to provide data on all mortgage applications and decisions on these applications, along with some characteristics.
I Pretty big database, in 2012, there were 7,400 institutions that reported a total of 18.7 million HMDA records.
I We want to see which patterns predict, whether a mortgage {Defaulted, No default}. This can be expressed as a dummy variable, where
Y = (1 if Defaulted 0 if No default

Summary statistics
> head(MORTGAGE[,c(“Default”,”Leverage”,”Minority”,”ApplicantIncome”,”LoanAmount”,”Female”,
“LoanPurpose”,”Occupancy”),with=F])
1: Owner-occupied
2: Owner-occupied
3: Owner-occupied
4: Owner-occupied
5: Owner-occupied
6: Owner-occupied
Default Mode :logical FALSE:102481 TRUE :16354
Female Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.298 3rd Qu.:1.000 Max. :1.000
“LoanPurpose”,”Occupancy”),with=F])
Leverage Minority 2.54 FALSE 1.50 FALSE 2.29 FALSE 1.39 FALSE 3.99 FALSE 5.62 FALSE
ApplicantIncome LoanAmount Female 61 155 0 40 60 1 136 311 0 31 43 1 90 359 1 53 298 0
LoanPurpose
Refinancing Home purchase Refinancing Home purchase Refinancing Refinancing
> summary(MORTGAGE[,c(“Default”,”Leverage”,”Minority”,”ApplicantIncome”,”LoanAmount”,”Female”,
Leverage Minority
ApplicantIncome LoanAmount
Min. :0.01 1st Qu.:1.51 Median :2.27 Mean :2.43 3rd Qu.:3.18 Max. :9.96
Mode :logical FALSE:98041 TRUE :20794
Min. : 1 1st Qu.: 46 Median : 70 Mean : 78 3rd Qu.:103 Max. :199
Min. : 1 1st Qu.: 97 Median :151 Mean :170 3rd Qu.:228 Max. :499
LoanPurpose Home improvement: 5912 Home purchase :43119 Refinancing :69804
Occupancy Non Owner-occupied: 8002 Owner-occupied :110833

Some graphs…

Estimating Logistic Regression using MLE
> glm.fit<-glm(Default ~ Leverage + Minority + ApplicantIncome + Female + Occupancy + LoanPurpose, + data=MORTGAGE, family=binomial(link=logit)) > summary(glm.fit)
glm(formula = Default ~ Leverage + Minority + ApplicantIncome + Female + Occupancy + LoanPurpose, family = binomial(link = logit), data = MORTGAGE)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.394 -0.570 -0.499 -0.427 2.529 Coefficients:
(Intercept)
MinorityTRUE ApplicantIncome
Female OccupancyOwner-occupied LoanPurposeHome purchase LoanPurposeRefinancing —
Estimate -0.203883 0.048997 0.332573 -0.006041 0.115838 -0.520638 -1.260043 -0.838062
Std. Error 0.046629 0.006889 0.021017 0.000248 0.018498 0.031582 0.035230 0.033596
z value Pr(>|z|) -4.37 1.2e-05 *** 7.11 1.1e-12 *** 15.82 < 2e-16 *** -24.36 < 2e-16 *** 6.26 3.8e-10 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ (Dispersion parameter for binomial 0.01 ‘*’ 0.05 ‘.’ 0.1 family taken to be 1) Null deviance: 95215 on 118834 degrees of freedom Residual deviance: 92564 on 118827 degrees of freedom AIC: 92580 Number of Fisher Scoring iterations: 4 -16.49 < -35.77 < -24.94 < 2e-16 *** 2e-16 *** 2e-16 *** Interpreting the Output... I The reported coecients tell you what is the marginal e↵ect of a change in some Xi on the log-odds ratio. I Hence, you can interpret the signs, but the coecients are not the marginal e↵ects in terms of probabilities. I The marginal e↵ects on the probabilities are not constant. I The e↵ect of the odds of a 1-unit increase in Leverage is exp(0.048) = 1.048, meaning an odds increase by around 4.8%. I Note: For small x, ex ⇡ 1 + x. I Lets look at the range of the predicted probabilities for a more saturated model > predpr <- predict(glm.fit,type=c("response")) > summary(predpr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.041 0.103 0.127 0.138 0.159 0.621

Nonlinear Decision Boundaries for Logistic Regression
I Logistic regression as we have formulated, will always generate linear decision boundaries.
I However, you can easily make a decision boundary non-linear by allowing e.g. for polynomials. For example:
log( p(X) )=0 +1×1 +2×12 +3×2 +4×2 1p(X)
I As with linear regression, you can use cross validation / validation set approach to determine the optimal model complexity in delivering“best performance”.

Nonlinear Decision Boundaries for Logistic Regression

Model Selection Approaches
I Further, best subset selection and shrinkage methods naturally apply as well!
I But rather than optimizing RSS subject to a constraint, you now maximize Log Likelihood minus a penality:
yi log(p(xi0)) + (1 yi ) log((1 p(xi0)))

EC994: Classification Errors and ”Optimal”Decision Rules

University of
Classification Error

Test and Training Error
I As with numeric responses, we are interested in making accurate decisions in terms of assinging labels.
I We have argued that the intuitive Maximum A Posteriori Decision rule is optimal, i.e. it reduces the overall training error.
Yˆ = argmaxy2CPˆ(Y = y|X)
I However, as with regression, we need to worry about overfitting the
I As opposed to regression, we can better distinguish the types of errors we make.

Two Types of Errors
Important: the MAP decision rule minimizes the overall error rate. But this may come at the expense of high (low) type 1 versus low (high) type 2 error rates!

Precision, Recall and Accuracy
I Precision = TruePositives TruePositive +FalsePositive

Precision, Recall and Accuracy
I Recall = TruePositives TruePositive +FalseNegatives
Also called ”sensitivity”

Precision, Recall and Accuracy
I Specificity = TrueNegative TrueNegative +FalsePositive

Precision, Recall and Accuracy
I Accuracy = TruePositives+TrueNegative AllCases

For Reference: Many synonyms.
Figure: Synonyms for Type 1, Type 2 errors taken from Hastie et al, 2014.

MAP Decision Rule and Error Types
I Suppose you were to suggest to only screen bad loan applications intensively, i.e. those whose probability of being rejected are above a threshold c ̄.
I MAP Decision rule says that the overall error rate is minimal, in case you set c ̄ = 1/2.
I Lets see how we would do in this case.

Classification and MAP Rule: How are we doing?
MAP Decision rule says that the overall error rate is minimal, in case you set c ̄ = 1/2.
> Defaulted=as.character(MORTGAGE[test]$Defaulted)
> glm.probs=predict(glm.fit,MORTGAGE[test],type=”response”)
> glm.pred=rep(“Regular check”,length(glm.probs))
> glm.pred[glm.probs>.5]=”Intense check”
> addmargins(table(glm.pred,Defaulted))
Default Non default Sum
Intense check
Regular check
How do we do?
I Accuracy 6+864=870/1000 or almost 87% correctly classified I Precision = 6/(6+7) = 46%.
I Recall = 6/(6+123) = 4%.
I Specificity = 864/871 = 99%.
We get really high precision, because most mortgages do not default!

Not following MAP Rule…
Suppose you set c ̄ = 0.2, out of our test sample of 1000 loan applications…
> Defaulted=as.character(MORTGAGE[test]$Defaulted)
> glm.probs=predict(glm.fit,MORTGAGE[test],type=”response”)
> glm.pred=rep(“Regular check”,length(glm.probs))
> glm.pred[glm.probs>.2]=”Intense check”
> addmargins(table(glm.pred,Defaulted))
glm.pred Default Non default Sum
Intense check 32
Regular check 104
Sum 136
How do we do?
I Accuracy = (46+752) /1000 or almost 80% correctly classified I Precision = 46 /(46+119) = 28%
I Recall = 46/(46+83) = 36%
I Specificity = 752/871 = 86%.
We are trading o↵ true positives with true negatives.

Visualizing Accuracy, Recall and ↵
Accuracy Recall (Sensitivity) Specificity
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.4 0.6 0.8
0.4 0.6 0.8
sensitivity
specificity

Alternative costs
> glm.pred[glm.probs>.5]=”Intense check”
> addmargins(table(glm.pred,Defaulted))
Suppose our costs for false-positves versus false-negatives are Actual outcome yi
Non-default
Intense check
Regular check
Predicted outcome yˆ i
What are the expected misclassification costs? C(y|yˆ =Intensecheck)= TP
C (y |yˆ = Regular check) = FN
i i 2,1 FN + TN
1,2 TP + FP
2,2 FN + TN
i i 1,1 TP + FP

Alternative costs
What are the expected misclassification costs?
C = p(yˆ = Intense) ⇥ C (y |yˆ = Intense) (1)
+ p(yˆ = Regular check)C (y |yˆ = Regular) (2) iii
Suppose 1,1 = 2,2 = 0. We can derive conditions on 1,2 and 2,1 that give us lowest expected misclassification cost. Note that the FP/FN counts are a function of the underlying cuto↵ c ̄ – so there is a di↵erent cost function for every di↵erent value c ̄.

Trade o↵ between Sensitivity and Specificity
I As we increase c ̄, there is a trade o↵ between the type 1 and type 2 errors that occur.
I For very low c ̄, a lot of loans are assigned to be intensively checked, resulting in many false positives (loans that would not have defaulted being intensively checked), but relatively few false negatives – high sensitivity and low specificity.
I As we increase c ̄, fewer loans are intensively checked; this reduces the false positive cases but increases the false negative cases (we fail to scrutinize high default probability cases) – low senitivity and high specificity.
I Overall, since most loans are granted (87.1%), the overall increase in accuracy as we increas c ̄ is driven by specificity.

ROC Curve to Visualize Trade O↵ between Sensitivity and Specificity
I ROC curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.
I It comes from communications theory and stands for“receiver operating characteristics”.
I It plots Specificity against Sensitivity as we vary c ̄ (without showing c ̄)
I The ideal ROC curve is in the top left corner (100 % specificity and 100% sensitivity)
I The 45 degree line is the classifier that assigns observations to classes in a random fashion (e.g. a coin toss)
I The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC).

ROC Curve in our example
1.0 0.6 0.2
Specificity
Sensitivity
0.0 0.4 0.8

Choice of Decision Rule depends on cost
I The aim of this exercise was to highlight that choice of decision criterion, i.e. cuto↵ c ̄ need not be dictated by the MAP decision rule that guarantees, on average, best prediction performance.
I The question really is, what the associated costs are for having many tue positives or true negatives.
I Optimal choice of c ̄ would take into account the potentially di↵erent costs and may give solutions that are far away from 1/2.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts