程序代写 Using Survival Analysis to build Credit Scoring Models

Using Survival Analysis to build Credit Scoring Models

Traditional approach to credit scoring
Data on past customers

Copyright By PowCoder代写 加微信 powcoder

Define ‘good’ & ‘bad’ by performance over

given time horizon ( i.e. fixed outcome period)
Build a model to identify which characteristics

best separate the goods from bads.
Logistic Regression

where pi is probability (applicant i is good) and X1,…..Xn are characteristics of applicant i

Age+Income=50
separates goods from bads
Arbitrary: if time horizon is T, default at T-1 is bad, default at T+1 is good

( or at least indeterminate).
Lose information: indeterminates left out. Those who fail at 3 months

classified same as those who fail at T-1 months.
Competing risks ignored: those who “attrite” during outcome period left

out of default scorecard building and vice versa.
Lecture 4b Survival analysis

The standard approach to building a credit scorecard ( system of rules) is to take a sample of past customers – their application forms and subsequent credit history
Then define which history is “good” and which is “bad”, for example three consecutive payments missed is a usual definition of default (i.e. “bad”)
Build a model to identify what characteristics (taken from application form ) best separate the goods from bads.
Simplistic example shows that if we only consider two characteristics Age and Income than a line Age+Income=50 separates the two groups.
Industry standard is logistic regression.
We try to find the best linear combination of characteristics which explains probability of being good .
Then this is used on new applicant to predict the probability of being “good”. If this predicted probability of being good is high, application is approved and the credit is granted.

4b.1 Why use when not If

“if” takes a time period and asks within this time period if
Consumer will default/will not default
Consumer will leave company/will stay with company
Consume will make new purchase/will not make new purchase
Asking instead “when” events happen- default, leaving purchase
gives a handle on profit as profit depends on time until certain event occur (default,switch lenders)
does not require any choice of time horizon so no arbitrariness or loss of information
uses the data on everyone so no loss of information
Also can deal with censored data so no need to prepare data
allows competing risks models so can build default , purchase and attrition models on same data.

Until recently the definition of bad and good in such models was whether the customer defaulted or not but in the last few years lending organisations have wanted models that enable them to choose customers with the highest profit to the organisation.
This means there have to be a number of changes made to the models and one of these is that it now becomes important not just if a customer will default but how long before they default .
It is possible that it is worth taking on someone who is likely to default if it is also likely that it will be a long time before they do so. The earned interest then may exceed the losses due to default.
Another factor which affects profit is when customers close their account early (pay off the loan early) by switching to another lender or for other reasons. Again one wants to know how long before they switch?

Censoring Mechanism
Months on Books
Censored (Closed Account)
Censored (Truncated)
End of sample date
Censored (Truncated and
started after start of sample)

-We have two type of events in our analysis. One is default and another is censored. So people who did not default between Jan2001-Dec2004, for example, closed accounts during this period or accounts who got truncated at Dec 2004 (which is the end of observation period) are treated as censored.

4b.2 Using Survival Analysis
“Survival analysis – analysis of lifetime data when censoring
Lifetime T– length of time before a loan defaults ( repays early,purchase made).
Standard ways of describing the randomness of T are
distribution function, F(t), where F(t) = Prob{ T t}
( S(t)=1-F(t) is the survivor function)
density function, f(t) where Prob{ t T  t+t)= f(t)t
hazard function
h(t) =f(t)/(1-F(t)) so h(t)t = Prob{t T  t+t |T  t)

So, the question now is how long customers survive before they default or pay off early. This is similar to asking how long deteriorating systems survive before failure, which is exactly the question survival analysis deals with.
It is therefore reasonable to consider how survival analysis techniques that have proven to be very successful in medical and reliability studies can be used to build practical credit scoring models.

Survival analysis is the area of statistics concerned with the analysis of lifetime data.
Traditional lifetime data can be found in medical studies, where there are a number of terminally ill patients are monitored and their time to death is recorded Hence lifetime data consists of times to some event of interest.
Often this event of interest can not be observed, the hospital may lose contact with the patient so they know that the time of death is after the last contact date but don’t know when exactly. This is called censored data. It still contains information that can be used. And survival analysis allows this data to be incorporated into the model. – This translates to consumer credit context as a customer who never defaults or never pays off early, so we do not observe an event of interest but we want to include the lifetime of the customer into our analysis.
Sometimes one wishes to compare two sub-populations (e.g subjected to two different treatments) – by comparing survival times. In our case we compare survival times of customers with different application characteristics.

Hazard Function
T – r.v. representing failure time (time to default/early pay-off)
Hazard function

Here is some notations and definitions.
Let T be random variable representing time until failure, I.e. time until default or early pay off.
Then one way to describe the distribution of T is hazard function, which is defined as follows
Probability that an individual fails at time t, conditional on he/she having survived to that time.
(Probability that an individual pay off early at time t, conditional on he/she having stayed on books up to that time.)
also can be thought of as age specific rate of failure.
Here are some examples of hazard functions
This is hazard rate for humans (the risk of death is high at the very young age, it flattens out at about 8-10 years and then starts rising slowly after about 30 and steeper after 45)
This is hazard rate for default – we have to ignore first 3 months because the definition of the default is 3 months delinquent. Highest default rate is at the start of the loans and then it decreases with time.

Two big results of survival analysis
1.Kaplan – Meier estimation of survival function.(1950s)

defaults at months 2,3,3,4,5
Censored data at months 1,3,4,4,5

2. Cox proportional hazard model (1974)

Month 1 2 3 4 5
No. at start of month 10 9 8 5 2
No default in month 0 1 2 1 1
h(t) 0 1/9 ¼ 1/5 ½
S(t-1)(1-h(t)) 1 8/9 2/3 8/15 4/15

4b.3 Proportional hazards (PH) and
accelerated life (AL) models
Explanatory variables allows for heterogeneity of the population.
Proportional hazard models and accelerated life models connect explanatory variables to failure times in survival analysis
Let x = ( x1, x2, …., xN) be explanatory variables.
Accelerated life models assume
S(t) = S0((x)t) or h(t) = (x)h0(t(x))
where typically (z) = exp( b1z1 + b2z2 +…….+ bNzN )=exp(b.z)
S0 ,h0 are baseline survivor /hazard rate function and x’s speed up or slow down the ‘ageing’ of the life of the system.
Proportional hazard models assumes
h(t) = (z)h0(t) = eb.z h0(t)
Explanatory variables have

multiplier effect on base hazard rate.

Cox Proportional Hazards Model
( Non-parametric approach)
Cox proportional hazards model :

where b is a vector of coefficients
h0(t) is unknown baseline hazard (x=0)
h(t,x) is hazard for individual with characteristics x
s h(x) =-(b1 x1 + b2x2 +…….+bnxn )

acts like a scorecard ( minus to ensure higher score better loan)

Cox showed one can estimate b without any knowledge of h0(t) by using rank of failure and censored times.
If times are discrete so ‘lots of ties’ then need to make approximation in Maximum Likelihood estimator.
So if T – r.v. representing failure time (time to default/early pay-off) and x -vector of covariates

Suppose now that on each individual one or more further measurements are available, so we have a vector of covariates X (application data). And we want to assess the relationship between the distribution of failure time and these covariates.
For example this graph shows default hazards for refinance – red line and not-refinance – black line, Hazard rate is much higher for those on refinance. Other covariates influence hazard rate in a similar fashion , they either decrease or increase it.
Cox has proposed the following model to access these relationships
The hazard for an individual at time t with covariates X is some baseline hazard multiplied by a f-n of covariates.
where beta is a vector of unknown parameters and h_0 is a baseline hazard, an unknown function giving the hazard for the standard set of conditions , when all covariates =0. For example if we have only one binary covariate , refinance, non-refinance. This lower hazard when cov-te equals 0 , I.e. for non-refinance will be the baseline hazard
It is called proportional hazards model because the assumption is that the hazard of the individual with app.charact-s x is proportional to the some baseline hazard.
Cox showed one can estimate beta without any knowledge of h0(t), just by using rank of failure and censored times.

Hazard function and survival function

For proportional hazards model, let p be probability of borrower
with characteristics x being Good after fixed time t*

Comparison of logistic regression
and survival analysis
Logistic regression
Take performance horizon of t*, and p=PG(t*,x)

Proportional hazards
Can estimate P G (t,x) for any x.
Consider p=PG(t*,x)

Building a credit scorecard for estimating when customers default using proportional hazards
Take sample of past customers with their application and bureau characteristics ( as usual)
For each give time of default or the time history was censored (no further info in sample/ time left lender)
coarse classify variables without using time horizon
statistical tests for validating model
check need for time dependent variables

Coarse-classifying using PH approach
Split variable into n binary variables, (each covering a category or in continuous variable case range of (1/n)th of population).
Apply PH model with these binary variables as characteristics
Chart parameter estimates
choose splits based on similarity of parameter estimates
Note: It is important to do splits separately for every type of failure. Here are estimates for default( left), early repayment(right)

So we have tried different approaches and found one which sees to be the most informative and easiest to use. I will explain with an example using our favourite covariate Purpose of the Loan:
We fit Cox’s prop. Hazard model to 23 binary variables representing 23 different purposes.
Then chart parameter estimates as shown in this two graphs. Left one – for early repayment, right one -for default.
Now we group those purposes that have similar parameter estimates. For example purposes 6,4,10 and 19 will be the highest risk group for default.
Note how different the overall picture is for default and early repayment. This tells us that it is very important to do the splits separately for every type of failure.
We should not use splits derived for early repayment for predicting default.

Comparing Logistic Regression Credit Scorecard and Proportional Hazards Credit Scorecard for estimating default risk
Two definitions of “bad”

1. Defaulted on loan in first 12 months
2. still repaying after 12 months but defaulted in the next twelve months.

Two separate LR models for each definition.
One PH model predicting time to early pay-off.
So LRs should be best as they are designed for each specific definition of bad
Compare models performance using ROC curves

Now we will see how survival model performs when used for traditional purpose – classifying applicants into two groups of good and bad compared with industry standard -Logistics regression.
We used data from a major UK financial institution. It consisted of application info of 50 000 loans covering about three years . Plus whatever happened to the loan during the observation time – closed, paid off, paid off early, defaulted or still going.
Here we only concerned with predicting early pay-off.
Two definitions of Bad were constructed
1st — bads are those who paid off earlier than agreed in first 12 months
2nd — bads are those who are still repaying after 12 mths but paid off early in the next 12 months
Two separate LR models were built for each of these definitions and
one PH was fitted to the times until early pay off – (which gives us ordering of relative likelihood to pay off early)
This PH model is then measured under two criteria (two definitions)
Models were built on a training sample and tested on a holdout.
We have compared models using ROC curves because ROC curves give us the ability to compare models independent of the cutoff.

ROC curves for PH and LR predicting default
PH vs LR 1
(1st 12 mths)
PH vs LR 2
(2nd 12 mths)

4b.4 Time Dependent Covariates
In PH ( as in standard scorecards) relative importance of characteristics in same over all time.
This may not be the case in reality and one may want the score for an attribute to change the longer the loan lasts.
Example is refinancing as purpose of a loan
Define x1=1 if purpose is refinance, 0 – otherwise

Cox’s model

is relative hazard of refinance to any other reason
independent of time on books.
Define x2 = x1t, so Cox’s model becomes

relative hazard of refinance to others depends on time

Let’s return to the model formulation to show one of the extension of the model and see how it works on the example
Suppose we have just one covariate
x1=1 if purpose of the loan is refinance, 0 – otherwise
So Cox’s model says the hazard of the customer at time t is a baseline hazard multiplied by some function of a covariate value
If it is not refinance x1=0 and has=baseline hazard
If x1=1 exp(beta1*x1)=exp(beta1) this is called relative hazard. Notice it is independent of time. So no matter how long you stay on books you will always be considered more likely to default when compared to other loans
This is questionable. Do risky people go bad early on? Does it matter that you stayed on long enough and proved to be a good customer?
To check this define a variable x2=x1*t (interaction of x1 (refinance indicator) with time) . Add this to the model
It allows simplest possible interaction of linear decrease or linear increase over time.
Notice that now the relative hazard for refinancers to others is exp(beta1+beta2*t) – depends on time

PH model – no time-by-characteristic interaction

The hazard for a customer on refinance is (1.17) times higher than for other customers at all times

PH model with time-by-characteristic interaction

t=1 the hazard for a customer on refinance is (1.36) times higher than for others
t=36 the hazard for a customer on refinance is (0.96) times higher than for others

So, fitting Cox’s model with no time dependency gives estimate of beta1=0.157
The hazard for a customer on refinance is e^0.157 times higher than for others
Add time depending covariate, refit the model
Now beta1=0.32 and beta2=-0.01 and relative hazard is time dependent so that at t=1 its e^0.31 and t=18 when you’ve stayed on books for 18 months and haven’s defaulted your relative hazard is e^(0.32-0.01*18) which is e^0.14
So this second coefficient is taking out little-by-little the effect of the characteristic.

4b.5 Attrition or Early Repayment Scoring using Proportional hazards
Methodology exactly the same as for credit scoring only “failure” is now “early repayment” and default is a form of censoring.
Again coarse classify using PH approach on this “failure”
How does it classify compared with LR.
Example on personal loans gives some interesting insights to early repayment
Two definitions of “bad”

1. paid off the loan early in first 12 months
2. still repaying after 12 months but paid off early in the next twelve months.
Again two separate LR models for each definition.
One PH model predicting time to early pay-off.
Compare models performance using ROC curves

Now we will see how survival model performs when used for traditional purpose – classifying applicants into two groups of good and bad compared with industry standard -Logistics regression.
We used data from a major UK financial institution. It consisted of application info of 50 000 loans covering about three years . Plus whatever happened to the loan during the observation time – closed, paid off, paid off early, defaulted or still going.
Here we only concerned with predicting early pay-off.
Two definitions of Bad were constructed
1st — bads are those who paid off earlier than agreed in first 12 months
2nd — bads are those who are still repaying after 12 mths but paid off early in the next 12 months
Two separate LR models were built for each of these definitions and
one PH was fitted to the times until early pay off – (which gives us ordering of relative likelihood to pay off early)
This PH model is then measured under two criteria (two definitions)
Models were built on a training sample and tested on a holdout.
We have compared models using ROC curves because ROC curves give us the ability to compare models independent of the cutoff.

ROC curves for PH and LR predicting attrition
(early repayment)
PH vs LR 1
(1st 12 mths)
PH vs LR 2
(2nd 12 mths)

These are ROC curves for PH solid line and LR – dotted line for early pay-off in 1st 12 months and the next 12 months.
The curves cross -over (PH is more sensitive then LR at the beginning then this is reversed) so we ould obtain different results depending on a cutoff – this justifies use of ROC curves.
You can see that in first year LR and PH are very close in performance.
But in second year LR clearly outperforms PH regression
This not surprising since two separate LR were built to fit each definition and only one PH predicting time to early pay-off was used.
These two LR can be seen as one LR segmented by time (at 12 mnths point, so that second uses only part of the data) So one way of improving PH model predictive power is segmentation by some time related variable and the logical candidate is term of the loan.

Because it is possible that loans of different lengths may behave quiete differently. Not necessarily the same factors will affect early repayment rate for shorter 12 mths and for longer 2-3 years loans.
So let us compare now segmented and non-segmented models.

Early repayment hazard rate vs age of account

Here is hazard rate for early repayment.
Different lines represent different terms of the loan 6mths loans, 1 year loans , 2 year loans etc.

So if we do not segment we average over these hazards that look quite different and hence we loose some information.
Now let’s try to look at these differences by transforming the time axis .

Early repayment hazard rate vs time to loan maturity
Suggets that time to maturity is the important element in early repayment
Segmenting by maturity means age and time to maturity directly connected

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com