MANG 2043 – Analytics for Marketing
MAT012 – Credit Risk Scoring
Lecture 2a
Copyright By PowCoder代写 加微信 powcoder
This Lecture’s Learning Contents
Pre-processing data
Introduction to Scorecard
To begin with…
Types of variables
Visualise data
Variables used
Sample selection
Missing values
Outlier detection and treatment
Binning of variables
Recoding categorical variables
Segmentation
Define the target variable
Variable Used and Types of Variables
Variables used
Application variables: age, occupation, years at address, years at bank, no. of children, no. of credit cards, amount on current account(s), telephone number given in application……
Credit bureau variable: on County Court Judgements (CCJs), number of defaults in last 6 months, no. of checks made……
Performance variables: average balance, maximum credit, total debt, weighted averages, number of transactions……
Are variables legal: Data Protection Act, Sex Discrimination Act, Race Discrimination Act
Variable types (see Lecture 1)
Select a sample
Is sample of applicants similar to current applicants?
Time of sample:
How far back do I need to trace my sample?
Trade-off: many data v.s. recent data
Sample size: large enough to produce convincing results?
Are there enough Bads?
Is the sample chosen randomly or rationally?
What operating rules over-rides in operation during the sampling period?
Are there system or business reasons for having scorecards for different populations?
What is the economic environment during the sample period?
How long should I choose for the performance window in order to stabilise bad rate?
Missing values
Why are there missing values?
Non-applicable (e.g. default date not known for non-defaulters)
Refuse to disclosed (e.g. income or other confidential data)
Error when merging data (e.g. typos in name or ID)
Recording issues (e.g. machine or algorithm errors)
How would you manage these missing values?
Imputation procedures
Outlier detection
Extreme or unusual observations
Univariate outlier detection methods
Approach I: use the mean and standard deviation
Any values outside this range defines as outliers
Or use the Z score
Approach II: use the first quartile ( ), third quartile ( ), interquartile range ( )
Outliers are defined as anything outside this range:
Detect multivariate outliers
Mahalanobis distance
D2=(xi-μ)T Σ-1(xi-μ)
μ is the vector of means and Σ is the covariance matrix
Calculate distance for every observation
Sort the distance
Clustering Methods
Look for elements outside cluster
Regression methods
Fit regression line and look for points with large errors
Residual plots
Dealing with outliers
Treat invalid outliers as missing
For valid outliers, one common approach is to truncate the outliers by:
For anything greater than , replace by
For anything less than , replace by
Sometimes if there are not a lot of outliers and the outliers are very extreme, we might simply treat this as ‘missing’.
Coarse classification & grouping
Split variables into a relatively small number of classes (groups must be in reasonable sizes; this is also called classing, categorisation, binning, grouping……)
The motivation for categorical characteristics
To handle lots of variables with many attributes
To handle some variables slightly lack of sufficient sample for robustness check
Not all variables are continuous: e.g. residential status are categorical; number of children are discrete. We need to put their attributes into bins or classes.
For some continuous variables, we also need coarse classifying because relationship with risk may not be monotone.
Coarse classification could introduce non-linear effects. Typical approach: equal interval binning, equal frequency binning, chi-squared analysis……
Example: Default risk varying with age
All variables are categorical
Since the default risk is non-linear in continuous variables, we can make them into categorical groups;
E.g. splitting ages into groups: 18-21, 22-28, 29-36, 37-59, 60+
How to bin variables
Assume we have six samples of annual incomes: 21000, 18000, 16000, 36000, 60000, 28000
Equal interval binning
Bin width=5000
(15000,20000]:16000, 18000
(20000, 25000]:21000……
Equal frequency binning
Bin 1: 16000, 18000, 21000
Bin 2: 28000, 36000, 60000
*All the above have not considered bad rates
Chi-squared method
This measures how likely there is no difference in the good:bad ratio in different classes. Define
: number of observations in the sample dataset
: number of observations of variable in class , where
and are the number of ‘goods’ and ‘bads’ for variable in class respectively
and are the number of ‘goods’ and ‘bads’ in the sample dataset
*You can refer to this website http://documentation.statsoft.com/STATISTICAHelp.aspx?path=WeightofEvidence/WeightofEvidenceWOEIntroductoryOverview
Chi-squared method
The expected values for and are:
The Chi-square value with degree of freedom
Compare the Chi-squared values between pairs of splits and choose the split with the largest Chi-squared value.
Chi-squared method
Practice question: in the residential area of 10000 people, what is your residential status?
Calculate the Chi-squared values for these two splits:
Owners, Renters and Others
Owners, With Parents and Others
For better data description and understanding of the sample, which split will you choose?
Weight of Evidence
Measuring risks represented by each attribute/class category
The higher the weight of evidence (in favour of being ‘good’), the lower the risk for that category
If we adopt the same notation as shown in the Chi-squared method, for attribute of variable , the weight of evidence is calculated as:
Information value/statistics
This relates to entropy that appears in information theory
Aims to show how different and are when takes the attribute value
Rule of thumb
<0.02: unpredicted
0.02-0.1: weak
0.1-0.3: medium
0.3+: strong
Using binary variables (modifying variables)
Usually after having coarse classifying variables, we recode the variable into dummy/binary variables
Assume we will use a simple regression model for this analysis, the model is
Segmentation
We can split variables and have different scorecards for different subgroups. The reasons for this are:
Strategic: banks might want to adopt special strategies to specific segments of customers (e.g. a scorecard for first time buyers)
Operational: e.g. new v.s. old customers
Variable interactions between variables
To conduct segmentation, one can rely on:
Experience
Statistical approach (e.g. clustering algorithms)
Splitting into different segments
What are ‘good’ and ‘bad’?
Choose an outcome period: normally 12 months on from
Observation period (behaviour score)
Initial acceptance (credit score)
Defining ‘bad’
Account has been 30/60/90 days overdue during that period
Charge off/write off
Claim over £ amount
I: 90 days, but can be changed by national regulators
What have we learnt
Preparing data before building a scorecard
Types of variables
Missing values
Coarse classification
segmentation
‘good’ and ‘bad’
More information about missing values:
http://www2.sas.com/proceedings/sugi30/113-30.pdf
(11.5),31.5)
first_date_with_the_bank=04/05/1900
jijijijiji
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com