程序代写 MANG 2043 – Analytics for Marketing

MANG 2043 – Analytics for Marketing

MAT012 – Credit Risk Scoring

Copyright By PowCoder代写 加微信 powcoder

This Lecture’s Learning Contents
Classification methods in credit scoring
Divergence
Decision tree
Linear programming
Measuring scorecard performance

Assessing, monitoring and updating scorecards
(measuring the difference between distributions)
Divergence: difference in expectations of weights of evidence
Mahalanobis Distance (briefly covered last time)
: difference in distribution function
ROC curves: comparison of distribution function
Gini coefficient/Somer’s D concordance statistic
Confusion Matrix

Divergence
Introduced by Kullback (Continuous version of Information Value)
Let f(s|G) ( f(s|B)) be density functions of scores of goods, (G) ( bads (B)) in a scorecard. Divergence is then defined by

D ≥ 0 and D=0  f(s|G)=f(s|B)
D   no overlap between scores of goods and bads
Really can only calculate the divergence by splitting scores into bands. If i bands with gi goods and bi bads in band i and

where w(s) is the weights of evidence at score

Building scorecard by maximising divergence
Assume score is
We need to maximise divergence. The likelihood functions f(x|G) and f(x|B) ) can be obtained empirically from the sample of past borrowers being used.
For any choice of attribute scores c we define the corresponding score distributions

then find the maximum divergence value as c varies, i.e.

Methods which group rather than score
Methods like classification trees, expert systems and neural nets end up with “scorecards” that classify applicants into groups rather than give a scorecard which adds the score for each answer.
Main approach is classification tree.
It was developed at the same time in statistics and computer science so is also called Recursive partitioning algorithm
Splits A -set of answers into two subsets, depending on the answer to one question so that the two subsets are very different
Take each subset and repeat the process until one decides to stop
Each terminal node is classified as in AG or AB
Classification tree depends on
Splitting rule –how to choose best daughter subsets
Stopping rule- when one decides this is a terminal node
Assigning rule- which categories for each terminal nodes

Classification tree:credit risk example

Rules in classification trees
Assigning Rule
Normally assign to class which is the largest in that node. Sometimes if D is default cost and L is lost profit, assign to good if good/bad ratio>D/L
Stopping rule
Stop either if subset is too small ( say <1% population) Difference in daughter subsets is too small ( under splitting rule) Really it is stopping and pruning rule, as always have to cut back some of the nodes. Do this by using a second sample ( not used in building the tree) Pruning Decision Trees Split data into a training sample and a validation sample Use training sample to grow the tree Use validation sample to decide on optimal size of the tree Two approaches Grow tree, monitor error on validation set and stop growing when the latter starts to increase; Or Grow full tree, and prune retrospectively using the validation set Splitting rules : Kolmogorov-Smirnov rule Maximise |p(L|B)-p(L|G)| L=parent: R=own+tenant p(L|B)=120/500 p(L|G)=80/1500 KS= |(120/500)-(80/1500)|=.187 L=parent+tenant; R=owner p(L|B)=320/500 p(L|G)=480/1500 KS=|(320/500)-(480/1500)|=.32 Choose split 2. Residential status Owner Tenant With parents No. of goods 1020 400 80 No. of bads 180 200 120 Good:bad odds 5.6:1 2:1 .67;1 Think of daughters as L(left) and R (right). P(L|B) is prop of bads in original set who are in left daughter ( p(L|G) similar) Note: with just 2 categories |p(R|B)-p(R|G)| =|1-p(L|B)-(1-p(L|G)|=|p(L|B)-p(L|G)| gives same answer Splitting rules: Chi square rule Residential status Owner Tenant With parents No. of goods 1020 400 80 No. of bads 180 200 120 Good:bad odds 5.6:1 2:1 .67;1 Think of daughters as L(left) and R (right). P(L|B) is prop of bads in original set who are in left daughter ( p(L|G) similar) Assume split of G/B was same L and R to get expected then n(G)/N=1500/2000=0.75, n(B)/N=500/2000=0.25. Then ELG = 0.75n(L), ELB = 0.25n(L) etc L=parent; R= owner+tenant: n(L)=200, n(R)=1800, =(70)2(1/150+1/50+1/1350+1/450)=145 L=Parent+tenant, R=owner: n(L)=800, n(R)=1200 =(120)2(1/300+1/900+1/200+1/600)=168 Choose split 2 (larger difference) Random forest Since 2000, random forest extension of classification trees are proving popular Build lots of trees, each using subset of data and subset of characteristics. For each new case, classify under each tree Choose class which majority of trees choose. This ensemble idea can be used for the other classification approaches but has really only been tired on trees so far Assumption is that some trees are picking up local effects Nearest neighbour approach There are other classification methods which make no assumptions about underlying population. The most popular one is nearest neighbour methods. A metric is given to the space of application answers so that the distance from application form answers x1 to application form answers x2 is d(x1, x2). Applicant’s distance to the answers in a training sample set is calculated and the k-nearest neighbours are identified. Classify as good if a majority of these k are good. Examples where k varies and Linear discriminant and nearest 3 neighbours with Euclidean metric Different metric and 5 nearest neighbours Dimension reduction Computation resource used can be reduced by reducing the dimension reducing the number of original features, e.g omitting features that are insignificant in linear discriminant. Use Principal components analysis to replace original features by a smaller number of variables formed by linear combinations of the original features with maximum variance. Choice of metric There are some modern, complex, algorithms that can learn good metrics. For example, choose vector of weights to produce good classification on the basis of a metric of the form Two popular algorithms (too complicated to go into here) are:: Large margin nearest neighbour analysis Neighbourhood components analysis Mahalanobis Distance If assume f(s|G) and f(s|B) are and respectively, divergence reduces to Measure the Mahalanobis distance between the two mean scores of the scorecard This is what discriminant analysis maximises. Kolmogorov-Smirnov Statistic Not a difference in expectations but difference in the distribution functions F(s|G) and F(s|B) (max. difference) Confusion Matrix Example of Confusion Matrix Accuracy (13798+765)/19117 = 0.76 Error rate (4436+118) /19117 = 0.24 Sensitivity 13798/18234 = 0.76 Specificity 765/883 = 0.87 Accuracy=(13798+765)/19117=0.76; Error rate = (118+4436)19117=0.24; Specificty=13798/18234=0.76; Sensitivity =765/883=0.86 ROC (Receiver Operating Characteristics) curve A small scorecard example A small scorecard example A small scorecard example ROC curve for the small example Gini coefficient Concordant & Discordant Pairs Example to calculate the C & D pairs Somer’s D – Concordance statistic Cumulative Accuracy Profile (CAP) Curve The Cumulative Gains Chart for the example The Cumulative Gains Chart for the example Lift chart (i.e. plot F(s|B)/F(s)) against F(s) for various scores s Linear programming Assume nG goods labelled i = 1, 2, …nG nB bads labelled i = nG+1, ….nG+nB Require weights wj j= 1, 2, . . . . p and a cut off value, c such that For goods: w1 xi1 + w2 xi2 + + wp xip > c
For bads: w1 xi1 + w2 xi2 + + wp xip

Example of non-linear relationship

(|)(|log(|)(|()
DivergenceDfsGfsBdsfsGfsBwsds

InformationValueIVgnbngnbn

(|,)(|); (|,)(|)
fsGfGfsBfB
x:c.xx:c.x

(|,)(|,log
MaxDMaxfsGfsBds

residential status
years at bank
employment
res. status

actualected

dxxEuclideanxxDwxx

Graph of simple scorecard on age and income

, if ,then

Minimise g
subject to

OUTfNETffwxwx

% of goodsphoneno phonetotal
owner903086
rent furnished604550

% of goods

rent furnished

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com