University of Cardiff
MAT012 Credit Risk Scoring (2016/17)
Copyright By PowCoder代写 加微信 powcoder
Lab Session 1: How scorecards are Developed- using Excel
Task 1: Understand data set.
Open Excel and then open the data set entitled germanexcelstart2010
This is a data set consisting of 1000 cases of applicants for credit together with details of 20 of their characteristics. 700 of the applicants turned out to be Good ( ‘1’ in the Good column) and 300 were Bad ( ‘1’ in the Bad column). There is a data dictionary at the end of this handout but the data you have has been simplified by removing A and the characteristic number from the attributes. Thus status of existing checking account has values 1, 2, 3 and 4 not characteristic A11, A12, A13 and A14. Check that you understand the meaning of the data set which is provided in Sheet 1.
The course structure looks at how to apply specific scorecard methodologies before looking at the practical details of preparing the data including coarse classifying. So in this first lab class we will use data that has already been put in the appropriate form with all the variables being binary (0 or 1) values. In the second workshop we will look at how to coarse classify the variables and so obtain such binary variables
Task 2: Using linear regression to build a scorecard
Go to sheet 3 where we have the data corresponding to binning six of the variables – checking, duration, purpose, married, telephone and foreign, together with the Good column. There are 3 variables for checking 1,2,3, yet there was a fourth value available 4.
Why have we left the corresponding binary variable out?
We only need N-1 variables to represent N states (see Week 2 slides, MAT012_Lecture_2, slide 21). Checking is either 1,2, 3 or 4 – so we need 3 binary variables to represent the possibilities.
Duration also has 3 variables. What is the variable that has been left out in its case?
Looking at the binary variables, we have Duration 12- (representing Duration <= 12), Duration 12-36 (representing 12 < Duration <= 36), Duration 36-48 (representing 36 < Duration <= 48). So the variable that has been left out is Duration 48+ (representing Duration > 48).
Similarly what are the five variables that arise from the purpose characteristic?
Purpose0 -> Purpose = 0
Purpose1 -> Purpose = 1
Purpose2 -> Purpose = 2
Purpose34 -> Purpose = 3 or Purpose = 4
Purpose59 -> Purpose = 5 or Purpose = 9
What sort of loans will have a 0 in all five cases?
A zero for all these binary variables means Purpose = 6,8 or X
Do you understand what a 1 in the married, telephone and foreign variables means ( One has to check the original data to see)
Now we will apply linear regression to get a scorecard where we are regressing the Good variable on the characteristics discussed above.
First go to Options in the File Menu. Then click on Add Ins and click on Go by Manage Excel Add ins. Tick Analysis ToolPak and Solver and click on OK.
Then if we go to the Data menu one should see Data Analysis and Solver in the Analysis section on the far right.
Click on Data Analysis. Highlight Regression and click OK
Either by highlighting the appropriate region or by typing ,
type into Input Y range O1:O1001
type into Input X Range A1:N1001
Tick the labels box
Tick the output region and enter in A1005 (so output will be at end of data set).
Click on OK.
The regression results should now appear from cell A1005 onwards.
It suggests that the scorecard should be as follows where I have multiplied each coefficient by 100 and rounded them.
Score = 59 -35 *Check1-23*Check2-11*Check3+28 * Duration(12-)+16 Duration(12-36) +21 *Purpose1+9*Purpose2+14 Purpose(3or4)+5*Purpose(5or9) -6*Married+4*telephone +21* Foreigner
Why has “Duration 36-48” and “Purpose 0” disappeared?
Do the FOUR values for checking account make sense (Where is the fourth?)
We can chat about that in the Lab session
The R2 value is around 0.18. What do you think of the scorecard in the light of this? (Remember you are fitting a line to data where all the dependent values are 0 or 1.)
We can construct the scores for all the cases as follows
Highlight B1022 to B1035 and Copy in the Home menu
Then Highlight A1002 to N1002
Go to Paste in the Home menu and then click on Paste Special and click on values and on transpose.
Now we have the scores for each attribute in row 1002.
Go to P1 and type in score
In cell P2 type in =100*($B$1021+SUMPRODUCT(A2:N2,A$1002:N$1002))
This should give the score ( multiplied by 100 for the first data point of 70.
Copy P2 all the way down the column to P1001.
One should have scores varying from 21 to 129.
Task 3: Building scorecard by linear programming using Solver in Excel
The linear programme we outlined has as many constraint as there are cases and the standard version of Solver only allows one to solve linear programming problems with 100 constraints. Thus we will cut down the number of cases to 60 Goods and 30 Bads.
Go to Sheet 4 where this set is available
Solver solves the linear programming problems by defining some of the cells as where the values for the variables in the LP should be. It tries different values in these cells until it gets the solution that maximises or minimises the objective.
One set of variables are the coefficient values of the attributes in the final score, and we will put these scores c(i) in row 92.
So we will start by puting 0s in the cells from A92 to N92.
In fact we will try a variant of the linear programme where we want all the Bads to have score of 60 or less and all the Goods to have scores of 100 or more. ( So we have a gap between the two groups as is the case in Support Vector machines).
For the Bads ( the first 30 cases ) we want
c(A)*x(A)+c(B)x(B)+…C(N)*x(N) –Error 60
where again the Error is a variable in the LP.
So we will make Column P the column where the errors are ( Again we start with them all 0).; Column Q will be where the calculation c(A)*x(A)+c(B)x(B)+…C(N)*x(N) –Error is made and column R is where we put the 60.
Similarly for the remaining 60 Goods we want
c(A)*x(A)+c(B)x(B)+…C(N)*x(N) +Error 100
So again we will make Column P the column where the errors are ( all 0 initially ).; Column Q will be where the calculation c(A)*x(A)+c(B)x(B)+…C(N)*x(N) +Error is made and column R is where we put the 100.
So type in the following
In Cell P1 type Error
In Cell P2 type 0 (zero)
Copy P2 down to P91.
In cell P92 type =sum(P2:P91)
( This is our objective function as we will seek to minimise this sum of the errors
In Cell Q1 type Calculation
In Cell Q2 type = SUMPRODUCT(A2:N2, A$92:N$92)+(-1+2*O2)*P2
Copy Q2 down to Q91
( Notice the -1+2*O” means the value is -1 for the Bads and +1 for the Goods)
In Cell R1 type Boundary value
In R2 type 60
Copy R2 down to R31.
In R32 type 100
Copy R32 down to R91
Now we are in a position to use solver.
Go to Solver in the Data menu
( If Solver is not there, then click on Add ons in options in the File menu. Click on Go on the Excel Add ins and then tick the solver choice. Click on OK).
When you go back to Tools , Solver should be there then)
Click on Solver
When the dialogue box appears
On set target cell type or highlight P92
Click on the minimise button
Either by typing in or clicking on the appropriate cells, put in the” by changing cells” box
A92:N92, P2:P91
Using the add button put in the following constraints
Q2:Q31 <= R2:R31
Q32:Q91 => R32:R91
P2:P91=> 0
Choose simplex LP in the solving method.
Make sure the box “ Make unconstrained variables non negative is NOT ticked. (we are happy for some of the scores to be negative)
Click on Solve..
Click on OK and then solve.
The solution should have the sum of the errors as 868 and the scorecard looks like
Score = -24*Check1-16*Check2+24 *Check3 + 108* Duration(12-)+84 *Duration(12-36) +52*Duration(36-48) + 40*Purpose1 +40*Purpose2 +24*Purpose(3or4) +40*Purpose(5or9) -8*Married—8*Telephone +16*Foreign
How are these score compared to the linear regression model? Remember it is the relative differences that are important?
Task 4: Sensitivity Analysis for the Linear programming scorecard
What happens if we say Bads should be less than 80 and Goods more than 100?
Is that a better scorecard?
Description of the German credit dataset.
Number of Attributes german: 20 (7 numerical, 13 categorical)
Number of Attributes german.numer: 24 (24 numerical)
Attribute description
Attribute 1: (qualitative)
Status of existing checking account
A11 : … < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation – does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : … < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
A171 : unemployed/ unskilled – non-resident
A172 : unskilled – resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com