University of Cardiff
MAT012 Credit Risk Scoring
Copyright By PowCoder代写 加微信 powcoder
Lab Session 1: How scorecards are Developed- using SAS
1. Understand data set.
Open SAS, Click on . Then go to File( Import Data and click on Local Computer. Find the Excel spreadsheet Germanregressionstart and then Click on Next. Highlight Sheet 1 and tick on “first row of range contains field names”. The click on Next twice more and then Finish. Click on the final icon made to open the data set.
This is a data set consisting of 1000 cases of applicants for credit together with details of 20 of their characteristics. 700 of the applicants turned out to be Good ( 1 in the Good column) and 300 were Bad ( 1 in the Bad column). There is a data dictionary at the end of this handout but the data you have has been simplified by removing A and the characteristic number from the attributes. Thus the data in existing checking account has values 1, 2, 3 and 4 not characteristic A11, A12, A13 and A14. Check that you understand the meaning of the data set which is provided in Sheet 1.
The course structure looks at how to apply specific scorecard methodologies before looking at the practical details of preparing the data including coarse classifying. So in this first lab class we will use data that has already been put in the appropriate form with all the variables being binary (0 or 1) values. In the second workshop we will look at how to coarse classify the variables and so obtain such binary variables
2. Using logistic regression to build a scorecard
Again go to Go to File( Import Data and then click on Local Computer and again find and open the Excel spreadsheet Germanregressionstart but this time choose Sheet 2. Check that you again have “first row of range contains field names” ticked.. Then go through Next, next and Finish and by clicking on the icon open the data.. Note this set has only six of the variables in it – checking, duration, purpose, married, telephone and foreign,- but they are all in binary form together with the Good variable. There are 3 variables for checking 1,2,3, yet there was a fourth value available 4.
Why have we left the corresponding binary variable out?
Duration also has 3 variables. What is the variable that has been left out in its case?
Similarly what are the five variables that arise from the purpose characteristic . What sort of loans will have a 0 in all five cases?
Do you understand what a 1 in the married, telephone and foreign variables means ( One has to check the original data to see)
Now we will apply logistic regression to get a scorecard where we are regressing the Good variable on the characteristics discussed above.
Go to Task( Regression(Logistic …
In Data on the screen that now opens assign the variables to their tasks by using the arrow that clicks across
Good ( Dependent Variable
All other variables( Quantitative variables
( You can highlight them all by using Ctrl at the start of the group and keeping Ctrl and Shift down when you click on the far end of the group)
Then Model ( Response( make sure it is binary response type, logit type of model and change “fit model to level” to be 1 ( So the score will be high for the goods who have a 1 in the Good value)
Model (Effects. Highlight all and click on Main
Plots(tick show plots for regression analysis and click on custom list of plots. When the list of plots appear tick ROC curve
Predictions(Original sample (tick) (Predictions (tick)
Then click Run. The output should look like the following.
Analysis of Maximum Likelihood Estimates
Pr > ChiSq
checking 3
Duration 12-
Duration 12-36
duration 36-48
purpose 34
purpose 59
The Somers D ( which is also called the GINI coefficient) is 0.54 which is very good.
However you cans see that Duration 36-48, purpose0, purpose 59 and telephone are not really adding anything to the scorecard. Why do you know that? We also question the use of duration 12-36 and purpose 2 but leave them in for now.
To remove the four variables, in the Process Flow diagram, click on the logistic regression and then right click and choose modify logistic regression. When the data window appears click on Model(Effect and then on Telephone and click on remove effect. Repeat that to remove Duration36-48, purpose0 and purpose59
Then make sure that in Plots you choose ROC Curve and tick ROC curve and in Predictions make sure Original sample and predictions is ticked
Then Run and allow the programme to replace the original results.
If you encounter difficulty you can always start again from the data imported icon and use task( regression( logistic and then only choose the relevant variables (i.e. not telephone etc) as quantitative variables
( You should get results like this
Analysis of Maximum Likelihood Estimates
Pr > ChiSq
checking 3
Duration 12-
Duration 12-36
purpose 34
This suggests that all the variables in the scorecard are now relevant. The Gini has in fact dropped a little to 0.535 but that is not a worry as the scorecard is probably more robust and not over fitting. Have a look at the other statistics and see if you understand some of them.
The Scorecard we would end up with could be represented as
Score = 82 -201Checking1 -143 Checking 2 -87Checking 3 +140Duration 12
+77Duration12-36 +124Purpose1 +41purpose2 +69purpose{3,4} -36Married +150Foreign.
Do you agree?
Does it make sense in each of the variables. In particular does it make sense for checking?,
3. Building a scorecard by linear regression and testing it on a holdout sample
Now we will build a scorecard using linear regression not logistic regression and also find out how to see how it performs on a different data set – a holdout sample – to the one it was built . This is because the scorecard will always perform better on the set it was built on than on any other one because it has picked up some of the special features of that set that do not exist elsewhere. We use the same data as before but we have now split it into two sets – 700 borrowers which make up the training set ( Sheet 3) and 300 borrowers who make up the holdout sample ( Sheet 4).
We will also make this scorecard build into a project ( We could do that for the logistic regression as well) and use process flow screen to see the development of the project flow
Go again to File ( (. Import Data and then click on Local Computer and again find Germanregressionstart. Click on Next then Click on Sheet 3 and go through the rest of the windows and click on Finish. Remember what name the data sets is given by clicking on the icon Import Data and check the log (It will be SASUSER.germanregression start and some number).
Go to Task( Regression(Linear
In Data assign the variables to their tasks by using the arrow that clicks across
Good ( Dependent Variable
All other variables are explanatory variables
Click on Model and choose stepwise selection (This means one start with no variables in the regression and at each step introduces the variable that makes the biggest impact or drops out the one with the least impact if they are above/below some prescribed values/ Keep these values for entry and exit at the 0.15 suggested.
Statistics (correlation of estimates (tick) confidence limit of estimates (tick; leave as 95%)
Prediction (Original sample (tick); Predictions (tick)
You should end up after nine iterations with the following regression where R2 = 0.17
Type II SS
checking 3
Duration 12-
duration 36-48
Notice that you do not have to worry about dropping characteristics out because they are not relevant as the stepwise procedure (or the forwards or backwards one) would do that automatically. A corresponding scorecard would be
Score = 87 -33 *Check1-27*Check2-10*Check3+12 * Duration(12-) -13 Duration(36-48) -10 *Purpose0+ 15*Purpose1 -7*Married+4*telephone +21* Foreigner
Compare this with the scorecard obtained in Task 2.They are not the same but for individual characteristics do you get the same sort of pattern?
Now let us see what the results on the holdout sample are.
Click on the View Tab to get Process Flow diagram
File(Import Data(Local Computer. Find Germanregressionstart. Open it and choose Sheet 4 clicking through Next to the Finish. Remember what name this data sets is given; Click on Import Data and check the log ( It will be SASUSER.germanregressionstart and some number).
Click on the Linear icon in the Process flow diagram. This will take you to the Task Roles tables you had before and so will recalculate the same scorecard. Go to Prediction and make sure original data is not ticked but additional data is ticked. Browse through the data sets and choose the one with the germanregressionstart number corresponding to Sheet 4. Click on display all plots and on show predictions
If it asks do you want to replace the original results say No.
When the results come up you should have the same parameter estimates as before. You can check by going back and for between the two outputs.
Click on Process Flown and then click on the new Linear regression icon which looks like a table ( the one that is connected to the both data sets). This is the results for the extra 300 cases you did not build the model on. Go to output and check there are 300 cases. If you look at the last three columns you will see Good, Flag ( ignore) and Predicted Good. The latter is the score ( i.e. the prediction of that case being Good). It should start with
0.9144741217
0.5617142138
0.6960097216
0.5962368774
0.5236957807
0.6854915045
0.3625292558
0.6444213441
Next week we will see how to use Excel to draw the ROC curves and calculate the Gini coefficient for this data.
To save your work go to File and choose save project as. Save it somewhere in the Local computer you can access later.
Description of the German credit dataset.
Number of Attributes german: 20 (7 numerical, 13 categorical)
Number of Attributes german.numer: 24 (24 numerical)
Attribute description
Attribute 1: (qualitative)
Status of existing checking account
A11 : … < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation – does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : … < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
A171 : unemployed/ unskilled – non-resident
A172 : unskilled – resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com