University of Cardiff
MAT012 Credit Risk Scoring (2016/17)
Copyright By PowCoder代写 加微信 powcoder
Lab Session 1: How scorecards are Developed- using Excel
Task 1: Understand data set.
Open Excel and then open the data set entitled germanexcelstart2010
This is a data set consisting of 1000 cases of applicants for credit together with details of 20 of their characteristics. 700 of the applicants turned out to be Good ( ‘1’ in the Good column) and 300 were Bad ( ‘1’ in the Bad column). There is a data dictionary at the end of this handout but the data you have has been simplified by removing A and the characteristic number from the attributes. Thus status of existing checking account has values 1, 2, 3 and 4 not characteristic A11, A12, A13 and A14. Check that you understand the meaning of the data set which is provided in Sheet 1.
这是一个数据集,包括1000个申请信贷的案例,以及20个特征的详细信息。其中700名申请人表现良好(“1”在“良好”一栏),300名表现不佳(“1”在“糟糕”一栏)。在本讲义的末尾有一个数据字典,但是您已经通过从属性中删除a和特征数简化了数据。因此,现有支票账户的状态值checking为1、2、3和4,而不是特征A11、A12、A13和A14。检查您是否理解表1中提供的数据集的含义。
Checking:支票账户的状态值
Duration:持续时间
Amount:存款金额
Telephone:是否有电话号码
Foreign:1:本国人2:外国人
Bad:申请人表现不佳
Good:申请人表现良好
The course structure looks at how to apply specific scorecard methodologies before looking at the practical details of preparing the data including coarse classifying. So in this first lab class we will use data that has already been put in the appropriate form with all the variables being binary (0 or 1) values. In the second workshop we will look at how to coarse classify the variables and so obtain such binary variables
课程结构着眼于如何应用特定的记分卡方法,然后再看准备数据的实际细节,包括粗分类。因此,在第一堂实验课中,我们将使用已经以适当形式放入的数据,所有变量都是二进制(0或1)值。在第二个专题中,我们将研究如何对变量进行粗分类,从而获得这样的二元变量
Task 2: Using linear regression to build a scorecard
Go to sheet 3 where we have the data corresponding to binning six of the variables – checking, duration, purpose, married, telephone and foreign, together with the Good column. There are 3 variables for checking 1,2,3, yet there was a fourth value available 4.
转到表3,这里我们有与6个变量对应的数据——检查、持续时间、目的、结婚、电话和外国,以及“好”列。有3个变量用于检查1、2、3,但还有第四个值可用。
Why have we left the corresponding binary variable out?
Duration also has 3 variables. What is the variable that has been left out in its case?
Similarly what are the five variables that arise from the purpose characteristic?.
What sort of loans will have a 0 in all five cases?
为什么我们把对应的二进制变量漏掉了?
持续时间也有3个变量。在这个例子中被忽略的变量是什么?
同样,从目的特性中产生的五个变量是什么?
哪种贷款在这五种情况下都是0 ?
Do you understand what a 1 in the married, telephone and foreign variables means ( One has to check the original data to see)
你明白“已婚”、“电话”和“外来变量”中的“1”是什么意思吗?(要查看原始数据)
Now we will apply linear regression to get a scorecard where we are regressing the Good variable on the characteristics discussed above.
现在,我们将应用线性回归得到一个记分卡,在那里,我们正在回归良好的变量在上面讨论的特征。
First go to Options in the File Menu. Then click on Add Ins and click on Go by Manage Excel Add ins. Tick Analysis ToolPak and Solver and click on OK.
Then if we go to the Data menu one should see Data Analysis and Solver in the Analysis section on the far right .
Click on Data Analysis. Highlight Regression and click OK
首先转到文件菜单中的选项。然后点击Add Ins,点击Go by Manage Excel Add Ins。勾选分析工具和解决方案,然后点击确定。
然后,如果我们转到数据菜单,应该会在最右边的分析部分看到数据分析和求解程序。
点击数据分析。突出显示Regression并单击OKEither by highlighting the appropriate region or by typing ,
type into Input Y range O1:O1001
type into Input X Range A1:N1001
Tick the labels box
Tick the output region and enter in A1005 (so output will be at end of data set).
Click on OK.
The regression results should now appear from cell A1005 onwards.
It suggests that the scorecard should be as follows where I have multiplied each coefficient by 100 and rounded them.
Score = 59 -35 *Check1-23*Check2-11*Check3+28 * Duration(12-)+16 Duration(12-36) +21 *Purpose1+9*Purpose2+14 Purpose(3or4)+5*Purpose(5or9) -6*Married+4*telephone +21* Foreigner
Why has Duration(36-48) and Purpose0) disappeared?
Do the FOUR values for checking account make sense ( Where is the fourth?)
为什么持续性(36-48)和目的性(0)消失了?
支票账户的四个值有意义吗(第四个值在哪里?)
The R2 value is around 0.18.What do you think of the scorecard in the light of this? (Remember you are fitting a line to data where all the dependent values are 0 or 1.)
R2的值大约是0.18。你觉得这样的记分卡怎么样?(请记住,您正在将一行拟合到所有相关值为0或1的数据中。)
We can construct the scores for all the cases as follows
Highlight B1022 to B1035 and Copy in the Home menu
Then Highlight A1002 to N1002
Go to Paste in the Home menu and then click on Paste Special and click on values and on transpose.
Now we have the scores for each attribute in row 1002.
Go to P1 and type in score
In cell P2 type in =100*($B$1021+SUMPRODUCT(A2:N2,A$1002:N$1002))
This should give the score ( multiplied by 100 for the first data point of 70.
Copy P2 all the way down the column to P1001.
One should have scores varying from 21 to 129.
我们可以构建所有情况下的得分如下
高亮B1022到B1035并复制到主菜单中
然后突出显示A1002到N1002
转到粘贴在主菜单,然后点击粘贴特殊,点击值和转置。
现在我们有了第1002行中每个属性的得分。
在P2类型的细胞中,In =100*($B$1021+SUMPRODUCT(A2:N2,A$1002:N$1002))
这应该会得到分数(第一个数据点为70乘以100)。
将P2一直复制到P1001。
一个人的分数应该从21到129不等。
Task 3: Building scorecard by linear programming using Solver in Excel
The linear programme we outlined has as many constraint as there are cases and the standard version of Solver only allows one to solve linear programming problems with 100 constraints. Thus we will cut down the number of cases to 60 Goods and 30 Bads.
任务3:使用Excel中的求解器,通过线性规划建立记分卡
我们概述的线性规划有尽可能多的约束情况和标准版本的求解器只允许一个解决100个约束的线性规划问题。这样的话,我们将把箱数减少到60件货物,30件货物。
Go to Sheet 4 where this set is available
Solver solves the linear programming problems by defining some of the cells as where the values for the variables in the LP should be. It tries different values in these cells until it gets the solution that maximises or minimises the objective.
One set of variables are the coefficient values of the attributes in the final score, and we will put these scores c(i) in row 92.
So we will start by puting 0s in the cells from A92 to N92.
In fact we will try a variant of the linear programme where we want all the Bads to have score of 60 or less and all the Goods to have scores of 100 or more. ( So we have a gap between the two groups as is the case in Support Vector machines).
转到工作表4,这一套是可用的
Solver通过定义一些单元格来解决线性规划问题,这些单元格是LP中变量的值所在的位置。它在这些细胞中尝试不同的值,直到得到最大化或最小化目标的解决方案。
一组变量是最终得分中属性的系数值,我们将把这些得分c(i)放在第92行。
我们将从在A92到N92的单元格中输入0开始。
事实上,我们将尝试一个线性程序的变体,我们希望所有的坏事都得到60分或更少,而所有的好事都得到100分或更多。(所以我们在这两组之间有一个差距,就像支持向量机一样)。
For the Bads ( the first 30 cases ) we want
c(A)*x(A)+c(B)x(B)+…C(N)*x(N) –Error ( 60
where again the Error is a variable in the LP.
So we will make Column P the column where the errors are ( Again we start with them all 0).; Column Q will be where the calculation c(A)*x(A)+c(B)x(B)+…C(N)*x(N) –Error is made and column R is where we put the 60.
Similarly for the remaining 60 Goods we want
c(A)*x(A)+c(B)x(B)+…C(N)*x(N) +Error( 100
So again we will make Column P the column where the errors are ( all 0 initially ).; Column Q will be where the calculation c(A)*x(A)+c(B)x(B)+…C(N)*x(N) +Error is made and column R is where we put the 100.
So type in the following
In Cell P1 type Error
In Cell P2 type 0 (zero)
Copy P2 down to P91.
In cell P92 type =sum(P2:P91)
( This is our objective function as we will seek to minimise this sum of the errors
In Cell Q1 type Calculation
In Cell Q2 type = SUMPRODUCT(A2:N2, A$92:N$92)+(-1+2*O2)*P2
Copy Q2 down to Q91
( Notice the -1+2*O” means the value is -1 for the Bads and +1 for the Goods)
In Cell R1 type Boundary value
In R2 type 60
Copy R2 down to R31.
In R32 type 100
Copy R32 down to R91
Now we are in a position to use solver.
Go to Solver in the Data menu
( If Solver is not there, then click on Add ons in options in the File menu. Click on Go on the Excel Add ins and then tick the solver choice. Click on OK).
When you go back to Tools , Solver should be there then)
Click on Solver
When the dialogue box appears
On set target cell type or highlight P92
Click on the minimise button
Either by typing in or clicking on the appropriate cells, put in the” by changing cells” box
A92:N92, P2:P91
Using the add button put in the following constraints
Q2:Q31 <= R2:R31
Q32:Q91 => R32:R91
P2:P91=> 0
Choose simplex LP in the solving method.
Make sure the box “ Make unconstrained variables non negative is NOT ticked. (we are happy for some of the scores to be negative)
Click on Solve..
Click on OK and then solve.
The solution should have the sum of the errors as 868 and the scorecard looks like
Score = -24*Check1-16*Check2+24 *Check3 + 108* Duration(12-)+84 *Duration(12-36) +52*Duration(36-48) + 40*Purpose1 +40*Purpose2 +24*Purpose(3or4) +40*Purpose(5or9) -8*Married—8*Telephone +16*Foreign
How are these score compared to the linear regression model? Remember it is the relative differences that are important?
Task 4: Sensitivity Analysis for the Linear programming scorecard
What happens if we say Bads should be less than 80 and Goods more than 100?
Is that a better scorecard?
Description of the German credit dataset.
Number of Attributes german: 20 (7 numerical, 13 categorical)
Number of Attributes german.numer: 24 (24 numerical)
Attribute description
Attribute 1: (qualitative)
Status of existing checking account
A11 : … < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation – does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : … < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
A171 : unemployed/ unskilled – non-resident
A172 : unskilled – resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com