PART 1 – Quantitative Credit Scoring 85% Section I – Variable Mapping
Go to the following link to download the dataset:
https://www.dropbox.com/s/mkoxbevzyvj1j9q/Credit_data_RSM6305.txt?dl=0
PS: The first column is just index values, ignore it or don’t count it as one of the features!
You have been provided a credit dataset with 20 different borrower attributes (7 numerical and 13 categorical). Description of the attributes are as follows,
Attribute description:
Attribute 1: (qualitative)
Status of existing checking account
Attribute 2: (numerical) Duration in month
Attribute 3: (qualitative) Credit history
Attribute 4: (qualitative) Purpose
o A40 : o A41 : o A42 : o A43 : o A44 : o A45 : o A46 : o A47 : o A48 : o A49 : o A410
car (new)
car (used) furniture/equipment radio/television
domestic appliances repairs
education
(vacation – does not exist?) retraining
business
: others
Attribute 5: (numerical) Credit amount
Attibute 6: (qualitative)
Savings account/bonds
o A61 : o A62: o A63: o A64 : o A65 :
…< 100DM 100<=…< 500DM 500<=…<1000DM
.. >= 1000 DM
unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
o A71 : unemployed
o A72: …<1year
o A73:1 <=…<4years o A74:4 <=…<7years o A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
1
Attribute 9: (qualitative)
Personal status and sex
o A91 : male : divorced/separated
o A92 : female : divorced/separated/married o A93 : male : single
o A94 : male : married/widowed
o A95 : female : single
(qualitative)
Other debtors / guarantors
o A101 : none
o A102 : co-applicant o A103 : guarantor
(numerical)
Present residence since
(qualitative) Property
o A121 : real estate
o A122 : if not A121 : building society savings agreement/life insurance o A123 : if not A121/A122 : car or other, not in attribute 6
o A124 : unknown / no property
(numerical) Age in years
(qualitative)
Other installment plans
Attribute 10:
Attribute 11: Attribute 12:
Attribute 13: Attribute 14:
Attribute 15:
Attribute 16:
Number of existing credits at this bank
Attribute 17: (qualitative) Job
o A171 : unemployed/ unskilled – non-resident
o A172 : unskilled – resident
o A173 : skilled employee / official
o A174 : management/ self-employed/highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative) Telephone
o A191 : none
o A192 : yes, registered under the customers name
o A141 : o A142 : o A143 :
(qualitative) Housing
o A151 : o A152 : o A153 :
bank stores none
rent own for free
(numerical)
2
Attribute 20: (qualitative) foreign worker
o A201 : yes
o A202 : no
The 21st column denotes the credit status of the borrower, with 1 being good and 2 being bad.
Your first task will be to transform the provided the dataset to the respective categorical variables based on the attributes mentioned above.
Section II – Exploratory Data Analysis & Wrangling
Your next task will be to conduct exploratory data analysis.
1) 2)
3)
Observe the variables by plotting histograms and box plots (for continuous variables) and frequency tables (bar plots) for categorical variables. Conduct proper outlier detection (if any) and use necessary tools taught in the class to treat outliers. Present your analysis.
You may notice that the categorical variables contain missing values in some instances. It will be advisable to conduct proper data imputation instead of dropping the rows entirely which will result in loss of information. Use your judgement to conduct proper imputation. Keep in mind that not all the variables are of the same type (some continuous and some categorical), so use proper judgement and methods to treat these two types differently to conduct imputation. Present your analysis.
Conduct cross tabulation of all the categorical predictors with the credit status of the borrower. For this you need to create cross contingency table. For example, if we take a categorical variable 𝑋 with 3 categories 𝑥1, 𝑥2 and 𝑥3, the cross contingency table will take the following form,
3
Credit Status |
𝑥1 |
𝑥2 |
𝑥3 |
Row Total |
0 |
#of𝑥1 in0 (and % of 0) |
#of𝑥2 in0 (and % of 0) |
#of𝑥3 in0 (and % of 0) |
Total 0 |
1 |
#of𝑥1 in1 (and % of 0) |
#of𝑥2 in1 (and % of 0) |
#of𝑥3 in1 (and % of 0) |
Total 1 |
Column Total |
Total 𝑥1 |
Total 𝑥2 |
Total 𝑥3 |
Sub – total |
For each table, present an outline of the analysis.
- 4) Using the cross contingency table, perform a chi-square test in order to test the dependence of the categorical variables with the credit status of the borrower. Take note of the variables which have a statistically significant dependence with the response variable.
- 5) For the continuous variables, present the necessary descriptive statistics. Make sure to standardize the continuous variables before moving to estimation. Also a correlation matrix of all the variables (categorical and continuous) will give us a good idea of the dependence structure of the dataset.
Section III – Estimation
Before moving to the estimation phase, it is important to not use the full data for estimation. Conduct a 70:30 cross validation, which means randomly sample 70% of the data as training set and keep the rest of 30% as test set.
1) Start by estimating a logistic regression using all the significant categorical predictors based on the chi square tests and all numeric variables. At every iteration, take out the insignificant variables and re–estimate until all the variables are significant at the 5% level. Briefly explain the final chosen variables, the signs of the coefficients.
a. For the logistic regression, you must build your own function, which includes constructing the functions for the logistic distribution, the log likelihood and the optimization process. Take help from Appendix A3 of the text book to build your own function for Newton’s method, since you need the Hessian matrix to calculate relevant regression statistics.
b. Plot the ROC curve for the in sample prediction. Use built in packages/function to extract the ROC curve. Perform a Kolmogorov Smirnov test (KS) on the possible true positive rates and false positive rates for each cut off value. Pick the appropriate cutoff value based on the KS test. Use this cut off value for out of sample prediction. (This will be important in Part IV for calculating Brier Score and conducting HL test)
- 2) Estimate a stepwise logistic regression model. Present your output results. Briefly explain the signs of the coefficients and their significance.
- Take help from the slides to get an idea on how to approach the algorithm for the stepwise functions. Feel free to create only one method, either the backward or the forward.
- Perform the KS test as usual to obtain the optimal cut off value. Use this cut off value for out of sample prediction.
- 3) Estimate a decision tree on the given dataset. Present your results.
- For this question, feel free to leverage the libraries or built in function for estimation.
- Apply cost complexity pruning to the large tree obtained in order to obtain a sequence of
sub-trees. Conduct a K-fold cross validation.
- Find the complexity parameter for which the cross validated error rate is minimum.
- Prune the tree using this complexity parameter. This way, you will obtain the final tree.1
- 4) Estimate a random forest on the given dataset. Set the forest size to 1000 trees and evaluated variables per node to 5. For more information on the methodology and how to estimate it in R, go over this document: https://www.dropbox.com/s/higfw6xd1drchq1/TreeBasedMethods.pdf?dl=0
- For this question, feel free to leverage the libraries or built in function for estimation.
- Present the necessary results.
- Plot the variables based on their importance, do the results make sense??
Section IV – Performance Validation
- 1) For each model estimated in part 3, predict default probability for each method using the test set.
- 2) Build a function for the Brier score2 and calculate the score for each prediction result.
- 3) Build a function for Hosmer – Lemeshow Test3 and use it to test each prediction result.
- 4) Plot the ROC curve for each method, also calculate the AUC for each out of sample prediction.
- 5) Which model would you recommend as your final model?
1 For a brief on K fold cross validation, please see: https://eight2late.wordpress.com/2016/02/16/a-gentle-introduction-to-decision- trees-using-r/
2 For a brief on brier score, https://en.wikipedia.org/wiki/Brier_score
3 A good primer, http://thestatsgeek.com/2014/02/16/the-hosmer-lemeshow-goodness-of-fit-test-for-logistic-regression/
4
PART 2 – Fundamental Credit Analysis 15%
You are an analyst for a bond hedge fund or a credit analyst or working in a Pension fund (private asset classes’ investment division or Investment Risk). The goal of this part of the project is to get you familiar with some concepts/approaches for credit analysis. Especially what is known under Fundamental Credit analysis or some places call it Internal rating approach.
— To begin, pick four ‘comparable’ companies within the same industry, and have the same rating and have bonds outstanding preferably with approximately same time to maturity.
— To make your life easier, all bonds should be denominated in USD/CAN, and issued by firms domiciled either in CA or the US.
— You can use the Bloomberg SRCH function to search for bonds and firms. See these links for details:
http://asklib.library.hbs.edu/faq/47426 http://www.investopedia.com/university/corporate-bonds-bloomberg-active-trader/ Further,
Given the industry you choose, create internal ratings of the firms in your sample. I have shared several Moody’s reports by industry that talks about the important factors for your credit analysis. You can find them here https://www.dropbox.com/sh/i7ju97wsk72ctbu/AACH76AEOSkHWxD1JwyAtOxBa?dl=0. You can follow Moody’s methodology. You can also complement the analysis using the 5 steps of credit analysis that I mentioned in class to guide your thinking. Basically I am looking for Fundamental analysis.
o The document of Moody’s attached, should help you understand what ratio is more important than other in each industry.
o Augment your credit/internal rating of the firms by adding recovery analysis and have a more informed internal ratings that accounts for both likelihood of default and severity of default: ( if you think the security value at time of default is not represented in the rating methodology you are following )
o You can also compute Z & Ohlson score. They are helpful to give some comparison across companies even though you will notice that your internal ratings may dominate those measures. Comment on that.
o Overall consider factors that affect the potential Default probability (cash flow, capacity to repay and covenants strictness with respect to the other firms you chose or with respect to the industry)
o Consider factors that affect the potential Recovery rate (collateral, tangible assets, priority, other sr. or jr. debt in the capital structure, etc.).
- What can you learn from the above information about the differences in bond yields of these firms?
- Based on your internal rating assigned, how do you assess these companies relative to the rating provided in
Bloomberg. Explain your rationale of the deviation in you internal rating methodology and that of provided by the rating agencies.
5