UNIVERSITY OF CARDIFF MAT012 Credit Risk Scoring Assignment 2019/20
This forms your assessment (100%) of this module. There are two parts to this assessment.
Part A contains THREE short essay-based questions and counts for 50% of the final mark.
Part B contains FOUR tasks to establish a scorecard using the given dataset and counts for 50% of the final mark. You may use Excel, SAS, R or Python to assist in the scorecard preparation.
You must answer ALL questions.
Submission must be made by 3pm on Friday 20th March via Learning Central, and instructions will follow shortly on how to do this. You will need to submit a single file containing answers to all questions; any spreadsheet analysis, workings or coding necessary can be shown in an Appendix in that file. Only the submitted file will be marked.
PART A
1. Critically examine what needs to be considered when developing a credit risk scoring
model.
[20 marks]
2. Explain how, in theory, Cox¡¯s proportional hazard model for survival analysis can be used for constructing a scorecard. Comment on the relative popularity of Cox¡¯s PH
model versus logistic regression in scorecard construction.
[15 marks]
3. Provide a brief literature review on the use of Markov models in credit risk
modelling, with a particular focus on those used in credit risk scoring.
[15 marks]
PART B
The dataset underpinning the analysis here is that used in the lab sessions during lectures. It has been uploaded as a spreadsheet named ¡®German¡¯ together with the data dictionary ¡®German data dictionary¡¯ describing each attribute. You will recall that the dataset consists of data for 1000 applicants along with a variable that says whether they were subsequently Good or Bad from a credit perspective.
1. Split the dataset into two subsets as follows:
Subset 1: the applicants with Duration <= 12 months Subset 2: the applicants where Duration > 12 months
Clean the subsets if necessary.
2. For each subset, establish a training set and validation set. Explain: a. what principle you have used to decide on these;
b. why both training and validation sets are needed;
c. any issues encountered during the splitting exercise.
[5 marks]
[5 marks]
3. For each training set choose four variables which are suitable for building a scorecard. For each training set the variables must have (i) at least one continuous variable before binning; (ii) at least one categorical variable with more than two categories, so you can see whether categories can be combined.
Explain the rationale behind your choice of variables (using supporting statistics eg chi-square). Should you be unable to choose variables satisfying the above criteria, explain the problem you have encountered and the solution you have chosen to compromise the variable selection.
[10 marks]
4. Using the binary variables obtained from the coarse classification in the above exercise to build two scorecards for each training set (so, two scorecards for those applicants with Duration <= 12 months; another two for those with Duration > 12 months), one using linear regression and one using logistic regression.
Note that the file you submit should include, in the Appendix, a table that gives the binary variables you used, together with the coefficients for those variables
calculated in each regression.
[15 marks]
5. Derive ROC curves for all scorecards using the validation set applicable to each, showing in detail how sensitivity and specificity have been calculated. Estimate the Gini coefficient and KS values for each. Explain and comment on your results.
[15 marks]