留学生辅导 The dataset underpinning the analysis here is that used in the lab session

The dataset underpinning the analysis here is that used in the lab sessions during lectures. It
has been uploaded as a spreadsheet named ‘German’ together with the data dictionary
‘German data dictionary’ describing each attribute. You will recall that the dataset consists
of data for 1000 applicants along with a variable that says whether they were subsequently

Copyright By PowCoder代写 加微信 powcoder

Good or Bad from a credit perspective.
1. Split the dataset into two subsets as follows:
Subset 1: the applicants with Duration <= 12 months Subset 2: the applicants where Duration > 12 months
Clean the subsets if necessary.
2. For each subset, establish a training set and validation set. Explain:
a. what principle you have used to decide on these;
b. why both training and validation sets are needed;
c. any issues encountered during the splitting exercise.
3. For each training set choose four variables which are suitable for building a
scorecard. For each training set the variables must have (i) at least one continuous
variable before binning; (ii) at least one categorical variable with more than two
categories, so you can see whether categories can be combined.
Explain the rationale behind your choice of variables (using supporting statistics eg
chi-square). Should you be unable to choose variables satisfying the above criteria,
explain the problem you have encountered and the solution you have chosen to
compromise the variable selection.
[10 marks]
4. Using the binary variables obtained from the coarse classification in the above
exercise to build two scorecards for each training set (so, two scorecards for those
applicants with Duration <= 12 months; another two for those with Duration > 12
months), one using linear regression and one using logistic regression.
Note that the file you submit should include, in the Appendix, a table that gives the
binary variables you used, together with the coefficients for those variables
calculated in each regression.
[15 marks]
5. Derive ROC curves for all scorecards using the validation set applicable to each,
showing in detail how sensitivity and specificity have been calculated. Estimate the
Gini coefficient and KS values for each. Explain and comment on your results.
[15 marks]

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com