Preliminary Information
• In your report do not just replicate the process followed during the workshops! The objective of the workshops is to introduce you to the different techniques discussed during the lectures, and not to give you a roadmap on how to answer the coursework.
• Assessment: In the coursework you will be assessed based on:
1. Your ability to use correctly the tools that we covered in the course
2. Your ability to draw the correct conclusions from these tools
3. Your ability to address the questions posed in the coursework based on an intelligent interpretation of the evidence provided in the previous two steps. (Consult the CRISP-DM process described in Chapter 1 of Guide to Intelligent Data Analysis)
• You will not be assessed on your capability to use R or any other software. For this reason don’t include screenshots from any software or any other information about commands you used, or options you set, or how to draw a figure etc. You will be simply wasting valuable space.
You are free to use any software you like to do the coursework. However, you can’t use as an excuse the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular capability which we covered in the workshops.
• Page limits Your report must be submitted as a PDF file that does not exceed 12 pages, with at least 11 point typeface. This limit is strict and it includes appendices (which I strongly recommend that you don’t use). If your report exceeds the page limit I will simply stop reading at the end of page 12 and not take into account anything from the remaining pages in the assessment.
• Plagiarism: This is an individual piece of assessment, and you should ensure that your report reflects your own work exclusively.
All reports go through automated software to detect plagiarism from a variety of sources (including past and current students’ reports as well as online resources, conference and journal publications etc.) The consequences of plagiarism are very serious.
Problem Description/ Project Objectives
A bank wants you to develop a credit scoring model to classify applications for unsecured loans. You have been provided with a sample of observations which contain information about past bank customers. (The dataset provided to each student is unique.) The description of the variables in this dataset is provided in the next section.
The bank is primarily interested in understanding what are the main factors that influence repayment behaviour, so that it can exploit this knowledge to improve future decisions. The bank faces a trade-off between accepting applicants for loans, so as to retain its share in the market and increase its profit through interest payments, and on the other hand incurring losses due to giving loans to customers that default on their debt. The bank managers are interested in the following questions:
• What is the best way for the bank to use a statistical model to achieve the following goals:
– Accept the maximum number of good customers if at least 85% of bad customers are correctly
identified
– Accept at least 70% of good customers while rejecting as many bad customers as possible.
• If the previous two goals were not specified which statistical model would you recommend, and why? Compare this model to the ones recommended in the previous question, and discuss similarities and differences.
1
• How many and which are the most important variables that determine the repayment behaviour of mortgage customers. (Do these differ depending on the objective, and/ or the classification method used?)
Data Description
You are provided with a sample of observations which contain information about past bank customers. The dataset provided to each student is unique. The main variables in this dataset are described in Table 1. The class variable (i.e. the variable we want to predict) is called BAD. There are 9 more variables in the dataset you were provided with in addition to these described in the table. Each of these variables is encoded as M and the name of one of the main variables: for example, M MORTDUE, or M DEBTINC. All the M variables are binary (i.e. take values in {0, 1}). They were created because the original dataset contained a large number of missing values. For each variable that had missing values in the original data (e.g. MORTDUE) the missing values were replaced, and a binary variable (M MORTDUE) was created what indicates whether the value of the variable was missing in the original dataset (M MORTDUE=1) or not (M MORTDUE=0). In other words, the value of a variable like DEBTINC is the actual, observed, value when M DEBTINC=0. When M DEBTINC=1 the value of DEBTINC has been predicted (and therefore does not correspond to the actual value of this variable for that customer). You don’t know which method was used to replace these missing values.
Name Type
BAD LOAN MORTDUE VALUE REASON JOB
YOJ DEROG DEBTINC CLAGE NINQ CLNO DELINQ
Tasks
Description
1=applicant defaulted on loan or seriously delinquent, 0=applicant paid loan Amount of the loan request
Amount due on existing mortgage
Value of current property
Not Provided; DebtCon=debt consolidation; HomeImp=home improvement Occupational categories
Years at present job
Number of major derogatory reports
Debt-to-income ratio
Age of oldest credit line in months Number of recent credit inquiries Number of credit lines
Number of delinquent credit lines
Table 1: Description of main variables in training dataset
Binary
Continuous
Continuous
Continuous
Nominal
Nominal
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
• Exploratory Data Analysis (40 marks).
In particular, consider each variable and answer the following questions:
– Does this variable appear to be important for the task at hand? (After discussing each variable separately provide a ranking of the importance of all explanatory variables.) Support your claims with appropriate visualisations that document whether and how important each variable is.
– Are different variables related, and which variables convey information similar to that provided in other variable(s)?
– Do you find evidence of “outliers” or other issues with data quality (e.g. incorrect observations)?
– For which variables is the fact that specific values were missing in the original dataset informative,
and what are the implications of this?
2
• Statistical Modelling (60 marks)
– What is the appropriate performance measure for this application and why? Relate this to the
project objectives.
– For the two types of classifiers: logistic regression, and decision trees discuss different settings you used and why you considered these important. (Consider the choice of variable selection method as part of this question also.)
– For each classification method develop one or a few candidate models that you think are promising before providing a final recommendation of the most appropriate model (for each question in the project objectives section). You do not need to discuss every model you tried in detail, but you must include the results for the important steps in the process that led you to the final recommendations. I am particularly interested in understanding the steps you followed and the justification for these. (Refer to the CRISP data mining process discussed during the lectures and in Chapter 1 of the Guide to Intelligent Data Analysis).
– Comment on the generalisation performance of the model(s) you recommend for each type of classifier.
The coursework requires you to write a report explaining your findings. This means that you need to explain each figure, table or number you include in the report. In other words including a relevant figure but not explaining what are the conclusions from it will get you no marks.
• You do not need to write an executive summary, or include a cover page, and a page of contents.
• You do need to include at the end of your coursework a Conclusions section which will summarise your findings and will clearly answer the questions posed in the project objectives section. In this section I would also recommend to discuss the relative advantages and limitations of the two types of classifiers for the problem at hand.
Report Assessment
Your coursework will not be evaluated by the quality of the final model alone, or by whether you got a particular answer right. You will be primarily assessed by whether you are able to correctly justify the steps you took to complete the assignment. In other words, your report needs to document that you are able to intelligently analyse the provided data, that you draw correct conclusions from what you observe, and that these conclusions lead you either to the next logical step of the data mining process, or to the revision of decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered in the first lectures and in particular to the feedback loops)
Therefore, don’t simply present the conclusions/ results of your analysis and expect to get a high mark. Reports that don’t document the steps followed and the reasons why these were chosen will receive minimal marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Don’t provide a list of bullet points, or unstructured sentences etc. Similarly, don’t include figures or any other output from R that you don’t comment/ explain in the text. I will not assume that you know how to interpret these correctly.
3