STAT2604 Final Project
Course: STAT2604 Introduction to R Programming and Elementary Data Analysis
Total marks: 100
Due date: 23:59, Dec. 15, 2021
Each student should work on your own to finish the project. Please pack your Rmd
source code file together with the output pdf file as one compressed file, and submit
only that compressed file onto Moodle. Name the compressed file in the format
(Name)(UID) P. Your final submitted pdf file should be limited to at most 10 pages
(including all the code, figures and tables).
(Note: The marking of this project will mainly focus on your thoughts and reasoning in solving
the problems.)
1 Problems and Objectives
A bank is now facing a trade-off between accepting customers so that it retains its share in
the mortgage loan market and incurring losses due to providing loans to customers who might
default. The bank managers are interested in the following questions:
1. What is the proportion of good customers that can be granted loans while ensuring that
α% of the bad customers are wrongly identified? (α = 5%, 1%, 0.5%)
2. What is the top 3 most important explanatory variables that affect whether a customer
is good or bad?
Your task is to undertake a thorough investigation of the dataset provided by the bank,
which contains information about past bank customers.
2 Data Description
This dataset contains all the relevant information about 2000 mortgage loan customers. In total,
there are 14 explanatory variables and the class label variable indicating whether a customer is
good or bad. A bad customer is defined as the one who has missed three or more payments
during the first year of the mortgage.
Table: Data Description
ID Customer ID
Annual Income Annual Gross Income in $s
Credit History Loan applications in past five years
Credit Cards Credit cards currently held
1
Amount Loan amount
Number of Dependants Number of family members that rely on the customer
Employment 1 Other
2 Self Employment
3 Part time
4 Full time private sector
5 Full time public sector
Installment Percentage Monthly installment as percentage of monthly gross earnings
Time at Current Employment in years
Time at Address in years
Age in years
Delayed or Missed Payments 0 No missed/delayed payments over last 3 years
1 Delayed payments only over last 3 years
2 Missed payments over last 3 years
Residential Status Rent
Own
Live with Family
Existing Credits Additional lines of credits
Area indicator Location of branch receiving application
Good Customer/ Bad Customer Yes
(Target variable) No
3 Tasks
Write a report that contains your R code, answers and detailed explanations to the questions
below. The report should be in pdf file format (pre-installation of TeX distribution required).
Exploratory Data Analysis:
1. Briefly summarize the data with descriptive statistics. Draw one or two most interesting
findings based on your summary statistics.
2. Suggest the top 3 most important explanatory variables that can predict whether a cus-
tomer is good or bad. Support your claims with appropriate visualizations. (20 marks)
3. Are different variables related, and which variables contain information that is provided
in other variable(s)? (10 marks)
4. Do you find evidence of any outliers or other issues with data quality (e.g., incorrect
observations)? If there is any, find a proper way to handle the problem. (30 marks)
2
Statistical Modeling:
1. Split the data into training and testing sets. Choose a model to fit the training set. Tune
the model if there are any tuning parameters. (10 marks)
2. Choose an appropriate evaluation measure based on the project objective, explain why
you choose this measure and how to compute this measure on the testing set. Give your
answer to the first question in Problem and Objectives. (20 marks)
3. Base on your fitted model, find the top 3 most important explanatory variables. Do they
conform to your previous suggestion? Provide your final choice of the top 3 variables and
comment on their contributions. (10 marks)
3