you are expected to analyze the data from a well-known FinTech company in Hong Kong, which lends money to loan applicants. Once the company receives an application, its decision engine automatically classifies loan applications into ¡°approve¡±, ¡°reject¡±, or ¡°manual review.¡± In the case of ¡°manual review,¡± reviewers make a subjective judgement to reclassify the case into either ¡°approve¡± or ¡°reject¡±
Download the data from the following link.
http://www .dropbox.com/sh/62m5jr0t4vpbeyp/AAASXqr3ZUlC71b3FYDxmLJJa?dl=0
Field description:
Field
Description
id
Loan application id
loan_amount
Requested amount in HKD
tenor
Requested repayment periods in months
age
month_of_service
Employment period of current job
residential_status
Own, Rent, Others
monthly_repayment
Monthly repayment for other existing loans
monthly_income
Average monthly income for last three months
self-employed
bankrupted
Whether the applicant has the record of bankruptcy
housewife
currently_employed
Whether the applicant is employed as a full-time job
channel
Loan application channel
language
tc (traditional Chinese), EN (English)
manual_review
approved
manual_approved
credit score
The higher is the better
friends_facebook
No. of Facebook friends (0 indicates either no friend or the account was not provided)
time_application
Time of the day when the application was submitted
location
Location where the application was submitted
default
Whether the repayment was overdue as of June 2017
NOTE: Use all available resources to solve the problems. You can find a solution to most of the coding problems from the Internet. Google it, if you are stuck in the middle.
Q1. Load the data to your R system. How many variables and observations are in the data? How many are currently employed? How many are self-employed among the currently employed?
Q2. What is the average monthly income of the whole sample? What is the average monthly income of the currently employed?
Q3. Generate the histogram of ¡°loan_amount.¡± Can you find some interesting patterns from the graph? Can you guess the reason why the graph has such a shape?
Q4. Replace the value of ¡°friends_facebook¡± to NA if the value is 0. What is the average number of Facebook friends of those who have provided their Facebook account?
Q5. Generate the scatterplot of ¡°month_of_service¡± and ¡°credit_score¡±. Can you find any relationship between them? What about ¡°monthly_income¡± and ¡°credit_score¡±? Confirm the relationship with the correlation tests.
Q6. Make a new variable, named ¡°automatic_approved,¡± which has the value ¡°t¡± if approved by the decision engine, ¡°f¡± if rejected by the decision engine, and ¡°NA¡± if reviewed manually. How many cases are approved or rejected by their decision engine? How many are classified as ¡°manual review¡±?
Q7. Compare the automatically approved cases and the automatically rejected cases. Conduct statistical tests on variables available in the dataset to answer the following subquestions.
1) Aretheydifferentin¡°loan_amount¡±?
2) Aretheydifferentin¡°tenor¡±?
3) Aretheydifferentin¡°age¡±?
4) Aretheydifferentin¡°month_of_service¡±? 5) Aretheydifferentin¡°residential_status¡±? 6) Aretheydifferentin¡°monthly_income¡±? 7) Aretheydifferentin¡°bankrupted¡±?
8) Aretheydifferentin¡°currently_employed¡±? 9) Aretheydifferentin¡°channel¡±?
10) Are they different in ¡°language¡±?
11) Are they different in ¡°credit_score¡±?
12) Are they different in ¡°friends_facebook¡±? 13) Are they different in ¡°location_application¡±?
Q8. Make a new variable, named ¡°automatic_approved_dummy,¡± which has the value of 1 if automatic_approved = t, and 0 otherwise. Develop a regression model of ¡°approval by the decision engine¡± using the DV of ¡°automatic_approved_dummy.¡± Include all relevant independent variables in the model.
Q9. Based on the analysis results above, provide the logic behind the decision engine to judge ¡°approve¡±.
Guideline
Submit 1) your answer sheet, 2) R-code used for the analysis,