FM 9528 – Banking Analytics Coursework 3
Coursework 3 – Deep Learning
In this coursework we will continue our study of mortgages in the US, but now we will analyze results at a zipcode level. The question that we want to answer is “can satellite images help our modelling process?”. For this, you are given two datasets:
A. Aggregated variable information at a zipcode level of the origination of the mortgages. This variable has a BinaryDefault variable which represent those zipcodes deemed to be of high risk, and those deemed to be of low risk. The variables that are available are:
a. Fico: Average FICO score of the area.
b. mi_pct cnt_units: Percentage of mortgages with insurance.
c. Cltv, ltv: Average LTV and CLTV of area.
d. cnt_borr: Number of borrowers in area.
e. occpy_sts_S: Percentage of users with occupancy status Single Home (S) in area.
f. channel_C: Percentage of cases with channel C in area.
g. channel_T: Percentage of cases with channel T in area.
h. prop_type_MH: Percentage of cases with property type MH in area.
i. prop_type_PU: Percentage of cases with property type PU in area.
j. loan_purpose_N: Percentage of cases with purpose of the loan type N in area.
k. Area_Number: Zipcode, coded to a meaningless area number. Non-predictive.
l. BinaryDefault: Whether the area is in a high risk area (highest 30% default rate
nationwide) or not. Target variable.
B. Samples of satellite images1 under different conditions for the different zipcodes. They
amount to approximately 2GB of data.
In this coursework, you will develop a multimodal deep learning model for this problem, and compare it against other alternative models, using what you have learned in the lectures. With this information, the datasets, and your knowledge from the course, answer the following questions:
1. (10%) Identify the train / test sample at area level (i.e. some areas for test and some for train that are in the Data folder with images) and create a logistic regression model that, using only the structured data, can predict whether an area is high risk or low risk. Discuss the performance of the model, the most important variables of your model, and the rationale of your decisions and outputs.
2. (30%) Choose from Tensorflow Hub a model that’s adequate for your problem2. Explain the model in detail by researching the literature. What layers does it use? Why those layers? Discuss your choice. Finetune a deep learning model able predict high and low risk zones. Explain what parameters you used to train it (optimizer, trainable layers, learning rate, etc), and your choice of architecture for the dense and output layers. Is it able to find meaningful patterns using only the images? Why do you think this is?
1 The data is available for direct download from Google Colab (using gdown). The link is https://drive.google.com/uc?id=1k7QmTjzk4hFrAnO_x8YvndoWyZ2H_LZw I advise you to download it to your Google Drive folder and mount it from there to not download it every time you need to work on it. 2 Any StateModel model that give you feature vectors or feature classification without the classifier head could work.
1
FM 9528 – Banking Analytics Coursework 3
3. (30%) Combine the structured input of and the images, plus the pretrained model, to create a multimodal deep learning model that takes into account both inputs into a single neural network. Use the Keras Model API to create this model. Discuss the reasons for your choice of architecture and parameter decisions, report the AUC scores of all models and compare the performance. Do the satellite images help? How does the performance of the model compare against the models you previously trained? Why do you think this happens?
4. (20%) Discuss the ethical, legal, and other challenges of using satellite images in the context of credit risk. Discuss with sources the following questions: What are the potential sources of bias in satellite images? Is this reflected in your model? What are the ethical implications of your findings? What potential legal ramifications can exist? Finalize by giving your opinion on the use of satellite images for the purposes of credit risk analytics.
The remaining 10% corresponds to formatting and presentation according to the rubric.
Conditions of the coursework
Software: You must use Python to run the numerical calculations over your portfolio. A copy of your jupyter notebook must be attached to the coursework as an appendix in readable format, and a link to the notebook (either colab or direct download from a cloud location) must also be included. Instructions how to export to PDF can be found here: https://stackoverflow.com/questions/52588552/google-co-laboratory-notebook-pdf-download. The notebook text MUST be machine readable (so no screenshots of the notebook please) otherwise a 25% discount will apply.
Word Limit: 2000 words +/-10% either side of the word count is deemed to be acceptable. Any text that exceeds an additional 10% will not attract any marks. The relevant word count includes items such as cover page, executive summary, title page, table of contents, tables, figures, in-text citations and section headings, if used. The relevant word count excludes your list of references and any appendices at the end of your coursework submission (including the code).
You should always include the word count (from Microsoft Word, not Turnitin), at the end of your coursework submission, before your list of references.
Title/Cover Page: You must include a title/ cover page that includes: your Student ID, Course Code, Assignment Title, Word Count. This assignment will be marked anonymously, please ensure that your name does not appear on any part of your assignment otherwise a discount will be applied.
Submission Deadline: December 18th, 23:59. This deadline is final and cannot be modified!
Turnitin Submission: The assignment MUST be submitted electronically via OWL. All required papers may be subject to submission for textual similarity review to the commercial plagiarism detection software under license to the University for the detection of plagiarism. All papers submitted for such checking will be included as source documents in the reference database for the purpose of detecting plagiarism of papers subsequently submitted to the system. Use of the service is subject to the licensing agreement, currently between The University of Western Ontario and Turnitin.com (http://www.turnitin.com).
2
FM 9528 – Banking Analytics Coursework 3
Late Submission: Late submissions are possible up to two days after the deadline. There is a linear 10% penalty per day of late submission (Final mark = Original mark – 10% * day) subtracted directly from the final mark. Submissions after the two days are not accepted and will be considered a non- submission.
3