Final_2021B_Undergrad
Final Exam – SS 3850G / CS 4414B – Undergraduate students¶
Student ID: XXXXXXXXX (XX / 100)¶
Copyright By PowCoder代写 加微信 powcoder
General comments¶
This Final integrates knowledge and skills acquired during the whole semester. You are allowed to use any document and source on your computer and look up documents on the internet. You are NOT allowed to share documents, or communicate in any other way with people inside or outside the class during the exam. To finish the exam in the alloted 3 hrs, you will have to work efficiently. Read the entirety of each question carefully. You need to be signed into the Final Zoom session during the entire exam with your video on and pointed at yourself.
You need to submit the final by the due date (17:00) on OWL in the Test & Quizzes / Final section where you downloaded the data set and notebook. Late submission will be scored with 0 pts, unless you have received special accommodations. To avoid technical difficulties, start your submission at latest five to ten minutes before the deadline. To be sure, you can also submit multiple versions – only the latest version will be graded.
Most question demand a written answer – answer these in a full English sentence.
For your Figures, ensure that all axes are labeled in an informative way.
Ensure that your code runs correctly by choosing “Kernel -> Restart and Run All” before submitting.
Additional Guidance¶
If at any point you are asking yourself “are we supposed to…”, then write your assumptions clearly in your exam and proceed according to those assumptions.
Good luck!
## Preliminaries
# Sets up the environment by importing
# pandas, numpy, matplotlib, searborn, sklearn, scipy.
# No other packages are allowed in solving the final.
import pandas as pd
import numpy as np
# Models and metrics
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_curve, roc_auc_score, mean_squared_error
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
# Plotting
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Credit scoring is one of the most common applications of statistical modeling / data science techniques. Predicting whether a potential borrower will repay their obligations or will not do so (called default) is one of the key activities in personal and small business lending.
During this exam, you will work with a sample of granted loans taken from a local bank, as part of a financial competition that ran back in 2013. The company sponsoring it made available the following variables:
SeriousDlqin2yrs (binary, target variable): 1 if the borrower experienced 90 days past due delinquency or worse (default), 0 otherwise.
RevolvingUtilizationOfUnsecuredLines (percentage): Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.
age (integer): Age of borrower in years.
NumberOfTime30-59DaysPastDueNotWorse (integer): Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio (percentage): Monthly debt payments, alimony, living costs divided by monthy gross income.
NumberOfOpenCreditLinesAndLoans (integer): Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards).
NumberOfTimes90DaysLate (integer): Number of times borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines (integer): Number of mortgage and real estate loans including home equity lines of credit.
NumberOfTime60-89DaysPastDueNotWorse (integer): Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents (integer): Number of dependents in family excluding themselves (spouse, children etc.)
You get a sample of 1000 cases with a 50% default rate. The cases are stored in the attached csv file (gsc_sample.csv)
With this information, execute the following tasks using your knowledge from the course.
# Uncomment this line if using cloud installation (Colab or others)
# !gdown https://drive.google.com/uc?id=1_9tztEp7v1wBJTH91xZpTS_QDgdp4mN0
# Read the data
gsc_sample = pd.read_csv(‘gsc_sample.csv’)
gsc_sample.describe()
Task 1 (30 pts)¶
Before we start working on a predictive models for whether somebody will default on a loan or not, in task 1 we will first build a model to predict the typical monthly income. Income is notoriously difficult to obtain, as people may have different sources of income, so it holds value to create income prediction models. Ultimately this model maybe useful in spotting whether somebody is in the typical income bracket. One of the main predictors of income is age, so we focus on this variable and later consider some additional variables.
Question 1.1 (5pts)¶
Generate a bivariate scatter plot of age (x-axis) and Monthly Income (y-axis). You should be able to see two extreme observations. [1pt]
Exclude the two observations from the data set and regenerate your bivariate scatter plot. [1pt]
Written answer: Is the distribution of Monthly income symmetric or skewed? [1pt]
Written answer: Given the presence of outliers and the shape of the distribution of the target variable, would you prefer a L2-loss or an L1-loss for your regression model? How will the prediction of each of these models differ? Which technique will give a prediction that is closer to the median income? [2pts]
# Your code here. You can add as many cells as you want.
Your written answers here
Question 1.2 (13pts)¶
As a solution to the distribution of Monthly income, you chose the following practical solution:
Exclude the two highest earning cases.
Transform the target variable from income to the natural logarithm of the monthly income.
Implement these two steps. All subsequent questions in this task will be done over this reduced and transformed data set
To build and evaluate a baseline model take the following steps:
Split the data into an equal-sized training and test set (500 observations each). Use a random_state of 1.
Build a model that predicts the log income as a quadratic function of age. To get full points, implement the feature construction and model in a pipeline
Fit the model using an squared-error loss (L2)
Plot the training data and the fit of the model
Calculate and report the mean-squared error for the test set
Using the Central limit theorem or bootstrap, calculate and report the 95% Confidence interval of the mean test error
# Your code here. You can add as many cells as you want.
Question 1.3 (6pts)¶
Now increase the model complexity by using a 5th-order polynomial on age to predict monthly income.
As in Question 1.2, plot the fit on the training data and report the mean squared error on the test data.
Written answer: Does a 5th-order polynomial offer a better model than the quadratic model in Question 1.2? Which model would you prefer? Justify your answer.
# Your code here. You can add as many cells as you want.
Your written answer here
Question 1.4 (6pts)¶
You boss tells you that he only wants to use your predictive algorithm, if the predicted mean squared test error on a completely novel data set is lower than 0.33 and the 95% confidence interval does not include that value. Since you have not achieved this so far, you give your code, the training and test data set to a colleague of yours (Carl). He uses your code to play around with the number of polynomial terms for age, numberOfDependents, as well as the number and form of interaction terms. Finally he finds a model including 25 features that minimizes the test error. He then conducts a boostrap analysis to obtain a 95% confidence interval on that test error. In his final report he writes:
The final model has a predicted test error on unseen data of 0.29. The 95% confidence internal is 0.26 – 0.32 and the new algorithm should therefore meet the required criterion of producing a squared error of lower than 0.33 on novel data with high certainty.
Written answer: What is the problem with Carl’s approach and statement? What would you have to change in the model fitting / selection / evaluation procedure to fix the problem?
Your answer here
Task 2: Tree-based ensemble (40 pts)¶
Now we will begin to model default. We will model it using a tree-based ensemble. After careful consideration, you have decided to use an XGBoosting model to create it.
Question 2.1: XGB vs Random Forest (5 pts)¶
Written answer: Why do you think XGBoosting is a better alternative than Random Forests for this particular database? Answer in terms of the number of cases, the number of variables and the properties of each model.
Your answer here
Question 2.2: Finding the best XGB model (20 pts)¶
One of your colleagues has previously done an analysis of the best parameters that can be used for the model, and has limited the choice to four potential configurations: a max_depth parameter of 3, a learning_rate value of either 0.01 or 0.1, and an n_estimators (number of trees) of either 50 trees or 200 trees. Your colleague used a random_state seed value of 20212004 everywhere possible. All the other parameters can be set at the values appropriate for a binary model seen in the course.
You will now determine which of these configurations is the best for your work. For this:
a. Starting from the original data (i.e. not using the output from Task 1), create a train / test split leaving 300 cases in the test set. (2 pts)
b. Create a parameter grid that can test the values that you need to test. (5 pts)
c. Run a grid-search using this configuration and get the values of the best parameters. Use the whole training set you created in a for this search, do not create a smaller sample. Show the value of the best parameters. (Hint: As you will be using the whole training set, you can get the best estimator directly from the GridSearchCV object by setting the option refit=True. The best estimator is then stored in the GRID_SEARCH_OBJECT.best_estimator_ property, where GRID_SEARCH_OBJECT is the name of your GridSearchCV object). (13 pts)
# Your code here. You can add as many cells as you want.
Question 2.3: Evaluating the model (10 pts)¶
Now that you have a model, you must check how well it works and evaluate it in the context of this problem. Perform the following tasks:
a. Apply the best model to the test set to obtain the probability of default for each element. (2pt)
b. Create a ROC curve plot that shows the value of the AUC you obtained. Written answer: What can you say about the model performance? (4pts)
c. Create a variable importance plot showing which variables contribute the most to the model prediction. (4pts)
# Apply to the test set
# Plot the ROC curve and show the AUC
Your written answer here
# Create the variable importance plot
Question 2.4: Fairness (5 pts)¶
Your colleague Carl says that you probably neglected to include the full zipcode as a predictive variable in the model. You know this is a bad idea due to fairness concerns. Explain to your colleague what are the issues with including such a variable to make real-world decisions in the context of model fairness. In this context, explain what we would need to see in the model decisions to achieve Demographic Parity, Equalized Opportunities, and Equalized Odds for the protected characteristics (gender, ethnicity, religion, etc).
(Hint: Remember that the full zipcode allows identifying particular neighborhoods, and those neighborhoods can have specific ethnicities and gender compositions)
Your answer here
Task 3: Unsupervised learning (30 pts)¶
After you have corrected him twice, Carl has now argued you should not construct just one model but several ones, as you probably have several disjoint clusters in your data. Your colleague says running a clustering model will clearly show this.
To test your colleague’s idea, you propose to run a K-Means model and using dimensionality reduction to plot the resulting analysis.
Question 3.1 Data normalization (5 pts)¶
Explain why it is a good idea to normalize the data for a K-Means clustering process. Then train a MinMaxScaler method over your full dataset gsc_sample, excluding the target variable. Transform the data and make a new Data Frame containing the transformed variables.
Your written answer here
# Your code here. You can add as many cells as you want.
Question 3.2 K-Means Clustering (15 pts)¶
Your colleague has mentioned they think there are between three and five clusters in the data. You decide to run a silhouette analysis over the data to answer these questions. Create a silhouette plot for 3, 4 and 5 clusters and calculate the corresponding silhouette scores. Use a random seed of 20210420 for your cluster functions.
Written answer: According to the silhouette analysis, do you think 3, 4 or 5 clusters provide a better description of the data? Why?
# Your code here. You can add as many cells as you want.
Your written answer here
Question 3.3: Visualizing your cluster model (10 pts)¶
After running the above analysis you have decided a certain number of clusters. However, you are not convinced these clusters are actually well-defined groups. You decide to visualize the clusters to identify whether there are actually any groups at all.
a. Run a PCA model over the normalized data keeping two components. Show the percentage of explained variance of these two components. (5 pts)
b. Create a scatterplot using these two components as axes of the plot, colouring the points of the plot depending on which cluster they belong to. (2 pts)
c. Written answer: Do you think there are real, distinct clusters in the data? Why or why not? To achieve a good prediction, do you think you need to create a separate predictive model for each cluster? Or would a single predictive model for the entire data set be enough? (3 pts)
# Your code here. You can add as many cells as you want.
Your written answer here
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com