CS代考 SS 3850G / CS 4414B¶

SOLUTION_Final_2021B_PG

Final Exam – SS 3850G / CS 4414B¶
Student ID: XXXXXXXXX (XX / 100)¶

Copyright By PowCoder代写 加微信 powcoder

General comments¶
This Final integrates knowledge and skills acquired during the whole semester. You are allowed to use any document and source on your computer and look up documents on the internet. You are NOT allowed to share documents, or communicate in any other way with people inside or outside the class during the exam. To finish the exam in the alloted 3 hrs, you will have to work efficiently. Read the entirety of each question carefully. You need to be signed into the Final Zoom session during the entire exam with your video on and pointed at yourself.

You need to submit the final by the due date (17:00) on OWL in the Test & Quizzes / Final section where you downloaded the data set and notebook. Late submission will be scored with 0 pts, unless you have received special accommodations. To avoid technical difficulties, start your submission at latest five to ten minutes before the deadline. To be sure, you can also submit multiple versions – only the latest version will be graded.

Most question demand a written answer – answer these in a full English sentence.

For your Figures, ensure that all axes are labeled in an informative way.

Ensure that your code runs correctly by choosing “Kernel -> Restart and Run All” before submitting.

Additional Guidance¶
If at any point you are asking yourself “are we supposed to…”, then write your assumptions clearly in your exam and proceed according to those assumptions.

Good luck!

# Install SHAP and yellowbrick if needed
# !pip install shap yellowbrick

## Preliminaries
# Sets up the environment by importing
# pandas, numpy, matplotlib, searborn, sklearn, scipy.
# No other packages are allowed in solving the final.

import pandas as pd
import numpy as np

# Models and metrics
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_curve, roc_auc_score, mean_squared_error
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# tensorflow
import tensorflow.keras as keras
from tensorflow.keras import models
from tensorflow.keras import layers

# Clustering
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_samples, silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from yellowbrick.cluster.elbow import kelbow_visualizer

import shap

# Plotting
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Credit scoring is one of the most common applications of statistical modeling / data science techniques. Predicting whether a potential borrower will repay their obligations or will not do so (called default) is one of the key activities in personal and small business lending.

During this exam, you will work with a sample of granted loans taken from a local bank, as part of a financial competition that ran back in 2013. The company sponsoring it made available the following variables:

SeriousDlqin2yrs (binary, target variable): 1 if the borrower experienced 90 days past due delinquency or worse (default), 0 otherwise.
RevolvingUtilizationOfUnsecuredLines (percentage): Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.
age (integer): Age of borrower in years.
NumberOfTime30-59DaysPastDueNotWorse (integer): Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio (percentage): Monthly debt payments, alimony, living costs divided by monthy gross income.
NumberOfOpenCreditLinesAndLoans (integer): Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards).
NumberOfTimes90DaysLate (integer): Number of times borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines (integer): Number of mortgage and real estate loans including home equity lines of credit.
NumberOfTime60-89DaysPastDueNotWorse (integer): Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents (integer): Number of dependents in family excluding themselves (spouse, children etc.)

You get a sample of 1000 cases with a 50% default rate. The cases are stored in the attached csv file (gsc_sample.csv)

With this information, execute the following tasks using your knowledge from the course.

# Uncomment this line if using cloud installation (Colab or others)
# !gdown https://drive.google.com/uc?id=1_9tztEp7v1wBJTH91xZpTS_QDgdp4mN0

# Read the data
gsc_sample = pd.read_csv(‘gsc_sample.csv’)
gsc_sample.describe()

SeriousDlqin2yrs age NumberOfTime3059DaysPastDueNotWorse MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime6089DaysPastDueNotWorse NumberOfDependents RevolvingUtilizationOfUnsecuredLines DebtRatio
count 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000
mean 0.50000 49.632000 0.757000 6744.346000 8.670000 0.42700 1.026000 0.330000 0.92200 0.468928 0.365071
std 0.50025 13.829166 3.303462 11396.666095 5.065645 3.22041 1.152672 3.170986 1.18715 0.403894 0.325817
min 0.00000 22.000000 0.000000 1100.000000 0.000000 0.00000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 0.00000 39.000000 0.000000 3457.750000 5.000000 0.00000 0.000000 0.000000 0.00000 0.080844 0.145849
50% 0.50000 49.000000 0.000000 5180.000000 8.000000 0.00000 1.000000 0.000000 0.00000 0.374469 0.291640
75% 1.00000 59.000000 1.000000 8000.000000 11.000000 0.00000 2.000000 0.000000 2.00000 0.877077 0.489728
max 1.00000 92.000000 98.000000 250000.000000 30.000000 98.00000 9.000000 98.000000 6.00000 2.297612 2.639328

Task 1 (30 pts UG, 37 points PG)¶
Before we start working on a predictive models for whether somebody will default on a loan or not, in task 1 we will first build a model to predict the typical monthly income. Income is notoriously difficult to obtain, as people may have different sources of income, so it holds value to create income prediction models. Ultimately this model maybe useful in spotting whether somebody is in the typical income bracket. One of the main predictors of income is age, so we focus on this variable and later consider some additional variables.

Question 1.1 (5pts)¶
Generate a bivariate scatter plot of age (x-axis) and Monthly Income (y-axis). You should be able to see two extreme observations. [1pt]
Exclude the two observations from the data set and regenerate your bivariate scatter plot. [1pt]
Written answer: Is the distribution of Monthly income symmetric or skewed? [1pt]
Written answer: Given the presence of outliers and the shape of the distribution of the target variable, would you prefer a L2-loss or an L1-loss for your regression model? How will the prediction of each of these models differ? Which technique will give a prediction that is closer to the median income? [2pts]

# Produce the scatter plot
sns.jointplot(x=gsc_sample.age, y=gsc_sample.MonthlyIncome)
plt.show()

# Exclude outliers and replot
D=gsc_sample[gsc_sample.MonthlyIncome<100000] sns.jointplot(x=D.age, y=D.MonthlyIncome) plt.show() Your written answers here Written answer: a regression using an L1 loss would be more robust against the influence of outliers and the influence of the right side of the distribution. It will fit a line through the median of the conditional income distribution. Question 1.2 (13pts)¶ As a solution to the pesky distribution of Monthly income, you chose the practice solution to Exclude the two highest earning cases. Transform the target variable from income to the natural logarithm of the monthly income. All subsequent questions in this task will be done over this reduced and transformed data set To build and evaluate a baseline model take the following steps: Split the data into an equal-sized training and test set (500 observations each). Use a random_state of 1. Build a model that predicts the log income as a quadratic function of age. To get full points, implement the feature construction and model in a pipeline Fit the model using an squared-error loss (L2) Plot the training data and fit the model Calculate and report the mean-squared error over the test set Using the Central limit theorem or bootstrap, calculate and report the 95% Confidence interval of the test error # Exclude the outliers and transform the Monthly income variance [1pt] D=gsc_sample[gsc_sample.MonthlyIncome<100000].copy() D['logIncome']=np.log(D.MonthlyIncome) # Split the data: 1pt X = np.c_[D.age] y = D.logIncome [Xtrain,Xtest,ytrain,ytest]=train_test_split(X,y,test_size=500,random_state=1) # Define and fit the model (2 pts for pipeline, 1 pt for otherwise correct fit) model = Pipeline([('features',PolynomialFeatures(degree=2)),('GLM',LinearRegression())]) model.fit(Xtrain,ytrain) # Plot training data and prediction: 3pts age = np.linspace(18,90,30).reshape((-1,1)) yp = model.predict(age) ax = plt.subplot() ax.scatter(Xtrain,ytrain) ax.plot(age,yp) plt.xlabel('Age[years]') plt.ylabel('Log on Monthly income') # Get the mean-square test error: 3pts ypred = model.predict(Xtest) res2 = (ytest-ypred)**2 Testerror = np.mean(res2) # Get standard error and CI: 3pts sem = np.std(res2)/np.sqrt(res2.shape[0]) CI = [Testerror-1.96 * sem, Testerror+1.96*sem] print(f'The mean test error is {Testerror:.4f}') print(f'The 95% Confidence interval is {CI[0]:.2f} - {CI[1]:.2f}') plt.show() The mean test error is 0.3275 The 95% Confidence interval is 0.29 - 0.37 Question 1.3 (6pts)¶ Now increase the model complexity by using a 5th-order polynomial on age to predict monthly income. As in Question 1.2, plot the fit on the training data and report the mean squared error on the test data. Written answer: Does a 5th-order polynomial offer a better model than the quadratic model in Question 1.2? Which model would you prefer? Justify your answer. # Define and fit the model [1pt] model2 = Pipeline([('features',PolynomialFeatures(degree=5)), ('GLM',LinearRegression())]) model2.fit(Xtrain,ytrain) # Plot training data and prediction [1pt] age = np.linspace(18,90,30).reshape((-1,1)) yp = model2.predict(age) ax = plt.subplot() ax.scatter(Xtrain, ytrain) ax.plot(age,yp) plt.xlabel('Age[years]') plt.ylabel('Log on Monthly income') # Get the root-mean-square test error [2pts] ypred = model2.predict(Xtest) res2 = (ytest-ypred)**2 Testerror = np.mean(res2) print(f'The mean test error is {Testerror:.4f}') plt.show() The mean test error is 0.3272 Your written answer here The test error for the new model is slightly lower and therefore it is a better model [2pts] Given that the test error is in the CI of the simpler model, I would prefer the simpler model. [2pts] Question 1.4 (PG only, 7pts)¶ Now we want to also take the number of dependents into account in the prediction of log Income Build a model that relies on the 5th-order polynomial for age and the 3rd-order polynomial of the number of dependents In the model do NOT include any interaction terms of the two features To get full points, implement the feature generation within the model pipeline (ColumnTransformer can be useful, or you can write your own Transformer) Report the test error and decide whether the new model provides an improvement to the last one. # Add the new variable: 1pt X = np.c_[D.age,D.NumberOfDependents] y = D.logIncome [Xtrain,Xtest,ytrain,ytest]=train_test_split(X,y,test_size=0.5,random_state=1) # Define and fit the model (5 pts for pipeline, 3 pt for otherwise correct fit) trans = ColumnTransformer([('agePoly', PolynomialFeatures(5),[0]), ('dependentPoly', PolynomialFeatures(3),[1])]) model3 = Pipeline([('trans',trans),('GLM',LinearRegression())]) model3.fit(Xtrain,ytrain) # Get the mean-square test error (1pt) ypred = model3.predict(Xtest) res2 = (ytest-ypred)**2 Testerror = np.mean(res2) print(f'The mean test error is {Testerror:.4f}') The mean test error is 0.3158 Question 1.5 (6pts)¶ You boss tells you that he only wants to use your predictive algorithm, if the predicted mean squared test error on a completely novel data set is lower than 0.33 and the 95% confidence interval does not include that value. Since you have not achieved this so far, you give your code, the training and test data set to a colleague of yours (Carl). He uses your code to play around with the number of polynomial terms for age, numberOfDependents, as well as the number and form of interaction terms. Finally he finds a model including 25 features that minimizes the test error. He then conducts a boostrap analysis to obtain a 95% confidence interval on that test error. In his final report he writes: The final model has a predicted test error on unseen data of 0.29. The 95% confidence internal is 0.26 - 0.32 and the new algorithm should therefore meet the required criterion of producing a squared error of lower than 0.33 on novel data with high certainty. Written answer: What is the problem with Carl's approach and statement? What would you have to change in the model fitting / selection / evaluation procedure to fix the problem? Your answer here Written answer: The problem is that Carl conducts the model selection using the test set. So out of all feasible models, he selects the one with the lowest test error. Thus, the test error and the corresponding CI is not an unbiased estimate of the test error on a new data set anymore. Had he only considered on single model, the test error would be valid estimator. [3pts] To fix this problem, I should have only allowed Carl to make his decision based on the training set (for example using cross-validation) and then evaluated the test error for that one winning model. [3pts] Task 2: Tree-based ensemble (40 pts)¶ Now we will begin to model default. We will model it using a tree-based ensemble. After careful consideration, you have decided to use an XGBoosting model to create it. Question 2.1: XGB vs Random Forest (5 pts)¶ Written answer: Why do you think XGBoosting is a better alternative than Random Forests for this particular database? Answer in terms of the number of cases, the number of variables and the properties of each model. Your answer here In general, both models use the diversity in the sampling to create a deep space search. However, Random Forests is much more stringent on how much diversity we can extract from the data, as it requires sampling variables and cases simultaneously and learning more complex trees from such division. XGBoosting, on the other hand, usually samples only cases and learns from the deflated errors (as defined by the learning rate) of the previous steps. This has been empirically shown to work better on smaller samples with limited variables, such as the ones we have available in this exam. So, XGB is at first look a better alternative for this problem. Question 2.2: Finding the best XGB model (20 pts)¶ One of your colleagues has previously done an analysis of the best parameters that can be used for the model, and has limited the choice to four potential configurations: a max_depth parameter of 3, a learning_rate value of either 0.01 or 0.1, and an n_estimators (number of trees) of either 50 trees or 200 trees. Your colleague used a random_state seed value of 20212004 everywhere possible. All the other parameters can be set at the values appropriate for a binary model seen in the course. You will now determine which of these configurations is the best for your work. For this: a. Starting from the original data (i.e. not using the output from Task 1), create a train / test split leaving 300 cases in the test set. b. Create a parameter grid that can test the values that you need to test. c. Run a grid-search using this configuration and get the values of the best parameters. Use the whole training set you created in a for this search, do not create a smaller sample. Show the value of the best parameters. (Hint: As you will be using the whole training set, you can get the best estimator directly from the GridSearchCV object by setting the option refit=True. The best estimator is then stored in the GRID_SEARCH_OBJECT.best_estimator_ property, where GRID_SEARCH_OBJECT is the name of your GridSearchCV object). # Create train / test split (2 pts) x_train, x_test, y_train, y_test = train_test_split(gsc_sample.drop(columns='SeriousDlqin2yrs'), gsc_sample['SeriousDlqin2yrs'], test_size=300, random_state=20210420) # Create the parameter grid (5 pts) param_grid = dict({'n_estimators': [50, 200], 'max_depth': [3], # This one can be left out and directly set in the model. 'learning_rate' : [0.01, 0.1] # Create the GridSearchCV object (10 pts) # Define the XGB model XGB_Bankloan = XGBClassifier(max_depth=3, # Depth of each tree learning_rate=0.1, # How much to shrink error in each subsequent training. Trade-off with no. estimators. n_estimators=100, # How many trees to use, the more the better, but decrease learning rate if many used. verbosity=1, # If to show more errors or not. objective='binary:logistic', # Type of target variable. booster='gbtree', # What to boost. Trees in this case. n_jobs=2, # Parallel jobs to run. Set your processor number. gamma=0.001, # Minimum loss reduction required to make a further partition on a leaf node of the tree. (Controls growth!) subsample=0.632, # Subsample ratio. Can set lower colsample_bytree=1, # Subsample ratio of columns when constructing each tree. colsample_bylevel=1, # Subsample ratio of columns when constructing each level. 0.33 is similar to random forest. colsample_bynode=1, # Subsample ratio of columns when constructing each split. reg_alpha=1, # Regularizer for first fit. alpha = 1, lambda = 0 is LASSO. reg_lambda=0, # Regularizer for first fit. scale_pos_weight=1, # Balancing of positive and negative weights. base_score=0.5, # Global bias. Set to average of the target rate. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com