Consider the dataset “D1.2 Credit card defaults.cv” (described in C1.2). This dataset contains
information about credit card consumers, in particular, their default behavior. Correspondingly,
the key variable in the dataset is “defaultpaymentnextmonth” (call this variable “y”),
dichotomous variable that indicates whether a customer defaulted on his/her debt. There are 23
Copyright By PowCoder代写 加微信 powcoder
other variables that can be used to predict this outcome. For simplicity, we will refer to the set
containing all these variables as “X”
Using this data, perform the following tasks:
1. [3 points] Generate a random training/validation index that implements a 70/30 split
Use a random seed of your choice.
2. [7 points] Estimate two logistic specifications that allow you to generate out-of-sample
predictions of y. Take the following points into account:
You choose the variables X that enter each model specification. These variables X
can be continuous or categorical. Make sure continuous and categorical variables
are entered appropriately into the models.
Specify model 1 as the simplest of the two. This model must include at least 5
explanatory variables.
Specify model 2 as the richer/more flexible of the two. Control flexibility through
the set of X variables used. Include at least one variable interaction. [An interaction
of two variables, xl and x2, would be x3 = x1*x2.
3. [2 points] Do any of your models exhibit signs of overfitting? Explain.
4. [3 points] Provide a discussion of which of the two models you would prefer for the purpose
of identifying consumers who will default in the future. If needed, make assumptions.