Preliminary
On the Moodle page of the course, in the Workshops section you will find a directory called Datasets & Source Code. The
directory contains the HMEQ dataset (both in its original form and after the missing values for the categorical variables
REASON and JOB have been identified. It also contains a file containing functions that I designed for these workshops,
my Rfunctions.R. This file is updated weekly so download the most recent version every week.
1. Download the datasets into the R directory you are using for the course
2. Set the working directory in RStudio to that directory
3. Source the file my Rfunctions.R by typing:
> source(“my Rfunctions.R”)
This command will load every required package. If a required package is not available it will download and install
it. If the command terminates with an error please let me know.
4. Load the dataset:
> Data <- read.csv("clean hmeq.csv") We next assign observations to the training and test set. To ensure reproducibility set the seed to 1 as in previous workshops. > set.seed(1)
> train <- sample(1:nrow(Data), floor(0.7*nrow(Data))) > Data.train <- Data[train,] > Data.test <- Data[-train,] Identifying Missing Values Missing values pose a challenging problem in real-world datasets. If the proportion of missing data is very small, then excluding the few samples with missing features may be the best strategy in order not to bias the analysis, but this is true only if the missing values are Missing Completely at Random (MCAR). While some quick fixes such as mean-substitution may yield decent performance in some cases, such simple approaches introduce bias into the data. For instance, applying mean substitution leaves the mean unchanged (which is desirable) but decreases variance, which may be undesirable. Furthermore, using mean substitution implicitly assumes that the missing values are MCAR. This can be correct or not, but in any case you need to be aware of what you are doing and its implications. It is important to understand that whenever we substitute (also called impute) missing values we are inserting inaccurate information in the dataset, since it is certain that the values we insert will not to be identical to the real values which are not available. To estimate the proportion of missing values for each variable in the dataset you can use the pMiss() function (defined in the file my Rfunctions.R). To call the function to measure the proportion of missing values for variable we use its name and provide the appropriate argument. > pMiss(Data$NINQ)
> pMiss(Data$CLNO)
It is clearly ine�cient for us to manually call a function for each of the variables of a data frame. Automating processes
is something that computers are really good at so we should seek a way to do this automatically. One way would be to
write a for loop.
1. First we define a numerical vector of length equal to the number of columns of Data. This vector will store the
proportion of missing values that we will compute in the next step.
# vector with numeric elements of length ncol(Data)
> v = vector(length=ncol(Data), mode = “numeric”)
YIRU
YIRU
YIRU
2. Then we use a for loop, to iterate over all the variables (columns) in the dataset and estimate the proportion of
missing values for each.
# Execute pMiss function for each column, Data[,i]
> for (i in 1:ncol(Data)) { v[i] = pMiss(Data[,i]) }
Instead of using a for loop as before it is possible to use the R function apply(). This function applies to a data frame
(first argument) a function (third argument) either along the rows (when the second argument=1) or along the columns
(when the second argument=2).
> v = apply(Data,2,pMiss)
To find the number of observations containing a missing value you can use either a for loop or apply, as follows:
1. Note that we now need to iterate over the rows of Data, and not the columns.
> count=0;
> for (i in 1:nrow(Data)) { count = count + anyNA(Data[i,]) }
2. The command below computes the number of observations with missing values by summing the output from the
apply() function.
> sum( apply(Data,1,anyNA) )
To get a proportion we only need to divide by the number of observations = number of rows
> sum( apply(Data,1,anyNA) ) / nrow(Data)
Create Indicator Variable for Missing Values
As discussed in the lectures to account for non-ignorable missing values, and in particular to account for the case when the
presence of a missing value is informative about the class to which an observation belongs (recall what we have discussed
in previous workshops about missing values for the DEBTINC variable) it is recommended to introduce a binary variable
that indicates whether a value is missing or not. The function na indicator() is provided for this reason. The function
takes as input a data frame and returns a data frame containing all the original variables, as well as an indicator (binary)
variable for each of the variables that contain missing values. The name of the indicator variable is M {original var.
name}.
> Data.ind <- na indicator(Data) Inspect the new data frame to understand what the above function did. For instance, > str(Data.ind)
Multiple Imputation of Missing Values
The mice package provides advanced methods to impute (i.e. replace) missing values with plausible values using a rich
variety of methods. MICE is designed to accommodate the case of Missing at Random (MAR), which means that the
probability that a value is missing depends on other independent variables, and therefore that one can attempt to predict
the missing values of a variable using the other variables. It imputes data on a variable by variable basis by specifying
an imputation model per variable. The default is to use the same imputation method for all the variables with missing
values.
# Library is loaded automatically when you source the file my Rfunctions.R
> library(mice)
# Enables a better understanding of the pattern of missing data
> md.pattern(Data)
The output of this function (see Fig. 6) is interpreted as:
• 3364 observations have no missing values
YIRU
YIRU
Figure 6: Output from md.pattern()
• 166 observations have missing values for MORTDUE only …
• 3 observations have missing values for both VALUE and MORTDUE
• In general, the number of missing values appears in the last column of the output, the variables for which missing
values occur are indicated with a 0 in the corresponding row, and the associated number of observations is the first
element in each row.
• The final row reports the number of observations that have a missing value for each variable. In our case the table
is so large that only part of it is printed so you can’t see the last row. To overcome this, try:
# The tail() function prints the last lines of output
> tail(md.pattern(Data))
A very informative visualisation of the presence and co-occurrence of missing values is o↵ered by the aggr() function
(which requires the VIM package).
# This should be loaded when you source the file my Rfunctions.R
> library(VIM)
> mice plot <- aggr(Data, numbers=TRUE, sortVars=TRUE) Missing Value Imputation To impute missing values use the mice() function: # The parameters specified are the default values (except for m) > Data.imp <- mice(Data, m=1, method = "pmm", seed = 10) Let’s explain the parameters used in mice(): • m = refers to the number of imputed datasets that will be created. Here we will use 1. • method = refers to the imputation method used. Here we chose Predictive Mean Matching (pmm) which is the recommended (default) method in the package. There are many other available methods in the package: # Returns list of imputation methods available > methods(mice)
More detailed information about what each methods is can be found in the documentation of this function
> ?mice
• seed Setting the random number seed to a given number ensures reproducibility of results. That is if you execute
the same command at any other point using the same seed you will get exactly the same predictions for the missing
values.
After an attempt to impute the missing values it is useful to inspect the distribution of missing values and compare it
with the distribution of the non-missing observations. You can construct scatterplots using xyplot() and density plots
using densityplot(), from the lattice package.
# Should be loaded after you source the file my Rfunctions.R
> library(lattice)
In the scatterplot shown in Fig. 7 I set the type of points that I would like to use pch=18 and also set a transparency
level, alpha=0.4, to enable me to distinguish more easily which observations are imputed (purple) and which are not
(blue). (Values of alpha close to zero make points maximally transparent, while a value of 1 makes the points solid. This
is a useful option when you make scatterplots of large datasets in which case if you don’t use transparency the entire
scatterplot is black.)
# Scatterplot of imputed data
> xyplot(Data.imp, DEBTINC ⇠ MORTDUE + VALUE, pch=18, alpha = 0.4, auto.key=TRUE)
MORTDUE + VALUE
D
E
B
TI
N
C
0
50
100
150
200
0e+00 2e+05 4e+05 6e+05 8e+05
MORTDUE
0e+00 2e+05 4e+05 6e+05 8e+05
VALUE
FALSE
TRUE
Figure 7: Indicative scatterplot of observed and imputed values for DEBTINC against MORTDUE and VALUE
We want the distribution of the imputed points (purple) to be similar to that of the observed data (blue). In this case,
the imputed values can be considered plausible. The density plot in Fig. 8 is obtained through:
> densityplot(Data.imp)
You can assess the di↵erence of imputing missing values using a model, as opposed to taking the default approach of
using the mean or the mode, by comparing the distribution of the imputed values from. It is very important to understand
that when you are imputing using the mean, the mode, or in any way that doesn’t involve the other variables, you are
e↵ectively assuming that values are Missing Completely At Random (and not Missing At Random) because in this case
there is no attempt to predict the missing values using the other variables.
# Replace missing values using the mean
> Data.imp2 <- mice(Data, m=1, method = "mean", seed = 10)
You will get a warning saying that the mean can’t be use to replace categorical variables. This makes sense, and if you
want to do this correctly you need to specify that for these variables you want to replace them using the mode. (Recall that
D
en
si
ty
0.
0e
+0
0
1.
0e
−0
5
2.
0e
−0
5
0e+00 1e+05 2e+05 3e+05 4e+05
MORTDUE
0.
0e
+0
0
4.
0e
−0
6
8.
0e
−0
6
0e+00 2e+05 4e+05 6e+05 8e+05
VALUE
0.
00
0.
02
0.
04
0.
06
0 10 20 30 40
YOJ
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
0 2 4 6 8 10
DEROG
0.
0
0.
5
1.
0
0 5 10 15
DELINQ
0.
00
0
0.
00
2
0.
00
4
0 500 1000
CLAGE
0.
0
0.
2
0.
4
0.
6
0 5 10 15
NINQ
0.
00
0.
01
0.
02
0.
03
0.
04
0 20 40 60
CLNO
0.
00
0.
01
0.
02
0.
03
0.
04
0.
05
0.
06
0 50 100 150 200
DEBTINC
Figure 8: Distribution of imputed and existing values
you can set the method used to impute each variable with missing values separately) Use the previous plots to compare
the distribution of imputed values with that of the observed.
As discussed in the lectures using a constant to replace missing values is not a recommended approach. So we will use the
original multiple imputation results (Data.imp) to replace missing values. This is done through the complete() function.
# Missing values have been replaced with the imputed values in the first dataset
> Data.comp <- complete(Data.imp,1)
Verify that there are no missing values in the new dataset, Data.comp:
> apply(Data.comp,2,anyNA)
To combine missing value imputation with the creation of the missing value indicator variables, we can either repeat all
the above steps but giving as input to the mice() function the data frame Data.ind (which contains the missing value
indicators), or we can simply add the indicator variables from Data.ind as additional columns to Data.imp. The latter
is a lot easier, and can be achieved using the function cbind() which “binds” data frames column-wise.
First verify that the missing value indicator variables are stored in columns 14 to 24 in data frame Data.ind.
> str(Data.ind)
> names(Data.ind)
> names(Data.ind)[14:24]
Next bind together the data frame Data.comp (containing the dataset with imputed values) and the columns with missing
value indicators from Data.ind
> Data.all <- cbind(Data.comp, Data.ind[ ,14:24]) Make sure that everything is correct by inspecting the last data frame you created. > str(Data.all)
> apply(Data.all, 2, anyNA)
To avoid having to repeat all the above steps (even if they are in a script) every time you want to analyse/ mine this
dataset you can store the complete dataset into a new CSV file. The function write.csv() is the analogue of read.csv()
for storing data frames in comma separated format.
> write.csv(Data.all, file = “comp hmeq.csv”, row.names = FALSE)
The option row.names=FALSE causes R not to store the row names. Since our observations don’t have a name that would
create an additional column which would correspond to the number of the row of each observation.
Logistic Regression Model and Variable Selection
From now on whenever we build logistic regression models we will be using the dataset with imputed missing values and
missing value indicator variables. I therefore overwrite the previous data frame Data with the data frame I just created.
(You can always recover the original dataset by loading the data in the file clean hmeq.csv).
> Data <- Data.all
> Data.train <- Data[train,]
> Data.test <- Data[-train,]
Stepwise Variable Selection using Akaike’s Information Criterion
Stepwise selection starts o↵ with a model containing only the intercept. At each step the variable that improves the most
the model selection criterion chosen is added. After introducing the new variable the method considers which, if any, of
the previously inserted variables is no longer contributing significantly to the model selection criterion. If such a variable
is found it is removed from the model. Therefore each step of stepwise selection performs one step of forward selection,
and one step of backward selection.
In logistic regression we we will use Akaike’s Information Criterion (AIC) to attempt to strike the optimal balance
between model complexity and model fit (on the training set!). AIC is defined as,
AIC = 2k � 2 ln(L),
where k is the number of explanatory variable (not coe�cients) in the model, and L is the value of the likelihood function
for the model (hence ln(L) is the log-likelihood). The “best” model minimises the AIC.
Stepwise selection is implemented by the function step():
# Logistic regression with no predictors, only intercept
> reg0 <- glm(BAD ⇠ 1, data = Data.train, family = binomial)
# Logistic regression containing all variables in Data
#(note: ⇠ . is shorthand notation for all variables in the data frame except the target)
> reg1 <- glm(BAD ⇠ ., data = Data.train, family = binomial)
Calling the step() function requires the specification of,
1. Initial model: reg0
2. scope provides a formula containing the full range of variables that are considered
3. direction specifies whether we want ”forward”, ”backward” or ”both” (which means stepwise) selection
> LRStep <- step(reg0, scope=formula(reg1), direction="both")
To see the variables in the final model, and summary information do:
> formula(LRStep)
> summary(LRStep)
The resulting model includes a large number of variables as shown in Fig. 9.
The order in which the variables were inserted into the model is reflected in the order in which the coe�cients appear in
the summary table (see Fig. 9). You can verify this by inspecting the ANOVA table which also reports how AIC changed
at each step:
> LRStep$anova
Note from the output of the summary() function that there are five coe�cients associated with the variable JOB. This is
because JOB is a nominal variable (factor) that takes 6 di↵erent values, and hence 5 binary variables are used to encode it.
For such variables it is not necessary that the p-values associated with the estimated coe�cients of all the binary variable
Figure 9: Summary of model selected through stepwise selection using AIC
are close to zero to be included. In other words if one or more binary variables corresponding to a categorical variable are
included then the rest will be automatically included irrespective if they are significant.
Although in the workshop we will continue to discuss other aspects of model building, when you are doing this to
actually solve a problem this is a stage at which you need to think very carefully about what your findings imply for the
task you are trying to solve. Recall that the order in which variables are included in the model is an indication of relative
importance.
The Bayesian information criterion is similar to AIC, with the only di↵erence being in the multiplier of the term k. In
particular,
BIC = ln(n)k � 2 ln(L),
where n is the number of observations used to estimate the model. Since for moderate sample size log(n) � 2 the BIC
typically yields more parsimonious (i.e. smaller) models. To perform variable selection with this criterion you need to
alter only one setting the step() function.
> LRStep2 <- step(reg0, scope=formula(reg1), direction="both", k=log(nrow(Data.train))) Use BIC to select variables and compare the results to those of the obtained using AIC for model selection. Using the material from the previous workshops you should be able to assess their performance (on either the training or test set) using ROC curves as well as the area under the ROC (AUC) criterion. We will not repeat this here. Cross Validation Next we consider K-fold cross-validation (CV) for model selection. K-fold cross-validation first splits the training set into K non-overlapping and exhaustive partitions (called folds). Each partition is used as a test set for a model that is estimated using the remaining (K � 1) partitions as a training set. By repeating this process K times, each time using a di↵erent fold for assessing performance, every fold is used once as a test set once, and consequently each observation is used once to assess performance and K � 1 times to estimate model parameters. An estimate of the generalisation performance is provided by the average performance over the K repetitions. We will use the typical setting K = 10. To perform 10–fold cross validation I need to assign each observation from the training set to one of 10 folds. To this end, I will create a vector my folds that will contain the index of the fold to which each observation is assigned. Since I will use 10 folds the index will be an integer between 1 and 10, and these integers must be allocated randomly to the di↵erent observations. I will use the sample() function as I did when I selected the observations to be allocated to the training set. This time however, I want to sample numbers from 1 to 10, as many as times the observations in the training set, and because there will clearly more than one observations allocated to the same fold I have to perform sampling with replacement. # To ensure reproducibility > set.seed(2)
# Sample with replacement nrow(Data.train) integers in range [1,10]
> my folds <- sample(1:10, nrow(Data.train), replace = TRUE)
Typically the misclassification error is used as a performance measure in CV, but this is not necessary, and any performance
measure that is deemed appropriate can be used instead. Because in our case the costs of di↵erent types of errors are
not symmetric, and moreover one class is much more frequent than the other, using the misclassification error is not
advisable. Instead we will use the Area Under the ROC curve (AUC). The function to perform cross validation using the
AUC is called crossValid() and is defined in the file my Rfunctions.R. The crossValid() function takes the following
arguments:
1. Data: is a data frame containing all the variables in consideration
2. y: name of target variable (in our case “BAD”). The default is to consider the first column as the target.
3. folds: Allocation of rows to each of the K folds. This option is useful if you want to reproduce results by using
the same cross-validation partitions. The default option is for this to be empty (so that the function specifies the
partitions)
4. K: Instead of specifying the folds you can specify the number of partitions to be used. (The default value is 10)
5. nvar: Maximum number of terms to include in the model. (The default behaviour is to consider all variables in
Data)
We next call the main function:
# Call cross validation with pre-specified folds
> out <- crossValid(Data.train, y="BAD", folds = my folds)
The function prints at each stage the number of estimated coe�cients, the variable added at each step, and the average
AUC over the K folds (see Fig. 10). Notice that CV is much more time-consuming than the information criteria you used
previously.
In addition to estimating the AUC for models with di↵erent number of parameters, crossValid() also provides a
recommendation for the best model. It is important to understand that the proposed, “best”, model is not the model
with the highest average AUC. To improve the stability of cross validation it has been argued that the recommended
(“best”) model should be the model with the smallest number of predictors that achieves an average performance within
one standard deviation from the model with the highest average AUC. In particular, if m? is the highest average AUC
achieved by any model, and s? is the standard deviation of AUC by the model achieving m?, then the recommended model
is the one with the least parameters such that its average AUC is greater or equal to m? � s?. This approach is known as
the 1 Standard Deviation Rule, and is used to improve the stability of CV.
The output of the crossValid() function is a list containing the following:
Figure 10: Output from crossValid()
1. BestModel The best model identified using the one-standard deviation rule just described.
2. Perf A matrix containing the AUC on each fold for the best model at each step
3. Models Contains the specification of the best model at each step
To inspect the recommended model we can use the summary() function (see Fig. 11). Note that the order in which
the variables appear is the order in which they were inserted in the model. The same holds for the output of the coef()
function below.
> summary(out$BestModel)
# Coe�cients of best model
> coef(out$BestModel)
Rather than just looking at the recommended model only it is instructive to inspect how performance is a↵ected as the
model size increases. To this end, use the function plotCVauc() also contained in my Rfunctions.R. The input for this
function must be the output of the crossValid() function.
> plotCVauc(out)
The horizontal axis of the resulting figure (see Fig. 12) is the number of predictors (independent variables) in each model
(0 refers to the model containing only the intercept). The vertical axis reports AUC: The dot for each number of predictors
represents the average AUC over the K folds, while the blue line represents ± one standard deviation from the mean. The
selected model is indicated with a vertical line, and as we saw previously it contains 8 predictors. The horizontal line is
the performance threshold given by the highest average AUC attainable minus one standard deviation. In this case the
highest average AUC is achieved by a model containing 19 predictors.
Fig. 12 indicates that the first few predictors increase the AUC very substantially but further additions induce very
marginal improvement. Indeed with more than 8 or 10 predictors the di↵erence in average AUC is extremely small.
Although barely noticeable, average performance actually deteriorates as we approach the maximum number of predictors.
Cross-validation can be sensitive to the specific partition used. You can repeat this experiment changing the allocation
of observations to folds and see what happens. You can also try di↵erent values for K and assess whether variability
increases or not.
Figure 11: Summary of best model from cross validation (order of coe�cients reflects order of inclusion in model)
0 5 10 15 20
Subset Size
C
V
A
U
C
0.
6
0.
8
Figure 12: AUC of models selected through 10-fold cross validation.