程序代写代做代考 data mining Wine taste prediction¶

Wine taste prediction¶
XXXXXX / XXXXXXXX
15 / 09 / 2016
(Kernel: IRKernel, Anaconda R-Essentials)
Additional Packages: pROC, glmnet

Contents¶

• Introduction
• Read the Data (A)
• Exploratory Data Analysis (B)
• Stepwise Selection by Cross-Validation (C)
• $L_2$ and $L_1$ Regularisation (D)
• Discussion (E)
• Conclusion
• References

Introduction¶

The files “Wine_training.csv” (training set) and “Wine_testing.csv” (test set) consist of data relating to wine, in particular, values for a binary response (quality: “good” or “bad”) as well as for eleven potential predictors: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, and alcohol. Different model selection methods are used — in particular, forward stepwise selection by cross-validation, $L_2$ regularisation, and $L_1$ regularisation — to determine the best logistic regression model for the given data, being the most predictive model as measured by area under the receiving operator characteristic (ROC) curve (AUC), in relation to predictions for the test set.

A — Read the Data¶

Read the training and testing sets (in “Wine_training.csv” and “Wine_testing.csv”). Abbreviate the variable (column) names.
In [1]:
# Read the training set (in “Wine_training.csv”).
d <- read.csv("Wine_training.csv") # Read the testing set (in "Wine_testing.csv"). d_test <- read.csv("Wine_testing.csv") In [2]: # Define the original variable (column) names. names_ <- names(d) # Abbreviate each variable (column) name to its first two letters (training). colnames(d) <- tolower(substr(colnames(d), 1, 2)) # Abbreviate each variable (column) name to its first two letters (test). colnames(d_test) <- tolower(substr(colnames(d_test), 1, 2)) # Return the original and abbreviated variable (column) names. names_ <- data.frame(cbind(names_, colnames(d))) colnames(names_) <- NULL names_ 1 fixed.acidity fi 2 volatile.acidity vo 3 citric.acid ci 4 residual.sugar re 5 chlorides ch 6 free.sulfur.dioxide fr 7 total.sulfur.dioxide to 8 density de 9 pH ph 10 sulphates su 11 alcohol al 12 quality qu B — Exploratory Data Analysis¶ There are eleven potential predictors, all continuous, and a single, binary, response: quality (qu). Out of 3,283 observations, 2,178 observations (approximately 66%) are classified as "good", and 1,105 observations (approximately 34%) are classified as "bad". Summary statistics, and histograms and box plots for each of the (potential) predictors (by response), are shown below. The main observations are: • the distribution of values for "good" and "bad" observations is only noticeably different for alcohol (al); • for most predictors, there are at least some outliers by conventional definition (as shown on the box plots); • the distribution of values (for both "good" and "bad" observations) appears to be: ▪ approximately normal for fixed acidity (fi), total sulfur dioxide (to), density (de), ph (ph), and sulphates (su); ▪ skewed for volatile acidity (vo), citric acid (ci), residual sugar (re), chlorides (ch) [strongly skewed, with a large number of outliers], free sulfur dioxide (fr), and alcohol (al) [the distribution of values for "good" observations is skewed, the distribution of values for "bad" observations is more symmetrical]; • each of residual sugar (re) and density (de) appear to contain an extreme outlier value; • there appears to be a reasonably strong (linear) correlation between some of the predictors, for example, between density (de) and residual sugar (re), between alcohol (al) and density (de), between total sulfur dioxide (to) and free sulfur dioxide (fr), and between density (de) and total sulfur dioxide (to); and • the range and magnitude of values for each of the predictors will be relevant for the interpretation of the coefficients for any logistic regression model. It is unecessary (and, although not shown here, demonstrably ineffectual) to transform the values for any of the predictors. That is, the logistic regression model does not assume, for example, that the distribution of values for any of the predictors is normal (James et al, 2015, pp 151; Hastie et al, 2009, pp 127-128). Neither does it appear to make any practical difference to remove observations with even the most extreme outlier values. In [3]: # Return the head of the data. head(d) fi vo ci re ch fr to de ph su al qu 1 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 good 2 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 good 3 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 good 4 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 good 5 6.2 0.32 0.16 7 0.045 30 136 0.9949 3.18 0.47 9.6 good 6 8.1 0.27 0.41 1.45 0.033 11 63 0.9908 2.99 0.56 12 bad In [4]: # Return summary statistics for each of the variables. summary(d) fi vo ci re Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700 Median : 6.800 Median :0.2600 Median :0.3200 Median : 4.800 Mean : 6.852 Mean :0.2803 Mean :0.3357 Mean : 6.031 3rd Qu.: 7.300 3rd Qu.:0.3275 3rd Qu.:0.3900 3rd Qu.: 9.000 Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800 ch fr to de Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:107.0 1st Qu.:0.9917 Median :0.04300 Median : 33.00 Median :133.0 Median :0.9936 Mean :0.04604 Mean : 35.05 Mean :137.7 Mean :0.9938 3rd Qu.:0.05000 3rd Qu.: 45.00 3rd Qu.:166.0 3rd Qu.:0.9958 Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390 ph su al qu Min. :2.720 Min. :0.2200 Min. : 8.40 bad :1105 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 good:2178 Median :3.180 Median :0.4800 Median :10.40 Mean :3.195 Mean :0.4904 Mean :10.58 3rd Qu.:3.290 3rd Qu.:0.5500 3rd Qu.:11.40 Max. :3.820 Max. :1.0800 Max. :14.00 In [5]: # Return the correlations between each of the predictors, # in descending order of magnitude. cor_ <- as.data.frame.table(cor(d[ , -ncol(d)])) colnames(cor_) <- c("a", "b", "cor") cor_ <- cor_[cor_$a != cor_$b, ] cor_ <- cor_[order(abs(cor_$cor), decreasing = TRUE), ] cor_ <- cor_[seq(1, nrow(cor_), 2), ] cor_$cor <- round(cor_$cor, 2) rownames(cor_) <- 1:nrow(cor_) print(cor_[1:10, ]) a b cor 1 de re 0.82 2 al de -0.76 3 to fr 0.63 4 de to 0.53 5 al to -0.45 6 ph fi -0.43 7 al re -0.40 8 to re 0.40 9 al ch -0.36 10 ci fi 0.30 In [6]: # Define a two-row by two-column plotting area. par(mfrow = c(2, 2)) # Plot a histogram and box plot for each of the predictors, # by response ("good" or "bad"). for (x in colnames(d[-ncol(d)])) { min_d <- min(d[ , x]) max_d <- max(d[ , x]) b <- seq(min_d, max_d, length.out = 20) hist(d[ , x][d$qu == "good"], col = rgb(0, 1, 0, 0.35), breaks = b, main = paste("Histogram: ", x, sep = ""), xlab = x) hist(d[ , x][d$qu == "bad"], col = rgb(1, 0, 0, 0.35), breaks = b, add = TRUE) mtext(c("Good", "Bad"), adj = c(0.25, 0.75), col = c("green", "red")) boxplot(d[ , x] ~ d$qu, main = paste("Boxplot: ", x, sep = "")) }       C — Stepwise Selection by Cross-Validation¶ C.1 — Load Package pROC¶ Load the package pROC, which provides for: • computing and plotting ROC curves; • calculating AUC; and • estimating the variance (and standard error) of AUC. In [7]: # Load the package *pROC*. library(pROC) Type 'citation("pROC")' for a citation. Attaching package: 'pROC' The following objects are masked from 'package:stats': cov, smooth, var C.2 — Define Functions¶ C.2.1 — Function: Create k Cross-Validation Folds¶ Define a function to create k cross-validation folds. Note that this method produces cross-validation folds of the same size (or as close to the same size as possible), by randomly "shuffling" n + 1 repetitions of the values 1 to k (where n is the number of observations per cross-validatoin fold). This is in contrast to the method used by James et al (2015, p 250), which produces randomly-sized cross-validation folds. In [8]: # Create *k* cross-validation folds for the data *d*. # Parameters: # d: data (data frame) # k: number of folds (default 10) # Returns: # id: vector of values, from 1 to k, # indicating the relevant fold for each row of *d* fold <- function(d, k = 10) { # The number of rows in *d*. r <- nrow(d) # The number of rows per fold. n <- r %/% k # Minimally "over-sample" from the values 1 to *k*, # by "shuffling" *n* + 1 repetitions of the values 1 to *k*, # and "take" *r* values (to account for an uneven division of *r*). id <- sample(rep(1:k, n + 1))[1:r] return(id) } C.2.2 — Function: Cross-Validation¶ Define a function to cross-validate AUC for specified predictors (for a logistic regression model) in relation to given data. Although not shown here, for the purpose of implementing the so-called one-standard-error rule (James et al, 2015, p 214), various methods can be used for determining cross-validated AUC and, in particular, the standard error for cross-validated AUC. Cross-validated AUC can be calculated as the mean of the AUC for each of the k cross-validation folds. The standard error can be estimated based on these k values (including via bootstrapping). This appears to be the method used by, for example, the cv.glmnet function in the package glmnet (see Friedman et al, 2016, p 5). However, given random variation between different cross-validation folds, for a value of k of ten (that is, ten-fold cross-validation, producing ten AUC values, one for each of the ten cross-validation folds), the estimated standard error is "unstable" (that is, there is a large amount of variation in the estimated standard error for different cross-validation "splits"). As such, a different approach has been used. Instead of averaging the AUC values for each of the k cross-validation folds, the predictions (and "true" class values) for each of the k cross-validation folds are accumulated (the predictions for each of the folds is independent to the extent that the test sets are disjoint, although the training sets overlap, but this is the same whether AUC values are averaged or accumulated), and a single ROC curve (and corresponding AUC) is calculated for the predictions for all k cross-validation folds (as to generating ROC curves by ranking a set of predictions: Fawcett, 2005, pp 863, 866). (Although not shown here, as you would expect, unless a given model is "unstable", the single AUC value for the predictions for all k cross-validation folds, and the average of each of k AUC values for the predictions of each of k cross-validation folds, are extremely similar.) The standard error for the single AUC value is calculated using the so-called DeLong method, as implemented in the var.roc function of the package pROC (see Robin et al, 2016, pp 6, 78-82). The standard error for cross-validated AUC calculated in this way is consistent with the standard error calculated on the basis of AUC values for each of k cross-validated folds but, crucially, is much more stable (there is considerably less variation in the estimated standard error for different cross-validation "splits"). In [9]: # Cross-validate AUC for the predictors specified in *f*, # for a logistic regression model, for the data *d*. # Parameters: # d: data (data frame) # f: formula specifying the predictors # k: number of folds (default 10) # s: random seed (default 1) # Returns: # list: # - AUC # - AUC standard error # - ROC curve (roc object) xv <- function(d, f, k = 10, s = 1) { # Set the random seed. set.seed(s) # Create *k* cross-validation folds for the data *d*. id <- fold(d, k) # Initialise the set of all prediction probabilities, # for all *k* cross-validation folds. p_ <- NULL # Initialise the set of all "true" class values, # for all *k* cross-validation folds. t_ <- NULL # For each of *k* folds: for (i in 1:k) { # Define the training set. tr <- d[i != id, ] # Define the test set. te <- d[i == id, ] # Fit a logistic regression model for the specified predictors. m <- glm(f, tr, family = binomial) # Apply the model to the test set. p <- predict(m, te, "response") # Accumulate the prediction probabilities. p_ <- c(p_, p) # Accumulate the "true" class values. t_ <- c(t_, as.vector(te$qu)) } # Compute the ROC curve for the set of all prediction probabilities. r <- roc(t_, p_) # Compute AUC. a <- pROC::auc(r) # Compute AUC standard error. se <- sqrt(var(r)) return(list("AUC" = a, "SE" = se, "ROC" = r)) } C.2.3 — Function: Stepwise Variable Selection by Cross-Validation¶ Define a function to perform forward stepwise variable selection using cross-validated AUC. This function uses cross-validated AUC for variable selection at each step, in accordance with the following procedure: • Until all predictors are included in the current model: ▪ For each predictor not yet included in the current model: ◦ Calculate cross-validated AUC for the current model plus the given predictor. ▪ Add the predictor corresponding to the largest increase in AUC (versus the current model) to the current model.
This method provides for both variable selection and best-single-model selection: the step corresponding to the highest AUC score represents the best subset of predictors (within the limitations of a stepwise procedure) and, equivalently, the single best model (as cross-validated AUC is used as the stepwise metric, all possible models for each step are tested within the stepwise procedure itself). Note that this is similar to the method suggested by Charpentier (2015).
This is in contrast to the method suggested by James et al (2015, p 250), which involves best-subset selection within each of k cross-validation folds, calculating prediction error for each subset, for each cross-validation fold, and taking the average error for models of each size (over the k cross-validation folds). This method produces models with different subsets of predictors for each fold. As such, although it might indicate the best average number of predictors, it does not identify a particuar subset of predictors.

In [10]:
# Perform forward stepwise selection using cross-validated AUC,
# for a logistic regression model, for the data *d*.
# Parameters:
# d: data (data frame)
# Returns:
# best (data frame):
# – step number
# – predictor name
# – AUC
# – AUC standard error
step_xv <- function(d) { # The set of predictors. x <- colnames(d[1:(ncol(d) - 1)]) # The response. y <- colnames(d[ncol(d)]) # The number of predictors. n <- length(x) # Initialise the return value (data frame). best <- data.frame("Step" = 1:n, "Var" = "", "AUC" = 0, "SE" = 0, stringsAsFactors = FALSE) # For each of *n* predictors: for (i in 1:n) { # Start a new plot. plot.new() plot.window(xlim = c(1, 0), ylim = c(0, 1)) grid() box() abline(1, -1, col = "blue", lty = 2) l <- seq(0,1,0.2) axis(1, at = rev(l), labels = format(l, 2)) axis(2) title(main = "ROC Curves", xlab = "FPR", ylab = "TPR") # The number of predictors not yet in the model. m <- length(x) # For each of *m* predictors not yet in the model: for (j in 1:m) { # Define the formula for the current model plus the predictor. f <- paste(y, paste(c(best$Var[1:i], x[j]), collapse = " + "), sep = " ~ ") f <- formula(f) # Cross-validate AUC for the candidate model. xv_j <- xv(d, f) # Plot the ROC curve. plot(xv_j$ROC, col = "grey", lwd = 1, add = TRUE) # If the AUC is larger than the current largest AUC, # update the best candidate model. if (xv_j$AUC > best$AUC[i]) {
best$AUC[i] <- xv_j$AUC best$SE[i] <- xv_j$SE best_r <- xv_j$ROC best_j <- j } } # Add the best predictor to the current model. best$Var[i] <- x[best_j] # Remove the best predictor, # from the set of predictors not yet in the model. x <- x[-best_j] # Plot the ROC curve for the new model. plot(best_r, col = "red", lwd = 1, add = TRUE) text(0, 0, paste("AUC: ", round(best$AUC[i], 6), sep = ""), pos = 2, col = "red") mtext(paste("Step ", i, ": Add *", best$Var[i], "*", sep = "")) } return(best) } C.3 — Perform Stepwise Variable Selection by Cross-Validation¶ Perform forward stepwise selection by cross-validated AUC. AUC represents the area under the ROC curve which, in turn, is a function of the true-positive rate (a.k.a. sensitivity) and false-positive rate (a.k.a. 1 - specificity or, equivalently, 1 - true-negative rate) for the predictions of a model. In other words, AUC represents the predictive performance a model, "balancing" the accuracy of the model in terms of correctly identifying positive obeservations (in this case, observations classified as "good"), and the accuracy of the model in terms of correctly identifying negative observations (in this case, observations classified as "bad"). An AUC value close to 0.5 (corresponding to an ROC curve close to the diagonal) represents random predictions. An AUC value close to 1 indicates close-to-perfect accuracy in terms of identifying both positive and negative observations. (Fawcett, 2005, pp 862-864, 868-869; James et al, 2015, pp 145-149). A plot for each step in the stepwise selection procedure, below, shows: • in red, the ROC curve for the predictor corresponding to the largest increase in AUC (versus the current model); • in grey, the ROC curves for the other predictors considered in that step. Also show below is a table listing the predictor added at each step, and the corresponding AUC and standard error, as well as a plot showing the change in AUC for each step (with the standard error). As noted above, in each case, the ROC curves (and corresponding AUC values) represent the accumulated predictions over all k cross-validation folds, and the so-called DeLong method has been used to estimate the AUC standard error. It is evident that: • the predictor alcohol (al) corresponds to the largest increase in AUC (versus the null model); and • the predictor volatile acidity (vo) corresponds to the largest increase in AUC (versus the model including only al). However, from the third step onwards, the increase in AUC for each model (versus the previous model) is marginal. This is self-evident from the ROC curves for each step: from the third step onwards, the differences between candidate models are imperceptible. It is also foreshadowed in the ROC curves for the first step, from which it is evident that many of the predictors, on their own, produce close-to-marginal ROC curves (that is, the ROC curves are close to the diagonal). The AUC value increases from step one to step nine (that is, a nine-predictor model), after which it decreases (although the differences are small). In other words, a nine-predictor model produces the highest AUC value. However, it is clear from the results of the stepwise procedure that, even for a nine-predictor model, most of the predictors contribute very little to the AUC value. Following the so-called one-standard-error rule (James et al, 2015, p 214), the model with the fewest number of predictors, but with an AUC value within one standard error of the AUC value of the model with the highest AUC value, is a four-predictor model including, in particular, the predictors alcohol (al), volatile acidity (vo), residual sugar (re), and fixed acidity (fi). Note that, as mentioned above, as cross-validated AUC has been used as the stepwise metric, the stepwise procedure has already selected the single best model (within the limitations of a stepwise procedure), as each possible model for each step has already been considered (in terms of cross-validated AUC) within the stepwise procedure itself. Note also that, strictly speaking, these results are dependent on (among other things), the particular cross-validation folds. However, although not shown here, for the given training set, the results appear to be similar if not the same for different cross-validation folds (and under repeated cross-validation, which is simply too slow when implemented with the present functions to be demonstrated here). (The random seed for the cross-validation folds can be set as an argument (e.g. "s = 1") of the call to the xv function within the step_xv function.) In [11]: # Define a two-column (with equally-sized rows) plotting area. par(mfrow = c(2, 2)) # Perform stepwise selection by cross-validated AUC, # for a logistic regression model, for the data *d*. st <- step_xv(d)    In [12]: # Return the results of stepwise selection. st Step Var AUC SE 1 1 al 0.754460690824327 0.00857372937846022 2 2 vo 0.795524558626163 0.00805564612229573 3 3 re 0.80181951144518 0.00798859400081263 4 4 fi 0.805252442150838 0.00789712471948007 5 5 su 0.808067511810827 0.00782127353974585 6 6 fr 0.809750736488704 0.00778937940378813 7 7 to 0.809896579950056 0.00778716388377009 8 8 ph 0.810053642139204 0.0077938153241844 9 9 de 0.810975239852245 0.00776713919796375 10 10 ci 0.810863052574282 0.00776850119219252 11 11 ch 0.810605852851842 0.00777120447031768 In [13]: # Return the model (step) with the highest AUC value. st[st$AUC == max(st$AUC), ] Step Var AUC SE 9 9 de 0.810975239852245 0.00776713919796375 In [14]: # Return the models (steps) within one standard error of the AUC value, # of the model with the the highest AUC value. st[st$AUC >= max(st$AUC) – st$SE[st$AUC == max(st$AUC)], ]

Step
Var
AUC
SE
4
4
fi
0.805252442150838
0.00789712471948007
5
5
su
0.808067511810827
0.00782127353974585
6
6
fr
0.809750736488704
0.00778937940378813
7
7
to
0.809896579950056
0.00778716388377009
8
8
ph
0.810053642139204
0.0077938153241844
9
9
de
0.810975239852245
0.00776713919796375
10
10
ci
0.810863052574282
0.00776850119219252
11
11
ch
0.810605852851842
0.00777120447031768
In [15]:
# Calculate the minimum and maximum y (AUC) values.
min_y <- min(st$AUC) - max(st$SE) max_y <- max(st$AUC) + max(st$SE) # Plot the AUC values for each step. plot(st$AUC, xlab = "Variable", ylab = "AUC", main = "AUC by Variable", xaxt = "n", ylim = c(min_y, max_y)) grid() mtext("In Order of Addition to the Model") axis(1, at = 1:11, labels = st$Var) lines(st$AUC, col = "grey", lty = 2) # Show bars (modified arrows) for the AUC standard error for each step. # Source: .
arrows(st$Step, st$AUC – st$SE, st$Step, st$AUC + st$SE, code = 3, angle = 90,
length = 0.05, col = “red”)

C.4 — Fit the Stepwise Model¶

Fit a logistic regression model, using the whole training set, using the variables identified by forward stepwise selection by cross-validated AUC (that is, in other words, the single best model as identified by stepwise selection), in particular:
• alcohol (al);
• volatile acidity (vo);
• residual sugar (re); and
• fixed acidity (fi).
In [16]:
# Define the vector of predictors for the stepwise model.
x_st <- st$Var[1:4] # Define the formula for the stepwise model. f_st <- paste("qu", paste(x_st, collapse = " + "), sep = " ~ ") # Fit the stepwise model using the whole training set. m_st <- glm(f_st, d, family = binomial) C.5 — Interpretation of the Stepwise Model¶ Relevant statistics for the stepwise model are shown below. The stepwise model is statistically significant: the proportion of deviance explained by the model (being the difference between the null deviance and the residual deviance) corresponds to a p value of effectively zero. There is a statistically-significant relationship between at least one of the predictors and the response pursuant to the model (see IDRE (2); R Documentation: anova.glm). All four of the predictors are statistically significant: for each of the four predictors, the relatively low magnitude of the standard error relative to the relatively large magnitude of the coefficient corresonds to a large z statistic (and an equivalently low p value, i.e., the probability of observing a coefficient of the given magnitude were the null hypothesis true). In other words, for each of the four predictors, the null hypothesis (that the "true" coefficient for the predictor is zero) is rejected (see James et al, 2015, p 134). Further, the relatively higher z statistic and relatively lower p values (and relatively larger decrease in deviance) for alcohol (al) and volatile acidity (vo), relative to residual sugar (re) and fixed acidity (fi), is consistent with the relative increase in AUC for each of these varaibles in the stepwise selection procedure. The coefficients represent the change in the log odds of the response for a one-unit change in the predictors, such that exponentiating the coefficients gives the change in the odds of the response for a one-unit change in the predictors (see IDRE (1); IDRE (2); see also James et al, 2015, pp 132-137). In particular, for the stepwise model, the odds of a classification of "good" are: • increased by a factor of approximately 3 ($e^{1.0887}$) for a one-unit increase in alcohol (al); • decreased by a factor of approximately 1,000 ($e^{-6.9741}$) for a one-unit increase in volatile acidity (vo) (but note the small magnitude and limited range of values of volatile acidity (vo)); • increased by approximately 6% ($e^{0.0625}$) for a one-unit increase in residual sugar (re); and • decreased by approximately 20% ($e^{-0.2460}$) for a one-unit increase in fixed acidity (fi). In each case, these changes have to be interpreted in the context of the range and magnitude of the values for the relevant predictor. Additionally, the ANOVA table for the model shows that the "order" of the predictors (as determined by forward stepwise selection) corresponds to the "order" of the predictors in terms of reduction in deviance. In other words, successive predictors account for (exponentially) decreasing reduction in deviance. This can be compared to the result of conventional forward stepwise selection (using the whole training set) by, for example, BIC, which produces a six-predictor model with the same first four predictors in the same order. In [17]: # Return the *p* value for the model. 1 - pchisq(m_st$null.deviance - m_st$deviance, m_st$df.null - m_st$df.residual) 0 In [18]: # Return the summary statistics for the model. summary(m_st) Call: glm(formula = f_st, family = binomial, data = d) Deviance Residuals: Min 1Q Median 3Q Max -3.1762 -0.8953 0.4309 0.7935 2.3890 Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.325750 0.615586 -11.900 < 2e-16 *** al 1.088730 0.049285 22.091 < 2e-16 *** vo -6.974085 0.477879 -14.594 < 2e-16 *** re 0.062544 0.009288 6.734 1.65e-11 *** fi -0.246003 0.049640 -4.956 7.20e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4194.0 on 3282 degrees of freedom Residual deviance: 3267.6 on 3278 degrees of freedom AIC: 3277.6 Number of Fisher Scoring iterations: 5 In [19]: # Return the ANOVA table for the model. format(round(anova(m_st, test = "Chisq"), digits = 4), 4) Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL
NA
NA
3282
4193.984
NA
al
1
631.5361
3281
3562.447
0
vo
1
226.3404
3280
3336.107
0
re
1
43.5915
3279
3292.515
0
fi
1
24.8840
3278
3267.631
0
In [20]:
# Return the coefficients and exponentiated coefficients for the model.
co_step <- round(summary(m_st)$coefficients[ , 1][-1], 4) co_step round(exp(co_step), 6) al 1.0887 vo -6.9741 re 0.0625 fi -0.246 al 2.97041 vo 0.000936 re 1.064494 fi 0.781922 In [21]: # Define a minimal logistic regression model (intercept only). m_min <- glm(qu ~ 1, d, family = binomial) # Define a maximal logistic regression model (all predictors). f_max <- formula(glm(qu ~ ., d, family = binomial)) # Perform forward stepwise variable selection by BIC. m_st_BIC <- step(m_min, scope = f_max, direction = "forward", k = log(nrow(d)), trace = 0) # Return the ANOVA table for the model. format(round(anova(m_st_BIC, test = "Chisq"), 4), 4) Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL
NA
NA
3282
4193.984
NA
al
1
631.5361
3281
3562.447
0.000
vo
1
226.3404
3280
3336.107
0.000
re
1
43.5915
3279
3292.515
0.000
fi
1
24.8840
3278
3267.631
0.000
su
1
18.7294
3277
3248.902
0.000
fr
1
10.7705
3276
3238.132
0.001

C.6 — Performance of the Stepwise Model¶

Apply the stepwise model to the test set. A confusion matrix and ROC curve for the predicitons of the stepwise model in relation to the test set are shown below.
For the stepwise model, in relation to the test set:
• predictive accuracy is approximately 0.7 (equivalently, prediction error is approximately 0.3); and
• AUC is approximately 0.75, and the ROC curve is relatively “smooth” and deviates meaningfully from the diagonal, both indicating meaningful predictive performance.
More specifically:
• sensitivity (a.k.a. recall, true-positive rate), being the proportion of positive values (i.e. “good”) in the test set correctly identified by the model, is approximately 0.85; and
• specificity (a.k.a. true-negative rate), being the proportion of negative values (i.e. “bad”) in the test set correctly identified by the model, is approximately 0.43 (less than 50%).
(See James et al, 2015, pp 145-149; Fawcett, 2005, pp 862).
In [22]:
# Apply the stepwise model to the test set.
p_st <- predict(m_st, d_test, "response") In [23]: # Return the confusion matrix for the predictions of the stepwise model. mat <- table(ifelse(p_st > 0.5, “good”, “bad”), d_test$qu)
mat <- data.frame(mat) colnames(mat) <- c("Pred", "True", "Count") rownames(mat) <- c("TN", "FP", "FN", "TP") mat["Proportion"] <- round(mat["Count"] / sum(mat["Count"]), 2) mat Pred True Count Proportion TN bad bad 104 0.15 FP good bad 139 0.21 FN bad good 64 0.09 TP good good 371 0.55 In [24]: # The ROC curve for the predictions of the four-variable model, # in relation to the test set. r_st <- roc(as.vector(d_test$qu), p_st) # Plot the ROC curve for the model in relation to the test set. plot.new() plot.window(xlim = c(1, 0), ylim = c(0, 1)) grid() box() abline(1, -1, col = "blue", lty = 2) l <- seq(0,1,0.2) axis(1, at = rev(l), labels = format(l, 2)) axis(2) title(main = "ROC Curve: Test Set Predictions", xlab = "FPR", ylab = "TPR") lines(r_st, lwd = 1) text(0, 0, paste("AUC: ", round(pROC::auc(r_st), 6), sep = ""), pos = 2) mtext("Stepwise Model")  D — $L_2$ and $L_1$ Regularisation¶ D.1 — Load Package glmnet¶ Load the package glmnet, which provides for: • $L_2$ and $L_1$ regularisation; and • the optimisation of the tuning parameter $\lambda$ by cross-validation. In [25]: # Load the package *glmnet*. library(glmnet) Loading required package: Matrix Loading required package: foreach Loaded glmnet 2.0-5 Attaching package: 'glmnet' The following object is masked from 'package:pROC': auc D.2 — Define Model Matrices and $\lambda$ Grid¶ Define: • model matrices for the training and test sets (for use with the glmnet, cv.glmnet, and predict functions of the package glmnet); and • a "grid" of values for the tuning parameter $\lambda$ (to be optimised by cross-validation, separately for each of the $L_2$ and $L_1$ shrinkage penalties). In [26]: # Define a model matrix for the training set (remove the intercept: redundant). x_tr <- model.matrix(qu ~ ., d)[ , -1] # Define a model matrix for the test set (remove the intercept: redundant). x_te <- model.matrix(qu ~ ., d_test)[ , -1] In [27]: # Define a "grid" of lambda values, being 10^*s*, # where *s* is a sequence of *n* values from *a* to *b*. a <- -5 b <- 5 n <- 100 s <- seq(a, b, (b - a) / n) g <- 10^s D.3 — $L_2$ Regularisation (Ridge Penalty)¶ D.3.1 — Optimise $\lambda$ for the $L_2$ Shrinkage Penalty by Cross-Validation¶ Find the value of the tuning parameter $\lambda$ (between $10^{-5}$ and $10^{5}$) for the $L_2$ shrinkage penalty which optimises (maximises) average AUC, by cross-validation. Plots for average AUC (cross-validated) and coefficient magnitude (not cross-validated) by $\lambda$ value are shown below. It is evident that average AUC is more or less unchanged for low values of $\lambda$. The $L_2$ penalty does not reduce the coefficients for any of the predictors to zero, meaning that the model is still functional even for relatively high values of $\lambda$ (average AUC does not drop below approximately 0.75, even for relatively high values of $\lambda$). The value of $\lambda$ which maximises average AUC is effectively zero (that is, no penalty). Following the so-called one-standard-error rule, the optimal value of $\lambda$, being the largest value of $\lambda$ corresponding to an average AUC within one standard error of the maximum average AUC, is approximately 0.1259. In [28]: # Set the random seed. set.seed(1) # Cross-validate lambda for the L2 shrinkage penalty, # for a logistic regression model, for the training set, # optimising (maximising) the average AUC value. m_L2_xv <- cv.glmnet(x_tr, d$qu, alpha = 0, lambda = g, family = "binomial", type.measure = "auc") # Define the largest lambda value corresponding to an average AUC value, # within 1 standard error of the maximum average AUC value. lambda_L2 <- m_L2_xv$lambda.1se In [29]: # Plot cross-validated average AUC value by lambda value (log). plot(m_L2_xv) title(bquote(atop(bold("Av. AUC by Lambda (Log)")), "")) # Plot coefficient magnitude by lambda value (log). m_temp_L2 <- glmnet(x_tr, d$qu, alpha = 0, family = "binomial", lower = -25) plot(m_temp_L2, xvar = "lambda") title(bquote(atop(bold("Coefficients by Lambda (Log)"), "")))   In [30]: # Return the lambda value corresponding to the maximum average AUC value. round(m_L2_xv$lambda.min, 6) # Return the largest lambda value corresponding to an average AUC value, # within 1 standard error of the maximum average AUC value. round(m_L2_xv$lambda.1se, 6) 0.000501 0.125893 In [31]: # Return the coefficients for: # - the lambda value for the maximum AUC # - the within-1-SE lambda value co_L2_xv <- cbind(coef(m_L2_xv, "lambda.min")[ , 1], coef(m_L2_xv, "lambda.1se")[ , 1])[-1, ] co_L2_xv <- co_L2_xv[order(abs(co_L2_xv[ , 1]), decreasing = TRUE), ] colnames(co_L2_xv) <- c("Max. AUC", "1-SE") round(co_L2_xv, 4) Max. AUC 1-SE de -148.0327 -67.4367 vo -6.4043 -3.0537 su 1.9349 0.9039 ph 1.1263 0.5949 al 0.9112 0.4066 ch -0.6587 -5.3794 ci 0.3541 0.3442 re 0.1129 0.0208 fi -0.0801 -0.0845 fr 0.0111 0.0060 to -0.0018 -0.0023 D.3.2 — Fit the $L_2$ Model¶ Fit a logistic regression model using the $L_2$ shrinkage penalty, with the optimised value for the tuning parameter $\lambda$, using the whole training set. In [32]: # Fit a logistic regression model using the L2 shrinkage penalty, # with the optimised lambda value, using the whole training set. m_L2 <- glmnet(x_tr, d$qu, alpha = 0, lambda = lambda_L2, family = "binomial") D.3.3 — Interpretation of the $L_2$ Model¶ The $L_2$ model is statistically significant (there is a statistically-significant relationship between at least one of the predictors and the response, pursuant to the model). The proportion of deviance explained by the model (being the difference between the null deviance and the residual deviance) corresponds to a p value of effectively zero (see IDRE (2); R Documentation: anova.glm). The $L_2$ penalty does not reduce the coefficients for any of the predictors to zero. The coefficients represent the change in the log odds of the response for a one-unit change in the predictors, such that exponentiating the coefficients gives the change in the odds of the response for a one-unit change in the predictors (see IDRE (1); IDRE (2); see also James et al, 2015, pp 132-137). In particular, for the $L_2$ model, the odds of a classification of "good" are: • decreased, in essence, to zero ($e^{-67}$) for a one-unit increase in density (de) (but note the limited range of values for de); • decreased by a factor of approximately 200 ($e^{-5.375}$) for a one-unit increase in chlorides (ch) (but note the limited range of values for chlorides (ch)); • decreased by a factor of approximately 20 ($e^{-3.0536}$) for a one-unit increase in volatile acidity (vo); • increased by a factor of approximately 2.5 ($e^{0.904}$) for a one-unit increase in sulphates (su); • increased by a factor of approximately 1.8 ($e^{0.5951}$) for a one-unit increase in ph (ph); • increased by a factor of approximately 1.5 ($e^{0.4067}$) for a one-unit increase in alcohol (al); • increased by a factor of approximately 1.4 ($e^{0.3447}$) for a one-unit increase in citric acid (ci); • decreased by approximately 10% ($e^{-0.0845}$) for a one-unit increase in fixed acidity (fi); • increased by approximately 2% ($e^{0.0208}$) for a one-unit increase in residual sugar (re); • increased by approximately 1% ($e^{0.006}$) for a one-unit increase in free sulfur dioxide (fr); and • decreased by less than 1% ($e^{-0.0023}$) for a one-unit increase in total sulfur dioxide (to). In each case, these changes have to be interpreted in the context of the range and magnitude of the values for the relevant predictor. In [33]: # Return the *p* value for the model. 1 - pchisq(m_L2$dev.ratio * m_L2$nulldev, m_L2$df) 0 In [34]: # Return the coefficients for the L2 model, # in decreasing order of magnitude. co_L2 <- coef(m_L2)[-1, 1] co_L2 <- co_L2[order(abs(co_L2), decreasing = TRUE)] round(co_L2, 4) de -67.4495 ch -5.375 vo -3.0536 su 0.904 ph 0.5951 al 0.4067 ci 0.3447 fi -0.0845 re 0.0208 fr 0.006 to -0.0023 In [35]: # Return the exponentiated coefficients for the L2 model. round(exp(co_L2), 6) de 0 ch 0.004631 vo 0.047187 su 2.469439 ph 1.813154 al 1.501881 ci 1.411503 fi 0.919016 re 1.021039 fr 1.006001 to 0.997721 D.3.4 — Performance of the $L_2$ Model¶ Apply the $L_2$ model to the test set. A confusion matrix and ROC curve for the predicitons of the $L_2$ model in relation to the test set are shown below. For the $L_2$ model, in relation to the test set: • predictive accuracy is approximately 0.71 (equivalently, prediction error is approximately 0.29); and • AUC is approximately 0.76, and the ROC curve is relatively "smooth" and deviates meaningfully from the diagonal, both indicating meaningful predictive performance. More specifically: • the sensitivity (a.k.a. recall, true-positive rate), being the proportion of positive values (i.e. "good") in the test set correctly identified by the model, is approximately 0.93; and • the specificity (a.k.a. true-negative rate), being the proportion of negative values (i.e. "bad") in the test set correctly identified by the model, is approximately 0.30 (less than 50%). (See James et al, 2015, pp 145-149; Fawcett, 2005, pp 862). In [36]: # Apply the L2 model to the test set. p_L2 <- predict(m_L2, newx = x_te, type = "response") # Compute the ROC curve for the L2 model in relation to the test set. r_L2 <- roc(as.vector(d_test$qu), as.vector(p_L2)) In [37]: # Return the confusion matrix for the predictions of the L2 model, # in relation to the test set. mat <- table(ifelse(p_L2 > 0.5, “good”, “bad”), d_test$qu)
mat <- data.frame(mat) colnames(mat) <- c("Pred", "True", "Count") rownames(mat) <- c("TN", "FP", "FN", "TP") # mat <- mat[c("TP", "FP", "TN", "FN"), ] mat["Proportion"] <- round(mat["Count"] / sum(mat["Count"]), 2) mat Pred True Count Proportion TN bad bad 73 0.11 FP good bad 170 0.25 FN bad good 30 0.04 TP good good 405 0.6 In [38]: # Plot the ROC curve for the L2 model in relation to the test set. plot.new() plot.window(xlim = c(1, 0), ylim = c(0, 1)) grid() box() abline(1, -1, col = "blue", lty = 2) l <- seq(0,1,0.2) axis(1, at = rev(l), labels = format(l, 2)) axis(2) title(main = "ROC Curve: Test Set Predictions", xlab = "FPR", ylab = "TPR") # plot(r_L2, lwd = 1, add = TRUE) lines(r_L2, lwd = 1) text(0, 0, paste("AUC: ", round(pROC::auc(r_L2), 6), sep = ""), pos = 2) mtext(expression(L[2] ~ "Model" ~ (lambda ~ "= 0.1259")))  D.4 — $L_1$ Regularisation (Lasso Penalty)¶ D.4.1 — Optimise $\lambda$ for the $L_1$ Shrinkage Penalty by Cross-Validation¶ Find the value of the tuning parameter $\lambda$ (between $10^{-5}$ and $10^5$) for the $L_1$ shrinkage penalty which optimises (maximises) average AUC, by cross-validation. Plots for average AUC (cross-validated) and coefficient magnitude (not cross-validated) by $\lambda$ value are shown below. As for the $L_2$ penalty, average AUC is more or less unchanged for low values of $\lambda$ for the $L_1$ penalty. In contrast to the $L_2$ penalty, large values of $\lambda$ for the $L_1$ penalty reduce each of the coefficients to zero (producing, in effect, a null model), resulting in average AUC values close to 0.5 (representing random predictions). The value of $\lambda$ which maximises average AUC is effectively zero (that is, no penalty). Following the one-standard-error rule, the optimal value of $\lambda$, being the largest value of $\lambda$ corresponding to an average AUC within one standard error of the maximum average AUC, is 0.019953. The optimal value of $\lambda$ reduces the coefficients for the predictors citric acid (ci), chlorides (ch), total sulfur dioxide (to), and density (de) to zero. This is in contrast to the $L_2$ penalty, for which the predictors ch and de (effectively excluded from the model under the $L_1$ penalty) are the two predictors with the largest coefficients (-5.375 and -67.4495). In [39]: # Set the random seed. set.seed(1) # Cross-validate lambda for the L1 shrinkage penalty, # for a logistic regression model, for the training set, # optimising (maximising) the average AUC value. m_L1_xv <- cv.glmnet(x_tr, d$qu, alpha = 1, lambda = g, family = "binomial", type.measure = "auc") # Define the largest lambda value corresponding to an average AUC value, # within 1 standard error of the maximum average AUC value. lambda_L1 <- m_L1_xv$lambda.1se In [40]: # Plot cross-validated average AUC value by lambda value (log). plot(m_L1_xv) title(bquote(atop(bold("Av. AUC by Lambda (Log)"), ""))) mtext("Cross-Validated", cex = 0.75) # Plot coefficient magnitude by lambda value (log). m_temp_L1 <- glmnet(x_tr, d$qu, alpha = 1, family = "binomial", lower = -25) plot(m_temp_L1, xvar = "lambda") title(bquote(atop(bold("Coefficients by Lambda (Log)"), ""))) mtext("Not Cross-Validated", cex = 0.75)   In [41]: # Return the lambda value corresponding to the maximum average AUC value. round(m_L1_xv$lambda.min, 6) # Return the largest lambda value corresponding to the largest average AUC value, # within 1 standard error of the maximum average AUC value. round(m_L1_xv$lambda.1se, 6) 5e-05 0.019953 In [42]: # Return the coefficients for: # - the lambda value for the maximum AUC # - the within-1-SE lambda value co_L1_xv <- cbind(coef(m_L1_xv, "lambda.min")[ , 1], coef(m_L1_xv, "lambda.1se")[ , 1])[-1, ] co_L1_xv <- co_L1_xv[order(abs(co_L1_xv[ , 1]), decreasing = TRUE), ] colnames(co_L1_xv) <- c("Max. AUC", "1-SE") round(co_L1_xv, 4) Max. AUC 1-SE de -151.5906 0.0000 vo -6.4459 -4.8959 su 1.9460 0.4577 ph 1.1398 0.0589 al 0.9150 0.8397 ch -0.5384 0.0000 ci 0.3485 0.0000 re 0.1151 0.0144 fi -0.0775 -0.0819 fr 0.0110 0.0032 to -0.0018 0.0000 D.4.2 — Fit the $L_1$ Model¶ Fit a logistic regression model using the $L_1$ shrinkage penalty, with the optimised value for the tuning parameter $\lambda$, using the whole training set. In [43]: # Fit a logistic regression model using the L1 shrinkage penalty, # with the optimised lambda value, using the whole training set. m_L1 <- glmnet(x_tr, d$qu, lambda = lambda_L1, alpha = 1, family = "binomial") D.4.3 — Interpretation of the $L_1$ Model¶ The $L_1$ model is statistically significant (there is a statistically-significant relationship between at least one of the predictors and the response, pursuant to the model). The proportion of deviance explained by the model (being the difference between the null deviance and the residual deviance) corresponds to a p value of effectively zero (see IDRE (2); R Documentation: anova.glm). The $L_1$ penalty reduces the coefficients for the predictors chlorides (ch), citric acid (ci), density (de), and total sulfur dioxide (to) to zero (effectively excluding these predictors from the model). The coefficients for the "remaining" predictors represent the change in the log odds of the response for a one-unit change in the predictors, such that exponentiating the coefficients gives the change in the odds of the response for a one-unit change in the predictors (see IDRE (1); IDRE (2); see also James et al, 2015, pp 132-137). In particular, for the $L_1$ model, the odds of a classification of "good" are: • decreased by a factor of approximately 133 ($e^{-4.896}$) for a one-unit increase in volatile acidity (vo) (but note the range and magnitude of values for volatile acidity (vo)); • increased by a factor of approximately 2.3 ($e^{0.8397}$) for a one-unit increase in alcohol (al); • increased by a factor of approximately 1.6 ($e^{0.4577}$) for a one-unit increase in sulphates (su); • decreased by approximately 8% ($e^{-0.0819}$) for a one-unit increase in fixed acidity (fi); • increased by approximately 6% ($e^{0.0589}$) for a one-unit increase in ph (ph); • increased by approximately 1% ($e^{0.0144}$) for a one-unit increase in residual sugar (re); and • increased by less than 1% ($e^{0.0032}$) for a one-unit increase in free sulfur dioxide (fr). In each case, these changes have to be interpreted in the context of the range and magnitude of the values for the relevant predictor. In [44]: # Return the *p* value for the model. 1 - pchisq(m_L1$dev.ratio * m_L1$nulldev, m_L1$df) 0 In [45]: # Return the coefficients for the L1 model, # in decreasing order of magnitude. co_L1 <- coef(m_L1)[-1, 1] co_L1 <- co_L1[order(abs(co_L1), decreasing = TRUE)] round(co_L1, 4) vo -4.896 al 0.8397 su 0.4577 fi -0.0819 ph 0.0589 re 0.0144 fr 0.0032 ci 0 ch 0 to 0 de 0 In [46]: # Return the exponentiated coefficients for the L1 model. round(exp(co_L1), 6)[1:7] vo 0.007476 al 2.315772 su 1.580462 fi 0.921401 ph 1.060624 re 1.014482 fr 1.003162 D.4.4 — Performance of the $L_1$ Model¶ Apply the $L_1$ model to the test set. A confusion matrix and ROC curve for the predicitons of the $L_1$ model in relation to the test set are shown below. For the $L_1$ model, in relation to the test set: • predictive accuracy is approximately 0.71 (equivalently, prediction error is approximately 0.29); and • AUC is approximately 0.75, and the ROC curve is relatively "smooth" and deviates meaningfully from the diagonal, both indicating meaningful predictive performance. More specifically: • sensitivity (a.k.a. recall, true-positive rate), being the proportion of positive values (i.e. "good") in the test set correctly identified by the model, is approximately 0.88; and • specificity (a.k.a. true-negative rate), being the proportion of negative values (i.e. "bad") in the test set correctly identified by the model, is approximately 0.39 (less than 50%). (See James et al, 2015, pp 145-149; Fawcett, 2005, pp 862). In [47]: # Apply the L_1 model to the test set. p_L1 <- predict(m_L1, newx = x_te, type = "response") # Compute the ROC curve for the L_1 model in relation to the test set. r_L1 <- roc(as.vector(d_test$qu), as.vector(p_L1)) In [48]: # Return the confusion matrix for the predictions of the L_1 model, # in relation to the test set. mat <- table(ifelse(p_L1 > 0.5, “good”, “bad”), d_test$qu)
mat <- data.frame(mat) colnames(mat) <- c("Pred", "True", "Count") rownames(mat) <- c("TN", "FP", "FN", "TP") # mat <- mat[c("TP", "FP", "TN", "FN"), ] mat["Proportion"] <- round(mat["Count"] / sum(mat["Count"]), 2) mat Pred True Count Proportion TN bad bad 95 0.14 FP good bad 148 0.22 FN bad good 52 0.08 TP good good 383 0.56 In [49]: # Plot the ROC curve for the L_1 model in relation to the test set. plot.new() plot.window(xlim = c(1, 0), ylim = c(0, 1)) grid() box() abline(1, -1, col = "blue", lty = 2) l <- seq(0,1,0.2) axis(1, at = rev(l), labels = format(l, 2)) axis(2) title(main = "ROC Curve: Test Set Predictions", xlab = "FPR", ylab = "TPR") # plot(r_L2, lwd = 1, add = TRUE) lines(r_L1, lwd = 1) text(0, 0, paste("AUC: ", round(pROC::auc(r_L1), 6), sep = ""), pos = 2) mtext(expression(L[2] ~ "Model" ~ (lambda ~ "= 0.0200")))  E — Discussion¶ E.1 — Predictive Performance¶ The ROC curves for each of the models (in relation to the test set) are shown together below. The predictive performance of each of the models (in relation to the test set) is very similar. Each of the models achieves predictive accuracy of approximately 0.7 (or, equivalently, prediction error of approximately 0.3). Likewise, each of the models achieves AUC of approximately 0.75. Further, the "shape" of the ROC curve for each of the models is almost the same (in each case, evidently deviating from the diagonal, corresponding to AUC values greater than 0.5, and indicating meaningful predictive performance). The predictive performance of the models can be distinguished (to some extent) in terms of sensitivity (true-positive rate) and specificity (true-negative rate). For the stepwise model, sensitivity is approximately 0.85 and specificity is approximately 0.43. For the $L_1$ model, sensitivity is approximately 0.88 and specificity is approximately 0.39 (that is, relative to the stepwise model, sensitivity is higher and specificity is lower). For the $L_2$ model, sensitivity is approximately 0.93 and specificity is approximately 0.3 (that is, relative to the $L_1$ model, sensitivity is higher and specificity is lower). (Presumably, the higher sensitivity of the $L_2$ model corresponds to the very slight "lift" in the ROC curve for false-positive-rate values between approximately 0.5 and 0.75, as compared to the ROC curves for the stepwise and $L_1$ models.) In other words, broadly speaking, each of the models performs well in terms of correctly classifying "good" observations, and each of the models performs poorly in terms of correctly classifying "bad" observations (less than 50% of "bad" observations are correctly classified by each model). The higher sensitivities of the $L_1$ and $L_2$ models compared to the stepwise model are balanced by lower specificities. None of the models clearly outperforms the other models. Empirically, however, in terms of "pure" predictive performance, as measured by AUC, the $L_2$ model is the best model (although the difference is very small). E.2 — Bias vs Variance (Overfitting)¶ Each of the model-selection methods (forward stepwise selection by cross-validation, $L_2$ regularisation, and $L_1$ regularisation), appears to have produced a broadly similar result in terms of constraining overfitting. The stepwise and $L_1$ models, in terms of the number of predictors, the particular predictors, and the magnitude of the coefficients for the predictors, are reasonably similar. The $L_2$ model is inherently different, in that all of the predictors are included in the model. Additionally, the magnitude of the coefficients for the predictors are markedly different for the $L_2$ model as compared to either the stepwise or $L_1$ models. Nevertheless, as detailed above, the predictive performance of the models (in relation to the test set) is very similar, and the standard error for the AUC for each of the models (shown with the ROC curves, below) is very similar. It is interesting, however, that each of the model-selection methods significantly overestimated the "true" performance of the model. That is, each of the methods produced a cross-validated AUC (or average AUC, in the case of $L_2$ and $L_1$ regularisation) of approximately 0.8, significantly above the "true" AUC of approximately 0.75 for each model. Note, though, that the test-set AUC for each of the models is close to the cross-validated AUC (from the stepwise selection procedure) for a model including only al, which suggests that al is by far the most "meaningful" predictor (and, equivalently, that all or almost all of the other predictors induce as much noise as additional predictive performance). E.3 — Methodology and Interpretability¶ It is difficult to meaningfully distinguish the models in terms of predictive performance or the control of overfitting. However, there are axiomatic differences between the methods used to produce the models. In particular, forward stepwise selection and $L_1$ regularisation perform variable selection, whereas $L_2$ regularisation does not. As a result, in this instance, the stepwise and $L_1$ models are, arguably, more easily interpreted (and more explanatory of the data and relationships between the response and the predictors), as they include fewer predictors. Further, forward stepwise selection is iterative (involving the comparison of models with different subsets of predictors), whereas both $L_2$ and $L_1$ regularisation work on all of the predictors (James et al, 2015, pp 215, 218-219, 223). The coefficients of the predictors for each of the models are shown together below. As noted above, the changes to the odds (of a classification of "good") for a one-unit change for each of the predictors has to be interpreted in the context of the range and magnitude of the values for the relevant predictor. In this way, the problem of interpretability for the $L_2$ model becomes clear. That is, first, the interpretation of the model requires the interpretation of all of the predictors in the context of the range and magnitude of their values, which, as all of the predictors are included in the model, is a much more complex task for the $L_2$ model (versus the $L_1$ model, let alone versus the stepwise model). Second, the meaning of the coefficients for the model, in practical terms, is obfuscated by large-magnitude coefficients for predictors which are of little interest (that is, for example, density (de) and chlorides (ch)): the same predictors are completely excluded from the other models (and the prima facie "significance" of the predictors in terms of the magnitude of their coefficients is negated by the small magnitude and range of values for the relevant predictors). In this sense, the stepwise and $L_1$ models are greatly preferable. In [50]: # Return the coefficients for each of the stepwise, L1, and L2 models, # in decreasing order of magnitude. co_all <- merge(co_step, cbind(co_L1, co_L2), by = "row.names", all = TRUE) rownames(co_all) <- co_all[ , 1] co_all <- co_all[ , -1] colnames(co_all) <- c("Step", "L1", "L2") co_all <- co_all[order(abs(co_all["Step"]), abs(co_all["L1"]), abs(co_all["L2"]), decreasing = TRUE), ] print(round(co_all, 4)) Step L1 L2 vo -6.9741 -4.8960 -67.4495 al 1.0887 0.8397 -5.3750 fi -0.2460 -0.0819 0.9040 re 0.0625 0.0144 0.4067 su NA 0.4577 -3.0536 ph NA 0.0589 0.5951 fr NA 0.0032 0.3447 ci NA 0.0000 -0.0845 ch NA 0.0000 0.0208 to NA 0.0000 0.0060 de NA 0.0000 -0.0023 In [51]: # Plot the ROC curves for each of the stepwise, L2, and L1 models, # in relation to the test set. plot.new() plot.window(xlim = c(1, 0), ylim = c(0, 1)) grid() box() abline(1, -1, col = "blue", lty = 2) l <- seq(0,1,0.2) axis(1, at = rev(l), labels = format(l, 2)) axis(2) title(main = "ROC Curves: Test Set Predictions", xlab = "FPR", ylab = "TPR") lines(r_st, col = "red", lwd = 1) lines(r_L2, col = "darkgreen", lwd = 1) lines(r_L1, col = "blue", lwd = 1) text(0, 0.2, paste("Stepwise AUC: ", round(pROC::auc(r_st), 4), ", SE: ", round(sqrt(var(r_st)), 5), sep = ""), pos = 2, col = "red") text(0, 0.1, paste("L2 AUC: ", round(pROC::auc(r_L2), 4), ", SE: ", round(sqrt(var(r_L2)), 5), sep = ""), pos = 2, col = "darkgreen") text(0, 0, paste("L1 AUC: ", round(pROC::auc(r_L1), 4), ", SE: ", round(sqrt(var(r_L1)), 5), sep = ""), pos = 2, col = "blue") mtext(c("Stepwise Model", expression(L[2] ~ "Model"), expression(L[1] ~ "Model")), adj = c(0.20,0.50,0.80), col = c("red", "darkgreen", "blue"))  Conclusion¶ For the given data, stepwise selection by cross-validated AUC, $L_2$ regularisation, and $L_1$ regularisation, produce models with very similar predictive performance (measured by AUC in relation to the test set). Strictly speaking, the predictive performance of the $L_2$ model is highest, and in that sense it is the "best" model out of the three. However, its higher sensitivity relative to the two other models is balanced by a lower specificity (which is true also of the $L_1$ model relative to the stepwise model) and, in any case, the difference in predictive performance is very slight. Additionally, for practical purposes, the interpretability of the stepwise and $L_1$ models is significantly more straightforward. In a sense, the $L_1$ model combines the advantages of the other models: the $L_1$ regularisation method performs variable selection, as for the stepwise model, increasing interpretability and, additionally, the $L_1$ regularisation method works on the full set of predictors, as for the $L_2$ regularisation method, rather than involving the iterative comparison of different subsets of potential predictors. For these reasons, in practice, the $L_1$ model is preferrable, although solely in terms of predictive performance, the $L_2$ model is the "best" model. References¶ • Charpentier, A. (2015). Variable Selection Using Cross-Validation (and Other Techniques).
Retrieved from: http://freakonometrics.hypotheses.org/19925 • Fawcett, T. (2005). An Introduction to ROC Analysis. Pattern Recognition Letters, 27, 861-874.
Retrieved from: https://ccrma.stanford.edu/workshops/mir2009/references/ROCintro.pdf • Friedman, J., Hastie, T., Simon, N., Tibshirani, R. (2016). R Package: glmnet.
Documentation: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
Retrieved from: http://statweb.stanford.edu/~tibs/ElemStatLearn/ • James, G., Witten, D., Hastie, T., Tibshirani, R. (2015). An Introduction to Statistical Learning with Applications in R.
Retrieved from: http://www-bcf.usc.edu/~gareth/ISL/ • Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J., Müller, M. (2016). R Package: pROC.
Documentation: https://cran.r-project.org/web/packages/pROC/pROC.pdf • UCLA Institute for Digital Research and Education. (n.d.). FAQ: How Do I Interpret Odds Ratios in Logistic Regression?
Retrieved from: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm • UCLA Institute for Digital Research and Education. (n.d.). R Data Analysis Examples: Logit Regression.
Retrieved from: http://www.ats.ucla.edu/stat/r/dae/logit.htm