Data mining
Institute of statistics and econometrics
(University of Kiel)
June 1, 2020
Contents
Preliminaries 1
1 Statistical learning 3
1.1 Fromstatisticstostatisticallearning …………………. 3
1.2 Supervisedlearning………………………….. 4
1.3 Unsupervisedlearning ………………………… 5
2 Supervised learning: some background 6
2.1 Errorquantification………………………….. 6
2.2 Learningforprediction………………………… 10
2.3 Leaningwithmanyfeatures ……………………… 12
3 Linear prediction and classification 14
3.1 Predictionwithlinearregression……………………. 14 3.2 Classification …………………………….. 17
4 Model assessment and selection 23
4.1 Subsetselectioninlinearregression ………………….. 23 4.2 Comparingaccuracy …………………………. 25 4.3 Validationandcross-validation…………………….. 27
5 Regression with large p 30 5.1 Improvinglinearregression………………………. 30 5.2 Shrinkagemethods ………………………….. 30 5.3 Dimensionreductionmethods …………………….. 33
6 Nonlinear methods 35
6.1 Polynomialregression…………………………. 35 6.2 Stepfunctions…………………………….. 36 6.3 Splines………………………………… 36 6.4 Localregression……………………………. 38 6.5 Generalizedadditivemodels ……………………… 38 6.6 Artificialneuralnetworks……………………….. 39 6.7 Tree-basedmethods………………………….. 42
ii
7 Ensemble methods
46
Contents
7.1 Baggingandthebootstrap………………………. 46 7.2 Boosting……………………………….. 48 7.3 Modelaveraging……………………………. 49
8 Unsupervised learning 51
8.1 Leavingthesupervisedframework…………………… 51 8.2 PrincipalComponentsAnalysis ……………………. 51 8.3 Clusteranalysis……………………………. 54
iii
Preliminaries
Under data mining we understand the process of extracting information from (large amounts of) data. Let’s add a small warning: Data mining used to be taken (and still is, in certain disciplines) in a negative sense, describing misuses of statistics where one looks for as many findings as possible, until one finds something.1
One purpose of this course is to make sure you know how to avoid this, while of course still look for patterns, trends and dependencies in data coming from various sources. In specific, we cover statistical and statistical learning methods designed for analyzing large and complex data sets and relations. For illustrations, we use the standard statistical programming and data analysis language R, which is free and open-source; it is also quite popular outside academia.2
Until then, let’s give some structure to data mining processes. We build on the so-called CRISP-DM standard. Cross Industry Standard Process for Data Mining is intended to ease (and ease checking) the implementation of data mining projects in practice. It specifies six steps: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. In practice they’re all equally important, even if this course will have a statistical focus
In Business understanding, one sets concrete goals and requirements for the data mining process. What problem does the management want to tackle? Are there constraints? (How) Do possible solutions impact? Assess situation (resources), project cost and benefits, and define the goals of the data mining project. Outcome: concrete tasks and (rough) road map. Btw., although statisticians are (usually) able to focus on the numbers, knowledge of where the data come from is of great help.
In Data understanding, follow-up questions pop up: What data are available? Are they informative enough to allow the completing the tasks decided above? What concrete data problems are there? Can these be overcome? (There may be some iterations between business and data understanding if required data is not available.)
1Think of tossing coins until you get heads 10 times in a row.
2Python is some competition, especially with machine learning tasks, but less developed on the statistical
side.
1
Contents
In Data preparation, one generates the data set(s) to be used in the modelling stage: Clean data (remove inaccuracies, outliers, mistakes), Merge datasets from different sources, Transform variables if necessary. Make sure data is tidy. Tidy data are nicely structured:
• Each variable must have its own column (row);
• Each observation must have its own row (column); • Each value must have its own cell.
These may be intuitive requirements, but recall that data may have many sources, and tidiness needs to be imposed when setting up the “final” data set. Data preparation is of huge importance in practice, (even if not very rewarding intellectually).
In Modelling, one uses suitable methods (from classical statistics, statistical learning, ma- chine learning etc) to obtain information (be exploratory in a sense), or set up predictions based on the available data set.
In Evaluation, one must, after setting up candidate models and predictions, move on to compare model performance. (For that, one needs error estimates, btw., which is something well-studied in statistics.) Also, it’s the time to match modeling outcomes with goals of the data mining project and assess the practical utility of the modelling results. (There may be some iterations between modelling and evaluation if data mining goals not achieved.)
In Deployment, set up a project summary and report to management (or whoever initiated the project). One must ensure that the data analysis in particular is readable, reproducible, and maintainable. (Here’s for instance where a programming language like R has decisive advantages over Excel;3 SQL knowledge may help if you have to work with databases.)
Our focus is on the modelling and (partly) the evaluation stage, with emphasis to statistics and statistical learning.
3You would be surprised how much data and how many data analyses in enterprises are available only as Excel data sheets (ok, sometimes enhanced with VBA scripts).
2
1 Statistical learning
1.1 From statistics to statistical learning
Classical statistics has a well-defined flow of operations (going all the way back to Fisher): 1. Study problem;
2. Set up statistical model;
3. Fit model, i.e. estimate (few) parameters;
4. Check assumptions; if model not “good” enough, return to step 2.
5. Use fitted model for explanation and prediction (causality is also an important aspect).
Statistical learning (supervised or not) has a focus on more complex models and datasets, but inherits statistical thinking and tools. Importantly, we still assume that the data Yi ∈ R and Xi ∈ Rp are a (random) sample from some population. Choosing the appropriate degree of model complexity is an essential task.1 (More about this later on.) Finally, machine learning focuses entirely on results (and often has the luxury of resorting to huge, even generated, datasets).
Strictly speaking, data most likely do not meet assumptions for parametric models anyway, but if you have a lot of information you may relax the standard parametric models that typically turn out to be too simple/restrictive if sticking to the classical workflow. So the workflow changes a bit for “modern” data analysis:
1. Look at data (preliminary analyses, visualization);
2. Propose (flexible) model,
3. Select so-called tunable parameters and fit model;
4. Assess model.
5. Make (high quality) predictions for new data (but interest in causal statements is growing).
1Nonparametric statistics come closest to statistical learning from this point of view. 3
1 Statistical learning
Statistics
Population/ Statistical model
Estimation
Prediction/ Classification
–
Covariates/Predictors
Variable of interest
Clustering/ Latent components
1.2 Supervised learning
Quite often, one is forced to make an educated guess as to what the outcome of some variable Y might be, given given a data point Y,X for which Y is either unobserved or withheld. In supervised learning, we are focusing on prediction and classification (where classification is nothing else that predicting discrete-valued target variables). If Y and X are not statistically independent, the dependence may be exploited to set up predictions or classifications. In practical terms, what we need is a predictor function, say f(x), such that
Yˆ = f ( X )
is a prediction (i.e. an educated guess) of the outcome of Y . For classification, f takes discrete values (the predicted labels), and we call it classifier; at times, we may denote it by C(x).
Even conditionally on X = x, Y is random (for both prediction and classification), while Yˆ is just an number – so prediction/classification mistakes are unavoidable. The point is to make as few prediction/classification errors as possible. (This implies certain properties of f which we discuss separately.)
In some situations, we may specify f from prior knowledge. In many other situations, we
must learn about f from available data. In supervised learning, values (or labels) of Yi
are paired with Xi, and we have n such training cases which allow us to learn a suitable
f. If e.g. f is a regression function, this amounts to estimating a regression model from
Learning
no explicit use
Learning Supervised learning
Predictor/Classifier Features/Inputs Target/Respose Unsupervised learning
Meaning
Sets of assumptions about
the joint distribution of Y and X Using data to make an informed
guess about an unknown quantity Predicting (discrete) outcomes
for Y when only X is available
A mapping from X to outcomes of Y The Xs
The Y s
Putting data into groups (using statistical
models or not), no specific target variable Table 1.1: Dictionary Statistics–Statistical Learning
the available sample. This approach only makes sense if the data point to be predicted or ˆ
classified comes from the same population as the data used to train f! 4
1 Statistical learning
We know from Advanced Statistics II that estimates are never 100% precise in finite samples. Working with large and complex data sets and models does not improve on this.
To see why, let’s look at the instructive case of polynomial regression. Polynomials can ap- proximate unknown smooth functions arbitrarily well (the Weierstrass approximation theo- rem), so it may be tempting to include many powers to have a flexible model. Flexibility of the fitted model is “good”: there’s no need to search for suitable functional forms. But there is such a thing as too much of it: John von Neumann is frequently cited saying “with four parameters I can fit an elephant, with five I can make him wiggle his trunk”. So we still need to carefully choose the models we work with. (Low-degree) Polynomial regression is still a useful tool, since it is relatively interpretable. For instance, model averaging (combining predictions from different models) usually outperforms a linear prediction, but it does not immediately clarify what role a particular predictor plays. To be fair, some are happy with good predictive performance alone, but, in general, there is a trade-off between performance and interpretability, as we’ll see during this course.
1.3 Unsupervised learning
In unsupervised learning, one does not focus on predicting or classifying conditional on feature values.
The statistical view: Set up a statistical model with (few) latent variables; this allows us to understand how data depend on one another, and also allows us to represent data in lower dimensionality. One could say that supervised learning focuses on conditional distribution of Y given X, unsupervised learning deals with unconditional distribution of X alone.
The learning view: one usually focuses directly (and less model-based) on the two main tasks, i.e. find clusters of “similar” observations and reduce data dimensionality. We focus on clustering for this course, but we’ll get to know some dimensionality reduction techniques as well.
Clustering is nothing else than a problem of missing labels. Let Yi ∈ {0, 1} , Xi ∈ Rp, but the Y s are latent (unobserved). Pooling observations leads to a mixture distribution for Xi. Contrary to classification, you do not observe Yi so no supervised training is available. The task is to group observations with similar values of Xi. (As a side effect, you get estimates of the latent Yi for each observation). Statistical approaches may make use of the mixture distribution, specifying the model more concretely (e.g. normal distribution of Xi given Yi) and estimating parameters. Learning approaches are a bit more agnostic; they may use the same algorithms, but don’t necessarily interpret them in a statistical key.
5
2 Supervised learning: some background 2.1 Error quantification
Conditionally on X, Y has a distribution and its outcomes (continuous or discrete) are random. At the same time, Yˆ is a deterministic function of X and therefore prediction1 as well as classification, errors will appear by construction. To minimize the (impact of the) errors, we need to quantify them.
Let us discuss prediction first. For continuous variables, the probability of Y = Yˆ is zero, so a good prediction lies as close as possible to Y on average. We shall measure “closeness” with the help of so-called loss functions.
Definition 2.1 A loss function L(·) is a quasi-convex function minimized at 0.
In other words, L is increasing for positive arguments and decreasing for negative arguments.
The most common loss function is the squared error loss (MSE). But asymmetric loss functi- ons may also be encountered, say asymmetric linear or asymmetric quadratic. Either way, we pick the prediction that minimizes the expected loss:
Yˆ = arg min EY |X (L (Y − y)) y
where the index specifies with respect to which distribution the expectation is taken.
Different losses imply different predictions for the same prediction problem! We therefore need to understand which functional of the joint distribution is the optimal prediction.
Proposition 2.1 For the squared error loss, the optimal (conditional) prediction is the mean ofY givenX,Yˆ=E(Y|X).
1The term forecasting, while referring to time series, is in principle equivalent. 6
Then,
2 Supervised learning: some background
So for prediction we typically focus on conditional means, i.e. the regression function, and take as predictor function
f (x) = E (Y |X = x) .
Of course, for same data point X to be “predicted”/classified, we take Yˆ = f(X).2 The implicit assumption is that the data points for which predictions are to be provided do come from the population we’re discussing and whose predictor function is f. It makes no sense to use a particular conditional expectation, say, if it does not match the prediction problem at hand.
Mean Absolute Deviation is sometimes an alternative to MSE.3
Proposition 2.2 For the absolute deviation loss, the optimal predictor is the conditional
median (which we assume unique).
Analogously, asymmetric linear losses leads to conditional quantiles. Let
L(u) =
−au u < 0
bu u≥0
for a,b > 0.
b Y|X=x a+b
f(x) = F−1
Of course, Yˆ = f(X) for new data X. Note again that X must come from the population
characterized by the quantile function f.
In general, we get a specific optimal prediction for each loss L.
Definition 2.2 (L-Measure of location) The functional μLY |X = arg min EY |X (L(Y − m))
m
is the conditional L-measure of location of Y .
The optimal prediction Yˆ given X under L is therefore (by definition) the conditional L-
measure of location of Y , and the predictor function is f ( x ) = μ LY | X = x .
Each loss function leads to different optimal predictors. E.g. μLY |X = E (Y |X ) only for symmetric distributions and symmetric L.
2Most authors, textbooks, blogs, deal directly with the conditional mean – often without stating this explicitly. Recall, this means MSE-optimal predictions – but not necessarily MAE (mean absolute error) optimal predictions, say.
3Usually because of robustness to outliers; but beware of the computational burden. 7
2 Supervised learning: some background
Moving on to classification, let’s look at the binary case, with 2 possible outcomes, say 0 (or No) and 1 (or Yes).4 Here the distribution of Y is Bernoulli, so what matters is the conditional probability that Y = 1 given X.
The so-called confusion matrix summarizes the decision possibilities, and for each combi- nation we postulate a certain loss. In the binary case, we have in the general case
Yˆ=0 Yˆ=1 Y=0 l0,0 l0,1
Y=1 l1,0 l1,1
The most common loss function for binary classification is the 0/1 loss,
Yˆ=0 Yˆ=1 Y=001 Y=110
and we’ll work with this one quite often (it is just as popular for classification as the MSE loss is for prediction).
Proposition 2.3 The optimal binary classifier under 0/1 loss sets Yˆ = 1 if P(Y = 1|X) > P(Y =0|X) and Yˆ =0 if P(Y =0|X)>P(Y =1|X).
I.e. the optimal classifier picks the class with the larger conditional probability of occurrence. This extends to several classes, as long as one uses the 0/1 loss.
For binary classification, we may equivalently state
“Set Yˆ = 1 if conditional probability of Y = 1 exceeds 1/2”.
Other loss functions simply imply another threshold for the conditional class probability, so we set Yˆ = 1 if conditional probability of Y = 1 exceeds a threshold τ, where τ depends on the losses in the confusion matrix. If e.g. instance you fear misclassifying 0s as 1s more that you fear misclassifying 1s as 0s, you would choose a suitable τ > 1/2.
Especially in machine learning, it is customary to work with error rates directly rather than losses implied by (mis)classifications.5 We need some terminology to define them.
4In machine learning some rather choose −1 and 1. Also, one often speaks of a condition being present (i.e. 1 or Yes) as opposed to not being given (i.e. 0 or No).
5And the rates often refer rather to the sample than population quantities, but the following definitions apply in both cases.
8
2 Supervised learning: some background
Y = No Y = Yes
Yˆ = No Predicted No
Yˆ = Yes Predicted Yes
Actual No Actual Yes Total
True Negatives False Positives False Negatives True Positives
The following error rates have names following a Yes/No classification situation, and we stick to this convention here:
• Accuracy: (True Positives + True Negatives)/Total (How often is the classifier correct?)
• Misclassification/Error rate: (False Positives + False Negatives)/Total (How often is the classifier wrong?)
• True Positive Rate, Sensitivity or Recall: True Positives/Actual Yes (How often is the classifier correct when truth is yes?)
• True Negative Rate or Specificity: True Negatives/Actual No (How often is the classifier correct when truth is no?)
• False Positive Rate: False Positives/Actual No (How often is the classifier wrong when truth is no?)
• Precision: True Positives/Predicted Yes
(How often is the classifier correct when prediction is yes?)
• Prevalence: Actual Yes/Total
(How often does the yes occur in the population or the sample?)
(In population terms, these have direct interpretations as probabilities – either marginal or conditional; take some time to see which they are.)
Accuracy is important, but can be misleading if prevalence is either very high or very low. Sensitivity and specificity should be high, but there’s a trade-off. This trade-off is controlled by the threshold (say τ) for classification based on conditional class probability: lower τ means more Yˆ = 1 (higher sensitivity), but more of them will be false positives (lower specificity); higher τ means less Yˆ = 1 and thus less false positives (higher specificity), but at the same time lower sensitivity. The “optimal” τ depends on the loss function (recall, τ = 1/2 for 0/1 loss).
Plotting sensitivity (TPR) against specificity (TNR) for all τ ∈ (0,1) gives the so-called Receiver Operating Curve. (In fact TPR is plotted for historical reasons against FPR=1- TNR.)
9
2 Supervised learning: some background
The ROC is informative about the classification performance of a model (population or fitted) for the conditional class probability. A pure chance classifier (i.e. Yˆ independent of Y ) has TPR = FPR. So the further the ROC is away from the diagonal, the better! The “Area Under the Curve” quantifies this: 0 < AUC < 1, with AUC = 1/2 standing for pure chance.
2.2 Learning for prediction
Recall classical statistics and say you have a parametric regression model, Y =f(X,θ)+ε
where E(ε|X) = 0. Estimate using squared error loss, θˆ=argmin1n (Yi−f(Xi,θ∗))2.
θ∗ ni=1 Proposition 2.4 Regularity conditions assumed, θˆ →p θ.
Since E (Y |X) = f (X, θ), plug in θˆ to obtain
ˆˆ ˆˆ
f(x)=f x,θ and Y =f X,θ . ˆˆpˆp
Consistencyofθimpliesthatf(x)→f(x)andY →E(Y|X).
The essential assumption that we make here is that the data and the new data points for
which a prediction is required come from the same population. The prediction errors at a (new) data point (Y,X) are
Y −Yˆ =ε−fX,θˆ−f(X,θ)=:εˆ. Since θˆ →p θ, we may linearize f in θ (at the given X) and obtain
ˆ ′ 1 εˆ=ε−θ−θ∇f+Op n
where ∇f is the gradient of f at the population θ and the given X. We thus have estimation error and irreducible error. In classical statistics, the estimation error can be made arbitrarily small if there are enough data.
10
2 Supervised learning: some background
It is important to realize that you get what you fit! Say you use a different loss function, ˆ1n
θ=argmin L(Yi−f(Xi,θ∗)). θ∗ n i=1
This is known as empirical risk minimization (or estimation under the relevant loss).
Proposition 2.5 Regularity conditions assumed, f x, θˆ →p μLY |X=x.
E.g. fitting under MAE/MAD (mean absolute error/deviations) gives the (fitted) conditional median as predictor function. So you should typically fit predictive models using the same loss function which is used to evaluate the errors. Ignoring this may lead to additional (perhaps avoidable error components).
But let’s return to nonlinear LS. Say now you want to fit the same parametric f (x, θ), but the actual regression function is different, say E(Y|X) = m(X). (After all, your assumptions about the population could be wrong.)
Proposition 2.6 Let θ = arg min E (f (X, θ∗) − m (X))2 (we call θ pseudo-true va- pXp
θ∗
lue). Regularity conditions assumed, θˆ →p θp.
Regarding prediction errors,
Y −Yˆ ≈ε+(m(X)−f(X,θ ))−θˆ−θ ′∇f
where ∇f is the gradient of f at X for the pseudo-true θp. We thus have model bias as an additional source of error. Note that the bias depends on the marginal distribution of X and on the point where we make the forecast. It should be noted that the same phenomenon arises too under estimation of misspecified models under other loss functions; since the effect is not just as easily quantified we focus mainly on MSE.
To avoid model bias, let’s resort to some more flexible classes of models. In fact, some pa- rametric families of functions are universal approximators. I.e., as dim θ → ∞, they are able to approximate smooth functions arbitrarily well. Some examples include polynomials, Fourier series, splines, but also regression trees and neural networks.
So choosing “large” dim θ in such models gives enough flexibility such that the model accom- modates arbitrary unknown regression functions. The analogous findings holds for classifiers.
However, there is a price to pay. The problem is that
Var (εˆ) ≈ Var (ε) + ∇f′ Cov θˆ ∇f 11
pp
where
2 Supervised learning: some background
∇f′ Cov θˆ ∇f ≤ ||∇f||2 Cov θˆ . ˆ
In the worst-case scenario, Cov θ = O(dim θ)! In other words, you reduce bias at the cost of increasing the variance. The bias-variance trade-off is making yet another appearance.
One minimizes prediction MSE by balancing bias and estimation error. This is a critical step, since both influence prediction MSE, and can be seen as a model selection issue. In statistical/machine learning there are special names for cases where the one or the other component drives the MSE:
• Bias dominates: underfitting (oversmoothing)
• Variance dominates: overfitting (undersmoothing)
The name overfitting comes from the fact that when the variance dominates, it can happen in the worst case that Yˆi ≈ Yi for the fitted data! Clearly, this leads to bad forecast (bad generalization in learning lingo). Choosing the right amount of smoothing is essential to get good predictive performance.
2.3 Leaning with many features
Intuitively, the more features you have the better, e.g. because chances increase that a po- werful predictor is among them. You need however enough data to be able to incorporate the extra features in your model. Clearly, additional features (and correspondingly parameters) for the same amount of data increase estimation variance. This is not much of a problem in classical parametric setups, but becomes one when you deal with flexible models.
The curse of dimensionality is a generic term used to describe situations where it is very costly to exploit additional features. It originated in dynamic optimization, where the cost of adding one dimension to a grid search increases exponentially with the number of dimensions. Its exact form depends on the model, but the source is the combination of high dimensionality with high flexibility.
A very intuitive interpretation appears in so-called local regression. With f the conditional mean of Y , take as fitted predictor function the local average
ni=1 Yi · I (∥Xi − x∥ ≤ ε) f(x)= ni=1I(∥Xi−x∥≤ε) .
ˆ
If Yi = f (Xi) + εi with f the regression function (assumed differentiable),
f(Xi)=f(x)+(Xi −x)′∇f+o(ε) 12
such that
ni=1 εi · I (∥Xi − x∥ ≤ ε) f(x)≈f(x)+ ni=1I(∥Xi −x∥≤ε) +Bias
ˆ
2 Supervised learning: some background
where Bias = O(ε) and the variance of the random term is inversely proportional to the average number of data points for which ∥Xi − x∥ ≤ ε, i.e. data points close to the x of interest.
Yet, in higher-dimensional space, the average distance between the elements of a random sample increases with the number of dimensions. So if p is large, there will only be very little data points within a radius ε of x! You would then need to increase ε to match the estimation variance of a lower dimensional problem, which in turn increases bias, often a lot. The bottom line is that, for flexible models, the MSE increases by construction with the number of dimensions when keeping the amount of sample information fixed; to outbalance this, you need exponentially more data.6
Parametric models (with low dim θ) don’t have this problem, as they tend to have a global, relatively “rigid” shape. E.g. linear regression learns (more or less) the same about slope coefficients from a data point, no matter where it lies.
Classification models also suffer from the curse of dimensionality, just like unsupervised learning has its dimensionality issues as well.
Imagine a clustering task with n data points and p = n features. Say for simplicity that the n features are binary 0/1, and Xi,j = 1 for i = j only. Unless you pay attention, most algorithms will immediately find n different clusters.
For all these cases, dimensionality reduction may help deal with the curse. And so does model selection.
6Ok, this is just a rough approximation to get you scared. 13
3 Linear prediction and classification 3.1 Prediction with linear regression
For any data, there are some simple questions we might ask: Is there a relationship between Y and X? How strong is the relationship? Which feature contributes to movements in the target? How accurately can we predict? Is the relationship linear?
Linear regression is a simple approach to supervised learning and prediction. It assumes that the dependence of Y on X1,X2,...,Xp is linear. True regression functions are quite plausibly never linear! But: Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. And we may build on it to obtain more complex models.
Basics
To answer them, let’s begin by assuming a simple regression model, Y = β0 + β1X + ε
where β0 and β1 are the unknown intercept and slope, and ε is the error term.
Given some estimates βˆ0 and βˆ1 for the model coefficients, we predict the target variable
using
Yˆ = βˆ 0 + βˆ 1 X . Estimation of the parameters is done by least squares.
The standard error of an estimator reflects how it varies under repeated sampling. They help us assess the accuracy of the coefficient estimates. These standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form βˆ1 ± 2 · SE(βˆ1) where SE(βˆ1) is the standard error of βˆ1, or its estimated standard deviation. That is, there is approximately a 95% chance that the interval
βˆ1 − 2 · SE(βˆ1), βˆ1 + 2 · SE(βˆ1) 14
3 Linear prediction and classification
will contain the true value of β1 (under a scenario where we got repeated samples like the present sample).
Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of
H0 : There is no relationship between Xand Y versus the alternative hypothesis
HA : There is some relationship between Xand Y . This corresponds to testing
H0 : β1 = 0 versus HA : β1 ̸= 0,
since if β1 = 0 then the model reduces to Y = β0 + ε, and X does not predict Y (linearly). To test this null hypothesis, we compute a t-statistic, given by
t = βˆ 1 − 0 . SE(βˆ1 )
Under textbook assumptions (not to be over-trusted), this will have a t-distribution with n − 2 degrees of freedom, assuming β1 = 0, deterministic regressors and Gaussian errors. Using statistical software, it is easy to compute the probability of observing any value equal to |t| or larger. We call this the p-value.
To assess the overall predictive accuracy of the model, we compute the Residual Standard
Error
1n
RSE = n − 2RSS, with RSS = (Yi − Yˆi)2.
i=1
The coefficient of determination R2 (R-squared) or fraction of variance explained is
R2 = TSS − RSS = 1 − RSS TSS TSS
where TSS = ni=1(yi − y)2 is the total sum of squares. This quantifies overall model fit and is scale-free (unlike the RSE). It can be shown that in this simple linear regression setting that R2 = r2 where r is the correlation between X and Y :
ni=1(Xi − X)(Yi − Y ) r=n n .
i=1(Xi − X)2 i=1(Yi − Y )2
15
3 Linear prediction and classification
To consider more inputs/regressors, our model is
Y =β0 +β1X1 +β2X2 +···+βpXp +ε,
We interpret βj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed.
The ideal scenario is when the predictors are uncorrelated — a balanced design. This is typically not available with observational data.
Under a balanced design, each coefficient can be estimated and tested separately. Interpre- tations such as “a unit change in Xj with a βj change in Y , while all the other variables stay fixed”, are straightforward. Correlations among predictors cause problems: The variance of all coefficients tends to increase, sometimes dramatically. Interpretations become hazardous — when Xj changes, everything else changes. Claims of causality should be avoided for observational data. (See any econometrics course.)
Anyway, given estimates βˆ0, βˆ1, . . . , βˆp we can make predictions using the formula Yˆ =βˆ0 +βˆ1X1 +βˆ2X2 +···+βˆpXp.
We estimate β0,β1,...,βp as the values that minimize the sum of squared residuals RSS. The values βˆ0, βˆ1, . . . , βˆp that minimize the RSS are the multiple least squares regression coefficient estimates.
Some important questions: Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response? Do all the predictors help to explain Y , or is only a subset of the predictors useful? How well does the model fit the data? Given a set of predictor values, what response value should we predict, and how accurate is our prediction? Is at least one predictor useful?
For the first question, we can use the F-statistic
F= (TSS−RSS)/p ∼Fp,n−p−1.
RSS/(n−p−1)
Selecting which regressors belong in the model is an important topic, related to selecting the
optimal degree of complexity of the model; we’ll return to this in due time.
Some predictors are not quantitative but are qualitative, taking a discrete set of values. These are also called categorical predictors or factor variables. We use dummy variables to deal with them. In the simplest case where they enter the model additively, the interpretation is essentially that of a different intercept for groups with different dummies.
16
3 Linear prediction and classification
Adding some flexibility
The linear model assumes that the direct effect of a unit change in each predictor (indirect effects due to co-movement of predictors notwithstanding) is constant. Should this effect depend on the level of another predictor, one may speak of interactions. (A marketing term, synergy, may perhaps be more illuminating.) Interactions of the predictors Xi and Xj are easily introduced in a linear regression model by means of a bit of engineering, concretely one adds the product XiXj as predictor or feature.
In practice, it may happen that an interaction term is highly significant, but the associated main effects are not. The hierarchy principle says: If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. The rationale for this principle is that interactions are hard to interpret in a model without main effects — their meaning is changed. Specifically, the interaction terms also contain main effects if the model has no main effect terms.
One may also have interactions between qualitative and quantitative variables. Non-linearity may be captured by polynomial regression as seen before. We shall discuss several approaches to nonlinearity in the chapter on nonlinear models.
3.2 Classification
Given a feature vector X and a qualitative response Y taking values in the set C, the classification task is to build a function C(x) that takes as input the feature vector X and predicts its value for Y ; i.e. Yˆ = C(X) ∈ C.
Take a binary classification task (Yes/No, Positive/Negative or analogous). Suppose that we code Y = 0 if No/Negative and Y = 1 if Yes/Positive,
0 ifNo
Y=
1 ifYes.
Can we simply perform a linear regression of Y on X and classify as Yes if Yˆ > 0.5, say?
In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear discriminant analysis which we discuss later. Since in the population E(Y |X = x) = P(Y = 1|X = x), we might think that regression is perfect for this task. However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more suitable, especially if the conditional class probabilities are also of interest.
17
3 Linear prediction and classification
Moreover, suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.
1
Y = 2 3
if stroke;
if drug overdose;
if epileptic seizure.
This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure. Linear regression is definitely not appropriate here. Multi-class logistic regression or, again, discriminant analysis are more appropriate. Should there indeed be an ordering, then use ordered ordered logit.
Logistic regression
Let’s write p(x) = P(Y = 1|X = x) for short and consider using balance to predict default. Logistic regression uses the functional form
eβ0 +β1 x p(x) = 1 + eβ0+β1x .
It is easy to see that no matter what values β1,β0 or X take, p(X) will have values between 0 and 1. A bit of rearrangement gives
p(X)
ln 1−p(X) =β0+β1X.
This monotone transformation is called the log odds or logit transformation of p(X). The quantity
p(X) = eβ0+β1X. 1−p(X)
is called the odds and can take values between 0 and ∞. Values of the odds close to 0 and
∞ indicate very low and very high probabilities of default, respectively. On average 1 in 5
people with an odds of 1 will default, since p(X) = 0.2 implies an odds of 0.2 = 1. Likewise 4 1−0.2 4
on average nine out of every ten people with an odds of 9 will default, since p(X) = 0.9 implies an odds of 0.9 = 9.
1−0.9
Estimation of logit regressions is conducted by Maximum Likelihood. We seek estimates for
β0 and β1 such that the predicted probability pˆ(Xi) of default for each individual, using eβ0 +β1 X
p(X) = 1 + eβ0+β1X , 18
3 Linear prediction and classification
corresponds as closely as possible to the individual’s observed default status. This is achieved via the likelihood function
L(β0,β1)= p(Xi) (1−p(Xi)). i:Yi =1 i:Yi =0
We pick βˆ0 and βˆ1 to maximize the likelihood of the observed data. Most statistical packages can fit linear logistic regression models by maximum likelihood.
Suppose, now we have not only one but p predictors X1, . . . , Xp to predict a binary response Y . We can easily generalize the logistic regression with one predictor as follows:
p(X)
ln 1−p(X) =β0 +β1X1 +···+βpXp
p(X) = eβ0+β1X1+···+βpXp
1 + eβ0+β1X1+···+βpXp
Just as before we can estimate the coefficients β0, β1, . . . , βp using the maximum likelihood method.
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes where Y may take values in (say) {1, . . . , K}. One version has the symmetric form
eβ0k +β1k X1 +···+βpk Xp P(Y = k|X) = Kl=1 eβ0l+β1lX1+···+βplXp
There is a linear function for each class. (Some cancellation is possible, and only K − 1 linear functions are needed as in 2-class logistic regression.) Multi-class logistic regression is also referred to as multinomial regression. We skip details here and move to a more direct technique.
Discriminant analysis
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.
The idea is to model the distribution of X in each of the classes separately, and then use the Bayes theorem to obtain P(Y |X).
19
3 Linear prediction and classification
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis. (However, this approach is quite general, and other distributions can be used as well.) We will focus on normal distributions.
Recall:
P(Y =k|X =x)= P(X =x|Y =k)·P(Y =k) P(X = x)
One writes this slightly differently for discriminant analysis (Bayes’ theorem…) πk fk (x)
P(Y = k|X = x) = Kl=1 πlfl(x), fk(x)=P(X =x|Y =k)
where
is the density for x in class k. Here we will use normal densities for these, separately in each
class, and πk = P(Y = k) is the marginal or prior probability for class k.
We classify a new data point according to which density is highest. When the priors are different, we take them into account as well, and compare πkfk(x). On the right, we favor the pink class — the decision boundary has shifted to the left.
Linear Discriminant Analysis when p = 1. The Gaussian density has the form 1 − 1 x−μk 2
fk(x) = √2πσ e 2 σk k
Here μk is the mean, and σk2 the variance (in class k). We will assume that all the σk = σ are the same. Plugging this into Bayes’ formula, we get a rather complex expression for pk(x) = P(Y = k|X = x):
π 1 e−1(x−μk)2 k√2σ
2πσ .
To classify at the value X = x, we need to see which of the pk(x) is largest. Taking logs, and discarding terms that do not depend on k, we see that this is equivalent to assigning x to the class with the largest discriminant score:
δk(x)=x·μk−μ2k +ln(πk) σ2 2σ2
Note that δk(x) is a linear function of x. If there are K = 2 classes and π1 = π2 = 0.5, then one can see that the decision boundary is at
x = μ1 + μ2 . 2
p (x) = k
x−μ 2 K π 1 e−1( l)
l=1l√ 2σ 2πσ
20
3 Linear prediction and classification
Let’s take an example with μ1 = −1.5,μ2 = 1.5,π1 = π2 = 0.5, and σ2 = 1. Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them into the rule.
πˆ = n k n
μˆ k = 1 x i nk i:yi=k
1K Knk−1 σˆ2= (xi−μˆk)2 = ·σˆk2,
where σˆ2 = k
the discriminant function
n − K k=1 i:yi=k k=1 n − K
1 (x − μˆ )2 is the usual formula for the estimated variance in the kth
nk−1 i:yi=k i k
class.
Linear Discriminant Analysis when p > 1: f (x) = 1 e− 1 (x−μ)′ Σ−1 (x−μ) , which leads to
2 π p/ 2 | Σ | 1/ 2 2 δk(x)= x′Σ−1μk −1μ′kΣ−1μk +lnπk,
2 which, despite its complex form, it’s linear.
When there are K classes, linear discriminant analysis can be viewed exactly in a K − 1 dimensional plot. It essentially classifies to the closest centroid. Even when K > 3, we can find the “best” 2-dimensional plane for visualizing the discriminant rule.
Once we have estimates δˆk(x), we can turn these into estimates for class probabilities: e δˆ k ( x )
P(Y = k|X = x) = Kl=1 eδˆl(x)
So classifying to the largest δˆk(x) amounts to classifying to the class for which P(Y = k|X =
x)islargest. WhenK=2,weclassifytoclass2ifP(Y =2|X=x)≥0.5,elsetoclass1.
When fk(x) are Gaussian densities, with the same covariance matrix Σ in each class, this leads to linear discriminant analysis. By altering the forms for fk(x), we get different classi- fiers. There are many other forms obtained by proposing specific density models for fk(x), including nonparametric approaches.
With Gaussians but different Σk in each class, we get quadratic discriminant analysis, δk(x) = −1(x − μk)′Σ−1(x − μk) + ln πk.
2k
Because the Σk are different, the quadratic terms don’t cancel out and we have the curved boundaries.
21
3 Linear prediction and classification
With fk(x) = pj=1 fjk(xj) (conditional independence model) in each class we get the so- called naive Bayes classifier; for Gaussian fk this means the Σk are diagonal. Naive Bayes assumes features are independent in each class. It’s quite useful when p is large, and multi- variate methods like quadratic discriminant analysis and even LDA break down. Gaussian naive Bayes assumes each Σk is diagonal:
p 1p(xj−μkj)2 δ(x)∝ln π f (x) =− +lnπ
k kkjj2σ2 k j=1 j=1 kj
can be used for mixed feature vectors (qualitative and quantitative). If Xj is qualitative, replace fkj(xj) with probability mass function (histogram) over discrete categories. Despite strong assumptions, naive Bayes often produces good classification results.
Let’s compare logistic regression with LDA. For a two-class problem, one can show that for
LDA
p1(x) p1(x)
ln 1−p1(x) =ln p2(x) =c0 +c1x1 +···+cpxp
So it has the same form as logistic regression. The difference is in how the parameters are estimated. Logistic regression uses the conditional likelihood based on P(Y |X) (known as discriminative learning). LDA uses the full likelihood based on P(X, Y ) (known as generative learning). Despite these differences, in practice the results are often very similar. Logistic regression can also fit quadratic boundaries like quadratic discriminant analysis, by explicitly including quadratic terms in the model.
22
4 Model assessment and selection
The temptation is great, to just throw in all available features and hope for the best. (Un)Fortunately, it turns out that selecting the features saliently greatly improves fore- casting accuracy. We now discuss ways of selecting regressors and evaluating accuracy. We start with the exemplary case of linear regression.
4.1 Subset selection in linear regression
Best subset selection: Let M0 denote the null model containing no predictors. This model simply predicts the sample mean for each observation. For k = 1, 2, …p:
1. Fit all p models that contain exactly k predictors. k
2. Pick the best among these p models, and call it Mk. Here, “best” is defined as k
having the smallest RSS, or equivalently largest R2.
3. Select a single best model from among M0, . . . , Mp. This is not trivial since R2 is not suitable for this task, as it increases simply by adding (possibly irrelevant) regressors. We’ll have to find an alternative way of assessing predictive accuracy.
Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression, and also other loss functions. The deviance — minus two times the maximized log-likelihood — plays the role of RSS for a broader class of models.
Stepwise Selection: For computational reasons, best subset selection cannot be applied with very large p. Best subset selection may also suffer from statistical problems when p is large: the larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates. For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.
23
4 Model assessment and selection
Forward Stepwise Selection: Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.
Let M0 denote the null model, which contains no predictors. For k = 0, . . . , p − 1 :
1. Consider all p − k models that augment the predictors in Mk with one additional
predictor.
2. Choose the best among these p − k models, and call it Mk+1. Here, “best” is defined
as having the smallest RSS, or equivalently largest R2.
3. Select a single best model from among M0,…,Mp using cross-validated prediction
error, Cp (AIC), BIC, or adjusted R2.
The computational advantage over best subset selection is clear. It is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. Why not?
Backward Stepwise Selection: Like forward stepwise selection, backward stepwise se- lection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
Let Mp denote the full model, which contains all p predictors. For k = p, p − 1, . . . , 1:
1. Consider all k models that contain all but one of the predictors in Mk, for a total of
k − 1 predictors.
2. Choose the best among these k models, and call it Mk−1. Here, “best” is defined as
having smallest RSS or highest R2.
3. Select a single best model from among M0,…,Mp using cross-validated prediction
error, Cp (AIC), BIC, or adjusted R2.
Like forward stepwise selection, the backward selection approach searches through only 1 + p(p+1)/2 models, and so can be applied in settings where p is too large to apply best subset selection. Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors. Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.
One may consider additional tricks, like
24
4 Model assessment and selection
• Targeting/Pre-screening: Only consider predictors that are not statistically inde- pendent of target (in linear regression, this – more or less – amounts to correlation).
• Core predictors: Some predictors may be always included (for non-statistical rea- sons).
• Smart searches: One could use heuristic discrete optimization techniques to find best model without parsing all 2p of them.
4.2 Comparing accuracy
Let’s move on to estimating the test error. We’ll consider two approaches. We can directly estimate the test error, using either a validation set approach or a cross-validation appro- ach (see the chapter on resampling). We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
The Cp, AIC/BIC, and adjusted R2 techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables.
First, Mallow’s Cp:
C p = 1 ( R S S + 2 d σˆ 2 ) , n
where d is the total number of parameters used (note that d may differ from p, e.g. for restricted models) and σˆ2 is an estimate of the variance of the error ε associated with each response measurement. The AIC criterion is defined for a large class of models fit by maximum likelihood:
AIC = −2 ln L + 2d
where L is the maximized value of the likelihood function for the estimated model. The
Bayesian IC is given by
BIC = −2 ln L + d ln n.
Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Notice that BIC replaces the 2d used by AIC with a dlnn term, where n is the number of observations. Since lnn > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp.
The adjusted R2 statistic is calculated for a least squares model with d variables as adjustedR2 =1−RSS/(n−d−1)
TSS(n−1) 25
4 Model assessment and selection
where TSS is the total sum of squares. Unlike Cp, AIC, and BIC, for which a small value indicates a low test error, a large value of adjusted R2 indicates a model with a low test error. Maximizing the adjusted R2 is equivalent to minimizing RSS/(n−d−1). While RSS always decreases as the number of variables in the model increases, RSS/(n − d − 1) may increase or decrease, due to the presence of d in the denominator. Unlike the R2 statistic, the adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model.
Let us now understand in more detail why we need alternatives to R2. Take the generic prediction model
Y =f(X;θ)+ε. Thismaybelinear,polynomial,orother(moretocome).1 Thepredictorfunctionisfx;θˆ
for some fitted θˆ. Assuming bias away for simplicity, the prediction error is roughly εˆ = ε − θˆ − θ ′ ∇ f ( X ; θ ) ,
where X is a new data point (and thus independent of training data). One of course aims to minimize Var (εˆ), but you need to measure it.
The variance of the prediction error εˆ is
Var (εˆ) ≈ Var (ε) + tr Cov θˆ E (∇f∇f′) ,
where independence of X and θˆ was used. Of equal interest is the error variance given X,
Var (εˆ|X) ≈ Var (ε) + ∇f′ Cov θˆ ∇f.
Either way, the second term is relatively easy to estimate given training data, while estima- ting Var (ε) is not trivial.
The usual residual variance estimator
1n ˆ2
εi− θ−θ ∇f(Xi;θ)
is an “in-sample” estimator, Var(εˆ) = nRSS. We call it training error. Since RSS can be made arbitrarily small by adding regressors (or, in nonlinear models, the model complexity), the training error rate is a downward biased estimate of Var (εˆ).
1We look at prediction in more detail, but same ideas apply for classification. 26
Var(εˆ) = n = n
i=1
Yi−f Xi;θ
1n ˆ ′ 2
i=1 1
A bit of algebra shows that
4 Model assessment and selection
1n ˆ ′2n
Var(εˆ) = n ε2i − θ−θ n ∇f(Xi;θ)εi
i=1 i=1 n
+ θˆ − θ ′ 1 ∇ f ( X i ; θ ) ∇ f ( X i ; θ ) ′ θˆ − θ . n i=1
The third term on the r.h.s. is usually of lower magnitude order; but, since θˆ is not inde- pendent of Xi, the second has expectation2
n
E −θˆ−θ′ 2∇f(Xi;θ)εi ≈−2dimθVar(ε).
n i=1 n
To compare models of different structure, we need (unbiased) estimators of the prediction error variance, since training error variance is not a good indicator of actual predictive performance in general.
First, we can directly estimate the test error, using a so-called validation/test set approach or a cross-validation approach. The idea is to ensure that −2 dimθVar(ε) vanishes by
n
forcing X and the training data to be (approximately) independent.
Second, we can indirectly estimate predictive error variances by making an adjustment to the training error to account for the bias due to overfitting. This leads e.g. to Mallows’ Cp in the linear case.
4.3 Validation and cross-validation
One fast way of ensuring independence of X and θˆ in the above derivation is to split the data set into a training subset and a so-called validation subset. For iid sampling, this is very effective. So fit various models on the training data and evaluate their performance on the validation data. Finally, use this information for model selection.
A small remark: the validated prediction error level of a model selected this way can also be a bit downward biased (because the model selection also uses the validation data). So one often uses a third data subset, called test data, to provide a final estimate of the predictive performance after selecting models (and also optimal degree of complexity of those models).
Drawbacks of the validation set approach: The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. In the validation approach, only
2Strictly speaking, this requires some regularity conditions to hold. 27
4 Model assessment and selection
a subset of the observations — those that are included in the training set rather than in the validation set — are used to fit the model. This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set. So let’s be smarter.
K-fold Cross-validation: Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. The idea is to randomly divide the data into K equal-sized parts. We leave out part k, fit the model to the other K − 1 parts (combined), and then obtain predictions for the left-out kth part. This is done in turn for each part k = 1, 2, …K , and then the results are combined.
Divide data into K roughly equal-sized parts. Let the K parts be C1, C2, . . . CK , where Ck denotes the indices of the observations in part k. There are nk observations in part k: if n is a multiple of K, then nk = n/K. Compute
CV(K)=K nkMSEk k=1 n
where MSEk = i∈Ck (Yi − Yˆi)2/nk, and Yˆi is the fit for observation i, obtained from the data with part k removed. Setting K = n yields n-fold or leave-one out cross-validation (LOOCV). The following (easily computed!) formula holds,
1 n Y i − Y ˆ i 2 CV(n) =n 1−h
i=1 i
,
where Yˆi is the ith fitted value from the original least squares fit, and hi is the so-called leverage (i.e. the ith diagonal element of the “hat” matrix X (X′X)−1 X′). This is like the ordinary MSE, except the ith residual is divided by 1 − hi.
LOOCV is often useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. A better choice is K = 5 or 10. (Larger K leads to more accurate estimates, but if K is close to n/2, computation time goes up.)
Other issues with cross-validation: Since each training set is only (K − 1)/K as big as the original training set, the estimates of prediction error will typically be biased upward. Why? This bias is minimized when K = n (LOOCV), but this estimate has high variance, as noted earlier. K = 5 or K = 10 provides a good compromise for this bias-variance tradeoff.
Cross-Validation for classification problems: We divide the data into K roughly equal-sized parts C1, C2, . . . CK . Ck denotes the indices of the observations in part k. There are nk
28
4 Model assessment and selection
observations in part k: if n is a multiple of K, then nk = n/K. Compute CVK=K nkErrk
k=1 n
where Errk = i∈Ck I(Yi ̸= Yˆi)/nk. The estimated standard deviation of CVK is
K Kkk2
SE(CV )=(Err −Err)/(K−1). k=1
This is a useful estimate, but strictly speaking, not quite valid, since Errk are not indepen- dent.
These procedures have an advantage relative to AIC, BIC, Cp, and adjusted R2, in that it provides a direct estimate of the test error, and does not require an estimate of the error variance σ2. It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance σ2.
29
5 Regression with large p 5.1 Improving linear regression
Despite its simplicity, the linear model has distinct advantages in terms of interpretability and shows quite often good predictive performance. Also, prediction accuracy matters (one determinant is variance of estimators, by the way; especially when p is comparable to n, va- riance can be quite large), alongside model interpretability (by removing irrelevant features, that is, by setting the corresponding coefficient estimates to zero, we can obtain a model that is more easily interpreted). Hence we discuss in this lecture some ways in which the simple linear model can be improved by replacing ordinary least squares fitting with some alternative fitting procedures. This is closely related to selecting the features more relevant for prediction. We study four classes of methods:
Subset Selection: We identify a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
We already considered this in model selection so we focus now on:
Shrinkage: We fit a model involving all p predictors, but the estimated coefficients are shrunk towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection.
Dimension Reduction: We project the p predictors into an M-dimensional subspace, where M < p. The M projections are used as predictors to fit a linear regression model by least squares.
And, last but not least, we can use ensemble methods; see later on.
5.2 Shrinkage methods
If one has p + 1 or fewer observations, a plane fits all of them. Even if you have some more observations, you’re often practically still there (multicollinearity works similarly, too many parameters in certain directions).
30
5 Regression with large p
What if we tried to estimate only the coefficients of regressors with strong signal? After all, there is not enough information to estimate all, if we have a lot of coefficients, some must be much less important than others, so set them to 0 (or close to)!
Forward stepwise selection may help: stepwise adding regressors may be combined with different criteria, can handle data with p > n, but one needs to account for multiple testing and estimates may be biased.
Alternatives are available, fortunately: regularized (linear) regression, for instance. The idea is to make sure that some of parameter estimates are small or zero. Place a penalty on large values for slope coefficients, and add it to the sum of squared residuals (or likelihood, or…) then minimize/maximize to get βˆ,
n βˆ=argmin (Yi−X′iβ)2+λfβ ̄ .
β
Here, f measures the distance between 0 and slope coefficients β ̄ (the intercept does not need regularization). The tuning parameter λ (the so-called shrinkage intensity) controls how close to 0 the estimates of unimportant coefficients should be.
Ridge regression: Recall that the least squares fitting procedure estimates β0,β1,…,βp using the values that minimize
i=1
n p 2
RSS= Yi− β0+βjXij i=1 j=1
In contrast, the ridge regression coefficient estimates βˆR are the values that minimize n p 2 p p
Yi− β0+βjXij +λβj2=RSS+λβj2,
i=1 j=1 j=1 j=1
where λ ≥ 0 is a tuning parameter, to be determined separately.
As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, λ βj2 is small when β0, β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero. The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates. Selecting a good value for λ is critical; cross-validation is usually employed for this.
The standard least squares coefficient estimates are scale equivariant. In other words, re- gardless of how the jth predictor is scaled, Xjβˆj will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge
31
5 Regression with large p
regression objective function. Therefore, it is best to apply ridge regression after standardi-
zing the predictors, using the formula X ̃ij = Xij/SE(Xj) with SE2(Xj) = 1 ni=1(Xji −Xj)2 n
as usual.
The Lasso: Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The Lasso is an alternative to ridge regression that overcomes this disadvantage. The Lasso coefficients, βˆλR minimize the quantity
n p 2 p p
Yi − β0 +βjXij +λ|βj|=RSS+λ|βj|. i=1 j=1 j=1 j=1
The lasso uses an l1 penalty instead of an l2 penalty. The l1 norm of a coefficient vector β is given by ||β||1 = pj=1 |βj|.
As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the l1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection. We say that the lasso yields sparse models — that is, models that involve only a small subset of the variables. As in ridge regression, selecting a good value of λ for the lasso is critical; cross-validation is again the method of choice.
Why is it that the Lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? Lasso and ridge regressions penalized sums of squares are actually the Lagrangians to
and
We note that the optimum must lie somewhere on the boundary pj=1 |βj| = s or pj=1 βj2 = s of the region satisfying the restrictions. The big difference is that pj=1 |βj| = s has vertices (corners), whereas pj=1 βj2 = s is smooth. This leads to different behavior of the optimum, as the Lasso may pick with positive probability the “corner” solutions.
Neither Ridge regression nor the Lasso will universally dominate the other. In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors. However, the number of predictors that is related to the response
np2 p
min Yi− β0+βjXij subjectto|βj|≤s β i=1 j=1 j=1
np2 p
min Yi− β0+βjXij subjecttoβj2≤s. β i=1 j=1 j=1
32
5 Regression with large p
is never known a priori for real data sets. A technique such as cross-validation can be used
in order to determine which approach is better on a particular data set.
The elastic net combines Ridge regression and the Lasso by having both types of penalty terms added to the RSS for estimation. It comes at the price of having to work with two tuning parameters.
Speaking of which: like for subset selection, we require a method to determine which of the models under consideration is best for Ridge regression or the Lasso. That is, we require a method selecting a value for the tuning parameter λ or equivalently, the value of the constraint s. Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error rate for each value of λ. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.
5.3 Dimension reduction methods
The methods that we have discussed so far in this chapter have involved fitting linear re- gression models, via least squares or a shrinking approach, using the original predictors, X1, X2, . . . , Xp. We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.
Let Z1, Z2, . . . , ZM represent M < p linear combinations of our original p predictors. That
is,
p
Zm =φmjXj j=1
for some constants φm1, . . . , φmp. We can then fit the linear regression model,
M
Yi =θ0 +θmZim +εi, i=1,...,n
j=1
using ordinary least squares. Note that regression coefficients are given by θ0, θ1, . . . , θM . If the constants φm1, . . . , φmp are chosen wisely, then such dimension reduction approaches can often outperform OLS regression.
Notice that from the definition
MMppMp
θZ=θφX=θφX=βX
m=1
m=1
j=1
j=1 m=1
m im
m
mj ij
m mj ij j ij j=1
33
where
5 Regression with large p M
βj = θmφmj. m=1
Hence reduced model can be thought of as a special case of the original linear regression model. Dimension reduction serves to constrain the estimated βj coefficients, since now they must take the form βj = Mm=1 θmφmj. Can win in the bias-variance trade-off.
Principal Components Regression: Here we apply principal components analysis (PCA) (discussed later on in slightly more detail) to define the linear combinations of the predic- tors, for use in our regression. The first principal component is that (normalized) linear combination of the variables with the largest variance. The second principal component has largest variance, subject to being uncorrelated with the first, and so on. Hence, with many correlated original variables, we replace them with a small set of principal components that capture their joint variation.
This does help, but don’t forget to choose the number of directions M. One may of course apply cross-validation.
Partial Least Squares: PCR identifies linear combinations, or directions, that best repre- sent the predictors X1, . . . , Xp. These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response does not supervise the identification of the principal components. Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1, . . . , Zm that are linear combinations of the original features, and then fits a linear model via OLS using these M new features. But, unlike PCR, PLS identifies these new features in a supervised way — that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.
Details of Partial Least Squares: After standardizing the p predictors, PLS computes the first direction Z1 by setting each φ1j above equal to the coefficient from the simple linear regression of Y onto Xj. One can show that this coefficient is proportional to the correlation between Y and Xj. Hence, in computing Z1 = pj=1 φ1jXj, PLS places the highest weight on the variables that are most strongly related to the response. Subsequent directions are found by taking residuals and then repeating the above prescription.
34
6 Nonlinear methods
We used linearity as a starting point rather than truth carved in stone. When a linear approximation is not good enough,1 you could try: polynomials, step functions, splines, local regression, and generalized additive models. They offer a lot of flexibility, without losing (too much of) the ease and interpretability of linear models.
6.1 Polynomial regression
Powers should be familiar; in the simple regression case (p = 1) we have yi =β0 +β1xi +β2x2i +β3x3i +···+βdxdi +εi.
Concretely, create new variables X1 = X, X2 = X2, etc and then treat as multiple linear regression. One is not really interested in the coefficients; more interested in the fitted function values at any value x0:
ˆˆˆˆ2ˆ3ˆ4 f(x0)=β0 +β1x0 +β2x0 +β3x0 +β4x0.
ˆ
Since f(x0) is a linear function of the βl, you can get a simple expression for pointwise
ˆ
variances Var f(x0) at any value x0. We either fix the degree d at some reasonably low
value, else use cross-validation to choose d.
For classification, logistic regression follows naturally. For example, we might for instance
model
P(Yi >250|Xi)= exp(β0 +β1Xi +β2Xi2 +β3Xi3 +···+βdXid) . 1+exp(β0 +β1Xi +β2Xi2 +β3Xi3 +···+βdXid)
To get confidence intervals, compute upper and lower bounds on the logit scale (which, recall, is linear), and then transform them to get on probability scale.
Can do separately on several variables — just stack the variables into one matrix, and sepa- rate out the pieces afterwards (see generalized additive models later). Caveat: polynomials have notoriously bad tail behavior — very bad for extrapolation.
1The decision to go nonlinear can be taken using model selection techniques. 35
6.2 Step functions
j=1 d
i=1 n
6 Nonlinear methods
Another way of creating transformations of a variable — cut the range of the variable into distinct regions. For instance if X were age (in years), one could partition R(X) in d subsets as
C1(X)=I(X <35), C2(X)=I(35≤X <65),..., C3(X)=I(65≥X)
and fit a different model for each subset. In the simple regression case, one predicts for
ˆ
each Cjusing the average Y in Cj, i.e., one uses the estimated predictor function f(x) =
I(Xi∈Cj)I(x∈Cj) .
d n Yi I(Xi ∈Cj )I(x∈Cj )
j=1 i=1
In spite of this more complex formula, step functions are in fact easy to work with; simply create a series of dummy variables representing each group. It also provides a useful way of creating interactions that are easy to interpret, and allow for different linear functions in each subset.
The choice of cutpoints or knots can be problematic. For creating nonlinearities, smoother alternatives such as splines are available and should in fact be preferred, but step functions also motivate trees. (We may also analyze the bias-variance trade-off on the basis of step functions.)
6.3 Splines
Instead of a single polynomial in X over its whole domain, we can rather use different polynomials in regions defined by knots. E.g.
β01 +β11Xi +β21Xi2 +β31Xi3 +εi if xi
(x−ξk)+ =
0 otherwise.
6 Nonlinear methods
where the bk are basis functions. Here the (·)+ means positive part; i.e.
x − ξk if x > ξk (x−ξk)+ =
0 otherwise.
A cubic spline with knots at ξk , k = 1, . . . , K is a piecewise cubic polynomial with continu- ous derivatives up to order 2 at each knot. Again we can represent this model with truncated power basis functions
A natural cubic spline extrapolates linearly beyond the boundary knots. This adds 4 = 2 × 2 extra constraints (zero 2nd order derivatives), and allows us to put more internal knots for the same flexibility (degrees of freedom of model) as a regular cubic spline.
Concerning knot placement, one popular strategy is to decide K, the number of knots, and then place them at appropriate quantiles of the observed X. A cubic spline with K knots has K + 4 parameters or degrees of freedom. A natural spline with K knots has K degrees of freedom.
Smoothing splines have some advantages in terms of knot placement. Consider this cri- terion for fitting a smooth function g(x) to some data:
n min (Yi − g(Xi))2 + λ g′′(t)2dt
g∈S i=1
The first term is RSS, and tries to make g(x) match the data at each Xi. The second term is a roughness penalty and controls how wiggly g(x) is. It is modulated by the tuning parameter λ ≥ 0. The smaller λ, the more wiggly the function, eventually interpolating Yi when λ = 0. As λ → ∞ the function g(x) becomes linear.
The solution to this problem is a natural cubic spline, with a knot at every unique value of xi. The roughness penalty still controls the roughness via λ. Smoothing splines avoid, as
37
i=1
6.4 Local regression
i=1
6 Nonlinear methods
said, the knot-selection issue, leaving a single tuning parameter λ to be chosen. The vector of n fitted values can be written as gˆλ = SλY, where Sλ is a n × n matrix (determined by the xi and λ) and Y stacks Yi (this makes splines a linear smoother, by the way).
The effective degrees of freedom are given by dfλ = ni=1{Sλ}ii, and we may specify df rather than λ. Just in case you’re resorting to LOOCV, the LOO cross-validated error is given by
n n Y − gˆ ( X ) 2 RSScv(λ) = (Yi − gˆ(−i)(Xi))2 = i λ i .
λ
1 − {Sλ}ii
Linear regression is quite reliable under some circumstances. What if we fitted linear regres- sions to neighbourhoods of the point x of interest? This is in fact a local linear approximation motivated by a 1st order Taylor expansion, which can be quite accurate in small neighbour- hoods of the expansion point.
With a sliding weight function, we fit separate linear fits over the range of X by weighted least squares, where data point i receives weights which are a decreasing function of the
ˆ
distance between Xi and x, the point where we want to fit f(x). At each x, the linear fit
approximates the tangent to the regression curve f(x), so local linear regression delivers fˆ and an estimate of f′. This may of course be extended for p > 1 predictors.
A very popular variant is so-called K-Nearest Neighbours. This amounts to finding the K data points that are closest to the point of interest (usually in terms of Euclidean distance ∥Xi − x∥, but if p is large you should go for the city-block distance) and predict using their average. Comparing with local regression, the weight function for K-NN is chosen differently at each point such that exactly K data points get nonzero weight, and the weights are equal.
6.5 Generalized additive models
All these methods have multivariate extensions – but don’t forget the curse of dimensionality! GAMs allow for flexible nonlinearities in several variables, but retains the additive structure of linear models:
Yi =β0 +f1(Xi1)+f2(Xi2)+···+fp(Xip)+εi.
One can fit a GAM simply using, e.g. natural splines for each fj. Coefficients are not that
interesting; fitted functions are.
One can use smoothing splines or local regression as well. GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or,
38
6 Nonlinear methods
perhaps more conveniently, interactions. Fitting is not straightforward, though; one may of
course use generic-purpose numerical optimization to minimize a sum of squared residuals
depending on all parameters of the functions fj, but in practice the so-called backfitting
algorithm is more robust. This consists of fitting one function fj at the time using Yi −
βˆ −fˆ(X )−…−fˆ (X )−fˆ (X )−···−fˆ(X ) as dependent variable, where 0 1 i1 j−1 i,j−1 j+1 i,j+1 p ip
ˆ· stand for the current estimates of the other fs, and iterate until convergence. To use GAMs for classification, simply write
p(x)
ln 1−p(x) =β0 +f1(x1)+f2(x2)+···+fp(xp).
Of course, fitting is done by maximizing the resulting Bernoulli likelihood rather than the RSS.
6.6 Artificial neural networks
(Artificial) neural networks are, at the origin (mid 20th century), mathematical models of biological neural networks. They turned out to be highly flexible nonlinear regression and classification tools. Neural networks are represented like interconnected artificial neurons that can perform simple operations – i.e. processes information. Each neuron transforms the input it receives through a so-called activation function2 to deliver a specific output. The outcome of each neuron is then passed on to other neurons as input. The approach is connectionist in the sense that complexity is dealt with by means of complex-structured connections of simple units.
Computational demands of training a neural network (i.e. fitting a model; see below) are quite high, reason for which they were ultimately not very popular for some 50 years. In the past decade, however, the cost of computation has decreased so much that there is a revival in the use of neural networks (buzz-word: deep learning). There is an extremely large number of types of neural networks available in the literature, each developed for a specific purpose (a network architecture designed for a regression problem will not perform well for character recognition tasks, for instance).
Typically, neurons are grouped in layers, such that the neurons of a layer do not communicate directly and are only connected to neurons from other layers. Also typically, one layer only takes data as input and one layer only serves to deliver the output of the network, while the other layers are not connected directly to the “exterior” and are thus “hidden”.
2The name comes from neurology: a (real-life) neuron is activated, i.e. sends an output signal, as a function of its input.
39
6 Nonlinear methods
One of the most popular models for prediction is the so-called feedforward multilayer per- ceptron. For the univariate case of relevance here, it consists of first-layer neurons whose role is to forward the signal (values of the features/regressors) to the hidden layer(s), where each neuron transforms its input through a specific activation function and passes the result on to the output neuron. (The activations of the input-layer neurons are usually linear.) Each connection has a certain weight attached to it, and each neuron’s input is the weighted mean of the signals “transported” by the connections leading to it.
A schematic representation for the case with one hidden layer is given below. In this simple example, there are no loops; the input-layer and hidden-layer neurons without input induce an intercept in the function.
Summing up, the feedforward multilayer perceptron with H hidden neurons and one hidden layer is in fact nothing else than a nonlinear regression function of the form
H
f x,θ ̃ =θ ̃ +θ ̃ σ θ ̃ +θ ̃ x
n 2,0,1 2,1,j j 1,0,j 1,1,j j=1
with the vector θ ̃ gathering all weights θ ̃ , and the activation functions σ of the hidden- i,j,k j
layer neurons. The notation f ̃ serves to show that this is intended to be just an approxi- n
mation of the population regression curve.3
Note that we took a linear activation for the output neuron in the above example. This is standard for regression problems; for classification problems, where the NN is supposed to learn a conditional class probability, the output activation needs to give values between 0 and 1. Then, a min-max normalization, or, better yet, an S-shaped function like the logistic function, are called for.
It is imperative that the activations of the hidden-layer(s) neurons are nonlinear. Classical activation functions are the hyperbolic tangent or the familiar logistic function, which are smooth but computationally somewhat demanding. Very large networks use the ReLU (Rectified Linear Unit) activation function, σ(x) = max{0, x}.
3While this is true for all models in this chapter, it is most obvious for artificial neural networks. 40
6 Nonlinear methods
The feedforward structure with one hidden layer is easy to understand, but it can be shown that already this simple structure can approximate a regression function f arbitrarily well if letting the number of neurons in the hidden layer go to infinity. Even with relatively low number of hidden neurons, a great deal of functions can be obtained; see below for some examples stepwise increasing complexity of a feedforward network with weights picked at random – 2, 4, 6 (first row) and 8, 10, and 12 (second row) hidden neurons.
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
xxx
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
xxx
Fitting a regression model (or training the network, in the specific jargon) is typically done by nonlinear least squares:
n 2 Y−f(X,θ) ;
i=1
in theory, this is not complicated, but in practice, where θ ̃ may have many entries, this optimization is a critical step. (The deep learning revolution, i.e. networks with many hidden layers and many neurons per layer, was only possible when cheap computers and data became available in large numbers.) The RSS as a function of parameter values (the weights) is typically quite bumpy and has many local minima. Shrinkage is not an uncommon strategy, as are various optimization tricks such as the use of stochastic gradient descent algorithms.
For use with classification, one essentially sets up a nonlinear logit model by replacing the linear activation in the output layer by a mapping onto [0, 1], and using ML estimation.
Neural networks are very flexible; for a multilayer perceptron, the number of hidden layers and of neurons per layer serve as tuning parameters, and have to be chosen very carefully – with cross-validation the easiest solution. Note that the complexity is not reliably captured by the number of weights only; let’s just say that the model degrees of freedom depend non-linearly on dim θ.
θ ̃=argmin
ini
41
−0.5 0.0
0.5 1.0 1.5
2.0 −0.5
0.0
0.5
1.0
1.5
1.4 1.6
1.8 2.0 2.2
2.4
0.5
1.0
1.5
2.0
1.2 1.4
1.6 1.8 2.0
2.2
0.5
1.0
1.5
2.0
f_n(x)
f_n(x)
f_n(x)
f_n(x)
f_n(x)
f_n(x)
6 Nonlinear methods
And a final remark: usually, large amounts of data are required for reliable training. That’s why neural networks are very successful in speech recognition or OCR: there you may ge- nerate training data with little cost. Where gathering or generating data is costly, you may find that alternative methods may perform better in terms of the bias-variance trade-off.
6.7 Tree-based methods
We now discuss tree-based methods for regression and classification. These involve strati- fying or segmenting the predictor space into a number of simple regions. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods. We first consider regression problems, and then move on to classification.
Tree-based methods
Tree-based methods are simple and useful for interpretation. However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. Hence we also discuss bagging, random forests, and boosting to improve the performance of the basic tree by building ensembles of trees. These methods grow multiple trees which are then combined to yield a single consensus prediction. Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss of interpretation.
A tree is constructed sequentially by splitting subsets of the predictor space suitably. Each split concerns one predictor only (but not always the same one). Each split is called a node. At a given internal node, the label (of the form Xj < tk) indicates the left-hand branch emanating from that split, and the right-hand branch corresponds to Xj ≥ tk.
Terminology for Trees: In keeping with the tree analogy, the regions R1, R2, and R3 are known as terminal nodes or leaves. Decision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tree. The points along the tree where the predictor space is split are referred to as internal nodes. Overall, the tree stratifies or segments the predictor space into several regions; in each region, we predict using the average outcome in that region.
Trees are surely an over-simplification, but compared to a regression model, it is easy to display, interpret and explain.
42
Looking in the blackbox
6 Nonlinear methods
We divide the predictor space — that is, the set of possible values for X1, X2, . . . , Xp — into J distinct and non-overlapping regions, R1, R2, . . . , RJ . For every observation that falls into the region Rj, we make the same prediction, which is simply the mean of the response values for the training observations in Rj.
In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes R1, . . . , RJ that minimize the RSS, given by
J
( y i − Rˆ j ) 2 , j=1 i∈Rj
where Rˆj is the mean response for the training observations within the jth box.
Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take a top-down, greedy approach that is known as recursive binary splitting. The approach is top-down because it begins at the top of the tree and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.
We first select the predictor Xj and the cutpoint s such that splitting the predictor space into the regions {X|Xj < s} and {X|Xj ≥ s} leads to the greatest possible reduction in RSS. Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.
The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. Why? A smaller tree with fewer splits (that is, fewer regions R1, . . . , RJ ) might lead to lower variance and better interpretation at the cost of a little bias. One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold. This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split — that is, a split that leads to a large reduction in RSS later on.
43
6 Nonlinear methods
Pruning offers a way out: A better strategy is to grow a very large tree T0, and then prune (chop off branches) it back in order to obtain a subtree. Most often, cost complexity pruning — also known as weakest link pruning — is used to do this. We consider a sequence of trees indexed by a nonnegative tuning parameter α. For each value of α there corresponds a subtree T ⊂ T0
|T|
(yi−yˆRm)2+α|T|
m=1 i:xi∈Rm
is as small as possible. Here |T| indicates the number of terminal nodes of the tree T, Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node, and yˆRm is the mean of the training observations in Rm. The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to the training data. We select an optimal value αˆ using cross-validation. We then return to the full data set and obtain the subtree corresponding to αˆ.
Summary: tree algorithm
1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α.
3. Use K-fold cross-validation to choose α. For each k = 1, . . . , K:
a) Repeat Steps 1 and 2 on the K−1th fraction of the training data, excluding the
K
kth fold.
b) Evaluate the mean squared prediction error on the data in the left-out kth fold,
as a function of α.
4. Average the results, and pick α to minimize the average error.
5. Return the subtree from Step 2 that corresponds to the chosen value of α.
Binary outputs
Classification trees are very similar to a regression tree, except that we use them to predict a qualitative response rather than a quantitative one. For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. Just as in the regression setting, we use recursive binary splitting to grow a classification tree.
44
6 Nonlinear methods
Details of classification trees: In the classification setting, RSS cannot be used as a criterion for making the binary splits. A natural alternative to RSS is the classification error rate E. This is simply the fraction of the training observations in that region that do not belong to the most common class:
Em =1−max(pˆmk). k
Here pˆmk represents the proportion of training observations in the mth region that are from the kth class. However classification error is not always sufficiently sensitive for tree-growing, and in practice two other measures are preferable.
The Gini index is defined by
K
G = pˆmk(1 − pˆmk),
k=1
a measure of total variance across the K classes. The Gini index takes on a small value if all of the pˆmk’s are close to zero or one. For this reason the Gini index is referred to as a measure of node purity — a small value indicates that a node contains predominantly observations from a single class. An alternative to the Gini index is cross-entropy, given by
K
D = − pˆmk ln pˆmk .
k=1
It turns out that the Gini index and the cross-entropy are very similar numerically.
Trees are very easy to explain (and we argued that interpretable learning outcomes are im- portant in practice). Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chap- ters. Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). Trees can easily handle qualitative predictors without the need to create dummy variables.
Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen here. But so-called ensemble methods, which combine models and predictions in various ways, can help here (and for other baseline models) a lot.
45
7 Ensemble methods 7.1 Bagging and the bootstrap
The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. E.g., it can provide an estimate of the standard error of a coefficient or a confidence interval for that coefficient. (Obviously, it does not serve the same purpose as CV.)
Suppose that we’re interested in the distribution properties of some estimator αˆ. If we knew the population distribution, we could resort to Monte Carlo simulation to approximate the characteristics of the sampling distribution of interest for any given sample size n. (Asymp- totics often help with n → ∞, but approximations based on limiting distributions may be bad in finite samples.) The precision of the MC approximation only depends on how many MC replications one is willing to draw from the population.
Now back to the real world: The Monte Carlo procedure outlined above cannot be applied, because for real data we cannot generate new samples from the original population. However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additi- onal samples. Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement. Each of these “bootstrap data sets” is created by sampling with replacement, and is the same size as our original dataset. The idea is that the data have an empirical distribution that is “close” to that of the population. As a result some observations may appear more than once in a given bootstrap data set and some not at all.
In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought. For example, if the data is a time series, we can not simply sample the observations with replacement. We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sampled blocks to obtain a bootstrap data set.
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly
46
7 Ensemble methods
useful and frequently used in the context of decision trees. Recall that given a set of n independent observations Z1,...,Zn, each with variance σ2 the variance of the mean Z of the observations is given by σ2/n. In other words, averaging a set of observations reduces variance. Of course, this is not practical because we generally do not have access to multiple training sets.
Instead, we can bootstrap, by taking repeated samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets. We then train our method on the bth bootstrapped training set in order to get fˆ∗b(x), the prediction at a point x. We then average all the predictions to obtain
ˆ1Bˆ fbag(x) = f∗b(x).
The above prescription applies immediately to regression trees. For classification trees, we modify the procedure a bit: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.
Out-of-Bag Error Estimation: It turns out that there is a very straightforward way to estimate the test error of a bagged model. Recall that the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. One can show that on average, each bagged tree makes use of around two-thirds of the observations. The remaining one- third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the ith observation, which we average. This estimate is essentially the LOO cross-validation error for bagging, if B is large.
Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random selection of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors.
B b=1
47
7.2 Boosting
1. Setf(x)=0andri =yi foralliinthetrainingset. 2. For b = 1,2,...,B, repeat:
7 Ensemble methods
Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The basic idea is to combine many so-called weak learners (some simple method doing better than guessing, but not too well); the purpose is to create a strong learner this way.
We only discuss boosting for regression trees. Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting for classification is similar in spirit to boosting for regression, but is a bit more complex and we ignore it.
A boosting algorithm for regression trees: ˆ
a) Fit a tree fˆb with d splits (d + 1 terminal nodes) to the training data (X, r). b) Update fˆ by adding in a shrunken version of the new tree:
c) Update the residuals,
3. Output the boosted model,
ˆˆˆ f(x) ← f(x) + λfb(x)
ˆ
ri ←ri +λfb(xi).
B
ˆˆ f(x) = λfb(x).
b=1
What is the idea behind this procedure? Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. By fitting small trees to the residuals, we slowly improve fˆ in areas where it does not perform well. The shrinkage parameter λ
48
7 Ensemble methods
slows the process down even further, allowing more and different shaped trees to attack the residuals.
Tuning parameters for boosting
1. The number of trees B. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross- validation to select B.
2. The shrinkage parameter, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. A very small λ can require using a very large value of B in order to achieve good performance.
3. The number of splits d in each tree, which controls the complexity of the boosted ensemble. Often d = 1 works well, in which case each tree is a stump, consisting of a single split and resulting in an additive model. More generally d is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables.
The last aspect of interest for us here is how to measure the importance of predictors. For bagged/RF regression trees, we record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A large value indicates an important predictor. Similarly, for bagged/RF classification trees, we add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all B trees.
7.3 Model averaging
Let’s go Bayesian for a moment. Let ζ be a quantity of interest (e.g. forecast function), and Z the training data. For each candidate model Mm, m = 1,...,M, use the Bayes theorem to obtain the posterior
P(ζ|Z,Mm) ∝ p(ζ)L(ζ;Z,Mm),
where the prior p(ζ) may be model-specific as well. Then, we have the “overall” posterior
P(ζ|Z) = P(ζ|Z,Mm)P(Mm|Z), which averages over models.
For this posterior we have e.g. the posterior mean
E(ζ|Z) = E(ζ|Z,Mm)P(Mm|Z). 49
7 Ensemble methods
This gives an average of the individual forecasts, since E(ζ|Z,Mm) is just the forecast for model Mm! Moreover, the individual weights are proportional to the posterior probability that model Mm is the right one.
We may compute P(Mm|Z) using the Bayes theorem as well; one may alternatively use exp−1BIC (Mm)
2 P(Mm|Z)≈M exp−1BIC(M ).
m=1 2 m
The above averaging estimator can be used without the Bayes background. We may approach model averaging as a MSE optimality problem, i.e. determine optimal weights in the average
forecast
M
fˆ ( x ) = w fˆ ( x ) mm
m=1
over M different models. To this end, we must minimize
ˆ 2 E Yi−f(Xi) ,
or pick w as the regression coefficients of a linear regression of Y on fˆ (X). This is never mm
worse than any single forecast, by the way, but unstable if M is large (may pick wm = 1/M in this case).
There are other approaches, e.g. complete subset regressions. Simple idea: combine the forecasts from all possible (linear) regression models that keep complexity fixed. Model complexity is measured by the number of included features; this trades off the bias and variance of the forecast errors. Thus, we may pick a number of regressors (much) smaller than n, but at the same time allow for many model configurations.
When it comes to combining classifications, one usually takes a majority vote, but weighting schemes are not unheard of.
50
8 Unsupervised learning 8.1 Leaving the supervised framework
We now focus on unsupervised learning, where we observe only the features X1, X2, . . . , Xp. We are not interested in prediction, because e.g. we do not have an associated response vari- able Y . Rather, the goal of unsupervised learning is to discover interesting things about the measurements: Can we discover subgroups among the variables or among the observations? Is there an informative way to visualize the data?
Unsupervised learning is more subjective than supervised learning, as there is no simple goal for the analysis, such as prediction of a response. But techniques for unsupervised learning are of growing importance in a number of fields: Stocks that have similar behavior, groups of shoppers characterized by their browsing and purchase histories, etc. Also, it is often easier to obtain unlabeled data — from a lab instrument or a computer — than labeled data, which can require human intervention.
We discuss two methods: principal components analysis, a tool used for data visualization or data pre-processing before supervised techniques are applied, and clustering, a broad class of methods for discovering unknown subgroups in data.
8.2 Principal Components Analysis
PCA produces a low-dimensional representation of a dataset. The idea is to find a sequence of linear combinations of the variables that have maximal variance, and are mutually un- correlated. We can then rank the principal components via their variance. Apart from producing derived variables for use in supervised learning problems (recall PC regression), PCA may also serve as a tool for data visualization.
The first principal component of a set of features X1,X2,...,Xp is the normalized linear combination of the features
p
Z1 = φ11X1 +φ21X2 +···+φp1Xp s.t. φ2j1 = 1. j=1
51
8 Unsupervised learning
We refer to the elements φ11,...,φp1 as the loadings of the first principal component; to- gether, the loadings make up the principal component loading vector φ1 = (φ11 φ21 . . . φp1)′. (We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.)
Let us get a concrete idea about how this is obtained. Suppose we have a n × p data set X. Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero). We then look for the linear combination of the sample feature values of the form
zi1 = φ11xi1 + φ21xi2 + · · · + φp1xip (8.1)
for i = 1, . . . , n that has largest sample variance, subject to the constraint that pj=1 φ2j1 = 1. Since each of the xij has mean zero, then so does zi1 (for any values of φj1). Hence the sample
variance of the zi1 can be written as
Plugging in (8.1) the first principal component loading vector solves the optimization problem
1np 2
p
φ2j1 = 1. j=1
max φj1xij φ11,...,φp1 n
subject to
1n z2 .
n i1 i=1
i=1 j=1
This problem can be solved via a singular-value decomposition of the matrix X, a standard technique in linear algebra. We refer to Z1 as the first principal component, with realized values z11,...,zn1.
The loading vector φ1 with elements φ11, φ21, . . . , φp1 defines a direction in feature space along which the data vary the most. If we project the n data points x1,...,xn onto this direction, the projected values are the principal component scores z11, . . . , zn1 themselves.
The second principal component is the linear combination of X1, . . . , Xp that has maximal variance among all linear combinations that are uncorrelated with Z1. The second principal componentscoresz12,z22,...,zn2 taketheform
zi2 =φ12xi1 +φ22xi2 +···+φp2xip,
where φ2 is the second principal component loading vector, with elements φ12, φ22, . . . , φp2. It turns out that constraining Z2 to be uncorrelated with Z1 is equivalent to constraining the direction φ2 to be orthogonal (perpendicular) to the direction φ1. And so on.
The principal component directions φ1, φ2, φ3, . . . are the ordered sequence of eigenvectors of the matrix X, and the variances of the components are 1/n times the squares of the singular values. There are at most min(n − 1, p) principal components.
52
8 Unsupervised learning
Another interpretation of Principal Components is that PCA finds the hyperplane closest to the observations. Concretely, the first principal component loading vector has a very special property: it defines the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness). The notion of principal components as the dimensions that are closest to the n observations extends beyond just the first principal component. For instance, the first two principal components of a data set span the plane that is closest to the n observations, in terms of average squared Euclidean distance.
The scaling of the variables matters; if the variables are in different units, scaling each to have standard deviation equal to one is recommended.
To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one. The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as
p p1n Var(Xj)= x2ij, j=1 j=1 n i=1
and the variance explained by the mth principal component is 1n
Var(Z)= z2. m n im
i=1
It can be shown that
pM
Var(Xj)=Var(Zm), with M=min(n−1,p). j=1 m=1
Therefore, the PVE of the mth principal component is given by the positive quantity between 0 and 1,
n z2 i=1 im
p n x2. j=1 i=1 ij
The PVEs sum to one. We sometimes display the cumulative PVEs.
The last and perhaps practically most interesting question refers to how many principal components we should use. No simple answer to this question, as cross-validation is not available for this purpose. The so-called “scree plot” can be used as a guide: we look for an “elbow” in the PVEs. There are more statistical ways of choosing the number of principal components, but they require a statistical model (a so-called factor model) in the background.
53
8.3 Cluster analysis
8 Unsupervised learning
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. We seek a partition of the data into distinct groups so that the observations within each group are quite similar to each other. To make this concrete, we must define what it means for two or more observations to be similar or different. This is often a domain-specific consideration that must be made based on knowledge of the data being studied.
PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance – while clustering looks for homogeneous subgroups among the observations.
Suppose we have access to a large number of measurements (e.g. median household income, occupation, distance from nearest urban area, and so forth) for a large number of people. Our goal is to perform market segmentation by identifying subgroups of people who might be more receptive to a particular form of advertising, or more likely to purchase a particular product. The task of performing market segmentation amounts to clustering the people in the data set.
Not surprisingly, there are more than just one clustering methods. In K-means clustering, we seek to partition the observations into a pre-specified number of clusters. In hierarchical clustering, we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.
Details of K-means clustering: Let C1, . . . , CK denote a partition of the available observati- ons in each cluster, C1 ∪C2 ∪...,Ck = {1,...,n} and Ck ∩Ck′ = ∅ ∀ k ̸= k′. For instance, if the ith observation is in the kth cluster, then i ∈ Ck.
The idea behind K-means clustering is that a good clustering is one for which the within- cluster variation is as small as possible. The within-cluster variation for cluster Ck is a measure WCV(Ck) of the amount by which the observations within a cluster differ from each other. Hence we want to solve the problem
K min WCV(Ck) .
(8.2)
(8.3)
C1 ,...,CK
k=1
To define within-cluster variation, we typically use Euclidean distance,
1p
WCV (Ck) = (xij − xi′j)2,
|Ck| i,i′∈Ck j=1
where |Ck| denotes the number of observations in the kth cluster.
54
where xkj = 1 i∈C |Ck|
8 Unsupervised learning
Combining (8.2) and (8.3) gives the optimization problem that defines K-means clustering, K1p
min (xij −xi′j)2 . (8.4) C1,...,CK k=1 |Ck| i,i′∈Ck j=1
It’s tempting of course to try all possible partitions. But this is time-consuming, and heu- ristics do better. So consider a greedy version,
1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
2. Iterate until the cluster assignments stop changing:
a) For each of the K clusters, compute the cluster centroid. (The kth cluster centroid
is the vector of the p feature means for the observations in the kth cluster.)
b) Assign each observation to the cluster whose centroid is closest (where closest is
defined using Euclidean distance).
This algorithm is guaranteed to decrease the value of the objective (8.4) at each step, since
1pp
(xij − xi′j)2 = 2 (xij − xkj)2,
|Ck| i,i′∈Ck j=1 i∈Ck j=1
xij is the mean for feature j in cluster Ck. However it is not guaranteed to give the global minimum, and clustering results may depend on the starting point.
Hierarchical Clustering is applicable when you don’t know how many clusters you want to end up with (K-means clustering requires us to pre-specify the number of clusters K, which can be a disadvantage). In this section, we describe bottom-up or agglomerative clustering. This is the most common type of hierarchical clustering, and refers to the fact that a dendrogram is built starting from the leaves and combining clusters up to the trunk.
The algorithm is as follows:
1. Start with each point in its own cluster.
2. Identify the closest two clusters and merge them. 3. Repeat 1-2 until all points are in a single cluster.
Heights in a dendrogram serves as indication of (dis)similarity (or linkage). This helps selecting the number of clusters.
Note that it matters how we measure linkage: 55
k
Complete
Single
Average
Centroid
Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.
8 Unsupervised learning
Note also that we do not always have to resort to Euclidean distance. An alternative is correlation-based distance which considers two observations to be similar if their features are highly correlated. This is an unusual use of correlation, which is normally computed between variables; here it is computed between the observation profiles for each pair of observations.
Finally, scaling of the variables matters, like for PCA.
56