STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,1 / 1
Assessing Model Accuracy
0 20 40 60 80 100 2 5 10 20
X Flexibility
f (X ) is black, MSETr is grey, MSETe is red, and three models are shown in the other colours (yellow is linear, blue and green are splines).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,2 / 1
In this example, the regression function is close to being linear. Hence, the linear model (lowest flexibility) has performed well. In contrast, the green model (the most flexible spine) has over-fitted the training data (the testing MSE is high and the training error is less than the irreducible error).
Y
2 4 6 8 10 12
Mean Squared Error
0.0 0.5 1.0 1.5 2.0 2.5
Assessing Model Accuracy
0 20 40 60 80 100 2 5 10 20
X Flexibility
f (X ) is black, MSETr is grey, MSETe is red, and three models are shown in the other colours (yellow is linear, blue and green are splines).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,3 / 1
In this example, the linear model is under-fitting the training data, failing to capture the structure of non-linear regression function (the testing MSE is high and the training MSE is also high). The splines have both performed well in this example, with the blue spline being the best (simpler (less complex) model and less over- fitting).
−10 0 10 20
Y
Mean Squared Error
0 5 10 15 20
Bias-variance trade-off
Suppose we fit a model to some training data and let (x0,y0) be a test observation.
If the true model is
with regression function f (x) = E (Y |X = x), then the expected test MSE at
X = x0 is
E[y0 −fˆ(x0)]2 =V(fˆ(x0))+[Bias(fˆ(x0))]2 +V(ε). Typically, as the flexibility of fˆ increases, its bias decreases and its variance
increases — a bias-variance trade-off.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,4 / 1
The expected test MSE refers to the average test MSE that we would obtain if we
repeatedly estimated f using a large number of different training data sets (from
the same population). For notational convenience let f (x ) = f and fˆ(x ) = fˆ , 0000
Y =f(X)+ε,
then
E[y −fˆ(x )]2 = E[f +ε−fˆ]2 0000
= E([f −fˆ]2 +2ε(f −fˆ)+ε2) 00 00
= E(f2 −2f fˆ +fˆ2 +2εf −2εfˆ +ε2) 000000
= f2 −2f E(fˆ)+E(fˆ2)+V(ε) 0000
= f2 −2f E(fˆ)+V(fˆ)+(E(fˆ))2 +V(ε) 00000
= V(fˆ)+[E(fˆ)−f ]2 +V(ε) 000
V(fˆ)+[Bias(fˆ)]2 +V(ε) 00
Note: E(fˆ2) = V(fˆ) + (E(fˆ))2 by definition and 2E(εfˆ) = 2E(ε)E(fˆ) = 0 00000
because of independence.
Bias-variance trade-off
MSE Bias Var
2 5 10 20
Flexibility
2 5 10 20
Flexibility
2 5 10 20
Flexibility
The bias-variance decomposition for slides 23,24 and 25. The vertical dashed line indicates the flexibility level corresponding to the smallest test mean-squared error.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,5 / 1
The general take home message here is that as model flexibility increases, the bias tends to decrease and the variance tends to increase.
• The variance of a statistical learning method, V(fˆ(x0)), measures how much fˆ(x0) is likely to change if different training data sets (from the same
population) are used to fit fˆ. Flexible methods fit the training data hard and hence the model can change a lot when different training data are used
V (fˆ(x0)) is large.
• The bias, E (fˆ(x0)) − f (x0), of a statistical learning method comes from trying to approximate a complicated real world problem using a relatively simple function. The bias of flexible models tends to be small because fˆ(x0) ≈ y0 over many training data sets so that
E(fˆ(x0))≈E(Y|X =x)=f(x0).
0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5
0 5 10 15 20
Bias-variance trade-off
High Bias Low Variance
Low Bias High Variance
Testing Sample
Training Sample
Low Flexibility High
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining
,6 / 1
The figure is important and should be memorised. Our goal is to find statistical learning methods (for our particular problem) that lie near the minimum of the characteristic U shaped testing error curve.
Prediction Error
Classification Problems
Suppose we have a qualitative response Y (classification problem) that takes values from a finite (unordered) set
and p different predictors,
C = {y1,y2,…,yk},
X = (X1,X2,…,Xp).
Our goal is to build a classifier f (X ) that assigns a class label from C to an unclassified X.
We need to quantify uncertainty in each classification and understand how different predictors affect classifications.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,7 / 1
Assessing Classification Accuracy
The most common approach to assess accuracy is the error rate (zero-one loss function).
Suppose we fit a classifier fˆ(x) to some training data Tr = {xi,yi}ni=1. From this set we compute the training error rate:
1 n
Tr n i i
I(y̸=yˆ),
where I(.) is an indicator variable that equals one if the condition is true (in this
case, if y ̸= yˆ ) and zero otherwise. ii
Suppose we have unseen test data Te = {xi,yi}mi=1. From this set we compute
the testing error rate:
ER =
i=1
1 m
Te m i i
I(y̸=yˆ).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,8 / 1
Once again we are interested in classifiers with relatively low testing error rates (it’s easy to fit a classifier with low (or zero in some cases) training error).
Revision material (for next slide): Let Ω denote the sample space. If Pr(B) > 0, then the conditional probability of A given B is
Pr(A|B)= Pr(A∩B). Pr(B )
You can think of Pr(A|B) as the fraction of times A occurs among those in which B occurs. Another useful result is the law of total probability:
k
Pr(B)=Pr(B|Ai)Pr(Ai), whereA1,…,Ak isapartitionofΩ(∪ki=1Ai =ΩandAj ∩Ai =∅forj̸=i).
i=1
Bayes Theorem: Let A1,…,Ak be a partition of Ω such that Pr(Ai) > 0 for each i. If Pr(B) > 0,
ER =
i=1
then
Pr(B|Ai )Pr(Ai ) Pr(B|Ai )Pr(Ai ) Pr(Ai |B) = Pr(B) = kj=1 Pr(B|Aj )Pr(Aj )
Bayes theorem allows us to compute Pr(A|B) if we know Pr(B|A).
Is there an ideal classifier?
Just as in the regression setting, there can be multiple classifications at X = x. The ideal f (X ) at X = x is
f(x)=yk if Pr(Y =yk|X =x)=maxPr(Y =yj|X =x) yj ∈C
and is called Bayes classifier. That is, assign each observation to its most likely class given its predictor values.
Bayes classifier is ideal in the sense that it minimizes the expected prediction error with a zero-one loss function.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,9 / 1
Let gk(x) = g(x|Y = k) be the PDF (density) for class k and let πk = Pr(Y = k) be the probability of being in class k (i πk = 1). Then, from Bayes theorem we
have that
gk (x )πk Pr(Y =k|X =x)= g(x)π .
jjj
Hence, Bayes classifier is equivalent to
f (x) = argmaxjgj(x)πj,
where argmaxj means ‘find the j (class) that maximizes this expression’. Hence, having gk(x) is almost equivalent to having the quantity Pr(Y = k|X = x). For a two-class classification problem with Y ∈ {0, 1}, the Bayes decision boundary is the set of points x that satisfy
π0g0(x)=π1g1(x) or Pr(Y =0|X =x)=Pr(Y =1|X =x)=0.5.
That is, the set of points for which there is a 50-50 chance of being in either class. We usually don’t know gk(x) or πk so we’ll have to estimate them (later in the course), but we’ll go through an example on the next slide assuming we do.
Bayes Classifier
oo oooo
o
o oo o o
oo
oo o oooo
oo o
o o o ooo
ooooooo oo
oooo ooo o oooooo o
oo oo oo o o ooooooo
oooooo oooo
oo o o o o ooo o ooo oo o
o oo oooooooo o
o o o ooooo
oo o o oooooo
oo
o
oo oo o
ooo ooo
o ooooooo
oo ooo
oo oooo ooo oo
ooo o o o oo o
o oooo oo o
o
o
o
o
X1
A simulated data set consisting of two groups and the Bayes decision boundary.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,10 / 1
If an unclassified point is in the blue region, it is classified blue. This is a two-class classification problem, so the decision boundary is the set of points x for which
Pr(Y = blue|X = x) = 0.5 (equivalent to π0g0(x) = π1g1(x)) Example: Consider a two-class classification problem with Y ∈ {0, 1} and f0 (x ) = Normal(0, 1),
f1(x) = Normal(1,1), π0 = 0.7 and π1 = 0.3. The Normal density function is
2 1 1 2 f(x)=Normal(μ,σ)=√2πσexp −2σ2(x−μ) .
oo
ooo
X2
Bayes Classifier
Unfortunately, for real-world problems, we do not know the conditional distribution of Y given X so it is impossible to compute Bayes classifier.
We can estimate the conditional distribution using nearest neighbour averaging (as we did for regression), provided we have enough local information.
The simplest method is the k-nearest neighbours (KNN) classifier.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,11 / 1
KNN Classifier
Goal: Classify a test observation x0 to a class from C = {y1,y2,…,yN}. Steps:
1 Choose a positive integer k.
2 Find the k nearest training observations to x0. Call this set of observations N0.
3 Forj=1,2,…,N,estimatetheprobabilityofbeinginclassyj using
Pr(Y =yj|X =x0)=Pj = 1 I(yi =yj).
k
i :xi ∈N0
4 Classify x0 to class yi, where Pi ≥ Pj for j = 1,2,…,N (the majority class in N0).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,12 / 1
We can measure ‘closeness’ using a variety of methods (more about this later in the course). For now, we can measure closeness using Euclidean distance.
KNN Classifier
oo
oo oo
oooo oo
oo
oooo oo
oo oo
KNN decision boundary when k = 3 is used.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,13 / 1
The estimated probability of x (the test observation) being blue is 2/3, so we classify x blue. The right panel shows the kNN decision boundary for k = 3 (the set of points for which the probability of being blue (or orange) is 0.5). The shaded regions show how the predictor space is classified.
KNN Classifier
KNN: K=10
oo oooo
o
o oo o o
oo
oo o oooo
oo o
o o o ooo o oooooooo
oooo ooo o oooooo o
oo oo oo o o ooooooo
oo ooooo oo oo o o o o
ooo o ooo oo o o oo
oooooooo o o o o ooooo
o
o
o
o
oo o o oooooo
oo
o
oo oo o
oo ooo o
oooo o
oo oooo o oo
oo oo oooo
ooo oo ooo o o o oo o
o oooo oo o
X1
k = 10 gives a very good approximation to the Bayes decision boundary.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,14 / 1
In this example kNN with k = 10 has performed extremely well. There is little difference between the Bayes decision boundary (the best we can do) and the kNN decision boundary.
oo
X2
KNN Classifier
KNN: K=1 KNN: K=100
oo oo ooo ooo
oo
oo oo oo
oooo oooo oo
o oooo o o o oooo o
o
oo oo ooo o oo oo ooo o
ooooooo o ooooooo o
o ooo ooo o o oooo o o
o ooo ooo o oooo o
o o
o oo
ooooooo ooooooo
o oooooooo
oooo oo o o oo o o
o o ooo
ooo ooo
oo oo oooo o
oo oo oooo o
ooo o ooo oo o o
o
oo o
ooo o ooo oo o
o o oo o
o o oooo o oooo
o o oooo o o o oooo
ooo oo ooo ooo
ooo o o oo o o
ooo
oo ooo
ooo ooo oooooo
oo oo oo oo
oo o oooo oo o oooo
oo
ooo ooo
o oo o oo
o o o ooo o o o o ooo o oo oo
oo
o o o o o o o o oo o oo o
oooo oo o oo o
oo
oo oo oo
Voronoi Tessellation (k = 1) over-fit vs. k = 100 under-fit.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,15 / 1
To fit a kNN model, the parameter k needs to be chosen by the user. We can choose the ‘best k’ using test data (if it’s available), where we choose the k value that minimizes the testing error rate. A better approach is called cross-validation, which is covered later in the course (and in Chapter 5 of the course text).
o oo
o
ooo o o oo o o
KNN Classifier
Training Errors Test Errors
0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K
The KNN training error rate decreases as the number of neighbours decreases and the characteristic U shape is clear in the testing error rate.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,16 / 1
The horizontal dashed line is Bayes error (the testing error of Bayes classifier applied to this problem, which is the best we can do). We plot 1/k (instead of k) so that we have increasing flexibility on the x-axis of our plot (k = 1 is the most flexible kNN method).
• If we choose k too large, we get large testing and training errors (under-fitting the training data). That is, low flexibility, high bias and low variance. Relatively large values of k may fail to capture highly non-linear decision boundaries.
• If we choose k too small, we get low training error and high testing error (over-fitting the training data). That is, high flexibility, low bias and high variance. Relatively small values of k tend to produce an overly complicated decision boundary because the boundary follows the training data too closely.
Despite its simplicity, kNN classification performs extremely well in a variety of settings, for example, character recognition.
Error Rate
0.00 0.05 0.10 0.15 0.20
Classification Problems
Just as in the regression setting, localized classification methods start to break down as dimension increases (curse of dimensionality).
We can make an assumption about the functional form of the decision boundary
e.g. linearly separable;
or make an assumption about class distributions
e.g. normally distributed.
We will consider a variety of parametric and non-parametric classification methods in this course.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,17 / 1