Data mining
Prof. Dr. Matei Demetrescu Summer 2020
Statistics and Econometrics (CAU Kiel)
Summer 2020 1 / 34
Today’s outline
Basics of prediction and classification
1 Error quantification
2 Learning for prediction
3 Leaning with many features
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020
2 / 34
Error quantification
Outline
1 Error quantification
2 Learning for prediction
3 Leaning with many features
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 34
There will be errors
Error quantification
Conditionally on X, Y has a distribution and its outcomes (continuous or discrete) are random.
Sample from joint distribution
Conditional densities
Y
−2 0 2 4 6
Densities
0.0 0.1 0.2 0.3 0.4 0.5
−4 −2 0 2 4 6 8 −2 0 2 4 6
X Y given X
At the same time, Yˆ is a deterministic function of X and therefore prediction (or classification) errors will appear by construction.
To minimize the (impact of the) errors, we need to quantify them.
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 34
Conditioning on X=−1
X=0 X=1 X=2
Error quantification
Let us discuss prediction first
For continuous variables,
the probability of Y = Yˆ is zero, so
a good prediction lies as close as possible to Y on average
We shall measure “closeness” with the help of so-called loss functions.
Definition
A loss function L(·) is a quasi-convex function minimized at 0.
In other words, L is increasing for positive arguments and decreasing for
negative arguments.
The most common loss function is the squared error loss (MSE).
But asymmetric loss functions may also be encountered, say asymmetric linear or asymmetric quadratic.
Statistics and Econometrics (CAU Kiel) Summer 2020 5 / 34
Error quantification
Decision theory
Asymmetric loss and optimal predictions for standard normal Y
−1.0 −0.5 0.0 0.5 1.0
u
Asymmetric loss and optimal predictions for standard normal Y
−1.0 −0.5 0.0 0.5 1.0
u
We pick the prediction that minimizes the expected loss!
Yˆ = arg min E (L (Y − y)) . y
Different losses imply different predictions
for the same prediction problem!
Statistics and Econometrics (CAU Kiel) Summer 2020
6 / 34
Asymmetric linear loss Asymmetric quadratic loss
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Error quantification
What is the optimal prediction?
Proposition
For the squared error loss, the optimal unconditional prediction is the mean of Y , Yˆ = E (Y ).
This clearly extends to conditional distributions and expectations.
So for prediction we typically focus on conditional means, i.e. the regression function and take as predictor function
f (x) = E (Y |X = x) .
Of course, for a new data point X, we take Yˆ = f(X).
Most authors,textbooks, blogs, deal directly with the conditional mean without stating this explicitly.
Statistics and Econometrics (CAU Kiel) Summer 2020
7 / 34
Error quantification
Other loss functions
Mean Absolute Deviation is sometimes an alternative to MSE.1 Proposition
For the absolute deviation loss, the optimal predictor is the median (which we assume unique).
Analogously, asymmetric linear losses leads to conditional quantiles.
Let
Then,
L(u) =
−au u<0 bu u≥0
for a,b > 0.
f(x)=F−1 b Y|X=x a+b
Of course, Yˆ = f(X) for new data X.
1Usually because of robustness to outliers; but beware of the computational burden. Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 34
Error quantification
Measures of location and predictions
In general, we get a specific optimal prediction for each loss L. Definition (L-Measure of location)
ThefunctionalμLY =argminE(L(Y−m))istheL-measureoflocation m
of Y .
The optimal prediction Yˆ given X under L is the conditional L-measure
of location of Y , μLY |X , and the predictor function is f ( x ) = μ LY | X = x .
Changing the loss function changes the prediction problem!
E.g. μLY |X = E (Y |X ) only for symmetric distributions and symmetric L. Statistics and Econometrics (CAU Kiel) Summer 2020 9 / 34
Error quantification
Classification
Let’s look at the binary case, with 2 possible outcomes, say 0 and 1.2
Low
High
Y
−5 0 5
X
Here the distribution of Y is Bernoulli, so what matters is the conditional probability that Y = 1 given X.
2In machine learning some rather choose −1 and 1.
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 34
−5 0 5
Low
High
X
Y
Error quantification
Classification errors
The so-called confusion matrix summarizes the error possibilities Yˆ=0 Yˆ=1
Y=0 l0,0 l0,1 Y=1 l1,0 l1,1
Like for prediction, you could have any cost attached to errors.
The most common loss function for binary classification is the 0/1 loss:
(We’ll start with this one.)
Yˆ = 0 Y=001
Yˆ = 1 Y=110
Statistics and Econometrics (CAU Kiel)
Summer 2020
11 / 34
Error quantification
Optimal classifier
Proposition
The optimal binary classifier under 0/1 loss sets Yˆ = 1 if P(Y =1)>P(Y =0)andYˆ =0ifP(Y =0)>P(Y =1).
I.e. the optimal classifier
picks the class with the larger (conditional) probability of occurrence. This extends to several classes, as long as one uses the 0/1 loss.
For binary classification, we may equivalently state
“Set Yˆ = 1 if conditional probability of Y = 1 exceeds 1/2”.
Other loss functions simply imply another threshold for the conditional class probability.
Statistics and Econometrics (CAU Kiel) Summer 2020
12 / 34
Error quantification
Errors and the confusion matrix
Especially in machine learning, it is customary to work with error rates directly rather than losses.3
We need some terminology to define them.
Yˆ = No
Y = No Actual No
Yˆ = Yes
Y = Yes Actual Yes
True Negatives
False Positives
False Negatives
True Positives
Predicted No
Predicted Yes
Total
(These error rates are typically given in a yes/no classification situation, and we follow this convention.)
3And the rates often refer rather to the sample than population quantities, but the following definitions apply in both cases.
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 34
Error quantification
Error rates
Accuracy: (True Positives + True Negatives)/Total (How often is the classifier correct?)
Misclassification/Error rate: (False Positives + False Negatives)/Total (How often is the classifier wrong?)
True Positive Rate, Sensitivity or Recall: True Positives/Actual Yes (How often is the classifier correct when truth is yes?)
True Negative Rate or Specificity: True Negatives/Actual No (How often is the classifier correct when truth is no?)
False Positive Rate: False Positives/Actual No (How often is the classifier wrong when truth is no?)
Precision: True Positives/Predicted Yes
(How often is the classifier correct when prediction is yes?)
Prevalence: Actual Yes/Total
(How often does the yes occur in the population or the sample?)
Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 34
Error quantification
The Receiver Operating Curve
Accuracy is important, but can be misleading if prevalence is either very high or very low.
Sensitivity and specificity should be high, but there’s a trade-off:
This trade-off is controlled by the threshold (say τ) for classification based on conditional class probability.
lower τ means more Yˆ = 1 (higher sensitivity)
… but more of them will be false positives (lower specificity); higher τ means less Yˆ = 1 and thus less false positives (higher specificity)
… but at the same time lower sensitivity.
The “optimal” τ depends on the loss function (τ = 1/2 for 0/1 loss). Plotting sensitivity (TPR) against specificity (TNR) for all τ ∈ (0, 1)
gives the so-called Receiver Operating Curve.
(In fact TPR is plotted for historical reasons against FPR=1-TNR.)
Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 34
Error quantification
ROC & AUC
The ROC is informative about the classification performance of a model (population or fitted) for the conditional class probability.
A pure chance classifier (i.e. Yˆ independent of Y ) has TPR = FPR.
So the further the ROC is away from the gray diagonal, the better!
The “Area Under the Curve” quantifies this:
0 < AUC < 1
AUC = 1/2 for pure chance.
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Statistics and Econometrics
(CAU Kiel) Summer 2020
16 / 34
True Positive Rate
0.0 0.2 0.4 0.6 0.8 1.0
Learning for prediction
Outline
1 Error quantification
2 Learning for prediction
3 Leaning with many features
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 34
Learning for prediction
Recall classical statistics
Say you have a parametric regression model,
Y =f(X,θ)+ε
where E(ε|X) = 0. Estimate using squared error loss,
θˆ=argmin1n (Yi−f(Xi,θ∗))2. θ∗ n i=1
Proposition
Regularity conditions assumed, θˆ →p θ.
Since E (Y |X) = f (X, θ), plug in θˆ to obtain
ˆˆˆˆ f(x)=f x,θ and Y =f X,θ .
ˆˆpˆp Consistencyofθimpliesthatf(x)→f(x)andY →E(Y|X).
Statistics and Econometrics (CAU Kiel) Summer 2020
18 / 34
Learning for prediction
Prediction errors
The prediction errors at a (new) data point (Y,X) are
Y−Yˆ=ε− f X,θˆ −f(X,θ) .
Since θˆ →p θ, we may linearize f in θ (at the given X) and obtain
ˆ ′ 1 εˆ=ε− θ−θ ∇f+Op n
where ∇f is the gradient of f at the population θ and the given X. We thus have estimation error and irreducible error.
The estimation error can be made arbitrarily small if there are enough data.
Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 34
Learning for prediction
Misspecification
Say you want to fit the same parametric f(x,θ), but the actual regression function is different, say E (Y |X ) = m(X ).
Proposition
Let θp = arg min E (f (X, θ∗) − m (X))2 (we call θp pseudo-true θ∗
value). Regularity conditions assumed, θˆ →p θp. Regarding prediction errors,
Y −Yˆ ≈ε+(m(X)−f(X,θ))−θˆ−θ ′∇f p
where ∇f is the gradient of f at X for the pseudo-true θp.
We thus have model bias as an additional source of error.
Statistics and Econometrics (CAU Kiel) Summer 2020
20 / 34
Learning for prediction
Example: Fitting linear regression instead of quadratic
−1.0 −0.5 0.0
0.5 1.0 1.5
−1.0 −0.5 0.0
0.5 1.0 1.5
Y
Y
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
XX
Note that the bias depends on the marginal distribution of X!
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 34
Learning for prediction
Sideline remark: Empirical risk minimization
It is important to realize that you get what you fit! Say you use a different loss function,
1 n
θˆ=argmin L(Yi −f(Xi,θ∗)).
θ∗ n i=1 p
Proposition
Regularity conditions assumed, f x, θˆ → μLY |X=x.
E.g. fitting under MAD gives the (fitted) conditional median as predictor.
So you should typically fit predictive models using the same loss function which is used to evaluate the errors.
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 34
Learning for prediction
Flexibility I
Some parametric families of functions are universal approximators. I.e., as dim θ → ∞,
Polynomials, Fourier series, splines, and also
regression trees and neural networks,
can approximate smooth functions arbitrarily well.
So choosing “large” dim θ
gives enough flexibility of the model such that it accommodates arbitrary unknown regression functions. The analogous findings holds for classifiers.
Free lunch anybody?
Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 34
Learning for prediction
Flexibility II
The problem is that
where
Var (εˆ) ≈ Var (ε) + ∇f ′ Cov θˆ ∇f ′ˆ 2ˆ
∇fCov θ ∇f≤||∇f||Cov θ.
ˆ
In the worst-case scenario, Cov θ = O(dim θ)!
You buy bias reduction by increasing the variance!
The bias-variance trade-off is really everywhere...
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 34
Learning for prediction
Towards model selection
One minimizes prediction MSE by balancing bias and estimation error. This is a critical step, since both influence prediction MSE,
and can be seen as a model selection issue.
In statistical learning we have new names:
Bias dominates: underfitting (oversmoothing) Variance dominates: overfitting (undersmoothing)
Choosing the right amount of smoothing is sometimes method-specific so we deal with it on suitable occasions.
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 34
Leaning with many features
Outline
1 Error quantification
2 Learning for prediction
3 Leaning with many features
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 34
Leaning with many features
More information is a good thing, right?
Intuitively,
The more features you have the better!
E.g. chances increase that a powerful predictor is among them.
You need however enough data to be able to incorporate the extra features in your model.
Clearly, additional features (and correspondingly parameters) for the same amount of data increase estimation variance.
This is not much of a problem in classical parametric setups,
... but becomes one when you deal with flexible models.
Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 34
Leaning with many features
The curse of dimensionality
The curse of dimensionality is a generic term used to describe situations where it is very costly to exploit additional features.
It originated in dynamic optimization, where the cost of adding one dimension to a grid search increases exponentially with the number of dimensions.
Its exact form depends on the model, but the source is the combination of high dimensionality with high flexibility.
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 34
Leaning with many features
Example: local regression
With f the conditional mean of Y , take as fitted predictor function the
local average
ni=1Yi ·I(∥Xi −x∥≤ε) f(x)= ni=1I(∥Xi−x∥≤ε) .
ˆ
If Yi = f (Xi) + εi with f the regression function,
such that
f (Xi ) = f (x) + (Xi − x)′ ∇f + o (ε) ni=1εi ·I(∥Xi −x∥≤ε)
ˆ
f(x)≈f(x)+ ni=1I(∥Xi −x∥≤ε) +Bias
where Bias = O(ε) and the variance of the random term is inversely proportional to the average number of data points with ∥Xi − x∥ ≤ ε.
Statistics and Econometrics (CAU Kiel) Summer 2020
29 / 34
Leaning with many features
Local isn’t local in high dimensions I
One nasty thing about higher-dimensional space is that the average distance between the elements of a random sample increases with the number of dimensions.
● ●●●●●● ●● ●●●●
●●● ●●●
●● ● ● ● ●
●●●● ● ●●●● ●● ●●●●
●● ● ●●●●●
●●
●
● B' ●●
● C' ●
●
●AB
● ●●● ●● ●●
●
A'
●
●
●●● ●●●●●
●●
● ●●
● ●● ●● ●●●●
●●●● ●
●●●
● ●●●
●
●
●●●●●● ●● ●●●●●●●●●●●●● ●● ● ●● ● ●●● ●● ●●●●● ●●●●●●●● ●
● ● ●●●
●
● ●●●● ●●●●●
●●● ●●●●
●● ● ● ●●●●● ● ●●●●●
●● ●●●●●● ●●●●
●●●●●● ●
●● ●●●●● ●●●●
●
●●●●●
●● ●●●●●
●● ●
●●●
●●●●●● ●●
●●●●●
●● ● ● ●● ●
● ●●●●●●●●
● ● ●●●●● ● ●● ● ●●●●●●● ●●
●●●●● ●●●●●●●● ●●●●●●
●●●●●
●●●●●●● ●● ●● ● ●●●
● ●●●● ●●●●
● ●●●
●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1 x1
Uniform distributions, one vs. two dimensions
Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 34
0.0 0.2 0.4
0.6 0.8 1.0
0.0 0.2 0.4
0.6 0.8 1.0
0.0 0.2 0.4
0.6 0.8 1.0
x2
x2
x2
Leaning with many features
Local isn’t local in high dimensions II
So if p is large,
There will only be very little data points within a radius ε of x!
You would then need to increase ε to match the estimation variance of a lower dimensional problem,
... which increases bias.
Parametric models (with low dim θ)
don’t have this problem, as they
tend to have a global, relatively “rigid” shape.
E.g. linear regression learns (more or less) the same about slope coefficients from a data point, no matter where it lies.
Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 34
Leaning with many features
The curse is not only for prediction!
Classification models also suffer from the curse of dimensionality.
And unsupervised learning has its dimensionality issues as well.
Imagine a clustering task with n data points and p = n features.
Say for simplicity that the n features are binary 0/1,
...andXi,j =1fori=jonly.
Unless you pay attention, most algorithms will immediately find n different clusters
For all, dimensionality reduction may help deal with the curse. (And so does model selection.)
Statistics and Econometrics (CAU Kiel) Summer 2020
32 / 34
Up next
Outline
1 Error quantification
2 Learning for prediction
3 Leaning with many features
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 33 / 34
Up next
Coming up
Linear prediction
Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 34