Outline
� Model Selection � Occam’s Razor
� Quantifying Model Complexity � Popper’s Prediction Strength
� Cross-Validation
� Dataset Bias and The ‘Clever Hans’ Effect � Bias-Variance Analysis
1/32
Learning a Model of the Data
Assuming some observations from some unknown true data distribution p(x,y), we would like to find a model θ such that some distance D(p,pθ) is minimized. For regression tasks, we can consider the simpler objective:
min� �Epˆ[y|x]−fθ(x)�2 ·pˆ(x)·dx θ
where pˆ is some guess of the true p, e.g. the empirical distribution. 1. observations 2. fit a model 3. make predictions
2/32
Model Selection
Questions:
1. Among models that correctly predict the data, which one should be retained?
2. Should we always choose a model that perfectly fits the data?
3/32
Occam’s Razor
William of Ockham (1287–1347)
“Entia non sunt multiplicanda praeter necessitatem”
English translation
“Entities must not be multiplied beyond necessity.”
4/32
Interpreting Occam’s Razor
What advantages do we expect from a model based on few assumptions?
� If two models correctly predict the data, the one that makes fewer assumptions should be preferred because simplicity is desirable in itself.
� If two models correctly predict the data, the one that makes fewer assumptions should be preferred because it is likely to have lower generalization error.
Further reading:
Domingos (1998) Occam’s two Razors: The Sharp and the Blunt.
5/32
Occam’s Razor for Model Selection
training error
number of assumptions
should be discarded according to Ockham’s razor
6/32
How to Quantify “Few Assumptions”?
Many possibilities:
� Number of free parameters of the model � Minimum description length (MDL)
� .
� Size of function class (structural risk minimization) � VC-Dimension (next week)
7/32
Number of Parameters of the Model
Constant classifier
g(x) = C ����
1
Nearest mean classifier
g(x)=x� (m1−m2)+ C � �� � ����
O(1) parameters O(d)parameters
2d 1
Fisher vector classifier (on top of k PCA components):
g(x)=PCA(x)� S−1 (m1−m2)+ C O(k·d)parameters W ����
� �� � ���� � �� �
1
k·d k2 2k
8/32
Number of Parameters of the Model
Counter-example
� Thetwo-parametersmodelg(x)=asin(ωx)canfitalmostany finite dataset in R, by setting very large values for ω.
� However, it is also clear that the model will not generalize well.
By only counting the number of parameters in the model, we have not
specified the range of values the parameter ω is allowed to take in practice.
9/32
Structural Risk Minimization Vapnik and Chervonenkis (1974)
Idea: Structure your space of solutions into a nesting of increasingly larger regions. If two solutions fit the data, prefer the solution that belongs to the smaller region.
Example:
Assuming Rd is the whole solution space for θ, build a sequence of real numbers C1 < C2 < · · · < CN , and create a nested collection (S1,S2,...,SN) of regions, where
Sn =�θ∈Rd:�θ�2 ≤Cn�
10/32
Connection to Regularization
Example (cont.): We optimize for multiple Cn the objective
�N
s.t. �θ�2 < Cn
and discard solutions with index n larger than necessary to fit the data.
min
θ
�(yi,fθ(xi)) � �� �
i=1
Eempirical
This objective can be equivalently formulated as:
�
with appropriate (λn)n. This objective is known in various contexts as L2 regularization, ridge regression, large margin, weight decay, etc.
min
θ
�(yi , fθ(xi )) +
� �� � ����
� �N i=1
λn�θ�2 Eempirical Ereg
11/32
From Occam’s Razor to Popper
Occam’s Razor
“Entities must not be multiplied beyond necessity.”
Falsifiability/prediction strength (S. Hawking, after K. Popper)
“[a good model] must accurately describe a large class of ob- servations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.”
In other words, the model with lowest generalization error is preferable.
12/32
From Occam’s Razor to Popper
test error
number of assumptions
best model according to Popper
should be discarded according to Ockham's razor
13/32
From Occam’s Razor to Popper
test error
number of assumptions
should be discarded according to Ockham's razor
best model according to Popper
14/32
The Holdout Selection Procedure
Idea: Predict out-of-sample error by splitting the data randomly in two parts (one for training, and one for estimating the error of the model).
models
error
training data dataset
holdout data
error
argmin
error
15/32
Cross-Validation (k-Fold Procedure)
To improve error estimates without consuming too much data, the pro- cesses of error estimation can be improved by computing an average over different splits:
dataset
k-partition
models model
holdout selection model
selection
model selection
model selection
train
avg
16/32
The Cross-Validation Procedure
Advantages:
� The model can now be selected directly based on simulated future observations.
Limitations:
� For a small number of folds k, the training data is reduced significantly, which may lead to less accurate models. For k large, the procedure becomes computationally costly.
� This technique assumes that the available data is representative of the future observations (not always true!).
17/32
The Problem of Dataset Bias
test data training data
Fisher discriminant
nearest mean classifier
This effect can lead cross-validation procedure to not work well, even when we have enough training data.
18/32
The Problem of Dataset Bias
Observation: The classifier has exploited a spurious correlation between images of the class horse and the presence of a copyright tag in the left bottom corner of the horse images. Here, cross-validation doesn’t help here because the spurious correlation would also be present in the validation set.
Further reading: Lapuschkin et al. (2019) Unmasking Clever Hans predictors and assessing what machines really learn
19/32
Part II. Bias-Variance Analysis of ML Models
20/32
Machine Learning Models
Machine learning models are learned from the data to approximate some truth θ.
A learning machine can be abstracted as a function that maps a dataset D to an estimator θˆ of the truth θ.
21/32
ML Models and Prediction Error
A good learning machine is one that produces an estimator θˆ close to the truth θ. Closeness to the truth can be measured by some error function, e.g. the square error:
Error(θˆ) = (θ − θˆ)2.
22/32
Bias, Variance, and MSE of an Estimator
Parametric estimation:
� θisavalueinRh
� θ�is a function of the data D = {X1,...,XN}, where Xi are random variables producing the data points.
Statistics of the estimator:
��
Bias(θ) = E[θ − θ] ���2
(measures expected deviation of the mean) (measures scatter around estimator of mean) (measures prediction error)
Var(θ) = E[(θ − E[θ]) ] ��2
MSE(θ) = E[(θ − θ) ]
Note: for θ ∈ Rh, we use the notation θ2 = θ�θ.
23/32
Visualizing Bias and Variance
low bias high bias
True parameter
Parameter estimator
24/32
high variance low variance
Bias-Variance Decomposition
�����2��2 Bias(θ) = E[θ − θ], Var(θ) = E[(θ − E[θ]) ], MSE(θ) = E[(θ − θ) ]
��2� We can show that MSE(θ) = Bias(θ) + Var(θ) .
25/32
Visualizing Bias and Variance
prediction error
low variance/ high bias
error
bias variance
low bias/ high variance
model complexity
26/32
Example: Parameters of a Gaussian
27/32
Example: Parameters of a Gaussian
The ‘natural’ estimator of mean μ� = 1 �N Xi decomposes to N i=1
Bias(μ�) = 0 and Var(μ�) = σ2/N .
28/32
The James-Stein Estimator
29/32
Estimator of Functions
30/32
Bias-Variance Analysis of the Function Estimator (locally)
31/32
Summary
� Occam’s Razor: Given two models with the same training error, the simpler one should be preferred.
� Popper’s View: How to make sure that a model predicts well? By testing it on out-of-sample data. Out-of-sample data can be simulated by applying a k-fold cross-validation procedure.
� Bias-Variance Decomposition: The error of a predictive model can be decomposed into bias and variance. Best models often results from some tradeoff between the two terms.
32/32