CS代考 OC 2007 dataset

WiSe 2021/22
Machine Learning 1/1-X
Lecture 5 Model Selection

Copyright By PowCoder代写 加微信 powcoder

􏰀 Model Selection 􏰀 Occam’s Razor
􏰀 Quantifying Model Complexity 􏰀 Popper’s Prediction Strength
􏰀 Holdout / Cross-Validation
􏰀 Limits of Holdout / Cross-Validation, The ‘ ’ Effect 􏰀 Bias-Variance Analysis

Learning a Model of the Data
Assume we have a few labeled examples
(x1,y1),…,(xN,yN)
from some unknown true data distribution p(x,y). We would like to learn a modelf(x)thatnotonlypredictswellx1,…,xN butalsofuturedatapoints drawn form p(x).
1. observations 2. fit a model 3. make predictions

Model Selection
Questions:
1. Among models that correctly predict the data, which one should be retained?
2. Should we always choose a model that perfectly fits the data?

Occam’s Razor
William of Ockham (1287–1347)
“Entia non sunt multiplicanda praeter necessitatem”
English translation
“Entities must not be multiplied be- yond necessity.”
Machine learning reformulation:
“Model complexity must not be increased beyond what is neces- sary to correctly predict the data.”

Occam’s Razor for Model Selection
training error
complexity necessary to correctly predict the data
complexity
should be discarded according to Occam’s razor

Two Interpretations of Occam’s Razor
What do we gain from restricting model complexity?
1. If two models correctly predict the data, the least complex one should
be preferred because simplicity is desirable in itself.
2. If two models correctly predict the data, the least complex one should
be preferred because it is likely to predict correctly new data points. In this lecture, we focus on (2).
Further reading: Domingos (1998) Occam’s two Razors: The Sharp and the Blunt.

Quantifying Complexity
Quantifying complexity of a model is highly non-trivial and many proposal have been made:
1. Counting the parameters of the model
2. Size of function class, aka. structural risk minimization (SRM) 3. Bayesian information criterion (BIC)
4. Minimum description length (MDL)
5. Smoothness / Lipschitzness
In today’s lecture, we discuss (1) and (2).

Approach 1: Counting the Parameters
􏰀 If several models predict the data sufficiently well, prefer the one with the fewest parameters.
Examples of models:
Constant classifier
Meansdifference
PCA + Fisher
g(x) = C 􏰉􏰈􏰇􏰊
g(x)=x⊤ (m1−m2)+ C
􏰉 􏰈􏰇 􏰊 􏰉􏰈􏰇􏰊 􏰉 􏰈􏰇 􏰊
g(x) = PCA(x)⊤ S−1 (m1 − m2) + C
k·d k2 2k 1

Approach 1: Counting the Parameters
Counter-example:
􏰀 The model g(x) = a sin(ωx) has only two parameters but can fit almost any finite dataset in R. This can be achieved by setting ω very large.
􏰀 However, the resulting high-frequency sinusoid will not work well at predicting new data points.

Approach 1: Counting the Parameters
Another counter-example:
􏰀 The model g(x) = 1 􏰅K αk tanh􏰁x−θk 􏰂 has a large number of Kk=1 αk
parameters (2 · K ) but is much more rigid than the sinusoid function (e.g. function never exceeds a slope of 1, and can only increase).

Approach 2: Structural Risk Minimization (SRM)
Structural risk minimization (Vapnik and Chervonenkis, 1974) is another ap- proach to measure complexity and perform model selection.
􏰀 Structure the space of solutions into a nesting of increasingly large regions.
􏰀 If two solutions fit the data, prefer the solution that also belongs to the smaller regions.
Particular choices of · · · ⊆ Sn−1 ⊆ Sn ⊆ Sn+1 ⊆ . . . lead to upper-bounds on the generalization error of a model (cf. next lecture).

Approach 2: Structural Risk Minimization (SRM)
􏰀 Assume you would like to build an estimator μ􏰆 of the mean μ ∈ Rd of your distribution, based on some iid. sample x1,…,xN ∈Rd.
􏰀 To apply the SRM principle, we first create a nested sequence
···⊆Sn−1 ⊆Sn ⊆Sn+1 ⊆… e.g. where
Sn =􏰃θ∈Rd:∥θ∥≤Cn􏰄

Approach 2: Structural Risk Minimization (SRM)
Example (cont.):
􏰀 The maximum likelihood estimator belonging to the set Sn can be found by solving
μ􏰆 = arg min 􏰅Ni =1 ∥θ − xi ∥2 θ
s.t. θ∈Sn. (shown in the figure as a red dot for
various choices of Sn).
􏰀 Choosing μ􏰆 associated to smaller sets Sn leads to better estimators of the true parameter μ (cf. the James-Stein estimator discussed later).

From Occam’s Razor to ’s Razor
“Entities must not be multiplied beyond necessity.”
Falsifiability/prediction strength (S. Hawking, after K. Popper)
“[a good model] must accurately describe a large class of obser- vations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.”
In other words, the model with lowest generalization error is preferable.

From Occam’s Razor to Popper
test error
complexity
best model according to Popper
should be discarded according to Occam’s razor

From Occam’s Razor to Popper
test error
complexity
should be discarded according to Occam’s razor
best model according to Popper

The Holdout Selection Procedure
Idea: Predict out-of-sample error by splitting the data randomly in two parts (one for training, and one for estimating the error of the model).
training data dataset
holdout data
error error
Problem: The more data we use to accurately estimate the prediction error of each model, the less data is available for training an accurate model in the first place.

Cross-Validation (k-Fold Procedure)
Idea: Retain more data for training, and make up for the lower amount of holdout data by repeating the model selection procedure over multiple data splits (and averaging the selected models):
k-partition
models model
holdout selection
model selection
model selection
model selection

Holdout / Cross-Validation
Advantages:
􏰀 The model can now be selected directly based on simulated future observations (implements Popper’s principle for model selection).
Limitations:
􏰀 For a small number of folds k, the training data is reduced significantly, which may lead to less accurate models. For k large, the procedure becomes computationally costly.
􏰀 This technique assumes that the available data is representative of the future observations (not always true!).

Limits of Holdout / Cross-Validation
Example: OC 2007 dataset
􏰀 Standard benchmark for image
classification in the early 2010s.
􏰀 Large collection of images annotated with bounding boxes associated to twenty different object categories (airplane, boat, horse, person, …).
􏰀 Holdout / cross-validation commonly used on this dataset for model selection and evaluation. But does a low predicted error guarantee that the model will truly perform well?

Limits of Holdout / Cross-Validation
􏰀 Explainable AI (presented later this semester) highlights features (e.g. pixels) used by the model to support its decision.
􏰀 Explanation reveals that the classifier at hand exploits a spurious correlation between the class horse and the presence of a copyright tag in the corner of the horse images (aka. effect).
􏰀 Spurious correlation only exists in the dataset, but not in real-world.
􏰀 Holdout / cross-validation cannot detect this flawed strategy and can
result in poor model selection.
Further reading: Lapuschkin et al. Unmasking predictors and assessing what machines really learn, Nature Communications, 2019.

Part II. Bias-Variance Analysis of ML Models

Machine Learning Models
Machine learning models are learned from the data to approximate some truth θ.
A learning machine can be abstracted as a function that maps a dataset D to an estimator θˆ of the truth θ.

ML Models and Prediction Error
A good learning machine is one that produces an estimator θˆ close to the truth θ. Closeness to the truth can be measured by some error function, e.g. the square error:
Error(θˆ) = (θ − θˆ)2.

Bias, Variance, and MSE of an Estimator
Parametric estimation:
􏰀 θisavalueinRh
􏰀 θ􏰆is a function of the data D = {X1,…,XN}, where Xi are random variables producing the data points.
Statistics of the estimator:
Bias(θ) = E[θ − θ] 􏰆􏰆􏰆
(measures expected deviation of the mean) (measures scatter around estimator of mean) (measures prediction error)
Var(θ) = E[(θ − E[θ])2] 􏰆􏰆
MSE(θ) = E[(θ − θ)2]
Note: for θ ∈ Rh, we use the notation θ2 = θ⊤θ.

Visualizing Bias and Variance
low bias high bias
True parameter
Parameter estimator
high variance low variance

Bias-Variance Decomposition
Bias(θ) = E[θ − θ], Var(θ) = E[(θ − E[θ])2], MSE(θ) = E[(θ − θ)2] Exercise: Show that .
MSE(θ) = Bias(θ)2 + Var(θ)
MSE(θ) = E[(θ − θ)2] 􏰆􏰆􏰆
= E[((θ−E[θ]) + (E[θ] − θ))2] 􏰆􏰆􏰆􏰆􏰆􏰆
=E[(θ−E[θ])2 +(E[θ]−θ)2 +2(θ−E[θ])·(E[θ]−θ)] 􏰆􏰆􏰆􏰆􏰆􏰆
= E[(θ − E[θ])2] + (E[θ − θ])2 +2 E[θ − E[θ]] ·(E[θ] − θ) 􏰉 􏰈􏰇 􏰊 􏰉 􏰈􏰇 􏰊 􏰉 􏰈􏰇 􏰊
Var(θ) Bias(θ)2 0

Visualizing Bias and Variance
prediction error
low variance/ high bias
bias variance
low bias/ high variance
model complexity

Example: Parameters of a Gaussian

Example: Parameters of a Gaussian
Exercise: Show that the bias and variance of the mean estimator μ􏰆 = N1 􏰅Ni=1 Xi are given by
N2 i=1 = 1 Nσ2
Bias(μ􏰆) = 0
Var(μ􏰆) = σ2/N
Bias(μ􏰆)=E[μ􏰆−μ]
=E[(1 􏰅N Xi)−μ]
Var(μ􏰆)=Var[N1 􏰅Ni=1Xi] = 1 Var[􏰅N Xi]
=(1 􏰅N E[Xi])−μ
= 1 􏰅N Var[Xi]
N i=1 =(1􏰅N μ)−μ
N2 i=1 =1􏰅N σ2

The James-

Estimator of Functions

Bias-Variance Analysis of the Function Estimator (locally)

􏰀 Occam’s Razor: Given two models that have sufficiently low training error, the simpler one should be preferred.
􏰀 Measuring Complexity: Counting the parameters can be misleading. Structured risk minimization (SRM) is more reliable in practice.
􏰀 Popper’s View: How to make sure that a model predicts well? By testing it on out-of-sample data.
􏰀 Holdout and Cross-Validation: Common practical procedures to simulate out-of-sample prediction behavior. Has some limitations (e.g. assume that the current dataset is representative of the true distribution).
􏰀 Bias-Variance Decomposition: The error of a predictive model can be decomposed into bias and variance. Best models can often be interpreted as finding a good tradeoff between the two terms.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com