CS计算机代考程序代写 Outline

Outline
� Model Selection � Occam’s Razor
� Quantifying Model Complexity � Popper’s Prediction Strength
� Cross-Validation
� Dataset Bias and The ‘Clever Hans’ Effect � Bias-Variance Analysis
1/32

Learning a Model of the Data
Assuming some observations from some unknown true data distribution p(x,y), we would like to find a model θ such that some distance D(p,pθ) is minimized. For regression tasks, we can consider the simpler objective:
min� �Epˆ[y|x]−fθ(x)�2 ·pˆ(x)·dx θ
where pˆ is some guess of the true p, e.g. the empirical distribution. 1. observations 2. fit a model 3. make predictions
2/32

Model Selection
Questions:
1. Among models that correctly predict the data, which one should be retained?
2. Should we always choose a model that perfectly fits the data?
3/32

Occam’s Razor
William of Ockham (1287–1347)
“Entia non sunt multiplicanda praeter necessitatem”
English translation
“Entities must not be multiplied beyond necessity.”
4/32

Interpreting Occam’s Razor
What advantages do we expect from a model based on few assumptions?
� If two models correctly predict the data, the one that makes fewer assumptions should be preferred because simplicity is desirable in itself.
� If two models correctly predict the data, the one that makes fewer assumptions should be preferred because it is likely to have lower generalization error.
Further reading:
Domingos (1998) Occam’s two Razors: The Sharp and the Blunt.
5/32

Occam’s Razor for Model Selection
training error
number of assumptions
should be discarded according to Ockham’s razor
6/32

How to Quantify “Few Assumptions”?
Many possibilities:
� Number of free parameters of the model � Minimum description length (MDL)
� .
� Size of function class (structural risk minimization) � VC-Dimension (next week)
7/32

Number of Parameters of the Model
Constant classifier
g(x) = C ����
1
Nearest mean classifier
g(x)=x� (m1−m2)+ C � �� � ����
O(1) parameters O(d)parameters
2d 1
Fisher vector classifier (on top of k PCA components):
g(x)=PCA(x)� S−1 (m1−m2)+ C O(k·d)parameters W ����
� �� � ���� � �� �
1
k·d k2 2k
8/32

Number of Parameters of the Model
Counter-example
� Thetwo-parametersmodelg(x)=asin(ωx)canfitalmostany finite dataset in R, by setting very large values for ω.
� However, it is also clear that the model will not generalize well.
By only counting the number of parameters in the model, we have not
specified the range of values the parameter ω is allowed to take in practice.
9/32

Structural Risk Minimization Vapnik and Chervonenkis (1974)
Idea: Structure your space of solutions into a nesting of increasingly larger regions. If two solutions fit the data, prefer the solution that belongs to the smaller region.
Example:
Assuming Rd is the whole solution space for θ, build a sequence of real numbers C1 < C2 < · · · < CN , and create a nested collection (S1,S2,...,SN) of regions, where Sn =�θ∈Rd:�θ�2 ≤Cn� 10/32 Connection to Regularization Example (cont.): We optimize for multiple Cn the objective �N s.t. �θ�2 < Cn and discard solutions with index n larger than necessary to fit the data. min θ �(yi,fθ(xi)) � �� � i=1 Eempirical This objective can be equivalently formulated as: � with appropriate (λn)n. This objective is known in various contexts as L2 regularization, ridge regression, large margin, weight decay, etc. min θ �(yi , fθ(xi )) + � �� � ���� � �N i=1 λn�θ�2 Eempirical Ereg 11/32 From Occam’s Razor to Popper Occam’s Razor “Entities must not be multiplied beyond necessity.” Falsifiability/prediction strength (S. Hawking, after K. Popper) “[a good model] must accurately describe a large class of ob- servations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.” In other words, the model with lowest generalization error is preferable. 12/32 From Occam’s Razor to Popper test error number of assumptions best model according to Popper should be discarded according to Ockham's razor 13/32 From Occam’s Razor to Popper test error number of assumptions should be discarded according to Ockham's razor best model according to Popper 14/32 The Holdout Selection Procedure Idea: Predict out-of-sample error by splitting the data randomly in two parts (one for training, and one for estimating the error of the model). models error training data dataset holdout data error argmin error 15/32 Cross-Validation (k-Fold Procedure) To improve error estimates without consuming too much data, the pro- cesses of error estimation can be improved by computing an average over different splits: dataset k-partition models model holdout selection model selection model selection model selection train avg 16/32 The Cross-Validation Procedure Advantages: � The model can now be selected directly based on simulated future observations. Limitations: � For a small number of folds k, the training data is reduced significantly, which may lead to less accurate models. For k large, the procedure becomes computationally costly. � This technique assumes that the available data is representative of the future observations (not always true!). 17/32 The Problem of Dataset Bias test data training data Fisher discriminant nearest mean classifier This effect can lead cross-validation procedure to not work well, even when we have enough training data. 18/32 The Problem of Dataset Bias Observation: The classifier has exploited a spurious correlation between images of the class horse and the presence of a copyright tag in the left bottom corner of the horse images. Here, cross-validation doesn’t help here because the spurious correlation would also be present in the validation set. Further reading: Lapuschkin et al. (2019) Unmasking Clever Hans predictors and assessing what machines really learn 19/32 Part II. Bias-Variance Analysis of ML Models 20/32 Machine Learning Models Machine learning models are learned from the data to approximate some truth θ. A learning machine can be abstracted as a function that maps a dataset D to an estimator θˆ of the truth θ. 21/32 ML Models and Prediction Error A good learning machine is one that produces an estimator θˆ close to the truth θ. Closeness to the truth can be measured by some error function, e.g. the square error: Error(θˆ) = (θ − θˆ)2. 22/32 Bias, Variance, and MSE of an Estimator Parametric estimation: � θisavalueinRh � θ�is a function of the data D = {X1,...,XN}, where Xi are random variables producing the data points. Statistics of the estimator: �� Bias(θ) = E[θ − θ] ���2 (measures expected deviation of the mean) (measures scatter around estimator of mean) (measures prediction error) Var(θ) = E[(θ − E[θ]) ] ��2 MSE(θ) = E[(θ − θ) ] Note: for θ ∈ Rh, we use the notation θ2 = θ�θ. 23/32 Visualizing Bias and Variance low bias high bias True parameter Parameter estimator 24/32 high variance low variance Bias-Variance Decomposition �����2��2 Bias(θ) = E[θ − θ], Var(θ) = E[(θ − E[θ]) ], MSE(θ) = E[(θ − θ) ] ��2� We can show that MSE(θ) = Bias(θ) + Var(θ) . 25/32 Visualizing Bias and Variance prediction error low variance/ high bias error bias variance low bias/ high variance model complexity 26/32 Example: Parameters of a Gaussian 27/32 Example: Parameters of a Gaussian The ‘natural’ estimator of mean μ� = 1 �N Xi decomposes to N i=1 Bias(μ�) = 0 and Var(μ�) = σ2/N . 28/32 The James-Stein Estimator 29/32 Estimator of Functions 30/32 Bias-Variance Analysis of the Function Estimator (locally) 31/32 Summary � Occam’s Razor: Given two models with the same training error, the simpler one should be preferred. � Popper’s View: How to make sure that a model predicts well? By testing it on out-of-sample data. Out-of-sample data can be simulated by applying a k-fold cross-validation procedure. � Bias-Variance Decomposition: The error of a predictive model can be decomposed into bias and variance. Best models often results from some tradeoff between the two terms. 32/32