Lecture 18:
Bias-variance, and decision theory CS 189 (CDSS offering)
2022/03/02
Copyright By PowCoder代写 加微信 powcoder
Today’s lecture
• Last time: how do we learn the best models with minimal generalization error? • We need to make sure we are neither overfitting nor underfitting
• Today: how do we understand generalization error?
• One way is the classical bias-variance decomposition (for regression models)
• Also: for classification, how does the problem setting affect the decision rule and the generalization error?
• This is covered by the area of decision theory 2
The bias-variance decomposition
A probabilistic model for continuous outputs
let’s focus on regression, where the outputs y ! ” are real values we are given = {(x1, y1), …, (xN, yN)}
we assume the data was sampled according to
how do we define a model that outputs a distribution over y!x? one option:
n NCfo X 1 the logpalyilxi
negative log likelihood loss is
t const WrtO
Intuition: bias and variance
• Since we assume the training data was randomly sampled, we can ask the question: how does our model change for different training sets?
• If the model is overfitting, it will learn a different function for each training set
• If the model is underfitting, it learns similar functions, even if we combine all the training sets together — and all the learned functions are bad
overfitting
underfitting
The bias-variance decomposition (“tradeoff”)
0 D betheMLE for D let
let’s take a look at expected error for a test point x!, where the expectation is over
different training datasets :
different training datasets (and the parameters that would be learned)
let f(x) be the expected prediction for x, where the expectation is again over the
The bias-variance decomposition (“tradeoff”)
J x f x Y’T I GETY
IEfox fx IIEYfx
IEI Fx fxypIn
Fa fall t Etcf Y 7
The bias-variance decomposition
2 ̄2 ̄22 So:[(f!()(x#)$Y#)]=(f(x#)$f(x#)) +[(f!()(x#)$f(x#))]+”
• The first term is called Bias2 — how wrong is the model on expectation, regardless of the dataset it is trained on?
• The second term is Variance — regardless of the true function f, how much does the model change based on the training dataset?
• The last term is irreducible error — i.e., the noise in the data process itself
• We used to think about this decomposition as a “tradeoff” — with the rise of deep learning, this no longer seems to always be the case
A brief excursion into decision theory
Decision theory for classification
• In general, x doesn’t uniquely specify y, and our predictions are guesses
• Remember: a classifier is a combination of a model and a decision rule
• Decision theory is a rich field which asks questions such as:
• What is the best decision rule?
• When is the best decision rule not the obvious one?
• Are there some guesses that should not be made?
• Without going into much detail, let’s take a brief look at some of these questions
Being Bayesian
the Bayes optimal decision rule minimizes the Bayes risk (probability of being wrong)
when might this decision rule not minimize loss?
Press $ to doubt
sometimes, it may make sense to abstain entirely from making a prediction
and similarly
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com