Bias Variance Trade-O↵
Bias vs Variance Tradeo↵
I In previous lecture we saw what happens to test MSE as model complexity increases. The U-shape that emerged is actually something that can be theoretically derived.
Copyright By PowCoder代写 加微信 powcoder
I The expected test MSE at any point X can be decomposed as: E [ y fˆ ( X | D ) ] 2 = V a r [ fˆ ( X | D ) ] + B i a s [ fˆ ( X | D ) ] 2 + V a r ( ✏ )
where Bias[fˆ(X|D)] = E[f (X) fˆ(X|D)]
I where D indicates the training dataset
D = {(X1,y1),…,(Xn,yn)} on which basis fˆ(X|D) has been learnt.
Bias vs Variance Tradeo↵
I Remember that we assume the true model can be written as y = f (X ) + ✏
I with E (✏) = 0 and f (.) being the true population model that is independent of the training data D and non-stochastic, hence
I E(f)=f andE(y)=E(f +✏)=E(f)+E(✏)=f
I Similarly, remember for the definition of variance of a random
I This implies
Var(Z) = E(Z2) E(Z)2 Var(y)=E[(y E(y))2]=E[(y f)2]=E(f +✏ f)2 =Var(✏)
I The trained fˆ and the ✏ in the validation sample are independent. Independence means, zero covariance, and hence
Cov (fˆ, ✏) = E (fˆ ⇥ ✏) E (fˆ) ⇥ E (✏) = E (fˆ ⇥ ✏) = 0
I Using this, we can formally show that the expected MSE on a validation sample can be decomposed
Bias vs Variance Tradeo↵ Proof [advanced]
Show that:
E[y fˆ(X|D)]2 = Var[fˆ(X|D)] + Bias[fˆ(X|D)]2 + Var(✏) To get rid of the indices, let f = f (X), and fˆ = fˆ(X|D)
E ⇥ ( y fˆ ) 2 ⇤ = = = = = = = = =
E [ y 2 + fˆ 2 2 y fˆ ] (1) E[y2] + E[fˆ2] E[2yfˆ] (2)
Var[y] + E[y]2 + Var[fˆ] + E[fˆ]2 E[2yfˆ] (3)
Var[y] + E[y]2 + Var[fˆ] + E[fˆ]2 2E[(f + ✏)fˆ] (4)
Var[y] + E[y]2 + Var[fˆ] + E[fˆ]2 2E[f fˆ] + E[✏fˆ] (5)
Var[y] + Var[fˆ] + E[y]2 + E[fˆ]2 2fE[fˆ] (6)
Var[y] + Var[fˆ] + [f 2 2fE[fˆ] + E[fˆ]2] (7)
Var[✏] + Var[fˆ] + (f E[fˆ])2 (8)
(f E[fˆ])2+Var[fˆ]+ Var[✏] (9) |{z}|{z} |{z}
Bias Variance Irreducible error
Bias vs Variance Intuition
E⇥(y fˆ)2⇤ = (f E[fˆ])2 + Var[fˆ] + Var[✏] |{z}|{z} |{z}
Bias Variance Irreducible error
Look at the individual elements here: Var (✏)
… is a constant.
it remains unchanged for di↵erent fˆ’s.
it represents the lowest bound for a test error that is attainable, since both the other terms are positive.
Minimizing test error requires finding an fˆ that minimizes the sum between squared bias and variance.
Bias vs Variance Intuition
E⇥(y fˆ)2⇤ = (f E[fˆ])2 + Var[fˆ] + Var[✏] |{z}|{z} |{z}
Bias Variance Irreducible error
Look at the individual elements here: V a r [ fˆ ]
… refers to the amount by which fˆ would change if we estimated it using a di↵erent training data set.
Since the training data are used to fit the statistical learning method, di↵erent training data sets produce di↵erent fˆ. Ideally the estimate for f should not vary too much between training sets.
Di↵erent methods have di↵erent variances: more flexible methods have larger variances, while less flexible ones (e.g. linear regression) have lower variance.
This is pushing up our test MSE for highly flexible specifications.
Bias vs Variance Intuition
E⇥(y fˆ)2⇤ = (f E[fˆ])2 + Var[fˆ] + Var[✏]
Looking at the Bias:
|{z}|{z} |{z}
Bias Variance Irreducible error
… an approximate model, that leaves our relevant factors systematically introduces errors by not allowing e.g. for more complex interactions between variables Xi .
e.g. a linear model may be inadequate in case the true relationship is non-linear, introducing significant bias.
This is akin to the idea of ommitted variable bias in regression, which causes the true e↵ect of some variable to be under or over-stated, thus, distorting the predictive power of that variable.
Bias vs Variance Intuition
Figure: Bias-Variance tradeo↵ illustrated: U-shape due to increasing variance at high level of model flexibility. Taken from Hastie et al., 2013.
Bias vs Variance Intuition
I Its important to recognize and realize that test MSE and training MSE are themselves random variables coming both with a mean and variance
I Training- and validation set approach constructs a single realization of both test MSE and training MSE
I This means that on specific training and validation sets the underlying shapes may not follow the theoretically expected U-shape by chance
I Cross validation or resampling methods allow you to construct multiple training and test error curves.
I We will see this in the interactive visualization
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com