4 Bias -Variance Tradeoff
Probability studies uncertainty. Probability theory and statistics are often presented together, but they concern different aspects of uncertainty. Using probability we can consider a model of some process, where the underlying uncertainty is captured by random variables, and we use the rules of probability to derive what happens. In statistics we observe that something has happened, and try to figure out the underlying model that explains the observations. In this sense,
machine learning is close to statistics in its goals to construct a model that adequately represents the process that generated the data. Probability plays two roles in machine learning: (1) The laws of probability tell us the reason behind learning algorithms. (2) Probability and statistics is used to analyze the behavior of learning algorithms. In this section, we review basic probability (see §4.1 − 4.4).
Recall that for a prediction function, its generalization is the ability to perform well on the testing set (see §1.4). Let (x, y) be a random vector on a sample space X × Y with the joint probability distribution P , and f : X → Y be a prediction function. We introduce the generalization error of f defined by expectation:
Rexp(f)=E(x,y)∼P[L(y,f(x))]=�X×Y �(y−f(x))2�P(x,y)dxdy (4.1) to measure the predict ability of f. Let D be a dataset in X and fD be a prediction function. We
also consider
Rexp,D(fD) = E(x,y)∼P,D[L(y, fD(x))]. (4.2) For this setting, we shall prove the formula of Bias-variance tradeoff (see §4.5):
Rexp,D(fD) = noise + V (fD) + bias2(fD) (4.3)
where the term noise is independent of D and fD so that it is irreducible, V (fD) is the variance of fD and bias(fbfD) is the bias of fD. This interesting formula tells us the relationship between the model fD’s ability to minimize bias and variance and gives important consequence.
4.1 Sample space, probability space and random variable
In probability theory, an experiment or a trial is a procedure that can be infinitely repeated and has a well-defined set of possible outcomes, known as the sample space. An experiment is said to be random if it has more than one possible outcomes, and is is said to be deterministic if it has only one.
73
Sample Space Let us define the following.
• An outcome is a possible result of an experiment.
• The set S of all outcomes of a random experience is called a sample space.
• Each outcome is also called a sample point.
• A subset of sample points is called an event of the sample space. We denote by A the collect of all events.
• The event with no sample point is called the empty event. An event that includes all possible sample points is called the whole event.
[Example]
1. What is the sample space for an experiment in which we select a rat at random from a cage and determine its sex?
The sample space of this experiment is
S = {M,F}
where M denotes the male rat and F denotes the female rat.
2. What is the sample space for an experiment in which we roll a pair of dice, one red and one
green?
The sample space S for this experiment is given by
S = {(x, y) both integers | 1 ≤ x ≤ 6, 1 ≤ y ≤ 6}
where integer x represents the number rolled on red die and integer y denotes the number
rolled on green die.
3. Describe the sample space of rolling a die and interpret the event {1, 2}. The sample space of this experiment is
S = {1,2,3,4,5,6}. The event {1, 2} means getting either a 1 or a 2.
Let us continue to give some definitions.
74
• The event that at least one of the events A and B occurs is called the union of events and denoted by A ∪ B.
• The event that both occur simultaneously is called the intersection of events and denoted by A∩B.
• If A ∩ B = ∅, the events A and B are called disjoint events.
• For events A,B and C, the following distribution laws hold:
(A∪B)∩C =(A∩C)∪(B∩C),
(A∩B)∪C =(A∪C)∩(B∪C).
• The event that event A does not occurs is called the complementary event of A, and denoted
by Ac.
• For events A,B and C, the following De Morgan’s laws hold:
(A ∪ B)c = Ac ∩ Bc, (A ∩ B)c = Ac ∪ Bc.
Definition of probability Let S be a sample space. A probability of S is a function Pr : A → [0, 1] where A is the set of all events of S such that
(1) 0 ≤ Pr(A) ≤ 1 for any event A ∈ A.
(2) Pr(S) = 1.
(3) The probability of the union of disjoint events is equivalent to the sum of their individual probabilities. In other words, for any sequence A1 ,�A2 , …… ∈ A where for all i �= j that Ai ∩ Aj = ∅, then
Pr(∪jAj) = Pr(Aj). j
We denote P r(A, B) := P r(A ∩ B), and P r(A or B) := P r(A ∪ B).
75
[Example]
1. If a fair coin is tossed twice, what is the probability of getting at least one head?
The sample space of this experiment is S = {HH,HT,TH,TT}. The event A = {at least
one head} = {HH,HT,TH}. The probability of A is the sum of the probabilities of its elementary events. Since Pr({HH}) = 1,Pr({HT}) = 1 and Pr({TH}) = 1, we get
Pr(A) = Pr(HH) + Pr(HT) + Pr(TH) = 1 + 1 + 1 = 3. 4444
2. Consider rolling a single 6-sided die. Then we set S = {1, 2, 3, 4, 5, 6}. If we take an event A = {1,3,5}. Since Pr({k}) = 1 for k = 1,2,3,4,5,6, the probability of rolling an odd
number is 1 1 1 1 Pr(A) = Pr({1}) + Pr({3}) + Pr({5}) = 6 + 6 + 6 = 2.
In general, for events A and B, the following additive law holds:
Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B), (4.4)
and
Pr(A∪B∪C) = Pr(A)+Pr(B)+Pr(C)−Pr(A∩B)−Pr(A∩C)−Pr(B∩C)+Pr(A∩B∩C).
Random variable In many random experiments, the elements of sample space are not necessarily numbers. One would like to ‘mathematize’ the outcomes of the sample space. This mathematiza- tion, or quantification, is achieved through the notion of random variables.
Consider a random experiment whose sample space is S. A random variable X is a function from the sample space S into the set of real numbers R
X:S→RX ⊂R s �→ X(s)
suchthatforeachintervalI inR,thesetX−1(I)={s∈S |X(s)∈I}isaneventinS. Theset RX :={x ∈ R | x = X(s),s ∈ S} is called the space of the random variable X.
[Example]
1. A constant function X : S → RX = {a} is a random variable. 76
444
6
2. Consider the coin tossing experiment. Construct a random variable X for this experiment. Let
S={Head, Tail} X(Head) = 0, X(T ail) = 1.
and define
Then X is a random variable for the coin tossing experiment. The space of this random
variable is RX = {0, 1}.
3. Consider an experiment in which a coin is tossed ten times. The sample space of this
experiment is
The cardinality of S is 2010. Let X : S → R be a function from the sample space S into the
set of reals R defined as follows:
X(s) = number of heads in sequence s.
Then X is a random variable. For example, X(HHTTTHTTHH) = 5.
The space of this random variable is RX = {0, 1, 2, …, 10}.
Probability distribution For a random variable X, there are two cases:
(i) X takes a countable value.
(ii) X takes a continuous value.
In the case (i), X is called a discrete random variable. In the case (ii), X is called a continuous random variable.
Associated with the random variable is a function that measures the probability: F(x) := Pr(X ≤ x)
which is called cumulative distribution function (cdf), or simply distribution.
Random vectors Let X1, …, Xp be random variables on the same probability space (S, A, P r):
Xj :(S,A,Pr)→R. 77
S = {s | s is a sequence of 10 heads or tails}.
The vector X := (X1, …, Xp)T is called a random vector. The joint distribution of a random vector is uniquely described by its multivariate
F(x1,…,xn) = Pr(X1 ≤ x1,…,Xp ≤ xp), ∀(x1,…,xp)T ∈ Rp. 4.2 Discrete Random variables
Discrete random variables If X is a discrete random variable, a function defined by f(x)=Pr(X=x), ∀x∈RX,
is called the probability density function (pdf) 33 , which satisfies f(x)≥0, ∀x∈RX, and � f(x)=1.
x∈RX
It is a description of how likely a random variable is to take on each of its possible states. Then
the cumulative distribution function F(x) can be calculated in terms of f(x): F(x)=Pr(X≤x)= � f(b).
b≤x,b∈RX
The cumulative distribution function F(x) satisfies:
• F(−∞)=0,
• F(∞)=1,
• F (x) is an increasing function, i.e., F (x1) ≤ F (x2) whenever x1 < x2.
[Examples]
33It is also called a probability mass function.
78
The probability density function f(x) satisfies: • �f(x)≥0forallx∈RX,
• x∈RX f(x) = 1.
1. A Bernoulli trial is a random experiment in which there are precisely two possible outcomes, which we conveniently call ‘failure’ (F) and ‘success’ (S). We can define a random variable from the sample space {S,F} into the set of real numbers by
X(F) = 0, X(S) = 1. The probability density function of this random variable is
f(0)=Pr(X =0)=1−p, f(1)=Pr(X =1)=p,
for some number 0 < p < 1. We may write it the sample space as {0,1} and the bdf as
f(x)=px(1−p)1−x, x=0,1. The random variable X is called the Bernoulli random variable.
The expectation and variance (for the definitions, see § 3.4) are given by
E[X]:= � xf(x)=p, and V[X]=E[X2]−(E[X])2 = � x2f(x)−p2 =p(1−p).
x∈RX x∈RX We denote the Bernoulli random variable by writing it as BER(p).
2. Consider a fixed number n of mutually independent Bernoulli trails. Suppose these trials have same probability of success, say p. Here 0 < p < 1. A random variable X is called a binomial random variable if it represents the total number of successes in n independent Bernoulli trials.
The sample space is {0, 1}. The space of the random variable X is {0, 1, 2, ..., n}. The probability density function of X is defined as
f(x)=Pr(X =x)=�nx�px(1−p)n−x, x=0,1,2,...,n.
The expectation and variance (for the definitions, see § 3.4) are given by E[X]=np, and V[X]=np(1−p).
We will denote a binomial random variable with parameters p and n as BIN(n,p).
79
Discrete random vectors A random vector X = (X1, ..., Xp)T is called discrete if each Xj is a discrete random variable.
A real valued function f of variables X = (X1, ..., Xp) is a joint probability density function of discrete random variables X1, X2, ..., and Xp (with range spaces RX if
(a) f(x1,...,xp) ≥ 0 for all (x1,...,xp) ∈ RX.
(b) �x1 ... �xp f(x1, ..., xp) = 1.
Then the joint distribution F (x1, ..., xp) of the random vector X can be calculated in terms of the joint density function:
F(x1,...,xp) = � f(b1,...,bp). bj ≤xj ,1≤j≤p
Marginalisation Given a joint probability density function f(x,y), the density function of a
single variable is given by
�
f(x) = f(x,y). (4.5) y
Here f (x) is called a marginal of the joint probability distribution f (x, y). The process of computing a marginal from a joint distribution is called marginalisation. More generally, one has
f(x1, ..., xi−1, xi+1, ..., xn) = � f(x1, ..., xn). (4.6) xi
Geometrically, in discrete random variable case, the formulas (4.5) or (4.6) just likes the the volume of a cubic is equal to the one by adding slices of rectangular boxes together.
Identically distributed We say that n random discrete variables X1, ..., Xn are identically distributed if
FX1,...,Xn (x1, ..., xn) = FX1 (x1) · · · FXn (xn) (4.7)
where Fj is the cumulative distribution function of Xj, and FX1,...,Xn is the joint cumulative distribution function of X1, ..., Xn.
80
4.3 Continuous Random variables
Continuous random variables If X is a continuous random variable, a function defined by
�b
f(x)dx=Pr(a≤x≤b), ∀a≤b a
is called the probability density function (pdf), which satisfies f(x)≥0, ∀x, and � ∞ f(x)dx=1.
−∞
The cumulative distribution function F(b) can be calculated by
F(b) = Pr(x ≤ b) =
If F(x) is the cumulative distribution function of a continuous random variable X, the proba-
f(x)dx
Every random variable is characterized through its probability density function.
bility density function f(x) of X is the derivative of F(x): d F(x) = f(x).
dx
Let X be a continuous random variable whose cdf is F(x). Then • Pr(X
• Pr(X=x)=0,
(4.8)
�b −∞
• Pr(a
Continuous random vector A random vector X = (X1, …, Xp)T is called continuous if there exists an integrable function f(x1,…,xp) ≥ 0 such that:
f(x)dx= √
−∞ σ 2π −∞
� (x−μ)2� exp − 2σ2
σ√2 � ∞ 2
dx= √ exp(−r )dr=1.
σ 2π −∞
Here we changed the variable from x to r = x−μ. The expectation and variance (for the
definition, see § 3.4) are given by
� xp F(x) = F(x1,…,xp) =
� x1 −∞
f(x1,…,xp)dx1 ···dxp.
A real valued function f of variables X1,…,Xp is a joint probability density function of con-
tinuous random variables X1, …, Xp (with range spaces RX ) if (a) f(x1,…,xp) ≥ 0 for all (x1,…,xp) ∈ RX.
(b) �bj≤xj,1≤j≤p f(x1,…,xp) = 1.
83
…
where x = (x1, …, xp)T , f is the probability density function (pdf) of X and F is the cumulative
distribution function(cdf) of X.
−∞
[Example] (Multivariable normal distribution) The multivariate normal (or Gaussian) distri- bution of the random vector X ∈ Rp has the following pdf:
where x = (x1, …, xp)T ∈ Rp, and the parameters μ ∈ Rp, � ∈ Rp×q where � is positive definite. This pdf can be denoted by X = (X1, …, Xp)T ∼ Np(μ, ).
Identically distributed We say that n random continues variables X1,…,Xn are identically distributed if
FX1,…,Xn (x1, …, xn) = FX1 (x1) · · · FXn (xn) (4.11)
where Fj is the cumulative distribution function of Xj, and FX1,…,Xn is the joint cumulative distribution function of X1, …, Xn.
���
1 1T−1
f(x)=(2π)p/2|�|1/2exp −2(x�−μ) (x−μ) . (4.10)
84
4.4 Expectation, mean, median, and variances
Expectation Let X be a random variable with probability density function f. Let g be a function of X. Let us define the average value of g under the probability distribution f, called the expectation of g, as follows, where the average is weighted by the relative probabilities at values of X.
The expectation of a function g of a discrete random variable X is given by 34 E[g(X)] := EX∼f [g(X)] = � g(x)f(x)
x∈X
and the expected value of a function g : R → R of a con�tinuous random variable X is given by
E[g(X)] := EX∼f [g(X)] = g(x)f(x)dx. X
In particular, we have defined the expectation E[X] for a random variable X. E[X] gives the center (or the average, or the mean-value) of the distribution of X.35
For constant c, we have the following properties • E[c] = c,
• E[X+c]=E[X]+c, • E[cX] = c E[X].
Median The median is defined as b such that
Pr(X ≤ b) = 1.
2
More generally, the α-quantile for 0 ≤ α ≤ 1 is defined as b such that
Pr(X ≤ b) = α.
34If the probabiltiy distribution is F, we also denote E[g(X)] = EX∼F [g(X)].
35By the law of large numbers, the average value of the variable converges to the E[X] as the number of repetitions approaches infinity.
85
Variance The variance of a random variable X, denoted by V [X] or var[X] V [X] := E[(X − E[X])2].
V [g(X)] = E[(g(X)−E[g(X)])2] gives a measure of how much the values of a function g of a random variable X vary from its probability distribution. When V [g(X)] is low, the values of g(X) cluster near their expected value. In particular, we have
V [X] = E�X2 − 2X E[X] + (E[X])2� = E[X2] − 2 E[X] · E[X] + (E[X])2 = E[X2] − (E[X])2. Hence
E[X2] = V [X] − E[X]2. We also have the following properties:
• V[c]=0,
• V[X+c]=V[X], • V[cX]=c2V[X].
(4.12)
Variance V [X] is the expectation of the squared deviation of a random variable X from its mean E[X]. Informally, it measures how far a set of (random) numbers are spread out from their average value.
The standard deviation is defined as
D[X] := �V [X].
Conventionally, the variance and the standard deviation are denoted by σ2 and σ. [Example] Let S be the random variable with its pdf given by the following table:
S=1 S=2 S=3 S=4 Then
E[S] = 0.1 × 1 + 0.1 × 2 + 0.5 × 3 + 0.3 × 4 = 3, 86
0.1
0.1
0.5
0.3
and
V[S] = 0.1·(1−3)2 +0.1·(2−3)2 +0.5·(3−3)2 +0.3·(4−3)2 = 0.8. Expectation and Covariance For multivariate random variables X1,…,Xn, we define the
Cov[X] = E�(X − E[X])(X − E[X])T �.
The covariance matrix has the covariance value Cov[Xi,Xj] as its (i,j)-th element given by
(Cov(X))i,j =Cov(Xi,Xj)=E�(Xi −E[Xi])(Xj −E[Xj])�.
The covariance matrix Cov(X) is symmetric and positive definite and tells us something about
the spread of the data .
In particular, the covariance between two univariate random variables X, Y ∈ R is given by
Cov[X, Y] := E�(X − E[X])(Y − E[Y])� = E[XY] − E[X]E[Y].
The covariance of a variable with itself Cov[X,X] is the variance V[X]. The square-root of the
variance is called the standard deviation D[X].
For two random vectors X = (X1, …, Xp)T , Y = (Y1, …, Yp)T , a matrix A and a vector b, the
expectation of X
E[X] := �E[X1], …, E[Xn]�T . The covariance matrix of X, Cov[X], is defined by
following statements hold:
1. E[AX+b]=AE[X]+b.
2. E[X+Y]=E[X]+E[Y].
3. Cov[AX + b] = A Cov[X]AT .
4. Cov[X + Y] = Cov[X] + Cov[Y], if X and Y are stochastically independent. 5. The matrix Cov[X] is non-negative definite.
Steiner rule As generalization of (4.12), we have
Theorem 4.1 (Steiner’s rule) Let a random vectors X = (X1, …, Xp)T . It holds that
E[(X−b)(X−b)T]=Cov[X]+(b−E[X])(b−E[X])T, ∀b∈Rp. 87
(4.13)
Proof: Let μ = E[X]. Note that E[(X−μ)(μ−b)T] = E[X−μ](μ−b)T = (μ−μ)(μ−b)T = 0.
E[(X − b)(X − b)T ] = E[(X − μ + μ − b)(X − μ + μ − b)T ]
= E[(X − μ)(X − μ)T ] + E[(μ − b)(μ − b)T ] + E[(X − μ)(μ − b)T ] + E[(μ − b)(X − μ)T ] = E[(X − μ)(X − μ)T ] + E[(μ − b)(μ − b)T ]
=Cov[X]+(μ−b)(μ−b)T =Cov[X]+(b−E[X])(b−E[X])T. �
Plot pdf
[Example] Consider Gaussian distribution:
2 1 −(z−μ)2
mu = 0.3;
sigma = 0.87;
pd = makedist(’Normal’,’mu’,mu,’sigma’,sigma);
x = -3:.1:3;
pdf_normal=pdf(pd,x);
plot(x, pdf_normal,’LineWidth’,2)
N (x; μ, σ ) = √
e 2σ2 .
If μ = 0.3 and σ = 0.87, we can plot its probability density function by the following MATLAB
code:
2πσ
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
0
-3 -2 -1 0 1 2 3
Figure 4.1: Plot the pdf for Gaussian distribution 88
[Example] The probability density function (pdf) for the logistic distribution is exp{ x−μ }
f(x;μ,γ)= � σ �2, ∀−∞