CS计算机代考程序代写 scheme Bayesian flex algorithm 2

2
Probability Distributions
In Chapter 1, we emphasized the central role played by probability theory in the solution of pattern recognition problems. We turn now to an exploration of some particular examples of probability distributions and their properties. As well as be- ing of great interest in their own right, these distributions can form building blocks for more complex models and will be used extensively throughout the book. The distributions introduced in this chapter will also serve another important purpose, namely to provide us with the opportunity to discuss some key statistical concepts, such as Bayesian inference, in the context of simple models before we encounter them in more complex situations in later chapters.
One role for the distributions discussed in this chapter is to model the prob- ability distribution p(x) of a random variable x, given a finite set x1, . . . , xN of observations. This problem is known as density estimation. For the purposes of this chapter, we shall assume that the data points are independent and identically distributed. It should be emphasized that the problem of density estimation is fun-
67

68 2. PROBABILITY DISTRIBUTIONS
damentally ill-posed, because there are infinitely many probability distributions that could have given rise to the observed finite data set. Indeed, any distribution p(x) that is nonzero at each of the data points x1, . . . , xN is a potential candidate. The issue of choosing an appropriate distribution relates to the problem of model selec- tion that has already been encountered in the context of polynomial curve fitting in Chapter 1 and that is a central issue in pattern recognition.
We begin by considering the binomial and multinomial distributions for discrete random variables and the Gaussian distribution for continuous random variables. These are specific examples of parametric distributions, so-called because they are governed by a small number of adaptive parameters, such as the mean and variance in the case of a Gaussian for example. To apply such models to the problem of density estimation, we need a procedure for determining suitable values for the parameters, given an observed data set. In a frequentist treatment, we choose specific values for the parameters by optimizing some criterion, such as the likelihood function. By contrast, in a Bayesian treatment we introduce prior distributions over the parameters and then use Bayes’ theorem to compute the corresponding posterior distribution given the observed data.
We shall see that an important role is played by conjugate priors, that lead to posterior distributions having the same functional form as the prior, and that there- fore lead to a greatly simplified Bayesian analysis. For example, the conjugate prior for the parameters of the multinomial distribution is called the Dirichlet distribution, while the conjugate prior for the mean of a Gaussian is another Gaussian. All of these distributions are examples of the exponential family of distributions, which possess a number of important properties, and which will be discussed in some detail.
One limitation of the parametric approach is that it assumes a specific functional form for the distribution, which may turn out to be inappropriate for a particular application. An alternative approach is given by nonparametric density estimation methods in which the form of the distribution typically depends on the size of the data set. Such models still contain parameters, but these control the model complexity rather than the form of the distribution. We end this chapter by considering three nonparametric methods based respectively on histograms, nearest-neighbours, and kernels.
2.1. Binary Variables
We begin by considering a single binary random variable x ∈ {0, 1}. For example, x might describe the outcome of flipping a coin, with x = 1 representing ‘heads’, and x = 0 representing ‘tails’. We can imagine that this is a damaged coin so that the probability of landing heads is not necessarily the same as that of landing tails. The probability of x = 1 will be denoted by the parameter μ so that
p(x = 1|μ) = μ (2.1)

Exercise 2.1
2.1. Binary Variables 69 where 0 à μ à 1, from which it follows that p(x = 0|μ) = 1 − μ. The probability
distribution over x can therefore be written in the form
Bern(x|μ) = μx(1 − μ)1−x (2.2)
which is known as the Bernoulli distribution. It is easily verified that this distribution is normalized and that it has mean and variance given by
E[x] = μ (2.3) var[x] = μ(1 − μ). (2.4)
Now suppose we have a data set D = {x1,…,xN} of observed values of x. We can construct the likelihood function, which is a function of μ, on the assumption that the observations are drawn independently from p(x|μ), so that
Section 2.4
At this point, it is worth noting that the#log likelihood function depends on the N observations xn only through their sum n xn. This sum provides an example of a sufficient statistic for the data under this distribution, and we shall study the impor- tant role of sufficient statistics in some detail. If we set the derivative of ln p(D|μ) with respect to μ equal to zero, we obtain the maximum likelihood estimator
!N p(D|μ) =
n=1
p(xn|μ) =
!N n=1
μxn (1 − μ)1−xn . (2.5)
In a frequentist setting, we can estimate a value for μ by maximizing the likelihood function, or equivalently by maximizing the logarithm of the likelihood. In the case of the Bernoulli distribution, the log likelihood function is given by
“N
ln p(D|μ) =
n=1
ln p(xn|μ) =
“N n=1
{xn ln μ + (1 − xn) ln(1 − μ)} . (2.6)
Jacob Bernoulli 1654–1705
Jacob Bernoulli, also known as Jacques or James Bernoulli, was a Swiss mathematician and was the first of many in the Bernoulli family to pursue a career in science and mathematics. Although compelled
1 “N
μML = N xn (2.7)
n=1
his time, including Boyle and Hooke in England. When he returned to Switzerland, he taught mechanics and became Professor of Mathematics at Basel in 1687. Unfortunately, rivalry between Jacob and his younger brother Johann turned an initially productive collabora- tion into a bitter and public dispute. Jacob’s most sig- nificant contributions to mathematics appeared in The Art of Conjecture published in 1713, eight years after his death, which deals with topics in probability the- ory including what has become known as the Bernoulli distribution.
to study philosophy and theology against his will by his parents, he travelled extensively after graduating in order to meet with many of the leading scientists of

70
2. PROBABILITY DISTRIBUTIONS
Figure 2.1
Histogram plot of the binomial dis- tribution (2.9) as a function of m for N =10andμ=0.25.
0.3 0.2 0.1
0 0 1 2 3 4 5 6 7 8 9 10
m
which is also known as the sample mean. If we denote the number of observations of x = 1 (heads) within this data set by m, then we can write (2.7) in the form
μML = Nm (2.8)
so that the probability of landing heads is given, in this maximum likelihood frame- work, by the fraction of observations of heads in the data set.
Now suppose we flip a coin, say, 3 times and happen to observe 3 heads. Then N = m = 3 and μML = 1. In this case, the maximum likelihood result would predict that all future observations should give heads. Common sense tells us that this is unreasonable, and in fact this is an extreme example of the over-fitting associ- ated with maximum likelihood. We shall see shortly how to arrive at more sensible conclusions through the introduction of a prior distribution over μ.
We can also work out the distribution of the number m of observations of x = 1, given that the data set has size N. This is called the binomial distribution, and from (2.5) we see that it is proportional to μm(1 − μ)N−m. In order to obtain the normalization coefficient we note that out of N coin flips, we have to add up all of the possible ways of obtaining m heads, so that the binomial distribution can be written
where
Bin(m|N,μ)= N μm(1−μ)N−m m
N≡ N!
m (N − m)!m!
(2.9) (2.10)
Exercise 2.3
is the number of ways of choosing m objects out of a total of N identical objects. Figure 2.1 shows a plot of the binomial distribution for N = 10 and μ = 0.25.
The mean and variance of the binomial distribution can be found by using the result of Exercise 1.10, which shows that for independent events the mean of the sum is the sum of the means, and the variance of the sum is the sum of the variances. Because m = x1 + . . . + xN , and for each observation the mean and variance are

Exercise 2.4
var[m]≡
These results can also be proved directly using calculus.
2.1.1 The beta distribution
Exercise 2.5
Exercise 2.6
We have seen in (2.8) that the maximum likelihood setting for the parameter μ in the Bernoulli distribution, and hence in the binomial distribution, is given by the fraction of the observations in the data set having x = 1. As we have already noted, this can give severely over-fitted results for small data sets. In order to develop a Bayesian treatment for this problem, we need to introduce a prior distribution p(μ) over the parameter μ. Here we consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties. To motivate this prior, we note that the likelihood function takes the form of the product of factors of the form μx(1 − μ)1−x. If we choose a prior to be proportional to powers of μ and (1 − μ), then the posterior distribution, which is proportional to the product of the prior and the likelihood function, will have the same functional form as the prior. This property is called conjugacy and we will see several examples of it later in this chapter. We therefore choose a prior, called the beta distribution, given by
Beta(μ|a,b)= Γ(a+b)μa−1(1−μ)b−1 (2.13) Γ(a)Γ(b)
where Γ(x) is the gamma function defined by (1.141), and the coefficient in (2.13) ensures that the beta distribution is normalized, so that
“1
Beta(μ|a, b) dμ = 1. (2.14) 0
given by (2.3) and (2.4), respectively, we have
!N m=0
mBin(m|N,μ) = Nμ (m−E[m])2Bin(m|N,μ) = Nμ(1−μ).
(2.11) (2.12)
E[m] ≡
!N m=0
The mean and variance of the beta distribution are given by
E[μ] = a (2.15)
2.1. Binary Variables
71
a+b
var[μ] = ab . (2.16)
(a+b)2(a+b+1)
The parameters a and b are often called hyperparameters because they control the distribution of the parameter μ. Figure 2.2 shows plots of the beta distribution for various values of the hyperparameters.
The posterior distribution of μ is now obtained by multiplying the beta prior (2.13) by the binomial likelihood function (2.9) and normalizing. Keeping only the factors that depend on μ, we see that this posterior distribution has the form
p(μ|m, l, a, b) ∝ μm+a−1(1 − μ)l+b−1 (2.17)

72 2. PROBABILITY DISTRIBUTIONS
3
2
1
00 0.5μ1
a=1 b=1
3
2
1
00 0.5μ1
a=8 b=4
3
2
1
00 0.5μ1
a = 0.1 b = 0.1
3
2
1
00 0.5μ1
a=2 b=3
Figure 2.2 Plots of the beta distribution Beta(μ|a, b) given by (2.13) as a function of μ for various values of the hyperparameters a and b.
where l = N − m, and therefore corresponds to the number of ‘tails’ in the coin example. We see that (2.17) has the same functional dependence on μ as the prior distribution, reflecting the conjugacy properties of the prior with respect to the like- lihood function. Indeed, it is simply another beta distribution, and its normalization coefficient can therefore be obtained by comparison with (2.13) to give
p(μ|m,l,a,b)= Γ(m+a+l+b)μm+a−1(1−μ)l+b−1. (2.18) Γ(m + a)Γ(l + b)
We see that the effect of observing a data set of m observations of x = 1 and l observations of x = 0 has been to increase the value of a by m, and the value of b by l, in going from the prior distribution to the posterior distribution. This allows us to provide a simple interpretation of the hyperparameters a and b in the prior as an effective number of observations of x = 1 and x = 0, respectively. Note that a and b need not be integers. Furthermore, the posterior distribution can act as the prior if we subsequently observe additional data. To see this, we can imagine taking observations one at a time and after each observation updating the current posterior

2 1
00 0.5 1
μ
2 1
00 0.5 1
μ
2 1
00 0.5 1
μ
prior
likelihood function
posterior
Figure 2.3 Illustration of one step of sequential Bayesian inference. The prior is given by a beta distribution with parameters a = 2, b = 2, and the likelihood function, given by (2.9) with N = m = 1, corresponds to a single observation of x = 1, so that the posterior is given by a beta distribution with parameters a = 3, b = 2.
Section 2.3.5
distribution by multiplying by the likelihood function for the new observation and then normalizing to obtain the new, revised posterior distribution. At each stage, the posterior is a beta distribution with some total number of (prior and actual) observed values for x = 1 and x = 0 given by the parameters a and b. Incorporation of an additional observation of x = 1 simply corresponds to incrementing the value of a by 1, whereas for an observation of x = 0 we increment b by 1. Figure 2.3 illustrates one step in this process.
We see that this sequential approach to learning arises naturally when we adopt a Bayesian viewpoint. It is independent of the choice of prior and of the likelihood function and depends only on the assumption of i.i.d. data. Sequential methods make use of observations one at a time, or in small batches, and then discard them before the next observations are used. They can be used, for example, in real-time learning scenarios where a steady stream of data is arriving, and predictions must be made before all of the data is seen. Because they do not require the whole data set to be stored or loaded into memory, sequential methods are also useful for large data sets. Maximum likelihood methods can also be cast into a sequential framework.
If our goal is to predict, as best we can, the outcome of the next trial, then we must evaluate the predictive distribution of x, given the observed data set D. From the sum and product rules of probability, this takes the form
!1 !1
p(x = 1|D) = p(x = 1|μ)p(μ|D) dμ = μp(μ|D) dμ = E[μ|D]. (2.19)
00
Using the result (2.18) for the posterior distribution p(μ|D), together with the result (2.15) for the mean of the beta distribution, we obtain
p(x = 1|D) = m + a (2.20) m+a+l+b
which has a simple interpretation as the total fraction of observations (both real ob- servations and fictitious prior observations) that correspond to x = 1. Note that in the limit of an infinitely large data set m, l → ∞ the result (2.20) reduces to the maximum likelihood result (2.8). As we shall see, it is a very general property that the Bayesian and maximum likelihood results will agree in the limit of an infinitely
2.1. Binary Variables 73

74 2. PROBABILITY DISTRIBUTIONS
Exercise 2.7
large data set. For a finite data set, the posterior mean for μ always lies between the prior mean and the maximum likelihood estimate for μ corresponding to the relative frequencies of events given by (2.7).
From Figure 2.2, we see that as the number of observations increases, so the posterior distribution becomes more sharply peaked. This can also be seen from the result (2.16) for the variance of the beta distribution, in which we see that the variance goes to zero for a → ∞ or b → ∞. In fact, we might wonder whether it is a general property of Bayesian learning that, as we observe more and more data, the uncertainty represented by the posterior distribution will steadily decrease.
To address this, we can take a frequentist view of Bayesian learning and show that, on average, such a property does indeed hold. Consider a general Bayesian inference problem for a parameter θ for which we have observed a data set D, de- scribed by the joint distribution p(θ, D). The following result
Exercise 2.8
2.2.
is equal to the prior mean of θ. Similarly, we can show that
varθ [θ] = ED [varθ [θ|D]] + varD [Eθ [θ|D]] . (2.24)
The term on the left-hand side of (2.24) is the prior variance of θ. On the right- hand side, the first term is the average posterior variance of θ, and the second term measures the variance in the posterior mean of θ. Because this variance is a positive quantity, this result shows that, on average, the posterior variance of θ is smaller than the prior variance. The reduction in variance is greater if the variance in the posterior mean is greater. Note, however, that this result only holds on average, and that for a particular observed data set it is possible for the posterior variance to be larger than the prior variance.
Multinomial Variables
Binary variables can be used to describe quantities that can take one of two possible values. Often, however, we encounter discrete variables that can take on one of K possible mutually exclusive states. Although there are various alternative ways to express such variables, we shall see shortly that a particularly convenient represen- tation is the 1-of-K scheme in which the variable is represented by a K-dimensional vector x in which one of the elements xk equals 1, and all remaining elements equal
where
Eθ[θ] = ED [Eθ[θ|D]]
Eθ[θ] ≡ ! p(θ)θ dθ
ED [Eθ [θ|D]] ≡ ! ! θp(θ|D) dθ
(2.21)
(2.22)
p(D) dD
says that the posterior mean of θ, averaged over the distribution generating the data,
(2.23)

Section 2.4 Appendix E
which represent the number of observations of xk = 1. These are called the sufficient statistics for this distribution.
In order to find the maximum likelihood solution for μ, we need to maximize ln p(D|μ) with respect to μk taking account of the constraint that the μk must sum to one. This can be achieved using a Lagrange multiplier λ and maximizing
0. So, for instance if we have a variable that can take K = 6 states and a particular observation of the variable happens to correspond to the state where x3 = 1, then x
will be represented by
Note that such vectors satisfy !Kk=1 xk = 1. If we denote the probability of xk = 1
x = (0,0,1,0,0,0)T. (2.25) by the parameter μk, then the distribution of x is given
p(x|μ) =
μxk (2.26)
wher!e μ = (μ1, . . . , μK )T, and the parameters μk are constrained to satisfy μk  0 and k μk = 1, because they represent probabilities. The distribution (2.26) can be regarded as a generalization of the Bernoulli distribution to more than two outcomes. It is easily seen that the distribution is normalized
# #K
p(x|μ) = μk = 1 (2.27)
x k=1
and that #
E[x|μ] = p(x|μ)x = (μ1,…,μM)T = μ. (2.28)
x
Now consider a data set D of N independent observations x1, . . . , xN . The corresponding likelihood function takes the form
“N “K p(D|μ) =
n=1 k=1
We see that the likelihood function depends on the N data points only through the
K quantities
#
mk = xnk (2.30) n
#K k=1
mklnμk+λ
#K k=1
μk−1 .
(2.31)
(2.32)
2.2. Multinomial Variables 75
“K
k
k=1
“K ( P x n k ) “K
μxnk = μ n = μmk . (2.29)
Setting the derivative of (2.31) with respect to μk to zero, we obtain μk = −mk/λ.
kkk k=1 k=1

76 2. PROBABILITY DISTRIBUTIONS
Exercise 2.9
The normalized form for this distribution is by
W!e can solve for the Lagrange multiplier λ by substituting (2.32) into the constraint k μk = 1 to give λ = −N. Thus we obtain the maximum likelihood solution in
the form
which is the fraction of the N observations for which xk = 1.
We can consider the joint distribution of the quantities m1 , . . . , mK , conditioned
on the parameters μ and on the total number N of observations. From (2.29) this takes the form
N $K
Mult(m1,m2,…,mK|μ,N) = μmk (2.34) m1m2 …mK k
k=1
which is known as the multinomial distribution. The normalization coefficient is the number of ways of partitioning N objects into K groups of size m1 , . . . , mK and is
given by
Note that the variables mk are subject to the constraint
%K
mk = N. 2.2.1 The Dirichlet distribution
μML = mk (2.33) kN
N=N!. (2.35) m1m2 …mK m1!m2!…mK!
k=1
We now introduce a family of prior distributions for the parameters {μk} of the multinomial distribution (2.34). By inspection of the form of the multinomial distribution, we see that the conjugate prior is given by
k=1
p(μ|α) ∝
μαk−1 (2.37)
where 0 à μk à 1 and !k μk = 1. Here α1,…,αK are the parameters of the distribution, and α denotes (α1,…,αK)T. Note that, because of the summation constraint, the distribution over the space of the {μk} is confined to a simplex of dimensionality K − 1, as illustrated for K = 3 in Figure 2.4.
Dir(μ|α) = 0
$K
k
Γ(α) $K
k=1
μαk−1 Γ(α1)···Γ(αK) k
(2.38)
(2.36)
which is called the Dirichlet distribution. Here Γ(x) is the gamma function defined by (1.141) while
%K k=1
α0 =
αk. (2.39)

Figure 2.4
The Dirichlet distribution over three variables μ1,μ2,μ3 is confined to a simplex (a bounded linear manifold) of the form shownP, as a consequence of the constraints 0àμk à1and kμk =1.
μ3
μ2
Plots of the Dirichlet distribution over the simplex, for various settings of the param- eters αk, are shown in Figure 2.5.
Multiplying the prior (2.38) by the likelihood function (2.34), we obtain the posterior distribution for the parameters {μk} in the form
k=1
p(μ|D,α) ∝ p(D|μ)p(μ|α) ∝
μαk+mk−1. (2.40)
We see that the posterior distribution again takes the form of a Dirichlet distribution, confirming that the Dirichlet is indeed a conjugate prior for the multinomial. This allows us to determine the normalization coefficient by comparison with (2.38) so that
p(μ|D, α) = Dir(μ|α + m) !K
= Γ(α0 +N) μαk+mk−1 (2.41)
Γ(α1 +m1)···Γ(αK +mK) k k=1
where we have denoted m = (m1, . . . , mK )T. As for the case of the binomial distribution with its beta prior, we can interpret the parameters αk of the Dirichlet prior as an effective number of observations of xk = 1.
Note that two-state quantities can either be represented as binary variables and
2.2. Multinomial Variables 77
!K
k
μ1
Lejeune Dirichlet 1805–1859
Johann Peter Gustav Lejeune Dirichlet was a modest and re- served mathematician who made contributions in number theory, me- chanics, and astronomy, and who gave the first rigorous analysis of
Fourier series. His family originated from Richelet in Belgium, and the name Lejeune Dirichlet comes
from ‘le jeune de Richelet’ (the young person from Richelet). Dirichlet’s first paper, which was published in 1825, brought him instant fame. It concerned Fer- mat’s last theorem, which claims that there are no positive integer solutions to xn + yn = zn for n > 2. Dirichlet gave a partial proof for the case n = 5, which was sent to Legendre for review and who in turn com- pleted the proof. Later, Dirichlet gave a complete proof for n = 14, although a full proof of Fermat’s last theo- rem for arbitrary n had to wait until the work of Andrew Wiles in the closing years of the 20th century.

78 2. PROBABILITY DISTRIBUTIONS
Figure 2.5 Plots of the Dirichlet distribution over three variables, where the two horizontal axes are coordinates in the plane of the simplex and the vertical axis corresponds to the value of the density. Here {αk} = 0.1 on the left plot, {αk } = 1 in the centre plot, and {αk } = 10 in the right plot.
Section 1.6 Exercise 2.14
modelled using the binomial distribution (2.9) or as 1-of-2 variables and modelled using the multinomial distribution (2.34) with K = 2.
2.3. The Gaussian Distribution
The Gaussian, also known as the normal distribution, is a widely used model for the distribution of continuous variables. In the case of a single variable x, the Gaussian distribution can be written in the form
N(x|μ,σ2)= 1 exp − 1 (x−μ)2 (2.42) (2πσ2)1/2 2σ2
where μ is the mean and σ2 is the variance. For a D-dimensional vector x, the multivariate Gaussian distribution takes the form
N (x|μ, Σ) = 1 1 exp − 1 (x − μ)T Σ−1 (x − μ) (2.43) (2π)D/2 |Σ|1/2 2
where μ is a D-dimensional mean vector, Σ is a D × D covariance matrix, and |Σ| denotes the determinant of Σ.
The Gaussian distribution arises in many different contexts and can be motivated from a variety of different perspectives. For example, we have already seen that for a single real variable, the distribution that maximizes the entropy is the Gaussian. This property applies also to the multivariate Gaussian.
Another situation in which the Gaussian distribution arises is when we consider the sum of multiple random variables. The central limit theorem (due to Laplace) tells us that, subject to certain mild conditions, the sum of a set of random variables, which is of course itself a random variable, has a distribution that becomes increas- ingly Gaussian as the number of terms in the sum increases (Walker, 1969). We can

3 2 1
00 0.5 1
3 2 1
00 0.5 1
N=1
3 2 1
00 0.5 1
N=2
N = 10
Figure 2.6 Histogram plots of the mean of N uniformly distributed numbers for various values of N. We observe that as N increases, the distribution tends towards a Gaussian.
Appendix C
illustrate this by considering N variables x1 , . . . , xN each of which has a uniform distribution over the interval [0, 1] and then considering the distribution of the mean (x1 + · · · + xN )/N . For large N , this distribution tends to a Gaussian, as illustrated in Figure 2.6. In practice, the convergence to a Gaussian as N increases can be very rapid. One consequence of this result is that the binomial distribution (2.9), which is a distribution over m defined by the sum of N observations of the random binary variable x, will tend to a Gaussian as N → ∞ (see Figure 2.1 for the case of N = 10).
The Gaussian distribution has many important analytical properties, and we shall consider several of these in detail. As a result, this section will be rather more tech- nically involved than some of the earlier sections, and will require familiarity with various matrix identities. However, we strongly encourage the reader to become pro- ficient in manipulating Gaussian distributions using the techniques presented here as this will prove invaluable in understanding the more complex models presented in later chapters.
We begin by considering the geometrical form of the Gaussian distribution. The
2.3. The Gaussian Distribution 79
Carl Friedrich Gauss 1777–1855
It is said that when Gauss went to elementary school at age 7, his teacher Bu ̈ttner, trying to keep the class occupied, asked the pupils to sum the integers from 1 to 100. To the teacher’s amazement, Gauss
ematician and scientist with a reputation for being a hard-working perfectionist. One of his many contribu- tions was to show that least squares can be derived under the assumption of normally distributed errors. He also created an early formulation of non-Euclidean geometry (a self-consistent geometrical theory that vi- olates the axioms of Euclid) but was reluctant to dis- cuss it openly for fear that his reputation might suffer if it were seen that he believed in such a geometry. At one point, Gauss was asked to conduct a geodetic survey of the state of Hanover, which led to his for- mulation of the normal distribution, now also known as the Gaussian. After his death, a study of his di- aries revealed that he had discovered several impor- tant mathematical results years or even decades be- fore they were published by others.
arrived at the answer in a matter of moments by noting that the sum can be represented as 50 pairs (1 + 100, 2 + 99, etc.) each of which added to 101, giving the an- swer 5,050. It is now believed that the problem which was actually set was of the same form but somewhat harder in that the sequence had a larger starting value and a larger increment. Gauss was a German math-

80 2. PROBABILITY DISTRIBUTIONS
Exercise 2.17
Exercise 2.18
functional dependence of the Gaussian on x is through the quadratic form
∆2 = (x − μ)TΣ−1(x − μ) (2.44)
which appears in the exponent. The quantity ∆ is called the Mahalanobis distance from μ to x and reduces to the Euclidean distance when Σ is the identity matrix. The Gaussian distribution will be constant on surfaces in x-space for which this quadratic form is constant.
First of all, we note that the matrix Σ can be taken to be symmetric, without loss of generality, because any antisymmetric component would disappear from the exponent. Now consider the eigenvector equation for the covariance matrix
Σui = λiui (2.45) where i = 1, . . . , D. Because Σ is a real, symmetric matrix its eigenvalues will be
real, and its eigenvectors can be chosen to form an orthonormal set, so that
uTi uj = Iij (2.46)
where Iij is the i, j element of the identity matrix and satisfies
I = 1, ifi=j (2.47)
Exercise 2.19
tors in the form
λiuiuTi
and similarly the inverse covariance matrix Σ−1 can be expressed as
Σ−1=”D 1uiuTi. i=1 λi
Substituting (2.49) into (2.44), the quadratic form becomes
ij 0, otherwise.
The covariance matrix Σ can be expressed as an expansion in terms of its eigenvec-
where we have defined
∆2=”D yi2 i=1 λi
yi = uTi (x − μ).
Σ =
“D i=1
(2.48)
(2.49)
(2.50) (2.51)
We can interpret {yi} as a new coordinate system defined by the orthonormal vectors ui that are shifted and rotated with respect to the original xi coordinates. Forming the vector y = (y1,…,yD)T, we have
y = U(x − μ) (2.52)

Figure 2.7
The red curve shows the ellip- x2 tical surface of constant proba- bility density for a Gaussian in
a two-dimensional space x = (x1,x2) on which the density
is exp(−1/2) of its value at x = μ. The major axes of the ellipse are defined by the eigenvectors ui of the covari- ance matrix, with correspond- ing eigenvalues λi.
u2
u1
y2
μ
2.3. The Gaussian Distribution
81
λ1/2 2
y1
λ1/2 1
x1
Appendix C
where U is a matrix whose rows are given by uTi . From (2.46) it follows that U is an orthogonal matrix, i.e., it satisfies UUT = I, and hence also UTU = I, where I is the identity matrix.
The quadratic form, and hence the Gaussian density, will be constant on surfaces for which (2.51) is constant. If all of the eigenvalues λi are positive, then these surfaces represent ellipsoids, with their centres at μ and their axes oriented along ui,
and with scaling factors in the directions of the axes given by λ1/2, as illustrated in Figure 2.7. i
For the Gaussian distribution to be well defined, it is necessary for all of the eigenvalues λi of the covariance matrix to be strictly positive, otherwise the dis- tribution cannot be properly normalized. A matrix whose eigenvalues are strictly positive is said to be positive definite. In Chapter 12, we will encounter Gaussian distributions for which one or more of the eigenvalues are zero, in which case the distribution is singular and is confined to a subspace of lower dimensionality. If all of the eigenvalues are nonnegative, then the covariance matrix is said to be positive semidefinite.
Now consider the form of the Gaussian distribution in the new coordinate system defined by the yi. In going from the x to the y coordinate system, we have a Jacobian matrix J with elements given by
Jij = ∂xi = Uji (2.53) ∂yj
where Uji are the elements of the matrix UT. Using the orthonormality property of the matrix U, we see that the square of the determinant of the Jacobian matrix is
|J|2 = àUTà2 = àUTà |U| = àUTUà = |I| = 1 (2.54) and hence |J| = 1. Also, the determinant |Σ| of the covariance matrix can be written

82 2. PROBABILITY DISTRIBUTIONS
as the product of its eigenvalues, and hence
j=1
!D 1 yj2 p(y) = p(x)|J| = (2πλ )1/2 exp −2λ
j=1j j
λ1/2.
Thus in the yj coordinate system, the Gaussian distribution takes the form
|Σ|1/2 =
(2.55)
(2.56)
!D
j
which is the product of D independent univariate Gaussian distributions. The eigen- vectors therefore define a new set of shifted and rotated coordinates with respect to which the joint probability distribution factorizes into a product of independent distributions. The integral of the distribution in the y coordinate system is then
$ !D$∞ 1 yj2
p(y)dy = (2πλ )1/2 exp −2λ dyj = 1 (2.57)
j=1−∞j j
where we have used the result (1.48) for the normalization of the univariate Gaussian. This confirms that the multivariate Gaussian (2.43) is indeed normalized.
We now look at the moments of the Gaussian distribution and thereby provide an interpretation of the parameters μ and Σ. The expectation of x under the Gaussian distribution is given by
1 1$ 1 T−1
E[x] = (2π)D/2 |Σ|1/2 $ exp −2(x−μ) Σ (x−μ) xdx
= 1 1 exp −1zTΣ−1z (z+μ)dz (2.58) (2π)D/2 |Σ|1/2 2
where we have changed variables using z = x − μ. We now note that the exponent is an even function of the components of z and, because the integrals over these are taken over the range (−∞, ∞), the term in z in the factor (z + μ) will vanish by symmetry. Thus
E[x] = μ (2.59)
and so we refer to μ as the mean of the Gaussian distribution.
We now consider second order moments of the Gaussian. In the univariate case,
we considered the second order moment given by E[x2]. For the multivariate Gaus- sian, there are D2 second order moments given by E[xixj], which we can group together to form the matrix E[xxT]. This matrix can be written as
T11$1T−1 T E[xx]= $ exp − (x−μ)Σ (x−μ) xx dx
(2π)D/2 |Σ|1/2 2
= 1 1 (2π)D/2 |Σ|1/2
exp −1zTΣ−1z (z+μ)(z+μ)T dz 2

(2π)D/2 |Σ|1/2 2 ”
= =
1 1!D!D
!Dy2
uiuTj exp− k yiyjdy
(2.61)
2.3. The Gaussian Distribution 83
where again we have changed variables using z = x − μ. Note that the cross-terms involving μzT and μTz will again vanish by symmetry. The term μμT is constant and can be taken outside the integral, which itself is unity because the Gaussian distribution is normalized. Consider the term involving zzT. Again, we can make use of the eigenvector expansion of the covariance matrix given by (2.45), together with the completeness of the set of eigenvectors, to write
!D j=1
where yj = uTj z, which gives
1 1 ” exp −1zTΣ−1z zzTdz
z =
yjuj (2.60)
Exercise 2.21
where we have made use of the eigenvector equation (2.45), together with the fact that the integral on the right-hand side of the middle line vanishes by symmetry unless i = j, and in the final line we have made use of the results (1.50) and (2.55), together with (2.48). Thus we have
E[xxT] = μμT + Σ. (2.62)
For single random variables, we subtracted the mean before taking second mo- ments in order to define a variance. Similarly, in the multivariate case it is again convenient to subtract off the mean, giving rise to the covariance of a random vector x defined by
cov[x] = E (x − E[x])(x − E[x])T . (2.63) For the specific case of a Gaussian distribution, we can make use of E[x] = μ,
together with the result (2.62), to give
cov[x] = Σ. (2.64)
Because the parameter matrix Σ governs the covariance of x under the Gaussian distribution, it is called the covariance matrix.
Although the Gaussian distribution (2.43) is widely used as a density model, it suffers from some significant limitations. Consider the number of free parameters in the distribution. A general symmetric covariance matrix Σ will have D(D + 1)/2 independent parameters, and there are another D independent parameters in μ, giv- ing D(D + 3)/2 parameters in total. For large D, the total number of parameters
(2π)D/2 |Σ|1/2
2λk k=1
!D i=1
i=1 j=1 uiuTi λi = Σ

84 2. PROBABILITY DISTRIBUTIONS
Figure 2.8 Contours of constant probability density for a Gaussian distribution in two dimensions in which the covariance matrix is (a) of general form, (b) diagonal, in which the elliptical contours are aligned with the coordinate axes, and (c) proportional to the identity matrix, in which the contours are concentric circles.
x2
x2 x2
x1 x1 x1 (a) (b) (c)
Section 8.3
Section 13.3
therefore grows quadratically with D, and the computational task of manipulating and inverting large matrices can become prohibitive. One way to address this prob- lem is to use restricted forms of the covariance matrix. If we consider covariance matrices that are diagonal, so that Σ = diag(σi2), we then have a total of 2D inde- pendent parameters in the density model. The corresponding contours of constant density are given by axis-aligned ellipsoids. We could further restrict the covariance matrix to be proportional to the identity matrix, Σ = σ2I, known as an isotropic co- variance, giving D + 1 independent parameters in the model and spherical surfaces of constant density. The three possibilities of general, diagonal, and isotropic covari- ancematricesareillustratedinFigure2.8. Unfortunately,whereassuchapproaches limit the number of degrees of freedom in the distribution and make inversion of the covariance matrix a much faster operation, they also greatly restrict the form of the probability density and limit its ability to capture interesting correlations in the data.
A further limitation of the Gaussian distribution is that it is intrinsically uni- modal (i.e., has a single maximum) and so is unable to provide a good approximation to multimodal distributions. Thus the Gaussian distribution can be both too flexible, in the sense of having too many parameters, while also being too limited in the range of distributions that it can adequately represent. We will see later that the introduc- tion of latent variables, also called hidden variables or unobserved variables, allows both of these problems to be addressed. In particular, a rich family of multimodal distributions is obtained by introducing discrete latent variables leading to mixtures of Gaussians, as discussed in Section 2.3.9. Similarly, the introduction of continuous latent variables, as described in Chapter 12, leads to models in which the number of free parameters can be controlled independently of the dimensionality D of the data space while still allowing the model to capture the dominant correlations in the data set. Indeed, these two approaches can be combined and further extended to derive a very rich set of hierarchical models that can be adapted to a broad range of prac- tical applications. For instance, the Gaussian version of the Markov random field, which is widely used as a probabilistic model of images, is a Gaussian distribution over the joint space of pixel intensities but rendered tractable through the imposition of considerable structure reflecting the spatial organization of the pixels. Similarly, the linear dynamical system, used to model time series data for applications such as tracking, is also a joint Gaussian distribution over a potentially large number of observed and latent variables and again is tractable due to the structure imposed on the distribution. A powerful framework for expressing the form and properties of

Exercise 2.22
which is known as the precision matrix. In fact, we shall see that some properties of Gaussian distributions are most naturally expressed in terms of the covariance, whereas others take a simpler form when viewed in terms of the precision. We therefore also introduce the partitioned form of the precision matrix
Λ= Λaa Λab (2.69) Λba Λbb
corresponding to the partitioning (2.65) of the vector x. Because the inverse of a symmetric matrix is also symmetric, we see that Λaa and Λbb are symmetric, while ΛTab = Λba. It should be stressed at this point that, for instance, Λaa is not simply given by the inverse of Σaa. In fact, we shall shortly examine the relation between the inverse of a partitioned matrix and the inverses of its partitions.
Let us begin by finding an expression for the conditional distribution p(xa|xb). From the product rule of probability, we see that this conditional distribution can be
2.3. The Gaussian Distribution 85 such complex distributions is that of probabilistic graphical models, which will form
the subject of Chapter 8.
2.3.1 Conditional Gaussian distributions
An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.
Consider first the case of conditional distributions. Suppose x is a D-dimensional vector with Gaussian distribution N (x|μ, Σ) and that we partition x into two dis- joint subsets xa and xb. Without loss of generality, we can take xa to form the first M components of x, with xb comprising the remaining D − M components, so that
We also define corresponding partitions of the mean vector μ given by μ= μa
μb and of the covariance matrix Σ given by
ance matrix
Λ ≡ Σ−1 (2.68)
x= xa . xb
(2.65)
(2.66)
(2.67)
Σ= Σaa Σab . Σba Σbb
Note that the symmetry ΣT = Σ of the covariance matrix implies that Σaa and Σbb are symmetric, while Σba = ΣTab.
In many situations, it will be convenient to work with the inverse of the covari-

86 2. PROBABILITY DISTRIBUTIONS
evaluated from the joint distribution p(x) = p(xa,xb) simply by fixing xb to the observed value and normalizing the resulting expression to obtain a valid probability distribution over xa. Instead of performing this normalization explicitly, we can obtain the solution more efficiently by considering the quadratic form in the exponent of the Gaussian distribution given by (2.44) and then reinstating the normalization coefficient at the end of the calculation. If we make use of the partitioning (2.65), (2.66), and (2.69), we obtain
−21(x − μ)TΣ−1(x − μ) =
−21(xa − μa)TΛaa(xa − μa) − 21(xa − μa)TΛab(xb − μb)
−12(xb − μb)TΛba(xa − μa) − 21(xb − μb)TΛbb(xb − μb). (2.70)
We see that as a function of xa, this is again a quadratic form, and hence the cor- responding conditional distribution p(xa|xb) will be Gaussian. Because this distri- bution is completely characterized by its mean and its covariance, our goal will be to identify expressions for the mean and covariance of p(xa|xb) by inspection of (2.70).
This is an example of a rather common operation associated with Gaussian distributions, sometimes called ‘completing the square’, in which we are given a quadratic form defining the exponent terms in a Gaussian distribution, and we need to determine the corresponding mean and covariance. Such problems can be solved straightforwardly by noting that the exponent in a general Gaussian distribution N (x|μ, Σ) can be written
−21(x − μ)TΣ−1(x − μ) = −21xTΣ−1x + xTΣ−1μ + const (2.71)
where ‘const’ denotes terms which are independent of x, and we have made use of the symmetry of Σ. Thus if we take our general quadratic form and express it in the form given by the right-hand side of (2.71), then we can immediately equate the matrix of coefficients entering the second order term in x to the inverse covariance matrix Σ−1 and the coefficient of the linear term in x to Σ−1μ, from which we can obtain μ.
Now let us apply this procedure to the conditional Gaussian distribution p(xa|xb) for which the quadratic form in the exponent is given by (2.70). We will denote the mean and covariance of this distribution by μa|b and Σa|b, respectively. Consider the functional dependence of (2.70) on xa in which xb is regarded as a constant. If we pick out all terms that are second order in xa, we have
−21xTa Λaaxa (2.72)
from which we can immediately conclude that the covariance (inverse precision) of
p(xa|xb) is given by
Σ = Λ−1. (2.73) a|b aa

Exercise 2.24
where we have made use of (2.73).
The results (2.73) and (2.75) are expressed in terms of the partitioned precision
matrix of the original joint distribution p(xa, xb). We can also express these results in terms of the corresponding partitioned covariance matrix. To do this, we make use of the following identity for the inverse of a partitioned matrix
A B −1 M −MBD−1
C D = −D−1CM D−1 + D−1CMBD−1 (2.76)
where we have defined
The quantity M−1 is known as the Schur complement of the matrix on the left-hand
Section 8.1.4
From these we obtain the following expressions for the mean and covariance of the conditional distribution p(xa|xb)
μ = μ +Σ Σ−1(x −μ) (2.81) a|b aabbbbb
Σa|b = Σaa − ΣabΣ−1Σba. (2.82) bb
Comparing (2.73) and (2.82), we see that the conditional distribution p(xa|xb) takes a simpler form when expressed in terms of the partitioned precision matrix than when it is expressed in terms of the partitioned covariance matrix. Note that the mean of the conditional distribution p(xa|xb), given by (2.81), is a linear function of xb and that the covariance, given by (2.82), is independent of xa. This represents an example of a linear-Gaussian model.
2.3. The Gaussian Distribution 87
Now consider all of the terms in (2.70) that are linear in xa
xTa {Λaaμa − Λab(xb − μb)} (2.74)
where we have used ΛTba = Λab. From our discussion of the general form (2.71), the coefficient of xa in this expression must equal Σ−1μa|b and hence
a|b
μa|b = Σa|b {Λaaμa − Λab(xb − μb)}
= μ −Λ−1Λ (x −μ) a aa ab b b
(2.75)
M = (A − BD−1C)−1. (2.77) side of (2.76) with respect to the submatrix D. Using the definition
ba bb
and making use of (2.76), we have
ba bb
Σ Σ −1 Λ Λ Σaa Σab = Λaa Λab
(2.78)
(2.79) (2.80)
Λaa = (Σaa − ΣabΣ−1Σba)−1 bb
Λab = −(Σaa − ΣabΣ−1Σba)−1ΣabΣ−1. bb bb

88 2. PROBABILITY DISTRIBUTIONS
2.3.2 Marginal Gaussian distributions
We have seen that if a joint distribution p(xa,xb) is Gaussian, then the condi- tional distribution p(xa|xb) will again be Gaussian. Now we turn to a discussion of the marginal distribution given by
p(xa)=! p(xa,xb)dxb (2.83)
which, as we shall see, is also Gaussian. Once again, our strategy for evaluating this distribution efficiently will be to focus on the quadratic form in the exponent of the joint distribution and thereby to identify the mean and covariance of the marginal distribution p(xa).
The quadratic form for the joint distribution can be expressed, using the par- titioned precision matrix, in the form (2.70). Because our goal is to integrate out xb, this is most easily achieved by first considering the terms involving xb and then completing the square in order to facilitate integration. Picking out just those terms that involve xb, we have
−1xTΛ x +xT m = −1(x −Λ−1m)TΛ (x −Λ−1m)+1mTΛ−1m (2.84) 2bbbbb 2bbb bbbbb 2bb
where we have defined
m = Λbbμb − Λba(xa − μa). (2.85)
We see that the dependence on xb has been cast into the standard quadratic form of a Gaussian distribution corresponding to the first term on the right-hand side of (2.84), plus a term that does not depend on xb (but that does depend on xa). Thus, when we take the exponential of this quadratic form, we see that the integration over xb required by (2.83) will take the form
!
This integration is easily performed by noting that it is the integral over an unnor- malized Gaussian, and so the result will be the reciprocal of the normalization co- efficient. We know from the form of the normalized Gaussian given by (2.43), that this coefficient is independent of the mean and depends only on the determinant of the covariance matrix. Thus, by completing the square with respect to xb, we can integrate out xb and the only term remaining from the contributions on the left-hand side of (2.84) that depends on xa is the last term on the right-hand side of (2.84) in which m is given by (2.85). Combining this term with the remaining terms from
exp −12(xb − Λ−1m)TΛbb(xb − Λ−1m) dxb. (2.86) bb bb

(2.70) that depend on xa, we obtain
21 [Λbbμb − Λba(xa − μa)]T Λ−1 [Λbbμb − Λba(xa − μa)]
2.3. The Gaussian Distribution 89
bb
−21xTa Λaaxa + xTa (Λaaμa + Λabμb) + const
= −1xT(Λ −Λ Λ−1Λ )x 2aaa abbbbaa
+xT(Λ − Λ Λ−1Λ )−1μ + const (2.87) a aa ab bb ba a
where ‘const’ denotes quantities independent of xa. Again, by comparison with (2.71), we see that the covariance of the marginal distribution of p(xa) is given by
Σa = (Λaa − ΛabΛ−1Λba)−1. (2.88) bb
Similarly, the mean is given by
Σa(Λaa − ΛabΛ−1Λba)μa = μa (2.89)
where we have used (2.88). The covariance in (2.88) is expressed in terms of the partitioned precision matrix given by (2.69). We can rewrite this in terms of the corresponding partitioning of the covariance matrix given by (2.67), as we did for the conditional distribution. These partitioned matrices are related by
(2.90)
(2.91)
Λ Λ −1 Σ Σ
Λaa Λab ba bb
= Σaa Σab ba bb
Making use of (2.76), we then have
Λaa − ΛabΛ−1Λba −1 = Σaa. bb
Thus we obtain the intuitively satisfying result that the marginal distribution p(xa) has mean and covariance given by
E[xa ] = μa (2.92) cov[xa] = Σaa. (2.93)
We see that for a marginal distribution, the mean and covariance are most simply ex- pressed in terms of the partitioned covariance matrix, in contrast to the conditional distribution for which the partitioned precision matrix gives rise to simpler expres- sions.
Our results for the marginal and conditional distributions of a partitioned Gaus- sian are summarized below.
Partitioned Gaussians
Given a joint Gaussian distribution N (x|μ, Σ) with Λ ≡ Σ−1 and
x= xa , μ= μa (2.94) xb μb
bb

90 2. PROBABILITY DISTRIBUTIONS
1
xb
0.5
00 0.5xa 1
xb = 0.7
p(xa,xb)
10
5
00 0.5xa 1
p(xa|xb = 0.7)
p(xa)
Figure 2.9 The plot on the left shows the contours of a Gaussian distribution p(xa,xb) over two variables, and the plot on the right shows the marginal distribution p(xa) (blue curve) and the conditional distribution p(xa|xb) for xb = 0.7 (red curve).
Σ= Σaa Σab , Λ= Λaa Λab . Σba Σbb Λba Λbb
Conditional distribution:
p(x |x ) = N(x|μ ,Λ−1) a b a|b aa
μ = μ −Λ−1Λ (x −μ). a|b aaaabbb
Marginal distribution:
p(xa) = N (xa|μa, Σaa).
We illustrate the idea of conditional and marginal distributions associated with
a multivariate Gaussian using an example involving two variables in Figure 2.9.
2.3.3 Bayes’ theorem for Gaussian variables
In Sections 2.3.1 and 2.3.2, we considered a Gaussian p(x) in which we parti- tioned the vector x into two subvectors x = (xa, xb) and then found expressions for the conditional distribution p(xa|xb) and the marginal distribution p(xa). We noted that the mean of the conditional distribution p(xa|xb) was a linear function of xb. Here we shall suppose that we are given a Gaussian marginal distribution p(x) and a Gaussian conditional distribution p(y|x) in which p(y|x) has a mean that is a linear function of x, and a covariance which is independent of x. This is an example of
(2.95)
(2.96) (2.97)
(2.98)

Exercise 2.29
2.3. The Gaussian Distribution 91
a linear Gaussian model (Roweis and Ghahramani, 1999), which we shall study in greater generality in Section 8.1.4. We wish to find the marginal distribution p(y) and the conditional distribution p(x|y). This is a problem that will arise frequently in subsequent chapters, and it will prove convenient to derive the general results here.
We shall take the marginal and conditional distributions to be
p(x) = N x|μ, Λ−1 (2.99)
p(y|x) = N y|Ax + b, L−1 (2.100)
where μ, A, and b are parameters governing the means, and Λ and L are precision matrices. If x has dimensionality M and y has dimensionality D, then the matrix A has size D × M .
First we find an expression for the joint distribution over x and y. To do this, we define
z= yx and then consider the log of the joint distribution
ln p(z) = ln p(x) + ln p(y|x)
= −21(x − μ)TΛ(x − μ)
−12(y − Ax − b)TL(y − Ax − b) + const
(2.101)
(2.102)
where ‘const’ denotes terms independent of x and y. As before, we see that this is a quadratic function of the components of z, and hence p(z) is Gaussian distribution. To find the precision of this Gaussian, we consider the second order terms in (2.102), which can be written as
−21xT(Λ + ATLA)x − 21yTLy + 12yTLAx + 21xTATLy 1 x T Λ+ATLA −ATL x 1
given by
Λ + ATLA −ATL
R= −LA L . (2.104)
The covariance matrix is found by taking the inverse of the precision, which can be done using the matrix inversion formula (2.76) to give
Λ−1 Λ−1 AT
cov[z] = R−1 = AΛ−1 L−1 + AΛ−1AT . (2.105)
= −2 y −LA L y =−2zTRz (2.103) and so the Gaussian distribution over z has precision (inverse covariance) matrix

92 2. PROBABILITY DISTRIBUTIONS
Similarly, we can find the mean of the Gaussian distribution over z by identify-
Exercise 2.30
Section 2.3
Λμ − ATLb E[z] = R−1 Lb .
Making use of (2.105), we then obtain
E[z] = μ . Aμ+b
(2.107)
(2.108)
Section 2.3
Next we find an expression for the marginal distribution p(y) in which we have marginalized over x. Recall that the marginal distribution over a subset of the com- ponents of a Gaussian random vector takes a particularly simple form when ex- pressed in terms of the partitioned covariance matrix. Specifically, its mean and covariance are given by (2.92) and (2.93), respectively. Making use of (2.105) and (2.108) we see that the mean and covariance of the marginal distribution p(y) are given by
E[y] = Aμ + b (2.109) cov[y] = L−1 + AΛ−1AT. (2.110)
A special case of this result is when A = I, in which case it reduces to the convolu- tion of two Gaussians, for which we see that the mean of the convolution is the sum of the mean of the two Gaussians, and the covariance of the convolution is the sum of their covariances.
Finally, we seek an expression for the conditional p(x|y). Recall that the results for the conditional distribution are most easily expressed in terms of the partitioned precision matrix, using (2.73) and (2.75). Applying these results to (2.105) and (2.108) we see that the conditional distribution p(x|y) has mean and covariance given by
E[x|y] = (Λ + ATLA)−1 ATL(y − b) + Λμ (2.111) cov[x|y] = (Λ + ATLA)−1. (2.112)
The evaluation of this conditional can be seen as an example of Bayes’ theorem. We can interpret the distribution p(x) as a prior distribution over x. If the variable y is observed, then the conditional distribution p(x|y) represents the corresponding posterior distribution over x. Having found the marginal and conditional distribu- tions, we effectively expressed the joint distribution p(z) = p(x)p(y|x) in the form p(x|y)p(y). These results are summarized below.
ing the linear terms in (2.102), which are given by
x T Λμ−ATLb
xTΛμ − xTATLb + yTLb = y Lb . (2.106)
Using our earlier result (2.71) obtained by completing the square over the quadratic form of a multivariate Gaussian, we find that the mean of z is given by

Appendix C
xnxTn . (2.119) (C.19), the derivative of the log likelihood with respect to μ is given by
2.3. The Gaussian Distribution 93 Marginal and Conditional Gaussians
Given a marginal Gaussian distribution for x and a conditional Gaussian distri- bution for y given x in the form
p(x) = N (x|μ, Λ−1) (2.113) p(y|x) = N (y|Ax + b, L−1) (2.114)
the marginal distribution of y and the conditional distribution of x given y are given by
where
estimate of the mean given by
p(y) = N (y|Aμ + b, L−1 + AΛ−1AT) p(x|y) = N (x|Σ{ATL(y − b) + Λμ}, Σ)
(2.115) (2.116)
(2.117)
Σ = (Λ + ATLA)−1. 2.3.4 Maximum likelihood for the Gaussian
Given a data set X = (x1, . . . , xN )T in which the observations {xn} are as- sumed to be drawn independently from a multivariate Gaussian distribution, we can estimate the parameters of the distribution by maximum likelihood. The log likeli- hood function is given by
ND N 1!N
lnp(X|μ,Σ) = − 2 ln(2π)− 2 ln|Σ|−2 (xn−μ)TΣ−1(xn−μ). (2.118)
n=1
By simple rearrangement, we see that the likelihood function depends on the data set
only through the two quantities
∂ !N
∂μ ln p(X|μ, Σ) = Σ−1(xn − μ) (2.120)
n=1
and setting this derivative to zero, we obtain the solution for the maximum likelihood
!N n=1
!N n=1
xn,
These are known as the sufficient statistics for the Gaussian distribution. Using
1 !N
μML = N xn (2.121)
n=1

94 2. PROBABILITY DISTRIBUTIONS
Exercise 2.34
which is the mean of the observed set of data points. The maximization of (2.118) with respect to Σ is rather more involved. The simplest approach is to ignore the symmetry constraint and show that the resulting solution is symmetric as required. Alternative derivations of this result, which impose the symmetry and positive defi- niteness constraints explicitly, can be found in Magnus and Neudecker (1999). The result is as expected and takes the form
1 !N
ΣML = N (xn − μML)(xn − μML)T (2.122)
n=1
which involves μML because this is the result of a joint maximization with respect to μ and Σ. Note that the solution (2.121) for μML does not depend on ΣML, and so we can first evaluate μML and then use this to evaluate ΣML.
If we evaluate the expectations of the maximum likelihood solutions under the true distribution, we obtain the following results
E[μML ] = μ (2.123) E[ΣML] = N−1Σ. (2.124)
N
We see that the expectation of the maximum likelihood estimate for the mean is equal to the true mean. However, the maximum likelihood estimate for the covariance has an expectation that is less than the true value, and hence it is biased. We can correct this bias by defining a different estimator Σ” given by
1 !N
Σ” = N − 1 (xn − μML)(xn − μML)T. (2.125)
n=1
Clearly from (2.122) and (2.124), the expectation of Σ” is equal to Σ.
2.3.5 Sequential estimation
Our discussion of the maximum likelihood solution for the parameters of a Gaus- sian distribution provides a convenient opportunity to give a more general discussion of the topic of sequential estimation for maximum likelihood. Sequential methods allow data points to be processed one at a time and then discarded and are important for on-line applications, and also where large data sets are involved so that batch processing of all data points at once is infeasible.
Consider the result (2.121) for the maximum likelihood estimator of the mean
μ , which we will denote by μ(N) when it is based on N observations. If we ML ML
Exercise 2.35

Figure 2.10
A schematic illustration of two correlated ran- dom variables z and θ, together with the regression function f(θ) given by the con- ditional expectation E[z|θ]. The Robbins- Monro algorithm provides a general sequen- tial procedure for finding the root θ⋆ of such functions.
z
dissect out the contribution from the final data point xN , we obtain (N) 1!N
1 1N!−1 =NxN+N xn
n=1
= 1xN+N−1μ(N−1)
N NML
= μ(N−1)+1(xN−μ(N−1)).
μML = N xn n=1
2.3. The Gaussian Distribution 95
f(θ)
θ
θ⋆
(2.126) This result has a nice interpretation, as follows. After observing N − 1 data points
ML N ML
we have estimated μ by μ(N −1) . We now observe data point xN , and we obtain our
revised estimate μ(N) by moving the old estimate a small amount, proportional to ML
ML
1/N, in the direction of the ‘error signal’ (xN −μ(N−1)). Note that, as N increases, ML
so the contribution from successive data points gets smaller.
The result (2.126) will clearly give the same answer as the batch result (2.121)
because the two formulae are equivalent. However, we will not always be able to de- rive a sequential algorithm by this route, and so we seek a more general formulation of sequential learning, which leads us to the Robbins-Monro algorithm. Consider a pair of random variables θ and z governed by a joint distribution p(z, θ). The con- ditional expectation of z given θ defines a deterministic function f(θ) that is given by ”
f (θ) ≡ E[z|θ] = zp(z|θ) dz (2.127)
and is illustrated schematically in Figure 2.10. Functions defined in this way are called regression functions.
Our goal is to find the root θ⋆ at which f(θ⋆) = 0. If we had a large data set of observations of z and θ, then we could model the regression function directly and then obtain an estimate of its root. Suppose, however, that we observe values of z one at a time and we wish to find a corresponding sequential estimation scheme for θ⋆. The following general procedure for solving such problems was given by

96 2. PROBABILITY DISTRIBUTIONS
Robbins and Monro (1951). We shall assume that the conditional variance of z is
finite so that
E (z−f)2 |θ <∞ (2.128) and we shall also, without loss of generality, consider the case where f(θ) > 0 for θ > θ⋆ and f(θ) < 0 for θ < θ⋆, as is the case in Figure 2.10. The Robbins-Monro procedure then defines a sequence of successive estimates of the root θ⋆ given by θ(N) = θ(N−1) + aN−1z(θ(N−1)) (2.129) where z(θ(N)) is an observed value of z when θ takes the value θ(N). The coefficients {aN } represent a sequence of positive numbers that satisfy the conditions lim aN = 0 N →#∞ ∞ aN = ∞ N=1 #∞ a2N < ∞. N=1 (2.130) (2.131) (2.132) It can then be shown (Robbins and Monro, 1951; Fukunaga, 1990) that the sequence of estimates given by (2.129) does indeed converge to the root with probability one. Note that the first condition (2.130) ensures that the successive corrections decrease in magnitude so that the process can converge to a limiting value. The second con- dition (2.131) is required to ensure that the algorithm does not converge short of the root, and the third condition (2.132) is needed to ensure that the accumulated noise has finite variance and hence does not spoil convergence. Now let us consider how a general maximum likelihood problem can be solved sequentially using the Robbins-Monro algorithm. By definition, the maximum like- lihood solution θML is a stationary point of the log likelihood function and hence satisfies #N à ∂ 1 lnp(xn|θ) à = 0. (2.133) ∂θ Nn=1 àθML Exchanging the derivative and the summation, and taking the limit N → ∞ we have lim 1 #N ∂ lnp(x |θ)=E ∂ lnp(x|θ) (2.134) and so we see that finding the maximum likelihood solution corresponds to find- ing the root of a regression function. We can therefore apply the Robbins-Monro procedure, which now takes the form θ(N) =θ(N−1) +a ∂ lnp(x |θ(N−1)). (2.135) N−1 ∂θ(N−1) N N→∞Nn=1∂θ n x ∂θ Figure 2.11 In the case of a Gaussian distribution, with θ corresponding to the mean μ, the regression function illustrated in Figure 2.10 takes the form of a straight line, as shown in red. In this case, the random variable z corresponds to the derivative of the log likelihood function and is given by (x − μML)/σ2, and its expectation that defines the regression function is a straight line given by (μ − μML)/σ2. The root of the regres- sion function corresponds to the maximum like- lihood estimator μML. z μ 2.3. The Gaussian Distribution 97 As a specific example, we consider once again the sequential estimation of the mean of a Gaussian distribution, in which case the parameter θ(N) is the estimate μ(N) of the mean of the Gaussian, and the random variable z is given by ML z= ∂ lnp(x|μML,σ2)= 1(x−μML). (2.136) ∂μML σ2 Thus the distribution of z is Gaussian with mean μ − μML, as illustrated in Fig- ure 2.11. Substituting (2.136) into (2.135), we obtain the univariate form of (2.126), provided we choose the coefficients aN to have the form aN = σ2/N. Note that although we have focussed on the case of a single variable, the same technique, together with the same restrictions (2.130)–(2.132) on the coefficients aN , apply equally to the multivariate case (Blum, 1965). 2.3.6 Bayesian inference for the Gaussian The maximum likelihood framework gave point estimates for the parameters μ and Σ. Now we develop a Bayesian treatment by introducing prior distributions over these parameters. Let us begin with a simple example in which we consider a single Gaussian random variable x. We shall suppose that the variance σ2 is known, and we consider the task of inferring the mean μ given a set of N observations X = {x1, . . . , xN }. The likelihood function, that is the probability of the observed data given μ, viewed as a function of μ, is given by !N 1 1#N p(X|μ) = p(xn|μ) = (2πσ2)N/2 exp −2σ2 (xn − μ)2 . (2.137) n=1 n=1 Again we emphasize that the likelihood function p(X|μ) is not a probability distri- bution over μ and is not normalized. We see that the likelihood function takes the form of the exponential of a quad- ratic form in μ. Thus if we choose a prior p(μ) given by a Gaussian, it will be a ML p(z|μ) μ 98 2. PROBABILITY DISTRIBUTIONS Exercise 2.38 conjugate distribution for this likelihood function because the corresponding poste- rior will be a product of two exponentials of quadratic functions of μ and hence will also be Gaussian. We therefore take our prior distribution to be p(μ) = N μ|μ0, σ02 (2.138) and the posterior distribution is given by p(μ|X) ∝ p(X|μ)p(μ). (2.139) Simple manipulation involving completing the square in the exponent shows that the posterior distribution is given by where p(μ|X) = N σ2 μ|μN , σN2 σN2 σ02 σ2 1 #N (2.140) (2.141) (2.142) μN = 1 = 1+N Nσ2 μ0 + 0 μML N σ 02 + σ 2 N σ 02 + σ 2 in which μML is the maximum likelihood solution for μ given by the sample mean μML = N xn. (2.143) n=1 It is worth spending a moment studying the form of the posterior mean and variance. First of all, we note that the mean of the posterior distribution given by (2.141) is a compromise between the prior mean μ0 and the maximum likelihood solution μML. If the number of observed data points N = 0, then (2.141) reduces to the prior mean as expected. For N → ∞, the posterior mean is given by the maximum likelihood solution. Similarly, consider the result (2.142) for the variance of the posterior distribution. We see that this is most naturally expressed in terms of the inverse variance, which is called the precision. Furthermore, the precisions are additive, so that the precision of the posterior is given by the precision of the prior plus one contribution of the data precision from each of the observed data points. As we increase the number of observed data points, the precision steadily increases, corresponding to a posterior distribution with steadily decreasing variance. With no observed data points, we have the prior variance, whereas if the number of data points N → ∞, the variance σN2 goes to zero and the posterior distribution becomes infinitely peaked around the maximum likelihood solution. We therefore see that the maximum likelihood result of a point estimate for μ given by (2.143) is recovered precisely from the Bayesian formalism in the limit of an infinite number of observations. Note also that for finite N , if we take the limit σ02 → ∞ in which the prior has infinite variance then the posterior mean (2.141) reduces to the maximum likelihood result, while from (2.142) the posterior variance is given by σN2 = σ2/N. Exercise 2.40 Section 2.3.5 Figure 2.12 Illustration of Bayesian inference for the mean μ of a Gaussian distri- bution, in which the variance is as- sumed to be known. The curves show the prior distribution over μ (the curve labelled N = 0), which in this case is itself Gaussian, along with the posterior distribution given by (2.140) for increasing numbers N of data points. The data points are generated from a Gaussian of mean 0.8 and variance 0.1, and the prior is chosen to have mean 0. In both the prior and the likelihood function, the variance is set to the true value. We illustrate our analysis of Bayesian inference for the mean of a Gaussian distribution in Figure 2.12. The generalization of this result to the case of a D- dimensional Gaussian random variable x with known covariance and unknown mean is straightforward. We have already seen how the maximum likelihood expression for the mean of a Gaussian can be re-cast as a sequential update formula in which the mean after observing N data points was expressed in terms of the mean after observing N − 1 data points together with the contribution from data point xN . In fact, the Bayesian paradigm leads very naturally to a sequential view of the inference problem. To see this in the context of the inference of the mean of a Gaussian, we write the posterior distribution with the contribution from the final data point xN separated out so that N"−1 n=1 The term in square brackets is (up to a normalization coefficient) just the posterior distribution after observing N − 1 data points. We see that this can be viewed as a prior distribution, which is combined using Bayes’ theorem with the likelihood function associated with data point xN to arrive at the posterior distribution after observing N data points. This sequential view of Bayesian inference is very general and applies to any problem in which the observed data are assumed to be independent and identically distributed. So far, we have assumed that the variance of the Gaussian distribution over the data is known and our goal is to infer the mean. Now let us suppose that the mean is known and we wish to infer the variance. Again, our calculations will be greatly simplified if we choose a conjugate form for the prior distribution. It turns out to be most convenient to work with the precision λ ≡ 1/σ2. The likelihood function for λ takes the form "N p(X|λ) = N (xn|μ, λ−1) ∝ λN/2 exp λ %N − 2 (xn − μ)2 . (2.145) n=1 n=1 p(μ|D) ∝ p(μ) p(xn|μ) p(xN |μ). (2.144) 2.3. The Gaussian Distribution 99 5 N=0 N = 10 N=2 N=1 0 −1 0 1 100 2. PROBABILITY DISTRIBUTIONS 2 1 00 λ1 2 Figure 2.13 a and b. Plot of the gamma distribution Gam(λ|a, b) defined by (2.146) for various values of the parameters The corresponding conjugate prior should therefore be proportional to the product of a power of λ and the exponential of a linear function of λ. This corresponds to the gamma distribution which is defined by Gam(λ|a, b) = 1 baλa−1 exp(−bλ). (2.146) Γ(a) Here Γ(a) is the gamma function that is defined by (1.141) and that ensures that (2.146) is correctly normalized. The gamma distribution has a finite integral if a > 0, and the distribution itself is finite if a  1. It is plotted, for various values of a and b, in Figure 2.13. The mean and variance of the gamma distribution are given by
Exercise 2.41 Exercise 2.42
2 1
00 λ1 2
2 1
00 λ1 2
a = 0.1 b = 0.1
a=1 b=1
a=4 b=6
E[λ] = ab (2.147) var[λ] = a . (2.148)
b2
Consider a prior distribution Gam(λ|a0,b0). If we multiply by the likelihood
function (2.145), then we obtain a posterior distribution
λ “N
p(λ|X) ∝ λa0−1λN/2 exp −b0λ − 2 (xn − μ)2 (2.149)
n=1
which we recognize as a gamma distribution of the form Gam(λ|aN , bN ) where
aN =a0+N2
1 “N
bN = b0 + 2
where σM2 L is the maximum likelihood estimator of the variance. Note that in (2.149) there is no need to keep track of the normalization constants in the prior and the likelihood function because, if required, the correct coefficient can be found at the end using the normalized form (2.146) for the gamma distribution.
n=1
(2.150) (xn − μ)2 = b0 + 2 σM2 L (2.151)
N

Section 2.2
From (2.150), we see that the effect of observing N data points is to increase the value of the coefficient a by N/2. Thus we can interpret the parameter a0 in the prior in terms of 2a0 ‘effective’ prior observations. Similarly, from (2.151) we see that the N data points contribute NσM2 L/2 to the parameter b, where σM2 L is the variance, and so we can interpret the parameter b0 in the prior as arising from the 2a0 ‘effective’ prior observations having variance 2b0/(2a0) = b0/a0. Recall that we made an analogous interpretation for the Dirichlet prior. These distributions are examples of the exponential family, and we shall see that the interpretation of a conjugate prior in terms of effective fictitious data points is a general one for the exponential family of distributions.
Instead of working with the precision, we can consider the variance itself. The conjugate prior in this case is called the inverse gamma distribution, although we shall not discuss this further because we will find it more convenient to work with the precision.
Now suppose that both the mean and the precision are unknown. To find a conjugate prior, we consider the dependence of the likelihood function on μ and λ
λμ2 p(μ,λ)∝ λ1/2exp − 2
βλ
= exp − 2 (μ−c/β)2
β
2.3. The Gaussian Distribution 101
!Nλ1/2 λ p(X|μ,λ)= 2π exp −2(xn −μ)2
n=1
λμ2N )N λ)N
∝ λ1/2 exp − 2 exp λμ xn − 2 x2n . n=1 n=1
(2.152)
We now wish to identify a prior distribution p(μ, λ) that has the same functional dependence on μ and λ as the likelihood function and that should therefore take the form
where c, d, and β are constants. Since we can always write p(μ, λ) = p(μ|λ)p(λ), we can find p(μ|λ) and p(λ) by inspection. In particular, we see that p(μ|λ) is a Gaussian whose precision is a linear function of λ and that p(λ) is a gamma distri- bution, so that the normalized prior takes the form
p(μ, λ) = N (μ|μ0, (βλ)−1)Gam(λ|a, b) (2.154)
where we have defined new constants given by μ0 = c/β, a = 1 + β/2, b = d−c2/2β. The distribution (2.154) is called the normal-gamma or Gaussian-gamma distribution and is plotted in Figure 2.14. Note that this is not simply the product of an independent Gaussian prior over μ and a gamma prior over λ, because the precision of μ is a linear function of λ. Even if we chose a prior in which μ and λ were independent, the posterior distribution would exhibit a coupling between the precision of μ and the value of λ.
exp{cλμ−dλ}
λβ/2 exp − d− 2β λ (2.153)
c2

102
2. PROBABILITY DISTRIBUTIONS
Exercise 2.45
Figure 2.14
Contour plot of the normal-gamma distribution (2.154) for parameter valuesμ0 =0,β=2,a=5and b = 6.
2
λ1
0
−2 0 2
μ
In the case of the multivariate Gaussian distribution N x|μ,Λ−1 dimensional variable x, the conjugate prior distribution for the mean μ, assuming the precision is known, is again a Gaussian. For known mean and unknown precision matrix Λ, the conjugate prior is the Wishart distribution given by
W(Λ|W,ν) = B|Λ|(ν−D−1)/2 exp −21Tr(W−1Λ) (2.155) where ν is called the number of degrees of freedom of the distribution, W is a D × D
scale matrix, and Tr(·) denotes the trace. The normalization constant B is given by &D ν+1−i −1
B(W, ν) = |W|−ν/2 2νD/2 πD(D−1)/4 Γ 2 . (2.156) i=1
for a D-
Section 2.3.6 Exercise 2.46
Again, it is also possible to define a conjugate prior over the covariance matrix itself, rather than over the precision matrix, which leads to the inverse Wishart distribu- tion, although we shall not discuss this further. If both the mean and the precision are unknown, then, following a similar line of reasoning to the univariate case, the conjugate prior is given by
p(μ, Λ|μ0, β, W, ν) = N (μ|μ0, (βΛ)−1) W(Λ|W, ν) (2.157) which is known as the normal-Wishart or Gaussian-Wishart distribution.
2.3.7 Student’s t-distribution
We have seen that the conjugate prior for the precision of a Gaussian is given by a gamma distribution. If we have a univariate Gaussian N (x|μ, τ −1 ) together with a Gamma prior Gam(τ |a, b) and we integrate out the precision, we obtain the marginal distribution of x in the form

Figure 2.15
Plot of Student’s t-distribution (2.159) for μ = 0 and λ = 1 for various values of ν. The limit ν → ∞ corresponds to a Gaussian distribution with mean μ and precision λ.
!∞
2.3. The Gaussian Distribution
103
0.5 0.4 0.3 0.2 0.1
0
−5 0 5
ν→∞ ν = 1.0 ν = 0.1
p(x|μ, a, b)
where we have made the change of variable z = τ[b + (x − μ)2/2]. By convention we define new parameters given by ν = 2a and λ = a/b, in terms of which the distribution p(x|μ, a, b) takes the form
= ! N (x|μ, τ −1 )Gam(τ |a, b) dτ 0
(2.158)

= ∞ bae(−bτ)τa−1
0 Γ(a) 2π 2
τ 1/2 exp
ba 1 1/2 (x − μ)2 −a−1/2
= Γ(a) 2π b+ 2
−τ (x − μ)2 Γ(a+1/2)
Γ(ν/2 + 1/2) λ 1/2 λ(x − μ)2 −ν/2−1/2 St(x|μ, λ, ν) = Γ(ν/2) πν 1 + ν
(2.159)
Exercise 2.47
which is known as Student’s t-distribution. The parameter λ is sometimes called the precision of the t-distribution, even though it is not in general equal to the inverse of the variance. The parameter ν is called the degrees of freedom, and its effect is illustrated in Figure 2.15. For the particular case of ν = 1, the t-distribution reduces to the Cauchy distribution, while in the limit ν → ∞ the t-distribution St(x|μ, λ, ν) becomes a Gaussian N (x|μ, λ−1 ) with mean μ and precision λ.
From (2.158), we see that Student’s t-distribution is obtained by adding up an infinite number of Gaussian distributions having the same mean but different preci- sions. This can be interpreted as an infinite mixture of Gaussians (Gaussian mixtures will be discussed in detail in Section 2.3.9. The result is a distribution that in gen- eral has longer ‘tails’ than a Gaussian, as was seen in Figure 2.15. This gives the t- distribution an important property called robustness, which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers. The robustness of the t-distribution is illustrated in Figure 2.16, which compares the maximum likelihood solutions for a Gaussian and a t-distribution. Note that the max- imum likelihood solution for the t-distribution can be found using the expectation- maximization (EM) algorithm. Here we see that the effect of a small number of
Exercise 12.24

104 2. PROBABILITY DISTRIBUTIONS
0.5 0.4 0.3 0.2 0.1
0
−5 0 5 10
(a)
0.5 0.4 0.3 0.2 0.1
0
−5 0 5 10
(b)
Figure 2.16 Illustration of the robustness of Student’s t-distribution compared to a Gaussian. (a) Histogram distribution of 30 data points drawn from a Gaussian distribution, together with the maximum likelihood fit ob- tained from a t-distribution (red curve) and a Gaussian (green curve, largely hidden by the red curve). Because the t-distribution contains the Gaussian as a special case it gives almost the same solution as the Gaussian. (b) The same data set but with three additional outlying data points showing how the Gaussian (green curve) is strongly distorted by the outliers, whereas the t-distribution (red curve) is relatively unaffected.
outliers is much less significant for the t-distribution than for the Gaussian. Outliers can arise in practical applications either because the process that generates the data corresponds to a distribution having a heavy tail or simply through mislabelled data. Robustness is also an important property for regression problems. Unsurprisingly, the least squares approach to regression does not exhibit robustness, because it cor- responds to maximum likelihood under a (conditional) Gaussian distribution. By basing a regression model on a heavy-tailed distribution such as a t-distribution, we obtain a more robust model.
If we go back to (2.158) and substitute the alternative parameters ν = 2a, λ = a/b, and η = τ b/a, we see that the t-distribution can be written in the form
St(x|μ, λ, ν) = ! ∞ N x|μ, (ηλ)−1 Gam(η|ν/2, ν/2) dη. (2.160) 0
We can then generalize this to a multivariate Gaussian N (x|μ, Λ) to obtain the cor- responding multivariate Student’s t-distribution in the form
St(x|μ, Λ, ν) = ! ∞ N (x|μ, (ηΛ)−1)Gam(η|ν/2, ν/2) dη. (2.161) 0
Using the same technique as for the univariate case, we can evaluate this integral to Exercise 2.48 give

Exercise 2.49
properties
with corresponding results for the univariate case.
2.3.8 Periodic variables
Γ(D/2 + ν/2) |Λ|1/2 ∆2 −D/2−ν/2 St(x|μ, Λ, ν) = Γ(ν/2) (πν)D/2 1 + ν
2.3. The Gaussian Distribution 105
(2.162) where D is the dimensionality of x, and ∆2 is the squared Mahalanobis distance
defined by
∆2 = (x − μ)TΛ(x − μ). (2.163) This is the multivariate form of Student’s t-distribution and satisfies the following
E[x] = μ, if ν>1 cov[x] = ν Λ−1, if ν>2
(2.164) (2.165)
(2.166)
(ν − 2) mode[x] = μ
Although Gaussian distributions are of great practical significance, both in their own right and as building blocks for more complex probabilistic models, there are situations in which they are inappropriate as density models for continuous vari- ables. One important case, which arises in practical applications, is that of periodic variables.
An example of a periodic variable would be the wind direction at a particular geographical location. We might, for instance, measure values of wind direction on a number of days and wish to summarize this using a parametric distribution. Another example is calendar time, where we may be interested in modelling quantities that are believed to be periodic over 24 hours or over an annual cycle. Such quantities can conveniently be represented using an angular (polar) coordinate 0 à θ < 2π. We might be tempted to treat periodic variables by choosing some direction as the origin and then applying a conventional distribution such as the Gaussian. Such an approach, however, would give results that were strongly dependent on the arbitrary choice of origin. Suppose, for instance, that we have two observations at θ1 = 1◦ and θ2 = 359◦, and we model them using a standard univariate Gaussian distribution. If we choose the origin at 0◦, then the sample mean of this data set will be 180◦ with standard deviation 179◦, whereas if we choose the origin at 180◦, then the mean will be 0◦ and the standard deviation will be 1◦. We clearly need to develop a special approach for the treatment of periodic variables. Let us consider the problem of evaluating the mean of a set of observations D = {θ1,...,θN} of a periodic variable. From now on, we shall assume that θ is measured in radians. We have already seen that the simple average (θ1 +· · ·+θN )/N will be strongly coordinate dependent. To find an invariant measure of the mean, we note that the observations can be viewed as points on the unit circle and can therefore bedescribedinsteadbytwo-dimensionalunitvectorsx1,...,xN where∥xn∥=1 for n = 1,...,N, as illustrated in Figure 2.17. We can average the vectors {xn} 106 2. PROBABILITY DISTRIBUTIONS Figure 2.17 Illustration of the representation of val- ues θn of a periodic variable as two- dimensional vectors xn living on the unit circle. Also shown is the average x of those vectors. x2 x x4 3 x ̄ r ̄ x2 θ ̄ x1 x1 instead to give 1 !N x = N xn n=1 (2.167) and then find the corresponding angle θ of this average. Clearly, this definition will ensure that the location of the mean is independent of the origin of the angular coor- dinate. Note that x will typically lie inside the unit circle. The Cartesian coordinates of the observations are given by xn = (cos θn, sin θn), and we can write the Carte- sian coordinates of the sample mean in the form x = (r cos θ, r sin θ). Substituting into (2.167) and equating the x1 and x2 components then gives 1 !N give # sinθn θ = tan−1 #n cos θ . (2.169) rcosθ = N n=1 1 !N rsinθ = N sinθn. (2.168) n=1 cosθn, Taking the ratio, and using the identity tan θ = sin θ/ cos θ, we can solve for θ to nn Shortly, we shall see how this result arises naturally as the maximum likelihood estimator for an appropriately defined distribution over a periodic variable. We now consider a periodic generalization of the Gaussian called the von Mises distribution. Here we shall limit our attention to univariate distributions, although periodic distributions can also be found over hyperspheres of arbitrary dimension. For an extensive discussion of periodic distributions, see Mardia and Jupp (2000). By convention, we will consider distributions p(θ) that have period 2π. Any probability density p(θ) defined over θ must not only be nonnegative and integrate Figure 2.18 The von Mises distribution can be derived by considering a two-dimensional Gaussian of the form (2.173), whose density contours are shown in blue and conditioning on the unit circle shown in red. x2 p(x) x1 − 1 2σ2 = = (rcosθ−r0 cosθ0)2 +(rsinθ−r0 sinθ0)2 − 1 1+r02 −2r0 cosθcosθ0 −2r0 sinθsinθ0 r0 cos(θ − θ0) + const σ2 2σ2 ! 2π 0 2.3. The Gaussian Distribution 107 r=1 to one, but it must also be periodic. Thus p(θ) must satisfy the three conditions p(θ)  0 p(θ) dθ = 1 p(θ + 2π) = p(θ). (2.170) (2.171) (2.172) From (2.172), it follows that p(θ + M2π) = p(θ) for any integer M. We can easily obtain a Gaussian-like distribution that satisfies these three prop- erties as follows. Consider a Gaussian distribution over two variables x = (x1 , x2 ) having mean μ = (μ1, μ2) and a covariance matrix Σ = σ2I where I is the 2 × 2 identity matrix, so that 1 (x1 −μ1)2 +(x2 −μ2)2 p(x1,x2)=2πσ2exp − 2σ2 . (2.173) The contours of constant p(x) are circles, as illustrated in Figure 2.18. Now suppose we consider the value of this distribution along a circle of fixed radius. Then by con- struction this distribution will be periodic, although it will not be normalized. We can determine the form of this distribution by transforming from Cartesian coordinates (x1, x2) to polar coordinates (r, θ) so that x1 = rcosθ, x2 = rsinθ. (2.174) We also map the mean μ into polar coordinates by writing μ1 = r0 cosθ0, μ2 = r0 sinθ0. (2.175) Next we substitute these transformations into the two-dimensional Gaussian distribu- tion (2.173), and then condition on the unit circle r = 1, noting that we are interested only in the dependence on θ. Focussing on the exponent in the Gaussian distribution we have (2.176) 108 2. PROBABILITY DISTRIBUTIONS m = 5, θ0 = π/4 m = 1, θ0 = 3π/4 3π/4 π/4 0 2π m = 5, θ0 = π/4 m = 1, θ0 = 3π/4 Figure 2.19 The von Mises distribution plotted for two different parameter values, shown as a Cartesian plot on the left and as the corresponding polar plot on the right. Exercise 2.51 where ‘const’ denotes terms independent of θ, and we have made use of the following trigonometrical identities cos2 A + sin2 A = 1 (2.177) cosAcosB+sinAsinB = cos(A−B). (2.178) If we now define m = r0/σ2, we obtain our final expression for the distribution of p(θ) along the unit circle r = 1 in the form p(θ|θ0, m) = 1 exp {m cos(θ − θ0)} (2.179) 2πI0(m) which is called the von Mises distribution, or the circular normal. Here the param- eter θ0 corresponds to the mean of the distribution, while m, which is known as the concentration parameter, is analogous to the inverse variance (precision) for the Gaussian. The normalization coefficient in (2.179) is expressed in terms of I0(m), which is the zeroth-order Bessel function of the first kind (Abramowitz and Stegun, 1965) and is defined by 1 ! 2π I0(m) = 2π exp {m cos θ} dθ. (2.180) 0 For large m, the distribution becomes approximately Gaussian. The von Mises dis- tribution is plotted in Figure 2.19, and the function I0(m) is plotted in Figure 2.20. Now consider the maximum likelihood estimators for the parameters θ0 and m for the von Mises distribution. The log likelihood function is given by "N n=1 Exercise 2.52 ln p(D|θ0, m) = −N ln(2π) − N ln I0(m) + m cos(θn − θ0). (2.181) Exercise 2.53 fromwhichweobtain # sinθn θML = tan−1 #n cos θ (2.184) nn 2.3. The Gaussian Distribution 109 3000 2000 I0(m) 1000 00 5 10 m 1 A(m) 0.5 00 5 10 m Figure 2.20 (2.186). Plot of the Bessel function I0(m) defined by (2.180), together with the function A(m) defined by Setting the derivative with respect to θ0 equal to zero gives !N n=1 To solve for θ0, we make use of the trigonometric identity sin(A − B) = cos B sin A − cos A sin B sin(θn − θ0) = 0. (2.182) (2.183) 0 which we recognize as the result (2.169) obtained earlier for the mean of the obser- vations viewed in a two-dimensional Cartesian space. Similarly, maximizing (2.181) with respect to m, and making use of I0′ (m) = I1(m) (Abramowitz and Stegun, 1965), we have 1 !N N n=1 A(m) = where we have substituted for the maximum likelihood solution for θML (recalling that we are performing a joint optimization over θ and m), and we have defined A(m) = I1(m). (2.186) I0 (m) The function A(m) is plotted in Figure 2.20. Making use of the trigonometric iden- cos(θ − θML) (2.185) n0 0 tity (2.178), we can write (2.185) in the form 1 !N 1 !N A(m ) = cosθ cosθML − sinθ sinθML. (2.187) ML N n 0 N n 0 n=1 n=1 110 2. PROBABILITY DISTRIBUTIONS Figure 2.21 Plots of the ‘old faith- ful’ data in which the blue curves show contours of constant proba- bility density. On the left is a single Gaussian distribution which has been fitted to the data us- ing maximum likelihood. Note that this distribution fails to capture the two clumps in the data and indeed places much of its probability mass in the central region between the clumps where the data are relatively sparse. On the right the distribution is given by a linear combination of two Gaussians which has been fitted to the data by maximum likelihood using techniques discussed Chap- ter 9, and which gives a better rep- resentation of the data. 100 80 60 401 2 3 4 5 6 100 80 60 401 2 3 4 5 6 Appendix A The right-hand side of (2.187) is easily evaluated, and the function A(m) can be inverted numerically. For completeness, we mention briefly some alternative techniques for the con- struction of periodic distributions. The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins. This has the virtue of simplicity and flexibility but also suffers from significant limitations, as we shall see when we discuss histogram methods in more detail in Section 2.5. Another approach starts, like the von Mises distribution, from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning (Mardia and Jupp, 2000). However, this leads to more complex forms of distribution and will not be discussed further. Finally, any valid distribution over the real axis (such as a Gaussian) can be turned into a periodic distribution by mapping succes- sive intervals of width 2π onto the periodic variable (0, 2π), which corresponds to ‘wrapping’ the real axis around unit circle. Again, the resulting distribution is more complex to handle than the von Mises distribution. One limitation of the von Mises distribution is that it is unimodal. By forming mixtures of von Mises distributions, we obtain a flexible framework for modelling periodic variables that can handle multimodality. For an example of a machine learn- ing application that makes use of von Mises distributions, see Lawrence et al. (2002), and for extensions to modelling conditional densities for regression problems, see Bishop and Nabney (1996). 2.3.9 Mixtures of Gaussians While the Gaussian distribution has some important analytical properties, it suf- fers from significant limitations when it comes to modelling real data sets. Consider the example shown in Figure 2.21. This is known as the ‘Old Faithful’ data set, and comprises 272 measurements of the eruption of the Old Faithful geyser at Yel- lowstone National Park in the USA. Each measurement comprises the duration of Section 9.3.3 Figure 2.22 Example of a Gaussian mixture distribution p(x) in one dimension showing three Gaussians (each scaled by a coefficient) in blue and their sum in red. 2.3. The Gaussian Distribution 111 the eruption in minutes (horizontal axis) and the time in minutes to the next erup- tion (vertical axis). We see that the data set forms two dominant clumps, and that a simple Gaussian distribution is unable to capture this structure, whereas a linear superposition of two Gaussians gives a better characterization of the data set. Such superpositions, formed by taking linear combinations of more basic dis- tributions such as Gaussians, can be formulated as probabilistic models known as mixture distributions (McLachlan and Basford, 1988; McLachlan and Peel, 2000). In Figure 2.22 we see that a linear combination of Gaussians can give rise to very complex densities. By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy. We therefore consider a superposition of K Gaussian densities of the form !K k=1 which is called a mixture of Gaussians. Each Gaussian density N (x|μk , Σk ) is called a component of the mixture and has its own mean μk and covariance Σk. Contour and surface plots for a Gaussian mixture having 3 components are shown in Figure 2.23. In this section we shall consider Gaussian components to illustrate the frame- work of mixture models. More generally, mixture models can comprise linear com- binations of other distributions. For instance, in Section 9.3.3 we shall consider mixtures of Bernoulli distributions as an example of a mixture model for discrete variables. The parameters πk in (2.188) are called mixing coefficients. If we integrate both sides of (2.188) with respect to x, and note that both p(x) and the individual Gaussian components are normalized, we obtain !K πk = 1. (2.189) k=1 Also, the requirement that p(x)  0, together with N(x|μk,Σk)  0, implies πk  0 for all k. Combining this with the condition (2.189) we obtain 0 à πk à 1. (2.190) p(x) = πkN(x|μk,Σk) (2.188) x 112 2. PROBABILITY DISTRIBUTIONS 1 (a) 0.5 1 (b) 0 0.5 1 0.2 0.5 00 0 0.5 1 0.5 0.3 Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space. (a) Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green, and the values of the mixing coefficients are shown below each component. (b) Contours of the marginal probability density p(x) of the mixture distribution. (c) A surface plot of the distribution p(x). We therefore see that the mixing coefficients satisfy the requirements to be probabil- ities. From the sum and product rules, the marginal density is given by !K k=1 which is equivalent to (2.188) in which we can view πk = p(k) as the prior prob- ability of picking the kth component, and the density N(x|μk,Σk) = p(x|k) as the probability of x conditioned on k. As we shall see in later chapters, an impor- tant role is played by the posterior probabilities p(k|x), which are also known as responsibilities. From Bayes’ theorem these are given by p(x) = p(k)p(x|k) (2.191) γk (x) ≡ p(k|x) p(k)p(x|k) = "l p(l)p(x|l) πkN(x|μk,Σk) = "l πlN(x|μl,Σl). (2.192) We shall discuss the probabilistic interpretation of the mixture distribution in greater detail in Chapter 9. The form of the Gaussian mixture distribution is governed by the parameters π, μ and Σ, where we have used the notation π ≡ {π1,...,πK}, μ ≡ {μ1,...,μK} and Σ ≡ {Σ1,...ΣK}. One way to set the values of these parameters is to use maximum likelihood. From (2.188) the log of the likelihood function is given by !N lnp(X|π,μ,Σ) = n=1 ln !K k=1 πkN(xn|μk,Σk) (2.193) where X = {x1 , . . . , xN }. We immediately see that the situation is now much more complex than with a single Gaussian, due to the presence of the summation over k inside the logarithm. As a result, the maximum likelihood solution for the parameters no longer has a closed-form analytical solution. One approach to maxi- mizing the likelihood function is to use iterative numerical optimization techniques (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008). Alterna- tively we can employ a powerful framework called expectation maximization, which will be discussed at length in Chapter 9. 2.4. The Exponential Family The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distri- butions called the exponential family (Duda and Hart, 1973; Bernardo and Smith, 1994). Members of the exponential family have many important properties in com- mon, and it is illuminating to discuss these properties in some generality. The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form p(x|η) = h(x)g(η) exp ηTu(x) (2.194) where x may be scalar or vector, and may be discrete or continuous. Here η are called the natural parameters of the distribution, and u(x) is some function of x. The function g(η) can be interpreted as the coefficient that ensures that the distribu- tion is normalized and therefore satisfies g(η) # h(x) exp ηTu(x) dx = 1 (2.195) where the integration is replaced by summation if x is a discrete variable. We begin by taking some examples of the distributions introduced earlier in the chapter and showing that they are indeed members of the exponential family. Consider first the Bernoulli distribution p(x|μ) = Bern(x|μ) = μx(1 − μ)1−x. Expressing the right-hand side as the exponential of the logarithm, we have (2.196) (2.197) (2.198) 2.4. The Exponential Family 113 p(x|μ) = exp{xlnμ+(1−x)ln(1−μ)} = (1−μ)exp ln μ x . 1−μ Comparison with (2.194) allows us to identify η = ln μ 1−μ 114 2. PROBABILITY DISTRIBUTIONS which we can solve for μ to give μ = σ(η), where σ(η) = 1 (2.199) 1 + exp(−η) is called the logistic sigmoid function. Thus we can write the Bernoulli distribution using the standard representation (2.194) in the form p(x|η) = σ(−η) exp(ηx) (2.200) where we have used 1 − σ(η) = σ(−η), which is easily proved from (2.199). Com- parison with (2.194) shows that g(η) = σ(−η). Next consider the multinomial distribution that, for a single observation x, takes the form xk ln μk (2.204) where x = (x1, . . . , xN )T. Again, we can write this in the standard representation (2.194) so that p(x|η) = exp(ηTx) (2.205) where ηk = ln μk, and we have defined η = (η1, . . . , ηM )T. Again, comparing with (2.194) we have !M p(x|μ) = μxk = exp k #M k=1 u(x) = x h(x) = 1 (2.201) (2.202) (2.203) k=1 u(x) = x h(x) = 1 g(η) = 1. (2.206) (2.207) (2.208) Note that the parameters ηk are not independent because the parameters μk are sub- ject to the constraint #M μk = 1 (2.209) k=1 so that, given any M − 1 of the parameters μk , the value of the remaining parameter is fixed. In some circumstances, it will be convenient to remove this constraint by expressing the distribution in terms of only M − 1 parameters. This can be achieved by using the relationship (2.209) to eliminate μM by expressing it in terms of the remaining {μk } where k = 1, . . . , M − 1, thereby leaving M − 1 parameters. Note that these remaining parameters are still subject to the constraints M#−1 k=1 0àμk à1, μk à1. (2.210) 2.4. The Exponential Family 115 Making use of the constraint (2.209), the multinomial distribution in this representa- tion then becomes "M k=1 M−1 M−1 M−1 = exp "xklnμk+ 1−"xk ln 1−"μk k=1 k=1 k=1 μk 1 − &M−1 μ exp xk ln μk We now identify = exp "xln k +ln 1−"μ k . (2.211) M−1 M−1 k=1 j=1 j μk k=1 ln 1−& μ =ηk (2.212) jj which we can solve for μk by first summing both sides over k and then rearranging and back-substituting to give exp(ηk ) μk = 1 + &j exp(ηj). (2.213) This is called the softmax function, or the normalized exponential. In this represen- tation, the multinomial distribution therefore takes the form −1 k=1 M"−1 This is the standard form of the exponential family, with parameter vector η = p(x|η) = 1 + (η1,...,ηM−1)T inwhich u(x) = x h(x) = 1 −1 g(η) = 1+ exp(ηk) . Finally, let us consider the Gaussian distribution. For the univariate Gaussian, we have p(x|μ,σ2) = 1 exp − 1 (x−μ)2 (2.218) (2πσ2)1/2 2σ2 = 1 exp −1x2+μx− 1μ2 (2.219) (2πσ2)1/2 2σ2 σ2 2σ2 exp(ηk) exp(ηTx). (2.214) M"−1 k=1 (2.215) (2.216) (2.217) 116 2. PROBABILITY DISTRIBUTIONS which, after some simple rearrangement, can be cast in the standard exponential Exercise 2.57 family form (2.194) with Exercise 2.58 Note that the covariance of u(x) can be expressed in terms of the second derivatives of g(η), and similarly for higher order moments. Thus, provided we can normalize a distribution from the exponential family, we can always find its moments by simple differentiation. Now consider a set of independent identically distributed data denoted by X = {x1 , . . . , xn }, for which the likelihood function is given by *N ηT u(xn) . (2.227) n=1 'N p(X|η) = η = μ/σ2 −1/2σ2 (2.220) (2.221) (2.222) (2.223) Let us now consider the problem of estimating the parameter vector η in the gen- eral exponential family distribution (2.194) using the technique of maximum likeli- hood. Taking the gradient of both sides of (2.195) with respect to η, we have ∇g(η)# h(#x)exp ηTu(x) dx + g(η) h(x) exp ηTu(x) u(x) dx = 0. Rearranging, and making use again of (2.195) then gives − 1 ∇g(η) = g(η) # h(x) exp ηTu(x) u(x) dx = E[u(x)] g(η) where we have used (2.194). We therefore obtain the result −∇ ln g(η) = E[u(x)]. u(x) = x x2 h(x) = (2π)−1/2 g(η) = (−2η2)1/2 exp η2 1 . 4η2 2.4.1 Maximum likelihood and sufficient statistics g(η)N exp condition to be satisfied by the maximum likelihood estimator ηML h(xn) Setting the gradient of lnp(X|η) with respect to η to zero, we get the following n=1 1 *N −∇ ln g(ηML) = N u(xn) (2.228) n=1 (2.224) (2.225) (2.226) 2.4. The Exponential Family 117 which can in principle be solved to obtain ηML. We see that th!e solution for the maximum likelihood estimator depends on the data only through n u(xn), which is therefore called the sufficient statistic of the distribution (2.194). We do not need to store the entire data set itself but only the value of the sufficient statistic. For the Bernoulli distribution, for example, the function u(x) is given just by x and so we need only keep the sum of the data points {xn}, whereas for the Gaussian u(x) = (x, x2)T, and so we should keep both the sum of {xn} and the sum of {x2n}. If we consider the limit N → ∞, then the right-hand side of (2.228) becomes E[u(x)], and so by comparing with (2.226) we see that in this limit ηML will equal the true value η. In fact, this sufficiency property holds also for Bayesian inference, although we shall defer discussion of this until Chapter 8 when we have equipped ourselves with the tools of graphical models and can thereby gain a deeper insight into these important concepts. 2.4.2 Conjugate priors We have already encountered the concept of a conjugate prior several times, for example in the context of the Bernoulli distribution (for which the conjugate prior is the beta distribution) or the Gaussian (where the conjugate prior for the mean is a Gaussian, and the conjugate prior for the precision is the Wishart distribution). In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior. For any member of the exponential family (2.194), there exists a conjugate prior that can be written in the form p(η|χ, ν) = f(χ, ν)g(η)ν exp νηTχ (2.229) where f(χ,ν) is a normalization coefficient, and g(η) is the same function as ap- pears in (2.194). To see that this is indeed conjugate, let us multiply the prior (2.229) by the likelihood function (2.227) to obtain the posterior distribution, up to a nor- malization coefficient, in the form &N n=1 This again takes the same functional form as the prior (2.229), confirming conjugacy. Furthermore, we see that the parameter ν can be interpreted as a effective number of pseudo-observations in the prior, each of which has a value for the sufficient statistic u(x) given by χ. 2.4.3 Noninformative priors In some applications of probabilistic inference, we may have prior knowledge that can be conveniently expressed through the prior distribution. For example, if the prior assigns zero probability to some value of variable, then the posterior dis- tribution will necessarily also assign zero probability to that value, irrespective of p(η|X, χ, ν) ∝ g(η)ν+N exp ηT u(xn) + νχ . (2.230) 118 2. PROBABILITY DISTRIBUTIONS any subsequent observations of data. In many cases, however, we may have little idea of what form the distribution should take. We may then seek a form of prior distribution, called a noninformative prior, which is intended to have as little influ- ence on the posterior distribution as possible (Jeffries, 1946; Box and Tao, 1973; Bernardo and Smith, 1994). This is sometimes referred to as ‘letting the data speak for themselves’. If we have a distribution p(x|λ) governed by a parameter λ, we might be tempted to propose a prior distribution p(λ) = const as a suitable prior. If λ is a discrete variable with K states, this simply amounts to setting the prior probability of each state to 1/K. In the case of continuous parameters, however, there are two potential difficulties with this approach. The first is that, if the domain of λ is unbounded, this prior distribution cannot be correctly normalized because the integral over λ diverges. Such priors are called improper. In practice, improper priors can often be used provided the corresponding posterior distribution is proper, i.e., that it can be correctly normalized. For instance, if we put a uniform prior distribution over the mean of a Gaussian, then the posterior distribution for the mean, once we have observed at least one data point, will be proper. A second difficulty arises from the transformation behaviour of a probability density under a nonlinear change of variables, given by (1.27). If a function h(λ) is constant, and we change variables to λ = η2, then h(η) = h(η2) will also be constant. However, if we choose the density pλ(λ) to be constant, then the density of η will be given, from (1.27), by pη(η) = pλ(λ)à dλà = pλ(η2)2η ∝ η (2.231) dη and so the density over η will not be constant. This issue does not arise when we use maximum likelihood, because the likelihood function p(x|λ) is a simple function of λ and so we are free to use any convenient parameterization. If, however, we are to choose a prior distribution that is constant, we must take care to use an appropriate representation for the parameters. Here we consider two simple examples of noninformative priors (Berger, 1985). First of all, if a density takes the form p(x|μ) = f (x − μ) (2.232) then the parameter μ is known as a location parameter. This family of densities exhibits translation invariance because if we shift x by a constant to give x = x + c, then p(x|μ) = f (x − μ) (2.233) where we have defined μ = μ + c. Thus the density takes the same form in the new variable as in the original one, and so the density is independent of the choice of origin. We would like to choose a prior distribution that reflects this translation invariance property, and so we choose a prior that assigns equal probability mass to Exercise 2.59 2.4. The Exponential Family 119 an interval A à μ à B as to the shifted interval A − c à μ à B − c. This implies ! B ! B−c ! B p(μ) dμ = p(μ) dμ = p(μ − c) dμ (2.234) A A−c A and because this must hold for all choices of A and B, we have p(μ − c) = p(μ) (2.235) which implies that p(μ) is constant. An example of a location parameter would be the mean μ of a Gaussian distribution. As we have seen, the conjugate prior distri- bution for μ in this case is a Gaussian p(μ|μ0, σ02) = N (μ|μ0, σ02), and we obtain a noninformative prior by taking the limit σ02 → ∞. Indeed, from (2.141) and (2.142) we see that this gives a posterior distribution over μ in which the contributions from the prior vanish. As a second example, consider a density of the form p(x|σ)=σ1f σx (2.236) where σ > 0. Note that this will be a normalized density provided f(x) is correctly normalized. The parameter σ is known as a scale parameter, and the density exhibits scale invariance because if we scale x by a constant to give x = cx, then
p(x|σ) = σ1 f σx (2.237)
where we have defined σ = cσ. This transformation corresponds to a change of scale, for example from meters to kilometers if x is a length, and we would like to choose a prior distribution that reflects this scale invariance. If we consider an interval A à σ à B, and a scaled interval A/c à σ à B/c, then the prior should assign equal probability mass to these two intervals. Thus we have
!B !B/c !B 1 1 p(σ)dσ = p(σ)dσ = p cσ c dσ
A A/c A
and because this must hold for choices of A and B, we have p(σ) = p 1cσ 1c
(2.238)
(2.239)
and hence p(σ) ∝ 1/σ. Note that again this is an improper prior because the integral of the distribution over 0 à σ à ∞ is divergent. It is sometimes also convenient to think of the prior distribution for a scale parameter in terms of the density of the log of the parameter. Using the transformation rule (1.27) for densities we see that p(ln σ) = const. Thus, for this prior there is the same probability mass in the range 1 à σ à 10 as in the range 10 à σ à 100 and in 100 à σ à 1000.

120
2. PROBABILITY DISTRIBUTIONS
Section 2.3
An example of a scale parameter would be the standard deviation σ of a Gaussian distribution, after we have taken account of the location parameter μ, because
N (x|μ, σ2) ∝ σ−1 exp −(“x/σ)2 (2.240)
where “x = x − μ. As discussed earlier, it is often more convenient to work in terms of the precision λ = 1/σ2 rather than σ itself. Using the transformation rule for densities, we see that a distribution p(σ) ∝ 1/σ corresponds to a distribution over λ of the form p(λ) ∝ 1/λ. We have seen that the conjugate prior for λ was the gamma distribution Gam(λ|a0 , b0 ) given by (2.146). The noninformative prior is obtained as the special case a0 = b0 = 0. Again, if we examine the results (2.150) and (2.151) for the posterior distribution of λ, we see that for a0 = b0 = 0, the posterior depends only on terms arising from the data and not from the prior.
2.5. Nonparametric Methods
Throughout this chapter, we have focussed on the use of probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set. This is called the parametric approach to density modelling. An important limitation of this approach is that the chosen density might be a poor model of the distribution that generates the data, which can result in poor predictive performance. For instance, if the process that generates the data is multimodal, then this aspect of the distribution can never be captured by a Gaussian, which is necessarily unimodal.
In this final section, we consider some nonparametric approaches to density es- timation that make few assumptions about the form of the distribution. Here we shall focus mainly on simple frequentist methods. The reader should be aware, however, that nonparametric Bayesian methods are attracting increasing interest (Walker et al., 1999; Neal, 2000; Mu ̈ller and Quintana, 2004; Teh et al., 2006).
Let us start with a discussion of histogram methods for density estimation, which we have already encountered in the context of marginal and conditional distributions in Figure 1.11 and in the context of the central limit theorem in Figure 2.6. Here we explore the properties of histogram density models in more detail, focussing on the case of a single continuous variable x. Standard histograms simply partition x into distinct bins of width ∆i and then count the number ni of observations of x falling in bin i. In order to turn this count into a normalized probability density, we simply divide by the total number N of observations and by the width ∆i of the bins to obtain probability values for each bin given by
pi = ni (2.241) N∆i
for which it is easily seen that $ p(x) dx = 1. This gives a model for the density p(x) that is constant over the width of each bin, and often the bins are chosen to have the same width ∆i = ∆.

Figure 2.24
An illustration of the histogram approach to density estimation, in which a data set of 50 data points is generated from the distribution shown by the green curve. Histogram density estimates, based on (2.241), with a common bin width ∆ are shown for various values of ∆.
In Figure 2.24, we show an example of histogram density estimation. Here the data is drawn from the distribution, corresponding to the green curve, which is formed from a mixture of two Gaussians. Also shown are three examples of his- togram density estimates corresponding to three different choices for the bin width ∆. We see that when ∆ is very small (top figure), the resulting density model is very spiky, with a lot of structure that is not present in the underlying distribution that generated the data set. Conversely, if ∆ is too large (bottom figure) then the result is a model that is too smooth and that consequently fails to capture the bimodal prop- erty of the green curve. The best results are obtained for some intermediate value of ∆ (middle figure). In principle, a histogram density model is also dependent on the choice of edge location for the bins, though this is typically much less significant than the value of ∆.
Note that the histogram method has the property (unlike the methods to be dis- cussed shortly) that, once the histogram has been computed, the data set itself can be discarded, which can be advantageous if the data set is large. Also, the histogram approach is easily applied if the data points are arriving sequentially.
In practice, the histogram technique can be useful for obtaining a quick visual- ization of data in one or two dimensions but is unsuited to most density estimation applications. One obvious problem is that the estimated density has discontinuities that are due to the bin edges rather than any property of the underlying distribution that generated the data. Another major limitation of the histogram approach is its scaling with dimensionality. If we divide each variable in a D-dimensional space into M bins, then the total number of bins will be MD. This exponential scaling with D is an example of the curse of dimensionality. In a space of high dimensional- ity, the quantity of data needed to provide meaningful estimates of local probability density would be prohibitive.
The histogram approach to density estimation does, however, teach us two im- portant lessons. First, to estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point. Note that the concept of locality requires that we assume some form of dis- tance measure, and here we have been assuming Euclidean distance. For histograms,
2.5. Nonparametric Methods 121
5
00 0.5 1 5
00 0.5 1 5
00 0.5 1
∆ = 0.04
∆ = 0.08
∆ = 0.25
Section 1.4

122
2. PROBABILITY DISTRIBUTIONS
Section 2.1
Now suppose that we have collected a data set comprising N observations drawn from p(x). Because each data point has a probability P of falling within R, the total number K of points that lie inside R will be distributed according to the binomial distribution
this neighbourhood property was defined by the bins, and there is a natural ‘smooth- ing’ parameter describing the spatial extent of the local region, in this case the bin width. Second, the value of the smoothing parameter should be neither too large nor too small in order to obtain good results. This is reminiscent of the choice of model complexity in polynomial curve fitting discussed in Chapter 1 where the degree M of the polynomial, or alternatively the value α of the regularization parameter, was optimal for some intermediate value, neither too large nor too small. Armed with these insights, we turn now to a discussion of two widely used nonparametric tech- niques for density estimation, kernel estimators and nearest neighbours, which have better scaling with dimensionality than the simple histogram model.
2.5.1 Kernel density estimators
Let us suppose that observations are being drawn from some unknown probabil- ity density p(x) in some D-dimensional space, which we shall take to be Euclidean, and we wish to estimate the value of p(x). From our earlier discussion of locality, let us consider some small region R co!ntaining x. The probability mass associated with this region is given by
P = p(x)dx. (2.242) R
Bin(K|N,P) = N! PK(1−P)1−K. (2.243) K!(N − K)!
Using (2.11), we see that the mean fraction of points falling inside the region is E[K/N] = P, and similarly using (2.12) we see that the variance around this mean is var[K/N ] = P (1 − P )/N . For large N , this distribution will be sharply peaked around the mean and so
K ≃ NP. (2.244) If, however, we also assume that the region R is sufficiently small that the probability
density p(x) is roughly constant over the region, then we have
P ≃ p(x)V (2.245)
where V is the volume of R. Combining (2.244) and (2.245), we obtain our density
estimate in the form
p(x) = K . (2.246) NV
Note that the validity of (2.246) depends on two contradictory assumptions, namely that the region R be sufficiently small that the density is approximately constant over the region and yet sufficiently large (in relation to the value of that density) that the number K of points falling inside the region is sufficient for the binomial distribution to be sharply peaked.

2.5. Nonparametric Methods 123
We can exploit the result (2.246) in two different ways. Either we can fix K and determine the value of V from the data, which gives rise to the K-nearest-neighbour technique discussed shortly, or we can fix V and determine K from the data, giv- ing rise to the kernel approach. It can be shown that both the K-nearest-neighbour density estimator and the kernel density estimator converge to the true probability density in the limit N → ∞ provided V shrinks suitably with N , and K grows with N (Duda and Hart, 1973).
We begin by discussing the kernel method in detail, and to start with we take the region R to be a small hypercube centred on the point x at which we wish to determine the probability density. In order to count the number K of points falling within this region, it is convenient to define the following function
k(u) = 1, |ui| à 1/2, i = 1,…,D, (2.247) 0, otherwise
which represents a unit cube centred on the origin. The function k(u) is an example of a kernel function, and in this context is also called a Parzen window. From (2.247), the quantity k((x − xn)/h) will be one if the data point xn lies inside a cube of side h centred on x, and zero otherwise. The total number of data points lying inside this cube will therefore be
“N x−xn
K= k h . (2.248)
n=1
Substituting this expression into (2.246) then gives the following result for the esti-
mated density at x
p(x)=1″N 1k x−xn (2.249) N n=1 hD h
where we have used V = hD for the volume of a hypercube of side h in D di- mensions. Using the symmetry of the function k(u), we can now re-interpret this equation, not as a single cube centred on x but as the sum over N cubes centred on the N data points xn.
As it stands, the kernel density estimator (2.249) will suffer from one of the same problems that the histogram method suffered from, namely the presence of artificial discontinuities, in this case at the boundaries of the cubes. We can obtain a smoother density model if we choose a smoother kernel function, and a common choice is the Gaussian, which gives rise to the following kernel density model
p(x)= 1 “N 1 exp −∥x−xn∥2 (2.250) N (2πh2)1/2 2h2
n=1
where h represents the standard deviation of the Gaussian components. Thus our density model is obtained by placing a Gaussian over each data point and then adding up the contributions over the whole data set, and then dividing by N so that the den- sity is correctly normalized. In Figure 2.25, we apply the model (2.250) to the data

124
2. PROBABILITY DISTRIBUTIONS
Figure 2.25
Illustration of the kernel density model (2.250) applied to the same data set used to demonstrate the histogram approach in Figure 2.24. We see that h acts as a smoothing parameter and that if it is set too small (top panel), the result is a very noisy density model, whereas if it is set too large (bottom panel), then the bimodal nature of the underlying distribution from which the data is generated (shown by the green curve) is washed out. The best den- sity model is obtained for some intermedi- ate value of h (middle panel).
set used earlier to demonstrate the histogram technique. We see that, as expected, the parameter h plays the role of a smoothing parameter, and there is a trade-off between sensitivity to noise at small h and over-smoothing at large h. Again, the optimization of h is a problem in model complexity, analogous to the choice of bin width in histogram density estimation, or the degree of the polynomial used in curve fitting.
We can choose any other kernel function k(u) in (2.249) subject to the condi- tions
k(u)  0, (2.251) k(u) du = 1 (2.252)
which ensure that the resulting probability distribution is nonnegative everywhere and integrates to one. The class of density model given by (2.249) is called a kernel density estimator, or Parzen estimator. It has a great merit that there is no compu- tation involved in the ‘training’ phase because this simply requires storage of the training set. However, this is also one of its great weaknesses because the computa- tional cost of evaluating the density grows linearly with the size of the data set.
2.5.2 Nearest-neighbour methods
One of the difficulties with the kernel approach to density estimation is that the parameter h governing the kernel width is fixed for all kernels. In regions of high data density, a large value of h may lead to over-smoothing and a washing out of structure that might otherwise be extracted from the data. However, reducing h may lead to noisy estimates elsewhere in data space where the density is smaller. Thus the optimal choice for h may be dependent on location within the data space. This issue is addressed by nearest-neighbour methods for density estimation.
We therefore return to our general result (2.246) for local density estimation, and instead of fixing V and determining the value of K from the data, we consider a fixed value of K and use the data to find an appropriate value for V . To do this, we consider a small sphere centred on the point x at which we wish to estimate the
5
00 0.5 1 5
00 0.5 1 5
00 0.5 1
h = 0.005
h = 0.07
h = 0.2
!

Figure 2.26
Illustration of K -nearest-neighbour den- sity estimation using the same data set as in Figures 2.25 and 2.24. We see that the parameter K governs the degree of smoothing, so that a small value of K leads to a very noisy density model (top panel), whereas a large value (bot- tom panel) smoothes out the bimodal na- ture of the true distribution (shown by the green curve) from which the data set was generated.
density p(x), and we allow the radius of the sphere to grow until it contains precisely K data points. The estimate of the density p(x) is then given by (2.246) with V set to the volume of the resulting sphere. This technique is known as K nearest neighbours and is illustrated in Figure 2.26, for various choices of the parameter K, using the same data set as used in Figure 2.24 and Figure 2.25. We see that the value of K now governs the degree of smoothing and that again there is an optimum choice for K that is neither too large nor too small. Note that the model produced by K nearest neighbours is not a true density model because the integral over all space diverges.
We close this chapter by showing how the K-nearest-neighbour technique for density estimation can be extended to the problem of classification. To do this, we apply the K-nearest-neighbour density estimation technique to each class separately and then make use of Bayes’ theorem. Let us suppose that we!have a data set com- prising Nk points in class Ck with N points in total, so that k Nk = N. If we wish to classify a new point x, we draw a sphere centred on x containing precisely K points irrespective of their class. Suppose this sphere has volume V and contains Kk points from class Ck. Then (2.246) provides an estimate of the density associated with each class
p(x|Ck) = Kk . NkV
Similarly, the unconditional density is given by
(2.253)
(2.254)
(2.255)
while the class priors are given by
p(x) = K NV
p(Ck) = Nk . N
2.5. Nonparametric Methods 125
5
00 0.5 1 5
00 0.5 1 5
00 0.5 1
K=1
K=5
K = 30
Exercise 2.61
We can now combine (2.253), (2.254), and (2.255) using Bayes’ theorem to obtain the posterior probability of class membership
p(Ck|x) = p(x|Ck)p(Ck) = Kk . (2.256) p(x) K

126 2. PROBABILITY DISTRIBUTIONS
Figure 2.27 (a) In the K -nearest- x2 neighbour classifier, a new point, shown by the black diamond, is clas- sified according to the majority class membership of the K closest train-
ing data points, in this case K = 3. (b) In the nearest-neighbour (K = 1) approach to classification, the resulting decision boundary is composed of hyperplanes that form perpendicular bisectors of pairs of points from different classes.
x2
(a)
(b)
If we wish to minimize the probability of misclassification, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to the largest value of Kk/K. Thus to classify a new point, we identify the K nearest points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of K = 1 is called the nearest-neighbour rule, because a test point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27.
In Figure 2.28, we show the results of applying the K-nearest-neighbour algo- rithm to the oil flow data, introduced in Chapter 1, for various values of K. As expected, we see that K controls the degree of smoothing, so that small K produces many small regions of each class, whereas large K leads to fewer larger regions.
Plot of 200 data points from the oil data set showing values of x6 plotted against x7, where the red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classifications of the input space given by the K-nearest-neighbour algorithm for various values of K.
x1
x1
2
x7
1
K=1
00 1×62
2
x7
1
K=3
00 1×62
2
x7
1
K = 31
00 1×62
Figure 2.28

Exercises 2.1
(⋆) www erties
Verify that the Bernoulli distribution (2.2) satisfies the following prop-
2.2
(⋆ ⋆) The form of the Bernoulli distribution given by (2.2) is not symmetric be- tween the two values of x. In some situations, it will be more convenient to use an equivalent formulation for which x ∈ {−1, 1}, in which case the distribution can be written
An interesting property of the nearest-neighbour (K = 1) classifier is that, in the limit N → ∞, the error rate is never more than twice the minimum achievable error rate of an optimal classifier, i.e., one that uses the true class distributions (Cover and Hart, 1967) .
As discussed so far, both the K-nearest-neighbour method, and the kernel den- sity estimator, require the entire training data set to be stored, leading to expensive computation if the data set is large. This effect can be offset, at the expense of some additional one-off computation, by constructing tree-based search structures to allow (approximate) near neighbours to be found efficiently without doing an exhaustive search of the data set. Nevertheless, these nonparametric methods are still severely limited. On the other hand, we have seen that simple parametric models are very restricted in terms of the forms of distribution that they can represent. We therefore need to find density models that are very flexible and yet for which the complexity of the models can be controlled independently of the size of the training set, and we shall see in subsequent chapters how to achieve this.
var[x] = μ(1 − μ).
Show that the entropy H[x] of a Bernoulli distributed random binary variable x is
given by
1 − μ (1−x)/2 1 + μ (1+x)/2 p(x|μ) = 2 2
(2.261)
where μ ∈ [−1, 1]. Show that the distribution (2.261) is normalized, and evaluate its mean, variance, and entropy.
2.3 (⋆ ⋆) www In this exercise, we prove that the binomial distribution (2.9) is nor- malized. First use the definition (2.10) of the number of combinations of m identical objects chosen from a total of N to show that
N+ N =N+1. (2.262) mm−1m
Exercises 127
!1 x=0
p(x|μ) = 1 E[x] = μ
(2.257)
H[x] = −μ ln μ − (1 − μ) ln(1 − μ). (2.260)
(2.258) (2.259)

128 2. PROBABILITY DISTRIBUTIONS
Use this result to prove by induction the following result
(1+x)N=!N Nxm (2.263) m=0 m
which is known as the binomial theorem, and which is valid for all real values of x. Finally, show that the binomial distribution is normalized, so that
then making use of the binomial theorem.
2.4 (⋆ ⋆) Show that the mean of the binomial distribution is given by (2.11). To do this, differentiate both sides of the normalization condition (2.264) with respect to μ and then rearrange to obtain an expression for the mean of n. Similarly, by differentiating (2.264) twice with respect to μ and making use of the result (2.11) for the mean of the binomial distribution prove the result (2.12) for the variance of the binomial.
2.5 (⋆ ⋆) www In this exercise, we prove that the beta distribution, given by (2.13), is correctly normalized, so that (2.14) holds. This is equivalent to showing that
(2.265)
(2.266)
!N
m=0 m
N μm(1 − μ)N−m = 1 (2.264) which can be done by first pulling out a factor (1 − μ)N out of the summation and
$ 1 Γ(a)Γ(b)
μa−1(1 − μ)b−1 dμ = Γ(a + b) . From the definition (1.141) of the gamma function, we have
0
Γ(a)Γ(b) = $ ∞ exp(−x)xa−1 dx $ ∞ exp(−y)yb−1 dy. 00
Use this expression to prove (2.265) as follows. First bring the integral over y inside the integrand of the integral over x, next make the change of variable t = y + x where x is fixed, then interchange the order of the x and t integrations, and finally make the change of variable x = tμ where t is fixed.
2.6 (⋆) Make use of the result (2.265) to show that the mean, variance, and mode of the beta distribution (2.13) are given respectively by
E[μ] = var[μ] = mode[μ] =
a a+b
ab (a+b)2(a+b+1)
a − 1 . a+b−2
(2.267) (2.268) (2.269)

2.7 (⋆ ⋆) Consider a binomial random variable x given by (2.9), with prior distribution for μ given by the beta distribution (2.13), and suppose we have observed m occur- rences of x = 1 and l occurrences of x = 0. Show that the posterior mean value of x lies between the prior mean and the maximum likelihood estimate for μ. To do this, show that the posterior mean can be written as λ times the prior mean plus (1 − λ) times the maximum likelihood estimate, where 0 à λ à 1. This illustrates the con- cept of the posterior distribution being a compromise between the prior distribution and the maximum likelihood solution.
2.8 (⋆) Consider two variables x and y with joint distribution p(x, y). Prove the follow- ing two results
E[x] = Ey [Ex[x|y]] (2.270) var[x] = Ey [varx[x|y]] + vary [Ex[x|y]] . (2.271)
Here Ex[x|y] denotes the expectation of x under the conditional distribution p(x|y), with a similar notation for the conditional variance.
2.9 (⋆ ⋆ ⋆) www . In this exercise, we prove the normalization of the Dirichlet dis- tribution (2.38) using induction. We have already shown in Exercise 2.5 that the beta distribution, which is a special case of the Dirichlet for M = 2, is normalized. We now assume that the Dirichlet distribution is normalized for !M − 1 variables and prove that it is normalized for M variables. To do this, consider the Dirichlet distribution over M variables, and take account of the constraint Mk=1 μk = 1 by eliminating μM , so that the Dirichlet is written
(2.272)
M−1 pM(μ1,…,μM−1)=CM “μαk−1 1−$μj
αM −1 k=1 j=1
and our goal is to find an expression for CM . To do this, integrate over μM −1 , taking care over the limits of integration, and then make a change of variable so that this integral has limits 0 and 1. By assuming the correct result for CM −1 and making use of (2.265), derive the expression for CM .
2.10 (⋆ ⋆) Using the property Γ(x + 1) = xΓ(x) of the gamma function, derive the following results for the mean, variance, and covariance of the Dirichlet distribution given by (2.38)
E[μj] = αj α0
(2.273)
(2.274) (2.275)
var[μj] cov[μjμl]
= αj(α0 −αj) α02(α0 +1)
= − αjαl α02(α0 +1)
,
j ̸= l
k
M−1
Exercises 129
where α0 is defined by (2.39).

130 2. PROBABILITY DISTRIBUTIONS
2.11 (⋆) www By expressing the expectation of ln μj under the Dirichlet distribution
(2.38) as a derivative with respect to αj , show that E[lnμj] = ψ(αj) − ψ(α0)
where α0 is given by (2.39) and
ψ(a)≡ d lnΓ(a) da
is the digamma function.
2.12 (⋆) The uniform distribution for a continuous variable x is defined by
U(x|a,b)= 1 , aàxàb. b−a
(2.276)
(2.277)
(2.278)
Verify that this distribution is normalized, and find expressions for its mean and variance.
2.13 (⋆⋆) Evaluate the Kullback-Leibler divergence (1.113) between two Gaussians p(x) = N (x|μ, Σ) and q(x) = N (x|m, L).
2.14 (⋆ ⋆) www This exercise demonstrates that the multivariate distribution with max- imum entropy, for a given covariance, is a Gaussian. The entropy of a distribution
p(x) is given by
!
H[x] = − p(x) ln p(x) dx. (2.279)
We wish to maximize H[x] over all distributions p(x) subject to the constraints that p(x) be normalized and that it have a specific mean and covariance, so that
! ! !
p(x) dx = 1
p(x)x dx = μ
p(x)(x − μ)(x − μ)T dx = Σ.
(2.280) (2.281) (2.282)
By performing a variational maximization of (2.279) and using Lagrange multipliers to enforce the constraints (2.280), (2.281), and (2.282), show that the maximum likelihood distribution is given by the Gaussian (2.43).
2.15 (⋆ ⋆) Show that the entropy of the multivariate Gaussian N (x|μ, Σ) is given by H[x] = 21 ln |Σ| + D2 (1 + ln(2π)) (2.283)
where D is the dimensionality of x.

2.16 (⋆ ⋆ ⋆) www Consider two random variables x1 and x2 having Gaussian distri- butions with means μ1,μ2 and precisions τ1, τ2 respectively. Derive an expression for the differential entropy of the variable x = x1 + x2. To do this, first find the distribution of x by using the relation
p(x) = ! ∞ p(x|x2)p(x2) dx2 (2.284) −∞
and completing the square in the exponent. Then observe that this represents the convolution of two Gaussian distributions, which itself will be Gaussian, and finally make use of the result (1.110) for the entropy of the univariate Gaussian.
2.17 (⋆) www Consider the multivariate Gaussian distribution given by (2.43). By writing the precision matrix (inverse covariance matrix) Σ−1 as the sum of a sym- metric and an anti-symmetric matrix, show that the anti-symmetric term does not appear in the exponent of the Gaussian, and hence that the precision matrix may be taken to be symmetric without loss of generality. Because the inverse of a symmetric matrix is also symmetric (see Exercise 2.22), it follows that the covariance matrix may also be chosen to be symmetric without loss of generality.
2.18 (⋆ ⋆ ⋆) Consider a real, symmetric matrix Σ whose eigenvalue equation is given by (2.45). By taking the complex conjugate of this equation and subtracting the original equation, and then forming the inner product with eigenvector ui, show that the eigenvalues λi are real. Similarly, use the symmetry property of Σ to show that two eigenvectors ui and uj will be orthogonal provided λj ̸= λi. Finally, show that without loss of generality, the set of eigenvectors can be chosen to be orthonormal, so that they satisfy (2.46), even if some of the eigenvalues are zero.
2.19 (⋆ ⋆) Show that a real, symmetric matrix Σ having the eigenvector equation (2.45) can be expressed as an expansion in the eigenvectors, with coefficients given by the eigenvalues, of the form (2.48). Similarly, show that the inverse matrix Σ−1 has a representation of the form (2.49).
2.20 (⋆ ⋆) www A positive definite matrix Σ can be defined as one for which the
Exercises 131
quadratic form
aT Σa (2.285)
is positive for any real value of the vector a. Show that a necessary and sufficient condition for Σ to be positive definite is that all of the eigenvalues λi of Σ, defined by (2.45), are positive.
2.21 (⋆) Show that a real, symmetric matrix of size D × D has D(D + 1)/2 independent parameters.
2.22 (⋆) www Show that the inverse of a symmetric matrix is itself symmetric.
2.23 (⋆ ⋆) By diagonalizing the coordinate system using the eigenvector expansion (2.45), show that the volume contained within the hyperellipsoid corresponding to a constant

132 2. PROBABILITY DISTRIBUTIONS
Mahalanobis distance ∆ is given by
VD |Σ|1/2 ∆D (2.286)
where VD is the volume of the unit sphere in D dimensions, and the Mahalanobis distance is defined by (2.44).
2.24 (⋆ ⋆) www Prove the identity (2.76) by multiplying both sides by the matrix
AB CD
and making use of the definition (2.77).
(2.287)
2.25 (⋆ ⋆) In Sections 2.3.1 and 2.3.2, we considered the conditional and marginal distri- butions for a multivariate Gaussian. More generally, we can consider a partitioning of the components of x into three groups xa, xb, and xc, with a corresponding par- titioning of the mean vector μ and of the covariance matrix Σ in the form
μa Σaa Σab Σac
μ= μb , Σ= Σba Σbb Σbc . (2.288)
μc Σca Σcb Σcc
By making use of the results of Section 2.3, find an expression for the conditional
distribution p(xa|xb) in which xc has been marginalized out.
2.26 (⋆⋆) A very useful result from linear algebra is the Woodbury matrix inversion
formula given by
(A + BCD)−1 = A−1 − A−1B(C−1 + DA−1B)−1DA−1. (2.289)
By multiplying both sides by (A + BCD) prove the correctness of this result.
2.27 (⋆) Let x and z be two independent random vectors, so that p(x, z) = p(x)p(z). Show that the mean of their sum y = x+z is given by the sum of the means of each of the variable separately. Similarly, show that the covariance matrix of y is given by the sum of the covariance matrices of x and z. Confirm that this result agrees with that of Exercise 1.10.
2.28 (⋆ ⋆ ⋆) www Consider a joint distribution over the variable
z= yx (2.290)
whose mean and covariance are given by (2.108) and (2.105) respectively. By mak- ing use of the results (2.92) and (2.93) show that the marginal distribution p(x) is given (2.99). Similarly, by making use of the results (2.81) and (2.82) show that the conditional distribution p(y|x) is given by (2.100).

Exercises 133
2.29 (⋆ ⋆) Using the partitioned matrix inversion formula (2.76), show that the inverse of
the precision matrix (2.104) is given by the covariance matrix (2.105).
2.30 (⋆) By starting from (2.107) and making use of the result (2.105), verify the result (2.108).
2.31 (⋆⋆) Consider two multidimensional random vectors x and z having Gaussian distributions p(x) = N(x|μx,Σx) and p(z) = N(z|μz,Σz) respectively, together with their sum y = x+z. Use the results (2.109) and (2.110) to find an expression for the marginal distribution p(y) by considering the linear-Gaussian model comprising the product of the marginal distribution p(x) and the conditional distribution p(y|x).
2.32 (⋆⋆⋆) www This exercise and the next provide practice at manipulating the quadratic forms that arise in linear-Gaussian models, as well as giving an indepen- dent check of results derived in the main text. Consider a joint distribution p(x, y) defined by the marginal and conditional distributions given by (2.99) and (2.100). By examining the quadratic form in the exponent of the joint distribution, and using the technique of ‘completing the square’ discussed in Section 2.3, find expressions for the mean and covariance of the marginal distribution p(y) in which the variable x has been integrated out. To do this, make use of the Woodbury matrix inversion formula (2.289). Verify that these results agree with (2.109) and (2.110) obtained using the results of Chapter 2.
2.33 (⋆ ⋆ ⋆) Consider the same joint distribution as in Exercise 2.32, but now use the technique of completing the square to find expressions for the mean and covariance of the conditional distribution p(x|y). Again, verify that these agree with the corre- sponding expressions (2.111) and (2.112).
2.34 (⋆ ⋆) www To find the maximum likelihood solution for the covariance matrix of a multivariate Gaussian, we need to maximize the log likelihood function (2.118) with respect to Σ, noting that the covariance matrix must be symmetric and positive definite. Here we proceed by ignoring these constraints and doing a straightforward maximization. Using the results (C.21), (C.26), and (C.28) from Appendix C, show that the covariance matrix Σ that maximizes the log likelihood function (2.118) is given by the sample covariance (2.122). We note that the final result is necessarily symmetric and positive definite (provided the sample covariance is nonsingular).
2.35 (⋆ ⋆) Use the result (2.59) to prove (2.62). Now, using the results (2.59), and (2.62),
show that
E[xnxm] = μμT + InmΣ (2.291)
where xn denotes a data point sampled from a Gaussian distribution with mean μ and covariance Σ, and Inm denotes the (n, m) element of the identity matrix. Hence prove the result (2.124).
2.36 (⋆ ⋆) www Using an analogous procedure to that used to obtain (2.126), derive an expression for the sequential estimation of the variance of a univariate Gaussian

134 2. PROBABILITY DISTRIBUTIONS
distribution, by starting with the maximum likelihood expression
1 !N
σM2 L = N (xn − μ)2. (2.292)
n=1
Verify that substituting the expression for a Gaussian distribution into the Robbins- Monro sequential estimation formula (2.135) gives a result of the same form, and hence obtain an expression for the corresponding coefficients aN .
2.37 (⋆ ⋆) Using an analogous procedure to that used to obtain (2.126), derive an ex- pression for the sequential estimation of the covariance of a multivariate Gaussian distribution, by starting with the maximum likelihood expression (2.122). Verify that substituting the expression for a Gaussian distribution into the Robbins-Monro se- quential estimation formula (2.135) gives a result of the same form, and hence obtain an expression for the corresponding coefficients aN .
2.38 (⋆) Use the technique of completing the square for the quadratic form in the expo- nent to derive the results (2.141) and (2.142).
2.39 (⋆⋆) Starting from the results (2.141) and (2.142) for the posterior distribution of the mean of a Gaussian random variable, dissect out the contributions from the first N − 1 data points and hence obtain expressions for the sequential update of μN and σN2 . Now derive the same results starting from the posterior distribution p(μ|x1,…,xN−1) = N(μ|μN−1,σN2 −1) and multiplying by the likelihood func- tion p(xN |μ) = N (xN |μ, σ2) and then completing the square and normalizing to obtain the posterior distribution after N observations.
2.40 (⋆ ⋆) www Consider a D-dimensional Gaussian random variable x with distribu- tion N (x|μ, Σ) in which the covariance Σ is known and for which we wish to infer the mean μ from a set of observations X = {x1 , . . . , xN }. Given a prior distribution p(μ) = N(μ|μ0,Σ0), find the corresponding posterior distribution p(μ|X).
2.41 (⋆) Use the definition of the gamma function (1.141) to show that the gamma dis- tribution (2.146) is normalized.
2.42 (⋆ ⋆) Evaluate the mean, variance, and mode of the gamma distribution (2.146).
2.43 (⋆) The following distribution
p(x|σ2,q) = q exp −|x|q (2.293) 2(2σ2)1/qΓ(1/q) 2σ2
is a generalization of the univariate Gaussian distribution. Show that this distribution is normalized so that $ ∞ p(x|σ2, q) dx = 1 (2.294)
−∞
and that it reduces to the Gaussian when q = 2. Consider a regression model in which the target variable is given by t = y(x, w) + ε and ε is a random noise

variable drawn from the distribution (2.293). Show that the log likelihood function over w and σ2, for an observed data set of input vectors X = {x1,…,xN} and corresponding target variables t = (t1 , . . . , tN )T , is given by
1!N N
lnp(t|X,w,σ2)=−2σ2 |y(xn,w)−tn|q − q ln(2σ2)+const (2.295)
n=1
where ‘const’ denotes terms independent of both w and σ2. Note that, as a function
of w, this is the Lq error function considered in Section 1.5.5.
2.44 (⋆ ⋆) Consider a univariate Gaussian distribution N (x|μ, τ −1 ) having conjugate Gaussian-gamma prior given by (2.154), and a data set x = {x1 , . . . , xN } of i.i.d. observations. Show that the posterior distribution is also a Gaussian-gamma distri- bution of the same functional form as the prior, and write down expressions for the parameters of this posterior distribution.
2.45 (⋆) Verify that the Wishart distribution defined by (2.155) is indeed a conjugate prior for the precision matrix of a multivariate Gaussian.
2.46 (⋆) www Verify that evaluating the integral in (2.158) leads to the result (2.159).
2.47 (⋆) www Show that in the limit ν → ∞, the t-distribution (2.159) becomes a Gaussian. Hint: ignore the normalization coefficient, and simply look at the depen- dence on x.
2.48 (⋆) By following analogous steps to those used to derive the univariate Student’s t-distribution (2.159), verify the result (2.162) for the multivariate form of the Stu- dent’s t-distribution, by marginalizing over the variable η in (2.161). Using the definition (2.161), show by exchanging integration variables that the multivariate t-distribution is correctly normalized.
2.49 (⋆ ⋆) By using the definition (2.161) of the multivariate Student’s t-distribution as a convolution of a Gaussian with a gamma distribution, verify the properties (2.164), (2.165), and (2.166) for the multivariate t-distribution defined by (2.162).
2.50 (⋆) Show that in the limit ν → ∞, the multivariate Student’s t-distribution (2.162) reduces to a Gaussian with mean μ and precision Λ.
2.51 (⋆) www The various trigonometric identities used in the discussion of periodic variables in this chapter can be proven easily from the relation
Exercises 135
exp(iA) = cos A + i sin A
in which i is the square root of minus one. By considering the identity
exp(iA) exp(−iA) = 1 prove the result (2.177). Similarly, using the identity
cos(A − B) = R exp{i(A − B)}
(2.296) (2.297) (2.298)

136 2. PROBABILITY DISTRIBUTIONS
where R denotes the real part, prove (2.178). Finally, by using sin(A − B) =
I exp{i(A − B)}, where I denotes the imaginary part, prove the result (2.183).
2.52 (⋆⋆) For large m, the von Mises distribution (2.179) becomes sharply peaked around the mode θ0 . By defining ξ = m1/2 (θ − θ0 ) and making the Taylor ex- pansion of the cosine function given by
cosα=1−α2 +O(α4) (2.299) 2
show that as m → ∞, the von Mises distribution tends to a Gaussian.
2.53 (⋆) Using the trigonometric identity (2.183), show that solution of (2.182) for θ0 is
given by (2.184).
2.54 (⋆) By computing first and second derivatives of the von Mises distribution (2.179), and using I0(m) > 0 for m > 0, show that the maximum of the distribution occurs when θ = θ0 and that the minimum occurs when θ = θ0 + π (mod 2π).
2.55 (⋆) By making use of the result (2.168), together with (2.184) and the trigonometric identity (2.178), show that the maximum likelihood solution mML for the concentra- tion of the von Mises distribution satisfies A(mML) = r where r is the radius of the mean of the observations viewed as unit vectors in the two-dimensional Euclidean plane, as illustrated in Figure 2.17.
2.56 (⋆ ⋆) www Express the beta distribution (2.13), the gamma distribution (2.146), and the von Mises distribution (2.179) as members of the exponential family (2.194) and thereby identify their natural parameters.
2.57 (⋆) Verify that the multivariate Gaussian distribution can be cast in exponential family form (2.194) and derive expressions for η, u(x), h(x) and g(η) analogous to (2.220)–(2.223).
2.58 (⋆) The result (2.226) showed that the negative gradient of ln g(η) for the exponen- tial family is given by the expectation of u(x). By taking the second derivatives of (2.195), show that
−∇∇ ln g(η) = E[u(x)u(x)T] − E[u(x)]E[u(x)T] = cov[u(x)]. (2.300)
2.59 (⋆) By changing variables using y = x/σ, show that the density (2.236) will be
correctly normalized, provided f(x) is correctly normalized.
2.60 (⋆ ⋆) www Consider a histogram-like density model in which the space x is di- vided into fixed regions for which the density p(x) takes the constant value hi over the ith region, and that the volume of region i is denoted ∆i. Suppose we have a set of N observations of x such that ni of these observations fall in region i. Using a Lagrange multiplier to enforce the normalization constraint on the density, derive an expression for the maximum likelihood estimator for the {hi}.
2.61 (⋆) Show that the K-nearest-neighbour density model defines an improper distribu- tion whose integral over all space is divergent.