代写 R C math scala C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml
Appendix A
Mathematical Background
A.1 Joint, Marginal and Conditional Probability
Let the n (discrete or continuous) random variables y1 , . . . , yn have a joint probability p(y1,…,yn), or p(y) for short.1 Technically, one ought to distin- guish between probabilities (for discrete variables) and probability densities for continuous variables. Throughout the book we commonly use the term “prob- ability” to refer to both. Let us partition the variables in y into two groups, yA and yB, where A and B are two disjoint sets whose union is the set {1,…,n}, so that p(y) = p(yA,yB). Each group may contain one or more variables.
The marginal probability of yA is given by 􏰖
joint probability
marginal probability
independence conditional probability
p(yA) =
p(yA,yB)dyB. (A.1)
The integral is replaced by a sum if the variables are discrete valued. Notice that if the set A contains more than one variable, then the marginal probability is itself a joint probability—whether it is referred to as one or the other depends on the context. If the joint distribution is equal to the product of the marginals, then the variables are said to be independent, otherwise they are dependent.
The conditional probability function is defined as
p(yA|yB) = p(yA,yB), (A.2)
p(yB )
defined for p(yB) > 0, as it is not meaningful to condition on an impossible event. If yA and yB are independent, then the marginal p(yA) and the condi- tional p(yA|yB) are equal.
1One can deal with more general cases where the density function does not exist by using the distribution function.

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml
200
Bayes’ rule
Mathematical Background Using the definitions of both p(yA|yB) and p(yB|yA) we obtain Bayes’
theorem
p(yA|yB) = p(yA)p(yB|yA). (A.3) p(yB )
Gaussian definition
Since conditional distributions are themselves probabilities, one can use all of the above also when further conditioning on other variables. For example, in supervised learning, one often conditions on the inputs throughout, which would lead e.g. to a version of Bayes’ rule with additional conditioning on X in all four probabilities in eq. (A.3); see eq. (2.5) for an example of this.
A.2 Gaussian Identities
The multivariate Gaussian (or Normal) distribution has a joint probability den- sity given by
p(x|m,Σ) =(2π)−D/2|Σ|−1/2exp􏰀−1(x−m)⊤Σ−1(x−m)􏰁, (A.4) 2
where m is the mean vector (of length D) and Σ is the (symmetric, positive definite) covariance matrix (of size D × D). As a shorthand we write x ∼ N (m, Σ).
Let x and y be jointly Gaussian random vectors
􏰌x􏰍 􏰊􏰌μx􏰍 􏰌A C􏰍􏰋 􏰎􏰌μx􏰍 􏰌A ̃ C ̃􏰍−1􏰏
y∼Nμ,C⊤B=Nμ,C ̃⊤B ̃ ,(A.5) yy
then the marginal distribution of x and the conditional distribution of x given y are
x ∼ N(μx,A), and x|y ∼ N􏰀μx +CB−1(y−μy), A−CB−1C⊤􏰁
or x|y ∼ N􏰀μx −A ̃−1C ̃(y−μy), A ̃−1􏰁. (A.6)
See, e.g. von Mises [1964, sec. 9.3], and eqs. (A.11 – A.13).
The product of two Gaussians gives another (un-normalized) Gaussian
N(x|a,A)N(x|b,B) =Z−1N(x|c,C) (A.7) where c = C(A−1a + B−1b) and C = (A−1 + B−1)−1.
Notice that the resulting Gaussian has a precision (inverse variance) equal to the sum of the precisions and a mean equal to the convex sum of the means, weighted by the precisions. The normalizing constant looks itself like a Gaussian (in a or b)
Z−1 = (2π)−D/2|A + B|−1/2 exp 􏰀− 1 (a − b)⊤(A + B)−1(a − b)􏰁. (A.8) 2
To prove eq. (A.7) simply write out the (lengthy) expressions by introducing eq. (A.4) and eq. (A.8) into eq. (A.7), and expand the terms inside the exp to
conditioning and marginalizing
products

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml
A.3 Matrix Identities
verify equality. Hint: it may be helpful to expand C using the matrix inversion lemma, eq. (A.9), C = (A−1+B−1)−1 = A−A(A+B)−1A = B−B(A+B)−1B.
To generate samples x ∼ N(m,K) with arbitrary mean m and covariance matrix K using a scalar Gaussian generator (which is readily available in many programming environments) we proceed as follows: first, compute the Cholesky decomposition (also known as the “matrix square root”) L of the positive def- inite symmetric covariance matrix K = LL⊤, where L is a lower triangular matrix, see section A.4. Then generate u ∼ N(0,I) by multiple separate calls to the scalar Gaussian generator. Compute x = m + Lu, which has the desired distribution with mean m and covariance LE[uu⊤]L⊤ = LL⊤ = K (by the independence of the elements of u).
In practice it may be necessary to add a small multiple of the identity matrix εI to the covariance matrix for numerical reasons. This is because the eigenvalues of the matrix K can decay very rapidly (see section 4.3.1 for a closely related analytical result) and without this stabilization the Cholesky decomposition fails. The effect on the generated samples is to add additional independent noise of variance ε. From the context ε can usually be chosen to have inconsequential effects on the samples, while ensuring numerical stability.
A.3 Matrix Identities
The matrix inversion lemma, also known as the Woodbury, Sherman & Morri- son formula (see e.g. Press et al. [1992, p. 75]) states that
(Z + UWV ⊤)−1 = Z−1 − Z−1U(W−1 + V ⊤Z−1U)−1V ⊤Z−1, (A.9)
assuming the relevant inverses all exist. Here Z is n×n, W is m×m and U and V are both of size n×m; consequently if Z−1 is known, and a low rank (i.e. m < n) perturbation is made to Z as in left hand side of eq. (A.9), considerable speedup can be achieved. A similar equation exists for determinants |Z + UWV ⊤| = |Z| |W| |W−1 + V ⊤Z−1U|. (A.10) Let the invertible n × n matrix A and its inverse A−1 be partitioned into 􏰊PQ􏰋 −1 􏰊P ̃Q ̃􏰋 A = R S , A = R ̃ S ̃ , (A.11) where P and P ̃ are n1 × n1 matrices and S and S ̃ are n2 × n2 matrices with n = n1 + n2. The submatrices of A−1 are given in Press et al. [1992, p. 77] as 201 generating multivariate Gaussian samples P ̃=P−1+P−1QMRP−1  whereM = (S−RP−1Q)−1, R = −MRP  S ̃ = M  Q ̃ = −P−1QM ̃ −1 (A.12) matrix inversion lemma determinants inversion of a partitioned matrix C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml 202 Mathematical Background derivative of inverse derivative of log determinant A.3.1 Matrix Derivatives Derivatives of the elements of an inverse matrix: ∂ K−1 = −K−1 ∂K K−1, ∂θ positive definite symmetric matrix we have ∂ log|K| = tr􏰀K−1∂K􏰁. A.3.2 Matrix Norms The Frobenius norm ∥A∥F of a n1 × n2 matrix A is defined as n1 n2 ∥A∥2F = 􏰅 􏰅 |aij |2 = tr(AA⊤ ), i=1 j=1 [Golub and Van Loan, 1989, p. 56]. A.4 Cholesky Decomposition or equivalently P ̃ = N Q ̃ = −NQS−1 R ̃ = −S−1RN S ̃ = S−1 +S−1RNQS−1   where N = (P − QS−1R)−1.    (A.13) (A.14) where ∂K is a matrix of elementwise derivatives. For the log determinant of a ∂θ ∂θ solving linear systems computational cost The Cholesky decomposition of a symmetric, positive definite matrix A decom- poses A into a product of a lower triangular matrix L and its transpose LL⊤ = A, (A.17) where L is called the Cholesky factor. The Cholesky decomposition is useful for solving linear systems with symmetric, positive definite coefficient matrix A. To solve Ax = b for x, first solve the triangular system Ly = b by forward substitution and then the triangular system L⊤x = y by back substitution. Using the backslash operator, we write the solution as x = L⊤\(L\b), where the notation A\b is the vector x which solves Ax = b. Both the forward and backward substitution steps require n2/2 operations, when A is of size n × n. The computation of the Cholesky factor L is considered numerically extremely stable and takes time n3/6, so it is the method of choice when it can be applied. ∂θ ∂θ (A.15) (A.16) C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml A.5 Entropy and Kullback-Leibler Divergence Note also that the determinant of a positive definite symmetric matrix can be calculated efficiently by nn |A| = 􏰕L2ii, or log|A| = 2􏰅logLii, (A.18) i=1 i=1 where L is the Cholesky factor from A. A.5 Entropy and Kullback-Leibler Divergence The entropy H[p(x)] of a distribution p(x) is a non-negative measure of the amount of “uncertainty” in the distribution, and is defined as 􏰖 203 determinant H[p(x)] = − p(x)logp(x)dx. (A.19) The integral is substituted by a sum for discrete variables. Entropy is measured in bits if the log is to the base 2 and in nats in the case of the natural log. The entropy of a Gaussian in D dimensions, measured in nats is H[N (μ, Σ)] = 1 log |Σ| + D (log 2πe). (A.20) 22 The Kullback-Leibler (KL) divergence (or relative entropy) KL(p||q) be- tween two distributions p(x) and q(x) is defined as entropy 􏰖 p(x) p(x) log q(x) dx. (A.21) KL(p||q) = It is easy to show that KL(p||q) ≥ 0, with equality if p = q (almost everywhere). For the case of two Bernoulli random variables p and q this reduces to KLBer(p||q) = plog p +(1−p)log (1−p), (A.22) where we use p and q both as the name and the parameter of the Bernoulli distributions. For two Gaussian distributions N (μ0 , Σ0 ) and N (μ1 , Σ1 ) we have [Kullback, 1959, sec. 9.1] divergence of Bernoulli random variables divergence of Gaussians minimizing KL(p||q) divergence leads to moment matching KL(N ||N ) = 1 log|Σ Σ−1|+ 01210 q (1−q) 1 trΣ−1􏰀(μ −μ )(μ −μ )⊤ +Σ −Σ 􏰁. (A.23) 21010101 Consider a general distribution p(x) on RD and a Gaussian distribution q(x) = N(μ,Σ). Then KL(p||q) = 􏰖 1(x−μ)⊤Σ−1(x−μ)p(x)dx+ 2 1 log|Σ|+ D log2π+ 22 􏰖 (A.24) p(x)logp(x)dx. C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml 204 Mathematical Background Equation (A.24) can be minimized w.r.t. μ and Σ by differentiating w.r.t. these parameters and setting the resulting expressions to zero. The optimal q is the one that matches the first and second moments of p. The KL divergence can be viewed as the extra number of nats needed on average to code data generated from a source p(x) under the distribution q(x) as opposed to p(x). A.6 Limits The limit of a rational quadratic is a squared exponential x2 −α x2 lim 􏰀1+ 􏰁 = exp􏰀− 􏰁. (A.25) finite measure probability measure Lebesgue measure Here we sketch some definitions concerning measure and integration; fuller treatments can be found e.g. in Doob [1994] and Bartle [1995]. Let Ω be the set of all possible outcomes of an experiment. For example, for a D-dimensional real-valued variable, Ω = RD. Let F be a σ-field of subsets of Ω which contains all the events in whose occurrences we may be interested.2 Then μ is a countably additive measure if it is real and non-negative and for all mutually disjoint sets A1, A2, . . . ∈ F we have ∞∞ μ􏰀􏰦Ai􏰁 = 􏰅μ(Ai). (A.26) i=1 i=1 If μ(Ω) < ∞ then μ is called a finite measure and if μ(Ω) = 1 it is called a probability measure. The Lebesgue measure defines a uniform measure over subsets of Euclidean space. Here an appropriate σ-algebra is the Borel σ-algebra BD, where B is the σ-algebra generated by the open subsets of R. For example on the line R the Lebesgue measure of the interval (a, b) is b − a. We now restrict Ω to be RD and wish to give meaning to integration of a function f : RD → R with respect to a measure μ 􏰖 We assume that f is measurable, i.e. that for any Borel-measurable set A ∈ R, f−1(A) ∈ BD. There are two cases that will interest us (i) when μ is the Lebesgue measure and (ii) when μ is a probability measure. For the first case expression (A.27) reduces to the usual integral notation 􏰔 f(x)dx. 2The restriction to a σ-field of subsets is important technically to avoid paradoxes such as the Banach-Tarski paradox. Informally, we can think of the σ-field as restricting consideration to “reasonable” subsets. α→∞2α 2 A.7 Measure and Integration f (x) dμ(x). (A.27) C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml A.8 Fourier Transforms For a probability measure μ on x, the non-negative function p(x) is called the density of the measure if for all A ∈ BD we have 􏰖 205 μ(A) = If such a density exists it is uniquely determined almost everywhere, i.e. except for sets with measure zero. Not all probability measures have densities—only distributions that assign zero probability to individual points in x-space can have densities.3 If p(x) exists then we have 􏰖􏰖 f(x)dμ(x) = f(x)p(x)dx. (A.29) If μ does not have a density expression (A.27) still has meaning by the standard construction of the Lebesgue integral. For Ω = RD the probability measure μ can be related to the distribution function F : RD → [0,1] which is defined as F(z) = μ(x1 ≤ z1,...xD ≤ zD). The distribution function is more general than the density as it is always defined for a given probability measure. A simple example of a random variable which has a distribution function but no density is obtained by the following construction: a coin is tossed and with probability p it comes up heads; if it comes up heads x is chosen from U(0,1) (the uniform distribution on [0,1]), otherwise (with probability 1−p) x is set to 1/2. This distribution has a “point mass” (or atom) at x = 1/2. A.7.1 Lp Spaces LetμbeameasureonaninputsetX. Forsomefunctionf :X →Rand (A.30) (A.31) 1 ≤ p < ∞, we define ∥f ∥Lp (X ,μ) 􏰪 |f (x)| dμ(x) , 􏰂􏰖 p 􏰃1/p if the integral exists. For p = ∞ we define ̃ 2πis·x f(x) = f(s)e −∞ ds, ̃ f(s) = −2πis·x f(x)e dx, (A.32) ∥f ∥L∞ (X ,μ) = ess sup |f (x)|, x∈X where ess sup denotes the essential supremum, i.e. the smallest number that upper bounds |f(x)| almost everywhere. The function space Lp(X,μ) is defined for any p in 1 ≤ p ≤ ∞ as the space of functions for which ∥f∥Lp(X,μ) < ∞. A.8 Fourier Transforms For sufficiently well-behaved functions on RD we have 􏰖∞ 􏰖∞ A p(x)dx. (A.28) −∞ “point mass” example 3A measure μ has a density if and only if it is absolutely continuous with respect to Lebesgue measure on RD, i.e. every set that has Lebesgue measure zero also has μ-measure zero. C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. ⃝c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml 206 Mathematical Background ̃ where f(s) is called the Fourier transform of f(x), see e.g. Bracewell [1986]. We refer to the equation on the left as the synthesis equation, and the equation on the right as the analysis equation. There are other conventions for Fourier transforms, particularly those involving ω = 2πs. However, this tends to de- stroy symmetry between the analysis and synthesis equations so we use the definitions given above. Here we have defined Fourier transforms for f(x) being a function on RD. For related transforms for periodic functions, functions defined on the integer lattice and on the regular N-polygon see section B.1. A.9 Convexity Below we state some definitions and properties of convex sets and functions taken from Boyd and Vandenberghe [2004]. A set C is convex if the line segment between any two points in C lies in C, i.e. if for any x1, x2 ∈ C and for any θ with 0 ≤ θ ≤ 1, we have θx1 + (1 − θ)x2 ∈ C. (A.33) Afunctionf :X →Risconvex ifitsdomainX isaconvexsetandifforall x1, x2 ∈X andθwith0≤θ≤1,wehave: f(θx1 + (1 − θ)x2) ≤ θf(x1) + (1 − θ)f(x2), (A.34) where X is a (possibly improper) subset of RD. f is concave if −f is convex. A function f is convex if and only if its domain X is a convex set and its Hessian is positive semidefinite for all x ∈ X . convex sets convex function