代写 R C math graph network Bayesian Gaussian Processes Part I The Linear Algebra of Inference

Gaussian Processes Part I The Linear Algebra of Inference
Philipp Hennig
MLSS 2013 29 August 2013
Max Planck Institute for Intelligent Systems Department of Empirical Inference
Tubingen, Germany
1,

Carl Friedrich Gauss 17771855
Paying Tolls with A Bell
fx e 2 2
1
x2
2
2,

The Gaussian distribution
Multivariate Form
N x; ,
4 2 0 1 4 6 8
1 exp 1 x 1 x 2N212 2
x,RN,RNN
is positive semidefinite, i.e.
vv0forallvRN
Hermitian, all eigenvalues 0
8 6 4
2 0 2 4
3,

Why Gaussian?
an experiment
40
20
0.1 0 0.1 0 5102 5102
nothing in the real world is Gaussian except sums of i.i.d. variables
But nothing in the real world is linear either!
Gaussians are for inference what linear maps are for algebra.
4,

Closure Under Multiplication
multiple Gaussian factors form a Gaussian
N x; a, AN x; b, B N x; c, CN a; b, A B C A1 B11 c CA1a B1b
8 6 4
2 0 2
44 2 0 1
4 6 8
5,

Closure Under Multiplication
multiple Gaussian factors form a Gaussian
N x; a, AN x; b, B N x; c, CN a; b, A B C A1 B11 c CA1a B1b
8 6 4
2 0 2
44 2 0 1
4 6 8
5,

Closure Under Multiplication
multiple Gaussian factors form a Gaussian
N x; a, AN x; b, B N x; c, CN a; b, A B C A1 B11 c CA1a B1b
8 6 4
2 0 2
44 2 0 1
4 6 8
5,

Closure under Linear Maps
Linear Maps of Gaussians are Gaussians
8 6 4
2 0 2
pz Nz;,
pAz N Az, A, AA
Here: A 1, 0.5
4
4 2 0 1 4 6 8
6,

Closure under Marginalization
projections of Gaussians are Gaussian
8 6 4
2 0 2

projection with A 1 0
xxxxy xxx Nx; , dyNx; ,
this is the sum rule
px,ydy pyxpxdypx
so every finitedim Gaussian is a marginal of infinitely many more
y y yx yy
4
4 2 0 1 4 6 8
7,

Closure under Conditioning
cuts through Gaussians are Gaussians
pxy px,y Nx; 1y , 1 py x xy yy y xx xy yy yx
8 6 4
2 0 2
4
4 2 0 1 4 6 8
this is the product rule
so Gaussians are closed under
the rules of probability
8,

Bayesian Inference
explaining away
5
0
px N x; ,
Nx1; 1 ,32 0 x2 0.5 0 32
05
9,

Bayesian Inference
explaining away
5
0
px N x; ,
Nx1; 1 ,32 0
py x, N y; Ax; 2
N 6;1 0.6×1,2 x2
05
x2 0.5 0 32
9,

Bayesian Inference
explaining away
5
0
px N x; ,
Nx1; 1 ,32 0
px2,y pxpyx 05
x2 0.5 0 32
py x, N y; Ax; 2
N 6;1 0.6×1,2
x2 px
9,

Bayesian Inference
explaining away
5
px N x; ,
Nx1; 1 ,32 0
0
py x, N y; Ax; 2
N 6;1 0.6×1,2
px2,y pxpyx 05
x2 0.5 0 32
x2 px
N x; AAA 21y A, AAA 21A
N x1 ; 3.9 , 3.4 3.4 x2 2.3 3.4 7.0
9,

What can we do with this?
linear regression
given y RN, pyf, whats f? 20
10
0
10
8 6 4 2 0 2 4 6 8 x
10 ,
y

A prior
over linear functions
fxw1 w2xxw pwNw;,
1 pf Nf;, xx xxx
11 ,

A prior
over linear functions
fxw1 w2xxw pwNw;,
1 pf Nf;, xx xxx
12 ,

The posterior
over linear functions
pyw,X Ny;Xw,2I pwy,XNw;XXX 2I1yX,
XXX 2I1Xx
13 ,

The posterior
over linear functions
pyw,X Ny;Xw,2I pfxy,XNfx;xxXXX 2I1yX,
xx xX X X 2I1X x
13 ,

prior on w
F 2;
phi absxfunpower,a,0:F1; mu zerosF,1;
Sigma eyeF;
number of features a 1; a
features of x
marginal stddev, for plotting gives Y,X,sigma
prior on fx
n 100; x linspace6,6,n;
phix phix;
m phixmu;
kxx phix Sigma phix; pfx Nm,kxx s bsxfunplus,m,cholkxx 1.0e8 eyen randnn,3; samples from prior
stdpi sqrtdiagkxx; loaddata.mat; N lengthY;
prior
phiX M kXX
G R
kxX A
mpost vpost spost
stdpo
on Y fX
phiX;
phiX mu;
phiX Sigma phiX;
kXX sigma2 eyeN; cholG;

features of data pfX NM,kXX
pYNM,kXX 2I most expensive step: ON 3
covfx, fX kxX precompute for reuse
phix Sigma kxX R;
phiX;
pfxYNmkxXkXX 2I1Y M, sqrtdiagvpost; marginal stddev, for plotting
m A R YM;
kxx A A; kxx kxXkXX 2I1kXx bsxfunplus,mpost,cholvpost 1.0e8 eyen randnn,3; samples
pwN, test points
14 ,

A More Realistic Dataset
General Linear Regression
20
10
0
10
f x x w
?
8 6 4 2 0 2 4 6 8 x
15 ,
y

fxw1 w2xxw
1
x
x
16 ,

prior on w
F 2;
phi absxfunpower,a,0:F1; mu zerosF,1;
Sigma eyeF;
number of features a 1; a
features of x
marginal stddev, for plotting gives Y,X,sigma
prior on fx
n 100; x linspace6,6,n;
phix phix;
m phixmu;
kxx phix Sigma phix; pfx Nm,kxx s bsxfunplus,m,cholkxx 1.0e8 eyen randnn,3; samples from prior
stdpi sqrtdiagkxx; loaddata.mat; N lengthY;
prior
phiX M kXX
G R
kxX A
mpost vpost spost
stdpo
on Y fX
phiX;
phiX mu;
phiX Sigma phiX;
kXX sigma2 eyeN; cholG;

features of data pfX NM,kXX
pYNM,kXX 2I most expensive step: ON 3
covfx, fX kxX precompute for reuse
phix Sigma kxX R;
phiX;
pfxYNmkxXkXX 2I1Y M, sqrtdiagvpost; marginal stddev, for plotting
m A R YM;
kxx A A; kxx kxXkXX 2I1kXx bsxfunplus,mpost,cholvpost 1.0e8 eyen randnn,3; samples
pwN, test points
17 ,

Cubic Regression
phi absxfunpower,a,0:3;
fx xw x 1 x x.2 x.3
18 ,

Cubic Regression
phi absxfunpower,a,0:3;
fx xw x 1 x x.2 x.3
18 ,

Septic Regression ?
phi absxfunpower,a,0:7;
fx xw x 1 x x.2 x.7
19 ,

Septic Regression ?
phi absxfunpower,a,0:7;
fx xw x 1 x x.2 x.7
19 ,

Fourier Regression
phi a2 cosbsxfuntimes,a8,0:8, sinbsxfuntimes,a8,1:8;
x cosx cos2x cos3x . . . sinx sin2x . . .
20 ,

Fourier Regression
phi a2 cosbsxfuntimes,a8,0:8, sinbsxfuntimes,a8,1:8;
x cosx cos2x cos3x . . . sinx sin2x . . .
20 ,

Step Regression
phi a1 2 bsxfunlt,a,linspace8,8,16;
x12x8 8x x7 7x …
21 ,

Step Regression
phi a1 2 bsxfunlt,a,linspace8,8,16;
x12x8 8x x7 7x …
21 ,

Another Kind of Step Regression
phi absxfungt,a,linspace8,8,16;
xx8 8x x7 7x …
22 ,

Another Kind of Step Regression
phi absxfungt,a,linspace8,8,16;
xx8 8x x7 7x …
22 ,

V Regression
phi absxfunminus,absbsxfunminus,a,linspace8,8,16,linspace8,8,16;
xx88 x77 x66 …
23 ,

V Regression
phi absxfunminus,absbsxfunminus,a,linspace8,8,16,linspace8,8,16;
xx88 x77 x66 …
23 ,

Legendre Regression
phi absxfuntimes,legendre13,a8,0.15.0:13;
x b0P0x, b1P1x, . . . , b13P13x Pnx 1 dn x2 1n 2nn! dxn
24 ,

Legendre Regression
phi absxfuntimes,legendre13,a8,0.15.0:13;
x b0P0x, b1P1x, . . . , b13P13x Pnx 1 dn x2 1n 2nn! dxn
24 ,

Eiffel Tower Regression
phi aexpabsbsxfunminus,a,8:1:8;
x ex8 ex7 ex6 . . .
25 ,

Eiffel Tower Regression
phi aexpabsbsxfunminus,a,8:1:8;
x ex8 ex7 ex6 . . .
25 ,

Bell Curve Regression
phi aexp0.5 bsxfunminus,a,8:1:8.2;
121212 x e2 x8 e2 x7 e2 x6 …
26 ,

Bell Curve Regression
phi aexp0.5 bsxfunminus,a,8:1:8.2;
121212 x e2 x8 e2 x7 e2 x6 …
26 ,

Multiple Inputs
all this works for in multiple dimensions, too
RN R f RN R
27 ,

Multiple Inputs
all this works for in multiple dimensions, too
28 ,

Multiple Outputs
slightly more confusing, but no algebraic problem
RRM f RRM covfit,fjt l,itl,jt l
f1t1,…,f1tN,f2t1,…,f2tN,…,fMt1,…fMtN
are just some covarying Gaussian variables requires careful matrix algebra

29 ,

Multiple Outputs
learning paths
RRM f RRM covfit,fjt l,itl,jt l
f1t1,…,f1tN,f2t1,…,f2tN,…,fMt1,…fMtN
are just some covarying Gaussian variables requires careful matrix algebra

30 ,

Multiple Outputs
learning paths
RRM f RRM covfit,fjt l,itl,jt l
f1t1,…,f1tN,f2t1,…,f2tN,…,fMt1,…fMtN
are just some covarying Gaussian variables requires careful matrix algebra

31 ,

How many features should we use?
lets look at that algebra again
pfxy,XNfx;xxXXX 2I1yX, xx xX X X 2I1X x

theres no lonely in there
all objects involving are of the form
the mean function
the kernel
once these are known, cost is independent of the number of features
remember the code:
M phiX mu;
m phix mu;
kXX phiX Sigma phiX; kxx phix Sigma phix;
kxX phix Sigma phiX;
pfX NM,kXX pfx Nm,kxx
covfx,fXkxX
32 ,

prior on w
F 2;
phi absxfunpower,a,0:F1; mu zerosF,1;
Sigma eyeF;
number of features a 1; a
features of x
marginal stddev, for plotting gives Y,X,sigma
prior on fx
n 100; x linspace6,6,n;
phix phix;
m phixmu;
kxx phix Sigma phix; pfx Nm,kxx s bsxfunplus,m,cholkxx 1.0e8 eyen randnn,3; samples from prior
stdpi sqrtdiagkxx; loaddata.mat; N lengthY;
prior
phiX M kXX
G R
kxX A
mpost vpost spost
stdpo
on Y fX
phiX;
phiX mu;
phiX Sigma phiX;
kXX sigma2 eyeN; cholG;

features of data pfX NM,kXX
pYNM,kXX 2I most expensive step: ON 3
covfx, fX kxX precompute for reuse
phix Sigma kxX R;
phiX;
pfxYNmkxXkXX 2I1Y M, sqrtdiagvpost; marginal stddev, for plotting
m A R YM;
kxx A A; kxx kxXkXX 2I1kXx bsxfunplus,mpost,cholvpost 1.0e8 eyen randnn,3; samples
pwN, test points
33 ,

prior
F phi k mu
2; absxfunpower,a,0:F; a,bphia phib; azerossizea,1;
number of features a 1; a kernel mean function
belief on fx
n 100; x linspace6,6,n;
m mux;
kxx kx,x; pfx Nm,kxx s bsxfunplus,m,cholkxx 1.0e8 eyen randnn,3; samples from prior
stdpi sqrtdiagkxx; loaddata.mat; N lengthY;
test points
marginal stddev, for plotting gives Y,X,sigma
prior
M kXX
G R
kxX A
mpost vpost spost
stdpo
on Y fX muX; kX,X;
kXX sigma2 eyeN; cholG;
kx,X;
pfX NM,kXX pYNM,kXX 2I
kxX R;
m A R YM;
precompute for reuse
pfxYNmkxXkXX 2I1Y M, kxx A A; kxx kxXkXX 2I1kXx bsxfunplus,mpost,cholvpost 1.0e8 eyen randnn,3; samples
sqrtdiagvpost; marginal stddev, for plotting

most expensive step: ON 3 covfx , fX kxX
34 ,

Features are cheap, so lets use a lot
an example DJC MacKay, 1998
For simplicity, lets fix 2 cmax cmin I
The elements of xx are
2c c F
F
x x max min x x ij lilj
l1
phiaexp0.5 bsxfunminus,a,8:1:8.2.s.2;
xexpxcl2
l
xi xj
F

2c c x x 2 F cl 1xi xj2 max min exp i j exp 2
22
2c c F x c 2 x c 2 max min exp i l exp j l
F l1 22 22
F 42l 2
35 ,

Features are cheap, so lets use a lot
an example DJC MacKay, 1998
x x 2 F cl 1xi xj2 x x max min exp i j exp 2

now increase F , such that of features in c becomes xi xj
ij
2c c
F 42l 2
F c cmax cmin
expx x 2 cmax expc 1xi xj2 dc 2ij2
42 cmin 2
let cmin , cmax
xixj 22expxixj2 42

36 ,

Exponentiated Squares
phi aexp0.5 bsxfunminus,a,linspace8,8,10.2 .ell.2;
37 ,

Exponentiated Squares
phi aexp0.5 bsxfunminus,a,linspace8,8,30.2 .ell.2;
37 ,

Exponentiated Squares
k a,b5exp0.25bsxfunminus,a,b.2;

aka. radial basis function, squaredexponential kernel
37 ,

Exponentiated Squares
k a,b5exp0.25bsxfunminus,a,b.2;

aka. radial basis function, squaredexponential kernel
37 ,

What just happened?
kernelization to infinitely many features
Definition
A function k X X R is a Mercer kernel if, for any finite collection X x1,…,xN, the matrix kXX RNN with elements
kXX,i,j kxi,xj is positive semidefinite.
38 ,
Lemma
Any kernel that can be written as
kx, x lxlx dl
is a Mercer kernel. assuming integral over positive set
Proof: X XN , v RN
2
vkXXv N vilxiN vjlxjdl vilxi dl 0 iji

What just happened?
Gaussian process priors
Definition
A function k X X R is a Mercer kernel if, for any finite collection X x1,…,xN, the matrix kXX RNN with elements
kXX,i,j kxi,xj is positive semidefinite.
Let X R be any function, k X X R be a Mercer kernel.
A Gaussian process pf GPf;,k is a probability distribution over the function f X R, such that every finite restriction to function values fX fx1,…,fxNisaGaussiandistributionpfXNfX;X,kXX.
39 ,
Definition

Those step functions
phi absxfungt,a,linspace8,8,5.sqrt5;
40 ,

Those step functions
phi absxfungt,a,linspace8,8,20.sqrt20;
40 ,

Those step functions
phi absxfungt,a,linspace8,8,100.sqrt100;
40 ,

Those step functions
k a,btheta.2 bsxfunmin,a8,b816;
covfxi,fxj xi cxj cdcminxi,xjcmin cmin

aka. the Wiener process
40 ,

Those step functions
k a,btheta.2 bsxfunmin,a8,b816;
f1 f2 f3 f4 f5
y1 y2 y3 y4 y5
40 ,

Those other stepfunctions
phi a1 2 bsxfunlt,a,linspace8,8,5; Wahba, 1990
41 ,

Those other stepfunctions
phi a1 2 bsxfunlt,a,linspace8,8,20; Wahba, 1990
41 ,

Those other stepfunctions
phi a1 2 bsxfunlt,a,linspace8,8,100; Wahba, 1990
41 ,

Those other stepfunctions
k a,b1 c 2 c absbsxfunminus,a,b16; Wahba, 1990
covf , f 1b 12x c12x c1dc 1b2bx x xixj0ij ij
aka. linear splines
41 ,

Those other stepfunctions
k a,b1 c 2 c absbsxfunminus,a,b16; Wahba, 1990
covf , f 1b 12x c12x c1dc 1b2bx x xixj0ij ij
aka. linear splines
41 ,

Those linear features
Wahba, 1990
phi absxfunminus,absbsxfunminus,a,linspace8,8,5,linspace8,8,5;
42 ,

Those linear features
Wahba, 1990
phi absxfunminus,absbsxfunminus,a,linspace8,8,20,linspace8,8,20;
42 ,

Those linear features
Wahba, 1990
phi absxfunminus,absbsxfunminus,a,linspace8,8,100,linspace8,8,100;
42 ,

Those linear features
Wahba, 1990
k a,btheta.2 1 1c bsxfuntimes,a8,b8.16 c . 3 absbsxfunminus,a,b16.3 bsxfunplus,a8.16.3,b8.16.3;
covf ,f 1x x b 1x ccx ccdc xi xj ij 0 i j
11bxixj bxi xj3 x3 y3 aka. cubicsplines
3 42 ,

Those linear features
Wahba, 1990
k a,btheta.2 1 1c bsxfuntimes,a8,b8.16 c . 3 absbsxfunminus,a,b16.3 bsxfunplus,a8.16.3,b8.16.3;
covf ,f 1x x b 1x ccx ccdc xi xj ij 0 i j
11bxixj bxi xj3 x3 y3 aka. cubicsplines
3 42 ,

Exponentially suppressed polynomials
phi absxfuntimes,bsxfunpower,a.9,0:1,c.0:1; Minka, 2000
1
covf ,f blxlxl 0b1 1x,x 1 xi xj i j i j
l0
43 ,

Exponentially suppressed polynomials
phi absxfuntimes,bsxfunpower,a.9,0:2,c.0:2; Minka, 2000
2
covf ,f blxlxl 0b1 1x,x 1 xi xj i j i j
l0
43 ,

Exponentially suppressed polynomials
phi absxfuntimes,bsxfunpower,a.9,0:10,c.0:10; Minka, 2000
10
covf ,f blxlxl 0b1 1x,x 1 xi xj i j i j
l0
43 ,

Exponentially suppressed polynomials
k a,btheta.2 . 1.1cbsxfuntimes,a.8,b.8; Minka, 2000
covf ,f blxlxl 1 0b1 1x,x1 xi xj i j 1bxx i j
ij
l0
43 ,

Exponentially suppressed polynomials
k a,btheta.2 . 1.1cbsxfuntimes,a.8,b.8; Minka, 2000
covf ,f blxlxl 1 0b1 1x,x1 xi xj i j 1bxx i j
ij
l0
43 ,

Exponentially decaying periodic features
phi absxfuntimes,cosbsxfuntimes,a8,0:2,c.0:2, … bsxfuntimes,sinbsxfuntimes,a8,1:2,c.1:2;
T. Minka, 2000
2
covf , f 1 blcos2lx cos2lx sin2lx sin2lx xixj ijij
l0 0b1
44 ,

Exponentially decaying periodic features
phi absxfuntimes,cosbsxfuntimes,a8,0:20,c.0:20, … bsxfuntimes,sinbsxfuntimes,a8,1:20,c.1:20;
T. Minka, 2000
20
covf , f 1 blcos2lx cos2lx sin2lx sin2lx xixj ijij
l0 0b1
44 ,

Exponentially decaying periodic features
phi absxfuntimes,cosbsxfuntimes,a8,0:50,c.0:50, … bsxfuntimes,sinbsxfuntimes,a8,1:50,c.1:50;
T. Minka, 2000
50
covf , f 1 blcos2lx cos2lx sin2lx sin2lx xixj ijij
l0 0b1
44 ,

Exponentially decaying periodic features
T. Minka, 2000
k a,btheta.2 . 2 pi . asin2 … cr cu bsxfuntimes,a,b . … sqrtbsxfuntimes,1 2 cr cu a.2,1 2 cr cu b.2 ;
covf , f 1 blcos2lx cos2lx sin2lx sin2lx xixj ijij
l0
2 1b2 2bcos2xi xj 0b1 44 ,
1
1 b22

Exponentially decaying periodic features
T. Minka, 2000
k a,btheta.2 . 2 pi . asin2 … cr cu bsxfuntimes,a,b . … sqrtbsxfuntimes,1 2 cr cu a.2,1 2 cr cu b.2 ;
covf , f 1 blcos2lx cos2lx sin2lx sin2lx xixj ijij
l0
2 1b2 2bcos2xi xj 0b1 44 ,
1
1 b22

White Noise
the limit of block functions
lim Ixi cIxj cdcxi xj 0
but were cheating a little height of blocks goes to 0!
white noise is a concept, more than a proper limit
if you make no assumptions, you learn nothing
45 ,

White Noise
the limit of block functions
lim Ixi cIxj cdcxi xj 0
but were cheating a little height of blocks goes to 0!
white noise is a concept, more than a proper limit
if you make no assumptions, you learn nothing
45 ,

White Noise
the limit of block functions
lim Ixi cIxj cdcxi xj 0
but were cheating a little height of blocks goes to 0!
white noise is a concept, more than a proper limit
if you make no assumptions, you learn nothing
45 ,

White Noise
the limit of block functions
lim Ixi cIxj cdcxi xj 0
but were cheating a little height of blocks goes to 0!
white noise is a concept, more than a proper limit
if you make no assumptions, you learn nothing
45 ,

White Noise
the limit of block functions
lim Ixi cIxj cdcxi xj 0
but were cheating a little height of blocks goes to 0!
white noise is a concept, more than a proper limit
if you make no assumptions, you learn nothing
45 ,

That gcd kernel
k a,bgcdbsxfuntimes,a,onessizeb,bsxfuntimes,onessizea,b;
46 ,

That gcd kernel
k a,bgcdbsxfuntimes,a,onessizeb,bsxfuntimes,onessizea,b;
46 ,

That gcd kernel
k a,bgcdbsxfuntimes,a,onessizeb,bsxfuntimes,onessizea,b;
46 ,

Summary
Gaussians are closed under
linear projection marginalization sum rule linear restriction conditioning product rule
they provide the linear algebra of inference
combine with nonlinear features , get nonlinear regression
in fact, number of features can be infinite
nonparametric Gaussian process regression Tomorrow:
so what are kernels? What is the set of kernels?
how should we design GP models?s
how powerful are those models?s
47 ,

Bibliography

D.J.C. MacKay
Introduction to Gaussian Processes
in Bishop, C.M. ed., Neural Networks and Machine Learning, Springer, 1998
C.E. Rasmussen C.K.I. Williams
Gaussian Processes for Machine Learning
MIT Press, 2006
T. Minka
Deriving quadrature rules from Gaussian processes
Tech. Report 2000
G. Wahba
Spline Models for Observational Data
SIAM CBMSNSF reg. conf. series in applied mathematics, 1990
48 ,