程序代写代做 algorithm go html flex Bayesian C chain graph cache 1.

1.
Expectation propagation as a way of life
Andrew Gelman Aki Vehtari Pasi Jylanki Christian Robert Nicolas Chopin
John P. Cunningham 15 Dec 2014
Abstract
We revisit expectation propagation EP as a prototype for scalable algorithms that partition big datasets into many parts and analyze each part in parallel to perform inference of shared parameters. The algorithm should be particularly efficient for hierarchical models, for which the EP algorithm works on the shared parameters hyperparameters of the model.
The central idea of EP is to work at each step with a tilted distribution that combines the likelihood for a part of the data with the cavity distribution, which is the approximate model for the prior and all other parts of the data. EP iteratively approximates the moments of the tilted distributions and incorporates those approximations into a global posterior approximation. As such, EP can be used to divide the computation for large models into manageable sizes. The computation for each partition can be made parallel with occasional exchanging of information between processes through the global posterior approximation. Moments of multivariate tilted distributions can be approximated in various ways, including, MCMC, Laplace approximations, and importance sampling.
Keywords: Bayesian computation, big data, data partitioning, expectation propagation, hierarchical models, Stan, statistical computing
Background: Bayesian computation via data partitioning
Various divideandconquer algorithms have been proposed for fitting statistical models to large datasets. The basic idea is to partition the data y into K pieces, y1,…,yK, each with likelihood pyk, then analyze each part of the likelihood separately; and finally combine the K pieces to perform inference typically approximately for . In addition to the appeal of such algorithms for parallel or online computation, there is a statistical intuition that they should work well from the law of large numbers: if the K pieces are a random sample of the data, or are independent realizations from a single probability model, then we can think of the log likelihood as the sum of K independent random variables.
In Bayesian inference, one must also decide what to do with the prior distribution, p. Perhaps the most direct approach is to multiply each factor of the likelihood by p1K and analyze the K pieces separately as pykp1K, such that the product of the K pieces is the full posterior density. A disadvantage of this approach is that it can lead to instability in the calculation of the fractional posterior distributions. Consider the many settings in statistics and machine learning where an informative prior is needed for regularizationthat is, where the likelihood alone does not allow good estimation of , with corresponding computational difficulties for some examples in our
We thank David Blei and Ole Winther for helpful comments, David Schiminovich, Peizhe Li, and Swupnil Sahai for the astronomy example, and the National Science Foundation, Institute for Education Sciences, and Moore and Sloan Foundations for partial support of this research.
Department of Statistics, Columbia University, New York. Visiting Universite Paris Dauphine and ENSAE. Aalto University, Espoo, Finland.
Donders Institute for Brain, Cognition, and Behavior, Radboud University Nijmegen, Netherlands.
Universite Paris Dauphine and University of Warwick.
Universite Paris Dauphine and ENSAE.
Department of Statistics, Columbia University, New York.
arXiv:1412.4869v1 stat.CO 16 Dec 2014

own research, see Gelman, Bois, and Jiang, 1996, and Gelman, Jakulin, et al., 2008, and also in the likelihoodfree context, Barthelme and Chopin, 2014. In these problems, p1K might not provide enough regularization when K is large. Another approach is to include the full prior distribution in each of the K independent inferences, but then this can leads to computational instability when dividing by pK1 at the end.
In any case, a challenge in all these algorithms is how to combine the K inferences. If the model is Gaussian, one can simply calculate the posterior mean as the precisionweighted average of the K separate posterior means, and the posterior precision as the sum of the K separate posterior precisions. Minsker et al. 2014 find a median to work well. More generally, though, we will be computing inferences using Monte Carlo simulation. And there is no general way to combine K batches of simulations to get a combined posterior. Indeed, one of our early attempts at data splitting was a bit disappointing, as the gains in computation time were small compared to all the effort required to combine the inferences Huang and Gelman, 2005.
Various approaches have been recently proposed to combine inferences from partitioned data see Ahn, Korattikara, and Welling, 2012, Gershman, Hoffman, and Blei, 2012, Korattikara, Chen, and Welling, 2013, Hoffman et al., 2013, Wang and Blei, 2013, Scott et al., 2013, Wang and Dunson, 2013, Neiswanger et al., 2013, and Wang, 2014. These different approaches rely on different approximations or assumptions, but all of them are based on analyzing each piece of the data alone, not using any information from the other K 1 pieces.
The contribution of the present paper is to suggest the general relevance of the expectation propagation EP framework, leveraging the idea of a cavity distribution, which approximates the influence of inference from all other K 1 pieces of the data, as a prior in the inference step for each partition. Like EP in general, our approach has the disadvantages of being an approximation and requiring a sequence of iterations which are not always stable. The advantage of our approach is that its regularization should induce faster and more stable calculation for each partition. Our approach can be viewed either as a stochastic implementation of EP or, as we prefer, as an application of EP ideas to Bayesian data partitioning.
We present the basic ideas in section 2, but we anticipate that the greatest gains from this approach will come in hierarchical models, as we discuss in section 3. Section 4 considers options in carrying out the algorithm, and in section 5 we demonstrate on a hierarchical logistic regression example. In section 6 we consider various generalizations and connections to other methods, and an appendix contains some details of implementation.
2. A general framework for EPlike algorithms based on iterative tilted approxima tions
Expectation propagation EP is a fast and parallelizable method of distributional approximation via data partitioning. As generally presented, EP is an iterative approach to approximately minimizing the KullbackLeibler divergence between the true posterior distribution py, and a distribution g from a tractable family, typically the multivariate normal. The goal, as is the goal of all ap proximate inference, is to construct g to approximate well the target, py pKk1 pyk.
As we see it, though, the key idea of EP is not specifically about KullbackLeibler or the evaluation of expectations, but rather that each step works with a tilted distribution that includes the likelihood for the kth partition of the data along with the kth cavity distribution, which represents the approximate posterior for the other K 1 pieces. The basic steps are as follows:
1. Partitioning. Split the data y into K pieces, y1,…,yK, each with likelihood pyk. 2

Figure 1: Sketch illustrating the benefits of expectation propagation EP ideas in Bayesian compu tation. In this simple example, the parameter space has two dimensions, and the data have been split into five pieces. Each oval represents a contour of the likelihood pyk provided by a single partition of the data. A simple parallel computation of each piece separately would be inefficient because it would require the inference for each partition to cover its entire oval. By combining with the cavity distribution gk in a manner inspired by EP, we can devote most of our computational effort to the area of overlap.
2. Initialization. Choose initial site approximations gk from some restricted family for example, multivariate normal distributions in . Let the initial approximation to the posterior density be g p Kk1 gk.
3. EPlike iteration. For k 1, . . . , K in serial or parallel:
a Compute the cavity distribution, gk ggk.
b Form the tilted distribution, gk pykgk.
c Construct an updated site approximation gnew such that gnewgk approximates kk
gk .
d If parallel, set gk to gnew, and a new approximate distribution g p K gk
k k1
will be formed and redistributed after the K site updates. If serial, update the global
approximation g to gnewgk. k
4. Termination. Repeat step 3 until convergence of the approximate posterior distribution g.
The benefits of this algorithm arise because each site gk comes from a restricted family with complexity is determined by the number of parameters in the model, not by the sample size; this is less expensive than carrying around the full likelihood, which in general would require computation time proportional to the size of the data. Furthermore, if the parametric approximation is multi variate normal, many of the above steps become analytical, with steps 3a, 3b, and 3d requiring only simple linear algebra. Accordingly, EP tends to be applied to specific highdimensional problems where computational cost is an issue, notably for Gaussian processes Rasmussen, and Williams, 2006, Jylanki, Vanhatalo, and Vehtari, 2011, Cunningham, Hennig, and LacosteJulien, 2011, and Vanhatalo et al., 2013, and efforts are made to keep the algorithm stable as well as fast.
Figure 1 illustrates the general idea. Here the data have been divided into five pieces, each of which has a somewhat awkward likelihood function. The most direct parallel partitioning approach would be to analyze each of the pieces separately and then combine these inferences at the end,
3

Tilted distribution, pyigi
Likelihood factor, pyi
Cavity distribution, gi
Figure 2: Example of a step of an EP algorithm in a simple onedimensional example, illustrating the stability of the computation even when part of the likelihood is far from Gaussian. When performing inference on the likelihood factor pyk, the algorithm uses the cavity distribution gk as a prior.
to yield a posterior distribution in the overlap. In contrast to datapartitioning algorithms, each step of an EPbased approach combines the likelihood of each partition with the cavity distribution representing the rest of the available information across the other K 1 pieces and the prior. In Figure 1, an EP approach should allow the computations to be concentrated in the area of overlap rather than wasting computation in the outskirts of the individual likelihood distributions.
Figure 2 illustrates the construction of the tilted distribution gk step 3b of the algorithm and demonstrates the critically important regularization attained by using the cavity distribution gk as a prior: because the cavity distribution carries information about the posterior inference from all other K 1 data pieces, any computation done to approximate the tilted distribution step 3c will focus on areas of greater posterior mass.
The remaining conceptual and computational complexity lies in step 3c, the fitting of the up dated local approximation gk such that gkgk approximates the tilted distribution gk pykgk. This step is in most settings the crux of EP, in terms of computation, accuracy, and stability of the iterations. As such, here we consider a variety of choices for forming tilted approx imations, beyond the standard choices in the EPliterature. We call any method of this general iterative approach an EPlike algorithm. Thus we can think of the EPlike algorithms described in this paper as a set of rough versions of EP, or perhaps more fruitfully, as a statistical and compu tational improvement upon partitioning algorithms that analyze each part of the data separately.
3. EPlike algorithm for hierarchical models
The exciting idea for hierarchical models is that the local parameters can be partitioned, and the convergence of the EPlike algorithm only needs to happen on the shared parameters. Suppose a hierarchical model has local parameters 1,…,K and shared parameters . All these can be vectors, with each k applying to the model for the data piece yk, and with including shared parameters fixed effects of the data model and hyperparameters as well. This structure is displayed in Figure 3.
3.1. An EPlike algorithm on the shared parameters
We will define an EPlike algorithm on the vector of shared parameters, , so that the kth approx imating step, gk combines the local information at site k with the cavity distribution gk which approximates the other information in the model relevant to this updating. The EPlike algorithm,
4

1 2 K
!! y1 y2 … yK
Figure 3: Model structure for the hierarchical EPlike algorithm. In each step k, inference is based on the local model, pykk,pk, multiplied by the cavity distribution, gk. Computation on this tilted posterior gives a distributional approximation on k, or simulation draws of k,; in either case, we just use the inference for to update the local approximation, gk. The algorithm has potentially large efficiency gains because, in each of the K steps, both the sample size and the number of parameters scale like 1K.
taking the steps of the algorithm in the previous section and altering them appropriately, becomes:
1. Partitioning. Partition the data y into K pieces, y1, . . . , yK , each with its local model pykk,pk. The goal is to construct an approximation g for the marginal posterior distribution of the shared parameters . From this, we can also get an approximate joint distribution as described in step 4 below.
2. Initialization. Choose initial approximations gk from some restricted family for example, multivariate normal distributions in . Define g p Kk1 gk.
3. EPlike iteration. For k 1, . . . , K in serial or parallel:
a Compute the cavity distribution, gk ggk.
b Form the local tilted distribution, gkk, pykk,pkgk, and determine its marginal distribution, gk; this can be computed analytically if gkk, is a joint normal distribution or approximated using the sampled values of if gkk, is computed via simulation.
c Construct an updated site approximation gnew such that gnewgk approximates kk
gk .
d If parallel, set gk to gnew, and a new approximation g pK gk will
k k1
be formed and redistributed after the K site updates. If serial, update the global ap
proximation g to gnewgk. k
4. Termination. Repeat step 3 until convergence of the approximate posterior distribution g. Given this approximation, define an approximate joint distribution g, g1, . . . , K , g Kk1 pykk, pk.
The computational advantage of this algorithm is that the local parameters are partitioned. For example, suppose we have a model with 100 data points in each of 3000 groups, 2 local parameters per group a varying slope and intercept and, say, 20 shared parameters including fixed effects and hyperparameters. If we then divide the problem into n 3000 pieces, we have reduced a 300,000 6020 problem to 3000 parallel 100 22 problems. To the extent that computation costs are proportional to sample size multiplied by number of parameters, this is a big win.
5

3.2. Example: a hierarchical nonlinear regression
We illustrate with an example of a hierarchical regression model from a current research project in astronomy.
Researchers are interested in the relation between fluxes of two different bands of electromagnetic radiation, which are measured as yi and xi, respectively, at a large number of locations i. The sky is divided into grid boxes j 1, . . . , J, and in each grid box j we want to fit a nonlinear regression with likelihood pyjaj,, where yj is the vector of data points in grid box j, aj is a vector of parameters for example, regression coefficients corresponding to grid box j, and is some set of shared parameters for example, regression coefficients, scale parameters, and shape parameters that are constant in the model and do not vary spatially. The local parameter vectors aj are modeled as independent draws from a multivariate normal distribution, aj NM, S. Finally, for computational reasons the problem is partitioned into K J pieces, so that each piece contains some number of grid boxes. For example, we might have 109 data points and J 106 grid boxes, with the problem divided into K 103 pieces.
In the notation of section 3.1, the set of shared parameters is ,M,S and, within each piece k, the set of local parameters is k ajfor all j in piece k. Performing full Bayes or even computing a Bayesian point estimate on this model would be costly if the number of data points and the number of grid boxes are large. Our EP formulation breaks down the problem into K smaller problems, each with only 1K as many data points and roughly 1k as many parameters.
We assume a multivariate normal approximate distribution on actually, we would transform the parameters to be unconstrained, thus working on the scale of the logarithms of any scale pa rameters and an appropriately transformed Cholesky decomposition of the covariance matrix S see Stan Development Team, 2014, section 53.9. We write the approximate distribution as:
KK
g gk Nk, k.
k1 k1
The ks and ks are updated during the steps of the EPlike algorithm until convergence.
For simplicity we assume a flat prior on the unconstrained transformed components of and we correspondingly set g0 to a constant. Appendix A describes an algorithm with informative priors. We start with weak approximations such as setting, for each k, k 0 and k 102IK, so that the initial setting for g is normal with mean 0 and scale equal to 10 times the identity
matrix.
The EPlike iterations step 3 of the algorithm in section 3.1 then proceed as follows:
3. For k 1,…,K in serial or parallel:
a The cavity distribution is gk k k Nk , k Nk , k , where
b The tilted distribution is
1 1
k
1 kk

kk kk
k k
gkk, gkpk, yk
Nk,k NajM,Spyjaj,.
j in piece k 6
k
1

We can feed this distribution into Stanit is simply the likelihood for the data in piece j, the priors for the local parameters aj within that piece, and a normal prior on the shared parameters determined by the cavity distributionand then we can directly approximate gkk, by a normal distribution perhaps using a Laplace approximation, perhaps using some quadrature or importance sampling to better locate the mean and variance of this tilted distribution or else simply take some number of posterior draws. Or a more efficient computation may be possible using importance weighting of earlier sets of simulations at this node, as we discuss in section 4.6. In any case we can get a normal approximation to the marginal tilted distribution of the shared parameters, gk. Call this approximation, Nk,k.
c The updated approximate site distribution is gnew Nnew, new, set so gnewgk kkkk
approximates gk. It would be most natural to do this by matching moments, but it is
possible that the differencing step could result in nonpositivedefinite covariance matrix
new, so some more complicated step might be needed, as discussed in section 4.7. k
d We can then perform the appropriate parallel or serial update.
As this example illustrates, the core computation is the inference on the local tilted distribution in step b, and this is where we are taking advantage of the partitioning, both in reduction of the data and in reduced dimensionality of the computation, as all that is required is inference on the shared parameters and the local parameters k, with the other K 1 sets of local parameters being irrelevant in this step.
3.3. Posterior inference for the approximate joint distribution
Once convergence has been reached for the approximate distribution g, we approximate the joint posterior distribution by g1,…,K, gKk1 pykk,pk. We can work with this expression in one of two ways, making use of the ability to simulate the model in Stan or otherwise in each note.
First, if all that is required are the separate marginal posterior distributions for each k vector, we can take these directly from the simulations or approximations performed in step 3b of the algorithm, using the last iteration which can be assumed to be at approximate convergence. This will give simulations of k or an approximation to pk,y.
These separate simulations will not be in phase, however, in the sense that different nodes will be simulating different values of . To get draws from the approximate joint distribution of all the parameters, one can first take some number of draws of the shared parameters from the EP approximation, g, and then, for each draw, run K parallel processes of Stan to perform inference for each k, conditional on the sampled value of . This computation is potentially expensivefor example, to perform it using 100 random draws of would require 100 separate Stan runsbut, on the plus side, each run should converge fast because it is conditional on the hyperparameters of the model. In addition, it may ultimately be possible to use adiabatic Monte Carlo Betancourt, 2014 to perform this ensemble of simulations more efficiently.
4. Open issues in implementation
4.1. Algorithmic considerations
Implementing an EPlike algorithm involves several decisions:
7

Algorithmic blocking. This includes parallelization and partitioning of the likelihood. In the present paper we a assume a model with independent data points conditional on its parameters so the likelihood can be factored. The number of subsets K will be driven by computational considerations. If K is too high, the Gaussian approximation will not be so accurate. However, if K is low, then the computational gains will be small. For large problems it could make sense to choose K iteratively, for example setting to a high value and then decreasing it if the approximation seems too poor. Due to the structure of modern computer memory, the computation using small blocks may get additional speedup if the most of the memory accesses can be made using fast but small cache memory.
The parametric form of the approximating distributions gk. We will use the multivariate normal. For simplicity we will also assume that the prior distribution p0 also is multivariate normal, as is the case in many practical applications, sometimes after proper reparameteriza tion such as with constrained parameter spaces. Otherwise, one may treat the prior as an extra site which will also be iteratively approximated by some Gaussian density g0. In that case, some extra care is required regarding the initialization of g0.
The starting distributions gk. We will use a broad but proper distribution factored into K equal parts, for example setting each gk N0, 1 A2I, where A is some large value for
n
example, if the elements of are roughly scaled to be of order 1, we might set A 10.
The algorithm to perform inference on the tilted distribution. We will consider three options: deterministic modebased approximation in section 4.3, deterministic improvements of mode based approximations in section 4.4 and Bayesian simulation in section 4.5.
The approximation to the tilted distribution. For modebased approximations, we can calcu late the first two moments of the fitted normal or other approximate density; see section 4.4. Or if the tilted distribution is computed by Monte Carlo methods we can compute the first two moments of the simulation, possibly with some regularization if is of high dimension; see section 4.5.
The division by the cavity distribution in step 3c can yield a nonpositivedefinite variance matrix which would correspond to a meaningless update for gk. We can stabilize this by setting negative eigenvalues to small positive values or perhaps using some more clever method. We discuss this in section 4.7.
In a hierarchical setting, the model can be fit using the nested EP approach Riihimaki et al., 2013. Assuming that the likelihood term associated with each data partition is approximated with a local Gaussian site function gkk,, the marginal approximations for and all k could be estimated without forming the potentially highdimensional joint approximation of all unknowns. At each iteration, first the K local site approximations could be updated in parallel with a fixed marginal approximation g. Then g could be refreshed by marginalizing over k separately using the new site approximations and combining the marginalized site approximations. The parallel EP approximations for each data partition correspond to the inner EP approximations, and inference for g corresponds to the outer EP.
4.2. Approximating the tilted distribution
In standard EP, the tilted distribution approximation in step 3c is done by moment matching prob lem, which when using the multivariate normal family implies that: one chooses the site gk so
8

that the first and second moments of gkgk match those of the intractable tilted distribution gk. When used for Gaussian processes, this approach has the particular advantage that the tilted distribution gk can typically be set up as a univariate distribution over only a single dimension in . This implicit dimension reduction implies that the tilted distribution approximation can be performed analytically e.g., Minka, 2001 or relatively quickly using onedimensional quadrature e.g., Zoeter and Heskes, 2005. In higher dimensions, quadrature gets computationally more ex pensive or, with reduced number of evaluation points, the accuracy of the moment computations gets worse. Seeger 2004 estimated the tilted moments in multiclass classification using multidi mensional quadratures. Without the possibility of dimension reduction in the more general case, there is no easy way to compute or approximate the integrals to compute the required moments over Rk. Accordingly, blackbox EP would seem to be impossible.
To move towards a blackbox EPlike algorithm, we can change this moment matching choice to instead match the mode or use numerical simulations. The resulting EPlike algorithms critically preserve the essential idea that the local pieces of data are analyzed at each step in the context of a full posterior approximation. These alternative choices for a tilted approximation generally make EP less stable, and we consider methods for fast and accurate approximations and stabilizing computations for EP updates.
Smola et al. 2004 proposed Laplace propagation, where momentmatching is replaced with a Laplace approximation, so that the tilted mean is replaced with the mode of the tilted distribution and the tilted covariance with the inverse Hessian of the log density at the tilted mode. The proof presented by Smola et al. 2004 suggests that a fixed point of Laplace propagation corresponds to a local mode of the joint model and hence also one possible Laplace approximation. Therefore, with Laplace approximation, a message passing algorithm based on local approximations corresponds to the global solution. Smola et al. 2004 were able to get useful results with tilted distributions in several hundred dimensions. Riihimaki et al. 2013 presented a nested EP method, where moments of the multivariate tilted distribution are also estimated using EP. For certain model types the inner EP can be computed efficiently.
In this paper, we also consider MCMC computation of the moments, which we suspect will give inaccurate moment estimates, but may work better than, or as a supplement to, Laplace approximation for skewed distributions. We also propose an importance sampling scheme to allow stable estimates based on only a moderate number of simulation draws.
When the moment computations are not accurate, EP may have stability issues, as discussed by Jylanki et al. 2011. Even with onedimensional tilted distributions, moment computations are more challenging if the tilted distribution is multimodal or has long tails. Fractional EP Minka, 2004 is an extension of EP which can be used to improve the robustness of the algorithm when the approximation family is not flexible enough Minka, 2005 or when the propagation of information is difficult due to vague prior information Seeger, 2008. Fractional EP can be viewed as a method for minimizing of the divergence, with 1 corresponding to KullbackLeibler divergence used in EP, 0 corresponding to the reverse KullbackLeibler divergence usually used in variational Bayes, and 0.5 corresponding to Hellinger distance. Ideas of fractional EP might help to stabilize EPlike algorithms that use approximative moments, as divergence with 1 is less sensitive to errors in tails of the approximation.
Section 4.1 details these and other considerations for tilted approximations in our EPlike algo rithm framework. While the tilted approximation is the key step in any EPlike algorithm, there are two other canonical issues that must be considered in any EPlike approach: lack of convergence guarantees and the possible computational instability of the iterations themselves. Section 4.1 also considers approaches to handle these instabilities.
9

4.3. Normal approximation based on deterministic computations
The simplest EPlike algorithms are deterministic and at each step construct an approximation of the tilted distribution around its mode. As Figure 2 illustrates, the tilted distribution can have a wellidentified mode even if the factor of the likelihood does not.
The most basic approximation is obtained by, at each step, setting gnew to be the multivariate normal distribution centered at the mode of gk, with covariance matrix equal to the inverse of the negative second derivative matrix of log gk at the mode. This corresponds to the Laplace approximation Smola et al., 2004. We assume any parameters with restricted range have been transformed to unrestricted spaces in Stan, this is done via logarithms of allpositive parameters, logits of intervalconstrained parameters, and special transforms for simplex constraints and co variance matrices; see Stan Development Team, 2014, section 53.9, so the gradient of gk will necessarily be zero at any mode.
The presence of the cavity distribution as a prior as illustrated in Figure 2 gives two compu tational advantages to this algorithm. First, we can use the prior mean as a starting point for the algorithm; second, the use of the prior ensures that at least one mode of the tilted distribution will exist.
4.4. Splitnormal and splitt approximations
After computing the mode and curvature at the mode, we can evaluate the tilted distribution at a finite number of points around the mode and use this to construct a better approximation to capture aspects of asymmetry and long tails in the posterior distribution. Possible approximate families include the multivariate splitnormal Geweke, 1989, Villani and Larsson, 2006, splitt, or wedgegamma Gelman et al., 2014 distributions.
We are not talking about changing the family of approximate distributions gwe still would keep these as multivariate normal. Rather, we would use an adaptivelyconstructed parametric ap proximation, possibly further improved by importance sampling Geweke, 1989 or CCD integration Rue et al., 2009 to get a better approximation to the mean and covariance matrix of the tilted distribution to use in constructing gk in step 3c of the algorithm.
4.5. Estimating moments using simulation
A different approach is at each step to use simulations for example, Hamiltonian Monte Carlo using Stan to approximate the tilted distribution and then use these to set the mean and covariance of the approximation in step 3c. As above, the advantage of the EPlike algorithm here is that the computation only uses a fraction 1K of the data, along with a simple multivariate normal prior that comes from the cavity distribution. Similar to other inference methods such as stochastic variational inference Hoffman et al., 2013 which take steps based on stochastic estimates, EP update steps need unbiased estimates. When working with the normal approximation, we then need unbiased estimates of the mean and precision in 1. The variance of the estimates can be reduced while preserving unbiasedness by using control variates.
4.6. Reusing simulations with importance weighting
In serial or parallel EP, samples from previous iterations can be reused as starting points for either Markov chains or in importance sampling. We discuss briefly the latter.
Assume we have obtained at iteration t for node k, a set of posterior simulation draws s , t,k
s 1, . . . , St,k from the tilted distribution gt , possibly with weights ws ; take ws 1 for an k t,k t,k
10

unweighed sample.
To progress to node k 1, reweight these simulations as: ws
the effective sample size ESS of the new weights,
1S ws 2 S s1 t,k1
ESS1Sws 2, S s1 t,k1
ws t,k
gt s g s
. If
Elaborations could be considered. Instead of throwing away a sample with too low an ESS, one could move these through several MCMC steps, in the spirit of sequential Monte Carlo Del Moral et al., 2006.
Another approach, which can be used in serial or parallel EP, is to use adaptive multiple im portance sampling Cornuet et al., 2012, which would make it possible to recycle the simulations from previous iterations.
Even the basic strategy should provide important savings when EP is close to convergence. Then changes in the tilted distribution should become small and as result the variance of the importance weights should be small as well. In practice, this means that the last EP iterations should essentially come for free.
4.7. Keeping the covariance matrix positive definite
When working with the normal approximation, step 3c of the EPlike algorithm can be conveniently written in terms of the natural parameters of the exponential family:
new1new new1new 1 kk kk
new1 new1 1, 1 k k
where new and new are the approximate mean and covariance matrix of the tilted distribution, and these are being used to determine new and new, the mean and variance of the updated site
kk
distribution gk. The difficulty is that the difference between two positivedefinite matrices is not
itself necessarily positive definite. There are various tricks in the literature to handle this problem.
One idea is to first perform the subtraction in the second line above, then do an eigendecomposition,
keeping the eigenvectors of new1 but taking any negative eigenvalues and replacing them with k
small positive numbers as in the SoftAbs map of Betancourt 2013. It is not clear whether some similar adjustment would be needed for the updating of 1k or whether the top line above
would work as written.
Is it possible to take into account the positivedefiniteness restriction in a more principled man
ner? If our measure of the precision matrix of the tilted distribution is noisy due to Monte Carlo error, it would make sense to include this noise in our inference and estimate new and new statis
tically via a measurementerror model. The noisy precision of the tilted distribution is the sum of the cavity precision, the precision of the pseudoobservation, and the noise term. Can we infer from this better estimate of precision of the pseudoobservation? This itself would be a small Bayesian inference problem but of low enough dimensionality that it should not represent a large part of the computation cost compared to the other steps of the algorithm.
An simple and effective alternative is to damp each step so that the resulting cavity covariance remains positive definite after each update. This does not change the fixed points of the EP
11
k
t,k1
k1 t,k
k
t,k
is large enough, keep this sample, s s
. Otherwise throw it away, simulate new s t1,k
s from
t,k1 gt , and reset the weights wt,k1 to 1.
t,k
k1
This basic approach was used in the EPABC algorithm of Barthelme and Chopin 2014.
kk

algorithm, and adjusts only the step sizes during the iterations. If the updates are done in parallel fashion, it is possible to derive a limit on the damping factor or step size so that the cavity covariances at all sites remain positive definite. The downside is that this requires determining one eigendecomposition at each site at each parallel update. Based on our experiments, this takes negligible time compared to the tilted distribution moment computations.
4.8. The computational opportunity of parallel EPlike algorithms
We have claimed that EPlike algorithms offer computational gains for large inference problems by splitting the data into pieces and performing inference on each of those pieces in parallel, occasionally sharing information between the pieces. Here we detail those benefits specifically.
Consider the simple, nonhierarchical implementation section 2 with a multivariate normal approximating family. We assume that we have K1 parallel processing units: one central processor that maintains the global posterior approximation g and K worker units on which inference can be computed on each of the K factors of the likelihood. Furthermore, we assume a network transmission cost of c per parameter. Let N be the number of data points and let D be the number of parameters, that is, the length of the vector . Finally, we define hn, d as the computational cost of approximating the tilted distribution a task which may, for example, be performed by running MCMC to perform simulations with n data points and d parameters.
Each step of the algorithm then incurs the following costs:
1. Partitioning. This loading and caching step will in general have immaterial cost.
2. Initialization. The central unit initializes the site approximations gk, which by construc tion are multivariate normal. In the general case each of the K sites will require DDD12 scalar quantities corresponding to the mean and covariance. Thus the central unit bears the initialization cost of OKD2.
3. EPlike iteration. Let m be the number of iterations over all K sites. Empirically m is typically a manageable quantity; however, numerical instabilities tend to increase this number. In parallel EP damped updates are often used to avoid oscillation van Gerven, 2009.
a Computing the cavity distribution. Owing to our multivariate normal approximating family, this step involves only simple rank D matrix operation per site, costing OD3 with a small constant; see Cunningham, Hennig, and LacosteJulien, 2011. One key choice is whether to perform this step at the central unit or worker units. If we compute the cavity distributions at each worker unit, the central unit must first transmit the full posterior to all K worker units, costing OcND2 for cost c per network operation. In parallel, the cavity computations then incur total cost of OD3. On the other hand, small D implies central cavity computations are preferred, requiring OKD3 to construct K cavity distributions centrally, with a smaller subsequent distribution cost of OcKD2. Accordingly, the total cost per EP iteration is OmincND2 D3,cKD2 KD3. We presume any computation constant is much smaller than the network transmission constant c, and thus in the small D regime, this step should be borne by the central unit, a choice strengthened by the presumed burden of step 3c on the worker units.
b Forming the tilted distribution. This conceptual step bears no cost.
c Fitting an updated local approximation gk. We must estimate OD2 parameters. More critical in the large data setting is the cost of computing the loglikelihoods. In the best case, for example if the likelihoods belong to the same exponential family, we
12

need only calculate a statistic on the data, with cost ONK. In some rare cases the desired moment calculation will be analytically tractable, which results in a minimum cost of OD2 NK. Absent analytical moments, we might choose a modal approximation e.g., Laplace propagation, which may typically incur a OD3 term. More common still, MCMC or another quadrature approach over the Ddimensional site parameter space will be more costly still: hNK, D D2. Furthermore, a more complicated likelihood than the exponential familyespecially a multimodal pyk such as a mixture likelihood will significantly influence numerical integration. Accordingly, in that common case, this step costs O hNK, D OD2 NK. Critically, our setup parallelizes this computation, and thus the factor K is absent.
d Return the updated gk to the central processor. This cost repeats the cost and con sideration of Step 3a.
e Update the global approximation g. In usual parallel EP, g is updated once after all site updates. However, if the cost h of approximating the posterior distribution is variable across worker units for example, in an MCMC scheme, the central unit could update g whenever possible or according to a schedule.
Considering only the dominating terms, across all these steps and the m EP iterations, we have the total cost of our parallel, EPlike algorithm:
O m mincND2 D3, cKD2 KD3 hNK, D .
This cost contains a term due to Gaussian operations and a term due to parallelized tilted
approximations. By comparison, consider first the cost of a nonparallel EPlike algorithm:
O m ND3 NhNK, D .
Second, consider the cost of full numerical quadrature with no EPlike partitioning:
O hN, D .
With these three expressions, we can immediately see the computational benefits of our scheme. In many cases, numerical integration will be by far the most costly operation, and will depend superlinearly on its arguments. Thus, our parallel EPlike scheme will dominate. As the total data size N grows large, our scheme becomes essential. When data is particularly big e.g. N 109, our scheme will dominate even in the rare case that h is its minimal OD2 NK see step 3c above.
5. Experiments
5.1. Implementation in Stan
We code our computations in R and Stan. Whether using point estimation for Laplace approxima tions or HMC for simulations from the tilted distribution, we write a Stan program that includes one portion of the likelihood and an expression for the cavity distribution. We then run this model repeatedly with different inputs for each subset k. This ensures that only one part of the likelihood gets computed at a time by the separate processes, but it does have the cost that separate Stan code is needed to implement the EP computations. In future software development we would like to be able to take an existing Stan model and merely overlay a factorization so that the EPlike algorithm could be applied directly to the model.
13

4 2 0
2
4 0123456
2 1.5 1 0.5 0
0123456 Iteration
2.5 2 1.5 1 0.5 0
0.5 1 1.5 2 2.5
2 1 0 1 2 Estimated values 3 sigmas
Figure 4: For simulated logistic regression with 50 parameters and N 2500 data points in J 50 groups divided into K 50 pieces: a Left plots show the convergence of the EP approximation for the shared paramaters; b Right plot compares the EP inferences for the shared parameters to their true values. Convergence of the algorithm is nearly instantaneous, illustrating the effectiveness of dispersed computation in this example.
We use Stan to compute step 3a of the EPlike algorithm, the inference for the tilted distribution for each process. Currently we perform the other steps in Matlab, Python, or R code will be made available at https:github.comgelmanepstan. In parallel EP, we pass the normal approximations gk back and forth between the master node and the K separate nodes.
Currently Stan performs adaptation each time it runs, but the future version should allow restarting from the previous state, which will speed up computation substantially when the EP algorithm start to converge. We should also be able to approximate the expectations more efficiently using importance sampling.
5.2. Hierarchical logistic regression
We test the algorithm with a simple hierarchical logistic regression, yi Bernoullij xij
j N0,2,
where is a vector of common coefficients and each group has own intercept j with a hierarchical prior. The data were simulated using xij N0, 1, j N0, 22 and N0, 1.
We first perform a simulation experiment with 50 regression predictors, the number of groups J 50, the number of data in each group Nj 50, and thus a total sample size of N 2500. For simplicity we partition the data as one piece per group; that is, K J. The Stan runs had 50 warmup and 200 iterations per chain.
If we were to use a simple scheme of data splitting and separate inferences without using the cavity distribution as an effective prior distribution at each step, the computation would problem atic: with only 50 data points per group, each of the local posterior distributions would be very wide as sketched in Figure 1. The EPlike framework, in which at each step the cavity distribution is used as a prior, keeps computations more stable and focused.
14
Std of paramaters
Mean of paramaters
True values

4 2 0
2
4 0123456
2 1.5 1 0.5 0
2.5 2 1.5 1 0.5 0
0.5 1 1.5 2 2.5
0123456 Iteration
2 1 0 1 2 Estimated values 3 sigmas
Figure 5: For simulated logistic regression with 50 parameters and N 50,000 data points in J 1000 groups divided into K 50 pieces: a Left plots show the convergence of the EP approximation for the shared paramaters; b Right plot compares the EP inferences for the shared parameters to their true values. With such a large dataset the normal approximation is no problem; we implemented a simulation of this size to begin to check the scalability of the algorithm.
Figure 4 shows the results: convergene is essentially immediate, and the algorithm ran fast. We have not yet performed runtime comparisons because our implementation has many places where we can increase its speed. But it is reassuring that the computations work well in this very simple and direct MCMC implementation of our algorithm.
In addition to checking the algorithms convergence left plots in Figure 4, we also check the coverage of posterior inferences from the simulations right plot. If we were performing full Bayesian inference this coverage check would not be necessary except possibly for catching bugs, but because the EPlike algorithm is approximate, it makes sense to check coverage in this fakedata simulation. As the graph shows, the coverage is fine, with all 50 estimates within 3 standard errors of the true values, the vast majority being within 2 standard errors, and about twothirds being within 1 standard error of the truth. For a fuller evaluation of the algorithm it would be a good idea to study this sort of coverage more thoroughly; here we just wanted to get a rough sense that the results made sense and that the approximating distribution was not clearly too narrow or two broad.
Figure 5 shows results for a new set of simulations, this time again with 50 regression predictors and K 50 partitions, but this time with J 1000 groups and Nj 50 observations per group, so that the data are partitioned via cluster sampling into 50 partitions with 20 groups and 1000 data points in each partition. By increasing the size of the problem, we are improving the normal approximation but we are beginning to test the scalability of the algorithm. We plan to do more systematic scalability testing once we have increased the efficiency of some of our computational steps.
The experiments show that the algorithm works and is stable even with the simplest moment computation using MCMC. All the above experiments were made on a desktop computer without any actual parallelization. Additional modifications suggested in section 4 should further speed up the moment computations and the convergence.
15
Std of paramaters
Mean of paramaters
True values

6. Discussion
In this paper we have presented a framework for using the EP idea of cavity and tilted distributions to perform Bayesian inference on partitioned datasets. Particularly in the case of hierarchical models, this structure can allow efficient dispersed computation for big models fit to big data.
6.1. Marginal likelihood
Although not the focus of this work, we mention in passing that EP also offers as no extra cost an approximation of the marginal likelihood, py p0pyd. This quantity is often used in model choice.
To this end, associate to each approximating site gk a constant Zk, and write the global approx
imation as:
g p0 K 1 gk. k1 Zk
1Q r Consider the Gaussian case, for the sake of simplicity, so that gk e 2 k k , under
natural parameterization, and denote by rk,Qk the corresponding normalizing constant: 1Qr 1
rk,Qk e 2 k k d 2logQk2rkQkrk.
Simple calculations Seeger, 2005 then lead to following formula for the update of Zk at site k, logZk logZk r, Q rk, Qk,
where Zk is the normalizing constant of the tilted distribution gk , r, Q is the natural parameter of
g, rKk1rk, QKk1Qk, rk ji rj, and Qk ji Qj. In the deterministic approaches we have discussed for approximating moments of gk, it is straightforward to obtain an approximation of the normalizing constant; when simulation is used, some extra efforts may be required, as in Chib 1995.
Finally, after completion of EP one should return the following quantity,
K
logZk r, Q r0, Q0, k1
as the EP approximation of logpy, where r0,Q0 is the natural parameter of the prior. 6.2. Different families of approximate distributions
We can place the EP approximation, the tilted distributions, and the target distribution on different rungs of a ladder:
g p0 Kk1 gk, the EP approximation;
For any k, gk g pk , the tilted distribution;
gk
For any k1,k2, gk ,k gpk1pk2 , which we might call the tilted2 distribution;
1 2 gk1gk2
For any k1,k2,k3, gk ,k ,k3 gpk1pk2pk3 , the tilted3 distribution;

1 2 gk1gk2gk3
16

p Kk0 pk, the target distribution, which is also the tiltedK distribution.
Instead of independent groups, also tree structures could be used Minka, 2001. As presented, the goal of an EPlike algorithm is to determine g. But at each step we get simulations from all the gks, so we could try to combine these inferences in some way for example, following the ideas of Wang and Dunson, 2013. Even something as simple as mixing the simulation draws from the tilted distribution could be a reasonable improvement on the EP approximation. One could then go further, for example at convergence computing simulations from some of the tilted2 distributions.
Another direction is to compare the EP approximation with the tilted distribution, for exam ple by computing a KullbackLeibler distance or looking at the distribution of importance weights. Again, we can compute all these densities analytically, we have simulations from the tilted distri butions, and we can trivially draw simulations from the EP approximation, so all these things are possible.
6.3. Connections to other approaches
The EPlike algorithm can be combined with other approaches to data partitioning. In the present paper we have focused on the construction of the approximate densities gk with the goal of simply multiplying them together to get the final approximation g p0 Kk1 gk, but one could instead think of the cavity distributions gk at the final iteration as separate priors, and then follow the ideas of Wang and Dunson 2013.
Data partitioning is an extremely active research area right now, with several different blackbox algorithms being proposed by various research groups. We are sure that different methods will be more effective in different problems. The present paper has two roles: we are presenting a particular blackbox algorithm for distributional approximation, and we are suggesting a general approach to approaching data partitioning problems. We anticipate that great progress could be made by using EP ideas to regularize existing algorithms.
References
Ahn, S., Korattikara, A., and Welling, M. 2012. Bayesian posterior sampling via stochastic gra dient Fisher scoring. In Proceedings of the 29th International Conference on Machine Learning. Barthelme, S., and Chopin, N. 2014. Expectation propagation for likelihoodfree inference. Jour
nal of the American Statistical Association 109, 315333.
Betancourt, M. 2013. A general metric for Riemannian manifold Hamiltonian Monte Carlo. http:
arxiv.orgabs1212.4693
Betancourt, M. 2014. Adiabatic Monte Carlo. http:arxiv.orgpdf1405.3489.pdf
Chib, S. 1995. Marginal likelihood from the Gibbs output. Journal of the American Statistical
Association 90, 13131321.
Cornuet, J. M., Marin, J. M., Mira, A., and Robert, C. P. 2012. Adaptive multiple importance
sampling. Scandinavian Journal of Statistics 39, 798812.
Cseke, B., and Heskes, T. 2011. Approximate marginals in latent Gaussian models. Journal of
Machine Learning Research 12, 417454.
Cunningham, J. P., Hennig, P., and LacosteJulien, S. 2011. Gaussian probabilities and expecta
tion propagation. http:arxiv.orgabs1111.6832
Del Moral, P., Doucet, A., and Jasra, A. 2006. Sequential Monte Carlo samplers. Journal of the
Royal Statistical Society B 68, 411436.
17

Gelman, A., Bois, F. Y., and Jiang, J. 1996. Physiological pharmacokinetic analysis using popu lation modeling and informative prior distributions. Journal of the American Statistical Asso ciation 91, 14001412.
Gelman, A., Carpenter, B., Betancourt, M., Brubaker, M., and Vehtari, A. 2014. Computationally efficient maximum likelihood, penalized maximum likelihood, and hierarchical modeling using Stan. Technical report, Department of Statistics, Columbia University.
Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y. S. 2008. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2, 13601383. Gershman, S., Hoffman, M., and Blei, D. 2012. Nonparametric variational inference. In Proceed
ings of the 29th International Conference on Machine Learning.
Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo integration. Econo
metrica 57, 13171339.
Girolami, M., and Zhong, M. 2007. Data integration for classification problems employing Gaus
sian process priors. In Advances in Neural Information Processing Systems 19, ed. B. Scholkopf,
J. Platt, and T. Hoffman, 465472.
Hoffman, M., Blei, D. M., Wang, C., and Paisley, J. 2013. Stochastic variational inference. Journal
of Machine Learning Research, 141, 13031347.
Huang, Z., and Gelman, A. 2005. Sampling for Bayesian computation with large datasets. Tech
nical report, Department of Statistics, Columbia University.
Jylanki, P., Vanhatalo, J., and Vehtari, A. 2011. Robust Gaussian process regression with a
Studentt likelihood. Journal of Machine Learning Research 12, 32273257.
Korattikara, A., Chen, Y., and Welling, M. 2013. Austerity in MCMC land: Cutting the MetropolisHastings budget. In Proceedings of the 31st International Conference on Machine
Learning.
Minka, T. 2001. Expectation propagation for approximate Bayesian inference. In Proceedings of
the Seventeenth Conference on Uncertainty in Artificial Intelligence, ed. J. Breese and D. Koller,
362369.
Minka, T. 2004. Power EP. Technical report, Microsoft Research, Cambridge.
Minka, T. 2005. Divergence measures and message passing. Technical report, Microsoft Research,
Cambridge.
Minkser, S., Srivastava, S., Lin, L., and Dunson, D. B. 2014. Robust and scalable Bayes via a
median of subset posterior measures. Technical report, Department of Statistical Science, Duke
University. http:arxiv.orgpdf1403.2660.pdf
Neiswanger, W., Wang, C., and Xing, E. 2013. Asymptotically exact, embarrassingly parallel
MCMC. arXiv:1311.4780.
Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning. Cam
bridge, Mass.: MIT Press.
Riihimaki, J., Jylanki, P., and Vehtari, A. 2013. Nested expectation propagation for Gaussian pro
cess classification with a multinomial probit likelihood. Journal of Machine Learning Research
14, 75109.
Rue, H., Martino, S., and Chopin, N. 2009. Approximate Bayesian inference for latent Gaussian
models by using integrated nested Laplace approximations. Journal of the Royal Statistical
Society B 71, 319392.
Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch,
R. E. 2013. Bayes and big data: The consensus Monte Carlo algorithm. In Bayes 250. 18

http:research.google.compubspub41849.html
Seeger, M. 2005. Expectation propagation for exponential families. Technical report, Max Planck Institute for Biological Cybernetics, Tubingen, Germany.
Seeger, M. 2008. Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research 9, 759813.
Seeger, M., and Jordan, M. 2004. Sparse Gaussian process classification with multiple classes. Technical report, University of California, Berkeley.
Smola, A., Vishwanathan, V., and Eskin, E. 2004. Laplace propagation. In Advances in Neural Information Processing Systems 16, ed. S. Thrun, L. Saul, and B. Scholkopf.
Stan Development Team 2014. Stan modeling language: Users gude and reference manual, version 2.5.0. http:mcstan.org
van Gerven, M., Cseke, B., Oostenveld, R., and Heskes, T. 2009. Bayesian source localization with the multivariate Laplace prior. In Advances in Neural Information Processing Systems 22, ed. Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 19011909.
Vanhatalo, J., Riihimaki, J., Hartikainen, J., Jylanki, P., Tolvanen, V., and Vehtari, A. 2013. GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research 14, 10051009.
Villani, M., and Larsson, R. 2006. The multivariate split normal distribution and asymmetric principal components analysis. Communications in Statistics: Theory and Methods 35, 1123 1140.
Wang, C., and Blei, D. M. 2013. Variational inference in nonconjugate models. Journal of Machine Learning Research 14, 899925.
Wang, X. 2014. Parallel MCMC via Weirstrass sampler. Xians Og, 3 Jan. http:xianblog. wordpress.com20140103parallelmcmcviaweirstrasssamplerareplyby xiangyuwang
Wang, X., and Dunson, D. B. 2013. Parallel MCMC via Weierstrass sampler. http:arxiv. orgpdf1312.4605v1.pdf
Zoeter, O., and Heskes, T. 2005. Gaussian quadrature based expectation propagation. In Pro ceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, ed. Z. Ghahramani and R. Cowell, 445452.
19

A. Distributed parallel algorithms
We have pitched this article at a fairly abstract level because we are interested in the general idea of using EPlike algorithms to perform parallel Bayesian computation using data splitting. Ultimately, though, these algorithms are motivated by speed and scalability concerns, so we need efficient implementation. In this appendix we go through some of the steps that we went through to efficiently implement the simulations described in section 5.
A.1. Distributed parallel EPlike algorithm
In this appendix we give a practical algorithm description that is suitable for implementing the general EPlike algorithms of sections 2 and 3 in a numerically stable manner. We assume that the posterior distribution,
K
py Z1 pykp,
k1
where the site approximations gk Z1grk, Qk and the prior p N0, 0 Z1gr0, Q0 k0
are written using the following definitions:
1 gr,Q exp 2Qr
1 1
r,Q log gr,Qd 2 logQ2rQ r . 2
The natural parameters and the normalization of the prior p are given by r0 10, Q0 1, 00
and log Z0 r0, Q0. The natural exponential parameters of the posterior approximation g are obtained by multiplying the prior and the site approximations together which gives
KK
Q1 Qk Q0 and r1rk r0. 3
k1 k1
The approximate posterior mean Q1r and covariance Q1 can be computed using a Cholesky or eigenvalue decomposition of Q. One possibility to initialize the site approximations is to set them to gk 1 by choosing rk 0 and Qk 0 for k 1,…,K, which is equivalent to initializing g to the prior, that is, 0 and 0.
We propose to distribute the cavity and tilted moment computations and the site parameter updates to K different computing units. The posterior update is done in the central computing node in a parallel fashion. First, the site updates are initialized to zero as rk 0, Qk 0Kk1 and then the following steps are repeated until convergence:
1. In parallel at each node: Compute the updated site parameters with damping level 0, 1: Qnew Q Q
is approximated by
K
g Z1 Z1grk, QkZ1gr0, Q0 N, ,
EPk 0 k1
kkk
rnew rr. kkk
20

2. At the central node: Compute the natural parameters of gnew as K
Qnew QnewQ k0
k1 K
rnew rnewr. k0
k1
3. In parallel at each node: Determine the cavity distributions gk Nk,k for all
k 1,…,K:
is given by
Q Qnew Qnew k k
r rnew rnew, k k
where 0, 1.
4. In parallel at each node: If Qk 0 for any k, go back to step 1 and decrease . Otherwise,
accept the new state by setting r rnew, Q Qnew, and Qk Qnew,rk rnewK and k kk1
continue to step 5.
5. In parallel at each node: determine the natural parameters rk 1k and Qk 1 of
k k the tilted distribution gkzk using either MCMC or Laplaces method. The tilted distribution
gk Z1pykNQ1k, Q1 k kk
1 pyk exp 2 Qk rk ,
which can be efficiently sampled and differentiated using
logg logpy 1Q r const. k k2kk
Key properties of the different approximation methods:
MCMC: It is easy to compute k and k from a set of samples, and k should be
symmetric and positive definite if enough samples are used. However, computing the
precision matrix Qk 1 requires a Od3 Cholesky or eigenvalue decomposition. k
Could this be done simultaneously within the SoftAbs stabilization step? In addition, there should be literature on estimating sparse precision matrices from samples, which could be beneficial with large k.
Laplaces method: Gradientbased methods can be used to determine the mode of the tilted distribution efficiently. Once a local mode is found, the natural parameters can be computed as
Qk 2 log gk 2 log pyk Qk rk Qk.
If is a local mode, Qk should be symmetric and positive definite. 21

6. In parallel at each node: If Qk 0, compute the undamped site parameter updates resulting from the moment consistency conditions Q Q Qnew and r r rnew:
k k k k k k Q QnewQ 1Q Q Q
k k k k k k r rnewr1rrr,
k k k k k k
If Qk 0, there are at least two options: discard the update by setting Qk 0 and rk 0, or use the SoftAbs map to improve the conditioning of Qk and compute the parameter updates with the modified Qk.
In the latter approach the natural location parameter of the tilted distribution can be recom puted as rk Qik using the original tilted mean k and the modified covariance matrix Qk, which preserves the tilted mean k but changes tilted covariance estimate k.
Steps 16 are repeated until all the tilted distributions are consistent with the approximate posterior, that is, r rk and Q Qk for k 1,…,K. Steps 4 is done to ensure that the posterior approximation g and the cavity distributions gk remain well defined at all times. Step 4 is potentially time consuming because it involves checking the conditioning of all the cavity precision matrices QkKk1. A cheaper alternative could be to skip the step and apply more damping which we expect should work well if the tilted distribution related to the different data pieces are not very different or multimodal.
Advantages of the approach include:
The central node does not need to compute Od3 matrix decompositions in step 2. It only needs to sum together the natural site parameters to obtain the posterior approximation in exponential form, and pass this to the individual computing nodes that can make the subtractions to form the cavity parameters.
The tilted moments can be determined by sampling directly from the unnormalized tilted distributions or by using the Laplaces method. This requires only cheap function and gradient evaluations and can be applied to wide variety of models.
After convergence the final posterior approximation could be formed by mixing the draws from the different tilted distributions because these should be consistent with each other and g. This samplebased approximation could capture also potential skewness in py because it resembles the EPbased marginal improvements described by Cseke and Heskes 2011.
Limitations of the approach include:
The tilted covariance matrices can be easily computed from samples, but these have to be inverted to obtain the site precision matrices. This inversion could be done by each node after sampling, but if there are more efficient ways to approximate sparse precision matrices directly from samples, this could be a potential scheme even for large number of unknowns? The algorithm would not require Od3 matrix decompositions and all the required posterior marginals could be summarized using samples.
Estimating the marginal likelihood is more challenging, because determining the normalization constants Zk requires multivariate integrations. For example, annealed importance sampling type of approaches could be used if marginal likelihood estimates are required.
With the Laplaces method, approximating Zk is straightforward by the quality of the marginal likelihood approximation is not likely to very good with skewed posterior distri butions. The Laplace marginal likelihood estimate is not generally well calibrated with the
22

approximate predictive distributions in terms of hyperparameter estimation. Therefore, it would make sense to integrate over the hyperparameters within the EP framework.
A.2. Efficient algorithms when dimension reduction is possible
Here we summarize a version of the EP algorithm of section A.1 for the special case in which the non Gaussian likelihood terms pyk depend on only through lowdimensional linearly transformed random variables,
zk Uk; 4 that is, pyk pykzk for each partition k. The posterior distribution is now given by
K
py Z1 pykUkp,
k1
and we approximate it by
g Z1 Z1gUkrk, QkZ1gr0, Q0 N, .
K
EPk 0
k1
The natural parameters of g are obtained by multiplying the site approximations and the prior
which gives
KK
Q1UkQkUkQ0 and r1Ukrkr0. 5
k1 k1
The approximate posterior mean Q1r and covariance Q1 can be computed using a Cholesky or eigenvalue decomposition of Q, or a series of K rankd updates. One possibility is to initialize the site approximations to gk 1 by setting rk 0 and Qk 0 for k 1,…,K, which is equivalent to initializing g to the prior, that is, 0 and 0.
If Uk is a k d matrix, then the cavity computations and the site parameter updates require only rankd matrix computations, and determining the moments of the kth tilted distribution gk requires only ddimensional numerical integrations. In the following we outline how this algorithm can be parallelized using m computing units.
We propose to distribute the cavity and tilted moment computations into m different computing units by dividing the model terms into m nonintersecting subsets Sj so that mj1 Sj 1, . . . , K. The posterior updates are done in the central computing node in a parallel fashion. First, the site updates are initialized to zero, rk 0,Qk 0Kk1, and then the following steps are repeated until convergence:
1. Distribute parameters rk , Qk , Uk iSj and the site parameter updates rk , Qk iSj to each computing node j 1,…,m and compute intermediate natural parameters Q j,r jmj1 with damping level 0, 1:
a Compute the updated site parameters for i Sj: Qnew Q Q
kkk
rnew rr. kkk
23

b Compute the natural parameters of the jth batch:
Q UQnewU
r Urnew. jkk
iSj
2. At the central node, compute the natural parameters of gnew as
m
Qnew Q jQ0
j1 m
rnew r jr0, j1
and determine the posterior mean new Qnew1rnew and covariance new Qnew1 using a Cholesky or eigenvalue decomposition.
3. If Qnew 0, go to step 1 and decrease . Otherwise, continue to step 4.
4. Distribute new, new, and rnew, Qnew, UkiSj to each computing node j 1, . . . , m, and de
where mk Uk new and Vk Uk new Uk are the moments of the distribution gzk Nmk, Vk, and 0, 1.
jkkk iSj
kk
termine the cavity distributions gkzk Nmk,Vk of the transformed random variables
zk Uk for all i Sj :
Q V 1 V 1 Qnew k k k k
r V1m V1m rnew, k k k k k k
approximate
Save ck rk, Qk V 1mk, V 1 for computing the marginal likelihood as described
kk
in section 6.1.
5. If Vk 0 for any k, go back to step 1 and decrease . Otherwise, accept the new state
bysettingrrnew,QQnew,new,new andQk Qnew,rk rnewK and k kk1
continue to step 6.
6. Distribute parameters mk,Vk,rk,Qk,UkiSj to each computing node j 1,…,m and
determine the site parameter updates rk,QkiSj using the following steps:
a Compute the moments Zk, mk Ezk, and Vk varzk of the tilted distribution gk zk recall that zk Uk as defined in 4 either analytically or using a numerical quadrature depending on the functional form of the exact site term pykU:
gkzk Z1pykUNzkmk, Vk Nzkmk, Vk, k
where Zk pykUNzkmk, Vkdzk. Save Zk for computing the marginal likelihood as described in section 6.1.
24
marginal

b If Zk 0 and Vk 0, compute the undamped site parameter updates resulting from the moment consistency conditions V 1 V 1Qnew and V 1m V 1m rnew:
k k k k k k k k Q Qnew Q 1V1 V1Q
k k k k k k
r rnewr 1V1m V1m r,
k k k k k k k k
IfZk 0orVk0,discardtheupdatebysettingQk 0andrk 0.
Steps 16 are repeated until all the tilted distributions are consistent with the approximate posterior, that is, mk mk and Vk Vk for k 1,…,K. Steps 3 and 5 are done to ensure that the posterior approximation g and the cavity distributions gkzk remain well defined at all times. In practice we expect that these numerical stability checks do not require any additional computations if a suitable damping factor is chosen. An additional approach is to stabilize the computations is to apply more damping to site updates with Qk 0, because only this kind of precision decreases can lead to negative cavity distributions.
Without the stability checks, the algorithm can be simplified so that fewer parameter transfers between central and the computing nodes are needed per iteration. The algorithm could be further streamlined by doing the posterior updates at steps 15 incrementally one batch at a time.
Advantages of the approach include:
If Uk is a d 1 matrix, only onedimensional integrations are required to determine the site parameters. Furthermore, with certain likelihoods, the conditional moments with respect to some components of zk are analytical which can be used to lower dimensionality of the required integrations. This goes against the general blackbox approach of this paper but could be relevant for difficult special cases.
The cavity computations and parameter updates are computationally cheap if d is small. In addition, the required computations can be distributed to the different computing nodes in a parallel EP framework.
Limitations of the approach include:
The model terms need to depend on lowdimensional transformations of the unknowns zk
Uk. For example generalized linear models and Gaussian processes fall in to this category.
Different types of model or likelihood terms require specific implementations. For example, probit and finite Gaussian mixture likelihoods can be handled analytically whereas Poisson and studentt likelihoods require quadratures. For a blackbox implementation we might prefer to use numerical quadrature for all these problems.
The central node needs to compute the global posterior covariance at step 2, which scales as Od3 and can be tedious with a large number of unknowns. Independence assumptions or multilevel designs as proposed in section 3 can be used to improve the scaling.
A.3. Parallel EP implementation for hierarchical models when approximations are formed also for the local parameters
In this section we describe a distributed EP algorithm that uses the hierarchical model structure from section 3 for efficient parallel computations in inference problems where a Gaussian approximation is required also for the local parameters. Such cases arise, for example, when certain data pieces
25

share some of the local parameters but we do not wish to form the potentially high dimensional joint approximation of and all the local parameters.
We assume independent Gaussian priors for and 1,…,K and approximate the posterior
distribution
by
K
p, y Z1NB1b0, B1 pykk, NkA1a0, A1
00 00 k1
K
g, Z1Z1gb0, B0, Zkgk, rk, Qkgka0, A0, 6
k1
EP 0
where the site approximations and the prior terms are written using 2, and Z0 is the normalization term of the joint prior p, . To derive the EP updates for the hierarchical model, we divide the site location vector rk and the site precision matrix Qk to blocks corresponding to k and as
ak Ak Ck rkb andQkCB.
kkk
The marginal approximation for the shared parameters can be obtained by integrating over the local parameters k in the joint approximation 6 as
K
g N, gb0, B0 where the parameters of g are given by
b B
1 bk CkAk A01ak a0b0
k1 K
1 Bk CkAk A01CkB0. 7
k1
K
gk, rk, Qkgka0, A0dk gb, B,
k1
In the EP update related to data piece yk we need to consider only the joint marginal approximation
of k and , which can be written as
gk, gk, rk, Qkgkk, , 8
where the kth cavity distribution is defined as
gkk, gka0, A0gbk, Bk
with natural parameters
bk bj CjAj A01aj a0b0 bbk CkAk A01ak a0 j i
Bk Bj CjAj A01CjB0 BBk CkAk A01Ck. 9 j i
The cavity distribution gk k , factorizes between k and , and that the marginal cav ity of the local parameters k depends only on the prior pk. The dependence on the other
26

local parameters is incorporated in the marginal cavity gk gbk,Bk. This property results from the factorized prior between and 1, . . . , K , and it enables computationally efficient lowerdimensional matrix computations. The marginal approximation gk Nk,k can be obtained by integrating over in 8, which gives
k A0 Ak CkBk Bk1Ck 1

a0 ak CkBk Bk1bk bk. 10 kk
The marginal approximations g and gkKk1 can be computed efficiently without actually forming the potentially highdimensional joint approximation g1, . . . , K , . After convergence, we can summarize the coefficients and compute the predictions for each group k 1, . . . , K using the marginal distributions 7 and 10.
Approximations 7 and 10 can be determined by first initializing the site parameters and the parameter updates to zero, that is ak 0,bk 0,Ak 0,Bk 0,Ck 0Kk1 and ak 0,bk 0, Ak 0, Bk 0, Ck 0Kk1, and then iterating the following steps until convergence:
1. Distributethecurrentsiteparametersak,Ak,bk,Bk,Cktogetherwiththeparameterupdates ak, Ak, bk, Bk, Ck to the corresponding computing node k 1, . . . , K, and compute new parameter values with damping level 0, 1:
anew a a bnew a a kkk kkk
Anew A A Bnew B B kkk kkk
CnewC C. kkk
Compute also auxiliary variables Vk Anew A01 using, e.g., K parallel Cholesky decom k
positions. If Vk 0, i.e., the Cholesky decomposition fails, decrease and recompute the updates. Otherwise, compute auxiliary parameters
b bnewCnewVanewa kkkkk0
B Bnew CnewV Cnew. kkkkk
2. At the central node, compute the natural parameters of gnew N new, new as
K
bnew new1new b b k0
k1 K
k1
kk
and determine the parameters of the cavity distributions:
gkk, Z1gkak, Akgbk, Bk, k
where Zk ak, Ak bk, Bk and
ak a0 Ak A0
bk bnew bk Bk Bnew B k.
Bnew new1 B B . k0
11 3. Distributeparametersbnew,Bnew,bnew,Bnewtotherespectivecomputingnodesk1,…,K,
27

A.4.
kk kkk
If Zk 0 or Qk 0, discard the update by setting ak 0,bk 0,Ak
0, Bk 0, and Ck 0 for that particular data piece k.
Determining tilted moments using inner EP approximations when dimension reduction
is possible
4.
5.
If Bk 0 for any k, go back to step 1 and decrease . Another option is to skip updates for sites k,Bk 0 but illconditioned cavity distributions and approximate covariance matrices may still emerge at subsequent iterations.
Otherwise, accept the new state by setting b bnew, B Bnew, and ak anew,Ak k
Anew, bk bnew, Bk Bnew, Ck CnewK and continue to the next step. Save the normal k k k kk1
ization terms log Zk for computing the marginal likelihood as described in section 6.1. Distribute the parameters ak,Ak,bk,Bk to the corresponding computing nodes k
1, . . . , K and determine the site parameter updates by the following steps:
a Determine the normalization term Zk and the moments k Ek, and k covk, of the tilted distribution gkk, using either an inner EP algorithm or MCMC depending on the functional form of the likelihood term pykk,:
gkk, Z1pykk, Z1gkak, Akgbk, Bk Nk, k, k, k k
where Zk pykk, gkk, dkd. For the site updates only the natural pa rameters of the tilted distribution need to be determined:
1 ak rkkk b
k
1 Ak Ck Qkk C B .
k k
b If Zk 0 and Qk 0, compute the undamped site parameter updates resulting from the moment consistency conditions r r rnew and Q Q Qnew:
ak ak ak ak Ak Ak Ak Ak Ck Ck Ck.
k k k
k k k bk bk bk bk
Bk Bk Bk Bk
Save the following quantity for computing the marginal likelihood as described in section
6.1:
ak ak Ak Ak Ck cklogZkbb,C BB.
We can use an inner EP algorithm to determine the natural parameters rk and Qk of the tilted distributions gkk, in step 5a of the hierarchical EP algorithm in Appendix A.3, if the like lihood terms related to each data piece can be factored into simple terms that depend only on lowdimensional linearly transformed random variables zk,j Uk,jk,, that is,
nk
pyk k , pyk,j Uk,j k , ,
j1
where nk is the number of observations in batch k. The EP algorithm description with dimension reduction from section A.2 can be readily applied for this purpose. Since the tilted moment ap proximations in step 5a of the hierarchical algorithm are already run in parallel at the different
28

computing nodes, the parallelization steps in section A.2 can be excluded when used to from the inner EP approximations.
We can also derive closedform solutions for the parameters ak,bk,Ak,Bk,CkKk1 of the ap proximation 6 in terms of the site parameters of the inner EP algorithms. First, we write the approximation to the tilted distributions as
gkk, Z1Z1pykk, gkak, Akgbk, Bk k k
nk
Z1Z1 Zk,jgUk,jk,r k,j,Q k,jgkak,Akgbk,Bk,
k k
j1
where r k,j and Q k,j are the site parameters of the inner EP approximation. If we write the trans formation as Uk,j k , uk,j k vk,j , where Uk,j uk,j , vk,j , we can write the outer EP parameters as
ak ak ak uk,jr k,j bk bk bk vk,jr k,j jj
AAA uQ u BBB vQ v
k k k k,j k,jk,j k k k k,j k,jk,j
jj
C C u Q v .
j
For example, in case of a linear model, uk,j are the input variables associated with the local coeffi cients k, and vk,j are the input variables corresponding to the shared coefficients .
With this representation, we can interpret the hierarchical EP algorithm with K inner EP approximations also as a single algorithm with site parameters r k,j and Q k,j that are updated in parallel fashion for K groups of site terms. After each successful update at step 4 we store only parameters r k,j and Q k,j, and we can also equivalently apply damping directly to these parameters at step 1. In fact, it is more efficient to initialize each tilted moment approximation at step 5a to the inner EP parameters from the previous iteration instead of starting from a zero initialization. This framework is similar to the nested EP algorithm for the multinomial probit model described by Riihimaki et al. 2013. However, if applied to the potentially highdimensional hierarchical setting, the computationally benefits become more evident compared with the GP classification case studied in that paper.
k k
k,j k,j k,j
29