代写 C algorithm Scheme math scala statistic network Bayesian theory Automatica 47 (2011) 39–49

Automatica 47 (2011) 39–49
Contents lists available at ScienceDirect
Automatica
journal homepage: www.elsevier.com/locate/automatica
System identification of nonlinear state-space models✩ Thomas B. Schön a,∗, Adrian Wills b, Brett Ninness b
a Division of Automatic Control, Linköping University, SE-581 83 Linköping, Sweden
b School of Electrical Engineering and Computer Science, University of Newcastle, Callaghan, NSW 2308, Australia
article info
Article history:
Received 26 June 2009
Received in revised form
14 April 2010
Accepted 24 August 2010
Available online 23 November 2010
Keywords:
System identification
Nonlinear models
Dynamic systems
Monte Carlo method
Smoothing filters
Expectation maximisation algorithm Particle methods
1. Introduction
The significance and difficulty of estimating nonlinear systems are widely recognised (Ljung, 2008; Ljung & Vicino, 2005; Ljungbode, 2003). As a result, there is very large and active research effort directed towards the problem. A key aspect of this activity is that it generally focuses on specific system classes such as those described by Volterra kernels (Bendat, 1990), neural networks (Narendra & Parthasarathy, 1990), nonlinear ARMAX (NARMAX) (Leontaritis & Billings, 1985), and Hammerstein–Wiener (Rangan, Wolodkin, & Poolla, 1995) structures, to name just some examples. In relation to this, the paper here considers Maximum Likelihood (ML) estimation of the parameters specifying a relatively general class of nonlinear systems that can be represented in state-space form.
✩ This work was supported by: the strategic research center MOVIII, funded by the Swedish Foundation for Strategic Research (SSF) and CADICS, a Linneaus Center funded by the Swedish Research Council; and the Australian Reseach Council. Parts of this paper were presented at the 14th IFAC Symposium on System Identification, March 29–31, 2006, Newcastle, Australia and at the 17th IFAC World Congress, July 6–11, 2008, Seoul, South Korea. This paper was recommended for publication in revised form by Associate Editor Wolfgang Scherrer under the direction of Editor Torsten Söderström.
∗ Corresponding author. Tel.: +46 13 281373; fax: +46 13 139282.
E-mail addresses: schon@isy.liu.se (T.B. Schön), Adrian.Wills@newcastle.edu.au
(A. Wills), Brett.Ninness@newcastle.edu.au (B. Ninness).
0005-1098/$ – see front matter © 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.automatica.2010.10.013
© 2010 Elsevier Ltd. All rights reserved.
Of course, the use of an ML approach (for example, with regard to linear dynamic systems) is common, and it is customary to employ a gradient based search technique such as a damped Gauss–Newton method to actually compute estimates (Ljung, 1999; Söderström & Stoica, 1989). This requires the computation of a cost Jacobian which typically necessitates implementing one filter derived (in the linear case) from a Kalman filter, for each parameter that is to be estimated. An alternative, recently explored in Gibson, Wills, and Ninness (2005) in the context of bilinear systems, is to employ the Expectation Maximisation algorithm (Dempster, Laird, & Rubin, 1977) for the computation of ML estimates.
Unlike gradient based search, which is applicable to maximisa- tion of any differentiable cost function, EM methods are only appli- cable to maximisation of likelihood functions. However, a dividend of this specialisation is that while some gradient calculations may be necessary, the gradient of the likelihood function is not required, which will prove to be very important in this paper. In addition to this advantage, EM methods are widely recognised for their nu- merical stability (Lange, 1995).
Given these recommendations, this paper develops and demon- strates an EM-based approach to nonlinear system identification. This will require the computation of smoothed state estimates that, in the linear case, could be found by standard linear smooth- ing methods (Gibson et al., 2005). In the fairly general nonlin- ear (and possibly non-Gaussian) context considered in this work we propose a ‘‘particle based’’ approach whereby approximations of the required smoothed state estimates are approximated by
abstract
This paper is concerned with the parameter estimation of a general class of nonlinear dynamic systems in state-space form. More specifically, a Maximum Likelihood (ML) framework is employed and an Expectation Maximisation (EM) algorithm is derived to compute these ML estimates. The Expectation (E) step involves solving a nonlinear state estimation problem, where the smoothed estimates of the states are required. This problem lends itself perfectly to the particle smoother, which provides arbitrarily good estimates. The maximisation (M) step is solved using standard techniques from numerical optimisation theory. Simulation examples demonstrate the efficacy of our proposed solution.

40 T.B. Schön et al. / Automatica 47 (2011) 39–49
Monte Carlo based empirical averages (Doucet, de Freitas, & Gor- don, 2001).
It is important to acknowledge that there is a very significant body of previous work on the problems addressed here. Many approaches using various suboptimal nonlinear filters (such as the extended Kalman filter) to approximate the cost Jacobian have been proposed (Bohlin, 2006; Graebe, 1990; Kristensen, Madsen, & Jorgensen, 2004). Additionally, there has been significant work (Andrieu, Doucet, Singh, & Tadić, 2004; Doucet & Tadić, 2003; Poyiadjis, Doucet, & Singh, 2005) investigating the employment of particle filters to compute the Jacobians necessary for a gradient based search approach.
There has also been previous work on various approximate EM-based approaches. Several authors have considered using suboptimal solutions to the associated nonlinear smoothing problem, typically using an extended Kalman smoother (Duncan & Gyöngy, 2006; Ghaharamani & Roweis, 1999; Goodwin & Agüero, 2005; Roweis & Ghaharamani, 2001).
As already mentioned, this paper is considering particle based approaches in order to solve the involved nonlinear smoothing problem. This idea has been partially reported by the authors in two earlier conference publications (Schön, Wills, & Ninness, 2006; Wills, Schön, & Ninness, 2008).
An interesting extension, handling the case of missing data, is addressed in Gopaluni (2008). Furthermore, in Kim and Stoffer (2008), the authors introduce an EM algorithm using a particle smoother, similar to the algorithm we propose here, but tailored to stochastic volatility models. The survey paper by Andrieu et al. (2004) is one of the earliest papers to note the possiblility of EM- based methods employing particle smoothing methods.
2. Problem formulation
This paper considers the problem of identifying the parameters θ for certain members of the following nonlinear state-space model structure
members of this class where pθ(xt+1 | xt) and pθ(yt | xt) can be explicitly expressed and evaluated.
The problem addressed here is the formation of an estimate θ of the parameter vector θ based on N measurements UN = [u1,…,uN],YN = [y1,…,yN] of observed system input–output responses. Concerning the notation, sometimes we will make use of Yt:N , which is used to denote [yt , . . . , yN ]. However, as defined above, for brevity we denote Y1:N simply as YN . Hence, it is here implicitly assumed that the index starts at 1.
One approach is to employ the general prediction error (PE) framework (Ljung, 1999) to deliver θ according to
θ =argminV(θ), (3) θ∈Θ
with cost function V (θ ) of the form
xt+1 = ft(xt,ut,vt,θ), yt = ht(xt,ut,et,θ).
Here, xt ∈ Rnx denotes the state variable, with ut
Rny denoting (respectively) observed input and output responses. Furthermore, θ ∈ Rnθ is a vector of (unknown) parameters that specifies the mappings ft (·) and ht (·) which may be nonlinear and time-varying. Finally, vt and et represent mutually independent vector i.i.d. processes described by probability density functions (pdf’s) pv(·) and pe(·). These are assumed to be of known form (e.g., Gaussian) but parameterized (e.g., mean and variance) by values that can be absorbed into θ for estimation if they are unknown.
Due to the random components vt and et , the model (1) can also be represented via the stochastic description
xt+1 ∼pθ(xt+1 |xt), (2a)
is the mean square optimal one-step ahead predictor of yt based on the model (1). The function l(·) is an arbitrary and user-chosen positive function.
This PE solution has its roots in the Maximum Likelihood (ML) approach, which involves maximising the joint density (likelihood) pθ (YN ) of the observations:
θ = argmax pθ(y1,…,yN). (6) θ∈Θ
To compute this, Bayes’ rule may be used to decompose the joint density according to
N
pθ(y1,…,yN) = pθ(y1)∏pθ(yt|Yt−1). (7)
t=2
Accordingly, since the logarithm is a monotonic function, the
maximisation problem (6) is equivalent to the minimisation problem
θ = arg min −Lθ (YN ), (8) θ∈Θ
where Lθ (YN ) is the log-likelihood
N Lθ(YN)􏰃logpθ(YN)=logpθ(y1)+−logpθ(yt |Yt−1). (9)
t=2
The PE and ML approaches both enjoy well understood theoretical properties including strong consistency, asymptotic normality, and in some situations asymptotic efficiency. They are therefore both attractive solutions, but there are two important challenges to their implementation.
First, both methods require knowledge of the prediction density pθ(yt | Yt−1). In the linear and Gaussian case, a Kalman filter can be employed. In the nonlinear case (1) an alternate solution must be found.
Second, the optimisation problems (3) or (8) must be solved. Typically, the costs V(θ) or Lθ(YN) are differentiable, and this is exploited by employing a gradient based search method to compute the estimate (Ljung, 1999). Unfortunately, these costs will generally possess multiple local minima that can complicate this approach.
3. Prediction density computation
Turning to the first challenge of computing the prediction density, note that by the law of total probability and the Markov
yt ∼ pθ(yt | xt),
(2b)
where pθ (xt +1 | xt ) is the pdf describing the dynamics for given values of xt,ut and θ, and pθ(yt | xt) is the pdf describing the measurements. As is common practise, in (2) the same symbol pθ is used for different pdf’s that depend on θ, with the argument to the pdf denoting what is intended. Furthermore, note that we have, for brevity, dispensed with the input signal ut in the notation (2). However, everything we derive throughout this paper is valid also if an input signal is present.
The formulation (1) and its alternative formulation (2) capture a relatively broad class of nonlinear systems and we consider the
∈ Rnu and yt ∈
(1a) (1b)
N
V(θ) = −l(εt(θ)),
t=1
εt(θ) = yt −yt|t−1(θ) (4)
andwithΘ⊆Rnθ denotingacompactsetofpermissiblevaluesof the unknown parameter θ. Here,

yt|t−1(θ) = Eθ {yt | Yt−1} =
yt pθ (yt | Yt−1) dyt (5)

nature of (2) pθ(yt | Yt−1) =
with respect to both the observed data YN and the missing data XN . Underlying this strategy is an assumption that maximising the ‘‘complete’’ log-likelihood Lθ (XN , YN ) is easier than maximising the incomplete one Lθ (YN ).
As a concrete example, if the model structure (1) was linear and time-invariant, then knowledge of the state xt would allow system matrices A, B, C , D to be estimated by simple linear regression. See Gibson and Ninness (2005) for more detail, and McLachlan and Krishnan (2008b) for further examples.
The EM algorithm then copes with XN being unavailable by forming an approximation Q (θ , θk ) of Lθ (XN , YN ). The approxima- tion used is the minimum variance estimate of Lθ (XN , YN ) given the observed available data YN , and an assumption θk of the true parameter value. This minimum variance estimate is given by the conditional mean (Anderson & Moore, 1979)

pθ(yt | xt)pθ(xt
where xt is the state of the underlying dynamic system. Further-
pθ(yt |xt)pθ(xt |Yt−1) pθ(yt |Yt−1)
(10) more, using the Markov property of (2) and Bayes’ rule we obtain
. (11) Finally, by the law of total probability and the Markov nature of (2)
pθ(xt |Yt)=
| Yt−1)dxt,
T.B. Schön et al. / Automatica 47 (2011) 39–49 41
pθ(xt+1 | Yt) =

pθ(xt+1 | xt)pθ(xt | Yt)dxt. (12)
Together, (11), (10) are known as the ‘‘measurement update’’ and (12) the ‘‘time update’’ equations, which provide a recursive formulation of the required prediction density pθ (yt | Yt −1 ) as well as the predicted and filtered state densities pθ (xt | Yt −1 ), pθ (xt | Yt ).
In the linear and Gaussian case, the associated integrals have closed form solutions which lead to the Kalman filter (Kalman, 1960). In general though, they do not. Therefore, while in principle (10)–(12) provide a solution to the computation of V(θ) or Lθ (YN ), there is a remaining obstacle of numerically evaluating the required nx-dimensional integrals.
In what follows, the recently popular methods of sequential importance resampling (SIR, or particle filtering) will be employed to address this problem.
However, there is a remaining difficulty which is related to the second challenge mentioned at the end of Section 2. Namely, if gradient based search is to be employed to compute the estimate θ,thennotonlyispθ(yt |Yt−1)required,butalsoitsderivative
∂ pθ(yt |Yt−1). (13) ∂θ
Unfortunately the SIR technique does not lend itself to the simple computation of this derivative. One approach to deal with this is to simply numerically evaluate the necessary derivative based on differencing. Another is to employ a search method that does not require gradient information. Here, there exist several possibilities, such as Nelder–Mead simplex methods or annealing approaches (Salamon, Sibani, & Frost, 2002; Wright, 1996).
Q(θ,θk) 􏰃 Eθk{Lθ(XN,YN) | YN} ∫
= Lθ(XN,YN)pθk(XN|YN)dXN.
(15a)
(15b)
The utility of this approach depends on the relationship between Lθ (YN ) and the approximation Q (θ , θk ) of Lθ (XN , YN ). This may be examined by using the definition of conditional probability to write
logpθ(XN,YN)=logpθ(XN |YN)+logpθ(YN). (16)
Taking the conditional mean Eθk {· | YN } of both sides then yields ∫
Q(θ,θk)=Lθ(YN)+ logpθ(XN|YN)pθk(XN|YN)dXN. (17) Therefore
Lθ(YN)−Lθk(YN) = Q(θ,θk)−Q(θk,θk)

∫ pθ(XN|YN)
log pθ (XN|YN)pθk(XN|YN)dXN
k
pθ(XN|YN)
1 − pθk(XN|YN) pθk(XN|YN)dXN = 0,
+
∫ pθk (XN |YN )
log pθ(XN|YN)pθk(XN|YN)dXN. (18)
The rightmost integral in (18) is the Kullback–Leibler divergence metric which is non-negative. This follows directly upon noting that since for x ≥ 0, − log x ≥ 1 − x
This paper explores a further possibility which is known as
the Expectation Maximisation (EM) algorithm, and is directed at ≥
∫ 
computing an ML estimate. Instead of using the smoothness of Lθ , it is capable of employing an alternative feature. Namely, the fact that Lθ is the logarithm of a probability density pθ (YN ), which has unit area for all values of θ. How the EM algorithm is capable of utilising this simple fact to deliver an alternate search procedure is now profiled.
4. The expectation maximisation algorithm
Like gradient based search, the EM algorithm is an iterative procedure that at the k’th step seeks a value θk such that the likelihood is increased in that Lθk (YN ) > Lθk−1 (YN ). Again like gradient based search, an approximate model of Lθ (YN ) is employed to achieve this. However, unlike gradient based search, the model is capable of guaranteeing increases in Lθ (YN ).
The essence of the EM algorithm (Dempster et al., 1977; McLachlan & Krishnan, 2008a) is the postulation of a ‘‘missing’’ datasetXN = {x1,…,xN}.Inthispaper,itwillbetakenasthestate sequence in the model structure (1), but other choices are possible, and it can be considered a design variable. The key idea is then to consider the joint likelihood function
Lθ(XN,YN) = logpθ(XN,YN), (14)
(19) | YN ) is of
where the equality to zero is due to the fact that pθ (XN
unit area for any value of θ. As a consequence of this simple fact
Lθ (YN ) − Lθk (YN ) ≥ Q (θ , θk ) − Q (θk , θk ). (20)
This delivers the key to the EM algorithm. Namely, choosing θ so that Q(θ,θk) > Q(θk,θk) implies that the log-likelihood is also increased in that Lθ (YN ) > Lθk (YN ). The EM algorithm exploits this to deliver a sequence of values θk, k = 1, 2, . . . designed to be increasingly good approximations of the ML estimate (6) via the following strategy.
Algorithm 1 (EM Algorithm).
(1) Set k = 0 and initialise θk such that Lθ (YN ) is finite;
k
(2) (Expectation (E) step):
Calculate: Q (θ , θk ); (21)
(3) (Maximisation (M) step):
Compute: θk+1 =argmax Q(θ,θk); (22)
θ∈Θ
(4) If not converged, update k → k + 1 and return to step 2.

42 T.B. Schön et al. / Automatica 47 (2011) 39–49
The termination decision in step 4 is performed using a standard criterion such as the relative increase of Lθ(YN) or the relative increase of Q (θ , θk ) falling below a pre-defined threshold (Dennis & Schnabel, 1983).
The first challenge in implementing the EM algorithm is the computation of Q (θ , θk ) according to the definition (15a). To address this, note that via Bayes’ rule and the Markov property associated with the model structure (1)
(measurable) function g
1 −M ∫ M g(xi) ≈ E {g(x)} =
i=1
g(x)π(x) dx, (26)
Lθ(XN,YN) = logpθ(YN|XN)+logpθ(XN) N−1
= logpθ(x1)+−logpθ(xt+1|xt) t=1
N
+ −logp (y |x ).
t=1
When the model structure (1) is linear and the stochastic
components vt and et are Gaussian the log pθ terms are either linear or quadratic functions of the state xt . Taking the conditional expectation (15a) in order to compute Q(θ,θk) is then simply achieved by invoking a modification of a standard Kalman smoother (Gibson & Ninness, 2005; Jazwinski, 1970).
In the more general setting of this paper, the situation is more complicated and requires an alternative approach. To develop it, application of the conditional expectation operator Eθk {· | YN } to both sides of (23) yields
with equality (with probability one) in the limit as M → ∞. Certainly, for some special cases such as the Gaussian density, random number generator constructions are well known. Denote by q(x) the density for which such a random variable generator is available, and denote by x ̃i ∼ q(x ̃) a realisation drawn using this
generator.
A realisation xj ∼ π(x) that is distributed according to the
target density π (x) is then achieved by choosing the j’th realisation xj to be equal to the value x ̃i with a certain probability w(x ̃i). More
I1 =
I2 = −
I3 =
log pθ (xt+1|xt )pθk (xt+1, xt |YN ) dxt dxt+1, logpθ(yt|xt)pθk(xt|YN)dxt.
log pθ (x1)pθk (x1|YN ) dx1,
N−1 ∫∫ t=1
−N ∫
θtt
(23)
specifically, for j = 1, . . . , M, a realisation xj randomly according to
is selected as x ̃i (27)
(28)
P(xj = x ̃i) = 1 w(x ̃i) κ
where
π ( x ̃ i ) w(x ̃i) = ,
κ =
−M i=1
w(x ̃i).
Q(θ,θk)=I1 +I2 +I3, where

(24)
(25a) (25b)
(25c)
This step is known as ‘‘resampling’’, and the random assignment is done in an independent fashion. The assignment rule (27) works, since by the independence, the probability that as a result xj takes on the value x ̃i is the probability q(x ̃i) that x ̃i was realised, times the probability w(x ̃i) that xj is then assigned this value. Hence, with x ̃i viewed as a continuous variable, rather than one from a discrete set {x ̃1,…,x ̃M}
P(xj = x ̃i) ∝ q(x ̃i) π(x ̃i) = π(x ̃i), (29) q ( x ̃ i )
so that xj is a realisation from the required density π(x).
The challenge in achieving this is clearly the specification of a density q(x) from which it is both feasible to generate realisations {x ̃i}, and for which the ratio w(x) in (28) can be computed. To
address this, consider the following selections:
π(xt) = pθ(xt | Yt), q(x ̃t) = pθ(x ̃t | xt−1). (30)
Thischoiceofproposaldensityqisfeasiblesincearealisationx ̃it ∼ pθ (x ̃t | xt−1) may be obtained by simply generating a realisation vti ∼ pv , and substituting it, a given xt −1 , a measured ut and model- implied θ into ft in (1a) in order to deliver a realisation x ̃it .
Furthermore, if xt−1 used in (30) is a realisation distributed as xt −1 ∼ pθ (xt −1 | Yt −1 ) then the unconditional proposal density q is given by the law of total probability as
q ( x ̃ i )
t=1
Computing Q(θ,θk) therefore requires knowledge of densities such as pθk (xt |YN ) and pθk (xt +1 , xt |YN ) associated with a nonlinear smoothing problem. Additionally, integrals with respect to these must be evaluated. Outside the linear case, there is no hope of any analytical solution to these challenges. This paper therefore takes the approach of evaluating (25a)–(25c) numerically.
5. Computing state estimates
The quantities I , I , I in (25) that determine Q (θ , θ ) depend 123 k
primarily on evaluating the smoothed density pθk (xt | YN ) and expectations with respect to it.
To perform these computations, this paper employs sequential importance resampling (SIR) methods. These are often discussed under the informal title of ‘‘particle filters’’, and the main ideas underlying them date back half a century (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Metropolis & Ulam, 1949). However, it was not until 1993 that the first working particle filter was discovered by Gordon, Salmond, and Smith (1993). As will be detailed, this approach first requires dealing with the
q(x ̃t)=

pθ(x ̃t |xt−1)pθ(xt−1 |Yt−1)dxt−1 (31)
filtered density pθ (xt by examining this.
5.1. Particle filtering
| Yt ), and hence the discussion will begin
and hence by the time update equation (12)
q(x ̃t)=pθ(x ̃t |Yt−1). (32)
As a result, the ratio w = π/q implied by the choice (30) can be expressed as
w(x ̃i ) = pθ(x ̃it | Yt) = pθ(x ̃it | Yt) = pθ(yt | x ̃it) (33) t q(x ̃it) pθ(x ̃it | Yt−1) pθ(yt | Yt−1)
where the measurement update equation (11) is used in progress- ing to the last equality.
According to the model (1), the numerator in this expression is simply the pdf of gt(xt,ut,et,θ) for given x ̃it,ut,θ and hence computable. Additionally, the denominator in (33) is independent
The essential idea is to evaluate integrals by a randomised approach that employs the strong law of large numbers (SLLN). For example, if it is possible to build a random number generator that delivers (suitably uncorrelated) realisations {xi} with respect to a given target probability density π (x), then by the SLLN, for a given

of x ̃it , and hence simply a normalising constant to ensure unit total probability so that
(34)
This analysis suggests a recursive technique of taking realisations xit−1 ∼ pθ (xt−1 | Yt−1), using them to generate candidate x ̃it via the proposal (30), and then resampling them using the density (34) to deliver realisations xit ∼ pθ (xt | Yt ). Such an approach is known as sequential importance resampling (SIR) or, more informally, the realisations {xjt }, {x ̃it } are known as particles, and the method is known as particle filtering.
Algorithm 2 (Basic Particle Filter).
(1) Initialize particles, {xi0 }Mi=1 ∼ pθ (x0 ) and set t = 1.
(2) Predict the particles by drawing M i.i.d. samples according to x ̃it ∼ pθ(x ̃t|xit−1), i = 1,…,M. (35)
Recognising that while resampling is necessary, it need not be done at each time step t, and recognising the possibility for alternatives to the choice (32) for the proposal density have lead to a range of different particle filtering methods (Doucet et al., 2001). All deliver values {wti }, {x ̃it }, {xit } such that arbitrary integrals with respect to a target density pθ (xt | Yt ) can be approximately computed via sums such as (26) and (38a).
A mathematical abstraction, which is a useful way of encapsu- lating this deliverable, is the discrete Dirac delta approximation of pθ(xt |Yt)givenby
1M
w(x ̃i)= p (y |x ̃i), κ=−p (y |x ̃i).
tκθtt θtt i=1
T.B. Schön et al. / Automatica 47 (2011) 39–49 43
(3) Compute the importance weights {wti }Mi=1 ,
Underlying this abstraction is the understanding that substituting pθ for pθ delivers finite sum approximations to integrals involving
pθ.
5.2. Particle smoother
The stochastic sampling approach for computing expectations with respect to the filtered density pθ (xt | Yt ) can be extended
wi 􏰃 w(x ̃i ) = pθ(yt|x ̃it) , ttM
∑pθ(yt|x ̃jt) j=1
i = 1,…,M.
(36)
to accommodate the smoothed density pθ (xt abstraction just introduced of
| YN ). The same (40)
pθ(xt |Yt)≈pθ(xt |Yt)=
−M i=1
wtiδ(xt −x ̃it). (39)
(4) For each j = 1, . . . , M draw a new particle xjt with replacement (resample) according to,
P(xjt = x ̃it) = wti, i = 1,…,M. (37)
(5) If t < N increment t → t + 1 and return to step 2, otherwise terminate. It is important to note that a key feature of the resampling step (37) is that it takes an independent sequence {x ̃it } and delivers a dependent one {xit }. Unfortunately, this will degrade the accuracy of approximations such as (26), since by the fundamental theory underpinning the SLLN, the rate of convergence of the sum to the integral decreases as the correlation in {xit } increases (Ninness, 2000). To address this, note that the proposal values {x ̃it } are by construction independent, but distributed as x ̃it ∼ pθ (x ̃t | Yt−1). Using them, and again appealing to the law of large numbers M pθ(xt |YN)≈pθ(xt |YN)=−wi δ(xt −x ̃i) i=1 will be used to encapsulate the resulting importance sampling approximations. To achieve this, note that using the definition of conditional probability several times pθ (xt = pθ(Yt+1:N = pθ (Yt+1:N | xt+1,YN) = pθ(xt | xt+1,Yt,Yt+1:N), pθ (xt , xt+1, Yt , Yt+1:N ) pθ (xt+1, Yt , Yt+1:N ) (41a) (41b) (41c) (41d) (41e) (41f) where the last equality follows from the fact that given xt+1, by the Markov property of the model (1) there is no further information about Yt+1:N available in xt and hence pθ (Yt+1:N | xt , xt+1, Yt ) = pθ(Yt+1:N |xt+1,Yt). Consequently, via the law of total probability and Bayes’ rule ∫ pθ(xt | YN) = pθ(xt | xt+1,Yt)pθ(xt+1 | YN)dxt+1 (42a) ∫ p (x | x )p (x | Y ) = θ t+1 t θ t t pθ(xt+1 |YN)dxt+1 (42b) dxt+1. (42c) t|N t = | xt,xt+1,Yt)pθ(xt,xt+1,Yt) pθ (xt+1, Yt , Yt+1:N ) | xt , xt+1, Yt )pθ (xt | xt+1, Yt )pθ (xt+1, Yt ) pθ (xt+1, Yt , Yt+1:N ) | xt,xt+1,Yt)pθ(xt | xt+1,Yt) pθ(Yt+1:N | xt+1,Yt) g(x ̃t)pθ(x ̃t g(x ̃t)pθ(x ̃t | Yt−1)pθ(x ̃t pθ(Yt+1:N = pθ (xt | xt+1, Yt ), 1−M∫ = M g(x ̃it)w(x ̃it) ≈ = = g(x ̃t )w(x ̃t )pθ (x ̃t | Yt−1) dx ̃t (38a) (38b) i=1 ∫ ∫ p θ ( x ̃ t | Y t ) | Yt−1)dx ̃t | Yt)dx ̃t = Eθ{g(x ̃t) | Yt}(38c) where the transition from (38a) to (38b) follows by (33). Note that the expectation in (38c) is identical to that in (26) with π(xt) = p (x | Y ). However, since the sum in (38a) involves independent θit t i {x ̃t } rather than the dependent {xt } used in (26), it will generally be a more accurate approximation to the expectation. As a result it is preferable to use the left hand side of (38a) rather than the right hand side of (26). The former, due to use of the ‘‘weights’’ {w(x ̃it )} is an example of what is known as ‘‘importance sampling’’ (Ripley, 1987). This explains the middle term in the SIR name given to Algorithm 2. Of course, this suggests that the resampling step (37) is not essential, and one could simplify Algorithm 2 by removing it and simply propagating the weights {wti} for a set of particles {xit} whose positions are fixed. Unfortunately this extreme does not work over time since the resampling is critical to being able to track movementsinthetargetdensitypθ(xt |Yt). pθ(xt+1 |Yt) = pθ(xt |Yt) expresses the smoothing density pθ (xt | YN ) in terms of ∫ pθ(xt+1 | xt)pθ(xt+1 | YN) pθ(xt+1 |Yt) This the filtered density pθ (xt | Yt ) times an xt dependent integral. To compute this integral, note first that again by the law of total probability, the denominator of the integrand can be written as ∫ pθ(xt+1 | Yt) = pθ(xt+1 | xt)pθ(xt | Yt)dxt. (43) 44 T.B. Schön et al. / Automatica 47 (2011) 39–49 As explained in the previous section, the particle filter (39) may be used to compute this via importance sampling according to M pθ(xt+1 | Yt) ≈ −wti pθ(xt+1 | x ̃it). (44) i=1 To complete the integral computation, note that for the particular case of t = N, the smoothing density and the filtering density are the same, and hence the weights in (40) may be initialised as 6. The E step: computing Q (θ, θk ) These importance sampling approaches will now be employed in order to compute approximations to the terms I1,I2 and I3 in (25) that determine Q(θ,θk) via (24). Beginning with I1 and I3, the particle smoother representation (46) achieved by Algorithm 3 directly provides the importance sampling approximations wi = wi and likewise the particles x ̃i are identical. Working N|NN N i=1 backwards in time t then, we assume an importance sampling approximation (40) is available at time t + 1, and use it and (44) to compute the integral in (42c) as  I ≈I 􏰃−−wi logp(y|x ̃i). ∫ pθ(xt+1 | xt)pθ(xt+1 | YN) dxt+1 pθ(xt+1 |Yt) −M wk pθ(x ̃k |xt) ≈ t+1|N t+1 . k=1∑Mi k i w t p θ ( x ̃ t + 1 | x ̃ t ) 33 t|Nθtt t=1 i=1 k=1 k−iki t M vt 􏰃 wtpθ(x ̃t+1|x ̃t). (50) M − I≈I􏰃 wi logp(x ̃i), (48a) (48b) A vital point is that when forming these approximations, the weights {wi } are computed by Algorithms 2 and 3 run with t|N respect to the model structure (1), (2) parameterised by θk. Evaluating I2 given by (25b) is less straightforward, due to it depending on the joint density pθ (xt +1 , xt |YN ). Nevertheless, using the particle filtering representation (39) together with the smoothing representation (46a) leads to the following importance sampling approximation. Lemma 6.1. The quantity I2 defined in (25b) may be computed by an importance sampling approximation I2 based on the particle filtering and smoothing representations (39), (44) that is given by |x ̃it), (49) 11 1|Nθ1 NM i=1 The remaining pθ (xt | Yt ) term in (42c) may be represented by the particle filter (39) so that the smoothed density pθ (xt represented by −M p (x |Y )≈p (x |Y )= wi δ(x −x ̃i), | YN ) is (46a) (46b) (46c)  θtNθtN t|Ntt i=1 N−1 M M I2 ≈I2 􏰃−−−wij logpθ(x ̃j wi t|N = wi −wk t pθ(x ̃t+1|x ̃t), vk Mki  t|N t+1 t=1 i=1 j=1 (45) t+1|N ij where the weights wt|N are given by wiwj p (x ̃j |x ̃i) wij = t t+1|N θk t+1 t . t|N M ∑wtlpθk(x ̃jt+1 |x ̃lt) l=1 Proof. First, by the definition of conditional probability i=1 These developments can be summarised by the following particle smoothing algorithm. Algorithm 3 (Basic Particle Smoother). (1) Run the particle filter (Algorithm 2) and store the predicted particles {x ̃it }Mi=1 and their weights {wti }Mi=1, for t = 1, . . . , N. (2) Initialise the smoothed weights to be the terminal filtered weights {wti } at time t = N, pθ (xt+1, xt |YN ) = pθ (xt |xt+1, YN )pθ (xt+1|YN ). (51) Furthermore, by (41a)–(41f) pθ (xt |xt+1, YN ) = pθ (xt |xt+1, Yt ). (52) Substituting (52) into (51) and using Bayes’ rule in conjunction with the Markov property of the model (1) delivers (53a) (53b) Therefore, the particle filter and smoother representations (39), (46a) may be used to deliver an importance sampling approxima- tion to I2 according to ∫∫ wi =wi , i=1,...,M N|N N and set t = N − 1. (3) Compute the smoothed weights {wi weights {wti }Mi=1 and particles {x ̃it , x ̃it+1}Mi=1 (47) using the filtered via the formulae pθ (xt+1, xt |YN ) = pθ (xt |xt+1, Yt )pθ (xt+1|YN ) pθ (xt+1|xt ) pθ (xt |Yt ) }M t|N i=1 = pθ (xt+1|Yt ) pθ (xt+1|YN ). (46b), (46c). (4)Updatet → t−1.Ift > 0returntostep3,otherwise
terminate.
Like the particle filter Algorithm 2, this particle smoother is not new (Doucet, Godsill, & Andrieu, 2000). Its derivation is presented here so that the reader can fully appreciate the rationale and approximating steps that underly it. This is important since they are key aspects underlying the novel estimation methods derived here.
Note also that there are alternatives to this algorithm for providing stochastic sampling approximations to functions of the smoothed state densities (Bresler, 1986; Briers, Doucet, & Maskell, 2010; Fearnhead, Wyncoll, & Tawn, 2008; Godsill, Doucet, & West, 2004; Pillonetto & Bell, 2008). The new estimation methods developed in this paper are compatible with any method the user chooses to employ, provided it is compatible with the approximation format embodied by (40). The results presented in this paper used the method just presented as Algorithm 3.
logpθ(xt+1|xt)pθk(xt+1,xt|YN)dxt dxt+1
∫ pθk(xt+1|YN)∫ = pθ (xt+1|Yt)
k
× pθk (xt |Yt ) dxt

logpθ(xt+1|xt)pθk(xt+1|xt) dxt+1
M ∫ pθ (xt+1|YN) ≈ − wti k
log pθ (xt+1|x ̃it )pθk (xt+1|x ̃it ) dxt+1
i=1
M M ≈−−wiwj
p(x ̃j |x ̃i)
θk t+1 t logp (x ̃j |x ̃i).
pθk (xt +1 |Yt )
i=1 j=1
t t+1|N
p (x ̃j |Y ) θ t+1 t θk t+1 t

Finally, the law of total probability in combination with the particle filter (39) provides an importance sampling approximation to the denominator term given by
tolerance. Commonly this is judged via the gradient itself according to a test such as |pTj gj| ≤ ε for some user specified ε > 0.
In relation to this, it is important to appreciate that it is in fact not necessary to find a global maximiser of Q (θ , θk ). All that is necessary is to find a value θk+1 for which Q (θk+1, θk) > Q (θk, θk) since via (20) this will guarantee that L(θk+1) > L(θk). Hence, the resulting iteration θk+1 will be a better approximation than θk of the maximum likelihood estimate (8).
8. Final identification algorithm
The developments of the previous sections are now sum- marised in a formal definition of the EM-based algorithm this paper has derived for nonlinear system identification.
Algorithm 4 (Particle EM Algorithm).
(1) Set k = 0 and initialise θk such that Lθk (Y ) is finite. (2) (Expectation (E) Step):
(a) Run Algorithms 2 and 3 in order to obtain the particle filter (39) and particle smoother (46a) representations.



pθk (x ̃jt+1|xt )pθk (xt |Yt ) dxt ≈−wtlpθk(x ̃jt+1|x ̃lt). 􏰂
Again, all weights and particles in this approximation are computed by Algorithms 2 and 3 run with respect to the model structure (1), (2) parametrised by θk.
Using these importance sampling approaches, the function Q(θ,θk) given by (24), (25) may be approximately computed as
QM (θ , θk ) defined by
QM (θ, θk) =I1 +I2 +I3, (55)
pθk (x ̃jt+1|Yt ) =
(54a)
(54b)
M l=1
where I1 , I2 and I3 are given by (48a), (49) and (48b), respectively. Furthermore, the quality of this approximation can be made arbitrarily good as the number M of particles is increased.
T.B. Schön et al. / Automatica 47 (2011) 39–49 45
7. TheMstep:maximisationofQM(θ,θk) 
approximation QM (θ , θk ) is maximised with respect to θ in order to compute a new iterate θk+1 of the maximum likelihood estimate.
k+1k kk εforsomeuserchosenε > 0.Ifsatisfiedupdatek → k+1
In certain cases, such as when the nonlinearities ft and ht in the model structure (1) are linear in the parameter vector θ, it is possible to maximise QM (θ , θk ) using closed-form expressions. An example of this will be discussed in Section 10.
In general however, a closed form maximiser will not be available. In these situations, this paper proposes a gradient based search technique. For this purpose, note that via (55), (48) and (49) the gradient of Q (θ , θk ) with respect to θ is simply computable via
and return to step 2, otherwise terminate.
It is worth emphasising a point made earlier, that while the authors have found the simple particle and smoothing Algorithms 2 and 3 to be effective, the user is free to substitute alternatives if desired, provided the results they offer are compatible with the representations (39), (46a).
It is natural to question the computational requirements of this
proposed algorithm. Some specific comments relating to this will
be made in the example section following. More generally, it is
possible to identify the computation of I given by (49) and its 2
gradient (56c) as a dominating component of both the E and M steps. As is evident, it requires O(NM 2 ) floating point operations.
This indicates that the computing load is sensitive to the number M of particles employed. Balancing this, the experience of the authors has been that useful results can be achieved without requiring M to be prohibitively large. The following simulation section will provide an example illustrating this point with M = 100, and 1000 iterations of Algorithm 4 requiring approximately one minute of processor time on a standard desktop computing platform.
9. Convergence
It is natural to question the convergence properties of this iterative parameter estimation procedure. These will derive from the general EM Algorithm 1 on which it is based, for which the most fundamental convergence property is as follows.
If the EM algorithm terminates at a point θk+1 because it is a stationary point of Q (θ , θk ), then it is also a stationary point of the log-likelihood L(θ). Otherwise, the likelihood is increased in progressing from θk to θk+1 .
Lemma 9.1. Let θk+1 be generated from θk by an iteration of the EM Algorithm (21), (22). Then
L(θ ) ≥ L(θ ) ∀k = 0,1,2,…. (60) k+1 k
∂ ∂ ˆI 1 ∂ ˆI 2 ∂ ˆI 3 QM(θ,θk)= + + ,
(56a) (56b)

With an approximation QM(θ,θk) of the function Q(θ,θk)
Compute: θk+1 =argmax QM(θ,θk) (59) θ∈Θ
required in the E step (21) of the EM Algorithm 1 available,
attention now turns to the M step (22). This requires that the 
explicitly if possible, otherwise according to (57).
(4) Check the non-termination condition Q (θ , θ )−Q (θ , θ ) >
(b) Use this information together with (48a), (48b) and (49) to  
Calculate: QM (θ, θk) = I1 + I2 + I3. (58) (3) (Maximisation (M) Step):
∂θ ∂θ ∂θ ∂θ
∂ˆI1 ∂θ
−M ∂logpθ(x ̃i ) = wi 1,
1|N ∂θ
N−1MM ji
i=1
−−− ij ∂logpθ(x ̃t+1|x ̃t)
∂ˆI2
∂θ= wt|N ∂θ , (56c)
t=1 i=1 j=1 NMi
∂ˆI3 =−−wi ∂logpθ(yt|x ̃t).
(56d)
∂θ
t=1 i=1
t|N ∂θ
With this gradient available, there are a wide variety of algorithms that can be employed to develop a sequence of iterates θ = β0 , β1 , . . . that terminate at a value β∗ which seeks to maximise QM (θ , θk ).
A common theme in these approaches is that after initialisation with β0 = θk, the iterations are updated according to
∂  βj+1 =βj +αjpj, pj =Hjgj, gj = QM(θ,θk)
. (57)
Here Hj is a positive definite matrix that is used to deliver a search direction pj by modifying the gradient direction. The scalar
term αj is a step length that is chosen to ensure that QM(βj + αj pj , θk ) ≥ QM (βj , θk ). The search typically terminates when
incremental increases in QM (β , θk ) fall below a user specified

∂θ θ=β
j

46 T.B. Schön et al. / Automatica 47 (2011) 39–49
Furthermore, equality holds in this expression if and only if both Q (θk+1, θk) = Q (θk, θk),
and
pθk+1(XN | YN) = pθk(XN
(61)
(62)
10.1. Linear Gaussian system
The first example to be considered is the following simple linear time series
| YN),
hold for almost all (with respect to Lebesgue measure) XN . Proof. See Theorem 5.1 in Gibson and Ninness (2005).
xt+1 =axt +vt [vt] [0] [q 0] y=cx+e e ∼N 0,0 r
(66a)
(66b)
An important point is that the proof of this result only depends on Q (θk+1 , θk ) ≥ Q (θk , θk ) being non-decreasing at each iteration. It does not require that θk+1 be a maximiser of Q (θ , θk ).
This provides an important theoretical underpinning for the EM method foundation of Algorithm 4 developed here. Its application
is complicated by the fact that only an approximation QM(θ,θk) of Q (θ , θk ) is available. However, this approximation is arbitrarily accurate for a sufficiently large number M of particles.
pθ(yt | xt) < ∞, pθ(xt+1 | xt) < ∞, E|Q(θ,θk)|4 | YN < ∞, (64) holdforallθ,θk ∈Θ.Thenwithprobabilityone (2008). 􏰂 Together, Lemmas 9.1 and 9.2, do not establish convergence of Algorithm 4, and are not meant to imply it. Indeed, one drawback of the EM algorithm is that except under restrictive assumptions (such as convex likelihood), it is not possible to establish convergence of the iterates {θk}, even when exact computation of the E-step is possible (McLachlan & Krishnan, 2008b; Wu, 1993). The point of Lemma 9.1 is to establish that any algorithmic test that Q(θ,θk) has not decreased (such as step (4) of Algorithm 4) guarantees a non-decrease of L(θ). Hence EM is capable of matching the guaranteed non cost-decreasing property of gradient based search. Of course, this depends on the accuracy with which Q(θ,θk) can be calculated. The point of Lemma 9.2 is to establish that the particle based approximant QM (θ , θk ) used in this paper is an arbitrarily accurate approximation of Q(θ,θk). Hence Lemma 9.2 establishes a scientific basis for employing QM (θ , θk ). 10. Numerical illustrations In this section the utility and performance of the new Algorithm 4 are demonstrated on two simulation examples. The first is a linear time-invariant Gaussian system. This is profiled since an exact solution for the expectation step can be computed using the Kalman smoother (Gibson & Ninness, 2005). Comparing the results obtained by employing both this and the particle based approximations used in Algorithm 4 therefore allow the effect of the particle approximation on estimation accuracy to be judged. The performance of Algorithm 4 on a second example involving a well studied and challenging nonlinear system is then illustrated. The estimation problem is to determine just the θ = a parameter on the basis of the observations YN. Using EM methods it is straightforward to also estimate the c , q and r parameters as well (Gibson & Ninness, 2005). However, this example concentrates on a simpler case in order to focus attention on the effect of the particle filter/smoother approximations employed in Algorithm 4. Lemma 9.2. Consider the function Q (θ , θk ) defined by (24)–(25c)   where d is a constant term that is independent of a and ψ(·) and γ (·) are defined as N−1 M M ψ(ak)=−−−wij x ̃j x ̃it, (68a) t=1 i=1 j=1 NM γ(a)=−−wi (x ̃i)2. (68b) k t|N t t=1 i=1 Since QM (a, ak ) in (67) is quadratic in a, it is straightforward to solve the M step in closed form as ψ(ak) ak+1 = γ(ak). (69) Furthermore, in this linear Gaussian situation Q(θ,θk) can be computed exactly using a modified Kalman smoother (Gibson & Ninness, 2005). In this case, the exact Q (a, ak ) is again of the quadratic form (67) after straightforward re-definitions of ψ and γ , so the ‘‘exact’’ M step also has the closed form solution (69). This ‘‘exact EM’’ solution can then be profiled versus the new particle filter/smoother based EM method (67)–(69) of this paper in order to assess the effect of the approximations implied by the particle approach. This comparison was made by conducting a Monte Carlo study over 1000 different realisations of data YN with N = 100. For each realisation, ML estimates a were computed using the exact EM solution provided by Gibson and Ninness (2005), and via the approximate EM method of Algorithm 4. The latter was done for and its SIR approximation QM (θ , θk ) defined by (48a)–(49) and (55) which is based on M particles. Suppose that ∀θ,θk ∈ Θ. (65) Proof. By application of Corollary 6.1 in Hu, Schön, and Ljung lim QM(θ,θk) = Q(θ,θk), M→∞  􏰂 with the true parameters given by θ⋆ = a⋆, c⋆, q⋆, r⋆ = [0.9, 0.5, 0.1, 0.01] .  More specifically, via Algorithm 4, a particle based approxima- tion Q (a, a ) can be expressed as (63) t|N t+1 tttt Mk QM (a, ak) = −γ (ak)a2 + 2ψ(ak)a + d, (67)  two cases of M = 10 and M = 500 particles. In all cases, the initial ⋆  realisations, a point is plotted with x co-ordinate the likelihood value L(a) achieved by 100 iterations of the exact EM method, and y co-ordinate the value achieved by 100 iterations of Algorithm 4. Clearly, if both approaches produced the same estimate, all the points plotted in this manner should lie on the solid y = x line shown in Fig. 1. For the case of M = 500 particles, where the points are plotted with a cross ‘x’, this is very close to being the case. This illustrates that with sufficient number of particles, the use of the approximation QM in Algorithm 4 can have negligible detrimental effect on the final estimate produced. Also plotted in Fig. 1 using an ‘o’ symbol, are the results obtained using only M = 10 particles. Despite this being what could be considered a very small number of particles, there is still generally reasonable, and often good agreement between the associated approximate and exact estimation results. value a0 was set to the true value a . The results are shown in Fig. 1. There, for each of the 1000 Fig. 1. Comparison of the likelihood values for the final estimates after 100 iterations of the exact EM method and the particle EM method given in Algorithm 4 using both M = 10 and M = 500 particles. 10.2. A nonlinear and non-Gaussian system A more challenging situation is now considered that involves the following nonlinear and time-varying system xt+1 =axt +b xt +ccos(1.2t)+vt, 1 + x 2t yt =dx2t +et, [vt] [0] [q 0] e∼N0,0r t where the true parameters in this case are θ⋆ = a⋆, b⋆, c⋆, d⋆, q⋆, r⋆ = [0.5, 25, 8, 0.05, 0, 0.1] . (70a) (70b) (70c) (71) Fig. 2. maxima. These could reasonably be expected to create significant difficulties for iterative search methods, such as gradient based search schemes for maximising Lθ (Y ). However, in this simplified case, the EM-based method of this paper seems quite robust against capture in these local maxima. For example, the trajectory of the parameter estimates over 100 iterations of Algorithm 4 and over 100 different length N = 100 data realisations, and 100 random initialisations for the b parameter, with the q parameter initialised at q = 0.001 are shown in Fig. 3. Here, M = 50 particles were employed, and in all cases, an effective estimate of the true parameter value b = 25 was obtained. The means whereby the EM-based Algorithm 4 achieves this are illustrated by profiling the function Q (θ , θk ) initialised at [b0, q0] = [40, 0.001] for k = 1, 10 and 100 as the dotted, dash–dotted and dashed lines, respectively. Clearly, in each case the Q (θ , θk ) function is a much more straightforward maximisation problem than that of the log-likelihood Lθ (Y ). Furthermore, by virtue of the essential property (20), at each iteration directions of increasing Q (θ , θk ) can be seen to coincide with directions of increasing Lθ (Y ). As a result, difficulties associated with the local maxima of Lθ (Y ) are avoided. To study this further, the trajectory of EM-based estimates θk = [bk , qk ]T for this example is plotted in relation to the two dimensional log-likelihood surface Lθ (Y ) in Fig. 4. Clearly, the iterates have taken a path circumventing the highly irregular ‘‘slice’’ at q = 0 illustrated in Fig. 2. As a result, the bulk of them lie in much better behaved regions of the likelihood surface. This type of behaviour with associated robustness to get captured in local minima is widely recognised and associated with the EM algorithm in the statistics literature (McLachlan & Krishnan, 2008b). Within this literature, there are broad explanations for this This has been chosen due to it being acknowledged as a challenging estimation problem in several previous studies in the area (Doucet et al., 2000; Godsill et al., 2004; Gordon et al., 1993). To test the effectiveness of Algorithm 4 in this situation, a Monte Carlo study was again performed using 104 different data realisations YN of length N = 100. For each of these cases, an estimate θ was computed using 1000 iterations of Algorithm 4 with initialisation θ0 being chosen randomly, but such that each entry of θ0 laid in an interval equal to 50% of the corresponding entry in the true parameter vector θ ⋆ . In all cases M = 100 particles were used. Using these choices, each computation of θ using Algorithm 4 took 58 s to complete on a 3 GHz quad-core Xeon running Mac OS 10.5. The results of this Monte Carlo examination are provided in Table 1, where the rightmost column gives the sample mean of the parameter estimate across the Monte Carlo trials plus/minus the sample standard deviation. Note that 8 of the 104 trials were not included in these calculations due to capture in local minima, which was defined according to the relative error test |(θi − θi⋆ )/(θi⋆ )| > 0.1 for any i’th component. Considering the random initialisation and the small number of censored trials, the results in Table 1 are considered to be successful.
It is instructive to further examine the nature of both this estimation problem and the EM-based solution. For this purpose consider the situation where only the b and q parameters are to be estimated. In this case, the log-likelihood Lθ (Y ) as a function of b with q = q⋆ = 0 is shown as the solid line in Fig. 2. Clearly the log- likelihood exhibits quite erratic behaviour with very many local
T.B. Schön et al. / Automatica 47 (2011) 39–49 47

Table 1
True and estimated parameter values for (70); mean value and standard deviations are shown for the estimates based on 104 Monte Carlo runs.
Parameter
a b c d q r
True Estimated
0.5 0.50 ± 0.0019 25.0 25.0 ± 0.99 8.0 7.99 ± 0.13
0.05 0.05 ± 0.0026
0. 7.78 × 10−5 ± 7.6 × 10−5 0.1 0.106 ± 0.015
The true log-likelihood function is shown as a function of the b parameter. Superimposed onto this plot are three instances of the Q (θ , θk ) function, defined in (15a). Clearly, as the EM algorithm evolves, then locally around the global maximiser, the approximation Q (θ , θk ) resembles the log-likelihood Lθ (Y ) more closely.

48
T.B. Schön et al. / Automatica 47 (2011) 39–49
Fig. 3.
criterion was employed, mainly due to the general statistical ef- ficiency of such an approach. This problem is then solved using the expectation maximisation algorithm, which in turn required a non- linear smoothing problem to be solved. This was handled using a particle smoothing algorithm. Finally, the utility and performance of the new algorithm were demonstrated using two simulation ex- amples.
References
Anderson, B. D. O., & Moore, J. B. (1979). Optimal filtering. New Jersey: Prentice-Hall. Andrieu, C., Doucet, A., Singh, S. S., & Tadić, V. B. (2004). Particle methods for change detection, system identification, and contol. Proceedings of the IEEE,
92(3), 423–438.
Bendat, J. S. (1990). Nonlinear system analysis and identification from random data.
Wiley Interscience.
Bohlin, T. (2006). Practical grey-box process identification: theory and applications.
Springer.
Bresler, Y. (1986). Two-filter formulae for discrete-time non-linear Bayesian
smoothing. International Journal of Control, 43(2), 629–641.
Briers, M., Doucet, A., & Maskell, S. R. (2010). Smoothing algorithms for state-space
models. Annals of the Institute of Statistical Mathematics, 62(1), 61–89. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1),
1–38.
Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained
optimization and nonlinear equations. Prentice Hall.
Doucet, A., de Freitas, N., & Gordon, N. (Eds.). (2001). Sequential Monte Carlo methods
in practice. Springer Verlag.
Doucet, A., Godsill, S. J., & Andrieu, C. (2000). On sequential Monte Carlo sampling
methods for Bayesian filtering. Statistics and Computing, 10(3), 197–208. Doucet, A., & Tadić, V. B. (2003). Parameter estimation in general state-space models using particle methods. Annals of the Institute of Statistical Mathematics, 55,
409–422.
Duncan, S., & Gyöngy, M. (2006). Using the EM algorithm to estimate the disease
parameters for smallpox in 17th century London. In Proceedings of the IEEE international conference on control applications. Munich, Germany. October (pp. 3312–3317).
Fearnhead, P., Wyncoll, D., & Tawn, J. (2008). A sequential smoothing algorithm with linear computational cost. Technical report. Department of Mathematics and Statistics, Lancaster University, Lancaster, UK. May.
Ghaharamani, Z., & Roweis, S. T. (1999). Learning nonlinear dynamical systems using an EM algorithm. In Advances in neural information processing systems: Vol. 11 (pp. 599–605). MIT Press.
Gibson, S., & Ninness, B. (2005). Robust maximum-likelihood estimation of multivariable dynamic systems. Automatica, 41(10), 1667–1682.
Gibson, S., Wills, A., & Ninness, B. (2005). Maximum-likelihood parameter estimation of bilinear systems. IEEE Transactions on Automatic Control, 50(10), 1581–1596.
Godsill, S. J., Doucet, A., & West, M. (2004). Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association, 99(465), 156–168. Goodwin, G. C., & Agüero, J. C. (2005). Approximate EM algorithms for parameter
and state estimation in nonlinear stochastic models. In Proceedings of the 44th IEEE conference on decision and control (CDC) and the European control conference. ECC. Seville, Spain. December (pp. 368–373).
Gopaluni, R. B. (2008). A particle filter approach to identification of nonlinear processes under missing observations. The Canadian Journal of Chemical Engineering, 86(6), 1081–1092.
Gordon, N. J., Salmond, D. J., & Smith, A. F. M. (1993). A novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings on Radar and Signal Processing, 140, 107–113.
Graebe, S. (1990). Theory and implementation of gray box identification. Ph.D. thesis. Royal Institute of Technology. Stockholm, Sweden. June.
Hu, X.-L., Schön, T. B., & Ljung, L. (2008). A basic convergence result for particle filtering. IEEE Transactions on Signal Processing, 56(4), 1337–1348.
Jazwinski, A. H. (1970). Mathematics in science and engineering, Stochastic processes and filtering theory. New York, USA: Academic Press.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME, Journal of Basic Engineering, 82, 35–45.
Kim, J., & Stoffer, D. S. (2008). Fitting stochastic volatility models in the presence of irregular sampling via particle methods and the EM algorithm. Journal of Time Series Analysis, 29(5), 811–833.
Kristensen, N. R., Madsen, H., & Jorgensen, S. B. (2004). Parameter estimation in stochastic grey-box models. Automatica, 40(2), 225–237.
Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society, 57(2), 425–437.
Leontaritis, I. J., & Billings, S. A. (1985). Input–output parametric models for non- linear systems. part II: stochastic non-linear systems. International Journal of Control, 41(2), 329–344.
Ljung, L. (1999). System sciences series, System identification, theory for the user (second ed.) Upper Saddle River, NJ, USA: Prentice Hall.
Ljung, L. (2008). Perspectives on system identification. In Plenary talk at the 17th IFAC world congress. Seoul, Korea. July 6–11.
Top: parameter b estimates as a function of iteration number (horizontal line indicates the true parameter value at b = 25). Bottom: log10(q) parameter estimates as a function of iteration number.
Fig. 4. The log-likelihood is here plotted as a function of the two parameters b and q. Overlaying this are the parameter estimates θk = [bk , qk ]T produced by Algorithm 4.
advantage, such as the fact that (20) implies that Q(θ,θk) forms a global approximation to the log-likelihood Lθ (Y ) as opposed to the local approximations that are implicit to gradient based search schemes. However, a detailed understanding of this phenomenon is an important open research question deserving further study.
A further intriguing feature of the EM algorithm is that while
(20) implies that local maxima of Lθ (Y ) may be fixed points of the
algorithm, there may be further fixed points. For example, in the
situation just studied where the true q⋆ = 0, if the EM algorithm
is initialised with q0 = 0, then all iterations θk will be equal to θ0,
regardlessofwhattheotherentriesinθ are. 0
This occurs because with vt = 0 in (1a), the smoothing step delivers state estimates completely consistent with θ0 (a deterministic simulation arises in the sampling (2a)), and hence the maximisation step then delivers back re-estimates that reflect this, and hence are unchanged. While this is easily avoided by always initialising q0 as non-zero, a full understanding of this aspect and the question of further fixed points are also worthy of further study.
11. Conclusion
The contribution in this paper is a novel algorithm for identify- ing the unknown parameters in general stochastic nonlinear state- space models. To formulate the problem a maximum likelihood

Ljung, L., & Vicino, A. (Eds.). (2005). System identification: linear vs. nonlinear. IEEE Transactions on Automatic Control (special issue).
Bode lecture: challenges of nonlinear system identification. December 2003. McLachlan, G., & Krishnan, T. (2008a). Whiley series in probability and statistics, The
EM algorithm and extensions (2nd ed.) New York, USA: John Wiley & Sons. McLachlan, G., & Krishnan, T. (2008b). The EM algorithm and extensions (2nd ed.)
John Wiley and Sons.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).
Equations of state calculations by fast computing machine. Journal of Chemical
Physics, 21(6), 1087–1092.
Metropolis, N., & Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247), 335–341.
Narendra, K., & Parthasarathy, K. (1990). Identification and control of dynamical
systems using neural networks. IEEE Transactions on Neural Networks, 1,
4–27.
Ninness, B. (2000). Strong laws of large numbers under weak assumptions with
application. IEEE Transactions on Automatic Control, 45(11), 2117–2122. Pillonetto, G., & Bell, B. M. (2008). Optimal smoothing of non-linear dynamic
systems via Monte Carlo Markov chains. Automatica, 44(7), 1676–1685. Poyiadjis, G., Doucet, A., & Singh, S. S. (2005). Maximum likelihhood parameter estimation in general state-space models using particle methods. In Proceedings
of the American statistical association. Minneapolis, USA. August.
Rangan, S., Wolodkin, G., & Poolla, K. (1995). New results for Hammerstein system identification. In Proceedings of the 34th IEEE conference on decision and control.
New Orleans, USA. December (pp. 697–702).
Ripley, B. D. (1987). Stochastic simulation. Wiley.
Roweis, S. T., & Ghaharamani, Z. (2001). Kalman filtering and neural networks.
In S. Haykin (Ed.), Learning nonlinear dynamical systems using the expectation
maximization algorithm (pp. 175–216). John Wiley & Sons (Chapter 6). Salamon, P., Sibani, P., & Frost, R. (2002). Facts, conjectures, and improvements fir
simulated annealing. Philadelphia: SIAM.
Schön, T. B., Wills, A., & Ninness, B. (2006). Maximum likelihood nonlinear system
estimation. In Proceedings of the 14th IFAC symposium on system identification.
SYSID. Newcastle, Australia. March (pp. 1003–1008).
Söderström, T., & Stoica, P. (1989). System identification. In Systems and control
engineering. Prentice Hall.
Wills, A. G., Schön, T. B., & Ninness, B. (2008). Parameter estimation for discrete-
time nonlinear systems using EM. In Proceedings of the 17th IFAC world congress.
Seoul, South Korea. July.
Wright, M. H. (1996). Direct search methods: once scorned, now respectable. In
Numerical analysis 1995. Dundee, 1995. Longman, Harlow (pp. 191–208). Wu, C. (1983). The convergence properties of the EM algorithm.
Thomas B. Schön was born in Sweden in 1977. He received the B.Sc. degree in Business Administration and Economics in Feb. 2001, the M.Sc. degree in Applied Physics and Electrical Engineering in Feb. 2001 and the Ph.D. degree in Automatic Control in Feb. 2006, all from Linköping University, Linköping, Sweden. He has held visiting positions at the University of Cambridge (UK) and the University of Newcastle (Australia). His research interests are mainly within the areas of system identification, sensor fusion and signal processing, with applications to the automotive and aerospace industry. He
is currently an Associate Professor at Linköping University.
Adrian Wills was born in Orange, N.S.W., Australia and received his B.E. (Elec.) and Ph.D. degrees from The University of Newcastle, Australia (Callaghan Campus) in May 1999 and May 2003, respectively. Since then he has held a postdoctoral research position at Newcastle, where the focus of his research has been in the area of system identification, and model predictive control. Between April and September 2007 he held a visiting position at Linköping University, Linköping, Sweden.
Brett Ninness was born in 1963 in Singleton, Australia and received his BE, ME and Ph.D. degrees in Electrical Engineering from the University of Newcastle, Australia in 1986, 1991 and 1994 respectively. He has stayed with the School of Electrical Engineering and Computer Science at the University of Newcastle since 1993, where he is currently a Professor.
His research interests are in the areas of system identification and stochastic signal processing, in which he has authored approximately one hundred papers in journals and conference proceedings. He has served on
the editorial boards of Automatica, IEEE Transactions on Automatic Control and is currently Editor in Chief for IEE Control Theory and Applications.
Together with Håkan Hjalmarsson he jointly organised the 14th IFAC Symposium on System Identification in Newcastle, Australia in 2006. Further details of his professional activities are available at http://sigpromu.org/brett.
T.B. Schön et al. / Automatica 47 (2011) 39–49 49