11. The Unscented Kalman Filter
Change log: 2020-03-05: (1) Corrected matrix names R and Q to W and V ; (2) corrected summation index for sample mean/variance.
In the previous chapter we discussed a first approach to applying the Kalman filter to nonlinear systems. There, the approach was to apply the Kalman filter to the first-order approximation of the nonlinear system equations. In this chapter we will present an alter- native approach, which uses a specially selected set of points to approximate the random variables. These points are then transformed through the (full) nonlinear functions, and from the transformed points the relevant statistics are recovered. We will see that the resulting Unscented Kalman Filter (UKF) is often easier to implement than the EKF, and often provides better performance.
11.1 Yet another derivation of the Kalman filter gain
Given a random variable x and an observation of that variable z, corrupted by measurement noise w as
z = h(x, w)
where w is independent of x, and we know:
E[x] = xˆp Var[x] = Pp E[w] = 0 Var[w] = W
We derived the Kalman filter gain K as that linear update rule that minimizes the trace of the estimation error’s variance, where the update estimate is given by
xˆm = xˆp + K (z − zˆ)
Last update: 2020-03-05 at 14:24:16
(11.1)
11 – 1
where z is the observation, and zˆ is the expected measurement.
For linear measurements, with z = Hx + w, the gain is computed as
K=PpHT HPpHT +W−1
This can be rewritten as follows, introducing the cross-covariances Pxz and Pzz as Pxz :=E(x−xˆp)(z−zˆ)T
=E (x − xˆp) (Hx + w − Hxˆp)T =E(x−xˆp)(x−xˆp)THT +E(x−xˆp)wT
=PpHT (by independence of w and x) Pzz :=E(z−zˆ)(z−zˆ)T
=E (Hx + w − Hxˆp) (Hx + w − Hxˆp)T
=HE(x−xˆp)(x−xˆp)THT +EwwT+HE(x−xˆp)wT+Ew(x−xˆp)THT
=HPpHT +W This allows us to write
K=P P−1. xz zz
Note that the covariance update can also be rewritten then, as Pm = (I − KH) Pp = Pp − KHPp
=P −P P−1PT p xzzzxz
=P −P P−1P P−1PT p xz zz zz zz xz
= Pp − KPzzKT
(11.2)
(11.3)
We can now use (11.1), (11.2), and (11.3) to do the Kalman filter update if we have an alternative method to compute the expected measurement zˆ and the statistics Pxz and Pzz. This would then not require the linearization H.
11.1.1 Nonlinear systems
Application of this Kalman filter gain to nonlinear systems requires the computation of the statistics Pxz and Pzz for nonlinear functions h. We have seen that this can be done
11 – 2
using the change of variables formula, though it typically is intractable to compute (and furthermore, using the change of variables formula requires the full PDF of the random variables – often we only have access to their mean and covariance).
For the extended Kalman filter, we computed Pxz and Pzz through a first order Taylor series expansion of the measurement model, and used zˆ = h(xˆp,0) for the measurement. The unscented filter is simply an alternative approach to computing these quantities, that does not require an approximation of the function h(·, ·).
11.2 The Unscented Transform
The unscented transform is a method for approximating the mean and variance of a random variable transformed through a nonlinear function. Before we state the transform, we provide a motivating example.
11.2.1 Motivating example: polar to Cartesian coordinates
Consider a radar that measures the bearing θ and range r to a target (that is, its position in polar coordinates). The bearing and range measurements are corrupted by (independent), Gaussian noises – a representative sensor may have 2cm standard deviation on the range, and 15◦ = π/12rad standard deviation on the bearing.
We may be interested in the target’s position in the Cartesian plane, that is its position x = (x1, x2) computed from the measurements as
x1 r cos θ x= x =g(ξ)= rsinθ
2
where ξ = (r,θ). The problem is to now compute the mean and variance of x, given knowledge of r and θ.
For this example, the true target is located at x1 = 1 and x2 = 0.
Sample-based approximation
One approach is to generate samples of r and θ, transform the samples through g(·), and then compute the resulting sample statistics. Let ξˇ be samples drawn from a distribution
11 – 3
i
so that the sample mean and variance match the underlying variable’s mean and variance: N−1 1
lim ξˇ=E[ξ] N→∞ i=0 N i
numbers (to be made precise later), we can then say N−1 1
lim xˇi=E[x] N→∞ i=0 N
T (xˇi−E[x])(xˇi−E[x]) =Var[x]
N−1 1 T
lim ξˇ−E[ξ]ξˇ−E[ξ] =Var[ξ]
N→∞i=0N−1 i i
For each sample ξˇ we compute a transformed sample xˇ = g ξˇ . Under the law of large
iii
N−1 1 lim
N→∞ i=0 N−1
That is, we can approximate the mean and variance of x from the samples xˇi. Note that, for finite N, these approximations are themselves random variables (as they are composed of random variables).
Below, we show the position solutions for N = 1000 range and bearing measurements, for a true target at the position (x1,x2) = (1,0). On the left, the samples are shown in the (r,θ) plane, and on the right the transformed samples in (x1,x2). We note the crescent shape of the position samples.
60 0.8
40 20
0.6 0.4 0.2
0 0.0
0.6 0.7
−20
−40
−60 −0.8
0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 1.08 range r [m]
0.8 0.9 x [m]
1.0 1.1
−0.2 −0.4 −0.6
11 – 4
bearing θ [deg]
x2 [m]
Though this is a good way of approximating the statistics of the transformed variable, sampling many points is computationally expensive, and the the resulting estimates are themselves random.
First-order approximation
We can try to approximate the mean and variance of x1 and x2 using a first-order approach, similar to what we did for the EKF.
∂gξˆ
ˆˆ x=g(ξ)=g(ξ)+ ∂ξ ξ−ξ +…
(11.4) (11.5)
(11.6) (11.7)
g(ξ) = A :=
cos θˆ sinθˆ
−rˆsin θˆ
=
1 0
0 1
r cos θ r sin θ
∂g ξˆ ∂ξ
ξˆ =
=
rˆ cos θˆ
rˆcosθˆ 1
θˆ=0,rˆ=1
E[x] ≈ g
Var [x] ≈ AVar [ξ] A = 0 π 2
12
=
T 0.022 0
rˆsinθˆ
0
We can now compare these predicted statistics to the sample statistics for the samples from the distribution, this is shown in the below figure:
0.3 0.2 0.1 0.0
−0.1
−0.2
−0.3
0.90 0.92
0.94 0.96 0.98 x [m]
1.00 1.02
samples first-order
We note two properties of the approximation: 11 – 5
x2 [m]
• the mean is incorrect – the sample mean lies more than one estimated standard deviation away from the estimated mean, and
• the variance in x1 is significantly under-estimated.
Therefore, in this case, the first-order approximation fails to capture important informa- tion of the problem. The unscented transform, below, presents an alternative method for computing the mean and variance of the transformed variables. It will be shown to perform much better for this example.
11.2.2 The unscented transform
The unscented transform∗ proceeds by selecting a deterministic set of points to represent the mean and variance of a random variable. These points are then transformed through the nonlinear function in question, and the mean and variance of the resulting points are used as an approximation for the mean and variance of the true random variable transformed through the nonlinear function. The intuition is that it is easier to approximate a distribution than to approximate a nonlinear function.
Note that we will select only a small set of points (unlike the example before), and we will select them deterministically.
Here, we will only present the most basic point selection scheme. We will call these points “sigma-points”.
Letx∈Wn andy=g(x)∈Wm,withthemeanandvarianceofxknown: E[x] = μx Var[x] = Px.
Wedefineasetof2nsigma-points,sx,i fori∈{0,1,2,…,2n−1},computedasbelow,for
∗About the name (by Jeffrey Uhlmann, from http://ethw.org/First-Hand:The_Unscented_ Transform):
Initially I only referred to it as the “new filter.” Needing a more specific name, people in my lab began referring to it as the “Uhlmann filter,” which obviously isn’t a name that I could use, so I had to come up with an official term. One evening everyone else in the lab was at the Royal Opera House, and as I was working I noticed someone’s deodorant on a desk. The word “unscented” caught my eye as the perfect technical term. At first people in the lab thought it was absurd – which is okay because absurdity is my guiding principle – and that it wouldn’t catch on. My claim was that people simply accept technical terms as technical terms: for example, does anyone think about why a tree is called a tree?
11 – 6
i = {0, 1, . . . , n − 1}
sx,i = μx + (nPx)i
sx,n+i = μx − (nPx)i
where A = √A√AT for A = AT > 0 (one way to compute this is with the Cholesky decomposition) and (A)i represents the i’th column of the matrix A. These sigma-points lie on the covariance contour of x.
We now define the transformed sigma-points sy,i as sy,i :=g(sx,i) fori∈{0,1,2,…,2n−1}
from which we compute an approximation for the mean and variance of y:
2n−1 1 E[y]≈μy :=
sy,i (sy,i−μy)(sy,i−μy)
(11.8) (11.9)
Var[y]≈Py :=
i=0 2n
i=0 2n 2n−1 1
T
Remarks:
• This allows for easy ‘rapid-prototyping’: we don’t need to explicitly compute the derivatives (or any other properties) of the function g, so it is easy to evaluate changes in g.
• This requires to compute the matrix square root of an n × n matrix, which may be computationally expensive, with complexity O(n3). However, the matrix multiplica- tions for the first-order approach have the same complexity.
• The function g need not be differentiable (or even continuous) for this approximation. 11 – 7
11.2.3 Revisiting the example
We now apply the unscented transform to the example of Section 11.2.1. Recall ξ = (r,θ) with E[ξ] = μξ = (1,0) and Var[ξ] = Pξ = diag0.022,π 2. We defined the
12 corresponding Cartesian position as x = (x1, x2) = g(ξ) = (r cos θ, r sin θ).
We first compute the 2n = 4 sigma points sξ,i, for which we require the matrix square root of 2Pξ , (easy here) 2Pξ = diag 0.02√2, π √2.
12
sξ,0 = (1.028…,0), sξ,1 = (1,0.37…), sξ,2 = (0.971…,0), sξ,3 = (1,−0.37…).
Each sigma point is transformed to generate a corresponding sx,i:
sx,0 = g(sξ,0) = (1.028…,0), sx,1 = g(sξ,1) = (0.932…,0.361…),
sx,2 = g(sξ,2) = (0.971…,0), sx,3 = g(sξ,3) = (0.932…,−0.361…), and we can compute the mean and variance approximations as
E[x]≈μx := Var[x]≈Px :=
3 1
0.966… 0
(11.10) (11.11)
4sx,i = 3 1
i=0
i=0
4(sx,i −μx)(sx,i −μx)
T 0.0015… 0 = 0 0.065…
The resulting estimate is shown in the figure below, where it is compared to the first- order estimate, and the empirical sample mean and variance. We note that the unscented transform mean estimate is extremely close to the empirical mean, and that the variance ellipse matches the empirical sample variance very closely. It is clear that the unscented transform yields a much better approximation of the true statistics than the first order methods.
11 – 8
0.3 0.2 0.1 0.0
−0.1
−0.2
−0.3
0.90 0.92
0.94 0.96 0.98 x [m]
1.00 1.02
samples first-order unscented
11.2.4 Properties of the Unscented transform
Consider the transformation y = g(x), where we assume that g is analytic. We will apply the unscented transform to approximate properties of y, which we will compare to the first-order approximation (as used in the EKF). For this analysis, we will assume both x and y are scalar, though the analysis generalizes to higher dimensions.
We introduce the Taylor expansion for g, about a point x ̄: g(x) = ∞ 1 ∂ig(x ̄) (x − x ̄)i
i=0 i! ∂xi
= g ( x ̄ ) + ∂ g ( x ̄ ) ( x − x ̄ ) + 1 ∂ 2 g ( x ̄ ) ( x − x ̄ ) 2 + O ( x 3 )
∂x 2 ∂x2
The exact statistics of y can be computed from this series, exploiting linearity, as below.
We use μx = E [x] as the linearization point: E [y] = E [g(x)]
∂g(μx) 1 ∂2g(μx) 2 =E[g(μx)]+E ∂x (x−μx) +E 2 ∂x2 (x−μx) +…
= g(μx) + ∂g(μx)E [(x − μx)] + 1 ∂2g(μx)E (x − μx)2 + . . . ∂x 2 ∂x2
= g(μx) + 1 ∂2g(μx)Var [x] + E O(x3) 2 ∂x2
11 – 9
(11.12)
x2 [m]
We will compute the variance as Var [y] = Var [g(x)] = E g(x)2 − E [g(x)]2. We compute these components as:
2 ∂g(μx)2 ∂2g(μx) 3
=g(μx) + ∂x Var[x]+g(μx) ∂x2 Var[x]+E O(x ) (11.13)
∞
E g(x)2 = E 1 ∂ig(μx) (x − μx)i
∞
1 ∂j g(μx) (x − μx)j
i! ∂xj j=0
i! ∂xi i=0
2 =g(μx) +E 2g(μx) ∂x (x−μx) +E ∂x (x−μx)
2
1 ∂2g(μx) 2 3
∂g(μx)
+E22g(μx) ∂x2 (x−μx) +EO(x)
∂g(μx)
E[g(x)]2 =g(μx)2 +2g(μx)1∂2g(μx)E(x−μx)2+EO(x3) 2 ∂x2
=g(μx)2 +g(μx)∂2g(μx)Var[x]+EO(x3) ∂x2
And therefore:
Var [y] = Var [g(x)] = E g(x)2 − E [g(x)]2
∂g(μx)2
= ∂x Var [x] + E O(x3)
(11.14)
(11.15)
We can compare this to the result of the unscented transform. The system is scalar (i.e. n = 1), so we will have two points:
sx,0 :=μx+∆, sx,1 :=μx−∆ where∆:=Var[x]
These points are transformed through g, to yield the points sy,i, expressed through the
series expansion as: sy,0 := g(sx,0)
= g(μx) + ∂g(μx) (sx,0 − μx) + 1 ∂2g(μx) (sx,0 − μx)2 + O((sx,0 − μx)3) ∂x 2 ∂x2
= g(μx) + ∂g(μx)∆ + 1 ∂2g(μx)∆2 + O(∆3) ∂x 2∂x2
sy,1 = g(μx) − ∂g(μx)∆ + 1 ∂2g(μx)∆2 + O(∆3) ∂x 2∂x2
11 – 10
The mean approximation of the unscented transform, which we will here call μy, is: μy=1 1sy,i
i=0 2
= g(μx) + 1 ∂2g(μx)∆2 + O(∆3)
2 ∂x2
= g(μx) + 1 ∂2g(μx)Var [x] + O(∆3)
(11.16)
Comparing now (11.12) and (11.16) we see that the unscented transform approximates the mean correctly up to second order. Recall that linearization, as was done for the EKF, was correct only to first order.
(11.17)
Comparing (11.15) and (11.17) we notice that the variance estimate is correct to second order – this is the same order as for a first order approximate of g. Even though the errors are on the same order, the errors are not equal, and the unscented transform typically outperforms a first order approximation, as the unscented approximation includes some approximation of higher order terms (unlike a first order approximation).
Remarks:
• The errors are on the order of the expectation of the random variable cubed. This make sense: we only had information about the variable squared (its variance) – we could not possibly do any better without more information.
• These errors may still be arbitrarily large. You can generate distributions for which the errors will be unbounded for both the first-order and unscented transformation approximations.
• If the function g is affine in x, then the estimates from the unscented transform are exact (as is the first order approximation), for any distribution of x.
11 – 11
2 ∂x2
The unscented transform’s variance approximation, which we will here call Py, is: Py:=1 1(sy,i−μy)2
i=0 2
1∂g(μx) 3 2 1 ∂g(μx) 3 2
=2 ∂x ∆+O(∆) +2 − ∂x ∆+O(∆) ∂g(μx)2 ∂g(μx)2
= ∂x ∆2 + O(∆4) = ∂x Var [x] + O(∆4)
11.2.5 Extensions of the unscented transform
This is the simplest version of the unscented transform. Specifically, we have 2n sigma- points, and we’ve weighted them all equally, (the weights we’ve chosen are 1/(2n)). This can be extended as follows:
• The most common scheme involves augmenting this set of points with an additional sigma-point, equal to the mean.
• The points need not be weighted equally – as long as the weights sum up to one.
• There are also schemes that have fewer points (as few as n + 1), which are more
computationally efficient, but do not provide as accurate an estimate.
11.3 The Unscented Kalman Filter
The unscented filter is the application of the unscented transform to the nonlinear system dynamics to approximate the prior mean and variance; and its application to the nonlinear measurement model to compute the measurement statistics from which we compute the Kalman filter gain using the standard equations. We will distinguish between two cases: systems where the noise is purely additive, and systems where the noise enters in a nonlinear fashion.
11.3.1 Additive noise
We consider systems of the following form: x(k) = qk−1 (x(k−1)) + v(k−1)
z(k) = hk (x(k)) + w(k)
E [x(0)] = x0, Var [x(0)] = P0
E [v(k−1)] = 0, Var [v(k−1)] = V (k−1)
E [w(k)] = 0, Var [w(k)] = W(k)
(11.18) (11.19)
for k = 1, 2, . . . , and where x(0), {v(·)}, and {w(·)} are mutually independent. The state is n dimensional.
11 – 12
We note the following properties, by independence of the noise:
E [x(k)] = E [qk−1 (x(k−1))]
Var [x(k)] = Var [qk−1 (x(k−1))] + V (k−1)
E [z(k)] = E [hk (x(k))]
Var [z(k)] = Var [hk (x(k))] + W(k)
We will now use the unscented transform to approximate these expectations and variances. Initialization: xˆm(0) = x0, Pm(0) = P0.
Step 1 (S1): Prior update/Prediction step
Generate 2n sigma-points:
sxm(k−1),i = xˆm(k−1) + nPm(k−1)
i
sxm(k−1),n+i = xˆm(k−1) − nPm(k−1) i
Using these, compute the prior sigma-points:
s = q s for i ∈ {0, 1, . . . , 2n − 1}
xp(k),i k−1 xm(k−1),i
Compute the prior statistics (note the additive noise):
2n−1 1 xˆp(k) =
sxp(k),i
sxp(k),i − xˆp(k) sxp(k),i − xˆp(k)
Pp(k) =
i=0 2n
+ V (k−1)
i=0 2n 2n−1 1
T
Step 2 (S2): A posteriori update/Measurement update step
Using the sigma points from Step 1, compute sigma points for the measurements sz(k),i =hk−1sxp(k),i fori∈{0,1,…,2n−1}
and from that the expected measurement and the associated covariance Pzz(k):
2n−1 1 zˆ(k) =
sz(k),i
sz(k),i − zˆ(k) sz(k),i − zˆ(k)
11 – 13
Pzz(k) =
i=0 2n
+ W(k)
i=0 2n 2n−1 1
T
We also compute the cross covariance Pxz(k):
2n−1 1 T
Pxz(k) = sxp(k),i − xˆp(k) sz(k),i − zˆ(k) i=0 2n
From this, we apply the Kalman filter gain:
K(k) = Pxz(k)Pzz(k)−1
xˆm(k) := xˆp(k) + K(k) (z(k) − zˆ(k)) Pm(k) = Pp(k) − K(k)Pzz(k)K(k)T
11.3.2 Nonlinear noise
The UKF can also straight-forwardly deal with more general nonlinear systems: x(k) = qk−1 (x(k−1), v(k−1)) E [x(0)] = x0, Var [x(0)] = P0
E [v(k−1)] = 0, Var [v(k−1)] = V (k−1) z(k) = hk (x(k), w(k)) E [w(k)] = 0, Var [w(k)] = W(k)
(11.20) (11.21)
for k = 1, 2, . . . , and where x(0), {v(·)}, and {w(·)} are mutually independent. The state is n dimensional.
We will need an augmented state vector, ξ = (x,v,w) ∈ Wnξ, with nξ = (n + dim(v) + dim(w)). The augmented system functions are defined as
q ̄ (ξ) := (q (x, v) , v, w) k−1 k−1
h ̄k (ξ) = hk (x,w)
The following algorithm is a generalization of that of the previous section, but will be more
computationally expensive.
Initialization: xˆm(0) = x0, Pm(0) = P0.
Step 1 (S1): Prior update/Prediction step
Let
ξˆm(k−1) := (xˆm(k−1), 0, 0)
Var [ξm(k−1)] := diag (Pm(k−1), V (k−1), W(k))
11 – 14
Generate 2nξ sigma-points:
ˆ
sξm(k−1),i = ξm(k−1) + nξVar [ξm(k−1)]
i
ˆ sξm(k−1),i = ξm(k−1) − nξVar [ξm(k−1)]
i
Note that, because the variance is block-diagonal, the computation its square root can be done block-wise.
Using these, and our augmented dynamics equation, compute the prior sigma-points:
s :=q ̄ s ξp(k),i k−1
fori∈{0,1,…,2n −1} ξ
ξm(k−1),i =: sxp(k),i, ∗, ∗
Compute the prior statistics:
2nξ−1 1
xˆp(k) =
i=0 2nξ
2nξ−1 1 Pp(k)=
sxp(k),i sxp(k),i−xˆp(k)sxp(k),i−xˆp(k)
T
i=0 2nξ
Step 2 (S2): A posteriori update/Measurement update step
Using the sigma points from Step 1, compute sigma points for the measurements sz(k),i =h ̄k−1sξp(k),i fori∈{0,1,…,2nξ −1}
and from that the expected measurement and the associated covariance Pzz(k):
2nξ−1 1
zˆ(k)=
i=0 2nξ
2nξ−1 1 T
Pxz(k)= sxp(k),i−xˆp(k)sz(k),i−zˆ(k) i=0 2nξ
11 – 15
2nξ−1 1 Pzz(k)=
sz(k),i sz(k),i−zˆ(k)sz(k),i−zˆ(k)
T
i=0 2nξ
We also compute the cross covariance Pxz(k):
From this, we apply the Kalman filter gain:
K(k) = Pxz(k)Pzz(k)−1
xˆm(k) := xˆp(k) + K(k) (z(k) − zˆ(k)) Pm(k) = Pp(k) − K(k)Pzz(k)K(k)T
11.3.3 Remarks
• At the beginning of step 2 (for either additive or nonlinear noise), we can compute new sigma points using xˆp and Pp. This will be more accurate than what we have described above, but will require more computation.
• Notice that Section 11.3.2 allows us to straight-forwardly handle the situation where v(k−1) and w(k−1) are not independent.
• If either the measurement model or dynamics equation is linear, we don’t need to use the unscented transform for the corresponding statistics.
• If only one of the process noise and measurement noise enters in a nonlinear way, we can adapt the equations to capture this with the minimum number of sigma-points.
11.4 Revisiting the pathological example
We revisit the pathological example from the EKF notes. Recall the system dynamics: x(k) = q (x(k−1), v(k−1)) = 2 arctan (x(k−1) + v(k−1))
z(k) = h(x(k), w(k)) = x(k) + w(k)
and x(0) ∼ N (x0, 1), v(k−1) ∼ N (0, 0.1), and w(k) ∼ N (0, 10). We notice that the measurement is affine in the state and noise, and that the dynamics are smooth. The process uncertainty is small, and the measurement uncertainty large.
The dynamics are illustrated below:
11 – 16
4 2 0
−2 −4
q(x(k), 0) x(k)
−5 0 5 x(k − 1)
The deterministic system (when we set v(k) = 0) has three equilibria, at x(k) = 0 and x(k) ≈ ±2.33. The equilibrium at zero is unstable, the other two are stable.
We implement the UKF algorithm for nonlinear noise, and run it many times to get a feeling for the estimator’s performance. We can compare the performance for x0 = 4, and x0 = 0.
The performance is roughly the same as we had for the EKF. For x0 = 4, the UKF performs well, but for x0 = 0 it does not. Usually, it’s OK, but sometimes the true state goes to one equilibrium, and the estimator to the other. Note
• this is true even though we can measure the state,
• it doesn’t matter how long we run the filter, it doesn’t correct itself, and • there is no model mismatch.
Just like the EKF, the UKF has a danger of diverging.
11.5 Further reading
An excellent paper is S. J. Julier and J. K. Uhlmann, “Unscented filtering and nonlinear estimation,” Proceedings of the IEEE, vol. 92, no. 3, pp. 401–422, 2004 (written by the creators of the Unscented Kalman Filter), which was also the primary source for this lecture. It is very accessible, and provides in-depth discussion.
11 – 17
x(k)