6. The Kalman Filter as best linear estimator
In this chapter we derive the Kalman Filter as a solution to a least-squares estimation problem for a dynamic system. We will make very few assumptions about our system, primarily whiteness of the noise, and that the system is linear.
We will show the connection between the Kalman Filter and least squares estimation, by looking at the recursive least squares solution to estimating a constant value, and the batched weighted least squares estimate of a dynamic system.
6.1 Weighted Least Squares (WLS)
Let x ∈ Rn be some (vector-valued) quantity of interest, and z ∈ Rm be a large number of measurements of x, so that m > n. The measurements are generated according to the model
z = Hx + w (6.1) where H ∈ Rm×n is a known matrix, and w ∈ Rm is a zero-mean random variable with
known variance W ∈ Rm×m.
Note that this is an under-determined problem, because there are more unknowns (x and w) than there are equations (z). One way to resolve this, is to solve for the values w which minimize the Mahalanobis distance∗ defined as
(w−E[w])T W−1(w−E[w]) Last update: 2020-02-04 at 15:29:10
∗See https://en.wikipedia.org/wiki/Mahalanobis_distance. 6–1
Note that this distance is always positive, and when W = σw2 I this simplifies to the Euclidean norm squared of w scaled by σ−2. Since w is zero-mean and has variance W,
the best guess for their values wˆ is that which minimizes the Mahalanobis distance is: wˆ := arg min wT W −1w
w
Substituting for w = z − Hxˆ allows us to solve this distance as a function of the estimate xˆ. The best estimate for x would then be the weighted least squares estimate:
xˆWLS :=argmin(z−Hxˆ)TW−1(z−Hxˆ) xˆ
= (HT W −1H)−1HT W −1z.
(6.2)
w
Example 6.1: Let x represent the value of a work of art, which we have been asked to insure (and thus need to compute the value of). We invite various different, inde- pendent, appraisers to give us their estimate of its value: z(k) for appraiser k.
From historical data of art auctions, we know the variance associated with each ap- praiser, and can model the appraisers as:
z(k) = x + w(k)
with E[w(k)] = 0 and Var[w(k)] = σk2. Given the following values (note that we provide the standard deviation σk for each appraiser), compute the WLS estimate for the value x:
k z(k) σk 1 $1.2M $300k 2 $1.6M $600k 3 $0.9M $400k
6.2 Recursive Least Squares (RLS) Problem
Consider the observation model
z(k) = H(k)x + w(k) with z(k), w(k) ∈ Rm, x ∈ Rn.
• Priorknowledge: meanandvarianceofx,μx :=E[x]andPx :=E(x−μx)(x−μx)T= Var [x], are given.
6–2
Notice that we have no process model: x does not change (our knowledge of it changes, however, as we keep getting new measurements).
• Measurement noise: zero-mean with known variance, E [w(k)] = 0, W (k) := Var [w(k)].
• {x, w(1), w(2), . . . } are independent.
• Typically, n > m; that is, fewer observations z(k)i than parameters xi at any particular time k.
The objective is to compute an estimate xˆ(k) of x from the observations {z(1), z(2), . . . , z(k)} in the least squares sense (minimizing a quadratic error).
Batch least squares
After collecting a whole bunch of data until time k, we can convert this to a standard, weighted least squares problem. Abusing notation, we will incorporate the prior knowledge as a pseudo-measurement at time k = 0, z(0) = μx = Ix + (μx − x) (i.e. a linear function of x, corrupted by zero-mean additive noise w(0) := μx − x with Var [w(0)] = Px).
z(0:k) = H(0:k)x + w(0:k) with
μx I μx−x
z(1) H(1) w(1) z(0:k) := . , H(0:k) := . , w(0:k) := . ,
. . . z(k) H(k) w(k)
Px 0 ··· 0
0 W(1) 0 W(0:k):=. .. . ,
… 0 0 ···W(k)
This problem matches our WLS problem, so we can immediately apply the solution (6.2) xˆWLS(k) = arg min(z(0:k) − H(0:k)xˆ)T W(0:k)−1(z(0:k) − H(0:k)xˆ)
xˆ
= (H(0:k)T W(0:k)−1H(0:k))−1H(0:k)T W(0:k)−1z(0:k).
The implementation of standard WLS would mean that we have to solve a full WLS problem (with growing dimensions) at every time step k.
6–3
Recursive WLS
Next we show how to build an estimate in real-time; that is, update the current estimate of x recursively when a new measurement z(k) comes in. We assume the following form for the estimate at time k:
xˆ(k) = xˆ(k−1) + K(k)z(k) − H(k)xˆ(k−1) where
xˆ(k) is the estimate of x given all measurements up to time k, {z(1), z(2), . . . , z(k)}, and the prior knowledge; and
K(k) is the gain matrix at time k, the only design variable.
• Why use this form?
– Hxˆ(k) is the predicted measurement from estimate
– (z(k) − Hxˆ) is the difference to actual measurement
– if (z(k) − Hxˆ) = 0, don’t change estimate (perfect prediction!) – else, correct proportionally
• We analyze the estimation error e(k) := x − xˆ(k). We have
e(k) = x − xˆ(k−1) − K(k) (H(k)x + w(k) − H(k)xˆ(k−1)) = e(k−1) − K(k)H(k)e(k−1) − K(k)w(k)
= (I − K(k)H(k)) e(k−1) − K(k)w(k)
with the mean error evolving as
E [e(k)] = I − K(k)H(k)E [e(k−1)]
If we initialize the estimator with xˆ(0) = μx, then E[e(0)] = 0, and from the above recursion, we get E[e(k)] = 0 for all k. That is, the estimator is unbiased for any choice of K(k).
• We now choose K(k) to minimize the mean squared error†
J(k) := E eT(k) e(k) = E trace e(k)eT(k) = trace (P(k)) ,
where trace(·) is the sum of the diagonal terms, and P(k) := Var[e(k)] (recall that E [e(k)] = 0).
†Why minimize mean squared error, rather than something else? To paraphrase Manfred Morari, “Be- cause it’s not completely stupid, and we can actually solve for it.”
6–4
• Substituting the estimation error
e(k) = I − K(k)H(k)e(k−1) − K(k)w(k)
into the definition, we obtain P(k) =Ee(k)eT(k)
=E [((I − K(k)H(k)) e(k−1) − K(k)w(k)) ((I − K(k)H(k)) e(k−1) − K(k)w(k))T
=E (I − K(k)H(k)) e(k−1)eT(k−1) (I − K(k)H(k))T + K(k)w(k)wT(k)KT(k) − (I − K(k)H(k)) e(k−1)wT(k)KT(k) −K(k)w(k)eT(k−1) (I − K(k)H(k))T
=(I−K(k)H(k))P(k−1)(I−K(k)H(k))T +K(k)W(k)KT(k) − (I − K(k)H(k)) E e(k−1)wT(k) KT(k)
− K(k)E w(k)eT(k−1) (I − K(k)H(k))T .
Because e(k−1) and w(k) are independent, E w(k)eT(k−1) = E [w(k)] E eT(k−1). Since E [w(k)] = 0, E [w(k)] E eT(k−1) = 0, and it follows that
P(k) = (I − K(k)H(k)) P(k−1) (I − K(k)H(k))T + K(k)W(k)KT(k)
Note the trade-off here: a large K means that the influence of past uncertainty disappears quickly, but then measurement noise has a stronger influence. We can multiply this out to get
P(k)=K(k)H(k)P(k−1)H(k)T +W(k)K(k)T −K(k)H(k)P(k−1)−P(k−1)H(k)TK(k)T +P(k−1) (6.3)
• We want to minimize the trace of this quantity, and compute the optimal gain K∗, so that J∗ ≤ J ̃ for any other K ̃ ̸= K∗. We rewrite the quadratic equation (6.3), where we make the following substitution for simplicity:
A = H(k)P(k−1)H(k)T + W(k) B = H(k)P(k−1)
C = P(k−1)
so that we can complete the square to get P(k)=KAKT −KB−BTKT +C
=K−BTA−1AK−BTA−1T −BTA−1B+C 6–5
Exploiting the linearity of the trace, we can rewrite this as
J =trace(P(k))=traceK−BTA−1AK−BTA−1T+trace−BA−1BT +C
Since last term is independent of K, and the first term is quadratic in the matrix K, and A > 0, it is clear that J is at a minimum if the quadratic term is zero. Thus, the optimal gain K(k) is
K(k) = K∗ = BT A−1 = P(k−1)H(k)T H(k)P(k−1)H(k)T + W(k)−1
This result for the gain K(k) is at the heart of “Kalman filtering”, and it will be central
to the Extended and Unscented Kalman filters.
Summary (RLS algorithm)
Initialization: xˆ(0) = μx, P (0) = Px = Var [x] Recursion:
• Observe: z(k) • Update:
K(k) = P(k−1)HT(k)H(k)P(k−1)HT(k) + W(k)−1 xˆ(k) = xˆ(k−1) + K(k)z(k) − H(k)xˆ(k−1)
P(k) = I − K(k)H(k)P(k−1)I − K(k)H(k)T + K(k)W(k)KT(k)
The matrices K(k) and P(k) can be pre-computed from the problem data Px, {H(·)}, and
{W (·)}.
6.3 Dynamic systems
We consider the linear, time-varying system:
x(k) = A(k−1)x(k−1) + u(k−1) + v(k−1) z(k) = H(k)x(k) + w(k)
6–6
(6.4) (6.5)
x(k): state
u(k): known control input v(k): process noise
z(k): measurement
w(k): sensor noise
for k = 1,2,….
The CRVs x(0), {v(·)}, and {w(·)} are mutually independent, have zero mean and known
variance:
• E [x(0)] = x0, Var [x(0)] = P0, Var [v(k)] = V (k), Var [w(k)] = W(k).
For the remainder of the derivation we set u(k) = 0, to simplify equations.
6.3.1 Batched estimation
At time k, we have collected a set of measurements, z(1:k). Given our model, we’d like to estimate x(k), and we’ll do this by storing all problem data. We ignore the input u to reduce clutter, though it is trivial to add back.
The approach will be to compute the initial state, process noise sequence, and measurement noise sequence that best explain the measurement sequence, and are optimal in a least squares sense. Then, given a set of process noise values, and an initial state, we can run through the system dynamics to compute the state.
x(k) =A(k−1)x(k−1) + v(k−1)
=A(k−1) (A(k − 2)x(k − 2) + v(k − 2)) + v(k−1)
=A (k − 1, 0) x(0) + A (k − 1, 1) v(0) + A (k − 1, 2) v(1)
where
A (k, l) :=
Then
A(k)A(k−1)···A(l) ifk≥l
+ . . . + A (k − 1, k − 2) v(k − 2) + v(k − 1)
0
else.
x(k)= A(k−1,0) A(k−1,1) … A(k−1,k−1) I
(6.6)
x(0)
v(0) v(1)
6–7
.
v(k − 1)
We can thus think of the process noise becoming part of the state vector, and we define the extended state vector ξ(k) as
x(0)
v(0)
v(1)
ξ(k) :=
.
v(k − 1)
We can similarly rewrite the measurements, z(k) = H(k)x(k)+w(k), and we see that z(k) is an affine combination of the process noise sequence until k − 1, the initial state x(0), and the measurement noise at time k. As we did before for the initial condition of the system, we’ll add some pseudo-measurements for the process noise variables v(k), so that our collection of known things about the state at time k is Z(k), defined as
x0 0
. .
Z(k):= 0 =H(k)ξ(k)+Ω(k) z(0)
. .
z(k)
wherein the zero elements have the same dimension as v(k), and there are k such zero
elements. We’ve introduced:
Z(k) = z(1:k): all the information we know about the state, including the measure- ment and our priors
H(k): All the (known) problem data (system matrices)
ξ(k): All the unknown, random variables affecting the state
Ω(k): All the unknown, random variables affecting our knowledge of the state
6–8
I0···0 0I···0
. . .. . ….
0 0 ··· H(k) = H(1)A(0,0) H(1) 0
I ··· 0
··· 0 ….···
H(k)A(k,0) H(k)A(k−1,1) H(k)A(k−1,2) ··· H(k)
x(0) − x0 v(0)
v(1) .
Ω(k) := . v(k−1)
w(1) .
. w(k)
Note that the dimensions of each of these quantities increases as time goes on. The variance of the random quantity Ω(k) will be block-diagonal (due to the variables’ independence), with:
Var[Ω(k)] := diag(P0,V(0),…,V(k−1),W(1),…,W(k))
To compute an estimate of the state x(k), we first compute an estimate of the unknown random variables, ξ(k). We can then compute the matching x(k) using (6.6). This is done with weighted least squares:
WLS T −1 −1 T −1
ξˆ (k) = H(k) Var [Ω(k)] H(k) H(k) Var [Ω(k)] Z(k)
Plugging this result into (6.6) gives us a least squares estimate xˆWLS. We note:
• The least squares solution is valid for any noise distributions (specifically, we never
assumed normality).
• The problem data grows with each time step, as more measurements come in. This has two effects:
– We need an ever-increasing amount of memory, to store the problem data. 6–9
H(2)A(1,0) H(2)A(1,1) H(2)
. . …
– The solution of the least squares problem becomes more computationally ex- pensive as time goes on.
• At each time step, we’re also updating our estimate of the process noise. 6.3.2 Recursive estimation: the Kalman Filter
We have already seen how to take a weighted least squares problem, and convert it to a recursive problem, when there were no dynamics. The Kalman Filter is the application of this idea (specifically, the gain matrix K(k) that minimizes the mean squared error), to the dynamic estimation problem we just described.
The philosophy is as follows: when given a measurement z(k) = H(k)x(k) + w(k), a model of the noise acting on this measurement (the variance W(k) of w(k)) and some prior knowledge of my system state at time k (mean xˆp(k) and variance Pp(k)), we can apply the Kalman filter gain to compute the posterior mean and variance (xˆm(k) and Pm(k)):
K(k) = Pp(k)HT(k) H(k)Pp(k)HT(k) + W(k)−1
xˆm(k) = xˆp(k) + K(k) (z(k) − H(k)xˆp(k)) Pm(k)=(I−K(k)H(k))Pp(k)(I−K(k)H(k))T +K(k)W(k)KT(k)
The time update (computing xˆp(k), and Pp(k) from xˆm(k − 1) and Pm(k − 1)) is straight- forward, since the system is linear:
xˆp(k) = A(k−1)xˆm(k−1)
Pp(k) = A(k−1)Pm(k−1)AT (k−1) + V (k−1)
6.3.3 Discussion
We have now derived the Kalman filter for a very general class of systems. Note that we never assumed that any variables have any particular distribution (specifically, they need not be Gaussian) – instead, we exploited linearity of the system to solve a least-squares problem. We will see that the equations here are the same if we made the assumption that all random quantities had Gaussian distributions.
Thus, we see a first interpretation of the Kalman Filter. Amongst the class of linear and unbiased estimators, the Kalman Filter minimizes the mean squared estimation error.
In coming lectures we will use the Kalman Filter gain, as computed here, to estimate the state of nonlinear systems. We will see that the Extended and Unscented Kalman
6 – 10
filters effectively reduce to different methods of computing the transformations of random variables through nonlinear functions.
6.4 Examples 6.4.1 Rocket ship
Consider a rocketship, moving in a straight line, powered by a (noisy) rocket engine, and able to measure the distance from itself to its starting position. We assume that the mass of the rocketship is constant‡.
The rocketship is acted upon by a propulsion force. The force command, uf , can change only at discrete times k. The rocket engine is not perfect, and the actual force is corrupted by an additive distributed noise vf (which also only changes at the discrete times k).
We do not know the PDF of the noises, we know only that they are zero-mean, and we know their variance. When simulating the rocketship, we will assume all quantities have uniform distributions.
‡after all, this is not a class on orbital mechanics.
6 – 11
System equations
Let x(k) := (p(k), s(k)), u(k) := (uf (k)/2, uf (k)), and v(k) := (vf (k)/2, vf (k)). Combining all equations:
1 1 0.025 0.05 x(k)= 0 1 x(k−1)+u(k−1)+v(k−1), v(k−1)∼N(0,V), V = 0.05 0.1 ,
A
z(k)=1 0x(k)+w(k), w(k)∼N(0,W), W =0.5.
H
A, H , V , W , and initial mean and variance are all we need to implement the KF.
Simulations
(the script to simulate this example are provided on the class website.)
Works! Eye-ball test: no appreciable difference to normal noise case (i.e. KF is a good choice).
Also works if we draw the noise from binary distributions.
6.4.2 What about nonlinear estimators?
We showed that the Kalman filter was the best amongst the class of unbiased, linear estimators. Can we do better?
Consider the following simple, one-dimensional, linear system: x(k) = x(k − 1)
z(k) = x(k) + w(k)
where w(k) ∈ [−1, 1] with f (w(k)) = 0.5, x(0) ∼ N (0, 1), and all all random variables
independent.
Kalman Filter implementation
We notice that the system is linear, and matches the form for which we designed the Kalman Filter: A(k) = 1, H(k) = 1. There is no process noise, so V (k) = 0, and we can
6 – 12
compute
E [w(k)] = 0
W(k) = Var[w(k)] = 1 3
We now have all of the necessary problem data, and can compute the Kalman Filter as presented.
A recursive nonlinear estimator
We define a nonlinear estimator as follows, consisting of three variables l(k), u(k), and xˆN L(k):
Initialization: Let l(0) = ∞, u(0) = −∞, and xˆNL(0) = 0 Recursion: At time k, we get the measurement z(k). Then:
l(k) = min{l(k − 1), z(k)} u(k) = max{u(k − 1), z(k)}
xˆNL(k) = 1 (l(k) + u(k)) 2
Note that this estimator is intuitive: keep track of the smallest & largest measurement observed, and the estimate is their average. Clearly, this is a nonlinear estimator.
Simulation
(the script used in the lecture is made available on the course website.)
We notice that the Kalman Filter does a reasonable job of estimating x. After 23 steps, the estimation error drops below 0.2.
However, we also notice that the nonlinear estimator converges much more quickly: after 12 steps, the error has dropped below 0.2, and the estimation error appears to monotonically decrease (for the Kalman Filter, it occasionally increases).
This simple example demonstrates that, even for simple linear systems, there may be nonlinear estimators that outperform the Kalman Filter.
6 – 13
What if the noise is Gaussian? Can you come up with an example linear system, with Gaussian, independent noises, where a nonlinear estimator outperforms the Kalman filter?
6 – 14