1. Introduction to stochastic systems
This course aims to discuss the following topics, in particular: • The basics of stochastic systems, including
– basics of random variables – Bayes’ rule
– estimates from distributions
• Estimation
– the Kalman filter as
∗ Bayesian estimator ∗ best linear estimator
– applying the Kalman filter to nonlinear systems ∗ Extended Kalman filter (EKF)
∗ Unscented Kalman filter (UKF)
– The particle filter
• Control of uncertain systems
– the separation principle, and certainty equivalence – Linear Quadratic Gaussian control
Additional topics maybe covered at the end of the semester as time allows.
The objective of this course is to understand how to interact with, and ultimately control, an uncertain system. A large part of that is the use of appropriate state estimation tools.
Last update: 2020-01-20 at 14:58:51
1–0
We will look, specifically, at estimating the state of a system using computationally efficient methods, that rely on a model of the system (with the model typically following from physical considerations).
u(k): known input z(k): measured output v(k): process noise
w(k): sensor noise
x(k): internal state
xˆ(k): estimate of internal state
Systems being considered
Nonlinear, discrete-time system:
x(k) = qk−1 (x(k−1), u(k−1), v(k−1)) k = 1, 2, …
z(k) = hk (x(k), w(k))
where x(0), {v(·)}, {w(·)} have a probabilistic description.
Note: this is a very general model, but we are restricting to discrete time.
Separation principle and certainty equivalence
We will consider estimation and control separately. We will cover the reasons why this is OK to do (beyond it being convenient). Our controllers will be static and state feedback.
Focus on recursive estimator algorithms
Estimator has its own state: xˆ(k), the estimate of x(k) at time k. Will compute xˆ(k) from xˆ(k−1), u(k−1), z(k), and model knowledge (dynamic model and probabilistic noise models). No need to keep track of the complete time history {u(·)}, {z(·)}. Recursive rather than batch.
1–1
Applications
• Generally, estimation is the dual of control. State feedback control: given state x(k) determine an input u(k). So one class of estimation problems is any state feedback control problem where the state x(k) is not available. This is a very large class of problems.
• Estimation without closing the loop: state estimates of interest in their own right (for example, system health monitoring, fault diagnosis, aircraft localization based on radar measurements, economic development, medical health monitoring).
Resulting algorithms
Will adopt a probabilistic approach.
• Underlying technique: Bayesian Filtering
• Linear system and Gaussian distributions: Kalman Filter
• Nonlinear system and (approximately) Gaussian distributions: Extended Kalman Filter, Unscented Kalman Filter
• Nonlinear system or Non-Gaussian (especially multi-modal) distributions: Particle Filter
The Kalman Filter is a special case where we have analytical solutions. Generally, we face a trade-off: tractability vs. accuracy.
References for stochastic state estimation
The lecture notes for part I of this course are adapted from notes originally by Prof. Raffaello D’Andrea, ETH Zurich. The notes have also been used (in part) to teach ME233 in Spring 2017 and 2018.
Some good textbooks on state estimation are:
• D. Simon, Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John Wiley & Sons, 2006: covers a lot of ground, and is the primary reference for state estimation in this course.
1–2
• Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with applications to track- ing and navigation: theory algorithms and software. John Wiley & Sons, 2004: Fo- cuses on navigation / tracking problems. A good reference, practical, contains “real- world wisdoms” as well (and some groan-worthy humor).
• B. D. Anderson and J. B. Moore, Optimal filtering. Dover Publications, 2005: origi- nally published 1979, lacks some new developments, but well-written and thorough. Inexpensive ($20 online).
References for control
In part 3 of this course we start considering how to control systems. The primary ref- erence for this part is D. P. Bertsekas, Dynamic programming and optimal control (3rd edition). Athena Scientific Belmont, MA, 2005, vol. 1 – this book is a model in clarity and conciseness, and an excellent reference to have.
1–3
2. Random variables and probability review
Engineering approach to probability: rigorous, but not the most general. We will not go into measure theory, for example, or formally into Kolomogorov axioms.
2.1 Fundamental terms 2.1.1 Probability
First, define concept of a probability. Let A be some event or outcome (e.g. we toss a coin, and it comes up heads). We define the probability function as a function that maps an event to a probability: Prob(A).
The following are the three axiomatic properties of the function:
(1) The probability that the event has an outcome is 1, i.e.
Prob(something happens) = 1. (2.1)
– This really just means that we have to define the events that we’re interested so that they cover all possible outcomes.
(2) The probability associated with any event is non-negative, i.e. Prob(A) ≥ 0 for any event A.
(3) If A1 and A2 are disjoint events, then
Prob(A1 or A2) = Prob(A1) + Prob(A2) Last update: 2020-01-20 at 14:58:51
2–0
(2.2)
(2.3)
– By “disjoint events” we mean that events A1 and A2 cannot simultaneously both be true. For example:
∗ Rolling a fair single die: the die cannot both come up as a 1 and a 2, so Prob(roll 1 or 2) = Prob(roll 1) + Prob(roll 2)
∗ However, because it is possible to simultaneously roll an even number and roll a 2:
Prob(roll 2 or roll an even number) ̸= Prob(roll 2)+Prob(roll an even number) From the above we can derive the following (already useful) things∗:
• Let AC be the complement of the event A. Then Prob(A) + ProbAC = 1.
– The complement AC is the event that A does not occur, i.e. if A is the event “I
rolleda1onthedie”thenAC istheevent“Ididnotrolla1onthedie”.
– The statement makes intuitive sense: The statement A or AC is all outcomes,
e.g. I either did, or did not, roll a 1 on the die. • The probability of no outcome occurring is zero:
Prob(no outcome) = 0
• When events A and B are not disjoint, we can compute the probability of either happening as follows:
Prob(A or B) = Prob(A) + Prob(B) − Prob(A and B)
2.1.2 Random variables
Our events are often things that take on numeric values, such as the number of people in a room, the outcome of rolling a die, the temperature of a machine, or the voltage potential over a terminal. Otherwise, we may assign numeric values to events, e.g. assign an integer value to each race driver in a car race.
A random variable x has an associated outcome space X. We will distinguish between discrete (integer-valued) and continuous (real-valued) random variables.
∗Can you derive this from the axioms?
2–1
For example, the discrete random variable x ∈ X = {1,2,3,4,5,6} models rolling a die. The continuous random variable y ∈ Y = [0, 330] models the volume of remaining liquid (in ml) in a soda can.
We can now reason about probabilities using random variables, by defining events using the random variables:
• Prob(x ∈ {2, 4, 6}) is the probability of rolling an even number on a die. • Prob(y > 165) is the probability that the soda can is more than half full.
Note that the previous discussion of probabilities still holds, i.e. Prob(x ∈ X ) = 1, etc. We will formalize the random variables after the following illustrative example.
2.1.3 Probability density
The probability density measures the likelihood of a random variable falling “near” a specific value. We will write this as fx(a) – the probability density of random variable x at the specific value a.
For discrete random variables, the density is simply the probability of the random variable taking a certain value, i.e.
fx(a) := Prob(x = a) for x discrete.
For continuous random variables, this is slightly more complex:
fx(a) := d Prob(x < a) for x continuous. da
The name “probability density” comes from the intuition of spreading “probability mass” over the space X. There is a lot of probability mass near points of high density, and less mass where the density is lower.
Probability densities will be our primary modelling and analysis tool.
2–2
2.2 Example from historical data
An intuitive approach to thinking about probabilities: the probability of an event is related to the relative frequency with which it occurs. We will use this example to introduce two probability rules: marginalization (sum rule) and conditioning (product rule).
• A power utility takes hourly data, and keeps track of the current weather and power consumption. The weather w is categorized as being either ‘cool’, ‘comfortable’, or ‘hot’ (w ∈ {1, 2, 3}), and the power consumption p as being either ‘low’, ‘average’, or ‘high’ (p ∈ {1, 2, 3}).
In particular, out of N observations:
nwp(i,j): number of periods with weather i and consumption j nw(i): number of periods with weather i
np(j): number of periods with consumption j
• Define
fwp(i,j) := nwp(i,j)/N, the likelihood of weather i and consumption j fw(i) := nw(i)/N, the likelihood of weather i
fp(j) := np(j)/N, the likelihood of consumption j
Note that fw(i) ≥ 0, and fw(i) = nw(i) = N = 1. Similarly for fp(j). i iNN
• We notice a few things:
nw(i) = nwp(i, j), all the periods with weather i
j
np(j) = nwp(i, j), all the periods with consumption j i
Therefore fw(i) = fwp(i, j) and fp(j) = fwp(i, j). ji
Called the marginalization, or sum rule.
• Define
fw|p(i, j) := nwp(i, j)/np(j): the likelihood of weather i given that consumption is j fp|w(j, i) := nwp(i, j)/nw(i): the likelihood of consumption j given weather i
2–3
Then
fwp(i, j) = nwp(i, j) = nwp(i, j) nw(i) = fp|w(j, i) fw(i)
N nw(i) N
= nwp(i, j) np(j) = fw|p(i, j) fp(j)
np(j) N Called the conditioning, or product rule.
This is a frequentist approach to probability: is intuitive, relative frequency in a large number of trials. This is a good way to think about probability for repeating processes such as tossing coins, rolling dice, and other phenomena where the physical process is essentially random and can occur many times.
Compare to Bayesian approach: probability is about beliefs and uncertainty. Measure of the state of knowledge. Applicable also to situations that happen only once, e.g. likelihood that man-made emissions are responsible for climate change. No inherent randomness necessary (just uncertainty).
We note that we can also condition probabilities: given events A and B, we can define the probability of event A conditioned on the event B as
Prob(A|B) = Prob(A and B) Prob(B)
2.3 Discrete Random Variables (DRV)
Formalize the motivating example. Let x be a discrete random variable (DRV).
• X : the set of all possible outcomes, subset of the integers {..., −1, 0, 1, ...}.
• fx(·): the probability density function (PDF), a real valued function that satisfies
1. fx(x ̄)≥0∀x ̄∈X 2. fx(x ̄)=1
x ̄∈X
• fx(·) and X define the discrete random variable x.
• The PDF can be used to define the notion of probability: the probability that a random variable x is equal to some value x ̄ ∈ X is fx(x ̄). This is written as Prob(x = x ̄) = fx(x ̄).
2–4
• In order to simplify notation, we often use x to denote a DRV and a specific value the DRV can take. So for example, we will write Prob(x) = fx(x). Furthermore, if it is clear from context which DRV we are talking about, we simply write f(x) instead of fx(x). While this is convenient, it may be confusing at first. If so, use the more cumbersome notation until you are comfortable with the shorthand notation.
Examples
• X = {1,2,3,4,5,6}, f(x) = 1 ∀ x ∈ X, captures a fair die. The formal, longhand 61
notation corresponding to this is fx(x ̄) = 6 ∀ x ̄ ∈ X.
• Theflippingofacoin: X ={0,1}. Thecoinbiasish,where0≤h≤1.
1−h forx=0 (i.e. “tails”)
f(x) =
h for x = 1 (i.e. “heads”)
Multiple Discrete Random Variables • Joint PDF.
– Let x and y be two DRVs. Joint PDF satisfies: 1. fxy(x,y)≥0 ∀x∈X, ∀y∈Y
2. fxy(x,y)=1 x∈X y∈Y
– Example: X = Y = {1,2,3,4,5,6}, fxy(x,y) = 1 , captures the outcome of two 36
fair dice.
– Generalizes to an arbitrary number of random variables.
– Short form: f(x,y).
• Marginalization, or Sum Rule axiom:
Given fxy(x, y), define fx(x) := fxy(x, y).
y∈Y
– This is a definition: fx(x) is fully defined by fxy(x,y). (Recall weather &
consumption example.)
• Conditioning, or Product Rule axiom:
2–5
2.4
•
•
Given fxy(x,y), define fx|y(x,y) := fxy(x,y) (when fy(y) ̸= 0). fy(y)
– This is a definition. fx|y(x,y) can be thought of as a function of x, with y fixed. It is easy to verify that it is a valid PDF in x. “Given y, what is the probability of x?” (Recall weather & consumption example.)
– Alternate, more expressive notation: fx|y(x|y).
– Short form: f(x|y)
– Usually written as f(x,y) = f(x|y)f(y) = f(y|x)f(x).
We can combine these to give us our first theorem, the Total Probability Theorem: fx(x) = fx|y(x|y)fy(y).
y∈Y A weighted sum of probabilities.
Multi-variable generalizations: Sometimes x is used to denote a collection (or vector) of random variables x = (x1 , x2 , ..., xN ). So when we write f (x) we implicitly mean f(x1, x2, ..., xN ).
Marginalization: f (x) = f (x, y) short form for y∈Y
f(x1, x2, ..., xN ) = f(x1, ..., xN , y1, ..., yL) (y1 ,...,yL )∈Y
Note: the probability remains a scalar.
Conditioning: Similarly,f(x,y)=f(x|y)f(y)appliestocollectionsofrandomvari-
ables.
Continuous Random Variables (CRV)
Very similar to discrete random variables.
• X isasubsetoftherealline,X ⊆R(forexample,X =[0,1]orX =R).
• fx(·), the PDF, is a real valued function satisfying: 2–6
1. fx(x)≥0∀x∈X
2.
3. fx(x) is bounded and piecewise continuous
– Stronger than necessary, but will keep you out of trouble. No delta func- tions, things that go to infinity, etc. Adequate for most problems you will encounter.
• Relation to probability: doesn’t make sense to say that the probability of x is fx(x). Look at the following limiting process: consider the integers {1,2,...,N} divided by N; that is, the numbers {1/N,2/N,...,N/N}, which are in the interval [0,1]. Assume that all are of equal probability 1/N. As N goes to infinity, the probability of any specific value i/N goes to 0.
• Recall the relationship to probability: fx(a) := d Prob(x < a)
da
so that by invoking the fundamental theorem of calculus, the probability of being in an interval can be computed by integrating the PDF over that interval (the amount of “probability mass” in that “volume”):
X
fx(x)dx = 1
Prob(x < a) =
a
−∞
fx(x) dx
Prob(x∈[a,b])=Prob(x 0
forallyorthatdg(y)<0forally. Assumethatdg(y)>0forally,theothercaseis dy dy
similar. We claim that
fx(x) = fy(y) dg dy(y)
For multivariable x and y, this can be generalized to ∂ g − 1
fx(x) = fy(h(x)) det ∂y(h(x)) .
2 – 15
dy
where det(·) is the determinant, under the conditions that ∂g
det ∂y(y) ̸= 0 for all y,
and that x = g(y) has a unique solution for y in terms of x, given by y = h(x). We will
revisit the multivariable formula later.
Proof (single variable only, assuming dg (y) > 0):
• Prob(y ∈ [y ̄,y ̄+∆y]) =
dy y ̄+∆y
fy(y)dy → fy(y ̄)∆y as ∆y → 0
• Let x ̄ = g(y ̄). Then g(y ̄+∆y) → g(y ̄)+ dg(y ̄)∆y as ∆y → 0, or g(y ̄+∆y) = x ̄+∆x
dy
where ∆x := dg (y ̄)∆y dy
• Prob(x∈[x ̄,x ̄+∆x]) = Prob(y∈[y ̄,y ̄+∆y]) → fx(x ̄)∆x = fx(x ̄)dg(y ̄)∆y = dy
fy(y ̄)∆y as ∆x → 0 and ∆y → 0.
Therefore, by letting ∆y → 0, and hence ∆x → 0:
f x ( x ̄ ) = 1 f y ( y ̄ ) . dg (y ̄)
y ̄
such that
dy fˆ (g(y))
dy
This gives us another way for generating samples of a PDF:
• Given a desired PDF fˆ (x), and a method for sampling f (y), find a function x = g(y)
xy dg(y)= fy(y) .
x
Equivalently, solve the differential equation dx = fy(y).
• Example: Let y be uniformly distributed on [0,1]. Then fy(y) = 1. Solve dxˆ ˆ xˆˆ
dy = 1/fx(x) ⇒ dy = dxfx(x) ⇒ y = Note that this is algorithm A3 from before.
2 – 16
fx(x ̄)dx ̄ = Fx(x).
dy fˆ (x) x
−∞
Exercise 2.5: Let x be a DRV capturing the roll of a fair die in a game of chance, with x ∈ {1,2,3,4,5,6} and fx(x) = 1.
Let y capture the monetary value associated with a certain outcome of the roll, where: Diex 1 2 3 4 5 6
Valuey 0 0 0 5 5 100
Compute the PDF of y.
6
Exercise 2.6: Let x be a continuous random variable, x ∈ X = [1,2], and with probability density function f(x) = 2x. Define y = g(x) = x3. Fully characterize the
probability density function of y.
3
2.9 Notable probability density functions
We will briefly review some common and important probability density functions (PDFs), which will show up later in the course.
2.9.1 Bernoulli distribution
Describe outcome of a binary event, x ∈ X = {0, 1}. Has a single parameter, ρ, and the PDF has the form
ρ
1 − ρ
We note the following: E [x] = 1 − ρ
if x = 0 else.
f(x) =
Note that E[x] need not lie in X.
Example: flipping a coin, occurrence of a component failure.
2 – 17
2.9.2 Uniform distribution
Defined for both continuous and discrete random variables. Characterized by the variable’s set of outcomes: x ∈ X
f(x) = c ∀ x ∈ X.
where c can be computed from the requirement that the PDF sum to one (or integrate if
CRV).
Note that, for vector-valued x, the form is the same: e.g. for three uniformly distributed
CRVs x,y,z ∈ [0,1] f(x,y,z) = 1
For continuous random variables with X = [xmin, xmax]: E [x] = 1 (xmin + xmax)
2
Example: often used to represent ignorance, all outcomes equally likely. What happens if
X =[−a,a]anda→∞?
2.9.3 Normal distribution
Also known as the Gaussian distribution. Has two parameters, μ and σ2 For a single, scalar RV, we have that μ and σ are scalar, and
1 1 2 f(x) = √2πσ2 exp −2σ2 (x − μ)
We will often write this simply as x ∼ N μ, σ2. Forvector-valuedx∈RN,wehaveμ∈RN andΣ∈RN×N withΣ=ΣT >0. Thenwe
write x ∼ N (μ,Σ) to mean
1 1 T−1
f(x)= N/2 exp −2(x−μ) Σ (x−μ) (2π) det (Σ)
Note that the normal distribution
2 – 18
• is positive everywhere: f(x) > 0 ∀ x ∈ RN
• has mean E[x] = μ and variance Var [x] = Σ, or for scalar Var [x] = σ2. • is symmetric about μ: fx(μ − x) = fx(μ + x)
This is a very important distribution, extremely widely used.
Theoretical motivation
Let ui be identically distributed, independent random variables. Furthermore, let ui have bounded moments: E uj is bounded for any integer j. Then define xk as follows:
xk := √k
ui
(2.5)
i 1 k
i=1
Then, by the central limit theorem as k → ∞, xk → N (·, ·). This is illustrated in the figure below for k ∈ {1, 2, 8}, and ui uniformly distributed. The bars show the normalized histograms over one million samples, and the line shows a normal distribution with the same mean and variance as the sample.
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
-3 -2 -1 0 1 2 3-3 -2 -1 0 1 2 3-3 -2 -1 0 1 2 3 x1 (sum over 1) x2 (sum over 2) x8 (sum over 8)
Note that, even for small k, xk looks like a normal distribution (by the eyeball test). This also works for u having distribution other than uniform.
That is, the central limit theorem states that (under mild assumptions, such as bounded variance) that the sum of a set of independent random variables approaches a normal
2 – 19
Normalized count
distribution as the number of summands tends to infinity. Many sensors, for example, have noise that is composed of many components (e.g. thermal effects), such that the sensor noise may be expected to be normally distributed. Similarly, the temperature in Berkeley at 12:00 on 1 January of every year will approximate a normal distribution (depends on many additive effects).
Pragmatic motivation
There are many pragmatic reasons to use (assume) Gaussian variables. They make many algorithms easier, as we will see in coming lectures. For example, we will see that Gaussian variables are closed under addition.
Caveat
The normal distribution has very rapidly decaying tails, meaning that values far away from the mean are extremely unlikely. Although it is easy to work with, and you can prove nice theorems, it can be inappropriate (dangerous) to assume normality, as it may lead to underestimating the likelihood of extreme outcomes. See 2008 financial crisis.
2.9.4 Cauchy distribution
A kind of continuous random variable, motivated by physics of light rays. Useful here as an example of a poorly behaved distribution. Specifically for x ∈ X = R, with
f(x) = 1 . π(1+x2)
This is shown in the figure below. Note that it satisfies our assumptions from the definition of a PDF (f(x) ≥ 0; it integrates to 1; and f(x) is bounded and piecewise continuous).
0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
-10 -5 0 5 10
x
Cauchy N(0,1)
2 – 20
f (x)
Note that the distribution is symmetric about zero, f(x) = f(−x). Nonetheless the ex- pected value does not exist:
∞ ∞x12∞ E[x]= xf(x)dx= π(1+x2)dx= 2πln1+x −∞
−∞ −∞
which is not defined (however, the mode = median = 0). We can also compute its second moment, E[x2]:
2 ∞2
E[x ] =
=
−∞ ∞
−∞
x f(x)dx (law of the unconscious statistician)
∞
(x − arctan x) = ∞
−∞
x2 1 dx =
π(1+x2) π
That is, the Cauchy distribution has infinite second moment, which means that the central limit theorem does not apply – illustrated below. We note no apparent difference between the shape of the sample distribution for a single Cauchy variable (x1), and the sum of a large number of variables (x100), except that the peak gets lower. A normal distribution with equal mean and variance is again plotted as a solid line, but the variance is so large as to make the PDF almost zero everywhere. The Central limit theorem fails because the Cauchy distribution does not have finite variance.
0.30 0.25 0.20 0.15 0.10 0.05 0.00
-10 -5 0 5 10-30 -20 -10 0 10 20 30-100 -50 0 50 100 x1 (sum over 1) x10 (sum over 10) x100 (sum over 100)
2 – 21
Normalized count