程序代写代做 5. Extracting estimates from probability density functions

5. Extracting estimates from probability density functions
We have seen how to manipulate probability density functions, especially how to use Bayes’ rule. The PDF of a variable, by definition, captures all information about that quantity, but we often need to produce an estimate xˆ of a random variable x, e.g. when we’re doing feedback control (typically u = −F xˆ). Typically, xˆ is of the same dimension as x.
Here, we discuss different ways to extract estimates from PDFs. We focus on CRVs, but the concepts carry over to DRVs.
We will focus on producing an estimate xˆ of a random variable x, given observation(s) z that are related to x.
Our estimates will all be optimal, that is, defined as the result of some optimization.
5.1 Maximum Likelihood (ML)
Often used when x ∈ X is an unknown (constant) parameter with no known probabilistic description. Given observation z and observation model f(z|x), compute x that makes the observation z most likely:
xˆML :=argmaxf(z|x). x∈X
In this context, f(z|x) as a function of x is often called the likelihood function.
Note: we read the definition above as “the maximum likelihood estimator is that value of x, which maximizes the value of the PDF of z given x”. This may initially be confusing, “PDF of z given x” sounds like we’re “given” an x. Of course, we are not given an x,
Last update: 2020-02-04 at 15:22:36
5–1

instead we make an observation (say z ̄, which is a number), and then find that x that maximizes the function fz|x(z ̄|x). Now, fz|x(z ̄|x) is a function of only one variable (x), which is the one we search over.
Example
Consider two measurements of x ∈ R: z1 =x+w1
z2 =x+w2,
where w1 and w2 are two normally distributed, independent CRVs with zero mean and
unit variance; that is,
1 􏰅wi2􏰆 f(wi)=√2πexp − 2
(Shorthand: wi ∼ N (0, 1)).
Since z1 and z2 are conditionally independent given x (given x, z1 only depends on w1, z2 only depends on w2, and w1 and w2 are independent),
f(z1,z2|x) = f(z1|x)f(z2|x) ,
1 􏰔 (zi−x)2􏰕
f(zi|x) = √2π exp − 2 ,
and, therefore,
f(z1,z2|x)=2πexp −2 (z1−x) +(z2−x) .
Differentiating with respect to x and setting to 0 yields: (z1−xˆ)+(z2−xˆ) = 0 and xˆ = z1+z2 . 2
That is, (for this example) the ML estimate is the average of the measurements. Variations:
• wi ∼ N 􏰈0, σi2􏰉, independent.
• w1, w2 uniformly distributed, independent.
5–2
1􏰅1􏰒2 2􏰓􏰆

Example (Generalization)
We generalize the previous example to m measurements, and an n-dimensional state: z=Hx+w withz,w∈Rm, x∈Rn, m>n, wi ∼N(0,1)independent,
 H1 
H2 􏰊 􏰋 H=.,Hi=hi1 …hin ,hij∈R,
. Hm
and H is assumed to have full column rank.
• To compute f(z|x) from z = Hx + w and the PDF of w, we use a multivariable version of the change of variables formula for CRVs. One option is to proceed as in the previous example: argue that the zi’s are conditionally independent given x, and then compute f(zi|x) using the scalar change of variables formula. Instead, here an alternative ap- proach is presented, which is also useful in other settings (for example, when the noise is not independent.
Multivariable change of variables for CRVs
Letgbeafunctionmappingy∈Rn tox∈Rn,x=g(y),andassumethatthe determinant of the Jacobian matrix ∂g is nonzero for all y; that is,
∂y
∂g1(y) … ∂g1(y) ∂y1 ∂yn
􏰅∂g 􏰆
det ∂y(y) =det . . ̸=0 forally,
 . .  ∂gn(y) … ∂gn(y)
∂y1 ∂yn
where det(·) is the determinant. Furthermore, assume that x = g(y) has a unique
solution for y in terms of x, say y = h(x). Then: 􏰀 􏰅∂g 􏰆􏰀−1
fx(x) = fy(h(x)) 􏰀􏰀det ∂y(h(x)) 􏰀􏰀 . 􏰀􏰀
Note that the change of variables formula for a scalar CRV (see lecture #2) can be recovered from this.
• We apply the multivariable change of variables formula with z = g(x, w) = Hx + w and w=h(z,x)=z−Hx. Sincedet(∂g)=1,weobtain
∂w
f(z|x) = fw(z − Hx)
= fw1 (z1 − H1x) · . . . · fwm (zm − Hmx)
(by change of variables)
(by independence of wi) 2􏰉􏰆
􏰅 1􏰈 2
∝exp −2 (z1−H1x) +…+(zm−Hmx)
5–3

where ∝ denotes proportionality.
• Differentiating the above expression with respect to xj and setting to 0 gives:
(z1−H1xˆ)h1j+(z2−H2xˆ)h2j+…+(zm−Hmxˆ)hmj=0, j=1,…,n [h1jh2j …hmj](z−Hxˆ)=0, j=1,…,n
and, combining the equations for all j, we get
HT(z−Hxˆ)=0, HTHxˆ=HTz, and,finally, xˆ=(HTH)−1HTz
(HT H is invertible, by full column rank assumption).
The obtained solution is the least squares (LS) solution. We can thus give least squares a statistical interpretation: the maximum likelihood estimate when the errors are indepen- dent, zero mean, equal variance, and normally distributed.
Recall the “standard” LS interpretation, minimizing a quadratic error: ε(xˆ) := z − Hxˆ, xˆLS := arg min εT(xˆ) ε(xˆ) = (HT H)−1HT z.

Variations:
• wi ∼ N 􏰈0, σi2􏰉 results in weighted least squares.
• w ∼ N (0, Σ), correlated noise.
Why is ML not always a good thing to do?
The maximum may be very sensitive to small changes in distribution. For example, consider a PDF that is generated from data, so that the PDF’s shape is itself somewhat noisy (i.e. jagged). In the example below, the ML estimate is xˆML = x1, but for robustness you may actually prefer x2. (Small variations (shifts) in f(z|x) cause x1 to have likelihood zero, whereas the likelihood of x2 remains the same.)
5–4

You may have prior knowledge about x, specifically its PDF f(x): then we use maximum a posteriori estimate (below).
5.2 Maximum a posteriori (MAP)
We can use the MAP estimate when x is a random variable with a known prior PDF, i.e. where we knew something about x before we received the measurement z. We already know what to do (Bayes’ rule):
f(x|z) = f(z|x)f(x) f(z)
xˆMAP :=argmaxf(x|z)=argmaxf(z|x)f(x) x∈X x∈X
What choice of parameters are the most likely ones, given the observations and the prior knowledge about x?
Remarks:
• If f(x) is constant, then xˆMAP = xˆML; that is, if all values of x are a priori equally likely, then the estimates coincide.
• As for ML, we are maximizing a function over x, so the same robustness criticism as mentioned above may apply.
• Thus, maximum likelihood and MAP are two sides of the same coin, and have very similar properties.
5–5

Example
Consider the scalar observation
z = x + w with w ∼ N (0, 1), x ∼ N 􏰈x, σx2􏰉, and x and w independent.
Then
f(x)∝exp −2 σx2 and f(z|x)∝exp −2(z−x) .
􏰅 1(x−x)2􏰆 􏰅 1 2􏰆
By differentiating f(x|z) with respect to x, setting to 0, and solving for xˆ, we get:
1 σx2 xˆ=1+σx2x+1+σx2z, aweightedsum.
Notice the following special cases:
σx2 =0: xˆ=x (maximumofprior)
σx2→∞: xˆ=z (ML)
5.3 Minimum Mean Squared Error (MMSE)
Define estimation error e := xˆ − x, a random variable.
The MMSE is the a posteriori estimate that minimizes the mean squared estimation error:
xˆMMSE :=arg min E􏰊(xˆ−x)T(xˆ−x)|z􏰋 xˆ∈RN
=arg min 􏰈xˆTxˆ−2xˆTE[x|z]+E􏰊xTx|z􏰋􏰉 xˆ∈RN
Differentiate the quantity to be minimized with respect to xˆ, and set to 0 to solve: xˆMMSE =E[x|z].
The MMSE is simply the expected value conditioned on z. Remarks:
• Compare this with the MAP estimate: the MAP is the maximum of the posterior PDF f(x|z), while the MMSE is the mean of f(x|z) – this is illustrated below for an example asymmetric PDF.
5–6

• There is no guarantee that xˆMMSE is in X : the minimization is over RN . Sometimes, we may prefer to enforce xˆ ∈ X (e.g. for a discrete random variable).
For example, for a DRV with sample space X , one may define xˆMMSE2 :=argminE􏰊(xˆ−x)T(xˆ−x)|z􏰋
xˆ∈X
in order to ensure that xˆMMSE2 ∈ X , while xˆMMSE as defined above is not necessarily
in X. Example
Consider the bimodal distribution f(x|z) below. The MMSE estimate is xˆMMSE = x1. But the probability of x taking any value in a small neighborhood [x1−∆x,x1+∆x] around x1 is actually zero: Pr(x ∈ [x1−∆x,x1+∆x]|z) ≈ 2fx|z(x1|z)∆x = 0 (for small ∆x). In the sense of “likelihood,” this might be considered unsatisfactory. Nonetheless, x1 is the estimate that minimizes the mean squared error.
5.4 Minimum Mean Absolute Error (MMAE)
Similarly to the MMSE estimate, the MMAE estimate minimizes the mean absolute error: xˆMMAE := arg min E [|xˆ − x| | z]
xˆ∈RN Derivation only for scalar x:
5–7

We can compute the expectation as follows:

􏰢
|xˆ − x| f (x|z) dx
xˆ ∞
􏰢􏰢
= |xˆ−x|f(x|z)dx+ |xˆ−x|f(x|z)dx −∞ xˆ
xˆ ∞ 􏰢􏰢
= (xˆ−x)f(x|z)dx− (xˆ−x)f(x|z)dx −∞ xˆ
To find the minimizer, take the derivative with respect to xˆ. Need the following fact:  b(y)  b(y)
d􏰢 􏰢∂c db da
dy  c(x,y)dx = ∂y dx+c(b(y),y)dy(y)−c(a(y),y)dy(y)
Then
a(y) a(y)
xˆMMAE ∞ d 􏰀􏰀 􏰢 􏰢
E [|xˆ − x| | z] =
−∞
0 = E [|xˆ − x| | z] 􏰀 = dxˆ 􏰀xˆ=xˆMMAE
f (x|z) dx − xˆMMAE
f (x|z) dx
= Prob􏰈x ≤ xˆMMAE|z􏰉 − Prob􏰈x > xˆMMAE|z􏰉 Or,
Prob􏰈x ≤ xˆMMAE|z􏰉 = Prob􏰈x > xˆMMAE|z􏰉 = 1 2
−∞
From this follows that xˆMMAE is the median of the distribution, i.e. that value of x for which half the probability mass is on either side. Note that this may be hard to compute.
This is often considered to have desirable robustness properties. Example: consider a sensor that has a 10% probability of having a decoding error (in which case the output is much larger than truth), otherwise it has a normally distributed error, with PDF shown in the figure, below. Since we know that the outliers (decoding errors) are not informative, the MMAE does a better job (eye-ball test) than either MMSE or MAP.
5–8

5.5 Which estimator to use?
Generally speaking, which type estimate (ML, MAP, MMSE, etc.) you should use depends on the application (what is it that you want to minimize/maximize/achieve?).
Furthermore, some estimators have desirable computational properties. For example, it is usually much easier to reason about the mean of a random variable, than about its median.
In many cases, we simply use the mean as the estimate; this corresponds to the minimum mean squared error estimator. It is important to understand what that means, and when this may give unexpected results.
5–9