STAT 513/413: Lecture 21 What is it good for
(An overview of most common statistical optimization tasks, with special attention to maximum likelihood)
What is most often optimized in statistics?
For instance, regression problems – this is a very common application Something like
ρ(yi − gi(θ)) = ρ(yi − g(xi, θ)) min !
θ
– where ρ(u) can be something like ρ(u) = u2 (the easiest one)
– and g(xi,θ) something like xi1θ1+···+xi1θp (the easiest one too) The form of ρ may come from the distribution of yi
Regression problems can be an instance of the following methodology, based on looking for the parameters that maximize likelihood
ii
1
Maximum likelihood
Suppose that Yi are independent and have distribution fi(yi,θ)
Then the joint density of y1, y2, . . . , yn is f1(y1, θ)f2(y2, θ) . . . fn(yn, θ)
Maximum likelihood: a good estimate of θ is the one that maximizes likelihood – the joint density of the Yi’s, viewed for the observed yi’s as a function of θ
f1(y1, θ)f2(y2, θ) . . . fn(yn, θ) max ! θ
This optimization problem equivalent to
− log f1(y1, θ) − log f2(y2, θ) − · · · − log fn(yn, θ) min ! θ
2
An example: location-scale model
The data, y1,y2,…,yn come as realizations of Y1,Y2,…,Yn which are independent and have all the same distribution with density
1 y−μ
f(y,θ) = f(y;μ,σ) = σφ σ with μ arbitrary, but σ > 0
That is, Yi is related to some “standard” distribution its density is φ(y) = f(y; 0, 1)
To obtain Yi, we take Ui from this “standard” distribution, multiply it by σ and then shift it by μ
Or, we can view it also in the opposite direction: when μ is subtracted from Yi and then the result is divided by σ, we obtain a quantity with a “standard” distribution
3
Instance: normal model
A well-known instance of this model is when the “standard” distribution is N(0,1), standard normal distribution
1 −(y − μ)2 f(y; μ, θ) = √ e 2σ2
σ 2π
The negative loglikehood, to be minimized, is then
n √ −logf(yi;μ,σ)=nlog 2π+nlogσ+
i=1
n ( y i − μ ) 2
2σ2 i=1
and we obviously care only about the last two terms, as the first one is irrelevant regarding the parameters, it is constant with respect to to them
Differentiation yields here simple equations that typically be solved in closed form
4
Instance: Cauchy model
Another instance takes the Cauchy distribution for the “standard” one:
f(y; μ, θ) = 1
y − μ2
πσ1+ σ
The relevant negative loglikelihood to be minimized is n y − μ 2
− log f(yi; μ, σ) = n log σ + log 1 + σ i=1
Taking derivatives in μ and σ, and setting them to zero we obtain
5
yi − μ
n 2 σ 1
n y i − μ
Likelihood equations
i=1 1 + and
yi−μ2 −σ =0 thatis σ
σ2+(yi−μ)2 =0 n(yi−μ)2 n
i=1
yi−μ yi−μ n12σ−σ2
σ + yi − μ2
= 0
that is
σ2 + (yi − μ)2 = 2
i=1 1+
i=1
σ
We have to solve this system somehow: for instance, by the Netwon method (but then we have to calculate three more derivatives)
6
Maximum likelihood in a very simplified setting
Just assume that Y has distribution f = f(y, θ) If we are taking derivative, it is in θ:
̇ ̇ ∂f(y,θ) ̈ ̈ ∂2f(y,θ) f = f(y, θ) = ∂θ f = f(y, θ) = ∂θ2
If we are taking expected value, it is in y: for instance, E(− log f) = E(− log f(Y, θ))
= (−logf(y,θ))f(y,θ)dy = − f(y,θ)logf(y,θ))dy
Another example: E(Y) =
yf(y, θ) dy
7
∂1 f ̇(y, θ) dy = 1 ̇ = ∂θ = 0
and then we can do it again
∂0 f(y,θ)dy=0= ∂θ =0
̈ ̇
The crucial tricks
We need some mathematical regularity conditions – but let us suppose they are all in place… Now some tricks: we start with
f(y, θ) dy = 1 (f(y, θ) is a probability density, right?) and then we take a derivative, in θ, of the both sides
8
Log-likelihood
The important quantity, as we have seen above, is also l(y,θ) = l = −logf = −logf(y,θ)
Again, the derivatives are in θ l ̇ = (−logf)· = −f ̇
And expected values in y (or Y, as needed) f ̇ f ̇
E(l ̇)=E −f =− ffdy= f ̇(y,θ)dy=0
so that then we have an important quantity (Fisher information)
̈ ff−ff E(l)=−E f2
2
̇ ̇2 ̇2 ̇2 f ̇ f ̇
2 I(θ)=Varl=E(l )−(E(l)) =E(l )=E −f =E f2
A longer, but important calculation
=−E f−f2 =E f2 −E f
f f f f ̈ ̇ ̇ ̈ ̇2 ̇2 ̈
f ̈
=I(θ)− ffdy=I(θ)− f ̈dy=I(θ)
9
And now the Newton method
We want to find l minθ ! which amounts to solving l ̇ = 0 Newton method: ̈l(θk+1 − θk) = −l ̇
We are fine with l ̇, but ̈l is way too many derivatives to calculate: we approximate ̈l by its expected value E( ̈l) – (Fisher) scoring
Yeah, but what is θ? Newton method, more precisely ̈l(y, θk)(θk+1 − θk) = −l ̇(y, θk)
after the approximation
I(θk)(θk+1 − θk) = −l ̇(y, θk)
OK, but instead of ̈l(y, θk) we have to calculate I(θk) – is that any better?
Well, often it is: note that after all, I(θk) = E(l ̇2), so at least we do not have to evaluate the second derivative
10
Nonlinear least-squares in a simplified setting
Consider the problem
2 (yi − g(xi, θ))
i
min ! θ
12 equivalent to (yi − g(xi, θ)) min !
i2θ
(this 1/2 is added for purely aesthetic reasons, as will be seen later)
However, θ is not a vector θ = (θ1,…,θp)T now – we would have to use partial derivatives, gradients, Hessians and all that –
but θ is only a single number, so we can again write gi = g(xi, θ), with g ̇ and g ̈ being the corresponding partial derivatives in θ
ii
Theproblemtoissolveis l= li= (yi−gi) min!
12
that is, li(yi, θ) = (yi − gi(yi, θ)) min !
ii2θ
Well… are there any distributions so that the li(yi,θ) above come as −logfi(yi,θ), where fi(yi,θ) is a density of Yi? Turns out that yes (…), but we have to worry about that later, if at all; so far, we have just to assume that E(Yi) = gi
12 ii2θ
11
We are to solve ̇ ̇
Let us use the rules
l = li = (yi − gi)(−g ̇i) = 0 ii
We take ̈l ̈2
l= (−g ̇ (−g ̇ )−(y −g )(−g ̈))= g ̇ iiiii iiii
ii
g ̇i2
̈2 2 2
and replace it by
which is in fact – note: the expected value is still in yi (or Yi)
i
E(l)= E(g ̇i )+E((Yi−gi)g ̈) = g ̇i +E(Yi−gi)g ̈ = g ̇i iii
The expected value is in yi, and gi = g(xi, θ) does not contain any yi – it is a constant with respect to yi, and hence E(const) = const
So the Gauss-Newton method amounts to iterating along the rule
2 g ̇ i ( θ k + 1 − θ k ) = −
ii
g ̇ i
+(y −g )(g ̈)
12
The problem is
2
(yi − g(xi, θ)) = min !
i
where θ = (θ1, θ2, . . . , θp)T
θ
b(θk)=
i
General expression
(note: superscripts)
We calculate the sum of gradients (the sum of p × 1 vectors)
∇g(xi,θk) (note: subscripts)
and the sum of p×p matrices (Fisher information matrix)
i
T ∇g(xi, θk) ∇g(xi, θk)
A(θk) =
The Gauss-Newton iteration θk+1 solves the system
A(θk)(θk+1 − θk) = −b(θk)
(and also Fisher information matrix at the solution is then ready – which is a path to evaluate the variance of the solution…)
13