Probability Density Functions
Australian National University
(James Taylor) 1 / 7
6 I
Density and Likelihood Functions
First, suppose X is an (absolutely continuous) random variable
Everything we will need to know about X is summarised by it’s
probability density function (pdf) f (x)
That is, given f (x) we can compute EX , Var(X ), sin(X ), EX 20, etc.
(James Taylor) 2 / 7
I
Examples
Univariate Normal, N (µ, s2)
f(x ; µ, s2) =
✓
1
2ps2
◆1/2
exp
✓
�
1
2s2
(x � µ)2
◆
= (2p)�1/2|s2|�1/2 exp
✓
�
1
2
(x � µ)(s2)�1(x � µ)
◆
Multivariate Normal, N (µ,S)
(2p)�n/2|S|�1/2 exp
✓
�
1
2
(x� µ)0S�1(x� µ)
◆
(James Taylor) 3 / 7
Normalisation
IsAxldx
e
Joint Distribution of Normal Random Variables
Suppose we have n iid random variables X1, . . . ,Xn with Xi ⇠ N (µ, s2),
and we wish to derive their joint pdf f (x)
As X1, . . . ,Xn are iid:
f (x) =
n
’
i=1
f(xi ; µ, s
2)
=
n
’
i=1
✓
1
2ps2
◆1/2
exp
✓
�
1
2s2
(xi � µ)2
◆
=
✓
1
2ps2
◆n/2
exp
�
1
2s2
n
Â
i=1
(xi � µ)2
!
(James Taylor) 4 / 7
t II 0 Hi 1m67T
independent
Geb eatb
Joint Distributions of Normals are Multivariate Normals
Quick Exercise: The joint distribution of iid normals from the previous
slide is a multivariate normal.
Joint normal:
✓
1
2ps2
◆n/2
exp
�
1
2s2
n
Â
i=1
(xi � µ)2
!
Multivariate Normal:
(2p)�n/2|S|�1/2 exp
✓
�
1
2
(x� µ)0S�1(x� µ)
◆
(James Taylor) 5 / 7
o
Fayeto
Find E M
(James Taylor) 6 / 7
KalEl4 iexpl tcxmi ilx.us
M
Let 2 1 1 I a fan I
wait 1h expl Ilx.us’s’Ll ex ul
City 625 exp l x ni x ul
Easy exec III Hi Ml
same
LUTTE1621 exp I Eilxi m
2 exp l Ni mi
Yet More Distributions
Distribution Notation pdf Support
Bernoulli Ber(p) px (1� p)1�x {0, 1}
Uniform U(a, b) 1b�a (a, b)
Gamma G(a, b)
ba
G(a)x
a�1e�bx R+
Univariate t t(n, µ, s2)
G((n)/2)
s
p
npG(n/2)
✓
1+ 1n
⇣
x�µ
s
⌘2◆� n+12
R
(James Taylor) 7 / 7
Properties of Determinants
A Brief Mathematical Interlude
Australian National University
(James Taylor) 1 / 2
6.2
Digression: Properties of Determinants
What is the determinant?
Multiplying a row/column is fine; also multiplying the whole matrix
Swapping rows/columns is ok
Loves triangular matrices
Plays well with inverses and transposes
Plays well with matrix multiplication
(James Taylor) 2 / 2
det
kttfdetlf.baadbc up UHUD
thesizeofthematrix
au cibdl la.ba
detlAI IA
(James Taylor) 2 / 2
dit Kia Kab kkaid K bi c au au
me
KdetIAI
det l k Ebd
det KfcKfa kaKd kb.pe
det KA ditLA
Read by
(James Taylor) 2 / 2
det Cadb cd da swap
2 row multiplyoutby 1
lad be
det bd f s be ad swap 2
columns hut detby 1
Lad bc
(James Taylor) 2 / 2
detCao III aada
bio dt
ajaaiii.inJ a au
anndetPgfIf
alntit told Itold
morethan 3tines as hard as 2 2
det of nXh is more than u tines as hard ai na x na
(James Taylor) 2 / 2
det ab’d ad b
DetCA l detCA
ad be
det A Det AI
detCAB detCA detCB
det At B nothing in particular
Likelihood Functions
General Theory
Australian National University
(James Taylor) 1 / 5
6.3
Likelihood Function
A very important concept in statistics
It describes, in a precise manner, our information about the
parameters of the model, given the observed data
Let f (x | q) denote the (multivariate) pdf for the sample
X = (X1, . . . ,Xn) with unknown parameter vector q.
That is f (x | q) is the marginal probability of finding x if the
parameter was indeed q.
(James Taylor) 2 / 5
Likelihood Function: Definition
Given that x is observed, the likelihood function is defined as
L(q | x) = f (x | q)
So the likelihood is just the pdf of the data, but viewed as a function
of the parameter vector q
In a sense it is the same function, just viewed from a new perspective
(James Taylor) 3 / 5
i
Likelihood Function: Definition
By definition, the likelihood function is a pdf in x, Therefore
Z
L(q | x)dx =
Z
f (x | q)dx = 1
But L(q, x) is not a pdf with respect to q
That is, in general Z
L(q | x)dq 6= 1
(James Taylor) 4 / 5
e
Log-likelihood
For basically all purposes it is easier to work with the logarithm of the
likelihood function
Naturally, this is called the log-likelihood function
`(q | x) = log L(q | x)
Why is this ok to do?
Why might we want to do this?
(James Taylor) 5 / 5
T
en
take6g
e mostlikely wewouldbemaximizingthelikelihood hewanttofindthe
avgwax valueof thita I
Likelihood Functions
Log-likelihood of the AR(1) Process
Australian National University
(James Taylor) 1 / 3
67cal
Log-likelihood for the AR(1) Process
Recall the AR(1) process:
yt = ryt�1 + et , et ⇠ N (0, s2)
Given yt�1 and the parameters r, s
2, we know yt ⇠ N (ryt�1, s2)
So the pdf of yt is
f (yt | yt�1, r, s2) =
✓
1
2ps2
◆1/2
exp
✓
�
1
2s2
(yt � ryt�1)2
◆
(James Taylor) 2 / 3
mean variance
Log-likelihood for the AR(1) Process
The joint density is then
f (y1, . . . , yT | y0, r, s2) =
T
’
t=1
f (yt | yt�1, r, s2)
So the log-likelihood is
`(r, s2 | y) = log
T
’
t=1
f (yt | yt�1, r, s2)
=
T
Â
t=1
log f (yt | yt�1, r, s2)
= �
T
2
log(2ps2)�
1
2s2
T
Â
t=1
(yt � ryt�1)2
(James Taylor) 3 / 3
II flytlyo.ba
fcyifyo.P64fly4yyXo.P64 flYslYny e.at
fifty e02
dortginchformation
f (yt | yt�1, r, s2) =
✓
1
2ps2
◆1/2
exp
✓
�
1
2s2
(yt � ryt�1)2
◆
(James Taylor) 3 / 3
togf lyt lyt l p 62 FI bg I lnot expc setaft eyeD2
If 6g l 2 F lyt pyt T
FIE I togCLI64 2 Iyt IytDY
I togl Il 64 EI lyt Pyi i5
f lP 6 Iy
(James Taylor) 3 / 3
Yt PTt i EtstNNCO.bz iid Kfmatrix
H
y eyes Hye IT LIL o
I ti HH
tt H’it
(James Taylor) 3 / 3
y pLy t s s Nc O 621 wantdistribution ofy
y pLy s
I my E
yHIME y n N lo Me61mi
MT
I
f y l e 6 UT E GNeMi expl E y l6MeMp5 y
f lp o lyJ E log LTD Ehog I yC62Meup I Iy
I tog ut E hog64 yC2 outfitsputgy
I 6g12404 2 Y L pLI LI pby
I 6gCUT64 FI IYe pyt I
(James Taylor) 2 / 2
16MpMe1 16251 141411
Me CI put
64T det Me det Mj
64Tdetuue5
64 detllI.PL Y
c65aetueTI eLftoii.jif
c65
(James Taylor) 4 / 7
12pDy
if it I It ILy pyo4Lys ly ft lyi PYi i5
ly t eYt i
Likelihood Functions
Log-likelihood of the MA(1) Process
Australian National University
(James Taylor) 1 / 7
6 Cbl
Log-likelihood for the MA(1) Process
Recall the MA(1) process:
yt = et + qet�1, et ⇠ N (0, s2)
We can write this in matrix form as
0
BBBBBBBB
@
y1
y2
y2
…
yT
1
CCCCCCCC
A
=
0
BBBBBBBB
@
1 0 0 · · · 0
q 1 0 · · · 0
0 q 1 · · · 0
…
. . .
…
0 0 · · · q 1
1
CCCCCCCC
A
0
BBBBBBBB
@
e1
e2
e3
…
eT
1
CCCCCCCC
A
(James Taylor) 2 / 7
iid
Yi E tdLo
y SetQE
YT ET1041
FO
Log-likelihood for the MA(1) Process
Therefore
(y | q) ⇠ N (0, s2GqG0q)
So the log-likelihood is
`(q, s2 | y) = log
✓
(2p)�T/2|S|�1/2 exp
✓
�
1
2
(y� µ)0S�1(y� µ)
◆◆
= �
T
2
log(2ps2)�
1
2
log |GqG0q |�
1
2s2
y0(GqG
0
q)
�1y
(James Taylor) 3 / 7
E NCO 621 y To E
Y N lo Tol 6 1 To
K
O
(James Taylor) 4 / 7
log EyE I EToTa f exp l I ly D l6ToTo 5 y D
E6gcuttingI 6Toto l Ey l 6ToTa T y
I togi ra log l05 IToto l 2 y CToto’s y
E logUT EY l64 ItoglToto l utY lTato J y
I log LTE i Tf II Y 1ToTa I y
Log-likelihood for the MA(1) Process
As an example, generate T = 100 data points according to the
MA(1) model with q = 0.8 and s2 = 1
Assume that s2 = 1 is known
So the log-likelihood will be a function only of q
Evaluate the log-likelihood over a grid of q’s
(James Taylor) 5 / 7
MATLAB code
ngrid = 300;
theta grid = linspace(.2,1,ngrid);
%% construct A and B so that Gam = theta*A + B
A = speye(T);
B = spdiags(ones(T�1,1),[�1],T,T);
ell = zeros(ngrid,1);
for i=1:ngrid
theta = theta grid(i);
Gam = A + B*theta;
Gam2 = Gam*Gam’;
ell(i) = �T/2*log(2*pi) �.5*log(det(Gam2)) +…
� .5*y’*(Gam2\y);
end
(James Taylor) 6 / 7
I l f
o is TXT
9 B
e
t I I
TT
E ecdily
Figure: The log-likelihood `(q | y) for the MA(1) model with s2 = 1
(James Taylor) 7 / 7
evil
i
i
o
Maximum Likelihood Estimation
General Theory
Australian National University
(James Taylor) 1 / 4
64
Maximum Likelihood Estimator
Problem: given the likelihood L(q | y) with observed sample y, we
want the “best” guess for q
Solution: use the value of q for which the observed sample y is most
likely
Call this the maximum likelihood estimator
(James Taylor) 2 / 4
Maximum Likelihood Estimator: Definition
The method of maximum likelihood estimation formalises this idea
into a parametric maximisation problem
The maximum likelihood estimator for q is defined as
q̂MLE = argmax
q2Q
L(q | y) = argmax
q2Q
`(q | y)
The q̂MLE is found by maximising the likelihood (or indeed the
log-likelihood)
(James Taylor) 3 / 4
OO O
Properties of MLE
Theorem – Properties of MLE
Let y be an observed sample with log-likelihood `(q | y). Then, under
appropriate regularity conditions, the MLE q̂MLE is consistent and is
asymptotically normal
q̂MLE ⇠ N (q, �1(q))
where (q) = �EH(q;Y) is the Fisher information matrix, and
H(q; y) =
∂2
∂q∂q0
`(q | y)
is the Hessian matrix.
(James Taylor) 4 / 4
Maximum Likelihood Estimation
Maximising Log-likelihood of the Linear Regression Model
Australian National University
(James Taylor) 1 / 5
6.4cal
Log-likelihood for the Linear Regression Model
Recall the linear regression model:
y = Xb + e, e ⇠ N (0, s2 T )
The joint density of y is N (Xb, s2 T )
The log-likelihood is given by
`(b, s2 | y) = �
1
2
log |2ps2 T |�
1
2
(y�Xb)0(s2 T )�1(y�Xb)
= �
T
2
log(2ps2)�
1
2s2
(y�Xb)0(y�Xb)
(James Taylor) 2 / 5
ii
µ
Enfant
MLE for Linear Regression Model
Log-likelihood for the linear regression is
`(b, s2 | y) = �
T
2
log(2ps2)�
1
2s2
(y�Xb)0(y�Xb)
= �
T
2
log(2ps2)�
1
2s2
(b0X0Xb � 2y0Xb + y0y)
The first-order conditions
∂
∂b
`(b, s2 | y) = �
1
2s2
(2b0X0X� 2y0X) = 0
∂
∂s2
`(b, s2 | y) = �
T
2s2
+
1
2(s2)2
(y�Xb)0(y�Xb) = 0
(James Taylor) 3 / 5
I
(James Taylor) 4 / 5
ftp 64Y Ehfl2T6 IT ly x ly xp
I 6g Ltd tr ly y t px’xp yxp px’y
I
I 6g Ltd 62 yy t px’Xp yxp teynstantPx’y Px’yl
kkgTH yxp
I f l f 64y O 2 o 2Bx’x 2Y’x o text
me
If X X 2yX
I x’x5x’y
c
l LP64y E IT t 4 y x’e ly xp 0
y XI l ly y T It
E fly xpIcy xp I
MLE for Linear Regression Model
Solving this system of equations for b and s2:
b̂ = (X0X)�1X0y
ŝ2 =
1
T
(y�Xb̂)0(y�Xb̂)
So q̂ = (b̂, ŝ2) is a critical point
To show it is indeed a local max we would need to show the Hessian
at q̂ is positive definite, but let’s not.
(James Taylor) 5 / 5
IT1y X Xy5x y I Iy XIXy5Xy