MLE
The usual representation we come across is a
probability density function:
But what if we know that ,
but we don’t know ?
We can set up a likelihood equation: , and
find the value of that maximizes it.
P (X = x|!)
x1, x2, …, xn ! N(µ, !
2)
µ
µ
P (x|µ,!)
These slides derived from course notes on http://www.cs.cmu.edu/~tom/10601_sp08/slides/recitation-mle-nb.pdf
MLE OF MU
Since x’s are independent and from the same
distribution,
Taking the log likelihood (we get to do this since log is
monotonic) and removing some constants:
p(x|µ,!2) =
n!
i=1
p(xi|µ,!
2)
L(x) =
n!
i=1
p(xi|µ,!2) =
1
!
2″!2
n!
i=1
exp(
(xi ” µ)2
2!2
)
log(L(x)) = l(x) !
n!
i=1
“(xi ” µ)
–
2
CALCULUS
We can take the derivative of this value and set it equal
to zero, to maximize.
dl(x)
dx
=
d
dx
(!
n!
i=1
x2
i
! xiµ + µ
2) = !
n!
i=1
xi ! µ
!
n!
i=1
xi ! µ = 0 ” nµ =
n!
i=1
xi ” µ =
”
n
i=1
xi
n
Maximum a posteriori (MAP)
What if you have some ideas about your parameter?
We can use Bayes’ Rule:
P (!|x) =
P (x|!)P (!)
P (x)
=
P (x|!)P (!)
!
!
P (!,x)
=
P (x|!)P (!)
!
!
P (x|!)P (!)
Maximum a posteriori (MAP)
This is just maximizing the numerator, since the
denominator is a normalizing constant.
This assumes we know the prior distribution.
argmax!P (!|x) = argmax!
P (x|!)P (!)
!
!
P (x|!)P (!)
Computing MAP
Analytical Solution
Numerical Optimization: gradient descent
EM Algorithm
Optimization Monte Carlo Simulation such as Simulated Annelaing
WHAT WE CAN DO NOW
MAP is the foundation for Naive Bayes classifiers.
Here, we’re assuming our data are drawn from two
“classes”. We have a bunch of data where we know the
class, and want to be able to predict P(class|data-point).
So, we use empirical probabilities
In NB, we also make the assumption that the features
are conditionally independent.
prediction = argmaxCP (C = c|X = x) ! argmaxC P̂ (X = x|C = c)P̂ (C = c)
SPAM FILTERING
Suppose we wanted to build a spam filter. To use the
“bag of words” approach, assuming that n words in an
email are conditionally independent, we’d get:
Whichever one’s bigger wins!
P (spam|w) !
n!
i=1
P (wi|spam)P (spam)
P (¬spam|w) ∝
n!
i=1
P (wi|¬spam)P (¬spam)
^ ^
^ ^
ECE 750 T17
Bayes’ Classifier
37
)(
)()(
)(
xp
wpwxp
xwp jjj
Likelihood or
conditional prior
evidence posterior
)()()( 2
2
1
j
j
j wpwxpxpclassesfor ¦
Normalization
Term
ECE 750 T17
38
Max a Posterior Classifier
known are )(&)(
)()( if
2
211
jj wpwxpAssumes
wotherwise
xwpxwpw !
1)()p(w w t..r.function w mass posterior
1)curveunder (area x t.r. density w.likelihood
21 �{
{
xwpx
Priors come from background knowledge
Likelihood comes from observation
ECE 750 T17
39
Max a Posterior Classifier (MAP)
¯
®
21
12
if )(
if )(
)(
wxwp
wxwp
xerrorp
)( minimizes errorpMAP
Generalization
1. More than one feature
2. More than 2 classes
3. Allowing actions rather than just deciding on class.
4. How to deal with situations where misclassifications for
some classes are more costly than for others.
> @dxxxx �21,
^ `cwww �21,
ECE 750 T17
40
Generalization
oUse of more than one feature (use feature
vector x)
oUse of more than 2 classes
o Allowing actions rather than just decide on
classes (possible rejection)
oDealing with different misclassification costs
o Introducing a loss function which is more
general than the probability of error
),1, ( cjwuse j �
ECE 750 T17
41
Loss Function
o Allowing actions: accepting the action or
rejecting
o The loss function states how costly each
action taken is
^ `
^ `
)(
,,
,,
21
21
ji
ji
a
c
tate is wwhen the saction
g for takins incurredbe the loswLet
actionspossibleofsetbeLet
classescofsetthebewwwLet
D
DO
DDD �
�
ECE 750 T17
42
Loss Function
aixallofsumRriskoverall �,1),R( i D
aixRR i �,1)( minimizing minimizing { D
isactionwithlossExpected i
.min )( isxRwhichforactionSelect ii DD
Bayes Risk = Best Performance that can be achieved
Conditional
Risk
ܴ ݔ|ߙ = σ ୀଵݓ|ߙ)ߣ (ݔ|ݓ)(
ECE 750 T17
43
2-Class Case
)()()(
2121111 xwpxwpxR
RisklConditiona
OOD �
)()()( 2221212 xwpxwpxR OOD �
ଵݓ ݃݊݅݀݅ܿ݁݀:ଵߙ
ଶݓ ݃݊݅݀݅ܿ݁݀:ଶߙ
ߣ = (ݓ|ߙ)ߣ
ECE 750 T17
44
2-Class Case
{
11 decide:action wD
� � � � � �
� � � � � �
1
21 11 11 1
12 22 2 2
2
decide
p x p
otherwise decide
w if
p x w p w
w w
w
O O
O O
� !
�
݈݁ݑܴ ݊݅ݏ݅ܿ݁ܦ ݇ݏܴ݅ ݊݅ܯ
If ܴ(ߙଵ ݔ < (ݔ|ଶߙ)ܴ
ECE 750 T17
45
2-Class Case
� �
� �
� �
� �
x
wp
wp
wxp
wxp
if oft independen or
1
2
1121
2212
2
1
OO
OO
�
�
!
� �
� �22
11
decideaction else
decideaction then take
w
w
D
D
Likelihood
Ratio
ECE 750 T17
Bayes’ Classifier
Learning priors and conditionals
� If have training data, we can
learn the prior and conditional from the
training data (freq. prob.)
� we may assume distributions and
learn parameters using MLE or other
methods
46
{
ECE 750 T17
Naive Bayes
F
Learning = Estimating
Classification
47
x
w
� � � � � �
� �xP
wPwxP
xwP
� � � �wPwxP ,
� �newxwP
ECE 750 T17
Naive Bayes
Assumes
48
� �
� � � �
jiallforwgiventindependen
lconditionaarefandf
wfPwffP
wffx
ji
i
i
d
d
z
�
,
valued-discrete ,,
1
1
�
�
ECE 750 T17
49
¾Bayes Classifier - optimum in the sense
of probability of error that for given prior
probabilities, loss function and class-
conditional densities, no other decision
rule will have a lower risk (expected value
of the loss function, for example,
probability of error)
¾In practice the class-conditional densities
are estimated using parametric and non-
parametric methods
Blank Page