程序代写代做代考 algorithm MLE

MLE

The usual representation we come across is a
probability density function:

But what if we know that ,
but we don’t know ?

We can set up a likelihood equation: , and
find the value of that maximizes it.

P (X = x|!)

x1, x2, …, xn ! N(µ, !
2)

µ

µ

P (x|µ,!)

These slides derived from course notes on http://www.cs.cmu.edu/~tom/10601_sp08/slides/recitation-mle-nb.pdf

MLE OF MU

Since x’s are independent and from the same
distribution,

Taking the log likelihood (we get to do this since log is
monotonic) and removing some constants:

p(x|µ,!2) =
n!

i=1

p(xi|µ,!
2)

L(x) =
n!

i=1

p(xi|µ,!2) =
1

!
2″!2

n!

i=1

exp(
(xi ” µ)2

2!2
)

log(L(x)) = l(x) !
n!

i=1

“(xi ” µ)

2

CALCULUS

We can take the derivative of this value and set it equal
to zero, to maximize.

dl(x)

dx
=

d

dx
(!

n!

i=1

x2
i
! xiµ + µ

2) = !
n!

i=1

xi ! µ

!

n!

i=1

xi ! µ = 0 ” nµ =

n!

i=1

xi ” µ =


n

i=1
xi

n

Maximum a posteriori (MAP)

What if you have some ideas about your parameter?

We can use Bayes’ Rule:

P (!|x) =
P (x|!)P (!)

P (x)
=

P (x|!)P (!)
!

!
P (!,x)

=
P (x|!)P (!)

!
!

P (x|!)P (!)

Maximum a posteriori (MAP)

This is just maximizing the numerator, since the
denominator is a normalizing constant.

This assumes we know the prior distribution.

argmax!P (!|x) = argmax!
P (x|!)P (!)

!
!

P (x|!)P (!)

Computing MAP

Analytical Solution

Numerical Optimization: gradient descent

EM Algorithm

Optimization Monte Carlo Simulation such as Simulated Annelaing

WHAT WE CAN DO NOW

MAP is the foundation for Naive Bayes classifiers.

Here, we’re assuming our data are drawn from two
“classes”. We have a bunch of data where we know the
class, and want to be able to predict P(class|data-point).

So, we use empirical probabilities

In NB, we also make the assumption that the features
are conditionally independent.

prediction = argmaxCP (C = c|X = x) ! argmaxC P̂ (X = x|C = c)P̂ (C = c)

SPAM FILTERING

Suppose we wanted to build a spam filter. To use the
“bag of words” approach, assuming that n words in an
email are conditionally independent, we’d get:

Whichever one’s bigger wins!

P (spam|w) !
n!

i=1

P (wi|spam)P (spam)

P (¬spam|w) ∝
n!

i=1

P (wi|¬spam)P (¬spam)

^ ^

^ ^

ECE 750 T17

Bayes’ Classifier

37

)(

)()(
)(

xp
wpwxp

xwp jjj

Likelihood or
conditional prior

evidence posterior

)()()( 2
2

1
j

j
j wpwxpxpclassesfor ¦

Normalization
Term

ECE 750 T17

38

Max a Posterior Classifier

known are )(&)(

)()( if

2

211

jj wpwxpAssumes

wotherwise
xwpxwpw !

1)()p(w w t..r.function w mass posterior

1)curveunder (area x t.r. density w.likelihood

21 �{

{

xwpx

Priors come from background knowledge
Likelihood comes from observation

ECE 750 T17

39

Max a Posterior Classifier (MAP)

¯
®
­

21

12

if )(
if )(

)(
wxwp
wxwp

xerrorp

)( minimizes errorpMAP
Generalization

1. More than one feature
2. More than 2 classes
3. Allowing actions rather than just deciding on class.
4. How to deal with situations where misclassifications for

some classes are more costly than for others.

> @dxxxx �21,
^ `cwww �21,

ECE 750 T17

40

Generalization

oUse of more than one feature (use feature
vector x)

oUse of more than 2 classes

o Allowing actions rather than just decide on
classes (possible rejection)

oDealing with different misclassification costs

o Introducing a loss function which is more
general than the probability of error

),1, ( cjwuse j �

ECE 750 T17

41

Loss Function

o Allowing actions: accepting the action or
rejecting

o The loss function states how costly each
action taken is

^ `

^ `

)(

,,

,,

21

21

ji

ji

a

c

tate is wwhen the saction

g for takins incurredbe the loswLet

actionspossibleofsetbeLet

classescofsetthebewwwLet

D

DO

DDD �

ECE 750 T17

42

Loss Function
aixallofsumRriskoverall �,1),R( i D

aixRR i �,1)( minimizing minimizing { D

isactionwithlossExpected i

.min )( isxRwhichforactionSelect ii DD
Bayes Risk = Best Performance that can be achieved

Conditional
Risk

ܴ ݔ|௜ߙ = σ ௝௖௝ୀଵݓ|௜ߙ)ߣ (ݔ|௝ݓ)݌(

ECE 750 T17

43

2-Class Case

)()()(

2121111 xwpxwpxR

RisklConditiona

OOD �
)()()( 2221212 xwpxwpxR OOD �

ଵݓ ݃݊݅݀݅ܿ݁݀:ଵߙ

ଶݓ ݃݊݅݀݅ܿ݁݀:ଶߙ

௜௝ߣ = (௝ݓ|௜ߙ)ߣ

ECE 750 T17

44

2-Class Case

{
11 decide:action wD

� � � � � �
� � � � � �

1

21 11 11 1

12 22 2 2

2

decide

p x p

otherwise decide

w if

p x w p w

w w

w

O O

O O

� !

݈݁ݑܴ ݊݋݅ݏ݅ܿ݁ܦ ݇ݏܴ݅ ݊݅ܯ
If ܴ(ߙଵ ݔ < (ݔ|ଶߙ)ܴ ECE 750 T17 45 2-Class Case � � � � � � � � x wp wp wxp wxp if oft independen or 1 2 1121 2212 2 1 OO OO � � ! � � � �22 11 decideaction else decideaction then take w w D D Likelihood Ratio ECE 750 T17 Bayes’ Classifier Learning priors and conditionals � If have training data, we can learn the prior and conditional from the training data (freq. prob.) � we may assume distributions and learn parameters using MLE or other methods 46 { ECE 750 T17 Naive Bayes F Learning = Estimating Classification 47 x w � � � � � � � �xP wPwxP xwP � � � �wPwxP , � �newxwP ECE 750 T17 Naive Bayes Assumes 48 � � � � � � jiallforwgiventindependen lconditionaarefandf wfPwffP wffx ji i i d d z � , valued-discrete ,, 1 1 � � ECE 750 T17 49 ¾Bayes Classifier - optimum in the sense of probability of error that for given prior probabilities, loss function and class- conditional densities, no other decision rule will have a lower risk (expected value of the loss function, for example, probability of error) ¾In practice the class-conditional densities are estimated using parametric and non- parametric methods Blank Page