Lecture 4:
Maximum likelihood estimation (2) CS 189 (CDSS offering)
2022/01/26
Copyright By PowCoder代写 加微信 powcoder
Today’s lecture
• Last lecture, we introduced the principle of maximum likelihood estimation (MLE) for general statistical inference
• Today, we continue the discussion of MLE, discuss its favorable consistency property, and see its relationship to information theoretic concepts
• We will see how MLE applies more specifically in the machine learning context
• And we will start our discussion (to be continued next lecture and beyond) on
linear regression from the MLE perspective 4
MLE in the limit of infinite data
what happens as the number of data points N ! “?
I Élogpolxi
an attractive property of MLE is that it is consistent: larger N
remember that, generally speaking, this is what motivates Monte Carlo estimation:
EI p t f x I f x i f X X d p m o r e a c c u r a t e w i t h
given large enough
N if I e then One will equal 5
MLE and information theory
recall the definition of cross-entropy: H p g E E log g x let’s plug in p!# (the true data distribution) for p and some p! for q:
HCPo Po IEp ti logpo x t É logpalx i
maximizing likelihood is approximately equivalent to minimizing this cross-entropy!
also approximately equivalent to minimizing this KL divergence:
Da pollpo HPoPo HPokconstantwrt O 6
What about MLE for regression/classification?
xy pxpolyx
given data = {(x1, y1), …, (xN, yN)} assume a set (family) of distributions on (x, y)
the parameters ! only dictate the conditional distribution of y given x!
the objective/definition remains the same:
Once arg max oe
ITPolxe.yi
ang ma O E On
polyol xi Iflogpolyilxi
logp xi logpolyilxi 7
Example: “least squares” linear regression
we are given data = {(x1, y1), …, (xN, yN)}
assume that the output given the input is generated i.i.d. as
Y X N TX 5 o O ti 51
we don’t need to specify the input distribution since it does not depend on !
the objective is:
O Ew by all possible w
argmax IlogN tb 02 70GO in
b ang max oe
From MLE to least squares linear regression
adda 1totheendofX
to put b into let’s do some algebra on the MLE objective for this setup:
w arg max Ii arg min É wt xi
w wtx y12 w yo this looks like a le norm squared fully Liv
the i th element of what vector
then exw di wix
argmin Il wy
If time: solving least squares linear regression
argmin AXw yIli arggin Xu y Xw y
WTX’XW 2TXw tyTy y
XW settoO XX
MLE typically requires (iterative) optimization
• For the examples we have seen thus far, the MLE can be obtained in closed form
• We will see many more examples in this class for which this is not the case
• In such cases, we typically rely instead on iterative optimization, i.e., continually refining our “guess” of the MLE until we are satisfied
• One can take entire series of classes on iterative optimization, e.g., EE 127/227
• We will start our (more limited) discussion of this topic in a couple of weeks
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com