Lecture 7:
Maximum a posteriori estimation CS 189 (CDSS offering)
2022/02/02
Copyright By PowCoder代写 加微信 powcoder
Today’s lecture
• Today, we introduce the concept of maximum a posteriori (MAP) estimation, and we will compare and contrast it with MLE
• We will see how both MLE and MAP estimation can be thought of as optimizing a loss function, a concept we have mostly neglected thus far
• From this perspective, MAP can be thought of as adding a regularizer to the loss function, compared to MLE which does not have one
• We will provide some intuition for why MAP estimation may be used over MLE, from the perspective of overfitting
Loss functions and objectives
• In machine learning, we often formulate the learning objective as finding parameters ! that minimize a loss function
• There are different ways we may choose to denote this:
• “(y, y)! — the loss incurred if the ground truth is y and the model predicts y!
• “(!; x, y) — the loss incurred by the model f! for the data point (x, y)
• Note that “(!; x, y) can be written as “(y, f!(x)) using the first convention
• How would we equate MLE to minimizing a loss function?
From MLE to a loss function
we are given = {(x1, y1), …, (xN, yN)}
our objective is to find the parameters that maximize the log likelihood of the data
logpolyi I Xi
this leads, naturally, to the negative log likelihood loss function:
elo Xi Yi logpolyi t xi argyin Iseco xi yi
(usually, we divide by N to work with average loss rather than summed loss) 4
MLE and overfitting
• Suppose, as a thought exercise, that my model can represent any function
• That is, for any function f, I can find a corresponding ! # $ such that f! = f
• Then, there may be infinitely many parameters ! that are the MLE
• We don’t even need an infinitely expressive model for this to happen
MLE and overfitting
• MLE has no notion of which of these parameters might be better than others
• So it very well could return the ! that corresponds to the middle function
• Intuitively, though, something about the function on the right seems more… right
• We often have these types of prior beliefs about which functions are more likely than others — can we somehow encode this into our problem setup?
A prior on our parameters
• We often have these types of prior beliefs about which functions are more likely than others — can we somehow encode this into our problem setup?
• Idea: treat the parameters ! as a RV and put a probability distribution on it!
• Then, p(!) is exactly what we were asking for — it is a prior distribution that
encodes which functions are more likely than others
• We have to specify p(!) ourselves, and doing so can be tricky
• How might we define p(!) to encode our beliefs about the previous example?
A prior for smoother functions
• Intuitively, the function on the right seems best (and the middle function worst) based on the smoothness of the functions
• That is, how quickly the function changes as the input changes
• What p(!) makes smooth f! more likely and non smooth f! less likely?
• Roughly speaking: as the magnitude of ! increases, p(!) should decrease
A MVG prior distribution
• How do we get a prior where p(!) decreases as the magnitude of ! increases?
• By far the most commonly used prior is p(!) = (!; 0, #2I)
• Because the prior is zero centered, as ! moves away from zero, its probability density p(!) goes down
• This places greater probability on parameters ! with small magnitude entries • The functions f! corresponding to such ! are often smoother
From the prior to the posterior
now that we are treating ! as a random variable, we can take a Bayesian approach to estimating it given a dataset
t.plxilplyilx.io
D is a constant war t O so sometimes
pl’Dlo plot proportional to
Maximum a posteriori (MAP) estimation
let’s do some more manipulations on this expression:
logplolD logp D10 logp o logp D EElogpxi tlogpyilxiO
The MAP estimate ! maximizes the log posterior probability:
arg fax arggin
log p yi lx i o _togpcjextra term to MLE compared
logplyilxi.tl
Completing the earlier example
what is the MAP estimate if p(!) = (!; 0, #2I)? argmin logp yi1xi 0 logp o
arggin log pl y i l x i o thallium
this is an example of a regularizer — typically, something we add to the loss
function that doesn’t depend on the data regularizers are crucial for combating overfitting
minimizing theO magnitude of
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com