Bayesian Methods
CS542 Machine Learning
Bayesian Methods
• Before, we derived cost functions from maximum likelihood, then added regularization terms to these cost functions
• Can we derive regularization directly from probabilistic principles?
• Yes! Use Bayesian methods
Bayesian Methods
Motivation
• •
•
t
ML estimates are biased
Problem with Maximum Likelihood:
Bias
Especially a problem for small number of samples, or high input dimensionality
Suppose we sample 2,3,6 points from the same dataset, use ML to fit regression parameters
h 𝑥𝑥,𝜃𝜃𝑀𝑀𝑀𝑀 t h 𝑥𝑥,𝜃𝜃𝑀𝑀𝑀𝑀 t h 𝑥𝑥,𝜃𝜃𝑀𝑀𝑀𝑀
xxx
•
ML estimates cannot be used to choose complexity of model
– E.g. suppose we want to estimate the number of basis functions
– Choose K=1? – Or K=15?
ML will always choose K that best fits training data (in this case, K=15)
h 𝑥𝑥,𝜃𝜃𝑀𝑀𝑀𝑀
t
Problem with Maximum Likelihood:
Overfitting
x
•
•
Solution: use a Bayesian method–define a prior distribution over the parameters (results in regularization)
Bayesian vs. Frequentist Frequentist: maximize data likelihood
𝑝𝑝𝐷𝐷𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 =𝑝𝑝𝐷𝐷𝜃𝜃
Bayesian: treat 𝜃𝜃 as random variable, maximize posterior 𝑝𝑝𝜃𝜃𝐷𝐷 =𝑝𝑝𝐷𝐷𝜃𝜃𝑝𝑝(𝜃𝜃)
𝑝𝑝(𝐷𝐷)
𝑝𝑝 𝐷𝐷 𝜃𝜃 is the data likelihood, 𝑝𝑝(𝜃𝜃) is the prior over the model parameters
Bayesian Method Treat 𝜃𝜃 as random variable, maximize posterior
𝑝𝑝𝜃𝜃𝐷𝐷 =𝑝𝑝𝐷𝐷𝜃𝜃𝑝𝑝(𝜃𝜃) 𝑝𝑝(𝐷𝐷)
Likelihood 𝑝𝑝 𝐷𝐷 𝜃𝜃 is the same as before, as in Maximum Likelihood
Prior 𝑝𝑝(𝜃𝜃) is a new distribution we model; specifies which parameters are more likely a priori, before seeing any data
𝑝𝑝(𝐷𝐷) does not depend on 𝜃𝜃, constant when choosing 𝜃𝜃 with the highest posterior probability
Prior over Model Parameters
Intuition
Will he score?
Score! Score! Miss Score!
Your estimate of 𝜃𝜃 = 𝑝𝑝(𝑠𝑠𝑠𝑠𝑚𝑚𝑠𝑠𝑚𝑚)?
Will he score?
Score! Score! Miss
Score!
• Prior information: player= LeBron James
• Your estimate of 𝜃𝜃 = 𝑝𝑝(𝑠𝑠𝑠𝑠𝑚𝑚𝑠𝑠𝑚𝑚)?
• Prior 𝑝𝑝(𝜃𝜃) reflects prior knowledge, e.g., 𝜃𝜃 ≈ 1
Prior Distribution
Prior distributions 𝑝𝑝(𝜃𝜃) are probability distributions of model parameters based on some a priori knowledge about the parameters.
Prior distributions are independent of the observed data.
Coin Toss Example
What is the probability of heads (θ)?
Beta Prior for θ
𝑃𝑃𝜃𝜃 =𝐵𝐵𝑚𝑚𝐵𝐵𝐵𝐵𝛼𝛼,𝛽𝛽 =Γ𝛼𝛼+𝛽𝛽 𝜃𝜃(𝛼𝛼−1)(1−𝜃𝜃)(𝛽𝛽−1) Γ𝛼𝛼Γ𝛽𝛽
Beta Prior for θ
𝑃𝑃𝜃𝜃 =𝐵𝐵𝑚𝑚𝐵𝐵𝐵𝐵𝛼𝛼,𝛽𝛽 =Γ𝛼𝛼+𝛽𝛽 𝜃𝜃(𝛼𝛼−1)(1−𝜃𝜃)(𝛽𝛽−1) Γ𝛼𝛼Γ𝛽𝛽
Uninformative Prior
𝑃𝑃𝜃𝜃 =𝐵𝐵𝑚𝑚𝐵𝐵𝐵𝐵𝛼𝛼,𝛽𝛽 =Γ𝛼𝛼+𝛽𝛽 𝜃𝜃(𝛼𝛼−1)(1−𝜃𝜃)(𝛽𝛽−1) Γ𝛼𝛼Γ𝛽𝛽
Informative Prior
𝑃𝑃𝜃𝜃 =𝐵𝐵𝑚𝑚𝐵𝐵𝐵𝐵𝛼𝛼,𝛽𝛽 =Γ𝛼𝛼+𝛽𝛽 𝜃𝜃(𝛼𝛼−1)(1−𝜃𝜃)(𝛽𝛽−1) Γ𝛼𝛼Γ𝛽𝛽
Coin Toss Experiment
• 𝑛𝑛=10cointosses
• 𝑦𝑦=4numberofheads
Likelihood Function for the Data
𝑃𝑃 𝑦𝑦|𝜃𝜃 =𝐵𝐵𝐵𝐵𝑛𝑛𝑚𝑚𝑚𝑚𝐵𝐵𝐵𝐵𝑚𝑚 𝑛𝑛,𝜃𝜃 = 𝑛𝑛𝑦𝑦 𝜃𝜃𝑦𝑦(1−𝜃𝜃)(𝑛𝑛−𝑦𝑦)
Prior and Likelihood
𝑃𝑃 𝑦𝑦|𝜃𝜃 =𝐵𝐵𝐵𝐵𝑛𝑛𝑚𝑚𝑚𝑚𝐵𝐵𝐵𝐵𝑚𝑚 𝑛𝑛,𝜃𝜃 = 𝑛𝑛𝑦𝑦 𝜃𝜃𝑦𝑦(1−𝜃𝜃)(𝑛𝑛−𝑦𝑦)
Posterior Distribution
𝑃𝑃𝑚𝑚𝑠𝑠𝐵𝐵𝑚𝑚𝑠𝑠𝐵𝐵𝑚𝑚𝑠𝑠 = 𝑃𝑃𝑠𝑠𝐵𝐵𝑚𝑚𝑠𝑠 × 𝐿𝐿𝐵𝐵𝐿𝐿𝑚𝑚𝑚𝑚𝐵𝐵h𝑚𝑚𝑚𝑚𝑚𝑚 𝑃𝑃𝜃𝜃|𝑦𝑦 =𝑃𝑃𝜃𝜃𝑃𝑃𝑦𝑦|𝜃𝜃
𝑃𝑃 𝜃𝜃|𝑦𝑦 = 𝐵𝐵𝑚𝑚𝐵𝐵𝐵𝐵 𝛼𝛼,𝛽𝛽 × 𝐵𝐵𝐵𝐵𝑛𝑛𝑚𝑚𝑚𝑚𝐵𝐵𝐵𝐵𝑚𝑚 𝑛𝑛,𝜃𝜃 =𝐵𝐵𝑚𝑚𝐵𝐵𝐵𝐵 𝑦𝑦+𝛼𝛼,𝑛𝑛−𝑦𝑦+𝛽𝛽
This is why we chose the Beta distribution as our prior, posterior is also a Beta distribution: conjugate prior.
Posterior Distribution
Effect of Informative Prior
Effect of Uninformative Prior
Bayesian Linear Regression
Bayesian Linear Regression Let’s now apply the Bayesian method to linear regression.
To do that, we must treat parameter 𝜃𝜃 as a random variable, design a prior over it.
First, review maximum likelihood for linear regression.
ML for Linear Regression
h𝑥𝑥 𝑥𝑥(𝑖𝑖)𝑝𝑝 𝐵𝐵 𝑥𝑥,𝜃𝜃,𝛽𝛽
𝐵𝐵𝑖𝑖 h 𝑥𝑥(𝑖𝑖)
Likelihood function
𝐵𝐵=𝑦𝑦+𝜖𝜖=h𝑥𝑥 +𝜖𝜖 Noise𝜖𝜖∼𝑁𝑁𝜖𝜖0,𝛽𝛽−1 ,
h 𝑥𝑥(𝑖𝑖)
whereβ=𝜎𝜎1,h𝑥𝑥 =𝜃𝜃𝑇𝑇𝑥𝑥 2
Probability of one data point
Maximum likelihood solution
𝛽𝛽 = argmax 𝑝𝑝(𝒕𝒕|𝒙𝒙, 𝜃𝜃, 𝛽𝛽)
𝜃𝜃𝑀𝑀𝑀𝑀 =argmax𝑝𝑝(𝒕𝒕|𝒙𝒙,𝜃𝜃,𝛽𝛽) 𝜃𝜃
𝑀𝑀𝑀𝑀
𝛽𝛽
What is 𝛽𝛽 useful for?
• Recall: we assumed observations t are Gaussian given h 𝑥𝑥
• 𝛽𝛽 allows us to write down distribution over t, given new x, called predictive distribution
𝑝𝑝(𝐵𝐵|𝑥𝑥,𝜃𝜃 ,𝛽𝛽 ) 𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀
= 𝑁𝑁(𝐵𝐵|𝜃𝜃𝑇𝑇 𝑥𝑥, 𝛽𝛽−1) 𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀
𝜃𝜃𝑇𝑇𝑥𝑥 𝐵𝐵 𝑀𝑀𝑀𝑀
𝛽𝛽−1 is the variance of this distribution 𝑀𝑀𝑀𝑀
Predictive Distribution
Given a new input point x, we can now compute a distribution
over the output t:
𝑇𝑇 −1
𝑝𝑝(𝐵𝐵|𝑥𝑥,𝜃𝜃 ,𝛽𝛽 ) = 𝑁𝑁(𝐵𝐵|𝜃𝜃 𝑥𝑥,𝛽𝛽 𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀
)
Predictive distribution
True hypothesis
Slide credit: Bishop
• •
Define prior distribution over 𝜃𝜃 as
𝑝𝑝𝜽𝜽 =𝑁𝑁(𝜽𝜽|𝒎𝒎0,𝑺𝑺0)
Define a distribution over parameters
Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions1, gives the
posterior
𝑝𝑝𝜽𝜽|𝒕𝒕 =𝑁𝑁(𝜽𝜽|𝒎𝒎𝑁𝑁,𝑺𝑺𝑁𝑁)
𝒎𝒎 𝑁𝑁 = 𝑺𝑺 𝑁𝑁 𝑺𝑺 −0 1 𝒎𝒎 0 + 𝛽𝛽 𝑿𝑿 𝑇𝑇 𝒕𝒕 𝑁𝑁0
• where
𝑺𝑺−1 = 𝑺𝑺−1 + 𝛽𝛽𝑿𝑿𝑇𝑇𝑿𝑿
1see Bishop 2.3.3
A common choice for prior
𝜃𝜃
• A common choice for the prior is
𝑝𝑝𝜽𝜽 =𝑁𝑁(𝜽𝜽|𝟎𝟎,𝛼𝛼−1𝑰𝑰)
1
• for which
𝒎𝒎𝑁𝑁 = 𝛽𝛽𝑺𝑺𝑁𝑁𝑿𝑿𝑇𝑇𝒕𝒕
𝜃𝜃
𝑺𝑺−1 =𝛼𝛼𝑰𝑰+𝛽𝛽𝑿𝑿𝑇𝑇𝑿𝑿 𝑁𝑁
0
Slide credit: Bishop
Intuition: prefer 𝜃𝜃 to be simple
𝜃𝜃
Namely, put a prior on 𝜃𝜃, which captures our belief that 𝜃𝜃 is
For a linear model for regression, 𝜃𝜃𝑇𝑇𝑥𝑥 What do we mean by 𝜃𝜃 being simple?
1
𝑝𝑝𝜽𝜽 =𝑁𝑁(𝜽𝜽|𝟎𝟎,𝛼𝛼−1𝑰𝑰)
𝜃𝜃0 around zero, i.e., resulting in a simple model for prediction.
This Bayesian way of thinking is to regard 𝜃𝜃 as a random variable, and we will use the observed data D to update our prior belief on 𝜃𝜃
Bayesian Linear Regression Example
Bayesian Linear Regression Example 0 data points observed
𝜃𝜃1
𝜃𝜃0
Prior
sample
Data Space
Slide credit: Bishop
Bayesian Linear Regression Example 1 data point observed
𝜃𝜃 Likelihood 𝜃𝜃 true 𝜽𝜽
11
𝜃𝜃0
Posterior
𝜃𝜃0
sample
Data Space
Slide credit: Bishop
Bayesian Linear Regression Example 2 data points observed
𝜃𝜃 Likelihood 𝜃𝜃 11
𝜃𝜃0
Posterior Data Space
𝜃𝜃0
Slide credit: Bishop
Bayesian Linear Regression Example 20 data points observed
𝜃𝜃 Likelihood 𝜃𝜃 11
𝜃𝜃0
Posterior Data Space
𝜃𝜃0
Slide credit: Bishop
Bayesian Linear Regression
Prediction
•
Prediction
Now that we have a Bayesian model, how do we use it to make predictions for new data points?
𝜃𝜃 Likelihood 𝜃𝜃 Posterior Data Space 11
𝜃𝜃0
𝜃𝜃0
• • •
One way is to maximize the posterior to get an estimate of 𝜽𝜽∗ Then, plug 𝜽𝜽∗ into the predictive distribution
This is known as the maximum a posteriori estimate
𝜃𝜃 Likelihood 𝜃𝜃 Posterior Data Space 11
Prediction
𝜃𝜃0
𝜃𝜃0
Maximum A Posteriori (MAP)
Output the parameter that maximizes its posterior distribution given the data
𝜽𝜽𝑴𝑴𝑴𝑴𝑴𝑴 = argmax 𝑝𝑝(𝜽𝜽|𝒕𝒕) 𝜃𝜃
Recall:forourprior 𝑝𝑝𝜽𝜽 =𝑁𝑁𝜽𝜽𝟎𝟎,𝛼𝛼−1𝑰𝑰, theposterioris 𝑝𝑝𝜽𝜽|𝒕𝒕=𝑁𝑁𝜽𝜽𝒎𝒎𝑁𝑁,𝑺𝑺𝑁𝑁,
where 𝒎𝒎𝑁𝑁 = 𝛽𝛽𝑺𝑺𝑁𝑁𝑿𝑿𝑇𝑇𝒕𝒕, 𝑺𝑺𝑁𝑁−1 = 𝛼𝛼𝑰𝑰 + 𝛽𝛽𝑿𝑿𝑇𝑇𝑿𝑿. Therefore, 𝜽𝜽𝑴𝑴𝑴𝑴𝑴𝑴 = argmax 𝑝𝑝(𝜽𝜽|𝒕𝒕) = 𝑿𝑿𝑇𝑇 𝑿𝑿 + 𝜶𝜶𝜷𝜷 𝑰𝑰 −1 𝑿𝑿𝑇𝑇 𝒚𝒚
𝜃𝜃
Same as solution to regularized regression with 𝜽𝜽 𝟐𝟐 term.
Note, this is the mode of the distribution
Bayesian Regression
Connection to Regularized Linear Regression
Maximizing posterior leads to regularized cost function
𝜃𝜃 𝜃𝜃 2𝛽𝛽−2 𝜃𝜃
where 𝛽𝛽−2 is the noise variance and 𝛼𝛼−2 is the prior covariance parameter
𝜃𝜃𝜽𝜽𝑴𝑴𝑴𝑴𝑴𝑴 𝜃𝜃 𝜃𝜃 𝜃𝜃
𝜃𝜃 𝜃𝜃2 2𝛼𝛼−2 𝑑𝑑
𝜃𝜃
Maximizing posterior leads to regularized
cost function
Can re-write the optimization in the same form as the regularized linear regression cost:
𝐿𝐿 𝜽𝜽 = � 𝜽𝜽𝑇𝑇𝒙𝒙𝑛𝑛 − 𝑦𝑦𝑛𝑛 2 + 𝜆𝜆 𝜽𝜽 2
where 𝜆𝜆 = 𝛽𝛽−2/𝛼𝛼−2 corresponds to the regularization hyperparameter.
• Intuitively, as 𝜆𝜆 → +∞, then 𝛽𝛽−2 ≫ 𝛼𝛼−2 . That is, the variance
of noise if far greater than what our prior model can allow for 𝜃𝜃.
can tell us, so we are getting a simple model, where 𝜃𝜃 → 0. 𝑀𝑀𝑀𝑀𝑀𝑀
In this case, our prior would be more accurate than what data
• If𝜆𝜆→0,then 𝛽𝛽−2 ≪ 𝛼𝛼−2,andwetrustourdatamore,sothe
𝜃𝜃→𝜃𝜃. 𝑀𝑀𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀
MAP solution approaches the maximum likelihood solution, i.e.
Effect of lambda
Bayesian Predictive Distribution
Maximum A Posteriori (MAP)
• Output the parameter that maximizes its posterior distribution given the data
𝜽𝜽 = argmax 𝑝𝑝(𝜽𝜽|𝒕𝒕)
𝑴𝑴𝑴𝑴𝑴𝑴
𝜃𝜃
• Note, this is the mode of the distribution
• However, sometimes we may want to hedge our bets and average (integrate) over all possible parameters, e.g. if the posterior is multi-modal
Bayesian Predictive Distribution • Predict 𝐵𝐵 for new values of 𝑥𝑥 by integrating over 𝜃𝜃 :
𝑝𝑝 𝐵𝐵 𝑥𝑥,𝒕𝒕,𝛼𝛼,𝛽𝛽 = �𝑝𝑝 𝐵𝐵|𝜽𝜽,𝛽𝛽 𝑝𝑝 𝜽𝜽|𝒕𝒕,𝛼𝛼,𝛽𝛽 𝑚𝑚𝜃𝜃 = 𝑁𝑁 ( 𝐵𝐵 | 𝑚𝑚 𝑁𝑁𝑇𝑇 𝑥𝑥 , 𝜎𝜎 𝑁𝑁2 ( 𝑥𝑥 ) )
• where
𝜎𝜎 𝑁𝑁2 𝑥𝑥 = 𝛽𝛽1 + 𝑥𝑥 𝑇𝑇 𝑆𝑆 𝑁𝑁 𝑥𝑥
What does it look like?
Compare to Maximum Likelihood:
)
𝑝𝑝(𝐵𝐵|𝑥𝑥,𝒙𝒙,𝒕𝒕) = 𝑁𝑁(𝐵𝐵|𝑚𝑚 𝑥𝑥,𝜎𝜎 ) 𝑝𝑝(𝐵𝐵|𝑥𝑥,𝜃𝜃 ,𝛽𝛽 ) = 𝑁𝑁(𝐵𝐵|𝜃𝜃 𝑥𝑥,𝛽𝛽 𝑁𝑁𝑁𝑁 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝑇𝑇 2 𝑇𝑇 −1
Slide credit: Bishop