Announcements
Reminder: ps4 self-grading form out, due Friday 10/30
• pset 5 out Thursday 10/29, due 11/5 (1 week) • Midterm grades will go up by Monday (don’t
discuss it yet)
• My Thursday office hours moved to 11am
• Lab this week – probabilistic models, ipython notebook examples
Bayesian Methods
CS542 Machine Learning
Bayesian Methods
• Before, we derived cost functions from maximum likelihood, then added regularization terms to these cost functions
• Can we derive regularization directly from probabilistic principles?
• Yes! Use Bayesian methods
Bayesian Methods
Motivation
• •
•
t
ML estimates are biased
Problem with Maximum Likelihood:
Bias
Especially a problem for small number of samples, or high input dimensionality
Suppose we sample 2,3,6 points from the same dataset, use ML to fit regression parameters
h 𝑥,𝜃𝑀𝐿 t h 𝑥,𝜃𝑀𝐿 t h 𝑥,𝜃𝑀𝐿
xxx
Problem with Maximum Likelihood:
Overfitting
•
ML estimates cannot be used to choose complexity of model
h 𝑥,𝜃𝑀𝐿
t
x
Solution: use a Bayesian method–define a prior distribution over the parameters (results in regularization)
•
•
– E.g. suppose we want to estimate the number of basis functions
– Choose K=1? – Or K=15?
ML will always choose K that best fits training data (in this case, K=15)
Bayesian vs. Frequentist Frequentist: maximize data likelihood
𝑝𝐷𝑚𝑜𝑑𝑒𝑙 =𝑝𝐷𝜃
Bayesian: treat 𝜃 as random variable, maximize posterior 𝑝𝜃𝐷 =𝑝𝐷𝜃𝑝(𝜃)
𝑝(𝐷)
𝑝 𝐷 𝜃 is the data likelihood, 𝑝(𝜃) is the prior over the model parameters
Bayesian Method Treat 𝜃 as random variable, maximize posterior
𝑝𝜃𝐷 =𝑝𝐷𝜃𝑝(𝜃) 𝑝(𝐷)
Likelihood 𝑝 𝐷 𝜃 is the same as before, as in Maximum Likelihood
Prior 𝑝(𝜃) is a new distribution we model; specifies which parameters are more likely a priori, before seeing any data
𝑝(𝐷) does not depend on 𝜃, constant when choosing 𝜃 with the highest posterior probability
Prior over Model Parameters
Intuition
Will he score?
Score! Score! Miss Score!
Your estimate of 𝜃 = 𝑝(𝑠𝑐𝑜𝑟𝑒)?
Will he score?
Score! Score! Miss
Score!
• Prior information: player= LeBron James
• Your estimate of 𝜃 = 𝑝(𝑠𝑐𝑜𝑟𝑒)?
• Prior 𝑝(𝜃) reflects prior knowledge, e.g., 𝜃 ≈ 1
Prior Distribution
Prior distributions 𝑝(𝜃) are probability distributions of model parameters based on some a priori knowledge about the parameters.
Prior distributions are independent of the observed data.
Coin Toss Example
What is the probability of heads (θ)?
Beta Prior for θ
𝑃 𝜃 =𝐵𝑒𝑡𝑎 𝛼,𝛽 = Γ 𝛼+𝛽 𝜃(𝛼−1)(1−𝜃)(𝛽−1) Γ𝛼Γ𝛽
Beta Prior for θ
𝑃 𝜃 =𝐵𝑒𝑡𝑎 𝛼,𝛽 = Γ 𝛼+𝛽 𝜃(𝛼−1)(1−𝜃)(𝛽−1) Γ𝛼Γ𝛽
Uninformative Prior
𝑃 𝜃 =𝐵𝑒𝑡𝑎 𝛼,𝛽 = Γ 𝛼+𝛽 𝜃(𝛼−1)(1−𝜃)(𝛽−1) Γ𝛼Γ𝛽
Informative Prior
𝑃 𝜃 =𝐵𝑒𝑡𝑎 𝛼,𝛽 = Γ 𝛼+𝛽 𝜃(𝛼−1)(1−𝜃)(𝛽−1) Γ𝛼Γ𝛽
Coin Toss Experiment
• 𝑛=10cointosses
• 𝑦 = 4 number of heads
Likelihood Function for the Data
𝑃 𝑦|𝜃 =𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛,𝜃 = 𝑛 𝜃𝑦(1−𝜃)(𝑛−𝑦) 𝑦
Prior and Likelihood
𝑃 𝑦|𝜃 =𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛,𝜃 = 𝑛 𝜃𝑦(1−𝜃)(𝑛−𝑦) 𝑦
Posterior Distribution
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑃𝑟𝑖𝑜𝑟 × 𝐿𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑 𝑃𝜃|𝑦 =𝑃𝜃𝑃𝑦|𝜃
𝑃 𝜃|𝑦 =𝐵𝑒𝑡𝑎 𝛼,𝛽 ×𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛,𝜃 =𝐵𝑒𝑡𝑎 𝑦+𝛼,𝑛−𝑦+𝛽
This is why we chose the Beta distribution as our prior, posterior is also a Beta distribution: conjugate prior.
Posterior Distribution
Effect of Informative Prior
Effect of Uninformative Prior
Bayesian Linear Regression
Bayesian Linear Regression Let’s now apply the Bayesian method to linear regression.
To do that, we must treat parameter 𝜃 as a random variable, design a prior over it.
First, review maximum likelihood for linear regression.
ML for Linear Regression
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
whereβ=1,h𝑥 =𝜃𝑇𝑥 𝜎2
Probability of one data point
h 𝑥(𝑖)
h𝑥
𝑝 𝑡 𝑥,𝜃,𝛽
𝑥(𝑖)
Maximum likelihood solution
𝜃𝑀𝐿 =argmax𝑝(𝒕|𝒙,𝜃,𝛽) 𝜃
𝛽 = argmax 𝑝(𝒕|𝒙, 𝜃, 𝛽) 𝑀𝐿 𝛽
Likelihood function
What is 𝛽 useful for?
• Recall: we assumed observations t are Gaussian given h 𝑥
• 𝛽 allows us to write down distribution over t, given new x, called predictive distribution
𝑝(𝑡|𝑥,𝜃 ,𝛽 ) 𝑀𝐿 𝑀𝐿
= 𝑁(𝑡|𝜃𝑇 𝑥, 𝛽−1) 𝑀𝐿 𝑀𝐿
𝜃𝑇𝑥 𝑡 𝑀𝐿
𝛽−1 is the variance of this distribution 𝑀𝐿
Predictive Distribution
Given a new input point x, we can now compute a distribution over the output t:
𝑝(𝑡|𝑥,𝜃 ,𝛽 𝑀𝐿 𝑀𝐿
) = 𝑁(𝑡|𝜃𝑇 𝑥,𝛽−1) 𝑀𝐿 𝑀𝐿
Predictive distribution
True hypothesis
Slide credit: Bishop
• •
Define a distribution over parameters
Define prior distribution over 𝜃 as
𝑝𝜽 =𝑁(𝜽|𝒎0,𝑺0)
Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions1, gives the posterior
𝑝𝜽|𝒕 =𝑁(𝜽|𝒎𝑁,𝑺𝑁)
𝒎 =𝑺 𝑺−1𝒎 +𝛽𝑿𝑇𝒕 𝑁𝑁00
𝑺−1 = 𝑺−1 + 𝛽𝑿𝑇𝑿 𝑁0
• where
1see Bishop 2.3.3
A common choice for prior • A common choice for the prior is
𝑝𝜽 =𝑁(𝜽|𝟎,𝛼−1𝑰)
• for which
𝒎𝑁 = 𝛽𝑺𝑁𝑿𝑇𝒕
𝑺−1 =𝛼𝑰+𝛽𝑿𝑇𝑿 𝑁
𝜃1
𝜃0
Slide credit: Bishop
Intuition: prefer 𝜃 to be simple
For a linear model for regression, 𝜃𝑇𝑥 What do we mean by 𝜃 being simple?
𝜃1
𝑝𝜽 =𝑁(𝜽|𝟎,𝛼−1𝑰)
Namely, put a prior on 𝜃, which captures our belief that 𝜃 is
𝜃0 around zero, i.e., resulting in a simple model for prediction.
This Bayesian way of thinking is to regard 𝜃 as a random variable, and we will use the observed data D to update our prior belief on 𝜃
Bayesian Linear Regression Example
Bayesian Linear Regression Example 0 data points observed
𝜃1
Prior
Data Space
sample
𝜃0
Slide credit: Bishop
Bayesian Linear Regression Example 1 data point observed
Likelihood
Posterior
Data Space
sample
𝜃1
true 𝜽
𝜃1
𝜃0
𝜃0
Slide credit: Bishop
Bayesian Linear Regression Example 2 data points observed
Likelihood 𝜃1
Posterior Data Space 𝜃1
𝜃0
𝜃0
Slide credit: Bishop
Bayesian Linear Regression Example 20 data points observed
Likelihood 𝜃1
Posterior Data Space 𝜃1
𝜃0
𝜃0
Slide credit: Bishop
Bayesian Linear Regression
Prediction
•
Prediction
Now that we have a Bayesian model, how do we use it to make predictions for new data points?
Likelihood 𝜃1
Posterior Data Space 𝜃1
𝜃0
𝜃0
• • •
Prediction
One way is to maximize the posterior to get an estimate of 𝜽∗ Then, plug 𝜽∗ into the predictive distribution
This is known as the maximum a posteriori estimate
Likelihood 𝜃1
Posterior Data Space 𝜃1
𝜃0
𝜃0
Maximum A Posteriori (MAP)
Output the parameter that maximizes its posterior distribution given the data
𝜽𝑴𝑨𝑷 = argmax 𝑝(𝜽|𝒕) 𝜃
Recall: for our prior 𝑝 𝜽 = 𝑁 𝜽 𝟎, 𝛼−1𝑰 , the posterior is 𝑝 𝜽|𝒕 = 𝑁 𝜽 𝒎𝑁,𝑺𝑁 ,
where 𝒎 =𝛽𝑺𝑿𝑇𝒕, 𝑺−1=𝛼𝑰+𝛽𝑿𝑇𝑿. 𝑁𝑁𝑁
Therefore, 𝜽𝑴𝑨𝑷 = argmax 𝑝(𝜽|𝒕) = 𝑿𝑇 𝑿 + 𝜶 𝑰 −1 𝑿𝑇 𝒚 𝜃𝜷
Same as solution to regularized regression with 𝜽 𝟐 term. Note, this is the mode of the distribution
Bayesian Regression
Connection to Regularized Linear Regression
Maximizing posterior leads to regularized cost function
𝜃𝜃
𝜃
2𝛽−2
𝜃
2𝛼−2 𝑑
𝜃2
where 𝛽−2 is the noise variance and 𝛼−2 is the prior covariance parameter
𝜃𝜽𝑴𝑨𝑷
𝜃 𝜃
𝜃
𝜃
Maximizing posterior leads to regularized cost function
Can re-write the optimization in the same form as the regularized linear regression cost:
𝐿𝜽 = 𝜽𝑇𝒙 −𝑦 2+𝜆𝜽2 𝑛𝑛
where 𝜆 = 𝛽−2/𝛼−2 corresponds to the regularization hyperparameter.
• Intuitively, as 𝜆 → +∞, then 𝛽−2 ≫ 𝛼−2 . That is, the variance of noise if far greater than what our prior model can allow for 𝜃. In this case, our prior would be more accurate than what data can tell us, so we are getting a simple model, where 𝜃𝑀𝐴𝑃 → 0.
• If 𝜆 → 0, then 𝛽−2 ≪ 𝛼−2, and we trust our data more, so the MAP solution approaches the maximum likelihood solution, i.e.
𝜃𝑀𝐴𝑃 → 𝜃𝑀𝐿.
Effect of lambda
Bayesian Predictive Distribution
Maximum A Posteriori (MAP)
• Output the parameter that maximizes its posterior distribution given the data
𝜽𝑴𝑨𝑷 = argmax 𝑝(𝜽|𝒕) 𝜃
• Note, this is the mode of the distribution
• However, sometimes we may want to hedge our bets and average (integrate) over all possible parameters, e.g. if the posterior is multi-modal
Bayesian Predictive Distribution • Predict 𝑡 for new values of 𝑥 by integrating over 𝜃 :
𝑝 𝑡 𝑥,𝒕,𝛼,𝛽 = 𝑝 𝑡|𝜽,𝛽 𝑝 𝜽|𝒕,𝛼,𝛽 𝑑𝜃
• where
= 𝑁(𝑡|𝑚𝑇 𝑥, 𝜎2(𝑥)) 𝑁𝑁
𝜎2 𝑥 =1+𝑥𝑇𝑆 𝑥 𝑁𝛽𝑁
What does it look like?
Compare to Maximum Likelihood:
𝑝(𝑡|𝑥,𝒙,𝒕) = 𝑁(𝑡|𝑚𝑇𝑥,𝜎2) 𝑝(𝑡|𝑥,𝜃 ,𝛽 ) = 𝑁(𝑡|𝜃𝑇 𝑥,𝛽−1) 𝑁𝑁 𝑀𝐿𝑀𝐿 𝑀𝐿𝑀𝐿
Slide credit: Bishop
Next Class
Support Vector Machines I
maximum margin methods; support vector machines; primal vs dual SVM formulation; Hinge loss vs. cross-entropy loss
Reading: Bishop Ch 7.1.1-7.1.2