Lecture 14. Bayesian regression
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Uncertaintynotcapturedbypointestimates • Bayesianapproachpreservesuncertainty
• SequentialBayesianupdating
• Conjugateprior(Normal-Normal)
• UsingposteriorforBayesianpredictionsontest
2
COMP90051 Statistical Machine Learning
Training == optimisation (?)
Stages of learning & inference:
• Formulate model
• Fit parameters to data • Makeprediction
𝒘𝒘� referred to as a ‘point estimate’
Regression
ditto
3
COMP90051 Statistical Machine Learning
Bayesian Alternative
Nothing special about 𝒘𝒘� … use more than one value?
• Formulate model
• Consider the space of likely parameters – those that
fit the training data well
• Make‘expected’prediction
Regression
4
COMP90051 Statistical Machine Learning
Uncertainty
From small training sets, we rarely have complete confidence in any models learned. Can we quantify the uncertainty, and use it in making predictions?
5
COMP90051 Statistical Machine Learning
Regression Revisited
• Learnmodelfromdata
∗ minimise error residuals by choosing weights
𝐰𝐰� = 𝐗𝐗 ′ 𝐗𝐗 − 1 𝐗𝐗 ′ 𝐲𝐲
• But…howconfident are we
∗ i n 𝐰𝐰� ?
∗ in the predictions?
Linear regression: y = w0 + w1 x (here y = humidity, x = temperature)
6
Do we trust point estimate 𝐰𝐰� ?
∗ how much uncertainty in parameter estimate?
∗ more informative if
neg log likelihood objective highly peaked
• Formalised as Fisher Information matrix
∗ E[2nd deriv of NLL]
∗ measures curvature of objective about 𝐰𝐰�
COMP90051 Statistical Machine Learning
∗ 𝐰𝐰� highly sensitive to noise
• How stable is learning?
Figure: Rogers and Girolami p81
7
COMP90051 Statistical Machine Learning
The Bayesian View
Retain and model all unknowns (e.g., uncertainty over parameters) and use this information when making inferences.
8
COMP90051 Statistical Machine Learning
A Bayesian View
• Could we reason over all parameters that are consistent with the data?
∗ weights with a better fit to the training data should be more probable than others
∗ make predictions with all these weights, scaled by their probability
• ThisistheideaunderlyingBayesianinference
9
COMP90051 Statistical Machine Learning
Uncertainty over parameters
• Manyreasonablesolutionstoobjective ∗ why select just one?
• Reasonunderallpossibleparametervalues ∗ weighted by their posterior probability
• Morerobustpredictions
∗ less sensitive to overfitting, particularly with small training sets
∗ can give rise to more
expressive model class (Bayesian logistic
regression becomes non-linear!)
10
COMP90051 Statistical Machine Learning
Frequentist vs Bayesian “divide”
• Frequentist:learningusingpointestimates, regularisation, p-values …
∗ backed by sophisticated theory on simplifying assumptions
∗ mostly simpler algorithms, characterises much practical
machine learning research
• Bayesian:maintainuncertainty,marginalise(sum) out unknowns during inference
∗ some theory
∗ often more complex algorithms, but not always
∗ often (not always) more computationally expensive
11
COMP90051 Statistical Machine Learning
Bayesian Regression
Application of Bayesian inference to linear regression, using Normal prior over w
12
COMP90051 Statistical Machine Learning
Revisiting Linear Regression
• Recallprobabilisticformulation of linear regression
• Bayesrule:
ID = D x D identity matrix
• Givesrisetopenalisedobjective (ridge regression)
point estimate taken here, avoids computing marginal likelihood term
13
COMP90051 Statistical Machine Learning
Bayesian Linear Regression
• Rewindonestep,considerfullposterior
Here we assume noise var. known
• Canwecomputethedenominator(marginal likelihood or evidence)?
∗ if so, we can use the full posterior, not just its mode
14
COMP90051 Statistical Machine Learning
Bayesian Linear Regression (cont)
• WehavetwoNormaldistributions ∗ normal likelihood x normal prior
• TheirproductisalsoaNormaldistribution
∗ conjugate prior: when product of likelihood x prior
results in the same distribution as the prior
∗ evidence can be computed easily using the normalising constant of the Normal distribution
closed form solution for posterior!
15
COMP90051 Statistical Machine Learning
Bayesian Linear Regression (cont)
Note that mean (and mode) are the MAP solution from before
where
Advanced: verify by expressing product of two Normals, gathering exponents together and ‘completing the square’ to express as squared exponential (i.e., Normal distribution).
Slide 2.21 – completing square
16
COMP90051 Statistical Machine Learning
Bayesian Linear Regression example
Step 1: select prior, here spherical about 0
Step 2: observe training data
Step 3: formulate posterior, from prior & likelihood
Samples from posterior 17
COMP90051 Statistical Machine Learning
Sequential Bayesian Updating
• Canformulate forgivendataset
• What happens as we see more and more data?
1. Start from prior
2. See new labelled datapoint
3. Compute posterior
4. The posterior now takes role of prior & repeat from step 2
18
COMP90051 Statistical Machine Learning
Sequential Bayesian Updating
• Initiallyknowlittle,many regression lines licensed
• Likelihoodconstrains possible weights such that regression is close to point
• Posteriorbecomesmore refined/peaked as more data introduced
• Approachesapointmass Bishop Fig 3.7, p155
19
COMP90051 Statistical Machine Learning
1. 2.
3. 4.
Stages of Training
Decide on model formulation & prior Compute posterior over parameters, p(w|X,y)
MAP
Find mode for w Use to make
prediction on test
approx. Bayes
exact Bayes
3. Use all w to make expected
prediction on test
3. 4.
Sample many w Use to make
ensemble average prediction on test
20
COMP90051 Statistical Machine Learning
Prediction with uncertain w • Couldpredictusingsampledregressioncurves
∗ sample S parameters, 𝒘𝒘
𝑠𝑠
,𝑠𝑠 ∈ {1,…,𝑆𝑆}
∗ for each sample compute prediction 𝑦𝑦∗(𝑠𝑠)at test point x* ∗ compute the mean (and var.) over these predictions
∗ this process is known as Monte Carlo integration
• ForBayesianregressionthere’sasimplersolution ∗ integration can be done analytically, for
𝑝𝑝(𝑦𝑦� |𝑿𝑿,𝒚𝒚,𝒙𝒙 ,𝜎𝜎2) = ∫𝑝𝑝 𝒘𝒘 𝑿𝑿,𝒚𝒚,𝜎𝜎2)𝑝𝑝(𝑦𝑦 𝒙𝒙 ,𝒘𝒘,𝜎𝜎2 𝑑𝑑𝒘𝒘 ∗∗∗∗
21
COMP90051 Statistical Machine Learning
Prediction (cont.)
• PleasantpropertiesofGaussiandistributionmeans integration is tractable
∗ additive variance based on x* match to training data
∗ cf. MLE/MAP estimate, where variance is a fixed constant
(wN and VN defined in posterior when fitting Bayesian linear regression)
22
COMP90051 Statistical Machine Learning
Bayesian Prediction example
samples from posterior
MLE, MAP fit
MLE (blue) and MAP (green) point estimates, with fixed variance
Data: y = x sin(x); Model = cubic
variance higher further from data points
23
COMP90051 Statistical Machine Learning
Caveats
• Assumptions
∗ known data noise parameter, σ2
∗ data was drawn from the model distribution
• In real settings, σ2 is unknown
∗ has its own conjugate prior
Normal likelihood ⨉ InverseGamma prior results in InverseGamma posterior
∗ closed form predictive distribution, with student-T
likelihood
(see Murphy, 7.6.3)
24
COMP90051 Statistical Machine Learning
Summary
• Uncertainty not captured by point estimates (MLE, MAP)
• Bayesian approach preserves uncertainty
∗ care about predictions NOT parameters
∗ chooseprioroverparameters,thenmodelposterior
• New concepts:
∗ sequentialBayesianupdating
∗ conjugateprior(Normal-Normal)
• Using posterior for Bayesian predictions on test
• Next time: Bayesian classification, then PGMs
25