. of Toronto
Intro ML (UofT) STA314-Lec11 1 / 44
STA 314: Statistical Methods for Machine Learning I
Lecture 11 – Bayesian Linear Regression, Probabilistic PCA
Copyright By PowCoder代写 加微信 powcoder
Final exam does not include this nor next week’s lecture.
Continuing in our theme of probabilistic models for continuous variables.
▶ Probabilistic interpretation of linear regression
▶ Probabilistic interpretation of PCA
(Optional) Bayesian model selection.
Intro ML (UofT) STA314-Lec11
Completing the Square for Gaussians
First, we’re going to review a very powerful technique that will let us figure out the distribution of Gaussian random variables.
It’s a multivariate generalization of completing the square. The density of N (μ, Σ) satifies:
logp(x) = −12(x−μ)⊤Σ−1(x−μ)+const = −12x⊤Σ−1x + x⊤Σ−1μ + const
Thus, if we know w is Gaussian with unknown mean and covariance, and we also know that
log p(w) = − 1 w⊤Aw + w⊤b + const 2
for A positive definite, then we know that
w ∼ N (A−1b, A−1)
Intro ML (UofT) STA314-Lec11 3 / 44
Bayesian Linear Regression
We’re going to be Bayesian about the parameters of the model.
▶ This is in contrast with na ̈ıve Bayes and GDA: in those cases, we used
Bayes’ rule to infer the class, but used point estimates of the
parameters.
▶ By inferring a posterior distribution over the parameters, the model can
know what it doesn’t know.
How can uncertainty in the predictions help us?
▶ Smooth out the predictions by averaging over lots of plausible explanations (just like ensembles!)
▶ Assign confidences to predictions
▶ Make more robust decisions
Intro ML (UofT) STA314-Lec11
Recap: Linear Regression
Given a training set of inputs and targets {(x(i),t(i))}Ni=1 Linear model:
y = w⊤ψ(x)
Vectorized, we have the design matrix X in input space and
and predictions
⎡⎢ ⎢ ⎢ − ψ ( x ( 1 ) ) − ⎤⎥ ⎥ ⎥ ⎢− ψ(x(2)) −⎥
⎢ ⎢⎣ − ψ ( x ( N ) )
Intro ML (UofT)
STA314-Lec11
Recap: Linear Regression
Squared error loss:
L2 regularization:
L ( y , t ) = 12 ∥ y − t ∥ 2 φ ( w ) = λ2 ∥ w ∥ 2
Solution 1: solve analytically by setting the gradient to 0
w = (Ψ⊤Ψ + λI)−1Ψ⊤t Solution 2: solve approximately using gradient descent
w ← (1 − αλ)w − αΨ⊤(y − t)
Intro ML (UofT) STA314-Lec11 6 / 44
Linear Regression as Maximum Likelihood
We can give linear regression a probabilistic interpretation by assuming a Gaussian noise model:
t∣x∼N(w⊤ψ(x), σ2)
Linear regression is just maximum likelihood under this model:
1N (i) (i) 1N (i) ⊤ (i) 2 N ∑logp(t ∣x ;w,b) = N ∑logN(t ;w ψ(x ),σ )
1 N (i) ⊤ (i) 2 =const−2Nσ2∑(t −wψ(x))
1 N 1 (t(i) −w⊤ψ(x(i)))2
= N ∑log[√2πσ exp(− 2σ2 )] i=1
STA314-Lec11
Regularized Linear Regression as MAP Estimation
We can view an L2 regularizer as MAP inference with a Gaussian prior. Recall MAP inference:
arg max log p(w ∣ D) = arg max [log p(w) + log p(D ∣ w)] ww
We just derived the likelihood term log p(D ∣ w):
1 N (i) ⊤ (i) 2
logp(D∣w)=const− 2Nσ2 ∑(t −w ψ(x )) i=1
Assume a Gaussian prior, w ∼ N (m, S): logp(w) = logN(w;m,S)
= log [ 1 exp (− 1 (w − m)⊤ S−1 (w − m))] (2π)D/2∣S∣1/2 2
= − 12 (w − m)⊤ S−1 (w − m) + const Commonly, m=0 and S=ηI, so
log p(w) = − 1 ∥w∥2 + const. 2η
This is just L2 regularization!
Intro ML (UofT) STA314-Lec11 8 / 44
Recap: Full Bayesian Inference
Recall: full Bayesian inference makes predictions by averaging over all likely explanations under the posterior distribution.
Compute posterior using Bayes’ Rule:
p(w ∣ D) ∝ p(w)p(D ∣ w)
Make predictions using the posterior predictive distribution:
p(t ∣x,D) = ∫ p(w∣D)p(t ∣x,w)dw Doing this lets us quantify our uncertainty.
Intro ML (UofT) STA314-Lec11
Bayesian Linear Regression
Prior distribution: w ∼ N (0, S) Likelihood: t∣x,w∼N(w⊤ψ(x), σ2)
Assuming fixed/known S and σ2 is a big assumption. More on this later.
Intro ML (UofT) STA314-Lec11
Bayesian Linear Regression
Bayesian linear regression considers various plausible explanations for how the data were generated.
It makes predictions using all possible regression weights, weighted by their posterior probability.
Here are samples from the prior p(w) and posteriors p(w ∣ D)
Intro ML (UofT) STA314-Lec11 11 / 44
Bayesian Linear Regression: Posterior
Deriving the posterior distribution: log p(w ∣ D) = log p(w) + log p(D ∣ w) + const
=−1w⊤S−1w− 1 ∥Ψw−t∥2 +const 2 2σ2
=−1w⊤S−1w− 1 (w⊤Ψ⊤Ψw−2t⊤Ψw+t⊤t)+const 2 2σ2
=−1w⊤(σ−2Ψ⊤Ψ+S−1)w− 1t⊤Ψw+const(completethesquare!) 2 σ2
Thus w∣D ∼ N(μ,Σ) where
μ = σ−2ΣΨ⊤t
Σ = (σ−2Ψ⊤Ψ + S−1)−1
Intro ML (UofT) STA314-Lec11 12 / 44
Bayesian Linear Regression: Posterior
Since a Gaussian prior leads to a Gaussian posterior, this means the Gaussian distribution is the conjugate prior for linear regression!
Compare μ the closed-form solution for linear regression: w = (Ψ⊤Ψ + λI)−1Ψ⊤t
This is the mean of the posterior, assuming that S = λ−1I and σ = 1.
λ−1 is the standard deviation of the prior. As this becomes infinite, the mean of the posterior converges to the maximum likelihood solution.
Intro ML (UofT) STA314-Lec11 13 / 44
Bayesian Linear Regression
— Bishop, Pattern Recognition and Machine Learning
Intro ML (UofT) STA314-Lec11 14 / 44
Bayesian Linear Regression
Example with radial basis function (RBF) features
(x − μj )2 ψj(x) = exp(− 2s2 )
— Bishop, Pattern Recognition and Machine Learning
Intro ML (UofT) STA314-Lec11 15 / 44
Bayesian Linear Regression
Functions sampled from the posterior:
— Bishop, Pattern Recognition and Machine Learning
Intro ML (UofT) STA314-Lec11 16 / 44
Bayesian Linear Regression
The posterior just gives us distribution over the parameter space, but if we want to make predictions, the natural choice is to use the posterior predictive distribution.
Posterior predictive distribution:
p(t ∣x,D) = ∫ p(t ∣x,w) p(w∣D) dw N(t;w⊤ψ(x),σ) N(w;μ,Σ)
Another interpretation: t = w⊤ψ(x) + ε, where ε ∼ N (0, σ) is independent of w ∣ D ∼ N (μ, Σ).
Intro ML (UofT) STA314-Lec11
Bayesian Linear Regression
Another interpretation: t = w⊤ψ(x) + ε, where ε ∼ N (0, σ) is independent of w ∣ D ∼ N (μ, Σ).
By the linear combination rules for Gaussian random variables, t is a Gaussian distribution with parameters
μ = μ⊤ψ(x) pred
σ2 = ψ(x)⊤Σψ(x) + σ2 pred
Hence, the posterior predictive distribution is N (t ; μ
, σ2 ). pred
Intro ML (UofT) STA314-Lec11
Bayesian Linear Regression
Here we visualize confidence intervals based on the posterior predictive mean and variance at each point:
Intro ML (UofT) STA314-Lec11 19 / 44
Overview: Probabilistic PCA
The formulation of PCA that we saw earlier in the course was motivated heuristically.
We will show that it can be expressed as the maximum likelihood estimate of a certain probabilistic model.
Intro ML (UofT) STA314-Lec11
Recall: PCA
Data set {x(i)}Ni=1
Each input vector x(i) ∈ RD is approximated as μˆ + Uz(i),
x(i) ≈ ̃x(i) = μˆ + Uz(i)
where μˆ = n1 ∑i x(i) is the data mean, U ∈ RD×K is the orthogonal
basis for the principal subspace, and z(i) ∈ RK is the code vector z ( i ) = U ⊤ ( x ( i ) − μˆ )
U is chosen to minimize the reconstruction error
U∗ =argmin∑∥x(i)−μˆ+UU⊤(x(i)−μˆ)∥2
Intro ML (UofT)
STA314-Lec11
Probabilistic PCA
To formulate probabilistic PCA, let’s start with a latent variable model.
Similar to the Gaussian mixture model, but we will assume continuous, Gaussian latents:
z ∼ N(0,I)
x ∣ z ∼ N (Wz + μ, σ2I)
Note: this is a naive Bayes model, because p(x ∣ z) factorizes with respect to the dimensions of x.
What sort of data does this model produce?
Intro ML (UofT) STA314-Lec11
Probabilistic PCA
z is a random coordinate within the affine space centered at μ and spanned by the columbs of W.
To get the random variable x, we samples a standard Normal z and then add a small amount of isotropic noise to Wz + μ.
Intro ML (UofT) STA314-Lec11 23 / 44
Probabilistic PCA
— Bishop, Pattern Recognition and Machine Learning
(UofT) STA314-Lec11 24 / 44
Probabilistic PCA : Maximum Likelihood
To perform maximum likelihood in this model, we need to maximize the following:
max logp(x∣W,μ,σ2) = max log∫ p(x∣z,W,μ,σ2)p(z) dz W,μ,σ2 W,μ,σ2
This was hard for the Gaussian mixture model, but in this case it’s easy.
p(x ∣ W, μ, σ2) will be Gaussian (confirm this) and so we just need to compute and Cov[x] and E[x].
Intro ML (UofT) STA314-Lec11 25 / 44
Probabilistic PCA : Maximum Likelihood
E[x] = E[Wz + μ + ε] = μ
Cov[x] = E[(Wz + ε)(Wz + ε)⊤]
= E[(Wzz⊤W⊤] + Cov[εε⊤]
= WW⊤ + σ2I
Intro ML (UofT) STA314-Lec11 26 / 44
Probabilistic PCA : Maximum Likelihood
Thus, the likelihood of the data under this model is given by ND N 1N(i) ⊤−1(i)
−2log(2π)−2log∣C∣−2∑(x −μ)C(x −μ) i=1
where C = WW⊤ + σ2I.
It’s a bit involved to derive the maximum likelihood solution, so we will skip it, but Tipping and Bishop (Probabilistic PCA, 1999) show that this is maximized at the following stationary points.
Intro ML (UofT) STA314-Lec11 27 / 44
Probabilistic PCA : Maximum Likelihood
1N (i) μˆMLE = N ∑x
Intro ML (UofT) STA314-Lec11 28 / 44
Probabilistic PCA : Maximum Likelihood
21 Wˆ MLE = UˆMLE(LˆMLE − σˆMLEI)2 R
where UˆMLE is the matrix whose columns are the K unit eigenvectors of the empirical covariance matrix Σˆ that have the largest eigenvalues, LˆMLE ∈ RK×K is the diagonal matrix whose elements are the corresponding eigenvalues, and R is any orthogonal matrix.
Intro ML (UofT) STA314-Lec11 29 / 44
Probabilistic PCA : Maximum Likelihood
21D σˆMLE=D−K ∑ λi
where λi is the ith largest eigenvalue of the empirical covariance matrix Σˆ
of the data. In otherwords, the average variance of the discarded subspace.
Intro ML (UofT) STA314-Lec11 30 / 44
Probabilistic PCA : Maximum Likelihood
That seems complex, to get an intuition about how this model behaves when it is fit to data, lets consider the MLE density.
Recall that the marginal distribution on x in our fitted model is a Gaussian with mean
and covariance
Wˆ Wˆ ⊤ + σˆ 2 I = Uˆ
( Lˆ − σˆ 2 I ) Uˆ ⊤ + σˆ 2 I
MLE MLE MLE
MLE MLE MLE MLE MLE
The covariance gives us a nice intuition about the type of model this forms.
Intro ML (UofT) STA314-Lec11 31 / 44
Probabilistic PCA : Maximum Likelihood
Consider centering the data and checking the variance along one of the unit eigenvectors ui , which are the eigenvectors forming the columns of UˆMLE:
Var(u⊤(x − μˆ )) = u⊤ Cov[x]u i MLE i i
= u ⊤ Uˆ ( Lˆ − σˆ 2 I ) Uˆ ⊤ u + σˆ 2
i MLE MLE MLE MLE i MLE
= λ − σˆ 2 + σˆ 2 = λ i MLE MLE i
Now, consider centering the data and checking the variance along any unit vector orthogonal to the subspace spanned by UˆMLE:
Var(u⊤(x−μˆ ))=u⊤Uˆ (Lˆ −σˆ2 I)Uˆ⊤ u +σˆ2
i MLE iMLEMLEMLEMLEiMLE
= σˆ 2 MLE
In other words, the model captures the variance along the principle axes and approximates the variance in all remaining directions with a single variance.
Intro ML (UofT) STA314-Lec11 32 / 44
Probably easier to visualize after implementing it.
Intro ML (UofT) STA314-Lec11 33 / 44
How does it relate to PCA?
The posterior mean is given by
E [ z ∣ x ] = ( Wˆ ⊤ Wˆ + σˆ 2 I ) − 1 Wˆ ⊤ ( x − μˆ )
MLE MLE MLE MLE So, if we don’t fit σ2 and instead take it to 0 we get
Can show that this is a projection onto an affine space spanned by the
columns of UˆMLE.
E [ z ∣ x ] → ( Wˆ ⊤ Wˆ ) Wˆ ⊤ ( x − μˆ )
MLE MLE MLE MLE
Intro ML (UofT) STA314-Lec11 34 / 44
Why Probabilistic PCA (PPCA)?
Fitting a full-covariance Gaussian model of data requires
D(D + 1)/2 + D parameters. With PPCA we model only the K most significant correlations and this only requires O(D) parameters as long as K is small.
Basis of Bayesian treatement of PCA, which gives us a Bayesian method for determining the dimensionality of the principal subspace (i.e. K).
Existence of likelihood functions allows direct comparison with other probabilistic models.
Can use PPCA as a class-conditional density (as in GDA) to reduce the requirement to fit and store O(D2) parameters.
Intro ML (UofT) STA314-Lec11 35 / 44
Recap: Gaussian models that we covered
Gaussian discriminant analysis.
▶ Gaussian class-conditional generative model p(x ∣ t) used for classification.
Gaussian mixture model.
▶ Gaussian latent variable model p(x) = ∑z p(x, z) used for clustering.
Bayesian linear regression.
▶ Gaussian discriminative model p(t ∣ x) used for regression with a Bayesian analysis for the weights.
Probabilistic PCA.
▶ Gaussian latent variable model p(x) = ∫z p(x, z) used for dimensionality reduction.
Intro ML (UofT) STA314-Lec11
Optional material: Bayesian model selection
Intro ML (UofT) STA314-Lec11 37 / 44
Occam’s Razor (optional)
Consider selecting models from a Bayesian perspective.
Related to Occam’s Razor: “Entities should not be multiplied beyond necessity.”
▶ Named after the 14th century British theologian William of Occam Huge number of attempts to formalize mathematically
▶ See Domingos, 1999, “The role of Occam’s Razor in knowledge discovery” for a skeptical overview. https://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf
Common misinterpretation: your prior should favor simple explanations
Better interpretation: by averaging over many hypothesis, Bayesian model selection naturally prefers simpler models.
Intro ML (UofT) STA314-Lec11 38 / 44
Occam’s Razor (optional)
Suppose you have a finite set of models, or hypotheses {Hi}Mi=1 (e.g. polynomials of different degrees)
Posterior inference over models (Bayes’ Rule):
p(Hi ∣D) ∝ p(Hi)p(D∣Hi)
prior evidence
The evidence is also called marginal likelihood since it requires marginalizing out the parameters:
p(D∣Hi) = ∫ p(w∣Hi)p(D∣w,Hi)dw
Intro ML (UofT) STA314-Lec11 39 / 44
Occam’s Razor (optional)
p(Hi ) is typically uniform, so we can compare them based on marginal likelihood.
Bayesian model selection:
H∗ =argmaxp(D∣H)
What types of models does this procedure prefer?
Intro ML (UofT) STA314-Lec11
Occam’s Razor (optional)
Suppose M1, M2, and M3 denote a linear, quadratic, and cubic model. M3 is capable of explaning more datasets than M1.
But its distribution over D must integrate to 1, so it must assign lower probability to ones it can explain.
— Bishop, Pattern Recognition and Machine Learning
Intro ML (UofT) STA314-Lec11 41 / 44
Occam’s Razor (optional)
How does the evidence penalize complex models?
∆wposterior
wMAP w ∆wprior
Approximating the integral for w ∈ R:
p(D∣Hi) = ∫ p(D∣w,Hi)p(w∣Hi)
≃p(D∣wMAP,Hi) ∆wposterior ∆wprior
best-fit likelihood Occam factor
STA314-Lec11
Occam’s Razor (optional)
Let’s investigate
log p(D ∣ Hi ) = log p(D ∣ wMAP, Hi ) + log ∆wposterior
∆wprior First term represents fit to the data given the most probable
parameter values.
Second term is a penalty, because it is negative
(∆wposterior < ∆wprior).
Thus if the posterior is very peaked and confident about the data, this penalty term will be very negative.
Intro ML (UofT) STA314-Lec11
Occam’s Razor (optional)
w∈RM wehave
log p(D ∣ Hi ) = log p(D ∣ wMAP, Hi ) + M log ∆wposterior
So the more parameters we have, the higher the penalty. Optimal model complexity is determined by a tradeoff.
In Bayesian model selection, we naturally prefer simpler models that model the data well.
Intro ML (UofT) STA314-Lec11 44 / 44
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com