Problem 3.1
Gamma function
Misc. Math
(x) = Z 1 ux 1e udu. 0
The gamma function is defined as
CS5487 Problem Set 3
Bayesian Parameter Estimation
Antoni Chan Department of Computer Science City University of Hong Kong
Use integration by parts to prove the relation (x + 1) = x (x). Also show that (1) = 1. Hence, (x + 1) = x! when x is a non-negative integer, and we can think of the gamma function as an extension of factorial to positive real numbers.
………
MAP Estimation
Problem 3.2 MAP for the exponential density
Let x have an exponential density,
p(x|✓) = (✓e ✓x, x 0
0, otherwise. Assume that the prior distribution of ✓ is also exponential,
p(✓) = ( e ✓, ✓ 0
0, otherwise.
(a) Given a set of i.i.d. samples {x1, …, xN }, show that the MAP estimate for ✓ is ✓ˆMAP= 1 ,
x ̄ + N
wherex ̄=1PN xi. N i=1
Now suppose we change the parameterization of the density to ↵ = 1/✓, (1 x
(3.2)
(3.3)
(3.4)
(3.5)
(3.1)
p(x|↵)= ↵e ↵, x 0
0, otherwise.
We might naively assume that the MAP estimator for ↵ is ↵ˆ = x ̄ + . We will show that this is
not true!
N
16
(b) First, we need to obtain the prior distribution of ↵ that is equivalent to (3.3). This is not as
simple as plugging in ✓ = 1/↵, since ✓ is a random variable (we have to preserve the normal-
ization of the pdf). Recall the following rule to obtain the distribution of a transformation
of a random variable. If ✓ ⇠ p✓(✓) and ↵ = g(✓) is a transformation of the r.v. ✓, then
@ ↵ 1 1 (
p(↵) = @✓ p✓(g (↵)). Use this rule to obtain the prior of ↵,
p(↵)= ↵2e ↵, 0
(3.6)
(3.7)
(c) Show that the MAP estimate for ↵ is
0, otherwise.
↵ˆMAP =Nx ̄+ . N+2
Note: this demonstrates that the MAP estimator is not invariant to all transformations!
………
Problem 3.3 Invariance of MAP to linear transformations
Prove that MAP estimators are invariant to linear transformations.
(a) Let p(x|✓) be the distribution with scalar parameter ✓, which has prior distribution p(✓). Let the new parameter be = a✓ + b, which is a linear transformation of ✓ (and a 6= 0). Show that ˆ M A P = a ✓ˆ M A P + b .
(b) Extend this result to vector parameters, i.e. show that for = A✓ + b, where ✓ 2 Rd and A is invertible, the MAP estimator is ˆMAP = A✓ˆMAP + b.
………
Bayesian Parameter Estimation
Problem 3.4 Bayesian estimation for the Gaussian mean
Let x be a r.v. with a Gaussian distribution, with mean μ and known variance 2, and place a Gaussian prior on the mean,
p(x|μ) = N (x|μ, 2), p(μ) = N (μ|μ0, 02). Show that the posterior distribution of μ, given samples D = {x1, . . . , xn}, is
(3.8) (3.9)
(3.10)
p(μ|D) = N (μ|μˆn, ˆn2 ),
n 02 μˆ M L + 2 μ 0
,
2 2 02
ˆn = 2 +n 02,
where μˆML = 1 Pn xi. n i=1
μˆn = 2 +n 02
Note: the prior and posterior of the mean are both Gaussian distributions. In other words, conditioning on the data yields an updated posterior, which “reproduces” from the original prior (it has the same form). This is an example of a conjugate prior. See Problem 3.9 for more about conjugate priors.
………
17
Problem 3.5 Bayesian estimation for the precision of a Gaussian
In this problem we will look at estimating the precision of a Gaussian. (The precision is the inverse of the variance). Let x be a r.v. with a Gaussian distribution,
p(x| ) = N (x|μ, 1), where is the precision and μ is the known mean.
A Gamma distribution is a distribution over positive real numbers ( > 0), Gam( |a, b) = 1 ba a 1e b ,
(a)
(3.11)
(3.12)
where a > 0 is called these shape parameter, and b > 0 is the inverse scale parameter (also called rate parameter). (x) is the gamma function (See Problem 3.1, above). The mean and variance of the Gamma distribution are
E[ ] = a, var( ) = a . (3.13) b b2
(a) Let the prior of be a Gamma distribution, p( ) = Gam( |a0,b0). Given a dataset D = {x1, · · · , xn}, show that the posterior of is also a Gamma distribution,
p( |D) = Gam( |aˆn, ˆbn), (3.14) aˆn = a0 + n, ˆbn = b0 + n M2 L, (3.15)
22
where 2 = 1 Pn (xi μ)2 is the ML estimate of the variance. What is the interpretation
ML n i=1
of the posterior mean E[ |D] in terms of “virtual” samples to an equivalent ML estimate?
(b) Using the same Gamma prior, show that the MAP estimate of is
ˆMAP = 2(a0 1) + n. (3.16)
2 b 0 + n M2 L
How does the MAP estimate behave when n is small and when n ! 1?
(c) Now assume a noninformative (and improper) prior, p( ) / 1 . Show that the posterior of is
and that the MAP estimate is
p( |D) = Gam( |a ̃n, ̃bn),
a ̃n = n, ̃bn = n M2 L,
̃MAP = n 2. n M2 L
(3.17) (3.18)
(3.19)
22
(d) Discuss the relationship between these estimators, including similarities and di↵erences. How are the estimators related? What are the intuitive interpretations of the posterior mean in terms of adding “virtual” samples to an equivalent ML estimate? What is the regularization e↵ect?
18
(e) So far we have only looked at the estimate of the precision parameter . Now let’s consider the predictive distribution of a novel x. Show that the predictive distribution
p(x|D) = Z p(x| )p( |D)d , (3.20) where p( |D) = Gam( |a, b) and p(x| ) = N (x|μ, 1), can be written as a Student’s t distri-
bution,
(a+1) ✓ 1 ◆ (a+1)
p(x|D) = p 2 1 + 2b(x μ)2 2 . (3.21)
(a) 2⇡b
(Hint: use Problem 3.1). The Student’s t-distribution is a “fat” version of the Gaussian with longer tails, and hence can represent outliers better. The mean and variance are given by (for a > 1),
E[x|D] = μ, var(x|D) = b , (3.22) a 1
Calculate the variance of the predictive distribution using the values of {aˆn,ˆbn} from the posterior distribution in (a). What is the interpretation in terms of “virtual” samples added to an ML estimate? Plot the predictive distribution for a few values of {a0, b0, n, M2 L}. How does the distribution behave when a0 is large or when n is large?
………
Problem 3.6 Bayesian estimation for a multivariate Gaussian
Let x be a vector r.v. with a multivariate Gaussian distribution of dimension d, and place a prior on the vector mean μ (assume that ⌃ is known),
p(x|μ) = N (x|μ, ⌃), p(μ) = N (μ|μ0, ⌃0). (3.23) (a) Given a set of i.i.d. samples D = {x1, · · · , xn}, show that the posterior distribution of μ is
(3.24) (3.25) (3.26)
p(μ|D) = N(μ|μˆn,⌃ˆn),
μˆn = ⌃0(⌃0 + 1⌃) 1μˆML + 1⌃(⌃0 + 1⌃) 1μ0,
nnn ⌃ˆ 1 = n⌃ 1 + ⌃ 1,
n0
where μˆML = 1 Pn xi. Hint: note that p(μ|D) / p(D|μ)p(μ), then examine the exponent
n i=1
and rewrite it into a quadratic form by completing the square.
(b) Show that (3.25) can be rewritten as
μˆn =AμˆML +(I A)μ0, A=⌃0(⌃0 + 1⌃) 1.
(3.27)
(3.28)
n What is the intuitive interpretation of this Bayes estimator μˆn?
(c) Show that the predictive distribution is p(x⇤|D)=N(x⇤|μˆn,⌃ˆn +⌃).
Hint: use Problem 1.9.
………
19
Problem 3.7 Bayesian estimation for a Bernoulli distribution
In this problem we will consider Bayesian estimation and prediction for a Bernoulli r.v. Let x be a r.v. with a Bernoulli distribution,
p(x|⇡) = ⇡x(1 ⇡)1 x, where ⇡ = P (x = 1) is the parameter.
(a) Let D = {x1,··· ,xn} be a set of samples, show that p(D|⇡) = ⇡s(1 ⇡)n s,
where s = Pni=1 xi is the sum of the samples (su cient statistic).
(b) Assume a uniform prior over ⇡. Use the identity
to show that
(c) Given the posterior in (3.32), show that the predictive distribution is ✓s+1◆x ✓ s+1◆1 x
(3.29)
(3.30)
(3.31)
(3.32)
Z1m n m!n!
⇡ (1 ⇡)d⇡=(m+n+1)!, p(⇡|D) = (n + 1)! ⇡s(1 ⇡)n s.
0
s!(n s)! Plot this density for n = 1 for each value of s.
p(x|D) = n + 2 1 n + 2 .
What is the e↵ective Bayesian estimate of ⇡? What is the intuitive explanation in terms of
“virtual” samples added to an equivalent MLE estimate?
(d) What is the ML estimate for ⇡? What is the MAP estimate for ⇡ using the uniform prior? Do you see any advantage in favoring one of the estimates in favor of the others? How does that relate to the uniform prior that was assumed for ⇡?
(e) Consider two other priors on ⇡,
p1(⇡) = 2⇡, (3.34) p0(⇡) = 2 2⇡, (3.35)
and 0 ⇡ 1. The prior p1(⇡) favors a r.v. biased to 1, while p0 favors a r.v. biased to 0. Calculate the MAP estimates using these two priors. What is the e↵ective estimate of ⇡ using these Bayesian estimates? What is the intuitive explanation in terms of “virtual” samples?
………
20
(3.33)
Problem 3.8 Bayesian estimation for a multinomial distribution
Consider Problem 2.7 in the previous problem set. Let x be a r.v. such that p(x = j) = ⇡j,j 2 {1,…,K}, N independent observations from x, a random vector c = [c1,…,cK]T where cj is the number of times that the observed value is j (i.e. c is the histogram of the sample of observations). We have seen that, c has multinomial distribution
N! YK cj
p(c|⇡) = QK ci! ⇡j , (3.36)
i=1 j=1
where ⇡ = [⇡1 , · · · , ⇡K ]T is the probability of each bin. In this problem we are going to compute MAP estimates for this model. Notice that the paramePters ⇡ are probabilities and, therefore, not every prior will be acceptable here (since ⇡j > 0 and j ⇡j = 1 for the prior to be valid). One distribution over vectors ⇡ that satisfies this constraint is the Dirichlet distribution
( P Kj = 1 u j ) YK u j 1 p(⇡)= QK (uj) ⇡j
j=1 j=1
where the uj are a set of hyperparameters (parameters of the prior) and (x) is the gamma function
(see Problem 3.1).
(a) Derive the MAP estimator for the parameters ⇡ using the Dirichlet prior.
(b) Compare this estimator with the ML estimator derived in Problem 2.7. What is the use of this prior equivalent to, in terms of the ML solution?
(c) What is the e↵ect of the prior as the number of samples n increases? Does this make intuitive sense?
………
Problem 3.9 Exponential family and conjugate priors
In this problem we explore the exponential family and conjugate priors. The exponential family is the family of densities of the form
p(x|✓) = f(x)g(✓)e (✓)T u(x), (3.37) where ✓ is the parameter, and f(x), g(✓), (✓), and u(x) are fixed functions for a particular
probability distribution. g(✓) is the normalization constant (also called the partition function),
✓Z
g(✓) = f(x)e
(✓)T u(x)
◆ 1
dx . (3.38)
Given a likelihood p(x|✓) and a prior p(✓), when the posterior distribution p(✓|D) is of the same type as the prior distribution (e.g., both are Gaussians), then we call this prior distribution a conjugate prior to likelihood p(x|✓). In other words, conditioning on the data yields an updated posterior, which “reproduces” from the original prior (it has the same form). We have already seen an example of a conjugate prior in Problem 3.4, where a Gaussian prior on the mean of a Gaussian likelihood yields a Gaussian posterior of the mean given the data.
In this problem, we will consider the conjugate priors for other likelihood densities. 21
(a) Show that, for a density in the exponential family, the likelihood of data D = {x1, . . . , xn} is n(n)
p(D|✓) / f(xi) exp (✓) u(xi) . (3.39) YPTX
i=1 i=1
What is the normalization constant? Denote s = ni=1 u(xi) as the su cient statistic.
(b) It has been shown that, apart from certain irregular cases, the exponential family is the only family of distributions for which there is a conjugate prior. Show that
g(✓)⌘e (✓)T ⌫
p(✓|⌘,⌫) = R g(✓)⌘e (✓)T ⌫d✓ (3.40)
is a conjugate prior for the exponential family, where {⌘,⌫} are the parameters of the prior distribution. Compute the posterior distribution p(✓|D). Using the definition of the su cient statistic s, compare the posterior with the prior density. What is the result of “propagating” the prior through the likelihood function?
(c) Consider the table below. For each row i) show that the likelihood function on the right column belongs to the exponential family, ii) show that the prior on the left column is a conjugate prior for the likelihood function on the right column, iii) compute the posterior p(✓|D), and iv) interpret the meaning of the su cient statistic and the “propagation” discussed in b).
Prior p(✓)
Beta Gamma Gamma
Gamma
p(✓|↵, ) = (↵+ ) ✓↵ 1(1 ✓) 1 (↵) ( )
p(✓|↵, ) = p(✓|↵, ) =
p(✓|↵, ) =
↵ ✓↵ 1e ✓ (↵)
↵ ✓↵ 1e ✓ (↵)
↵ ✓↵ 1e ✓ (↵)
Likelihood Bernoulli Poisson Exponential
pQ(D|✓)
ni=1 ✓xi (1 ✓)1 xi
Q q ✓ x ne✓i
i=1 xi! ni=1 ✓e ✓xi
✓ = 1/ 2.
Note: use the general form of p(✓|D) you derived in (b)!
(d) Repeat the steps of c) for the distributions of problem 3.8, i.e. the multinomial as the likelihood function and the Dirichlet as the prior.
………
Bayesian Regression
Problem 3.10 Bayesian regression with Gaussian prior
In the last problem set, we showed that various forms of linear regression by the method of least squares are really just particular cases of ML estimation under the model
y = T ✓ + ✏ (3.41) 22
n ✓ exp{ ✓(xi μ)2} i=12⇡ 2
Normal(✓=1/ 2)
Table 1: In the case of the normal distribution, μ is assumed known, the parameter is the precision
where ✓ = [✓1,…,✓D]T is the parameter vector, y = [y1,…,yn]T is the vector of outputs, {x1, · · · , xn} are the set of corresponding inputs, (xi) is a feature transformation, with
= ⇥ (x1), · · · , (xn)⇤ (3.42)
and✏=[✏1,…,✏n]T isanormalrandomprocess✏⇠N(0,⌃),withsomecovariancematrix⌃.
It seems only natural to consider the Bayesian extension of this model. For this, we simply
extend the model considering a Gaussian prior
p(✓) = N (✓|0, ),
where is the covariance matrix. We will first derive a general result (for generic covariance matrices ⌃ and ), and then show how it relates to other methods.
(a) Given a training set D = {(x1, y1), . . . , (xn, yn)}, show that the posterior distribution is
p(✓|D) = N(✓|μˆ✓,⌃ˆ✓),
μˆ✓ = ( 1 + ⌃ 1 T ) 1 ⌃ 1y,
⌃ˆ✓ = ( 1 + ⌃ 1 T ) 1,
where μˆ✓ is the posterior mean and ⌃ˆ✓ is the posterior covariance. Do not assume any specific
form of the covariance matrices ⌃ and . Hint: complete the square (Problem 1.10).
(b) Consider the MAP estimate,
✓ˆMAP = argmaxp(✓|D). (3.46) ✓
How does it di↵er from the least squares estimate and the weighted-least squares estimate? What is the role of the terms that were not present before? Is there any advantage in setting them to anything other than zero?
(c) Consider the case where we assume the prior covariance and observation noise are both i.i.d., = ↵I and ⌃ = 2I. Show that the MAP estimate under these assumptions is
✓ˆ=( T + I) 1 y, (3.47) for some 0. Show that (3.47) is also the solution to the regularized least-squares problem,
✓ˆ=argmin y T✓ 2 + k✓k2. (3.48) ✓
In various fields, this formulation is also called ridge regression, Tikhonov regression, shrinkage (as in shrinking the weights in the parameter vector), or weight decay.
Now let’s consider the fully Bayesian version of regression, with the same assumptions as in (c), i.e. = ↵I and ⌃ = 2I. This formulation is the linear version of Gaussian process regression.
(d) What happens to the mean and covariance of the posterior p(✓|D) for di↵erent values of ↵ and 2 (e.g.,↵=0,↵!1, 2 =0)?
23
(3.43) (3.44) (3.45)
(e) Given a novel input x⇤, show that the predictive distribution of f⇤ = f(x⇤,✓) is
p(f⇤|x⇤, D) = N (f⇤|μˆ⇤, ˆ⇤2), μˆ⇤ = (x⇤)T μˆ✓,
ˆ⇤2 = (x⇤)T ⌃ˆ✓ (x⇤).
(Hint: see Problem 1.1). Assuming the same observation noise 2 as the training set, show
that the predictive distribution of y⇤ is
p(y⇤|x⇤,D)=Z p(y⇤|x⇤,✓)p(✓|D)d✓=N(y⇤|μˆ⇤, 2+ ˆ⇤2). (3.52)
Hint: note that p(y⇤|x⇤,✓) only depends on ✓ through f⇤ = (x⇤)T✓. Hence, we can rewrite the integral over ✓ with an integral over f⇤, while replacing p(✓|D) with p(f⇤|D).
This is the linear version of Gaussian process regression. We will see how to derive the nonlinear (kernel) version in a later problem set.
………
Problem 3.11 Estimating hyperparameters with maximum marginal likelihood
In the Bayesian linear regression from Problem 3.10, there are two hyperparameters, ↵ and 2, that parameterize the prior and likelihood covariance matrices, = ↵I and ⌃ = 2I. The observation likelihood (of the vector of observations) and the prior are
p(y|✓, X, 2) = N (y| T ✓, 2I), (3.53) p(✓|↵) = N (✓|0, ↵I), (3.54)
where the dependence on the hyperparameters is now made explicit.
One way to estimate these hyperparameters is to maximize the marginal likelihood of the
training data,
{↵ˆ, ˆ2} = argmax p(y|X, 2, ↵) = argmax log p(y|X, 2, ↵) (3.55) ↵, 2 ↵, 2
where the marginal likelihood (also called the evidence) is the likelihood of the training data, aver- aged over all possible parameter values ✓,
p(y|X, 2,↵)=Z p(y|✓,X, 2)p(✓|↵)d✓. (3.56) In some sense, (3.55) is selecting the hyperparameters that can best explain the data for all plausible
values of the parameter ✓ (according to the prior).
(a) Show that the marginal log-likelihood for the above model for Bayesian linear regression is
(3.49) (3.50) (3.51)
log p(y|X, 2, ↵) =
1 y Tμˆ 2 1 kμˆk2 1log 1I+ 1 T nlog 2 Dlog↵ nlog(2⇡).
2 2 ✓2↵✓2 ↵ 2 222 Hint: complete the square.
24
(3.57)
(b) What is the e↵ect of each term in (3.57), when changing the hyperparameters ↵ and 2? In particular, which term is the data-fitting penalty? Which are complexity penalties?
Estimating hyperparameters by maximizing the marginal likelihood is also known as evidence ap- proximation, type-II maximum likelihood, generalized maximum likelihood, or empirical Bayes.
……… Problem 3.12 L1-regularized least-squares (LASSO)
In this problem we will consider a di↵erent form of regularized least-squares, which uses the L1- norm to regularize the weights (parameter vector). This is called LASSO (for “least absolute shrinkage and selection operator”), and is a widely studied regression problem. The problem setup is the same as regression problems before (e.g., the previous problem), except now we include a regularization term on the weights based on the L1-norm,
✓ˆ=argmin1 y T✓ 2 + k✓k1, (3.58) ✓2
where k✓k1 = PDi=1 |✓i| is the L1-norm, and is a hyperparameter that controls its influence. The e↵ect of the L1-norm is to force some of the weights to zero, leading to a parameter vector that is sparse (has few non-zero entries). For example, this could be useful for automatically selecting a subset of features from (xi), or to control the complexity of a polynomial function (by forcing some weights to 0).
(a) Rewrite (3.58) as an MAP estimation problem. What is the prior distribution assumed by LASSO? Plot the Gaussian prior (from the previous problem) and the LASSO prior. How does this explain why LASSO prefers the weights to be close to zero?
There is no closed-form solution to (3.58), and hence an iterative method is needed to perform the optimization. Next, we will rewrite (3.58) into an equivalent quadratic programming (QP) problem, which can be plugged into a standard solver (e.g., quadprog in MATLAB).
(b) First, we rewrite ✓ as a di↵erence between two vectors with positive entries. ✓ = ✓+ ✓ ,
✓+ 0, ✓ 0.
The original optimization problem (3.58) can now be rewritten as
✓ˆ=argmin1 y T(✓+ ✓ ) 2 + X|✓i+ ✓i |, ✓+,✓ 2 i
s.t.✓+ 0, ✓ 0.
Using a bit of optimization theory “magic”, we can rewrite (3.61) as
✓ˆ=argmin1 y T(✓+ ✓ ) 2 + X(✓i+ +✓i ), ✓+,✓ 2 i
(3.59) (3.60)
(3.61)
(3.62)
s.t.✓+ 0, ✓ 0.
Why is the optimization problem in (3.62) equivalent to that of (3.61)? Hint: at the optimum,
what can we say about the values of the pair {✓i+,✓i }? 25
(c) Finally, define x = ✓+ . Show that (3.62) can be rewritten in the standard form of a quadratic
program,
where
✓
min1xTHx+fTx x2
(3.63)
(3.64)
Note: there are many customized algorithms for estimating the weights of LASSO, but this is perhaps the easiest to implement since we can use quadprog in MATLAB.
………
s.t. x 0. H= T T ,
f = 1 y , y
T T
and 1 is the vector of ones. Now we can use a standard QP solver!
26