Problem 6.1
CS5487 Problem Set 6
Bayes Decision Theory
Antoni Chan Department of Computer Science City University of Hong Kong
Bayes Decision Theory
BDR with unbalanced loss function
Consider a two-class problem with y 2 {0, 1} and measurement x, with associated prior distribution p(y) and class-conditional densities p(x|y).
(a) Consider the loss-function:
8><>:0, g(x) = y
L(g(x),y) = `0, y = 0 and g(x) = 1 (6.1)
`1, y=1andg(x)=0.
In other words, the loss for misclassification is di↵erent for each class. When might this type
of loss function be useful? Can you give a real-world example?
(b) Derive the Bayes decision rule (BDR) for y. Write the BDR as a log-likelihood ratio test. What is the threshold?
(c) Explain how the loss values `0 and `1 influence the threshold. ………
Problem 6.2 BDR for regression
In this problem, we will consider the Bayes decision rule for regression. Suppose we have a regression problem, where y 2 R is the output, x 2 Rd is the input, and we have already learned the distribution p(y|x), which maps the input x to a distribution of outputs y. The goal is to select the optimal output y for a given x.
(a) Consider the squared-loss fuRnction, L(g(x), y) = (g(x) y)2. Show that the BDR is to decide the conditional mean of p(y|x), or g⇤(x) = E[y|x]. In other words, show that g⇤(x) minimizes the conditional risk R(x) = L(g(x), y)p(y|x)dy.
(b) One generalization of the squared-loss function is the Minkowski loss,
Lq(g(x), y) = |g(x) y|q. (6.2)
Plot the loss function Lq versus (g(x) y) for values of q 2 {0.2, 1, 2, 10} and q ! 0. Comment on the e↵ect of using di↵erent loss functions.
(c) Show that the BDR for q = 1 is to select the conditional median of p(y|x).
(d) Show that the BDR for q ! 0 is to select the conditional mode of p(y|x).
………
41
Bayes Decision Rule and 0-1 Loss function
Problem 6.3 Noisy channel with unequal priors
In this problem, you will derive the BDR for a noisy channel with unequal priors (an example in lecture). Given a bit y = {0, 1}, the transmitter sends a signal of μ0 for y = 0, and μ1 for y = 1. The channel has additive zero-mean Gaussian noise with variance 2, and hence the class-conditional densities for the measurement x 2 R are
p(x|y = 0) = N (x|μ0, 2), p(x|y = 1) = N (x|μ1, 2). (6.3) Let the prior probability of the transmitted bits be p(y = 1) = ⇡1 and p(y = 0) = ⇡0. The goal is
to recover the bit y 2 {0, 1} after receiving a noisy measurement x.
(a) Show that the Bayes decision rule (BDR) using the 0-1 loss function is given by
(0, x<μ1+μ0 + 2 log⇡0 y⇤ = 2 μ1 μ0 ⇡1
1, otherwise
(6.4)
(b) Explain the intuitive e↵ect of each term in the above BDR.
(c) Let μ0 = 1, μ1 = 1, and 2 = 1. Plot the change in the threshold for di↵erent values of ⇡.
.........
Problem 6.4 Coin Tossing
In this problem we will consider the traditional probability scenario of coin tossing. However, we will consider two variations. First, the coin is not fair. Denoting by s the outcome of the coin toss (either H or T) we have
p(s = H) = ↵, ↵ 2 [0,1]. (6.5)
Second, you do not observe the coin directly but have to rely on a friend that reports the outcome of the toss. Unfortunately your friend is unreliable, he will sometimes report heads when the outcome was tails and vice-versa. Denoting the report by r we have
p(r = T|s = H) = ✓1, (6.6) p(r = H|s = T) = ✓2, (6.7)
where ✓1,✓2 2 [0,1]. Your job is to, given the report from your friend, guess the outcome of the toss.
(a) Given that your friend reports heads, what is the optimal decision function in the minimum probability of error sense? That is, when should you guess heads, and when should you guess tails?
(b) Consider the case ✓1 = ✓2. Can you give an intuitive interpretation to the rule derived in (a)?
42
(c) You figured out that if you ask your friend to report the outcome of the toss various times, he will produce reports that are statistically independent. You then decide to ask him to report the outcome n times, in the hope that this will reduce the uncertainty. (Note: there is still only one coin toss, but the outcome gets reported n times). What is the new minimum probability of error decision rule?
(d) Consider the case ✓1 = ✓2 and assume that the report sequence is all heads. Can you give an intuitive interpretation to the rule derived in (c)?
......... Problem 6.5 Naive Bayes and discrete variables
For high-dimensional observation spaces, it might be di cult to learn a joint density over the space (e.g., if not enough data is available). One common assumption is to use a “Naive Bayes” model, where we assume that the individual features (dimensions) are conditionally independent given the class,
Yn i=1
where x = [x1, · · · , xd]T is the observation vector, and xi is the individual feature. While the features are conditionally independent given the class, the features are still dependent in the overall distribution of observations p(x) (similar to a GMM with diagonal covariance matrices).
Let the vector x be a collection of d binary-valued features, i.e.
x=[x1,···,xd]T, xi 2{0,1}. (6.9)
Assume there are C classes, with class variable y 2 {1,··· ,C} and prior distribution p(y = j) = ⇡j. Now define
pij = p(xi = 1|y = j), 8i,j (6.10)
with the features {xi} being conditionally independent given class y = j (this conditional inde- pendence assumption is the “Naive Bayes” assumption). The goal is to recover the class y given a measurement x.
(a) Interpret in words the meaning of pij.
(b) Show that the class-conditional distributions can be written as
i=1
p(x|y = j) =
p(xi|y = j), (6.8)
Yd
ij
p(x|y = j) =
(c) Show that the minimum probability of error is achieved by the following decision rule: Decide
y = j if gj(x) gk(x) for all j,k, where
(6.11)
log(1 pij ) + log ⇡j . (6.12)
Xd gj (x) =
i=1
pij xi log 1 p
ij
+
Xd i=1
pxi (1 pij )1 xi .
43
(d) Now assume there are two classes, C = 2, and for convenience let pi = pi1 and qi = pi2. Using the above result, show that the BDR is: Decide y = 1 if g(x) > 0, and y = 2 otherwise, where
g(x)=Xd xilogpi +(1 xi)log1 pi +log⇡1. i=1 qi 1 qi ⇡2
(6.13)
(6.14)
(6.15)
(e) Show that g(x) can be rewritten as
Xd
where
g(x) =
pi(1 qi) Xd 1 pi ⇡1 wi=logq(1 p), w0= log1 q +log⇡.
i i i=1 i 2
i=1
wixi + w0,
This demonstrates that the BDR in this case is a simple linear classifier, where each feature “votes” for the class using its weight wi. What is the interpretation of formulas for the weights wi and the o↵set w0?
………
Gaussian Classifiers
Problem 6.6 Gaussian classifier with common covariance
In this problem, we will derive the BDR for Gaussian classifiers with a common covariance, and interpret the resulting decision boundaries. Let y 2 {1, . . . , C} be the classes with prior probabilities p(y = j) = ⇡j, and x 2 Rd be the measurement with class conditional densities that are Gaussian with a shared covariance, p(x|y = j) = N(x|μj,⌃).
(a) Show that the BDR using the 0-1 loss function is:
g(x)⇤ =argmaxgj(x),
j where the gj(x) for each class is a linear function of x,
gj(x) = wjT x + bj,
wj =⌃ 1μj, bj = 1μTj ⌃ 1μj +log⇡j.
2
gi(x) = gj(x)) is described by a hyperplane,
wT x + b = 0, (6.19)
w=⌃ 1(μi μj), b= 1(μi+μj)T⌃ 1(μi μj)+log⇡i (6.20) 2 ⇡j
44
(6.18) (b) For two classes i and j (i 6= j), show that the decision boundary between the two classes (i.e.,
(6.16)
(6.17)
(c) Finally, show that the hyperplane in (6.19) can be rewritten in the form wT (x x0) = 0,
w=⌃ 1(μi μj), x0=μi+μj (μi μj) log⇡i. 2 k μ i μ j k 2⌃ ⇡ j
What is the interpretation of w and x0? What is the e↵ect of the priors {⇡i,⇡j} on x0? ………
Problem 6.7 Gaussian classifier with arbitrary covariances
(6.21)
(6.22)
In this problem, we will derive the BDR for Gaussian classifiers with arbitrary covariances, and interpret the resulting decision boundaries. Consider the same setup as Problem 6.6, but with class conditional densities with individual covariance matrices, p(x|y = j) = N(x|μj,⌃j).
(a) Show that the BDR using the 0-1 loss function is:
g(x)⇤ =argmaxgj(x),
j
where the gj(x) for each class is a quadratic function of x,
(6.23)
(6.24)
(6.25)
gj (x) = xT Aj x + wjT x + bj ,
Aj= 1⌃ 1, wj=⌃ 1μj, bj= 1μTj⌃ 1μj 1log|⌃j|+log⇡j.
2jj2j2
(b) For two classes i and j (i 6= j), the decision boundary between the two classes, i.e., gi(x) = gj(x), is either a hyperplane, sphere, ellipsoid, 2 hyperplanes, parabola, or hyperbola (i.e., a conic section). What parameters of the Gaussian class conditional densities {μi,⌃i,μj,⌃j} will give each form?
………
Problem 6.8 Posterior distribution of binary Gaussian classifier
In the previous 2 problems, we have looked at the log-based version of the BDR with the 0-1 loss function. It is also interesting to look at the original definition of the BDR that is based on posterior distributions,
g(x)⇤ = argmax p(y = j|x). (6.26) j
Consider a two-class problem, y 2 {0,1} with priors p(y = 1) = ⇡1 and p(y = 0) = ⇡0. The observation is x 2 R, with Gaussian class-conditional densities with equal variance, p(x|y) = N(x|μj, 2).
(a) Show that the posterior distribution p(y|x) can be written as p(y = 1|x) = 1 , p(y = 0|x) =
where
1 , 1 + ef(x)
f(x) = logp(x|y = 1) logp(x|y = 0)+log ⇡1. ⇡0
45
(6.27)
(6.28)
1 + e f(x)
(b) For the Gaussian CCDs assumed above, show that
f(x) = 1 (μ1 μ0)x 1(μ21 μ20) +log ⇡1. (6.29)
2 2 ⇡0
Hence, the posterior probability of class 1 given x is a sigmoid function of the form
(c) What is the decision boundary for this classifier?
1 . 1+e (ax+b)
(d) Plot p(y = 1|x) when μ0 = 1, μ1 = 1, and 2 = 1, for di↵erent values of ⇡1. Also plot the class-conditional densities.
………
Error Bounds
Problem 6.9 Error bounds for classification with Gaussians
In this problem, we will derive error bounds for classification using Gaussian class-conditionals. Consider a two-class problem, with y 2 {0,1}, priors p(y = 1) = ⇡ and p(y = 0) = 1 ⇡, and conditional densities p(x|y) = N(x|μy,⌃y).
For the 0-1 loss function, the BDR is to select the class with highest posterior probability (or minimum probability of error),
y⇤ = argmin[1 p(y 6= j|x)] = argmax p(y = j|x). jj
Hence, for a given x, the probability of error using BDR is p(error|x) = min[p(0|x), p(1|x)].
The total probability of error is then obtained by taking the expectation over x, p(error) = Z p(x)p(error|x)dx.
(6.30)
(6.31)
(6.32)
(This is also the Bayes risk). Because of the discontinuous nature of the decision regions, the integral in (6.32) is almost always di cult to calculate. For a few dimensions, numerical integration methods could be used, but these become less viable for high dimensional spaces.
In this problem, we will derive a bound on the probability of error, given the above assump- tions.
(a) First, verify that the following inequality is true:
min[a,b] a b1 , for a,b 0 and 0 1.
Hint: without loss of generality, consider the case a < b.
(b) Apply the bound in (6.33) to (6.32) to show that
p(error) (1 ⇡) ⇡(1 )e k( ), 46
(6.33)
(6.34)
where
k( ) = logZ p(x|0) p(x|1)1 dx. (6.35)
This is called the Cherno↵ bound. After finding a closed-form expression for k( ), we could plugin for our parameters and find a value of that minimizes the upper bound e k( ), as illustrated in the below figure. (You don’t need to do this.)
1
e-k( ) 0.8
0.6 Bhattacharyya bound 0.4
0.2
0 0 0.25 0.5 ∗ 0.75 1
FIGURE 2.18. The Chernoff error bound is never looser than the Bhattacharyya bound. For this example, the Chernoff bound happens to be at β∗ = 0.66, and is slightly tighter than the Bhattacharyya bound (β = 0.5). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright ⃝c 2001 by John Wiley & Sons, Inc.
Alternatively, we could simply fix , which yields a looser bound. For = 1/2, this is called the Bhatacharyya bound.
Chernoff bound
(c) Consider the case when = 1/2, which is the Bhatacharyya bound, 2
Show that the exponent term is
p(error) p⇡(1 ⇡)e k( 1 ).
(6.36)
(6.37)
1
k(1)= (μ1 μ0)T
⌃ + ⌃ 1 1 0
1 ⌃ 1 + ⌃ 0 logp 2 .
(μ1 μ0)+
Hint: note that p(x|0)2 can be rewritten as a scaled Gaussian, and apply Problem 1.9.
probability of error large? When is it low? What happens when ⌃1 = ⌃0?
(e) Consider the noisy channel in Problem 6.3. Compute the Bhatacharyya bound when ⇡ = 0.5.
Compare the bound to an estimate of the probability of error using numerical integration.
.........
47
2 8 1
2
2 |⌃1||⌃2|
(d) Describe an intuitive explanation for each of the terms in (6.36) and (6.37) . When is the