CS5487 Problem Set 6
Bayes Decision Theory
Copyright By PowCoder代写 加微信 powcoder
Antoni of Computer Science
City University of
Bayes Decision Theory
Problem 6.1 BDR with unbalanced loss function
Consider a two-class problem with y 2 {0, 1} and measurement x, with associated prior distribution
p(y) and class-conditional densities p(x|y).
(a) Consider the loss-function:
L(g(x), y) =
0, g(x) = y
, y = 0 and g(x) = 1
, y = 1 and g(x) = 0.
In other words, the loss for misclassification is di↵erent for each class. When might this type
of loss function be useful? Can you give a real-world example?
(b) Derive the Bayes decision rule (BDR) for y. Write the BDR as a log-likelihood ratio test. What
is the threshold?
(c) Explain how the loss values `
influence the threshold.
. . . . . . . . .
Problem 6.2 BDR for regression
In this problem, we will consider the Bayes decision rule for regression. Suppose we have a regression
problem, where y 2 R is the output, x 2 Rd is the input, and we have already learned the
distribution p(y|x), which maps the input x to a distribution of outputs y. The goal is to select
the optimal output y for a given x.
(a) Consider the squared-loss function, L(g(x), y) = (g(x)� y)2. Show that the BDR is to decide
the conditional mean of p(y|x), or g⇤(x) = E[y|x]. In other words, show that g⇤(x) minimizes
the conditional risk R(x) =
L(g(x), y)p(y|x)dy.
(b) One generalization of the squared-loss function is the Minkowski loss,
(g(x), y) = |g(x)� y|q. (6.2)
Plot the loss function L
versus (g(x)� y) for values of q 2 {0.2, 1, 2, 10} and q ! 0. Comment
on the e↵ect of using di↵erent loss functions.
(c) Show that the BDR for q = 1 is to select the conditional median of p(y|x).
(d) Show that the BDR for q ! 0 is to select the conditional mode of p(y|x).
. . . . . . . . .
Bayes Decision Rule and 0-1 Loss function
Problem 6.3 Noisy channel with unequal priors
In this problem, you will derive the BDR for a noisy channel with unequal priors (an example in
lecture). Given a bit y = {0, 1}, the transmitter sends a signal of µ
for y = 0, and µ
for y = 1. The
channel has additive zero-mean Gaussian noise with variance �2, and hence the class-conditional
densities for the measurement x 2 R are
p(x|y = 0) = N (x|µ
,�2), p(x|y = 1) = N (x|µ
,�2). (6.3)
Let the prior probability of the transmitted bits be p(y = 1) = ⇡
and p(y = 0) = ⇡
. The goal is
to recover the bit y 2 {0, 1} after receiving a noisy measurement x.
(a) Show that the Bayes decision rule (BDR) using the 0-1 loss function is given by
0, x < µ1+µ0 1, otherwise (b) Explain the intuitive e↵ect of each term in the above BDR. = 1, and �2 = 1. Plot the change in the threshold for di↵erent values of ⇡. . . . . . . . . . Problem 6.4 Coin Tossing In this problem we will consider the traditional probability scenario of coin tossing. However, we will consider two variations. First, the coin is not fair. Denoting by s the outcome of the coin toss (either H or T ) we have p(s = H) = ↵, ↵ 2 [0, 1]. (6.5) Second, you do not observe the coin directly but have to rely on a friend that reports the outcome of the toss. Unfortunately your friend is unreliable, he will sometimes report heads when the outcome was tails and vice-versa. Denoting the report by r we have p(r = T |s = H) = ✓ p(r = H|s = T ) = ✓ 2 [0, 1]. Your job is to, given the report from your friend, guess the outcome of the (a) Given that your friend reports heads, what is the optimal decision function in the minimum probability of error sense? That is, when should you guess heads, and when should you guess (b) Consider the case ✓ . Can you give an intuitive interpretation to the rule derived in (a)? (c) You figured out that if you ask your friend to report the outcome of the toss various times, he will produce reports that are statistically independent. You then decide to ask him to report the outcome n times, in the hope that this will reduce the uncertainty. (Note: there is still only one coin toss, but the outcome gets reported n times). What is the new minimum probability of error decision rule? (d) Consider the case ✓ and assume that the report sequence is all heads. Can you give an intuitive interpretation to the rule derived in (c)? . . . . . . . . . Problem 6.5 Naive Bayes and discrete variables For high-dimensional observation spaces, it might be di�cult to learn a joint density over the space (e.g., if not enough data is available). One common assumption is to use a “Naive Bayes” model, where we assume that the individual features (dimensions) are conditionally independent given the p(x|y = j) = |y = j), (6.8) where x = [x , · · · , x ]T is the observation vector, and x is the individual feature. While the features are conditionally independent given the class, the features are still dependent in the overall distribution of observations p(x) (similar to a GMM with diagonal covariance matrices). Let the vector x be a collection of d binary-valued features, i.e. , · · · , x 2 {0, 1}. (6.9) Assume there are C classes, with class variable y 2 {1, · · · , C} and prior distribution p(y = j) = ⇡ Now define = 1|y = j), 8i, j (6.10) with the features {x } being conditionally independent given class y = j (this conditional inde- pendence assumption is the “Naive Bayes” assumption). The goal is to recover the class y given a measurement x. (a) Interpret in words the meaning of p (b) Show that the class-conditional distributions can be written as p(x|y = j) = )1�xi . (6.11) (c) Show that the minimum probability of error is achieved by the following decision rule: Decide y = j if g (x) for all j, k, where (d) Now assume there are two classes, C = 2, and for convenience let p the above result, show that the BDR is: Decide y = 1 if g(x) > 0, and y = 2 otherwise, where
(e) Show that g(x) can be rewritten as
This demonstrates that the BDR in this case is a simple linear classifier, where each feature
“votes” for the class using its weight w
. What is the interpretation of formulas for the weights
and the o↵set w
. . . . . . . . .
Gaussian Classifiers
Problem 6.6 Gaussian classifier with common covariance
In this problem, we will derive the BDR for Gaussian classifiers with a common covariance, and
interpret the resulting decision boundaries. Let y 2 {1, . . . , C} be the classes with prior probabilities
p(y = j) = ⇡
, and x 2 Rd be the measurement with class conditional densities that are Gaussian
with a shared covariance, p(x|y = j) = N (x|µ
(a) Show that the BDR using the 0-1 loss function is:
g(x)⇤ = argmax
(x), (6.16)
where the g
(x) for each class is a linear function of x,
(b) For two classes i and j (i 6= j), show that the decision boundary between the two classes (i.e.,
(x)) is described by a hyperplane,
wTx+ b = 0, (6.19)
(c) Finally, show that the hyperplane in (6.19) can be rewritten in the form
) = 0, (6.21)
What is the interpretation of w and x
? What is the e↵ect of the priors {⇡
. . . . . . . . .
Problem 6.7 Gaussian classifier with arbitrary covariances
In this problem, we will derive the BDR for Gaussian classifiers with arbitrary covariances, and
interpret the resulting decision boundaries. Consider the same setup as Problem 6.6, but with class
conditional densities with individual covariance matrices, p(x|y = j) = N (x|µ
(a) Show that the BDR using the 0-1 loss function is:
g(x)⇤ = argmax
(x), (6.23)
where the g
(x) for each class is a quadratic function of x,
(b) For two classes i and j (i 6= j), the decision boundary between the two classes, i.e., g
is either a hyperplane, sphere, ellipsoid, 2 hyperplanes, parabola, or hyperbola (i.e., a conic
section). What parameters of the Gaussian class conditional densities {µ
} will give
each form?
. . . . . . . . .
Problem 6.8 Posterior distribution of binary Gaussian classifier
In the previous 2 problems, we have looked at the log-based version of the BDR with the 0-1 loss
function. It is also interesting to look at the original definition of the BDR that is based on posterior
distributions,
g(x)⇤ = argmax
p(y = j|x). (6.26)
Consider a two-class problem, y 2 {0, 1} with priors p(y = 1) = ⇡
and p(y = 0) = ⇡
observation is x 2 R, with Gaussian class-conditional densities with equal variance, p(x|y) =
(a) Show that the posterior distribution p(y|x) can be written as
p(y = 1|x) =
1 + e�f(x)
, p(y = 0|x) =
f(x) = log p(x|y = 1)� log p(x|y = 0) + log
(b) For the Gaussian CCDs assumed above, show that
Hence, the posterior probability of class 1 given x is a sigmoid function of the form 1
(c) What is the decision boundary for this classifier?
(d) Plot p(y = 1|x) when µ
= 1, and �2 = 1, for di↵erent values of ⇡
. Also plot the
class-conditional densities.
. . . . . . . . .
Error Bounds
Problem 6.9 Error bounds for classification with Gaussians
In this problem, we will derive error bounds for classification using Gaussian class-conditionals.
Consider a two-class problem, with y 2 {0, 1}, priors p(y = 1) = ⇡ and p(y = 0) = 1 � ⇡, and
conditional densities p(x|y) = N (x|µ
For the 0-1 loss function, the BDR is to select the class with highest posterior probability
(or minimum probability of error),
y⇤ = argmin
[1� p(y 6= j|x)] = argmax
p(y = j|x). (6.30)
Hence, for a given x, the probability of error using BDR is
p(error|x) = min[p(0|x), p(1|x)]. (6.31)
The total probability of error is then obtained by taking the expectation over x,
p(error) =
p(x)p(error|x)dx. (6.32)
(This is also the Bayes risk). Because of the discontinuous nature of the decision regions, the integral
in (6.32) is almost always di�cult to calculate. For a few dimensions, numerical integration methods
could be used, but these become less viable for high dimensional spaces.
In this problem, we will derive a bound on the probability of error, given the above assump-
(a) First, verify that the following inequality is true:
min[a, b] a�b1�� , for a, b � 0 and 0 � 1. (6.33)
Hint: without loss of generality, consider the case a < b. (b) Apply the bound in (6.33) to (6.32) to show that p(error) (1� ⇡)�⇡(1��)e�k(�), (6.34) k(�) = � log p(x|0)�p(x|1)1��dx. (6.35) This is called the Cherno↵ bound. After finding a closed-form expression for k(�), we could plugin for our parameters and find a value of � that minimizes the upper bound e�k(�), as illustrated in the below figure. (You don’t need to do this.) 0 0.25 0.75 1 Chernoff bound Bhattacharyya bound FIGURE 2.18. The Chernoff error bound is never looser than the Bhattacharyya bound. For this example, the Chernoff bound happens to be at !! = 0.66, and is slightly tighter than the Bhattacharyya bound (! = 0.5). From: . Duda, . Hart, and . Stork, Pattern Classification. Copyright c" 2001 by & Sons, Inc. Alternatively, we could simply fix �, which yields a looser bound. For � = 1/2, this is called the Bhatacharyya bound. (c) Consider the case when � = 1/2, which is the Bhatacharyya bound, p(error) ⇡(1� ⇡)e�k( Show that the exponent term is Hint: note that p(x|0) 2 can be rewritten as a scaled Gaussian, and apply Problem 1.9. (d) Describe an intuitive explanation for each of the terms in (6.36) and (6.37) . When is the probability of error large? When is it low? What happens when ⌃ (e) Consider the noisy channel in Problem 6.3. Compute the Bhatacharyya bound when ⇡ = 0.5. Compare the bound to an estimate of the probability of error using numerical integration. . . . . . . . . . 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com