CSE475 Midterm 1, Due: 11:59 pm, Friday 03/05/2021
Please note that you are encouraged to typeset your solutions using either LATEX or Microsoft Word, and produce a PDF file for submission. Alternatively, you can scan your handwritten solutions and produce a PDF file. Unreadable/illegible solutions will not be graded. You need to submit an electronic version (in PDF form) on the canvas. You should name your file using the format CSE475-Midterm1-LastName-FirstName.
1. [10pts] True-or-False questions. Please answer either “T” (True) or “F” (False) for the following statements.
[T] The probability density function of a continuous random variable can be greater than 1. [F] The decision boundary of K Nearest Neighbors classifier is always linear.
[T] Different distances measures (such as Euclidean distance, L1 distance etc.) assign different decision boundaries to K-nearest neighbor classifier.
[F] Classification accuracy can be used to evaluate a Linear Regression model. Grading: 2.5pts for each question.
2. [10pts] For the Naïve Bayes algorithm, how many parameters are needed for a binary classification problem with 4 categorical features that can take 3 possible values each [example: Price – $, $$, $$$]?
Hint: The parameters of Naïve Bayes algorithm are likelihoods and the class prior probability which are estimated from training data.
Solution: there are
2 (two classes) × 2 (two parameters for each feature and each class) × 4 (four features) + 1 (one class prior) = 17 parameters which are need for the Naïve Bayes algorithm.
Grading: 2pts for stating that one parameter is needed for the class prior. 8pts for stating that 16 parameters are needed for likelihoods. 3pts for stating that 26 parameters are needed.
3. [15pts] A student is late for class. It could be because the bus arrived late, or the student overslept. The reason for oversleeping could be the alarm did not ring. There are four random variables involved in this problem, which are 𝐴 (Alarm ring), 𝑂 (Overslept), 𝐵 (Bus Late), 𝐿 (Late for Class). Each of them can take value of either 1 (corresponding to “Yes”) or 0 (corresponding
to “No”).
[1] Construct a Bayesian Network (BN) for the above problem with the four random variables as nodes (please plot the BN).
[2] Please factorize the joint probability 𝑃(𝐴, 𝑂, 𝐵, 𝐿) using the structure of the BN constructed in part [1].
[3] We want to estimate the joint probability 𝑃(𝐴, 𝑂, 𝐵, 𝐿) for all possible values of 𝐴, 𝑂, 𝐵, 𝐿. Please list all the parameters that are needed. Hint: Parameters of a BN are prior probabilities and conditional probabilities. Please list a minimum set of necessary parameters which do not include parameter(s) that can be deduced from other parameters.
[4] Write an expression using the parameters from part [3] that represents the probability of reaching the class on time if the student overslept.
Solution: [1]
[2pts] [2]𝑃(𝐴,𝑂,𝐵,𝐿)=𝑃(𝐴)𝑃(𝐵)𝑃(𝑂|𝐴)𝑃(𝐿|𝐵,𝑂) [3pts]
[3] Because 𝑃(𝐴, 𝑂, 𝐵, 𝐿) = 𝑃(𝐴)𝑃(𝐵)𝑃(𝑂|𝐴)𝑃(𝐿|𝐵, 𝑂) by part [2], all the parameters needed for estimating the joint probability 𝑃(𝐴, 𝑂, 𝐵, 𝐿) for all possible values of 𝐴, 𝑂, 𝐵, 𝐿 are as follows:
𝑃(𝐴 = 1), 𝑃(𝐵 = 1), 𝑃(𝑂 = 1|𝐴 = 1), 𝑃(𝑂 = 1|𝐴 = 0), 𝑃(𝐿 = 1|𝐵 = 1, 𝑂 = 1),
𝑃(𝐿 = 1|𝐵 = 1, 𝑂 = 0), 𝑃(𝐿 = 1|𝐵 = 0, 𝑂 = 1), 𝑃(𝐿 = 1|𝐵 = 0, 𝑂 = 0). [0.5pt for each parameter, 4pts in total]
[4] The probability of reaching the class on time if the student overslept is 𝑃(𝐿 = 0|𝑂 = 1). We have
𝑃(𝐿 = 0|𝑂 = 1) = 1 − 𝑃(𝐿 = 1|𝑂 = 1) = 1 − 𝑃(𝐿=1,𝑂=1). [1pt] 𝑃(𝑂=1)
For 𝑃(𝐿=1,𝑂=1), the denominator is 𝑃(𝑂=1)
𝑃(𝑂 = 1) = 𝑃(𝑂 = 1, 𝐴 = 1) + 𝑃(𝑂 = 1, 𝐴 = 0) = 𝑃(𝑂 = 1|𝐴 = 1)𝑃(𝐴 = 1) + 𝑃(𝑂=1|𝐴=0)(1−𝑃(𝐴=1)) [1pt]
The numerator is
𝑃(𝐿=1,𝑂=1)= 𝑃(𝐴=1,𝑂=1,𝐵=1,𝐿=1)+ 𝑃(𝐴=1,𝑂=1,𝐵=0,𝐿=1) + 𝑃(𝐴=0,𝑂=1,𝐵=1,𝐿=1)+ 𝑃(𝐴=0,𝑂=1,𝐵=0,𝐿=1)
= 𝑃(𝐴 = 1)𝑃(𝐵 = 1)𝑃(𝑂 = 1|𝐴 = 1)𝑃(𝐿 = 1|𝐵 = 1, 𝑂 = 1) + 𝑃(𝐴 = 1)(1
− 𝑃(𝐵 = 1))𝑃(𝑂 = 1|𝐴 = 1)𝑃(𝐿 = 1|𝐵 = 0, 𝑂 = 1) + (1
− 𝑃(𝐴 = 1))𝑃(𝐵 = 1)𝑃(𝑂 = 1|𝐴 = 0)𝑃(𝐿 = 1|𝐵 = 1, 𝑂 = 1) + (1 − 𝑃(𝐴 = 1))(1 − 𝑃(𝐵 = 1))𝑃(𝑂 = 1|𝐴 = 0)𝑃(𝐿 = 1|𝐵 = 0, 𝑂 = 1)
[3pts] Therefore,
𝑃(𝐿 = 0|𝑂 = 1) = 1 − 𝑃(𝐿 = 1, 𝑂 = 1) 𝑃(𝑂 = 1)
= 1 − {𝑃(𝐴 = 1)𝑃(𝐵 = 1)𝑃(𝑂 = 1|𝐴 = 1)𝑃(𝐿 = 1|𝐵 = 1, 𝑂 = 1) + 𝑃(𝐴 = 1)(1
− 𝑃(𝐵 = 1))𝑃(𝑂 = 1|𝐴 = 1)𝑃(𝐿 = 1|𝐵 = 0, 𝑂 = 1) + (1
− 𝑃(𝐴 = 1))𝑃(𝐵 = 1)𝑃(𝑂 = 1|𝐴 = 0)𝑃(𝐿 = 1|𝐵 = 1, 𝑂 = 1) + (1 − 𝑃(𝐴 = 1))(1 − 𝑃(𝐵 = 1))𝑃(𝑂 = 1|𝐴 = 0)𝑃(𝐿 = 1|𝐵 = 0, 𝑂 = 1)}/{𝑃(𝑂 = 1|𝐴 = 1)𝑃(𝐴 = 1)
+ 𝑃(𝑂 = 1|𝐴 = 0)(1 − 𝑃(𝐴 = 1))}
[1pt]
Grading: the full credit for part [1-4] is 2pts, 3pts, 4pts, and 6pts respectively.
4. [10pts] Consider a one-dimensional two-class classification problem, with the class- conditional PDFs for the two classes (𝜔1 and 𝜔2) being the normal densities 𝑁(𝜇1, 𝜎2) and 𝑁(𝜇2, 𝜎2), respectively. Note that these two PDFs have the same variance 𝜎2. 𝑁(𝜇, 𝜎2) denotes the normal density defined by
𝑝(𝑥|𝜇, 𝜎) = 1 exp {− (𝑥−𝜇)2}. √2𝜋𝜎 2𝜎2
The losses are defined in the same way as in the lecture, and the definition is recapped below
for your convenience: 𝜆𝑖𝑗 is the loss incurred when classifying a sample as 𝜔𝑖 while the true
class label is 𝜔 . We let 𝜇 = −1, 𝜇 = 1, 𝜎2 = 0.5 for the following questions. 𝑗12
[1] If we do minimum-error-classification with simple 0-1 losses (i.e., 𝜆11 = 𝜆22 = 0, 𝜆21 =
𝜆12 = 1), the minimum-error-rate classification (by Bayes decision rule) will decide class label 𝜔1 for feature 𝑥 such that 𝑥 < 𝑎, and this is the Bayesian decision boundary. What is the value
of 𝑎 if the classes have equal prior probabilities (𝑃(𝜔1) = 𝑃(𝜔2) = 1)? 2
[2] If the classes have equal prior probabilities (𝑃(𝜔1) = 𝑃(𝜔2) = 1), but the losses are given 2
by 𝜆11 = 𝜆22 = 0, 𝜆12 = 1, 𝜆21 = 2. The minimum-error-rate classification (by Bayes decision rule) still decides class label 𝜔1 for feature 𝑥 such that 𝑥 < 𝑎. What is the value of 𝑎?
[3] If we use the same losses as part [2]: 𝜆11 = 𝜆22 = 0, 𝜆12 = 1, 𝜆21 = 2, but change the prior to 𝑃(𝜔1) = 1. The minimum-error-rate classification (by Bayes decision rule) still decides class
3
label 𝜔1 for feature 𝑥 such that 𝑥 < 𝑎. What is the value of 𝑎?
Solution: [1]
𝑅(𝛼1|𝑥) < 𝑅(𝛼2|𝑥) is equivalent to
𝜆11𝑃(𝜔1|𝑥) + 𝜆12𝑃(𝜔2|𝑥) < 𝜆21𝑃(𝜔1|𝑥) + 𝜆22𝑃(𝜔2|𝑥) [1pt], and it follows that
𝜆12𝑃(𝜔2|𝑥) < 𝜆21𝑃(𝜔1|𝑥)
⇔ 𝜆12𝑃(𝑥|𝜔2)𝑃(𝜔2) < 𝜆21𝑃(𝑥|𝜔1)𝑃(𝜔1)
1
⇔𝜆12 ×√2𝜋𝜎exp{−
[1pt]
(𝑥 − 𝜇2)2 ⇔ exp{− 2𝜎2 +
(𝑥 − 𝜇2)2 1 (𝑥 − 𝜇1)2 2𝜎2 }×𝑃(𝜔2)<𝜆21 ×√2𝜋𝜎exp{− 2𝜎2
}×𝑃(𝜔1)
(𝑥 − 𝜇1)2 2𝜎2
𝜆21𝑃(𝜔1) } < 𝜆12𝑃(𝜔2)
(𝑥 − 𝜇2)2 (𝑥 − 𝜇1)2
⇔ − 2𝜎2 + 2𝜎2 < ln𝜆12𝑃(𝜔2)
⇔2𝑥(𝜇 −𝜇)−𝜇2+𝜇2<2𝜎2ln𝜆21𝑃(𝜔1) 2 1 2 1 𝜆12𝑃(𝜔2)
Substituting 𝜇1, 𝜇2 and 𝜎2 for the given values, we have ⇔𝑥<2×0.5 ln𝜆21𝑃(𝜔1)=1 ln𝜆21𝑃(𝜔1) (1)
4 𝜆12𝑃(𝜔2) 4 𝜆12𝑃(𝜔2)
[1pt]
For part [1], substituting the losses and prior probabilities for given values in inequality (1), we
4
will decide class label 𝜔1 for feature 𝑥 such that 𝑥 < 0, and 𝑎 = 0. [1pt]
[2] Substituting the losses and prior probabilities for given values in inequality (1), we have
𝜆21𝑃(𝜔1)
1 𝜆𝑃(𝜔) 1 1×1 1
have 𝑥 < ln 21 1 = 4 𝜆12𝑃(𝜔2)
ln 2 = ln 1 = 0. Therefore, the minimum-error-rate classification 1×1 4
2
1 𝜆𝑃(𝜔) 1 2×1 ln2
𝑥 < ln 21 1 = 4 𝜆12𝑃(𝜔2)
4
ln 2 = . Therefore, the minimum-error-rate classification will decide 1×1 4
2
class label 𝜔1 for feature 𝑥 such that 𝑥 < ln2 , and 𝑎 = ln2 ≈ 0.173. [3pts] 44
[2] Substituting the losses and prior probabilities for given values in inequality (1), we have
4
decide class label 𝜔1 for feature 𝑥 such that 𝑥 < 0, and 𝑎 = 0. [3pts]
Grading: the full credit for part [1-3] is 4pts, 3pts, and 3pts respectively.
5. [10pts] Consider a one-dimensional two-class classification problem with class-conditional PDFs and class prior probabilities as follows:
𝑝(𝑥|𝜔1) = {4𝑥3 𝑥 ∈ [0,1], 𝑝(𝑥|𝜔2) = {6𝑥 − 6𝑥2 𝑥 ∈ [0,1],
0 otherwise 0 otherwise
𝑃(𝜔1) = 9 (and thus 𝑃(𝜔2) = 1 ). 𝜔1, 𝜔2 are two class labels. 10 10
[1] Specify the condition on 𝑥 (in terms of 𝑥 < 𝑎 or 𝑥 > 𝑎 for a specific 𝑎) such that the minimum-error-rate classification (by Bayes decision rule) will decide class label 𝜔1 for feature 𝑥.
[2] What is the average error using the Bayes decision rule? Note that the average error should be computed by P(error) = E[P(error | x)]= P(error | x)p(x)dx .
Solution:
[1] The minimum-error-rate classification decides 𝜔1 if 𝑃(𝜔1|𝑥) > 𝑃(𝜔2|𝑥) [1pt]. This is
equivalent to 𝑝(𝑥|𝜔1)𝑃(𝜔1) > 𝑝(𝑥|𝜔2)𝑃(𝜔2), or 4𝑥3 ∙ 9 > (6𝑥 − 6𝑥2) ∙ 1 [1pt]. It follows
that 6𝑥3 + 𝑥2 − 𝑥 > 0, or 𝑥(6𝑥2 +𝑥−1)>0 (1)
Note that 𝑝(𝑥) = 𝑝(𝑥|𝜔1)𝑃(𝜔1) + 𝑝(𝑥|𝜔2)𝑃(𝜔2), so 𝑝(𝑥) = 0 if 𝑥 ∉ [0,1], so we only need to consider the case that 𝑥 ∈ [0,1] in equality (1). To make (1) hold, 𝑥 ≠ 0, so 𝑥 > 0. (1) is equivalent to 6𝑥2 + 𝑥 − 1 > 0, or (3𝑥 − 1)(2𝑥 + 1) > 0. [1pt] It follows that 3𝑥 − 1 >
0 ⇒ 𝑥 > 1. 3
1 𝜆𝑃(𝜔) 1 2×1 ln1
𝑥 < ln 21 1 = 4 𝜆12𝑃(𝜔2)
ln 3 = = 0. Therefore, the minimum-error-rate classification will 1×2 4
3
10 10
Therefore, the minimum-error-rate classification decides 𝜔1 if 𝑥 > 1 [1pt]. 3
[2] By part [1], the Bayes decision rule decides 𝜔1 if 𝑥 > 1, and it will decide 𝜔2 if 𝑥 < 1. When 33
it decides 𝜔1, the probability of error is 𝑃(𝑒𝑟𝑟𝑜𝑟|𝑥) = 𝑃(𝜔2|𝑥) [1pt]. If it decides 𝜔2, the probability of error is 𝑃(𝑒𝑟𝑟𝑜𝑟|𝑥) = 𝑃(𝜔1|𝑥) [1pt].
Also, 𝑝(𝑥) = 𝑝(𝑥|𝜔1)𝑃(𝜔1) + 𝑝(𝑥|𝜔2)𝑃(𝜔2), so 𝑝(𝑥) = 0 if 𝑥 ∉ [0,1]. The average error is
∞11
𝑃(𝑒𝑟𝑟𝑜𝑟) = ∫ 𝑃(𝑒𝑟𝑟𝑜𝑟|𝑥)𝑝(𝑥)d𝑥 = ∫ 𝑃(𝑒𝑟𝑟𝑜𝑟|𝑥)𝑝(𝑥)d𝑥 = ∫3 𝑃(𝑒𝑟𝑟𝑜𝑟|𝑥)𝑝(𝑥)d𝑥 + −∞ 0 0
111139
∫1 𝑃(𝑒𝑟𝑟𝑜𝑟|𝑥)𝑝(𝑥)d𝑥 = ∫3 𝑝(𝑥|𝜔1)𝑃(𝜔1)d𝑥 + ∫1 𝑝(𝑥|𝜔2)𝑃(𝜔2)d𝑥 = ∫3 4𝑥 × d𝑥 +
3 0 3 010
1 21118𝑥3312134261223 ∫1(6𝑥−6𝑥)× d𝑥=∫3 d𝑥+ ∫1(𝑥−𝑥)d𝑥= + ×( − )= + = ≈0.085
3 10 05 53 9059819027270
[4pts]
Grading: the full credit for (a) is 4pts. The full credit for part (b) is 6pts.
6. [10pts] Given three random variables 𝐴, 𝐵, 𝐶, suppose that 𝑃(𝐴|𝐵) = 𝑃(𝐴|𝐶). Is it always true that 𝑃(𝐵|𝐴) = 𝑃(𝐶|𝐴)? If it is always true, please derive it. Otherwise, please give a counter-example.
Solution: It is not always true that 𝑃(𝐵|𝐴) = 𝑃(𝐶|𝐴). To see this, suppose 𝑃(𝐵|𝐴) = 𝑃(𝐶|𝐴), and let 𝑃(𝐴) ≠ 0. Then
𝑃(𝐵|𝐴) = 𝑃(𝐶|𝐴) ⇒ 𝑃(𝐴, 𝐵) = 𝑃(𝐴, 𝐶) ⟹ 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐴|𝐶)𝑃(𝐶) 𝑃(𝐴) 𝑃(𝐴)
When 𝑃(𝐴|𝐵) = 𝑃(𝐴|𝐶) ≠ 0, we have 𝑃(𝐵) = 𝑃(𝐶). But it is not always true that 𝑃(𝐵) = 𝑃(𝐶), so it is not always true that 𝑃(𝐵|𝐴) = 𝑃(𝐶|𝐴). Any case that 𝑃(𝐵) ≠ 𝑃(𝐶) (e.g., 𝑃(𝐵) = 0.1, 𝑃(𝐶) = 0.2 ) can serve as a counter-example.
Grading: give partial credit to valid argument with statement that 𝑃(𝐵) = 𝑃(𝐶) does not always hold.
7. [10pts] Consider the following simple Bayesian Network (BN) of three random variables 𝐴, 𝐵, 𝐶. Is it always true that 𝐵, 𝐶 are independent, i.e., 𝑃(𝐵, 𝐶) = 𝑃(𝐵)𝑃(𝐶)? If it is always true, please derive it. Otherwise, please give a counter-example.
Solution: It is always true that 𝐵, 𝐶 are independent, i.e., 𝑃(𝐵, 𝐶) = 𝑃(𝐵)𝑃(𝐶). To see this, the factorization of the joint probability is 𝑃(𝐵, 𝐶, 𝐴) = 𝑃(𝐵)𝑃(𝐶)𝑃(𝐴|𝐵, 𝐶) by the given BN. Summing both sides of this equality, we have
∑𝐴 𝑃(𝐵, 𝐶, 𝐴) = ∑𝐴 𝑃(𝐵)𝑃(𝐶)𝑃(𝐴|𝐵, 𝐶) = 𝑃(𝐵)𝑃(𝐶) ∑𝐴 𝑃(𝐴|𝐵, 𝐶) (1)
Where ∑𝐴 𝑃(𝐵, 𝐶, 𝐴) stands for the sum of 𝑃(𝐵, 𝐶, 𝐴) over all possible values of 𝐴. For example, if 𝐴 is a binary random variable taking values in {0,1}, then ∑𝐴 𝑃(𝐵, 𝐶, 𝐴) = 𝑃(𝐵,𝐶,𝐴 = 0) + 𝑃(𝐵,𝐶,𝐴 = 1). We have ∑𝐴 𝑃(𝐴|𝐵,𝐶) = 1 and ∑𝐴 𝑃(𝐵,𝐶,𝐴) = 𝑃(𝐵,𝐶), so the equation (1) is equivalent to 𝑃(𝐵, 𝐶) = 𝑃(𝐵)𝑃(𝐶), indicating that 𝐵, 𝐶 are independent.
Grading: give partial credit to valid argument
8. [15pts] Consider the following simple Bayesian Network (BN), where all the nodes are assumed to be binary random variables and each of them can take value of either 1 or 0.
Suppose we use the following abbreviations: 𝑜 = Battery Old, 𝑐 = Battery charging, 𝑓 =
Battery Fine, 𝑚 = Music, 𝑙 = Lights.
Find an expression for the probability that lights are on given that batter is old, i.e.
𝑃(𝑙 = 1|𝑜 = 1), using only the following parameters (probabilities):
𝑃(𝑜 = 1); 𝑃(𝑐 = 1); 𝑃(𝑓 = 1|𝑜 = 0, 𝑐 = 0); 𝑃(𝑓 = 1|𝑜 = 0, 𝑐 = 1); 𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 0);
𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 1); 𝑃(𝑙 = 1|𝑓 = 0); 𝑃(𝑙 = 1|𝑓 = 1); 𝑃(𝑚 = 1|𝑓 = 0); 𝑃(𝑚 = 1|𝑓 = 1). Please show how this expression is derived. (Hint: it is fine if some given parameters are not
used.) Solution:
𝑃(𝑙 = 1|𝑜 = 1) = 𝑃(𝑙=1,𝑜=1) [2pts] According to the BN, the joint probability is factorized as 𝑃(𝑜=1)
𝑃(𝑜, 𝑐, 𝑓, 𝑙, 𝑚) = 𝑃(𝑜)𝑃(𝑐)𝑃(𝑓|𝑜, 𝑐)𝑃(𝑙|𝑓)𝑃(𝑚|𝑓) [3pts] The numerator is
𝑃(𝑙=1,𝑜=1)= ∑ 𝑃(𝑜=1,𝑐,𝑓,𝑙=1,𝑚) 𝑐,𝑓,𝑚
[4pts] So that
= ∑ 𝑃(𝑜 = 1)𝑃(𝑐)𝑃(𝑓|𝑜 = 1, 𝑐)𝑃(𝑙 = 1|𝑓)𝑃(𝑚|𝑓) 𝑐,𝑓,𝑚
= ∑ 𝑃(𝑜 = 1)𝑃(𝑐)𝑃(𝑓|𝑜 = 1, 𝑐)𝑃(𝑙 = 1|𝑓) 𝑐,𝑓
𝑃(𝑙 = 1|𝑜 = 1) = 𝑃(𝑙 = 1, 𝑜 = 1) = ∑𝑐,𝑓 𝑃(𝑜 = 1)𝑃(𝑐)𝑃(𝑓|𝑜 = 1, 𝑐)𝑃(𝑙 = 1|𝑓) = 𝑃(𝑜 = 1) 𝑃(𝑜 = 1)
[6pts]
= ∑ 𝑃(𝑐)𝑃(𝑓|𝑜 = 1, 𝑐)𝑃(𝑙 = 1|𝑓)
𝑐,𝑓
= 𝑃(𝑐 = 0)𝑃(𝑓 = 0|𝑜 = 1, 𝑐 = 0)𝑃(𝑙 = 1|𝑓 = 0)
+ 𝑃(𝑐 = 0)𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 0)𝑃(𝑙 = 1|𝑓 = 1)
+ 𝑃(𝑐 = 1)𝑃(𝑓 = 0|𝑜 = 1, 𝑐 = 1)𝑃(𝑙 = 1|𝑓 = 0)
+ 𝑃(𝑐 = 1)𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 1)𝑃(𝑙 = 1|𝑓 = 1)
= (1 − 𝑃(𝑐 = 1))(1 − 𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 0))𝑃(𝑙 = 1|𝑓 = 0)
+ (1 − 𝑃(𝑐 = 1))𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 0) 𝑃(𝑙 = 1|𝑓 = 1) + 𝑃(𝑐 = 1)(1 − 𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 1))𝑃(𝑙 = 1|𝑓 = 0)
+ 𝑃(𝑐 = 1)𝑃(𝑓 = 1|𝑜 = 1, 𝑐 = 1)𝑃(𝑙 = 1|𝑓 = 1)
9. [10pts] Consider the following data of 100 patients who are classified as healthy or sick, where “Count” indicates the number of patients with corresponding features.
Using the above data and the Naïve Bayes algorithm, calculate the probability of a patient being sick if
[1] Headache = T and Fever = F
[2] Headache = F and Fever = T.
Hint: the Naïve Bayes algorithm assumes that the Naïve Bayes Assumption holds, i.e. the two features, Fever and Headache, are conditionally independent given Sick or Healthy.
Solution:
By counting, we have
𝑃(Sick) = 0.3, 𝑃(Headache = T|Sick) = 0.7, 𝑃(Fever = T|Sick)
= 2 , 𝑃(Headache = T|Healthy) = 9 , 𝑃(Fever = T│Healthy) = 1 3 70 14
[0.4pt for each probability, 2pts in total] [1]
𝑃(Sick |Headache = T, Fever = F) = 𝑃(Headache = T, Fever = F|Sick)𝑃(Sick)
𝑃(Headache = T, Fever = F) = 𝑃(Headache = T|Sick)𝑃(Fever = F|Sick)𝑃(Sick)
𝑃(Headache = T, Fever = F)
𝑃(Headache = T|Sick)𝑃(Fever = F|Sick)𝑃(Sick) = 7 × 1 × 3
[1pt]
𝑃(Headache = T, Fever = F)
= 𝑃(Headache = T, Fever = F|Sick)𝑃(Sick)
+ 𝑃(Headache = T, Fever = F|Healthy)𝑃(Healthy) = 𝑃(Headache = T|Sick)𝑃(Fever = F|Sick)𝑃(Sick)
+ 𝑃(Headache = T|Healthy)𝑃(Fever = F|Healthy)𝑃(Healthy) = 7 ×1× 3 + 9 ×13× 7
10 3 10
[2pts] So that
10 3 10 70 14 10
7 ×1× 3
98
= ≈ 0.456
𝑃(Sick |Headache = T, Fever = F) = 10 3 10
7 ×1× 3 + 9 ×13× 7
215
[1pt]
[2]
𝑃(Headache = F|Sick)𝑃(Fever = T|Sick)𝑃(Sick) = 3 × 2 × 3
[1pt]
𝑃(Headache = F, Fever = T)
= 𝑃(Headache = F, Fever = T|Sick)𝑃(Sick)
10 3 10 70 14 10
𝑃(Sick |Headache = F, Fever = T) = 𝑃(Headache = F, Fever = T|Sick)𝑃(Sick)
𝑃(Headache = F, Fever = T) = 𝑃(Headache = F|Sick)𝑃(Fever = T|Sick)𝑃(Sick)
𝑃(Headache = F, Fever = T)
+ 𝑃(Headache = F, Fever = T|Healthy)𝑃(Healthy) = 𝑃(Headache = F|Sick)𝑃(Fever = T|Sick)𝑃(Sick)
+ 𝑃(Headache = F|Healthy)𝑃(Fever = T|Healthy)𝑃(Healthy) = 3 ×2× 3 +61× 1 × 7
10 3 10 70 14 10
10 3 10
[2pts]
So that
𝑃(Sick |Headache = F, Fever = T) =
[1pt]
3 ×2× 3
10 3 10
3 ×2× 3 +61× 1 × 7 10 3 10 70 14 10
84
= ≈ 0.579 145