CM3112 Artificial Intelligence
Probability theory: Naive Bayes
Steven Schockaert
SchockaertS1@cardiff.ac.uk
School of Computer Science & Informatics Cardiff University
Bayesian inference
Often we want to evaluate a probability of the form P(cause | effect), e.g. finding out what the probability is that you have some disease, given that you are experiencing particular symptoms
Such probabilities are usually difficult to estimate
‣ What percentage of people with headaches have the flu?
Probabilities of the form P(effect | cause) are usually simpler to estimate
‣ Take a sufficiently large random sample of people with the flu ‣ Check how many of them have a headache
( (
P
(
=
Bayesian inference
0.108 + 0.012 + 0.016 + 0.064 = 0.4
A B C|coC)
Bayes’ rule allows us to use knowledge about P(effect|cause) to evaluate
A|B C) · P(B C) + P(A|B coC) · P(B coC)
P(cause|effect)
(A|coB) · P (coB)
P(effect|cause) is called the likelihood of
P(cause) is called the prior probability; it is the probability of a particular cause, in absence of any evidence
cause, given the available evidence (i.e. the effects that are witnessed)
A|B) · (P(B C) + P(B coC))
P (cause|e ect) =
P (e ect|cause) · P (cause) P (e ect)
P(cause|effect) is called the posterior probability of cause, after taking into account available evidence
meningitis if you have a stiff neck?
P( ⇥⇥) P(⇥)
P (rain ⇥ sun) = 0.4 P (rain ⇥ sun) = 0.4
P( |⇥) =
P( |⇥) = P(⇥)
P ( ⇤ ⇥) = P ( ) + P (⇥) P ( ⇥ ⇥) P ( ⇤ ⇥) = P ( ) + P (⇥) P ( ⇥ ⇥)
We have the following information:
Using Bayes’ rule, we find:
P (sti -neck|meningitis) · P (meningitis)
0.8 · 0.0001
P (rain) = 0.8 P (rain) = 0.8
Example
Suppose that 80% of people with meningitis have a stiff neck. The probability
P (sun|rain) = P (rain ⇥ sun) = 0.4 = 0.5 P (rain ⇥ sun) 0.4
P (rain) 0.8
P (sun|rain) = = = 0.5
P (rain)
that somebody has meningitis is 0.01%. Moreover, the probability that
0.8
somebody has a stiff neck is 10%. What is the probability that you have
P( ⇥⇥)
P (sti -neck) = 0.1 P (sti -neck) = 0.1
P (meningitis) = 0.0001 P (meningitis) = 0.0001
P (sti -neck|meningitis) = 0.8 P (sti -neck|meningitis) = 0.8
P (sti -neck|meningitis) · P (meningitis)
P (meningitis|sti -neck) = = =
P (meningitis|sti -neck) =
= 0.8 · 0.0001 = 0.0008 0.1
P (sti -neck)
P (sti -neck)
0.1
•| · | · • |
+()()
P (cause |e ect) = · P (e ect|cause ) · P (cause )
P A|coB ·P coB
• P(A|B C)·P(B C)+P(A|B coC)·P(B coC)
• P(A|B)·(P(B C)+P(B coC)) Bayesian inference
+P (A|coB) · P (coB)
The value P(effect) is sometimes difficult to estimate directly; we can ignore it if
• P(A|B)·(P((Bca usCe|)e+ ePct()B= coC)) we add a renormalisation constant α:
P (e ect|cause) · P (cause)
P (e ect)
P (e ect|cause) · P (cause)
P (cause|e ect) =
P (cause|e ect) = · P (e ect|cause) · P (cause)
P (e ect)
P (¬cause|e ect) = · P (e ect|¬cause) · P (¬cause)
P (cause|e ect) = · P (e ect|cause) · P (cause)
P (¬cause|e ect) = · P (e ect|¬cause) · P (¬cause)
1
P (e ect|cause) · P (cause) + P (e ect|¬cause) · P (¬cause)
The value α can then be found as follows:
=
|
P (e ect)
P (cause|e ect) =
P (e ect|cause) · P (cause) P (e ect)
P (cause|e ect) = · P (e ect|cause) · P (cause) Bayesian inference
P (¬cause|e ect) = · P (e ect|¬cause) · P (¬cause) P (cause|e ect) = · P (e ect|cause) · P (cause)
Alternatively, we may evaluate:
P (¬cause|e ect) = · P (e ect|¬cause) · P (¬cause)
=
P (cause1 |e ect) = · P (e ect|cause1 ) · P (cause1 ) … 1
P(cause |e ect)= ·P(e ect|cause )·P(cause )
P (e ect|cause) · P (cause) + P (e ect|¬cause) · P (¬cause)
nnn
If we are only interested in identifying the most plausible cause, we don’t
P (cause1 |e ect) = · P (e ect|cause1 ) · P (cause1 ) need to worry about the value of α
…
P(cause |e ect)= ·P(e ect|cause )·P(cause )
If we know that then possibilities cause ,…,causen are exhausntive and pairwise 1n
disjoint (i.e. in any case exactly one of these causes is true), we have
1
P (e ect|cause1 ) · P (cause1 ) + … + P (e ect|causen ) · P (causen )
=
=
P (cause1 |e ect) = · P (e ect|cause1 ) · P (cause1 )
=
1
P(e ect|cause )·P(cause )+…+P(e ect|cause )·P(cause )
C if
P (cause |e ect) = · P (e ect|cause ) · P (cause ) P(e ect|cause )·P(cause )+…+P(e ect|cause )·P(cause )
P (e ect|cause) · P (cause) + P (e ect|¬cause) · P (¬cause) …
P(cause |e ect)= ·P(e ect|cause )·P(cause ) nnn
Naive Bayes assumption
P (cause1 |e ect) = · P (e ect|cause1 ) · P (cause1 ) …
Two events A and B are conditionally independent given the event =1
n11nnnn P (A, B|C) = P (A|C) · P (B|C)
11 nn P(A|B,C) = P(A|C)
In such a case, the following property holds (exercise: show this): P (A, B|C) = P (A|C) · P (B|C)
P(A|B,C) = P(A|C) 5
1
P(e ect|cause )·P(cause )+…+P(e ect|cause )·P(cause )
=
Naive Bayes assumption
11 nn P (A, B|C) = P (A|C) · P (B|C)
Assume that different effects are conditionally independent of the
cause
P(A|B,C) = P(A|C)
P (cause|e ect1 , …, e ectn ) = · P (e ect1 |cause) · … · P (e ectn |cause) · P (cause)
5
This assumption is called the Naive Bayes assumption because it is usually a simplifying assumption, rather than a genuine property of the problem domain
Inference under the Naive Bayes assumption is very efficient: the amount of data that we need is linear in the number of random variables
Example
Assume that 20% of all students get a first class honours degree, 30% get a 2.1, 30% gets a 2.2, 10% gets a third, and 10% have to repeat the year. The percentage of people with a certain degree classification who like AI, programming and probability theory is given in the following table:
likes AI
likes programming
likes probability
First
70%
90%
60%
Upper second
60%
80%
40%
Lower second
40%
70%
20%
Third
30%
60%
10%
Fail
20%
50%
10%
What is the probability that you will get a first, given that you like AI and programming, but not probability theory?
= · 0.0162
(0 252+0 288+0 224+
·|· |
= · 0.7 · 0.9 · 0.4 · 0.2 = · 0.0504
Example P (upper-second|AI, programming, ¬proba P(male|H =6,W =180,F =11)= ·P(H =6|male)·
= · 0.6 · 0.8 · 0.6 · 0.3 = · 0.0864
P (first|AI, programming, ¬probability)
= · P (AI|first) · P (programming|first) · P (¬probability|first) · P (first)
= · 0.7 · 0.9 · 0.4 · 0.2 = · 0.0504
P (upper-second|AI, programming, ¬probability) = · 0.6 · 0.8 · 0.6 · 0.3
= · 0.0864
P (lower-second|AI, programming, ¬probability) = · 0.4 · 0.7 · 0.8 · 0.3
= · 0.0672
P (third|AI, programming, ¬probability) = · 0.3 · 0.6 · 0.9 · 0.1
P (lower-second|AI, programming, ¬proba = · 0.4 · 0.7 · 0.8 · 0.3
= · 0.0672
P (third|AI, programming, ¬probability) = · 0.3 · 0.6 · 0.9 · 0.1
= · 0.0162
P (fail|AI, programming, ¬probability) = · 0.2 · 0.5 · 0.9 · 0.1
= · 0.009
(0.252 + 0.288 + 0.224 +
P (fail|AI, programming, ¬probability)
= · 0.2 · 0.5 · 0.9 · 0.1
= · 0.009 Example
We find
(0.0504 + 0.0864 + 0.0672 + 0.0162 + 0.009) = 1
= 1 = 1 4.363
0.0504 + 0.0864 + 0.0672 + 0.0162 + 0.009 0.2292 And thus
P (first|AI, programming, ¬probability) = 0.0504 0.2198 0.2292
P (upper-second|AI, programming, ¬probability) = 0.0864 0.3769 0.2292
P (lower-second|AI, programming, ¬probability) = 0.0672 0.2931 0.2292
P (third|AI, programming, ¬probability) = 0.0162 0.0706 6 0.2292
P(fail|AI,programming,¬probability)= 0.009 0.0392 0.2292