COMP3308/3608, Lecture 6a
ARTIFICIAL INTELLIGENCE
Statistical-Based Learning (Naïve Bayes)
Reference: Witten, Frank, Hall and Pal, ch.4.2: p.96-105 Russell and Norvig, ch.20: p.802-810
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 1
• Bayes theorem
• Naïve Bayes algorithm
• Naïve Bayes – issues
• Zero probabilities – Laplace correction • Dealing with missing values
• Dealing with numeric attributes
Outline
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 2
What is Bayesian Classification?
• Bayesian classifiers are statistical classifiers
• They can predict the class membership probability, i.e. the probability that a given example belongs to a particular class
• They are based on the Bayes Theorem
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 3
Thomas Bayes (1702-1761)
Bayes Theorem
• Given a hypothesis H and evidence E for this hypothesis, then the probability of H given E, is:
• Example: Given are instances of fruits, described by their color and shape. Let:
– E is red and round
– H is the hypothesis that E is an apple
• What are:
• P(H|E)=?
• P(H)=?
• P(E|H)=?
• P(E)=?
P(H | E) = P(E | H)P(H) P(E)
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 6a, 2021 4
Bayes Theorem – Example (cont. 1)
• P(H|E) is the probability that E is an apple, given that we have seen that E is red and round
• Called posteriori probability of H conditioned on E
• P(H) is the probability that any given example is an apple,
regardless of how it looks
• Called prior probability of H
• The posteriori probability is based on more information that the prior probability which is independent of E
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 5
Bayes Theorem – Example (cont. 2)
• What is P(E|H)?
• the posteriori probability of E conditioned on H
• the probability that E is red and round, given that we know that E is an apple
• What is P(E)?
• the prior probability of E
• The probability that an example from the fruit data set is red and round
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 6
Bayes Theorem for Problem Solving
• Given: A doctor knows that
• Meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50 000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has a stiff neck, what is the probability that he has meningitis?
P(H | E) = P(E | H)P(H) P(E)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 7
Bayes Theorem for Problem Solving – Answer
• Given: A doctor knows that P(S | M )
• Meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50 000
• Prior probability of any patient having stiff neck is 1/20
P(M )
P(S)
• If a patient has a stiff neck, what is the probability that he has
meningitis?
P(M | S) = ?
P ( M | S ) = P ( S | M ) P ( M ) = 0 . 5 (1 / 5 0 0 0 0 ) = 0 . 0 0 0 2 P(S) 1/20
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 8
Naïve Bayes Algorithm
• The Bayes Theorem can be applied for classification tasks = Naïve Bayes algorithm
• While 1R makes decisions based on a single attribute, Naive Bayes uses all attributes and allows them to make contributions to the decision that are equally important and independent of one another
• Assumptions of the Naïve Bayes algorithm
• 1) Independence assumption – (the values of the) attributes are conditionally independent of each other, given the class (i.e. for each class value)
• 2) Equally importance assumption – all attributes are equally important
• Unrealistic assumptions! => it is called Naive Bayes
• Attributes are dependent of one another
• Attributes are not equally important
• But these assumptions lead to a simple method which works
surprisingly well in practice!
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 9
temp.
humidity
windy
play
hot
high
false
no
hot
high
true
no
hot
high
high
false
yes
mild
false
yes
cool
normal
false
yes
cool
normal
true
no
cool
normal
true
yes
mild
high
normal
false
no
cool
false
yes
mild
normal
false
yes
mild
normal
true
yes
•
Naive Bayes on the Weather Example
• Given: the weather data
• Task: use Naïve Bayes to predict the class (yes or no) of the new example
outlook=sunny, temperature=cool,
humidity=high, windy=true
The Bayes Theorem:
mild
high
true
yes
hot
normal
false
yes
mild
high
true
no
•
What are H and E?
• the evidence E is the new example
• the hypothesis H is play=yes (and there is another H: play=no)
How to use the Bayes Theorem for classification?
Calculate P(H|E) for each H (class), i.e. P(yes|E) and P(no|E) Compare them and assign E to the class with the highest probability
OK, but for P(H|E) we need to calculate P(E), P(H) and P(E|H) – how to do this? From the given data (this is the training phase of the classifier)
•
• • •
P(H | E) = P(E | H)P(H) P(E)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 10
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
Naive Bayes on the Weather Example (2)
• We need to calculate and compare P(yes|E) and P(no|E)
where E
outlook=sunny,temperature=cool, humidity=high, windy=true
1) How to calculate P(E|yes) and P(E|no) ?
Let’s split the evidence E into 4 smaller pieces of evidence:
• E1 = outlook=sunny, E2 = temperature=cool
• E3 = humidity=high, E4 = windy=true
Let’s use the Naïve Bayes’s independence assumption: E1, E2, E3 and E4 are independent given the class. Then, their combined probability is obtained by multiplication of per-attribute probabilities:
P(yes | E) = P(E | yes)P(yes) P(E)
P(no | E) = P(E | no)P(no) P(E)
P(E | yes) = P(E | yes)P(E | yes)P(E | yes)P(E | yes) 1234
P(E|no)=P(E |no)P(E |no)P(E |no)P(E |no) 1234
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 11
Naive Bayes on the Weather Example (3)
• Hence:
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
P(E)
P(no | E) =
P(E |no)P(E |no)P(E |no)P(E |no)P(no) 1234
P(E)
• In summary:
• Numerator – the probabilities will be estimated from the data
• Denominator – the two denominators are the same (P(E)) and since we are comparing the two fractions, we can just compare the numerators => there is no need to calculate P(E)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 12
Calculating the Probabilities from the Training Data
E1 = outlook=sunny, E2 = temperature=cool E3 = humidity=high, E4 = windy=true
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
P(E)
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
temp.
humidity
windy
play
hot
high
false
no
hot
high
true
no
hot
high
false
yes
mild
high
false
yes
cool
normal
false
yes
cool
normal
true
no
cool
normal
true
yes
mild
high
false
no
cool
normal
false
yes
mild
normal
false
yes
mild
normal
true
yes
mild
high
true
yes
hot
normal
false
yes
mild
high
true
no
• P(E1|yes)=P(outlook=sunny|yes)=? • P(E2|yes)=P(temp=cool|yes)=?
• P(E3|yes)=P(humidity=high|yes)=? • P(E4|yes)=P(windy=true|yes)=?
• P(yes)=?
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 6a, 2021 13
Calculating the Probabilities from the Training Data
E1 = outlook=sunny, E2 = temperature=cool E3 = humidity=high, E4 = windy=true
• P(E1|yes)=P(outlook=sunny|yes) = ?/9 = 2/9 • P(E2|yes)=P(temp=cool|yes)=?
• P(E3|yes)=P(humidity=high|yes)=?
• P(E4|yes)=P(windy=true|yes)=?
• P(yes)=?
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 14
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
P(E)
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
temp.
humidity
windy
play
hot
high
false
no
hot
high
true
no
hot
high
false
yes
mild
high
false
yes
cool
normal
false
yes
cool
normal
true
no
cool
normal
true
yes
mild
high
false
no
cool
normal
false
yes
mild
normal
false
yes
mild
normal
true
yes
mild
high
true
yes
hot
normal
false
yes
mild
high
true
no
•
Weather data – counts and probabilities:
sunny overcast rainy
sunny overcast rainy
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
yes no
2 3 hot 4 0 mild 3 2 cool
2/9 3/5 hot 4/9 0/5 mild 3/9 2/5 cool
yes no 2 2 4 2 3 1
2/9 2/5 4/9 2/5 3/9 1/5
yes no 3 4 6 1
3/9 4/5 6/9 1/5
false true
false true
yes no yes no 6 2 9 5 3 3
6/9 2/5 9/14 5/14 3/9 3/5
Calculation the Probabilities (2)
outlook temperature
humidity
high normal
high normal
windy play
temp. humidity windy play
hot high hot high hot high mild high cool normal cool normal cool normal mild high cool normal mild normal mild normal mild high hot normal
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes
proportions of days when play is yes
proportions of days when humidity is normal and play is yes i.e. the probability of humidity to be normal given that play=yes
Irena Koprinska, irena.koprinska@sydney.edu.au
mild high
true no
COMP3308/3608 AI, week 6a, 2021 15
•
E, i.e. anything about
P(yes | E) = ? outlook
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes)
sunny overcast rainy
sunny overcast rainy
Calculation the Probabilities (3)
yes no
2 3 hot 4 0 mild 3 2 cool
2/9 3/5 hot 4/9 0/5 mild 3/9 2/5 cool
yes 2
4
3
2/9 4/9 3/9
yes no
yes no
false 6 2 9 5
true 3 3
false 6/9 2/5 9/14 5/14 true 3/9 3/5
P(yes | E) = 1 temperature
2 3 P(E)
humidity
no
2 high 3 4 2 normal 6 1 1
2/5 high 3/9 4/5 2/5 normal 6/9 1/5 1/5
4
windy play
P(E1|yes)=P(outlook=sunny|yes)=2/9 P(E2|yes)=P(temperature=cool|yes)=3/9 P(E3|yes)=P(humidity=high|yes)=3/9 P(E4|yes)=P(windy=true|yes)=3/9
P(yes) =? – the probability of a yes without knowing any
the particular day; the prior probability of yes; P(yes) = 9/14
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 16
yes no
Final Calculations
• By substituting the respective evidence probabilities:
23339
P( yes | E) = 9 9 9 914 = 0.0053 P(E) P(E)
• Similarly calculating P(no | E) :
3143 5
P(no | E) = 5 5 5 514 = 0.0206 P(E) P(E)
P(no|E)P(yes|E)
• => for the new day play = no is more likely than play = yes
• =>
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 17
Another Example
• Use the NB classifier to solve the following problem:
• Consider a volleyball game between team A and team B
• Team A has won 65% of the time and team B has won 35%
• Among the games won by team A, 30% were when playing on team B’s court
• Among the games won by team B, 75% were when playing at home
• If team B is hosting the next match, which team is most likely to win?
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 18
Solution
• host – the team hosting the match {A, B}
• winner – the winner of the match {A, B}
• Using NB, the task is to compute and compare 2 probabilities: P(winner=A|host=B)
P(winner=B|host=B)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) P(host = B)
P(winner=B|host=B)= P(host=B|winner=B)P(winner=B) P(host = B)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 19
Solution (2)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) P(host = B)
P(winner=B|host=B)= P(host=B|winner=B)P(winner=B) P(host = B)
• Do we know these probabilities:
• P(winner=A)= ? //probability that A wins
• P(winner=B)=? //probability that B wins
• P(host=B|winner=A)=? //probability that team B hosted the match, given that team A won
• P(host=B|winner=B)=? //probability that team B hosted the match, given that team B won
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 20
Solution (3)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) P(host = B)
P(winner=B|host=B)= P(host=B|winner=B)P(winner=B) P(host = B)
• Do we know these probabilities:
• P(winner=A)= ? //probability that A wins =0.65
• P(winner=B)=? //probability that B wins =0.35
• P(host=B|winner=A)=? //probability that team B hosted the match, given that team A won =0.30
• P(host=B|winner=B)=? //probability that team B hosted the match, given that team B won =0.75
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 21
Solution (4)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) = P(host = B)
= 0.3 * 0.65 = 0.195 P(host = B)
P(winner = B | host = B) = P(host = B | winner = B)P(winner = B) = P(host = B)
= 0.75*0.35 =0.2625 P(host = B)
=>NB predicts team B
i.e. NB predicts that if team B is hosting the next match, then team B is more likely to win
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 22
Three More Things About Naïve Bayes
• How to deal with probability values of zero in the numerator?
• How do deal with missing values?
• How to deal with numeric attributes?
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 23
Problem – Probability Values of 0
• Suppose that the training data was different: outlook=sunny had always occurred together with play=no (i.e.
outlook=sunny had never occurred together with play=yes )
• Then: P(outlook=sunny|yes)=0 and P(outlook=sunny|no)=1
yes
…
sunny
overcast
rainy
sunny
overcast
rainy
0
…
4
…
3
…
0/9
…
…
4/9
…
3/9
…
P(yes| E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
=0 P(E)
• => final probability P(yes|E)=0 no matter of the other probabilities • This is not good!
• The other probabilities are completely ignored due to the multiplication with 0
• I.e. the prediction for new examples with outlook=sunny will always be no, regardless of the values of the other attributes
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 24
A Simple Trick to Avoid This Problem
• Assume that our training data is so large that adding 1 to each count would not make difference in calculating the probabilities …
• but it will avoid the case of 0 probability
• This is called the Laplace correction or Laplace estimator
•“
“What we know is not much. What we do not know is immense.”
Pierre-Simon Laplace (1749-1827)
Image from http://en.wikipedia.org/wiki/File:Pierre-Simon_Laplace.jpg
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 25
Laplace Correction
• Add 1 to the numerator and k to the denominator, where k is the number of attribute values for the given attribute
• Example:
• A dataset with 2000 examples, 2 classes: buy_Mercedes=yes and
buy_Mercedes=no; 1000 examples in each class
• 1 of the attributes is income with 3 values: low, medium and high
• For class buy_Mercedes=yes , there are 0 examples with income=low, 10 with income=medium and 990 with income=high
• Probabilities without the Laplace correction for class yes: 0/1000=0, 10/1000=0.01, 990/1000=0.99
• Probabilities with the Laplace correction: 1/1003=0.001, 11/1003=0.011, 991/1003=0.988
• The correct probabilities are close to the adjusted probabilities, yet the 0 probability value is avoided!
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 26
yes
…
sunny overcast rainy
sunny overcast rainy
0
…
4
…
3
…
…
0/9
…
4/9
…
3/9
…
P(sunny|yes)=0+1= 1 9+3 12
P(overcast|yes)=4+1= 5 9+3 12
P(rainy|yes)=3+1= 4 9+3 12
Laplace Correction – Modified Weather Example
P(sunny|yes)=0/9 →problem P(overcast|yes)=4/9 P(rainy|yes)=3/9
Laplace correction
• Assumes that there are 3 more examples from
class yes, 1 for each value of outlook
• This results in adding 1 to the numerator and 3
to the denominator of all probabilities
• Ensures that an attribute value which occurs 0 times will receive a nonzero (although small) probability
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 27
Generalization of the Laplace Correction: M-estimate
• Add a small constant m to each denominator and mpi to each numerator, where pi is the prior probability of the i values of the attribute:
P(overcast | yes) = 4 + mp2 9+m
P(sunny | yes) = 2 + mp1 9+m
• Note that p1+p2+…+pn=1, n – number of attribute values
• Advantage of using prior probabilities – it is rigorous
• Disadvantage – computationally expensive to estimate prior probabilities
• Large m – the prior probabilities are very important compared with the new evidence coming in from the training data; small m – less important
• Typically we assume that each attribute value is equally probable, i.e. p1=p2=…=pn =1/n
• The Laplace correction is a special case of the m-estimate, where p1=p2=…=pn =1/n and m=n. Thus, 1 is added to the numerator and m to the denominator.
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 28
P(rainy | yes) = 3 + mp3 9+m
Handling Missing Values – Easy
• Missing attribute value in the new example – do not include this attribute
• e.g. outlook=?, temperature=cool, humidity=high, windy=true
• Then:
• outlookisnotincluded.Comparetheseresultswiththeprevious results!
• As one of the fractions is missing, the probabilities are higher but the comparison is fair – there is a missing fraction in both cases
• Missing attribute value in a training example – do not include this value in the counts
• Calculate the probabilities based on the number of values that actually occur (are not missing) and not on the total number of training examples
3339
P(yes|E)= 99914=0.0238 P(E) P(E)
143 5
P(no|E)= 55514=0.0343 P(E) P(E)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 29
Handling Numeric Attributes
numeric
numerical
• We would like to classify the following new example: outlook=sunny, temperature=66, humidity=90, windy=true
• Question: How to calculate
P(temperature=66|yes)=?, P(humidity=90|yes)=?
P(temperature=66|no)=?, P(humidity=90|no) ?
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 30
Using Probability Density Function
• Answer: By assuming that numerical values have a normal (Gaussian, bell curve) probability distribution and using the probability density function
• For a normal distribution with mean and standard deviation , the probability density function is:
Image from http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 31
1 −(x−)2 f(x)= 2e 22
More on Probability Density Functions
• What is the meaning of the probability density function of a continuous random variable?
• closely related to probability but not exactly the probability (e.g. the probability that x is exactly 66 is 0)
• = the probability that a given value x (x-/2, x+/2 ) is *f(x)
1 −(x−)2 f(x)= 2e 22
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 32
Calculating Probabilities Using Probability
Density Function
mean for temp. for class=yes
−(66−73)2
f (temperature = 66 | yes) = 1 e 2* 6.22 = 0.034 6.2 2
f (humidity = 90 | yes) = 0.0221
std.dev. for temp. for class=yes
=>P(no|E) > P(yes|E) => no play
2 0.034 0.02213 9
P(yes | E) = 9 914 = 0.000036
P(E) P(E)
3 0.0291 0.0383 5
P(no | E) = 5 514 = 0.000136
P(E) P(E)
• Compare with the categorical weather data!
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 33
Mean and Standard Deviation – Reminder
• A reminder how to calculate the mean value and standard deviation :
X is a random variable with values, x1, x2, …. xn
=
i i=1
n x
n
=
n
i=1
n−1
(x −) i
2
Note that the denominator is n-1 not n
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021
34
Naive Bayes – Advantages
• Simple approach – the probabilities are easily computed due to the independence assumption
• Clear semantics for representing, using and learning probabilistic knowledge
• Excellent computational complexity
• Requires 1 scan of the training data to calculate all statistics (for both
nominal and continuous attributes assuming normal distribution):
• O(pk), p – # training examples, k-valued attributes
• In many cases outperforms more sophisticated learning methods => always try the simple method first!
• Robust to isolated noise points as such points are averaged when estimating the conditional probabilities from data
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 35
Naive Bayes – Disadvantages
• Correlated attributes reduce the power of Naïve Bayes
• Violation of the independence assumption
• Solution: apply feature selection beforehand to identify and discard correlated (redundant) attributes
• Normal distribution assumption for numeric attributes – many features are not normally distributed – solutions:
• Discretize the data first, i.e. numerical -> nominal attributes
• Use other probability density functions, e.g. Poisson, binomial, gamma, etc.
• Transform the attribute using a suitable transformation into a normally distributed one (sometimes possible)
• Use kernel density estimation – doesn’t assume any particular distribution
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 6a, 2021 36