CS代考 COMP3308/3608, Lecture 6a

COMP3308/3608, Lecture 6a
ARTIFICIAL INTELLIGENCE
Statistical-Based Learning (Naïve Bayes)
Reference: Witten, Frank, Hall and Pal, ch.4.2: p.96-105 Russell and Norvig, ch.20: p.802-810

Copyright By PowCoder代写 加微信 powcoder

, COMP3308/3608 AI, week 6a, 2022 1

• Bayes theorem
• Naïve Bayes algorithm
• Naïve Bayes – issues
• Zero probabilities – Laplace correction • Dealing with missing values
• Dealing with numeric attributes
, COMP3308/3608 AI, week 6a, 2022 2

What is Bayesian Classification?
• Bayesian classifiers are statistical classifiers
• They can predict the class membership probability, i.e. the probability that a given example belongs to a particular class
• They are based on the Bayes Theorem
, COMP3308/3608 AI, week 6a, 2022 3
(1702-1761)

Bayes Theorem
• Given a hypothesis H and evidence E for this hypothesis, then the probability of H given E, is:
• Example: Given are instances of fruits, described by their color and shape. Let:
– E is red and round
– H is the hypothesis that E is an apple
• What are:
• P(H|E)=?
• P(E|H)=?
P(H | E) = P(E | H)P(H) P(E)
COMP3308/3608 AI, week 6a, 2022 4

Bayes Theorem – Example (cont. 1)
• P(H|E) is the probability that E is an apple, given that we have seen that E is red and round
• Called posterior probability of H conditioned on E
• P(H) is the probability that any given example is an apple,
regardless of how it looks
• Called prior probability of H
• The posterior probability is based on more information that the prior probability which is independent of E
, COMP3308/3608 AI, week 6a, 2022 5

Bayes Theorem – Example (cont. 2)
• What is P(E|H)?
• the posterior probability of E conditioned on H
• the probability that E is red and round, given that we know that E is an apple
• What is P(E)?
• the prior probability of E
• The probability that an example from the fruit data set is red and round
, COMP3308/3608 AI, week 6a, 2022 6

Bayes Theorem for Problem Solving
• Given: A doctor knows that
• Meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50 000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has a stiff neck, what is the probability that he has meningitis?
P(H | E) = P(E | H)P(H) P(E)
, COMP3308/3608 AI, week 6a, 2022 7

Bayes Theorem for Problem Solving – Answer
• Given: A doctor knows that P(S | M )
• Meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50 000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has a stiff neck, what is the probability that he has
meningitis?
P(M | S) = ?
P ( M | S ) = P ( S | M ) P ( M ) = 0 . 5 (1 / 5 0 0 0 0 ) = 0 . 0 0 0 2 P(S) 1/20
, COMP3308/3608 AI, week 6a, 2022 8

Naïve Bayes Algorithm
• The Bayes Theorem can be applied for classification tasks = Naïve Bayes algorithm
• While 1R makes decisions based on a single attribute, Naive Bayes uses all attributes and allows them to make contributions to the decision that are equally important and independent of one another
• Assumptions of the Naïve Bayes algorithm
• 1) Independence assumption – (the values of the) attributes are conditionally independent of each other, given the class (i.e. for each class value)
• 2) Equally importance assumption – all attributes are equally important
• Unrealistic assumptions! => it is called Naive Bayes
• Attributes are dependent of one another
• Attributes are not equally important
• But these assumptions lead to a simple method which works
surprisingly well in practice!
, COMP3308/3608 AI, week 6a, 2022 9

Naive Bayes on the Weather Example
• Given: the weather data
• Task: use Naïve Bayes to predict the class (yes or no) of the new example
outlook=sunny, temperature=cool, humidity=high, windy=true
The Bayes Theorem:
• What are H and E?
• the evidence E is the new example
• the hypothesis H is play=yes (and there is another H: play=no)
• How to use the Bayes Theorem for classification?
Calculate P(H|E) for each H (class), i.e. P(yes|E) and P(no|E)
Compare them and assign E to the class with the highest probability OK, but for P(H|E) we need to calculate P(E), P(H) and P(E|H) – how to
do this? From the given data (this is the training phase of the classifier)
P(H | E) = P(E | H)P(H) P(E)
, COMP3308/3608 AI, week 6a, 2022 10
outlook temp. humidity windy play
sunny hot high sunny hot high overcast hot high rainy mild high rainy cool normal rainy cool normal overcast cool normal sunny mild high sunny cool normal rainy mild normal sunny mild normal overcast mild high overcast hot normal rainy mild high
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no

Naive Bayes on the Weather Example (2)
• We need to calculate and compare P(yes|E) and P(no|E)
outlook=sunny,temperature=cool, humidity=high, windy=true
1) How to calculate P(E|yes) and P(E|no) ?
Let’s split the evidence E into 4 smaller pieces of evidence:
• E1 = outlook=sunny, E2 = temperature=cool
• E3 = humidity=high, E4 = windy=true
Let’s use the Naïve Bayes’s independence assumption: E1, E2, E3 and E4 are independent given the class. Then, their combined probability is obtained by multiplication of per-attribute probabilities:
P(yes | E) = P(E | yes)P(yes) P(E)
P(no | E) = P(E | no)P(no) P(E)
P(E | yes) = P(E | yes)P(E | yes)P(E | yes)P(E | yes) 1234
P(E|no)=P(E |no)P(E |no)P(E |no)P(E |no) 1234
, COMP3308/3608 AI, week 6a, 2022 11

Naive Bayes on the Weather Example (3)
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
P(no | E) =
P(E |no)P(E |no)P(E |no)P(E |no)P(no) 1234
• In summary:
• Numerator – the probabilities will be estimated from the data
• Denominator – the two denominators are the same (P(E)) and since we are comparing the two fractions, we can just compare the numerators => there is no need to calculate P(E)
, COMP3308/3608 AI, week 6a, 2022 12

Calculating the Probabilities from the Training Data
E1 = outlook=sunny, E2 = temperature=cool E3 = humidity=high, E4 = windy=true
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
outlook temp. humidity windy play
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no
hot high sunny hot high overcast hot high
rainy mild high rainy cool normal
overcast cool normal
cool normal sunny mild high
rainy mild normal
cool normal
overcast mild high
mild normal
overcast hot normal
• P(E1|yes)=P(outlook=sunny|yes)=? • P(E2|yes)=P(temp=cool|yes)=?
• P(E3|yes)=P(humidity=high|yes)=? • P(E4|yes)=P(windy=true|yes)=?
• P(yes)=?
COMP3308/3608 AI, week 6a, 2022 13

Calculating the Probabilities from the Training Data
E1 = outlook=sunny, E2 = temperature=cool E3 = humidity=high, E4 = windy=true
• P(E1|yes)=P(outlook=sunny|yes) = ?/9 = 2/9 • P(E2|yes)=P(temp=cool|yes)=?
• P(E3|yes)=P(humidity=high|yes)=?
• P(E4|yes)=P(windy=true|yes)=?
• P(yes)=?
, COMP3308/3608 AI, week 6a, 2022 14
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
outlook temp. humidity windy play
sunny sunny
rainy sunny
hot high hot high
cool normal mild high
false no true no
true no false no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
overcast cool normal true yes
cool normal false yes
rainy mild normal false yes
mild normal true yes
overcast mild high true yes
overcast hot normal false yes

Weather data – counts and probabilities:
Calculation the Probabilities (2)
rainy 3/9 2/5 cool 3/9 1/5
temperature
outlook temp. humidity windy play
sunny hot overcast hot rainy mild rainy cool rainy overcast cool sunny mild sunny
rainy mild sunny mild overcast mild overcast hot rainy
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes
hot high high high high
proportions of days when play is yes
proportions of days when humidity is normal and play is yes i.e. the probability of humidity to be normal given that play=yes
normal cool normal normal
high cool normal normal normal
COMP3308/3608 AI, week 6a, 2022 15

Calculation the Probabilities (3)
P(yes | E) = ?
 P(E1|yes)=P(outlook=sunny|yes)=2/9 P(E2|yes)=P(temperature=cool|yes)=3/9 P(E3|yes)=P(humidity=high|yes)=3/9 P(E4|yes)=P(windy=true|yes)=3/9
P(yes) =? – the probability of a yes without knowing any E, i.e. anything about the particular day; the prior probability of yes; P(yes) = 9/14
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
rainy 3/9 2/5 cool 3/9 1/5
temperature
, COMP3308/3608 AI, week 6a, 2022 16

Final Calculations
• By substituting the respective evidence probabilities:
P( yes | E) = 9 9 9 914 = 0.0053 P(E) P(E)
• Similarly calculating P(no | E) :
P(no | E) = 5 5 5 514 = 0.0206 P(E) P(E)
P(no|E)P(yes|E)
• => for the new day play = no is more likely than play = yes
, COMP3308/3608 AI, week 6a, 2022 17

Another Example
• Use the NB classifier to solve the following problem:
• Consider a volleyball game between team A and team B
• Team A has won 65% of the time and team B has won 35%
• Among the games won by team A, 30% were when playing on team B’s court
• Among the games won by team B, 75% were when playing at home
• If team B is hosting the next match, which team is most likely to win?
, COMP3308/3608 AI, week 6a, 2022 18

• host – the team hosting the match {A, B}
• winner – the winner of the match {A, B}
• Using NB, the task is to compute and compare 2 probabilities: P(winner=A|host=B)
P(winner=B|host=B)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) P(host = B)
P(winner=B|host=B)= P(host=B|winner=B)P(winner=B) P(host = B)
, COMP3308/3608 AI, week 6a, 2022 19

Solution (2)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) P(host = B)
P(winner=B|host=B)= P(host=B|winner=B)P(winner=B) P(host = B)
• Do we know these probabilities:
• P(winner=A)= ? //probability that A wins
• P(winner=B)=? //probability that B wins
• P(host=B|winner=A)=? //probability that team B hosted the match, given that team A won
• P(host=B|winner=B)=? //probability that team B hosted the match, given that team B won
, COMP3308/3608 AI, week 6a, 2022 20

Solution (3)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) P(host = B)
P(winner=B|host=B)= P(host=B|winner=B)P(winner=B) P(host = B)
• Do we know these probabilities:
• P(winner=A)= ? //probability that A wins =0.65
• P(winner=B)=? //probability that B wins =0.35
• P(host=B|winner=A)=? //probability that team B hosted the match, given that team A won =0.30
• P(host=B|winner=B)=? //probability that team B hosted the match, given that team B won =0.75
, COMP3308/3608 AI, week 6a, 2022 21

Solution (4)
P(winner = A | host = B) = P(host = B | winner = A)P(winner = A) = P(host = B)
= 0.3 * 0.65 = 0.195 P(host = B)
P(winner = B | host = B) = P(host = B | winner = B)P(winner = B) = P(host = B)
= 0.75*0.35 =0.2625 P(host = B)
=>NB predicts team B
i.e. NB predicts that if team B is hosting the next match, then team B is more likely to win
, COMP3308/3608 AI, week 6a, 2022 22

Three More Things About Naïve Bayes
• How to deal with probability values of zero in the numerator?
• How do deal with missing values?
• How to deal with numeric attributes?
, COMP3308/3608 AI, week 6a, 2022 23

Problem – Probability Values of 0
• Suppose that the training data was different: outlook=sunny had always occurred together with play=no (i.e.
outlook=sunny had never occurred together with play=yes )
• Then: P(outlook=sunny|yes)=0 and P(outlook=sunny|no)=1
sunny 0 … overcast 4 … rainy 3 … … sunny 0/9 … overcast 4/9 … rainy 3/9 …
P(yes| E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
• => final probability P(yes|E)=0 no matter of the other probabilities • This is not good!
• The other probabilities are completely ignored due to the multiplication with 0
• => The prediction for new examples with outlook=sunny will always be no, regardless of the values of the other attributes
, COMP3308/3608 AI, week 6a, 2022 24

A Simple Trick to Avoid This Problem
• Assume that our training data is so large that adding 1 to each count would not make a difference in calculating the probabilities …
• but it will avoid the case of 0 probability
• This is called the Laplace correction or Laplace estimator
“What we know is not much. What we do not know is immense.”
Pierre- (1749-1827)
Image from http://en.wikipedia.org/wiki/File:Pierre-Simon_Laplace.jpg
, COMP3308/3608 AI, week 6a, 2022 25

Laplace Correction
• Add 1 to the numerator and k to the denominator, where k is the number of attribute values for the given attribute
• Example:
• A dataset with 2000 examples, 2 classes: buy_Mercedes=yes and
buy_Mercedes=no; 1000 examples in each class
• 1 of the attributes is income with 3 values: low, medium and high
• For class buy_Mercedes=yes , there are 0 examples with income=low, 10 with income=medium and 990 with income=high
• Probabilities without the Laplace correction for class yes: 0/1000=0, 10/1000=0.01, 990/1000=0.99
• Probabilities with the Laplace correction: 1/1003=0.001, 11/1003=0.011, 991/1003=0.988
• The correct probabilities are close to the adjusted probabilities, yet the 0 probability value is avoided!
, COMP3308/3608 AI, week 6a, 2022 26

… sunny 0 … overcast 4 … rainy 3 … … sunny 0/9 … overcast 4/9 … rainy 3/9 …
P(sunny|yes)=0+1= 1 9+3 12
P(overcast|yes)=4+1= 5 9+3 12
P(rainy|yes)=3+1= 4 9+3 12
Laplace Correction – Modified Weather Example
P(sunny|yes)=0/9 →problem P(overcast|yes)=4/9 P(rainy|yes)=3/9
Laplace correction
• Assumes that there are 3 more examples from
class yes, 1 for each value of outlook
• This results in adding 1 to the numerator and 3
to the denominator of all probabilities
• Ensures that an attribute value which occurs 0 times will receive a nonzero (although small) probability
, COMP3308/3608 AI, week 6a, 2022 27

Generalization of the Laplace Correction: M-estimate
• Add a small constant m to each denominator and mpi to each numerator, where pi is the prior probability of the i values of the attribute:
P(overcast | yes) = 4 + mp2 9+m
P(sunny | yes) = 2 + mp1 9+m
• Note that p1+p2+…+pn=1, n – number of attribute values
• Advantage of using prior probabilities – it is rigorous
• Disadvantage – computationally expensive to estimate prior probabilities
• Large m – the prior probabilities are very important compared with the new evidence coming in from the training data; small m – less important
• Typically we assume that each attribute value is equally probable, i.e. p1=p2=…=pn =1/n
• The Laplace correction is a special case of the m-estimate, where p1=p2=…=pn =1/n and m=n. Thus, 1 is added to the numerator and m to the denominator.
, COMP3308/3608 AI, week 6a, 2022 28
P(rainy | yes) = 3 + mp3 9+m

Handling Missing Values – Easy
• Missing attribute value in the new example – do not include this attribute
• e.g. outlook=?, temperature=cool, humidity=high, windy=true
• outlookisnotincluded.Comparetheseresultswiththeprevious results!
• As one of the fractions is missing, the probabilities are higher but the comparison is fair – there is a missing fraction in both cases
• Missing attribute value in a training example – do not include this value in the counts
• Calculate the probabilities based on the number of values that actually occur (are not missing) and not on the total number of training examples
P(yes|E)= 99914=0.0238 P(E) P(E)
P(no|E)= 55514=0.0343 P(E) P(E)
, COMP3308/3608 AI, week 6a, 2022 29

Handling Numeric Attributes
• We would like to classify the following new example: outlook=sunny, temperature=66, humidity=90, windy=true
• Question: How to calculate
P(temperature=66|yes)=?, P(humidity=90|yes)=?
P(temperature=66|no)=?, P(humidity=90|no) ?
, COMP3308/3608 AI, week 6a, 2022 30

Using Probability Density Function
• Answer: By assuming that numerical values have a normal (Gaussian, bell curve) probability distribution and using the probability density function
• For a normal distribution with mean  and standard deviation , the probability density function is:
Image from http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg
, COMP3308/3608 AI, week 6a, 2022 31
1 −(x−)2 f(x)= 2e 22

More on Probability Density Functions
• What is the meaning of the probability density function of a continuous random variable?
• closely related to probability but not exactly the probability (e.g. the probability that x is exactly 66 is 0)
• = the probability that a given value x  (x-/2, x+/2 ) is *f(x)
1 −(x−)2 f(x)= 2e 22
, COMP3308/3608 AI, week 6a, 2022 32

Calculating Probabilities Using Probability
Density Function
mean for temp. for class=yes
f (temperature = 66 | yes) = 1 e 2* 6.22 = 0.034 6.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com