COMP3308/3608, Lecture 6a
ARTIFICIAL INTELLIGENCE
Statistical-Based Learning (Naïve Bayes)
Reference: Witten, Frank, Hall and Pal, ch.4.2: p.96-105 Russell and Norvig, ch.20: p.802-810
Copyright By PowCoder代写 加微信 powcoder
, COMP3308/3608 AI, week 6a, 2022 1
• Bayes theorem
• Naïve Bayes algorithm
• Naïve Bayes – issues
• Zero probabilities – Laplace correction • Dealing with missing values
• Dealing with numeric attributes
, COMP3308/3608 AI, week 6a, 2022 2
What is Bayesian Classification?
• Bayesian classifiers are statistical classifiers
• They can predict the class membership probability, i.e. the probability that a given example belongs to a particular class
• They are based on the Bayes Theorem
, COMP3308/3608 AI, week 6a, 2022 3
(1702-1761)
Bayes Theorem
• Given a hypothesis H and evidence E for this hypothesis, then the probability of H given E, is:
• Example: Given are instances of fruits, described by their color and shape. Let:
– E is red and round
– H is the hypothesis that E is an apple
• What are:
• P(H|E)=?
• P(E|H)=?
P(H | E) = P(E | H)P(H) P(E)
COMP3308/3608 AI, week 6a, 2022 4
Bayes Theorem – Example (cont. 1)
• P(H|E) is the probability that E is an apple, given that we have seen that E is red and round
• Called posterior probability of H conditioned on E
• P(H) is the probability that any given example is an apple,
regardless of how it looks
• Called prior probability of H
• The posterior probability is based on more information that the prior probability which is independent of E
, COMP3308/3608 AI, week 6a, 2022 5
Bayes Theorem – Example (cont. 2)
• What is P(E|H)?
• the posterior probability of E conditioned on H
• the probability that E is red and round, given that we know that E is an apple
• What is P(E)?
• the prior probability of E
• The probability that an example from the fruit data set is red and round
, COMP3308/3608 AI, week 6a, 2022 6
Bayes Theorem for Problem Solving
• Given: A doctor knows that
• Meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50 000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has a stiff neck, what is the probability that he has meningitis?
P(H | E) = P(E | H)P(H) P(E)
, COMP3308/3608 AI, week 6a, 2022 7
Bayes Theorem for Problem Solving – Answer
, COMP3308/3608 AI, week 6a, 2022 8
Naïve Bayes Algorithm
• The Bayes Theorem can be applied for classification tasks = Naïve Bayes algorithm
• While 1R makes decisions based on a single attribute, Naive Bayes uses all attributes and allows them to make contributions to the decision that are equally important and independent of one another
• Assumptions of the Naïve Bayes algorithm
• 1) Independence assumption – (the values of the) attributes are conditionally independent of each other, given the class (i.e. for each class value)
• 2) Equally importance assumption – all attributes are equally important
• Unrealistic assumptions! => it is called Naive Bayes
• Attributes are dependent of one another
• Attributes are not equally important
• But these assumptions lead to a simple method which works
surprisingly well in practice!
, COMP3308/3608 AI, week 6a, 2022 9
Naive Bayes on the Weather Example
• Given: the weather data
• Task: use Naïve Bayes to predict the class (yes or no) of the new example
outlook=sunny, temperature=cool, humidity=high, windy=true
The Bayes Theorem:
What are H and E?
• the evidence E is the new example
• the hypothesis H is play=yes (and there is another H: play=no)
How to use the Bayes Theorem for classification?
Calculate P(H|E) for each H (class), i.e. P(yes|E) and P(no|E) Compare them and assign E to the class with the highest probability
OK, but for P(H|E) we need to calculate P(E), P(H) and P(E|H) – how to do this? From the given data (this is the training phase of the classifier)
P(H | E) = P(E | H)P(H) P(E)
, COMP3308/3608 AI, week 6a, 2022 10
outlook temp. humidity windy play
sunny hot high sunny hot high overcast hot high rainy mild high rainy cool normal rainy cool normal overcast cool normal sunny mild high sunny cool normal rainy mild normal sunny mild normal overcast mild high overcast hot normal rainy mild high
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no
Naive Bayes on the Weather Example (2)
• We need to calculate and compare P(yes|E) and P(no|E)
outlook=sunny,temperature=cool, humidity=high, windy=true
1) How to calculate P(E|yes) and P(E|no) ?
Let’s split the evidence E into 4 smaller pieces of evidence:
• E1 = outlook=sunny, E2 = temperature=cool
• E3 = humidity=high, E4 = windy=true
Let’s use the Naïve Bayes’s independence assumption: E1, E2, E3 and E4 are independent given the class. Then, their combined probability is obtained by multiplication of per-attribute probabilities:
P(yes | E) = P(E | yes)P(yes) P(E)
P(no | E) = P(E | no)P(no) P(E)
P(E | yes) = P(E | yes)P(E | yes)P(E | yes)P(E | yes) 1234
P(E|no)=P(E |no)P(E |no)P(E |no)P(E |no) 1234
, COMP3308/3608 AI, week 6a, 2022 11
Naive Bayes on the Weather Example (3)
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
P(no | E) =
P(E |no)P(E |no)P(E |no)P(E |no)P(no) 1234
• In summary:
• Numerator – the probabilities will be estimated from the data
• Denominator – the two denominators are the same (P(E)) and since we are comparing the two fractions, we can just compare the numerators => there is no need to calculate P(E)
, COMP3308/3608 AI, week 6a, 2022 12
Calculating the Probabilities from the Training Data
E1 = outlook=sunny, E2 = temperature=cool E3 = humidity=high, E4 = windy=true
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
outlook temp. humidity windy play
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no
hot high sunny hot high overcast hot high
rainy mild high rainy cool normal
overcast cool normal
cool normal sunny mild high
rainy mild normal
cool normal
overcast mild high
mild normal
overcast hot normal
• P(E1|yes)=P(outlook=sunny|yes)=? • P(E2|yes)=P(temp=cool|yes)=?
• P(E3|yes)=P(humidity=high|yes)=? • P(E4|yes)=P(windy=true|yes)=?
• P(yes)=?
COMP3308/3608 AI, week 6a, 2022 13
Calculating the Probabilities from the Training Data
E1 = outlook=sunny, E2 = temperature=cool E3 = humidity=high, E4 = windy=true
• P(E1|yes)=P(outlook=sunny|yes) = ?/9 = 2/9 • P(E2|yes)=P(temp=cool|yes)=?
• P(E3|yes)=P(humidity=high|yes)=?
• P(E4|yes)=P(windy=true|yes)=?
• P(yes)=?
, COMP3308/3608 AI, week 6a, 2022 14
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
outlook temp. humidity windy play
sunny sunny
rainy sunny
hot high hot high
cool normal mild high
false no true no
true no false no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
overcast cool normal true yes
cool normal false yes
rainy mild normal false yes
mild normal true yes
overcast mild high true yes
overcast hot normal false yes
Weather data – counts and probabilities:
Calculation the Probabilities (2)
rainy 3/9 2/5 cool 3/9 1/5
temperature
outlook temp. humidity windy play
sunny hot overcast hot rainy mild rainy cool rainy overcast cool sunny mild sunny
rainy mild sunny mild overcast mild overcast hot rainy
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes
hot high high high high
proportions of days when play is yes
proportions of days when humidity is normal and play is yes i.e. the probability of humidity to be normal given that play=yes
normal cool normal normal
high cool normal normal normal
COMP3308/3608 AI, week 6a, 2022 15
Calculation the Probabilities (3)
P(yes | E) = ?
P(E1|yes)=P(outlook=sunny|yes)=2/9 P(E2|yes)=P(temperature=cool|yes)=3/9 P(E3|yes)=P(humidity=high|yes)=3/9 P(E4|yes)=P(windy=true|yes)=3/9
P(yes) =? – the probability of a yes without knowing any E, i.e. anything about the particular day; the prior probability of yes; P(yes) = 9/14
P(yes | E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
rainy 3/9 2/5 cool 3/9 1/5
temperature
, COMP3308/3608 AI, week 6a, 2022 16
Final Calculations
• By substituting the respective evidence probabilities:
P( yes | E) = 9 9 9 914 = 0.0053 P(E) P(E)
• Similarly calculating P(no | E) :
P(no | E) = 5 5 5 514 = 0.0206 P(E) P(E)
P(no|E)P(yes|E)
• => for the new day play = no is more likely than play = yes
, COMP3308/3608 AI, week 6a, 2022 17
Another Example
• Use the NB classifier to solve the following problem:
• Consider a volleyball game between team A and team B
• Team A has won 65% of the time and team B has won 35%
• Among the games won by team A, 30% were when playing on team B’s court
• Among the games won by team B, 75% were when playing at home
• If team B is hosting the next match, which team is most likely to win?
, COMP3308/3608 AI, week 6a, 2022 18
, COMP3308/3608 AI, week 6a, 2022 19
Three More Things About Naïve Bayes
• How to deal with probability values of zero in the numerator?
• How do deal with missing values?
• How to deal with numeric attributes?
, COMP3308/3608 AI, week 6a, 2022 20
Problem – Probability Values of 0
• Suppose that the training data was different: outlook=sunny had always occurred together with play=no (i.e.
outlook=sunny had never occurred together with play=yes )
• Then: P(outlook=sunny|yes)=0 and P(outlook=sunny|no)=1
sunny 0 … overcast 4 … rainy 3 … … sunny 0/9 … overcast 4/9 … rainy 3/9 …
P(yes| E) =
P(E | yes)P(E | yes)P(E | yes)P(E | yes)P(yes) 1234
• => final probability P(yes|E)=0 no matter of the other probabilities • This is not good!
• The other probabilities are completely ignored due to the multiplication with 0
• => The prediction for new examples with outlook=sunny will always be no, regardless of the values of the other attributes
, COMP3308/3608 AI, week 6a, 2022 21
A Simple Trick to Avoid This Problem
• Assume that our training data is so large that adding 1 to each count would not make a difference in calculating the probabilities …
• but it will avoid the case of 0 probability
• This is called the Laplace correction or Laplace estimator
“What we know is not much. What we do not know is immense.”
Pierre- (1749-1827)
Image from http://en.wikipedia.org/wiki/File:Pierre-Simon_Laplace.jpg
, COMP3308/3608 AI, week 6a, 2022 22
Laplace Correction
• Add 1 to the numerator and k to the denominator, where k is the number of attribute values for the given attribute
• Example:
• A dataset with 2000 examples, 2 classes: buy_Mercedes=yes and
buy_Mercedes=no; 1000 examples in each class
• 1 of the attributes is income with 3 values: low, medium and high
• For class buy_Mercedes=yes , there are 0 examples with income=low, 10 with income=medium and 990 with income=high
• Probabilities without the Laplace correction for class yes: 0/1000=0, 10/1000=0.01, 990/1000=0.99
• Probabilities with the Laplace correction: 1/1003=0.001, 11/1003=0.011, 991/1003=0.988
• The correct probabilities are close to the adjusted probabilities, yet the 0 probability value is avoided!
, COMP3308/3608 AI, week 6a, 2022 23
… sunny 0 … overcast 4 … rainy 3 … … sunny 0/9 … overcast 4/9 … rainy 3/9 …
P(sunny|yes)=0+1= 1 9+3 12
P(overcast|yes)=4+1= 5 9+3 12
P(rainy|yes)=3+1= 4 9+3 12
Laplace Correction – Modified Weather Example
P(sunny|yes)=0/9 →problem P(overcast|yes)=4/9 P(rainy|yes)=3/9
Laplace correction
• Assumes that there are 3 more examples from
class yes, 1 for each value of outlook
• This results in adding 1 to the numerator and 3
to the denominator of all probabilities
• Ensures that an attribute value which occurs 0 times will receive a nonzero (although small) probability
, COMP3308/3608 AI, week 6a, 2022 24
Generalization of the Laplace Correction: M-estimate
• Add a small constant m to each denominator and mpi to each numerator, where pi is the prior probability of the i values of the attribute:
P(overcast | yes) = 4 + mp2 9+m
P(sunny | yes) = 2 + mp1 9+m
• Note that p1+p2+…+pn=1, n – number of attribute values
• Advantage of using prior probabilities – it is rigorous
• Disadvantage – computationally expensive to estimate prior probabilities
• Large m – the prior probabilities are very important compared with the new evidence coming in from the training data; small m – less important
• Typically we assume that each attribute value is equally probable, i.e. p1=p2=…=pn =1/n
• The Laplace correction is a special case of the m-estimate, where p1=p2=…=pn =1/n and m=n. Thus, 1 is added to the numerator and m to the denominator.
, COMP3308/3608 AI, week 6a, 2022 25
P(rainy | yes) = 3 + mp3 9+m
Handling Missing Values – Easy
• Missing attribute value in the new example – do not include this attribute
• e.g. outlook=?, temperature=cool, humidity=high, windy=true
• outlookisnotincluded.Comparetheseresultswiththeprevious results!
• As one of the fractions is missing, the probabilities are higher but the comparison is fair – there is a missing fraction in both cases
• Missing attribute value in a training example – do not include this value in the counts
• Calculate the probabilities based on the number of values that actually occur (are not missing) and not on the total number of training examples
P(yes|E)= 99914=0.0238 P(E) P(E)
P(no|E)= 55514=0.0343 P(E) P(E)
, COMP3308/3608 AI, week 6a, 2022 26
Handling Numeric Attributes
• We would like to classify the following new example: outlook=sunny, temperature=66, humidity=90, windy=true
• Question: How to calculate
P(temperature=66|yes)=?, P(humidity=90|yes)=?
P(temperature=66|no)=?, P(humidity=90|no) ?
, COMP3308/3608 AI, week 6a, 2022 27
Using Probability Density Function
• Answer: By assuming that numerical values have a normal (Gaussian, bell curve) probability distribution and using the probability density function
• For a normal distribution with mean and standard deviation , the probability density function is:
Image from http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg
, COMP3308/3608 AI, week 6a, 2022 28
1 −(x−)2 f(x)= 2e 22
More on Probability Density Functions
• What is the meaning of the probability density function of a continuous random variable?
• closely related to probability but not exactly the probability (e.g. the probability that x is exactly 66 is 0)
• = the probability that a given value x (x-/2, x+ /2 ) is *f(x)
1 −(x−)2 f(x)= 2e 22
, COMP3308/3608 AI, week 6a, 2022 29
Calculating Probabilities Using Probability
Density Function
mean for temp. for class=yes
f (temperature = 66 | yes) = 1 e 2* 6.22 = 0.034 6.2 2
f (humidity = 90 | yes) = 0.0221
std.dev. for temp. for class=yes
=>P(no|E) > P(yes|E) => no play
2 0.034 0.02213 9
P(yes | E) = 9 914 = 0.000036
3 0.0291 0.0383 5
P(no | E) = 5 514 = 0.000136
• Compare with the categorical weather data!
, COMP3308/3608 AI, week 6a, 2022 30
Mean and Standard Deviation – Reminder
• A reminder how to calculate the mean value and standard deviation :
X is a random variable with values, x1, x2, …. xn
Note that the denominator is n-1 not n
, COMP3308/3608 AI, week 6a, 2022
Naive Bayes – Advantages
• Simple approach – the probabilities are easily computed due to the independence assumption
• Clear semantics for representing, using and learning probabilistic knowledge
• Excellent computational complexity
• Requires 1 scan of the training data to calculate all statistics (for both
nominal and continuous attributes assuming normal distribution):
• O(pk), p – # training examples, k-valued attributes
• In many cases outperforms more sophisticated learning methods => always try the simple method first!
• Robust to isolated noise points as such points are averaged when estimating the conditional probabilities from data
, COMP3308/3608 AI, week 6a, 2022 32
Naive Bayes – Disadvantages
• Correlated attributes reduce the power of Naïve Bayes
• Violation of the independence assumption
• Solution: apply feature selection beforehand to identify and discard correlated (redundant) attributes
• Normal distribution assumption for numeric attributes – many features are not normally distributed – solutions:
• Discretize the data first, i.e. numerical -> nominal attributes
• Use other probability density functions, e.g. Poisson, binomial, gamma, etc.
• Transform the attribute using a suitable transformation into a normally distributed one (sometimes possible)
• Use kernel density estimation – doesn’t assume any particular distribution
, COMP3308/3608 AI, week 6a, 2022 33
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com