Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 6: Probability-based Learning Sections 6.1, 6.2, 6.3
Copyright By PowCoder代写 加微信 powcoder
and Namee and Aoife D’Arcy
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Fundamentals
Bayes’ Theorem
Bayesian Prediction
Conditional Independence and Factorization
Standard Approach: The Naive Bayes’ Classifier
A Worked Example
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Figure: A game of find the lady
Left Center Right
Figure: A game of find the lady: (a) the cards dealt face down on a table; and (b) the initial likelihoods of the queen ending up in each position.
Likelihood
Left Center Right
Figure: A game of find the lady: (a) the cards dealt face down on a table; and (b) a revised set of likelihoods for the position of the queen based on evidence collected.
Likelihood
Left Center Right
Figure: A game of find the lady: (a) The set of cards after the wind blows over the one on the right; (b) the revised likelihoods for the position of the queen based on this new evidence.
Likelihood
Figure: A game of find the lady: The final positions of the cards in the game.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
We can use estimates of likelihoods to determine the most likely prediction that should be made.
More importantly, we revise these predictions based on data we collect and whenever extra evidence becomes available.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Fundamentals
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Table: A simple dataset for MENINGITIS diagnosis with descriptive features that describe the presence or absence of three common symptoms of the disease: HEADACHE, FEVER, and VOMITING.
ID HEADACHE 1 true
4 true 5 false 6 true 7 true 8 true 9 false
FEVER VOMITING true false true false
false true false true true false false true false true false true
true false false true
MENINGITIS false false false false true false false true false true
Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
A probability function, P(), returns the probability of a feature taking a specific value.
A joint probability refers to the probability of an assignment of specific values to multiple different features.
A conditional probability refers to the probability of one feature taking a specific value given that we already know the value of a different feature
A probability distribution is a data structure that describes the probability of each possible value a feature can take. The sum of a probability distribution must equal 1.0.
Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
A joint probability distribution is a probability distribution over more than one feature assignment and is written as a multi-dimensional matrix in which each cell lists the probability of a particular combination of feature values being assigned.
The sum of all the cells in a joint probability distribution must be 1.0.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
P(H,F,V,M) =
P(h,f,v,m),
P(h,f,v,¬m),
P(h,f,¬v,m),
P(¬h,f,v,m) P(¬h,f,v,¬m) P(¬h,f,¬v,m) P(¬h,f,¬v,¬m) P(¬h,¬f,v,m) P(¬h,¬f,v,¬m) P(¬h,¬f,¬v,m) P(¬h,¬f,¬v,¬m)
P(h,f,¬v,¬m),
P(h,¬f,v,m), P(h,¬f,v,¬m), P(h,¬f,¬v,m),
P(h,¬f,¬v,¬m),
Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Given a joint probability distribution, we can compute the probability of any event in the domain that it covers by summing over the cells in the distribution where that event is true.
Calculating probabilities in this way is known as summing out.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayes’ Theorem
Bayes’ Theorem
P(X|Y) = P(Y|X)P(X) P(Y)
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayes’ Theorem
After a yearly checkup, a doctor informs their patient that he has both bad news and good news. The bad news is that the patient has tested positive for a serious disease and that the test that the doctor has used is 99% accurate (i.e., the probability of testing positive when a patient has the disease is 0.99, as is the probability of testing negative when a patient does not have the disease). The good news, however, is that the disease is extremely rare, striking only 1 in 10,000 people.
What is the actual probability that the patient has the disease?
Why is the rarity of the disease good news given that the patient has tested positive for it?
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayes’ Theorem
P(d|t) = P(t|d)P(d) P(t)
P(t) = P(t|d)P(d) + P(t|¬d)P(¬d)
= (0.99 × 0.0001) + (0.01 × 0.9999) = 0.0101
0.99 × 0.0001
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayes’ Theorem
Deriving Bayes theorem
P(Y|X)P(X) = P(X|Y)P(Y) P(X|Y)P(Y) = P(Y|X)P(X)
P(X|Y)P(Y) P(Y|X)P(X)
= P(Y|X)P(X)
P ( Y )
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayes’ Theorem
The divisor is the prior probability of the evidence This division functions as a normalization constant.
0≤P(X|Y) ≤ 1
P(Xi|Y) = 1.0 i
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayes’ Theorem
We can calculate this divisor directly from the dataset. P(Y) = |{rows where Y is the case}|
|{rows in the dataset}|
Or, we can use the Theorem of Total Probability to
calculate this divisor.
P(Y) = P(Y|Xi)P(Xi) (1) i
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
Generalized Bayes’ Theorem
P(t = l|q[1],…,q[m]) = P(q[1],…,q[m]|t = l)P(t = l) P(q[1], . . . , q[m])
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
Chain Rule
P(q[1],…,q[m]) =
P(q[1]) × P(q[2]|q[1])×
· · · × P(q[m]|q[m − 1], . . . , q[2], q[1])
To apply the chain rule to a conditional probability we just
add the conditioning term to each term in the expression:
P(q[1],…,q[m]|t = l) =
P(q[1]|t =l)×P(q[2]|q[1],t =l)×…
···×P(q[m]|q[m−1],…,q[3],q[2],q[1],t =l)
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
ID 1 2 3 4 5 6 7 8 9
HEADACHE true
MENINGITIS false false false false true false false true false true
MENINGITIS ?
HEADACHE true false true true false true true true false
FEVER true true false false true false false false true false
VOMITING false false true true false true true true false true
FEVER false
VOMITING true
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
P(M|h,¬f,v) =?
In the terms of Bayes’ Theorem this problem can be stated
P(M|h,¬f,v)= P(h,¬f,v|M)×P(M) P(h,¬f,v)
There are two values in the domain of the MENINGITIS feature, ’true’ and ’false’, so we have to do this calculation twice.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
We will do the calculation for m first
To carry out this calculation we need to know the following probabilities: P(m), P(h,¬f,v) and P(h,¬f,v | m).
ID HEADACHE 1 true
4 true 5 false 6 true 7 true 8 true 9 false
FEVER VOMITING true false true false
false true false true true false false true false true false true
true false false true
MENINGITIS false false false false true false false true false true
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
We can calculate the required probabilities directly from the data. For example, we can calculate P(m) and P(h,¬f,v) as follows:
P(m)= |{d5,d8,d10}|
|{d1, d2, d3, d4, d5, d6, d7, d8, d9, d10}|
P(h,¬f,v) = |{d3,d4,d6,d7,d8,d10}|
|{d1, d2, d3, d4, d5, d6, d7, d8, d9, d10}|
= 3 =0.3 10
= 6 = 0.6 10
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
However, as an exercise we will use the chain rule calculate:
P(h,¬f,v | m) =?
ID 1 2 3 4 5 6 7 8 9
HEADACHE FEVER VOMITING true true false false true false true false true true false true false true false true false true true false true true false true false true false
MENINGITIS false false false false true false false true false true
false true
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
Using the chain rule calculate:
P(h,¬f,v |m)=P(h|m)×P(¬f |h,m)×P(v |¬f,h,m) = |{d8, d10}| × |{d8, d10}| × |{d8, d10}| |{d5, d8, d10}| |{d8, d10}| |{d8, d10}|
= 32 × 2 × 2 = 0.6666
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
So the calculation of P(m|h,¬f,v) is:
P(m|h,¬f,v) =
P(h|m)×P(¬f|h,m) ×P(v|¬f,h,m)×P(m)
= 0.6666 × 0.3 = 0.3333
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
The corresponding calculation for P(¬m|h,¬f,v) is: P(¬m|h,¬f,v)= P(h,¬f,v |¬m)×P(¬m)
P(h|¬m) × P(¬f | h, ¬m)
×P(v|¬f,h,¬m)×P(¬m)
= 0.7143×0.8×1.0×0.7 =0.6667
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
P(m|h,¬f,v) = 0.3333 P(¬m|h,¬f,v) = 0.6667
These calculations tell us that it is twice as probable that the patient does not have meningitis than it is that they do even though the patient is suffering from a headache and is vomiting!
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
The Paradox of the False Positive
The mistake of forgetting to factor in the prior gives rise to the paradox of the false positive which states that in order to make predictions about a rare event the model has to be as accurate as the prior of the event is rare or there is a significant chance of false positives predictions (i.e., predicting the event when it is not the case).
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
Maximum a posteriory prediction (MAP)
Bayesian MAP Prediction Model
MMAP(q) = argmaxP(t = l | q[1],…,q[m]) l∈levels(t)
=argmaxP(q[1],…,q[m]|t =l)×P(t =l) l∈levels(t) P(q[1], . . . , q[m])
Bayesian MAP Prediction Model (without normalization)
MMAP(q)=argmaxP(q[1],…,q[m]|t =l)×P(t =l) l∈levels(t)
ID 1 2 3 4 5 6 7 8 9
HEADACHE FEVER true true false true
true false true false false true true false true false true false
false true
VOMITING false false true true false true true true false true
VOMITING false
MENINGITIS false false false false true false false true false true
MENINGITIS ?
HEADACHE true
FEVER true
ID 1 2 3 4 5 6 7 8 9
FEVER VOMITING true false true false
false true false true true false false true false true false true
true false false true
P(m | h,f,¬v) =? P(¬m | h,f,¬v) =?
MENINGITIS false false false false true false false true false true
HEADACHE true false true true false true true true false
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
P(m | h,f,¬v) =
P(h|m)×P(f |h,m) ×P(¬v |f,h,m)×P(m)
= 0.6666×0×0×0.3 =0
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
P(¬m | h,f,¬v) =
P(h|¬m)×P(f |h,¬m) ×P(¬v |f,h,¬m)×P(¬m)
= 0.7143×0.2×1.0×0.7 =1.0
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
P(m | h,f,¬v) = 0 P(¬m | h,f,¬v) = 1.0
There is something odd about these results!
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
Curse of Dimensionality
As the number of descriptive features grows the number of potential conditioning events grows. Consequently, an exponential increase is required in the size of the dataset as each new descriptive feature is added to ensure that for any conditional probability there are enough instances in the training dataset matching the conditions so that the resulting probability is reasonable.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Bayesian Prediction
The probability of a patient who has a headache and a fever having meningitis should be greater than zero!
Our dataset is not large enough → our model is over-fitting to the training data.
The concepts of conditional independence and factorization can help us overcome this flaw of our current approach.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
If knowledge of one event has no effect on the probability of another event, and vice versa, then the two events are independent of each other.
If two events X and Y are independent then: P(X|Y) = P(X)
P (X , Y ) = P (X ) × P (Y )
Recall, that when two event are dependent these rules are:
P(X|Y) = P(X,Y) P(Y)
P (X , Y ) = P (X |Y ) × P (Y ) = P (Y |X ) × P (X )
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Full independence between events is quite rare.
A more common phenomenon is that two, or more, events may be independent if we know that a third event has happened.
This is known as conditional independence.
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
For two events, X and Y , that are conditionally independent given knowledge of a third events, here Z , the definition of the probability of a joint event and conditional probability are:
P(X|Y,Z) = P(X|Z)
P (X , Y |Z ) = P (X |Z ) × P (Y |Z )
P(X|Y) = P(X,Y) P(Y)
P (X , Y ) = P (X |Y ) × P (Y ) = P(Y|X) × P(X)
P(X|Y) = P(X)
P (X , Y ) = P (X ) × P (Y )
X and Y are independent
X and Y are dependent
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
If the event t = l causes the events q[1],…,q[m] to happen then the events q[1], . . . , q[m] are conditionally independent of each other given knowledge of t = l and the chain rule definition can be simplified as follows:
P(q[1],…,q[m] | t = l)
=P(q[1]|t =l)×P(q[2]|t =l)×···×P(q[m]|t =l) m
= P(q[i] | t = l) i=1
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Using this we can simplify the calculations in Bayes’ Theorem, under the assumption of conditional independence between the descriptive features given the level l of the target feature:
P(t = l | q[1],…,q[m]) =
P(q[i]|t=l) ×P(t=l) i=1
P(q[1], . . . , q[m])
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Withouth conditional independence
P(X,Y,Z|W) = P(X|W)×P(Y|X,W)×P(Z|Y,X,W)×P(W) With conditional independence
P(X,Y,Z|W) = P(X|W)×P(Y|W)×P(Z|W)×P(W)
Factor 1 Factor 2 Factor 3 Factor 4
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
The joint probability distribution for the meningitis dataset.
P(H,F,V,M) =
P(h,f,v,m),
P(h,f,v,¬m),
P(h,f,¬v,m),
P(¬h,f,v,m) P(¬h,f,v,¬m) P(¬h,f,¬v,m) P(¬h,f,¬v,¬m) P(¬h,¬f,v,m) P(¬h,¬f,v,¬m) P(¬h,¬f,¬v,m) P(¬h,¬f,¬v,¬m)
P(h,f,¬v,¬m),
P(h,¬f,v,m), P(h,¬f,v,¬m), P(h,¬f,¬v,m),
P(h,¬f,¬v,¬m),
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Assuming the descriptive features are conditionally independent of each other given MENINGITIS we only need to store four factors:
Factor1 : < P(M) >
Factor2 : < P(h|m), P(h|¬m) > Factor3 : < P(f |m), P(f |¬m) > Factor4 : < P(v|m),P(v|¬m) >
P (H , F , V , M ) = P (M ) × P (H |M ) × P (F |M ) × P (V |M )
ID HEADACHE 1 true
FEVER true true
VOMITING false false
MENINGITIS false false false false true false false true false
false true true
9 false true false
false true false true true false false true false true false true
Calculate the factors from the data. Factor1 : < P(M) >
Factor2 : < P(h|m), P(h|¬m) > Factor3 : < P(f |m), P(f |¬m) > Factor4 : < P(v|m),P(v|¬m) >
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Factor1 : < P(m) = 0.3 >
Factor2 : < P(h|m) = 0.6666, P(h|¬m) = 0.7413 > Factor3 : < P(f |m) = 0.3333, P(f |¬m) = 0.4286 > Factor4 : < P(v|m) = 0.6666,P(v|¬m) = 0.5714 >
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Factor1 : < P(m) = 0.3 >
Factor2 : < P(h|m) = 0.6666, P(h|¬m) = 0.7413 > Factor3 : < P(f |m) = 0.3333, P(f |¬m) = 0.4286 > Factor4 : < P(v|m) = 0.6666,P(v|¬m) = 0.5714 >
Using the factors above calculate the probability of MENINGITIS=’true’ for the following query.
HEADACHE FEVER VOMITING MENINGITIS true true false ?
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
P(h|m) × P(f|m) × P(¬v|m) × P(m) P(m|h,f,¬v)= i P(h|Mi)×P(f|Mi)×P(¬v|Mi)×P(Mi) =
0.6666 × 0.3333 × 0.3333 × 0.3 = 0.1948 (0.6666 × 0.3333 × 0.3333 × 0.3) + (0.7143 × 0.4286 × 0.4286 × 0.7)
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
Factor1 : < P(m) = 0.3 >
Factor2 : < P(h|m) = 0.6666, P(h|¬m) = 0.7413 > Factor3 : < P(f |m) = 0.3333, P(f |¬m) = 0.4286 > Factor4 : < P(v|m) = 0.6666,P(v|¬m) = 0.5714 >
Using the factors above calculate the probability of MENINGITIS=’false’ for the same query.
HEADACHE FEVER VOMITING MENINGITIS true true false ?
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
P(h|¬m) × P(f|¬m) × P(¬v|¬m) × P(¬m) P(¬m|h,f,¬v)= i P(h|Mi)×P(f|Mi)×P(¬v|Mi)×P(Mi) =
0.7143 × 0.4286 × 0.4286 × 0.7 = 0.8052 (0.6666 × 0.3333 × 0.3333 × 0.3) + (0.7143 × 0.4286 × 0.4286 × 0.7)
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary Conditional Independence and Factorization
P(m|h,f,¬v) = 0.1948 P(¬m|h,f,¬v) = 0.8052
As before, the MAP prediction would be MENINGITIS = ’false’
The posterior probabilities are not as extreme!
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Standard Approach: The Naive Bayes’ Classifier
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Naive Bayes’ Classifier
M(q)=argmax P(q[i]|t =l) ×P(t =l)
l∈levels(t)
Big Idea Fundamentals Standard Approach: The Naive Bayes’ Classifier Summary
Naive Bayes’ is simple to train!
1 calculate the priors for each of the target levels
2 calculate the conditional probabilities for each feature given each target level.
Table: A dataset from a loan application fraud detection domain.
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
CREDIT GUARANTOR/
HISTORY COAPPLICANT ACCOMODATION
current none own paid none own paid none own
FRAUD true false false true false true false false false true false true true false false false false false false false
current current paid arrears current
none own none rent none own none own none own
guarantor rent none own none own none own none own none rent none own
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com