CS计算机代考程序代写 algorithm Bayesian data mining AI Excel Bayesian network flex Data Mining (EECS 4412)

Data Mining (EECS 4412)
Bayesian Classification
Parke Godfrey
EECS
Lassonde School of Engineering York University

Thanks to
Professor Aijun An
for creation & use of these slides.
2

Outline
1. Introduction
2. Bayes Theorem
3. Naïve Bayes Classifier
4. Bayesian Belief Networks
3

Introduction
Goal:
Determine the most probable hypothesis (class)
E.g, Given new instance x, what is its most probable classification?
Probabilistic learning & prediction:
Estimate explicit probabilities for all hypotheses (classes) Predict multiple hypotheses, weighted by their probabilities
Can combine prior knowledge (such as prior probabilities, probability distributions, causal relationships between variables in belief networks) with observed data
4

Introduction (Cont’d) Incremental learning:
Each training example can incrementally increase/decrease the probability that a hypothesis is correct.
flexible in handling inconsistency Provides a Standard:
provides a standard of optimal decision making against which other methods can be measured
5

Bayes Theorem
P(h | x) = P(x | h)P(h) P(x)
P (h) = prior probability of hypothesis h
P (x) = probability that example x is observed P (h | x) = posterior probability of h given x
P (x | h) = conditional probability of x given h
(often called the likelihood of h given x)
6

Finding Maximum a posteriori Hypothesis P(h | x) = P(x | h)P(h)
P(x)
Goal: Find the most probable hypothesis h from a set H of
candidate hypotheses, given an example x.
The most probable hypothesis is called maximum a posteriori (MAP) hypothesis hMAP:
hMAP (x) = arg max P(h | x) hÎH
= arg max P(x | h)P(h) hÎH P(x)
=argmaxP(x|h)P(h) hÎH
(P(x) is constant for all hypotheses)
If assume P (hi) = P (hj) (classes are equally likely), then can further simplify, and choose the Maximum likelihood (ML)
hypothesis:
hML(x)=argmaxP(x|h ) h ÎH
7

Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases in which the disease is actually present,
The test returns a correct negative result in only 97% of the cases in which the disease is not present.
Furthermore, .008 of the entire population have this cancer.
P(cancer) = P(+ | cancer) = P(+ | ¬cancer) =
Our goal is to find the maximum between:
P(cancer | +) and P(¬cancer | +)
P(¬cancer) = P(- | cancer) = P(- | ¬cancer) =
8

Learning Probabilities from Data
Suppose we do not know the probabilities used in the example in the last slide.
But we are given a set of data.
In order to conduct the reasoning, i.e., to find the MAP hypothesis hMAP, we can estimate the probabilities used in the reasoning from the data.
Suppose there are k possible hypotheses (i.e., classes): h1, h2, …. hk
We need to estimate:
P(h1), P(h2), …, P(hk),
P(x|h1), P(x|h2), …, P(x|hk) for each possible instance x,
in order to find:
h (x) = arg max P(x | h )P(h ) MAP hiÎH i i
9

Practical Problem with Finding MAP Hypothesis
Suppose instance x is described by attributes values áx1, x2, …, xnñ and there is a set C of classes: c1, c2, …
cm.
MAP cjÎCj12n
c
(x)=argmaxP(c |x,x,…,x) =argmaxP(x,x,…,x |c)P(c)
cjÎC
12n 12njj
12njj P(x ,x ,…,x )
cjÎC
=argmaxP(x,x,…,x |c)P(c)
Given data set with many attributes, it is infeasible to estimate P(x1, x2, …, xn | cj ) for all possible x values, unless we have a very, very large set of training data. It is also computationally expensive.
10

Naïve Bayes Classifier
Naïve assumption: values of attributes are conditionally independent given a class
P(x,x,…,x |c)=ÕP(x|c)
which gives:
c (x)=argmaxP(x ,x ,…,x |c )P(c ) NB cjÎC12njj
=argmaxP(cj)ÕP(xi |cj) cjÎC i
Probabilities can be estimated from the training data.
12nj
ij i
11

Estimating Probabilities Estimate P(cj):
P(c j ) = # of training examples of class c j # of training examples
Estimate P(xi|cj) for each attribute value xi of attribute Ai and each class cj
If attribute Ai is categorical,
# of training examples of class c with x for A jii
# of training examples of class c j
P(xi |cj)=
12

Estimating Probabilities
If attribute Ai is continuous, can assume normal distribution,
P(x|c)= 1 e 2s2
-(xi-μcj )2 cj
ij
2pscj
and s c j are the mean and standard deviation of
where μc j
the values of Ai for training examples of class cj
2 s= 1å(x-μ)
c j n – 1 x Îc j ij
ic
13

Naïve Bayes Algorithm Naïve Bayes Learning (from examples)
For each class cj ˆ
P(c j ) ¬ estimate P(c j ) For each attribute for which xi is a value
ˆ
P(xi |cj)¬estimateP(xi |cj)
Classifying new instance (x)
ˆÕˆ cNB(x)=argmaxP(cj) P(x |c )
cjÎC xiÎx
ij
14

Example Training dataset
age
income
student
credit_rating
buys_computer
<=30 high no fair no <=30 high no excellent no 30...40 high no fair yes >40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30 medium no fair no <=30 low yes fair yes >40
medium
yes
fair
yes
<=30 medium yes excellent yes 31...40 medium no excellent yes 31...40 high yes fair yes >40
medium
no
excellent
no
Classes: c1:buys_computer=‘yes’ c2:buys_computer=‘no’
Classify new example:
X =(age<=30, Income=medium, Student=yes Credit_rating=Fair) 15 Example (cont’d) P(buy_computer=“yes”) = 9/14 P(buy_computer=“no”) = 5/14 Compute P(xj|ci) for each class and each attribute value pair: P(age£30 | buys_computer=“yes”) = 2/9 = 0.222 P(age£30 | buys_computer=“no”) = 3/5 = 0.6 ⋮ P(income=“medium” | buys_computer=“yes”) = 4/9 = 0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 ⋮ P(student=“yes” | buys_computer=“yes) = 6/9 = 0.667 P(student=“yes” | buys_computer=“no”) = 1/5 = 0.2 ⋮ P(credit_rating=“fair” | buys_computer=“yes”) =6/9 =0.667 P(credit_rating=“fair” | buys_computer=“no”) = 2/5 = 0.4 Learning: Compute P(ci) ⋮ 16 Example (cont’d) Classification: to classify: x = (age£30, income=medium, student=yes, credit_rating=fair) P(x | ci) : P(x|buys_computer=“yes”) = P(age£30 | buys_computer=yes) ́P(income=medium | buys_computer=yes) ́ P(student=yes | buys_computer=yes) ́ P(credit=fair | buys_computer=yes) = 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(x | buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(ci|x)μP(x|ci)*P(ci ): P(buys_computer=“yes”|x) μ P(x|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(buys_computer=“yes”|x) μ P(x|buys_computer=“no”) * P(buys_computer=“no”)=0.007 x belongs to class “buys_computer=yes” 17 Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantage Assumption: class conditional independence of attributes, therefore loss of accuracy Practically, dependencies exist among attributes For example, headache and body temperature are dependent attributes for flu dataset. Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks 18 Bayesian Belief Networks Naive Bayes assumption of conditional independence is too restrictive. But it's intractable without such assumptions... Bayesian Belief networks provide an intermediate approach which allows dependencies among attributes but assumes conditional independence among subsets of attributes. 19 Bayesian Belief Networks A graphical model of causal relationships. Two components: A directed acyclic graph (DAG): represents dependency among variables (attributes) EF • Nodes: variables (including class attribute) • Links: dependencies (e.g., A dependes on E) • Parents: immediate predecessors. E.g., A,B are the parents of C. B is the parent of D • Descendant: X is a descendant of Y if there is a direct path from Y to X. • Conditional Independency: • Assume: each variable is conditionally independent of its nondescendants given its parents. • Definition: X is conditionally independent of Y given Z iff P(X | Y, Z)=P(X | Z) • E.g.: C is conditional independent of D given A and B. Thus, P(C | A, B, D)=P(C|A, B) B C A D • Acyclic: has no loops or cycles A conditional probability table (CPT) for each variable X: specifies the conditional probability distribution P(X | Parents(X)). 20 Example of CPT Suppose each variable is binary (contain two values: X and ¬X) CPT table for Campfire There is a conditional probability table (CPT) for each variable 21 Inference Rule in Bayesian Networks The joint probability of any tuple (x1, ..., xn) corresponding to the variables or attributes (X1, ..., Xn) is computed by P(x,...,x )=Pn P(x |Parents(X )) 1nii i=1 Example: P(¬S, B,¬L,C,¬T, F) = P(¬S) ́ P(B) ́ P(¬L | ¬S) ́ P(C | ¬S, B) ́ P(¬T | ¬L) ́ P(F | ¬L,¬S,C) 22 Inference in Bayesian Networks A Bayesian network can be used to infer the (probabilities of) values of one or more network variables, given observed values of others. Example: Given Storm= 0, BusTourGroup=1, Lightning=0, Campfire=1, Thunder=0, we want to know ForestFire=? Compute two probabilities: (1) P(F|¬S,B,¬L,C,¬T)=P(F|¬L,¬S,C) (2) P(¬F|¬S,B,¬L,C,¬T)=P(¬F|¬L,¬S,C) ForestFire = True if (1) > (2)
23

Inference in Bayesian Networks
Another example:
Given Storm=1, Campfire=0, ForestFire=1, what is the probability distribution of Thunder?
Compute two probabilities:
(1) P (T | S , ¬ C , F ) = P (T , L | S , ¬ C , F ) + P (T , ¬ L | S , ¬ C , F )
= P (T | L , S , ¬ C , F ) P ( L | S , ¬ C , F ) + P (T | ¬ L , S , ¬ C , F ) P ( ¬ L | S , ¬ C , F ) = P(T | L)P(L | S,¬C, F) + P(T | ¬L)P(¬L | S,¬C, F)
where P(L | S,¬C,F) = P(L,F | S,¬C) = P(F | S,¬C)
P(F | L,S,¬C)P(L | S,¬C) P(F, L | S,¬C) + P(F,¬L | S,¬C)
= P(F | L,S,¬C)P(L | S) = P(F,L|S,¬C)+P(F,¬L|S,¬C)
P(F | L,S,¬C)P(L | S)
P(F |L,S,¬C)P(L|S,¬C)+P(F |¬L,S,¬C)P(¬L|S,¬C)
= P(F | L,S,¬C)P(L | S)
P(F | L,S,¬C)P(L | S) + P(F | ¬L,S,¬C)P(¬L | S)
andsimilarlyP(¬L|S,¬C,F)= P(F |¬L,S,¬C)P(¬L|S)
P(F | L,S,¬C)P(L | S) + P(F | ¬L,S,¬C)P(¬L | S)
(2)P(¬T |S,¬C,F)canbecalculatedsimilarly. Thunder = True if (1) > (2)
24

Learning of Bayesian Networks
Several scenarios of this learning task Network structure might be known or unknown.
Training examples might provide values of all network variables, or just some.
Scenario 1: If structure known and observe all variables:
Then it’s easy as training a Naïve Bayes classifier.
Learn only CPTs (estimate the conditional probabilities from training data)
25

Learning of Bayesian Networks
Scenario 2: Suppose structure known, variables partially observable
For example, observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire…
Similar to training neural network with hidden units. In fact, can learn network conditional probability tables using gradient ascent method!
Scenario 3: When structure unknown
Use heuristic search or constraint-based technique to search
through potential structures. K2 algorithm
26

Summary: Bayesian Belief Networks Combine prior knowledge with observed data
Intermediate approach that allows both dependencies and conditional independencies
Other issues
Extend from categorical to real-valued variables Parameterized distributions instead of tables More effective inference and learning methods …
27

Next Class Neural Networks
28