Bayesian Learning
COSC 2673-2793 | SEMESTER 1 2021 (COMPUTATIONAL) MACHINE LEARNING
Revision – Probability
P(A|B)= P(A∩B) P(B)
P (A ∩ B) = P (A | B) · P (B) P (A ∩ B) = P (B | A) · P (A)
Bayes’ Theorem
P (A | B) = P(B|A)·P(A) P(B)
P (A | B) = P(B|A)·P(A) P(B|A)·P(A)+P(B|A ̄)·P(A ̄)
RMIT Classification: Trusted
AB
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
Joint probability distribution
RMIT Classification: Trusted
P (M,H,R)
Dataset: Aus Census
M
H
R
Probabilit y
0
0
0
0.253
0
0
1
0.024
0
1
0
0.042
0
1
1
0.012
1
0
0
0.331
1
0
1
0.097
1
1
0
0.134
1
1
1
0.106
ID
M
H
R
1
1
1
1
2
1
0
1
3
0
0
1
4
1
0
0
5
0
1
0
6
0
0
0
7
0
0
1
…
…
…
…
1000
0
1
1
M – Male(1) or Female(0)
H – Work over 40Hours per week (1) or not (0) R – Rich(1) or poor (0)
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
Probability & ML
What does conditional probabilities have to do with ML?
Y =f(X) P (Y | X)
P (Y | X1,X2,··· ,Xm) Can we estimate this Probability?
RMIT Classification: Trusted
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
Probability & ML
P (M,H,R)
RMIT Classification: Trusted
M
H
R
Probability
0
0
0
0.253
0
0
1
0.024
0
1
0
0.042
0
1
1
0.012
1
0
0
0.331
1
0
1
0.097
1
1
0
0.134
1
1
1
0.106
P (R | M,H)
M
H
Probability
0
0
0.09
0
1
0.21
1
0
0.23
1
1
0.38
If 24 features 2^24 = 16,777,216
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
Bayes Rule
P (R | M,H) = Naïve Bayes Assumption:
P (M,H|R)·P (R) P(M,H)
P (M, H | R) = P (M | R) · P (M | R)
P(X1,X2,···,Xm |Y)=mi=1P(Xi |Y) Conditional independence: https://www.youtube.com/watch?v=TAyA-rjmesQ
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
Naïve Bayes
P(Y |X1,X2,···,Xm)∝mi=1P(Xi |Y)P(Y) Decision Rule: m
y⋆ ←argmax i=1P(Xi⋆ |Y =yk)P(Y =yk) yk
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
What if we have continuous Attributes?
• What if we have continuous Attributes?
P(Y |X1,X2,···,Xm)∝mi=1P(Xi |Y)P(Y)
RMIT Classification: Trusted
,···,X )∝m P(X |Y)P(Y)
m• Assume i follow a gaussian distribution.
1
i=1
• Gaussian Naïve Bayes
P(Xi=x|Y=yk)=p2⇡ 2 e 2 ik ik
1 ⇣ x μik ⌘2
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
2
Naive Bayes’ Example
Sailing Data Set Example Categorical Target class with 2 values
◦ Is it good to sail today?
3 Features (all categorical) 17 Instances
RMIT Classification: Trusted
P(S = y)
P(S = n)
P( O = r | S = y)
P(O = s | S = n)
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
Naive Bayes’ Classifier
y⋆ ←argmaxmi=1P(Xi⋆ |Y =yk)P(Y =yk) yk
Where do the probabilities come from? ◦ The training data set!
RMIT Classification: Trusted
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
RMIT Classification: Trusted
Sailing Data Set Example
Will I sail if: sunny, med company, small boat?
y⋆ ←argmaxmi=1P(Xi⋆ |Y =yk)P(Y =yk) yk
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
RMIT Classification: Trusted
Strengths of Naive Bayes’ Classifier
Bayesian Learning does not explicitly “search” for a hypothesis, or a classification. ◦ This “best” classification is calculated
◦ Makes this comparatively computationally efficient
◦ Scales well to large data sets
◦ Individual conditionally independent probabilities can all be pre-computed (or cached) ◦ The probabilities capture noise and errors in the data set, though not explicitly
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020
RMIT Classification: Trusted
Limitations of Naive Bayes’ Classifier
Limitations stem from the same point:
◦ In Real-world problems, the features/attributes are not independent
− Though Bayes’ classifiers still experimentally work well in practice ◦ Can have a tendency to overfit the data set
− Compute how likely the data set can be generated by the hypothesis
− It is difficult if not impossible to estimate 𝑃(h) and 𝑃(𝐷) in continuous systems
Bayesian Learning COSC2673-COSC2793
| Sem 1 2020