Bayesian Learning
Prof. Dan A. Simovici
UMB
1/28
Outline
1 Probability Spaces
2 Random Variables
3 Conditional Probabilities
4 ML and Conditional Probabilities
5 Bayes Theorem and Concept Learning
2/28
Probability Spaces
Suppose that (Ω,E,P) is a probability space, E is a family of subsets of Ω known as events, and P is a probability. The elements of Ω are elementary events.
In many cases, E consists of all subsets of Ω, and we will make this assumption unless a special statement says otherwise.
3/28
Probability Spaces
Example
Roling two dice is described by a finite probability space that consists of 36 elementary events: (1, 1), (1, 2), . . . , (6, 6).
6 5 4 3 2 1
123456
4/28
Probability Spaces
An event in the previous example is a subset of {1,…,6}×{1,…,6}, that is, a subset of the set of pairs {(u,v) | 1 u 6,1 v 6}.
Example
throws that have the same number of both dice:
S = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)}
throws such that the sum of the numbers is greater than 8:
B = {(2,6),(3,5),(3,6),(4,4),(4,5),(4,6),(5,3),(5,4),(5,6),(6,6),(6
5/28
,
Probability Spaces
Note that Ω consists of 36 elementary events and there are 236 ≈ 1012 events in this very simple probability space
Probability of an event V in this context is the number P(V) given by
P(V) = |V|. |Ω|
Example
We have and
P(S)= 6 =1, 36 6
P(B) = 18 = 1 36 2
6/28
Random Variables
Informally, Borel sets of R are the sets that can be constructed from open or closed sets by repeatedly taking countable unions and intersections.
Definition
Let (Ω,E,P) be a probability space. A function X : Ω −→ R is a random variable if X−1(U) ∈ E for every Borel subset of R.
7/28
Random Variables
Definition
A simple random variable is defined by a table: x1 x2 ··· xn
X:pp···p, 12n
where x1,…,xn are the values assummed by X and pi = P(X = xi) for 1 i n. We always have p1 + · · · + pn = 1.
The expected value of X is
E[X]=x1p1 +x2p2 +···+xnpn.
8/28
Random Variables
Example
A random variable X whose distribution is: 0 1
where p + q = 1 is said to have a Bernoulli distribution with parameter p. Note that E[X] = p and var(X) = pq.
X:qp,
9/28
Random Variables
Example
Letp,q∈[0,1]betwonumberssuchthatp+q=1. Considerthe random variable defined by
01···k···n X: qn nqn−1p ··· nqn−kpk ··· pn ,
1k
We refer to a random variable with this distribution as a binomial random
variable. Note that
n nn−1 nn−kk n n
q + 1 q p+···+ k q p +···+p =(q+p) =1.
10/28
Random Variables
Example cont’d
The expectation of a binomial variable is E[X] = np.
The variance of a random variable X is
var(X) = E[(X − E(X))2] = E[X2] − (E[X])2.
In the case of a binomial variable the variance is var(X) = npq.
11/28
Random Variables
The Characteristic Function of an Event
If A is an event, then the function 1A : Ω −→ {0, 1} defined by 1 ifω∈A,
is a random variable,
1A(ω) = 0 otherwise, 01
1A: 1−P(A) P(A)
Note that E(1A) = P(A) and var(1A) = P(A)(1 − P(A)).
12/28
Random Variables
The event A∧B takes place when both A and B occur; the event A∨B takes place when at least one of A and B occur.
Example
The event S ∧ B takes place when the result of throwing the dice results in a pair of numbers (n, n) whose sum is greater than 8 and consists of the pairs:
(4, 4), (5, 5), (6, 6)
Therefore,P(S∧B)= 3 = 1 . 36 12
13/28
Conditional Probabilities
Definition
If B is an event such that P(B) > 0 one can define the probability of an event A conditioned on B as
P(A|B) = P(A ∩ B). P(B)
Example
The probability of the event S conditioned on B is
P(S|B)=
P(B|S)=
P(S ∧ B) 1 1 =12= ,
P(B) 1 6 2
and
P(S ∧ B) 1 1 =12= .
P(S) 1 2 6
14/28
Conditional Probabilities
Definition
Two events A, B are independent if P(A ∧ B) = P(A)P(B). If A, B are independent events, then
and
P(A|B) = P(A ∧ B) = P(A)P(B) = P(A) P(B) P(B)
P(B|A) = P(B ∧ A) = P(B)P(A) = P(B). P (A) P (A)
Note that B and S are independent events because
P(B∧S)= 1 =P(B)P(S). 12
15/28
Conditional Probabilities
The product rule or the Bayes theorem:
P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A).
The sum rule:
P(A ∨ B) = P(A) + P(B) − P(A ∧ B).
The total probability rule: if A1, . . . , An are mutually exclusive and
ni=1 P(Ai ) = 1, then
n
P(B) = P(B|Ai )P(Ai ).
i=1
16/28
ML and Conditional Probabilities
In ML we are often interested in determining the best hypothesis from some space H given the observed data S.
“Best” means in this context, the most probable hypothesis given
the data S, and
any initial knowledge of prior probabilities of hypotheses in H.
17/28
ML and Conditional Probabilities
“Prior probabilities” (or a priori probabilities) mean probabilities of hypotheses before seeing the data S.
“Posterior probabilities” mean probabilities of hypotheses after seeing the data S.
If no prior knowledge exist all hypotheses have the same probability. In ML we are interested to compute P(h|S) that h holds given the observed training data S.
18/28
ML and Conditional Probabilities
Bayes’ Theorem in ML
For a sample S and a hypothesis h we have P(h|S) = P(S|h)P(h)
P(S)
Note that:
P(h|S) increases with P(h) and with P(S|h).
P(h|S) decreases with P(S) because the more probable is that S will be observed independent of h, the less evidence S provides for h.
19/28
ML and Conditional Probabilities
Learning Scenario
Consider some set of candidate hypotheses H and seek the most probable hypothesis given the observed data S.
Any such maximally probabile hypothesis is called a maximum a posteriori hypothesis, MAP.
hMAP is
because P(S) is a constant.
hMAP =
argmaxh∈H P (h|S )
= argmax P (S |h)P (h)
h∈H P(S)
= argmaxh∈H P (S |h)P (h)
20/28
ML and Conditional Probabilities
Maximum Likelihood Hypothesis
In some cases we assume that every hypothesis of H is apriori equally probable, that is, P(hi) = P(hj) for all hi,hj ∈ H.
Now,
hMAP = argmaxh∈HP(S|h). P(S|h) is known as the likelihood of S given h.
21/28
ML and Conditional Probabilities
Example
A medical diagnosis problem:
The hypothesis space contains two hypotheses:
h0: patient has no cancer;
h1: patient has cancer.
An imperfect diagnosis test that has two outcomes; ⊕ and ⊖.
P(⊕|h1) = 0.98 P(⊕|h0) = 0.03 . P(⊖|h1) = 0.02 P(⊖|h0) = 0.97
Prior knowlege: Only 0.08% of population has cancer; 99.2% does not.
22/28
ML and Conditional Probabilities
Example (cont’d)
The test returns ⊕. Should we conclude that the patient has cancer? The MAP hypothesis is obtained as
hMAP = argmaxh∈HP(S|h)P(h). P(⊕|h1)P(h1) = 0.98 ∗ 0.008 = 0.0078,
P(⊕|h0)P(h0) = 0.03 ∗ 0.992 = 0.0298. The MAP hypothesis is h0; the patient has no cancer.
23/28
Bayes Theorem and Concept Learning
Brute-Force Bayes Concept Learning
For each hypothesis h ∈ H calculate the posterior probablity: P(h|S) = P(D|h)P(h)
P(S) Output the hypothesis hMAP with
hMAP = argmaxh∈HP(h|S).
24/28
Bayes Theorem and Concept Learning
Assumption for the Brute-Force Bayes Concept Learning: TrainingdataisS={(x1,y1),…,(xm,ym)},whereyi =f(xi)for
1 i m and it is noise-free.
The target hypothesis is contained in H.
We have no apriori reason to believe that any hypothesis is more probable than the other
25/28
Bayes Theorem and Concept Learning
Consequences
P(h)= 1 ; |H|
The probability of S given h is 1 if S is consistent with h and 0 otherwise:
1 ifyi =h(xi)for1im 0 otherwise;
P(S|h) =
26/28
Bayes Theorem and Concept Learning
Let VSH,S be the subset of hypotheses of H that is consistent with S. If S is inconsistent with h then P(h|S) = 0·P(h) = 0.
P(S)
If S is consistent with h then
1·1 1·1
1
P(h|S)= |H| = |H| = P(S) |VSH,S|
|VSH,S |
|H|
27/28
Bayes Theorem and Concept Learning
Since the hypotheses are mutually exclusive (that is, P(hi ∧ hj ) = 0 if i ̸= j), by the total probability law:
P(S) = P(S|hi)P(hi) hi ∈H
= 1·1+0·1 h∈VS |H | h̸∈VS |H |
H,S H,S = 1· 1 =|VSH,S|.
h∈VS
|H| |H|
H,S
Note that under this setting every consistent hypothesis is a MAP hypothesis.
28/28