EECS 70 Discrete Mathematics and Probability Theory Fall 2020
Note 14 Conditional Probability, Independence, and Combinations of Events
One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten T ’s, this does not make it more likely that the next coin flip will be H’s. It still has exactly 50% chance of being H.
By contrast, suppose while dealing cards, the first ten cards are all red (hearts or diamonds). What is the
chance that the next card is red? We started with exactly 26 red cards and 26 black cards. But after dealing
the first ten cards, we know that the deck has 16 red cards and 26 black cards. So the chance that the
next card is red is 16 . So unlike the case of coin flips, now the chance of drawing a red card is no longer 42
independent of the previous card that was dealt. This is the phenomenon we will explore in this note on conditional probability.
1 Conditional Probability
Let’s consider an example with a smaller sample space. Suppose we toss m = 4 labeled balls into n = 3
labeled bins; this is a uniform sample space with 34 = 81 sample points. From the previous note, we already
know that the probability the first bin is empty is (1 − 1 )4 = ( 2 )4 = 16 . What is the probability of this event 3 3 81
given that the second bin is empty? Let A denote the event that the first bin is empty, and B the event that the second bin is empty. In the language of conditional probability, we wish to compute the probability P[A|B], which we read as “the conditional probability of A given B.”
How should we compute P[A|B]? Since event B is guaranteed to happen, we need to look not at the whole sample space Ω, but at the smaller sample space consisting only of the sample points in B. In terms of the picture below, we are no longer looking at the large oval, but only the oval labeled B:
Ω=
B A
What should be the probability of each sample point ω ∈ B given that the event B occurs? If they all simply inherited their probabilities from Ω , then the sum of these probabilities would be ∑ω ∈B P[ω ] = P[B], which in general is less than 1. So, to get the correct normalization, we need to scale the probability of each sample point by 1 . That is, for each sample point ω ∈ B, the new probability becomes
P[B]
P[ω|B] = P[ω]. P[B]
Now it is clear how to compute P[A|B]: namely, we just sum up these scaled probabilities over all sample
EECS 70, Fall 2020, Note 14 1
points that lie in both A and B:
P[A|B] = ∑ P[ω|B] = ∑ P[ω] = P[A∩B].
ω∈A∩B ω∈A∩B P[B] P[B]
Definition 14.1 (Conditional Probability). For events A,B ⊆ Ω in the same probability space such that
P[B] > 0, the conditional probability of A given B is
P[A|B]= P[A∩B].
P[B]
Returning to our example of balls and bins, to compute P[A|B] we need to figure out P[A ∩ B]. But A ∩ B
is the event that both the first two bins are empty, i.e., all four balls fall in the third bin. So P[A ∩ B] = 1 81
(why?). Therefore,
P[A|B]=P[A∩B]= 1/81 = 1. P[B] 16/81 16
Not surprisingly, P[A|B] = 1 = 0.0625 is quite a bit less than P[A] = 16 ≈ 0.1975; knowing that bin 2 is 16 81
empty makes it significantly less likely that bin 1 will be empty.
Example: Card Dealing
Let’s apply the ideas discussed above to compute the probability that, when dealing 2 cards and the first card is known to be an ace, the second card is also an ace. Let B be the event that the first card is an ace, and let A be the event that the second card is an ace.
To compute P[A|B], we need to figure out P[A ∩ B]. This is the probability that both cards are aces. Note that there are 52 · 51 sample points in the sample space, since each sample point is a sequence of two cards. A sample point is in A∩B if both cards are aces. This can happen in 4·3 = 12 ways.
the first trial, is 4 . Therefore, 52
Since each sample point is equally likely, P[A ∩ B] = 12 , while P[B], the probability of drawing an ace in
52·51
P[A|B] = P[A∩B] = 3 ,
P[B] 51
which is less than P[A] = 4 = 1 (see Exercise 1). Hence, if the first card is an ace, it makes it less likely
52 13 that the second card is also an ace.
2 Bayesian Inference
Now that we have introduced the notion of conditional probability, we can see how it is used in real world settings. Conditional probability is at the heart of a subject called Bayesian inference, used extensively in fields such as machine learning, communications and signal processing. Bayesian inference is a way to update knowledge after making an observation. For example, we may have an estimate of the probability of a given event A. After event B occurs, we can update this estimate to P[A|B]. In this interpretation, P[A] can be thought of as a prior probability: our assessment of the likelihood of an event of interest, A, before making an observation. It reflects our prior knowledge. P[A|B] can be interpreted as the posterior probability of A after the observation. It reflects our updated knowledge.
Here is an example of where we can apply such a technique. A pharmaceutical company is marketing a new test for a certain medical disorder. According to clinical trials, the test has the following properties:
EECS 70, Fall 2020, Note 14 2
1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called “false negatives”).
2. When applied to a healthy person, the test comes up negative in 80% of cases, and positive in 20% (these are called “false positives”).
Suppose that the incidence of the disorder in the US population is 5%; this is our prior knowledge. When a random person is tested and the test comes up positive, how can we update the probability that the person has the disorder? (Note that this is presumably not the same as the simple probability that a random person in the population has the disorder, which is just 1 .) The implicit probability space here is the entire US
20
population with equal probabilities.
The sample space here consists of all people in the US — denote their number by N (so N ≈ 325 million). Let A be the event that a person chosen at random is affected, and B be the event that a person chosen at random tests positive. Now we can rewrite the information above as:
• P[A] = 0.05, (5% of the U.S. population is affected)
• P[B|A] = 0.9, (90% of the affected people test positive) • P[B|A] = 0.2, (20% of healthy people test positive)
We want to calculate P[A|B]. We can proceed as follows:
P[A|B] = P[A ∩ B] = P[B|A]P[A] . (1)
P[B] P[B]
We obtained the second equality above by applying the definition of conditional probability:
P[B|A]= P[A∩B]. P[A]
To evaluate (1), we need to compute P[B]. This is the probability that a random person tests positive. To compute this, we can sum two values: the probability P[A ∩ B] that a healthy person tests positive and the probability P[A ∩ B] that an affected person tests positive. We can sum because the events A ∩ B and A ∩ B do not intersect:
B A
A∩B A∩B
EECS 70, Fall 2020, Note 14
3
By again applying the definition of conditional probability, we have:
P[B] = P[A ∩ B] + P[A ∩ B] = P[B|A]P[A] + P[B|A]P[A] (2)
= P[B|A]P[A] + P[B|A](1 − P[A]). Combining (1) and (2), we have expressed P[A|B] in terms of P[A], P[B|A] and P[B|A]:
P[A|B] = P[B|A]P[A] . (3) P[B|A]P[A] + P[B|A](1 − P[A])
By plugging in the values written above, we obtain P[A|B] = 9 ≈ 0.19. 47
Equation (3) is useful for many inference problems. We are given P[A], which is the (unconditional) proba- bility that the event of interest, A, happens. We are given P[B|A] and P[B|A], which quantify how noisy the observation is. (If P[B|A] = 1 and P[B|A] = 0, for example, the observation is completely noiseless.) Now we want to calculate P[A|B], the probability that the event of interest happens given we made the observation. Equation (3) allows us to do just that.
Of course, (1), (2) and (3) are derived from the basic axioms of probability and the definition of conditional probability, and are therefore true with or without the above Bayesian inference interpretation. However, this interpretation is very useful when we apply probability theory to study inference problems.
3 Bayes’ Rule and Total Probability Rule
Equations (1) and (2) are very useful in their own right. The former is called Bayes’ Rule and the latter is called the Total Probability Rule. Bayes’ Rule is useful when one wants to calculate P[A|B] but one is given P[B|A] instead, i.e., it allows us to “flip” things around.
The Total Probability Rule is an application of the strategy of “dividing into cases.” There are two possibili- ties: either an event A happens or A does not happen. If A happens, the probability that B happens is P[B|A]. If A does not happen, the probability that B happens is P[B|A]. If we know or can easily calculate these two probabilities and also P[A], then the Total Probability Rule yields the probability of event B.
3.1 Examples
Tennis Match
You are about to play a tennis match against a randomly chosen opponent and you wish to calculate your probability of winning. You know your opponent will be one of two people, X or Y . If person X is chosen, you will win with probability 0.7. If person Y is chosen, you will win with probability 0.3. Your opponent is chosen by flipping a biased coin such that the probability of choosing X is 0.6.
Let’s first determine which events we are interested in. Let A be the event that you win. Let BX be the event that person X is chosen, and let BY be the event that person Y is chosen. We wish to calculate P[A]. Here is what we know so far:
• P[A|BX ] = 0.7, (if person X is chosen, you win with probability 0.7) • P[A|BY ] = 0.3, (if person Y is chosen, you win with probability 0.3)
EECS 70, Fall 2020, Note 14 4
• P[BX ] = 0.6, (person X is chosen with probability 0.6) • P[BY ] = 0.4, (person Y is chosen with probability 0.4)
By using the Total Probability Rule, we have P[A] = P[A|BX ]P[BX ] + P[A|BY ]P[BY ], and plugging in the known values gives P[A] = (0.7 × 0.6) + (0.3 × 0.4) = 0.54.
Balls and Bins
Imagine we have the following two bins each containing some number of black and white balls:
Suppose one of the two bins is chosen with equal probability and a ball is drawn from the chosen bin uniformly at random. What is the probability that we picked Bin 1 given that a white ball was drawn, i.e., P[Bin1| f]?
Is the answer 2 , since we know that there are a total of three white balls, two of which are in Bin 1? This 3
reasoning is incorrect. Instead, what we should do is appropriately scale each sample point as the following picture shows:
This diagram shows that the sample space Ω consists of the outcomes in event B1 (corresponding to Bin 1) and event B2 (corresponding to Bin 2), i.e., Ω = B1 ∪B2. We can use the definition of conditional probability
to see that
Let us try to achieve this probability using Bayes’ Rule. To apply Bayes’ Rule, we need to compute
P[ f|Bin 1], P[Bin 1], and P[ f]. Here, P[ f|Bin 1] is the chance that we pick a white ball given that
we picked Bin 1, which is 2 . P[Bin 1] = 1 , as given in the description of the problem. Finally, P[ f] can be 52
computed using the Total Probability Rule:
P[ f]=P[ f|Bin1]×P[Bin1]+P[ f|Bin2]×P[Bin2]= 2×1+1×1 = 9 . 5 2 2 2 20
Observe that we can apply the Total Probability Rule here because P[Bin 1] is the complement of P[Bin 2]. Finally, upon plugging the above values into Bayes’ Rule, we obtain the probability that we picked Bin 1
1+124 P[Bin1|f]=10 10 =10=.
1+1+1 9 9 10 10 4 20
EECS 70, Fall 2020, Note 14 5
given that we picked a white ball:
2×124 P[Bin1|f]=5 2=10= .
999 20 20
All we have done above is to combine Bayes’ Rule and the Total Probability Rule; this is also how we obtained (3). Equivalently, we could have plugged in the appropriate values to (3).
3.2 Generalization
We now consider Bayes’ Rule and the Total Probability Rule in a more general context. First, we define a partition of an event as follows.
Definition 14.2 (Partition of an event). We say that an event A is partitioned into n events A1,…,An if 1. A=A1∪A2∪···∪An,and
2. Ai ∩Aj = ∅ for all i ̸= j (i.e., A1,…,An are mutually exclusive).
In other words, each outcome in A belongs to exactly one of the events A1,…,An.
Now, let A1,…,An be a partition of the sample space Ω. Then, the Total Probability Rule for any event B
is
nn
P[B] = ∑P[B∩Ai] = ∑P[B|Ai]P[Ai], (4)
i=1
while Bayes’ Rule, assuming P[B] ̸= 0, is given by
P[Ai|B] = P[B|Ai]P[Ai] = P[B]
i=1
P[B|Ai]P[Ai] , (5) ∑nj=1 P[B|A j ] P[A j ]
where the second equality follows from the Total Probability Rule.
4 Combinations of Events
In most applications of probability in Computer Science, we are interested in things like P[ni=1 Ai] and P[ni=1 Ai], where the Ai are simple events (i.e., we know or can easily compute P[Ai]). The intersection i Ai corresponds to the logical AND of the events Ai, while the union i Ai corresponds to their logical OR. As an example, if Ai denotes the event that a failure of type i happens in a certain system, then i Ai is the event that the system fails.
In general, computing the probabilities of such combinations can be very difficult. In this section, we discuss some situations where it can be done. Let’s start with independent events, for which intersections are quite simple to compute.
4.1 Independent Events
Definition 14.3 (Independence). Two events A,B in the same probability space are said to be independent if P[A∩B] = P[A]×P[B].
EECS 70, Fall 2020, Note 14 6
The intuition behind this definition is the following. Suppose that P[B] > 0. Then we have P[A|B] = P[A∩B] = P[A]×P[B] = P[A].
P[B] P[B]
Thus independence has the natural meaning that “the probability of A is not affected by whether or not B occurs.” (By a symmetrical argument, we also have P[B|A] = P[B] provided P[A] > 0.) For events A, B such that P[B] > 0, the condition P[A|B] = P[A] is actually equivalent to the definition of independence.
Several of our previously mentioned random experiments consist of independent events. For example, if we flip a coin twice, the event of obtaining heads in the first trial is independent of the event of obtaining heads in the second trial. The same applies for two rolls of a die; the outcomes of each trial are independent.
The above definition generalizes to any finite set of events:
Definition 14.4 (Mutual independence). Events A1,…,An are said to be mutually independent if for every
subset I ⊆ {1,…,n} with size |I| ≥ 2,
P[∩i∈IAi] = ∏P[Ai]. (6) i∈I
An equivalent definition of mutual independence is as follows.
Definition 14.5 (Mutual independence). Events A1,…,An are said to be mutually independent if for all
Bi ∈ {Ai,Ai},i = 1,…,n,
Remarks.
n
P[B1 ∩···∩Bn] = ∏P[Bi]. (7)
i=1
1. In Definition 14.4, (6) needs to hold for every subset I of {1,…,n} with size |I| ≥ 2. The cases of |I| = 0 and |I| = 1 are omitted, as they impose no constraints.
2. Note that (6) imposes 2n − n − 1 constraints on the probability distribution, while (7) defines 2n con- straints. It turns out that exactly n + 1 constraints implied by (7) are actually redundant.
For mutually independent events A1,…,An, it is not hard to check from the definition of conditional proba- bility that, for any 1 ≤ i ≤ n and any subset I ⊆ {1,…,n}\{i}, we have
P[Ai|j∈I Aj] = P[Ai].
Note that the independence of every pair of events (so-called pairwise independence) does not necessarily imply mutual independence. As illustrated in the following example, it is possible to construct three events A,B,C such that each pair is independent but the triple A,B,C is not mutually independent.
Example: Pairwise Independent but Not Mutually Independent
Suppose you toss a fair coin twice and let A be the event that the first flip is H and B be the event that the second flip is H . Now let C be the event that both flips are the same (i.e., both H ’s or both T ’s). Of course A and B are independent. What is more interesting is that so are A and C: given that the first toss came up H, the chance of the second flip being the same as the first is still 1/2. Another way of saying this is that P[A∩C] = P[A]P[C] = 1/4 since A∩C is the event that the first flip is H and the second is also H. By the
EECS 70, Fall 2020, Note 14 7
same reasoning B and C are also independent. On the other hand, A, B and C are not mutually independent. For example, if we are given that A and B occurred, then the probability that C occurs is 1. So, even though A, B and C are not mutually independent, every pair of them are independent. In other words, A, B and C are pairwise independent but not mutually independent.
4.2 Intersections of Events
Computing the probability of an intersection of mutually independent events is easy; it follows from the definition. We simply multiply the probabilities of each event. How do we compute the probability of an intersection for events that are not mutually independent? From the definition of conditional probability, we immediately have the following Product Rule (sometimes also called the chain rule) for computing the probability of an intersection of events.
Theorem 14.1 (Product Rule). For any events A, B, we have
P[A ∩ B] = P[A]P[B|A].
More generally, for any events A1,…,An,
P[n A ] = P[A ]×P[A |A ]×P[A |A ∩A ]×···×P[A |n−1 A ].
i=1i121312 ni=1i
Proof. The first assertion follows directly from the definition of P[B|A] (and is in fact a special case of the
second assertion with n = 2).
To prove the second assertion, we will use induction on n (the number of events). The base case is n = 1, and corresponds to the statement that P[A] = P[A], which is trivially true. For an arbitrary n > 1, assume (the inductive hypothesis) that
P[n−1 Ai] = P[A1]×P[A2|A1]×···×P[An−1|n−2 Ai]. i=1 i=1
Now we can apply the definition of conditional probability to the two events An and n−1 Ai to deduce that i=1
P[n A]=P[A ∩(n−1A)]=P[A |n−1A]×P[n−1A] i=1 i n i=1 i n i=1 i i=1 i
= P[An|n−1 Ai]×P[A1]×P[A2|A1]×···×P[An−1|n−2 Ai], i=1 i=1
where in the last line we have used the inductive hypothesis. This completes the proof by induction.
The Product Rule is particularly useful when we can view our sample space as a sequence of choices. The next few examples illustrate this point.
Example: Coin Tosses
Toss a fair coin three times. Let A be the event that all three tosses are heads. Then A = A1 ∩A2 ∩A3, where Ai is the event that the ith toss comes up heads. We have
P[A] = P[A1] × P[A2|A1] × P[A3|A1 ∩ A2]
= P[A1] × P[A2] × P[A3]
=1×1×1=1. 2228
The second line here follows from the fact that the tosses are mutually independent. Of course, we already
know that P[A] = 1 from our definition of the probability space in the previous note. Another way of looking 8
EECS 70, Fall 2020, Note 14 8
at this calculation is that it justifies our definition of the probability space, and shows that it was consistent with assuming that the coin flips are mutually independent.
If the coin is biased with heads probability p, we get, again using independence, P[A] = P[A1] × P[A2] × P[A3] = p3.
More generally, the probability of any sequence of n tosses containing k heads and n − k tails is pk (1 − p)n−k . This is in fact the reason we defined the probability space this way: we defined the sample point probabilities so that the coin tosses would behave independently.
Example: Monty Hall Revisited
Recall the Monty Hall problem from the previous note: there are three doors and the probability that the
prize is behind any given door is 1 . There are goats behind the other two doors. The contestant picks a door 3
randomly, and the host opens one of the other two doors, revealing a goat. How do we calculate intersections in this setting? For example, what is the probability that the contestant chooses door 1, the prize is behind door 2, and the host chooses door 3?
Let Ci be the event that the contestant chooses door i, let Pi be the event that the prize is behind door i, and let Hi be the event that the host chooses door i. Then, by the Product Rule,
P[C1 ∩ P2 ∩ H3] = P[C1] × P[P2|C1] × P[H3|C1 ∩ P2].
The probability of C1 is 1 , since the contestant is choosing the door at random. The probability of P2 given
3
C1 is still 1 since they are independent. The probability of the host choosing door 3 given events C1 and P2 3
is 1; the host cannot choose door 1 since the contestant has already chosen it, and the host cannot choose door 2 since the host must reveal a goat (and not the prize). Therefore,
P[C1 ∩P2 ∩H3] = 1 × 1 ×1 = 1. 339
Observe that we needed conditional probability in this setting; had we simply multiplied the probabilities
of each event, we would have obtained 1 since the probability of H3 is also 1 (can you figure out why?). 27 3
What if we changed the situation, and instead asked for the probability that the contestant chooses door 1, the prize is behind door 1, and the host chooses door 3? We can use the same technique as above, but our final answer will be different. This is left as an exercise (see Exercise 4).
Now, noting that P[H3|C1] = 1 (see Exercise 5) and P[C1 ∩H3] = P[C1]P[H3|C1] = 1 × 1 = 1, we obtain 2 326
1 2
6
Let’s use the Product Rule to compute the probability of a flush in a different way. This is equal to 4 × P[A], where A is the probability of a Hearts flush. Intuitively, this should be clear since there are 4 suits; we’ll see
P[P2|C1∩H3]=
which formally justifies the intuitive answer described in the previous note.
Example: Poker Hands
P[C1 ∩P2 ∩H3] P[C1 ∩H3]
=9= , 1 3
EECS 70, Fall 2020, Note 14 9
why this is formally true in the next section. We can write A = 5i=1 Ai, where Ai is the event that the ith card we pick is a Heart. So we have
P[A] = P[A1]×P[A2|A1]×···×P[A5|4i=1 Ai].
Clearly P[A1 ] = 13 = 1 . What about P[A2 |A1 ]? Well, since we are conditioning on A1 (the first card is a
52 4
Heart), there are only 51 remaining possibilities for the second card, 12 of which are Hearts. So P[A2|A1] =
12 . Similarly, P[A3|A1 ∩ A2] = 11 , and so on. So we get 51 50
4×P[A]=4×13×12×11×10× 9 , 52 51 50 49 48
which is exactly the same fraction we computed in the previous note.
So now we have two methods of computing probabilities in many of our sample spaces. It is useful to keep these different methods around, both as a check on your answers and because in some cases one of the methods is easier to use than the other.
4.3 Unions of Events
You are in Las Vegas, and you spy a new game with the following rules. You pick a number between 1 and 6. Then three dice are thrown. You win if and only if your number comes up on at least one of the dice.
The casino claims that your odds of winning are 50%, using the following argument. Let A be the event that
you win. We can write A = A1 ∪ A2 ∪ A3, where Ai is the event that your number comes up on die i. Clearly
P[Ai] = 1 for each i. Therefore, they claim 6
P[A] = P[A1 ∪A2 ∪A3] = P[A1]+P[A2]+P[A3] = 3× 1 = 1. 62
Is this calculation correct? Well, suppose instead that the casino rolled six dice, and again you win if and
only if your number comes up at least once. Then the analogous calculation would say that you win with
probability 6 × 1 = 1, i.e., certainly! The situation becomes even more ridiculous when the number of dice 6
gets bigger than 6.
The problem is that the events Ai are not disjoint: i.e., there are some sample points that lie in more than one of the Ai. (We could get really lucky and our number could come up on two of the dice, or all three.) So, if we add up the P[Ai], we are counting some sample points more than once.
Fortunately, in Note 11 we learned about the Principle of Inclusion-Exclusion which allows us to deal with this kind of situation. In the proof of Theorem 11.3, it was shown that every ω ∈/ A1 ∪ · · · ∪ An is not counted by the Inclusion-Exclusion formula, while every ω ∈ A1 ∪ · · · ∪ An is counted exactly once by the formula. Hence, rather than summing (with appropriate signs) the cardinality of ∩i∈SAi in the Inclusion-Exclusion formula, if we instead sum their probability P[∩i∈SAi], we obtain the following useful formula for computing P[A1 ∪···∪An]:
Theorem 14.2 (Inclusion-Exclusion). Let A1 , . . . , An be events in some probability space, where n ≥ 2. Then, we have
n
P[A1 ∪···∪An] = ∑(−1)k−1 ∑
k=1 S⊆{1,…,n}:|S|=k
The right hand side of (8) can be written as
n
P[ni=1Ai]=∑P[Ai]−∑P[Ai∩Aj]+ ∑ P[Ai∩Aj∩Ak]−···+(−1)n−1P[A1∩A2∩···∩An],
i=1 i