EECS 70 Discrete Mathematics and Probability Theory Fall 2020
Random Variables: Distribution and Expectation
Note 15
Recall our setup of a probabilistic experiment as a procedure of drawing a sample from a set of possible
values, and assigning a probability for each possible outcome of the experiment. For example, if we toss a
fair coin n times, then there are 2n possible outcomes, each of which is equally likely and has probability 1 . 2n
Now suppose we want to make a measurement in our experiment. For example, we can ask what is the number of heads in n coin tosses; call this number X. Of course, X is not a fixed number, but it depends on the actual sequence of coin flips that we obtain. For example, if n = 4 and we observe the outcome ω =HTHH, then is X =3; whereas if we observe the outcome ω =HTHT, then the is X =2. In this example of n coin tosses, we only know that X is an integer between 0 and n, but we do not know what its exact value is until we observe which outcome of n coin flips is realized and count how many heads there are. Because every possible outcome is assigned a probability, the value X also carries with it a probability for each possible value it can take. The table below lists all the possible values X can take in the example of n = 4 coin tosses, along with their respective probabilities.
Such a value X that depends on the outcome of the probabilistic experiment is called a random variable (abbreviated r.v.). As we see from the example above, a random variable X typically does not have a definitive value, but instead only has a probability distribution over the set of possible values X can take, which is why it is called random. So the question “What is the number of heads in n coin tosses?” does not exactly make sense because the answer X is a random variable. But the question “What is the typical number of heads in n coin tosses?” makes sense — it is asking what is the average value of X (the number of heads) if we repeat the experiment of tossing n coins multiple times. This average value is called the expectation of X, and is one of the most useful summary (also called statistics) of an experiment.
1 Random Variables
Before we formalize the above notions, let us consider another example to enforce our conceptual under- standing of a random variable.
Example: Fixed Points of Permutations
Question: Suppose we collect the homeworks of n students, randomly shuffle them, and return them to the
students. How many students receive their own homework?
EECS 70, Fall 2020, Note 15 1
outcomes ω
value of X (# heads)
probability of occurring
TTTT
HTTT, THTT, TTHT, TTTH
HHTT, HTHT, HTTH, THHT, THTH, TTHH HHHT, HHTH, HTHH, THHH
HHHH
0 1 2 3 4
1/16 4/16 6/16 4/16 1/16
Here the probability space consists of all n! permutations of the homeworks, each with equal probability 1 . n!
If we label the homeworks as 1,2,…,n, then each sample point is a permutation π = (π1,…,πn) where πi is the homework that is returned to the i-th student.
As in the coin flipping case above, our question does not have a simple numerical answer (such as 4), because the number depends on the particular permutation we choose (i.e., on the sample point). Let us call the number of fixed points Xn, which is a random variable. Recall the table in an earlier note that lists the value taken by X3 for different permutations of {1,2,3}.
Formal Definition of a Random Variable
We now formalize the concepts discussed above.
Definition 15.1 (Random Variable). A random variable X on a sample space Ω is a function X : Ω → R
that assigns to each sample point ω ∈ Ω a real number X(ω).
Until further notice, we will restrict our attention to random variables that are discrete, i.e., they take values in a range that is finite or countably infinite. This means even though we define X to map Ω to R, the actual set of values {X(ω): ω ∈ Ω} that X takes is a discrete subset of R.
A random variable can be visualized in general by the picture in Figure ??.1 Note that the term “random variable” is really something of a misnomer: it is a function so there is nothing random about it and it is definitely not a variable! What is random is which sample point of the experiment is realized and hence the value that the random variable maps the sample point to.
Figure 1: Visualization of how a random variable is defined on the sample space.
1The figures in this note are inspired by figures in Chapter 2 of Introduction to Probability by D. Bertsekas and J. Tsitsiklis.
EECS 70, Fall 2020, Note 15 2
2 Probability Distribution
When we introduced the basic probability space in an earlier note, we defined two things:
1. The sample space Ω consisting of all the possible outcomes (sample points) of the experiment. 2. The probability of each of the sample points.
Analogously, there are two important things about any random variable:
1. The set of values that it can take.
2. The probabilities with which it takes on the values.
Since a random variable is defined on a probability space, we can calculate these probabilities given the probabilities of the sample points. Let a be any number in the range of a random variable X. Then the set
{ω ∈Ω :X(ω)=a}
is an event in the sample space (simply because it is a subset of Ω). We usually abbreviate this event to simply “X = a”. Since X = a is an event, we can talk about its probability, P[X = a]. The collection of these probabilities, for all possible values of a, is known as the distribution of the random variable X.
Definition 15.2 (Distribution). The distribution of a discrete random variable X is the collection of values {(a,P[X = a]) : a ∈ A }, where A is the set of all possible values taken by X.
Thus, the distribution of the random variable X in our permutation example above is: P[X =0]= 1, P[X =1]= 1, P[X =3]= 1,
326 andP[X =a]=0forallothervaluesofa.
The distribution of a random variable can be visualized as a bar diagram, shown in Figure ??. The x-axis represents the values that the random variable can assume. The height of the bar at a value a is the probability P[X = a]. Each of these probabilities can be computed by looking at the probability of the corresponding event in the sample space.
Note that the collection of events X = a, for a ∈ A , satisfy two important properties: • AnytwoeventsX =a1 andX =a2 witha1 ̸=a2 aredisjoint.
• The union of all these events is equal to the entire sample space Ω .
The collection of events thus form a partition of the sample space (see Figure ??). Both properties follow directly from the fact that X is a function defined on Ω, i.e., X assigns a unique value to each and every possible sample point in Ω . As a consequence, the sum of the probabilities P[X = a] over all possible values of a is exactly equal to 1. So when we sum up the probabilities of the events X = a, we are really summing up the probabilities of all the sample points.
EECS 70, Fall 2020, Note 15 3
Figure 2: Visualization of how the distribution of a random variable is defined.
Bernoulli Distribution
A simple yet very useful probability distribution is the Bernoulli distribution of a random variable which takes value in {0,1}:
p, if i = 1, P[X =i]= 1−p, ifi=0,
where 0 ≤ p ≤ 1. We say that X is distributed as a Bernoulli random variable with parameter p, and write X ∼ Bernoulli(p).
Binomial Distribution
Let us return to our coin tossing example above, where we defined our random variable X to be the number of heads. More formally, consider the random experiment consisting of n independent tosses of a biased coin that shows H with probability p. Each sample point ω is a sequence of tosses, and X(ω) is defined to be the number of heads in ω. For example, when n = 3, X(THH) = 2.
To compute the distribution of X, we first enumerate the possible values that X can take. They are simply
0,1,…,n. Then we compute the probability of each event X = i for i = 0,1,…,n. The probability of the
event X = i is the sum of the probabilities of all the sample points with exactly i heads (for example, if
n = 3 and i = 2, there would be three such sample points {HHT,HTH,THH}). Any such sample point has
probability pi (1 − p)n−i , since the coin flips are independent. There are exactly n of these sample points. i
Hence,
n i n−i
P[X=i]= i p(1−p) , for i=0,1,…,n.
(1) This distribution, called the binomial distribution, is one of the most important distributions in probability.
EECS 70, Fall 2020, Note 15 4
A random variable with this distribution is called a binomial random variable, and we write X ∼ Bin(n, p),
where n denotes the number of trials and p the probability of success (observing an H in the example). An
example of a binomial distribution is shown in Figure ??. Notice that due to the properties of X mentioned
above, it must be the case that ∑n P[X = i] = 1, which implies that ∑n n pi (1 − p)n−i = 1. This provides i=0 i=0 i
a probabilistic proof of the Binomial Theorem from an earlier note where we saw it combinatorially, for a = p and b = 1 − p.
Figure 3: The binomial distributions for two choices of (n, p).
Although we define the binomial distribution in terms of an experiment involving tossing coins, this distri- bution is useful for modeling many real-world problems. Consider for example the error correction problem studied earlier. Recall that we wanted to encode n packets into n + k packets such that the recipient can reconstruct the original n packets from any n packets received. But in practice, the number of packet losses is random, so how do we choose k, the amount of redundancy? If we model each packet getting lost with probability p and the losses are independent, then if we transmit n + k packets, the number of packets re- ceived is a random variable X with binomial distribution: X ∼ Bin(n + k, 1 − p) (we are tossing a coin n + k times, and each coin turns out to be a head (packet received) with probability 1 − p). So the probability of successfully decoding the original data is:
n+k n+k n + k i n+k−i P[X≥n]= ∑P[X=i]= ∑ i (1−p)p .
i=n i=n
Given fixed n and p, we can choose k such that this probability is no less than, say, 0.99.
Hypergeometric Distribution
Consider an urn containing N = B + W balls, where B balls are black and W are white. Suppose you randomly sample n ≤ N balls from the urn with replacement, and let X denote the number of black balls in your sample. What is the probability distribution of X? Since the probability of seeing a black ball is B/N for each draw, independently of all other draws, X follows the binomial distribution Bin(n, p), where p=B/N.
What if you randomly sample n ≤ N balls from the urn without replacement? In this case, the probability of seeing a black ball in the i-th draw depends on the colors of the i − 1 balls already drawn; that is, unlike
EECS 70, Fall 2020, Note 15 5
in the case of sampling with replacement, you don’t have independence. The probability distribution of the number Y of black balls in this setting can be found as follows. Consider a sequence ω of n draws where the first k balls are black and the next n − k balls are white. The probability of this particular sequence of draws can be computed as
P[ω]= B×B−1×···×B−k+1× W × W−1
×···×W−(n−k)+1 (2) N−n+1
N N−1 B! W!
= (B−k)! (W−(n−k))! N!
(N −n)!
N−k+1 N−k N−k−1
B! (N −B)!
= (B−k)! (N−B−(n−k))! .
N! (N −n)!
Furthermore, any other sequence ω′ consisting of k blacks balls and n−k white balls has exactly the same
probability as ω: if we carry out a similar calculation as in (??) for the sequence of draws in ω′, the
numerators appearing in P[ω′] will be some permutation of the numerators in the first line of (??), while
the denominators will be exactly the same as in (??). Since there are n distinct sequences of length n k
consisting of k black balls and n − k white balls, we obtain B! (N−B)!
n (B−k)! (N−B−(n−k))! P[Y=k]= k N!
j
This probability distribution is called the hypergeometric distribution with parameters N,B,n, and we write
Y ∼ Hypergeometric(N,B,n).
3 Multiple Random Variables and Independence
Often one is interested in multiple random variables on the same sample space. Consider, for example, the sample space of flipping two coins. One could define many random variables: for example a random variable X indicating the number of heads in a sequence of coin tosses, or a random variable Y indicating the number of tails, or a random variable Z indicating whether the first is H or not. Note that for each sample point, any random variable has a specific value: e.g., for ω = HTT, we have X(ω) = 1, Y(ω) = 2, and Z(ω)=1.
The concept of a distribution can then be extended to probabilities for the combination of values for multiple random variables.
Definition 15.3. The joint distribution for two discrete random variables X and Y is the collection of values {((a,b),P[X=a,Y=b]):a∈A,b∈B},whereA isthesetofallpossiblevaluestakenbyXandBis the set of all possible values taken by Y .
When given a joint distribution for X and Y , the distribution P[X = a] for X is called the marginal distribution for X , and can be found by “summing” over the values of Y . That is,
P[X =a]= ∑P[X =a,Y =b]. b∈B
BN −B k n−k
= N , n
(N −n)!
for k ∈ {0,1,…,n}. (Note that m = 0 if j > m, so P[Y = k] ̸= 0 only if max(0,n+B−N) ≤ k ≤ min(n,B).)
EECS 70, Fall 2020, Note 15
6
The marginal distribution for Y is analogous, as is the notion of a joint distribution for any number of random variables.
A joint distribution over random variables X1,…,Xn (for example, Xi could be the value of the i-th roll of a sequence of n die rolls) is P[X1 = a1,…,Xn = an], where ai ∈ Ai and Ai is the set of possible values for Xi. The marginal distribution for Xi is simply the distribution for Xi and can be obtained by summing over all the possible values of the other variables, but in some cases can be derived more simply. We proceed to one such case.
Independence for random variables is defined in an analogous fashion to independence for events:
Definition 15.4 (Independence). Random variables X and Y on the same probability space are said to be independent if the events X = a and Y = b are independent for all values a, b. Equivalently, the joint distribution of independent r.v.’s decomposes as
P[X =a,Y =b]=P[X =a]P[Y =b], ∀a,b. Mutual independence of more than two r.v.’s is defined similarly.
A very important example of independent random variables are indicator random variables for independent events. If Ii denotes the indicator r.v. for the i-th toss of a coin being H, then I1,…,In are mutually inde- pendent random variables. This example motivates the commonly used phrase “independent and identically distributed (i.i.d.) set of random variables.” In this example, {I1,…,In} is a set of i.i.d. indicator random variables.
4 Expectation
The distribution of a r.v. contains all the probabilistic information about the r.v. In most applications, how- ever, the complete distribution of a r.v. is very hard to calculate. For example, consider the homework permutation example with n = 20. In principle, we would have to enumerate 20! ≈ 2.4 × 1018 sample points, compute the value of X at each one, and count the number of points at which X takes on each of its possible values (though in practice we could streamline this calculation a bit)! Moreover, even when we can compute the complete distribution of a r.v., it is often not very informative.
For these reasons, we seek to summarize the distribution into a more compact, convenient form that is also easier to compute. The most widely used such form is the expectation (or mean or average) of the r.v.
Definition 15.5 (Expectation). The expectation of a discrete random variable X is defined as
E[X]= ∑a×P[X=a], (3)
a∈A where the sum is over all possible values taken by the r.v.
(Technical Note. Expectation is well defined provided that the sum on the right hand side of (??) is abso- lutely convergent, i.e., ∑a∈A |a| × P[X = a] < ∞. There are random variables for which expectations do not exist.)
For our simpler permutation example with only 3 students, the expectation is 1 1 1 1 1
E[X]= 0×3 + 1×2 + 3×6 =0+2+2=1.
EECS 70, Fall 2020, Note 15 7
That is, the expected number of fixed points in a permutation of three items is exactly 1.
The expectation can be seen in some sense as a “typical” value of the r.v. (though note that E[X] may not actually be a value that X can take). The question of how typical the expectation is for a given r.v. is a very important one that we shall return to in a later lecture.
Here is a physical interpretation of the expectation of a random variable: imagine carving out a wooden cutout figure of the probability distribution as in Figure ??. Then the expected value of the distribution is the balance point (directly below the center of gravity) of this object.
Figure 4: Physical interpretation of expected value as the balance point.
4.1 Examples
1. Single die. Throw a fair die once and let X be the number that comes up. Then X takes on values 1,2,...,6 each with probability 1, so
6
E[X] = 1(1+2+3+4+5+6) = 21 = 7. 662
Note that X never actually takes on its expected value 7 . 2
2. Two dice. Throw two fair dice and let X be the sum of their scores. Then the distribution of X is
The expectation is therefore
111 1
E[X]= 2×36 + 3×18 + 4×12 +···+ 12×36 =7.
3. Roulette. A roulette wheel is spun (recall that a roulette wheel has 38 slots: the numbers 1,2,...,36,
half of which are red and half black, plus 0 and 00, which are green). You bet $1 on Black. If a black
number comes up, you receive your stake plus $1; otherwise you lose your stake. Let X be your net
winningsinonegame. ThenX cantakeonthevalues+1and−1,andP[X =1]= 18,P[X =−1]= 20.
38 38
Thus,
18 20 1 E[X]= 1×38 + −1×38 =−19;
i.e., you expect to lose about a nickel per game. Notice how the zeros tip the balance in favor of the casino!
a
2 3 4 5 6 7 8 9 10 11 12
P[X = a]
11115151111 36 18 12 9 36 6 36 9 12 18 36
EECS 70, Fall 2020, Note 15 8
4.2 Linearity of Expectation
So far, we have computed expectations by brute force: i.e., we have written down the whole distribution and then added up the contributions for all possible values of the r.v. The real power of expectations is that in many real-life examples they can be computed much more easily using a simple shortcut. The shortcut is the following:
Theorem 15.1. For any two random variables X and Y on the same probability space, we have E[X +Y] = E[X]+E[Y].
Also, for any constant c, we have
E[cX] = cE[X].
Proof. We first rewrite the definition of expectation in a more convenient form. Recall from Definition ??
that
E[X]= ∑a×P[X=a]. a∈A
Consider a particular term a × P[X = a] in the above sum. Notice that P[X = a], by definition, is the sum of P[ω] over those sample points ω for which X(ω) = a. Furthermore, we know that every sample point ω ∈ Ω is in exactly one of these events X = a. This means we can write out the above definition in a more long-winded form as
E[X]= ∑X(ω)×P[ω]. (4) ω∈Ω
This equivalent definition of expectation will make the present proof much easier (though it is usually less convenient for actual calculations). Applying (??) to E[X + Y ] gives:
E[X+Y]=∑ω∈Ω(X+Y)(ω)×P[ω] =∑ω∈Ω(X(ω)+Y(ω))×P[ω]
= ∑ω∈Ω X(ω)×P[ω] +∑ω∈Ω Y(ω)×P[ω] =E[X]+E[Y].
In the last step, we used (??) twice.
This completes the proof of the first equality. The proof of the second equality is much simpler and is left
as an exercise.
Theorem ?? is very powerful: it says that the expectation of a sum of r.v.’s is the sum of their expecta- tions, with no assumptions about the r.v.’s. We can use Theorem ?? to conclude things like E[3X − 5Y ] = 3 E[X ] − 5 E[Y ], regardless of whether or not X and Y are independent. This important property is known as linearity of expectation.
Importantcaveat:Theorem??doesnotsaythatE[XY]=E[X]E[Y],orthatE1= 1 ,etc.Theseclaims X E[X]
are not true in general. It is only sums and differences and constant multiples of random variables that behave so nicely.
EECS 70, Fall 2020, Note 15 9
4.3 Applications of Linearity of Expectation
Now let us see some examples of Theorem ?? in action.
1. Twodiceagain.HereisamuchlesspainfulwayofcomputingE[X],whereXisthesumofthescores
of the two dice. Note that X = Y1 +Y2, where Yi is the score on die i. We know from example 1 in
Section ?? that E[Y1] = E[Y2] = 7. So, by Theorem ??, we have E[X] = E[Y1]+E[Y2] = 7. 2
2. More roulette. Suppose we play the roulette game mentioned in Section ?? n ≥ 1 times. Let Xn
beourexpectednetwinnings. ThenXn=Y1+Y2+···+Yn,whereYi isournetwinningsinthe
i-th play. We know from earlier that E[Yi] = − 1 for each i. Therefore, by Theorem ??, E[Xn] = 19
E[Y1]+E[Y2]+···+E[Yn] = − n . For n = 1000, E[Xn] = −1000 ≈ −53, so if you play 1000 games,
19 19
you expect to lose about $53.
3. Fixed points of permutations. Let us return to the homework permutation example with an arbitrary number n of students. Let Xn denote the number of students who receive their own homework after shuffling (or equivalently, the number of fixed points). To take advantage of Theorem ??, we need to write Xn as a sum of simpler r.v.’s. Since Xn counts the number of times something happens, we can write it as a sum using the following useful trick:
1, if student i gets their own homework,
Xn = I1 +I2 +···+In, where Ii = 0, otherwise. (5)
[You should think about this equation for a moment. Remember that all the Ii’s are random variables. What does an equation involving random variables mean? What we mean is that, at every sample point ω, we have Xn(ω) = I1(ω)+I2(ω)+···+In(ω). Why is this true?]
A Bernoulli random variable such as Ii is called an indicator random variable of the corresponding event (in this case, the event that student i gets their own homework). For indicator r.v.’s, the expecta- tion is particularly easy to calculate. Specifically,
In our case, we have
E[Ii] = (0×P[Ii = 0])+(1×P[Ii = 1]) = P[Ii = 1].
P[Ii = 1] = P[student i gets their own homework] = 1 . n
We can now apply Theorem ?? to (??), yielding
E[Xn] = E[I1]+E[I2]+···+E[In] = n× 1 = 1.
n
So, we see that the expected number of students who get their own homeworks in a class of size n is 1. That is, the expected number of fixed points in a random permutation of n items is always 1, regardless of n.
4. Coin tosses. Toss a fair coin n ≥ 1 times. Let the r.v. Xn be the number of heads observed. As in the previous example, to take advantage of Theorem ?? we write
Xn = I1 +I2 +···+In,
EECS 70, Fall 2020, Note 15
10
where Ii is the indicator r.v. of the event that the i-th toss is H. Since the coin is fair, we have E[Ii]=P[Ii =1]=P[i-thtossisH]= 1.
Using Theorem ??, we therefore get
n1n E[Xn] = ∑ 2 = 2.
i=1
2
In n tosses of a biased coin that shows H with probability p, E[Xn] = np. (Check this.) So the expectation of a binomial r.v. X ∼ Bin(n, p) is equal to np. Note that it would have been harder to reach the same conclusion by computing this directly from the definition of expectation shown in (??).
5. Balls and bins. Throw m balls into n bins. Let the r.v. X be the number of balls that land in the first bin. Then X behaves exactly like the number of heads in m tosses of a biased coin with P[H] = 1
n
(why?). So, from the previous example, we get E[X ] = m . In the special case m = n, the expected n
number of balls in any bin is 1. If we wanted to compute this directly from the distribution of X, we would get into a messy calculation involving binomial coefficients.
Here is another example on the same sample space. Let the r.v. Yn be the number of empty bins. The distribution of Yn is horrible to contemplate: to get a feel for this, you might like to write it down for m = n = 3 (i.e., 3 balls, 3 bins). However, computing the expectation E[Yn] is easy using Theorem ??. As in previous two examples, we write
Yn =I1+I2+···+In, (6) where Ii is the indicator r.v. of the event “bin i is empty”. The expectation of Ii is easy to find:
1m E[Ii]=P[Ii =1]=P[biniisempty]= 1−n ,
as discussed earlier. Applying Theorem ?? to (??), we therefore obtain
n 1m E[Yn]=∑E[Ii]=n 1−n ,
i=1
a simple formula, quite easily derived. Let us see how it behaves in the special case m = n (same
number of balls as bins). In this case we get E[Yn] = n1 − 1 n. Now the quantity 1 − 1 n can be nn
approximated (for large enough values of n) by the number 1 .2 So we see that, for large n, e
E[Yn] ≈ n ≈ 0.368n. e
The bottom line is that, if we throw (say) 1000 balls into 1000 bins, the expected number of empty bins is about 368.
2More generally, it is a standard fact that for any constant c,
(1+c)n→ec asn→∞.
n
We just used this fact in the special case c = −1. The approximation is actually very good even for quite small values of n. (Try it
yourself!) E.g., for n = 20 we already get (1 − 1 )n ≈ 0.358, which is very close to 1 ≈ 0.368. The approximation gets better and ne
better for larger n.
EECS 70, Fall 2020, Note 15
11