CS计算机代考程序代写 IEOR 4404 Lecture 2: Probability Review

IEOR 4404 Lecture 2: Probability Review

Lecture 2: Probability Review

Outline

Probability space

Conditioning and independence

Random variables – discrete and continuous

Expectation (and variance)

Some well-known random variables (next time)

Outline

Probability space

Conditioning and independence

Random variables – discrete and continuous

Expectation (and variance)

Some well-known random variables (next time)

Outline

Probability space

Conditioning and independence

Random variables – discrete and continuous

Expectation (and variance)

Some well-known random variables (next time)

Outline

Probability space

Conditioning and independence

Random variables – discrete and continuous

Expectation (and variance)

Some well-known random variables (next time)

Outline

Probability space

Conditioning and independence

Random variables – discrete and continuous

Expectation (and variance)

Some well-known random variables (next time)

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.
I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

I Ex. 3: the sample space of a horse race of 7 horses numbered
1, 2, . . . , 7 is the set of all possible orderings of the set {1, 2, . . . , 7}
(assuming no ties). There are altogether 7! possible outcomes.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.
I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the

subset {HHH,THH,HHT ,THT}.

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.
I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.
I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the

subset {HHH,THH,HHT ,THT}.

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.

I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.
I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the

subset {HHH,THH,HHT ,THT}.

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.
I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.
I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the

subset {HHH,THH,HHT ,THT}.

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.
I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.
I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the

subset {HHH,THH,HHT ,THT}.

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.
I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.

I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the
subset {HHH,THH,HHT ,THT}.

Probability Space

Need to specify the following three objects:

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

I Ex. 1: the sample space of a single coin toss is {H,T}.
I Ex. 2: the sample space of three consecutive tosses is
{HHH,HHT ,HTH,HTT ,THH,THT ,TTH,TTT}.

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

I Subsets may be “complicated” but their descriptions are often succinct.
I Ex 4: In Ex 2, the event that “the second coin toss is Head” is the

subset {HHH,THH,HHT ,THT}.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.
I P(Ω) = 1.
I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not

have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).
Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.
I P(Ω) = 1.
I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not

have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).
Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.

I P(Ω) = 1.
I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not

have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).
Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.
I P(Ω) = 1.

I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not
have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).
Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.
I P(Ω) = 1.
I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not

have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).
Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.
I P(Ω) = 1.
I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not

have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).

Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Probability Space

Sample space (usually denoted by Ω) is the set of all possible
outcomes (we usually denote outcomes by ω).

Events (usually denoted by A, B, etc) are subsets of the sample
space Ω.

Probability (usually denoted by P) assigns numeric values to events
so that the following three properties are satisfied:

I 0 ≤ P(A) ≤ 1 for all events A.
I P(Ω) = 1.
I Whenever A1,A2, . . . are disjoint events, i.e., if any two of them do not

have any common element, then the probability of their union is the
sum of their respective probabilities:

P(A1 ∪ A2 ∪ · · · ) = P(A1) + P(A2) + · · · .

Some useful properties: P(Ac) = 1− P(A);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B).
Ex. 5: to model an unbiased coin toss, Ω = {H,T}, Events are
∅, {H}, {T}, {H,T}, and P(H) = P(T ) = 1/2.

Conditioning and Independence

Definition 1 (Conditional probability). For two events A and B, the
conditional probability of A given B, denoted by P(A | B) is defined
to be

P(A | B) =
P(A ∩ B)
P(B)

Theorem 2 (Law of total probability). Let B1,B2, · · · be disjoint
events whose union is Ω. Then

P(A) =
∑
i

P(A ∩ Bi ) =
∑

P(A | Bi )P(Bi ).

Theorem 3 (Bayes’ theorem).

P(A | B) =
P(B | A)
P(B)

· P(A).

(Bayes’ theorem gives a simple formula on how we update our beliefs
(about A) given observations (B).)

Conditioning and Independence

Definition 1 (Conditional probability). For two events A and B, the
conditional probability of A given B, denoted by P(A | B) is defined
to be

P(A | B) =
P(A ∩ B)
P(B)

Theorem 2 (Law of total probability). Let B1,B2, · · · be disjoint
events whose union is Ω. Then

P(A) =
∑
i

P(A ∩ Bi ) =
∑

P(A | Bi )P(Bi ).

Theorem 3 (Bayes’ theorem).

P(A | B) =
P(B | A)
P(B)

· P(A).

(Bayes’ theorem gives a simple formula on how we update our beliefs
(about A) given observations (B).)

Conditioning and Independence

Definition 1 (Conditional probability). For two events A and B, the
conditional probability of A given B, denoted by P(A | B) is defined
to be

P(A | B) =
P(A ∩ B)
P(B)

Theorem 2 (Law of total probability). Let B1,B2, · · · be disjoint
events whose union is Ω. Then

P(A) =
∑
i

P(A ∩ Bi ) =
∑

P(A | Bi )P(Bi ).

Theorem 3 (Bayes’ theorem).

P(A | B) =
P(B | A)
P(B)

· P(A).

(Bayes’ theorem gives a simple formula on how we update our beliefs
(about A) given observations (B).)

Conditioning and Independence

Definition 4. (Independence of events) Two events A and B are
independent if

P(A ∩ B) = P(A)P(B).

Note: equivalently, A and B are independent if P(A | B) = P(A).

Common misconceptions about independence and conditioning:

I Independence is NOT the same as disjointness: if A and B are disjoint,
then A ∩ B = ∅ and P(A ∩ B) = 0.

I P(A | B) 6= P(B | A) in general.
I P(A | Bc) 6= 1− P(A | B) in general.

Conditioning and Independence

Definition 4. (Independence of events) Two events A and B are
independent if

P(A ∩ B) = P(A)P(B).

Note: equivalently, A and B are independent if P(A | B) = P(A).
Common misconceptions about independence and conditioning:

I Independence is NOT the same as disjointness: if A and B are disjoint,
then A ∩ B = ∅ and P(A ∩ B) = 0.

I P(A | B) 6= P(B | A) in general.
I P(A | Bc) 6= 1− P(A | B) in general.

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

I Sample space: (boy, boy), (boy, girl), (girl, boy), (girl, girl).
I Probability: 1/4 for each outcome.
I Event A that “both children are girls” is {(girl, girl)}.
I Event B that “the older child is a girl” is {(girl, boy), (girl, girl)}.

Note that A is a subset of B.
I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1

4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

Note that A is a subset of B.
I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1

4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

I Sample space: (boy, boy), (boy, girl), (girl, boy), (girl, girl).

I Probability: 1/4 for each outcome.
I Event A that “both children are girls” is {(girl, girl)}.
I Event B that “the older child is a girl” is {(girl, boy), (girl, girl)}.

Note that A is a subset of B.
I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1

4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

I Sample space: (boy, boy), (boy, girl), (girl, boy), (girl, girl).
I Probability: 1/4 for each outcome.

I Event A that “both children are girls” is {(girl, girl)}.
I Event B that “the older child is a girl” is {(girl, boy), (girl, girl)}.

Note that A is a subset of B.
I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1

4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

I Sample space: (boy, boy), (boy, girl), (girl, boy), (girl, girl).
I Probability: 1/4 for each outcome.
I Event A that “both children are girls” is {(girl, girl)}.

I Event B that “the older child is a girl” is {(girl, boy), (girl, girl)}.
Note that A is a subset of B.

I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1
4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

Note that A is a subset of B.

I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1
4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

Note that A is a subset of B.
I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1

4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q1: Mr. Jones has two children. The older child is a girl. What is the
probability that both children are girls?

I Assumptions: each child is a boy or a girl with equal probability 1/2,
independent from the other.

Note that A is a subset of B.
I P(A | B) = P(A ∩ B)/P(B) = P(A)/P(B) = 1

4
/ 1
2

= 1/2.

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I Consider the following scenario. Both children are playing in the
backyard. One of them is hiding in the wood and I saw the other one
in the open, who turns out to be a boy. I can now conclude that at
least one of the children is a boy.

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.

I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.

I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?

I Convince yourself that the conditional probability is 1/2 in this case!

Conditioning: the Boy-or-Girl Paradox

Q2: Mr. Smith has two children. At least one of them is a boy. What
is the probability that both children are boys?

I Here event C that “both children are boys” is {(boy, boy)}.
I Event D that “at least one of them is a boy” is
{(boy, boy), (girl, boy), (boy, girl)}.

I Can again compute P(C | D) = 1/3.
I The conditioning event affects the a posterior probability.

Revisit Q2 – how on earth do I know at least one of the children is a
boy?

I What is the probability that both children are boys?
I Convince yourself that the conditional probability is 1/2 in this case!

Random variables

A random variable can be thought of numeric values associated with
outcomes of a probability space.

I E.g., for a single coin toss, we can assign value 0 to outcome “Heads”,
and 1 to “Tail”. This assignment is a random variable (r.v).

I If we assign 0 to “Heads” and 2 to “Tail”, this is a different r.v.

“Definition” 5. A random variable X is a function/map from Ω to
R. The function F : R→ [0, 1] defined by

F (x) = P(X ≤ x)

is called the cumulative distribution function (cdf) of X .

Remarks. 1) This is not a rigorous definition but is sufficient for our
purposes.
2) X ,Y ,Z are common symbols used to denote r.v.
3) Functions of r.v. are also r.v.
4) Cdfs are increasing functions with limx→−∞ F (x) = 0 and
limx→∞ F (x) = 1.

Random variables

A random variable can be thought of numeric values associated with
outcomes of a probability space.

I E.g., for a single coin toss, we can assign value 0 to outcome “Heads”,
and 1 to “Tail”. This assignment is a random variable (r.v).

I If we assign 0 to “Heads” and 2 to “Tail”, this is a different r.v.

“Definition” 5. A random variable X is a function/map from Ω to
R. The function F : R→ [0, 1] defined by

F (x) = P(X ≤ x)

is called the cumulative distribution function (cdf) of X .

Random variables

A random variable can be thought of numeric values associated with
outcomes of a probability space.

I E.g., for a single coin toss, we can assign value 0 to outcome “Heads”,
and 1 to “Tail”. This assignment is a random variable (r.v).

I If we assign 0 to “Heads” and 2 to “Tail”, this is a different r.v.

“Definition” 5. A random variable X is a function/map from Ω to
R. The function F : R→ [0, 1] defined by

F (x) = P(X ≤ x)

is called the cumulative distribution function (cdf) of X .

Random variables

A random variable can be thought of numeric values associated with
outcomes of a probability space.

I E.g., for a single coin toss, we can assign value 0 to outcome “Heads”,
and 1 to “Tail”. This assignment is a random variable (r.v).

I If we assign 0 to “Heads” and 2 to “Tail”, this is a different r.v.

“Definition” 5. A random variable X is a function/map from Ω to
R. The function F : R→ [0, 1] defined by

F (x) = P(X ≤ x)

is called the cumulative distribution function (cdf) of X .

Random variables

A random variable can be thought of numeric values associated with
outcomes of a probability space.

I E.g., for a single coin toss, we can assign value 0 to outcome “Heads”,
and 1 to “Tail”. This assignment is a random variable (r.v).

I If we assign 0 to “Heads” and 2 to “Tail”, this is a different r.v.

“Definition” 5. A random variable X is a function/map from Ω to
R. The function F : R→ [0, 1] defined by

F (x) = P(X ≤ x)

is called the cumulative distribution function (cdf) of X .

Random variables

Discrete random variables. A r.v X is discrete if it takes on a
countable number of possible values, in which case we can enumerate
them as x1, x2, . . .. Its probability mass function (p.m.f) p(·) is
defined by

p(x) = P(X = x).

Continuous random variables. A r.v. X is a continuous r.v. if there
exists a nonnegative function f such that for every x ∈ R,

F (x) = P(X ≤ x) =
∫ x
−∞

f (y)dy .

f is the probability density function (p.d.f) of X .

Random variables

p(x) = P(X = x).

Continuous random variables. A r.v. X is a continuous r.v. if there
exists a nonnegative function f such that for every x ∈ R,

F (x) = P(X ≤ x) =
∫ x
−∞

f (y)dy .

f is the probability density function (p.d.f) of X .

Random variables

An example of r.v. X that is neither discrete nor random.

Suppose X takes values in [0, 1] ∪ {2}.
I For x < 0, F (x) = 0. I For x ∈ [0, 1], F (x) = P(X ≤ x) = x/2. I For x = 2, P(X = x) = 1/2 (so for x ∈ (1, 2), P(X ≤ x) = 1/2; and for x ≥ 2, P(X ≤ x) = 1). The cdf plot of X : 0 1 1 2 0.5 x F Random variables An example of r.v. X that is neither discrete nor random. Suppose X takes values in [0, 1] ∪ {2}. I For x < 0, F (x) = 0. I For x ∈ [0, 1], F (x) = P(X ≤ x) = x/2. I For x = 2, P(X = x) = 1/2 (so for x ∈ (1, 2), P(X ≤ x) = 1/2; and for x ≥ 2, P(X ≤ x) = 1). The cdf plot of X : 0 1 1 2 0.5 x F Random variables An example of r.v. X that is neither discrete nor random. Suppose X takes values in [0, 1] ∪ {2}. I For x < 0, F (x) = 0. I For x ∈ [0, 1], F (x) = P(X ≤ x) = x/2. I For x = 2, P(X = x) = 1/2 (so for x ∈ (1, 2), P(X ≤ x) = 1/2; and for x ≥ 2, P(X ≤ x) = 1). The cdf plot of X : 0 1 1 2 0.5 x F Random variables An example of r.v. X that is neither discrete nor random. Suppose X takes values in [0, 1] ∪ {2}. I For x < 0, F (x) = 0. I For x ∈ [0, 1], F (x) = P(X ≤ x) = x/2. I For x = 2, P(X = x) = 1/2 (so for x ∈ (1, 2), P(X ≤ x) = 1/2; and for x ≥ 2, P(X ≤ x) = 1). The cdf plot of X : 0 1 1 2 0.5 x F Random variables Independence of r.v. Two r.v. X and Y are independent if for every real values x and y , P(X ≤ x ,Y ≤ y) = P(X ≤ x)P(Y ≤ y). Expectation. The expectation of a discrete r.v. X is E[X ] = ∑ x p(x)x . The expectation of a continuous r.v. Y with pdf f is E[Y ] = ∫ ∞ −∞ yf (y)dy . Expectation of functions of r.v. For discrete r.v X : E[g(X )] = ∑ x g(x)p(x). For continuous r.v. Y : E[g(Y )] = ∫∞ −∞ g(y)f (y)dy . Random variables Independence of r.v. Two r.v. X and Y are independent if for every real values x and y , P(X ≤ x ,Y ≤ y) = P(X ≤ x)P(Y ≤ y). Expectation. The expectation of a discrete r.v. X is E[X ] = ∑ x p(x)x . The expectation of a continuous r.v. Y with pdf f is E[Y ] = ∫ ∞ −∞ yf (y)dy . Expectation of functions of r.v. For discrete r.v X : E[g(X )] = ∑ x g(x)p(x). For continuous r.v. Y : E[g(Y )] = ∫∞ −∞ g(y)f (y)dy . Random variables Independence of r.v. Two r.v. X and Y are independent if for every real values x and y , P(X ≤ x ,Y ≤ y) = P(X ≤ x)P(Y ≤ y). Expectation. The expectation of a discrete r.v. X is E[X ] = ∑ x p(x)x . The expectation of a continuous r.v. Y with pdf f is E[Y ] = ∫ ∞ −∞ yf (y)dy . Expectation of functions of r.v. For discrete r.v X : E[g(X )] = ∑ x g(x)p(x). For continuous r.v. Y : E[g(Y )] = ∫∞ −∞ g(y)f (y)dy . Random variables Useful identities and facts: I Var(X ) = E[X 2]− (E[X ])2; I E[aX ] = aE[X ]; Var(aX ) = a2Var(X ). I E[X + Y ] = E[X ] + E[Y ]; Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ). I If r.v. X and Y are independent, then E[XY ] = E[X ]E[Y ] and Var(X + Y ) = Var(X ) + Var(Y ). I For an event A, we can define its indicator function (a r.v!) 1A by 1A = 1 if event A happens and 1A = 0 if event A doesn’t happen. Then E[1A] = P(A). Random variables Useful identities and facts: I Var(X ) = E[X 2]− (E[X ])2; I E[aX ] = aE[X ]; Var(aX ) = a2Var(X ). I E[X + Y ] = E[X ] + E[Y ]; Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ). I If r.v. X and Y are independent, then E[XY ] = E[X ]E[Y ] and Var(X + Y ) = Var(X ) + Var(Y ). I For an event A, we can define its indicator function (a r.v!) 1A by 1A = 1 if event A happens and 1A = 0 if event A doesn’t happen. Then E[1A] = P(A). Random variables Useful identities and facts: I Var(X ) = E[X 2]− (E[X ])2; I E[aX ] = aE[X ]; Var(aX ) = a2Var(X ). I E[X + Y ] = E[X ] + E[Y ]; Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ). I If r.v. X and Y are independent, then E[XY ] = E[X ]E[Y ] and Var(X + Y ) = Var(X ) + Var(Y ). I For an event A, we can define its indicator function (a r.v!) 1A by 1A = 1 if event A happens and 1A = 0 if event A doesn’t happen. Then E[1A] = P(A). Random variables Useful identities and facts: I Var(X ) = E[X 2]− (E[X ])2; I E[aX ] = aE[X ]; Var(aX ) = a2Var(X ). I E[X + Y ] = E[X ] + E[Y ]; Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ). I If r.v. X and Y are independent, then E[XY ] = E[X ]E[Y ] and Var(X + Y ) = Var(X ) + Var(Y ). I For an event A, we can define its indicator function (a r.v!) 1A by 1A = 1 if event A happens and 1A = 0 if event A doesn’t happen. Then E[1A] = P(A). Random variables Useful identities and facts: I Var(X ) = E[X 2]− (E[X ])2; I E[aX ] = aE[X ]; Var(aX ) = a2Var(X ). I E[X + Y ] = E[X ] + E[Y ]; Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y ). I If r.v. X and Y are independent, then E[XY ] = E[X ]E[Y ] and Var(X + Y ) = Var(X ) + Var(Y ). I For an event A, we can define its indicator function (a r.v!) 1A by 1A = 1 if event A happens and 1A = 0 if event A doesn’t happen. Then E[1A] = P(A). Discrete Random Variables Bernoulli random variables Binomial random variables Geometric random variables Negative binomial random variables Poisson random variables Bernoulli Random Variables Take only two possible values, 0 and 1. A Bernoulli r.v. X with parameter p is defined by P(X = 0) = 1− p;P(X = 1) = p. Sometimes also called a Bernoulli trial. An example (coin toss): I Sample space Ω = {H,T}. I X (H) = 1,X (T ) = 0. I P(X = 1) = p. Denoted by Bern(p). Bernoulli Random Variables Take only two possible values, 0 and 1. A Bernoulli r.v. X with parameter p is defined by P(X = 0) = 1− p;P(X = 1) = p. Sometimes also called a Bernoulli trial. An example (coin toss): I Sample space Ω = {H,T}. I X (H) = 1,X (T ) = 0. I P(X = 1) = p. Denoted by Bern(p). Bernoulli Random Variables Take only two possible values, 0 and 1. A Bernoulli r.v. X with parameter p is defined by P(X = 0) = 1− p;P(X = 1) = p. Sometimes also called a Bernoulli trial. An example (coin toss): I Sample space Ω = {H,T}. I X (H) = 1,X (T ) = 0. I P(X = 1) = p. Denoted by Bern(p). Bernoulli Random Variables Take only two possible values, 0 and 1. A Bernoulli r.v. X with parameter p is defined by P(X = 0) = 1− p;P(X = 1) = p. Sometimes also called a Bernoulli trial. An example (coin toss): I Sample space Ω = {H,T}. I X (H) = 1,X (T ) = 0. I P(X = 1) = p. Denoted by Bern(p). Bernoulli Random Variables Take only two possible values, 0 and 1. A Bernoulli r.v. X with parameter p is defined by P(X = 0) = 1− p;P(X = 1) = p. Sometimes also called a Bernoulli trial. An example (coin toss): I Sample space Ω = {H,T}. I X (H) = 1,X (T ) = 0. I P(X = 1) = p. Denoted by Bern(p). Bernoulli Random Variables Take only two possible values, 0 and 1. A Bernoulli r.v. X with parameter p is defined by P(X = 0) = 1− p;P(X = 1) = p. Sometimes also called a Bernoulli trial. An example (coin toss): I Sample space Ω = {H,T}. I X (H) = 1,X (T ) = 0. I P(X = 1) = p. Denoted by Bern(p). Binomial Random Variables Binomial r.v. X with parameters (n, p) is defined by p(k) = P(X = k) = ( n k ) pk(1− p)n−k , k = 0, 1, 2, . . . , n. Recall ( n k ) = n! k!(n−k)! is the number of ways to pick k objects from n homogeneous items. Interpretation: number of successes in n i.i.d Bernoulli trials with parameter p. Sample space: the set of all 0− 1 sequences of length n. I X counts the number of 1s in the sequence (the number of successes/Heads among all trials). Denoted by Bin(n, p). Binomial Random Variables Binomial r.v. X with parameters (n, p) is defined by p(k) = P(X = k) = ( n k ) pk(1− p)n−k , k = 0, 1, 2, . . . , n. Recall ( n k ) = n! k!(n−k)! is the number of ways to pick k objects from n homogeneous items. Interpretation: number of successes in n i.i.d Bernoulli trials with parameter p. Sample space: the set of all 0− 1 sequences of length n. I X counts the number of 1s in the sequence (the number of successes/Heads among all trials). Denoted by Bin(n, p). Binomial Random Variables Binomial r.v. X with parameters (n, p) is defined by p(k) = P(X = k) = ( n k ) pk(1− p)n−k , k = 0, 1, 2, . . . , n. Recall ( n k ) = n! k!(n−k)! is the number of ways to pick k objects from n homogeneous items. Interpretation: number of successes in n i.i.d Bernoulli trials with parameter p. Sample space: the set of all 0− 1 sequences of length n. I X counts the number of 1s in the sequence (the number of successes/Heads among all trials). Denoted by Bin(n, p). Binomial Random Variables Binomial r.v. X with parameters (n, p) is defined by p(k) = P(X = k) = ( n k ) pk(1− p)n−k , k = 0, 1, 2, . . . , n. Recall ( n k ) = n! k!(n−k)! is the number of ways to pick k objects from n homogeneous items. Interpretation: number of successes in n i.i.d Bernoulli trials with parameter p. Sample space: the set of all 0− 1 sequences of length n. I X counts the number of 1s in the sequence (the number of successes/Heads among all trials). Denoted by Bin(n, p). Binomial Random Variables Binomial r.v. X with parameters (n, p) is defined by p(k) = P(X = k) = ( n k ) pk(1− p)n−k , k = 0, 1, 2, . . . , n. Recall ( n k ) = n! k!(n−k)! is the number of ways to pick k objects from n homogeneous items. Interpretation: number of successes in n i.i.d Bernoulli trials with parameter p. Sample space: the set of all 0− 1 sequences of length n. I X counts the number of 1s in the sequence (the number of successes/Heads among all trials). Denoted by Bin(n, p). Binomial Random Variables Binomial r.v. X with parameters (n, p) is defined by p(k) = P(X = k) = ( n k ) pk(1− p)n−k , k = 0, 1, 2, . . . , n. Recall ( n k ) = n! k!(n−k)! is the number of ways to pick k objects from n homogeneous items. Interpretation: number of successes in n i.i.d Bernoulli trials with parameter p. Sample space: the set of all 0− 1 sequences of length n. I X counts the number of 1s in the sequence (the number of successes/Heads among all trials). Denoted by Bin(n, p). Geometric Random Variables A Geometric random variable X with parameter p is defined by P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . . Interpretation: number of i.i.d Bernoulli trials that need to be performed to see a success. I If number is k, have k − 1 failures (each with probability 1− p) and a success on the kth trial (w.p. p). Sample space: set of all infinite 0− 1 sequences. I Think of 1 as a success. I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .) to the number 5. Sometimes people define geometric r.v. as P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one. Memoryless property of geometric r.v.: P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

A Geometric random variable X with parameter p is defined by

P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . .

Interpretation: number of i.i.d Bernoulli trials that need to be
performed to see a success.

I If number is k, have k − 1 failures (each with probability 1− p) and a
success on the kth trial (w.p. p).

Sample space: set of all infinite 0− 1 sequences.
I Think of 1 as a success.
I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .)

to the number 5.

Sometimes people define geometric r.v. as
P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one.
Memoryless property of geometric r.v.:

P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

A Geometric random variable X with parameter p is defined by

P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . .

Interpretation: number of i.i.d Bernoulli trials that need to be
performed to see a success.

I If number is k, have k − 1 failures (each with probability 1− p) and a
success on the kth trial (w.p. p).

Sample space: set of all infinite 0− 1 sequences.
I Think of 1 as a success.
I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .)

to the number 5.

Sometimes people define geometric r.v. as
P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one.
Memoryless property of geometric r.v.:

P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

A Geometric random variable X with parameter p is defined by

P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . .

Interpretation: number of i.i.d Bernoulli trials that need to be
performed to see a success.

I If number is k, have k − 1 failures (each with probability 1− p) and a
success on the kth trial (w.p. p).

Sample space: set of all infinite 0− 1 sequences.

I Think of 1 as a success.
I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .)

to the number 5.

Sometimes people define geometric r.v. as
P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one.
Memoryless property of geometric r.v.:

P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

A Geometric random variable X with parameter p is defined by

P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . .

Interpretation: number of i.i.d Bernoulli trials that need to be
performed to see a success.

I If number is k, have k − 1 failures (each with probability 1− p) and a
success on the kth trial (w.p. p).

Sample space: set of all infinite 0− 1 sequences.
I Think of 1 as a success.
I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .)

to the number 5.

Sometimes people define geometric r.v. as
P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one.
Memoryless property of geometric r.v.:

P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

A Geometric random variable X with parameter p is defined by

P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . .

Interpretation: number of i.i.d Bernoulli trials that need to be
performed to see a success.

I If number is k, have k − 1 failures (each with probability 1− p) and a
success on the kth trial (w.p. p).

Sample space: set of all infinite 0− 1 sequences.
I Think of 1 as a success.
I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .)

to the number 5.

Sometimes people define geometric r.v. as
P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one.

Memoryless property of geometric r.v.:

P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

A Geometric random variable X with parameter p is defined by

P(X = k) = p(1− p)k−1, k = 1, 2, 3, . . . .

Interpretation: number of i.i.d Bernoulli trials that need to be
performed to see a success.

I If number is k, have k − 1 failures (each with probability 1− p) and a
success on the kth trial (w.p. p).

Sample space: set of all infinite 0− 1 sequences.
I Think of 1 as a success.
I E.g., geometric r.v. X maps the sequence (0, 0, 0, 0, 1, 1, 0, 1, 1, 1, . . .)

to the number 5.

Sometimes people define geometric r.v. as
P(X = k) = p(1− p)k , k = 0, 1, 2, . . .. But we don’t use this one.
Memoryless property of geometric r.v.:

P(X > k + m | X > m) = P(X > k).

Denoted by Geom(p).

Geometric Random Variables

Proof of the memoryless property.

Let X ∼ Geom(p). Then P(X > k) = (1− p)k .
I Can compute directly.
I Or use the interpretation in terms of Bernoulli trials: {X > k} is the

event that no success before the kth trial – probability (1− p)k .
We also have

P(X > k + m | X > m) =
P({X > k + m} ∩ {X > m})

P(X > m)

=
P(X > k + m)
P(X > m)

=
(1− p)k+m

(1− p)m
= (1− p)k

= P(X > k).

Geometric Random Variables

Proof of the memoryless property.

Let X ∼ Geom(p). Then P(X > k) = (1− p)k .

I Can compute directly.
I Or use the interpretation in terms of Bernoulli trials: {X > k} is the

event that no success before the kth trial – probability (1− p)k .
We also have

P(X > k + m | X > m) =
P({X > k + m} ∩ {X > m})

P(X > m)

=
P(X > k + m)
P(X > m)

=
(1− p)k+m

(1− p)m
= (1− p)k

= P(X > k).

Geometric Random Variables

Proof of the memoryless property.

Let X ∼ Geom(p). Then P(X > k) = (1− p)k .
I Can compute directly.
I Or use the interpretation in terms of Bernoulli trials: {X > k} is the

event that no success before the kth trial – probability (1− p)k .

We also have

P(X > k + m | X > m) =
P({X > k + m} ∩ {X > m})

P(X > m)

=
P(X > k + m)
P(X > m)

=
(1− p)k+m

(1− p)m
= (1− p)k

= P(X > k).

Geometric Random Variables

Proof of the memoryless property.

Let X ∼ Geom(p). Then P(X > k) = (1− p)k .
I Can compute directly.
I Or use the interpretation in terms of Bernoulli trials: {X > k} is the

event that no success before the kth trial – probability (1− p)k .
We also have

P(X > k + m | X > m) =
P({X > k + m} ∩ {X > m})

P(X > m)

=
P(X > k + m)
P(X > m)

=
(1− p)k+m

(1− p)m
= (1− p)k

= P(X > k).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.
I It’s exactly the event that (a) the kth trial is a success; and (b) there

are r − 1 successes among the first k − 1 trials.
I (a) and (b) are independent events themselves, and P((a)) = p;

P((b)) =
(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.
I It’s exactly the event that (a) the kth trial is a success; and (b) there

are r − 1 successes among the first k − 1 trials.
I (a) and (b) are independent events themselves, and P((a)) = p;

P((b)) =
(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.

I It’s exactly the event that (a) the kth trial is a success; and (b) there
are r − 1 successes among the first k − 1 trials.

I (a) and (b) are independent events themselves, and P((a)) = p;
P((b)) =

(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.
I It’s exactly the event that (a) the kth trial is a success; and (b) there

are r − 1 successes among the first k − 1 trials.

I (a) and (b) are independent events themselves, and P((a)) = p;
P((b)) =

(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.
I It’s exactly the event that (a) the kth trial is a success; and (b) there

are r − 1 successes among the first k − 1 trials.
I (a) and (b) are independent events themselves, and P((a)) = p;

P((b)) =
(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.
I It’s exactly the event that (a) the kth trial is a success; and (b) there

are r − 1 successes among the first k − 1 trials.
I (a) and (b) are independent events themselves, and P((a)) = p;

P((b)) =
(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Negative Binomials Random Variables

A negative binomial r.v. X with parameters (r , p) is defined by

p(k) =

(
k − 1
r − 1

)
pr (1− p)k−r , k = r , r + 1, . . . .

Interpretation: consider an infinite sequence of i.i.d Bernoulli trials. X
is the number of trials required to see r successes.

I Consider the event {X = k}.
I It’s exactly the event that (a) the kth trial is a success; and (b) there

are r − 1 successes among the first k − 1 trials.
I (a) and (b) are independent events themselves, and P((a)) = p;

P((b)) =
(
k−1
r−1
)
pr−1(1− p)k−r .

Denoted by NB(r , p).

Suppose Y1,Y2, . . . ,Yr are i.i.d. Geom(p) r.v., then
X = Y1 + . . .+ Yr is NB(r , p).

Poisson Random Variables

A Poisson r.v. X with mean λ is defined by

p(k) = P(X = k) = e−λ
λk

k!
, k = 0, 1, 2, . . . .

Denoted by Pois(λ).

Can approximate binomial r.v with Poisson r.v.
I Consider a binomial r.v. with parameters (n, p).
I Suppose λ = np is fixed. As n→∞ and p → 0,(

)
pk(1− p)n−k → e−λ

λk

k!
.

Poisson Random Variables

A Poisson r.v. X with mean λ is defined by

p(k) = P(X = k) = e−λ
λk

k!
, k = 0, 1, 2, . . . .

Denoted by Pois(λ).

Can approximate binomial r.v with Poisson r.v.
I Consider a binomial r.v. with parameters (n, p).
I Suppose λ = np is fixed. As n→∞ and p → 0,(

)
pk(1− p)n−k → e−λ

λk

k!
.

Poisson Random Variables

A Poisson r.v. X with mean λ is defined by

p(k) = P(X = k) = e−λ
λk

k!
, k = 0, 1, 2, . . . .

Denoted by Pois(λ).

Can approximate binomial r.v with Poisson r.v.

I Consider a binomial r.v. with parameters (n, p).
I Suppose λ = np is fixed. As n→∞ and p → 0,(

)
pk(1− p)n−k → e−λ

λk

k!
.

Poisson Random Variables

A Poisson r.v. X with mean λ is defined by

p(k) = P(X = k) = e−λ
λk

k!
, k = 0, 1, 2, . . . .

Denoted by Pois(λ).

Can approximate binomial r.v with Poisson r.v.
I Consider a binomial r.v. with parameters (n, p).

I Suppose λ = np is fixed. As n→∞ and p → 0,(
n

)
pk(1− p)n−k → e−λ

λk

k!
.

Poisson Random Variables

A Poisson r.v. X with mean λ is defined by

p(k) = P(X = k) = e−λ
λk

k!
, k = 0, 1, 2, . . . .

Denoted by Pois(λ).

Can approximate binomial r.v with Poisson r.v.
I Consider a binomial r.v. with parameters (n, p).
I Suppose λ = np is fixed. As n→∞ and p → 0,(

)
pk(1− p)n−k → e−λ

λk

k!
.

Mean and Variance of These Discrete RV

Mean Variance

Bern(p) p p(1− p)
Bin(n, p) np np(1− p)
Geom(p) p−1 (1− p)/p2
NB(r , p) r/p r(1− p)/p2
Pois(λ) λ λ

Continuous Random variables

Uniform random variables

Exponential random variables

Normal/Gaussian random variables

Uniform Random variables

A uniform r.v. X on the interval [a, b] has density function

f (x) =

{
1

b−a if x ∈ [a, b];
0 otherwise.

Cdf is

F (x) =




x−a
b−a , x ∈ [a, b];
1, x ≥ b;
0, x < a. Denoted by Unif (a, b). Whether end-points are open/closed does not really matter. If X a uniform r.v., then cX + d is also a r.v. Uniform Random variables A uniform r.v. X on the interval [a, b] has density function f (x) = { 1 b−a if x ∈ [a, b]; 0 otherwise. Cdf is F (x) =   x−a b−a , x ∈ [a, b]; 1, x ≥ b; 0, x < a. Denoted by Unif (a, b). Whether end-points are open/closed does not really matter. If X a uniform r.v., then cX + d is also a r.v. Uniform Random variables A uniform r.v. X on the interval [a, b] has density function f (x) = { 1 b−a if x ∈ [a, b]; 0 otherwise. Cdf is F (x) =   x−a b−a , x ∈ [a, b]; 1, x ≥ b; 0, x < a. Denoted by Unif (a, b). Whether end-points are open/closed does not really matter. If X a uniform r.v., then cX + d is also a r.v. Uniform Random variables A uniform r.v. X on the interval [a, b] has density function f (x) = { 1 b−a if x ∈ [a, b]; 0 otherwise. Cdf is F (x) =   x−a b−a , x ∈ [a, b]; 1, x ≥ b; 0, x < a. Denoted by Unif (a, b). Whether end-points are open/closed does not really matter. If X a uniform r.v., then cX + d is also a r.v. Exponential Random Variables An exponential r.v X with parameter λ is defined by the density function f (x) = { λe−λx , x ≥ 0; 0 x < 0. Cdf is F (x) = 1− e−λx if x ≥ 0, and F (x) = 0 if x < 0. Denoted by Exp(λ). Properties: I If X ∼ Exp(λ), then cX ∼ Exp(λ/c), c > 0.
I Memoryless property: P(X > x + y | X > y) = P(X > x). This also

defines the class of exponential r.v.
I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).
I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then

min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Exponential Random Variables

An exponential r.v X with parameter λ is defined by the density
function

f (x) =

{
λe−λx , x ≥ 0;
0 x < 0. Cdf is F (x) = 1− e−λx if x ≥ 0, and F (x) = 0 if x < 0. Denoted by Exp(λ). Properties: I If X ∼ Exp(λ), then cX ∼ Exp(λ/c), c > 0.
I Memoryless property: P(X > x + y | X > y) = P(X > x). This also

defines the class of exponential r.v.
I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).
I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then

min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Exponential Random Variables

An exponential r.v X with parameter λ is defined by the density
function

f (x) =

defines the class of exponential r.v.
I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).
I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then

min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Exponential Random Variables

An exponential r.v X with parameter λ is defined by the density
function

f (x) =

defines the class of exponential r.v.
I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).
I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then

min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Exponential Random Variables

An exponential r.v X with parameter λ is defined by the density
function

f (x) =

{
λe−λx , x ≥ 0;
0 x < 0. Cdf is F (x) = 1− e−λx if x ≥ 0, and F (x) = 0 if x < 0. Denoted by Exp(λ). Properties: I If X ∼ Exp(λ), then cX ∼ Exp(λ/c), c > 0.

I Memoryless property: P(X > x + y | X > y) = P(X > x). This also
defines the class of exponential r.v.

I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).
I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then

min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Exponential Random Variables

An exponential r.v X with parameter λ is defined by the density
function

f (x) =

defines the class of exponential r.v.
I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).

I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then
min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Exponential Random Variables

An exponential r.v X with parameter λ is defined by the density
function

f (x) =

defines the class of exponential r.v.
I Consider the sum of S i.i.d Exp(λ) r.v. X1, . . . ,XS , where

S ∼ Geom(p) is independent from Xi . Then
∑S

i=1 Xi ∼ Exp(pλ).
I Xi independent Exp(λi ), i = 1, 2, . . . , n. Then

min{X1, . . . ,Xn} ∼ Exp(λ1 + λ2 + . . .+ λn).

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)
χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)
χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)
χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)

χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)
χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)
χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

A Gaussian r.v X with mean µ and variance σ2 has density function

f (x) =
1

√
2πσ

exp

(
−

(x − µ)2

2σ2

)
, x ∈ R

Usually denoted by N(µ, σ2)

N(0, 1) is often called the standard normal; cdf denoted by Φ(x)

If X ∼ N(0, 1), then σX + µ ∼ N(µ, σ2)
χ2-distribution: if Z1, · · · ,Zk are i.i.d standard normals, then
Z 21 + · · ·+ Z

2
k is said to have a χ

2-distribution with k degrees of
freedom

I Fact: Z 21 + Z
2
2 ∼ Exp(1/2)

Gaussian/Normal Random Variables

Plot of pdf of N(µ, σ2)

Mean and Variance of These Continuous RV

Distribution Mean Variance

Unif (a, b) (a + b)/2 (b − a)2/12
Exp(λ) λ−1 λ−2

N(µ, σ2) µ σ2

Limit Theorems

The (strong) law of large numbers (SLLN). X1,X2, . . . are i.i.d r.v
with finite mean µ. Then with probability 1,

X̄ n =
X1 + X2 + · · ·+ Xn

n
→ µ.

SLLN does not tell us how the empirical average X̄ n fluctuates
around the true mean, µ.

Central limit theorem (CLT). X1,X2, . . . are i.i.d r.v with finite
mean µ and finite variance σ2. Then for all x ∈ R,

P
(√

σ
[X̄ n − µ] < x ) → Φ(x) = ∫ x −∞ 1 √ 2π e−y 2/2dy , as n→∞. Fluctuation of X̄ n around µ is roughly O(1/ √ n); X̄ n ≈ N(µ, σ2/n). Limit Theorems The (strong) law of large numbers (SLLN). X1,X2, . . . are i.i.d r.v with finite mean µ. Then with probability 1, X̄ n = X1 + X2 + · · ·+ Xn n → µ. SLLN does not tell us how the empirical average X̄ n fluctuates around the true mean, µ. Central limit theorem (CLT). X1,X2, . . . are i.i.d r.v with finite mean µ and finite variance σ2. Then for all x ∈ R, P (√ n σ [X̄ n − µ] < x ) → Φ(x) = ∫ x −∞ 1 √ 2π e−y 2/2dy , as n→∞. Fluctuation of X̄ n around µ is roughly O(1/ √ n); X̄ n ≈ N(µ, σ2/n). Limit Theorems The (strong) law of large numbers (SLLN). X1,X2, . . . are i.i.d r.v with finite mean µ. Then with probability 1, X̄ n = X1 + X2 + · · ·+ Xn n → µ. SLLN does not tell us how the empirical average X̄ n fluctuates around the true mean, µ. Central limit theorem (CLT). X1,X2, . . . are i.i.d r.v with finite mean µ and finite variance σ2. Then for all x ∈ R, P (√ n σ [X̄ n − µ] < x ) → Φ(x) = ∫ x −∞ 1 √ 2π e−y 2/2dy , as n→∞. Fluctuation of X̄ n around µ is roughly O(1/ √ n); X̄ n ≈ N(µ, σ2/n). Limit Theorems The (strong) law of large numbers (SLLN). X1,X2, . . . are i.i.d r.v with finite mean µ. Then with probability 1, X̄ n = X1 + X2 + · · ·+ Xn n → µ. SLLN does not tell us how the empirical average X̄ n fluctuates around the true mean, µ. Central limit theorem (CLT). X1,X2, . . . are i.i.d r.v with finite mean µ and finite variance σ2. Then for all x ∈ R, P (√ n σ [X̄ n − µ] < x ) → Φ(x) = ∫ x −∞ 1 √ 2π e−y 2/2dy , as n→∞. Fluctuation of X̄ n around µ is roughly O(1/ √ n); X̄ n ≈ N(µ, σ2/n). Conditional Expectation (and Cond. Variance) Suppose X and Y are both discrete r.v. Recall conditional probability: P(X = x | Y = y) = P(X = x ,Y = y) P(Y = y) . ∑ x P(X = x | Y = y) = ∑ x P(X=x ,Y=y) P(Y=y) = P(Y=y) P(Y=y) = 1 Can define conditional expectation of X given Y = y : E[X | Y = y ] = ∑ x xP(X = x | Y = y) Note that the mapping y → E[X | Y = y ] is a function of r.v. Y The conditional expectation of X given Y is the random variable ω → E[X | Y = Y (ω)] Conditional Expectation (and Cond. Variance) Suppose X and Y are both discrete r.v. Recall conditional probability: P(X = x | Y = y) = P(X = x ,Y = y) P(Y = y) . ∑ x P(X = x | Y = y) = ∑ x P(X=x ,Y=y) P(Y=y) = P(Y=y) P(Y=y) = 1 Can define conditional expectation of X given Y = y : E[X | Y = y ] = ∑ x xP(X = x | Y = y) Note that the mapping y → E[X | Y = y ] is a function of r.v. Y The conditional expectation of X given Y is the random variable ω → E[X | Y = Y (ω)] Conditional Expectation (and Cond. Variance) Suppose X and Y are both discrete r.v. Recall conditional probability: P(X = x | Y = y) = P(X = x ,Y = y) P(Y = y) . ∑ x P(X = x | Y = y) = ∑ x P(X=x ,Y=y) P(Y=y) = P(Y=y) P(Y=y) = 1 Can define conditional expectation of X given Y = y : E[X | Y = y ] = ∑ x xP(X = x | Y = y) Note that the mapping y → E[X | Y = y ] is a function of r.v. Y The conditional expectation of X given Y is the random variable ω → E[X | Y = Y (ω)] Conditional Expectation (and Cond. Variance) Suppose X and Y are both discrete r.v. Recall conditional probability: P(X = x | Y = y) = P(X = x ,Y = y) P(Y = y) . ∑ x P(X = x | Y = y) = ∑ x P(X=x ,Y=y) P(Y=y) = P(Y=y) P(Y=y) = 1 Can define conditional expectation of X given Y = y : E[X | Y = y ] = ∑ x xP(X = x | Y = y) Note that the mapping y → E[X | Y = y ] is a function of r.v. Y The conditional expectation of X given Y is the random variable ω → E[X | Y = Y (ω)] Conditional Expectation (and Cond. Variance) Suppose X and Y are both discrete r.v. Recall conditional probability: P(X = x | Y = y) = P(X = x ,Y = y) P(Y = y) . ∑ x P(X = x | Y = y) = ∑ x P(X=x ,Y=y) P(Y=y) = P(Y=y) P(Y=y) = 1 Can define conditional expectation of X given Y = y : E[X | Y = y ] = ∑ x xP(X = x | Y = y) Note that the mapping y → E[X | Y = y ] is a function of r.v. Y The conditional expectation of X given Y is the random variable ω → E[X | Y = Y (ω)] Conditional Expectation (and Cond. Variance) Suppose X and Y are both discrete r.v. Recall conditional probability: P(X = x | Y = y) = P(X = x ,Y = y) P(Y = y) . ∑ x P(X = x | Y = y) = ∑ x P(X=x ,Y=y) P(Y=y) = P(Y=y) P(Y=y) = 1 Can define conditional expectation of X given Y = y : E[X | Y = y ] = ∑ x xP(X = x | Y = y) Note that the mapping y → E[X | Y = y ] is a function of r.v. Y The conditional expectation of X given Y is the random variable ω → E[X | Y = Y (ω)] Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Theorem. E[E[X | Y ]] = E[X ]. Proof of theorem. E[E[X | Y ]] = ∑ y E[X | Y = y ]P(Y = y) = ∑ y {∑ x xP(X = x | Y = y) } P(Y = y) = ∑ y {∑ x x P(X = x ,Y = y) P(Y = y) } P(Y = y) = ∑ y ∑ x xP(X = x ,Y = y) = ∑ x x {∑ y P(X = x ,Y = y) } = ∑ x xP(X = x) = E[X ]. Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS . Conditional Expectation (and Cond. Variance) Conditional Variance. Var(X | Y ) = E[(X − E[X | Y ])2 | Y ] I Can show Var(X | Y ) = E[X 2 | Y ]− (E[X | Y ])2 Proposition. Var(X ) = E[Var(X | Y )] + Var(E[X | Y ]). Proof of proposition. LHS = E[X 2]− (E[X ])2 = E[E[X 2 | Y ]]− (E[E[X | Y ]])2 = E[Var(X | Y ) + (E[X | Y ])2]− (E[E[X | Y ]])2 = E[Var(X | Y )] + E[(E[X | Y ])2 − (E[E[X | Y ]])2] = E[Var(X | Y )] + Var(E[X | Y ]) = RHS .

Related Posts