Introduction to information system
Probability Review
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
Today’s Objectives
• Basic Concepts in Set Theory and Probability Theory
• Kolmogorov’s Axioms
• Random Variable
• Conditional Probability
• Bayes’ Rule
• Expectation and Variance
• Appendix: Two Popular Inequalities
Have you ever made a
decision by flipping a coin?
Head
Tail
Basic Probability Concepts (1/2)
Experiment
In probability, an experiment is any
process that can be repeated in which
the results are uncertain.
Elementary/Simple Event
A simple event is any single outcome
from a probability experiment. Each
simple event is denoted 𝜔𝑖 , 𝑖 = 1,2,⋯.
Example
If the experiment consists of tossing a
coin, two outcomes, head and tail, are
the elementary events, therefore,
𝜔1 = H, 𝜔2 = T.
Basic Probability Concepts (2/2)
Sample Space
The set Ω = {𝜔1𝜔2, ⋯ }, of all possible
outcomes of a particular experiment is
called the sample space for the
experiment.
Example
If the experiment consists of flipping a
coin, the sample space contains two
outcomes: head and tail; Ω = {H, T}
Event
An event is 𝐸 any collection of possible
outcomes of an experiment, that is, any
subset of the sample space Ω.
Example
If the experiment consists of throwing a
dice, the sample space contains six
outcomes, Ω = {𝜔1, ⋯ , 𝜔6}. Then, the
event that the outcome is more than
number four is subset of the sample
space, i.e., 𝐴 = {𝜔5, 𝜔6} ⊂ Ω
Notations in Set Theory and Probability Theory
Notation Set theory Probability theory
Ω Collection of objects Sample space
𝜔 ∈ Ω Member of Ω Elementary event, outcome
𝐴 ⊂ Ω Subset of Ω Event that some outcome in 𝐴 occurs
𝐴𝑐 Complement Event that no outcome in 𝐴 occurs
𝐴 ∩ 𝐵 Intersection Both 𝐴 and 𝐵
𝐴 ∪ 𝐵 Union Either 𝐴 or 𝐵
𝐴 ∖ 𝐵 Difference 𝐴 but not 𝐵
∅ Empty set Impossible event
Set Theory Basics
• 𝐴 ⊂ 𝐵 ⇔ 𝑥 ∈ 𝐴 ⇒ 𝑥 ∈ 𝐵
• 𝐴 = 𝐵 ⇔ 𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴
• 𝐴 ∪ 𝐵 = 𝑥: 𝑥 ∈ 𝐴 or 𝑥 ∈ 𝐵
• 𝐴 ∩ 𝐵 = {𝑥: 𝑥 ∈ 𝐴 and 𝑥 ∈ 𝐵}
• 𝐴𝑐 = {𝑥: 𝑥 ∉ 𝐴}
• Commutativity
𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴
𝐴 ∩ 𝐵 = 𝐵 ∩ 𝐴
• Associativity
𝐴 ∪ 𝐵 ∪ 𝐶 = 𝐴 ∪ 𝐵 ∪ 𝐶
𝐴 ∩ 𝐵 ∩ 𝐶 = 𝐴 ∩ 𝐵 ∩ 𝐶
• Distributive Laws
𝐴 ∩ 𝐵 ∪ 𝐶 = 𝐴 ∩ 𝐵 ∪ 𝐴 ∩ 𝐶
𝐴 ∪ 𝐵 ∩ 𝐶 = 𝐴 ∪ 𝐵 ∩ (𝐴 ∪ 𝐶)
• DeMorgan’s Laws
𝐴 ∪ 𝐵 𝑐 = 𝐴𝑐 ∩ 𝐵𝑐
𝐴 ∩ 𝐵 𝑐 = 𝐴𝑐 ∪ 𝐵𝑐
𝜎-Algebra/Field
Let Ω be a nonempty set, and let ℱ be a collection of subsets of Ω. We say
that ℱ is a 𝝈-algebra/field if and only if:
1) The empty set belongs to ℱ, that is, ∅ ∈ ℱ
2) Whenever a set 𝐸 ∈ ℱ, its complement 𝐸𝑐 ∈ ℱ
3) Whenever a sequence of sets 𝐸1, 𝐸2, … ∈ ℱ, then 𝑖=1
∞ 𝐸𝑖 ∈ ℱ
The double (Ω, ℱ) is called a measurable space.
Examples of 𝜎-field
1) If the experiment consists of just one flip of a fair coin, then the outcomes
are either heads or tails: Ω = {𝐻, 𝑇}. The 𝜎-algebra ℱ = 2Ω contains 4
events: ∅ , 𝐻 , 𝑇 , {𝐻, 𝑇}. Therefore, ℱ = {∅, {𝐻}, {𝑇}, {𝐻, 𝑇}}.
2) The smallest 𝜎-field associated with Ω is the collection ℱ = {∅, Ω}
3) If 𝐴 is any subset of Ω then ℱ = {∅, 𝐴, 𝐴𝑐 , Ω}
Probability Measure (Kolmogorov’s Axioms)
Let Ω be a nonempty set, and let ℱ be a 𝜎-algebra of
subsets of Ω. A probability measure ℙ:ℱ → [0,1], is
a function that, to every set 𝐸 ∈ ℱ, assigns a number
in [0,1], which satisfies the following conditions:
1) If Ω is the sample space, then ℙ Ω = 1.
2) For all 𝐸 ∈ ℱ, 𝑃(𝐸) ≥ 0
3) For mutually exclusive events, 𝐸1, 𝐸2, … ,
(i.e., 𝐸𝑖 ⊓ 𝐸𝑗 = ∅ if 𝑖 ≠ 𝑗), then
ℙ
𝑖=1
∞
𝐸𝑖 =
𝑖=1
∞
ℙ 𝐸𝑖
The triple (Ω, ℱ, ℙ) is called a probability space.
Andrey Kolmogorov
The probability of an event 𝑨,
denoted by ℙ(𝐴), is the likelihood
of that event occurring.
Methods of Assigning Probabilities
• Classical approach
• Relative-frequency approach
• Subjective approach
Classical Approach
If an experiment has 𝑛 simple
outcomes, this method would assign
a probability of
1
𝑛
to each outcome. In
other words, each outcome is
assumed to have an equal probability
of occurrence. This method is also
called the axiomatic approach.
Example
Roll of a die Ω = 1, 2,· · · , 6 . Each
simple event has a
1
6
chance of occurring.
Relative-Frequency Approach
Probabilities are assigned on the basis of experimentation or historical data. Let
𝐴 be an event and assume that you have performed the same experiment 𝑛
times so that 𝑛 is the number of times 𝐴 could have occurred. Further, let #(𝐴)
be the number of times that 𝐴 did occur. Then, ℙ(𝐴) is defined by the relative
frequency #(𝐴)/𝑛 as follows:
ℙ 𝐴 = lim
𝑛→∞
#(𝐴)
𝑛
.
Please Be Aware …
It is not physically feasible to repeat
an experiment an infinite number of
times.
Two sets of 𝑛 experiments may result
in two different ratios. However, we
expect the discrepancy to converge to
0 for large 𝑛. Hence, for large 𝑛, the
ratio #(𝐴)/𝑛 may be taken as a
reasonable approximation for ℙ(𝐴).
Example
Roll the given die 1000 times and
suppose the number of times the
outcome 1 is observed is 145. Thus,
𝐴 = 1 , #(𝐴) = 145, and 𝑛 = 1000.
Therefore, we say that ℙ(𝐴) is
approximately equal to
145/1000 = 0.145
1
6
= 0.167
Classical Approach
Subjective Approach
In the subjective approach, we define probability as the degree of belief that
we hold in the occurrence of an event. Thus, judgment is used as the basis
for assigning probabilities.
Exercise: A fair coin is tossed 5 times. What is the probability
of getting at least three heads on consecutive tosses?
3 consecutive heads:
ℙ HHHTT, THHHT, TTHHH,HTHHH,HHHTH = 5 ×
1
2
5
=
5
32
.
4 consecutive heads:
ℙ HHHHT, THHHH = 2 ×
1
2
5
=
2
32
.
5 consecutive heads:
ℙ HHHHH =
1
2
5
=
1
32
.
Hence, we have
ℙ At least three heads =
5
32
+
2
32
+
1
32
=
1
4
.
Sample space Ω
Random Variable
A random variable 𝑋:Ω → ℝ is a real-valued
function that assigns a number to each outcome
in the sample space of a random experiment.
ℝ
𝑋
{𝑋 ∈ 𝐵} =
{𝜔 ∈ Ω: 𝑋(𝜔) ∈ 𝐵}
𝐵
Probability Distribution Function
The probability distribution function of a random variable 𝑋, denoted by 𝔽𝑋(⋅),
is defined to be the function 𝔽𝑋: ℝ → 0,1 which assigns
𝔽𝑋 𝑥 = 𝔽 𝑥 = ℙ 𝑋 ≤ 𝑥 = ℙ {𝜔 ∈ Ω: 𝑋 𝜔 ≤ 𝑥} ,
for every 𝑥 ∈ ℝ.
Properties of 𝔽𝑋 ⋅
• lim
𝑥→−∞
𝔽𝑋 𝑥 = 0 and lim
𝑥→+∞
𝔽𝑋 𝑥 = 1
• 𝔽𝑋 𝑎 ≤ 𝔽𝑋 𝑏 for 𝑎 < 𝑏 (Monotone and non-decreasing) • 𝔽𝑋 ⋅ is continuous from the right lim ℎ→0 𝔽𝑋 𝑥 + ℎ = 𝔽𝑋 𝑥 Probability Mass Function and Density Function The probability mass function 𝑓(∙) of a discrete RV 𝑋 is 𝑓 𝑥 = ℙ(𝑋 = 𝑥) For a continuous RV 𝑋, a function 𝑓:ℝ → [0,∞), is a density function if and only if: • 𝑓 𝑥 ≥ 0 for ∀𝑥; • −∞ ∞ 𝑓 𝑥 𝑑𝑥 = 1. Joint Probability Distribution and Marginal Distribution The joint probability distribution of 𝑋 and 𝑌 is defined as 𝔽 𝑥, 𝑦 = ℙ 𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦 . Example Consider you flip 2 coins together, what is the probability of two heads? The individual marginal distribution of 𝑋 is 𝔽𝑋 𝑥 = ℙ 𝑋 ≤ 𝑥, 𝑌 ≤ ∞ . Example Consider you flip 2 coins together, what is the probability of coin 1 is head? Conditional Probability The conditional probability of event 𝐸𝑖 given event 𝐸𝑗 is defined as ℙ 𝐸𝑖 𝐸𝑗 = ℙ 𝐸𝑖 , 𝐸𝑗 ℙ 𝐸𝑗 , or ℙ 𝐸𝑖 ∩ 𝐸𝑗 ℙ 𝐸𝑗 where ℙ(𝐸𝑗) > 0.
If 𝐸𝑖 and 𝐸𝑗 are independent (𝐸𝑖 ⊓ 𝐸𝑗 = ∅), then
ℙ 𝐸𝑖 𝐸𝑗 =
ℙ 𝐸𝑖)ℙ(𝐸𝑗
ℙ 𝐸𝑗
= ℙ 𝐸𝑖 .
Bayes’ Rule
Let the events 𝐸1, 𝐸2, … be a partition of the space 𝑆 such that ℙ 𝐸𝑖 > 0 for 𝑖 =
1,2,⋯, and let 𝐵 be any event such that ℙ 𝐵 > 0. Then, for 𝑖 = 1,⋯,
ℙ 𝐸𝑖 𝐵 =
ℙ 𝐸𝑖 ℙ(𝐵|𝐸𝑖)
𝑗=1
∞ ℙ 𝐸𝑗 ℙ(𝐵|𝐸𝑗)
.
Pick up a door!
1 2 3
Behind one of the doors is a cool prize!
Behind the others, also cool prizes!
Suppose you would like to pick door No. 1?
1 2 3
The host, who knows what’s behind the doors, opens another door, say No. 3,
which has a pig. He then says to you, “Do you want to pick door No. 2?” Is it to
your advantage to switch your choice?
1 2
Let 𝐷1, 𝐷2, 𝐷3 stand for the event that
the car is behind door No.1, No. 2,
No. 3, respectively. Thus
ℙ 𝐷1 ∪ 𝐷2 ∪ 𝐷3 = 1.
Since the car is randomly placed
behind one of the doors, then
ℙ 𝐷1 = ℙ 𝐷2 = ℙ 𝐷3 =
1
3
.
1 2 3
2/31/3
The host NEVER reveals a door that has
the car behind it, but randomly chooses
which door to open if neither has the car
behind it. Let 𝐻3 stand for the even that
the host reveals a pig behind door No. 3.
The problem boils down to calculating
ℙ(𝐷1 ∣ 𝐻3). If ℙ 𝐷1 𝐻3 < 1 2 you should switch doors. ℙ 𝐷1 𝐻3 = ℙ 𝐻3 𝐷1 ℙ 𝐷1 ℙ 𝐻3 𝐷1 ℙ 𝐷1 + ℙ 𝐻3 𝐷2 ℙ 𝐷2 + ℙ 𝐻3 𝐷3 ℙ 𝐷3 = 1 3 2/31/3 1 2 02/3 Expectation Expectation, expected value, or mean of a RV 𝑋, denoted by 𝔼(𝑋), is the average value of 𝑋 in a large number of experiments: 𝔼 𝑋 = 𝑖 𝑥𝑖𝑓 𝑥𝑖 , if 𝑋 is discrete, 𝑥𝑓 𝑥 𝑑𝑥, if 𝑋 is continuous. Variance Variance measures how much 𝑋 varies around the expected value. If 𝜇 ≔ 𝔼(𝑋), the variance is defined as 𝕍 𝑋 = 𝔼 𝑋 − 𝜇 2 = 𝑖 𝑥𝑖 − 𝜇 2𝑓 𝑥𝑖 , if 𝑋 is discrete, 𝑥 − 𝜇 2𝑓 𝑥 𝑑𝑥, if 𝑋 is continuous. Important relationship of variance and expectation 𝕍 𝑋 = 𝔼 𝑋2 − 𝔼 𝑋 2. Covariance If 𝑋 and 𝑌 are random variables, then their covariance is Cov 𝑋, 𝑌 = 𝔼 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌 . The correlation between 𝑋 and 𝑌 is then given by 𝜌(𝑋, 𝑌) = Cov 𝑋, 𝑌 𝜎𝑋𝜎𝑌 . Expectation of A Function of A Random Variable Let 𝑋 be a random variable and 𝑔(⋅) be a function 𝑔 ⋅ : ℝ → ℝ. The expectation or expected value of the function 𝑔(⋅) of the random variable 𝑋, denoted by 𝔼 𝑔 𝑋 , is defined by 𝔼(𝑔(𝑋)) = 𝑖 𝑔(𝑥𝑖)𝑓 𝑥𝑖 , if 𝑋 is discrete, 𝑔(𝑥)𝑓 𝑥 𝑑𝑥, if 𝑋 is continuous. Properties of Expectations • 𝔼 𝑐 = 𝑐 for a constant 𝑐 ∈ ℝ • 𝔼 𝑐𝑔 𝑋 = 𝑐𝔼(𝑔 𝑋 ) • 𝔼 𝑐1𝑔1 𝑋 + 𝑐2𝑔2 𝑋 = 𝑐1𝔼(𝑔1 𝑋 ) + 𝑐2𝔼(𝑔2 𝑋 ) • 𝔼 𝑔1 𝑋 ≤ 𝔼(𝑔2 𝑋 ) if 𝑔1 𝑥 ≤ 𝑔2 𝑥 ∀𝑥 ∈ ℝ Summary • Basic Concepts in Set Theory and Probability Theory • Kolmogorov’s Axioms • Random Variable • Conditional Probability • Bayes’ Rule • Expectation and Variance • Appendix: Two Popular Inequalities Thank You! bchen@lincoln.ac.uk mailto:bchen@lincoln.ac.uk Appendix: Two Popular Inequalities For some basic concepts, please check your Math for Computing slides. Thanks! Chebyshev’s Inequality Let 𝑋 be a random variable and 𝑔(⋅) a non-negative function, then ℙ 𝑔 𝑋 ≥ 𝑘 ≤ 𝔼 𝑔 𝑋 𝑘 , ∀𝑘 > 0.
Chebyshev’s inequality guarantees
that in any data sample or probability
distribution, “nearly all” values are
close to the mean.
Corollary of Chebyshev’s Inequality
If 𝑋 is a random variable with finite variance, then
ℙ(∣ 𝑋 − 𝜇𝑋 ∣≥ 𝑟𝜎𝑋) = ℙ( 𝑋 − 𝜇𝑋
2 ≥ 𝑟2𝜎𝑋
2) ≤ 1/𝑟2
Proof. Take 𝑔 𝑋 = 𝑋 − 𝜇𝑋
2 and 𝑘 = 𝑟2𝜎𝑋
2
Example
Suppose we randomly select a journal article from a source with an average of
1000 words per article, with a standard deviation of 200 words. We can then
infer that the probability that it has between 600 and 1400 words (i.e. within k = 2
standard deviations of the mean) must be at least 75%. [Wiki]
Jensen’s Inequality
For a random variable 𝑋 with mean 𝔼(𝑋) and 𝑔(⋅) is a convex continuous
function
𝔼 𝑔 𝑋 ≥ 𝑔(𝔼(𝑋)).