程序代写代做代考 data science Introduction to information system

Introduction to information system

Probability Review

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

Today’s Objectives

• Basic Concepts in Set Theory and Probability Theory

• Kolmogorov’s Axioms

• Random Variable

• Conditional Probability

• Bayes’ Rule

• Expectation and Variance

• Appendix: Two Popular Inequalities

Have you ever made a

decision by flipping a coin?

Head

Tail

Basic Probability Concepts (1/2)

Experiment

In probability, an experiment is any

process that can be repeated in which

the results are uncertain.

Elementary/Simple Event

A simple event is any single outcome

from a probability experiment. Each

simple event is denoted 𝜔𝑖 , 𝑖 = 1,2,⋯.

Example

If the experiment consists of tossing a

coin, two outcomes, head and tail, are

the elementary events, therefore,

𝜔1 = H, 𝜔2 = T.

Basic Probability Concepts (2/2)

Sample Space

The set Ω = {𝜔1𝜔2, ⋯ }, of all possible
outcomes of a particular experiment is

called the sample space for the

experiment.

Example

If the experiment consists of flipping a

coin, the sample space contains two

outcomes: head and tail; Ω = {H, T}

Event

An event is 𝐸 any collection of possible
outcomes of an experiment, that is, any
subset of the sample space Ω.

Example

If the experiment consists of throwing a
dice, the sample space contains six
outcomes, Ω = {𝜔1, ⋯ , 𝜔6}. Then, the
event that the outcome is more than
number four is subset of the sample
space, i.e., 𝐴 = {𝜔5, 𝜔6} ⊂ Ω

Notations in Set Theory and Probability Theory

Notation Set theory Probability theory

Ω Collection of objects Sample space

𝜔 ∈ Ω Member of Ω Elementary event, outcome

𝐴 ⊂ Ω Subset of Ω Event that some outcome in 𝐴 occurs

𝐴𝑐 Complement Event that no outcome in 𝐴 occurs

𝐴 ∩ 𝐵 Intersection Both 𝐴 and 𝐵

𝐴 ∪ 𝐵 Union Either 𝐴 or 𝐵

𝐴 ∖ 𝐵 Difference 𝐴 but not 𝐵

∅ Empty set Impossible event

Set Theory Basics

• 𝐴 ⊂ 𝐵 ⇔ 𝑥 ∈ 𝐴 ⇒ 𝑥 ∈ 𝐵

• 𝐴 = 𝐵 ⇔ 𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴

• 𝐴 ∪ 𝐵 = 𝑥: 𝑥 ∈ 𝐴 or 𝑥 ∈ 𝐵

• 𝐴 ∩ 𝐵 = {𝑥: 𝑥 ∈ 𝐴 and 𝑥 ∈ 𝐵}

• 𝐴𝑐 = {𝑥: 𝑥 ∉ 𝐴}

• Commutativity
𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴
𝐴 ∩ 𝐵 = 𝐵 ∩ 𝐴

• Associativity
𝐴 ∪ 𝐵 ∪ 𝐶 = 𝐴 ∪ 𝐵 ∪ 𝐶
𝐴 ∩ 𝐵 ∩ 𝐶 = 𝐴 ∩ 𝐵 ∩ 𝐶

• Distributive Laws
𝐴 ∩ 𝐵 ∪ 𝐶 = 𝐴 ∩ 𝐵 ∪ 𝐴 ∩ 𝐶
𝐴 ∪ 𝐵 ∩ 𝐶 = 𝐴 ∪ 𝐵 ∩ (𝐴 ∪ 𝐶)

• DeMorgan’s Laws
𝐴 ∪ 𝐵 𝑐 = 𝐴𝑐 ∩ 𝐵𝑐

𝐴 ∩ 𝐵 𝑐 = 𝐴𝑐 ∪ 𝐵𝑐

𝜎-Algebra/Field

Let Ω be a nonempty set, and let ℱ be a collection of subsets of Ω. We say
that ℱ is a 𝝈-algebra/field if and only if:

1) The empty set belongs to ℱ, that is, ∅ ∈ ℱ

2) Whenever a set 𝐸 ∈ ℱ, its complement 𝐸𝑐 ∈ ℱ

3) Whenever a sequence of sets 𝐸1, 𝐸2, … ∈ ℱ, then 𝑖=1
∞ 𝐸𝑖 ∈ ℱ

The double (Ω, ℱ) is called a measurable space.

Examples of 𝜎-field

1) If the experiment consists of just one flip of a fair coin, then the outcomes

are either heads or tails: Ω = {𝐻, 𝑇}. The 𝜎-algebra ℱ = 2Ω contains 4
events: ∅ , 𝐻 , 𝑇 , {𝐻, 𝑇}. Therefore, ℱ = {∅, {𝐻}, {𝑇}, {𝐻, 𝑇}}.

2) The smallest 𝜎-field associated with Ω is the collection ℱ = {∅, Ω}

3) If 𝐴 is any subset of Ω then ℱ = {∅, 𝐴, 𝐴𝑐 , Ω}

Probability Measure (Kolmogorov’s Axioms)

Let Ω be a nonempty set, and let ℱ be a 𝜎-algebra of
subsets of Ω. A probability measure ℙ:ℱ → [0,1], is
a function that, to every set 𝐸 ∈ ℱ, assigns a number
in [0,1], which satisfies the following conditions:

1) If Ω is the sample space, then ℙ Ω = 1.

2) For all 𝐸 ∈ ℱ, 𝑃(𝐸) ≥ 0

3) For mutually exclusive events, 𝐸1, 𝐸2, … ,
(i.e., 𝐸𝑖 ⊓ 𝐸𝑗 = ∅ if 𝑖 ≠ 𝑗), then

𝑖=1

𝐸𝑖 =

𝑖=1

ℙ 𝐸𝑖

The triple (Ω, ℱ, ℙ) is called a probability space.

Andrey Kolmogorov

The probability of an event 𝑨,
denoted by ℙ(𝐴), is the likelihood
of that event occurring.

Methods of Assigning Probabilities

• Classical approach

• Relative-frequency approach

• Subjective approach

Classical Approach

If an experiment has 𝑛 simple
outcomes, this method would assign

a probability of
1

𝑛
to each outcome. In

other words, each outcome is

assumed to have an equal probability

of occurrence. This method is also

called the axiomatic approach.

Example

Roll of a die Ω = 1, 2,· · · , 6 . Each

simple event has a
1

6
chance of occurring.

Relative-Frequency Approach

Probabilities are assigned on the basis of experimentation or historical data. Let

𝐴 be an event and assume that you have performed the same experiment 𝑛
times so that 𝑛 is the number of times 𝐴 could have occurred. Further, let #(𝐴)
be the number of times that 𝐴 did occur. Then, ℙ(𝐴) is defined by the relative
frequency #(𝐴)/𝑛 as follows:

ℙ 𝐴 = lim
𝑛→∞

#(𝐴)

𝑛
.

Please Be Aware …

It is not physically feasible to repeat

an experiment an infinite number of

times.

Two sets of 𝑛 experiments may result
in two different ratios. However, we

expect the discrepancy to converge to

0 for large 𝑛. Hence, for large 𝑛, the
ratio #(𝐴)/𝑛 may be taken as a
reasonable approximation for ℙ(𝐴).

Example

Roll the given die 1000 times and

suppose the number of times the

outcome 1 is observed is 145. Thus,

𝐴 = 1 , #(𝐴) = 145, and 𝑛 = 1000.
Therefore, we say that ℙ(𝐴) is
approximately equal to

145/1000 = 0.145

1

6
= 0.167

Classical Approach

Subjective Approach

In the subjective approach, we define probability as the degree of belief that

we hold in the occurrence of an event. Thus, judgment is used as the basis

for assigning probabilities.

Exercise: A fair coin is tossed 5 times. What is the probability

of getting at least three heads on consecutive tosses?

3 consecutive heads:

ℙ HHHTT, THHHT, TTHHH,HTHHH,HHHTH = 5 ×
1

2

5

=
5

32
.

4 consecutive heads:

ℙ HHHHT, THHHH = 2 ×
1

2

5

=
2

32
.

5 consecutive heads:

ℙ HHHHH =
1

2

5

=
1

32
.

Hence, we have

ℙ At least three heads =
5

32
+
2

32
+
1

32
=
1

4
.

Sample space Ω

Random Variable

A random variable 𝑋:Ω → ℝ is a real-valued
function that assigns a number to each outcome

in the sample space of a random experiment.

𝑋

{𝑋 ∈ 𝐵} =
{𝜔 ∈ Ω: 𝑋(𝜔) ∈ 𝐵}

𝐵

Probability Distribution Function

The probability distribution function of a random variable 𝑋, denoted by 𝔽𝑋(⋅),
is defined to be the function 𝔽𝑋: ℝ → 0,1 which assigns

𝔽𝑋 𝑥 = 𝔽 𝑥 = ℙ 𝑋 ≤ 𝑥 = ℙ {𝜔 ∈ Ω: 𝑋 𝜔 ≤ 𝑥} ,

for every 𝑥 ∈ ℝ.

Properties of 𝔽𝑋 ⋅

• lim
𝑥→−∞
𝔽𝑋 𝑥 = 0 and lim

𝑥→+∞
𝔽𝑋 𝑥 = 1

• 𝔽𝑋 𝑎 ≤ 𝔽𝑋 𝑏 for 𝑎 < 𝑏 (Monotone and non-decreasing) • 𝔽𝑋 ⋅ is continuous from the right lim ℎ→0 𝔽𝑋 𝑥 + ℎ = 𝔽𝑋 𝑥 Probability Mass Function and Density Function The probability mass function 𝑓(∙) of a discrete RV 𝑋 is 𝑓 𝑥 = ℙ(𝑋 = 𝑥) For a continuous RV 𝑋, a function 𝑓:ℝ → [0,∞), is a density function if and only if: • 𝑓 𝑥 ≥ 0 for ∀𝑥; • −∞ ∞ 𝑓 𝑥 𝑑𝑥 = 1. Joint Probability Distribution and Marginal Distribution The joint probability distribution of 𝑋 and 𝑌 is defined as 𝔽 𝑥, 𝑦 = ℙ 𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦 . Example Consider you flip 2 coins together, what is the probability of two heads? The individual marginal distribution of 𝑋 is 𝔽𝑋 𝑥 = ℙ 𝑋 ≤ 𝑥, 𝑌 ≤ ∞ . Example Consider you flip 2 coins together, what is the probability of coin 1 is head? Conditional Probability The conditional probability of event 𝐸𝑖 given event 𝐸𝑗 is defined as ℙ 𝐸𝑖 𝐸𝑗 = ℙ 𝐸𝑖 , 𝐸𝑗 ℙ 𝐸𝑗 , or ℙ 𝐸𝑖 ∩ 𝐸𝑗 ℙ 𝐸𝑗 where ℙ(𝐸𝑗) > 0.

If 𝐸𝑖 and 𝐸𝑗 are independent (𝐸𝑖 ⊓ 𝐸𝑗 = ∅), then

ℙ 𝐸𝑖 𝐸𝑗 =
ℙ 𝐸𝑖)ℙ(𝐸𝑗

ℙ 𝐸𝑗
= ℙ 𝐸𝑖 .

Bayes’ Rule

Let the events 𝐸1, 𝐸2, … be a partition of the space 𝑆 such that ℙ 𝐸𝑖 > 0 for 𝑖 =
1,2,⋯, and let 𝐵 be any event such that ℙ 𝐵 > 0. Then, for 𝑖 = 1,⋯,

ℙ 𝐸𝑖 𝐵 =
ℙ 𝐸𝑖 ℙ(𝐵|𝐸𝑖)

𝑗=1
∞ ℙ 𝐸𝑗 ℙ(𝐵|𝐸𝑗)

.

Pick up a door!

1 2 3

Behind one of the doors is a cool prize!

Behind the others, also cool prizes!

Suppose you would like to pick door No. 1?

1 2 3

The host, who knows what’s behind the doors, opens another door, say No. 3,

which has a pig. He then says to you, “Do you want to pick door No. 2?” Is it to

your advantage to switch your choice?

1 2

Let 𝐷1, 𝐷2, 𝐷3 stand for the event that
the car is behind door No.1, No. 2,

No. 3, respectively. Thus

ℙ 𝐷1 ∪ 𝐷2 ∪ 𝐷3 = 1.

Since the car is randomly placed

behind one of the doors, then

ℙ 𝐷1 = ℙ 𝐷2 = ℙ 𝐷3 =
1

3
.

1 2 3

2/31/3

The host NEVER reveals a door that has

the car behind it, but randomly chooses

which door to open if neither has the car

behind it. Let 𝐻3 stand for the even that
the host reveals a pig behind door No. 3.

The problem boils down to calculating

ℙ(𝐷1 ∣ 𝐻3). If ℙ 𝐷1 𝐻3 < 1 2 you should switch doors. ℙ 𝐷1 𝐻3 = ℙ 𝐻3 𝐷1 ℙ 𝐷1 ℙ 𝐻3 𝐷1 ℙ 𝐷1 + ℙ 𝐻3 𝐷2 ℙ 𝐷2 + ℙ 𝐻3 𝐷3 ℙ 𝐷3 = 1 3 2/31/3 1 2 02/3 Expectation Expectation, expected value, or mean of a RV 𝑋, denoted by 𝔼(𝑋), is the average value of 𝑋 in a large number of experiments: 𝔼 𝑋 = 𝑖 𝑥𝑖𝑓 𝑥𝑖 , if 𝑋 is discrete, 𝑥𝑓 𝑥 𝑑𝑥, if 𝑋 is continuous. Variance Variance measures how much 𝑋 varies around the expected value. If 𝜇 ≔ 𝔼(𝑋), the variance is defined as 𝕍 𝑋 = 𝔼 𝑋 − 𝜇 2 = 𝑖 𝑥𝑖 − 𝜇 2𝑓 𝑥𝑖 , if 𝑋 is discrete, 𝑥 − 𝜇 2𝑓 𝑥 𝑑𝑥, if 𝑋 is continuous. Important relationship of variance and expectation 𝕍 𝑋 = 𝔼 𝑋2 − 𝔼 𝑋 2. Covariance If 𝑋 and 𝑌 are random variables, then their covariance is Cov 𝑋, 𝑌 = 𝔼 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌 . The correlation between 𝑋 and 𝑌 is then given by 𝜌(𝑋, 𝑌) = Cov 𝑋, 𝑌 𝜎𝑋𝜎𝑌 . Expectation of A Function of A Random Variable Let 𝑋 be a random variable and 𝑔(⋅) be a function 𝑔 ⋅ : ℝ → ℝ. The expectation or expected value of the function 𝑔(⋅) of the random variable 𝑋, denoted by 𝔼 𝑔 𝑋 , is defined by 𝔼(𝑔(𝑋)) = 𝑖 𝑔(𝑥𝑖)𝑓 𝑥𝑖 , if 𝑋 is discrete, 𝑔(𝑥)𝑓 𝑥 𝑑𝑥, if 𝑋 is continuous. Properties of Expectations • 𝔼 𝑐 = 𝑐 for a constant 𝑐 ∈ ℝ • 𝔼 𝑐𝑔 𝑋 = 𝑐𝔼(𝑔 𝑋 ) • 𝔼 𝑐1𝑔1 𝑋 + 𝑐2𝑔2 𝑋 = 𝑐1𝔼(𝑔1 𝑋 ) + 𝑐2𝔼(𝑔2 𝑋 ) • 𝔼 𝑔1 𝑋 ≤ 𝔼(𝑔2 𝑋 ) if 𝑔1 𝑥 ≤ 𝑔2 𝑥 ∀𝑥 ∈ ℝ Summary • Basic Concepts in Set Theory and Probability Theory • Kolmogorov’s Axioms • Random Variable • Conditional Probability • Bayes’ Rule • Expectation and Variance • Appendix: Two Popular Inequalities Thank You! bchen@lincoln.ac.uk mailto:bchen@lincoln.ac.uk Appendix: Two Popular Inequalities For some basic concepts, please check your Math for Computing slides. Thanks! Chebyshev’s Inequality Let 𝑋 be a random variable and 𝑔(⋅) a non-negative function, then ℙ 𝑔 𝑋 ≥ 𝑘 ≤ 𝔼 𝑔 𝑋 𝑘 , ∀𝑘 > 0.

Chebyshev’s inequality guarantees

that in any data sample or probability

distribution, “nearly all” values are

close to the mean.

Corollary of Chebyshev’s Inequality

If 𝑋 is a random variable with finite variance, then

ℙ(∣ 𝑋 − 𝜇𝑋 ∣≥ 𝑟𝜎𝑋) = ℙ( 𝑋 − 𝜇𝑋
2 ≥ 𝑟2𝜎𝑋

2) ≤ 1/𝑟2

Proof. Take 𝑔 𝑋 = 𝑋 − 𝜇𝑋
2 and 𝑘 = 𝑟2𝜎𝑋

2

Example

Suppose we randomly select a journal article from a source with an average of
1000 words per article, with a standard deviation of 200 words. We can then
infer that the probability that it has between 600 and 1400 words (i.e. within k = 2
standard deviations of the mean) must be at least 75%. [Wiki]

Jensen’s Inequality

For a random variable 𝑋 with mean 𝔼(𝑋) and 𝑔(⋅) is a convex continuous
function

𝔼 𝑔 𝑋 ≥ 𝑔(𝔼(𝑋)).