计算机代写 Probability Review for Machine Learning

Probability Review for Machine Learning
, , and 1 University of Toronto
1Slides adapted from Erdogdu and Zemel

Copyright By PowCoder代写 加微信 powcoder

Motivation
Uncertainty arises through: Noisy measurements Variability between samples Finite size of data sets
Probability provides a consistent framework for the quantification and manipulation of uncertainty.

Sample Space
Sample space Ω is the set of all possible outcomes of an experiment.
Observations ω ∈ Ω are points in the space also called sample outcomes, realizations, or elements.
Events E ⊂ Ω are subsets of the sample space. In this experiment we flip a coin twice:
Sample space All outcomes Ω = {HH,HT,TH,TT} Observation ω = HT valid sample since ω ∈ Ω
Event Both flips same E = {HH,TT} valid event since E ⊂ Ω

Probability
The probability of an event E, P(E), satisfies three axioms: 1: P(E)≥0foreveryE
3: If E1, E2, . . . are disjoint then
P([ Ei) = XP(Ei) i=1 i=1

Joint and Conditional Probabilities
Joint Probability of A and B is denoted P (A, B). Conditional Probability of A given B is denoted P(A|B).
Joint:𝑝 𝐴,𝐵 =𝑝(𝐴∩𝐵)
Conditional: 𝑝 𝐴|𝐵 = *(+∩,) *(,)
p(A, B) = p(A|B)p(B) = p(B|A)p(A)

Conditional Example
Probability of passing the midterm is 60% and probability of passing both the final and the midterm is 45%.
What is the probability of passing the final given the student passed the midterm?
P(F|M) = P(M,F)/P(M) = 0.45/0.60

Independence
Events A and B are independent if P (A, B) = P (A)P (B). Independent: A: first toss is HEAD; B: second toss is HEAD;
P (A, B) = 0.5 × 0.5 = P (A)P (B)
Not Independent: A: first toss is HEAD; B: first toss is HEAD;
P(A,B) = 0.5 ̸= P(A)P(B)

Independence
Events A and B are conditionally independent given C if P (A, B|C) = P (B|C)P (A|C)
Consider two coins2: A regular coin and a coin which always outputs heads.
A = The first toss is heads;
B = The second toss is heads; C = The regular coin is used D = The biased coin is used
Then A and B are conditionally independent given C and given D.
2 www.probabilitycourse.com/chapter1/1_4_4_conditional_independence. php

Independence
Events A and B are conditionally independent given C if P (A, B|C) = P (B|C)P (A|C)
Consider a coin which outputs heads if the first toss was heads, and tails otherwise.
A = The first toss is heads;
B = The second toss is heads;
E = The eventually biased coin is used
Then A and B are conditionally dependent given E.

Marginalization and Law of Total Probability
Law of Total Probability 3
P(A) = XP(A,B) = XP(A|B)P(B)
3 www.probabilitycourse.com/chapter1/1_4_2_total_probability.php

Bayes’ Rule
Bayes’ Rule:
P (A|B) = P (B|A)P (A) P(B)
P (θ|x) = P (x|θ)P (θ) P (x)
Posterior = Likelihood ∗ Prior Evidence
P osterior ∝ Likelihood × P rior

Bayes’ Example
Suppose you have tested positive for a disease. What is the probability you actually have the disease?
This depends on the prior probability of the disease:
P (T = 1|D = 1) = 0.95 (likelihood) P (T = 1|D = 0) = 0.10 (likelihood) P (D = 1) = 0.1 (prior)
SoP(D=1|T =1)=?

Bayes’ Example
Suppose you have tested positive for a disease. What is the probability you actually have the disease?
P (T = 1|D = 1) = 0.95 (true positive) P (T = 1|D = 0) = 0.10 (false positive) P (D = 1) = 0.1 (prior)
SoP(D=1|T =1)=? Use Bayes’ Rule:
P(T =1)=P(T =1|D=1)P(D=1)+P(T =1|D=0)P(D=0) = 0.95∗0.1+0.1∗0.90 = 0.185
P(D=1|T =1)= P(T =1|D=1)P(D=1) = 0.95∗0.1 =0.51 P(T = 1) P(T = 1)

Random Variable
How do we connect sample spaces and events to data?
A random variable is a mapping which assigns a real number X(ω) to each observed outcome ω ∈ Ω
For example, let’s flip a coin 10 times. X(ω) counts the number of Heads we observe in our sequence. If ω = HHT HT HHT HT then X(ω) = 6. We often shorten this and refer to the random variable X.

Expectations
From our example, we see that X does not have a fixed value, but rather a distribution of values it can take. It is natural to ask questions about this distribution, such as “What is the average number of heads in 10 coin tosses?”
This average value is called the expectation and denoted as E[X]. It is defined as
E[x] = X P [X = a] × a a∈A
where A represents the set of all possible values X(w) can take.

Expectation Practice
What is the expected value of a fair die? X = value of roll
E[X] = X 1a
a∈{1,2,3,4,5,6} 1 X6
= 21 = 7 62

Linearity of expectations
There are two powerful properties regarding expectations.
1 E[X + Y ] = E[X] + E[Y ].
This holds even if the random variables are dependent.
2 E[cX] = cE[X], where c is a constant.
Note we cannot say anything in general about E[XY ].

Expectation Practice
What is the expected value of the sum of two dice? X1 = value of roll 1
X2 = value of roll 2
E [ X 1 + X 2 ] = E [ X 1 ] + E [ X 2 ] = 27 + 72 = 7
(compare this to computing 2× 1 +3× 2 +…) 36 36

Expectation Practice 2
Suppose there are n students in class, and they each complete an assignment. We hand back assignments randomly. What is the expected number of students that receive the correct assignment? When n = 3? In general?
X = Number of students that get their assignment back Xi = Student i gets their assignment back
E[X]=E[X1 +X2 +…+Xn]
= E[X1] + E[X2] + . . . + E[Xn]
= n1 × n = 1

Knowing the expectation can only tell us so much. We have another quantity used to describe how far off we are from the expected value. It is defined as follows for a random variable X with E[X] = μ:
Var[x] = E[(X − μ)2] The variance can be simplified as:
E[(X − μ)2] = E[X2 − 2μX + μ2]
= E[X2] − E[2μX] + E[μ2] = E[X2] − 2μE[X] + E[μ2]
= E[X2] − μ2

Variance Properties
Constants get squared:
Var[cX] = c2 Var[X]
For independent random variables X and Y , we have
E[XY ] = E[X]E[Y ]
Var[X +Y] = Var[X]+Var[Y]
The quantity we encounter during the proof E[XY ] − E[X]E[Y ] is called the covariance. It is 0 when X and Y are independent. Q: can it be 0 when X and Y are not independent?

Variance Properties

Variance Practice
Consider a particle that starts at position 0. At each time step, the particle moves one step to the left or one step to the right with equal probability. What is the variance of the particle at time step n?
X = X1 + X2 + . . . + Xn
Each Xi is 1 or -1 with equal probability. Var(Xi) = 1
Var(X) = X Var(Xi) = n The expected squared distance from 0 is n.

Discrete and Continuous Random Variables
Discrete Random Variables
Takes countably many values, e.g., number of heads Distribution defined by probability mass function (PMF) Marginalization: p(x) = Py p(x, y)
Continuous Random Variables
Takes uncountably many values, e.g., time to complete task Distribution defined by probability density function (PDF)
Marginalization: p(x) = R p(x, y)dy y

Random variables are said to be independent and identically distributed (i.i.d.) if they are sampled from the same probability distribution and are mutually independent.
This is a common assumption for observations. For example, coin flips are assumed to be i.i.d.

Probability Distribution Statistics
Mean: First Moment, μ
E[x] = X xip(xi) i=1
(univariate discrete r.v.) (univariate continuous r.v.)
Variance: Second (central) Moment, σ2
= E[(x − μ)2] = E[x2] − E[x]2
(x − μ)2p(x)dx

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com