COMP3223: A quick review/introduction to probability theory
Srinandan Dasmahapatra
November 8, 2020
1/24
Reinterpret regression and classification probabilistically
Softmax regression: Predict high probability of correct label c for data point x
For high p(c|x) for yc = 1, loss −yc ln(w⊤c x) low Achieved by setting wc large – overconfidence/overfitting regularisation needed
Linear regression: Predict output yˆ given input x to make r2 = (y − yˆ)2 small
Given family of functions yˆ = f(x; w)
lowering r2 achieved by complex f with ∥w∥2 large overfitting, fitting noise in data, regularisation needed
Classification already in probabilistic language
Interpret regression as finding model f(·; w) that makes large r2 predictions improbable
Regularisation by weight penalty viewed as imposing improbability of complex or large ∥w∥2 models even before data is seen
2/24
Outline: mostly about probability and statistics
Basic probability theory and statistics
Bivariate statistics (covariance) related to regression Bivariate continuous distributions
Use linear dependence between two Gaussian random variables to motivate the form of the bivariate Gaussian distribution.
3/24
Basic definitions from probability theory: random variable, event/sample space
The set of all events is Ω. Probability of event A ⊂ Ω is P(A) ∈ [0, 1], P(Ω) = 1.
X variable, x value (specific event)
Probability mass function (pmf) P(X = x): quantifies how likely each possible outcome is:
P(X = x)
Joint distribution P(A = ai, B = bj) is the probability that both events ai and bj occur.
If events A, B independent, P(A = ai, B = bj) = P(A = ai)P(B = bj): joint factorises into product of marginals
Conditional probability, P(A = ai|B = bj) is the probability that event ai occurs given that event bj has occurred: information update.
P(X = x) = PX(x) = P(x)
P(A) = P(x ∈ A) = ∑
x∈A
4/24
Bayes’ rule for inference and inverse problems
Given data X a set H of hypotheses hi ∈ H that explains data. What is prob that given observation X was generated by some hj? Equality of expressing joint in terms of conditionals:
{ P(B=b|A=a)P(A=a) P(A = a, B = b) = P(A = a|B = b)P(B = b)
Leads to Bayes’ rule:
P(B=b|A=a)= P(A=a|B=b)P(B=b). P(A = a)
P(X|hj) for each hj ∈ H known; a generative mechanism: hj → X Inverse problem: given data X, find P(hi|X).
5/24
Expectation and variance characterise mean value of random variable and its dispersion.
xi ∼ P(X): xi, i = 1,…,N random values drawn from P(X) Example: die rolls, xi ∈ {1, 2, 3, 4, 5, 6}.
Example: individual i will infect xi others, xi ∈ {1, 2, 3, . . .}.
P(X = x)x (population
property) E(X) = EX (notation).
Expectation of a function of a random variable:
∑
E(f(X)) =
Moments= expectation of power of X: Mk = EXk
Variance: Average (squared) fluctuation from the mean
Var(X) = E(X − EX)2 (1)
= EX2 − (EX)2 = M2 − M21 (2) 6/24
Collect data: X := {x1, x2, . . . , xN}, mean of sample
If P(X) known, Expectation: E(X) = ∑
x
x
P(X = x)f(x)
Bivariate distributions characterise systems of 2 observables.
Joint distribution: P(X = x, Y = y), a list of probabilities of all possible pairs
of observations
Marginal distribution: P(X = x) =
y P(X = x, Y = y) Conditional distribution: P(X = x|Y = y) = P(X=x,Y=y)
∑
P(y=y)
X|Y has distribution P(X|Y), a lookup-table of all possible P(X = x|Y = y)
7/24
Statistics of multivariate distributions:
Conditional distributions are just distributions which have a (conditional) mean or variance.
E(X|Y = y) = f(Y = y) fn of y. “For each value of Y what is the average value of X?”.
E(X,Y) = ∑
Covariance is the expected value of the product of fluctuations (deviations from mean):
Cov(X,Y) = E((X − EX)(Y − EY)) = EXY − EXEY Var(X) = Cov(X, X)
In finite sample, ⟨(X, Y)⟩ = (1/N) ∑Nn=1(xn, yn)
Sample covariance σXY = (1/N) ∑Nn=1(xn − ⟨X⟩)(yn − ⟨Y⟩).
Slope of regression line:
w1 = σXY. σXX
x,y
P(X = x,Y = y)(x,y) = (E(X),E(Y))
8/24
From linear regression – minimise (y ̃n − w1x ̃n)2
9/24
From linear regression – covariance as dot product
10/24
Continuous random variables
A random variable X is continuous if its sample space X is uncountable. In this case, P(X = x) = 0 for each x.
If pX(x) is a probability density function for X, then
P(a
x
x · p(x)dx
13/24
Univariate Gaussian (Normal), N(μ, σ)
Pdf of gaussian:
( ( )2) p(x)=√1 exp −1 x−μ
2πσ2 2σ Statistics – population mean μ, variance σ2:
∫∞ xp(x)dx = μ ∫−∞
E(X) = E(X2) = Var(X) =
∞ x2p(x)dx = μ2 + σ2 −∞
E(X2) − (EX)2 = σ2 Standard normal N(0,1) has mean 0 and σ = 1: p(z) =
∫∞ ∫∞ ∫∞
1 e−z2 √2
2π
p(z)dz = 1, zp(z)dz = 0, z2p(z)dz = 1. −∞ −∞ −∞
14/24
Bivariate continuous distributions: Marginalisation, Conditioning and Independence
pX,Y (x, y), joint probablity density function of X and Y ∫∫
x y p(x,y)dxdy = 1 ∫∞
Marginal distribution: p(x) = −∞ p(x, y)dy Conditional distribution:
p(x|y)dx = p(x, y)dxdy p(y)dy
Independence: X and Y are independent if pX,Y (x, y) = pX (x)pY (y)
15/24
Two dimensional Gaussian distributions
The distribution on the left has mean μ = (2, 1)T and covariance
()
matrix Σ = 2 −2 . −2 4
The dark lines are for the conditional distributions
P(Y|X = 3.0) and P(X|Y = 2.6). Notice that they are both Gaussian distributions.
16/24
Changing the covariance matrix of Gaussian – contour plots
17/24
Covariance matrix of X, Y linearly dependent Gaussian random variables
Let nx ∼ N(0, σx), ny ∼ N(0, σy) two independent Gaussian r. v. Introduce 2 r. v.’s X = nx and Y = aX + ny, a real; EX = 0, EY = 0.
Compute components of covariance matrix Σ =Cov(X, Y): EX2, EXY and EY2. Use VarN(0, σ) = σ2.
E X 2 = E n 2x = σ 2x
EXY = Enx(anx + ny) = Ean2x + Enxny = aσ2x.
EY2 = E(a2n2x + 2anxny + n2y)= a2σ2x + σ2y Assembling all terms for Σ and noting Σ−1 for reference:
1a2 a ( σ 2 aσ 2 ) σ2 + σ2 −σ2
Σ=x x,Σ−1=xyy aσx2 a2σx2 + σy2 a a2
− σ 2y σ 2y
18/24
Example of 2-dimensional Gaussian distribution
2
1
Given mean and covariance matrix of 2D Gaussian:
[]()
N(x,y; 2 , 6 −2 ), 1 −21
compare cov. mat. with that of Y = aX + ny, X = nx ( σ 2 aσ 2 ) a = −1/3,
Σ=x x =⇒σ2=6 aσ2 a2σ2+σ2 x
xxy σ2y=1/3
Note negative slope, narrower distribution for y.
How to set contour lines – lines of equal probability (equal height)? Express exponent in Gaussian as eQ(x,y)
Locus of pairs (x, y) so that Q(x, y) = constant. (Called level sets.)
19/24
Obtain quadratic form Q(x, y) from inverse covariance matrix for X = nx, Y = aX + ny
Joint distribution (since nx, ny are independent)
( ( )2) ( ( )2)
p(X=x,Y=y)= 1 exp −1 x 1 exp −1 y−ax Zx 2σxZy 2σy
Consider the exponent in the joint distribution
1 ( x2 (y − ax)2 ) 1 ( 1 a2 a 1 −2 σ2 + σ2 =−2 (σ2 +σ2)x2−2σ2xy+σ2y2
) ()
xyxyyy
The exponent eQ(x,y) has quadratic form Q(x, y) = − 1 (x y)Λ x :
1a2 a
1 σ2+σ2 −σ2(x)
2y
Q(x,y)=−2(x y) x a y 1y y ,Λturnsout=Σ−1. − σ 2y σ 2y
20/24
Explicit form for 2-dimensional Gaussian distribution
To explicitly write the term in the exponent of
[][]
N(x,y; 2 , 6 −2 ) 1 −2 1
1
ss (x − μ)T Λ(x − μ) (precision matrix Λ = Σ−1):
1 ( 1[ ][12][x−2])
Z exp −2·2 x−2, y−1 2 6 y−1
,
2
normalisation
where the inverse of the covariance matrix has been inserted and its determinant
= 2 is in the denominator. This evaluates to
( x2 3y2 9)
−4 −xy+2x− 2 +5y−2 .
√
The normalisation factor is 1/(2 2π).
21/24
Conditionals on contour plot
These are contour plots of the
(same) Gaussian pdf with mean
μ = (2, 1)T and covariance matrix ()
Σ= 2−2. −2 4
Notice how the negative correlation between deviations from the means for X and Y.
The light bands are for the conditional distributions
P(Y|X = 3.0) and P(X|Y = 2.6). They are shown on the right and top, and they display a narrower
distribution than the 2-d version.
22/24
General form for Gaussian distributions
A p-dimensional random variable X = (X1, . . . , Xp) has a probability density function p(X) ∏pi dXi given by a multivariate Gaussian distribution specified by its mean μ and covariance matrix Σ:
1(1) p(X)=p(X;μ,Σ)= (2π)p/2|Σ|1/2 exp −2(X−μ)TΣ−1(X−μ) ,
where | · | is the determinant. Equivalently, if X = x, a value taken by the random variable X distributed as a (multivariate) normal, we write
x ∼ N(μ, Σ).
23/24
Summary and looking ahead
Recast task of reducing loss of model as that of reducing improbability of predictive model
Formalise probabilistic modelling
Relate statistics of distributions and samples to linear regression Interpreting 2-dimensional Gaussians
Lab 3 and Chapter 2 of FCML addresses maximum likelihood estimation Next steps: regularisation to control increase of learned weight norms
Bayesian: priors shape expectations of data modelling in absence of data (domain understanding)
24/24