Data Mining and Machine Learning
Statistical Modelling (1) Peter Jančovič
Slide 1
Data Mining and Machine Learning
Objectives
Review basic statistical modelling
Review the notions of probability distribution and probability density function (PDF)
Gaussian PDF
Multivariate Gaussian PDFs
Parameter estimation for Gaussian PDFs
Slide 2
Data Mining and Machine Learning
Discrete random variables
Suppose that Y is a random variable which can take
any value in a discrete set X={x1,x2,…,xM}
Suppose that y1,y2,…,yT are samples of the random
variable Y
If cm is the number of times that the yn = xm then an estimate of the probability that yn takes the value xm is given by: c
PxmPyn xm m N
Slide 3
Data Mining and Machine Learning
Continuous Random Variables
In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space
Counting the number of occurrences of each value is no longer a viable way of estimating probabilities…
…but generalisations of this approach are applicable to continuous variables – non-parametric methods
Slide 4
Data Mining and Machine Learning
Continuous Random Variables
An alternative is to use a parametric model
Probabilities are defined by a small set of parameters
Familiar example is a normal, or Gaussian model
A (scalar/univariate) Gaussian probability density function (PDF) is defined by two parameters – its mean and variance
For a multivariate Gaussian PDF defined on a vector space, is the mean vector and is the covariance matrix
Slide 5
Data Mining and Machine Learning
Gaussian PDF
‘Standard’ 1- dimensional Gaussian PDF:
– mean=0
– variance =1
P(a x b)
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
0ab
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Slide 6
Data Mining and Machine Learning
Gaussian PDF
For a 1-dimensional Gaussian PDF p with mean
and variance :
px px|,
1
x
2 exp 2
2
Constant to ensure area under curve is 1
Defines ‘bell’ shape
Slide 7
Data Mining and Machine Learning
Standard Deviation
Standard deviation is the square root of the variance For a Gaussian PDF:
– 68% of the area under the curve lies within one standard deviation (s.d.) of the mean
– 95% of the area under the curve lies within two s.ds of the mean
– 99% of the area under the curve lies within three standard deviations of the mean
Slide 8
Data Mining and Machine Learning
Standard Deviation
s
P s x s 0.68 P 2s x 2s 0.95 P 3s x 3s 0.99
Slide 9
Data Mining and Machine Learning
Multivariate Gaussian PDFs
A (univariate) Gaussian PDF assumes the random variable takes scalar values
In the case where the random variable takes N dimensional vector values the corresponding PDF is called a multivariate Gaussian PDF and is given by:
11T px 2 exp 2 mx 1mx
N
Slide 10
Data Mining and Machine Learning
Visualising multivariate Gaussian PDFs
Slide 11
Data Mining and Machine Learning
Example
If 9 0 , standard
0 4 deviations in ‘x’ and ‘y’
directions are 3 and 2, respectively, and the 1 s.d. contour is an ellipse:
Slide 12
Data Mining and Machine Learning
Example 2:
Now suppose 7.75 2.17 and m 2
2.17 5.25 4 Calculate the eigenvalue decomposition of Σ
31 31
UDUT 2 29 02 2
1 3041 3
2222
Slide 13
Data Mining and Machine Learning
Example 2 (continued) Note U is a rotation through 30o
Hence the one standard deviation contour is the same as in the previous example, but rotated through 30o and translated by
m 2 4
Slide 14
Data Mining and Machine Learning
Example 2 (continued)
Slide 15
Data Mining and Machine Learning
Fitting a Gaussian PDF to Data
Suppose y = y1,…,yt,…,yT is a set of T data values
For a Gaussian PDF p with mean and variance , define:
T py|,py |,
t t1
How do we choose and to maximise p(y|, )?
Slide 16
Data Mining and Machine Learning
Fitting a Gaussian PDF to Data
Good fit
1.4 1.2 1 0.8 0.6 0.4 0.2
-5 -4 -3 -2 -1 0 1 2-5 -43 -34-2 5-1-0.20 1 2 3 4 5
1.4 1.2 1 0.8 0.6 0.4 0.2 00
Poor fit
Slide 17
Data Mining and Machine Learning
Maximum Likelihood Estimation
The ‘best fitting’ Gaussian maximises p(y|,) Terminology:
– p(y|,), as a function of y is the probability (density) of y
– p(y|,), a function of , is the likelihood of ,
Maximising p(y|,) with respect to , is called Maximum Likelihood (ML) estimation of ,
Slide 18
Data Mining and Machine Learning
ML estimation of , Intuitively:
– The maximum likelihood estimate of should be the average value of y1,…,yT. (the sample mean)
– The maximum likelihood estimate of should be the variance of y1,…,yT. (the sample variance)
This is true: p(y| , ) is maximised by setting:
1T 1T
y, y
t
T t1 T t1
t
2
Slide 19
Data Mining and Machine Learning
0
t So, T yt, Tyt
t1
t
Proof
First note that maximising p(y) is the same as maximising
log(p(y))
logpy|,logpy |,
t1 log pyt | ,
t
t1
logpy |, t
Also
At a maximum:
y log2 t
TT
1 2
2
T logpy|,
logpy |,
T 2y1
t1 T1T
t1 t1
Slide 20
Data Mining and Machine Learning
Multi-modal distributions
In practice the distributions of many naturally occurring phenomena do not follow the simple bell- shaped Gaussian curve
For example, if the data arises from several difference sources, there may be several distinct peaks (e.g. distribution of heights of adults)
These peaks are the modes of the distribution and the distribution is called multi-modal
Slide 21
Data Mining and Machine Learning
Summary
Reviewed basic statistical modelling, probability distribution, probability density function
Gaussian PDFs
Multivariate Gaussian PDFs
Maximum likelihood (ML) parameter estimation
In the next session we will introduce Gaussian mixture PDFs (GMMs) and ML parameter estimation for GMMs
Slide 22
Data Mining and Machine Learning