Principal Components Analysis: Part 1
Hamid Dehghani School of Computer Science Birmingham
April 2021
Outline
• By the end of these series you will
– Learn about relationship between multi-dimensional data
– Understand Principal Components Analysis
– Apply data compression
Covariance
• Variance and Covariance
– measure of the “spread” of a set of points around their centre of mass (mean)
– V ariance
• measure of the deviation from the mean for points in one dimension e.g. heights
– Covariance
• measure of how much each of the dimensions vary from the mean with respect to each other
• Covariance
cov(x,x )= 12
1122 i=1
n-1
Covariance
– measured between 2 dimensions to see if there is a relationship between the 2 dimensions e.g. number of hours studied & marks obtained
n
å(xi – x )(xi -x )
Covariance
• For a 3-dimensional data set (x,y,z)
– measure the covariance between
• x and y dimensions,
• y and z dimensions.
• x and z dimensions.
– Measuring the covariance between • x and x
• y and y • z and z
– Gives you the variance of the x , y and z dimensions respectively
Covariance Matrix
• Representing Covariance between dimensions as a matrix e.g. for 3 dimensions:
• Diagonal is the variances of x, y and z
• cov(x,y) = cov(y,x) hence matrix is symmetrical about the
diagonal
• N-dimensional data will result in NxN covariance matrix
Covariance
• What is the interpretation of covariance calculations?
– e.g.: 2 dimensional data set
• x: number of hours studied for a subject
• y: marks obtained in that subject
– covariance value is say: 104.53
• What does this value mean?
• Exact value is not as important as its sign.
Covariance examples
• A positive value of covariance
– Both dimensions increase or decrease together e.g. as the number of hours studied increases, the marks in that subject increase.
• A negative value
– while one increases the other decreases, or vice-versa e.g. Hours awake vs performance in Final Assessment.
• If covariance is zero:
– the two dimensions are independent of each other e.g. heights of students vs the marks obtained in the Quiz
Why is it interesting
• Why bother with calculating covariance when we could just plot the 2 values to see their relationship?
– Covariance calculations are used to find relationships between dimensions in high dimensional data sets (usually greater than 3) where visualization is difficult.
Principal components analysis (PCA)
• PCA is a technique that can be used to simplify a dataset
– A linear transformation that chooses a new coordinate system for the data set such that:
– greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component),
– the second greatest variance on the second axis,
– and so on.
• PCA can be used for reducing dimensionality by eliminating the later (none substantial principal components.
PCA: Simple example
• Consider the following 3D points in space (x,y,z)
– P1 = [1 2 3]
– P2 = [2 4 6]
– P3 = [4 8 12]
– P4 = [3 6 9]
– P5 = [5 10 15]
– P6 = [6 12 18]
• To store each point in the memory, we need
– 18 = 3 x 6 bytes
PCA: Simple example
• But
– P1 = [1 2 3] = P1 ⨉ 1
– P2 = [2 4 6] = P1 ⨉ 2
– P3 = [4 8 12] = P1 ⨉ 4
– P4 = [3 6 9] = P1 ⨉ 3
– P5 = [5 10 15] = P1 ⨉ 5
– P6 = [6 12 18] = P1 ⨉ 6
• All the points are related geometrically: they are all the same point, scaled by a factor
• They can be stored using only 9 bytes
– Store one point (3 bytes) + the multiplying constants (6 bytes)
PCA: Simple example
• Viewing the points in 3D
• this example
– all the points happen to belong to a line:
– a 1D subspace of the original 3D space
PCA: Simple example
• Viewing the points in 3D
– Consider a new coordinate system where one of the axes is along the direction of the line
– In this coordinate system, every point has only one non-zero coordinate: we only need to store the direction of the line and the nonzero coordinate for each of the points.
Principal Component Analysis (PCA)
• Given a set of points, how do we know if they can be compressed like in the previous example?
• The answer is to look into the correlation between the points
• The tool for doing this is called PCA