F21_APS1070_Tutorial_3
PCA – Tutorial 3¶
Principle Component Analysis¶
As you get deeper in the world of data science, you’ll learn that in practice it’s very uncommon to work with datasets that are 2 or 3 dimensional, and so can be plotted directly. We’re now going to look at dimensionality reduction : a category of unsupervised algorithms which attempt to collapse high-dimensional datasets into a low-dimensional space.
As suggested above, one reason to do this is to aid visualization. However, that’s far from the only reason dimensionality reduction is useful! These techniques also allow us to filter noise, extract useful features, and accomplish much more.
Let’s dive into PCA with the Iris dataset.
PCA – Overview¶
Standardize the data.
Obtain the Eigenvectors and Eigenvalues from the Covariance matrix (or Correlation matrix), or perform Singular Vector Decomposition.
Sort eigenvalues in descending order and choose the 𝑘 eigenvectors that correspond to the 𝑘 largest eigenvalues where 𝑘 is the number of dimensions of the new feature subspace. 𝑘 is less than original dimensionality.
Construct the projection matrix 𝐖 from the selected 𝑘 eigenvectors.
Transform the original dataset 𝐗 via 𝐖 to obtain a 𝑘-dimensional feature subspace 𝐘.
PCA – Iris dataset¶
What’s that flower?
For the following tutorial, we will be working with the famous “Iris” dataset that has been deposited on the UCI machine learning repository
(https://archive.ics.uci.edu/ml/datasets/Iris).
The iris dataset contains measurements for 150 iris flowers from three different species.
The three classes in the Iris dataset are:
Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)
And the four features of in Iris dataset are:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
df = pd.read_csv(filepath_or_buffer=’https://raw.githubusercontent.com/aps1070-2019/datasets/master/iris.data’,
header=None,
df.columns=[‘sepal_len’, ‘sepal_wid’, ‘petal_len’, ‘petal_wid’, ‘class’]
# split data table into data X and class labels y
X = df.iloc[:,0:4].values
y = df.iloc[:,4].values
labels = set(y)
Exploratory Data Analysis¶
Let’s explore a bit.
fig, axes = plt.subplots(2, 2, figsize=(12,12))
noOfCols = X.shape[1]
# iterate over each column (feature), and plot in separate sub-plot.
for col in range(noOfCols):
# plot data for different labels for choosen column (feature).
for label in labels:
axes.flat[col].hist(X[y==label, col], alpha=0.5, label=label)
axes.flat[col].legend(loc=’upper right’)
axes.flat[col].set(xlabel=df.columns[col], ylabel=’count’)
As one can see, no feature can on it’s own predict the class of the flower.
Standardizing Data¶
Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.
Although, all features in the Iris dataset were measured in centimeters, let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.
X_std = StandardScaler().fit_transform(X)
X[:, 0].mean(), X_std[:, 0].mean()
X[:, 0].var(), X_std[:, 0].var()
Eigendecomposition – Computing Eigenvectors and Eigenvalues¶
The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.
The classic approach to PCA is to perform the eigendecomposition on the covariance matrix Σ, which is a 𝑑×𝑑 matrix where each element represents the covariance between two features.
Using Covariance Matrix¶
n, m = X_std.shape
# Compute covariance matrix
C = np.dot(X_std.T, X_std) / (n-1)
# or C = np.cov(X_std.T)
# Eigen decomposition
eigenValues, eigenVectors = np.linalg.eigh(C)
print (” :\n “,eigenVectors, ” \n Eig Val: \n”, eigenValues)
Sort based on eigenValues¶
Decreasing order of eigenValues.
It was not needed in this case as eigenValues were already in decreasing order.
args = (-eigenValues).argsort()
eigenValues = eigenValues[args]
eigenVectors = eigenVectors[:, args]
Explained Variance¶
eigValSum = sum(eigenValues)
expVar = [eigV/eigValSum*100 for eigV in eigenValues]
cumExpVar = np.cumsum(expVar)
plt.bar(range(4), expVar, label=’Explained Variance’)
plt.plot(cumExpVar, ‘r-o’, label=’Cumulative Explained Variance’)
plt.legend()
plt.show()
How many eigenValues are needed to explain more than 95% of variance?
Projections¶
Since only 2 eigenVectors are enough to explain more than 95% of variance, we’ll create the projection matrix using the first 2 eigenVectors.
PC_count = 2 # Number of PCs
x_feature = 0 # Plot x-axis
y_feature = 1 # Plot y_axis
# Original Data
for label in labels:
plt.plot(X_std[y==label, x_feature], X_std[y==label, y_feature], ‘o’, label=label)
plt.legend(loc=’upper right’)
W = eigenVectors[:, 0:PC_count]
projX = np.dot(X_std, W)
X_std.shape, W.shape , projX.shape
## Projection
for label in labels:
plt.plot(projX[y==label, x_feature], projX[y==label, y_feature], ‘o’, label=label)
plt.legend(loc=’upper right’)
# Reconstruction
ReconX = np.dot(projX, W.T)
plt.figure()
plt.title(“Original Data”)
for label in labels:
plt.plot(X_std[y==label, x_feature], X_std[y==label, y_feature], ‘o’, label=label)
plt.legend(loc=’upper right’)
plt.figure()
plt.title(“Reconstruction Data”)
for label in labels:
plt.plot(ReconX[y==label, x_feature], ReconX[y==label, y_feature], ‘o’, label=label)
plt.legend(loc=’upper right’)
What has PCA helped us achieve here?
Visualization: easier visualization of all 3 classes
Classification: a flower of unknown class can be plotted here, and then classified visually or using algorithms (such as KNNs)
We have used PCA on numerical data. But can it used on Image data?
Summarizing PCA:¶
Assume we have a dataset with 1000 samples and 5000 features for each sample: X: (1000, 5000). Total elements: 5,000,000
We apply PCA, we get W: (5000, 5000)
We determine only 10 PCs are enough: W_op: (5000, 10)
We apply projection: X @ W_op = Proj (1000, 10)
Now data is summarized in Proj:(1000,10) and W_op: (5000, 10) Total elements: 60,000!!! Compression Ratio: 5,000,000 / (60,000 + StandardScaler parameters) ~ 80x
Reconstruction: Proj @ W_op.T = Recon (1000, 5000)
Eigenfaces¶
Let’s face the Eigen.
Eigenfaces is the name given to a set of eigenvectors when they are used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby (1987) and used by and in face classification.
Eigenfaces refers to an appearance-based approach to face recognition that seeks to capture the variation in a collection of face images and use this information to encode and compare images of individual faces in a holistic (as opposed to a parts-based or feature-based) manner.
The motivation of Eigenfaces is twofold:
Extract the relevant facial information, which may or may not be directly related to human intuition of face features such as the eyes, nose, and lips. One way to do so is to capture the statistical variation between face images.
Represent face images efficiently. To reduce the computation and space complexity, each face image can be represented using a small number of parameters.
Images are downloaded from Labeled Faces in the Wild.
!pip install wget
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import wget
import math
# Download and unzip dataset.
filename = wget.download(‘https://github.com/aps1070-2019/datasets/raw/master/lfw-a.tgz’, ‘lfw-a.tgz’)
!tar -xvzf “{filename}”
# constants
IMAGE_DIR = ‘lfw’
DEFAULT_SIZE = [250, 250]
# Reads images from filesystem and returns Array of images and imageNames.
def readImages(imagePath = IMAGE_DIR, defaultSize = DEFAULT_SIZE):
images = []
imageNames = []
imageDirs = [image for image in os.listdir(imagePath) if not image.startswith(‘.’)]
for imageDir in imageDirs:
dirPath = os.path.join(imagePath, imageDir)
dirImageNames = [image for image in os.listdir(dirPath) if not image.startswith(‘.’)]
for imageName in dirImageNames:
image = Image.open(os.path.join(dirPath, imageName))
image = image.convert (“L”) # L stands for Luminance: converts image to grayscale
if (defaultSize is not None):
image = image.resize(defaultSize, Image.ANTIALIAS) # resize image
images.append(np.asarray(image, dtype = np.uint8))
imageNames.append(imageDir)
return [images, imageNames]
[X, y] = readImages()
type(X), len(X)
type(X[0]), X[0].shape
type(y), len(y)
print (X[40] ,”\n”, y[40])
Exploratory Data Analysis¶
Check from dataset if this image has correct label.
print(‘Image name is: ‘, y[40])
plt.imshow(X[40], cmap=plt.cm.gray)
plt.show()
print(X[40][100:120,100:110])
plt.imshow(X[40][100:120,100:120] ,cmap=plt.cm.gray )
Creating a mean face from all dataset faces.
def asRowMatrix(X):
if len(X) == 0: return np.array([])
rowMatrix = np.empty((0, X[0].size), dtype = X[0].dtype)
for img in X:
rowMatrix = np.vstack((rowMatrix, np.asarray(img).reshape(1, -1)))
return rowMatrix
XMat = asRowMatrix(X);
meanImage = np.reshape(XMat.mean(axis=0), X[0].shape)
plt.imshow(meanImage, cmap=plt.cm.gray)
plt.title(‘Mean Face’)
plt.show()
XMat.shape
Eigendecomposition – Computing Eigenvectors and Eigenvalues¶
Using Covariance Matrix¶
Below image explains the PCA code above:
# Below code is computationally time taking.
# C = np.dot(XMat.T, XMat) # covariance matrix
# eigenValues, eigenVectors = np.linalg.eigh(C)
def getBasisCountThatPreservesVariance(eigenValues, variance=0.98):
for idx, cumulativeSum in enumerate(np.cumsum(eigenValues) / np.sum(eigenValues)):
if cumulativeSum > variance:
return idx
def pca(X, y):
n, d = X.shape
mu = X.mean(axis=0)
X = X – mu # standardising data
C = np.dot(X.T,x) / (n-1) # covariance matrix
eigenValues, eigenVectors = np.linalg.eigh(C)
C = np.dot(X,X.T) / (n-1) # covariance matrix
eigenValues, eigenVectors = np.linalg.eigh(C)
eigenVectors = np.dot(X.T, eigenVectors)
for i in range(n):
eigenVectors[:,i] = eigenVectors[:, i] / np.linalg.norm(eigenVectors[:, i])
print (“Dim of Full Eigen Vectors”, eigenVectors.shape)
# sort eigenVectors in descending order by their eigenValue
idx = np.argsort(-eigenValues)
eigenValues = eigenValues[idx]
eigenVectors = eigenVectors[:, idx]
# select based on numOfBasis
numOfBasis = getBasisCountThatPreservesVariance(eigenValues)
print(‘Number of useful eigenBasis are: ‘, numOfBasis)
eigenValues = eigenValues[0:numOfBasis].copy()
eigenVectors = eigenVectors[:, 0:numOfBasis].copy()
return eigenValues, eigenVectors, mu
eigenValues, eigenVectors, mean = pca(XMat, y)
eigenValues.shape , eigenVectors.shape
EigenFaces¶
What were dimensions of eigenVector in the case of Iris example?
Array of size = number of features (4 in the case of Iris).
Array of size 4.
What will be dimensions of eigenVector in this example?
Array of size = feature size (62500).
eigenVectors[:, 0].shape
If the dimensions of eigenVector is same as the vectorised image.
What if eigenVector is displayed in image format.
This is called eigenFace.
# show the first eigenFace
plt.imshow(eigenVectors[:, 0].reshape(-1, 250), cmap = plt.cm.gray)
# print first 6 eigen faces
ROWS = math.ceil(COUNT/3)
fig = plt.figure(figsize=(12, ROWS * 4))
for i in range(0, COUNT):
plt.subplot(ROWS, 3, i+1)
plt.imshow(eigenVectors[:, i].reshape(-1, 250), cmap = plt.cm.gray)
plt.title(‘#{}’.format(i+1))
By only using first few eigenFaces:
How would you get a face with white hair.
Answer the same for black hair.
Have a good look at eigenFace number 5.
Good luck sleeping tonight.
Plot the next 6 eigenFaces.
Do you observe any difference observed between the first 6 and second 6.
Projections¶
Now, we will reconstruct an image from the dataset using eigenFaces (eigenVectors).
IMAGE_IDX = 10 # image idx in dataset
# actual image
plt.imshow(X[IMAGE_IDX], cmap=plt.cm.gray)
plt.show()
def project (W , X , mu):
return np.dot (X – mu , W)
def reconstruct (W , Y , mu) :
return np.dot (Y , W.T) + mu
# create reconstructed images
COUNT = 6 # count of first eigenVectors used to reconstruct the image
reconImages = []
for numEvs in range (1, COUNT+1):
P = project(eigenVectors[:, 0:numEvs], X[IMAGE_IDX].reshape(1, -1), mean)
R = reconstruct(eigenVectors[:, 0:numEvs], P, mean)
reconImages.append(R.reshape(X[0].shape))
# plot reconstructed images
ROWS = math.ceil(COUNT/3)
fig = plt.figure(figsize=(12, ROWS * 4))
for i in range(0, COUNT):
plt.subplot(ROWS, 3, i+1)
plt.imshow(reconImages[i], cmap = plt.cm.gray)
plt.title(‘#{}’.format(i+1))
# create reconstructed images
numEvsSet = [1, 10 , 100, 200, 400, 500, 535] # these no. of eigenVectors will be used to reconstruct the image.
COUNT = len(numEvsSet)
reconImages = []
for numEvs in numEvsSet:
P = project(eigenVectors[:, 0:numEvs], X[IMAGE_IDX].reshape(1, -1), mean)
R = reconstruct(eigenVectors[:, 0:numEvs], P, mean)
reconImages.append(R.reshape(X[0].shape))
# plot reconstructed images
ROWS = math.ceil(COUNT/3)
fig = plt.figure(figsize=(12, ROWS * 4))
for i in range(0, COUNT):
plt.subplot(ROWS, 3, i+1)
plt.imshow(reconImages[i], cmap = plt.cm.gray)
plt.title(“Reconstruction:”+ str(numEvsSet[i]) + ” Components” )
Let’s create an animation of reconstruction!¶
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from google.colab import files
fig = plt.figure()
reconImages = []
for numEvs in range (0 , eigenVectors.shape[1]):
if numEvs % 50 ==0:
print (“Progress: %.2f ” % (numEvs *100/ eigenVectors.shape[1]), “%”)
title = plt.text(125.5,0.85, “”, bbox={‘facecolor’:’w’, ‘alpha’:1, ‘pad’:5},
ha=”center”)
P = project(eigenVectors[:, 0:numEvs], X[IMAGE_IDX].reshape(1, -1), mean)
R = reconstruct(eigenVectors[:, 0:numEvs], P, mean)
reconImages=(R.reshape(X[0].shape))
title.set_text(“Reconstruction:”+ str(numEvs) + ” Components”)
im = plt.imshow(reconImages, cmap = plt.cm.gray)
ims.append([im, title])
ani = animation.ArtistAnimation(fig, ims, interval=50, blit=True,
repeat_delay=1000)
ani.save(‘dynamic_images.mp4’)
plt.show()
files.download(‘dynamic_images.mp4’)