COMP3670/6670 Assignment 4 – GMM¶
Enter Your Student ID:
Your Name:
Submit: You can write your answers in this file and submit a single Jupyter Notebook file (.ipynb) on Wattle. Rename this file with your student number as ‘uXXXXXXX.ipynb’. Otherwise, you can write your programming questions in this file, and submit two files, ‘uXXXXXXX.ipynb’ for programming and ‘uXXXXXXX.pdf’ for theory. Please submit them separately instead of a zip file.
Enter Discussion Partner IDs Below:
•
•
•
Programming Section
• 1 = 20%
• 2 = 20%
• 3 = 20%
• 4 = 10%
• 5 = 10%
• 6 = 20%
PROGRAMMING SECTION¶
For all of the following, program the solution yourself. Don’t just call a library function that does the whole question for you, or you’ll get zero (no, that doesn’t mean you can’t use any library functions, but it does mean that you have to show you understand how to compute the answer yourself).
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D #This is for 3d scatter plots.
import math
import random
import functools
We’re going to implement an algorithm to model data with a mixture of Gaussians.
Remember the simplified EM algorithm in assignment 2 for k-means? Well, we’re going to implement something similar.
The more complex aspect of this all is that your program must also correctly estimate the covariance matrices of the 2-dimensional Gaussian components involved, as well as their means.
Not only this, but the “distance” between each of your data points (a row $d_i$ of $D$) and a Gaussian $k$, is the probability of that point generated by that Gaussian $p(d_i \mid k)$.
Run the below cell to load the data we’ll be using.
In [ ]:
X = np.load(“./data.npy”)
fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(X[:,0], X[:,1], X[:, 2])
plt.show()
First off, some definitions:
$X$ is a dataset containing 800 data points. Each data point has 4 dimensions. The first three dimensions can be represented by a three dimensional Gaussian Mixture Model. The forth dimension contains catogorical information for that data point.
You only need the first three dimensions of the data in task 1 through 5. The forth dimension is for task 6 only..
$K$ is the total number of Gaussians. This is just like the $K$ mean vectors you had for k-means in assignment 2. Each Gaussian is named $k$, where $k$ is the unique number associated with that Gaussian. For example, the first Gaussian is $k = 1$. Each $k$ has a mean and covariance matrix associated with it. This allows you to construct a Gaussian which is just a formula which can be used to generate data points or to compute the probability that a given data point is generated from this gaussian (look up generative models if you’re curious to learn more). For this purpose, you can use np.random.multivariate_normal().
$\Sigma$ is a list of covariance matrices $\Sigma_k \in \mathbb{R}^{m \times m}$, which are symmetric positive semidefinite matrices.
$\mu$ is a list of means, each one $\mu_k \in \mathbb{R}^{m}$ associated with gaussian $k$.
$N$ is the total number of datapoints.
$responsibilities \in [0, 1]^{N \times K}$ is a matrix. Every column $k$ of $responsibilities$ is associated with the $k^{th}$ gaussian. Each element $r_{ik}$ of the $k^{th}$ column is the probability of the $i^{th}$ datapoint $x_i$ (the $i^{th}$ row of $X$) given the gaussian $k$.
$N_k$ is the sum of the $k^{th}$ column of $responsibilities$. In other words, there is one $N_k$ for each gaussian. $N_k = \sum_i r_{ik}$.
$\pi$ is a list of probabilies, 1 associated with each gaussian. $\pi_k$ is the probability of gaussian $k$. $\pi_k = \frac{N_k}{N}$.
TASK 1: Complete the below function to initialise your parameters. You will need to generate:
1. A list $\mu$ of $K$ mean vectors, each of which is in $\mathbb{R}^{m}$.
2. A list $\Sigma$ of $K$ covariance matrices in $\mathbb{R}^{m \times m}$. Remember, covariance matrices must be symmetric positive semidefinite.
3. A list $\pi$ of $K$ probabilities $\pi_k$. They should be initialised at $\frac{1}{K}$ (uniformly distributed at first).
Do not hard code parameters. You should generate them with some randomness, and your code must work for any $m$, $k$ and $N$.
You need to intelligently select $\mu_k$ just like you did with k-means. Poorly initialised parameters may result in an entirely broken EM algorithm.
You only need the first three dimensions of the data in task 1 through 5. The forth dimension is for task 6 only.
In [ ]:
def initialise_parameters(X, K):
m = 3
N = X.shape[0]
sigma = np.zeros((K, m, m))
mu = np.zeros((K, m))
pi = np.zeros(K)
# YOUR CODE HERE
return sigma, mu, pi
K = 4
sigma, mu, pi = initialise_parameters(X, K)
print(‘\nSigma: \n’, sigma)
print(‘\nMu: \n’, mu)
print(‘\nPi: \n’, pi)
$E\_step$ computes the matrix $responsibilities \in [0, 1]^{N \times K}$, where $N$ is the number of data points, and $K$ is the number of Gaussians you’re attempting to cluster the data with. Each Gaussian will be associated with a column of $responsibilities$. As your algorithm runs, each row represents a data point $x_i$, and each column of that row will contain the probability that that $x_i$ came from that gaussian, $p(x_i \mid k)$, signifying the extent to which this datapoint $x_i$ has been assigned to the Gaussian associated with that column.
TASK 2: Implement $E\_step(\pi, \mu, \Sigma, X) = responsibilities$, which updates the $responsibilities$ matrix.
Remember, which gaussian a datapoint is assigned to depends on the probability of that datapoint given that guassian.
You cannot use euclidean distance like you did with the k-means algorithm in Assignment 2.
HINT:
• Each element of responsibilities $r_{ik} = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_i \mid \mu_j \Sigma_j)}$, where $x_i$ is the $i^{th}$ row of $X$.
• Use $multivariate\_normal.pdf(x, \mu_k, \Sigma_k)$
In [ ]:
def E_step(pi, mu, sigma, X):
N = X.shape[0]
m = 3
r = np.zeros((N, K))
# YOUR CODE HERE
return r
responsibilities = E_step(pi, mu, sigma, X)
print(responsibilities)
TASK 3: Implement $M\_step(responsibilities, X) = \mu, \Sigma, \pi$ which returns the updated means and covariances for all of the $k$ Gaussians, along with the priors $\pi$.
In [ ]:
def M_step(r, X):
K = r.shape[1]
N = X.shape[0]
m = 3
mu = np.zeros((K, m))
sigma = np.zeros((K, m, m))
pi = np.zeros(K)
# YOUR CODE HERE
return mu, sigma, pi
mu, sigma, pi = M_step(responsibilities, X)
print(‘\nSigma: \n’, sigma)
print(‘\nMu: \n’, mu)
print(‘\nPi: \n’, pi)
TASK 4: Implement $classify(\pi, \mu, \Sigma, x) = k$ which takes an unknown example $x \in \mathbb{R}^{m}$, the means $\mu$ and covariance matrices $\Sigma$ and the priors $\pi$, and returns $k$, the number of the gaussian which maximises the probability of $x$.
In other words, ${arg max}_k \left(p(x \mid \mu_k, \Sigma_k)\pi_k \right)$.
In [ ]:
def classify(pi, mu, sigma, x):
# YOUR CODE HERE
return -1
print(classify(pi, mu, sigma, X[270, :3]))
TASK 5: Implement $EM(X, K, iterations) = \mu, \Sigma, \pi$ which:
1. takes a dataset $X \in \mathbb{R}^{N \times m}$ and $K$, an integer indicating how many Gaussians will be used to cluster the data, and $iterations$ the number of iterations to be performed.
2. uses all of the functions you completed above to initialise parameters and find the optimal means $\mu$, covariances $\Sigma$ and priors $\pi$ to cluster the data points (a gaussian mixture model).
In [ ]:
def EM(X, K, iterations):
# YOUR CODE HERE
m = 3
mu = np.zeros((K, m))
sigma = np.zeros((K, m, m))
pi = np.zeros(K)
return mu, sigma, pi
#Test code. Leave it aloooooone!
iterations = 30
K = 4
mu, sigma, pi = EM(X, K, iterations)
print(‘\nSigma: \n’, sigma)
print(‘\nMu: \n’, mu)
print(‘\nPi: \n’, pi)
def allocator(pi, mu, sigma, X, k):
N = X.shape[0]
cluster = []
for ix in range(N):
prospective_k = classify(pi, mu, sigma, X[ix, :])
if prospective_k == k:
cluster.append(X[ix, :])
return np.asarray(cluster)
colours = [‘r’, ‘g’, ‘b’, ‘y’]
fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection=’3d’)
for k in range(K):
cluster = allocator(pi, mu, sigma, X[:, :3], k)
ax.scatter(cluster[:,0], cluster[:,1], cluster[:, 2], c=colours[k])
plt.show()
Now we start to use the fourth dimension, which indicates the category of each data point.
The possible values for the forth dimension are 1, 2, 3, 4, 5.
Task 6: Compute the Conditional Distribution. Use the function from the Task 4. Use the $\mu, \pi, \Sigma$ from Task 5.
Let $K$ stands for the classification result of the first three dimensions, $Y$ stand for the value of the forth dimension.
Possible $k$ values are 0, 1, 2, 3. Possible $y$ values are 1, 2, 3, 4, 5.
Compute the conditional distribution $p(Y = y|K = k)$.
Hint: Using the Bayes’ thereom
In [ ]:
def conditional_distribution(X, y, k, mu, pi, sigma):
# YOUR CODE HERE
pass
conditional_distribution(X, 1, 0, mu, pi, sigma)
In [ ]: