Inf2B Coursework 2
Submission due: 4pm, Friday 5th April 2019 Hiroshi Shimodaira and JinHong Lu
1 Outline
(Ver. 1.0)
The coursework consists of two tasks, Task 1 – PCA and clustering, Task 2 – classification with K-NN and Bayes classification with Gaussian distributions, in which we use a data set of handwritten characters (digits).
You are required to submit (i) two reports, one for each Task, (ii) code, and (iii) results of exper- iments if specified, using the electronic submission command. Details are given in the corresponding task sections below. Some of the code and results of experiments submitted will be checked with an automated marking system in the DICE computing environment, so that it is essential that you follow the syntax of function and file format specified. No marks will be given if it does not meet the specifications. Efficiency of code and programming style (e.g. comments, indentation, and variable names) count. Those pieces of code that do not run or that do not finish in approximately five minutes on a standard DICE machine will not be marked. (This excludes Task 1.5.) This coursework is out of 100 marks and forms 12.5% of your final Inf2b grade.
This coursework is individual coursework – group work is forbidden. You should work alone to complete the coursework. You are not allowed to show any written materials, the data provided to you, results of your experiments, or code to anyone else. This includes posting your coursework to the internet and making it accessible to other people not only during the coursework period, but also after that. Never copy-and-paste material of other people (including those available on the internet) into your coursework and edit it. You can, however, use the code provided in the lecture notes, slides, and labs of this course, excluding some functions described later. High-level discussion that is not directly related to this coursework is fine.
Please note that assessed work is subject to University regulations on academic misconduct:
http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct
For late coursework and extension requests, see the page: http://web.inf.ed.ac.uk/infweb/student-services/ ito/admin/coursework-projects/late-coursework-extension-requests
Note that any extension request must be made to the ITO, and not to the lecturer.
Programming: WritecodeinMatlab(R2018a)/OctaveorPython(version2.7)+Numpy+Scipy.Your code should run on standard DICE machines without the need of any additional software. There are some functions that you should write the code by yourself rather than using those of standard libraries available. See section 4 for details.
This document assumes programming in Matlab. For Python, put all the specified functions into a single file for each Task, so that task1.py for Task 1, and task2.py for Task 2. Output data should be stored in Matlab’s MAT format.
2 Data
The coursework employs the MNIST handwritten digits data set http://yann.lecun.com/exdb/ mnist. Each character image is represented as 28-by-28 pixels in gray scale, being stored as a row vector of 784 elements (28 × 28 = 784). A subset of the original MNIST data set is considered in this coursework.
You data set is stored in your coursework-data directory (denoted as YourDataDir hereafter) : /afs/inf.ed.ac.uk/group/teaching/inf2b/cwk2/d/UUN/
where UUN denotes your UUN (DICE login name). In the directory, you will find the following four files:
1
3 Task specifications
2
File name
‘trn-images-idx3-ubyte’ ‘trn-labels-idx1-ubyte’ ‘tst-images-idx3-ubyte’ ‘tst-labels-idx1-ubyte’
Matlab variable (Class) Xtrn[Ntrn,784] (uint8) Ytrn[Ntrn,1] (uint8) Xtst[Ntst,784] (uint8) Ytst[Ntst,1] (uint8)
Description
training samples
class labels of training samples test samples
class labels of test samples
where Ntrn and Ntst denote the total number of training samples and test samples, respectively, which are smaller than the original ones. The file formats are the same as the original ones.
A sample code to load the data files will be provided in the following directory.
/afs/inf.ed.ac.uk/group/teaching/inf2b/cwk2/matlab codes/
There are ten classes, C1,…,C10, (C1 corresponds to digit ’0’, C10 to ’9’), and each class has about 4200 training samples and 300 test samples on the average, respectively, but exact number of samples may be different among different classes.
Each pixel value is represented as an unsigned byte integer (uint8) with the range in [0, 255]. Note that, after you load the data in your program, you should at first convert the image data to floating point (double) numbers. Additionally, you should divide the numbers by 255.0 so that the maximum value is less than or equal to 1.0.
A class label is represented as an 8-bit unsigned integer (uint8) number between 0 and 9, where 0 denotes ’0’ and 9 ’9’, respectively.
3 Task specifications
Task1 – PCA and Clustering [50 marks] Task 1.1
(a)
[5 marks]
(b)
Task 1.2
(a)
(b)
Write a Matlab function that displays the images of the first ten samples for each class using montage() function, so that each figure shows the ten samples for class Ck, where k = 1, . . . , 10. Save your code as ‘task1 1.m’. Note that file names and function names are case sensitive. The syntax of the function should be as follows.
function task1 1(X, Y)
where
X M-by-D data matrix (of floating-point numbers in double-precision format, which is the default in Matlab), where M is the number of samples, and D is the the number of elements in a sample. Note that each sample is represented as a row vector rather than a column vector.
Y M-by-1 label vector (of uint8) for X. Y(i) is the class number of X(i,:).
Run task1 1(Xtrn, Ytrn), and save the image (hardcopy) for each class as
‘task1 1 imgs classK.pdf’, where K = 1, . . . , 10. (e.g. task1 1 imgs class3.pdf is a hard- copy for C3 in PDF format.) In your report, show the ten hardcopies.
[5 marks]
Write a Matlab function that calculates mean vector (i.e. average pattern) for each class, and display the all average patterns in a single graph using montage() function. Save your function as ‘task1 2.m’. The syntax of the function should be as follows.
function M = task1 2(X, Y)
where X and Y are the same as in Task 1.1.
M K-by-D mean vector matrix (double), where K is the number of
classes, and D is the same as in Task 1.1.
Run M = task1 2(Xtrn,Ytrn);, save M as ‘task1 2 M.mat’, save the hardcopy of the figure drawn as ‘task1 2 imgs.pdf’. In your report, show the hardcopy.
3 Task specifications 3
Task 1.3
(a)
[7 marks] Write a Matlab function that computes the principal components of a data set, and save it
as ‘comp pca.m’. The syntax of the function is as follows: function [EVecs, EVals] = comp pca(X)
X is the same format as in Task 1.1. EVecs are the eigenvectors (stored in a D-by-D matrix EVecs), and EVals are the corresponding eigen values {λi}Di=1 (stored in a D-by-1 vector ). The eigenvalues should be sorted in descending order, so that λ1 is the largest and λD is the smallest, and i’th column of EVecs should hold the eigenvector that corresponds to λi. Eigenvectors are not unique by definition in terms of scale (length) and sign, but we make them unique in this coursework by putting the following additional constraints, which your comp pca() should satisfy.
• The first element of each eigenvector is non-negative. If it is not the case, i.e. if the first element is negative, multiply -1 to the eigenvector (i.e. v ← −v) so that it gets the opposite direction.
• Each eigenvector is a unit vector, i.e. ∥v∥ = 1, where v denotes an eigenvector. As far as you use Matlab’s eig() or Python’s numpy.linalg.eig(), you do not need to care about this, since either function ensures unit vectors.
Write a Matlab function task1 3() that
• carries out PCA on the given data,
• plots cumulative variance,
• finds the minimum number of PCA dimensions to cover 70%, 80%, 90%, 95% of the
total variance (in task1 3()),
and save it as ‘task1 3.m’. The syntax of the function should be as follows.
function [EVecs, EVals, CumVar, MinDims] = task1 3(X) Where EVecs EVals the same as in part (a),
Cumvar D-by-1 vector (double) of cumulative variance.
MinDims 4-by-1 vector (int32) showing the minimum number of PCA di-
mensions shown above.
Run [EVecs, EVals, CumVar, MinDims] = task1 3(Xtrn); save the graph as
‘task1 3 graph.phd’, and save EVecs, Evals, CumVar, MinDims as ‘task1 4 evcs.mat’, ‘task1 4 evlas.mat’, ‘task1 4 cumvar.mat’, and ‘task1 4 mindims.mat’ respectively.
[5 marks]
Write a Matlab function that displays the images of the first ten principal axes of PCA using montage().
function task1 4(EVecs)
where EVecs is the same format as in Task 1.3.
Run task1 4(EVecs) and save the image as ‘task1 4 imgs.pdf’, where EVecs is the eigen- vector matrix obtained in Task 1.3.
[6 marks]
Write a Matlab function that carries out the k-means clustering, and save it as my kMeansClustering.m. The syntax of the function should be as follows.
function [C, idx, SSE] = my kMeansClustering(X, k, initialCentres, maxIter)
where X is the same format as in Task 1.1.
(b)
(c)
Task 1.4
(a)
(b)
Task 1.5
(a)
3 Task specifications
4
(b)
(c)
Task 1.6
(a)
(b)
Task 1.7
(a)
Write a Matlab function that calls the k-mean clustering function for each k in a vector Ks, using the first k samples in the data set X as the initial cluster centres, and saves the returned C, idx, and SSE as ‘task1 5 c k.mat’, ‘task1 5 idx k.mat’, and ‘task1 5 sse k.mat’, re- spectively. Save your code as ‘task1 5.m’. The syntax of the the function should be as follows.
function task1 5(X, Ks)
where X is the same as in Task 1.1, Ks is a L-by-1 vector of the numbers of nearest neigh-
bours.
Run task1 5(Xtrn, Ks) with Ks = [1,2,3,4,5,7,10, 15, 20]. In your report, shows a graph
of SSE vs k, and shows the time taken for each k.
Write a Matlab function that displays the image of each cluster centre using montage() function. The syntax of the function should be as follows.
function task1 6(MAT ClusterCentres)
where MAT ClusterCentre is a MAT file that contains C, the cluster centre matrix shown
in Task 1.5.
Run task1 6(’task1 5 c k.mat’) for k in Ks, and save each image as ‘task1 6 k imgs.pdf’.
[7 marks]
Write a Matlab function that visualises cluster regions on the 2D-PCA plane, where the position of the plane is specified by a point vector. The plotting range should be m ± 5σ, where m is the mean and σ is the standard deviation of the data on the corresponding PCA axis. Save your code as ‘task1 7.m’. The syntax of the function should be as follows.
function DMAP = task1 7(MAT ClusterCentres, MAT evecs, MAT evals, MAT pos, nbins)
where the first three arguments are MAT files of cluster centres, eigenvectors, and eigenval- ues. MAT pos is a D-by-1 vector (double) to specify the position of the 2D-PCA plane. We assume that nbins by nbins square grid points are used to plot the cluster regions. The output, DMAP is a nbin-by-nbin matrix (uint8), each element represents the cluster number that the point belongs to.
run Dmap = task1 7(’task1 5 c k.mat’, ’task1 4 evecs.mat’, ’task1 4 evals.mat’, ’task1 2 M.mat’, 200); for k = 1, 2, 3, 5, 10, save Dmap as ‘task1 7 dmap k.mat’, and save the figure of cluster regions as ‘task1 7 imgs k.pdf’.
In your report, describe the method implemented for visualising the regions.
k initialCentres
maxIter C
idx
SSE
The target number of clusters
Initial cluster centres in the k-by-D matrix, where each row represents a cluster centre.
Maximum number of iterations (optional)
Cluster centres in the k-by-D matrix.
Cluster indices in the N -by-1 vector, where idx[i] indicates the cluster number (starting from 1) of i-th data point. The sum squared error (the one shown in slide 21 for Lec- ture 3) for each iteration in the (L+1)-by-1 vector, where the first element is the SSE for the initial cluster centres, and L is the number of iterations done.
(b)
(c)
Task 1.8
and findings in your report, and save your code as ‘task1 8.m’.
[10 marks] This is a mini research project, in which you are asked to investigate the k-means clustering in terms of initial cluster centres, i.e. how different initial cluster centres result in different cluster centres, for which employ SSE to measure the clustering performance. Report your experiments
3 Task specifications 5
Task 2 – Classification [50 marks]
In the following classification experiments, assume a uniform prior distribution over class, and use the maximum likelihood estimation (MLE) to estimate model parameters.
Task 2.1
(a)
(b)
[7 marks]
Write a Matlab function for the classification with k-NN, save it as ‘run knn classifier.m’. The syntax of the function should be as follows.
function [Ypreds] = run knn classifier(Xtrn, Ytrn, Xtst, Ks)
Write a Matlab function that creates a confusion matrix, and save the code as ‘comp confmat.m’.
The syntax of the function should be as follows.
function [CM, acc] = comp confmat(Ytrues, Ypreds)
where
Ytrues N-by-1 vector of ground truth (target) class labels
Ypreds N-by-1 vector of predicted class labels
CM K-by-K confusion matrix, where CM(i,j) is the number of samples
whose target is the i’th class that was classified as j. K is the
number of classes, which is 26 for the data set.
acc A scalar variable representing the accuracy (i.e. correct classifica-
tion rate) with the range in [0, 1].
Write the Matlab function, and save it as ‘task2 1.m’. The syntax of the function is
function task2 1(Xtrain, Ytrain, Xtest, Ytest, Ks)
The specifications of the script are as follows.
• Runs a classification experiment on the data set, calling the classification function (run knn classifier).
• Measures the user time taken for the classification experiment, and display the time (in seconds) to the standard output (i.e. display).
• Saves the confusion matrix for each k to a matrix variable cm, and save it with the file name ‘task2 1 cmk.mat’, where k denotes the number of nearest neighbours (i.e. k) specified in Ks.
(c)
(d)
Task 2.2
(a)
(b)
Task 2.3
(a)
Run task2 1(Xtrn, Ytrn, Xtst, Ytst, Ks) with Ks = [1,3,5,10,20]’. In your re- port, report the information shown on the display.
[6 marks]
Write a Matlab function that visualise decision regions of k-NN on a 2D PCA plane, where the position of the plane is specified by a point vector. Use the same specifications for plotting as in Task 1.7.
function Dmap = task2 2(X, k, MAT evecs, MAT evals, MAT pos, nbins)
Run Dmap = task2 2(Xtrn, k, ’task1 4 evecs.mat’, ’task1 4 evals.mat’, ’task1 2 M.mat’, 200); for k = 1, 3, save Dmap as ‘task2 2 dmap k.map’ and save the figure of decision regions
as ‘task2 2 imgs k.pdf’.
[7 marks]
Write a Matlab function that transforms data to the ones in 2D PCA space and plots a contour of Gaussian distribution for each class. Save your code as ‘task2 3.m’. The syntax of the function should be as follows.
• Displays the k
N Nerrs acc
following information (to the standard output). The number of nearest neighbours
The number of test samples
The number of wrongly classified test samples Accuracy (i.e. correct classification rate)
3 Task specifications 6
(b)
Task 2.4
(a)
(b)
Task 2.5
(a)
function task2 3(X, Y)
The specifications of the function are as follows.
• Transform X to the data in the 2D space spanned by the first two principal components.
• Estimate the parameters (mean vector and covariance matrix) of Gaussian distribution
for each class in the 2D space.
• On a single graph, plot a contour of the distribution for each class using plot() function.
Do not use functions for plotting contours such as contour(). The lengths of longest / shortest axis of an ellipse should be proportional to the standard deviation for the axis. (Please note that that contours of different distributions plotted by this method do not necessary give the set of points of the same pdf value.)
Run task2 3(Xtrn, Ytrn) and save the figure in ‘task2 3 img.pdf’
[5 marks]
Write a Matlab function that calculates correlation r12 on 2D-PCA for each class and for all the classes (i.e. whole data)
function [Corrs] = task2 4(Xtrain, Ytrain)
where Output is a 11-by-1 vector (double) – Output(k) holds the correlation for class k
(k = 1, . . . , 10), and Output(11) holds the correlation for the whole data. Run task2 4(Xtrn, Ytrn) and save Corrs as ‘task2 4 corrs.mat’
[6 marks]
Write a Matlab function for the classification with a single Gaussian distribution per class, and save the code as ‘run gaussian classifiers.m’. The syntax of the function should be as follows.
[Ypreds, Ms, Covs] = run gaussian classifiers(Xtrain, Ytran, Xtest, epsilon)
where Xtrn, Ytrn, Xtst, and Ypreds are the same as those in Task1. epsilon is a scalar (double) for the regularisation of covariance matrix described in Lecture 8, in which we add a small positive number (ε) to the diagonal elements of covariance matrix, i.e. Σ ← Σ+εI, where I is the identity matrix.
Ms K-by-D matrix of mean vectors, where Ms(k,:) is the sample mean vector for class k.
Covs K-by-D-by-D 3D array of covariance matrices, where Cov(k,:,:) is the covariance matrix (after the regularisation) for class k.
Write a Matlab function for a classification experiment, and save the code as ‘task2 5.m’. The syntax of the function is as follows.
function task2 5(Xtrain, Ytrain, Xtest, Ytest, epsilon)
The specifications of the function are as follows.
• Calls the classification function with epsilon=0.01.
• Measures the user time taken for the classification experiment, and display the time
(in seconds) to the standard output.
• Obtains the confusion matrix, stores it to a matrix variable cm, and saves it with the
file name ‘task2 5 cm.mat’.
• Saves the mean vector and covariance matrix for Class 10, i.e, Ms(10,:) and Covs(10,:,:),
to files with the file names ‘task2 5 m10.mat’ and ‘task2 5 cov10.mat’, respectively.
• Displays the following information (to the standard output).
N The number of test samples
Nerrs The number of wrongly classified test samples acc Accuracy (i.e. correct classification rate)
Run task2 5(Xtrn, Ytrn, Xtst, Ytst, 0.01);. In your report, report the information you got.
(b)
(c)
3 Task specifications 7
Task 2.6
(a)
(b)
Task 2.7
for Gaussian classifiers.
[6 marks]
Write a Matlab function that visualises decision regions of the Gaussian classifiers on a 2D-PCA plane.
function DMAP = task2 6(X, Y, epsilon, MAT evecs, MAT evals, MAT pos, nbins) where MAT evecs, MAT evals, MAT pos, and nbins are the same as in Task 1.7.
Run Dmap = task2 6(Xtrn, Ytrn, 0.01, ’task1 4 evecs.mat’, ’task1 4 evals.mat’, ’task1 2 M.mat’, 200);, save Dmap as ‘task2 6 dmap.mat’, and save the plotting in ‘task2 6 img.pdf’.
[5 marks] This task aims to investigate the effect of amount of training data on classification performance
(a)
(b)
Task 2.8
In this task, you
Write a Matlab function that run an experiment using a subset of training data, and save your code as ‘task2 7.m’. The syntax of the function is as follows.
function [CM, acc] = task2 7(Xtrain, Ytrain, Xtest, Ytest, epsilon, ratio) where ration specifies the ratio of training data to use. If it is 0.9, use the first 90% of
samples in Xtrain.
Run task2 7(Xtrn, Ytrn, Xtst, Ytst, 0.01, ratio) for ration = 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, and save CM as ‘task2 7 cm R.mat’, where R = * 100, so that ‘task2 7 cm 90.mat’ for ration=0.9. In report, report the result.
[8 marks]
(a) Write a Matlab function that applies k-means clustering to each class to obtain multiple Gaussian classifier per class, and carries out classification. Save the function as ‘run mgc.m’ The syntax of the function is as follows.
function [Ypreds, MMs, MCovs] = run mgcs(Xtrain, Ytrain, Xtest, epsilon, L)
where L denotes the number of clusters per class. Ms L*K-by-D matrix of mean vectors.
Covs (L*K)-by-D-by-D 3D array of covariance matrices.
(b) Write a Matlab function that runs an experiment, and save the code as ‘task2 8.m’. The
syntax of the function is as follows.
function task2 8(Xtrain, Ytrain, Xtest, Ytest, epsilon, L)
The specifications of the function are as follows.
• Calls the classification function with epsilon=0.01.
• Measures the user time taken for the classification experiment, and display the time
(in seconds) to the standard output.
• Obtains the confusion matrix, stores it to a matrix variable cm, and saves it with the
file name ‘task2 8 cm L.mat’.
• Saves the mean vector and covariance matrix for Class 1, i.e, Ms(1:L,:) and Covs(1:L,:,:),
to files with the file names ‘task2 8 gL m1.mat’ and ‘task2 8 gL cov1.mat’, respec-
tively.
• Displays the following information (to the standard output).
N The number of test samples
Nerrs The number of wrongly classified test samples acc Accuracy (i.e. correct classification rate)
(c) Run run mgcs(Xtrn, Ytrn, Xtst, Ytst, 0.01, L); for L=2,3. In your report, report the information you obtained.
4 Functions that are not allowed to use 8
4 Functions that are not allowed to use
Since one of the objectives of this coursework is to understand and implement basic algorithms for machine learning, you are not allowed to use those functions in standard libraries listed below. You should write the code by yourself using the basic operations of arithmetic for scalars, vectors, and matrices. If it is the case, use a different function name from the original one in standard libraries
(e.g. MyCov() for cov() as shown in the table below). You may, purposes, i.e. to check your code.
however, use them for comparison
Suggested name to implement MySqDist()
MyMean()
MyCov()
run knn classifier()
my kMeansClustering() comp confmat()
Description of function
Pairwise (squared) Euclidean distance Compute the mean
Compute the covariance matrix Compute Gaussian probability densities K-NN classification
K-means clustering
Compute confusion matrix
Other utilities for classification
You may use those functions or operations:
Description
Sum function Cumulative sum Square root function Exponential function Logarithmic function Matrix transpose Matrix inverse Determinant
Log determinant Eigen values/vectors Sort
Sample mode Vectorisation helpers
Typical names pdist2() mean()
cov() mvnpdf() fitcknn() kmeans() confusion()
e, exp()
log(), ln() transpose(), ‘
inv()
det()
logdet()
eig()
sort()
mode()
bsxfun(), arrayfun()
Typical names sum() cumsum() sqrt()
· · ·
You should submit your work electronically via the DICE submit command by the deadline. No submission of printed document is required.
Since marking for each task will be done separately, you should prepare separate reports for the two tasks, and save your report files in PDF format and name them ‘report task1.pdf’ and ‘report task2.pdf’. Remember to place your student number and the task name prominently at the top of each report. Do not indicate your name anywhere. Your report should be concise and brief for each task.
Create a directory named LearnCW, copy all of the requested files to the the directory, but do NOT put the data set files in it.
A checklist will be available from the coursework web page. Submit your coursework from a DICE machine using:
submit inf2b cw2 LearnCW
(NB: the list is not exhaustive)
5 Submission
available in Inf2b cwk2 directory