Inf2B Coursework
Submission due: 4pm, Friday 3rd April 2020 Hiroshi Shimodaira
1 Outline
(Ver. 0.9)
The coursework consists of two tasks, Task 1 – data analysis and classification with multivariate Gaussian classifiers, Task 2 – neural networks.
You are required to submit (i) two reports, one for each Task, (ii) code, and (iii) results of exper- iments if specified, using the electronic submission command. Details are given in the corresponding task sections below. Some of the code and results of experiments submitted will be checked with an automated marking system in the DICE computing environment, so that it is essential that you follow the syntax of function and file format specified. No marks will be given if it does not meet the specifications. Some helper tools to check your files and function template files will be provided. Please check the following coursework web-page frequently to see any updates.
https://www.inf.ed.ac.uk/teaching/courses/inf2b/coursework/cwk.html
Efficiency of code and programming style (e.g. comments, indentation, and variable names) count. Those pieces of code that do not run or that do not finish in approximately five minutes on a standard DICE machine will not be marked. This coursework is out of 100 marks and forms 25% of your final Inf2b grade.
This coursework is individual coursework – group work is forbidden. You should work alone to complete the coursework. You are not allowed to show any written materials, the data provided to you, results of your experiments, or code to anyone else. This includes posting your coursework to the internet and making it accessible to other people not only during the coursework period, but also after that. Never copy-and-paste material of other people (including those available on the internet) into your coursework and edit it. You can, however, use the code provided in the lecture notes, slides, and labs of this course, excluding some functions described later. High-level discussion that is not directly related to this coursework is fine.
Please note that assessed work is subject to University regulations on academic misconduct:
http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct
For late coursework and extension requests, see the page: http://web.inf.ed.ac.uk/infweb/student-services/ ito/admin/coursework-projects/late-coursework-extension-requests
Note that any extension request must be made to the ITO, and not to the lecturer.
Programming: WritecodeinMatlab(R2018a)/OctaveorPython(version3.6)+Numpy+Scipy+Matplolib. Your code should run on standard DICE machines without the need of any additional software. There
are some functions that you should write the code by yourself rather than using those of standard libraries available. See section 4 for details.
This document assumes programming in Matlab. For Python, put all the specified functions into a single file for each Task, so that task1.py for Task 1, and task2.py for Task 2. Output data should be stored in Matlab’s MAT binary format.
2 Data
2.1 Data for Task 1
The coursework employs the Anuran Calls (MFCCs) Data Set introduced by J. Colonna etal..1
Your data set file, ‘dset.mat’, which is a subset of the original data set, should be found in your
coursework data directory (denoted as YourDataDir hereafter) : 1 https://doi.org/10.1007/978-3-319-46307-0_13
1
3 Task specifications 2
/afs/inf.ed.ac.uk/group/teaching/inf2b/cwk/d/UUN/ where UUN denotes your UUN (DICE login name).
You can use Matlab’s load() function to load the data set in the following manner: load(pathname);
where pathname denotes the absolute pathname of your data set file. Once you load the data set, you will find the following variables.
Matlab variable (Class) X[N, D] (double))
Y family[N,1] (int32)
Y genus[N,1] (int32)
Y species[N,1] (int32) list family[4,1] (cell) list genus[8,1] (cell) list species[10,1] (cell)
Description
feature vectors family class labels genus class labels species class labels family class names genus class names species class names
where N and D denotes the total number of samples and the dimension of feature vector (D=24), respectively.
Among the three different levels of taxonomic rank provided in the original data set, we use ’species’ in the coursework. There are ten different species, so that the number of classes for classification is ten, i.e., C = 10. The variable, Y species(i), contains the integer number that corresponds to the species of i-th sample, whose feature vector is X(i,:). Hereafter, Y denotes Y species.
The variable, list species, holds the list of species names.
The following table shows the number of samples for each species in the original data set, which may be different from the number samples in your data set.
Family
Genus
Species
# of samples
Leptodactylidae
Leptodyctylus
Leptodactylus fuscus
222
Adenomera
Adenomera andreae
496
Adenomera hylaedactyla
3049
Hylidae
Dendropsophus
Hyla minuta
229
Scinax
Scinax ruber
96
Osteocephalus
Osteocephalus oophagus
96
Hypsiboas
Hypsiboas cinerascens
429
Hypsiboas cordobae
702
Bufonidae
Rhinella
Rhinella granulosa
135
Dendrobatidae
Ameerega
Ameerega trivittata
544
The data set has not been split in two sets for training and testing. You need to split the data set according to the instructions described later.
2.2 Data for Task 2
The data for Task 2 is stored in the plain-text file named ‘task2 data.txt’ in YourDataDir. For details, see the Task 2 specifications.
3 Task specifications
Task1 – Anuran-Call analysis and classification [50 marks]
Task 1.1 [5 marks] (a) Write a Matlab function task1 1() that
• calculates the covariance matrix, S and correlation matrix, R, for the whole data set X, using the maximum likelihood estimation (MLE),
• saves S as ‘t1 S.mat’,
3 Task specifications 3
• saves R as ‘t1 R.mat’.
Save the code as ‘task1 1.m’. Note that, hereafter, function and file names are case sensi- tive, and your code should save output files in the current working directory. The syntax of the function should be as follows.
function task1 1(X, Y)
where
X N-by-D matrix of feature vector (of floating-point numbers in double-precision format, which is the default in Matlab), where N is the number of samples, and D is the the number of elements in a sample. Note that each sample is represented as a row vector rather than a column vector.
Y N-by-1 label vector (of int32) for X. Y(i) is the class number of X(i,:).
Run the following:
function task1 1(X, Y)
Make sure that the two output files are created properly. It will be a good idea that you write a script to run the above.
(b)
Task 1.2
using graphs.
Task 1.3
(a)
[10 marks]
[5 marks] Look into the correlation matrix, R, you obtained, and describe your findings in your report,
Write a Matlab function task1 3() that
• calculates the eigenvectors, EVecs and eigenvalues, EVals, of a covariance matrix, and calculates the cumulative variance, Cumvar,
• finds the minimum number of PCA dimensions to cover each 70%, 80%, 90%, 95% of the total variance, and store the values to a vector MinDims,
• saves the eigenvectors to a file named ‘t1 EVecs.mat’,
• saves the eigenvalues to a file named ‘t1 EVals.mat’,
• saved the cumulative variances to a file named ‘t1 Cumvar.mat’,
• saves the the numbers of minimum dimensions, MinDims, to a fle named ‘t1 MinDims.mat’,
Save the function as ‘task1 3.m’.
The syntax of the function should be as follows.
function task1 3(Cov)
where Cov is a D-by-D covariance matrix (double). The specifications of the variables are as follows.
EVecs
EVals
Cumvar
MinDims
D-by-D matrix (in double) D-by-1 vector (in double) D-by-1 vector (in double) 4-by-1 vector (in int32)
The eigenvalues should be sorted in descending order, so that λ1 is the largest and λD is the smallest, and i’th column of EVecs should hold the eigenvector that corresponds to λi. Eigenvectors are not unique by definition in terms of scale (length) and sign, but we make them unique in this coursework by putting the following additional constraints, which your program should employ.
• The first element of each eigenvector is non-negative. If it is not the case, i.e. if the first element is negative, multiply -1 to the eigenvector (i.e. v ← −v) so that it gets the opposite direction.
• Each eigenvector is a unit vector, i.e. ∥v∥ = 1, where v denotes an eigenvector. As far as you use Matlab’s eig() or Python’s numpy.linalg.eig(), you do not need to care about this, since either function ensures unit vectors.
3 Task specifications 4
(b)
(c)
Task 1.4
(a)
Run the following:
task1 3(S);
In your report, show a graph of cumulative variance.
Plot all data on a 2D-PCA plane, clarifying data of different classes, and show the graph in your report.
[25 marks]
Write a Matlab function task1 mgc cv() that carries out a classification experiment with multivariate Gaussian classifiers, using k-fold cross validation, and save the code as ‘task1 mgc cv.m’. The syntax of the function is as follows
function task1 mgc cv(X, Y, CovKind, epsilon, Kfolds)
where CovKind is the type of covariance matrix – 1 for full covariance matrix, 2 for diagonal covariance matrix, and 3 for shared covariance matrix, epsilon is a scalar (double) for the regularisation of covariance matrix described in Lecture 8, in which we add a small positive number (ε) to the diagonal elements of covariance matrix, i.e. Σ ← Σ + εI, where I is the identity matrix, Kfolds is the number of folds (partitions) in k-fold cross validation. Assume a uniform prior distribution over class, and use MLE for the estimation of model parameters.
At first, the function should split the data set in Kfolds partitions for cross validation, whose information is stored in a N-by-1 vector, PMap, where PMap(i) holds the partition number that i-th sample is assigned to, and save it to a file named ‘t1 mgc
For each fold, p, the function should
• estimate the mean vector and covariance matrix for each class from the samples that do not belong to partition p.
• save the mean vectors ((Ms)to ‘t1 mgc
Ms.mat’,
• save the covariance matrices (Covs) to ‘t1 mgc
ck
• carry out a classification experiment using the samples of partition p, and save the
confusion matrix (CM) to ‘t1 mgc
ck
• calculate the final confusion matrix (where each element is a relative frequency) and
save it to ‘t1 mgc
,
Details of partitioning algorithm for k-fold cross validation and the variables to save will be specified in a separate sheet.
Run the function with epsilon=0.01 and Kfolds=5 for each CovKind=1,2,3, and report the accuracy (correct classification rate) in your report.
(b)
Task 1.5
your report.
Task 2 – Neural networks [50 marks]
In this task, you implement neural networks for binary classification problems, in which input feature is represented as a two-dimensional vector (x1,x2)T. We assume that decision regions are defined with polygon(s), whose specifications are given in the polygon specification file ‘task2 data.txt’ 2 in YourDataDir. The file is a plain-text file, in which each line specifies the name of the polygon and the coordinates of its vertices {(xp1,xp2)}Pp=1, where P is the number of vertices. The following is an example of the file.
[5 marks] Using CovKind=1 (i.e. full covariance), investigate how the classification accuracy changes with respect to the regularisation parameter, epsilon. Plot a graph and describe your findings in
2 You are not allowed to show this file of yours to anyone else.
3 Task specifications
5
Polygon A: Polygon B:
-1 -0.5 6 1.25 6 6.25 1 6 2.5 3 3.5 3 3.5 3.5 2.5 3.5
where two polygons, Polygon A and Polygon B, are defined. In each line, the first two numbers (e.g. -1 and -0.5 for Polygon A) from the left specify the coordinates (x11,x12) of the first vertex, followed by the coordinates (x21,x22) for the second vertex, and so one. You will see that each polygon has four vertices, meaning a quadrangle in this case.
Task 2.1 [3 marks] Consider a single neuron with a unit function, whose output is defined as y(x) = h(wT x), where h(a) is a step function such that h(a) = 1 if a > 0, and h(a) = 0 otherwise 3. Implement this neuron as a Matlab function:
function [Y] = task2 hNeuron(W, X)
where X is a N-by-D data matrix (double), W is a (D+1)-by-1 weight matrix (double), Y is a
N-by-1 output vector (double). Save the function as ‘task2 hNeuron.m’.
Note that this function can take more than one input vector stored in a matrix X, where each input vector is represented as a row vector rather than a column one, and gives corresponding output as a vector Y.
Task 2.2 [3 marks]
Similar to task2 hNeuron() above, but consider another neuron which employs the logistic
sigmoid function g(a) = 1 . Implement this neuron as a Matlab function: 1+exp(−a)
function [Y] = task2 sNeuron(W, X) and save it as ‘task2 sNeuron.m’.
Task 2.3 [8 marks] Find the structure (i.e. connection of neurons) and weights of the neural network that classifies the inside and periphery of Polygon A as Class 1 (i.e. y(x) = 1), and the outside as Class 0 (i.e. y(x) = 0), where each neuron is modelled with task2 hNeuron().
This task is meant for you to work using pen and paper (and calculator), but it is also fine that you write a piece of code to find the weights. If it is the case, save the script or function as ‘task2 find hNN A weights.m’.
Let wjli denote the weight of neuron j in layer l from neuron i in layer l−1 4. Normalise your
weights in such a way that maxi |wjli| = 1. Write the weights in a plain text file ‘task2 hNN A weights.txt’ in the following format.
You write each wjli in a separate line, for l = 1,…, j = 1,…, and i = 0,1,…, so that the first line contains w10 followed by w11 and w12 in the second line and the third line, respectively. The format of each line should be as follows:
W(l,j,i) :
first line should look like this:
W(1,1,0) : 0.35
Spaces are only allowed just before and after “:”, and none in other places.
In your report, show the structure of the network and explain how you found the weights.
Task 2.4 [5 marks] Implement the neural network above as a function:
3 NB: The step function defined here is slightly different from the one in the lectures.
4 The input layer where input date are fed is regarded as layer 0 (zero). The output node of a single-layer neural network is in layer 1.
4 Functions that are not allowed to use 6
function [Y] = task2 hNN A(X)
and save it as ‘task2 hNN A.m’, where X and Y follow the same format as was shown in Task
3.1.
Task 2.5 [4 marks] Using task2 hNN A(), write a script that plots the decision regions in a 2D space, and save the code as ‘task2 plot regions hNN A.m’. Save the graph as a PDF file named ‘t2 regions hNN A.pdf’.
Task 2.6 [6 marks] We now consider the decision regions formed with Polygon A and Polygon B, whose classification rule is shown below:
Class1: A∩B ̄ Class0: A ̄∪B
where A and B denote the inside and periphery of the corresponding polygon, B ̄ denotes the complement of B.
Implement the corresponding neural network as a function:
function [Y] = task2 hNN AB(X)
and save it as ‘task2 hNN AB.m’. Note that each neuron should be modelled with task2 hNeuron().
Task 2.7 [4 marks] Using task2 hNN AB(), write a script that plots the decision regions in a 2D space, and save the code as ‘t2 plot regions hNN AB.m’. Save the graph as a PDF file named ‘t2 regions hNN AB.pdf’.
Task 2.8 [5 marks]
We now consider another network task2 sNN AB() obtained by replacing all nodes of task2 hNeuron() with those of task2 sNeuron() in task2 hNN AB(), so that each neuron is now modelled with task2 sNeuron(). Implement the neural network as a function:
function [Y] = task2 sNN AB(X)
and save it as ‘task2 sNN AB.m’. Note that you will need to modify the weights to approximate
the decision regions properly.
Task 2.9 [4 marks]
Using task2 sNN AB(), write a script that plots the decision regions in a 2D space, and save the
code as ‘task2 plot regions sNN AB.m’. Save the graph as a PDF file named ‘t2 regions sNN AB.pdf’.
Task 2.10 [8 marks] Investigate and discuss the decision regions for task2 sNN AB(), clarifying how and why they are different from those for task2 hNN AB().
4 Functions that are not allowed to use
Since one of the objectives of this coursework is to understand and implement basic algorithms for machine learning, you are not allowed to use those functions in standard libraries listed below. You should write the code by yourself using the basic operations of arithmetic for scalars, vectors, and matrices. If it is the case, use a different function name from the original one in standard libraries (e.g. MyCov() for cov() as shown in the table below). You may, however, use them for comparison purposes, i.e. to check your code.
5 Submission
7
Description of function
Pairwise (squared) Euclidean distance Compute the mean
Compute the covariance matrix Compute Gaussian probability densities K-NN classification
K-means clustering
Compute confusion matrix
Other utilities for classification
You may use those functions or operations:
Typical names pdist2() mean()
cov() mvnpdf() fitcknn() kmeans() confusion()
e, exp()
log(), ln() transpose(), ‘
inv()
det()
logdet()
eig()
sort()
mode()
bsxfun(), arrayfun()
Suggested name to implement MySqDist()
MyMean()
MyCov()
run knn classifier()
my kMeansClustering() comp confmat()
Description
Sum function Cumulative sum Square root function Exponential function Logarithmic function Matrix transpose Matrix inverse Determinant
Log determinant Eigen values/vectors Sort
Sample mode Vectorisation helpers
Typical names sum() cumsum() sqrt()
· · ·
You should submit your work electronically via the DICE submit command by the deadline. No submission of printed document is required.
Since marking for each task will be done separately, you should prepare separate reports for the two tasks, and save your report files in PDF format and name them ‘report task1.pdf’ and ‘report task2.pdf’. Remember to place your student number and the task name prominently at the top of each report. Do not indicate your name anywhere. Your report should be concise and brief for each task.
Create a directory named LearnCW, copy all of the requested files to the directory, but do NOT put the data set files in it.
A checklist will be available from the coursework web page. Submit your coursework from a DICE machine using:
submit inf2b cw1 LearnCW
(NB: the list is not exhaustive)
5 Submission
available in Inf2b cwk directory