代写 C++ algorithm Scheme html statistic Assignment 4: Linear Algebra library – Armadillo

Assignment 4: Linear Algebra library – Armadillo
June 1, 2017
1 Background
In this assignment, we will introduce you to the Armadillo linear algebra library. This last assignment will serve as a recap through the course and is a test for you to see whether you are able to write programs with C++ on your own. In contrast to previous assignments, we are not going to present you with a header file with functions to implement, instead you are asked to plan and write your program freely.
Your task is to use Armadilllo to implement two well known statistical al- gorithms for classification: Nearest Neighbours and Logistic Regression. Both algorithms assign a label to an unseen point. For example, if the point is an im- age of a lung patient, the algorithms will decide whether the patient is suffering from an illness or not. For simplicity, we will only require binary classification, i.e. the labels to be assigned are either 1 or -1.
2 Exercise 1: Nearest neighbour classification
The nearest neighbour algorithm is a simple algorithm that assigns a label using a majority voting scheme. A previously unseen point is compared to a set of points with known labels. The algorithm picks the k points with closest distance and assigns the label of the majority of points. More formally, given a set of points {(x1, yi), . . . , (xN , yN )} and a number of neighbours, k we assign a label y to a new point x as
• Compute distances di = ∥xi − x∥
• Create a set of label-distance pairs P with Pi = (yi,di), i = 1,…,N. • Sort P ascending by di( i.e. afterwards P1 is the pair with smallest di) • Assign to y the label which appears most often among P1, . . . , Pk.
Your program implemented in the source file “NearestNeighbours.cpp” has to do the following:
1

3

• •
Read “dataX.dat”, “dataXtest.dat” and “dataY.dat” from disk. “dataX.dat” contains the points xi in its rows. They are stored as numbers separated by a white space (i.e. tab or space). “dataY.dat” contains the label yi, one in each row. You can assume that the only labels appearing are 1 or -1. “dataXtest.dat” contains test points.
For each point in “dataXtest.dat”, assign the apropriate label using the nearest neighbour algorithm with k = 5.
Write a new file “NN.dat” containing the assigned labels in the same format as “dataY.dat”.
Exercise 2: Logistic Regression
Logistic regression assigns the label y to a point x using a previously chosen linear function f(x) = wTx. To find the label, we use y = sign(f(x)) (i.e. if f(x) is positive, the label is 1, else the label is -1). We find w by minimizing the function
1 􏰀N
log(1 + exp(−yiwT xi))
For this assignment, we use an algorithms called gradient descent to find the
L(w) = N
optimal w. This algorithms uses the gradient of L(w):
i=1
∂ 1􏰀N 1
estimate using
is close enough to the optimal solution. We consider our choice of w to be close
enough when the above update role does not change w very much any more, i.e.
when ∥ ∂ L(w)∥ < ε, for some chosen tolerance ε. ∂w Your program implemented in the source file "LogisticRegression.cpp" has to do the following: • Read "dataX.dat", "dataXtest.dat" and "dataY.dat" from disk (same as Exercise 1). • Find the optimal w using the rules above. The final solution should obey ε = 10−7 • For each point in "dataXtest.dat", assign the apropriate label using the assignment rule y = sign(f(x)). • Write a new file "LogReg.dat" containing the assigned labels in the same format as "dataY.dat". 2 ∂wL(w) = −N For an existing initial guess of w (for example w = 0), we can find a better yi 1 + exp(yiwT xi)xi w ← w − α · ∂ L(w) , i=1 ∂w where 0 < α < 1 is a chosen learning rate. We can repeat this, until our solution Hints: • The documentation of Armadillo is at http://arma.sourceforge.net/ docs.html. • It might be helpful to implement file input and output in a common file. • The version of Armadillo provided is configured to be stand-alone, which means that some advanced operations: solving systems of equations or eigenvalues are not available. However this should not be relevant for the assignment. • A test for whether you implemented the algorithms correctly is to copy the points from "dataX.dat" into "dataXtest.dat". In this case, you should expect that almost all of the labels assigned by the algorithm should be the same as in "dataY.dat". 3