Copyright By PowCoder代写 加微信 powcoder
COMP24112 Lab 2: News Article Classification by k-NN¶
1. Task description¶
You will work on a news article classification task.
The provided dataset includes a total of 800 articles taken from Reuters newswire.
They belong to 4 classes: “earn” (0), “crude” (1), “trade” (2) and “interest” (3).
There are 200 articles per class.
Each article is characterised by word occurrences.
The list of used words is called a vocabulary.
In our dataset, the vocabulary includes a total of 6428 words.
2. Preparation¶
First we need to import the data.
Run the below cell to load the data using NumPy.
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse
data, labels, class_names, vocabulary = np.load(“ReutersNews_4Classes_sparse.npy”, allow_pickle=True)
A Note on Sparsity¶
Most documents only contain a small subset of the vocabulary, resulting in a very sparse data matrix.
To handle the sparsity, in this exercise data is represented as a scipy.sparse.csr_matrix, which can store sparse matrices efficiently while still allowing efficient row-based indexing.
You can learn more about csr_matrix and other ways of dealing with sparse matrices at https://docs.scipy.org/doc/scipy/reference/sparse.html.
Note, however, that data is not a normal NumPy array.
While most operations will be the same as with a normal dense array, you cannot use a sparse matrix to index another matrix.
If you need to do this, either first convert the matrix to a NumPy array with the toarray() method, or use methods specifically designed to work with sparse matrices.
print(data[41,:]) # A sparse row vector; the output will be the non-zero indices and their values.
print(data[41,:].toarray()) # Convert back to a NumPy array. Note that the result is a (1, 6428) matrix, not a vector.
# print(vocabulary[data[41,:] > 0]) # Can’t index vocabulary with a sparse matrix.
rows, columns, values = scipy.sparse.find(data[41,:]) # Find the non-zero entries in the 42nd document.
print(vocabulary[columns]) # Prints the words present in the 42nd document.
To see the full vocabulary, you can run
print(“, “.join(vocabulary))
You can see how many times article $i$ contains word $j$ using
i, j = 40, 2
print(data[i,j])
You can see which class the $i$th article belongs to using
print(labels[i])
For instance, by running
print(“Occurrences:”, data[0,10])
print(“Class:”, class_names[labels[0]])
print(“Word:”, vocabulary[10])
you can see that the 11th word appears twice in the first document, the first document belongs to the class “earn”, and the 11th word is “shareholder”.
The following function randomly selects a subset of the data.
def sample_indices(labels, *num_per_class):
Returns randomly selected indices. It will return the specified number of indices for each class.
indices = []
for cls, num in enumerate(num_per_class):
cls_indices = np.where(labels == cls)[0]
indices.extend(np.random.choice(cls_indices, size=num, replace=False))
return np.array(indices)
For instance, to get one sample from the first class, two from the second, three from the third, and four from the fourth, you can run:
indices = sample_indices(labels, 1, 2, 3, 4)
print(“Returned indices:”, indices)
print(“Samples:”, data[indices])
print(“Corresponding classes:”, labels[indices])
3. k-NN Implementation (4 Marks, Normal)¶
Now, you will need to implement a k-NN classifier by filling the code below.
This function should support two types of distance measures: Euclidean distance and cosine distance (defined as 1 – cosine similarity). It should take a set of training samples, a user-specified neighbour number, a distance option, and features of a set of testing samples as the input.
It should return the predicted classes for the input set of testing samples.
In order to get 4 marks, you are asked to implement the k-NN classifier from scrach without relying on any machine learning library, particularly the distance calculation. But you are allowed to research NumPy functions relating to sorting. If you decide to use existing distance implementation from libraries, e.g., sklearn.metrics.pairwise_distances imported as cdist, you can get at most 3 marks.
Your implementation must NOT make use of Python loops over individual samples or features.
You should use functions that operate on whole matrices, as this will be much faster than looping in Python.
Each experiment below is expected to take no more than 2 minutes to run.
import scipy.stats
def knn_classify(test_samples, training_data, training_labels, metric=”euclidean”, k=1):
Performs k-nearest neighbour classification on the provided samples,
given training data and the corresponding labels.
test_samples: An m x d matrix of m samples to classify, each with d features.
training_data: An n x d matrix consisting of n training samples, each with d features.
training_labels: A vector of size n, where training_labels[i] is the label of training_data[i].
metric: The metric to use for calculating distances between samples.
k: The number of nearest neighbours to use for classification.
Returns: A vector of size m, where out[i] is the predicted class of test_samples[i].
# Calculate an m x n distance matrix.
pairwise_distance = …
# Find the k nearest neighbours of each samples as an m x k matrix of indices.
nearest_neighbours = …
# Look up the classes corresponding to each index.
nearest_labels = …
# Return the most frequent class on each row.
# Note: Ensure that the returned vector does not contain any empty dimensions.
# You may find the squeeze method useful here.
return …
4. Experiments (12 Marks in Total)¶
Use your k-NN function to perform the following experiments.
Experiment 1 (4 Marks, Easy)¶
Randomly select 80 articles per class for training, and use the remaining articles for testing.
Fix a neighbour number setting as you see fit. Perform k-NN classification using the Euclidean distance and test it.
Repeat this process 20 times (trials).
Calculate the mean and standard deviation of the testing accuracies. Print out the mean and standard deviation.
# Your code goes here
Use the same neighbour number, but use the cosine distance instead of the Euclidean distance.
Repeat the same experiment.
Print out the mean and standard deviation.
# Your code goes here
Explain in your report which distance measure gives better performance and analyse the reason.
Experiment 2 (4 Marks, Easy)¶
Using the distance measure that you found performs better in Experiment 1.
Randomly select 80 articles per class for training, and use the remaining articles for testing. Perform k-NN classification with the neighbour number $k$ varying from 1 to 50.
For each values of $k$, repeat the training process by 20 trials and record the average training error rates and standard deviation.
Do the same for testing errors.
# Your code goes here
Produce an error bar plot showing the training error rate for each $k$ here:
# Your code goes here
Produce your testing error bar plot here:
# Your code goes here
Remember that all graphs should have axis labels and a title.
Discuss in your report the difference between the training and testing accuracies and its indication. Analyse in your report the effect of $k$ based on this experiment.
Experiment 3 (4 Marks, Normal)¶
Compare three 5-NN classifiers using cosine distance. In order to get 4 marks, you should implement the confusion matrix calculation from scrach yourself. If you decide to use existing implementation for confusion matrix calculation, you can get at most 3 marks.
First, randomly select 100 articles per class and keep these as your testing samples. Set all the remaining articles as the training set.
# Your code goes here
Then do the following:
(1) Train the first classifier using the traning set.
Compute the confusion matrix for the 4 classes using the testing samples.
Print out the numbers of the training and testing samples belonging to each class, the $2\times 2$ confusion matrix for each of the 4 classes, and the overall accuracy of the classifier.
# Your code goes here
(2) Randomly remove 95 training articles from class 1 (“crude”) of the training set.
Train the second classifier using the reduced training samples.
Compute the confusion matrix for the 4 classes using the testing samples.
Print out the numbers of the training and testing samples belonging to each class, the $2\times 2$ confusion matrix for each of the 4 classes, and the overall accuracy of the classifier.
# Your code goes here
(3) Randomly remove 95 training articles from each class, and do it for all the classes of the training set.
Train the third classifier using the new training data.
Compute the confusion matrix for the 4 classes using the testing samples.
Print out the numbers of the training and testing samples belonging to each class, the $2\times 2$ confusion matrix for each of the 4 classes, and the overall accuracy of the classifier.
# Your code goes here
Repeat the whole thing a few times. Based on the observed results, state in your report which of the three classifiers performs the worst, and explain in your report the reason.
5. Result Analysis (4 Marks in Total)¶
Analysis 1 (2 Marks, Normal)¶
Choose a training-testing trial in Experiment 2 for k=1. Observe the testing error of this 1-NN, and estimate the interval where its true error lies with 90% probability. Explain in your report how you compute it.
Analysis 2 (2 Marks, Normal)¶
The following function Get_p_value() is provided to obtain $p$ according to $z_p$. Use this function to perform Analysis 2.
# run this cell first
def Get_p_value(zp):
return round(1 – scipy.stats.norm.sf(abs(zp))*2,2)
# Use this cell to compare the output value of function Get_p_value with
# the table provided in your lecture notes (e.g., Slide 12, Chapter3C.pdf)
print(‘zp = 0.67, p = ‘, Get_p_value(0.67))
print(‘zp = 1, p = ‘, Get_p_value(1))
print(‘zp = 1.64, p = ‘, Get_p_value(1.64))
print(‘zp = 2.58, p = ‘, Get_p_value(2.58))
# you can alert the input zp value and re-run this cell to help you to calculate the corresponding p.
print(‘p = ‘, Get_p_value(0.43))
# you can change 0.43 to any zp value you obtained.
Choose a training-testing trial in Experiment 2 for k=45. Observe the testing error of this 45-NN. Compare it with the 1-NN in Analysis 1. Which one has higher testing sample error? Estimate the probability that it also has higher true error. Explain your answer and how you compute it in the report.
6. Hyperparameter Selection (5 Marks, Hard)¶
Use your k-NN function with cosine distance. Design an appropriate and complete machine learning experiment, which should include the training, hyper-parameter selection and evaluation stages. In this case, your hyperparameter will be $k$. You can choose from the random subsampling, k-fold CV and LOO approaches for hyperparameter selection. In order to get 5 marks, you should implement this from scrach without using readily implemented data-split functions provided in existing libraris. If you decide to use existing implementation on data splitting, model selection and/or evaluation, you can get at most 3 marks. Explain in the report your experiment design, data splitting strategy, and the obtained results, also justify your design from theory side with the machine leanring knowlwdge learned.
# Your code goes here.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com