Lab2_Classification1_Qs
Classification 1 – distance measures and Nearest Neighbour classification¶
Copyright By PowCoder代写 加微信 powcoder
Introduction¶
In this notebook we will expand on some of the concepts of
classification, starting with an experiment with distance measures on data, then looking into the $k$-Nearest Neighbour algorithm. Please try to understand the code and follow instructions marked in purple color.
1) Distance measures for high-dimensionality data
Algorithms such as $k$-Nearest Neighbour are conceptually very simple — we predict the class value of an unlabelled query data point we are given by looking at all the labelled data point(s) in our data set, and predicting that our query will have the same class as the most similar data point(s) in the training set. So, all we need is a way of measuring similarity. The well-known Euclidean distance measure would seem to be a good choice. However, while we are very familiar with Euclidean distance in 2 and 3-dimensions, there is a problem with high-dimensions – what is this problem ?
Pairwise distances in high-dimensional spaces¶
**Answer**: in high-dimensional spaces everything is far away from everything else, and so pairwise distances become uninformative.
But what does this actually mean ? There is a mathematical argument to show that this is a true statement, but an alternative approach is simply to simulate what happens. One approach is to randomly generate $N$ points inside a $d$-dimensional cube centred around zero, such as $[-0.5, 0.5]^{d}$. Now we calculate the pairwise distances among the $N$ points. After that for every data point we calculate the ratio of the minimum distance to the maximum distance to all of the other data points. The mean ratio represents the average range of pairwise distances there are in that dimensionality. We run the simulation from 1 dimension to 1000 dimensions and the ratios will be plotted on a line chart using the matplotlib library.
You should use the numpy library for this, and in particular the linear algebra methods to calculate distances such as the L2 norm.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def run_d_n(dim,N_pts,L):
pts=np.random.rand(N_pts,dim)-0.5 # simulate N_pts points on dim dimensions space
ratio_list=[]
for i in range(N_pts):
# ignore the data point itself
selected_pts=np.array([j for j in range(N_pts) if j!=i])
# calculate the L2 or L1 distance with other points
dist=np.linalg.norm(pts[selected_pts]-pts[i],L,axis=1)
# calculate the ratio of the min. distance to the max. distance
ratio=np.min(dist)/np.max(dist)
ratio_list.append(ratio)
# output the mean ratio
return np.mean(ratio_list)
# Initialise the N_pts, the number of points we simulate
# Setting l=2 to calculate the L2 distance
# Setting the number of dimensions we simulate
check_dim=range(1,550,50)
# Calculate the mean ratio on that dimension
ratio_list=[ run_d_n(dim,N_pts,l) for dim in check_dim]
# Plot the ratio with its corresponding dimension
plt.plot(check_dim,ratio_list)
plt.ylabel(“Mean ratio of min/max pairwise distances”)
plt.xlabel(“Number of dimensions”)
plt.title(“Effect of increasing dimensionality on pairwise distances”)
plt.xticks(np.arange(0, 600, step=100))
plt.show()
**Question:** how can this plot be interpreted ? How else could you visualize this effect ?
2) Implement Nearest Neighbour from scratch¶
The following will give some practise in implementing a simple classifier, the $k$-Nearest Neighbour ($k$NN) algorithm. It should help us to write a $k$NN package from scratch. Most machine learning methods include two main steps, namely training (fitting to a model to the training data) and prediction (running the model on input data to generate output). However, in the $k$NN algorithm, since there is no explicit model-building step, we only require implementation of the prediction step without a training step.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
Data Creation¶
mean_01 = np.array([1, 0.5])
cov_01 = np.array([[1, 0.1], [0.1, 1.2]])
mean_02 = np.array([4, 5])
cov_02 = np.array([[1, 0.1], [0.1, 1.2]])
dist_01 = np.random.multivariate_normal(mean_01, cov_01, 500)
dist_02 = np.random.multivariate_normal(mean_02, cov_02, 500)
print(dist_01.shape, dist_02.shape)
(500, 2) (500, 2)
We have created two 2-dimensional normal distributions of data points with the same covariance but different means.
Plotting the created Data¶
What does the data look like ? Notice the 2 unique clusters being formed.
plt.figure(0)
plt.xlim(-5, 10)
plt.ylim(-5, 10)
plt.scatter(dist_01[:, 0], dist_01[:, 1])
plt.scatter(dist_02[:, 0], dist_02[:, 1])#, color=’red’)
plt.show()
Let us now represent it in a tabular way. We will have dist_01 getting label 1.
r = dist_01.shape[0] + dist_02.shape[0]
c = dist_01.shape[1] + 1
data = np.zeros((r, c))
print(data.shape)
data[:dist_01.shape[0], :2] = dist_01
data[dist_01.shape[0]:, :2] = dist_02
data[dist_01.shape[0]:, -1] = 1.0
print(data.mean(axis=0))
[2.53984132 2.775286 0.5 ]
Now shuffle the data and check by printing the first 10 rows.
np.random.shuffle(data)
print(data[:10])
[[ 4.50331247 4.61112561 1. ]
[ 4.9844128 5.45535834 1. ]
[ 1.93894564 -0.32547993 0. ]
[ 2.82070904 0.06931775 0. ]
[ 2.7820279 4.12175889 1. ]
[ 4.38145217 4.77332416 1. ]
[ 1.68223559 -0.45712006 0. ]
[ 3.20582081 5.27428178 1. ]
[ 2.03199265 1.05406107 0. ]
[ 3.19377429 5.95478885 1. ]]
Implementation. Next, we implement our KNN algorithm. There are many ways to do this, but a basic approach will require a pairwise distance measure for instances, and a way to take a “training” dataset of classified instances and make a prediction for a “test” data instance. Here is a top-level outline.
Please add an impelemtation code. Use Manhattan (L2) distance.
def distance(x1, x2):
def knn(X_train, y_train, xt, k=7):
Now check to see if we can make a prediction.
test_point = np.array([8, -4])
# Un-comment the line below and check if it comes out as 0.0
# print(knn(data[:, :2], data[:, -1], test_point))
Create a train and test split of the data¶
np.random.shuffle(data)
split = int(0.75 * data.shape[0])
# print split
train_data_X = data[:split, :2]
train_data_y = data[:split, -1]
test_data_X = data[split:, :2]
test_data_y = data[split:, -1]
print(train_data_X.shape, train_data_y.shape)
print(test_data_X.shape, test_data_y.shape)
(750, 2) (750,)
(250, 2) (250,)
**Implementation.** Next, please add an implementation to run our KNN classifier on all the test data and get the results.
# kx is the number of neighbours
def get_acc(kx):
#print(get_acc(7))
What accuracy did you get ? You should get around 99 percent on this dataset.
Now let’s try real data : MNIST¶
import pandas as pd
import datetime
Of course, MNIST is image data, but here we are using a CSV version where we can view the pixels as numbers (each row has the pixel data for an image of a digit, and the first column is the class of the digit, i.e., 0-9).
df = pd.read_csv(‘train.csv’)
5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 … 0.608 0.609 0.610 0.611 0.612 0.613 0.614 0.615 0.616 0.617
0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0
1 4 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0
3 9 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0
4 2 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0
5 rows × 785 columns
Since the dataset is quite big, we will just use a subset.
data = df.values[:2000]
print (data.shape)
(2000, 785)
Make a train/test split of the data.
split = int(0.8 * data.shape[0])
X_train = data[:split, 1:]
X_test = data[split:, 1:]
y_train = data[:split, 0]
y_test = data[split:, 0]
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(1600, 784) (1600,)
(400, 784) (400,)
Let us just check that our data really does represent images.
plt.figure(0)
plt.imshow(X_train[91].reshape((28, 28)), cmap=’gray’, interpolation=’none’)
print (y_train[91])
plt.show()
**Implementation.** Now code another “`get_acc()“` and try different values of K on our dataset.
def get_acc(kx):
#print (get_acc(kx=))
Try different values of k and observe how the results changes. Which k gives the best result and why?.
Now explore different metrics: precision, recall, F1 and confusion matrix. Use “`sklearn.neighbors KNeighborsClassifier“` to obtain predictions and “`sklearn.metrics“` to do the calculations.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com