week11-2021-sem2
COMP20008 2021S2 workshop week 11¶
Copyright By PowCoder代写 加微信 powcoder
Chi Squared Feature Selection¶
The following code implements the example in Slide 19 of the Experimental design lecture
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
data = pd.DataFrame(np.array([[1,1,1],[1,0,1],[0,1,0],[0,0,0]]), columns=[‘a1′,’a2′,’c’])
features=data[[‘a1′,’a2’]]
class_label = data[‘c’]
cont_table = pd.crosstab(class_label,features[‘a1’])
chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)
print(‘Chi2 value: ‘,chi2_val)
if(p<0.05) :
print('Null hypothesis rejected, p value: ', p)
print('Null hypothesis accepted, p value: ', p)
Chi2 value: 4.0
Null hypothesis rejected, p value: 0.04550026389635857
Question 2¶
Adapt the example above to calculate the Chi2 values for each feature in Question 1. Ensure that the results agree with your answer to Question 1.
import scipy.stats as stats
from scipy.stats import chi2_contingency
data = pd.DataFrame(np.array([[1,0,1],[1,1,1],[1,1,1],[1,0,0],[1,1,1],[0,0,0],[0,0,0],[0,0,0],[1,1,0],[0,0,0]]), columns=['A','B','Class'])
features=data[['A','B']]
class_label = data['Class']
for feature in ['A','B'] :
cont_table = pd.crosstab(class_label,features[feature])
chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)
print('Chi2 value for feature', feature,': ',chi2_val)
if(p<0.05) :
print('Null hypothesis rejected for feature', feature, 'p value:', p)
print('Null hypothesis accepted for feature', feature, 'p value:', p)
Chi2 value for feature A : 4.444444444444445
Null hypothesis rejected for feature A p value: 0.03501498101966245
Chi2 value for feature B : 3.4027777777777777
Null hypothesis accepted for feature B p value: 0.0650867264927665
Experimental Evaluation¶
K-fold cross validation is important to ensure that the results we report are reliable, and not merely the result of a 'lucky' test_train split. Below is an example of K-fold cross validation applied to the World Development Index dataset.
Note also the process in the loop below - we've started by doing the test_train split then performed other functions like scaling and imputation on the training set and applied the results to the testing set. This is important to ensure that we don't violate the test_train split and apply our understanding of the testing set when building our model.
world= pd.read_csv('world_org.csv')
life = pd.read_csv('life.csv')
world.set_index('Country Code')
life.set_index('Country Code')
all_data = world.merge(life)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
##get just the features
data=all_data.iloc[:, 5:-1]
##get just the class labels
classlabel=all_data['Life expectancy at birth (years)']
kf = KFold(n_splits=k, shuffle=True, random_state=42)
acc_score = []
for train_index, test_index in kf.split(data):
#Perform the split for this fold
X_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :]
y_train, y_test = classlabel[train_index], classlabel[test_index]
#Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
#Impute missing values via mean imputation
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
#Train k-nn classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
#Predict result
y_pred=knn.predict(X_test)
acc_score.append(accuracy_score(y_test, y_pred))
print(acc_score)
#Display average of accuracy scores
avg_acc_score = sum(acc_score)/k
print(avg_acc_score)
[0.7368421052631579, 0.8421052631578947, 0.8421052631578947, 0.7222222222222222, 0.8333333333333334, 0.6666666666666666, 0.8333333333333334, 0.6111111111111112, 0.7222222222222222, 0.4444444444444444]
0.7254385964912282
Question 3¶
Experiment with different values of k and the random_state parameter. What might be an optimal k value in this case? How could we further improve the reliability of our results?
Principal components analysis¶
Principal components analysis can be used for transforming data into a different (lower dimensional) representation. This is particularly useful for visualisation, computational efficiency and removing noisy data.
The python sci-kit learn package (sklearn) contains functions which can be used for PCA. Consider the example below of introducing PCA to the previous task
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
from sklearn.decomposition import PCA
##get just the features
data=all_data.iloc[:, 5:-1]
##get just the class labels
classlabel=all_data['Life expectancy at birth (years)']
kf = KFold(n_splits=k, shuffle=True, random_state=42)
acc_score = []
for train_index, test_index in kf.split(data):
#Perform the split for this fold
X_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :]
y_train, y_test = classlabel[train_index], classlabel[test_index]
#Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
#Impute missing values via mean imputation
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
#Perform PCA
pca = PCA(n_components=5)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
#Train k-nn classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
#Predict result
y_pred=knn.predict(X_test)
acc_score.append(accuracy_score(y_test, y_pred))
print(acc_score)
#Display average of accuracy scores
avg_acc_score = sum(acc_score)/k
print(avg_acc_score)
[0.631578947368421, 0.7894736842105263, 0.6842105263157895, 0.8333333333333334, 0.7777777777777778, 0.6666666666666666, 0.8333333333333334, 0.7777777777777778, 0.7777777777777778, 0.5]
0.7271929824561403
Question 4¶
Experiment with different numbers of components. What gives the best result? How could you decide on the appropriate number of principal components to use?
Visualisation using PCA¶
Consider the example below of applying PCA on the iris dataset.
iris= pd.read_csv('iris.csv',dtype=None) ###read in data
iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns
SepalLength SepalWidth PetalLength PetalWidth
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
150 rows × 4 columns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
##########################################################
#######Example of performing PCA on Iris dataset and visualising####################
##########################################################
sklearn_pca = PCA(n_components=2) #we want just the first two PCs
iris_sklearn = sklearn_pca.fit_transform(iris2)
print("Variance explained by each PC",sklearn_pca.explained_variance_ratio_) #print out the amount of variance explained by each PC
#set up the colour scheme
palette=palette = ['blue','green','red']
colors=iris.Name.replace(to_replace=iris.Name.unique(),value=palette).tolist()
#plot the objects along the first two principal components, using the colour scheme
plt.scatter(iris_sklearn[:,0],iris_sklearn[:,1],s=60,c=colors) #plot the PC's in 2D - s marker size
plt.xlabel('1st Principal Component', fontsize=28)
plt.ylabel('2nd Principal Component', fontsize=28)
plt.show()
Variance explained by each PC [0.92461621 0.05301557]
Question 5)¶
What can you observe about the clustering behavior of the iris dataset from the plot above? What other techniques could you use to help visualise the clustering behavior?
VAT - Visual Assessment for Clustering Tendency¶
We've already seen the VAT algorithm for visualising the clustering tendency of a dataset. Below is python code for VAT. You can treat it as a black box (not worrying about the internal coding details) - a function which can be used to execute VAT on an input dataset.
import numpy as np
import math,random
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
def VAT(R):
VAT algorithm adapted from matlab version:
http://www.ece.mtu.edu/~thavens/code/VAT.m
R (n*n double): Dissimilarity data input
R (n*D double): vector input (R is converted to sq. Euclidean distance)
RV (n*n double): VAT-reordered dissimilarity data
C (n int): Connection indexes of MST in [0,n)
I (n int): Reordered indexes of R, the input data in [0,n)
R = np.array(R)
N, M = R.shape
if N != M:
R = squareform(pdist(R))
J = list(range(0, N))
y = np.max(R, axis=0)
i = np.argmax(R, axis=0)
j = np.argmax(y)
y = np.max(y)
y = np.min(R[I,J], axis=0)
j = np.argmin(R[I,J], axis=0)
I = [I, J[j]]
J = [e for e in J if e != J[j]]
for r in range(2, N-1):
y = np.min(R[I,:][:,J], axis=0)
i = np.argmin(R[I,:][:,J], axis=0)
j = np.argmin(y)
y = np.min(y)
I.extend([J[j]])
J = [e for e in J if e != J[j]]
C.extend([i[j]])
y = np.min(R[I,:][:,J], axis=0)
i = np.argmin(R[I,:][:,J], axis=0)
I.extend(J)
C.extend(i)
RI = list(range(N))
for idx, val in enumerate(I):
RI[val] = idx
RV = R[I,:][:,I]
return RV.tolist(), C, I
Visualising iris datset using VAT¶
We will first recreate the visualisations of the iris dataset used in lectures (lecture 7). Info about the iris dataset is here. First a heatmap of the raw iris dataset is displayed. Secondly a randomly ordered dissimilarity matrix for the objects in iris is shown - notice the lack of structure. Thirdly the VAT visualisation is produced. The heatmap function from the seaborn package is employed as a convenient tool for plotting heatmaps.
Below is an example of the VAT algorithm applied to the same iris dataset
iris= pd.read_csv('iris.csv',dtype=None) ###read in data
iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns
SepalLength SepalWidth PetalLength PetalWidth
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
150 rows × 4 columns
import seaborn as sns
##########################################################
#######Read in the datset###############
##########################################################
iris= pd.read_csv('iris.csv',dtype=None) ###read in data
iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns
####Draw heatmap of raw Iris matrix#######j
#sns.heatmap(iris2,cmap='viridis',xticklabels=True,yticklabels=False)
#plt.show()
####Visualise the dissimilarity matrix for Iris using a heatmap (without applying VAT)####
iris3=iris2.copy().values
np.random.shuffle(iris3) ####randomise the order of rows (objects)
sq = squareform(pdist(iris3)) ###compute the dissimilarity matrix
ax=sns.heatmap(sq,cmap='viridis',xticklabels=False,yticklabels=False)
ax.set(xlabel='Objects', ylabel='Objects')
plt.show()
#####Apply VAT Algorithm to Iris dataset and visualise using heatmap########
RV, C, I = VAT(iris2)
x=sns.heatmap(RV,cmap='viridis',xticklabels=False,yticklabels=False)
x.set(xlabel='Objects', ylabel='Objects')
plt.show()
Question 6)¶
Plot VAT heatmap for iris data and tell how many clusters does the VAT visualisation reveal? How does this compare to the PCA scatterplot?
If you get time: Practicing VAT and PCA¶
You will now practice using the australian crabs dataset from this file. This data describes 200 crabs collected from Fremantle Western Australia. There are two species of crabs - blue and orange. Within each species there are male and female. There are 5 features:
FL - frontal lip
RW - rear width
CL - carapace length
CW - carapace width
BD - body depth
The first four of these are visualised as follows:
Question 7)¶
Adapt the iris example to produce a VAT heatmap of the australian crabs dataset. How many clusters are there?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com