week11-2021-sem2-answers
COMP20008 2021S1 workshop week 11¶
Copyright By PowCoder代写 加微信 powcoder
Chi Squared Feature Selection¶
The following code implements the example in Slide 19 of the Experimental design lecture
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
data = pd.DataFrame(np.array([[1,1,1],[1,0,1],[0,1,0],[0,0,0]]), columns=[‘a1′,’a2′,’c’])
features=data[[‘a1′,’a2’]]
class_label = data[‘c’]
cont_table = pd.crosstab(class_label,features[‘a1’])
chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)
print(‘Chi2 value: ‘,chi2_val)
if(p<0.05) :
print('Null hypothesis rejected, p value: ', p)
print('Null hypothesis accepted, p value: ', p)
Chi2 value: 4.0
Null hypothesis rejected, p value: 0.04550026389635857
Question 2)¶
Adapt the example above to calculate the Chi2 values for each feature in Question 1. Ensure that the results agree with your answer to Question 1.
import scipy.stats as stats
from scipy.stats import chi2_contingency
data = pd.DataFrame(np.array([[1,0,1],[1,1,1],[1,1,1],[1,0,0],[1,1,1],[0,0,0],[0,0,0],[0,0,0],[1,1,0],[0,0,0]]), columns=['A','B','Class'])
features=data[['A','B']]
class_label = data['Class']
for feature in ['A','B'] :
cont_table = pd.crosstab(class_label,features[feature])
chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)
print('Chi2 value for feature', feature,': ',chi2_val)
if(p<0.05) :
print('Null hypothesis rejected for feature', feature, 'p value:', p)
print('Null hypothesis accepted for feature', feature, 'p value:', p)
Chi2 value for feature A : 4.444444444444445
Null hypothesis rejected for feature A p value: 0.03501498101966245
Chi2 value for feature B : 3.4027777777777777
Null hypothesis accepted for feature B p value: 0.0650867264927665
Experimental Evaluation¶
K-fold cross validation is important to ensure that the results we report are reliable, and not merely the result of a 'lucky' test_train split. Below is an example of K-fold cross validation applied to the World Development Index dataset.
Note also the process in the loop below - we've started by doing the test_train split then performed other functions like scaling and imputation on the training set and applied the results to the testing set. This is important to ensure that we don't violate the test_train split and apply our understanding of the testing set when building our model.
world= pd.read_csv('world_org.csv', encoding="iso-8859-1")
life = pd.read_csv('life.csv')
world.set_index('Country Code')
life.set_index('Country Code')
all_data = world.merge(life)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
##get just the features
data=all_data.iloc[:, 5:-1]
##get just the class labels
classlabel=all_data['Life expectancy at birth (years)']
kf = KFold(n_splits=k, shuffle=True, random_state=42)
acc_score = []
for train_index, test_index in kf.split(data):
#Perform the split for this fold
X_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :]
y_train, y_test = classlabel[train_index], classlabel[test_index]
#Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
#Impute missing values via mean imputation
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
#Train k-nn classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
#Predict result
y_pred=knn.predict(X_test)
acc_score.append(accuracy_score(y_test, y_pred))
print(acc_score)
#Display average of accuracy scores
avg_acc_score = sum(acc_score)/k
print(avg_acc_score)
[0.7368421052631579, 0.8421052631578947, 0.8421052631578947, 0.7222222222222222, 0.8333333333333334, 0.6666666666666666, 0.8333333333333334, 0.6111111111111112, 0.7222222222222222, 0.4444444444444444]
0.7254385964912282
Question 3¶
Experiment with different values of k and the random_state parameter. What might be an optimal k value in this case? How could we further improve the reliability of our results?
Answer: Generally we would expect larger values of k to give more accurate results as a larger amount of data is being trained on. Values like 2 or 3 are unlikely to be as effective. Higher values of k require more processing time, however. Note that there is a lot of variability still present - which you can see by changing the random_state parameter. Note also that we're NOT trying to choose the value of k which happens to give us the highest accuracy (e.g. 12 is not a better choice than 13 just because the average accuracy is higher) - that's simply random behavior and won't be replicated on other datasets.
To further improve the accuracy we could use a Repeated k-Fold cross validator (sklearn.RepeatedKFold), which applies cross validation several times using the same k with different random_states.
Principal components analysis¶
Principal components analysis can be used for transforming data into a different (lower dimensional) representation. This is particularly useful for visualisation, computational efficiency and removing noisy data.
The python sci-kit learn package (sklearn) contains functions which can be used for PCA. Consider the example below of introducing PCA to the previous task
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
from sklearn.decomposition import PCA
##get just the features
data=all_data.iloc[:, 5:-1]
##get just the class labels
classlabel=all_data['Life expectancy at birth (years)']
kf = KFold(n_splits=k, shuffle=True, random_state=42)
acc_score = []
for train_index, test_index in kf.split(data):
#Perform the split for this fold
X_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :]
y_train, y_test = classlabel[train_index], classlabel[test_index]
#Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
#Impute missing values via mean imputation
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
#Perform PCA
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
#Train k-nn classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
#Predict result
y_pred=knn.predict(X_test)
acc_score.append(accuracy_score(y_test, y_pred))
print(acc_score)
#Display average of accuracy scores
avg_acc_score = sum(acc_score)/k
print(avg_acc_score)
[0.7368421052631579, 0.6842105263157895, 0.631578947368421, 0.8333333333333334, 0.7222222222222222, 0.6666666666666666, 0.8888888888888888, 0.6666666666666666, 0.7777777777777778, 0.5555555555555556]
0.7163742690058479
Question 4¶
Experiment with different numbers of components. What gives the best result? How could you decide on the appropriate number of principal components to use?
Answer: It's possible that we might want to use a fixed number of components (e.g. 2 or 3) for visualisation purposes, but for model building it's useful to look at the variance explained by the principal components we've used so far (sklearn_pca.explained_varianceratio), as in the next example. If the vast majority of the variance has been explained by the first x principal components, then x is probably a good number to use.
Visualisation using PCA¶
Consider the example below of applying PCA on the iris dataset.
iris= pd.read_csv('iris.csv',dtype=None) ###read in data
iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns
SepalLength SepalWidth PetalLength PetalWidth
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
150 rows × 4 columns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
##########################################################
#######Example of performing PCA on Iris dataset and visualising####################
##########################################################
sklearn_pca = PCA(n_components=2) #we want just the first two PCs
iris_sklearn = sklearn_pca.fit_transform(iris2)
print("Variance explained by each PC",sklearn_pca.explained_variance_ratio_) #print out the amount of variance explained by each PC
#set up the colour scheme
palette=palette = ['blue','green','red']
colors=iris.Name.replace(to_replace=iris.Name.unique(),value=palette).tolist()
#plot the objects along the first two principal components, using the colour scheme
plt.scatter(iris_sklearn[:,0],iris_sklearn[:,1],s=60,c=colors) #plot the PC's in 2D - s marker size
plt.xlabel('1st Principal Component', fontsize=28)
plt.ylabel('2nd Principal Component', fontsize=28)
plt.show()
Variance explained by each PC [0.92461621 0.05301557]
Question 5)¶
What can you observe about the clustering behavior of the iris dataset from the plot above? What other techniques could you use to help visualise the clustering behavior?
We can see two of the three classes are very close together while the other is more clearly separated. Scatter plot matrices could be useful, as would VAT.
VAT - Visual Assessment for Clustering Tendency¶
We've already seen the VAT algorithm for visualising the clustering tendency of a dataset. Below is python code for VAT. You can treat it as a black box (not worrying about the internal coding details) - a function which can be used to execute VAT on an input dataset.
import numpy as np
import math,random
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
def VAT(R):
VAT algorithm adapted from matlab version:
http://www.ece.mtu.edu/~thavens/code/VAT.m
R (n*n double): Dissimilarity data input
R (n*D double): vector input (R is converted to sq. Euclidean distance)
RV (n*n double): VAT-reordered dissimilarity data
C (n int): Connection indexes of MST in [0,n)
I (n int): Reordered indexes of R, the input data in [0,n)
R = np.array(R)
N, M = R.shape
if N != M:
R = squareform(pdist(R))
J = list(range(0, N))
y = np.max(R, axis=0)
i = np.argmax(R, axis=0)
j = np.argmax(y)
y = np.max(y)
y = np.min(R[I,J], axis=0)
j = np.argmin(R[I,J], axis=0)
I = [I, J[j]]
J = [e for e in J if e != J[j]]
for r in range(2, N-1):
y = np.min(R[I,:][:,J], axis=0)
i = np.argmin(R[I,:][:,J], axis=0)
j = np.argmin(y)
y = np.min(y)
I.extend([J[j]])
J = [e for e in J if e != J[j]]
C.extend([i[j]])
y = np.min(R[I,:][:,J], axis=0)
i = np.argmin(R[I,:][:,J], axis=0)
I.extend(J)
C.extend(i)
RI = list(range(N))
for idx, val in enumerate(I):
RI[val] = idx
RV = R[I,:][:,I]
return RV.tolist(), C, I
Visualising iris datset using VAT¶
We will first recreate the visualisations of the iris dataset used in lectures (lecture 7). Info about the iris dataset is here. First a heatmap of the raw iris dataset is displayed. Secondly a randomly ordered dissimilarity matrix for the objects in iris is shown - notice the lack of structure. Thirdly the VAT visualisation is produced. The heatmap function from the seaborn package is employed as a convenient tool for plotting heatmaps.
Below is an example of the VAT algorithm applied to the same iris dataset
iris= pd.read_csv('iris.csv',dtype=None) ###read in data
iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns
SepalLength SepalWidth PetalLength PetalWidth
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
150 rows × 4 columns
import seaborn as sns
##########################################################
#######Read in the datset###############
##########################################################
iris= pd.read_csv('iris.csv',dtype=None) ###read in data
iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns
####Draw heatmap of raw Iris matrix#######j
#sns.heatmap(iris2,cmap='viridis',xticklabels=True,yticklabels=False)
#plt.show()
####Visualise the dissimilarity matrix for Iris using a heatmap (without applying VAT)####
iris3=iris2.copy().values
np.random.shuffle(iris3) ####randomise the order of rows (objects)
sq = squareform(pdist(iris3)) ###compute the dissimilarity matrix
ax=sns.heatmap(sq,cmap='viridis',xticklabels=False,yticklabels=False)
ax.set(xlabel='Objects', ylabel='Objects')
plt.show()
#####Apply VAT Algorithm to Iris dataset and visualise using heatmap########
RV, C, I = VAT(iris2)
x=sns.heatmap(RV,cmap='viridis',xticklabels=False,yticklabels=False)
x.set(xlabel='Objects', ylabel='Objects')
plt.show()
Question 6)¶
Plot VAT heatmap for iris data and tell how many clusters does the VAT visualisation reveal? How does this compare to the PCA scatterplot?
Two clear clusters on the diagonal.
There are three clusters (types of) iris, so it is at first surprising we can see only two clusters. However,
two of the clusters (virginica and versicolor) are very close together and could arguably be one cluster. Setosa is well separated from these. This largely matches what we've seen in the PCA plot before
If you get time: Practicing VAT and PCA¶
You will now practice using the australian crabs dataset from this file. This data describes 200 crabs collected from Fremantle Western Australia. There are two species of crabs - blue and orange. Within each species there are male and female. There are 5 features:
FL - frontal lip
RW - rear width
CL - carapace length
CW - carapace width
BD - body depth
The first four of these are visualised as follows:
Question 7)¶
Adapt the iris example to produce a VAT heatmap of the australian crabs dataset. How many clusters are there?
###Answer 7
crabsall = pd.read_csv('australian-crabs.csv')
print(crabsall.shape)
crabsall.head(5)
species sex index FL RW CL CW BD
0 Blue Male 1 8.1 6.7 16.1 19.0 7.0
1 Blue Male 2 8.8 7.7 18.1 20.8 7.4
2 Blue Male 3 9.2 7.8 19.0 22.4 7.7
3 Blue Male 4 9.6 7.9 20.1 23.1 8.2
4 Blue Male 5 9.8 8.0 20.3 23.0 8.2
crabsall = pd.read_csv('australian-crabs.csv')
crabs=crabsall[['FL','RW','CL','CW','BD']]
crabs_std=crabs
RV, R, I = VAT(crabs_std)
x = sns.heatmap(RV, cmap='viridis', xticklabels=False, yticklabels=False)
x.set(xlabel='Objects', ylabel='Objects')
plt.show()
#can see two large clusters and one small cluster on the diagonal.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com