程序代写 COMP20008 2021S1 workshop week 11¶

week11-2021-sem2-answers

COMP20008 2021S1 workshop week 11¶

Copyright By PowCoder代写 加微信 powcoder

Chi Squared Feature Selection¶
The following code implements the example in Slide 19 of the Experimental design lecture

import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency

data = pd.DataFrame(np.array([[1,1,1],[1,0,1],[0,1,0],[0,0,0]]), columns=[‘a1′,’a2′,’c’])
features=data[[‘a1′,’a2’]]
class_label = data[‘c’]
cont_table = pd.crosstab(class_label,features[‘a1’])
chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False)
print(‘Chi2 value: ‘,chi2_val)
if(p<0.05) : print('Null hypothesis rejected, p value: ', p) print('Null hypothesis accepted, p value: ', p) Chi2 value: 4.0 Null hypothesis rejected, p value: 0.04550026389635857 Question 2)¶ Adapt the example above to calculate the Chi2 values for each feature in Question 1. Ensure that the results agree with your answer to Question 1. import scipy.stats as stats from scipy.stats import chi2_contingency data = pd.DataFrame(np.array([[1,0,1],[1,1,1],[1,1,1],[1,0,0],[1,1,1],[0,0,0],[0,0,0],[0,0,0],[1,1,0],[0,0,0]]), columns=['A','B','Class']) features=data[['A','B']] class_label = data['Class'] for feature in ['A','B'] : cont_table = pd.crosstab(class_label,features[feature]) chi2_val, p, dof, expected = stats.chi2_contingency(cont_table.values, correction=False) print('Chi2 value for feature', feature,': ',chi2_val) if(p<0.05) : print('Null hypothesis rejected for feature', feature, 'p value:', p) print('Null hypothesis accepted for feature', feature, 'p value:', p) Chi2 value for feature A : 4.444444444444445 Null hypothesis rejected for feature A p value: 0.03501498101966245 Chi2 value for feature B : 3.4027777777777777 Null hypothesis accepted for feature B p value: 0.0650867264927665 Experimental Evaluation¶ K-fold cross validation is important to ensure that the results we report are reliable, and not merely the result of a 'lucky' test_train split. Below is an example of K-fold cross validation applied to the World Development Index dataset. Note also the process in the loop below - we've started by doing the test_train split then performed other functions like scaling and imputation on the training set and applied the results to the testing set. This is important to ensure that we don't violate the test_train split and apply our understanding of the testing set when building our model. world= pd.read_csv('world_org.csv', encoding="iso-8859-1") life = pd.read_csv('life.csv') world.set_index('Country Code') life.set_index('Country Code') all_data = world.merge(life) import pandas as pd import matplotlib.pyplot as plt from sklearn import neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn import preprocessing from sklearn.impute import SimpleImputer from sklearn.model_selection import KFold ##get just the features data=all_data.iloc[:, 5:-1] ##get just the class labels classlabel=all_data['Life expectancy at birth (years)'] kf = KFold(n_splits=k, shuffle=True, random_state=42) acc_score = [] for train_index, test_index in kf.split(data): #Perform the split for this fold X_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :] y_train, y_test = classlabel[train_index], classlabel[test_index] #Scale the data scaler = preprocessing.StandardScaler().fit(X_train) X_train=scaler.transform(X_train) X_test=scaler.transform(X_test) #Impute missing values via mean imputation imp = SimpleImputer(missing_values=np.nan, strategy='mean') X_train = imp.fit_transform(X_train) X_test = imp.transform(X_test) #Train k-nn classifier knn = neighbors.KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) #Predict result y_pred=knn.predict(X_test) acc_score.append(accuracy_score(y_test, y_pred)) print(acc_score) #Display average of accuracy scores avg_acc_score = sum(acc_score)/k print(avg_acc_score) [0.7368421052631579, 0.8421052631578947, 0.8421052631578947, 0.7222222222222222, 0.8333333333333334, 0.6666666666666666, 0.8333333333333334, 0.6111111111111112, 0.7222222222222222, 0.4444444444444444] 0.7254385964912282 Question 3¶ Experiment with different values of k and the random_state parameter. What might be an optimal k value in this case? How could we further improve the reliability of our results? Answer: Generally we would expect larger values of k to give more accurate results as a larger amount of data is being trained on. Values like 2 or 3 are unlikely to be as effective. Higher values of k require more processing time, however. Note that there is a lot of variability still present - which you can see by changing the random_state parameter. Note also that we're NOT trying to choose the value of k which happens to give us the highest accuracy (e.g. 12 is not a better choice than 13 just because the average accuracy is higher) - that's simply random behavior and won't be replicated on other datasets. To further improve the accuracy we could use a Repeated k-Fold cross validator (sklearn.RepeatedKFold), which applies cross validation several times using the same k with different random_states. Principal components analysis¶ Principal components analysis can be used for transforming data into a different (lower dimensional) representation. This is particularly useful for visualisation, computational efficiency and removing noisy data. The python sci-kit learn package (sklearn) contains functions which can be used for PCA. Consider the example below of introducing PCA to the previous task import pandas as pd import matplotlib.pyplot as plt from sklearn import neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn import preprocessing from sklearn.impute import SimpleImputer from sklearn.model_selection import KFold from sklearn.decomposition import PCA ##get just the features data=all_data.iloc[:, 5:-1] ##get just the class labels classlabel=all_data['Life expectancy at birth (years)'] kf = KFold(n_splits=k, shuffle=True, random_state=42) acc_score = [] for train_index, test_index in kf.split(data): #Perform the split for this fold X_train, X_test = data.iloc[train_index, :], data.iloc[test_index, :] y_train, y_test = classlabel[train_index], classlabel[test_index] #Scale the data scaler = preprocessing.StandardScaler().fit(X_train) X_train=scaler.transform(X_train) X_test=scaler.transform(X_test) #Impute missing values via mean imputation imp = SimpleImputer(missing_values=np.nan, strategy='mean') X_train = imp.fit_transform(X_train) X_test = imp.transform(X_test) #Perform PCA pca = PCA(n_components=2) X_train = pca.fit_transform(X_train) X_test = pca.transform(X_test) #Train k-nn classifier knn = neighbors.KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) #Predict result y_pred=knn.predict(X_test) acc_score.append(accuracy_score(y_test, y_pred)) print(acc_score) #Display average of accuracy scores avg_acc_score = sum(acc_score)/k print(avg_acc_score) [0.7368421052631579, 0.6842105263157895, 0.631578947368421, 0.8333333333333334, 0.7222222222222222, 0.6666666666666666, 0.8888888888888888, 0.6666666666666666, 0.7777777777777778, 0.5555555555555556] 0.7163742690058479 Question 4¶ Experiment with different numbers of components. What gives the best result? How could you decide on the appropriate number of principal components to use? Answer: It's possible that we might want to use a fixed number of components (e.g. 2 or 3) for visualisation purposes, but for model building it's useful to look at the variance explained by the principal components we've used so far (sklearn_pca.explained_varianceratio), as in the next example. If the vast majority of the variance has been explained by the first x principal components, then x is probably a good number to use. Visualisation using PCA¶ Consider the example below of applying PCA on the iris dataset. iris= pd.read_csv('iris.csv',dtype=None) ###read in data iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns SepalLength SepalWidth PetalLength PetalWidth 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 ... ... ... ... ... 145 6.7 3.0 5.2 2.3 146 6.3 2.5 5.0 1.9 147 6.5 3.0 5.2 2.0 148 6.2 3.4 5.4 2.3 149 5.9 3.0 5.1 1.8 150 rows × 4 columns import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA ########################################################## #######Example of performing PCA on Iris dataset and visualising#################### ########################################################## sklearn_pca = PCA(n_components=2) #we want just the first two PCs iris_sklearn = sklearn_pca.fit_transform(iris2) print("Variance explained by each PC",sklearn_pca.explained_variance_ratio_) #print out the amount of variance explained by each PC #set up the colour scheme palette=palette = ['blue','green','red'] colors=iris.Name.replace(to_replace=iris.Name.unique(),value=palette).tolist() #plot the objects along the first two principal components, using the colour scheme plt.scatter(iris_sklearn[:,0],iris_sklearn[:,1],s=60,c=colors) #plot the PC's in 2D - s marker size plt.xlabel('1st Principal Component', fontsize=28) plt.ylabel('2nd Principal Component', fontsize=28) plt.show() Variance explained by each PC [0.92461621 0.05301557] Question 5)¶ What can you observe about the clustering behavior of the iris dataset from the plot above? What other techniques could you use to help visualise the clustering behavior? We can see two of the three classes are very close together while the other is more clearly separated. Scatter plot matrices could be useful, as would VAT. VAT - Visual Assessment for Clustering Tendency¶ We've already seen the VAT algorithm for visualising the clustering tendency of a dataset. Below is python code for VAT. You can treat it as a black box (not worrying about the internal coding details) - a function which can be used to execute VAT on an input dataset. import numpy as np import math,random from scipy.spatial.distance import pdist, squareform import pandas as pd import matplotlib.pyplot as plt %matplotlib inline def VAT(R): VAT algorithm adapted from matlab version: http://www.ece.mtu.edu/~thavens/code/VAT.m R (n*n double): Dissimilarity data input R (n*D double): vector input (R is converted to sq. Euclidean distance) RV (n*n double): VAT-reordered dissimilarity data C (n int): Connection indexes of MST in [0,n) I (n int): Reordered indexes of R, the input data in [0,n) R = np.array(R) N, M = R.shape if N != M: R = squareform(pdist(R)) J = list(range(0, N)) y = np.max(R, axis=0) i = np.argmax(R, axis=0) j = np.argmax(y) y = np.max(y) y = np.min(R[I,J], axis=0) j = np.argmin(R[I,J], axis=0) I = [I, J[j]] J = [e for e in J if e != J[j]] for r in range(2, N-1): y = np.min(R[I,:][:,J], axis=0) i = np.argmin(R[I,:][:,J], axis=0) j = np.argmin(y) y = np.min(y) I.extend([J[j]]) J = [e for e in J if e != J[j]] C.extend([i[j]]) y = np.min(R[I,:][:,J], axis=0) i = np.argmin(R[I,:][:,J], axis=0) I.extend(J) C.extend(i) RI = list(range(N)) for idx, val in enumerate(I): RI[val] = idx RV = R[I,:][:,I] return RV.tolist(), C, I Visualising iris datset using VAT¶ We will first recreate the visualisations of the iris dataset used in lectures (lecture 7). Info about the iris dataset is here. First a heatmap of the raw iris dataset is displayed. Secondly a randomly ordered dissimilarity matrix for the objects in iris is shown - notice the lack of structure. Thirdly the VAT visualisation is produced. The heatmap function from the seaborn package is employed as a convenient tool for plotting heatmaps. Below is an example of the VAT algorithm applied to the same iris dataset iris= pd.read_csv('iris.csv',dtype=None) ###read in data iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns SepalLength SepalWidth PetalLength PetalWidth 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 ... ... ... ... ... 145 6.7 3.0 5.2 2.3 146 6.3 2.5 5.0 1.9 147 6.5 3.0 5.2 2.0 148 6.2 3.4 5.4 2.3 149 5.9 3.0 5.1 1.8 150 rows × 4 columns import seaborn as sns ########################################################## #######Read in the datset############### ########################################################## iris= pd.read_csv('iris.csv',dtype=None) ###read in data iris2=iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]] #retain a copy with only these columns ####Draw heatmap of raw Iris matrix#######j #sns.heatmap(iris2,cmap='viridis',xticklabels=True,yticklabels=False) #plt.show() ####Visualise the dissimilarity matrix for Iris using a heatmap (without applying VAT)#### iris3=iris2.copy().values np.random.shuffle(iris3) ####randomise the order of rows (objects) sq = squareform(pdist(iris3)) ###compute the dissimilarity matrix ax=sns.heatmap(sq,cmap='viridis',xticklabels=False,yticklabels=False) ax.set(xlabel='Objects', ylabel='Objects') plt.show() #####Apply VAT Algorithm to Iris dataset and visualise using heatmap######## RV, C, I = VAT(iris2) x=sns.heatmap(RV,cmap='viridis',xticklabels=False,yticklabels=False) x.set(xlabel='Objects', ylabel='Objects') plt.show() Question 6)¶ Plot VAT heatmap for iris data and tell how many clusters does the VAT visualisation reveal? How does this compare to the PCA scatterplot? Two clear clusters on the diagonal. There are three clusters (types of) iris, so it is at first surprising we can see only two clusters. However, two of the clusters (virginica and versicolor) are very close together and could arguably be one cluster. Setosa is well separated from these. This largely matches what we've seen in the PCA plot before If you get time: Practicing VAT and PCA¶ You will now practice using the australian crabs dataset from this file. This data describes 200 crabs collected from Fremantle Western Australia. There are two species of crabs - blue and orange. Within each species there are male and female. There are 5 features: FL - frontal lip RW - rear width CL - carapace length CW - carapace width BD - body depth The first four of these are visualised as follows: Question 7)¶ Adapt the iris example to produce a VAT heatmap of the australian crabs dataset. How many clusters are there? ###Answer 7 crabsall = pd.read_csv('australian-crabs.csv') print(crabsall.shape) crabsall.head(5) species sex index FL RW CL CW BD 0 Blue Male 1 8.1 6.7 16.1 19.0 7.0 1 Blue Male 2 8.8 7.7 18.1 20.8 7.4 2 Blue Male 3 9.2 7.8 19.0 22.4 7.7 3 Blue Male 4 9.6 7.9 20.1 23.1 8.2 4 Blue Male 5 9.8 8.0 20.3 23.0 8.2 crabsall = pd.read_csv('australian-crabs.csv') crabs=crabsall[['FL','RW','CL','CW','BD']] crabs_std=crabs RV, R, I = VAT(crabs_std) x = sns.heatmap(RV, cmap='viridis', xticklabels=False, yticklabels=False) x.set(xlabel='Objects', ylabel='Objects') plt.show() #can see two large clusters and one small cluster on the diagonal. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com