comp9417_ass1_spec
COMP9417 18s1 Assignment 1: Applying Machine Learning¶
Last revision: Mon Mar 19 19:17:25 AEDT 2018
The aim of this assignment is to enable you to apply different machine learning algorithms implemented in the Python scikit-learn machine learning library on a variety of datasets and answer questions based on your analysis and interpretation of the empirical
results, using your knowledge of machine learning.
After completing this assignment you will be able to:
set up replicated $k$-fold cross-validation experiments to obtain average
performance measures of algorithms on datasets
compare the performance of different algorithms against a base-line
and each other
aggregate comparative performance scores for algorithms over a range
of different datasets
propose properties of algorithms and their parameters, or datasets, which
may lead to performance differences being observed
suggest reasons for actual observed performance differences in terms of
properties of algorithms, parameter settings or datasets.
apply methods for data transformations and parameter search
and evaluate their effects on the performance of algorithms
There are a total of 20 marks available.
Each is worth 0.5 course mark, i.e., assignment marks will be scaled
to a course mark out of 10 to contribute to the course total.
Deadline: 23:59:59, Monday April 2, 2018.
Submission will be via the CSE give system (see below).
Late penalties: one mark will be deducted from the total for each day late, up to a total of five days. If six or more days late, no marks will be given.
Recall the guidance regarding plagiarism in the course introduction: this applies to this assignment and if evidence of plagiarism is detected it may result in penalties ranging from loss of marks to suspension.
Format of the questions¶
There are 4 questions in this assignment. Each question has two parts: the Python code which must be run to generate the output results on the given datasets, and the responses you give in the file answers.txt on your analysis and interpretation of the results produced by running these learning algorithms for the question. Marks are given for both parts: submitting correct output from the code, and giving correct responses. For each question, you will need to save the output results from running the code to a separate plain text file. There will also be a plain text file containing the questions which you will need to edit to specify your answers. These files will form your submission.
In summary, your submission will comprise a total of 5 (five) files which should be named as follows:
q1.out
q2.out
q3.out
q4.out
answers.txt
Please note: files in any format other than plain text cannot be accepted.
Submit your files using give. On a CSE Linux machine, type the following on the command-line:
$ give cs9417 ass1 q1.out q2.out q3.out q4.out answers.txt
Alternatively, you can submit using the web-based interface to give.
Datasets¶
You can download the datasets required for the assignment here.
Note: you will need to
Question 1¶
For this question the objective is to run two different learning algorithms on a range of different sample sizes taken from the same training set to assess the effect of training sample size on error. You will use the nearest neighbour classifier and the decision tree classifier to generate two different sets of “learning curves” on 8 real-world datasets:
anneal.arff
audiology.arff
autos.arff
credit-a.arff
hypothyroid.arff
letter.arff
microarray.arff
vote.arff
Running the classifiers [2 marks]¶
You will run the following code section, and save the results to a plain text file “q1.out”. You will also need to write your own code to compute the error reduction for question 1(b).
The output of the code section are two tables, which represent the percentage error of classification for the nearest neighbour and the decision tree algorithm respectively. The first column contains the result of the baseline classifier, which simply predicts the majority class. From the second column on, the results are obtained by running the nearest neighbour or decision tree algorithms on $10\%$, $25\%$, $50\%$, $75\%$, and $100\%$ of the data. The standard deviation are shown in brackets, and where an asterisk is present, it indicates that the result is significantly different from the baseline.
Result interpretation [6 marks]¶
Answer these questions in the file called answers.txt. Your answers must be based on the results you saved in “q1.out”. Please note: the goal of these questions is to attempt to explain why you think the results you obtained are as they are.
1(a). [2 marks] Refer to answers.txt.
1(b). [4 marks] For each algorithm over all of the datasets, find the average change in error when moving from the default prediction to learning from 10% of the training set as follows.
Let the error on the base line be err0 and the error on 10% of the training set be error10.
For each algorithm, calculate the percentage reduction in error relative to the default on each dataset as:
\begin{equation*}
\frac{err_0 – err_{10}}{err_{10}} \times 100.
\end{equation*}Now repeat exactly the same process by comparing the two classifiers over all of the datasets, learning from $100\%$ of the training set, compared to default. Organise your results by grouping them into a 2 by 2 table, something like this:
Mean error reduction relative to default
Algorithm After 10% training After 100% training
Nearest Neighbour Your result Your result
Decision Tree Your result Your result
The entries from this table should be inserted into the correct places in your file answers.txt.
Once you have done this, complete the rest of the answers for question 1 in your file answers.txt.
In [1]:
# code for question 1
import arff
import numpy as np
from itertools import product
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score, KFold
from sklearn.utils import resample
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import ttest_ind
seeds = [2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37]
score_list = []
for fname in [“anneal.arff”, “audiology.arff”, “autos.arff”, “credit-a.arff”, \
“hypothyroid.arff”, “letter.arff”, “microarray.arff”, “vote.arff”]:
dataset = arff.load(open(fname), ‘r’)
data = np.array(dataset[‘data’])
X = np.array(data)[:, :-1]
Y = np.array(data)[:, -1]
# turn unknown/none/? into a separate value
for i, j in product(range(len(data)), range(len(data[0]) – 1)):
if X[i, j] is None:
X[i, j] = len(dataset[‘attributes’][j][1])
# a hack to turn negative categories positive for autos.arff
for i in range(Y.shape[0]):
if Y[i] < 0:
Y[i] += 7
# identify and extract categorical/non-categorical features
categorical, non_categorical = [], []
for i in range(len(dataset['attributes']) - 1):
if isinstance(dataset['attributes'][i][1], str):
non_categorical.append(X[:, i])
else:
categorical.append(X[:, i])
categorical = np.array(categorical).T
non_categorical = np.array(non_categorical).T
if categorical.shape[0] == 0:
transformed_X = non_categorical
else:
# encode categorical features
encoder = OneHotEncoder(n_values = 'auto',
categorical_features = 'all',
dtype = np.int32,
sparse = False,
handle_unknown = 'error')
encoder.fit(categorical)
categorical = encoder.transform(categorical)
if non_categorical.shape[0] == 0:
transformed_X = categorical
else:
transformed_X = np.concatenate((categorical, non_categorical), axis = 1)
# concatenate the feature array and the labels for resampling purpose
Y = np.array([Y], dtype = np.int)
input_data = np.concatenate((transformed_X, Y.T), axis = 1)
# build the models
models = [DummyClassifier(strategy = 'most_frequent')] \
+ [KNeighborsClassifier(n_neighbors = 1, algorithm = "brute")] * 5 \
+ [DecisionTreeClassifier()] * 5
# resample and run cross validation
portion = [1.0, 0.1, 0.25, 0.5, 0.75, 1.0, 0.1, 0.25, 0.5, 0.75, 1.0]
sample, scores = [None] * 11, [None] * 11
for i in range(11):
sample[i] = resample(input_data,
replace = False,
n_samples = int(portion[i] * input_data.shape[0]),
random_state = seeds[i])
score = [None] * 10
for j in range(10):
score[j] = np.mean(cross_val_score(models[i],
sample[i][:, :-1],
sample[i][:, -1].astype(np.int),
scoring = 'accuracy',
cv = KFold(10, True, seeds[j])))
scores[i] = score
score_list.append((fname[:-5], 1 - np.array(scores)))
# print the results
header = ["{:^123}".format("Nearest Neighbour Results") + '\n' + '-' * 123 + '\n' + \
"{:^15} | {:^10} | {:^16} | {:^16} | {:^16} | {:^16} | {:^16}" \
.format("Dataset", "Baseline", "10%", "25%", "50%", "75%", "100%"),
"{:^123}".format("Decision Tree Results") + '\n' + '-' * 123 + '\n' + \
"{:^15} | {:^10} | {:^16} | {:^16} | {:^16} | {:^16} | {:^16}" \
.format("Dataset", "Baseline", "10%", "25%", "50%", "75%", "100%")]
offset = [1, 6]
for k in range(2):
print(header[k])
for i in range(8):
scores = score_list[i][1]
p_value = [None] * 5
for j in range(5):
_, p_value[j] = ttest_ind(scores[0], scores[j + offset[k]], equal_var = False)
print("{:<15} | {:>10.2%}”.format(score_list[i][0], np.mean(scores[0])), end = ”)
for j in range(5):
print(” | {:>6.2%} ({:>5.2%}) {}” .format(np.mean(scores[j + offset[k]]),
np.std(scores[j + offset[k]]),
‘*’ if p_value[j] < 0.05 else ' '), end = '')
print()
print()
File "
print(“{:<15} | {:>10.2%}”.format(score_list[i][0], np.mean(scores[0])), end = ”)
^
SyntaxError: invalid syntax
Question 2¶
Dealing with noisy data is a key issue in machine learning. Unfortunately, even algorithms that have noise-handling mechanisms built-in, like decision trees, can overfit noisy data, unless their “overfitting avoidance” or regularization parameters are set properly.
The datasets you will be using have had various amounts of “class noise” added
by randomly changing the actual class value to a different one for a
specified percentage of the training data.
Here we will specify three arbitrarily chosen levels of noise: low
($20\%$), medium ($50\%$) and high ($80\%$).
The learning algorithm must try to “see through” this noise and learn
the best model it can, which is then evaluated on test data without
added noise to evaluate how well it has avoided fitting the noise.
We will also let the algorithm do a limited search using cross-validation
for the best over-fitting avoidance parameter settings on each training set.
Running the classifiers [2 marks]¶
You will run the following code section, and save the results to a plain text file “q2.out”.
The output of the code section is a table, which represents the percentage accuaracy of classification for the decision tree algorithm. The first column contains the result of the “Default” classifier, which is the decision tree algorithm with default parameter settings running on each of the datasets which have had $50\%$ noise added. From the second column on, in each column the results are obtained by running the decision tree algorithm on $0\%$, $20\%$, $50\%$ and $80\%$ noise added to each of the datasets, and in the parentheses is shown the result of a grid search that has been applied to determine the best value for a basic parameter of the decision tree algorithm, namely min_samples_leaf i.e., the minimum number of examples that can be used to make a prediction in the tree, on that dataset.
Result interpretation [3 marks]¶
Answer these questions in the file called answers.txt. Your answers must be based on the results you saved in “q2.out”.
2(a). [2 marks] Refer to answers.txt.
2(b). [1 mark] Refer to answers.txt.
In [18]:
# code for question 2
import arff, numpy as np
from sklearn import tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import random
In [19]:
# fixed random seed
np.random.seed(0)
random.seed(0)
def label_enc(labels):
le = preprocessing.LabelEncoder()
le.fit(labels)
return le
def features_encoders(features,categorical_features=’all’):
n_samples, n_features = features.shape
label_encoders = [preprocessing.LabelEncoder() for _ in range(n_features)]
X_int = np.zeros_like(features, dtype=np.int)
for i in range(n_features):
feature_i = features[:, i]
label_encoders[i].fit(feature_i)
X_int[:, i] = label_encoders[i].transform(feature_i)
enc = preprocessing.OneHotEncoder(categorical_features=categorical_features)
return enc.fit(X_int),label_encoders
def feature_transform(features,label_encoders, one_hot_encoder):
n_samples, n_features = features.shape
X_int = np.zeros_like(features, dtype=np.int)
for i in range(n_features):
feature_i = features[:, i]
X_int[:, i] = label_encoders[i].transform(feature_i)
return one_hot_encoder.transform(X_int).toarray()
In [ ]:
def load_data(path):
dataset = arff.load(open(path, ‘rb’))
data = np.array(dataset[‘data’])
attr = dataset[‘attributes’]
# mask categorical features
masks = []
for i in range(len(attr)-1):
if attr[i][1] != ‘REAL’:
masks.append(i)
return data, masks
def preprocess(data,masks, noise_ratio):
# split data
train_data, test_data = train_test_split(data,test_size=0.3,random_state=0)
# test data
test_features = test_data[:,0:test_data.shape[1]-1]
test_labels = test_data[:,test_data.shape[1]-1]
# training data
features = train_data[:,0:train_data.shape[1]-1]
labels = train_data[:,train_data.shape[1]-1]
classes = list(set(labels))
# categorical features need to be encoded
if len(masks):
one_hot_enc, label_encs = features_encoders(data[:,0:data.shape[1]-1],masks)
test_features = feature_transform(test_features,label_encs,one_hot_enc)
features = feature_transform(features,label_encs,one_hot_enc)
le = label_enc(data[:,data.shape[1]-1])
labels = le.transform(train_data[:,train_data.shape[1]-1])
test_labels = le.transform(test_data[:,test_data.shape[1]-1])
# add noise
np.random.seed(0)
noise = np.random.randint(len(classes)-1, size=int(len(labels)*noise_ratio))+1
noise = np.concatenate((noise,np.zeros(len(labels) – len(noise),dtype=np.int)))
labels = (labels + noise) % len(classes)
return features,labels,test_features,test_labels
In [ ]:
# load data
paths = [‘balance-scale’,’primary-tumor’,
‘glass’,’heart-h’]
noise = [0,0.2,0.5,0.8]
scores = []
params = []
for path in paths:
path = ‘datasets/’ + path + ‘.arff’
score = []
param = []
# training on data without noise and default parameters
features, labels, test_features, test_labels = preprocess(data, masks, 0.5)
tree = DecisionTreeClassifier(random_state=0,min_samples_leaf=2, min_impurity_decrease=0)
tree.fit(features, labels)
tree_preds = tree.predict(test_features)
tree_performance = accuracy_score(test_labels, tree_preds)
score.append(tree_performance)
param.append(tree.get_params()[‘min_samples_leaf’])
# training on data with noise %0, %20, %50, %80
for noise_ratio in noise:
data, masks = load_data(path)
features, labels, test_features, test_labels = preprocess(data, masks, noise_ratio)
param_grid = {‘min_samples_leaf’: np.arange(2,30,5)}
grid_tree = GridSearchCV(DecisionTreeClassifier(random_state=0), param_grid,cv=10,return_train_score=True)
grid_tree.fit(features, labels)
estimator = grid_tree.best_estimator_
tree_preds = grid_tree.predict(test_features)
tree_performance = accuracy_score(test_labels, tree_preds)
score.append(tree_performance)
param.append(estimator.get_params()[‘min_samples_leaf’])
scores.append(score)
params.append(param)
# print the results
header = “{:^123}”.format(“Decision Tree Results”) + ‘\n’ + ‘-‘ * 123 + ‘\n’ + \
“{:^15} | {:^16} | {:^16} | {:^16} | {:^16} | {:^16} |”.format(“Dataset”, “Default”, “0%”, “20%”, “50%”, “80%”)
# print result table
print(header)
for i in range(len(scores)):
#scores = score_list[i][1]
print “{:<15}".format(paths[i]),
for j in range(len(params[i])):
print "| {:>6.2%} ({:>2}) ” .format(scores[i][j],params[i][j]),
print ‘|\n’
print ‘\n’
Question 3¶
This question involves mining a data-set on California house prices from
census data in the 1990s.
We will be using linear regression to do this since the output is numeric.
Since this problem involves using attribute or feature transformations we
will need to do this using the numpy Python library.
Running the regression [1 mark]¶
In this question, in the following code section you will be required to train a linear regression model using scikit learn to fit the dataset. Select “median house value” as the target variable Y and the rest of the features as X. Then perform a 10-fold cross validation of linear regression on the dataset. Save the Intercept and Coefficients of the reuslting Linear regression model, as well as Root mean squared error of cross validation, into a plain text file called “q3.out”. For this question the code to save the output to a file has been provided for you.
In [ ]:
# code for question 3
import arff,numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict,cross_val_score
from sklearn import metrics
#————–Show the attributes————–
dataset = arff.load(open(‘houses.arff’,”r”,encoding = “ISO-8859-1″))
attributes = np.array(dataset[‘attributes’])
attributes
In [ ]:
#————–linear regression————–
regr = linear_model.LinearRegression()
data = np.array(dataset[‘data’])
houses_X = data[:,1:] #X vector
houses_Y = data[:,0] #Y vector
regr.fit(houses_X, houses_Y)
intercept = regr.intercept_
print(‘Intercept:\n%.2e’ % intercept,end=’\n’)
print(‘Coefficients:’)
for coef in regr.coef_:
print(‘%.2e’ % coef,end=” “)
file = open(‘q3.out’,’a’)
file.write(‘Intercept:\n%.2e\n’ % intercept)
file.write(‘Coefficients:\n’)
for coef in regr.coef_:
file.write(‘%.2e’ % coef)
file.write(‘ ‘)
file.write(‘\n’)
#————–10-fold cross validation————–
predicted = cross_val_predict(regr, houses_X, houses_Y, cv=10)
RMSE = np.sqrt(metrics.mean_squared_error(houses_Y, predicted))
print (‘\nRMSE:\n%.2e\n’ % RMSE)
file.write(‘RMSE:\n%.2e\n’ % RMSE)
file.close()
Data transformation (feature construction)¶
Now set up a log transform to the class variable “median house value”. Then train a new linear regression model to fit the transformed dataset. Save (append) the Intercept and Coefficients into your “q3.out” file.
Experiment with at most two more sets of transformations to the variables(including squares, cubes, and logs of ratios), run linear regression on the transformed data.
Result interpretation [3 marks]¶
3(a). [2 marks] Refer to answers.txt.
3(b). [1 mark] Refer to answers.txt.
In [ ]:
#————–linear regression on transformed dataset————–
regr = linear_model.LinearRegression()
houses_X = data[:,1:] #X vector
houses_Y = data[:,0] #Y vector
#———–log tranformation——————
logMedianHousePrice = np.log(houses_Y)
regr.fit(houses_X, logMedianHousePrice)
intercept = regr.intercept_
print(‘Intercept:\n%.2e’ % intercept)
print(‘Coefficients:’)
for coef in regr.coef_:
print(‘%.2e’ % coef,end=” “)
file = open(‘q3.out’,’a’)
file.write(‘Intercept:\n%.2e\n’ % intercept)
file.write(‘Coefficients:\n’)
for coef in regr.coef_:
file.write(‘%.2e’ % coef)
file.write(‘ ‘)
file.write(‘\n’)
#————–10-fold cross validation————–
predicted = cross_val_predict(regr, houses_X, logMedianHousePrice, cv=10)
RMSE = np.sqrt(metrics.mean_squared_error(logMedianHousePrice, predicted))
print (‘\nRMSE:\n%.2e\n’% RMSE)
file.write(‘RMSE:\n%.2e\n’ % RMSE)
file.close()
Question 4¶
This question involves mining text data, for which machine learning algorithms typically use a transformation into a dataset of “word counts”. In the original dataset each text example is a string of words with a class label, and the sklearn transform converts this to a vector of word counts.
The dataset contains “snippets”, short sequences of words taken from Google searches, each of which has been labelled with one of 8 classes, referred to as “sections”, such as business, sports, etc. The dataset is provided already split into a training set of $10,060$ snippets and a test set of $2,280$ snippets (for convenience, the combined dataset is also provided as a separate file).
Using a vector representation for text data means that we can use many of the standard classifier learning methods. However, such datasets are typically highly “sparse”, in the sense that for any example (i.e., piece of text) nearly all of its feature values are zero. To tackle this problem, we typically apply methods of feature selection (or dimensionality reduction). In this question you will investigate the effect of using the SelectKBest method to select the $K$ best features (words or tokens in this case) that appear to help classification accuracy.
Running the classifier [1 mark]¶
You will run the following code section, and save the results to a plain text file “q4.out”.
The output of the code section is 5 lines of output, each of which represents the percentage accuaracy of classification on training and test set for different amounts of feature selection.
The first such line representst the “default”, i.e., using all features. The remaining 4 lines show the effect of learning and predicting on text data where only the top $K$ features are being used.
Result interpretation [2 marks]¶
Answer this question in the file called answers.txt. Your answers must be based on the results you saved in “q4.out”.
4. [2 marks] Refer to answers.txt.
In [20]:
# code for question 4
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.feature_selection import SelectKBest, chi2
df_trte = pd.read_csv(‘./datasets/snippets_all.csv’)
df_tr = pd.read_csv(‘./datasets/snippets_train.csv’)
df_te = pd.read_csv(‘./datasets/snippets_test.csv’)
# set up the vocabulary (the global set of “words” or tokens) for training and test datasets
vectorizer = CountVectorizer()
vectorizer.fit(df_trte.snippet)
# apply this vocabulary to transform the text snippets to vectors of word counts
X_train = vectorizer.transform(df_tr.snippet)
X_test = vectorizer.transform(df_te.snippet)
y_train = df_tr.section
y_test = df_te.section
# Debugging
# print(“X train: “, X_train.shape)
# print(“X test: “, X_test.shape)
# print(“Y train: “, y_train.shape)
# print(“Y test: “, y_test.shape)
# learn a Naive Bayes classifier on the training set
clf = MultinomialNB(alpha=0.5)
MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
clf.fit(X_train, y_train)
pred_train = clf.predict(X_train)
score_train = metrics.accuracy_score(y_train, pred_train)
pred_test = clf.predict(X_test)
score_test = metrics.accuracy_score(y_test, pred_test)
print(“Train/test accuracy using all features: “, score_train, score_test)
(‘Train/test accuracy using all features: ‘, 0.979324055666004, 0.80526315789473679)
In [21]:
# Use Chi^2 to select top 10000 features
ch2_10000 = SelectKBest(chi2, k=10000)
ch2_10000.fit(X_train, y_train)
# Project training data onto top 10000 selected features
X_train_kbest_10000 = ch2_10000.transform(X_train)
# Train NB Classifier using top 10 selected features
clf_kbest_10000 = MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
clf_kbest_10000.fit(X_train_kbest_10000,y_train)
# Predictive accuracy on training set
pred_train_kbest_10000 = clf_kbest_10000.predict(X_train_kbest_10000)
score_train_kbest_10000 = metrics.accuracy_score(y_train,pred_train_kbest_10000)
# Project test data onto top 10000 selected features
X_test_kbest_10000 = ch2_10000.transform(X_test)
# Predictive accuracy on test set
pred_test_kbest_10000 = clf_kbest_10000.predict(X_test_kbest_10000)
score_test_kbest_10000 = metrics.accuracy_score(y_test,pred_test_kbest_10000)
print(“Train/test accuracy for top 10K features”, score_train_kbest_10000, score_test_kbest_10000)
# Use Chi^2 to select top 1000 features
ch2_1000 = SelectKBest(chi2, k=1000)
ch2_1000.fit(X_train, y_train)
# Project training data onto top 1000 selected features
X_train_kbest_1000 = ch2_1000.transform(X_train)
# Train NB Classifier using top 1000 selected features
clf_kbest_1000 = MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
clf_kbest_1000.fit(X_train_kbest_1000,y_train)
# Predictive accuracy on training set
pred_train_kbest_1000 = clf_kbest_1000.predict(X_train_kbest_1000)
score_train_kbest_1000 = metrics.accuracy_score(y_train,pred_train_kbest_1000)
# Project test data onto top 1000 selected features
X_test_kbest_1000 = ch2_1000.transform(X_test)
# Predictive accuracy on test set
pred_test_kbest_1000 = clf_kbest_1000.predict(X_test_kbest_1000)
score_test_kbest_1000 = metrics.accuracy_score(y_test,pred_test_kbest_1000)
print(“Train/test accuracy for top 1K features”, score_train_kbest_1000, score_test_kbest_1000)
# Use Chi^2 to select top 100 features
ch2_100 = SelectKBest(chi2, k=100)
ch2_100.fit(X_train, y_train)
# Project training data onto top 100 selected features
X_train_kbest_100 = ch2_100.transform(X_train)
# Train NB Classifier using top 100 selected features
clf_kbest_100 = MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
clf_kbest_100.fit(X_train_kbest_100,y_train)
# Predictive accuracy on training set
pred_train_kbest_100 = clf_kbest_100.predict(X_train_kbest_100)
score_train_kbest_100 = metrics.accuracy_score(y_train,pred_train_kbest_100)
# Project test data onto top 100 selected features
X_test_kbest_100 = ch2_100.transform(X_test)
# Predictive accuracy on test set
pred_test_kbest_100 = clf_kbest_100.predict(X_test_kbest_100)
score_test_kbest_100 = metrics.accuracy_score(y_test,pred_test_kbest_100)
print(“Train/test accuracy for top 100 features”, score_train_kbest_100, score_test_kbest_100)
# Use Chi^2 to select top 10 features
ch2_10 = SelectKBest(chi2, k=10)
ch2_10.fit(X_train, y_train)
# Project training data onto top 10 selected features
X_train_kbest_10 = ch2_10.transform(X_train)
# Train NB Classifier using top 10 selected features
clf_kbest_10 = MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
clf_kbest_10.fit(X_train_kbest_10,y_train)
# Predictive accuracy on training set
pred_train_kbest_10 = clf_kbest_10.predict(X_train_kbest_10)
score_train_kbest_10 = metrics.accuracy_score(y_train,pred_train_kbest_10)
# Project test data onto top 10 selected features
X_test_kbest_10 = ch2_10.transform(X_test)
# Predictive accuracy on test set
pred_test_kbest_10 = clf_kbest_10.predict(X_test_kbest_10)
score_test_kbest_10 = metrics.accuracy_score(y_test,pred_test_kbest_10)
print(“Train/test accuracy for top 10 features”, score_train_kbest_10, score_test_kbest_10)
(‘Train/test accuracy for top 10K features’, 0.97176938369781307, 0.80526315789473679)
(‘Train/test accuracy for top 1K features’, 0.95387673956262431, 0.69429824561403508)
(‘Train/test accuracy for top 100 features’, 0.71878727634194828, 0.4631578947368421)
(‘Train/test accuracy for top 10 features’, 0.4092445328031809, 0.18903508771929825)
In [ ]: