comp9417_ass1_spec(1)
COMP9417 18s1 Assignment 1: Applying Machine Learning¶
Last revision: Sat Mar 24 14:04:42 AEDT 2018
The aim of this assignment is to enable you to apply different machine learning algorithms implemented in the Python scikit-learn machine learning library on a variety of datasets and answer questions based on your analysis and interpretation of the empirical
results, using your knowledge of machine learning.
After completing this assignment you will be able to:
set up replicated $k$-fold cross-validation experiments to obtain average
performance measures of algorithms on datasets
compare the performance of different algorithms against a base-line
and each other
aggregate comparative performance scores for algorithms over a range
of different datasets
propose properties of algorithms and their parameters, or datasets, which
may lead to performance differences being observed
suggest reasons for actual observed performance differences in terms of
properties of algorithms, parameter settings or datasets.
apply methods for data transformations and parameter search
and evaluate their effects on the performance of algorithms
There are a total of 20 marks available.
Each is worth 0.5 course mark, i.e., assignment marks will be scaled
to a course mark out of 10 to contribute to the course total.
Deadline: 23:59:59, Monday April 2, 2018.
Submission will be via the CSE give system (see below).
Late penalties: one mark will be deducted from the total for each day late, up to a total of five days. If six or more days late, no marks will be given.
Recall the guidance regarding plagiarism in the course introduction: this applies to this assignment and if evidence of plagiarism is detected it may result in penalties ranging from loss of marks to suspension.
Format of the questions¶
There are 4 questions in this assignment. Each question has two parts: the Python code which must be run to generate the output results on the given datasets, and the responses you give in the file answers.txt on your analysis and interpretation of the results produced by running these learning algorithms for the question. Marks are given for both parts: submitting correct output from the code, and giving correct responses. For each question, you will need to save the output results from running the code to a separate plain text file. There will also be a plain text file containing the questions which you will need to edit to specify your answers. These files will form your submission.
In summary, your submission will comprise a total of 5 (five) files which should be named as follows:
q1.out
q2.out
q3.out
q4.out
answers.txt
Please note: files in any format other than plain text cannot be accepted.
Submit your files using give. On a CSE Linux machine, type the following on the command-line:
$ give cs9417 ass1 q1.out q2.out q3.out q4.out answers.txt
Alternatively, you can submit using the web-based interface to give.
Datasets¶
You can download the datasets required for the assignment here.
Note: you will need to ensure the dataset files are in the same directory from which you are started the notebook.
Please Note: to load datasets from ‘.arff’ formatted files, you will need to have installed the liac-arff package. You can do this using pip at the command-line, as follows:
$ pip install liac-arff
Question 1¶
For this question the objective is to run two different learning algorithms on a range of different sample sizes taken from the same training set to assess the effect of training sample size on error. You will use the nearest neighbour classifier and the decision tree classifier to generate two different sets of “learning curves” on 8 real-world datasets:
anneal.arff
audiology.arff
autos.arff
credit-a.arff
hypothyroid.arff
letter.arff
microarray.arff
vote.arff
Running the classifiers [2 marks]¶
You will run the following code section, and save the results to a plain text file “q1.out”. You will also need to write your own code to compute the error reduction for question 1(b).
The output of the code section are two tables, which represent the percentage error of classification for the nearest neighbour and the decision tree algorithm respectively. The first column contains the result of the baseline classifier, which simply predicts the majority class. From the second column on, the results are obtained by running the nearest neighbour or decision tree algorithms on $10\%$, $25\%$, $50\%$, $75\%$, and $100\%$ of the data. The standard deviation are shown in brackets, and where an asterisk is present, it indicates that the result is significantly different from the baseline.
Result interpretation [6 marks]¶
Answer these questions in the file called answers.txt. Your answers must be based on the results you saved in “q1.out”. Please note: the goal of these questions is to attempt to explain why you think the results you obtained are as they are.
1(a). [2 marks] Refer to answers.txt.
1(b). [4 marks] For each algorithm over all of the datasets, find the average change in error when moving from the default prediction to learning from 10% of the training set as follows.
Let the error on the base line be err0 and the error on 10% of the training set be error10.
For each algorithm, calculate the percentage reduction in error relative to the default on each dataset as:
\begin{equation*}
\frac{err_0 – err_{10}}{err_{0}} \times 100.
\end{equation*}Now repeat exactly the same process by comparing the two classifiers over all of the datasets, learning from $100\%$ of the training set, compared to default. Organise your results by grouping them into a 2 by 2 table, something like this:
Mean error reduction relative to default
Algorithm After 10% training After 100% training
Nearest Neighbour Your result Your result
Decision Tree Your result Your result
The entries from this table should be inserted into the correct places in your file answers.txt.
Once you have done this, complete the rest of the answers for question 1 in your file answers.txt.
In [13]:
# code for question 1
import arff
import numpy as np
from itertools import product
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score, KFold
from sklearn.utils import resample
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import ttest_ind
seeds = [2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37]
score_list = []
for fname in [“anneal.arff”, “audiology.arff”, “autos.arff”, “credit-a.arff”, \
“hypothyroid.arff”, “letter.arff”, “microarray.arff”, “vote.arff”]:
dataset = arff.load(open(fname), ‘r’)
data = np.array(dataset[‘data’])
X = np.array(data)[:, :-1]
Y = np.array(data)[:, -1]
# turn unknown/none/? into a separate value
for i, j in product(range(len(data)), range(len(data[0]) – 1)):
if X[i, j] is None:
X[i, j] = len(dataset[‘attributes’][j][1])
# a hack to turn negative categories positive for autos.arff
for i in range(Y.shape[0]):
if Y[i] < 0:
Y[i] += 7
# identify and extract categorical/non-categorical features
categorical, non_categorical = [], []
for i in range(len(dataset['attributes']) - 1):
if isinstance(dataset['attributes'][i][1], str):
non_categorical.append(X[:, i])
else:
categorical.append(X[:, i])
categorical = np.array(categorical).T
non_categorical = np.array(non_categorical).T
if categorical.shape[0] == 0:
transformed_X = non_categorical
else:
# encode categorical features
encoder = OneHotEncoder(n_values = 'auto',
categorical_features = 'all',
dtype = np.int32,
sparse = False,
handle_unknown = 'error')
encoder.fit(categorical)
categorical = encoder.transform(categorical)
if non_categorical.shape[0] == 0:
transformed_X = categorical
else:
transformed_X = np.concatenate((categorical, non_categorical), axis = 1)
# concatenate the feature array and the labels for resampling purpose
Y = np.array([Y], dtype = np.int)
input_data = np.concatenate((transformed_X, Y.T), axis = 1)
# build the models
models = [DummyClassifier(strategy = 'most_frequent')] \
+ [KNeighborsClassifier(n_neighbors = 1, algorithm = "brute")] * 5 \
+ [DecisionTreeClassifier()] * 5
# resample and run cross validation
portion = [1.0, 0.1, 0.25, 0.5, 0.75, 1.0, 0.1, 0.25, 0.5, 0.75, 1.0]
sample, scores = [None] * 11, [None] * 11
for i in range(11):
sample[i] = resample(input_data,
replace = False,
n_samples = int(portion[i] * input_data.shape[0]),
random_state = seeds[i])
score = [None] * 10
for j in range(10):
score[j] = np.mean(cross_val_score(models[i],
sample[i][:, :-1],
sample[i][:, -1].astype(np.int),
scoring = 'accuracy',
cv = KFold(10, True, seeds[j])))
scores[i] = score
score_list.append((fname[:-5], 1 - np.array(scores)))
# print the results
header = ["{:^123}".format("Nearest Neighbour Results") + '\n' + '-' * 123 + '\n' + \
"{:^15} | {:^10} | {:^16} | {:^16} | {:^16} | {:^16} | {:^16}" \
.format("Dataset", "Baseline", "10%", "25%", "50%", "75%", "100%"),
"{:^123}".format("Decision Tree Results") + '\n' + '-' * 123 + '\n' + \
"{:^15} | {:^10} | {:^16} | {:^16} | {:^16} | {:^16} | {:^16}" \
.format("Dataset", "Baseline", "10%", "25%", "50%", "75%", "100%")]
offset = [1, 6]
for k in range(2):
print(header[k])
for i in range(8):
scores = score_list[i][1]
p_value = [None] * 5
for j in range(5):
_, p_value[j] = ttest_ind(scores[0], scores[j + offset[k]], equal_var = False)
print("{:<15} | {:>10.2%}”.format(score_list[i][0], np.mean(scores[0])), end = ”)
for j in range(5):
print(” | {:>6.2%} ({:>5.2%}) {}” .format(np.mean(scores[j + offset[k]]),
np.std(scores[j + offset[k]]),
‘*’ if p_value[j] < 0.05 else ' '), end = '')
print()
print()
Nearest Neighbour Results
---------------------------------------------------------------------------------------------------------------------------
Dataset | Baseline | 10% | 25% | 50% | 75% | 100%
anneal | 23.83% | 20.31% (0.94%) * | 18.00% (1.33%) * | 11.14% (0.70%) * | 9.11% (0.37%) * | 7.44% (0.44%) *
audiology | 74.77% | 60.17% (2.17%) * | 42.00% (2.56%) * | 31.85% (2.13%) * | 29.62% (1.78%) * | 26.47% (1.81%) *
autos | 67.35% | 64.50% (1.50%) * | 61.40% (2.21%) * | 65.96% (2.02%) | 52.92% (2.39%) * | 57.37% (0.95%) *
credit-a | 44.49% | 39.98% (1.05%) * | 41.35% (0.99%) * | 32.04% (1.50%) * | 34.63% (0.79%) * | 34.71% (0.73%) *
hypothyroid | 7.71% | 8.27% (0.52%) * | 7.33% (0.18%) * | 4.74% (0.14%) * | 5.01% (0.13%) * | 4.79% (0.10%) *
letter | 96.26% | 16.86% (0.35%) * | 9.61% (0.20%) * | 6.05% (0.08%) * | 4.71% (0.06%) * | 3.93% (0.07%) *
microarray | 50.20% | 59.47% (2.55%) * | 49.58% (2.36%) | 42.45% (0.83%) * | 50.71% (0.95%) | 50.88% (0.60%)
vote | 38.63% | 6.45% (1.01%) * | 10.42% (1.16%) * | 8.26% (0.55%) * | 7.12% (0.19%) * | 7.91% (0.39%) *
Decision Tree Results
---------------------------------------------------------------------------------------------------------------------------
Dataset | Baseline | 10% | 25% | 50% | 75% | 100%
anneal | 23.83% | 8.72% (1.52%) * | 3.75% (0.70%) * | 1.36% (0.44%) * | 1.41% (0.42%) * | 0.69% (0.30%) *
audiology | 74.77% | 62.50% (3.27%) * | 46.33% (4.13%) * | 29.23% (1.86%) * | 22.36% (1.83%) * | 22.08% (1.98%) *
autos | 67.35% | 68.50% (7.43%) | 46.80% (3.69%) * | 33.49% (2.41%) * | 30.17% (3.39%) * | 21.15% (3.40%) *
credit-a | 44.49% | 20.05% (2.52%) * | 13.20% (1.46%) * | 19.70% (1.86%) * | 19.48% (1.55%) * | 18.99% (1.15%) *
hypothyroid | 7.71% | 2.84% (0.54%) * | 1.55% (0.23%) * | 0.67% (0.04%) * | 0.80% (0.12%) * | 0.60% (0.07%) *
letter | 96.26% | 28.79% (0.41%) * | 21.52% (0.32%) * | 16.43% (0.27%) * | 13.33% (0.14%) * | 11.77% (0.15%) *
microarray | 50.20% | 48.77% (3.88%) | 52.57% (3.75%) | 50.31% (1.82%) | 46.52% (2.21%) * | 49.15% (1.79%)
vote | 38.63% | 12.75% (3.03%) * | 5.82% (1.64%) * | 6.95% (0.98%) * | 3.29% (0.58%) * | 5.74% (0.54%) *
Question 2¶
Dealing with noisy data is a key issue in machine learning. Unfortunately, even algorithms that have noise-handling mechanisms built-in, like decision trees, can overfit noisy data, unless their "overfitting avoidance" or regularization parameters are set properly.
The datasets you will be using have had various amounts of "class noise" added
by randomly changing the actual class value to a different one for a
specified percentage of the training data.
Here we will specify three arbitrarily chosen levels of noise: low
($20\%$), medium ($50\%$) and high ($80\%$).
The learning algorithm must try to "see through" this noise and learn
the best model it can, which is then evaluated on test data without
added noise to evaluate how well it has avoided fitting the noise.
We will also let the algorithm do a limited search using cross-validation
for the best over-fitting avoidance parameter settings on each training set.
Running the classifiers [2 marks]¶
You will run the following code section, and save the results to a plain text file "q2.out".
The output of the code section is a table, which represents the percentage accuaracy of classification for the decision tree algorithm. The first column contains the result of the "Default" classifier, which is the decision tree algorithm with default parameter settings running on each of the datasets which have had $50\%$ noise added. From the second column on, in each column the results are obtained by running the decision tree algorithm on $0\%$, $20\%$, $50\%$ and $80\%$ noise added to each of the datasets, and in the parentheses is shown the result of a grid search that has been applied to determine the best value for a basic parameter of the decision tree algorithm, namely min_samples_leaf i.e., the minimum number of examples that can be used to make a prediction in the tree, on that dataset.
Result interpretation [3 marks]¶
Answer these questions in the file called answers.txt. Your answers must be based on the results you saved in "q2.out".
2(a). [2 marks] Refer to answers.txt.
2(b). [1 mark] Refer to answers.txt.
In [1]:
# code for question 2
import arff, numpy as np
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn import tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import sys
import warnings
In [2]:
# fixed random seed
np.random.seed(1)
def warn(*args, **kwargs):
pass
def label_enc(labels):
le = preprocessing.LabelEncoder()
le.fit(labels)
return le
def features_encoders(features,categorical_features='all'):
n_samples, n_features = features.shape
label_encoders = [preprocessing.LabelEncoder() for _ in range(n_features)]
X_int = np.zeros_like(features, dtype=np.int)
for i in range(n_features):
feature_i = features[:, i]
label_encoders[i].fit(feature_i)
X_int[:, i] = label_encoders[i].transform(feature_i)
enc = preprocessing.OneHotEncoder(categorical_features=categorical_features)
return enc.fit(X_int),label_encoders
def feature_transform(features,label_encoders, one_hot_encoder):
n_samples, n_features = features.shape
X_int = np.zeros_like(features, dtype=np.int)
for i in range(n_features):
feature_i = features[:, i]
X_int[:, i] = label_encoders[i].transform(feature_i)
return one_hot_encoder.transform(X_int).toarray()
warnings.warn = warn
In [3]:
class DataFrameImputer(TransformerMixin):
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
def load_data(path):
dataset = arff.load(open(path, 'r'))
data = np.array(dataset['data'])
data = pd.DataFrame(data)
data = DataFrameImputer().fit_transform(data).values
attr = dataset['attributes']
# mask categorical features
masks = []
for i in range(len(attr)-1):
if attr[i][1] != 'REAL':
masks.append(i)
return data, masks
def preprocess(data,masks, noise_ratio):
# split data
train_data, test_data = train_test_split(data,test_size=0.3,random_state=0)
# test data
test_features = test_data[:,0:test_data.shape[1]-1]
test_labels = test_data[:,test_data.shape[1]-1]
# training data
features = train_data[:,0:train_data.shape[1]-1]
labels = train_data[:,train_data.shape[1]-1]
classes = list(set(labels))
# categorical features need to be encoded
if len(masks):
one_hot_enc, label_encs = features_encoders(data[:,0:data.shape[1]-1],masks)
test_features = feature_transform(test_features,label_encs,one_hot_enc)
features = feature_transform(features,label_encs,one_hot_enc)
le = label_enc(data[:,data.shape[1]-1])
labels = le.transform(train_data[:,train_data.shape[1]-1])
test_labels = le.transform(test_data[:,test_data.shape[1]-1])
# add noise
np.random.seed(1234)
noise = np.random.randint(len(classes)-1, size=int(len(labels)*noise_ratio))+1
noise = np.concatenate((noise,np.zeros(len(labels) - len(noise),dtype=np.int)))
labels = (labels + noise) % len(classes)
return features,labels,test_features,test_labels
In [4]:
# load data
paths = ['balance-scale','primary-tumor',
'glass','heart-h']
noise = [0,0.2,0.5,0.8]
scores = []
params = []
for path in paths:
score = []
param = []
path += '.arff'
data, masks = load_data(path)
# training on data with %50 noise and default parameters
features, labels, test_features, test_labels = preprocess(data, masks, 0.5)
tree = DecisionTreeClassifier(random_state=0,min_samples_leaf=2, min_impurity_decrease=0)
tree.fit(features, labels)
tree_preds = tree.predict(test_features)
tree_performance = accuracy_score(test_labels, tree_preds)
score.append(tree_performance)
param.append(tree.get_params()['min_samples_leaf'])
# training on data with noise %0, %20, %50, %80
for noise_ratio in noise:
features, labels, test_features, test_labels = preprocess(data, masks, noise_ratio)
param_grid = {'min_samples_leaf': np.arange(2,30,5)}
grid_tree = GridSearchCV(DecisionTreeClassifier(random_state=0), param_grid,cv=10,return_train_score=True)
grid_tree.fit(features, labels)
estimator = grid_tree.best_estimator_
tree_preds = grid_tree.predict(test_features)
tree_performance = accuracy_score(test_labels, tree_preds)
score.append(tree_performance)
param.append(estimator.get_params()['min_samples_leaf'])
scores.append(score)
params.append(param)
# print the results
header = "{:^123}".format("Decision Tree Results") + '\n' + '-' * 123 + '\n' + \
"{:^15} | {:^16} | {:^16} | {:^16} | {:^16} | {:^16} |".format("Dataset", "Default", "0%", "20%", "50%", "80%")
# print result table
print(header)
for i in range(len(scores)):
#scores = score_list[i][1]
print("{:<16}".format(paths[i]),end="")
for j in range(len(params[i])):
print("| {:>6.2%} ({:>2}) ” .format(scores[i][j],params[i][j]),end=””)
print(‘|\n’)
print(‘\n’)
Decision Tree Results
—————————————————————————————————————————
Dataset | Default | 0% | 20% | 50% | 80% |
balance-scale | 36.70% ( 2) | 76.06% ( 2) | 71.28% (12) | 65.43% (27) | 18.09% (27) |
primary-tumor | 25.49% ( 2) | 37.25% (12) | 42.16% (12) | 43.14% (12) | 26.47% ( 7) |
glass | 44.62% ( 2) | 69.23% ( 7) | 66.15% (22) | 35.38% (17) | 29.23% (17) |
heart-h | 35.96% ( 2) | 67.42% ( 7) | 78.65% (22) | 56.18% (17) | 20.22% (27) |
Question 3¶
This question involves mining a data-set on California house prices from
census data in the 1990s.
We will be using linear regression to do this since the output is numeric.
Since this problem involves using attribute or feature transformations we
will need to do this using the numpy Python library.
Running the regression [1 mark]¶
In this question, in the following code section you will be required to train a linear regression model using scikit learn to fit the dataset. Select “median house value” as the target variable Y and the rest of the features as X. Then perform a 10-fold cross validation of linear regression on the dataset. Save the Intercept and Coefficients of the reuslting Linear regression model, as well as Root mean squared error of cross validation, into a plain text file called “q3.out”. For this question the code to save the output to a file has been provided for you.
In [5]:
# code for question 3
import arff,numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict,cross_val_score
from sklearn import metrics
#————–Show the attributes————–
dataset = arff.load(open(‘houses.arff’,”r”,encoding = “ISO-8859-1”))
attributes = np.array(dataset[‘attributes’])
attributes
Out[5]:
array([[‘median_house_value’, ‘REAL’],
[‘median_income’, ‘REAL’],
[‘housing_median_age’, ‘REAL’],
[‘total_rooms’, ‘REAL’],
[‘total_bedrooms’, ‘REAL’],
[‘population’, ‘REAL’],
[‘households’, ‘REAL’],
[‘latitude’, ‘REAL’],
[‘longitude’, ‘REAL’]], dtype=’