# COMP9417 18s1 Assignment 1: Applying Machine Learning
The aim of this assignment is to enable you to **apply** different machine learning algorithms implemented in the Python [scikit-learn](http://scikit-learn.org/stable/index.html) machine learning library on a variety of datasets and answer questions based on your **analysis** and **interpretation** of the empirical
results, using your knowledge of machine learning.
After completing this assignment you will be able to:
– set up replicated $k$-fold cross-validation experiments to obtain average
performance measures of algorithms on datasets
– compare the performance of different algorithms against a base-line
and each other
– aggregate comparative performance scores for algorithms over a range
of different datasets
– propose properties of algorithms and their parameters, or datasets, which
may lead to performance differences being observed
– suggest reasons for actual observed performance differences in terms of
properties of algorithms, parameter settings or datasets.
– apply methods for data transformations and parameter search
and evaluate their effects on the performance of algorithms
There are a total of *20 marks* available.
Each is worth *0.5 course mark*, i.e., assignment marks will be scaled
to a **course mark out of 10** to contribute to the course total.
Deadline: 23:59:59, Monday April 2, 2018.
Submission will be via the CSE *give* system (see below).
Late penalties: one mark will be deducted from the total for each day late, up to a total of five days. If six or more days late, no marks will be given.
Recall the guidance regarding plagiarism in the course introduction: this applies to this assignment and if evidence of plagiarism is detected it may result in penalties ranging from loss of marks to suspension.
### Format of the questions
There are 4 questions in this assignment. Each question has two parts: the Python code which must be run to generate the output results on the given datasets, and the responses you give in the file [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt) on your analysis and interpretation of the results produced by running these learning algorithms for the question. Marks are given for both parts: submitting correct output from the code, and giving correct responses. For each question, you will need to save the output results from running the code to a separate plain text file. There will also be a plain text file containing the questions which you will need to edit to specify your answers. These files will form your submission.
In summary, your submission will comprise a total of 5 (five) files which should be named as follows:
“`
q1.out
q2.out
q3.out
q4.out
answers.txt
“`
Please note: files in any format other than plain text **cannot be accepted**.
Submit your files using “`give“`. On a CSE Linux machine, type the following on the command-line:
“`
$ give cs9417 ass1 q1.out q2.out q3.out q4.out answers.txt
“`
Alternatively, you can submit using the web-based interface to “`give“`.
### Datasets
You can download the datasets required for the assignment [here](http://www.cse.unsw.edu.au/~mike/comp9417/ass1/datasets.zip).
Note: you will need to
## Question 1
For this question the objective is to run two different learning algorithms on a range of different sample sizes taken from the same training set to assess the effect of training sample size on error. You will use the nearest neighbour classifier and the decision tree classifier to generate two different sets of “learning curves” on 8 real-world datasets:
“`
anneal.arff
audiology.arff
autos.arff
credit-a.arff
hypothyroid.arff
letter.arff
microarray.arff
vote.arff
“`
### Running the classifiers [2 marks]
You will run the following code section, and save the results to a plain text file “q1.out”. You will also need to write your own code to compute the error reduction for question 1(b).
The output of the code section are two tables, which represent the percentage error of classification for the nearest neighbour and the decision tree algorithm respectively. The first column contains the result of the baseline classifier, which simply predicts the majority class. From the second column on, the results are obtained by running the nearest neighbour or decision tree algorithms on $10\%$, $25\%$, $50\%$, $75\%$, and $100\%$ of the data. The standard deviation are shown in brackets, and where an asterisk is present, it indicates that the result is significantly different from the baseline.
### Result interpretation [6 marks]
Answer these questions in the file called [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt). Your answers must be based on the results you saved in “q1.out”. **_Please note_**: the goal of these questions is to attempt to **_explain why_** you think the results you obtained are as they are.
**1(a). [2 marks]** Refer to [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt).
**1(b). [4 marks]** For each algorithm over all of the datasets, find the average change in error when moving from the default prediction to learning from 10% of the training set as follows.
Let the error on the base line be err<sub>0</sub> and the error on 10% of the training set be error<sub>10</sub>.
For each algorithm, calculate the percentage reduction in error relative to the default on each dataset as:
\begin{equation*}
\frac{err_0 – err_{10}}{err_{10}} \times 100.
\end{equation*}
Now repeat exactly the same process by comparing the two classifiers over all of the datasets, learning from $100\%$ of the training set, compared to default. Organise your results by grouping them into a 2 by 2 table, something like this:
| | Mean error reduction relative to default |
|—|—|—|
| Algorithm | After 10% training | After 100% training |
| Nearest Neighbour | Your result | Your result |
| Decision Tree | Your result | Your result |
The entries from this table should be inserted into the correct places in your file [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt).
Once you have done this, complete the rest of the answers for question 1 in your file [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt).
## Question 2
Dealing with noisy data is a key issue in machine learning. Unfortunately, even algorithms that have noise-handling mechanisms built-in, like decision trees, can overfit noisy data, unless their “overfitting avoidance” or *regularization* parameters are set properly.
The datasets you will be using have had various amounts of “class noise” added
by randomly changing the actual class value to a different one for a
specified percentage of the training data.
Here we will specify three arbitrarily chosen levels of noise: low
($20\%$), medium ($50\%$) and high ($80\%$).
The learning algorithm must try to “see through” this noise and learn
the best model it can, which is then evaluated on test data *without*
added noise to evaluate how well it has avoided fitting the noise.
We will also let the algorithm do a limited search using cross-validation
for the best *over-fitting avoidance* parameter settings on each training set.
### Running the classifiers [2 marks]
You will run the following code section, and save the results to a plain text file “q2.out”.
The output of the code section is a table, which represents the percentage accuaracy of classification for the decision tree algorithm. The first column contains the result of the “Default” classifier, which is the decision tree algorithm with default parameter settings running on each of the datasets which have had $50\%$ noise added. From the second column on, in each column the results are obtained by running the decision tree algorithm on $0\%$, $20\%$, $50\%$ and $80\%$ noise added to each of the datasets, and in the parentheses is shown the result of a [grid search](http://en.wikipedia.org/wiki/Hyperparameter_optimization) that has been applied to determine the best value for a basic parameter of the decision tree algorithm, namely [min_samples_leaf](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) i.e., the minimum number of examples that can be used to make a prediction in the tree, on that dataset.
### Result interpretation [3 marks]
Answer these questions in the file called [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt). Your answers must be based on the results you saved in “q2.out”.
**2(a). [2 marks]** Refer to [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt).
**2(b). [1 mark]** Refer to [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt).
## Question 3
This question involves mining a data-set on California house prices from
census data in the 1990s.
We will be using linear regression to do this since the output is numeric.
Since this problem involves using attribute or feature transformations we
will need to do this using the [*numpy*](http://www.numpy.org/) Python library.
## Question 4
This question involves mining text data, for which machine learning algorithms typically use a transformation into a dataset of “word counts”. In the original dataset each text example is a string of words with a class label, and the sklearn transform converts this to a vector of word counts.
The dataset contains “snippets”, short sequences of words taken from Google searches, each of which has been labelled with one of 8 classes, referred to as “sections”, such as business, sports, etc. The dataset is provided already split into a training set of $10,060$ snippets and a test set of $2,280$ snippets (for convenience, the combined dataset is also provided as a separate file).
Using a vector representation for text data means that we can use many of the standard classifier learning methods. However, such datasets are typically highly “sparse”, in the sense that for any example (i.e., piece of text) nearly all of its feature values are zero. To tackle this problem, we typically apply methods of feature selection (or dimensionality reduction). In this question you will investigate the effect of using the [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) method to select the $K$ best features (words or tokens in this case) that appear to help classification accuracy.
### Running the classifier [1 mark]
You will run the following code section, and save the results to a plain text file “q4.out”.
The output of the code section is 5 lines of output, each of which represents the percentage accuaracy of classification on training and test set for different amounts of feature selection.
The first such line representst the “default”, i.e., using all features. The remaining 4 lines show the effect of learning and predicting on text data where only the top $K$ features are being used.
### Result interpretation [2 marks]
Answer this question in the file called [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt). Your answers must be based on the results you saved in “q4.out”.
**4. [2 marks]** Refer to [*answers.txt*](http://www.cse.unsw.edu.au/~cs9417/18s1/ass1/answers.txt).