Assignment 1 – Building and Testing Classifiers in WEKA
Assessment Weight: 20%
Note: This is an individual assignment. While it is expected that students will discuss their ideas with
one another, students need to be aware of their responsibilities in ensuring that they do not
deliberately or inadvertently plagiarize the work of others.
In this assignment you will run a machine learning experiment using Weka. You will generate a model
that predicts the quality of wine based on its chemical attributes. You will train the model on the
supplied training data.
Submission Instructions
Create a document (Word or equivalent) with your answers to the questions given. Be sure you have
answered all the questions.
Marks Breakdown
This assignment consists of nine (7) questions worth 80 marks, divided among two parts:
• Part 1 (Q1) 10 marks
• Part 2 (Q2-4) 30 marks
• Part 3 (Q5-7) 40 marks
Your answers will be marked according to the rubric at the end of this document.
Task #1 – Classification for Wine
The Wine Dataset
The dataset files are provided for you:
• wine_train.arff (labeled training set, 1599 instances)
This dataset is adapted from:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modelling wine preferences by data mining
from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-
9236.
This dataset contains data for variants of the Portuguese “Vinho Verde” wine. For each variant, 11
chemical features were measured. Each of these is a numeric attribute. They are:
• fixed acidity
• volatile acidity
• citric acid
• residual sugar
• chlorides numeric
• free sulfur dioxide
• total sulfur dioxide
• density
• pH
• sulphates
• alcohol
Each variant was tasted by three experts. Their ratings have been combined into a single quality
label: “good” or “bad” Therefore this is a binary classification problem with numeric attributes.
Task 1 – Part 1: Examine/Explore the Data [10 marks]
It is a good idea to inspect your data by hand before running any machine learning experiments, to
ensure that the dataset is in the correct format and that you understand what the dataset contains.
Firstly, view wine_train.arff. You should see something like this:
The files are in ARFF (Attribute-Relation File Format), a text format developed for Weka. At the top of
each file you will see a list of attributes, followed by a data section with rows of comma separated
values, one for each instance.
For this assignment you will not need to deal with the ARFF format directly, as Weka will handle
reading and writing ARFF files for you. In future experiments you may have to convert between ARFF
and another data format. (You can close the text editor.)
Another way to view .arff files is using the WEKA ArffViewer tool. Once you start WEKA, you will get a
screen like the following:
From the Tools menu choose ArffViewer. In the window that opens, choose File→Open and open one
of the data files. You should see something like the following:
Here you see the same data as in the text editor, but parsed into a spreadsheet-like format. Although
you will not need the ArffViewer for this assignment, it is a useful tool to know about when working
with Weka. (You can close the ArffViewer window.)
Important Note
You may find that the ARFF files are grayed out and that the All Files option needs to be selected
from the File Format dropdown menu for the files to be selectable. However, the ARFF Viewer may
still not read the files properly. If such is the case, it is likely that a .txt extension got appended to the
filename when the files were downloaded. However, even if the files are downloaded without .txt
getting appended or an inadvertently added .txt extension is removed, the ARFF Viewer may have
trouble reading the files properly. The following steps should resolve the issue:
1. View the ARFF in your Web browser by clicking on the link in the instructions or open the
downloaded ARFF file in a text editor.
2. Copy all the text and paste it to a new text file.
3. If you copied the ARFF contents from the downloaded ARFF file, it is recommended that you do
not overwrite the downloaded ARFF file when saving the new file on the next step. Instead, delete
the downloaded ARFF file.
4. Save the new text file with a .arff extension, carefully making sure that a .txt extension does not
get appended.
5. Open the newly saved ARFF file in the Weka ARFF Viewer to verify the Viewer can display the
file in the manner illustrated in the image above.
After getting the initial examination on the data files provided using a general text editor or ArffViewer
tool, choose the Explorer interface in Weka. (From the Weka GUI Choose click on the Explorer button
to open the Weka Explorer.) The Explorer is the main tool in Weka, and the one you are most likely to
work with when setting up an experiment. For the remainder of this assignment you will work within
the Weka Explorer.
The Explorer should open to the “Preprocess” tab. The Preprocess tab allows you to inspect and
modify your dataset before passing it to a machine learning algorithm. Click on the button that says
“Open file…” and open wine_train.arff. You should see something like this when the dataset is
correctly uploaded:
The attributes are listed in the bottom left, and summary statistics for the currently selected attribute
are shown on the right side, along with a histogram. Click on each attribute (or use the down arrow
key to move through them) and look at the corresponding histogram. You will notice that many
numeric attributes have a “hump” shape; this is a common pattern for numeric attributes drawn from
real-world data.
You will also notice that some attributes appear to have outliers on one or both sides of the
distribution. The proper treatment of outliers varies from one experiment to another. For this
assignment you can leave the outliers alone.
Now answer the question below:
Question #1: [10 marks]
Which attributes in the training set do not appear to have a “hump” distribution? Which attributes
appear to have outliers? (Do not worry too much about being precise here. The point is for you to
inspect the data and interpret what you see.)
Based on the histogram, which attribute appears to be the most useful for classifying wine, and why?
Task 1 – Part 2: Classifier Basics [30 marks]
In this section you will train a couple of basic classifiers on the data.
Baseline Classifier
Click on the “Classify” tab. Choose ZeroR as the Classifier if it is not already chosen (it is under the
“rules” subtree when you click on the “Choose” button). When used in a classification problem, ZeroR
simply chooses the majority class. Under “Test options” select “Use training set”, then click the “Start”
button to run the classifier. You should see something like this:
The classifier output pane displays information about the model created by the classifier as well as
the evaluated performance of the model. In the Summary section, the row “Correctly Classified
Instances” reports the accuracy of the model.
Now answer the question below:
Question #2: [10 marks]
What is the accuracy – the percentage of correctly classified instances – achieved by ZeroR when you
run it on the training set? Explain this number (what this number means …). How is the accuracy of
ZeroR a helpful baseline for interpreting the performance of other classifiers?
Decision Trees
J48 is the Weka implementation of the C4.5 decision tree algorithm.
Click on the “Choose” button and select J48 under the “trees” section. Notice that the field to the right
of the “Choose” button updates to say “J48 -C 0.25 -M 2”. This is a command-line representation of
the current settings of J48. Click on this field to open up the configuration dialog for J48:
….
Each classifier has a configuration dialog such as this that shows the parameters of the algorithm as
well as buttons at the top for more information. When you change the settings and close the dialog,
the command line representation updates accordingly. For now we will use the default settings, so hit
“Cancel” to close the dialog.
Under “Test options” select “Use training set”, then click the “Start” button to run the classifier. After
the classifier finishes, scroll up in the output pane. You should see a textual representation of the
generated decision tree.
Now answer the question below:
Question #3: [10 marks]
Using a decision tree Weka learned over the training set, what is the most informative single feature
for this task, and what is its influence on wine quality? Does this match your answer from Question
#1?
Scroll back down and record the percentage of Correctly Classified Instances. Now, under “Test
options”, select “Cross-validation” with 10 folds. Run the classifier again and record the percentage of
Correctly Classified Instances.
In both cases, the final model that is generated is based on all of the training data. The difference is in
how the accuracy of that model is estimated.
Now answer the question below:
Question #4: [10 marks]
What is 10-fold cross-validation? What is the main reason for the difference between the percentage
of Correctly Classified Instances when you used the entire training set directly versus when you ran
10-fold cross-validation on the training set? Why is cross-validation important?
Task 1 – Part 3: Build Your Own Classifier [40 marks]
This is the main part of the assignment. Search through the classifiers in Weka and run some of them
on the training set. You may want to try varying some of the classifier parameters as well. Choose
the one you feel is most likely to generalize well to unseen examples. Feel free to use validation
strategies other than 10-fold cross-validation.
When you have built the classifier you want to submit, move on to the following sections.
Saving the Model
To export a classifier model you have built:
1. Right-click on the model in the “Result list” in the bottom left corner of the Classify tab.
2. Select “Save model”.
3. In the dialog that opens, ensure that the File Format is “Model object files”
4. Save the model using the naming convention given as instructed (e.g. A1-Darth-Vader.model).
Now answer the question below (Questions #5 and #6):
Question #5: [5 marks]
What is the “command-line” for the model you are submitting? For example, “J48 -C 0.25 -M 2”. What
is the reported accuracy for your model using 10-fold cross-validation?
Do not submit the example model for your answer.
Question #6: [20 marks]
In a few sentences, describe how you chose the model you are submitting. Be sure to mention your
validation strategy and whether you tried varying any of the model parameters.
Building various versions of your selected model
With the classifier model you selected, try to vary the model by setting various configurations. You
can do this by trying different values for parameters. For example, if you chose the J48 decision tree
classifier, your initial model might be built using default configuration (parameter setting). To validate
your model, you can try to set alternative option “True” for “unpruned” parameter. The result of
running this model with the same train/test data set may be different. Try yourself to configure
variously parameters and save them as different versions of your classifier model.
Now answer the question below:
Question #7: [15 marks]
For your selected classifier model, investigate what kind of parameters are available to be set in
Weka and briefly summarise of the role of each parameter.
Throughout the various configuration testing, summarise your findings about the effect (on the result
of testing the model using test data) of three different parameters of your choice.
Marking Rubric
Exemplary Good Satisfactory Limited Very Limited
9-10 7-8 5-6 3-4 0-2
For each
question,
Questions 1-7
Answer
demonstrates
excellent
knowledge of
machine
learning and
data science,
is well-written,
and very well-
justified.
Exhibits
aspects of
exemplary
(left) and
satisfactory
(right)
Answer
demonstrates
sound
knowledge of
machine
learning and
data science
and provides
justification.
Exhibits
aspects of
satisfactory
(left) and very
limited (right)
Answer
demonstrates
flawed
knowledge of
machine
learning
and/or
provides
incoherent
justification.
Or
Answer is
absent or
negligible.