COMP30018/COMP90049 Knowledge Technologies, Semester 1 2015 Project 2: How do you check the weather?

Due:
Submission Mechanism:

Submission Materials: Assessment Criteria:

Introduction

5:00pm, Friday, 29 May, 2015 (but see Late Submission Policy) PDF to Turnitin;
code and system outputs on the CIS servers (where appropriate) Written report in PDF, as project 1;

code and system outputs as necessary
Creativity, Critical Analysis, Soundness, Report Quality

COMP30018/COMP90049

Knowledge Technologies, Semester 1 2015 Project 2: How do you check the weather?

Melbourne’s weather is a fickle thing. My solution is to “wear layers”, so that I’m prepared for a cold, rainy day, or indeed a warm, sunny one — but otherwise, I simply hope for the best!

Others, though, plan ahead, by carefully monitoring the weather forecast — sometimes even in lieu of looking outside! But the question goes: “which weather forecast?” Depending on your source, the weather prediction can be highly variable.

Presumably, many people check the weather using the ubiquitous app that is installed on their smart- phone by default. Another popular choice here is “BOM”, the Australian Bureau of Meteorology1. Still others check the news, or listen to the radio in their car, and of course, some hardly check the weather forecast at all.

Your task in this project will be to examine a data set of how people check the weather, along with a number of marginally relevant attributes, by using a sophisticated machine–learning toolkit. The goal will be to acquire some meaningful knowledge to answer the question: “How do you check the weather?”

Overview

The aim of this project is to acquire knowledge about the predictive capacity of various personal data on a particular feature: the mechanism, if any, used to check the weather. This aims to reinforce several concepts in machine learning, including the use and critique of di↵erent models, handling di↵erent types of data, and making conclusions based on observed results. Your objectives will be to attempt some machine learning approaches, identify some interesting patterns in the data, leverage your knowledge of certain machine learning criteria, and write a report based on your observations.

The technical side of this project is limited: you are highly encouraged to use appropriate machine learning packages in your exploration, which means that you can (almost) “solve” the problem by plugging the relevant data files into an application, and reading o↵ a set of numbers.

However, acquiring knowledge will potentially be quite di cult. Once again, the main focus of our evaluation will be the report detailing the knowledge that you have taken away from your attempts. A strong implementation with a poor report will not receive a good mark.

Data

The main data set will be in the form of 938 survey responses, (presumably) from people residing in the U.S.A. via SurveyMonkey2. This data has been collected by Walter Hickey of FiveThirtyEight3; it is described, along with some shallow analysis, in:

1 http://www.bom.gov.au
2 http://www.surveymonkey.com
3 http://www.fivethirtyeight.com

1

Hickey, Walter. (2015) Where People Go To Check The Weather, FiveThirtyEight, accessed 8 May 2015, <http://fivethirtyeight.com/datalab/weather-forecast-news-app-habits/>.

The data set, in raw form, can be obtained from the following GitHub repository:

https://github.com/fivethirtyeight/data/tree/master/weather-check

We will also be providing the data in a format suitable for Weka processing. (More on Weka below.)
By using this data — whether from GitHub, or the version that we have pre-processed — you must

correctly attribute it. This means (at least) citing the article above.
The fields — survey questions — are described in the links above; briefly, they comprise the following:

• Does the survey respondent typically check a daily weather report? • How does the survey respondent typically check the weather?
• Would they use a smartwatch?
• Age

• Gender
• Household Income • US Region

Notably, you might require a bit of inference to make sense of the survey questions and possibly answers, as they were designed for an American audience. None of the fields, however, should be completely unfamiliar to you.

In addition to this, we will attempt to construct a test set, from a di↵erent audience — namely, you, the students taking this subject. We have set up a survey which is (almost) exactly the same as the original survey. It is also available from SurveyMonkey, via the following link:

     https://www.surveymonkey.com/s/LTTN2XG

After a week or so, we will collate the responses, and publish them as a test set where you can check the performance of your machine learning system(s). We are politely requesting that as many people take this survey as possible; the larger the test set, the more meaningful the observations will be4.

Machine Learning

Systems

Various machine learning techniques are discussed in this subject (Naive Bayes, Decision Trees, Support Vector Machines, Association Rules, etc.); many more exist. Developing a machine learner is likely not to be a good use of your time: instead, you are strongly encouraged to make use of machine learning software in your attempts at this project.

One convenient framework for this is Weka: http://www.cs.waikato.ac.nz/ml/weka/. Weka is a machine learning package with many classifiers, feature selection methods, evaluation metrics, and other machine learning concepts readily implemented and reasonably accessible. After downloading and unarchiv- ing the packages (and compiling, if necessary), the Graphical User Interface will let you start experimenting immediately.

Weka is dauntingly large: you will probably not understand all of its functionality, options, and output based on the concepts covered in this subject. The good news is that most of it will not be necessary to be successful in this project. A good place to start is the Weka wiki (http://weka.wikispaces.com/),

4The main di↵erence between this survey and the original is that we have included Prefer not to answer to numerous questions, to help protect your anonymity, if you wish.

2

in particular, the primer (http://weka.wikispaces.com/Primer) and the Frequently Asked Questions. If you use Weka, please do not bombard the developers or mailing list with questions — the LMS Discussion Forum should be your first port of call. We will post some basic usage notes and links to some introductory materials on the LMS Discussion Forum.

Some people may not like Weka. Other good packages are available (for example, PyML (http://pyml. sourceforge.net/) or scikit-learn (http://scikit-learn.org/). One caveat is that you will probably need to transform the data into the correct syntax for your given package to process them correctly (usually csv or arff); we will upload suitable versions of the dataset, if requests are made on the Discussion Forum.

Alternatively, you might try implementing certain components yourself. This will probably be time- consuming, but might give you finer control over certain subtle aspects.

Modes

There are two main modes by which you can approach this problem: classification and association rule mining.

In classification, you will treat the dataset as a number of (marginally relevant) attributes that you can use to help predict the “class” — “How does the survey respondent check the weather?” This is probably the more natural approach, and you can use some of the (numerous) classification algorithms covered in this subject, or otherwise, to classify the instances (for example, Naive Bayes, Decision Trees, k-Nearest Neighbour, 1-R, etc.). It is recommended to use methods that you understand (at least conceptually), otherwise it will be di cult to assess why the methods work or not, and your discussion will be limited.

In association rule mining, you will attempt to find meaningful patterns within the data. Typically, this will mean that you will forgo the main idea of the project (“How do you check the weather?”), and instead try to find some knowledge inherent in the data. Many of the patterns that you can mine will not actually be interesting, so it may or may not be a fruitful avenue for research; but it is possible to take this approach as well.

Phases

The main idea of (supervised) classification is to build a prediction mechanism based on one dataset (the “training set”) and predict unseen instances (the “test set”). You can thus use the data given above to “train” your models, and then classify the “test instances” from students in the subject.

The problem with this is that the test labels are, by defintion, unseen — so you will have no idea whether the model is actually working well! In Weka, you can solve this problem by evaluating the model’s goodness (according to some evalution metrics) with respect to the training set itself. There are three main mechanisms for doing this:

  • Testing on the training set (“Use training set”). This introduces terrible biases to your model and makes it very di cult to generalise. (Why?) Please don’t do this.
  • By holding out some portion of the training data, pretending it is unseen, building the model on the remainder of the training data, and evaluating by comparing the predictions to the known labels (“Percentage split”). This is a valid method.
  • By doing the above, but using each portion of the data as a test split, and building the model on the remainder of the data, and iterating (“Cross-validation”). Although this takes more time than the “holdout” strategy above, it has less variance, and should be reasonable on a data set of this size. This is the recommended method.

    You can then evaluate your model(s), based on the method you have chosen above, according to some evaluation metric(s) — for example: accuracy; precision, recall, f-score of the various classes; are under a ROC curve, and so on. (Weka reports all of these values in its output.) They can then be included in your report. Again, you should choose evaluation metrics that allow you to sensibly discuss your observations in terms of the knowledge that you have acquired.

    3

For COMP90049 students only, it will also be required to run your models on the provided test data (and include these outputs with your submission). You will not be able to evaluate these directly, but there is potentially some meaningful analysis you can do in estimating how and why your models will perform as they do in classifying this unseen data. For COMP30018 students, this is optional.

In an association rule mining system, there isn’t a training phase and test phase. You can examine the rules generated directly for the given data set, along with their supports and confidences. This should allow you to make observations directly on the data set.

For COMP90049 students only, it will also be required to consider the rules generated from the test data set — whether and why the same rules from the given “training” data will generalise to the unseen data (of students from this class). This is optional for COMP30018 students.

Other Important Data Considerations:

Missing Field Values

It is (frequently) the case that some of the data has not been recorded, is unavailable, or is unreliable; in this case, these has been recorded with the attribute value ?5. Dealing with this problem is an entire sub-field of machine learning. Since we are using real-world data, this is a problem that you will need to deal with. Here are some suggestions:

  • Use an algorithm that handles missing values natively. For example, many Naive Bayes implemen- tations just ignore missing fields from the test item, and estimate probabilities based on whatever training data is available.
  • Treat ? as an attribute value. Many decision tree implementations do this. There tends to be some awkwardness in your machine learner if the given entry is usually numeric (because ? is not a number).
  • Make a “best guess” at what the data would have been if it was recorded. There are two typical ways to do this:
    • –  Choose the mean (or median) from the training data. This is typically what support vector machine implementations do. It makes sense for normally distributed data (age), less for uniformly distributed data, even less for categorical data (what’s the “mean” of New England and Pacific?).
    • –  Choose the most frequently occurring value (mode), or some other sensible frequent value based on your knowledge of the data. For example, if do-check is not recorded, it was probably Yes.
  • Remove the record in question, and only deal with records where all of the fields have been recorded (so that the data is more reliable). This may not be feasible on this data set, because you only have a few hundred instances.
  • Remove the attribute in question, and only use attributes where the values are always recorded. This may not be feasible on this data set, because you only have a few attributes.

    You should discuss your treatment of this problem in your report. (Perhaps only briefly if you take a straightforward solution.)

    Attribute types

    The class that we wish to predict “How do you typically check the weather?”, is categorical, with 8 di↵erent possibilities. All of the attributes have also been recorded as categorical, however there are two points of interest:

    5We have also converted Prefer not to answer responses to ?, which may or may not be a sensible thing to do.

    4

  • The “A specific website or app…” is best construed as a string, not a category. However, most classifiers can’t handle string-type attributes natively. The most obvious way of dealing with this problem is to remove the attribute, but it’s possible that it contains some useable (if not useful) information. You might try to exhaustively state the possibilities in the Weka header, or manage the information in some other way.
  • The “Age” and “Household Income” attributes seem more like ordinal attributes than categorical. You might try to convert them to some numerical format, to better capture this pattern of the data. Some machine learners (support vector machines, for example) work better with continuous attributes; some (Naive Bayes) prefer categorical attributes.

    You should indicate how you manage these issues in your report.

    Technical Notes

    You will not be required to submit scripts which generate the formatted data for your system; any manual tuning (cheating!) will be quite obvious. You should discuss the systems that you used (and the relevant parameter settings, where necessary) in your report. You should detail the various files (which are the outputs on the accompanying test data) in your README: which model and parameters were used to generate it. Technical details of the implementation (system settings, resource limits, etc.) should be discussed in the README. Performance over the training data and your observations should be included in the report.

    Report

    The report should be broadly similar to your report for Project 1, including the templates and sample papers. Carefully chosen tables of results and figures, wll help you to spend your word count in discussing the relevance of your observations — the length requirement will be 750–1250 words for COMP30018 students, and 1000–1500 words for COMP90049 students.

    In your report, you should describe your approach and observations. Only describe points that are relevant to your paper (and not, say, common to everyone’s paper) — in particular, do not discuss the internal structure of your classifiers, unless it is substantially di↵erent to the ones we discuss in this subject (and then a primary citation should be adequate), or its internals are important for connecting the theory to your practical observations.

    Remember that your goal here is knowledge — the (detailed) discussion in the sections above is to allow you to set up the problem in a sensible manner, but actually running the software and getting some results will mostly be a trivial matter. Consequently, reports which only describe results of algorithms without corresponding analysis are trivial, and have missed the entire point of the project!

    You should, once again, make an e↵ort to cite relevant research articles in your report. Here is an example of one tangentially relevant paper:

    Ron Kohavi. “Scaling up the accuracy of naive-Bayes classifiers: a decision-tree hybrid.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 96), Portland, USA. pp. 202–207.

    In particular, you must attribute the data set to its curators, at least by citing the article above. I will repeat its citation here:

    Hickey, Walter. (2015) Where People Go To Check The Weather, FiveThirtyEight, accessed 8 May 2015, <http://fivethirtyeight.com/datalab/weather-forecast-news-app-habits/>.

    5

I cannot stress the following point strongly enough:

If you do not indicate the source of the data set, you are commiting plagiarism. Consequently, I will be forced to give you 0 marks for the paper, and you will probably fail the subject.
Consider yourself warned.

Also note that there is some low–level analysis included in the article above. You may include it if you wish (by citing it correctly!), but it will not be considered as novel analysis for the evaluation criteria.

Assessment

For students taking this subject as COMP30018, the project will be marked out of 15, and is worth 15% of your overall mark for the subject. Note that there is a hurdle requirement on your combined project mark, of 15/30 (of which this project will contribute 15 marks).

For students taking this subject as COMP90049, the project will be marked out of 20, and is worth 20% of your overall mark for the subject. Note that there is a hurdle requirement on your combined project mark, of 20/40 (of which this project will contribute 20 marks).

Late submissions

We will use the standard policy for late submissions — 10% will be deducted per business day (or part thereof) after the deadline, up until 5 days beyond the deadline. Submissions that are more than 5 days late will not be accepted.

However, as we have been a little behind in the subject content, we have released the project a little bit later than usual. Also, the last week of semester is often a stressful time. So, we will be waiving the first day of late penalties, under the condition that if you intend to submit late, you email Jeremy (nj@unimelb.edu.au) before the submission deadline — if you do not contact us by the submission deadline, we will apply the late penalties as usual.

In e↵ect, this means that, as long as you send the email, you can submit without penalty up until 5pm on Monday, 1 June 2015. Note that the regular penalties apply after this, so 5.02pm on 1 June will mean that the project is 2 days late, and a 20% penalty will apply. Note also that I will be applying the late penalties far more stringently than in Project 1, so please plan ahead.

If there are documentable medical or personal circumstances which have taken time away from your project work, you should contact Jeremy via email (nj@unimelb.edu.au) at the earliest possible opportunity (this generally means well before the deadline). We will assess whether special consideration is warranted, and if granted, will scale back the expectations proportionately (e.g. if you have been unable to work on the project for a total of 1 out of 3 weeks, the expectations will be reduced by around 33%). No requests for special consideration will be accepted after the submission deadline.

Note that we do not wish for you to submit later than the (somewhat modified) deadline, because it will negatively a↵ect your ability to prepare for your exams.

Submission

Submission will be similar as for Project 1; you will submit a report (in PDF) to Turnitin on the LMS, and copy your code and system outputs, where necessary, to the following directory on the CIS servers (dimefox.eng.unimelb.edu.au or nutmeg):

/home/subjects/comp90049/submission/your-login/

Be warned that computer systems are often heavily loaded near project deadlines, and unexpected net- work or system downtime can occur. You should plan ahead to avoid unexpected problems at the last minute. System downtime or failure will generally not be considered as ground for special consideration.

6

While it is acceptable to discuss the project with other students in general terms (including the Discussion Forum), excessive collaboration will be consider cheating. This is particularly notable in a project such as this, where analysis will comprise almost all of the evaluation. We will be carefully examining the report for evidence of collusion, and will invoke the University’s Academic Misconduct policy (http:// academichonesty.unimelb.edu.au/policy.html) where inappropriate levels of collaboration or plagiarism are deemed to have taken place.

The mark breakdown will roughly be as follows, subject to minor alteration:

Creativity Critical Analysis Soundness Report Quality Total

30018 1 mark 8 marks 3 marks 3 marks 15 marks

90049 2 marks 10 marks 4 marks 4 marks 20 marks

Changes/Updates to the Project Specifications

We will use the LMS to advertise any (hopefully small-scale) changes or clarifications to the project specifica- tions. Any addendums made to the project specifications via the LMS will supersede information contained within this document.

7