COURSEWORK ASSIGNMENT UNIVERSITY OF EAST ANGLIA School of Computing Sciences
UNIT: CMP-7023B Data Mining
ASSIGNMENT TITLE: Data Mining the Forest Cover database
DATE SET
DATE & TIME OF SUBMISSION: RETURN DATE
ASSIGNMENT VALUE
SET BY
CHECKED BY
: 26/02/2019 : 02/05/2019 : 30/05/2019 : 65%
: G. Richards
: B. de la Iglesia
SIGNED: SIGNED:
Aims:
To obtain an overall view of the complex process of Knowledge Discovery in Databases and understand the need for a methodical approach to KDD.
To explore tools and algorithms available to each stage of the KDD process.
To gain experience of using KDD software tools in a medium sized database.
To learn to combine data manipulation and analysis approaches in order to improve the quality of input data.
To present knowledge induced in a format suitable for the target audience and for the particular application.
Learning outcomes
Competence in using KDD software tools in medium to large databases.
Competence in applying relevant techniques at each stage of the KDD process
Ability to evaluate the suitability of software tools in the context of different data analysis tasks. Competence in combining data manipulation and analysis approaches in order to improve the quality of input data.
Understanding and identification of problems in input data such as outliers, missing data, unreliable data, differences in granularity, and others, and identify an adequate strategy to deal with the problem data. Presentation of knowledge induced in a format suitable for the target audience and for the particular application.
Assessment criteria
Part 1
10
Part 2
20
Part 3
20
Part 4
20
Part 5
20
Overall presentation, conclusions and executive summary.
10
100%
Description of Assignment
The data used for this coursework is the Forest Cover Dataset, which can be obtained from Blackboard. Download the file: covtype.csv, which contains the name of the fields as the first line of data.
The information contained in this database relates to the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data.
The purpose of this project is to see whether information can be gleaned from this database and used by the USFS to identify the type of management required for a particular area. Different types of forest cover require different management, hence they want to derive models for future usage that would allow them to establish the type of forest cover of a particular area given some new cells measurements (i.e. a classification problem).
In order to accomplish your task, you need to perform the following operations:
1. Download the database and prepare and present a data dictionary for the data. Note that some
information on the fields is available from Blackboard in the file covtype.info.
2. Split the data into a test set and a training set. On the training set, undertake any cleansing or pre-
processing you think necessary explaining clearly what you have done and why you have done it.
3. Use a suitable toolkit to construct a decision tree from the training set to predict the Forest Cover Type. You may use as input any or all of the fields (except of course that describing the Cover Type) and obtain a predicted type. Set any parameters you can to get as ¡°good¡± a tree as possible for this application. (Note that it may be necessary at this point to revisit the pre-processing stage to improve the quality of the models obtained). You will need to test the performance of your model on your test set. As part of your final report you will be describing the decisions you have made, the resulting tree, how it has been assessed and its effectiveness for the problem in hand.
4. An alternative approach is to use (supervised) clustering for this task. The idea is to cluster the database using k-means to obtain different clusters for the various types of cover. The Cover Type is not used in the clustering. Can you find any suitable clusters that identify closely to one or more of the Forest Cover Types?
5. Finally, the USFS is particularly interested in the description of Forest Cover Type 4, corresponding to Cottonwood/Willow, as they require special management and treatment. Can you find any patterns of reasonable quality to describe that particular class?
Collate all the answers to the above questions in a technical report. Your reports should be in the format of a consultancy report to the USFS. It should advise them on the potential of this data for Forest Cover Type prediction/description and on the efficacy of the methods proposed in the study. The consultancy report should not exceed ten pages and should be preceded by a one page executive summary which will summarise the work you have undertaken and the conclusion you have reached. You should also include appendices giving evidence of your experimental work, the data dictionary and any other relevant information to present your case. The report must be well written and clearly presented. Students are expected to work independently and any plagiarism or collusion will be heavily penalised.
Handing in procedure:
Please submit a single PDF containing all parts of the report to the PGT Hub following the advertised procedure for all coursework submissions.
Plagiarism:
Plagiarism is the copying or close paraphrasing of published or unpublished work, including the work of another student without the use of quotation marks and due acknowledgement. Plagiarism is regarded a serious offence by the University and all cases will be reported to the Board of Examiners. Work that contains even small fragments of plagiarised material will be penalised.