Information School.
INF6028 Coursework 2020-21
Version date: 15/04/2021
Mining and Visualising a Structured Dataset
1. Introduction
The assessment for INF6028 Data Mining and Visualisation consists of a piece of individual coursework to assess your ability to understand key data mining, analysis and evaluation concepts by carrying out a data mining task and interpreting and communicating your findings. Given a dataset (see Section 2), you should use data exploration techniques to explore the data and then you should choose two supervised data mining techniques available in KNIME to predict certain data values and compare their relative performance (you may wish to compare these with one or more simple rule-based approaches as baselines; these do not count towards the total count of two). You will need to select the most appropriate techniques and justify your choices made at different stages of your workflow.
You should write a 2,000 word structured report (see Section 3) that includes the following headings (more details on how the report will be assessed are provided below):
• Introduction – introduce the prediction problem.
• Data mining theory – provide a theoretical description of your two chosen supervised data mining
methods (for example, the classification or regression techniques that you have used) and why
they are appropriate to your task.
• Data exploration and preparation – briefly describe the approaches you used in your workflow
for feature selection, transformation and normalisation, where appropriate.
• Experimental setup – describe the experimental setup you used in KNIME for each data mining method including hyperparameter tuning of the methods. Describe the evaluation measures that you used and how you handled the data to ensure that the models were not over-trained. Ensure that you provide sufficient information for someone else to repeat your study. For example, you should explain which nodes you have used in KNIME and which parameter settings you used. It
may be appropriate to present this data in table form or as an Appendix.
• Results – present the results for each data mining method and compare the performance of the different methods using graphical and tabular methods. What insights can you gain from the models, for example, which are the most important features, are there any outliers in the
predictions.
• Conclusion and reflections – summarise the main findings of your report and reflect on the methods used.
Charts, tables, references and appendices are not included in the word count.
This assessment is worth 100% of the overall module mark for INF6028. A pass mark of 50 is required to pass the module. Submission deadline: 10am Thursday 27th May 2021 (Week 13) via Turnitin. See Section 4 for more general information about Coursework Submission Requirements within the Information School.
2. The Datasets
Choose one of the following datasets which are derived from Kaggle competitions. The datasets are downloadable from Blackboard in the Coursework Brief & Information section. A brief description of the attributes in each dataset is given at the end of this document. Note that in both cases the data are different to the standard Kaggle datasets.
Titanic-derived dataset
The data is split across two files each of which contains 1204 entries representing 1204 passengers, although it should be noted that the passengers are not necessarily the same in the two files. The two files are titanic_ticket_data.csv and titanic_personal_data.csv
Note that this data is different to the standard Kaggle titanic dataset both in terms of the number of entries and the number of attributes.
The aim of this challenge is to build models that predict whether or not a passenger will survive the sinking of the titanic.
House Prices
The house-prices.csv dataset consists of 1,300 houses characterised by around 80 attributes. The aim of this challenge is to build models that predict the sale price of houses.
3. Report Structure
You are required to produce a structured report that includes all the sections detailed in Table 1. You must state the word count somewhere in the report. As there is a word count limit you should aim to make your writing as concise and informative as possible. The emphasis of the report should be on the clarity, accuracy and quality in communicating your findings.
Table 1: Required content of the structured report.
Section Description Maximum allocated marks
Structured abstract
This should provide a summary of your report in a structured manner. This is not included in the word count.
Required, but 0 marks
Introduction
This section should concisely introduce the data mining task that is addressed in the report. You should indicate the property/data value that you will predict and give a brief overview of the dataset and methods that you will use.
10 marks
Data Mining Theory
This section should provide an overview of your chosen algorithms for predictive data mining from a theoretical aspect. Explain why they are relevant to your prediction problem. Support your rationale by providing references to the literature where the techniques have been applied to similar problems.
Include a discussion of the most appropriate methods for evaluating the performance of your chosen data mining methods.
25 marks
Data Exploration and Preparation
This section should provide a brief description of the data and of the approaches used to pre-process the data. You should present an investigation of the attributes (particularly the data value to be predicted) and describe any data cleaning employed, including handling of missing data, data transformations and data aggregations.
10 marks
Experimental Setup
This section should describe the experimental design used for each data mining method.
You should describe the process you followed in order to find the best performing model for each method and how you validated these. For example, which KNIME nodes did you use? How did you configure them? Did you use cross-validation or a separate validation set and why?
20 marks
Results and Discussion
Present the results of your data mining including the results of experiments to find the best model for each data mining method. Compare the best performance of the different methods and, if appropriate, consider which attribute contributes most to each model.
Discuss the advantages and disadvantages of the chosen data mining methods for your problem, including the interpretability of the methods you have chosen. Which of the chosen methods produced the best model and why? Try to use citations to relevant literature to support the points you make.
20 marks
Conclusion and reflection
Summarise the main findings of the analysis and reflect on the choice of methods for the problem, for example, how might the models be improved with hindsight? Use evidence from the literature to support your arguments.
15 marks
KNIME Workflow(s)
You should submit your KNIME workflow(s) as a “.knar” files. Note that this can consist of separate workflows but they should all be saved to one file. Include your best setup for each data mining method.
Required, but 0 marks.
Note that 5 marks will be deducted if this is not submitted.
4. Information School Coursework Submission Requirements
It is the student’s responsibility to ensure no aspect of their work is plagiarised or the result of other unfair means. The University’s and Information School’s Advice on unfair means can be found in your Student Handbook, available via http://www.sheffield.ac.uk/is/current
Your assignment has a word count limit. A deduction of 3 marks will be applied for coursework that is 5% or more above or below the word count as specified above or that does not state the word count.
It is your responsibility to ensure your coursework is correctly submitted before the deadline. It is highly recommended that you submit well before the deadline. Coursework submitted after 10am on the stated submission date will result in a deduction of 5% of the mark awarded for each working day after the submission date/time up to a maximum of 5 working days, where ‘working day’ includes Monday to Friday (excluding public holidays) and runs from 10am to 10am. Coursework submitted after the maximum period will receive zero marks.
Work submitted electronically, including through Turnitin, should be reviewed to ensure it appears as you intended.
Before the submission deadline, you can submit coursework to Turnitin numerous times. Each submission will overwrite the previous submission. Only your most recent submission will be assessed. However, after the submission deadline, the coursework can only be submitted once.
Details about the submission of work via Turnitin can be found at http://youtu.be/C_wO9vHHheo
If you encounter any problems during the electronic submission of your coursework, you should immediately contact the module coordinator and one of the Information School Teaching Support Team is-teaching-support@shef.ac.uk (Julie Priestley 0114 2222839). This does not negate your responsibilities to submit your coursework on time and correctly.
Titanic Dataset
The titanic data consist of two files that need to be merged.
The titanic_ticket_data.csv data consists of the following variables: PassengerId: the identifier
Survived: the value to predict
Ticket: the Ticket Number
Fare: the passenger fare
Cabin: Cabin number
Embarked: Port of embarkation. C = Cherbourg, Q = Queenstown, S = Southampton
The personal data titanic_personal_data.csv consists of the following variables: PassengerId – the identifier
Name: the name of the passenger
Sex: male or female
Age:
SibSp: number of siblings/spouses where
Parch: number of parent/children where family relations are defined as follows: Parent = mother, father;
Child = daughter, son, stepdaughter, stepson.
Some children travelled only with a nanny, therefore parch=0 for them Salary: in dollars
Job: job title
House Prices Dataset
The house prices data consists of one file (house-prices.csv) with the following variables:
SalePrice: the property’s sale price in dollars. This is the target variable that you’re trying to predict. MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibling = brother, sister, stepbrother, stepsister
family relations are defined as follows:
Spouse = husband, wife
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale SaleCondition: Condition of sale