MN-M535 Data Mining
Academic Year 2019-20
Module Handbook
Module Co-ordinator:
Dr Karima Dyussekeneva
Office: Bay Campus, School of Management Building, Third Floor, Room 320
Office Hours: Wednesday: 3.30 – 4.30 pm.; Friday: 3.30 – 4.30 pm.
Email: k.dyussekeneva@swansea.ac.uk
Teaching Staff:
Dr Karima Dyussekeneva
Office: Bay Campus, School of Management Building, Third Floor, Room 320
Office Hours: Wednesday: 3.30 – 4.30 pm.; Friday: 3.30 – 4.30 pm.
Email: k.dyussekeneva@swansea.ac.uk
School of Management
MN-M535 Data Mining
Module Overview
Introduction
The field of Data Mining is still relatively new and in a state of evolution. Data Mining stands at the confluence of the fields of statistics and machine learning (also known as artificial intelligence). A variety of techniques for exploring data and building models have been around for a long time in the world of statistics: linear regression, discriminant analysis, and for example, principal component analysis. Computer science has brought machine learning techniques, such as trees and neural networks, that are less structured than classical statistical models and more computationally intensive. In addition, the growing field of database management is also part of the Data Mining structure.
Today Data Mining is used in a variety of fields and applications. Enterprises benefit from collecting and analysing its data, hospitals can spot trends and anomalies in their patent records, search engines can do better ranking and ad placement. The list continues, with cybersecurity and computer network intrusion detection, financial and business intelligence and many more.
This booklet contains:
• an introduction to the module
• lecture and seminar locations
• details of the core textbooks via the reading list
• information on assessment and feedback, including the coursework brief
• an overview of the entire module
Lecture & Seminar Locations
Lectures will take place in Great Hall 011 on Wednesday morning 11.00-13.00.
Seminars will take place in The College 129 on Friday afternoon 14.00-15.00.
Please note: Lecture/seminar times and locations may change in the first two weeks of term. Please check Blackboard announcements and the timetable data displayed on the Intranet for regular updates.
Communication
Lecture notes will be posted on Blackboard along with announcements and any administrative notices.
Learning Outcomes
• Recognise and recall basic data mining concepts and different data mining techniques appropriate for analysing various business problems.
• Decide on and select an appropriate data mining instrument to analyse a relevant business problem, and describe the main functions of the selected technique.
• Solve hypothetical business problems by applying appropriate data mining techniques, and interpret the outputs in the data mining and business contexts.
• Use specialist software such as SPSS and Weka for utilising data mining algorithms in solving tasks such as classification, association and prediction.
• Analyse outputs obtained by different data mining techniques, and inspect relationships between changes in the algorithm parameters and those in the outputs. Compare the performances of various data mining techniques, and balance complexity with accuracy in view of relevant business problem.
• Present and defend opinions in selecting appropriate data mining techniques for the stated business problem by evaluating the validity of the application outputs according to criteria such as accuracy, computational complexity and strengths and weaknesses of the instruments.
Transferrable Skills
Communication
Information Technology
Analytical skills
Problem-solving
Widening horizons
Improving Learning and Performance
Reading Material
The full reading list for this module is available via Blackboard in the ‘Reading List’ folder.
The core textbook for the module is one of either:
Data Mining (Third Edition). Ian Witten. Elsevier.
Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar . Pearson.
A core textbook is only a starting point and provides introductory and background information only. Supplemental reading will be identified at each lecture. To achieve high marks in this module students will need to do the background and supplemental reading as well as conduct their own independent research for instance through the reading of academic journals, into the topics identified.
Assessment
The assessment for the module is structured as follows:
• 2 x 50% individual coursework report (project report on data mining exercise)
Feedback to the coursework will be provided within three calendar weeks of submission. All feedback for the coursework assignment will be provided through GradeMark. Marks will be made available via Grade Centre in Blackboard and your university student portal.
Submission in Welsh
Any written work submitted as part of any assessment or examination may be submitted in Welsh, and that work submitted in Welsh will be treated no less favourably than written work submitted by you in English as part of an assessment or examination.
School of Management
MN-M535 Data Mining
Individual Coursework Assignment
Each coursework assignment for this module is an individual assignment worth 50% of the overall module mark.
Coursework Brief
Coursework 1
Regression models for data mining (Individual project). In this project, you centre on applying logistic regression to build a classification model, based on the ‘Boston Housing’ SPSS data. The coursework full brief and Data will be provided on Blackboard.
Coursework 2
Classification and prediction models for data mining (Individual project). In this project you centre on applying decision tree to build a classification model, based on the ‘Hepatitis’ WEKA data. The coursework full brief the Data will be provided on Blackboard.
Key Marking criteria will include:
• Initiative: originality, innovativeness of answer
• Assignment Structure: clarity of aims, objective, structure and presentation
• Quality of Writing: Readability and ability to convey key message(s) concisely
• Quality/Scope of Literature Review: Understanding of established knowledge
• Suitability of Literature: Use of suitable sources, focused to answer key research aims
• Literature Analysis: Quality/level of analytical skill demonstrated
• Insightfulness of Analysis: Interest and usefulness of findings, conclusions drawn.
• Understanding: Assignment demonstrates students have understood key topics
• Overall Quality of Assignment
Submission
Assignment one must be submitted by 3pm on Monday 02nd. Of March via Turnitin.
Assignment two must be submitted by 3pm on Monday 6th. Of April via Turnitin.
Please note:
• The maximum file size that can be uploaded is 20mb. If your file is larger than this it is usually because you have included a lot of images – you should either remove some if possible, or else convert them to a more efficient format to bring the file size down (e.g. .png or .gif).
Digital Submission of Coursework Instructions
• Logon to Blackboard.
• Access the appropriate Module site.
• Click the Assignment menu button which appears on the left of the screen.
• In this folder you will see a file entitled ‘Student Declaration form’. You need to complete this form and incorporate it as the first page of your coursework (not two separate files).
• Click Coursework. Please read the statement of originality before you click “submit”. By submitting work you are agreeing to this statement and confirming it to be true.
• Complete the dialogue box with your forename and surname
• To submit your coursework, locate the correct file on your computer by clicking the “browse” button and enter a title for the coursework (we suggest the module code and your student ID MNB108 123456). Click SUBMIT
• You will then be asked to check if the document is the one you wish to submit and if so click “YES, SUBMIT”
• You will then receive a message saying “paper successfully complete”.
• BLACKBOARD will then send you a confirmation email of submission. Please keep this receipt safe as evidence of your submission.
If you experience any difficulties submitting your work via Turnitin please contact the Student Hub straight away at SoMAssessment@swansea.ac.uk
Notes on Style and Word Count
Assignments are a critical part of the learning experience and development for scholars at Swansea University. Practice will pay dividends when it comes to honing your skills in report and essay writing. Students are therefore encouraged to submit the highest quality work they can to reach their maximum potential. Students with concerns about how to present their work can consult with the Module Co-ordinator for guidance in addition to the notes listed below:
The maximum word limit for the main assignment (excluding references, tables, contents page, footnotes, charts, graphs, figures, reference lists but including in-text references) is 2000 words. The word count must be stated in the assignment cover sheet.
Markers will stop marking once the word count [or time limit] limit has been reached, likely leading to a reduced overall mark as key arguments or conclusions will not be included in the marked work.
Students who submit work that is below the word limit will not be penalised. This is because students will not have taken full advantage of the word limit available to them, which in itself may constitute a penalty.
Video, Audio or other Assessment Types
For some assessments students may be required to submit a video, audio or other digital media item. The University’s overarching privacy policy advises students that the University will collect photographs and video recordings for the purpose of recording lectures, student assessment and examinations. This processing and storage of this information is lawful as it is necessary for the performance of a contract with the student and will apply to any personal data that we process for the purposes of administering and delivering their course of study.
https://www.swansea.ac.uk/media/Student-Data-Protection-Statement-18-19.pdf
Proof Reading
Please be aware of the university’s Proof Reading policy which sets outs what the university considers to be good academic practice in relation to proof reading. The School of Management allows proof reading but please be aware of the requirements around this including keeping an evidence trail relating to any proof reading and whether it is formal or informal. Further information can be found here.
School of Management
MN-M535 Data Mining
Module Schedule
Week
Topic
Lecture Contents
Seminar Contents
Key Readings
1
w/c
27/01
Module Introduction
Introduction to the course and overview of content.
Courseworks General Highlights
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 1
Pang-Ning Tan et al , Introduction to Data Mining, Chapter 1
2
w/c
3/02
Data Input: concepts, instances and attributes
This session will introduce styles of learning in data mining.
It will look at the types of data attributes and data examples, as well as data pre-processing.
Data pre-processing exercise (descriptive analysis, data visualising, missing values, outliers).
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 2
Pang-Ning Tan et al , Introduction to Data Mining, Chapter 2
3
w/c
10/02
Data Output: knowledge representation (linear models).
This session will introduce linear models for data mining, such as linear and logistic regression.
Logistic regression exercise.
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 4
4
w/c
17/02
Validation and evaluating output
This session will introduce data mining validation and evaluating. Such methods as holdout estimation, cross-validation, bootstrapping will be looked at. Numeric prediction evaluating, such as error measures and student’s test will be considered.
Numeric prediction evaluation exercise.
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 5
5
w/c
24/02
Simple algorithms: association rules
This session will look at rudimentary rules, covering algorithms and association rules.
Association rules exercise
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 3, 4, 11 (11.7)
Pang-Ning Tan et al , Introduction to Data Mining, Chapter 6
6
w/c
2/03
Decision trees
This session will look at decision trees algorithm for data mining.
Decision trees exercise
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 3, 4
Pang-Ning Tan et al , Introduction to Data Mining, Chapter 4
7
w/c
9/03
Clustering
This session will look at cluster analysis: basic concepts and algorithms.
Cluster analysis exercise
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 4, 6
Pang-Ning Tan et al , Introduction to Data Mining, Chapter 8
8
w/c
16/03
Advanced methods
This session will introduce advanced data mining techniques, such as support vector machine and neural networks.
SVM, neural networks exercise
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 6
9
w/c
23/03
Ensemble learning
This session will look at ensemble learning and combining multiple methods. Algorithms used: bagging, boosting, stacking.
Ensemble learning exercise
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 8
Pang-Ning Tan et al , Introduction to Data Mining, Chapter 5
10
w/c
30/03
Data transformations
This session will look at attribute selection for data mining, sampling and data calibrating.
Data transformations exercise
Ian H. Witten, et al. Practical Machine Learning Tools and Technique, Chapter 7
G E N E R I C M A R K I N G P R O G R A M M E
Mark (%) Class [Descriptor]
Information and knowledge
Application
Analysis
Synthesis and context
Evaluation
80-100
First
[Outstanding]
Contains all information required, with no errors. Evidence of study beyond the module content.
Answers question fully and completely. Excellent adaptation and application of concepts. No irrelevant material.
Ideas expressed logically and coherently. Excellent use of appropriate mathematical / diagrammatic exposition.
Excellent integration of ideas and information. Demonstrates outstanding understanding of topic within a wider context.
Shows evidence of significant independent thinking and critical awareness, and
originality.
70-79
First
[Excellent]
Contains all information required, with no major errors and no or very few minor errors.
Answers question fully and completely. Good adaptation and application of concepts. Little or no irrelevant material.
Ideas expressed logically and coherently. Effective use of appropriate mathematical / diagrammatic exposition.
Effective integration of ideas and information. Demonstrates substantial understanding of topic within a wider context.
Shows evidence of sound independent thinking and critical awareness.
60-69
Upper second
[Very good]
Contains all or almost all information required, with no major errors and only a few minor errors.
Answers question fully. Some adaptation and application of concepts. Little or no irrelevant material.
Ideas generally expressed logically and coherently. Competent use of appropriate mathematical / diagrammatic exposition.
Competent integration of ideas and information. Demonstrates some understanding of topic within a wider context.
Shows some evidence of independent thinking and critical awareness.
50-59
Lower second
[Good]
Contains most information required, with no or very few major errors and some minor errors.
Partially answers question. Limited adaptation and application of concepts. Some irrelevant material.
Ideas not always expressed logically and coherently. Adequate use of appropriate mathematical / diagrammatic exposition.
Limited integration of ideas and information. Demonstrates modest but incomplete understanding of topic and its context.
Shows little evidence of independent thinking and critical awareness.
40-49
Third
[Satisfactory]
Contains basic (core) information required, with some major and minor errors.
Only answers some aspects of question. No adaptation or application of concepts. Some irrelevant material.
Ideas rarely expressed logically and coherently. Limited use of appropriate mathematical expressions.
Minimal integration of ideas and information. Demonstrates limited understanding of topic and its
Shows very little or no evidence of independent thinking and critical awareness.
30-39
Fail (Potentially tolerable) [Poor]
Contains only a limited amount of information required, with numerous major and minor errors.
Does not answer question. No adaptation or application of concepts. Much irrelevant material.
Ideas rarely expressed logically and coherently. Little or no use of appropriate mathematical / diagrammatic exposition.
No integration of ideas and information. Demonstrates little understanding of topic and its context.
Shows no evidence of independent thinking or critical awareness.
0-29
Fail (Not tolerable) [Very poor]
Contains none or almost none of information required and with many major and minor errors.
Wholly fails to answer question. No adaptation or application of concepts. Largely irrelevant material.
Ideas expressed incoherently. No linking of ideas within text. Little or no use of appropriate mathematical / diagrammatic exposition.
No integration of ideas and information. Demonstrates no understanding of topic and its context.
Shows no evidence of independent thinking or critical awareness.