程序代写代做 Bayesian decision tree graph database html Hive algorithm data mining ADM3308: Business Data Mining

ADM3308: Business Data Mining
Data Mining Project Using IBM SPSS Modeler
(Team work)
_____________________________________________________________________________________
_____________________________________________________________________________________
Weight: 25% of the final mark. This is a team work project (only one submission per team).
_____________________________________________________________________________________

Important Note: Read the following academic integrity statement, type in your full name and student ID, and include a copy in your submission. Submitting this form electronically by the team representative is considered as signing the document by BOTH members of the team.
Personal Ethics & Academic Integrity Statement
Student name: Student ID:
Student Name: Student ID:
By typing in my name and student ID on this form and submitting it electronically, I am attesting to the fact that I have reviewed not only my own work, but the work of my team member, in its entirety.
I attest to the fact that my own work in this project adheres to the fraud policies as outlined in the Academic Regulations in the University’s Undergraduate Studies Calendar. I further attest that I have knowledge of and have respected the “Beware of Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the best of my knowledge, I also believe that each of my group colleagues has also met the aforementioned requirements and regulations. I understand that if my group assignment is submitted without a completed copy of this Personal Work Statement from each group member, it will be interpreted by the school that the missing student(s) name is confirmation of non-participation of the aforementioned student(s) in the required work.
We, by typing in our names and student IDs on this form and submitting it electronically,
• warrant that the work submitted herein is our own group members’ work and not the work of others
• acknowledge that we have read and understood the University Regulations on Academic Misconduct
• acknowledge that it is a breach of University Regulations to give or receive unauthorized and/or unacknowledged assistance on a graded piece of work

The IBM SPSS Modeler is a commercial data mining package offered by the IBM capable of performing data mining tasks including predictive and descriptive models with user-friendly interfaces. The IBM Modeler is available on the computers in the lab. There will be tutorials presented to class on using the IBM Modeler for data mining. Students are also required to consult on-line resources to learn more about IBM Modeler. More information on the product is available at
https://www.ibm.com/products/spss-modeler?mhsrc=ibmsearch_a&mhq=Modeler

For this project, you are required to complete two parts:
• Part-1 (100 points): Data mining modelling project using a selected datasets from Table-1.
• Part-2 (30 points): Perform data pre-processing and data cleaning on the raw dataset provided to you (Unclean-Bank-Data.Xlsx) using IBM SPSS Modeller nodes to clean and pre-process the data.

PART-1
(A) Dataset Selection:
Each team must select one of the datasets listed in Table-1 (or from other recommended repositories with the pre-approval of the professor), and announce it on the “Discussion Board” on the Forum named “Announcing Dataset Selection”. Post your name, your tem-member’s name, and the dataset selected. If a dataset is already taken by one of the teams, as posted on the Forum, that dataset cannot be selected by other teams. Therefore, I recommended that you select your dataset and announce it on the Discussion Board as early as possible.
NOTE: You may choose a dataset other than what listed in Table-1 with the professor prior approval. If you would like to analyze a dataset not listed in Table-1, please email me the details of the dataset for my review (e.g. the source of the data, how many records, how many attributes).

(B) Data Analysis and Model Building:
You are required to import the data, perform pre-processing tasks if needed (such as reformatting the data, normalizing it, dealing with missing values, dealing with outlier), followed by two or more modeling tasks such as classification (Decision tree, Bayesian, KNN, neural networks, etc.), clustering (K-means, agglomerative), and association rules mining.

(C) Project Report for Part-1:
Your report for this part of the project should include:
• Explaining the data you selected for your project (attributes, instances, etc.)
• Explaining your pre-processing tasks if any (cleaning, transforming, normalizing, etc.)
• Explaining the data mining modeling techniques you performed on the data (at least two techniques)
• Demonstrating the graphs/tables of the results produced by the techniques
• Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings
• Concluding remarks, your recommendations, actionable discoveries, and future trends/studies you would recommend

Overall, your report for this part of the project should be 15 to 25 pages long (including graphs). Use 12pt Times New Roman font, with 1.5 line space. Keep a margin of 1” on all sides of the page.

Rubrics for Part-1
Your report for Part-1 of the project will be evaluated as follows:
Components of the Report (Part-1)
Points
Abstract OR Executive summary (or abstract)
10
Explanation of the data set, and the pre-processing tasks (if any) to prepare the data
10
Explanation of at least two data mining tasks you performed on the data. Also, explain why you considered the specific data mining tasks for your dataset
20
Relevant graphs showing the output results of the techniques you applied
20
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings
10
A conclusion section summarizing your findings, discussing the results, your understanding of the results, your recommendation, and any useful patterns, rules, prediction or future trend you infer from the data
10
Overall organization of the paper, its soundness and readability, and quality of the presentation
20
Total (Part-1)
100

(D) List of Datasets:
Select one of the following datasets, then post a message on the Discussion Board on Brightspace to claim your dataset.
Table-1: List of datasets for Part-1 of the project
Note: These datasets are available at the UCI Machine Learning Repository. For more information, visit http://archive.ics.uci.edu/ml/datasets.html

#
Name
Number of features
Number of Samples
Comments
1
Waveform Database Generator (version 2)
40
5000
Use the dataset without Noise
2
Statlog (Landsat Satellite)
36
6435
Training and Testing datasets are different
3
seismic-bumps
22
8124

4
Image Segmentation
19
2310
Use only the testing dataset
5
Bank Marketing
17
45211

6
Pen-Based Recognition of Handwriting Digits
16
10992
Training and Testing datasets are different
7
Student Performance
33
649

8
Adult
14
48842
Training and Testing datasets are different
9
Statlog (Shuttle)
9
58000

10
Abalone
8
4177

11
Nursery
8
12960

12
Yeast
8
1484

13
One-hundred plant species leaves data set
64
1600
Use just-data_Mar_64.txt
14
Spambase
57
4601

15
Cardiotocography
23
2126

16
Statlog (German Credit Card)
20
1000

17
Letter Recognition
16
20000

18
EEG Eye State
15
14980

19
Page Blocks Classification
10
5473

20
Contraceptive Method Choice
9
1473

21
Weight lifting exercises monitored
10
39242
Use the following features: roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, (class as output)
22
Connect-4
42
67557

23
Mushroom
22
8124

24
Default of credit card clients
24
30000

25
Autism Screening Adult Data Set
21
704

26
Drug consumption (quantified) Data Set
32
1885

27
Polish companies bankruptcy data, Data Set
64
10503

PART-2
In this part of the project, all teams will use the dataset Unclean-Bank-Data.Xlsx posted on the “Project Description” page of the course website.
This dataset includes missing values, invalid values, and outliers. You should use the IBM SPSS Modeler nodes to pre-process and clean the data.
Do not remove a record if there is only one missing value in that record. Instead, use the IBM Modeler to fill in the missing value with an algorithm of your choice.
Similarly, do not remove a record if it has only one invalid value. Instead, use the IBM Modeler to fill in the invalid value with an algorithm of your choice.
If you find a record with more than one missing value, or more than one invalid value, then you may either remove the record, or use the IBM Modeler to fill in for the missing or invalid values.
If you detect outliers, you may then delete the entire record.
You may also want to do other pre-processing tasks such as data normalization, binning data, etc.
Deliverables for Part-2:
1- Include in your project report a short explanation of three different cleaning and pre-processing tasks you applied on the data using the IBM SPSS Modeller.
2- Also, include the clean dataset (name it “Clean-Bank-Data.xlsx”) in your submission together with your project report (you may submit everything in one zip file).

Rubrics for Part-2
Your report for Part-2 of the project will be evaluated as follows:

Components of the Report (Part-2)
Points
Explanation of three cleaning and pre-processing tasks applied on the data; explaining the results after you pre-processed the data; including the Clean-Bank-Data.xlsx with your submission.

3 X 10
Total (Part-2)
30