22/09/2020 Assessment task 3: Data mining in action
Assessment task 3: Data mining in ac on
Submit Assignment
Due Oct 23 by 23:59 Points 100 Submitting a file upload Available Sep 14 at 17:00 – Oct 30 at 23:59 about 2 months
Scenario
This assignment is a practical data analytics project that follows on from the data exploration you did in Assignment 2.
You will be acting as a data scientist at a consultant company and you need to make a prediction on a dataset. The dataset can be found below.
You need to build classifiers using the techniques covered in the lectures to predict the class attribute. At the very minimum, you need to produce a classifier for each method we have covered. However, if you explore the problem very thoroughly (as you should do in the industry), preprocessing the data, looking at different methods, choosing their best parameters settings and identifying the best classifier in a principled and explainable way, then you should be able to get a better mark. If you choose to use KNIME and you show ‘expert’ use (i.e. exploring multiple classifiers, with different settings, choosing the best in a principled way and being able to explain why you built the model the way you did), this will attract a better mark. If you choose to use R or Python to build, optimise and test different models, this will also attract better marks.
You need to write a short report describing how you solved the problem and the results you found. See below for requirements.
You also need to attend a short oral defence of your classifier of around 5 minutes where you show the classifier (e.g. using the KNIME workflow or Python/R code) and answer some questions about it. Details about oral defences will be given by email and in class.
Kaggle Compe on
For this assignment, you will use the Kaggle website (kaggle.com) to submit your assignment solution. The report itself will be submitted through Canvas as for the other assignments. Go to this link to sign up to the competition on Kaggle: https://www.kaggle.com/t/288b4a0860154a098e09e34f9ebd44f1 (https://www.kaggle.com/t/288b4a0860154a098e09e34f9ebd44f1) .You need to use the link to access the project because it is a private project for students in 32130. Sharing the competition with anyone not relevant to the subject is strictly prohibited. To submit to Kaggle you will need to make a Kaggle login using your UTS email address, and set your display name (in My Profile -> Edit Profile -> Display Name) as UTS_32130_xxxx where xxxx is your student ID. Submissions will not be considered if they don’t meet these criteria.
Datasets
Below you will find 3 datasets: a training dataset for training your model (it contains the target values), an Unknown dataset for testing the model (it does not have the target values – you need to predict them) and a submission sample which shows you what the file submitted to Kaggle should look like. In particular, you will need to set the column names in your submission file correctly – that is, “Row ID” and “Predict-Qualified”.
Assignment3-TrainingData.csv
Assignment3-UnknownData.csv
Assignment3-Kaggle-Submission-Random-Sample.csv
For this dataset, you only have the attribute headings and a brief description of what they mean, which you can find here: Assignment3-Attribute-Description.pdf
Assessment
Assessment is real-time. This means that as soon as you submit the file, Kaggle will assess the performance of your classifier and provide you with the result. You can submit multiple times, but Kaggle has a limit for the number of times you can do this per day.
Do not use the measure of performance reported by Kaggle as a measure of your test error in the final competition and optimise to it. This is because Kaggle has two measures: a public measure, which it reports to you, and a private measure, which it keeps hidden. Instead, develop several models and estimate the test error yourself before submitting to Kaggle. Remember that your estimate of test error is just that: an estimate. The actual private measure will probably be a little bit different.
Classifica on task
Build a classifier that classifies the ¡°QUALIFIED¡± attribute.
You can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), providing explanations for why you have
chosen to do that. You may need to split the training set into training, validation and test sets to accurately set the parameters and evaluate the quality of the classifier.
You can use KNIME to build classifiers. Feel free to use any other tool such as R, Weka, Python, Orange, scikit-learn or other software. If you do this, though, please explain more about your classifier – and be sure that you are producing valid results! You don’t need to limit yourself to the classifiers we used in class, but if you do use other classifiers you need to describe them in your report and make sure you are producing valid results.
A hint: usually it’s not a case of having a ‘better’ classifier that will produce good results. Rather, it’s a case of identifying or generating good features that can be used to solve the problem.
Assignment report and submission Report
https://canvas.uts.edu.au/courses/15561/assignments/30638 1/3
22/09/2020 Assessment task 3: Data mining in action
Your report should include the following information:
A description of the data mining problem;
The data preprocessing and transformations you did (if any);
How you went about solving the problem;
Classification techniques used and summary of the results and parameter settings;
The best classifier that you selected – the type, its performance, how it solved the problem (if it makes sense for that type of classifier), and reasons for selecting it; Reflection: One page reflecting on your learning in assignment 3. What did you learn about data mining and yourself as a result of doing the assignment? How would you approach the problem differently if you were to do it again? The more incisive and thoughtful your reflection is, the better your mark.
The report should be a PDF (preferable) or MS Word doc, with the filename fda_a3_xxxx.pdf or fda_a3_xxxx.docx, where xxxx is your student ID. The report should be around 10-12 pages, in 11 or 12 point Times or Arial font.
Submit your report using the link at the bottom of this page, after the Discussion.
Kaggle
The predictions on the unknown dataset should be submitted as a .csv file to the Kaggle competition here: https://www.kaggle.com/c/data-analytics-uts-s2020- assignment3. (https://www.kaggle.com/c/data-analytics-uts-s2020-assignment3/submit)
There will be a class prize for the top 3 submissions in the assignment scoreboard on Kaggle.
On average each student will require between 24 and 36 hours to complete this assignment.
Assessment
This assignment is assessed as individual work.
The report contributes up to 30 marks out of the 50. The marking criteria are here: Assignment3-Marking-Criteria-32130.pdf
(%24CANVAS_COURSE_REFERENCE%24/file_ref/gcc474d2a4773d5bec19d73954c89f9cd/download?wrap=1) The oral defence contributes up to 20 marks out of the 50. At the oral defence, students need to explain how they solved the problem and answer questions about their solutions showing the workflow in KNIME or working code in Python, R or other tools.
Students receive either 0, 10, 15 or 20 marks as follows.
Students using baseline classifiers who are able to satisfactorily answer questions about them will receive 10 out of 20.
Students showing an investigation with many classifiers using Python/R/KNIME or other tools, with basic data preprocessing, parameter estimation and model evaluation, will receive 15 marks out of 20.
Students showing an in-depth investigation using Python/R/KNIME – multiple classifiers with valid data preprocessing, parameter estimation and model evaluation – will receive 20 marks out of 20.
Students who fail the oral defence will be permitted to undertake it one more time. If they pass, they will receive a maximum of 10 marks out of 20.
Due Date:
Friday 23 Oct. 2020, 11:59 PM
Rela onship to objec ves
This assessment task addresses subject learning objectives (SLOs) 3, 4, 5 and 6.
This assessment task contributes to the development of Course Intended Learning Outcomes (CILOs) B.1, C.1, E.1 and F.1.
Return of assignments
Marks for the oral defence will be received at the end of the defence. Marks for the written report will be given within 3 weeks of submission. Feedback on the report will be given only for students requesting it. Emails will be sent when marking is complete.
Academic standards and penal es
Please refer to the Subject Outline.
Discussion Box
Please ask questions about your assignment here.
https://canvas.uts.edu.au/courses/15561/assignments/30638 2/3
22/09/2020 Assessment task 3: Data mining in action
Enter your response here
Enter your response here
https://canvas.uts.edu.au/courses/15561/assignments/30638
3/3