2017/18 COM6012 – Assignment 2

Assignment Brief

Deadline:

11:59PM on Friday 27 April 2018

How and what to submit

Create a .zip file containing two folders. One folder per exercise. Name the two folders: Exercise1, Exercise2. Within each folder, include the .sbt file, the .scala files, the .sh files, and the files you get as outputs when you run your code in the HPC. Each folder should contain these files. Include a README.txt file that explains what is each file in each folder.

Please, also include in the .zip, a pdf file with your answers to any questions that ask you to provide analysis or comments on the Solutions. Please, be concise.

Upload your .zip file to MOLE before the date and time specified above. Name your .zip file as NAME_REGCOD.zip, where NAME is your full name, and REGCOD is your registration code.

Please, make sure you are not uploading any dataset. If your file is greater than 2 MBs, you may want to check if you are unintentionally uploading unsolicited files.

Assessment Criteria

Being able to use pipelines, cross-validators and a different range of supervised learning methods for large datasets [10 marks]
Being able to analyse and put in place a suitable course of action to address a large scale data analytics challenge [15 marks]

Late submissions

We follow Department’s guidelines about late submissions. Please, see this link

Use of unfair means

“Any form of unfair means is treated as a serious academic offence and action may be taken under the Discipline Regulations.” (taken from the Handbook for MSc Students). If you are unaware what constitutes Unfair Means, please cafefully read the following link.

General Advice

An old, but very powerful engineering principle says: divide and conquer. If you are unable to analyse your datasets out of the box, you can always start with a smaller one, and build your way from it.

Exercises

Exercise 1 [10 marks]

In this Exercise, you will apply Decision Trees for Classification, Decision Trees for Regression and Logistic Regression over the HIGGS dataset. For each algorithm:

Use pipelines and cross-validation to find the best configuration of parameters and their accuracy. Use the same splits of training and test data when comparing performances between the algorithms (7 marks).
Find the three most relevant features for classification or regression (2 marks).
Provide training times in the cluster when using different cores (1 mark).

Do not try to upload the dataset to MOLE when returnig your work. It is 2.6Gb.

Exercise 2 [15 marks]

You are hired as a Senior Data Analyst at Intelligent Insurances Co. The company wants to develop a predictive model that uses vehicle characteristics to accurately predict insurance claim payments. Such model will allow the company to assess the potential risk that a vehicle represents.

The company puts you in charge of coming up with a solution for this problem, and provides you with a historic dataset of previous insurance claims. A more detailed description of the problem and the available historic dataset is here

Disclaimer: Bear in mind that your bosses prefer probabilistic models, since such models provide predictions with uncertainty, so they suggest you use generalised linear models (8 marks). Your bosses also tell you that they would like to have two solutions to the problem with justification as why the solutions make sense and a further comparison of their performance (7 marks). Your bosses also say the solution needs to be scalable, so they ask you to use Apache Spark to develop it.

Due to the difficulties that some of you have had using HPC, I’ve decided to allow the following changes for Assignment 2.

For Exercise 1, you are allowed to use a subset of the original dataset. You can use a subset of the dataset with a size no smaller than 10% of the original size of the dataset (a minimum of 1.100.000 instances instead of the original 11.000.000 instances).

1100001