bigData
Big Data Coursework 2016/7 part 2: Spark Pipelines and Evaluation of Scaling of
Algorithms
Topic This coursework part is about implementing and applying Spark Machine Learn- ing
Pipelines, and evaluating them with respect to preprocessing, parametrisation, and scaling.
This task is open-ended in comparison to part 1, and aims to encourage you to explore
Spark’e Pipeline model and to design and evaluate machine learning applications at scale. It
is highly recommended to do the coursework in pairs, which will be taken into account
when the coursework is marked.
Tasks
a) Select a (not too small) dataset (we provide some, but you can also use your own) and
identify a task (classification, regression, recommendation …). Explain your choice of
dataset and task. (20%)
b) Implement a machine learning pipeline in Spark, including feature extractors,
transformers, and/or selectors. Test that your pipeline it is correctly implemented and
explain your choice of processing steps, learning algorithms, and parameter settings. (25%)
c) Evaluate the performance of your pipeline using training and test set (don’t use CV but
pyspark.ml.tuning.TrainValidationSplit). (20%)
d) Implement a parameter grid (using pyspark.ml.tuning.ParamGridBuilder[source]),
varying at least one feature preprocessing step, one machine learning parameter, and the
training set size. Document the training and test performance and the time taken for training
and testing. Comment on your findings. (35%)