COMP5349 Assignment 1 Tutor Name: Tutor Name Name: Firstname Lastname SID: Student ID Number
Workload: Category and Trending Correlation
The following content is just an example of how you can construct your report for Spark implementation. It is based on an example in the week 5 lecture slide (30-36) on finding the average rating received by each movie genre. The format could be different, but you must include a diagram and very brief description.
The Computation Graph is illustrated in in Figure 1.
Figure 1: Spark computation graph for workload 1.
Workload
Implementation
Programming Language
Category and Trending Correlation
Spark
Python
Controversial Video Identification
Map Reduce
Java
Extract movie id an ratings from input line and output those as kv pair
The sequence of transformations and actions are illustrated in Figure 1. Ratings file is read in and mapped to create (movie id, rating) RDD pair. Similarly, Movies file is read and mapped to create (movie id, genre) RDD pair. Flat map is used here as there are multiple Genres per movie.
They are joined then mapped to form (genre, rating) RDD pair.
AggregateByKey transformation is then applied. It involves the use of mergeRating and mergeCombiner as sequencing and combiner function. MergeRating updates the summary containing the sum of ratings and number of ratings per genre with a new value, while mergeCombiner combines multiple summaries together.
COMP5349 Assignment 1 Tutor Name: Tutor Name Name: Firstname Lastname SID: Student ID Number
Workload: Controversial Video Identification
The following content is just an example of how you can construct your report for MapReduce implementation. It is based on an example in the week 4 lab, we assume a two- job implementaton
Figure 2: MapReduce computation graph for workload 2.
Add short description here as well!