COMP336 – Big Data Assignment 2
Semester 1, 2019
Macquarie University, Department of Computing
Due: Week 8 (Sunday 5 May)
Demonstration and Marking: Week 9, during the Workshop
Weighting: 20%
In this assignment you will implement MapReduce techniques for the processing of Big Data. You will build
your assignment on top of Hadoop (i.e. an open-source version of MapReduce written in Java).
This Assessment Task relates to the following Learning Outcomes:
Apply Map-reduce techniques to a number of problems that involve Big Data.
Task 1: (15%)
• Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”
• MapReduce: Calculate the count of number of occurrences of each word in the text of Tweets.
• Create a short documentation in which you briefly describe your implementation:
o What to write in the mapper(s) ? Flowchart and Pseudocode ! o What to write in the reducer(s) ? Flowchart and Pseudocode !
Task 2: (15%)
• Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”
• MapReduce: Calculate the count of number of tweets for a list of different cities in Australia.
• Create a short documentation in which you briefly describe your implementation:
o What to write in the mapper(s) ? Flowchart and Pseudocode ! o What to write in the reducer(s) ? Flowchart and Pseudocode !
Copyright © DataAnalyticsResearchGroup @MQ https://data-science-group.github.io/
Task 3: (35%)
• Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”
• MapReduce: Implement the Bubble Sort algorithm using Map-Reduce.
• MapReduce: Implement the Quick Sort algorithm using Map-Reduce.
• Create a short documentation in which you briefly describe your implementation:
o How many MapReduce Jobs? Why?
o What to write in the mapper(s) ? Flowchart and Pseudocode ! o What to write in the reducer(s) ? Flowchart and Pseudocode !
• Sort Tweets, using the object.id :
Task 4: (35%)
• Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”
• MapReduce: Implement the TF-IDF algorithm using Map-Reduce for the term “health” in the text of
the Tweets.
• Create a short documentation in which you briefly describe your implementation:
o How many MapReduce Jobs? Why?
o What to write in the mapper(s) ? Flowchart and Pseudocode ! o What to write in the reducer(s) ? Flowchart and Pseudocode !
Submission:
Submit a zip file including:
• A documentation for each task including the Flowchart and Pseudocode
• Source code for the mapper(s) and reducer(s)
• Output for each task
Copyright © DataAnalyticsResearchGroup @MQ https://data-science-group.github.io/