Introduction
Big Data (H/SIT) 2018-19
1st Assessed Exercise: HDFS/MapReduce
The goal of this exercise is to familiarize yourselves with the design, implementation and performance testing of Big Data crunching tasks using Hadoop/MapReduce. You will be required to design and implement algorithms for parsing, filtering, projecting, and transforming data, over a relatively large dataset, executing your code on a shared Hadoop cluster.
Dataset and Infrastructure
You will be working on a parsed version of the complete Wikipedia edit history as of January 20081. This is a single large text file (around the 300GB mark in its entirety), in a tagged multi-line format. Each revision history record consists of 14 lines, each starting with a tag and containing a space/tab- delimited series of entries. More specifically, each record contains the following data/tags, one tag per line:
• REVISION: revision metadata, consisting of:
o article_id: a large integer, uniquely identifying each page.
o rev_id: a large number uniquely identifying each revision.
o article_title: a string denoting the page’s title (and the last part of the URL of the
page, hence also uniquely identifying each page).
o timestamp: the exact date and time of the revision, in ISO 8601 format; e.g., 13:45:00
UTC 30 September 2013 becomes 2013-09-12T13:45:00Z, where T separates the date from the time part and Z denotes the time is in UTC. (Note: a class that translates such dates into numerical form is provided in the skeleton code).
o [ip:]username: the name of the user who performed the revision, or her DNS-resolved IP address (e.g., ip:office.dcs.gla.ac.uk) if anonymous.
o user_id: a large number uniquely identifying the user who performed the revision, or her IP address as above if anonymous.
• CATEGORY: list of categories this page is assigned to.
• IMAGE: list of images in the page, each listed as many times as it occurs.
• MAIN, TALK, USER, USER_TALK, OTHER: cross-references to pages in other namespaces.
• EXTERNAL: list of hyperlinks to pages outside Wikipedia.
• TEMPLATE: list of all templates used by the page, each listed as many times as it occurs.
• COMMENT: revision comments as entered by the revision author.
• MINOR: a Boolean flag (0|1) denoting whether the edit was marked as minor by the author.
• TEXTDATA: word count of revision’s plain text.
• An empty line, denoting the end of the current record.
To execute and test your implementations, you will be using Hadoop/MapReduce. The dataset is already stored on HDFS, under the path “/user/enwiki/”. For the sake of time, you will be working on random samples of this dataset; however, your code should be able to process the larger file
1 Source: http://snap.stanford.edu/data/wiki-meta.html
without any major changes, hence you should try to design for scalability and performance. There are several versions of the datasets under said folder:
• enwiki-20080103-sample.txt: A small random sample (~0.05%) of the complete dataset.
• enwiki-20080103-largersample.txt: A relatively larger (1%) random sample of the complete
dataset.
• enwiki-20080103-perftest.txt: A 10% random sample of the complete dataset; this will not
be accessible until later this semester.
Practice tasks
Try to implement MapReduce programs to process the following types of queries over the dataset:
1. Execute a “wordcount” on the whole data file.
2. Execute a “wordcount” but only counting occurrences of article_id’s.
3. Execute a “wordcount” but only counting how many MAIN outlinks a page has.
These tasks are unassessed and are only provided as ideas for you to practice with MapReduce and the datasets. Do NOT submit source code files for the above.
Assessed task
Your main task is to implement a watered-down version of the PageRank algorithm. Looking at the record you have for each revision, you can pull out:
• The article_title from the REVISION line
• The list of titles of pages linked to (i.e., out-links) from the MAIN line
Now, given a set S of such links, in the form:
Your README.txt file should outline (a) your solution (key-value format for mappers and reducers, as well as any other configuration details for your jobs), (b) any interesting aspects of your solution (e.g., assumptions you’ve made, optimisations that you thought of, etc.), as well as (c) any modifications you may have done to the provided project/maven build files. Please note that if your solution produces correct results, but this is only true under assumptions that were not explicitly stated in your README.txt file, there may be a deduction of marks.
Your submission will be tested for plagiarism with a specialised off-the-shelf tool. Plagiarism cases will be dealt on a case-by-case basis, but suffice to say there will be little tolerance.
How this exercise will be marked
Following timely submission on Moodle, the exercise will be given a numerical mark between 0 (no submission) and 20 (perfect in every way). The numerical marks will then be converted to a band (A5, A4, etc.). The marking scheme is as follows:
• 3 marks for the quality and readability of the code; make sure that you use appropriate names for variables/methods/classes and that your source code is properly structured in packages and classes, with common functionality abstracted out.
• 3 marks for documentation in the code and your README.txt file; make sure that you document in your source files, at the very least, the basic steps taken and the key-value pair formats. You can use Javadoc-compliant comments or just plain comments. That said, do not make an essay of your code; use your README file to discuss further details.
• 9 marks for a solution that produces the correct results; partial marks will be awarded if your solution is on the right track but produces incorrect results due to minor errors.
• 5 marks for any optimisations you have come up with, e.g., to reduce the number of jobs and the amount of data read from disk and/or transferred over the network, to improve parallelism, etc.