Introduction
Big Data (H) 2018-19
2nd Assessed Exercise: Apache Spark
The goal of this exercise is to familiarize yourselves with the design, implementation and performance testing of Big Data crunching tasks using Apache Spark. You will be required to design and implement algorithms for parsing, filtering, projecting, and transforming data, over a relatively large dataset, executing your code on a shared cluster.
Dataset and Infrastructure
You will be working on the same parsed version of the complete Wikipedia edit history as of January 20081 that you processed in the 1st assessed exercise. Please refer to the spec of the 1st AE for details of the format and location of the input datasets.
Practice tasks
Please refer to the Spark tutorial slides for sample practice tasks. Also, try to tweak the basic PageRank algorithm provided in the lecture so as to be able to parse the Wikipedia dump file you used in your previous exercise. These tasks are unassessed and are only provided as ideas for you to practice with Spark and the datasets. Do NOT submit source code files for the above. However, do reach out to us (lecturer and tutors) with any issues you may have, as solving them at an early stage will only help you when working on the assessed task below.
Assessed task
Your main task is to implement the same watered-down version of the PageRank algorithm you worked on in the 1st AE. Remember that you need to compute the PageRank scores at a user-defined point in time. That is, the user of your program will supply (through a command line argument) a date Y; your code should then compute the PageRank score only for article revisions valid at time Y. An article revision is deemed valid iff it is the latest revision for that article that predates Y; i.e., iff (a) the revision’s creation date C predates Y (C <= Y); and (b) there is no other revision for the same article with a creation date C’ so that C
Your output files should contain one line of text per unique article title, containing the article title and its PageRank score separated by a single space; i.e.:
Article_1 score1 Article_2 score2 Article_3 score3 …
Again, do not worry about sorting or merging the output files but this time try to have each individual output file sorted in descending score order.
You are free to implement your code in your favourite programming language, provided that it can interface with Apache Spark (e.g., Java, Scala, Python, etc.). However, please note that the maven build file only supports Java as far as compiled languages are concerned, so you may need to provide your own custom build files if you choose another language (e.g., sbt files for Scala, etc.).
If you use Java/Scala, please consider using Hadoop’s TextInputFormat with a tweaked delimiter (‘\n\n’) to parse the input file. That is, there is no need for you to define a custom input format of your own; however, if you feel inclined to do so, then by all means go ahead.
Sample solutions against the sample input files and predefined dates were provided for the 1st AE and are also valid for the 2nd AE. NOTE: While working on your implementations, do all your testing against the smallest of the three files. Once you are confident that your code works correctly and efficiently, move up to the next larger sample. Only run your code against the perftest file if you are 100% certain that there are no bugs or other problems. Along the same lines, beware of the resources you request through your code. Teams that hog the cluster will be “named and shamed”…
What to hand in
Use Moodle to submit a single zip or compressed tar file with your source code. Do NOT submit any class files or other binary files (e.g., jar files)! That is, before creating the submission file, make sure you execute an ‘mvn clean’ in your project’s directory (and that Eclipse or other IDE is closed so as to avoid having the class files rebuilt in the background), and that your directory does not include any source versioning directories/artifacts (e.g., .git, .svn, etc.). Also submit a README.txt file outlining (a) your solution (a discussion of the DAG of transformations that you ended up with, on top of the “standard” PageRank DAG), (b) any interesting aspects of your solution, as well as (c) any modifications you may have done to the provided project/maven build files (or instructions to build/package/execute your code, if you haven’t used Java for your implementation).
Your submission will be tested for plagiarism with a specialised off-the-shelf tool. Plagiarism cases will be dealt on a case-by-case basis, but suffice to say there will be little tolerance.
How this exercise will be marked
Following timely submission on Moodle, the exercise will be given a numerical mark between 0 (no submission) and 20 (perfect in every way). The numerical marks will then be converted to a band (A5, A4, etc.). The marking scheme is as follows:
• 4 marks for documentation in the code and your README.txt file. Make sure that you document in your README file, at the very least, the basic steps taken, and the reasons why you chose the specific transformations and DAG. Please substantiate/properly justify your design decisions. For your code, you can use Javadoc-compliant comments or just plain comments.
• 2 marks for the quality and readability of the code; make sure that you use appropriate names for variables/methods/classes and that your source code is properly structured in packages and classes, with common functionality abstracted out. Although it may be appealing to put everything on a single line (e.g., see WordCount_v0 example), it’s better to break the transformations out to improve readability and allow for a better documentation of your approach.
• 11 marks for a solution that produces the correct results; partial marks will be awarded if your solution is on the right track but produces incorrect results due to minor errors.
• 3 marks for any optimisations you have come up with, e.g., to reduce the number of transformations, to reduce the amount of data read from disk and/or transferred over the network, to improve parallelism (without overloading the cluster), etc.