代写 C algorithm Scheme html Java scala parallel graph network MapReduce Introduction

Introduction
Big Data (H/SIT) 2018-19
1st Assessed Exercise: HDFS/MapReduce
The goal of this exercise is to familiarize yourselves with the design, implementation and performance testing of Big Data crunching tasks using Hadoop/MapReduce. You will be required to design and implement algorithms for parsing, filtering, projecting, and transforming data, over a relatively large dataset, executing your code on a shared Hadoop cluster.
Dataset and Infrastructure
You will be working on a parsed version of the complete Wikipedia edit history as of January 20081. This is a single large text file (around the 300GB mark in its entirety), in a tagged multi-line format. Each revision history record consists of 14 lines, each starting with a tag and containing a space/tab- delimited series of entries. More specifically, each record contains the following data/tags, one tag per line:
• REVISION: revision metadata, consisting of:
o article_id: a large integer, uniquely identifying each page.
o rev_id: a large number uniquely identifying each revision.
o article_title: a string denoting the page’s title (and the last part of the URL of the
page, hence also uniquely identifying each page).
o timestamp: the exact date and time of the revision, in ISO 8601 format; e.g., 13:45:00
UTC 30 September 2013 becomes 2013-09-12T13:45:00Z, where T separates the date from the time part and Z denotes the time is in UTC. (Note: a class that translates such dates into numerical form is provided in the skeleton code).
o [ip:]username: the name of the user who performed the revision, or her DNS-resolved IP address (e.g., ip:office.dcs.gla.ac.uk) if anonymous.
o user_id: a large number uniquely identifying the user who performed the revision, or her IP address as above if anonymous.
• CATEGORY: list of categories this page is assigned to.
• IMAGE: list of images in the page, each listed as many times as it occurs.
• MAIN, TALK, USER, USER_TALK, OTHER: cross-references to pages in other namespaces.
• EXTERNAL: list of hyperlinks to pages outside Wikipedia.
• TEMPLATE: list of all templates used by the page, each listed as many times as it occurs.
• COMMENT: revision comments as entered by the revision author.
• MINOR: a Boolean flag (0|1) denoting whether the edit was marked as minor by the author.
• TEXTDATA: word count of revision’s plain text.
• An empty line, denoting the end of the current record.
To execute and test your implementations, you will be using Hadoop/MapReduce. The dataset is already stored on HDFS, under the path “/user/enwiki/”. For the sake of time, you will be working on random samples of this dataset; however, your code should be able to process the larger file
1 Source: http://snap.stanford.edu/data/wiki-meta.html

without any major changes, hence you should try to design for scalability and performance. There are several versions of the datasets under said folder:
• enwiki-20080103-sample.txt: A small random sample (~0.05%) of the complete dataset.
• enwiki-20080103-largersample.txt: A relatively larger (1%) random sample of the complete
dataset.
• enwiki-20080103-perftest.txt: A 10% random sample of the complete dataset; this will not
be accessible until later this semester.
Practice tasks
Try to implement MapReduce programs to process the following types of queries over the dataset:
1. Execute a “wordcount” on the whole data file.
2. Execute a “wordcount” but only counting occurrences of article_id’s.
3. Execute a “wordcount” but only counting how many MAIN outlinks a page has.
These tasks are unassessed and are only provided as ideas for you to practice with MapReduce and the datasets. Do NOT submit source code files for the above.
Assessed task
Your main task is to implement a watered-down version of the PageRank algorithm. Looking at the record you have for each revision, you can pull out:
• The article_title from the REVISION line
• The list of titles of pages linked to (i.e., out-links) from the MAIN line
Now, given a set S of such links, in the form:, the PageRank score value for any page u can be expressed as:
PR(u)=0.15 + 0.85 * Sum(PR(v)/L(v)), ∀v: ∃(v,u) ∈S, where L(v) is the number of out-links of page v.
The PageRank score is then computed in rounds. For the first round, all pages have a score of 1.0, and assume any given page has a score of p in a subsequent round. On each round, each page v “contributes” p/L(v) to the PageRank of each of the pages it links to. At the end of each round, the PageRank score of a page u is equal to the sum of the contributions of all pages with an out-link to u, times 0.85, plus 0.15 (= 1 – 0.85). The 0.85 factor is called the “damping factor” of PageRank.
You will be computing PageRank scores at a user-defined point in time. That is, the user of your program will supply (through a command line argument) a date Y; your code should then compute the PageRank score only for article revisions valid at time Y. An article revision is deemed valid iff it is the latest revision for that article that predates Y; i.e., iff (a) the revision’s creation date C predates Y and (b) there is no other revision for the same article with a creation date C’ so that C= 1), and (d) the date Y for which the PageRank scores will be computed (in ISO8601 format); i.e., the code should be executable like so:
$ env HADOOP_CLASSPATH=/path/to/your.jar hadoop MainClass \ /path/to/input.txt /path/to/output X Y
Your output files should contain one line of text per result, containing the article title and its PageRank score separated by a single space; i.e.:
Article_1 score1 Article_2 score2 Article_3 score3 …
Do not worry about sorting/merging of the final output files.
Solutions against the sample input files will be provided closer to the submission deadline to avoid distracting you from your coding. While working on your implementations, do all your testing against the smallest of the three files (also included in the VM’s HDFS). Once you are confident that your code works correctly and efficiently, move up to the next larger sample (on the cluster). Only execute your code against the perftest file iff you are 100% certain that there are no bugs or other problems. Along the same lines, beware of the resources you request through your code. Teams that hog the cluster will be “named and shamed”…
What to hand in
Use Moodle to submit a single zip or compressed tar file with the contents of the uog-bigdata folder post cleaning, plus a separate README.txt file.
For the former, do NOT submit any class files or other binary files (e.g., jar files)! That is, before creating the first submission file, make sure you execute ‘mvn clean’ in your project’s directory (and that Eclipse or other IDE is closed beforehand so as to avoid having the class files rebuilt in the background). To aid in testing and assessing of your code, please make sure that:
• Your submission file is named uog-bigdata.tgz or uog-bigdata.zip
• When uncompressed, your files will be in a folder named uog-bigdata
• No binaries are included in the submission, unless you have used a third-party library (whose
use you should document and justify in your report (more on this shortly)
• No version control system-specific directories are included; e.g., if you are using git to track
your code changes, don’t include the .git directory in your submission, etc.

Your README.txt file should outline (a) your solution (key-value format for mappers and reducers, as well as any other configuration details for your jobs), (b) any interesting aspects of your solution (e.g., assumptions you’ve made, optimisations that you thought of, etc.), as well as (c) any modifications you may have done to the provided project/maven build files. Please note that if your solution produces correct results, but this is only true under assumptions that were not explicitly stated in your README.txt file, there may be a deduction of marks.
Your submission will be tested for plagiarism with a specialised off-the-shelf tool. Plagiarism cases will be dealt on a case-by-case basis, but suffice to say there will be little tolerance.
How this exercise will be marked
Following timely submission on Moodle, the exercise will be given a numerical mark between 0 (no submission) and 20 (perfect in every way). The numerical marks will then be converted to a band (A5, A4, etc.). The marking scheme is as follows:
• 3 marks for the quality and readability of the code; make sure that you use appropriate names for variables/methods/classes and that your source code is properly structured in packages and classes, with common functionality abstracted out.
• 3 marks for documentation in the code and your README.txt file; make sure that you document in your source files, at the very least, the basic steps taken and the key-value pair formats. You can use Javadoc-compliant comments or just plain comments. That said, do not make an essay of your code; use your README file to discuss further details.
• 9 marks for a solution that produces the correct results; partial marks will be awarded if your solution is on the right track but produces incorrect results due to minor errors.
• 5 marks for any optimisations you have come up with, e.g., to reduce the number of jobs and the amount of data read from disk and/or transferred over the network, to improve parallelism, etc.

Related Posts