CS435 Introduction to Big Data, Fall 2018 Programming Assignment 3
Programming Assignment 3
Estimating PageRank Values of Wikipedia Articles using Apache Spark
Due: November 6, 2018 Tuesday 5:00PM
Submission: via Canvas, individual submission
Instructor: Sangmi Lee Pallickara
Web page: http://www.cs.colostate.edu/~cs435/Assignments.html
Objectives
The goal of this programming assignment is to enable you to gain experience in:
- Installing and using analytics tools such as HDFS and Apache Spark
- Implementing iterative algorithms to estimate PageRank values of
Wikipedia articles using Wikipedia dump data
- Designing and implementing batch layer computation using Hadoop
MepReduce and HDFS
(1) Overview
In this assignment, you will design and implement a system that calculates PageRank1 values of currently available Wikipedia articles. The PageRank algorithm is used by Google Search to rank web pages in their search engine query results. PageRank measures the importance of web pages. The underlying assumption is that more important web pages are likely to receive more links from other web pages.
In this assignment, you will implement algorithms to estimate PageRank values over the Wikipedia dataset. These results can be used to rank search results within Wikipedia. Most of the Wikipedia articles contain links to other articles or external web contents. We will consider only internal Wikipedia articles to calculate the PageRank. Currently Wikipedia contains more than 4.9 million articles2. Wikipedia publishes data dumps in various formats periodically3. In this assignment, you will perform:
- Estimation of PageRank values under ideal conditions
- Estimation of the PageRank values while considering dead-end articles
- Analysis of the above results: Creating a Wikipedia Bomb
To perform the PageRank algorithm over the Wikipedia dump, you are required to use Spark in an iterative fashion.
1 Larry Page, “PageRank: Bringing Order to the Web” Stanford Digital Library Project, 1997, https://web.archive.org/web/20020506051802/www-diglib.stanford.edu/cgi-bin/WP/get/SIDL- WP-1997-0072?1
2 https://en.wikipedia.org/wiki/Wikipedia:Statistics 3 https://meta.wikimedia.org/wiki/Data_dumps
1/6
CS435 Introduction to Big Data, Fall 2018 Programming Assignment 3
(2) Requirement of Programing Assignment 3
Your implementation of this assignment should provide:
- A sorted list of Wikipedia pages based on their ideal PageRank value in
descending order. Each row should contain the title of the article and its PageRank value. This computation should be performed in your own Spark cluster with 5 machines. You should present results from at least 25 iterations.
- A sorted list (in descending order) of Wikipedia pages based on their PageRank value with taxation. Each row should contain the title of the article and its PageRank value. This computation should be performed on your own Spark cluster with 5 machines. You should present results from at least 25 iterations.
- A Wikipedia Bomb that returns the “Rocky Mountain National Park” wikipedia page for the search key word “surfng” (see 3.C). In order to do that, you should modify the link data fle and show that the “Rocky Mountain National Park” page generates highest PageRank among all of the pages containing “surfng” as part of their titles.
There are many algorithms to estimate PageRank. You should use the algorithms included in this description. Also, you are not allowed to use existing PageRank implementations.
Demonstration of your software should be on the machines in CSB-120. Demonstration will include an interview discussing implementation and design details. Your submission should include your source codes and outputs. Your submission should be via Canvas.
(3) PageRank
The term PageRank comes from Larry Page, a founder of Google. PageRank is a function that assigns a real number to each page in the Web (in our case Wikipedia articles). A page with a higher PageRank value will be considered as a more important page. There are many algorithms to assign PageRank. In this assignment, we will consider two algorithms4: (1) idealized PageRank and (2) taxation based PageRank.
A. Idealized PageRank
Suppose that our Web is a directed graph where pages are the nodes and there is (are) edge(s) from page p1 to page p2 . The edges between nodes represent links between the web pages. Figure 1 depicts the links between 4
4 Jure Leskovec, Anand Rajaraman, and Jef llman, “Chapter 5: Link Analysis”, Mining of Massive Datasets, Cambridge niversity Press, http://infolab.stanford.edu/~ullman/mmds/ch5.pdf
2/6
CS435 Introduction to Big Data, Fall 2018 Programming Assignment 3
pages (A, B, C and D) as a graph. Page A has links to each of the other three pages B, C and D. Page B has links to A and D. Page C has a link only to A. Finally, page D has links to B and C.
A
A
CD
Figure 1. Web graph of example
Suppose a random surfer starts at page A (in Figure 1). The probability that the random surfer will visit each of the pages B, C, and D in his next stop is 1/3. Note that if a page contains a link to itself, it should be represented in the graph as a loop. The random surfer that arrived at page B will likely to visit A with probability of 1⁄2.
This graph can be defned as the transition matrix of the Web (here, Wikipedia), and the transition matrix for the previous example depicted in the Figure 1 is:
0 1/2 1 0
M=1/3 0 0 1/2
B
1/3 0 0 1/2 []
1/3 1/2 0 0
In the above matrix, the order of the pages is the natural one, A, B, C, and D. The frst column shows that a random surfer at A has probability of 1/3 of visiting each of the other pages.
An idealized PageRank supposes that a random surfer may start from any of the n pages with equal probability. The initial vector v0 for our previous example is (1⁄4, 1⁄4, 1⁄4, 1⁄4). After one step, the distribution of the surfer will be Mvo, after two steps it will be M((Mvo) = M2v0, and so on. Multiplying the initial vector v0 by M a total of i times will provide us the distribution of the surfer after i steps.
After enough number of iterations, the vector will show little change at each round. For this assignment your iteration count is as specifed earlier.
3/6
B. Taxation
CS435 Introduction to Big Data, Fall 2018 Programming Assignment 3
Wikipedia contains dead-end articles5. A dead-end article is a page that contains no links to other Wikipedia articles. Since there is no link to go out, once a random surfer visits the dead-end article, this random surfer never leaves the article. This often results in 0 probabilities in many of the pages as the number of iteration increases.
To deal with dead-end articles, the PageRank algorithm modifes the idealized process by which random surfers are assumed to move without following the links. This method is referred to as “taxation”. Taxation allows each random surfer to have a small probability of teleporting to a random page, rather than following an out-link from their current page.
With taxation, the new vector for estimating the PageRank will be,
v’=βMv+(1−β)e/n
where β is a chosen constant, usually in the range of 0.8 to 0.9, e is a vector of all 1’s with the appropriate number of components (i.e. Wikipedia pages), and n is the number of nodes in the Web graph. In this assignment, you should use β as 0.85.
articles. You should perform this processing using Apache Spark. Please ignore all of the possible errors due to the data segmentation by HDFS.
C. Wikipedia Bomb
The term Wikipedia Bomb is derived from the Google bombs for this assignment. The Google bomb refers to the practice of causing a web site to rank highly in web search engine results for unrelated or of-topic search terms by linking heavily6.
In this assignment, you should generate a Wiki Bomb for “surfng” in Wikipedia and the result should be the “Rocky Mountain National Park” page in Wikipedia.
To generate this, you are allowed to modify the link data fle. You should show the results and discuss during the demo.
(4) Programming Environment
A. Requirements
All of the software demonstration of this assignment will be done in CSB120. You should make sure your software works on the machines in CSB120.
5 https://en.wikipedia.org/wiki/Wikipedia:Dead-end_pages 6 https://en.wikipedia.org/wiki/Google_bomb
4/6
CS435 Introduction to Big Data, Fall 2018 Programming Assignment 3
B.Setup your Hadoop cluster
This assignment includes a requirement of setting up your own HDFS and Spark cluster on every node. For writing your intermediate and output data, you should use your own cluster setup.
(5) Dataset
The original dump dataset from Wikipedia is 12.1GB as a bz2 fle, and about 50GB when it is decompressed. It follows an XML format and includes other information that is not relevant to this task. To optimize your work for this assignment, we will use Wikipedia dump7 generated by Henry Haselgrove. Please fnd the instruction to download the data fle from CS120 (instead of the external server) in the link below. If you have any problem with storage capacity, you should contact the Prof. Pallickara immediately.
The dataset contains two fles: links-simple-sorted.zip (323MB) [Link]
titles-orted.zip (28MB) [Link]
These fles contain all links between proper Wikipedia pages. This includes disambiguation pages and redirect pages. It includes only English language Wikipedia.
In links-simple-sorted.txt, there is one line for each page that has links
from it. The format of the lines is:
from1: to11 to12 to13 … from2: to21 to22 to23 … …
where from1 is an integer labeling a page that has links from it, and to11 to12 to13 … are integers labeling all the pages that the page links to. To fnd the page title that corresponds to integer n, just look up the n-th line in the fle titles-sorted.txt, a TF-8 encoded text fle.
(6) Grading
Your submissions will be graded based on the demonstration to instructor in CSB120. All of the items described in (6) will be evaluated. This assignment will account for 10% of your fnal course grade.
- 3 points: Implementation of the Idealized Page Rank algorithm
- 2 points: Implementation of the Taxation-based Page Rank algorithm
- 2 points: Set up the cluster
- 3 points: Analysis task: Including the Wikipedia bomb
7 http://haselgrove.id.au/wikipedia.htm 5/6
CS435 Introduction to Big Data, Fall 2018 Programming Assignment 3
(7) Late Policy
Please check the late policy posted on the course web page.
6/6