COMP9313 21T3 Project 2 (15 marks)
Problem 1. Relative Frequency (8 pts, use Spark RDD):
Given a corpus of documents, compute the relative frequency f(wj|wi), i.e.,
Here, N(., .) indicates the number of times a particular co-occurring term pair is
observed in the corpus. In this problem, we consider that 𝑤𝑖 and 𝑤𝑗 co-occur if
𝑤𝑗 appears after 𝑤𝑖 in the same line.
Input:
The format of input files is like “pg100.txt”. For example, the input could be:
The Works of Shakespeare, by William Shakespeare
Language: English
Output:
Please get the terms from the documents as below:
• Ignore the letter case, i.e., convert all words to lower case first.
• Ignore terms starting with non-alphabetical characters, i.e., only consider
terms starting with “a” to “z”.
• Use the following split function to split the documents into terms:
split(“[\\s*$&#/\”‘\\,.:;?!\\[\\](){}<>~\\-_]+”)
Each line of the output file is in format of “𝑤𝑖 𝑤𝑗 𝑓(𝑤𝑗|𝑤𝑖)”. The results are
sorted by the first term (i.e., 𝑤𝑖) in ascending order (alphabetical), then sorted
by the relative frequency (i.e., 𝑓(𝑤𝑗|𝑤𝑖)) in descending order, and finally sorted
by the second term in ascending order. The relative frequency is of double
precision. Given the above example, the output file is like:
by shakespeare 0.5
by william 0.5
language english 1.0
of shakespeare 0.5
of by 0.25
of william 0.25
shakespeare by 0.3333333333333333
shakespeare shakespeare 0.3333333333333333
shakespeare william 0.3333333333333333
the shakespeare 0.3333333333333333
the by 0.16666666666666666
the of 0.16666666666666666
the william 0.16666666666666666
the works 0.16666666666666666
william shakespeare 1.0
works shakespeare 0.4
works by 0.2
works of 0.2
works william 0.2
Code format:
Name the package as “comp9313.proj2”, the scala file as “Problem1.scala”,
and the object as “Problem1”. Store the result in a text file on disk. Your
program should take two parameters: the input text file and the output folder.
Problem 2 (7 pts, use the GraphX pregel operator):
Given a directed graph, for each vertex, find all the triangles that pass this
vertex. A triangle is a cyclic path that passes three different vertices. For
example, in the below graph, 0->1->4->0 is a triangle.
Input:
Each line is in format of “EdgeId FromNodeId ToNodeId”. Given the above
graph, the input file is like below.
0 0 1
1 0 3
2 1 2
3 1 4
4 1 5
5 2 1
6 2 3
7 3 0
8 3 5
9 4 0
10 4 5
11 5 2
Output:
For each vertex, display all the triangles that pass it. Each line of the output file
contains the vertex id (VertexId) and the triangles (each in format of “VertexId-
>Vertex1->Vertex2”) that pass this vertex, and the format is
“VertexId:Triangle_1;Triangle_2;…;Triangle_k;”. Please order the lines by
VertexId in ascending order numerically. Within each line, sort the triangles on
one vertex according to “Vertex1” first and “Vertex2” second in descending
order numerically. For example, given the above sample tiny graph, the output
file is like:
0:0->1->4;
1:1->5->2;1->4->0;
2:2->3->5;2->1->5;
3:3->5->2;
4:4->0->1;
5:5->2->3;5->2->1;
Code format:
Name the package as “comp9313.proj2”, the scala file as “Problem2.scala”,
and the object as “Problem2”. Store the result in a text file on disk. Your
program should take two parameters: the input graph file and the output folder.
Documentation and code readability
Your source code will be inspected and marked based on readability and ease
of understanding. The documentation (comments of the codes) in your source
code is also important. Below is an indicative marking scheme:
Result correctness: 80%
Efficiency: 10%
Code structure, Readability, and
Documentation: 10%