This exercise aims to get you to apply the order inversion and value-to-key conversion design patterns for MapReduce programming you have learned in Chapter 3.1.
If you have not finished solving the problems in Lab 3, please keep working on them, and then move to Lab 4.
Create a project “Lab4” in Eclipse, and create a package “comp9313.lab4” in this project. Put all your codes written in this week’s lab in this package and keep a copy for yourself after you have finished all problems.
For all problems, using the following code to tokenize a line of document:
Convert all terms to lower case (by using toLowerCase() function). Download the pg100.txt file and put it to HDFS by:
Problem 1. Computer Relative Frequency Using Order Inversion
The problem is to compute the relative frequency f(wj|wi), i.e.,
Here, N(., .) indicates the number of times a particular co-occurring term pair is observed in the corpus. We need the count of the term co-occurrence, divided by the marginal (the sum of the counts of the conditioning variable co-occurring with anything else).
In this problem, we consider the nonsymmetric co-occurrence. That is, wi and wj co- occur if wj appears after wi in the same line, which is defined the same as in Problem 2 of Lab 3.
Create a new class “NSRelativeFreq.java” in the package “comp9313.lab4” and solve this problem. Your output should be in format of (wi, wj, f(wj|wi)), and you need to use DoubleWritable to serialize the value f(wj|wi).
StringTokenizer itr = new StringTokenizer(value.toString(), ” *$&#/\t\n\f\”‘\\,.:;?![](){}<>~-_”);
$ hdfs dfs –mkdir input
$ hdfs dfs –put ~/pg100.txt input
1. Refer to slides 16-21 of Chapter 3.1.
2. The mapper only needs to be modified a little bit. Just emit one more special key for a pair of terms.
The problem is, what is the type of the output key of map()? How to guarantee that the order is correct, and thus the special key can be sent to the reducer first (You can still use Text to wrap a pair of terms)?
3. How to write the codes for the reducer to compute the relative frequency of a pair?
4. Can you use the reducer as the combiner in this problem? How to write a combiner to
aggregate partial results?
5. How to specify the configuration in the main function? (Be careful if your map output
value is different from the reduce output value!)
You can use the results of Problem 2 in Lab 3 to verify the correctness of your output (e.g., check the pairs where the first term is “zounds”).
Question: What will happen if you set the number of reducers more than 1?
Problem 2. Computer Relative Frequency Using Order Inversion (version 2)
You are required to customize a WritableComparable class to solve the problem defined in Problem 3, which is the output key type of the map () function. This is more complicated than using Text to wrap the pair of terms. In order to see the effects of the partitioner, set the number of reducers to 2 by adding the following line in the main function:
job.setNumReduceTasks(2);
Create a new class “NSRelativeFreq2.java” in the package “comp9313.lab4” and solve this problem. Your output should be in format of (wi, wj, f(wj|wi)), and you need to use DoubleWritable to serialize the value f(wj|wi).
1. Please refer to the StringPair class provided.
2. Remember that you need to either override hashCode() in your WritableComparable class
or implement a Partitioner to guarantee the correctness. Refer to slide 55 of Chapter 2.2
and slide 17 of Chapter 3.1. Try both methods if you have time.
3. How to configure the main function (configure your partitioner class if there is one)?
Your results should be the same as that of Problem 1 (stored in two output files because two reducers are specified).
Problem 3. Reverse Graph Edge Directions
The problem is to reverse graph edge directions of the input directed graph and sort the output adjacency list in ascending node order. You can refer to Slide 57 of Chapter 3.2.
Create a new class “ReverseGraphEdge.java” in the package “comp9313.lab4” and solve this problem.
Input files:
In the input file, each line is in format of: “FromNodeId\tToNodeId1 ToNodeId2 ToNodeId3 … ToNodeIdn”. In the above example, the input file is:
The format of the output file is the same as that of the input file. Moreover, the lines are required to be sorted accendingly according to their FromNodeIds. In each line, the ToNodeIds are also required to be sorted in the accending order.
1. Because of the requirement of ascending order of ToNodeIds, you are required to use secondary sort to solve this problem. Please refer to the slides 30-32 of Chapter 3.1 about how to do partitioning, how to use the grouping comparator, and how to define the order for keys.
Solutions of the Problems
I hope that you can finish all problems by yourself, since the hints are already given. All the source codes will be published in the course homepage on Friday next week.
3\t1 2 1\t2 3
1\t3 2\t1 3 3\t1