Setup your own Hadoop 2.7.3 single-node environment on your local machine. Just make sure your code can run with Hadoop 2.7.3. Besides, you can find the test files (testFiles.zip)
In this question, you are required to write a MapReduce program to generate the bag-of-word (BoW) vectors of given TEN text files on Canvas. In practice, the BoW model is a useful tool for feature generation. After transforming the text file into a “bag of words”, we can calculate various measures to characterize the file, e.g., calculating Squared Euclidean distance in Question 2. For example, supposing the dictionary is {“John”, “likes”, “to”, “watch”, “movies”, “also”, “football”, “games”, “Mary”, “too”}, the BoW vector of text file with content “John likes to watch movies. Mary likes movies too.” is [1, 2, 1, 1, 2, 0, 0, 0, 1, 1], and the BoW vector of the other file with “John also likes to watch football games.” is [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]. Now, you should write a MapReduce program, which is similar to the standard WordCount.java.
1) The first step is to extract the key words that exist in the specific word dictionary, where the key words are coming from the web page: https://en.wikipedia.org/wiki/Most_common_words_in_English. You could find a defined String array in Appendix (DO NOT adjust the sequence of the words). During this procedure, you should also record the occurrence of each key words.
2) Then generate a vector in 100-dimension space, which has 100 elements, where each element represents the occurrence of corresponding word in the 100-most-common words. In case that the occurrence is beyond the maximum integer number in Java, you could simply use the maximum number to represent it. The output format should be “[file’s name v1, v2, …, v100]”, e.g.,
file_1_name 1, 2, 3, 4, …, 100
file_2_name 5, 2, 9, 6, …, 30
3) Finally, you should output these vectors as defined above into a single file, which would be the input file of Question 2. So make sure the format is exactly the same as the given example.
4) Note: You should make the word identification case-insensitive, and ignore anything that is not a letter or a digit. It will be replaced with whitespace, e.g., “Man.&” will be identified as “man”, “Harry’s” will be identified as “harry” and “s”.
2. In this question, you are provided with a BoW vector of a query file (see the Clarifications 1 below). You should use the above result, i.e., the output of the question 1, as an input to calculate the Squared Euclidean distance between each of the generated BoW vectors and the query one, i.e., queryFileBoW (see below), the formula is: (supposing v and w are two vectors with 100 elements)
dist = (v[0]-w[0])2 + (v[1]-w[1]) 2 + … + (v[99]-w[99]) 2
Note that each line of the input file contains the file’s name and its BoW vector, separated by a Tab (i.e., “\t”). You are required to create a Hadoop program that is able to process such an input
file, calculate the distances, and sort these testing files in a descending order according to the computed Squared Euclidean distance. For example, the query vector is: (for easy demonstration, the dimension is just 3)
query_file 1, 1, 2
And the input file contains:
file_a 1, 2, 3
file_b 2, 4, 6
file_c 1, 3, 5
file_d 1, 2, 3
One form of the output is:
file_b 26
file_c 13
file_d 2
file_a 2
Or
file_b 26
file_c 13
file_a 2
file_d 2
Clarifications:
1. For the query bow vector, you should hardcode in your program by defining the following array:
private int[] queryFileBoW = { 576, 82, 287, 289, 303, 241, 197, 161, 68, 279, 147, 75, 80, 41, 112, 84, 92, 198, 44, 88, 81, 56, 54, 46, 50, 25, 56, 3, 50, 46, 97, 30, 24, 104, 36, 52, 33, 55, 15, 32, 43, 30, 25, 46, 20, 25, 2, 61, 11, 63, 21, 9, 26, 9, 7, 48, 11, 27, 19, 4, 7, 20, 5, 45, 15, 13, 30, 17, 18, 21, 10, 26, 17, 15, 17, 15, 15, 13, 16, 4, 13, 11, 14, 13, 10, 15, 45, 9, 19, 11, 4, 2, 2, 2, 49, 10, 9, 12, 15, 18 }; // Please do not modify any value in this array.
2. Your program should conform to the MapReduce programming paradigm. And the output should be a single file.
3. The input file is coming from the Question 1 (you can change the name from its default one, i.e., “part-r-00000”), so make sure the format is correct.
Submission Instructions
You are required to upload a ZIP file with your Java source code, i.e., two XXX.java files for the two questions respectively (E.g., you can name them as Bow.java and Dist.java). Your source code will be read and we will attempt to test your program on Hadoop 2.7.3. So, make sure your program can compile and execute correctly first before you submit. Your assignment must be submitted on Canvas with the name “[CS4296/CS5296]-[Your Name]-[Student ID]-Assignment2.zip” (e.g., CS5296-HarryPotter-12345678-Assignment2.zip) before the deadline, i.e., 23:59, April 16, Sunday.
Appendix
Top 100 common words in English
private final static String[] top100Word = { “the”, “be”, “to”, “of”, “and”, “a”, “in”, “that”, “have”, “i”, “it”, “for”, “not”, “on”, “with”, “he”, “as”, “you”, “do”, “at”, “this”, “but”, “his”, “by”, “from”, “they”, “we”, “say”, “her”, “she”, “or”, “an”, “will”, “my”, “one”, “all”, “would”, “there”, “their”, “what”, “so”, “up”, “out”, “if”, “about”, “who”, “get”, “which”, “go”, “me”, “when”, “make”, “can”, “like”, “time”, “no”, “just”, “him”, “know”, “take”, “people”, “into”, “year”, “your”, “good”, “some”, “could”, “them”, “see”, “other”, “than”, “then”, “now”, “look”, “only”, “come”, “its”, “over”, “think”, “also”, “back”, “after”, “use”, “two”, “how”, “our”, “work”, “first”, “well”, “way”, “even”, “new”, “want”, “because”, “any”, “these”, “give”, “day”, “most”, “us” };