Cloud Computing Course Project 2018
Task 1.
Basic Task (20 points)
Without using MapReduce, write a simple program that COUNT the number of times each word occurs in text file “NovelTexts.txt” we prepared for you. Please report the time spent for your program to complete the task. Discuss possible ways to speed up your program.
Note: for all the tasks in this projects, words end with (s|ing|ness|ed|ly) should be considered as identical to its normal form.
Optional Task (20% bonus)
Prepare a set of English novels (could be from any groups of writers, such as harry porter, A Song of Ice and Fire, etc) in .txt file. The file should as large as possible (no smaller than 10MB).
Possible ways of preparing such file:
- Search from internet for existing files.
- Crawl and download from webpages.
- Other methods you can come up with
Word | Location |
aaa | File 1, line 10; File 2, line 100 … |
And repeat the word count task in Basic Task 1 requirement for the file prepared by yourself. If you do this task, please describe the method you use to collect the data in the report. Think about methods to speed up your data collection process.
Task 2. (20 points)
Write an MapReduce program to count the number of times each word occurs in text files from Task1.
- Your program should output number of times each word occurs in the file.
- Your program should record the time it takes to complete the task.
- Compare the time spent in task 2 and 1 (and with different numbers of VMs you use) and analyze the possible reasons for the results
Task 3.
Basic Task (20 points)
Write an MapReduce program to sort the words according to their occurrence in files you have. You need to sort the result in both ascending and descending order.
Optional Task (20% bonus)
Without using MapReduce, write a program that SORT the words by their occurrence. Compare the time spent with/without MapReduce and analyze the reason.
Task 4. (20 points)
Write an MapReduce program to build an index of words in the files so that you can find out where a word is located very quickly, for example, an index could look like this:
bb | File 13, line 12; File 22, line 133 … |
…
Task 5. (20 points)
Build a simple web based search engine based on the previous result.
- Write an MapReduce program to be able to conduct the search for a text pattern in the prepared dataset. Think about: whether to use the index result from task 4 or not, why?
- Write a simple web interface for input and output purpose.