INFS 5095 – Big Data Basics
Practical Test 1 (SP5 2021)
Due: By 11PM on Sunday 29 August
General instructions
• This exercise is worth 5% of your final grade and it is due no later than 11pm on Sunday 29 August.
• The exercise will be marked out of 10.
• You will need to submit your work via learnonline in zip format.
Assessment task
In this assessment you are required to write two MapReduce programs and run them on HDFS.
First, create a directory for this assessment called test1 within the /home/cloudera/ directory as we normally have in practicals. From here you should be able to follow the directions of the Week 4 practical to write and run your MapReduce programs.
Our input file (or data) for this assessment will be a dictionary of English words found on GitHub. The data file words.txt can be found from the repository: https://github.com/dwyl/english-words/ Copy the file into your input folder and rename it to test_input.txt.
Q1. Write a MapReduce program to determine the frequency of word lengths within an input file.
The program should return how many times each word length appears within the dictionary. For example, in the following list
{‘apple’ , ’banana’ , ‘orange’ , ‘pear’}
The length of the words is
{5,6,6,4}
And so the output ‘part-00000.txt’ file would look something like
41 51 62
indicating that there is one word with four letters, one word with five letters and two words with six letters.
If we follow the computer prac, the frequency values (e.g. 2, 4, 10, 16, 23) are actually being saved as strings, so they will print in the order (10, 16, 2, 23, 4). See if you can get frequencies to print in correct numerical order. (Hint: It can be achieved in one line in the Linux terminal after the file is output.)
Q2. Write a MapReduce program to determine the frequency of individual characters within the provided text file.
The program should return how many times each character appears, and you don’t need to discern between letters, numbers or symbols, any character within the text file should be counted.
For instance
{‘11’ , ’cat’ , ’1hat’} Should output
13 a2 c1 h1 t2
Note: This output should not need sorting. You should only see values 0-9, thus circumventing the problem in Question 1.
Submission instructions
You should submit three files in total for each question:
– mapperq*.py
– reducerQ*.py – part-00000_q*
where * is replaced by the question number. The output of each MapReduce program will be ‘part- 00000.txt’, so just rename them using the above convention once they’re done, and submit all files to learnonline, bundled together into a zip file.
Make sure to comment your code sufficiently, this will be included in your final mark. Good programmers are good commenters too, your code should be able to be read by a stranger who wants to use it, or by yourself in a year’s time.
Distribution of marks
All questions MapReduce program code – 2 marks each All questions output – 2 marks each
All questions code presentation – 1 mark each
Total of 10 marks