Term Project 2:
Map Reduce
Due: 21st May (Mon) ~ 3rd June (Sun), 11:59 PM
Submit on i-Campus
1. Project Introduction
MapReduce is a new paradigm for large-scale parallel data processing introduced by Google engineers in 2004. It is generally used in the domain of databases. In this project, your task is to understand Map Reduce and build a simplified version of it. Furthermore, you will have to correctly understand and implement concurrency for this project. Construct functions that efficiently treat data and work concurrently for the given application wordcount.c. By accomplishing this project, you will be able to understand what MapReduce is and use threads and the related functions to realize concurrency.
2. Implementations 2.0 Environment
- The project should be performed in Ubuntu 16.04 environment.
- You can install Ubuntu or use a virtual machine to carry out your project.
- Evaluation will be carried out in Ubuntu environment with AMD Ryzen Threadripper 1950X
16-Core processor.
- Three files and three folders will be given.
- – Makefile: Use this file to compile wordcount.c with your functions.
- – mapreduce.h: The header file for implementing your functions.
(It is different from the file mapreduce.h provided in the link given in 2.1 Concept.)
- – wordcount.c: The C file to test your functions.
- – input_simple: A simple test set.
- – input_long_balanced: A complex test set that is balanced.
- – input_long_unbalanced: A complex test set that is unbalanced.
2.1 Concept
- The general idea of MapReduce is described at the following link. https://github.com/remzi-arpacidusseau/ostep-projects/tree/master/concurrency- mapreduce
- Use the pthread APIs for this project. (See 2.5 Tips)
- You are expected to describe the concept of MapReduce in your report. 2.2 Things to Do
Your task is to complete the following files which will be compiled with the test file wordcount.c.
SWE3004-41: Operating Systems (Spring 2018)
OS Term Project 2
– single_mapreduce.c: Contains your functions for the single-threaded MapReduce
– multi_mapreduce.c: Contains your functions for the multi-threaded MapReduce
- You need to complete the functions of MR_Emit() and MR_Run().
- When implementing the multi-threaded version, you should use the pthread APIs for
concurrency.
2.3 Compilation
- We provide a Makefile which you can use to compile your program.
- Commands for compiling
$ make all
– To compile wordcount.c for both single- and multi-threaded versions. $ make single_mapreduce
– To compile wordcount.c for the single-threaded version.
$ make multi_mapreduce– To compile wordcount.c for the multi-threaded version.
- The compiling outputs are as follows:
single_mapreduce, multi_mapreduce
- If you run the programs, the result and execution time will be shown.
2.4 Output
The following arguments will be passed as input:
$ ./single_mapreduce {folder name} {# of partitions}
- folder name: the folder we will use to test
- the number of partitions: the number of partitions we will use.
The user applications, single_mapreduce and multi_mapreduce, will print the following:
- Key: words separated by blank characters.
- Count: the number of words corresponding to the given key.
- Execution time: the time for execution.
Output Example (single-threaded version)
- 1) At the start of the project.
- 2) After you have made your own files.
- 3) Compile all versions.
- 4) Files after compilation.
SWE3004-41: Operating Systems (Spring 2018)
OS Term Project 2
5) Print (key, count) pairs and the execution time.
2.5 Tips
- The difference between this project and the original project described in 2.1 Concept:
– When executing the executable files, we will pass only a folder name including all thetarget files instead of passing all the file names.
– You need to give the number of partitions as shown in 2.4 Output. - You do NOT need to implement the MR_DefaultHashPartition() function.
– It is already included in wordcount.c.
– The test program must pass the function pointer of hash function as an argument inMR_Run().
- The number of threads must not exceed the value specified as input in the multi-threaded version.
- Do NOT modify any files that we have provided including Map() and Reduce().
- In output, keys should be sorted in ascending key order (ASCII code value).
- You can refer to the following chapters of your textbook.
– Chapter 26. Concurrency: An Introduction
– Chapter 27. Interlude: Thread API
– Chapter 28. Locks
– Chapter 29. Lock-based Concurrent Data Structures – Chapter 30. Condition Variables - You can also refer to the following pthread API instructions.
– https://www.joinc.co.kr/w/Site/Thread/Beginning/PthreadApiReference – http://www.cs.wm.edu/wmpthreads.html
SWE3004-41: Operating Systems (Spring 2018)
$ ./single_mapreduce folder/file1 folder/file2 folder/file3 // original
$ ./single_mapreduce folder 1000 // project #2
OS Term Project 2
Each student will be ranked according to the processing performance of the test file, and your ranking will affect the final evaluation score.
3. Evaluation Policy
- Implementation (70 points)
– Single-threaded version of MapReduce – Multi-threaded version MapReduce
– Ranking based on performance - Report (30 points)
– Investigation on MapReduce
– Detailed explanation of your implementations4. Submission Guidelines
(20 points) (30 points) (20 points)
(10 points) (20 points)
SWE3004-41: Operating Systems (Spring 2018)
Task |
Single- threaded |
Multi- threaded |
Performance |
Report |
Total |
|
Investigation |
Implementation |
|||||
Points |
20 |
30 |
20 |
10 |
20 |
100 |
- Submit a compressed zip file named [your_student_ID]_proj2.zip.
The zip file should contain three files: [your_student_ID].pdf, single_mapreduce.c, and multi_mapreduce.c. - If the format of your work is wrong, -10 points will be applied.
- Please describe the key parts of your code explicitly in your report.
Do NOT attach the whole source code in your report, just the key parts.
- Submit by uploading your zip file on i-Campus.
If you have trouble uploading it on i-Campus, send it to tk1star2@nate.com. (The title should be [SWE3004-41] Term project 2, [your_student_id], [your_name])
5. Notice
- Your term project must be your original work.
- If you have any questions, describe your problem along with a screenshot and send it to
tk1star2@nate.com, or visit 85465 (Research & Business Center).
(Please include [SWE3004-41] in the title.)
- In the case of cheating, you will FAIL this course.
(If two similar source codes or reports are found, both students will fail this course.)
- There are 15 points delay penalty per day in case of late submission.
Have fun!
OS Term Project 2