并行计算 map reduce: Term Project 2

Term Project 2:

Map Reduce

Due: 21st May (Mon) ~ 3rd June (Sun), 11:59 PM

Submit on i-Campus

1. Project Introduction

MapReduce is a new paradigm for large-scale parallel data processing introduced by Google engineers in 2004. It is generally used in the domain of databases. In this project, your task is to understand Map Reduce and build a simplified version of it. Furthermore, you will have to correctly understand and implement concurrency for this project. Construct functions that efficiently treat data and work concurrently for the given application wordcount.c. By accomplishing this project, you will be able to understand what MapReduce is and use threads and the related functions to realize concurrency.

2. Implementations 2.0 Environment

  •   The project should be performed in Ubuntu 16.04 environment.
  •   You can install Ubuntu or use a virtual machine to carry out your project.
  •   Evaluation will be carried out in Ubuntu environment with AMD Ryzen Threadripper 1950X

    16-Core processor.

  •   Three files and three folders will be given.
    • –  Makefile: Use this file to compile wordcount.c with your functions.
    • –  mapreduce.h: The header file for implementing your functions.

      (It is different from the file mapreduce.h provided in the link given in 2.1 Concept.)

    • –  wordcount.c: The C file to test your functions.
    • –  input_simple: A simple test set.
    • –  input_long_balanced: A complex test set that is balanced.
    • –  input_long_unbalanced: A complex test set that is unbalanced.

      2.1 Concept

  •   The general idea of MapReduce is described at the following link. https://github.com/remzi-arpacidusseau/ostep-projects/tree/master/concurrency- mapreduce
  •   Use the pthread APIs for this project. (See 2.5 Tips)
  •   You are expected to describe the concept of MapReduce in your report. 2.2 Things to Do

     Your task is to complete the following files which will be compiled with the test file wordcount.c.

SWE3004-41: Operating Systems (Spring 2018)

OS Term Project 2

– single_mapreduce.c: Contains your functions for the single-threaded MapReduce

– multi_mapreduce.c: Contains your functions for the multi-threaded MapReduce

  •   You need to complete the functions of MR_Emit() and MR_Run().
  •   When implementing the multi-threaded version, you should use the pthread APIs for

    concurrency.

    2.3 Compilation

  •   We provide a Makefile which you can use to compile your program.
  •   Commands for compiling

    $ make all

    – To compile wordcount.c for both single- and multi-threaded versions. $ make single_mapreduce
    – To compile wordcount.c for the single-threaded version.
    $ make multi_mapreduce

    – To compile wordcount.c for the multi-threaded version.

  •   The compiling outputs are as follows:

    single_mapreduce, multi_mapreduce

  •   If you run the programs, the result and execution time will be shown.

    2.4 Output

    The following arguments will be passed as input:

    $ ./single_mapreduce {folder name} {# of partitions}

  •   folder name: the folder we will use to test
  •   the number of partitions: the number of partitions we will use.

The user applications, single_mapreduce and multi_mapreduce, will print the following:

  •   Key: words separated by blank characters.
  •   Count: the number of words corresponding to the given key.
  •   Execution time: the time for execution.

    Output Example (single-threaded version)

    •   1) At the start of the project.
    •   2) After you have made your own files.
    •   3) Compile all versions.
    •   4) Files after compilation.

SWE3004-41: Operating Systems (Spring 2018)

OS Term Project 2

 5) Print (key, count) pairs and the execution time.

2.5 Tips

  •   The difference between this project and the original project described in 2.1 Concept:
    – When executing the executable files, we will pass only a folder name including all the

    target files instead of passing all the file names.
    – You need to give the number of partitions as shown in 2.4 Output.

  •   You do NOT need to implement the MR_DefaultHashPartition() function.
    – It is already included in wordcount.c.
    – The test program must pass the function pointer of hash function as an argument in

    MR_Run().

  •   The number of threads must not exceed the value specified as input in the multi-threaded version.
  •   Do NOT modify any files that we have provided including Map() and Reduce().
  •   In output, keys should be sorted in ascending key order (ASCII code value).
  •   You can refer to the following chapters of your textbook.

    – Chapter 26. Concurrency: An Introduction
    – Chapter 27. Interlude: Thread API
    – Chapter 28. Locks
    – Chapter 29. Lock-based Concurrent Data Structures – Chapter 30. Condition Variables

  •   You can also refer to the following pthread API instructions.
    – https://www.joinc.co.kr/w/Site/Thread/Beginning/PthreadApiReference – http://www.cs.wm.edu/wmpthreads.html

SWE3004-41: Operating Systems (Spring 2018)

$ ./single_mapreduce folder/file1 folder/file2 folder/file3 // original
$ ./single_mapreduce folder 1000 // project #2

OS Term Project 2

 Each student will be ranked according to the processing performance of the test file, and your ranking will affect the final evaluation score.

3. Evaluation Policy

  •   Implementation (70 points)
    – Single-threaded version of MapReduce – Multi-threaded version MapReduce
    – Ranking based on performance
  •   Report (30 points)
    – Investigation on MapReduce
    – Detailed explanation of your implementations

    4. Submission Guidelines

(20 points) (30 points) (20 points)

(10 points) (20 points)

SWE3004-41: Operating Systems (Spring 2018)

Task

Single- threaded

Multi- threaded

Performance

Report

Total

Investigation

Implementation

Points

20

30

20

10

20

100

  •   Submit a compressed zip file named [your_student_ID]_proj2.zip.
    The zip file should contain three files: [your_student_ID].pdf, single_mapreduce.c, and multi_mapreduce.c.
  •   If the format of your work is wrong, -10 points will be applied.
  •   Please describe the key parts of your code explicitly in your report.

    Do NOT attach the whole source code in your report, just the key parts.

  •   Submit by uploading your zip file on i-Campus.

    If you have trouble uploading it on i-Campus, send it to tk1star2@nate.com. (The title should be [SWE3004-41] Term project 2, [your_student_id], [your_name])

    5. Notice

  •   Your term project must be your original work.
  •   If you have any questions, describe your problem along with a screenshot and send it to

    tk1star2@nate.com, or visit 85465 (Research & Business Center).

    (Please include [SWE3004-41] in the title.)

  •   In the case of cheating, you will FAIL this course.

    (If two similar source codes or reports are found, both students will fail this course.)

  •   There are 15 points delay penalty per day in case of late submission.

    Have fun!

OS Term Project 2