代写 algorithm html parallel cuda GPU DLP project

DLP project
pg. 1
DATA LEVEL PARALLELISM
05/24/19
DLP Project (DUE: Friday. 6/7/2019, 11:59PM
Learn the data-parallel programming paradigm and become familiar with CUDA by implementing an algorithm on the GPU.

DLP project pg. 2 Data Level Parallelism
BACKGROUND
In this project, you will do template matching on a GPU. Template matching is used to find the location of a smaller image within a larger image. A brute-force approach can be used to compare the template with every possible image region of the same size centered on a different pixel. In this method, we need a way to determine the level of similarity between two images.
The similarity measure used in this project is the correlation coefficient, which is defined as:
In the case of template matching, the mean of the pixels must be computed once for the template, and once for every candidate image region. Computed values will be between -1 and 1. For this project, we are only concerned with positive correlation, so we want to find the score closest to 1 (the highest score).
MATERIALS
The project files are on Canvas as well as /afs/ece/users/jowens/eec171/DLP. You are provided with the only image we will be working with, input.bmp, which is size 1024 x 768. There are also two template images, with template1.bmp having size 25 x 49 and template2.bmp having size 179 x 91. The template dimensions are odd to make to make the code more simple. To start off, you are also provided with two source files, bmp.cu and template_matching.cu. The first source file contains some utility code to read and write image files. You should not need to modify the code in this file. To compile the program into the executable template_matching, use:
nvcc template_matching.cu bmp.cu -o template_matching
When run, this program reads an image as the first argument and a template as the second argument and tries to find the region in the image that best matches the template. The program uses the metric described previously to compute similarity scores between all possible image regions and the template. The program outputs two images. The first image, corr_image.bmp, is the same size as the input image and contains the correlation score for every pixel. The second image is simply the input image with an added highlight to mark the region that produced the highest score. This location was found by iterating

DLP project pg. 3
through the correlation image and choosing the location with the highest value. For this project, your task is to modify template_matching.cu and create a GPU implementation of the basic template matching algorithm. In the current code, the main function reads and writes images, sets up the data, and calls the CPU function for template matching. It allocates memory for the input data using cudaMallocManaged, which allows both CPU and GPU to use the memory. Therefore, you do not need to add any additional malloc calls to the code, though you are allowed to allocate more memory if you find it necessary. For this project, all images are converted to greyscale, with a floating point number used to represent each pixel.
TASKS
Create a CUDA kernel and fill in the code for MatchTemplateGPU(), which together should be a basic GPU implementation of a brute-force template matcher. Use the CPU implementation of the code as a reference. Your kernel function (e.g. MatchTemplateKernel()) will contain the algorithm. It will mostly be straightforward, as the only special cases are the pixels at the edges of the image. As shown by the CPU code, pixels that are ¡°outside¡± will considered zeroes. MatchTemplateGPU() will likely be short and used mainly for setting up the kernel arguments and launching the kernel. You can choose the configuration of blocks and threads, but you should justify your decisions. You are also free to create other __device__ functions to help you organize your code. At the end of MatchTemplateGPU(), or right after it is called, you should place a cudaDeviceSynchronize() to avoid memory errors.
Processing pixels on an edge. An outside pixel contributes a value of 0.
Once you have completed your implementation, run your code on a snake machine using the provided input image. Test your program on both templates. Compare the runtimes of the GPU with the CPU. What speedup did you get? To time your GPU code, use nvprof. For this project, we only care about the kernel runtime and not the runtimes of memory transfers between host and device or any other overhead.
The next step is to add a second GPU implementation that uses shared memory. Similar to before, you will implement a function and the kernel called by that function (MatchTemplateGPUShared() and MatchTemplateSharedKernel()). You will likely need to divide the image into tiles that can fit into shared memory. This makes the code more complicated than before, because you must consider pixels at the edge of a tile. There are different ways to handle this issue, but if a needed pixel is not in shared

DLP project pg. 4
memory, you will have to read it from global memory. Again, at the end of the function call, you should place a cudaDeviceSynchronize(). Once you have completed your shared memory GPU implementation, compare its runtime on both templates with the previous basic GPU implementation. Is there a speedup? If there is little to no speedup, what could be the reason?
Each tile in shared memory, with the test region extending across tiles
The correlation image produced each of your GPU implementations should match that of the CPU implementation. There might be some minor differences due to floating point computation, but this is unlikely. GPU programs are highly parallel and can be tricky to debug. Fortunately, printf can be used from within a GPU kernel. A good strategy is to use conditional statements to limit the number of threads printing. The code you are tasked with writing and handing in must be your own.
OTHER RESOURCES
https://devblogs.nvidia.com/using-shared-memory-cuda-cc/ https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
REQUIREMENTS
The correlation images computed by your GPU implementations should match the ones computed by the CPU implementation. Your program should use your shared memory GPU implementation by default. Your program can continue running the CPU code, but the correlation image that it produces should be the one computed by your shared memory GPU implementation.
REPORT
Write a report with a maximum of 2 pages that explains your approach to your implementations and answers any previously mentioned questions.
SUBMISSION
Submit to Canvas: 1) template_matching.cu 3) your report
DUE DATE: Friday, June 7 at 11:59PM.