程序代写 GPU Programming – Exercise 2: Reductions

GPU Programming – Exercise 2: Reductions
1 Introduction
This exercise is about how to perform reductions, in particular to perform dot products of really large vectors. Operations such as these are very common, but represent a challenge for many-core systems, such as GPUs. As long as sub-tasks can be performed in isolation, it is usually easy to divide up the work and the speed you get can be most impressive. However, when information has to be aggregated, you have to rely on inter- core communication schemes that can be much slower. The main objectives in this exercise are to learn about:
• how a task can be divided up into multiple blocks and how results can be aggregated from different blocks into a single output,

• atomicoperations(e.gatomicAdd)appliedeitheronglobalmemoryorsharedlocalmemory, • shuffle operations (e.g. shfldownsync) for interaction between threads in a warp.
The exercise can be done either on the GPU cluster or on a local machine, assuming that you have a GNU (linux) environment with CUDA drivers properly installed. The tool nvprof, which is usually stored under /usr/local/cuda/bin, is preferably used for profiling the individual kernels. The visual profilers nsight (for newer GPUs) and nvvp (for older GPUs) provide a more in-depth analysis.
The goal of the exercise is to compute a large vector multiplication on a GPU. In C++ you could have a function such as the one below.
float dotProduct(float *A, float *B, int N) { float sum = 0.0f; for (int
i=0;i0;j/=2)
sum += __shfl_down_sync(0xffffffff, sum, j);
you can add up sum for all threads in a warp in five steps. In the end you only let the first thread in each warp update localSum with an atomicAdd. Even if the other threads are then idling, it is better than letting all threads access localSum in sequence. In the end, assuming a block size of 1024 threads, you end up with 32 local atomicAdd using a shared variable and a single global atomicAdd, instead of the original 1024 global atomicAdd.
The goal in this exercise was to reduce the relative cost of reductions and synchronization, when partial results are aggregated into a final result, something that is very common in practice. It is possible to improve further on the above, for example using vectors of shared variables, but at some point you become limited by the latency associated with reading the original vectors from global memory, something you cannot do much about.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts