留学生辅导 GPU Programming – Exercise 3: Memory optimization

GPU Programming – Exercise 3: Memory optimization
1 Introduction
This exercise is about how to perform image convolutions with relatively large filters applied to multi- megapixel images. From an implementational point of view, this is a challenging problem since such an operation puts a lot of pressure on the memory system, which is true for a lot of problems for which GPUs are used today. Ever since the early days of computing, we have primarily thought of the computational cost of operations, easily forgetting that to do those operations data has to be accessed from memory and in many cases, the memory is the actual bottleneck. The main objectives in this exercise are to learn about:
• howdifferentkindsofmemoriesarehierarchicallyorganizedinmodernGPUs,

Copyright By PowCoder代写 加微信 powcoder

• howtothinkaboutGPUimplementationsintermsoftheirmemoryneeds,
• how to find the right balance between memory resources to make an application as efficient as possible.
The exercise can be done either on the GPU cluster or a local machine, assuming that you have a GNU (linux) environment with CUDA drivers properly installed. The tool nvprof, which is usually stored under /usr/local/cuda/bin, is preferably used for profiling the individual kernels. The visual profilers nsight (for newer GPUs) and nvvp (for older GPUs) provide a more in-depth analysis.
1.1 Different kinds of memories
Just like on CPU, a GPU has a hierarchy of memories with slower, but larger, memories shared between all SMs and smaller, but faster, memories locally next to each SM. Even if the total bandwidth on a GPU is far greater than on a CPU, the available computational resources are even greater. While a load operation can have a latency of hundreds of clock cycles, each SM can have multiple load operations on the go, all running in parallel. Thus if large amounts of data are needed from memory, you typically benefit from having multiple warps (32 threads each) ready to go, so that loads from different warps can be interleaved, minimizing the total latency of loading from memory. Computations continue until the loaded data is needed and a warp will be stalled only if there is nothing else to do. If some computations can be done while waiting for the data, these computations essentially come for free and have no impact on the performance, while memory access becomes the limiting factor. To optimize a GPU implementation for maximum speed, it is thus important to understand the typical memory hierarchy of modern GPUs.

Here follows a list of the different kinds of memories commonly available on a GPU:
• Global memory – This is the main memory of the GPU. If this is enough for the whole application, unnecessary and very costly transfers back and forth between the CPU and GPU can be avoided. Preferably, all operations should be kept on the GPU and never be copied back to the CPU, unless it is absolutely necessary. Since allocation of global memory takes time, you often benefit from first making an estimate of the total memory consumption, allocate once and then let your code divide up the allocated memory. It is also good to reuse the same memory for different purposes, if it is no longer needed. (12 Gb/GPU)
• Constantmemory–Inmanycasesyouhavesomelimiteddatathatneverchanges,suchascoefficients of some filter. Such data can be declared as constant, which means that while it appears as global memory, it will be kept in its own high-speed read-only memory structure available to all SMs. Even if such data would be effectively handled by the caches, even without such a declaration, unnecessary pollution of caches should be avoided, when caches could be used for better purposes. (64 Kb/GPU)
• L2cache–ThisissimilartotheL2cacheofaCPU,afastermemorysharedbetweenallSMs.Frequently accessed global memory will be kept in this cache, greatly improving the overall speed. To ensure the best performance, the total foot print of your kernels should be kept within this cache. Temporary global memory buffers should preferably be reused for different purposes. It might even be beneficial to split up kernels into many small pieces, each consuming less than the available L2 memory. (6 Mb/GPU)
• Sharedmemory–Thismemoryissharedforallthreadswithinathreadblockandistypicallyusedfor buffering, in particular of global memory that will used by multiple threads in the block. It could also be used in-between operations within the same kernel. In many cases you benefit from combining many kernels into a larger kernel that performs a sequence of operations. As long as more operations can be done on the same data, you better keep them in a shared memory as part of a larger kernel, while limiting the number of global load operations that you need. (0-100 Kb/SM)
• L1/Texturecache–Thissmallmemory,whichisplacednexttoeachstreamingmultiprocessor(SM),is used for even faster caching of frequently used global data and for data defined as texture memory. The difference is in the way spatial locality is used for effective caching. Unlike typical caches that assume data structured as 1D sequences, texture memory can exploit locality also in 2D and 3D. Texture memories are good for random access load operations of regions in planes or volumes, which is common in computer graphics, but rare in most other cases. To fully utilize the L1 cache, it is preferably to access memory as sequentially as possible. (28-128 Kb/SM)
• Registers – The register file are divided so that each thread has an equal number of registers for temporary storage during computations. This is the fastest available memory and has a great effect on

the performance. The number of registers needed depends on the kernel, but the nvcc compiler often does a very good job of limiting the number of registers to a suitable amount by reusing registers and reordering operations. In some highly optimized cases, shared memories can be replaced by vectors of registers, with registers exchanged between threads using shuffling operations. (64 Kb/SM)
The numbers in parentheses are the amount of memory you have of each kind on the GeForce 3080 Ti. Note that on more recent GPUs, the shared memory and L1/Texture cache share the same physical memory, but can be divided up in different ways, while on older GPUs these were kept separate. This has simplified optimization, since you need to think less about how to fully utilize both the shared memory and the cache.
1.2 Memory optimization
Thinking how to access memory and where that memory is kept, can have a great effect on the performance. In some cases, a highly optimized piece of code can be 100 times faster than a naive translation of some existing CPU code. GPU architectures have evolved though and modern GPUs have memory structures more similar to those of CPUs with caches that are much more effective than before. Traditionally, shared memories have been recommended for buffering of frequently used global memory, but nowadays caches have improved, reducing the relative benefit of such buffering, in particular if the same shared memories could instead be used internally between different steps of the kernel.
Here are some recommendations, some of which have already been mentioned:
• Balance memory resources – The different kinds of memories described above should be viewed as resources, resources that should be exploited. Do not just look at the limiting resources, but also on those resources that are underused. Find the right balances between resources, so that each resource is used to its full potential. If the L2 cache is underused, one might e.g. consider skipping some shared memory buffering, letting the L2 cache deal with the problem and then use the shared memories for something else. For example, if the pressure is high on the register file, one might even consider using shared memories for temporary storage, even if each address is only used by a single thread.
• Trydifferentnumberofthreads–Normally,itisrecommendedtokeepthenumberofthreadsashigh as possible, in order to make sure that many global load operations can be interleaved by having enough warps available. However, this might reduce the number of registers available per thread, which in turn will force the compiler to use cached global memory for temporary storage. You might also have different number of warps active in different parts of the kernel. When you load from global memory for buffering in shared memory, you might benefit from having 32-64 warps, but later in the kernel you may let all but a handful of warps be idling, while the remaining warps do the number crunching, if they are able to keep all computational units busy.

• Consider merging of kernels – A general rule is to keep the kernels as large as possible. If you have a sequence of operations, consider merging all those operations into a single kernel, while trying to reuse resources, such as shared memories, as much as possible. When you leave the kernel, what is stored in registers and shared memory is lost. Consider threads as workers that can be recruited for different purposes in different parts of the kernel. How you divide up the work in different parts might vary depending on what operations you do and how many you have.
• Avoidsharedmemorybankconflicts–Sharedmemoryisorganizedin32banksandifmanythreadsin the same warp access the same bank, memory operations will be serialized. The only exception is if all threads read the same address. In many cases you benefit from storing information not as arrays- ofstructures, but as structures-of-arrays, such as the coordinates of a set of 3D points as three vectors. A related recommendation is to ensure that threads in a warp touches as few cache lines as possible during a global memory operation, which is easiest if the threads access chunks of memory in single pieces.
• Keepingthememoryfootprintlow–Thisisarecommendationthathasalmostalwaysbeentrue,also for CPUs. As long as you have a cache hierarchy, it is a good idea to help that hierarchy as much as possible by reusing the same address space for different purposes. One way of doing that is to divide an application into chunks, where each chunk has a foot print that is not much larger than the size of the L2 cache. This is usually beneficial, even if it leads to an increase in the total number of computations, when exactly the same computation may be repeated in different chunks.
The goal of this exercise is to implement a 2D convolution function that is fast enough to be applied for filtering of really large images. The code listed below shows a direct implementation of a CUDA kernel for such an operation, which assumes that the filter kernel is separable and symmetric, i.e. the same 1D filter is applied first x-wise and then y-wise. Such filters are very common in image processing. To allow the compiler to optimize as much as possible, the filter length (2*RADIUS + 1) and the image size (WIDTH×HEIGHT) are here defined as constants.
__global__ void conv(float *source, float *kernel, float *result) { int x = blockIdx.x*blockDim.x + threadIdx.x; int y =
blockIdx.y*blockDim.y + threadIdx.y; float sumy = 0.0f; for (int dy=-RADIUS;dy<=RADIUS;dy++) { int ys = y + dy; if (ys>=0 && ys=0 && xsCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com