Parallelisation approach
Solution 1: Assign a thread to each pixel, then return the sum of every block(size b*b). The thread which is not the multiple of b will do nothing. The thread which is the multiple of b in two coordinate axises will compute the sum of four element, just like the figure 1. This operation will execute repeatedly until one element was left over just like Step 4 Figure 1. The operation will be executed in global index, so we don’t have to worry about dimension of thread block.
Figure 1
The disadvantage of solution 1 is that there are 3/4 threads will be idle just like Step 1 Figure 1. There are 15/16 threads will be idle in Step 2 Figure 1, and so on.
Here is the profiler of solution 1 when the w=h=2048.
Figure 2
Because of the shortcoming, we will not optimize the solution 1.
Solution 2: compute the sum of 4 element using one thread. Step by step, we will get the result finally.
1. Move the data to the unsigned int memory avoiding overflow. Using the function cuda_pre.
2. Computing the sum of every 2*2 elements, putting the result on position which is multiple of 2. Using the function cuda_2.
3. Computing the sum of every 2*2 elements which x index and y index are the multiple of 2, putting the result on position which is multiple of 4. Using the function cuda_2.
4. Computing the sum of every 2*2 elements which x index and y index are the multiple of 4, putting the result on position which is multiple of 8. Using the function cuda_2.
5. ……
6. Compute the average of the block and scatter them to the corresponding position. Using function cuda_after.
cuda_pre before optimizition.
__global__ void cuda_pre(unsigned char *ptrOut, unsigned int *ptrTemp, unsigned char *ptrIn, int numrow, int numcol)
{
unsigned int tidx = threadIdx.x;
unsigned int tidy = threadIdx.y;
unsigned int x = tidx + blockDim.x*blockIdx.x;
unsigned int y = tidy + blockDim.y*blockIdx.y;
if (x < numcol&&y < numrow)
{
// transform the unsigned char to unsigned int
ptrTemp[(y*numcol + x) * 3 + 0] = ptrIn[(y*numcol + x) * 3 + 0];
ptrTemp[(y*numcol + x) * 3 + 1] = ptrIn[(y*numcol + x) * 3 + 1];
ptrTemp[(y*numcol + x) * 3 + 2] = ptrIn[(y*numcol + x) * 3 + 2];
}
}
cuda_2(the function could used to compute the mosaic, and could aslo used to compute the global sum).
//compute temp mosaic image,save the result to leftup of ptrTemp every four element
__global__ void cuda_2(unsigned char *ptrOut, unsigned int *ptrTemp, unsigned char *ptrIn, int numrow, int numcol, int step) {
unsigned int tidx = threadIdx.x;
unsigned int tidy = threadIdx.y;
unsigned int x = tidx + blockDim.x*blockIdx.x;
unsigned int y = tidy + blockDim.y*blockIdx.y;
//if (x == 0 && y == 0)printf("step=%d\n", step);
if ((y * step + step / 2
pix->green = &pix[cols*rows];
pix->blue = &pix[cols*rows*2];
The use of various GPU memory caches
Constant memory: Constant memory is used for data that will not change over the course of a kernel execution and is read only. Using constant rather than global memory can reduce the required memory bandwidth. But it is not suitable for our application.
Texture memory: Texture memory is aslo a kind of read-only memory on the device. When all reads in a warp are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory. Because of the coping time, we don’t use Texture memory here.
Shared Memory: Data stored in shared memory is visible to all threads within that block and lasts for the duration of the block. This is invaluable because this type of memory allows for threads to communicate and share data between one another. The solution is executed step by step, so we can’t use shared memory. We guess that using the shared memory in the last step will speedup the application, but the result is frustrating. It will be slower about 20% then before.
Optimization Strategies
Solution 2:
1. Computing the sum of every four element before the function cuda_pre to reduce the memory access, this operation will remove the first cuda_2, just like Figure 5(comparing with Figure 4).
Figure 5
2. Computing the sum using the idea of reduce.
3. We use the result of the block sum to compute the global average, preventing the repeatedly computing.
4. Use the pinned page memory to halve the memory copy time.
cuda_pre after optimization:
//transform the unsigned char to unsigned int
__global__ void cuda_pre(unsigned char *ptrOut, unsigned int *ptrTemp, unsigned char *ptrIn, int numrow, int numcol)
{
unsigned int tidx = threadIdx.x;
unsigned int tidy = threadIdx.y;
unsigned int x = tidx + blockDim.x*blockIdx.x;
unsigned int y = tidy + blockDim.y*blockIdx.y;
if (x < numcol&&y < numrow)//To prevent cross-border
{
if ((y * 2 + 1 < numrow) && (x * 2 + 1 < numcol)) {//there are four situation to compute the sum,preventing cross-border and prove the efficiency
// the detail was in the code
}else if ((y * 2 < numrow) && (x * 2 + 1 < numcol)){
// the detail was in the code
}else if ((y * 2 + 1 < numrow) && (x * 2 < numcol)){
// the detail was in the code
}else if ((y * 2 < numrow) && (x * 2 < numcol)){
// the detail was in the code
}
}
}
Little Skill
Using the unsigned int type to improve the speed of the application(comparing to the float type).
Using #pragma unroll to unfold the circulation.
Using << and >> not ×2 or ÷2.
Computing the sum of every four element which is close to each other, preventing large numbers from eating decimals.
__host__ cudaError_t cudaHostRegister ( void* ptr, size_t size, unsigned int flags )
cudaHostRegister: Registers an existing host memory range for use by CUDA.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
Image size
C
CPU
OPENMP
Cuda
Cuda (with pinned memory)
2048×2048
16
39
34
10
5
2048×2048
32
40
36
10
5
4096×4096
16
153
141
42
20
4096×4096
32
157
143
41
20
Summary
Through the design of this program, I learned CUDA parallel programming and CUDA memory model, the advantages and disadvantages of each memory comparison and appropriate application scenarios, learned to solve problems step by step, understand the reduce algorithm, know some of the program design fundamental knowledge such as loop unrolling, shift operations, the effect of integer and floating-point precision differences on calculations, and the problem of large numbers of floating-point numbers eating up decimals.
Step1Step2Step3Step4
Step1
Step2
Step3
Step4
Step 1Step 2Step 3Step 4
Step 1
Step 2
Step 3
Step 4