程序代写代做代考 GPU cache cuda 并行实现（Parallelisation approach...）

并行实现（Parallelisation approach…）
方法1：对每个像素分配一个线程，然后对每个c*c的块进行归一，下为一个c*c的归一过程，在全局内存中操作，不考虑线程块

图1
这样做的缺点是，图中过程1只有1/4线程工作，过程2只有1/16线程工作，以此类推

方法1实现到最后发现有跨块问题，大块mosaic计算出错，且速度慢，没有继续修改。
方法2：分步骤，每次归一4个数
1、先将数据复制到另外分配的无符号整型数据位置（否则会溢出）cuda_pre函数
2、每2*2使用1个线程进行求和，放在原始的被2整除的位置，cuda_2函数
3、每4*4使用1个线程进行求和，放在原始的被4整除的位置，cuda_2函数
4、 …….
5、将最终数据平均后，扩散分配输出至各对应位置cuda_after
cuda_pre未优化的：
__global__ void cuda_pre(unsigned char *ptrOut, unsigned int *ptrTemp, unsigned char *ptrIn, int numrow, int numcol)
{
unsigned int tidx = threadIdx.x;
unsigned int tidy = threadIdx.y;
unsigned int x = tidx + blockDim.x*blockIdx.x;
unsigned int y = tidy + blockDim.y*blockIdx.y;

if (x < numcol&&y < numrow) { //转移存储位置 ptrTemp[(y*numcol + x) * 3 + 0] = ptrIn[(y*numcol + x) * 3 + 0]; ptrTemp[(y*numcol + x) * 3 + 1] = ptrIn[(y*numcol + x) * 3 + 1]; ptrTemp[(y*numcol + x) * 3 + 2] = ptrIn[(y*numcol + x) * 3 + 2]; } } 图2为思路，左侧为启用的线程，数字表示线程号（11表示二维分块下x和y都是1），右侧为该线程号需要处理的数据，每次启用数据量1/4的线程，对对应位置进行求和，得到求和结果。 cuda_2(该函数可以用于求mosaic步骤，也可以用于全局求和): //每四个元素进行求和 __global__ void cuda_2(unsigned char *ptrOut, unsigned int *ptrTemp, unsigned char *ptrIn, int numrow, int numcol, int step) { unsigned int tidx = threadIdx.x; unsigned int tidy = threadIdx.y; unsigned int x = tidx + blockDim.x*blockIdx.x; unsigned int y = tidy + blockDim.y*blockIdx.y; //分情况处理，防止越界 if ((y * step + step / 2red = &pix[0];
pix->green = &pix[cols*rows];
pix->blue = &pix[cols*rows*2];

不同内存类型(The use of various GPU memory caches…)
常量内存：适合广播，且在计算过程中不可更改，这里数据在每次运算中单纯的使用一次，因此不合适在这里使用。
纹理内存：将全局内存绑定为纹理内存，在内存读取局部性较强时能够加速访存，但需要将数据拷贝到申请的纹理内存，本方法在第一步将数据求和，数据量已小于原先数据，使用纹理内存会增加额外消耗，不适用。
共享内存：该方法分步实现，不宜使用共享内存，但猜测最后一步数据写会可以利用共享内存加速，实测最后一步cuda_after函数使用共享内存并不能加速反而会慢20%左右。

使用的用来提升性能的并行方法(Any GPU optimisations you have made to improve the performance?…..)
方法2
1、在cuda_pre函数中先对数据进行简单求和，减少数据访存量，并减少一步cuda_2操作，下次直接从4*4开始，省略掉前述方法2中的步骤2

图4
与图3进行对比
2、使用规约思想进行分块求和
3、最终的全局求和步骤使用了求mosaic时的中间结果，避免重复计算
4、使用锁页内存将内存拷贝时间减半

该部分内容中的优化1.
cuda_pre优化的：
//先将数据导入到整型内存
__global__ void cuda_pre(unsigned char *ptrOut, unsigned int *ptrTemp, unsigned char *ptrIn, int numrow, int numcol)
{
unsigned int tidx = threadIdx.x;
unsigned int tidy = threadIdx.y;
unsigned int x = tidx + blockDim.x*blockIdx.x;
unsigned int y = tidy + blockDim.y*blockIdx.y;

if (x < numcol&&y < numrow)//防止越界 { if ((y * 2 + 1 < numrow) && (x * 2 + 1 < numcol)) {//求和时分四种情况，防止越界并提高效率 //详细内容参考代码 }else if ((y * 2 < numrow) && (x * 2 + 1 < numcol)) { //详细内容参考代码 }else if ((y * 2 + 1 < numrow) && (x * 2 < numcol)) { //详细内容参考代码 }else if ((y * 2 < numrow) && (x * 2 < numcol)) { //详细内容参考代码 } } } 其他有趣的方面(Any other interesting aspects of the implementation or optimisation.......) 使用整型数据，提升程序速度（相对于浮点数而言）使用#pragma unroll进行循环展开使用<< 和>> 而不是×2，÷2
两两求和，防止大数吃掉小数
上面那种程度的优化，把内存拷贝时间加上就很难看出区别了，后面给你单列测试时间。

使用锁页内存能够明显提高主机与设备的内存拷贝带宽，提升运行速度，且效果十分明显，提供该操作的优化结果对比。
// cudaHostRegister为将主机内存注册为锁页内存的函数 pix为需要注册为锁页内存的主机端地址，nuimgcol*nuimgrow * sizeof(Pix)为需要注册的内存大小，cudaHostRegisterPortable为注册标志
cudaHostRegister(pix, nuimgcol*nuimgrow * sizeof(Pix), cudaHostRegisterPortable);

测试结果：
尺寸
C
串行
cpu并行
Cuda

2048×2048
16
39
34
10

2048×2048
32
40
36
10

4096×4096
16
153
141
42

4096×4096
32
157
143
41

使用锁页内存的测试结果（使用锁页内存貌似cpu计算也会变快）：
尺寸
C
串行
cpu并行
Cuda

2048×2048
16

2048×2048
32
59.6
32.9
5

4096×4096
16

4096×4096
32

总结：
通过这个程序的设计，学习到了CUDA并行程序设计和cuda内存模型，各内存之间的优缺点比较和适宜的应用场景，学会了分步骤解决问题，了解了规约的思想，知道了程序设计中一些基础的知识如，循环展开，移位运算，整型和浮点型精度差别对计算的影响，浮点数大数吃掉小数的问题等。

注意：
本程序优化过程中，主要瓶颈在于主机和设备之间的内存拷贝带宽，来回拷贝数据耗费了80%左右的运行时间，因此，某些程序优化手段的优化效果在总时间里被掩盖掉了，只能通过profiler分析得出较为可靠地分析结论，报告给出了部分profiler分析结果。

你的程序问题：
总和使用单个像素求和，不合理

Step1Step2Step3Step4

Step1
Step2
Step3
Step4

Related Posts