CS代考 BD48B43B; %f19, 0f00000000;

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Computer Graphics
Looking at OpenCL Assembly Code
Mike Bailey

Copyright By PowCoder代写 加微信 powcoder

opencl.assembly.pptx
mjb – March 27, 2021
How to Extract the OpenCL Assembly Language
size_t size;
status = clGetProgramInfo( Program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &size, NULL ); PrintCLError( status, “clGetProgramInfo (1):” );
unsigned char * binary = new unsigned char [ size ];
status = clGetProgramInfo( Program, CL_PROGRAM_BINARIES, size, &binary, NULL ); PrintCLError( status, “clGetProgramInfo (2):” );
FILE * fpbin = fopen( CL_BINARY_NAME, “wb” ); if( fpbin == NULL )
fprintf( stderr, “Cannot create ‘%s’\n”, CL_BINARY_NAME ); }
fclose( fpbin ); }
delete [ ] binary;
fwrite( binary, 1, size, fpbin );
Computer Graphics
This binary can then be used in a call to clCreateProgramWithBinary( )
mjb – March 27, 2021

particles.cl, I
typedef float4 point; typedef float4 vector; typedef float4 color; typedef float4 sphere;
= (float4) ( 0., -9.8, 0., 0. ); = 0.1;
constant float4 G
constant float DT
constant sphere Sphere1 = (sphere)( -100., -800., 0., 600. );
Computer Graphics
mjb – March 27, 2021
particles.cl, II 4
Particle( global point * dPobj, global vector * dVel, global color * dCobj ) {
int gid = get_global_id( 0 );
point p = dPobj[gid]; vector v = dVel[gid];
point pp = p + v*DT + .5*DT*DT*G; vector vp = v + G*DT;
dPobj[gid] = pp; dVel[gid] = vp;
// particle #
// p’ // v’
Computer Graphics
mjb – March 27, 2021

particles.cl, III
Bounce( vector in, vector n ) {
n = normalize( n );
vectorout=in – 2.*n*dot(in.xyz,n.xyz); out.w = 0.;
return out;
BounceSphere( point p, vector v, sphere s ) {
n.xyz = fast_normalize( p.xyz – s.xyz ); n.w = 0.;
return Bounce( in, n );
= “reflect” function
Computer Graphics
mjb – March 27, 2021
NVIDIA OpenCL Assembly Language Sample
FMA = “Fused Multiply-Add”
ld.global.v4.f32 {%f188, %f189, %f190, %f191}, [%r1]; ld.global.v4.f32 {%f156, %f157, %f158, %f159}, [%r2];
// load dPobj[ gid ] // load dVel[ gid ]
// put DT (a constant) → register f17
// (p + v*DT).x → f248 // (p + v*DT).y → f249 // (p + v*DT).z → f250
//.5*G.y*DT*DT(a constant) → f18 // 0., for .x and .z (a constant) → f19
//(p+v*DT).x+0.→f256 //(p+v*DT).y+.5*G.y*DT*DT→f257 //(p+v*DT).z+0.→f258
// G.y * DT (a constant) → f20
//v.x+0.→f264 //v.y+G.y*DT→f265 //v.z+0.→f266
fma.rn.f32 fma.rn.f32 fma.rn.f32
mov.f32 mov.f32
add.f32 add.f32 add.f32
add.f32 add.f32 add.f32
%f17, 0f3DCCCCCD;
%f248, %f156, %f17, %f188; %f249, %f157, %f17, %f189; %f250, %f158, %f17, %f190;
%f18, 0fBD48B43B; %f19, 0f00000000;
%f256, %f248, %f19; %f257, %f249, %f18; %f258, %f250, %f19;
%f20, 0fBF7AE148;
%f264, %f156, %f19; %f265, %f157, %f20; %f266, %f158, %f19;
Computer Graphics
mjb – March 27, 2021

Fused Multiply-Add
Many scientific and engineering computations take the form:
D = A + (B*C);
A “normal” multiply-add compilation would handle this as:
tmp = B*C; D = A + tmp;
A “fused” multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time that the higher-order bits of B*C are being multiplied.
Consider a Base 10 example: 789 + ( 123*456 )
123 x 456 738
+ 789 56,877
Can start adding the 9 the moment the 8 is produced!
Computer Graphics
Note: In the lower bits of the result, “Normal” A+(B*C) ≠ “FMA” A+(B*C)
mjb – March 27, 2021
Something like:
Sum = Sum + (B*C);
would also be suitable to be implemented as an FMA.
• The function calls were all in-lined. (This makes sense – the OpenCL spec says “no recursion”, which implies “no stack”, which would make function calls difficult.)
• Me defining G, DT, and Sphere1 as constant memory types was a mistake. It got the correct results, but the compiler didn’t take advantage of them being constants. Changing them to type const threw compiler errors because of their global scope. Changing them to const and moving them into the body of the kernel function Particle did result in good compiler optimizations.
• The sqrt(x2+y2+z2) assembly code is amazingly convoluted. I suspect it is an issue of maintaining highest precision. Use fast_sqrt( ), fast_normalize( ), and fast_length( ) when you can. Usually computer graphics doesn’t need the full precision of sqrt( ).
• The compiler did not do a good job with expressions-in-common. I had really hoped it would figure out that detecting if a point was in a sphere and determining the unitized surface normal at that point were the same operation, but it didn’t.
• There is a 4-argument Fused-Multiply-Add instruction in hardware to perform D = A + (B*C) in
Things Learned from Examining OpenCL Assembly Language
• The points, vectors, and colors were typedef’ed as float4’s, but the compiler realized that they were being used only as float3’s and so didn’t bother with the 4th element.
• The floatn’s were not SIMD’ed. (We actually knew this already, since NVIDIA doesn’t support SIMD operations in their GPUs. ) There is still an advantage in coding this way, even if just for readability.
one instruction in hardware. The compiler took great advantage of it.
Computer Graphics
mjb – March 27, 2021

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com