Computer Graphics
Looking at OpenCL Assembly Code
Mike Bailey
opencl.assembly.pptx
Copyright By PowCoder代写 加微信 powcoder
mjb – March 27, 2021
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
How to Extract the OpenCL Assembly Language
Computer Graphics
This binary can then be used in a call to clCreateProgramWithBinary( )
mjb – March 27, 2021
size_t size;
status = clGetProgramInfo( Program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &size, NULL ); PrintCLError( status, “clGetProgramInfo (1):” );
unsigned char * binary = new unsigned char [ size ];
status = clGetProgramInfo( Program, CL_PROGRAM_BINARIES, size, &binary, NULL ); PrintCLError( status, “clGetProgramInfo (2):” );
FILE * fpbin = fopen( CL_BINARY_NAME, “wb” ); if( fpbin == NULL )
fprintf( stderr, “Cannot create ‘%s’\n”, CL_BINARY_NAME ); }
fclose( fpbin ); }
delete [ ] binary;
fwrite( binary, 1, size, fpbin );
particles.cl, I 3
typedef float4 point; typedef float4 vector; typedef float4 color; typedef float4 sphere;
= (float4) ( 0., -9.8, 0., 0. ); = 0.1;
constant float4 G
constant float DT
constant sphere Sphere1 = (sphere)( -100., -800., 0., 600. );
Computer Graphics
mjb – March 27, 2021
particles.cl, II 4
kernel void Particle( {
global point * dPobj, global vector * dVel, global color * dCobj )
int gid = get_global_id( 0 );
point p = dPobj[gid]; vector v = dVel[gid];
point pp = p + v*DT + .5*DT*DT*G; vector vp = v + G*DT;
dPobj[gid] = pp; dVel[gid] = vp;
// particle #
// p’ // v’
Computer Graphics
mjb – March 27, 2021
particles.cl, III 5
Bounce( vector in, vector n ) {
n = normalize( n );
vector out = in – 2. * n * dot( in.xyz, n.xyz ); out.w = 0.;
return out;
BounceSphere( point p, vector v, sphere s ) {
n.xyz = fast_normalize( p.xyz – s.xyz ); n.w = 0.;
return Bounce( in, n );
= “reflect” function
Computer Graphics
mjb – March 27, 2021
NVIDIA OpenCL Assembly Language Sample
FMA = “Fused Multiply-Add”
ld.global.v4.f32 {%f188, %f189, %f190, %f191}, [%r1]; // load dPobj[ gid ] ld.global.v4.f32 {%f156, %f157, %f158, %f159}, [%r2]; // load dVel[ gid ]
fma.rn.f32 fma.rn.f32 fma.rn.f32
mov.f32 mov.f32
add.f32 add.f32 add.f32
add.f32 add.f32 add.f32
%f17, 0f3DCCCCCD;
%f248, %f156, %f17, %f188; %f249, %f157, %f17, %f189; %f250, %f158, %f17, %f190;
%f18, 0fBD48B43B; %f19, 0f00000000;
%f256, %f248, %f19; %f257, %f249, %f18; %f258, %f250, %f19;
%f20, 0fBF7AE148;
%f264, %f156, %f19; %f265, %f157, %f20; %f266, %f158, %f19;
// put DT (a constant) → register f17
// (p + v*DT).x → f248 // (p + v*DT).y → f249 // (p + v*DT).z → f250
//.5*G.y*DT*DT(a constant) → f18 // 0., for .x and .z (a constant) → f19
//(p+v*DT).x+0.→f256 //(p+v*DT).y+.5*G.y*DT*DT→f257 //(p+v*DT).z+0.→f258
// G.y * DT (a constant) → f20
//v.x+0.→f264 //v.y+G.y*DT→f265 //v.z+0.→f266
Computer Graphics
mjb – March 27, 2021
Fused Multiply-Add
Many scientific and engineering computations take the form:
D = A + (B*C);
A “normal” multiply-add compilation would handle this as:
tmp = B*C; D = A + tmp;
A “fused” multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time that the higher-order bits of B*C are being multiplied.
Consider a Base 10 example: 789 + ( 123*456 )
123 x 456 738
+ 789 56,877
Can start adding the 9 the moment the 8 is produced!
Computer Graphics
Note: In the lower bits of the result, “Normal” A+(B*C) ≠ “FMA” A+(B*C)
mjb – March 27, 2021
Something like:
Sum = Sum + (B*C);
would also be suitable to be implemented as an FMA.
Things Learned from Examining OpenCL Assembly Language 8 • The points, vectors, and colors were typedef’ed as float4’s, but the compiler realized that they
were being used only as float3’s and so didn’t bother with the 4th element.
• The floatn’s were not SIMD’ed. (We actually knew this already, since NVIDIA doesn’t support SIMD operations in their GPUs. ) There is still an advantage in coding this way, even if just for readability.
•Thefunctioncallswereallin-lined. (Thismakessense–theOpenCLspecsays“norecursion”, which implies “no stack”, which would make function calls difficult.)
• Me defining G, DT, and Sphere1 as constant memory types was a mistake. It got the correct results, but the compiler didn’t take advantage of them being constants. Changing them to type const threw compiler errors because of their global scope. Changing them to const and moving them into the body of the kernel function Particle did result in good compiler optimizations.
• The sqrt(x2+y2+z2) assembly code is amazingly convoluted. I suspect it is an issue of maintaining highest precision. Use fast_sqrt( ), fast_normalize( ), and fast_length( ) when you can. Usually computer graphics doesn’t need the full precision of sqrt( ).
• The compiler did not do a good job with expressions-in-common. I had really hoped it would figure out that detecting if a point was in a sphere and determining the unitized surface normal at that point were the same operation, but it didn’t.
• There is a 4-argument Fused-Multiply-Add instruction in hardware to perform D = A + (B*C) in
one instruction in hardware. The compiler took great advantage of it.
Computer Graphics
mjb – March 27, 2021
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com