Oregon State University Computer Graphics
Oregon State University Computer Graphics
OpenGL Compute Shaders
Mike Bailey mjb@cs.oregonstate.edu
Application Invokes the Compute Shader to Modify the OpenGL Buffer Data
Oregon State University
compute.shader.pptx
mjb – January1, 2019
mjb – January1, 2019
OpenGL Compute Shader – the Basic Idea
3
Why Not Just Use OpenCL Instead? 4
Paraphrased from the ARB_compute_shader spec:
OpenCL is great! It does a super job of using the GPU for general-purpose data-parallel computing. And, OpenCL is more feature-rich than OpenGL compute shaders. So, why use Compute Shaders ever if you’ve got OpenCL? Here’s what I think:
Recent graphics hardware has become extremely powerful. A strong desire to harness this power for work that does not fit the traditional graphics pipeline has emerged. To address this, Compute Shaders are a new single-stage program. They are launched in a manner that is essentially stateless. This allows arbitrary workloads to be sent to the graphics hardware with minimal disturbance to the GL state machine.
•
OpenCL requires installing a separate driver and separate libraries. While this is not a huge deal, it does take time and effort. When everyone catches up to OpenGL 4.3, Compute Shaders will just “be there” as part of core OpenGL.
In most respects, a Compute Shader is identical to all other OpenGL shaders, with similar status, uniforms, and other such properties. It has access to many of the same data as all other shader types, such as textures, image textures, atomic counters, and so on. However, the Compute Shader has no predefined inputs, nor any fixed-function outputs. It cannot be part of a rendering pipeline and its visible side effects are through its actions on shader storage buffers, image textures, and atomic counters.
Compute shaders use the same context as does the OpenGL rendering pipeline. There is no need to acquire and release the context as OpenGL+OpenCL must do.
Oregon State University Computer Graphics
Oregon State University
mjb – January1, 2019
mjb – January1, 2019
1
OpenGL Compute Shader – the Basic Idea 2
• • • •
Compute Shaders use the GLSL language, something that all OpenGL programmers should already be familiar with (or will be soon).
Application Invokes OpenGL Rendering which Reads the Buffer Data
I’m assuming that calls to OpenGL compute shaders are more lightweight than calls to OpenCL kernels are. (true?) This should result in better performance. (true? how much?)
Using OpenCL is somewhat cumbersome. It requires a lot of setup (queries, platforms, devices, queues, kernels, etc.). Compute Shaders look to be more convenient. They just kind of flow in with the graphics.
The bottom line is that I will continue to use OpenCL for the big, bad stuff. But, for lighter-weight data-parallel computing that interacts with graphics, I will use the Compute Shaders.
I suspect that a good example of a lighter-weight data-parallel graphics-related application is a
particle system. This will be shown here in the rest of these notes. I hope I’m right. Computer Graphics
A Shader Program, with only a Compute Shader in it
Another Shader Program, with pipeline rendering in it
1
If I Know GLSL,
What Do I Need to Do Differently to Write a Compute Shader?
5
Passing Data to the Compute Shader Happens with a Cool 6 New Buffer Type – the Shader Storage Buffer Object
Not much:
The tricky part is getting data into and out of the Compute Shader. The trickiness comes from the specification phrase: “In most respects, a Compute Shader is identical to all other OpenGL shaders, with similar status, uniforms, and other such properties. It has access to many of the same data as all other shader types, such as textures, image textures, atomic counters, and so on.”
1. A Compute Shader is created just like any other GLSL shader, except that its type is GL_COMPUTE_SHADER (duh…). You compile it and link it just like any other GLSL shader program.
OpenCL programs have access to general arrays of data, and also access to OpenGL arrays of data in the form of buffer objects. Compute Shaders, looking like other shaders, haven’t had direct access to general arrays of data (hacked access, yes; direct access, no). But, because Compute Shaders represent opportunities for massive data-parallel computations, that is exactly what you want them to use.
2. A Compute Shader must be in a shader program all by itself. There cannot be vertex, fragment, etc. shaders in there with it. (why?)
3. A Compute Shader has access to uniform variables and buffer objects, but cannot access any pipeline variables such as attributes or variables from other stages. It stands alone.
Thus, OpenGL 4.3 introduced the Shader Storage Buffer Object. This is very cool, and has been needed for a long time!
4. A Compute Shader needs to declare the number of work-items in each of its work-groups in a special GLSL layout statement.
Shader Storage Buffer Object
Shader Storage Buffer Objects are created with arbitrary data (same as other buffer objects), but what is new is that the shaders can read and write them in the same C-like way as they were created, including treating parts of the buffer as an array of structures – perfect for data- parallel computing!
Oregon State University Computer Graphics
Oregon State University Computer Graphics
More information on items 3 and 4 are coming up . . .
Arbitrary data, including Arrays of Structures
Passing Data to the Compute Shader Happens with a Cool New Buffer Type – the Shader Storage Buffer Object
7
The Example We Are Going to Use Here is a Particle System 8
Shader Storage Buffer Object
And, like other OpenGL buffer types, Shader Storage Buffer Objects can be bound to indexed binding points, making them easy to access from inside the Compute Shaders.
Oregon State University Computer Graphics
Oregon State University Computer Graphics
Texture0 Texture1 Texture2 Texture3
OpenGL Context
Display Dest.
Buffer0 Buffer1 Buffer2. Buffer3.
The OpenGL Rendering Draws the Particles by Reading the Position Buffer
(Any resemblance this diagram has to a mother sow is accidental, but not entirely inaccurate…)
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
The Compute Shader Moves the Particles by Recomputing the Position and Velocity Buffers
2
3 Work-Items
4 Work-Groups
Setting up the Shader Storage Buffer Objects in Your C Program
9
Setting up the Shader Storage Buffer Objects in Your C Program 10 glGenBuffers( 1, &posSSbo);
#define NUM_PARTICLES #define WORK_GROUP_SIZE
1024*1024 128
// total number of particles to move // # work-items per work-group
glBindBuffer( GL_SHADER_STORAGE_BUFFER, posSSbo );
glBufferData( GL_SHADER_STORAGE_BUFFER, NUM_PARTICLES * sizeof(struct pos), NULL, GL_STATIC_DRAW );
struct pos {
GLint bufMask = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT ; // the invalidate makes a big difference when re-writing
};
struct vel {
points[ i ].x = Ranf( XMIN, XMAX ); points[ i ].y = Ranf( YMIN, YMAX ); points[ i ].z = Ranf( ZMIN, ZMAX ); points[ i ].w = 1.;
};
}
glUnmapBuffer( GL_SHADER_STORAGE_BUFFER );
struct color {
glGenBuffers( 1, &velSSbo);
glBindBuffer( GL_SHADER_STORAGE_BUFFER, velSSbo );
glBufferData( GL_SHADER_STORAGE_BUFFER, NUM_PARTICLES * sizeof(struct vel), NULL, GL_STATIC_DRAW );
};
float x, y, z, w;
// positions
struct pos *points = (struct pos *) glMapBufferRange( GL_SHADER_STORAGE_BUFFER, 0, NUM_PARTICLES * sizeof(struct pos), bufMask ); for( int i = 0; i < NUM_PARTICLES; i++ )
{
float vx, vy, vz, vw;
// velocities
float r, g, b, a;
// colors
// need to do the following for both position, velocity, and colors of the particles:
struct vel *vels = (struct vel *) glMapBufferRange( GL_SHADER_STORAGE_BUFFER, 0, NUM_PARTICLES * sizeof(struct vel), bufMask ); for( int i = 0; i < NUM_PARTICLES; i++ )
{
GLuint posSSbo; GLuint velSSbo GLuint colSSbo;
vels[ i ].vx = Ranf( VXMIN, VXMAX ); vels[ i ].vy = Ranf( VYMIN, VYMAX ); vels[ i ].vz = Ranf( VZMIN, VZMAX ); vels[ i ].vw = 0.;
Note that .w and .vw are not actually needed. But, by making these structure sizes a multiple of 4 floats, it doesn’t matter if they are declared with the std140 or the std430 qualifier. I think this is a
}
glUnmapBuffer( GL_SHADER_STORAGE_BUFFER );
Oregon State University
Oregon State University Computer Graphics
good thing.
Computer Graphics
The Data Needs to be Divided into Large Quantities call Work-Groups, each of11 which is further Divided into Smaller Units Called Work-Items
The Data Needs to be Divided into Large Quantities call Work-Groups, each of12 which is further Divided into Smaller Units Called Work-Items
20 total items to compute:
20x12 (=240) total items to compute:
Oregon State University Computer Graphics
Oregon State University Computer Graphics
5 Work Groups
The Invocation Space can be 1D, 2D, or 3D. This one is 1D.
The Invocation Space can be 1D, 2D, or 3D. This one is 2D.
#WorkGroups GlobalInvocationSize WorkGroupSize
4 Work-Items
#WorkGroups GlobalInvocationSize WorkGroupSize
5x4 20 4
5x4 20x12 4x3
4 Work-Items
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
The same would possibly need to be done for the color shader storage buffer object
5 Work-Groups
3
num_groups_y
...
GLSLProgram *Particles = new GLSLProgram( ); bool valid = Particles>Create( “particles.cs” );
if( ! valid ) { . . . }
Running the Compute Shader from the Application
13
A Mechanical Equivalent…
14
void glDispatchCompute( num_groups_x, num_groups_y, num_groups_z );
“Streaming Multiprocessor”
Oregon State University Computer Graphics
Oregon State University Computer Graphics
http://news.cision.com
num_groups_x
If the problem is num_groups_y = 1 num_groups_z = 1
1D, then and
Invoking the Compute Shader in Your C Program
15
Writing a C++ Class to Handle Everything is Fairly Straightforward 16 Setup:
glBindBufferBase( GL_SHADER_STORAGE_BUFFER, 4, posSSbo ); glBindBufferBase( GL_SHADER_STORAGE_BUFFER, 5, velSSbo ); glBindBufferBase( GL_SHADER_STORAGE_BUFFER, 6, colSSbo );
glUseProgram( MyComputeShaderProgram );
glDispatchCompute( NUM_PARTICLES / WORK_GROUP_SIZE, 1, 1 ); glMemoryBarrier( GL_SHADER_STORAGE_BARRIER_BIT );
…
Using:
glUseProgram( MyRenderingShaderProgram ); glBindBuffer( GL_ARRAY_BUFFER, posSSbo ); glVertexPointer( 4, GL_FLOAT, 0, (void *)0 ); glEnableClientState( GL_VERTEX_ARRAY ); glDrawArrays( GL_POINTS, 0, NUM_PARTICLES ); glDisableClientState( GL_VERTEX_ARRAY ); glBindBuffer( GL_ARRAY_BUFFER, 0 );
Particles->Use( );
Particles->DispatchCompute( NUM_PARTICLES / WORK_GROUP_SIZE, 1, 1 );
Oregon State University Computer Graphics
Oregon State University Computer Graphics
If the problem is num_groups_z = 1
2D, then
“CUDA Cores” “Data”
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
Render->Use( ); …
4
in uvec3 const uvec3
gl_NumWorkGroups ; gl_WorkGroupSize ; gl_WorkGroupID ; gl_LocalInvocationID ; gl_GlobalInvocationID ; gl_LocalInvocationIndex ;
Same numbers as in the glDispatchCompute call Same numbers as in the layout local_size_* Which workgroup this thread is in
Where this thread is in the current workgroup Where this thread is in all the work items
#version 430 compatibility
#extension GL_ARB_compute_shader :
#extension GL_ARB_shader_storage_buffer_object :
enable enable;
in in in in
uvec3 uvec3 uvec3 uint
vec4 Positions[
];
// array of structures
Special Pre-set Variables in the Compute Shader
17
The Particle System Compute Shader — Setup 18
≤ gl_WorkGroupID
≤ gl_LocalInvocationID
0
0 gl_GlobalInvocationID gl_LocalInvocationIndex
gl_NumWorkGroups – 1 gl_WorkGroupSize – 1
= gl_WorkGroupID * gl_WorkGroupSize
Oregon State University Computer Graphics
Oregon State University Computer Graphics
≤ ≤
layout( local_size_x = 128, local_size_y = 1, local_size_z = 1 ) in;
The Particle System Compute Shader – The Physics
19
The Particle System Compute Shader – 20 How About Introducing a Bounce?
const vec3 G const float DT
= vec3( 0., -9.8, 0. ); = 0.1;
const vec4 Sphere = vec4( -100., -800., 0., 600. );
// x, y, z, r
// (could also have passed this in)
…
vec3
Bounce( vec3 vin, vec3 n ) {n
uint gid = gl_GlobalInvocationID.x;
// the .y and .z are both 1 in this case
vec3 vout = reflect( vin, n ); in out
vec3 p = Positions[ gid ].xyz; vec3 v = Velocities[ gid ].xyz;
vec3
BounceSphere( vec3 p, vec3 v, vec4 s ) {
vec3 pp = p + v*DT + .5*DT*DT*G; vec3 vp=v+G*DT;
}
Positions[ gid ].xyz = pp; Velocities[ gid ].xyz = vp;
v’vGt
bool
IsInsideSphere( vec3 p, vec4 s ) {
Oregon State University Computer Graphics
Oregon State University Computer Graphics
1D representation of the gl_LocalInvocationID (used for indexing into a shared array)
layout( std140, binding=6 ) buffer Col {
= gl_LocalInvocationID.z * gl_WorkGroupSize.y * gl_WorkGroupSize.x gl_LocalInvocationID.y * gl_WorkGroupSize.x gl_LocalInvocationID.x
+ +
+ gl_LocalInvocationID
p’ pvt1Gt2 2
vec3 n = normalize( p – s.xyz ); return Bounce( v, n );
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
layout( std140, binding=4 ) buffer Pos {
You can use the empty brackets, but only on the last element of the buffer. The actual dimension will be determined for you when OpenGL examines the size of this buffer’s data store.
};
layout( std140, binding=5 ) buffer Vel {
};
};
return vout; }
vec4 Velocities[ ];
// array of structures
vec4 Colors[ ];
// array of structures
float r = length( p – s.xyz );
return ( r < s.w ); }
5
The Particle System Compute Shader – How About Introducing a Bounce?
21
The Bouncing Particle System Compute Shader – What Does It Look Like?
22
uint gid = gl_GlobalInvocationID.x;
// the .y and .z are both 1 in this case
vec3 p = Positions[ gid ].xyz; vec3v =Velocities[ gid ].xyz;
1
p' pvt Gt2
vec3 pp = p + v*DT + .5*DT*DT*G; vec3 vp=v+G*DT;
2 v'vGt
if( IsInsideSphere( pp, Sphere ) ) {
Graphics Trick Alert: Making the bounce happen from the surface of the sphere is time-consuming. Instead, bounce from the previous position in space. If DT is small enough, nobody will ever know...
vp = BounceSphere( p, v, Sphere );
pp = p + vp*DT + .5*DT*DT*G; }
Positions[ gid ].xyz = pp; Velocities[ gid ].xyz = vp;
Oregon State University Computer Graphics
Oregon State University Computer Graphics
Other Useful Stuff –
Copying Global Data to a Local Array Shared by the Entire Work-Group
23
There are some applications, such as image convolution, where threads within a work- group need to operate on each other’s input or output data. In those cases, it is usually a good idea to create a local shared array that all of the threads in the work-group can access. You do it like this:
Oregon State University Computer Graphics
layout( std140, binding=6 ) buffer Col {
};
vec4 Colors[ ];
layout( shared ) vec4 rgba[ gl_WorkGroupSize.x ];
uint gid = gl_GlobalInvocationID.x; uint lid = gl_LocalInvocationID.x;
rgba[ lid ] = Colors[ gid ]; memory_barrier_shared( );
<< operate on the rgba array elements >> Colors[ gid ] = rgba[ lid ];
mjb – January1, 2019
mjb – January1, 2019
mjb – January1, 2019
6