Microsoft PowerPoint – COMP528 HAL25 OpenACC.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
25 – HAL
OpenACC for GPUs
Directives for accelerators
Programming Model
• some code on host (the CPU)
• “offload” a “kernel” to the “accelerator”
– offloading possible (in theory) via OpenMP
– can also use
• OpenACC
• OpenCL
• CUDA proprietary, just for NVIDIA GPUs
Why Directives for GPUs?
• CUDA is only for NVIDIA GPUs
– lack of portability
– programming via calling a kernel explicitly,
plus function calls to handle data transfer & usage of memory
• Amount of coding
– one directive may have been several lines of CUDA
• Portability over different heterogeneous architectures
– CPU + NVIDIA GPU
– CPU + AMD GPU
– CPU + XeonPhi (RIP)
– CPU + FPGA (apparently)
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~20xx
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
kb
an
e,
U
n
iv
o
f
L
iv
er
p
o
o
l
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~20xx
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf
• OpenMP
– support for GPU is in OpenMP standard
– But not easy to find an implementation for given GPU
• OpenACC
– some implementations on GPUs available to use more readily
OpenAcc
– open accelerators
– initially project driven by Cray + CAPS + NVIDIA + PGI
• (NVIDIA later bought out PGI)
– directives to describe offloading, targets and how to use targets
OpenACC Directives
• https://openacc.org
• #pragma acc [openAccDirective]
• Choice of openAccClauses
per directive
• Key directives
– acc parallel
• “this loop should be parallelised”
– acc kernels
• “may contain parallelism”;
compiler to do its best
“Jacobi”
• Example based on blog by Jeff Larkin, NVIDIA
https://devblogs.nvidia.com/getting-started-openacc/
• Jacobi Method
– Iterative solver
– https://en.wikipedia.org/wiki/Jacobi_method
– Anew set to mean of 4 surrounding points
– ‘stencil’ algorithm
• Parallelism
– 1st loop: Anew[j][i] updates
– 2nd loop: A[j][i] updates
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
kb
an
e,
U
n
iv
o
f
L
iv
er
p
o
o
l
#pragma acc parallel loop reduction(max:error)
for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j- error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } #pragma acc parallel loop reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j- error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } Quick Comparison of Timings • Barkla modules compilers/pgi/2018 compilers/pgi/2019-llvm • Since a short execution time, we will use “interactive GPU” on the “viz02” node – Less powerful GPU (Quadro4000) c.f. to the batch GPU (P100 or V100) – No queuing – Not exclusive access: so may not see the best possible • We cite the fastest of 3 runs – Barkla GPUs set up “non-persistence”: a small overhead loading some libraries Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterations C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l Quick Comparison of Timings • EG1: parallel v kernels ~mkbane/HPC_DEMOS/Directives_for_Accelerators/OpenACC kdiff3 jacobi-parallel.c jacobi-kernels.c C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l Quick Comparison of Timings • EG1: parallel v kernels ~mkbane/HPC_DEMOS/Directives_for_Accelerators/OpenACC kdiff3 jacobi-parallel.c jacobi-kernels.c C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l • Makefile C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l Quick Comparison of Timings • EG1: parallel v kernels jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterationsCOMP328/COMP528 (c) mkbane, Univ of Liverpool Quick Comparison of Timings • EG1: parallel v kernels jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) profile by setting PGI_ACC_TIME=1 Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterationsCOMP328/COMP528 (c) mkbane, Univ of Liverpool compute: 302+79+167 = 548 msec data transfers: 30+49+860+1614 = 2553 msec i.e. cost of data transfers kills any compute acceleration COMP328/COMP528 (c) mkbane, Univ of Liverpool • all the data is being transferred each direction every time either kernel is called – but this is not the minimum actually needed – copy A at start and back at end (of a 'data' region) – create (temp space) Anew directly on the GPU • let's do explicit data transfer – 'data' region around the while loop – 'copy(A)' and 'create(Anew)' for this data region Quick Comparison of Timings • EG1: parallel v kernels jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) • EG2: kernels directive plus data transfer clauses jacobi-kernels+data.c 0.6 seconds (factor of 11.7 times faster than just ‘kernels’) Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterationsCOMP328/COMP528 (c) mkbane, Univ of Liverpool compute: 471 msec data transfers: 2518 msec compute: 470 msec data transfers: 52 msec COMP328/COMP528 (c) mkbane, Univ of Liverpool • data within loops is being re-used a lot for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } • so "tiling" will help (re-use of cache lines) • empirically, see that giving compiler specific info further helps improve performance Quick Comparison of Timings • EG1: parallel v kernels jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) • EG2: kernels directive plus data transfer clauses jacobi-kernels+data.c 0.6 seconds (factor of 11.7 times faster than just ‘kernels’) EG3: kernels directive plus data transfer clauses plus ‘tile’ clause jacobi-kernels+data+tile.c 0.5 seconds COMP328/COMP528 (c) mkbane, Univ of Liverpool • There is more for OpenACC that we do not cover – e.g. clause for #pragma acc loop • control direction/s for copying data between host & dev • granularity (c.f. threads, thread blocks we cover shortly for CUDA) #pragma acc loop clauses • eg to help define levels of parallelism – tile(x,y): may help with cache re-use – gang(l): number of thread blocks/grid – worker(m): – vector(n): #threads per thread block – reduction(…) • GPU • CPU – gang(l): number of processors – worker(m): number of cores – vector(n):width of vector unit (SIMD) #pragma acc data • Use to control (ie reduce) data movements • copy(x): copy of host x to device at start & exit (of region) • copyin(x): only copy to device at start • copyout(x): only copy off device at end (of region) • create(y): local device var => create ‘y’ only on the device
• present(z): ‘z’ already exists on device (eg from previous kernel)
OpenACC: Further Reading
• Resources will be added to COMP528 course module page
on CANVAS
– https://www.olcf.ornl.gov/wp-
content/uploads/2013/02/Intro_to_OpenACC-JL.pdf
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane