CS计算机代考程序代写 compiler cuda GPU Fortran ER cache algorithm Microsoft PowerPoint – COMP528 HAL25 OpenACC.pptx

Microsoft PowerPoint – COMP528 HAL25 OpenACC.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

25 – HAL

OpenACC for GPUs

Directives for accelerators

Programming Model

• some code on host (the CPU)

• “offload” a “kernel” to the “accelerator”
– offloading possible (in theory) via OpenMP

– can also use
• OpenACC

• OpenCL

• CUDA proprietary, just for NVIDIA GPUs

Why Directives for GPUs?

• CUDA is only for NVIDIA GPUs
– lack of portability

– programming via calling a kernel explicitly,
plus function calls to handle data transfer & usage of memory

• Amount of coding
– one directive may have been several lines of CUDA

• Portability over different heterogeneous architectures
– CPU + NVIDIA GPU

– CPU + AMD GPU

– CPU + XeonPhi (RIP)

– CPU + FPGA (apparently)

OpenMP .v. OpenACC
OpenMP

• 1998 onwards

• offloading @v4 ~20xx

• CPU & accelerator

• FORTRAN, C, C++, …

• prescriptive
– user explicitly specifics actions to be

undertaken by compiler

• slower uptake of new [accelerator]
ideas but generally

• maturity for CPU

OpenACC

• 2012 onwards

• offloading always

• CPU & accelerator

• FORTRAN, C, C++, …

• descriptive
– user describes (guides) compiler but

compiler makes decision how/if to do
parallelism

• generally more reactive to new ideas

• maturity for GPU

https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf

C
O

M
P
3

2
8

/C
O

M
P
5

2
8

(
c)

m
kb

an
e,

U
n
iv

o
f
L
iv

er
p
o
o
l

OpenMP .v. OpenACC
OpenMP

• 1998 onwards

• offloading @v4 ~20xx

• CPU & accelerator

• FORTRAN, C, C++, …

• prescriptive
– user explicitly specifics actions to be

undertaken by compiler

• slower uptake of new [accelerator]
ideas but generally

• maturity for CPU

OpenACC

• 2012 onwards

• offloading always

• CPU & accelerator

• FORTRAN, C, C++, …

• descriptive
– user describes (guides) compiler but

compiler makes decision how/if to do
parallelism

• generally more reactive to new ideas

• maturity for GPU

https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf

• OpenMP
– support for GPU is in OpenMP standard

– But not easy to find an implementation for given GPU

• OpenACC
– some implementations on GPUs available to use more readily

OpenAcc

– open accelerators

– initially project driven by Cray + CAPS + NVIDIA + PGI
• (NVIDIA later bought out PGI)

– directives to describe offloading, targets and how to use targets

OpenACC Directives
• https://openacc.org

• #pragma acc [openAccDirective]

• Choice of openAccClauses
per directive

• Key directives
– acc parallel

• “this loop should be parallelised”

– acc kernels
• “may contain parallelism”;

compiler to do its best

“Jacobi”

• Example based on blog by Jeff Larkin, NVIDIA
https://devblogs.nvidia.com/getting-started-openacc/

• Jacobi Method
– Iterative solver

– https://en.wikipedia.org/wiki/Jacobi_method

– Anew set to mean of 4 surrounding points

– ‘stencil’ algorithm

• Parallelism
– 1st loop: Anew[j][i] updates

– 2nd loop: A[j][i] updates

C
O

M
P
3

2
8

/C
O

M
P
5

2
8

(
c)

m
kb

an
e,

U
n
iv

o
f
L
iv

er
p
o
o
l

#pragma acc parallel loop reduction(max:error)

for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j- error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } #pragma acc parallel loop reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j- error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } Quick Comparison of Timings • Barkla modules compilers/pgi/2018 compilers/pgi/2019-llvm • Since a short execution time, we will use “interactive GPU” on the “viz02” node – Less powerful GPU (Quadro4000) c.f. to the batch GPU (P100 or V100) – No queuing – Not exclusive access: so may not see the best possible • We cite the fastest of 3 runs – Barkla GPUs set up “non-persistence”: a small overhead loading some libraries Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterations C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l Quick Comparison of Timings • EG1: parallel v kernels ~mkbane/HPC_DEMOS/Directives_for_Accelerators/OpenACC kdiff3 jacobi-parallel.c jacobi-kernels.c C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l Quick Comparison of Timings • EG1: parallel v kernels ~mkbane/HPC_DEMOS/Directives_for_Accelerators/OpenACC kdiff3 jacobi-parallel.c jacobi-kernels.c C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l • Makefile C O M P 3 2 8 /C O M P 5 2 8 ( c) m kb an e, U n iv o f L iv er p o o l Quick Comparison of Timings • EG1: parallel v kernels  jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds  jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs  jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterationsCOMP328/COMP528 (c) mkbane, Univ of Liverpool Quick Comparison of Timings • EG1: parallel v kernels  jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds  jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs  jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives)  profile by setting PGI_ACC_TIME=1 Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterationsCOMP328/COMP528 (c) mkbane, Univ of Liverpool compute: 302+79+167 = 548 msec data transfers: 30+49+860+1614 = 2553 msec i.e. cost of data transfers kills any compute acceleration COMP328/COMP528 (c) mkbane, Univ of Liverpool • all the data is being transferred each direction every time either kernel is called – but this is not the minimum actually needed – copy A at start and back at end (of a 'data' region) – create (temp space) Anew directly on the GPU • let's do explicit data transfer – 'data' region around the while loop – 'copy(A)' and 'create(Anew)' for this data region Quick Comparison of Timings • EG1: parallel v kernels  jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds  jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs  jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) • EG2: kernels directive plus data transfer clauses  jacobi-kernels+data.c 0.6 seconds (factor of 11.7 times faster than just ‘kernels’) Example based on blog by Jeff Larkin, NVIDIA https://devblogs.nvidia.com/getting-started-openacc/ Data size for timed examples above is N=500 and 5000 iterationsCOMP328/COMP528 (c) mkbane, Univ of Liverpool compute: 471 msec data transfers: 2518 msec compute: 470 msec data transfers: 52 msec COMP328/COMP528 (c) mkbane, Univ of Liverpool • data within loops is being re-used a lot for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(A[j][i]-Anew[j][i])); } } • so "tiling" will help (re-use of cache lines) • empirically, see that giving compiler specific info further helps improve performance Quick Comparison of Timings • EG1: parallel v kernels  jacobi.c no directives icc -O0: 14.4 seconds, pgcc -O0: 7.0 seconds  jacobi-kernels.c OpenACC “kernels” directive pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 7.0 secs  jacobi-parallel.c OpenACC “parallel” directives pgcc -O0 -Minfo=accel -mp=nonuma -acc -ta=tesla: 9.9 secs (specifically “parallel loop” directives) • EG2: kernels directive plus data transfer clauses  jacobi-kernels+data.c 0.6 seconds (factor of 11.7 times faster than just ‘kernels’)  EG3: kernels directive plus data transfer clauses plus ‘tile’ clause  jacobi-kernels+data+tile.c 0.5 seconds COMP328/COMP528 (c) mkbane, Univ of Liverpool • There is more for OpenACC that we do not cover – e.g. clause for #pragma acc loop • control direction/s for copying data between host & dev • granularity (c.f. threads, thread blocks we cover shortly for CUDA) #pragma acc loop clauses • eg to help define levels of parallelism – tile(x,y): may help with cache re-use – gang(l): number of thread blocks/grid – worker(m): – vector(n): #threads per thread block – reduction(…) • GPU • CPU – gang(l): number of processors – worker(m): number of cores – vector(n):width of vector unit (SIMD) #pragma acc data • Use to control (ie reduce) data movements • copy(x): copy of host x to device at start & exit (of region) • copyin(x): only copy to device at start • copyout(x): only copy off device at end (of region) • create(y): local device var => create ‘y’ only on the device

• present(z): ‘z’ already exists on device (eg from previous kernel)

OpenACC: Further Reading

• Resources will be added to COMP528 course module page
on CANVAS

– https://www.olcf.ornl.gov/wp-
content/uploads/2013/02/Intro_to_OpenACC-JL.pdf

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane