Microsoft PowerPoint – COMP528 HAL26 OpenMP for GPUs, perhaps.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
26 – HAL
OpenMP for GPUs
Directives for accelerators
Programming Model
• some code on host (the CPU)
• “offload” a “kernel” to the “accelerator”
– offloading possible (in theory) via OpenMP
– can also use
• OpenACC
• OpenCL
• CUDA proprietary, just for NVIDIA GPUs
Why Directives for GPUs?
• CUDA is only for NVIDIA GPUs
– lack of portability
– programming via calling a kernel explicitly,
plus function calls to handle data transfer & usage of memory
• Amount of coding
– one directive may have been several lines of CUDA
• Portability over different heterogeneous architectures
– CPU + NVIDIA GPU
– CPU + AMD GPU
– CPU + XeonPhi (RIP)
– CPU + FPGA (apparently)
Programming GPUs
CUDA
• proprietary
• NVIDIA only GPUs
• non-portable
• performant
Directives
• portable
– in theory?
• less coding
• maybe not so performant
• (some extensions to parallelism on
CPUs, Xeon Phis, FPGAs)
COMP328/COMP528 (c) mkbane, Univ of
Liverpool
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~20xx
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf
• OpenACC
• OpenMP 4.x (and 5.0…)
• OpenCL
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~20xx
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf
OpenMP for Accelerators
• #pragma omp target
– defines region of code is to be off-loaded to the target (GPU)
– Then need to say what happens within that region of code e.g.
• #pragma omp parallel for
– on CPU: creates threads & spreads iterations over threads
– within ‘target’: runs using threads of GPU
Target Clauses
• device (N)
– run on device #N
• map(A,B)
– ensure A, B vars available on target device
• map(tofrom: C)
– copy C to device, run region on device, copy C back
GPU threads != CPU threads
• OpenMP designed around CPU threads
– high cost of set-up and of synchronisation
• GPUs
– light weight threads, very low cost of switching
– “thread blocks”
• SO… “teams” directive of OpenMP re GPU
COMP328/COMP528 (c) mkbane, Univ of
Liverpool
CHADWICK
COMP328/COMP528
(c) mkbane, Univ of Liverpool
OpenACC
• parallel | kernels
• copy
• copyin
• copyout
• create
• delete
OpenMP
• target / teams / parallel
• map(inout:…)
• map(in:…)
• map(out:…)
• map(alloc:…)
• map(release:…) / map(delete:…)
OpenMP – good for GPUs?
• syntax via examples
• what is in which version
• which compilers support which version
OpenMP versions / GPU support
• v4.0 (2013): support offloading
– Intel v15 & v16
– GCC v4.9.0
• v4.5 (2015): improved support for offloading targets
– Intel v17 onwards
– GCC v6.1 onwards
– Cray CCE 8.7 onwards
COMP328/COMP528
(c) mkbane, Univ of Liverpool
Who supports What?
• Intel make CPUs (and Xeon Phi) but not discrete [compute]
GPUs
– Intel compilers support OpenMP but not OpenACC
• NVIDIA (owners of PGI) make GPUs but not CPUs
– PGI compilers support OpenACC
& (only recently) OpenMP (but only CPU ‘target’)
• Cray no longer make chips, more of an “integrator”
– Cray compilers support OpenMP & OpenACC
COMP328/COMP528
(c) mkbane, Univ of Liverpool
• There are some options to DIY extend LLVM/clang
– IBM compilers
• GNU ?
– Possible in v10 (not avail on Barkla)
– Can do a ‘chain’ of builds within v6 onwards
(if various support commands also avail)
https://gcc.gnu.org/wiki/Offloading
COMP328/COMP528
(c) mkbane, Univ of Liverpool
Conclusion?
• OpenMP for accelerators
– … limited support (Intel for Intel Xeon Phi)
– … clang/LLVM (handbuild or via IBM) for GPUs…
not the most straight forward
• Use OpenACC for ease!
COMP328/COMP528
(c) mkbane, Univ of Liverpool
OpenMP for Accelerators: Further Reading
– OpenMP example of Jacobi:
– OpenMP user group presentation (2020):
openmp-for-cs/slides/pdfs at openmp-ug-2020 · UoB-HPC/openmp-for-cs · GitHub
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane