Microsoft PowerPoint – COMP528 HAL24 Directives for Accelerators.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
24 – HAL
DIRECTIVES FOR ACCELERATORS
(AN OVERVIEW)
OpenMP for Accelerators
Further Reading
“Accelerators” ??
• GPU or GPGPU
– [general purpose] graphics processing unit
– good to accelerate vector-like operations
– GPU typically sits at far end of PCIe to the CPU
• FPGA
– field programmable gate array
– programmable (& run-time re-configurable) logic gates
– harder to program but good acceleration for streaming like operations
– FPGA usually sits at far end of PCIe to the CPU
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
Programming Model
• some code on host (the CPU)
• “offload” a “kernel” to the “accelerator”
– offloading possible (in theory) via OpenMP
– can also use
• OpenACC
• OpenCL
• CUDA proprietary, just for NVIDIA GPUs
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
Programming Model
• directives based:
– OpenMP
– OpenACC
• programming languages/ extensions
• OpenCL
• CUDA proprietary, just for NVIDIA GPUs
• OpenMP 4.0
– introduced “off-loading”
• running a region on another kit (potentially differ ISA)
– directives determine region and “target”
– directives also determine how to run in parallel on the target
– targets can be XeonPhi, GPU, FPGA
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
• OpenACC
– open accelerators
– initially project driven by Cray + CAPS + NVIDIA + PGI
• (NVIDIA later bought out PGI)
– directives to describe offloading, targets and how to use targets
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~2013
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~20xx
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
Why Directives for GPUs?
• CUDA is only for NVIDIA GPUs
– lack of portability
– programming via calling a kernel explicitly,
plus function calls to handle data transfer & usage of memory
• Amount of coding
– one directive may have been several lines of CUDA
• Portability over different heterogeneous architectures
– CPU + NVIDIA GPU
– CPU + AMD GPU
– CPU + XeonPhi (RIP)
– CPU + FPGA (apparently)
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
In practice…
• OpenMP
– v4.0 introduced “offloading” to a “target” device (e.g. GPU)
– not widely supported (vendor wars)
Who supports What?
• Intel make CPUs (and Xeon Phi)
but not discrete [compute] GPUs (yet…)
– Intel compilers support OpenMP but not OpenACC
• NVIDIA (owners of PGI) make GPUs but not CPUs
– PGI compilers support OpenACC & (only recently) OpenMP
• Cray no longer make chips, more of an “integrator”
– Cray compilers support OpenMP & OpenACC
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
• OpenMP
– support for GPU is in OpenMP standard
– But not easy to find an implementation for given GPU
• OpenACC
– some implementations on GPUs available to use more readily
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
OpenMP for Accelerators
Further Reading
• Resources will be added to CANVAS
– OpenMP example of Jacobi:
https://www.slideshare.net/jefflarkin/gtc16-s6510-targeting-gpus-
with-openmp-45