CS代写 CS402/922 High Performance Computing ● ●

Programming Models
aka “Can I have a different choice in languages?” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 07/02/2022 ● CS402/922 High Performance Computing ● ●

07/02/2022

Multithreading so far
So many ways of running things in parallel!
• So far we have encountered a few different coding paradigms:
Pthreads/Java Runables
• Low level coding
• Language specific
• Parallelism over
kernels/functions
• Limited to CPU’s
• Everything is explicitly
OpenMP MPI
• High level statements • FORTRAN and C/C++
• All done my the
compiler, no manual
operations
• Only on CPU’s (Up to
• Low level parallelism
• Interfaces for multiple languages (focused on
• Mainly focused on
CPU’s and distributed
• Everything is explicit

07/02/2022
Programming Models
“And now on the catwalk, C++!”
• Key aim for HPC at the moment:
• Implement parallel programs with as little cost (time, developers etc.)
as possible
• Build code that will work for the new systems and architectures
• Abstract away the need to hard code parallelisation through different methodologies:
• LibrariesàKokkos, RAJA
• Domain Specific LanguagesàOPS, OP2
• Compiler BasedàOpenMP 5.0, SYCL, OpenACC • Specialised languagesàOccam, Rust, GO

07/02/2022
• Imported (and often precompiled) code to allow for the abstraction of a computing conceptàparallelisation
• Allows for different backends to be developed, without changing the front end
• Issues can include:
• Much larger binary files
• If not developed well, may require some specialised knowledge
• Extra computation requiredàcan lead to slower performance than hand-optimising

07/02/2022
• KokkosàC++ template library
Fancy logo!
• Developed by Sandia National Laboratory
• Predominantly relies on C++ template metaprogramming
• Able to select the most appropriate data layout based on the underlying architecture
• Can produce OpenMP, CUDA, HIP, SYCL, HPX and C++ threaded implementations

Home

https://github.com/kokkos/kokkos

07/02/2022
Still a pretty cool logo!
• RAJAàC++ template library
• Developed by National
Laboratory
• Predominantly uses C++11 and lambda functions in order to allow for more flexibility when building kernels
• Produce OpenMP, CUDA and HIP implementations, amongst others
https://github.com/LLNL/RAJA

07/02/2022
Domain Specific Language (DSL)
Like accents… but not…
• Generalising parallelisation means we can loose performance • Extra computation required
• Certain assumptions cannot be made
• FixàSpecialise the library for a particular group of problems • DSLs extend a language with interpretable functions
• Usually involve a separate code to translate from the DSL to compliable code
• Allows for the DSL to be converted into multiple different variations

07/02/2022
That mesh is very structured!
• OPSàOxford Parallel Library for Multi-block Structured-mesh solvers
• Developed by Dr. whilst working at Oxford University, continues to be developed
• Designed specifically to optimise structured meshes (think deqn)àEach cell in the mesh is the same
• Different optimisations and parallelisation’s have been implemented including OpenMP, MPI, CUDA, OpenACC and hybrids of these
https://op-dsl.github.io/
https://github.com/OP-DSL/OPS

07/02/2022
That mesh is very unstructured!
• OP2àOxford Parallel Library for Unstructured- mesh solvers
• Developed by Dr. whilst working at Oxford University, continues to be developed
• Designed specifically to optimise unstructured meshes
• Each cell may have different shapes and sizes
• Therefore, is harder to generalise and optimise
• Different optimisations and parallelisation’s have been implemented including OpenMP, MPI, CUDA, OpenACC and hybrids of these
https://op-dsl.github.io/
https://github.com/OP-DSL/OP2-Common

07/02/2022
Compiler Based Parallelisation
Why do all the hard work myself!
• Libraries can be large and complex to build
• DSLs only work with the correct interpreters
• With both libraries and DSLs, the functionality and some level of specialisation has to be written by a developer
• Why not get the compiler to handle a wider range of systems?
• Compiler-based parallelism allows for minimal effort from the
• Calculations done at compile time, less runtime overhead

07/02/2022
OpenMP 5.0 onwards
Haven’t we seen this before?
• Extension to the basic OpenMP to allow for offloading
• Split threads into teams
• Define memory spaces for variables (system,
high bandwidth, etc.)
• Development support in GNU 11 (but full support for OpenMP 4.5)
• Clang 11.0/LLVM supports GPU offloading but is still under development, but defaults to OpenMP semantics
OpenMP 5.2 Spec
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf

07/02/2022
NVIDIA’s first attempt at OpenMP 4.0
• Pragma based library, similar to OpenMP
• Supports C, C++ and Fortran
• Works with POWER (IBM) and X86 (Intel, AMD), along with offloading to NVIDIA GPU’s
• Predominately used with the PGI compiler, maintained by NVIDIA
https://www.openacc.org/

07/02/2022
At least it doesn’t start with ”Open”…
• C++ programming model for OpenCL
• Designed explicitly for performance portability
• Managed by the Khronos Group
• Allows for flexibility and ease, combining both OpenCL offloading and C++ 11/14/17 principles
• OpenCL is actively supported by multiple vendors, including Apple, AMD, ARM, Intel, NVIDIA, IBM
• Compilers developed by CodePlay and Intel (oneAPI)
https://www.khronos.org/sycl/

07/02/2022
Programming Languages
Just rewrite the entire thing!
• Each of the other approaches incur a overhead for utilising them
• LibrariesàLarger binary files mean less cache reuse, bigger libraries means more additional compute
• DSLsàAdditional compilation step that may go wrong
• CompileràCompilers have to be safe so may not optimise as far as
• What if the entire language was built on parallelisation and performance from the ground up?
• C++20/23 onwards is incorporating some of this

07/02/2022
Tangentially named after Occam’s Razor…
• Designed in the 1980’s to allow for parallelism through concurrency
• Multiple instructions can be ran at the same time
• Originally supported integers, V.2 (1987)
supported floating point values
• Still being worked on by researchers from University of Kent (called Occam 𝜋)
Occam 𝜋 Website
http://occam-pi.org/

07/02/2022
isn’t rusty code!
• Developed by Mozilla engineers in 2010
• Designed to be memory efficient à No garbage
collection
• Strong type systemàLess likely to incur compile/runtime errors and ensures memory and thread safety
• Rapidly growing user baseàLots of help out there
• Slow compile times, can be very hands-on
https://www.rust-lang.org/

07/02/2022
🚦 Go! 🚦 Stop! 🚦 Go…
• Developed and released by Google in 2009
• Designed for simple, efficient codeàrapid prototyping
• Manages dependencies, has garbage collection and built-in concurrencyàLess to worry about
• Used by a lot of organisations
• Fast compilations, slower runtimes than
https://go.dev/

07/02/2022
Which one is best?
Errrmmm…
• Comparing different parallelisation methods are difficult
• How does it affect the compile time?
• How does it affect the runtime?
• How does it translate between different hardware/software models? • What is the overhead/cost to utilising a parallelisation model?
• How easy is it to develop in?
• How easy is it to incorporate into my already-existing code?
• All of this and more has spawned a new area of research, Performance, Portability and Productivity (P3)

07/02/2022
What is Performance Portability?… so not “how fast can I code”?
• “A measurement of an application’s performance efficiency for a given
problem that can be executed correctly on all platforms in a given set.” • Application efficiency.
• Achieved time/Fastest time.
• Architecture efficiency.
• Achieved GFLOPs/Theoretical peak GFLOPs. • Memory bandwidth efficiency.

07/02/2022
What is Performance Portability?
Ooo, an equation, fancy!
, if 𝑖 is supported ∀𝑖 ∈ 𝐻 otherwise
𝒫 𝑎,𝑒,𝐻 = ∑!∈#
• H is a set of platforms • a is the application
• p is the parameters for a
• e is the performance efficiency measure

07/02/2022
Don’t you hate it when you get bits of a mini-app in your cup?
• Linear Heat Conduction Mini-application
• Structured mesh, spatially decomposed using a 5 point stencil
• 4 solvers implemented, Conjugate Gradient (CG) solver used
• Part of the Mantevo project and ECP Proxy App suite, maintained by UK-MAC
• Originally written in Fortran, been converted to C++ by University of Bristol

07/02/2022 Slide 22
Systems Used
These do compute things fast!
Key Information
Intel Xeon E5-2660 v4 (Broadwell)
2 processors, each with 14 core and 2 hyperthreads per core. 2.00GHz
Intel Xeon Phi 7210 (Knights Landing/KNL)
1 processor with 64 cores and 4 hyperthreads per core. 1.30GHz, Flat memory mode, Quadrant clustering mode
NVIDIA Tesla P100
3840 single precision CUDA cores (1920 double precision CUDA cores)

07/02/2022 Slide 23
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 24
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
0.96 (20.72 GFLOP/s)
1.35 (29.06 GFLOP/s)
2.73 (58.60 GFLOP/s)
0.91 (19.65 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
1.52 (45.59 GFLOP/s)
3.39 (101.80 GFLOP/s)
1.57 (47.24 GFLOP/s)
1.60 (48.15 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
2.36 (110.91 GFLOP/s)
2.83 (133.05 GFLOP/s)
5.30 (249.23 GFLOP/s)
1.87 (87.90 GFLOP/s)
BW (Peak: 494.0 GB/s)
60.49 (52.9 GB/s)
91.61 (369.5 GB/s)
75.70 (373.9 GB/s)
89.61 (77.8 GB/s)
95.93 (386.9 GB/s)
61.21 (302.4 GB/s)
64.11 (55.6 GB/s)
23.59 (95.1 GB/s)
65.86 (325.3 GB/s)
53.13 (46.1 GB/s)
60.87 (245.5 GB/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 25
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
60.49 (52.9 GB/s)
89.61 (77.8 GB/s)
64.11 (55.6 GB/s)
53.13 (46.1 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
91.61 (369.5 GB/s)
95.93 (386.9 GB/s)
23.59 (95.1 GB/s)
60.87 (245.5 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
75.70 (373.9 GB/s)
61.21 (302.4 GB/s)
65.86 (325.3 GB/s)
70.63 (348.9 GB/s)
0.96 (20.72 GFLOP/s)
1.52 (45.59 GFLOP/s)
2.36 (110.91 GFLOP/s)
1.35 (29.06 GFLOP/s)
3.39 (101.80 GFLOP/s)
2.83 (133.05 GFLOP/s)
2.73 (58.60 GFLOP/s)
1.57 (47.24 GFLOP/s)
5.30 (249.23 GFLOP/s)
0.91 (19.65 GFLOP/s)
1.60 (48.15 GFLOP/s)
1.87 (87.90 GFLOP/s)

07/02/2022 Slide 26
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
95.93 (386.9 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 27
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 28
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 29
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 30
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 31
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 32
Performance Portability (Architecture)
So… many… numbers!
Eff. (Xeon E5-2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)
(Peak: 2150.4 GFLOP/s)
BW (Peak: 86.8 GB/s)
(Peak: 3000 GFLOP/s)
BW (Peak: 403.3 GB/s)
(Peak: 4700 GFLOP/s)
BW (Peak: 494.0 GB/s)
0.96 (20.72 GFLOP/s)
60.49 (52.9 GB/s)
1.52 (45.59 GFLOP/s)
91.61 (369.5 GB/s)
2.36 (110.91 GFLOP/s)
75.70 (373.9 GB/s)
1.35 (29.06 GFLOP/s)
89.61 (77.8 GB/s)
3.39 (101.80 GFLOP/s)
95.93 (386.9 GB/s)
2.83 (133.05 GFLOP/s)
61.21 (302.4 GB/s)
2.73 (58.60 GFLOP/s)
64.11 (55.6 GB/s)
1.57 (47.24 GFLOP/s)
23.59 (95.1 GB/s)
5.30 (249.23 GFLOP/s)
65.86 (325.3 GB/s)
0.91 (19.65 GFLOP/s)
53.13 (46.1 GB/s)
1.60 (48.15 GFLOP/s)
60.87 (245.5 GB/s)
1.87 (87.90 GFLOP/s)
70.63 (348.9 GB/s)

07/02/2022 Slide 33
Performance Portability (Application)
Always running!
Eff. (Xeon E5- 2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)

07/02/2022 Slide 34
Performance Portability (Application)
Always running!
Eff. (Xeon E5- 2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)

07/02/2022 Slide 35
Performance Portability (Application)
Always running!
Eff. (Xeon E5- 2660 v4) (%)
Eff. (KNL) (%)
𝓟(CPU) (%)
Eff. (P100) (%)
𝓟(CPU&GPU) (%)

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts