Microsoft PowerPoint – COMP528 HAL17 OpenMP parallel regions – slides.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
17 – HAL
1. Assessment / Assignment
2. Plans
– labs this week
– break for Easter
– exam…
3. Shared Memory Programming with OpenMP
Chip for Japan’s Path to Exascale
Why bother with Single Node
• Performance can be pretty high
• Not everybody has a massive supercomputer with Ks of
nodes
• Most people do have a multi-core laptop|desktop
Still need to obtain good portion of peak performance
each node
*
*
(c) mkbane, univ of liverpool (2018-2020)
FLOP-wise comparison…
CHADWICK
Node = 2 * SandyBridge 2.2 GHz
SB core does 8 flops / cycle
17.6 GFLOPS/sec
8c per chip
140.8 GFLOPS/chip
2 SB chips per node
281.6 GFLOPS/node
To match the A64FX chip, we would need
Ceiling(2700/281.6) = 10 nodes of
Chadwick
Latency of interconnect
Contention on interconnect
10 nodes being free (not just 1)
Notes:
* A64FX is chip in Fugaku (#1 Supercomputer, June2020)
* “Chadwick” (photo) is predecessor to “Barkla”
Shared .v. Dist Mem
Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3
Shared .v. Dist Mem
Core 0 Core 1 Core 2 Core 3
How we start our OpenMP
code is a different approach
to starting MPI.
The use of the memory for
OpenMP is very different to
the MPI approach
We cover these points
shortly…
Shared Memory Programming
• Making use of all cores available
– Speed-up of 16x on Chadwick
– 40 cores on Barkla => 40x should be possible
(e.g. go from model run time of 1.5 days ==> 1 hour!)
– Speed-up of 48x on a64fx
– Speed-up of 2, or 4 on typical laptop
– Speed-up of maybe 8 or even 32 on a workstation
• If need more, can do MPI between nodes (dist mem)
& shared mem programming on each of the nodes
Shared Memory Programming
• Based on threads (rather than processes)
• Threads are more lightweight
• Context switching can be much less expensive
• So efficiency on shared memory can be better for threads
than processes
• Which parallel paradigm have used processes…?
Processes v Threads (for interest, not for exam)
• A process is the basic
unit of work of the
operating system
• The operating system
is responsible for
every aspects of
operation of a
process:
– memory
allocation
– scheduling CPU
execution time
– Assign I/O
devices, files
– etc
• In many systems
threads are served
on the higher level
(priority) than
processes, and can
be switched without
involvement of the
OS kernel
• Thread creation,
termination and
switching are much
faster than for
processes
• A thread represents a piece of a
process that can be executed
independently of (parallel with) other
parts of the process
• Like any process, each thread has its
own
– program counter value (~”which
instruction to execute next”)
– stack space
• Unlike any process, a thread shares
with the other threads in the process
– Program code
– Data
– Open files (+ other resources)
(c) univ of liverpool (2018-2020)
How to Program Shared Memory…
How to Program Shared Memory…
• UNIX provides “POSIX Threads” (pThreads)
• Windows API provides threads
• In 1980-1990s, whole raft of higher level threads API
• And the winner is: OpenMP
• Thread based
• Shared Memory
• Parallelisation via WORK-SHARING and TASKS
• FORTRAN, C, C++
• Possible to have single-source code for serial, par-dev &
par-release
– Why this good?
OpenMP for COMP528
• C only
• OpenMP version 4.5
parallel regions
work sharing constructs
data clauses
synchronisation
tasks
accelerators (sort of!)
https://apy-groupe.ca/fr/pny-nvidia-tesla/133-nvidia-tesla-p40.html
But be aware of…
• v5 was announced at SC18 (Nov2018)
• There will be bits of OMP4.5
we won’t cover during this course module
• There is a one-stop shop for all things OpenMP:
Example… recall quadrature
• Approximate integral is
sum of areas under line
• Each area approximated by a rectangle
• Calculate these in parallel – we look MPI .v. OpenMP…
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
x y=x+h
0.5*[f(x)+f(y)]
MPI_Init(NULL, NULL);
t0 = MPI_Wtime();
MPI_Comm_size(MPI_COMM_WORLD, &numPEs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* traps per process by varying myStartIter */
myNum = num/numPEs;
myStartIter = myRank*myNum;
if (myRank==numPEs-1) {
myNum = num – (numPEs-1)*myNum;
}
myFinishIter = myStartIter + myNum – 1;
for (i=myStartIter; i<=myFinishIter; i++) {
x = a + i*stepsize; // x-space NOT iterations
mySum += 0.5*stepsize*(func(x) + func(x+stepsize));
}
/* each pass back sum to root for global sum */
MPI_Reduce(&mySum, &globalSum, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD);
t1 = MPI_Wtime();
if (myRank==0) {
printf("TOTAL SUM: %f\n%d iterations took total
wallclock %f milliseconds\n",
globalSum,num,1000.0*(t1-t0));
}
MPI_Finalize();
#include
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i, x) reduction(+:sum)
for (i=0; i
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i