CS计算机代考程序代写 compiler c++ Fortran Microsoft PowerPoint - COMP528 HAL17 OpenMP parallel regions - slides.pptx

Microsoft PowerPoint – COMP528 HAL17 OpenMP parallel regions – slides.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

17 – HAL

1. Assessment / Assignment

2. Plans
– labs this week

– break for Easter

– exam…

3. Shared Memory Programming with OpenMP

Chip for Japan’s Path to Exascale

Why bother with Single Node

• Performance can be pretty high

• Not everybody has a massive supercomputer with Ks of
nodes

• Most people do have a multi-core laptop|desktop

Still need to obtain good portion of peak performance
each node

FLOP-wise comparison…

CHADWICK

Node = 2 * SandyBridge 2.2 GHz

SB core does 8 flops / cycle
 17.6 GFLOPS/sec

8c per chip
 140.8 GFLOPS/chip

2 SB chips per node
 281.6 GFLOPS/node

To match the A64FX chip, we would need
Ceiling(2700/281.6) = 10 nodes of
Chadwick

Latency of interconnect
Contention on interconnect
10 nodes being free (not just 1)

Notes:
* A64FX is chip in Fugaku (#1 Supercomputer, June2020)
* “Chadwick” (photo) is predecessor to “Barkla”

Shared .v. Dist Mem

Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3

Shared .v. Dist Mem

Core 0 Core 1 Core 2 Core 3

How we start our OpenMP
code is a different approach
to starting MPI.

The use of the memory for
OpenMP is very different to
the MPI approach

We cover these points
shortly…

Shared Memory Programming

• Making use of all cores available
– Speed-up of 16x on Chadwick

– 40 cores on Barkla => 40x should be possible
(e.g. go from model run time of 1.5 days ==> 1 hour!)

– Speed-up of 48x on a64fx

– Speed-up of 2, or 4 on typical laptop

– Speed-up of maybe 8 or even 32 on a workstation

• If need more, can do MPI between nodes (dist mem)
& shared mem programming on each of the nodes

Shared Memory Programming

• Based on threads (rather than processes)

• Threads are more lightweight

• Context switching can be much less expensive

• So efficiency on shared memory can be better for threads
than processes

• Which parallel paradigm have used processes…?

Processes v Threads (for interest, not for exam)

• A process is the basic
unit of work of the
operating system

• The operating system
is responsible for
every aspects of
operation of a
process:

– memory
allocation

– scheduling CPU
execution time

– Assign I/O
devices, files

– etc

• In many systems
threads are served
on the higher level
(priority) than
processes, and can
be switched without
involvement of the
OS kernel

• Thread creation,
termination and
switching are much
faster than for
processes

• A thread represents a piece of a
process that can be executed
independently of (parallel with) other
parts of the process

• Like any process, each thread has its
own

– program counter value (~”which
instruction to execute next”)

– stack space

• Unlike any process, a thread shares
with the other threads in the process

– Program code

– Data

– Open files (+ other resources)
(c) univ of liverpool (2018-2020)

How to Program Shared Memory…

• UNIX provides “POSIX Threads” (pThreads)

• Windows API provides threads

• In 1980-1990s, whole raft of higher level threads API

• And the winner is: OpenMP

• Thread based

• Shared Memory

• Parallelisation via WORK-SHARING and TASKS

• FORTRAN, C, C++

• Possible to have single-source code for serial, par-dev &
par-release
– Why this good?

OpenMP for COMP528

• C only

• OpenMP version 4.5
 parallel regions

 work sharing constructs

 data clauses

 synchronisation

 tasks

 accelerators (sort of!)

https://apy-groupe.ca/fr/pny-nvidia-tesla/133-nvidia-tesla-p40.html

But be aware of…

• v5 was announced at SC18 (Nov2018)

• There will be bits of OMP4.5
we won’t cover during this course module

• There is a one-stop shop for all things OpenMP:

Home

Example… recall quadrature

• Approximate integral is
sum of areas under line

• Each area approximated by a rectangle

• Calculate these in parallel – we look MPI .v. OpenMP…

0 0.5 1 1.5 2 2.5 3 3.5

x y=x+h

0.5*[f(x)+f(y)]

MPI_Init(NULL, NULL);
t0 = MPI_Wtime();
MPI_Comm_size(MPI_COMM_WORLD, &numPEs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* traps per process by varying myStartIter */
myNum = num/numPEs;
myStartIter = myRank*myNum;
if (myRank==numPEs-1) {

myNum = num – (numPEs-1)*myNum;
}
myFinishIter = myStartIter + myNum – 1;

for (i=myStartIter; i<=myFinishIter; i++) { x = a + i*stepsize; // x-space NOT iterations mySum += 0.5*stepsize*(func(x) + func(x+stepsize)); } /* each pass back sum to root for global sum */ MPI_Reduce(&mySum, &globalSum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); t1 = MPI_Wtime(); if (myRank==0) { printf("TOTAL SUM: %f\n%d iterations took total wallclock %f milliseconds\n", globalSum,num,1000.0*(t1-t0)); } MPI_Finalize(); #include
#include
#include

double func(double);

int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num; // stepsize in x-space
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers

t0 = omp_get_wtime();