CS计算机代考程序代写 compiler c++ Fortran cache Microsoft PowerPoint – COMP528 HAL20 OpenMP performance matters.pptx

Microsoft PowerPoint – COMP528 HAL20 OpenMP performance matters.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

20 – HAL

• Thread based

• Shared Memory

• Fork-Join model

• Parallelisation via
WORK-SHARING and
TASKS

• FORTRAN, C, C++

• Directives +
Environment Variables +
Run Time

• OpenMP version 4.5
 parallel regions

work sharing constructs

 data clauses

 synchronisation

 tasks

 accelerators (sort of!)

Home

Background Reading
• “Using OpenMP – The Next Step: Affinity, Accelerators,

Tasking and SIMD”, van der Pas et al. MIT Press (2017)
https://ieeexplore-ieee-
org.liverpool.idm.oclc.org/xpl/bkabstractplus.jsp?bkn=8169743

– Homework: read Chapter 1 (a nice recap of v2.5 of OpenMP)

• “Using OpenMP: Portable Shared Memory Parallel Programming”
Chapman et al. MIT Press (2007)
– https://ebookcentral.proquest.com/lib/liverpool/reader.action?docID=33

38748&ppg=60

– Based on v2.5 so it does not cover: tasks, accelerators, some other refinements

PERFORMANCE MATTERS FOR
SHARED MEMORY PROGRAMMING

Performant OpenMP

• Granularity

• Load imbalance
– Scheduling

– (and not waiting…)

• First Touch

• Affinity

• False Sharing

Performant OpenMP

• Granularity

Fine grained OpenMP

Coarse grained OpenMP

COMP328/COMP528 (c) mkbane, univ of
liverpool (2018-2020)

Performant OpenMP

• Granularity

• Load imbalance
– Scheduling

– (and not waiting…)

• Without scheduling, iterations are
“block”
• For some examples this leads to

load imbalance
• More work => longer time

=> other threads just waiting
(rather than doing something
useful)

• With appropriate scheduling such as
“round robin” (or “cyclic”)
• Could aid load balance
• More equal sharing of work =>

all threads doing something
useful
=> all finish quicker

COMP328/COMP528 (c) mkbane, univ of
liverpool (2018-2020)

• We seen that load imbalance can be
an issue & use of optional data
clause “schedule” can help

• schedule(type, chunksize)
– type: static | dynamic | guided | auto | runtime

– chunksize: int (or int expr) – optional

• schedule(runtime) – uses value of env var OMP_SCHEDULE
export OMP_SCHEDULE=“guided,10”

Scheduling of Loops
versus

schedule(type, chunksize)
– type: static | dynamic | guided | runtime

– chunksize: int (or int expr) – optional

• (static): iterations divided into ~equal blocks with 1st

block on 1st thread, 2nd block on second thread, …

• (static, N): block of N iterations assigned in round-
robin fashion

• (dynamic): chunks dynamically assigned to threads
as they become free

• (guided): chunks of decreasing size are dynamically
assigned to the threads as they become available.
chunksize is min #iters handed out

(static)

(static,1)

(auto): leave to the compiler &|or run
time system to determine what is
best, with presumption that after a
few goes through a given for loop it
will determine the best scheduling…

i.e. default if no
explicit schedule

Performant OpenMP

• Granularity

• Load imbalance
– Scheduling

– (and not waiting…)

Performance / Load Imbalance / Not waiting

• implicit barriers at
– End of worksharing constructs “omp for”, “omp single”, (&

others)

• but it is not always necessary to have a barrier

• this can be removed with the “nowait” clause for the
worksharing construct

• it is up to user to ensure program remains correct

nowait examples…

• Consider 2 independent loops within a par region
– independent: we could do either loop first, or second

– so why should second loop wait for first to finish?

• Potential performance enhancement by using “nowait”
– data clause for “omp for” e.g.
#pragma omp for nowait

#pragma omp parallel default(none) shared(NUM,A,x,y,res) private(i,j,k) private(cksum)
{

#pragma omp for
for(i=0;i core) initialises a
var dictates location

• Consider arrays where work
happens

#pragma omp parallel for

for (int i=m; i1 thread

Not covered in this course but… “SCOPE”

• It’s all about the dynamic scope
– Can set up parallel region

within one C function

– But use worksharing within
another C function
that is called later

https://computing.llnl.gov/tutorials/openMP/#Scoping

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane