Microsoft PowerPoint – COMP528 HAL23 OpenMP tasks v1.1.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
23 – HAL
OPENMP TASKS (OVERVIEW)
Background Reading
• “Using OpenMP – The Next Step: Affinity, Accelerators,
Tasking and SIMD”, van der Pas et al. MIT Press (2017)
– https://ieeexplore-ieee-
org.liverpool.idm.oclc.org/xpl/bkabstractplus.jsp?bkn=8169743
– Homework: read Chapter 1 (a nice recap of v2.5 of OpenMP)
– NEW homework: read & Chapter 3 (tasking)
optional: Chapter 4 (thread affinity)
•
“Using OpenMP: Portable Shared Memory Parallel Programming”
Chapman et al. MIT Press (2007)
https://ebookcentral.proquest.com/lib/liverpool/reader.action?docID=3338748&ppg=60
– v2.5 so does not cover: tasks, accelerators, some refinements
Tasks
• Rather than considering workflow imperative style
– stmt then stmt then stmt
• … and work-sharing across a for loop
• Can we identify specific “tasks”, push them to a queue and let
the system run next appropriate task when resources available
– closer to data flow
– “appropriate” => user defines some dependencies,
RTS runs when they are resolved
Tasks / Tasking
• think task (as in a bundle of work) not threads
• can be used to exploit parallelism in workloads that are not
a set of for loops
• also support ‘while’ loops & recursion, & some dynamic
load balancing
• “task parallelism” (using data flow)
•
Tasks: Concept
• create task
Tasks: Concept
• create task
• add to task pool
Tasks: Concept
• create task
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
Tasks: Concept
• create task
• add to pool
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
a little later we may have…
• create task
• add to pool
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
T
I M
E
BUT we can also run tasks on avail resources
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
T
I M
E
BUT we can also run tasks on avail resources
• task runs on avail
thread
T
I M
E
BUT we can also run tasks on avail resources
• avail resource
• => assign task
• & repeat
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
T
I M
E
BUT we can also run tasks on avail resources
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
T
I M
E
BUT we can also run tasks on avail resources
• start new tasks
when resources
become available
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
T
I M
E
BUT we can also run tasks on avail resources
• aha
dependency
purple depends on
all yellow finished
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
T
I M
E
BUT we can also run tasks on avail resources
COMP528 (c) mkbane, univ of liverpool
(2018, 2019)
Outline of syntax
• task creation
– done in a parallel region
– each task created on a single thread
• often (in textbooks) within a single threaded region
– master | single (maybe critical as long as not same task created)
• but can be the individual iterations of a parallelised “omp for” loop
#pragma omp task
{
block becomes task
}
Running the task
• run-time decides!
– might be immediate
– might be deferred
• BUT programmer can use synchronisation to ensure when
must be run by (or rather where to wait until task/s has run)
Definitions
• task construct
• task – the actual instructions (& data created) when thread
runs (“encounters”) the “task construct”
– different encounters of same “task construct” generate different
tasks
• task region – the code encountered during execution of a task
Two Examples…
• DEMO SUB DIR:
~mkbane/HPC_DEMOS/openmp-tasks
1. Simple, non-recursive…
– print out adjectives, any order
storyTelling.c
– (You should also see what happens if you comment out the
“taskwait” directive)
Storytelling (task example)
4 tasks; each can run on a
different OpenMP thread
No ordering of tasks; we don’t
know (a priori) which thread a
task will run on
But all tasks have to complete
at the ‘taskwait’ before any
thread can continue past that
point (c.f. “barrier”)
COMP328/COMP528
(c) mkbane, univ of liverpool
COMP328/COMP528
(c) mkbane, univ of liverpool
Two Examples…
2. Fibonnaci series:
fib[i] = 1, 1, 2, 3, 5, 8, 10, …
eg fib[6] is 6th entry i.e. 8
COMP328/COMP528
(c) mkbane, univ of liverpool
COMP328/COMP528 (c) mkbane, univ of liverpool
-DMKB_PRINT=printf to output
-DMKB_PRINT=”//” to not output
COMP328/COMP528
(c) mkbane, univ of liverpool
Fibonacci Recursive Tasks Example
PERFORMANCE
3. ex_fib_tasks_LARGE.c for fib(NN=40) so we can time
(best of 10 runs on (exclusive) ‘course’ node (i.e. Intel Skylake chips,
with OMP_DYNAMIC & OMP_PROC_BIND set as recommended,
using Intel compiler with zero optimisation level)
fib 40 = 102334155; time taken:19.359815 on 1 threads
fib 40 = 102334155; time taken:93.365958 on 2 threads
fib 40 = 102334155; time taken:63.541324 on 3 threads
fib 40 = 102334155; time taken:48.035683 on 4 threads
fib 40 = 102334155; time taken:25.048191 on 8 threads
fib 40 = 102334155; time taken:17.036860 on 12 threads
WHAT DO YOU THINK OF THE PERFORMANCE?
Fibonacci Recursive Tasks Example
PERFORMANCE
3. ex_fib_tasks_LARGE.c for fib(NN=40) so we can time
(best of 10 runs on (exclusive) ‘course’ node (i.e. Intel Skylake chips,
with OMP_DYNAMIC & OMP_PROC_BIND set as recommended,
using Intel compiler with zero optimisation level)
fib 40 = 102334155; time taken:19.359815 on 1 threads
fib 40 = 102334155; time taken:93.365958 on 2 threads
fib 40 = 102334155; time taken:63.541324 on 3 threads
fib 40 = 102334155; time taken:48.035683 on 4 threads
fib 40 = 102334155; time taken:25.048191 on 8 threads
fib 40 = 102334155; time taken:17.036860 on 12 threads
WHAT DO YOU THINK OF THE PERFORMANCE?
C
O
M
P
3
2
8
/C
O
M
P
5
2
8
(
c)
m
k
b
an
e,
u
n
iv
o
f
li
v
er
p
o
o
l
• overheads of tasks can be high
– if “omp par for” fits, use it
• can arrange for task to be on a team of threads
• can nest tasks
• can use tasks for dynamic load balancing of irregular
problems
Tasks
• useful to know
• worth consideration…
• in depth further reading
chpt 3: “Using OpenMP: The Next Step…”
https://hpc-
forge.cineca.it/files/ScuolaCalcoloParallelo_WebDAV/public/anno-
2016/12_Advanced_School/OpenMP_4_tasks.pdf (Cineca)
Background Reading
• “Using OpenMP – The Next Step: Affinity, Accelerators,
Tasking and SIMD”, van der Pas et al. MIT Press (2017)
– https://ieeexplore-ieee-
org.liverpool.idm.oclc.org/xpl/bkabstractplus.jsp?bkn=8169743
– Homework: read Chapter 1 (a nice recap of v2.5 of OpenMP)
– NEW homework: read & Chapter 3 (tasking)
optional: Chapter 4 (thread affinity)
•
“Using OpenMP: Portable Shared Memory Parallel Programming”
Chapman et al. MIT Press (2007)
https://ebookcentral.proquest.com/lib/liverpool/reader.action?docID=3338748&ppg=60
– v2.5 so does not cover: tasks, accelerators, some refinements
RECAP
SHARED MEMORY
• Memory on chip
– Faster access
– Limited to that memory
– … and to those nodes
• Programming typically OpenMP
– Directives based + environment vars
+ run time functions
– Incremental changes to code
(e.g. loop by loop)
– Portable to single core / non-OpenMP
• Single code base (or use of “stubs”)
DISTRIBUTED MEMORY
• Access memory of another node
– Latency & bandwidth issues
– Which interconnect: IB .v. gigE (v. etc)
– Expandable (memory & nodes)
• Programming 99% always MPI
– Message Passing Interface
– Library calls more intrusive
– Different MPI libs / implementations
– Non-portable to non-MPI (without effort)
OpenMP
• Only for “globally addressable” shared
memory
– (generally) only a single node
• Threads
• Fork-join model: parallel regions
– Work-sharing
– Tasks
• Need to think about
private(x) .v. shared (x)
MPI
• For distributed memory
– Includes subset of a single node
• Processes
• Each process on a different core
• Need to think message-passing in
order to share information
OpenMP MPI
Intel modules on Barkla intel intel intel-mpi
compile (with no optimisation) icc –qopenmp -O0 myOMP.c -o myOMP.exe mpiicc -O0 myMPI.c –o myMPI.exe
run on 7 cores export OMP_NUM_THREADS=7
./myOMP.exe
mpirun -np 7 ./myMPI.exe
• OpenMP for Accelerators…