Microsoft PowerPoint – COMP528 SYNC04 performance considerations.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
4 – SYNC
This session will be
RECORDED…
This allows you to revise later
The session will be mainly me
presenting…
but it is good to have interaction
But if anybody wishes to not be
recorded then I can pause recording
COMP328/528 (c) univ of liverpool
“Timetable” [updates]
• Lectures & Background
• Much more in CANVAS now
• 01/HAL + 02/sync intro
• 03/HAL Terminology, “Top 500”:
• 04/sync (today) performance considerations, …
• Async lectures via CANVAS (i.e. online)
• loosely based on traditional 30 hours, released in blocks, 1 or 2 per week
• Synchronous online
• Week 3: was to meet you, but will cancel and replace with async materials
• weeks 5, 8: to explain the assessed assignments (including feedback) + some lecture materials
• Fridays: weeks 2, 6, 10, 12: lab solutions, ensure materials making sense, Q&A
• Labs
• First week complete…
• Assessment: first MCQ will be on theory we’re covering.
Specific time TBC. Week 4.
Terminology &
Performance
Considerations
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• We repeat it – how long?
– When same person
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• (1) how would you get 5 people to sort?
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• (1) how would you get 5 people to sort?
– How long will it take?
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• (1) how would you get 5 people to sort?
• (2) how could you get 3 people to sort?
– How long?
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• (1) how would you get 5 people to sort?
• (2) how could you get 3 people to sort?
• (3) what is ideal number of people to sort the quickest?
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• (1) how would you get 5 people to sort?
• (2) how could you get 3 people to sort?
• What about, say, 7 people??
Consider…
• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile
• 1 person can sort in 60 seconds
• (3) what is ideal number of people to sort the quickest?
– Is 60 good? Is 61 good?
Lessons…
• Parallelisable
• Elements that can be done independently, and thus work can be shared out
• Different ways to parallelise
e.g. for this example
• Each person takes a color => global pile of that color
• Each person sorts their area => global piles
• Each person sorts their area to own local piles of each color, then final step to combine
all local piles in to global piles
• There are “overheads”
• Communications needed for each actor to know their role (which color ball /
area) and potentially to combine local results in to global results
• There can be “load imbalance”
Lessons…
• There can be “load imbalance”
• e.g. 60/7 is non-integer
• More is not always better
• e.g. 60/10 may be okay but 60/30 may have people running in to each other
etc – “contention”
• Is there any point of having 61 people?
• In practice, one would try different numbers to find most efficient…
• Expected time (without load imbalance) is work/#people
• Occasionally we can do even better than that!
Quick Quiz
• Do you know your computer hardware?
https://www.linkedin.com/in/
heather-gong-127394ba
Intel / EPCC
NVIDIA
A
nt
oi
ne
b
er
co
vi
ci
-O
w
n
w
or
k,
C
C
B
Y-
SA
3
.0
,
ht
tp
s:
//
co
m
m
on
s.
w
ik
im
ed
ia
.o
rg
/w
/in
de
x.
ph
p?
cu
ri
d=
29
02
33
72
CPU Intel, AMD,
ARM (as IP)
1 to maybe 64 cores,
running at 2 to 3 GHz
Powerful cores, out of
order, look ahead. Good
for general purpose and
generally good
1-2 sockets direct on the
motherboard
GPU NVIDIA, AMD 15 to 56 “streaming
multiprocessors”
(SMs), each with 64-
128 “CUDA Cores”.
Base freq about 1 GHz
SMs are good for high
throughput of vector
arithmetic
AMD produced “fused” CPU
& GPU. Until 2016, NV cards
situated at far end of PCI-e
bus. In 2016, NV working with
IBM for on-board solution
using “NVlink”
Xeon Phi Intel 60-70 cores Low grunt but general
purpose cores
KNC is PCI-e
KNL (2016) is standalone
FPGA Altera (Intel),
Xilinix
Fabric to design own
layout – and
reconfigurable
Can use Verilog or VHDL
to map. MATLAB can
also be used. Maxeler
uses Java
Focus needs to be on the
data flow
ASIC Anton-2 uses custom
ASIC for MD calcs. Very
fast but not necessarily
low power
If you’re designing ASIC you
needn’t be on this course!
Design for
Parallel Performance
Serial design process…
Understand problem
to be solved
Design of the
solution
Implementation
Execution
Understand problem
to be solved
Design of the
solution
Implementation
Execution
Implement
parallelism
Adding parallel to serial
implementation
Determine
parallelism
of time
consuming
parts
Adding parallel to serial
implementation
Adding parallel to serial
implementation
Preferred parallel
design process…
Understand problem
& its parallelism
Design parallel
solution
Implementation
parallel solution
Execution
More efficient
design ==>
quicker to
design, likely to
perform and to
scale better
Performance by design
• Our task:
– find best possible solution for a given task on a given system
• There is a maximum possible performance for a given combination
of system and problem to be solved:
– so choices made during design and implementation will lead either to the
system meeting this maximum, or
– Cause performance being below expectation
Performance by design
• There is a maximum possible performance for a given
combination of system and problem to be solved
• However…
1. For a given system (multicore or otherwise) there is no universal
best solution for all problems
2. For a given problem there is no universal best solution
appropriate for all systems
• This is what you usually encounter: ensuring some performance portability
for your parallelised application whichever system it is run on
Example
• Sorting sequential and parallel
– The best sequential algorithms may be not the best to run on
parallel (multicore) machines
Where is performance lost?
• CPU
• Specialised units (vector, FMA) not used
• Data not incoming sufficiently fast e.g. memory hierarchy
• WORK
• Load imbalance: one thread has more to do => others wait for it
• Lots of communication (compute stalls waiting…)
• Contention of resources
• More threads (etc) than actual work to be done
• PARALLEL IMPLEMENTATION
• Overheads of the parallel implementation (spin up new threads)
• Synchronisation (actually ensuring all threads are at same point)
Thinking parallel
• Different aspects:
• Decomposition: how to decompose your problem into concurrent tasks;
• Scaling: how to ensure that there are enough concurrent tasks to keep all
the processor cores busy;
• Correctness: how to ensure that the result is free from errors;
• Abstractions and Patterns: how to choose and utilize algorithms
• ….
Decomposition
How do you find the parallelism?
• Data parallelism
• Task parallelism
• Pipelines
• Mixed solutions
Data parallelism
• Given: a data set and an operation that can be applied
element by element;
• What to do: apply the same task concurrently to each
element
a b c d e f g h
A B C D E F G H
Data parallelism
• Given: a data set and an operation that can be applied
element by element;
• What to do: apply the same task concurrently to each
element
a b c d e f g h
A B C D E F G H
Task parallelism
• Several (different) independent tasks to be
applied to the same data
1 4 9 12 6 14 3
Average Minimum Maximum Geom- mean
7 1 14 5.243
Task parallelism
• Several (different) independent tasks to be
applied to the same data
1 4 9 12 6 14 3
Average Minimum Maximum Geom- mean
7 1 14 5.243
Pipelines
• Special kind of task/data parallelism:
– many independent tasks to be applied to a stream of data;
– Each data item is processed by stages as they passed through;
– Different items can pass through different stages and be processed
at the same time;
Pipelines
data1 Data2 Data3
task 1 Task 2
Pipeline
• Different tasks for
streaming data
• Dependency between
tasks
FOLD LETTER STUFF LETTER ADDRESS
LABEL
3 cycles to get 1st completed letter
3 letters being processed
Once
pipeline full,
one new
letter per
cycle
Empty pipeline slots as we
run out of data to process
Pipeline
• Different tasks for
streaming data
• Dependency between
tasks
FOLD LETTER STUFF LETTER ADDRESS
LABEL
3 cycles to get 1st completed letter
3 letters being processed
Once
pipeline full,
one new
letter per
cycle
Empty pipeline slots as we
run out of data to process
In this example, we completed 3 tasks for each of 7 letters in 9 timesteps:
faster than 1 at a time (would be 21 timesteps), but not quite 3x faster (21/9 = 2.33 < 3)
Mixed solutions
• Simple case study:
– Task of folding, stuffing, sealing, addressing, stamping and
mailing letters
– There are 6 people to perform the task
– Several natural solutions and their combinations are possible
Mixed solutions (cont.)
• Pipeline: each person doing exactly one task
(folding, stuffing, sealing, addressing, stamping, mailing)
passing along to next person
– Fine-grained parallelism: small individual tasks and frequent
interactions
– But what if addressing task takes longer than other tasks?
• Data parallelism: each person takes a portion of the
envelopes and performs all six operations for their portion
– Coarse-grained parallelism: large individual work and infrequent
interactions
Mixed solutions (cont.)
• Pipeline: each person doing exactly one task
(folding, stuffing, sealing, addressing, stamping, mailing)
passing along to next person
– Fine-grained parallelism: small individual tasks and frequent
interactions
– But what if addressing task takes longer (say 3x longer)
than other tasks?
• The person sealing will have to wait, so the person stuffing
will have to wait… AND also the person stamping will be
waiting, so the person mailing will be waiting
Mixed solutions (cont.)
• Mixed solution:
– Person 1 is doing folding and stuffing
– Person 2 is doing sealing
– Persons 3,4,5 are doing addressing – to get it done in same time as other tasks
– Person 6 is doing stamping and mailing
• Neither data parallelism, nor task (pipeline) parallelism, but rather a
combination of both
– Parallel codes are real codes: they will be a mix of all the nice theoretical ideas we cover
Exploiting
Parallelism
(Part 1)
Quadrature (integration)
• Problem Def: find integral of f(x) from x=a to x=b
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
Quadrature (integration)
• Problem Def: find integral of f(x) from x=a to x=b
• Approximate integral is
sum of areas under line
• Each area approximated by a rectangle
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
“quad.c” from
first lab
• Approximate integral is
sum of areas under line
• Each area approximated by a rectangle (all of width “h”)
• Calculation of each area is independent of others,
can be done in any order
• ==> Calculate these in parallel…
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
x y=x+h
0.5*[f(x)+f(y)]
Specific example…
Integrating from a=1 to b=2
Delta x = 2-1 = 1.0
6 steps
So stepsize “h”= 1/6
Estimate each area as trapezoidal:
1) Average height = 0.5 (f(x)+f(x+h))
2) Width is h
3) Area is height * width
Each trapezoidal area is independent
Therefore we calculate in parallel
And the TOTAL integral is the SUM of
the areas of the 6 trapezoidals
Each trapezoidal area is independent
Therefore we calculate in parallel
And the TOTAL integral is the SUM of
the areas of the 6 trapezoidals
This is one way we can distribute the
work between (e.g.) 3 cores
So this should go 3x faster than on 1
core
But there is some overheads – bringing
all the areas together to sum to get
the total integral
core
0
core
1
core
2
0 1
2 3
4 5
Many ways to distribute work between
resources
You want to
• Share work as evenly as possible
• Minimise overheads
• Do it logically so you understand
• Comment your code!
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane