CS计算机代考程序代写 matlab Java cuda GPU ER case study arm algorithm Microsoft PowerPoint – COMP528 SYNC04 performance considerations.pptx

Microsoft PowerPoint – COMP528 SYNC04 performance considerations.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

4 – SYNC

This session will be
RECORDED…

This allows you to revise later

The session will be mainly me
presenting…

but it is good to have interaction

But if anybody wishes to not be
recorded then I can pause recording
COMP328/528 (c) univ of liverpool

“Timetable” [updates]
• Lectures & Background

• Much more in CANVAS now
• 01/HAL + 02/sync intro
• 03/HAL Terminology, “Top 500”:
• 04/sync (today) performance considerations, …

• Async lectures via CANVAS (i.e. online)
• loosely based on traditional 30 hours, released in blocks, 1 or 2 per week

• Synchronous online
• Week 3: was to meet you, but will cancel and replace with async materials
• weeks 5, 8: to explain the assessed assignments (including feedback) + some lecture materials
• Fridays: weeks 2, 6, 10, 12: lab solutions, ensure materials making sense, Q&A

• Labs
• First week complete…

• Assessment: first MCQ will be on theory we’re covering.
Specific time TBC. Week 4.

Terminology &
Performance
Considerations

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• We repeat it – how long?
– When same person

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• (1) how would you get 5 people to sort?

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• (1) how would you get 5 people to sort?
– How long will it take?

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• (1) how would you get 5 people to sort?

• (2) how could you get 3 people to sort?
– How long?

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• (1) how would you get 5 people to sort?

• (2) how could you get 3 people to sort?

• (3) what is ideal number of people to sort the quickest?

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• (1) how would you get 5 people to sort?

• (2) how could you get 3 people to sort?

• What about, say, 7 people??

Consider…

• 10 balls of each of 5 colours (ie 50 in total): red, yellow,
blue, green, white
mixed randomly on the floor
AIM: to have 5 piles with only one colour in each pile

• 1 person can sort in 60 seconds

• (3) what is ideal number of people to sort the quickest?
– Is 60 good? Is 61 good?

Lessons…

• Parallelisable
• Elements that can be done independently, and thus work can be shared out
• Different ways to parallelise

e.g. for this example
• Each person takes a color => global pile of that color
• Each person sorts their area => global piles
• Each person sorts their area to own local piles of each color, then final step to combine

all local piles in to global piles

• There are “overheads”
• Communications needed for each actor to know their role (which color ball /

area) and potentially to combine local results in to global results

• There can be “load imbalance”

Lessons…

• There can be “load imbalance”
• e.g. 60/7 is non-integer

• More is not always better
• e.g. 60/10 may be okay but 60/30 may have people running in to each other

etc – “contention”
• Is there any point of having 61 people?
• In practice, one would try different numbers to find most efficient…

• Expected time (without load imbalance) is work/#people
• Occasionally we can do even better than that!

Quick Quiz

• Do you know your computer hardware?

https://www.linkedin.com/in/
heather-gong-127394ba

Intel / EPCC

NVIDIA

A
nt

oi
ne

b
er

co
vi

ci
-O

w
n

w
or

k,
C

C
B

Y-
SA

3
.0

,
ht

tp
s:

//
co

m
m

on
s.

w
ik

im
ed

ia
.o

rg
/w

/in
de

x.
ph

p?
cu

ri
d=

29
02

33
72

CPU Intel, AMD,
ARM (as IP)

1 to maybe 64 cores,
running at 2 to 3 GHz

Powerful cores, out of
order, look ahead. Good
for general purpose and
generally good

1-2 sockets direct on the
motherboard

GPU NVIDIA, AMD 15 to 56 “streaming
multiprocessors”
(SMs), each with 64-
128 “CUDA Cores”.
Base freq about 1 GHz

SMs are good for high
throughput of vector
arithmetic

AMD produced “fused” CPU
& GPU. Until 2016, NV cards
situated at far end of PCI-e
bus. In 2016, NV working with
IBM for on-board solution
using “NVlink”

Xeon Phi Intel 60-70 cores Low grunt but general
purpose cores

KNC is PCI-e
KNL (2016) is standalone

FPGA Altera (Intel),
Xilinix

Fabric to design own
layout – and
reconfigurable

Can use Verilog or VHDL
to map. MATLAB can
also be used. Maxeler
uses Java

Focus needs to be on the
data flow

ASIC Anton-2 uses custom
ASIC for MD calcs. Very
fast but not necessarily
low power

If you’re designing ASIC you
needn’t be on this course!

Design for
Parallel Performance

Serial design process…

Understand problem
to be solved

Design of the
solution

Implementation

Execution

Understand problem
to be solved

Design of the
solution

Implementation

Execution

Implement
parallelism

Adding parallel to serial
implementation

Determine
parallelism

of time
consuming

parts

Adding parallel to serial
implementation

Adding parallel to serial
implementation

Preferred parallel
design process…

Understand problem
& its parallelism

Design parallel
solution

Implementation
parallel solution

Execution

More efficient
design ==>
quicker to
design, likely to
perform and to
scale better

Performance by design

• Our task:
– find best possible solution for a given task on a given system

• There is a maximum possible performance for a given combination
of system and problem to be solved:
– so choices made during design and implementation will lead either to the

system meeting this maximum, or
– Cause performance being below expectation

Performance by design

• There is a maximum possible performance for a given
combination of system and problem to be solved

• However…
1. For a given system (multicore or otherwise) there is no universal

best solution for all problems

2. For a given problem there is no universal best solution
appropriate for all systems

• This is what you usually encounter: ensuring some performance portability
for your parallelised application whichever system it is run on

Example

• Sorting sequential and parallel
– The best sequential algorithms may be not the best to run on

parallel (multicore) machines

Where is performance lost?

• CPU
• Specialised units (vector, FMA) not used
• Data not incoming sufficiently fast e.g. memory hierarchy

• WORK
• Load imbalance: one thread has more to do => others wait for it
• Lots of communication (compute stalls waiting…)
• Contention of resources
• More threads (etc) than actual work to be done

• PARALLEL IMPLEMENTATION
• Overheads of the parallel implementation (spin up new threads)
• Synchronisation (actually ensuring all threads are at same point)

Thinking parallel

• Different aspects:
• Decomposition: how to decompose your problem into concurrent tasks;

• Scaling: how to ensure that there are enough concurrent tasks to keep all
the processor cores busy;

• Correctness: how to ensure that the result is free from errors;

• Abstractions and Patterns: how to choose and utilize algorithms

• ….

Decomposition

How do you find the parallelism?

• Data parallelism

• Task parallelism

• Pipelines

• Mixed solutions

Data parallelism

• Given: a data set and an operation that can be applied
element by element;

• What to do: apply the same task concurrently to each
element

a b c d e f g h

A B C D E F G H

Data parallelism

• Given: a data set and an operation that can be applied
element by element;

• What to do: apply the same task concurrently to each
element

a b c d e f g h

A B C D E F G H

Task parallelism

• Several (different) independent tasks to be
applied to the same data

1 4 9 12 6 14 3

Average Minimum Maximum Geom- mean

7 1 14 5.243

Task parallelism

• Several (different) independent tasks to be
applied to the same data

1 4 9 12 6 14 3

Average Minimum Maximum Geom- mean

7 1 14 5.243

Pipelines

• Special kind of task/data parallelism:
– many independent tasks to be applied to a stream of data;

– Each data item is processed by stages as they passed through;

– Different items can pass through different stages and be processed
at the same time;

Pipelines

data1 Data2 Data3

task 1 Task 2

Pipeline

• Different tasks for
streaming data

• Dependency between
tasks

FOLD LETTER STUFF LETTER ADDRESS
LABEL

3 cycles to get 1st completed letter
3 letters being processed
Once
pipeline full,
one new
letter per
cycle

Empty pipeline slots as we
run out of data to process

Pipeline

• Different tasks for
streaming data

• Dependency between
tasks

FOLD LETTER STUFF LETTER ADDRESS
LABEL

3 cycles to get 1st completed letter
3 letters being processed
Once
pipeline full,
one new
letter per
cycle

Empty pipeline slots as we
run out of data to process

In this example, we completed 3 tasks for each of 7 letters in 9 timesteps:
faster than 1 at a time (would be 21 timesteps), but not quite 3x faster (21/9 = 2.33 < 3) Mixed solutions • Simple case study: – Task of folding, stuffing, sealing, addressing, stamping and mailing letters – There are 6 people to perform the task – Several natural solutions and their combinations are possible Mixed solutions (cont.) • Pipeline: each person doing exactly one task (folding, stuffing, sealing, addressing, stamping, mailing) passing along to next person – Fine-grained parallelism: small individual tasks and frequent interactions – But what if addressing task takes longer than other tasks? • Data parallelism: each person takes a portion of the envelopes and performs all six operations for their portion – Coarse-grained parallelism: large individual work and infrequent interactions Mixed solutions (cont.) • Pipeline: each person doing exactly one task (folding, stuffing, sealing, addressing, stamping, mailing) passing along to next person – Fine-grained parallelism: small individual tasks and frequent interactions – But what if addressing task takes longer (say 3x longer) than other tasks? • The person sealing will have to wait, so the person stuffing will have to wait… AND also the person stamping will be waiting, so the person mailing will be waiting Mixed solutions (cont.) • Mixed solution: – Person 1 is doing folding and stuffing – Person 2 is doing sealing – Persons 3,4,5 are doing addressing – to get it done in same time as other tasks – Person 6 is doing stamping and mailing • Neither data parallelism, nor task (pipeline) parallelism, but rather a combination of both – Parallel codes are real codes: they will be a mix of all the nice theoretical ideas we cover Exploiting Parallelism (Part 1) Quadrature (integration) • Problem Def: find integral of f(x) from x=a to x=b 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 Quadrature (integration) • Problem Def: find integral of f(x) from x=a to x=b • Approximate integral is sum of areas under line • Each area approximated by a rectangle 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 “quad.c” from first lab • Approximate integral is sum of areas under line • Each area approximated by a rectangle (all of width “h”) • Calculation of each area is independent of others, can be done in any order • ==> Calculate these in parallel…

0

2

4

6

8

10

12

0 0.5 1 1.5 2 2.5 3 3.5

x y=x+h

0.5*[f(x)+f(y)]

Specific example…

Integrating from a=1 to b=2

Delta x = 2-1 = 1.0

6 steps
So stepsize “h”= 1/6

Estimate each area as trapezoidal:
1) Average height = 0.5 (f(x)+f(x+h))
2) Width is h
3) Area is height * width

Each trapezoidal area is independent
Therefore we calculate in parallel

And the TOTAL integral is the SUM of
the areas of the 6 trapezoidals

Each trapezoidal area is independent
Therefore we calculate in parallel

And the TOTAL integral is the SUM of
the areas of the 6 trapezoidals

This is one way we can distribute the
work between (e.g.) 3 cores

So this should go 3x faster than on 1
core

But there is some overheads – bringing
all the areas together to sum to get
the total integral

core
0

core
1

core
2

0 1
2 3

4 5

Many ways to distribute work between
resources

You want to
• Share work as evenly as possible
• Minimise overheads
• Do it logically so you understand
• Comment your code!

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane