lease provide
back on CO
ents/suggestions
Copyright By PowCoder代写 加微信 powcoder
Study Survey
Click on the following link: https://student-surveys.sydney.edu.au/stud
t only takes a few minutes to complete
Motivation
parallel computing,
eral goals:
advanced high performance
distributed computer architectures and organisations
introduce the basics of
parallel programming
many operations take
place at once, complicating our reasoning about the correctness and performance
Issues in High
Technology push
Application driven
ning perform
Increasing the speed
Parallelizing (explici
mance compu
the process
aluating In
Network topologies
dynamic Message routin
Arc Connectivity
Bisection Width
onnection Networks
allel Architectures
Flynn’s taxon
shared-memory m
communicate toge
distributed
communicate together through
ultiprocessors
ther through a
ry multicomp
processors
Most of that
registers,
Most unoptimized parallel programs run at less than 10% of the machine’s “peak” performance
caches and I
hardware and compilers, programmers need to get involved to enhance the performance for many application
Execution time = f * tf + m * tm = f * tf * (1 + tm/tf
m = number of memory elements (words) moved b and slow memory
tm = time per slow
number of arithm
tf = time p
arithmetic operat
ory operation
etic operatio
q = f / m average number of flops per slow mem access – computational intensity (key to algorith
efficiency)
Larger q means time
closer to minimum f * t
* 1/q) etween
Blocking is a good intensity
inally used
chnique to
by making use of more registers
The advantages are obvious for simple loops if data
registers can be
mputational i
to reduce br
computational
allel Algorith
llel algorithm design
Second part
Try to find theoretical applicability
First part considers characteristics of the problem to identify the maximum degree of
iders efficient
potential and
s – machine dependen
Try to identify relationship between properties of algorithms and features
architectures
divided into
implementation
allel Algorith
Partitioning: smaller ones
tion/synchronization:
execution of concurrent tasks and establish appropriate communication/synchronization structures
ages focus
divide a large task into multiple which can be executed concurrently
on recogni
opportunities for parallel execution
Assignment: reorganize tasks and assign them to
esses/threads
– machine depe
coordinate the
Threads coordination synchronization on sh
Shared memo
(e.g., ~100 core
Need coarse
Multiple threads created and running concurrently
Each having their own local private variables
A set of global variables shared by a Threads communication implicitly usi
s for large one)
explicitly by ared variables
s are usually
ll threads ng shared
small scale
Global variables shared by
Seems algorith
Yes, data don’t need to be mo Data partitioning unnecessary?
Not true!
hierarchy Data locality is v
Explicit synchronization Possible race condition, a
synchronization to
all threads
is relatively
onsider ILP & memory
ize unnecessary
improve efficiency
Load balancing – very important
Amdahl’s law: e
ven a small
imbalance may significantly affect the overal performance
Data locality – ve
Increase computational intensity (
hierarchy)
Need seriously consider how to optimize the performance on single processor, or core
Increase o
nities for ILP
Each process has
Memory Platform
a distributed memory machine
Local variables cannot be accessed by other processes (no shared variables between processes)
be distributed
To exchange data, processes must explicitly communicate with each other using (variants of)
nd and receive
addressing s
across the processors
passing mechanisms
the algorithm design ne
data is parti
to minimiz
Of course,
Memory Platform
How tasks (or work) are assigned which is also related to data partitioning
ed seriously consider
e communication overheads
Data locality (also to reduce comm overheads – send a large message, many small ones)
unication rather tha
Load Balanc
Amdahl’s law: a small amount of bound maximum speedup
We need good strategies for ta across the processes/threads
The strategies can be classified into
Static task
balancing)
The quality
k partitioning
and graphs,
assignment
Dynamic task assignment (dy
problems can be easily
size with regular
other problems, e.g., spar
of task assignment is directly
are more complicated
data dependency
load imbalance can sign
composed into
two categories:
job scheduling or dyna
trices, unst
o balance workload
the quality
f tasks of
ructured meshes,
Load Balanc
Graph partitioning Balancing workload, while
Graph bisection
Partitioning with Geometr
Coordinate bisection
Inertial Bisect
rtitioning without Geom Spectral Bisection
Trade off between the quality of the solution
choice depends
minimizing
on the intended a
the communication cost
etric Inform
pplication
Scalability of
Algorithm/prog
Memory hierarchy
ram related
Scalability can be defined as the ability to continue to achieve good performance with increased problem size and increased processing resources
Factors affecting the
Architecture related
Number of available processors/cores
Communication networks
architectur
tioning (load balancing & task granularity)
hronization/communication
Additional computation Computational intensity
Scalability of
Performance metrics:
al/Parallel
total time and total overh
Workload scaling models:
Problem constrained (PC): or Amdahl’s law
Time constrained (TC):
stafson’s law
Isoefficiency
efficiency
what is the r
size must increase with respect to the number of processing elements to keep the efficiency fixed?
xed-load Model, lead to
e Model, lead to
Thread management
e.g., creation,
shared-memory computers
n, and pass parameters
e.g., creation, destroy, lock and unl Condition variables
e.g., creation,
destroy, wait and
A condition
riable is always
tion with a
Message passing
memory machines
Minimal set of
MPI_Send and
Collective communication
MPI_Bcast (one to
MPI_Reduce (many to one) MPI_Allreduce (many to many)
MPI_Gether (many to one)
interface for distributed
Multithreading
Transfer d
Programming
Kernel function Allocate memory
Synchronization
kernel function executed by a number of
grid, blocks, and warps
Barrier and atomic operations
emory a locality
nd global memory across threads for one
Registers, shared m
Memory coalescing: instruction
general, Fine gra
Programming
l algorithm
design and implementatio
we need to consider ined parallelism
coalescing
Effective use of shared me
Control divergence
Synchronization ov
uploaded the correct file
must submit
book exam, plus
your answer as
this is not
It is a two-hour open
10 min Reading 30min uploading
en 17 June at 13:00 till 15:40
NOT treat uploading time as extra writing time.
The upload time must be used mainly to save and upload your files correctly as per the exam instructions
If your time runs out while you are uploading, considered a technical issue
Thus Manage your time carefully
When uploading, check that you have your file correctly
e.g., COMP5426_exam_lastname_sid.pdf
There are FIVE Questions
Click “START YOUR EXAM” button then click “COMP5426 Final Exam” link then click “exam script” to download the
tal of 100 marks
Must attempt ALL questions
Please answer ALL questions in document
You can download a “answer booklet” word document as template for your own answer document
You MUST include your your answer document
name and student ID
exam paper
The Final Exam
relevant cha
To get good marks, need s
& problem solving skills
xam, please care
• lecture notes
review assignments and lab exercises
s in the textbo
Try to understand the basic concepts
fully Read
Canvas unit
lease provide
back on CO
ents/suggestions
Study Survey
Click on the following link: https://student-surveys.sydney.edu.au/stud
t only takes a few minutes to complete
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com