CS计算机代考程序代写 data structure concurrency cache Microsoft PowerPoint – COMP528 HAL07 practical parallelism#2.pptx

Microsoft PowerPoint – COMP528 HAL07 practical parallelism#2.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

7 – HAL

Files from Lecture Demos

• Login to Barkla & load the relevant modules

• Files available within my directory:
~mkbane/HPC_DEMOS

• You cannot edit/create files in my directory so you need to copy to your own directory
• Bearing in mind I will update the HPC_DEMOS as we go along

• Bearing in mind you will not want to overwrite any edits of your own!

• rsync -ru -L -t –progress ~mkbane/HPC_DEMOS $HOME
this will recursively copy only newer files, resolving any “symbolic links”, maintaining
timestamps, and showing progress, from my HPC_DEMOS to “HPC_DEMOS” directory
within your home directory

• Typically, for given topic, I will refer to “HPC_DEMOS/topic” subdirectory where there will
be a bunch of files and (ideally) a Makefile

RACE CONDITIONS

Consider the simple summation…

• (presuming we have no rounding error issues)

• If we want to sum up:
SUM = SUM + X(i)
for say i=0,…,9 where x(i) has previously been set to i
using 2 threads, with one thread doing even values of i and other doing increments for
odd values of i

• What is the answer?

• Let’s write an OpenMP program and find out…

• DEMO: HPC_DEMOS/raceCondition
– we will compile & run & compare serial .v. parallel

Race Condition

• Very common “mistake”

• occur when multiple tasks read from and write to the same memory without
proper synchronization;

• Depending on application can lead to incorrect results: let’s look at 2 parallel
processes running concurrently

SUM

sum = sum + X[i] sum = sum + X[i]

RAM (or
L2 cache?)

Race Condition

• Very common “mistake”

• occur when multiple tasks read from and write to the same memory without
proper synchronization;

• Depending on application can lead to incorrect results: let’s look at 2 parallel
processes running concurrently

SUM

sum = sum + X[i] sum = sum + X[i]

Both threads want to load value in SUM
Both threads update their value for SUM

Both threads want to update same physical
memory location

RAM (or
L2 cache?)

sum = sum + X[i]
• Presume this breaks down to

ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM

• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9

• This is what we naively consider happening…

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

ld sum
[sum: 0]

0

inc sum
[sum: 0]

str sum 0

ld sum
[sum: 0]

inc sum
[sum: 1]

1 str sum

sum = sum + X[i]
• Presume this breaks down to

ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM

• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

ld sum
[sum: 0]

0 ld sum
[sum: 0]

inc sum
[sum: 0]

inc sum
[sum: 1]

str sum 0 or 1? str sum

sum = sum + X[i]
• Presume this breaks down to

ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM

• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

ld sum
[sum: 0]

0 ld sum
[sum: 0]

inc sum
[sum: 0]

inc sum
[sum: 1]

str sum 0 or 1? str sum

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

0 ld sum
[sum: 0]

ld sum
[sum: 0]

0 inc sum
[sum: 1]

inc sum
[sum: 0]

1 str sum

str sum 0

ld sum
[sum: 0]

ld sum
[sum: 0]

0 inc sum
[sum: 3]

inc sum
[sum: 2]

3 str sum

str sum 2

sum = sum + X[i]
• Presume this breaks down to

ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM

• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

ld sum
[sum: 0]

0 ld sum
[sum: 0]

inc sum
[sum: 0]

inc sum
[sum: 1]

str sum 0 or 1? str sum

thread 0 iterations i=0,2
thread 1 iterations i=1,3 6

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

0 ld sum
[sum: 0]

ld sum
[sum: 0]

0 inc sum
[sum: 1]

inc sum
[sum: 0]

1 str sum

str sum 0

ld sum
[sum: 0]

ld sum
[sum: 0]

0 inc sum
[sum: 3]

inc sum
[sum: 2]

3 str sum

str sum 2

Race Conditions

• Multiple threads or processes accessing “randomly” one or more memory
accesses such that the result is indeterminate
• Not just (eg) a SUM

• ? What if SUM is used in an IF statement
• Total flow & thus model prediction could vary between runs

Concurrent access and Locks

• Locking is a mechanism for controlling concurrent access

• Acquiring a lock gives a process (or thread) an exclusive
access to a data structure (e.g. variable)

• Locks can be used to ensure correct (required) behaviour of
the multiple-threads programs

• lock(sum)
ld sum
inc sum by X[i]
str sum
unlock(sum)

Thread 0 Value of
SUM in RAM
at end of
step

Thread 1

lock(sum)

ld sum [sum:
0]

0

inc sum
[sum: 0]

str sum 0

unlock(sum)

lock(sum)

ld sum [sum:
0]

inc sum
[sum: 1]

1 str sum

unlock(sum)

lock(sum)

Thread 1 may
try to obtain
lock but it
will have to
wait until it is
released by
thread 0

There is no rule that
threads take it fairly in
turns to obtain lock

Thread 0 SUM Thread 1

lock(sum)

0 ld sum
[sum: 0]

inc sum
[sum: 1]

str sum

1 unlock(su
m)

lock(sum)

ld sum
[sum: 1]

inc sum
[sum: 0]

str sum 1

unlock(sum)

1 lock(sum)

ld sum
[sum: 1]

inc sum
[sum: 4]

4 str sum

unlock(sum)

lock(sum)

ld sum
[sum:4]

inc sum
[sum:6]

str sum 6

unlock(sum)

Thread 0 Value of
SUM in
RAM at end
of step

Thread 1

0 ld sum
[sum: 0]

ld sum
[sum: 0]

0 inc sum
[sum: 1]

inc sum
[sum: 0]

1 str sum

str sum 0

ld sum
[sum: 0]

ld sum
[sum: 0]

0 inc sum
[sum: 3]

inc sum
[sum: 2]

3 str sum

str sum 2

thread 0 does iterations i=0,2,4,6,8
thread 1 does iterations i=1,3,5,7,9

Parallel Patterns for Concurrency Access

• low level locks
– OpenMP: omp_set_lock(…), omp_unset_lock(…)

and others

– fine grained control

– fairly easy to end up with deadlock

• high level functions that implement some form of locks
– routines for OpenMP and for MPI

– control access, but user does not have to worry about deadlock

Parallel Patterns for Concurrency Access

• high level functions
– OpenMP atomic (variable accessed only by single thread)

– OpenMP critical (region accessed by one thread at a time)

– OpenMP reduction (summation from all threads to master thread)

– MPI_Reduce (summation from all processes to root process)
• other [commutative & associative] maths operations also available

• There now follows 2 demos. One with “critical” and then
one with “atomic”, based upon previous slide.

COMP328/COMP528 (c) .uk

Integrating from x=1 to x=2 (a=1, b=2)

Estimate each area as trapezoidal:
1) Average height = 0.5 (f(x)+f(x+h))
2) Width is h
3) Area is height * width

Each trapezoidal area is independent

Therefore we calculate in parallel
on each of 6 parallel processes

The TOTAL integral
is the SUM of the areas of the 6 trapezoidals

Collecting the results from the parallel
processes and adding them together
and handling access (implicit locking)
is a “reduction” pattern

SUM = sum[0] + sum[1] + … sum[5]

0 1
2 3

4 5

a b

distributed across parallel processeson the master or root process

0 1
2 3

4 5

a b

A B

OP

X

REDUCTION (detailed)

For
X = A op B

But A, B are from threads
indeterministic manner

0 1
2 3

4 5

a b

B A

OP

X

REDUCTION (detailed)

For
X = A op B

But A, B are from threads
indeterministic manner

We don’t know which order the
distributed elements will be
combined

Hence need to be
commutative & associative

COMP328/COMP528 (c) .uk

Parallel Patterns for Concurrency Access

• high level functions
– OpenMP atomic (variable accessed only by single thread)

– OpenMP critical (region accessed by one thread at a time)

– OpenMP reduction (summation from all threads to master thread)

– MPI_Reduce (summation from all processes to root process)
• other [commutative & associative] maths operations also available

• Q: which operations would be supported (as “reduction”) ??

Example of “reduction”

Conclusion

• All discussed aspects of Thinking Parallel approach are
non-trivial

• We need tools and instruments which support Thinking
Parallel at appropriate level of abstraction

ANATOMY OF HPC

Physicals Matter

• On a single node
• All cores can access the RAM

• SHARED memory

• (may have local copies in their
caches)

• For SHARED memory we can use
• OpenMP

• pThreads

• MPI

• But HPC is lots of nodes

• Connected by (fast) interconnect

• Memory per node => DISTRIBUTED
memory

• Needs mechanism for a node to
interact with data on another node

• Typically:
• MPI

• “Message Passing Interface”

SHARED MEMORY

• Memory on node
• Faster access

• Limited to that memory

• … and to those cores

• Programming typically OpenMP (or another
threaded model)
• Directives based

• Incremental changes

• Portable to single core / non-OpenMP
• Single code base 

DISTRIBUTED MEMORY

• Access memory of another node
• Latency & bandwidth issues
• IB .v. gigE
• Expandable (memory & nodes)

• Programming 99% always MPI
• Message Passing Interface
• Library calls
• More intrusive
• Different implementations of MPI standard
• Non-portable to non-MPI (without effort)

COMP328/COMP528 (c) .uk

SHARED MEMORY

• Memory on node
• Faster access

• Limited to that memory

• … and to those cores

• Programming typically OpenMP (or another
threaded model)
• Directives based

• Incremental changes

• Portable to single core / non-OpenMP
• Single code base 

• Can use MPI too

DISTRIBUTED MEMORY

• Access memory of another node
• Latency & bandwidth issues
• IB .v. gigE
• Expandable (memory & nodes)

• Programming 99% always MPI
• Message Passing Interface
• Library calls
• More intrusive
• Different implementations of MPI standard
• Non-portable to non-MPI (without effort)

COMP328/COMP528 (c) .uk

Parallel hardware

• Flynn’s taxonomy:
– SISD (Single Instruction, Single Data)

– SIMD (Single Instruction, Multiple Data)

– MISD eg where fault tolerance is vital

– MIMD (Multiple Instruction, Multiple Data)

Flynn, Michael J. (September 1972).
“Some Computer Organizations and Their Effectiveness”.
IEEE Trans. Computer. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.

https://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Parallel hardware

• Not for this course
– SISD (Single Instruction, Single Data): serial

– MISD eg where fault tolerance is vital

• Of interest
– SIMD (Single Instruction, Multiple Data)

• Vectorisation, some OpenMP

• (SIMT: single instr multiple threads) = SIMD + multithreading
– Particularly for GPUs

– MIMD (Multiple Instruction, Multiple Data)
• Options: shared-memory || distributed-memory || hybrid

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane