Concurrency for Software
Development
Presented by
Dr. Shuaiwen Leon Song
USYD Future System Architecture Lab (FSA)
https://shuaiwen-leon-song.github.io/
https://shuaiwen-leon-song.github.io/
Tips for students joining online
– Remember that you are still in a space with other students.
– Mute your microphone when not speaking.
– Use earphones or headphones – the mic is better and you’ll
disturb others less.
– If you have a webcam and feel comfortable doing so, please
switch it on so we can see you!
– Try not to talk over someone else.
– Use the chat function to send messages to the teacher or
classmates during class.
Tips for students learning online
– For tips and guides on learning online and the tools you will use, refer to Learning
while off campus resources in Canvas.
about:blank
4
Our team for this class
Shuaiwen Leon Song, Ph.D., Virginia Tech
Assistant Professor, USYD Computer Science
Affiliated Professor, University of Washington Seattle
Affiliated Faculty, Sydney Quantum Academy
Home page: https://shuaiwen-leon-song.github.io/
Building systems, hardware and software, enjoy
traveling and MMORPG games; and concerts.
TA and Tutors
TA and tutor 1:
Tyson Thomas
email:
tyson. .au
tinkerer, painter and creative
coder
TA and tutor 1:
Alan Robertson
email:
alan.
u.au
Quantum computing
enthusiast, coding expert,
brand new CS PhD student
https://shuaiwen-leon-song.github.io/
mailto:tyson. .au
Why this course?
• Almost all computers today use parallelism
• As software developers, we need to think
about parallelism from the start when we
design algorithms and programs
• High performance in many applications is
critical: we want to use our hardware as
efficiently as possible
Forms of Parallelism
• A commodity cluster computer is composed of
multiple nodes, connected via a network
– Each node is composed of a system board, 1 or more chips,
DRAM, coprocessors/accelerators, etc.
• Each chip contains multiple cores
– Each core has its own L1 cache, but may share higher level
caches with other cores
• Each core can
– Execute multiple instructions simultaneously (instruction
level parallelism)
– Some instructions can execute the same instruction on
multiple pieces of data simultaneously (SIMD parallelism)
Where can we use the knowledge learned here?
§ Topics in science and engineering include:
o Artificial
intelligence
o Climate modeling
o Automotive
engineering
o Cryptographic
engineering
o Geophysics
o Molecular biology
o Molecular
dynamics
o Nuclear physics
o Physical
oceanography
o Plasma physics
o Quantum physics
o Quantum chemistry
o Solid state physics
o Structural
dynamics
7
§ One example is modeling the fluid flow in the atmosphere
§ Solving Navier-Stokes equations
§ Computational requirements
§ Climate prediction (50 years in 30 days) à 4.8 TFLOPS (1T = 1012)
§ To use in policy negotiations (50 years in 12 hours) à 288 TFLOPS
§ To double the grid resolution, computation needs 8x to 16x
§ Sustained 1.01 simulation-year-per-day (SYPD) at the 3km-resolution
(IEEE/ACM Gordon Bell Award 2016)
§ State-of-the-art models require integration of atmosphere, clouds,
ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry
and more à Very complicated models!!
Example: Why We Need Ever-Increasing Performance?
• FLOPS = floating point operations per second
• For example, Intel Xeon Gold 6148 @ 2.4 GHz, 20 cores (server processor)
• Peak performance = 2.4 GHz x 32 DP FLOPS/cycle x 20 cores = 1.5 TFLOPS
• Real applications can only achieves a small portion of peak performance, e.g., 10% ~ 0.15 TFLOPS
• 288 TFLOPS / 0.15 TFLOPS ~ 2,000 very powerful server CPUs
Real-world use case for concurrency and parallel
computing
9
Summit: The Most Powerful
Supercomputer in The World
› Summit’s specifications
– Nodes: 4,608 (202,752 cores)
– CPUs: IBM Power9 (2/node)
– GPUs: NVIDIA Volta V100 (6/node)
– Memory: Each node has 512 GB DDR4 memory and 96 GB
HBM2 (high bandwidth memory) addressable by all CPUs and
GPUs
– Network: Mellanox EDR 100G InfiniBand, Non-blocking Fat Tree
› Features for AI techniques, like deep learning
Skylake: Recent Intel Microarchitecture
Intel Xeon Microarchitectures
Tok (new) Tik (shrunk version of previous
one)
Core (2006)
65 nm
Penryn (2007)
45 nm
Nehalem (2008)
45 nm
Westmere
(2010)
32 nm
Sandy Bridge
(2011)
32 nm
Ivy Bridge
(2012)
22 nm
Haswell (2013)
22 nm
Broadwell
(2014)
14 nm
Skylake (2015)
14 nm
Kaby Lake
(2016)
14 nm
Cannon Lake
(2018)
10 nm
Ice lake
(2020?)
10 nm
Tiger Lake
(2021?)
10 nm
The diameter of an atom ranges
from about 0.1 to 0.5
nanometers
Why Multiple Cores?
› CPU clock speed/frequency is stagnating, many factors:
– Hard to increase number of transistor (shrink transistor size, approaching
atom scale)
– Power limitation (“power stays constant with increasing transistors” not
applicable)
› How to get more computing power? à Increase number of cores
Intel Xeon
Gold 6148, 20
cores
Common Coprocessors or Accelerators
Intel Xeon Phi
Nvidia Tesla V100
Intel FPGA Accelerator (Arria® 10)
Majority of
systems with
accelerators
Less than 20% of systems
with accelerators
2% of systems with
accelerators
› 38% of reported systems have accelerators installed
Parallel Programming: Shared Memory vs. Distributed Memory› Shared memory: multiple threads of a process run on a single
node (week 4 to 9)
– All the data can be accessed by all the threads in the regular way, i.e., serial
or sequential program
– Need mechanisms to coordinate cooperation of the threads (e.g. locks)
– Minor point: Data may be physically distributed (multiple nodes), but software
is used to make it look “logically shared”
¢ Distributed memory: multiple processes run on multiple nodes (week 10
and 11)
§ Processes only have access to data on the node
§ Use a “communication library” (e.g., MPI) to access data on other nodes
§ Minor point: Processes themselves can have multiple threads
§ Minor point: Could run multiple processes per node, e.g., one process per CPU
core, multiple cores/processes per node
Programming models you will encounter in this class
Why do I think this class
is important?
Solve Big Problems!!!
16
Reconfigurable Computing
Google’s Cloud TPU
Computational Heterogeneity
High-performance computing
Cloud computing & Data Centers
Rapid prototyping
Reconfigurable Computing
Autonomous systems
17
More diversity:
• Quantum accelerators
• Neuromorphic chips
• Cryogenic devices
Extreme Heterogeneity
Source:
http://vlsiarch.eecs.harvard.edu/accelerato
rs/die-photo-analysis
› Apple AX Chip:
– >40 Specialized IP Blocks
18
Key Concerns
Programming:
› Too many ISAs and APIs
› Lack of universal programming abstractions
Characterization:
› Performance per processor per kernel per input?
› Diverse processors: accuracy or precision?
› How to address nested heterogeneity?
Runtime:
› OS Scheduler: Seen as I/O devices not processors
› Communication and synchronization complexity
› Utilization-objectives: performance, power, latency?
FFT(X)
Processor ISA / API
Carmel CPU Cores ARMv8.2 / ARMPL
Volta GPU CUDA / cuFFT
PVA (Programmable Vision
Accelerator)
OpenCV
DLA (Deep Learning Accelerator) TensorRT
??
?
?
19
Textbooks
•Maurice Herlihy, Nir Shavit., The art of multiprocessor programming.
•Calvin Lin & Lawrence Snyder, Principles of Parallel Programming
•OpenMP tutorial: https://computing.llnl.gov/tutorials/openMP/
•Lawrence Livermore National Lab online tutorial on pthread:
•https://computing.llnl.gov/tutorials/pthreads/
https://computing.llnl.gov/tutorials/openMP/
https://computing.llnl.gov/tutorials/pthreads/
Course Topics
•LO1. Understanding the landscape of modern multi-core and many-core computer systems, real-
world examples and how this class can be used in reality
•LO2. Understanding the basic system programming, Linux, C basic programming knowledge, C
system programming, C pointers and memory management, safety issues with memory
management.
•LO3. Understanding the concurrent programming fundamentals, memory hierarchy and
optimizations, and contemporary multi-core hardware
•LO4. Understanding the challenges in developing concurrent data structures and algorithms, and
contemporary multi-core hardware
•LO5. Understanding “process” in the context of concurrent and parallel programming
•LO6. Understanding threads and synchronization; demonstrate ability to synchronize concurrent
threads
•LO7. demonstrate ability to program using low level atomic instructions
•LO8. understand the concept of sharing memory across concurrent executions.
•LO9. Understanding different locks mechanism and how to use them in parallel programming
•LO10. Understanding shared memory programming in multi-core environment, able to write safe
and efficient multithreading programs using pthread and OpenMP
•LO11. Understanding the concept of message passing in parallel programming; able to write basic
parallel codes on distributed systems using message passing interface
•LO12. Understanding parallel design patterns.
Assessment
22
Few ways of submission for assessment
23
•Quizzes will be conducted on Canvas, this will be during tutorial time and invigilated by your
tutor
•Homework and Assignment will use a combination of Canvas and Git submission.
Please make sure you have signed in to https://github.sydney.edu.au
Homework announcement will outline requirements of your repo and you will need to ensure
that your final submission is in the master branch.
Reports will need to be submitted on Canvas via TurnitIn.
•Details related to the final exam will be announced at a later date.
To elaborate with git submission. If the assessment has a code component you will need to
create a repository and invite your tutor to it. We will provide additional guidelines when an
assignment is announced. This may include:
•Repository name
•Test case format
•Expected visualizations and discussions
•… many more!
https://github.sydney.edu.au/
What you need in order to succeed in this course
• Desire to learn how to make programs run fast
• Curiosity to investigate performance anomalies
(required for making programs run fast)
• Engage and participate in class discussions and
activities. You also need to bring a laptop
computer for classes and tutorials.
• Familiarity in C programming
• Familiarity with using the Linux command line
• Not afraid of matrix operations !
What you need in order to
succeed in this course
Some Linux concepts you will need
• In addition to ssh, scp, editing (such as vim, nano), compiling,
moving files, /tmp file system……
• Understanding PATH
• Setting environment variables in general, e.g.,
LD_LIBRARY_PATH
• Writing shell scripts
• Shell startup file, e.g., .bashrc
• Note differences between different shells
• Know how to compile codes and optimizations : gcc, mpicc, etc.
• Know simple debugging techniques: GDB
Other Relevant Information
There will be postings from Tyson Thomas and Alan Robertson on
› How to setup your linux virtual machine ( for pthread, openmp), in tutorials
or Ed/Canvas postings;
› How to submit your homework, quiz , and assignments using USYD
github;
› How to access to our free cloud autoscaling cluster from Amazon: this can
be used for programming exercises and assignments. This is especially
helpful for MPI related contents later in the class;
› If there is additional reading materials for certain topic, we will also post on
Ed/Canvas.
26
Concurrency and Parallel
Programming Essential:
Parallel Performance Models
and Analysis
27
Why Performance Modeling?
• Use models to choose among different
implementation options
• Use models to determine if you have achieved
the best possible performance possible on
your hardware
– Otherwise, you get a timing result, but how do
you know if it is any good?
n Predict performance of parallel programs
n Understand barriers to higher performance
Learning Objectives
n General speedup formula
n Amdahl’s Law
n Gustafson-Barsis’ Law
n Karp-Flatt metric
n Isoefficiency metric
Outline
Strong Scaling vs. Weak Scaling
› Strong scaling concerns the speedup for a fixed problem size with
respect to the number of processors, and is governed by Amdahl’s law.
› Weak scaling concerns the speedup for a scaled problem size with
respect to the number of processors, and is governed by Gustafson’s law.
31
Speedup = Sequential executiontime
Parallel execution time
Speedup Formula
n Inherently sequential computations: s(n)
n Potentially parallel computations: j(n)
n Communication operations: k(n,p)
Execution Time Components
𝐴 = 𝜋𝑟!
Speedup Expression
𝜓 𝒏, 𝒑 ≤
𝝈 𝒏 + 𝝋(𝒏)
𝝈 𝒏 +
𝝋 𝒏
𝒑 + 𝜿(𝒏, 𝒑)
j(n)/p
1/x curve
p
k(n,p)
p
j(n)/p + k(n,p)
p
“elbowing out”
Speedup Plot
p
Relatively Good Scalability on Real Machines
39
pdc/2018/11/scalability-
strong-and-weak-scaling/
Sequential execution time
Efficiency=
Processors used´Parallel execution time
Efficiency = Speedup
Processors used# of Cores
Effeciency = Speedup
Speedup = Sequential execution time
Parallel execution time
Efficiency
e (n, p) £ s (n)+j(n)
ps (n) +j(n) + pk (n, p)
All terms > 0 Þ e(n,p) > 0
Denominator > numerator Þ e(n,p) < 1 0 £ e(n,p) £ 1 y (n, p) £ s (n) +j(n) s (n) +j(n) / p +k (n, p) £ s (n) + j (n) s (n) +j(n) / p Let f = s(n)/(s(n) + j(n)) Note: fix n changing p 1 f + (1- f ) / p y £ Amdahl’s Law Amdahl's law is often used in parallel computing to predict the theoretical speedup when using multiple processors. https://en.wikipedia.org/wiki/Parallel_computing 43 Amdahl’s Law Theoretical Speedup n 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? @5.9 1 0.05+(1-0.05)/8 y £ Example 1 n 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program? 1 lim = = 5 1 p®¥ 0.2+ (1- 0.2) / p 0.2 Example 2 n An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors? Pop Quiz n A computer animation program generates a feature movie frame-by-frame. Each frame can be generated independently and is output to its own file. If it takes 99 seconds to render a frame and 1 second to output it, how much speedup can be achieved by rendering the movie on 100 processors? Pop Quiz n Ignores k(n,p) n Overestimates speedup achievable Limitations of Amdahl’s Law n Typically k(n,p) has lower complexity than j(n)/p n As n increases, j(n)/p dominates k(n,p) n As n increases, speedup increases Amdahl Effect n = 10,000 Speedup n = 1,000 n = 100 Processors How to increase speedup? 𝜓 𝒏, 𝒑 ≤ 𝝈 𝒏 + 𝝋(𝒏) 𝝈 𝒏 + 𝝋 𝒏 𝒑 + 𝜿(𝒏, 𝒑) n Treats problem size as a constant (strong scaling) n Shows how execution time decreases as number of processors increases Review of Amdahl’s Law n We often use faster computers to solve larger problem instances n Let’s treat the parallel execution time as a constant and allow problem size to increase with number of processors n This is called Weak Scaling. Another Perspective y (n, p) £ s (n) + j (n) s (n) +j(n) / p Let s = s(n)/(s(n)+j(n)/p) Note: s(n)+j(n)/p is fixed y £ p + (1- p)s Gustafson-Barsis’s Law n Begin with parallel execution time n Estimate sequential execution time to solve same problem n Problem size is an increasing function of p n Predicts scaled speedup Gustafson-Barsis’s Law n An application running on 10 processors spends 3% of its time in serial code. What is the scaled speedup of the application? y =10+ (1-10)(0.03) =10- 0.27 = 9.73 …except 9 do not have to execute serial code Execution on 1 CPU takes 10 times as long… Example 1 n What is the maximum fraction of a program’s parallel execution time that can be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors? 7 = 8+ (1-8)s Þ s » 0.14 Example 2 y (n, p) £ s (n) +j (n) s (n) +j (n) / p +k (n, p) n Amdahl’s Law and Gustafson-Barsis’ Law ignore k(n,p) n They can overestimate speedup or scaled speedup n Karp and Flatt proposed another metric The Karp-Flatt Metric n Takes into account parallel overhead n Detects other sources of overhead or inefficiency ignored in speedup model uProcess startup time uProcess synchronization time u Imbalanced workload uArchitectural overhead Experimentally Determined Serial Fraction 𝑒 = GOAL: Figure out the serial fraction and communication overhead by just running the program without looking at the source code or the underlying parallel algorithm Experimentally Determined Serial Fraction 𝒆 = 𝟏/𝝍 − 𝟏/𝒑 𝟏 − 𝟏/𝒑 𝒆 = 𝒇 + 𝜿 𝒏, 𝒑 [𝒑/(𝒑 − 𝟏)] 𝝈 𝒏 + 𝝋(𝒏) ≈ 𝝈 𝒏 + 𝜿(𝒏, 𝒑) 𝝈 𝒏 + 𝝋(𝒏) p 2 3 4 5 6 7 8 y 1.8 2.5 3.1 3.6 4.0 4.4 4.7 What is the primary reason for speedup of only 4.7 on 8 CPUs? e 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Since e is constant, large serial fraction is the primary reason. Example 1 𝒆 = 𝒇 + 𝜿 𝒏, 𝒑 [𝒑/(𝒑 − 𝟏)] 𝝈 𝒏 + 𝝋(𝒏) ≈ 𝝈 𝒏 + 𝜿(𝒏, 𝒑) 𝝈 𝒏 + 𝝋(𝒏) p 2 3 4 5 6 7 8 y 1.9 2.6 3.2 3.7 4.1 4.5 4.7 What is the primary reason for speedup of only 4.7 on 8 CPUs? e 0.070 0.075 0.080 0.085 0.090 0.095 0.100 Since e is steadily increasing, parallel overhead is the primary reason. Example 2 𝒆 = 𝒇 + 𝜿 𝒏, 𝒑 [𝒑/(𝒑 − 𝟏)] 𝝈 𝒏 + 𝝋(𝒏) ≈ 𝝈 𝒏 + 𝜿(𝒏, 𝒑) 𝝈 𝒏 + 𝝋(𝒏) n Is this program likely to achieve a speedup of 10 on 12 processors? p 4 8 12 y 3.9 6.5 ? Pop Quiz e= {e1=0.008, e2=0.032, e3=0.018} X: cant achieve that. e3>=e2
n Parallel system: parallel program executing on a parallel
computer
n Scalability of a parallel system: measure of its ability to
increase performance as number of processors increases
n A scalable system maintains efficiency as processors are
added
n Isoefficiency: a way to measure scalability
Isoefficiency Metric
n Begin with speedup formula
n Compute total amount of overhead
n Assume efficiency remains constant
n Determine relation between sequential execution time and
overhead
Isoefficiency Derivation Steps
Deriving Isoefficiency Relation
𝜺 =
𝝍(𝒏, 𝒑)
𝒑
≤
𝝈 𝒏 # 𝝋(𝒏)
𝝈 𝒏 #𝝋 𝒏𝒑 #𝜿(𝒏,𝒑)
𝒑
= 𝝈 𝒏 $𝝋(𝒏)
𝒑𝝈 𝒏 $𝝋 𝒏 $𝒑𝜿(𝒏,𝒑)
= 𝝈 𝒏 $𝝋(𝒏)
𝝈 𝒏 $𝝋 𝒏 $ 𝒑*𝟏 𝝈 𝒏 $𝒑𝜿(𝒏,𝒑)
= 𝟏
𝟏$ 𝒑*𝟏 𝝈 𝒏 #𝒑𝜿(𝒏,𝒑)
𝝈 𝒏 #𝝋(𝒏)
https://www.cse.wustl.edu/
~roger/569M.s09/Isoefficie
ncy.pdf
https://www.cse.wustl.edu/~roger/569M.s09/Isoefficiency.pdf
To(n, p) = ( p -1)s (n) +pk (n, p)
Let
0 01-eT (n,1) ³
e T (n, p) = CT (n, p) Isoefficiency Relation
T (n,1) =s (n) +j(n)
then, the formula in the previous slide becomes
1
T ( n , 1 )
Assume efficiency is constant, it can be re-write as
1 + T o ( n , p )
e £
Deriving Isoefficiency Relation
n The isoefficiency relation can often be simplified as n ³ f(p)
n Let M(n) denote memory required for problem of size n
n M(f(p))/p shows how memory usage per processor must
increase to maintain same efficiency
n We call M(f(p))/p the scalability function
Scalability Function
n To maintain efficiency when increasing p, we must increase n
n Maximum problem size is limited by available memory, which is
linear in p
n Scalability function shows how memory usage per processor must
grow to maintain efficiency
n Scalability function is a constant means parallel system is perfectly
scalable
Meaning of Scalability Function
M
em
or
y
ne
ed
ed
p
er
pr
oc
es
so
r Cplogp
Cp
Memory Size
Can maintain
efficiency
Clogp
C
Number of processors
Cannot maintain
efficiency
Interpreting Scalability Function (M(f(p))/p)
Example 1
70
1
Question 5
Let n>f(p) denote the Isoefficiency relation of a parallel system and
M(n) denote the amount of memory for a problem of size n. Use the
scalability function to rank the following parallel systems from most to
least scalable:
(a). f(p)=p and M(n)=n;
(b). f(p)=p and M(n)=n^2;
(c). f(p)=p*log(p) and M(n)=n;
(d). f(p)=p*log(p) and M(n)=n^2.
Example 1: Answer
71
1
Question 5
Let n>f(p) denote the Isoefficiency relation of a parallel system and
M(n) denote the amount of memory for a problem of size n. Use the
scalability function to rank the following parallel systems from most to
least scalable:
(a). f(p)=p and M(n)=n;
(b). f(p)=p and M(n)=n^2;
(c). f(p)=p*log(p) and M(n)=n;
(d). f(p)=p*log(p) and M(n)=n^2.
Solution:
a. Scalability function = M(f(p))/p = (Cp)/p = C
b. Scalability function = M(f(p))/p = (Cp)^2/p = C^2p
c. Scalability function = M(f(p))/p = (Cp*log(p))/p = C*log(p)
d. Scalability function = M(f(p))/p = (Cp*log(p))^2/p = C^2*p*log^2(p)
From most scalable to least scalable is a > c > b > d.
n Sequential algorithm complexity
T(n,1) = Q(n)
n Parallel algorithm
uComputational complexity = Q(n/p)
uCommunication complexity = Q(log p)
n Parallel overhead
T0(n,p) = Q(p log p)
Example 2: Reduction
n The system has good scalability
n Isoefficiency relation: n ³ C p log p
n We ask: To maintain same level of efficiency, how must
n increase when p increases?
n M(n) = n
M (Cplogp) / p = Cplogp / p = Clogp
Reduction (continued)
n Sequential time complexity: T(n,1)=Q(n3)
n Parallel computation time: Q(n3/p)
n Parallel communication time: Q(n2log p)
n Parallel overhead: T0(n,p) = Q(pn2log p)
Example 3: Floyd’s Algorithm
n Isoefficiency relation
n3 ³ C(p n2 log p) Þ n ³ C p log p
n M(n) = n2
n The parallel system has poor scalability
M (Cplogp) / p =C2 p2 log2 p / p =C2 p log2 p
Floyd’s Algorithm (continued)
n Sequential time complexity per iteration: T(N,1)=Q(n2)
n Parallel communication complexity per iteration: Q(n/Öp)
n Parallel overhead: T_o(n,p)=Q(n Öp)
n Space Complexity: M(n) = n2
n What is the scalability function?
Example 4: Finite Difference
n Isoefficiency relation
n2 ³ CnÖp Þ n ³ CÖ p
n M(n) = n2
n This algorithm is perfectly scalable
M (C p) / p =C 2 p / p = C 2
Finite Difference (continued)
n Performance terms
uSpeedup
uEfficiency
n Model of speedup
uSerial component
uParallel component
uCommunication component
Summary (1/3)
n What prevents linear speedup?
uSerial operations
uCommunication operations
uProcess start-up
u Imbalanced workloads
uArchitectural limitations (memory)
Summary (2/3)
n Analyzing parallel performance
uAmdahl’s Law
uGustafson-Barsis’ Law
uKarp-Flatt metric
u Isoefficiency metric
Summary (3/3)
Backup Slides
81