CS代考 Concurrency for Software Development

Concurrency for Software Development
Presented by
Dr. Shuaiwen Leon Song
USYD Future System Architecture Lab (FSA) https://shuaiwen-leon-song.github.io/

Tips for students joining online
– Remember that you are still in a space with other students.
– Mute your microphone when not speaking.
– Use earphones or headphones – the mic is better and you’ll disturb others less.
– If you have a webcam and feel comfortable doing so, please switch it on so we can see you!
– Try not to talk over someone else.
– Use the chat function to send messages to the teacher or classmates during class.

Tips for students learning online
– For tips and guides on learning online and the tools you will use, refer to Learning while off campus resources in Canvas.

TA and Tutors
TA and tutor 1:
email:

tinkerer, painter and creative coder
TA and tutor 1:
email: u.au
Quantum computing
enthusiast, coding expert,
brand new CS PhD student
Our team for this class
Shuaiwen Leon Song, Ph.D.,
Assistant Professor, USYD Computer Science Affiliated Professor, University of Washington Seattle Affiliated Faculty, Academy
Home page: https://shuaiwen-leon-song.github.io/ Building systems, hardware and software, enjoy traveling and MMORPG games; and concerts.
4

Why this course?
• Almost all computers today use parallelism
• As software developers, we need to think about parallelism from the start when we design algorithms and programs
• High performance in many applications is critical: we want to use our hardware as efficiently as possible

Forms of Parallelism
• A commodity cluster computer is composed of multiple nodes, connected via a network
– Each node is composed of a system board, 1 or more chips, DRAM, coprocessors/accelerators, etc.
• Each chip contains multiple cores
– Each core has its own L1 cache, but may share higher level
caches with other cores
• Each core can
– Execute multiple instructions simultaneously (instruction level parallelism)
– Some instructions can execute the same instruction on multiple pieces of data simultaneously (SIMD parallelism)

Where can we use the knowledge learned here?
§ Topics in science and engineering include:
o Artificial intelligence
o Climate modeling o Automotive
engineering
o Cryptographic engineering
o Geophysics
o Molecular biology o Molecular
dynamics
o Nuclear physics o Physical
oceanography
o Plasma physics
o Quantum physics
o Quantum chemistry o Solid state physics o Structural
dynamics
7

Example: Why We Need Ever-Increasing Performance?
§ Oneexampleismodelingthefluidflowintheatmosphere
§ SolvingNavier-Stokesequations
§ Computationalrequirements
§ Climate prediction (50 years in 30 days)à4.8 TFLOPS (1T = 1012)
§ To use in policy negotiations (50 years in 12 hours)à288 TFLOPS § Todoublethegridresolution,computationneeds8xto16x
§ Sustained1.01simulation-year-per-day(SYPD)atthe3km-resolution (IEEE/ACM Gordon Bell Award 2016)
§ State-of-the-artmodelsrequireintegrationofatmosphere,clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and moreàVery complicated models!!
• FLOPS = floating point operations per second
• For example, Intel Xeon Gold 6148 @ 2.4 GHz, 20 cores (server processor)
• Peak performance = 2.4 GHz x 32 DP FLOPS/cycle x 20 cores = 1.5 TFLOPS
• Real applications can only achieves a small portion of peak performance, e.g., 10% ~ 0.15 TFLOPS
• 288 TFLOPS / 0.15 TFLOPS ~ 2,000 very powerful server CPUs

Real-world use case for concurrency and parallel computing
9

Summit: The Most Powerful Supercomputer in The World
› Summit’s specifications
– Nodes: 4,608 (202,752 cores)
– CPUs: IBM Power9 (2/node)
– GPUs: NVIDIA Volta V100 (6/node)
– Memory: Each node has 512 GB DDR4 memory and 96 GB HBM2 (high bandwidth memory) addressable by all CPUs and GPUs
– Network: Mellanox EDR 100G InfiniBand, Non-blocking Fat Tree › Features for AI techniques, like deep learning

Skylake: Recent Intel Microarchitecture

Intel
Tok (new)
Tik (shrunk version of previous one)
Core (2006) 65 nm
Penryn (2007) 45 nm
Nehalem (2008) 45 nm
Westmere
(2010) 32 nm

(2011) 32 nm

(2012) 22 nm
Haswell (2013) 22 nm
Broadwell
(2014) 14 nm
Skylake (2015) 14 nm
Kaby Lake
(2016) 14 nm
Cannon Lake
(2018) 10 nm
Ice lake
(2020?) 10 nm
Tiger Lake
(2021?) 10 nm
The diameter of an atom ranges from about 0.1 to 0.5 nanometers

Why Multiple Cores?
› CPU clock speed/frequency is stagnating, many factors:
– Hard to increase number of transistor (shrink transistor size, approaching
atom scale)
– Power limitation (“power stays constant with increasing transistors” not applicable)
› How to get more computing power?àIncrease number of cores
Intel Xeon Gold 6148, 20 cores

Common Coprocessors or Accelerators
› 38% of reported systems have accelerators installed
Intel Xeon Phi
Less than 20% of systems with accelerators
Nvidia Tesla V100
Majority of systems with accelerators
Intel FPGA Accelerator (Arria® 10)
2% of systems with accelerators

Programming models you will encounter in this class
Parallel Programming: Shared Memory vs. Distributed Memory
› Shared memory: multiple threads of a process run on a single node (week 4 to 9)
– All the data can be accessed by all the threads in the regular way, i.e., serial or sequential program
– Need mechanisms to coordinate cooperation of the threads (e.g. locks)
– Minor point: Data may be physically distributed (multiple nodes), but software
is used to make it look “logically shared”
¢ Distributedmemory:multipleprocessesrunonmultiplenodes(week10 and 11)
§ Processes only have access to data on the node
§ Use a “communication library” (e.g., MPI) to access data on other nodes
§ Minor point: Processes themselves can have multiple threads
§ Minor point: Could run multiple processes per node, e.g., one process per CPU core, multiple cores/processes per node

Why do I think this class is important?
Solve Big Problems!!!
16

High-performance computing Cloud computing & Data Cente Rapid prototyping
Autonomous systems
More diversity:
• Quantum accelerators
• Neuromorphic chips
• Cryogenic devices
Computational Heterogeneity
rs
Google’s Cloud TPU
Reconfigurable Computing Reconfigurable Computing
17

› X Chip:
– >40 Specialized IP Blocks
Extreme Heterogeneity
Source: http://vlsiarch.eecs.harvard.edu/accelerato rs/die-photo-analysis
18

FFT(X)
Key Concerns
Programming:
› Too many ISAs and APIs
› Lack of universal programming abstractions Characterization:
› Performance per processor per kernel per input?
› Diverse processors: accuracy or precision?
› How to address nested heterogeneity?
Runtime:
› OS Scheduler: Seen as I/O devices not processors › Communication and synchronization complexity
› Utilization-objectives: performance, power, latency?
?
? ?
?
PU Cores
Volta GPU
19
Processor
PVA (Programmable Vision Accelerator)
DLA (Deep Learning Accelerator)
ISA / API
ARMv8.2 / ARMPL
CUDA / cuFFT
OpenCV
TensorRT

Textbooks
• , ., The art of multiprocessor programming. • & , Principles of Parallel Programming •OpenMP tutorial: https://computing.llnl.gov/tutorials/openMP/ • National Lab online tutorial on pthread: •https://computing.llnl.gov/tutorials/pthreads/

Course Topics
•LO1. Understanding the landscape of modern multi-core and many-core computer systems, real- world examples and how this class can be used in reality
•LO2. Understanding the basic system programming, Linux, C basic programming knowledge, C system programming, C pointers and memory management, safety issues with memory management.
•LO3. Understanding the concurrent programming fundamentals, memory hierarchy and optimizations, and contemporary multi-core hardware
•LO4. Understanding the challenges in developing concurrent data structures and algorithms, and contemporary multi-core hardware
•LO5. Understanding “process” in the context of concurrent and parallel programming
•LO6. Understanding threads and synchronization; demonstrate ability to synchronize concurrent threads
•LO7. demonstrate ability to program using low level atomic instructions
•LO8. understand the concept of sharing memory across concurrent executions.
•LO9. Understanding different locks mechanism and how to use them in parallel programming •LO10. Understanding shared memory programming in multi-core environment, able to write safe and efficient multithreading programs using pthread and OpenMP
•LO11. Understanding the concept of message passing in parallel programming; able to write basic parallel codes on distributed systems using message passing interface
•LO12. Understanding parallel design patterns.

Assessment
22

Few ways of submission for assessment
•Quizzes will be conducted on Canvas, this will be during tutorial time and invigilated by your tutor
•Homework and Assignment will use a combination of Canvas and Git submission.
Please make sure you have signed in to https://github.sydney.edu.au
Homework announcement will outline requirements of your repo and you will need to ensure that your final submission is in the master branch.
Reports will need to be submitted on Canvas via TurnitIn.
•Details related to the final exam will be announced at a later date.
To elaborate with git submission. If the assessment has a code component you will need to create a repository and invite your tutor to it. We will provide additional guidelines when an assignment is announced. This may include:
•Repository name
•Test case format
•Expected visualizations and discussions •… many more!
23

What you need in order to
succeed in this course What you need in order to succeed in this course
• Desiretolearnhowtomakeprogramsrunfast
• Curiositytoinvestigateperformanceanomalies
(required for making programs run fast)
• Engageandparticipateinclassdiscussionsand activities. You also need to bring a laptop computer for classes and tutorials.
• FamiliarityinCprogramming
• FamiliaritywithusingtheLinuxcommandline
• Notafraidofmatrixoperations!

Some Linux concepts you will need
• In addition to ssh, scp, editing (such as vim, nano), compiling, moving files, /tmp file system……
• Understanding PATH
• Setting environment variables in general, e.g.,
LD_LIBRARY_PATH
• Writing shell scripts
• Shell startup file, e.g., .bashrc
• Note differences between different shells
• Know how to compile codes and optimizations : gcc, mpicc, etc.
• Know simple debugging techniques: GDB https://www.youtube.com/watch?v=pJWtXvmGraU

Other Relevant Information
There will be postings from and on
› How to setup your linux virtual machine ( for pthread, openmp), in tutorials or Ed/Canvas postings;
› How to submit your homework, quiz , and assignments using USYD github;
› How to access to our free cloud autoscaling cluster from Amazon: this can be used for programming exercises and assignments. This is especially helpful for MPI related contents later in the class;
› If there is additional reading materials for certain topic, we will also post on Ed/Canvas.
26

Concurrency and Parallel Programming Essential:
Parallel Performance Models and Analysis
27

Why Performance Modeling?
• Use models to choose among different implementation options
• Use models to determine if you have achieved the best possible performance possible on your hardware
– Otherwise, you get a timing result, but how do you know if it is any good?

Learning Objectives
n Predict performance of parallel programs n Understand barriers to higher performance

Outline
n General speedup formula n Amdahl’s Law
n Gustafson-Barsis’ Law
n Karp-Flatt metric
n Isoefficiency metric

Strong Scaling vs. Weak Scaling
› Strong scaling concerns the speedup for a fixed problem size with respect to the number of processors, and is governed by Amdahl’s law.
› Weak scaling concerns the speedup for a scaled problem size with respect to the number of processors, and is governed by Gustafson’s law.
31

Speedup Formula
Speedup = Sequential executiontime Parallel execution time

Execution Time Components
n Inherently sequential computations: s(n) n Potentially parallel computations: j(n) nCommunicationoperations: k(n,p)

𝐴 = 𝜋𝑟!
Speedup Expression
𝜓𝒏,𝒑 ≤ 𝝈𝒏 +𝝋(𝒏)
𝝈 𝒏 + 𝝋 𝒑𝒏 + 𝜿 ( 𝒏 , 𝒑 )

j(n)/p
1/x curve
p

k(n,p)
p

j(n)/p + k(n,p)
p

Speedup Plot
“elbowing out”
p

Relatively Good Scalability on Real Machines
https://www.kth.se/blogs/ pdc/2018/11/scalability- strong-and-weak-scaling/
39

Efficiency
Speedup = Sequential execution time Parallel execution time
Efficiency = Sequential execution time Processors used ́ Parallel execution time
Effeciency= Speedup Efficiency = Speedup
# of Cores
Processors used

0 £ e(n,p) £ 1
e(n,p)£ s(n)+j(n) ps(n)+j(n)+ pk(n, p)
All terms > 0 Þ e(n,p) > 0 Denominator > numerator Þ e(n,p) < 1 Amdahl’s Law y (n, p) £ s (n) + j (n) s(n)+j(n)/ p+k(n, p) £ s (n) + j (n) s(n)+j(n)/ p Let f = s(n)/(s(n) + j(n)) y£ 1 f +(1-f)/p Amdahl's law is often used in parallel computing to predict the theoretical speedup when using multiple processors. Note: fix n changing p Amdahl’s Law Theoretical Speedup 43 n 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? y£ 1 @5.9 0.05+(1-0.05)/8 Example 1 n 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program? lim 1 = 1 =5 p®¥ 0.2+(1-0.2)/ p 0.2 Example 2 Pop Quiz n An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors? Pop Quiz n A computer animation program generates a feature movie frame-by-frame. Each frame can be generated independently and is output to its own file. If it takes 99 seconds to render a frame and 1 second to output it, how much speedup can be achieved by rendering the movie on 100 processors? Limitations of Amdahl’s Law n Ignores k(n,p) n Overestimates speedup achievable Amdahl Effect n Typically k(n,p) has lower complexity than j(n)/p n As n increases, j(n)/p dominates k(n,p) n As n increases, speedup increases Speedup How to increase speedup? 𝜓𝒏,𝒑 ≤ 𝝈𝒏 +𝝋(𝒏) 𝝈 𝒏 + 𝝋 𝒑𝒏 + 𝜿 ( 𝒏 , 𝒑 ) n = 10,000 n = 1,000 n = 100 Processors n Treats problem size as a constant (strong scaling) n Shows how execution time decreases as number of processors increases Review of Amdahl’s Law n We often use faster computers to solve larger problem instances n Let’s treat the parallel execution time as a constant and allow problem size to increase with number of processors n This is called Weak Scaling. Another Perspective Gustafson-Barsis’s Law y (n, p) £ s (n) + j (n) s(n)+j(n)/ p Let s = s(n)/(s(n)+j(n)/p) Note: s(n)+j(n)/p is fixed y£ p + (1-p)s Gustafson-Barsis’s Law n Begin with parallel execution time n Estimate sequential execution time to solve same problem n Problem size is an increasing function of p n Predicts scaled speedup n An application running on 10 processors spends 3% of its time in serial code. What is the scaled speedup of the application? y = 10 + (1-10)(0.03) = 10 - 0.27 = 9.73 ...except 9 do not have to execute serial code Execution on 1 CPU takes 10 times as long... Example 1 Example 2 n What is the maximum fraction of a program’s parallel execution time that can be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors? 7 = 8 + (1- 8)s Þ s » 0.14 y (n, p) £ s (n) + j (n) s (n) + j (n) / p + k (n, p) n Amdahl’s Law and Gustafson-Barsis’ Law ignore k(n,p) n They can overestimate speedup or scaled speedup n Karp and Flatt proposed another metric The Karp- Experimentally Determined Serial Fraction n Takes into account parallel overhead n Detects other sources of overhead or inefficiency ignored in speedup model uProcess startup time uProcess synchronization time uImbalanced workload uArchitectural overhead Experimentally Determined Serial Fraction 𝑒= 𝒆 = 𝟏/𝝍 − 𝟏/𝒑 𝟏 − 𝟏/𝒑 GOAL: Figure out the serial fraction and communication overhead by just running the program without looking at the source code or the underlying parallel algorithm 𝒆=𝒇+𝜿 𝒏,𝒑 [𝒑/(𝒑−𝟏)]≈𝝈 𝒏 +𝜿(𝒏,𝒑) 𝝈 𝒏 + 𝝋(𝒏) 𝝈 𝒏 + 𝝋(𝒏) Example 1 𝒆=𝒇+𝜿 𝒏,𝒑 [𝒑/(𝒑−𝟏)]≈𝝈 𝒏 +𝜿(𝒏,𝒑) 𝝈 𝒏 + 𝝋(𝒏) 𝝈 𝒏 + 𝝋(𝒏) p2345678 y 1.8 2.5 3.1 3.6 4.0 4.4 4.7 What is the primary reason for speedup of only 4.7 on 8 CPUs? Since e is constant, large serial fraction is the primary reason. e 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Example 2 𝒆=𝒇+𝜿 𝒏,𝒑 [𝒑/(𝒑−𝟏)]≈𝝈 𝒏 +𝜿(𝒏,𝒑) 𝝈 𝒏 + 𝝋(𝒏) 𝝈 𝒏 + 𝝋(𝒏) p2345678 y 1.9 2.6 3.2 3.7 4.1 4.5 4.7 What is the primary reason for speedup of only 4.7 on 8 CPUs? Since e is steadily increasing, parallel overhead is the primary reason. e 0.070 0.075 0.080 0.085 0.090 0.095 0.100 Pop Quiz p 4 8 12 y 3.9 6.5 ? n Is this program likely to achieve a speedup of 10 on 12 processors? e= {e1=0.008, e2=0.032, e3=0.018} X: cant achieve that. e3>=e2

Isoefficiency Metric
n Parallel system: parallel program executing on a parallel computer
n Scalability of a parallel system: measure of its ability to increase performance as number of processors increases
n A scalable system maintains efficiency as processors are added
n Isoefficiency: a way to measure scalability

Isoefficiency Derivation Steps
n Begin with speedup formula
n Compute total amount of overhead
n Assume efficiency remains constant
n Determine relation between sequential execution time and
overhead

𝜺 = 𝝍(𝒏, 𝒑)
𝒑
𝝈 𝒏 #𝝋(𝒏)
≤ 𝝈 𝒏 #𝝋𝒑𝒏 #𝜿(𝒏,𝒑) 𝒑
https://www.cse.wustl.edu/ ~roger/569M.s09/Isoefficie ncy.pdf
Deriving Isoefficiency Relation
=
=
𝝈 𝒏 $𝝋(𝒏)
𝒑𝝈 𝒏 $𝝋 𝒏 $𝒑𝜿(𝒏,𝒑) 𝝈 𝒏 $𝝋(𝒏)
𝝈 𝒏 $𝝋 𝒏 $ 𝒑*𝟏 𝝈 𝒏 $𝒑𝜿(𝒏,𝒑)
=𝟏
𝟏$ 𝒑*𝟏 𝝈 𝒏 #𝒑𝜿(𝒏,𝒑)
𝝈 𝒏 #𝝋(𝒏)

e£
1+
Deriving Isoefficiency Relation
Let
To (n, p) = ( p -1)s (n) + pk (n, p) T(n,1)=s(n)+j(n)
then, the formula in the previous slide becomes
1 To(n,p)
T (n,1)
Assume efficiency is constant, it can be re-write as
T(n,1)3 e T (n,p)=CT(n,p)
1-e
0 0 Isoefficiency Relation

Scalability Function
n The isoefficiency relation can often be simplified as n 3 f(p)
n Let M(n) denote memory required for problem of size n
n M(f(p))/p shows how memory usage per processor must increase to maintain same efficiency
n We call M(f(p))/p the scalability function

Meaning of Scalability Function
n To maintain efficiency when increasing p, we must increase n
n Maximum problem size is limited by available memory, which is
linear in p
n Scalability function shows how memory usage per processor must
grow to maintain efficiency
n Scalability function is a constant means parallel system is perfectly scalable

Interpreting Scalability Function (M(f(p))/p)
Cannot maintain efficiency
Memory Size

Can maintain efficiency
Clogp C
Number of processors
Memory needed per processor

Example 1
1
Question 5
Let n>f(p) denote the Isoefficiency relation of a parallel system and M(n) denote the amount of memory for a problem of size n. Use the scalability function to rank the following parallel systems from most to least scalable:
(a). f(p)=p and M(n)=n;
(b). f(p)=p and M(n)=n^2;
(c). f(p)=p*log(p) and M(n)=n;
(d). f(p)=p*log(p) and M(n)=n^2.
70

Example 1: Answer
1
Question 5
Let n>f(p) denote the Isoefficiency relation of a parallel system and M(n) denote the amount of memory for a problem of size n. Use the scalability function to rank the following parallel systems from most to least scalable:
(a). f(p)=p and M(n)=n;
(b). f(p)=p and M(n)=n^2;
(c). f(p)=p*log(p) and M(n)=n;
(d). f(p)=p*log(p) and M(n)=n^2.
Solution:
a. Scalability function = M(f(p))/p = (Cp)/p = C
b. Scalability function = M(f(p))/p = (Cp)^2/p = C^2p
c. Scalability function = M(f(p))/p = (Cp*log(p))/p = C*log(p)
d. Scalability function = M(f(p))/p = (Cp*log(p))^2/p = C^2*p*log^2(p) From most scalable to least scalable is a > c > b > d.
71

Example 2: Reduction
n Sequential algorithm complexity T(n,1) = Q(n)
n Parallel algorithm
uComputational complexity = Q(n/p)
uCommunication complexity = Q(log p)
n Parallel overhead T0(n,p) = Q(p log p)

Reduction (continued)
n Isoefficiency relation: n 3 C p log p
n We ask: To maintain same level of efficiency, how must
n increase when p increases? n M(n) = n
M(Cplogp)/ p=Cplogp/ p=Clogp n The system has good scalability

Example 3: Floyd’s Algorithm
n Sequential time complexity: T(n,1)=Q(n3) n Parallel computation time: Q(n3/p)
n Parallel communication time: Q(n2log p) n Parallel overhead: T0(n,p) = Q(pn2log p)

n Isoefficiency relation
n3 3 C(p n2 log p) Þ n 3 C p log p
n M(n) = n2
M(Cplogp)/p=C2p2log2 p/p=C2plog2 p
n The parallel system has poor scalability
Floyd’s Algorithm (continued)

Example 4: Finite Difference
n Sequential time complexity per iteration: T(N,1)=Q(n2)
n Parallel communication complexity per iteration: Q(n/Öp) n Parallel overhead: T_o(n,p)=Q(n Öp)
n Space Complexity: M(n) = n2
n What is the scalability function?

Finite Difference (continued)
n Isoefficiency relation n2 3 CnÖp Þ n 3 CÖ p
n M(n) = n2
M(C p)/p=C2p/p=C2
n This algorithm is perfectly scalable

Summary (1/3)
n Performance terms u Speedup
u Efficiency
n Model of speedup
uSerial component
uParallel component uCommunication component

Summary (2/3)
n What prevents linear speedup? uSerial operations uCommunication operations uProcess start-up
uImbalanced workloads uArchitectural limitations (memory)

Summary (3/3)
n Analyzing parallel performance uAmdahl’s Law
u Gustafson-Barsis’ Law uKarp-Flatt metric uIsoefficiency metric

Backup Slides
81

Related Posts