留学生代考 COSC 407: Intro to Parallel Computing

Intro to Parallel Computing
Topic 11 – Speedup vs. Efficiency
COSC 407: Intro to Parallel Computing
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

Copyright By PowCoder代写 加微信 powcoder

Introduction
• Speed of execution depends on many factors, one of them is good algorithm and code
• Factors that affect the execution time
• Then focus on the factors related to the code
(algorithm)
• Speedup and Efficiency
• Amdahl’s and Gustafson’s Laws
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

Computer Performance How to measure CPU performance?
Response Time (aka Execution Time) Our focus today The time the CPU takes to finish a given task
This includes all subtasks (e.g., memory access, I/O activities, CPU execution time, etc.).
Example: How long should I wait for a DB query?
Unit of measurement: seconds
Throughput
The total work finished per unit of time
Here, we don’t care about the amount of time spent on each element, but the total work done per second.
Example: how many client queries can a server respond to per minute
Unit of measurement: number of jobs finished per seconds
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
We’ll talk about this later (CUDA)
You are a store manager and you want to hire employees to work at checkout registers with a $500K/year budget for salaries. Assume you have two options:
1) hire 10 slow employees:
Speed: 2 minutes per customer (i.e. each employee takes 2 min / customer) Salary: $50K / year (total is $500K/year)
2) hire 1 very fast employee
Speed: 0.3 minute per customer Salary: $500K / year
Q1: Which option would you choose?
Assume we want to checkout a 100 customer per hour. Q2: What if you have to make the same decision but you are a CEO for a big company and you want accountants to handle the taxes of very important client?
Topic 11 – Speedup vs. Efficiency
COSC 407: Intro to Parallel Computing
Execution time
Throughput

CPU Performance
Response Time for a program A is split to:
• CPU time is the total time the CPU spends on your program.
• user CPU time: time CPU spent running A
• system CPU time: time CPU spent executing OS routines
issued by A
• Wait time: I/O operations and time sharing with other
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
IPS = Instructions Per Second
• Measures approximate number of machine instruction a CPU
can execute per second
• Does not consider other factors that might affect execution
time (e.g., wait time: memory access delay, I/O speed)
• Disadvantage: A machine with higher IPS rating might not
run a program any faster than a machine with lower IPS rating
MIPS = Million Instructions Per Second
FLOPS: Floating-point Operations Per Second
• MFLOPS: Million FLOPS GFLOPS: Billion FLOPS larger FLOPS rates correspond to faster execution times
• Disadvantage: ignores non-floating-point operations àuseful for scientific calculations that make heavy use
of floating-point calculations
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

𝐶𝑃𝑈 𝑇𝑖𝑚𝑒 = 𝐶𝑃𝐼×𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝐶𝑜𝑢𝑛𝑡 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝐶𝑜𝑢𝑛𝑡 𝐶𝑙𝑜𝑐𝑘𝑅𝑎𝑡𝑒 𝐼𝑃𝑆
§ CPI (Clock cycles Per Instruction): is the the average number of CPU cycles used for instructions of program A
§ CPI depends on the Architecture (ISA) § Instruction Count depends on
– the architecture (ISA) and
– the code quality (the compiler and the algorithm)
𝐼𝑃𝑆 = 𝐶𝑙𝑜𝑐𝑘 𝑅𝑎𝑡𝑒 𝐶𝑃𝐼
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑀𝐼𝑃𝑆 𝑅𝑎𝑡𝑖𝑛𝑔 = 𝐶𝑙𝑜𝑐𝑘 𝑅𝑎𝑡𝑒 𝐶𝑃𝐼 ×10!
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
Benchmarks
• We need benchmark programs to objectively evaluate the performance.
• Benchmarks are program that capture the instruction mix of a variety of applications.
• Example: SPEC (System Performance Evaluation Cooperative)
• SPEC is an organization, established in 1988, that provides
standardized set of performance benchmarks for computers.
• CPU-intensive real-world applications.
• Provides good indicator of processor performance and
compiler technology
• More information: http://www.spec.org/
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

Overhead due to Parallelism
• Parallel runtime includes computation and overhead
• Overhead includes
• Thread creation / destruction • Synchronization
• Communication (exchange of data)
• Waiting time due to:
• Unequal load distribution
• Mutual exclusion (waiting for a shared resource)
• Condition synchronization (one thread is waiting for
another thread’s action that changes a condition)
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
Speedup and Efficiency of Parallel Programs
Speedup S : how much faster is our parallel algorithm compared to the
serial algorithm
𝑆= 𝑇”#$%&’ 𝑇(&$&”#’
• Theoretically, maximum speedup S = number of cores ( p )
• When the speed up is equal to p, the algorithm is said to have linear
speedup (no overhead)àvery rare!
• In practice, we have overhead: 𝑇(&$&”#’ = )!”#$%& + 𝑇*+#$,#&-
Parallel Efficiency E : how effectively you’re using your multiple cores: 𝐸 = 𝑎𝑐𝑡𝑢𝑎𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑆 = 𝑇”#$%&’
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑝 𝑝 G 𝑇(&$&”#’
Question: how much does the # cores p and problem size affect S and E?
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

Experiment: Effect of p and
problem size on E and S
§ Observations:
1. As the number of cores p increases,
• The speedup S increases – non-linearly
• The efficiency E decreases
– duetotheincreasedoverhead
2. As the problem size increases, both E and S increase.
• overhead is usually less in large programs compared to computation.
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
Speedup Formula
• Does adding more and more cores always mean better speedup?
• To answer this question, let’s find the speed up for any given program. Note that
there will always be part of your program that cannot be parallelized.
serial p=2 p=4 p=8 p=…
Unparallelizable
1 − 𝑟 𝑇!”#$%&
Parallelizable (r%)
𝑟 % 𝑇'()*+,
Topic 11 – Speedup vs. Efficiency
𝑟. 𝑇'()*+, 𝑟. 𝑇'()*+, 𝑟. 𝑇'()*+, 248
Execution Time = 1 − 𝑟 𝑇./0123 + 𝑟. )-./012 (
𝑺 = 𝑇”#$%&’ 𝑇”#$%&’ 1−𝑟 G𝑇”#$%&’+𝑟G 𝑝
COSC 407: Intro to Parallel Computing
𝑺=𝟏𝒓 𝟏5𝒓 7𝒑

Amdahl’s Law
Does adding more and more cores always mean better speedup?
Speedup is limited by the portion of unparallelizable code regardless of number of cores available
𝑆=1𝑟 1−𝑟+𝑝
maximum speedup is ) )*+
Topic 11 – Speedup vs. Efficiency
COSC 407: Intro to Parallel Computing
Amdahl’s Law: Example Let’s say we can parallelize 90% of a serial program (r = 0.9)
Applying speedup formula:
1−𝑟𝑇 !”#$%&
𝑟.𝑇'()*+, 𝑃
859.; 7 ‘.) *
As pà¥, sà10 (maximum speedup)!!
Should we give up on parallelism?
• No, problem size is another factor that should be considered (as we saw before). And there are ‘sweet values’ that we should consider to achieve maximum possible efficiency vs. speedup.
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
Q: without math formulas, can you tell what the max speedup is?

Gustafson’s Law
§ Remember when we said that with as increased problem size we can achieve better speedup and efficiency?
§ Gustafson’s law shows that speedup can be increased with larger problem sizes
– Or, with more cores, larger problem sizes can be solved within the same time.
§ Amdahl’s law assumes fixed problem size and shows how limited the speed up is considering the sequential part of the problem.
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
Gustafson’s Law
Problems with large repetitive DATA SETS can be efficiently parallelized. § As the program processes more and more data, the portion of the
serial workload becomes smaller and smaller
𝑆=1𝑟 1−𝑟 +𝑝
– As rà100%, the max speedup is Sàp
serial p=2 p=4 p=8
Unparallelizable
Parallelizable
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

Suppose a car is traveling between two cities, and has already spent one hour traveling at 30 km/h (this is the unparallelizable part)
Amdahl’s Law:
§ Assume the two cities are 60 km apart (the car already travelled half of
them). No matter how fast you drive the second half, it is impossible to achieve an average speed higher than 60 km/h before reaching the second city.
– Since it has already taken you 1 hour and you only have a distance of 60 km total; going infinitely fast you would only achieve 60 km/h.
Gustafson’s Law :
§ Given enough time and distance to travel (the cities are far far away from each other), the car’s average speed can always eventually reach
more than 60 km/h, no matter how long or how slowly it has already traveled.
– e.g., the could achieve an average speed of 90km/h by driving at 150 km/h for one more hour.
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
Estimation
§ Based on the formula 𝑆 = 8 # , we can estimate the required 85$7*
portion of parallelized code (r) given s and p:
𝑟 = 1 − 1𝑠 1 − 𝑝1
§ Example:
– Assume the required speed up is 10, and we have 20 cores. How
much of the program should be parallelizable?
1−1 1− 1 0.90
𝑟 = 𝑠 = 10 = 1−1 1−1 0.95
COSC 407: Intro to Parallel Computing
Topic 11 – Speedup vs. Efficiency

Scalability
§ In general, a problem is scalable if it can handle ever increasing problem sizes.
§ strongly scalable: problems where we keep E fixed after increasing p and without increasing problem size.
– pé+ problem_size fixed = E fixed
– We are achieving higher speedup with more cores, with a fixed E
and fixed problem size.
§ weakly scalable: problems where we keep E fixed after increasing p
and with increasing the problem size.
– pé+ problem_sizeé= E fixed
– Problem size is increased at the same rate as we increase p.
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing
§ Speedup and efficiency are different
§ Using more cores doesn’t necessarily mean significant speedup § Sometimes serial code is the best choice
– Especially for serial algorithms that do not parallelize well
– You need to test on your system
§ As the problem size increases, both efficiency and speedup increase
– Parallel programs are better used for larger problems
Topic 11 – Speedup vs. Efficiency COSC 407: Intro to Parallel Computing

Conclusion/Up Next
§ What we covered today (review key concepts):
• Speedup and Efficiency
• Amdahl’s and Gustafson’s Laws
– Intro to GPU programming
• CPU vs GPU programming
• Latency vs. Throughput
– CUDA basics: the hardware layout
Topic 11 – Speedup vs. Efficiency
COSC 407: Intro to Parallel Computing

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com