CS代考 Performance of Parallel Programs

Performance of Parallel Programs

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING
This material has been reproduced and communicated to you by or on behalf of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act).

Copyright By PowCoder代写 加微信 powcoder

The material in this communication may be subject to copyright under the Act. Any further copying or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice.

§ Amdahl’s Law
§ Load Balancing
§ Measuring Performance
§ Sources of Performance Loss
§ Example Case-study § Thread pool
§ Reasoning about Performance § A closer look at memory

Limits to Performance Scalability § Not all programs are “embarrassingly” parallel.
§ Programs have sequential parts and parallel parts.
Sequential part: cannot be parallelized because of data dependencies.
Parallel part: no data dependence, the different loop iterations (Line 5) can be executed in parallel.
4 for (k=0; k < e; k++) 5 M[k] = 1; Limits to Performance Scalability § Not all programs are “embarrassingly” parallel. § Programs have sequential parts and parallel parts. Sequential part: cannot be parallelized because of data dependencies. Parallel part: no data dependence, the different loop iterations (Line 5) can be executed in parallel. 4 for (k=0; k < e; k++) 5 M[k] = 1; Dependencies: Limits to Performance Scalability § Not all programs are “embarrassingly” parallel. § Programs have sequential parts and parallel parts. Sequential part: cannot be parallelized because of data dependencies. Parallel part: no data dependence, the different loop iterations (Line 5) can be executed in parallel. Unroll loop: k=0 k=1 k=... M[k] M[k] M[k] M[k] 4 for (k=0; k < e; k++) 5 M[k] = 1; Dependencies: Amdahl’s Law In 1967, , then IBM computer mainframe architect, stated that “The performance improvement to be gained from some faster mode of execution is limited by the fraction of the time that the faster mode can be used.” § “Faster mode of execution” here means program parallelization. § The potential speedup is defined by the fraction of the code that can be parallelized. 50 seconds 25 seconds Use 5 threads for the parallelizable part: sequential part T2 T3 T4 T5 sequential part sequential part parallelizable part sequential part 25 seconds 25 seconds + 10 seconds T1 + 25 seconds 60 seconds 100 seconds Amdahl’s Law (cont.) Use 5 threads for the parallelizable part: sequential part T2 T3 T4 T5 sequential part sequential part parallelizable part sequential part 25 seconds 50 seconds 25 seconds 100 seconds 25 seconds + 10 seconds T1 + 25 seconds 60 seconds § Speedup = old running time / new running time = 100 seconds / 60 seconds § The Parallel version is 1.67 times faster than the sequential version. sequential part parallelizable part sequential part 25 seconds 50 seconds 25 seconds 100 seconds Amdahl’s Law (cont.) sequential part sequential part 25 seconds 2 seconds 25 seconds 52 seconds § We may use more threads executing in parallel for the parallelizable part, but the sequential part will remain the same. § The sequential part of a program limits the speedup that we can achieve! § Even if we theoretically could reduce the parallelizable part to 0 seconds, the best possible speedup in this example would be speedup = 100 seconds / 50 seconds = 2. Amdahl’s Law (cont.) § p = fraction of work that can be parallelized. § n = the number of threads executing in parallel. new_running_time = (1-p) * old_running_time + p * old_running_time n Speedup = old_running_time = 1 new_running_time (1 - p) + pn § Observation: if the number of threads goes to infinity ( ), the speedup becomes § Parallel programming pays off for programs which have a large parallelizable part. Amdahl’s Law (Examples) § p = fraction of work that can be parallelized. § n = the number of threads executing in parallel. Speedup = old_running_time = 1 new_running_time (1 - p) + np § Example 1: p=0, an embarrassingly sequential program. § Example 2: p=1, an embarrassingly parallel program. § Example 3: p=0.75, n = 8 § Example 4: p=0.75, n = Performance Scalability § On the previous slides, we assumed that performance directly scales with the number n of threads/cores used for the parallelizable part p. § E.g., if we double the number of threads/cores, we expect the execution time to drop in half. This is called linear speed-up. § However, linear speedup is an “ideal case” that is often not achievable. sequential program § The speedup graph shows that the curves level off as we increase the number of cores. § Result of keeping the problem size constant à the amount of work per thread decreases àoverhead for thread creation also becomes more significant. § Other reasons possible also (memory hierarchy, highly-contended lock, ...). § Efficiency = § Ideal efficiency of 1 indicates linear speedup. § Efficiency > 1 indicates super-linear speedup.
§ Super linear speedups are possible due to registers and caches.
number of threads/cores

§ Amdahl’s Lawü
§ Load Balancing
§ Measuring Performance
§ Sources of Performance Loss § Example Case-study
§ Thread pool
§ Reasoning about Performance § A closer look at memory

Granularity of parallelism = frequency of Interaction between threads
§ Fine-grained Parallelism
§ Small amount of computational work between communication / synchronization stages.
§ Low computation to communication ratio.
§ High communication / synchronization overhead.
§ Less opportunity for performance improvement.
§ Coarse-grained Parallelism
§ Large amount of computational work between communication / synchronization stages.
§ High computation to communication ratio.
§ Low communication / synchronization overhead.
§ More opportunity for performance improvement.
§ Harder to load-balance effectively.

Load Balancing Problem
§ Threads that finish early have to wait for the thread with the largest amount of work to complete.
§ This leads to idle times and lowers processor utilization.
synchronization
synchronization

Load Balancing
§ Load imbalance: work not evenly assigned to cores. § Underutilizes parallelism!
§ The assignment of work, not data, is key.
§ Static assignment = assignment at program writing time
§ cannot be changed at run-time.
§ More prone to imbalance.
§ Dynamic assignment = assignment at run-time.
§ Quantum of work must be large enough to amortize overhead. § Example: Threads fetch work from a work queue.
work queue
T2 T1 T2 T1 step 2: step 3:
work queue
§ With flexible allocations, load balance can be solved late in the design programming cycle.
work queue

§ Amdahl’s Lawü
§ Load Balancingü
§ Measuring Performance
§ Sources of Performance Loss § Example Case-study
§ Thread pool
§ Reasoning about Performance § A closer look at memory

Measuring Performance § Execution time … what’s time?
§ We can measure the time from start to termination of our program.
§ Also called wallclock time.
time ./a.out
real 0m4.159s
user 0m0.008s
sys 0m0.026s
elapsed wallclock time
time executed in user mode
time executed in kernel mode (see next slide)

#include
int main() {
int i, *j, k;
FILE f = fopen(…);
j = malloc(255*sizeof(int)); for (k=0;kL1 cacheàL2 cacheàmain memory.
§ Cache miss: CPU wants to access data which is not in the cache.
• Need to access data in level below, eventually in main memory (slow!).
RAM (Main Memory)

Divide into several parts…
§ Solution for several threads: divide the array into several parts.
length=16 t=4 array
Thread 0 Thread 1 Thread 2 Thread 3
What’s wrong?
int length_per_thread = length / t;
int = start = id * length_per_thread;
for(i=start; iaction)(work->arg);
// anything else?
free(work); }
threadpool_init(…) {
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread_attr_setschedpolicy(&attr, SCHED_RR); // man 7 sched
for ( ; i < max_threads; ++i) pthread_create(&tp->threads[i], &attr, threadpool_start, wd);

Thread pool: design considerations
§ Task size: meant to be small, but how small?
§ Experimentally, find the threshold on unloaded machine
§ Dynamically, do a small test in the program! (beware cold/hot cache)
§ Overhead costs
§ additional function calls
§ reformatting parameters to be described with one address (pthread) § Inside threadpool there are delays which has noth

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com