编程辅导 COSC 407: Intro to Parallel Computing

Intro to Parallel Computing
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations )
COSC 407: Intro to Parallel Computing
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

Copyright By PowCoder代写 加微信 powcoder

Outline (Asynchronous)
Previously:
• Work-sharing: parallel for construct
• Data dependency
• Single, master constructs
• Sections
• Scheduling Loops (static, dynamic, guided, auto)
• Ordered Iterations
• Some examples
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

Work-Sharing Constructs
• Within a parallel region, you can decide how work is distributed among the threads.
for – The for construct is probably the most important construct in OpenMP.
single – Assigning a task to a single thread sections – Dividing the tasks into sections. Each
section is executed by one thread.
• Implied barrier on exit (unless nowait is used). No implied barrier on entry….
• As they used in parallel region, use existing threads (do not create new threads)
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing
Parallel Sections
• Section directive allow us to assign different sections to different threads
• Each section is executed once. Each thread executes zero or more sections.
• A thread may execute zero sections if (# of sections < # of threads) • A thread may execute more than one section if it is fast enough and the implementation allows it, and/or (# of sections > # of threads)
• No order! It is not possible to determine which sections will be executed before which, or if two sections are executed by same thread
• Therefore, it is important that none of the later sections depends on the results of the earlier ones
• There is an implicit barrier at the end of the “sections” region. No implicit barrier at end of each “section”)
• There is no implicit barrier on entry
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

Parallel Sections, Syntax
Within a parallel block, use the following syntax #pragma omp sections [clause [clause]…]
#pragma omp section
code for section1
#pragma omp section
code for section2
If not within a parallel block, use the following syntax:
#pragma omp parallel sections [clause [clause]…]
The clause could be: private, reduction, nowait
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing
#pragma omp section
Sections – Example 1
#pragma omp parallel
#pragma omp sections
#pragma omp section
printf(“T%d:A \n”, omp_get_thread_num());
printf(“T%d:B \n”, omp_get_thread_num()); }
#pragma omp section
printf(“T%d:C \n”, omp_get_thread_num()); printf(“T%d:D \n”, omp_get_thread_num());
possible outcomes
T0:C T1:A T0:A T0:A T3:A T2:C T0:B T0:B T0:D T2:D T0:C T1:C T3:B T1:B T0:D T1:D
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

Sections – Example -1, cont’d
#pragma omp parallel sections
#pragma omp section
printf(“T%d:A \n”, omp_get_thread_num());
printf(“T%d:B \n”, omp_get_thread_num()); }
#pragma omp section
printf(“T%d:C \n”, omp_get_thread_num());
printf(“T%d:D \n”, omp_get_thread_num()); }
possible outcomes
T0:C T1:A T0:A T0:A T3:A T2:C T0:B T0:B T0:D T2:D T0:C T1:C T3:B T1:B T0:D T1:D
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing
Sections – Example – 2
Possible Output
T1:X T2:X T0:X T1:C T0:A T1:D T0:B T3:X
T1:Y T0:Y T2:Y T3:Y
3) Can T1:X appear here? Can T2:X appear here?
1) Is this an error? Shouldn’t T3:X appear with the other X’s above?
#pragma omp parallel num_threads(4) {
int n = omp_get_thread_num();
printf(“T%d:X \n”, n);
#pragma omp sections {
#pragma omp section {
printf(“T%d:A \n”, n);
printf(“T%d:B \n”, n); }
#pragma omp section {
printf(“T%d:C \n”, n); printf(“T%d:D \n”, n);
2) Why all Y’s can only appear here?
4) Can “Finished” appear earlier?
T2:X T2:Y } T1:X T1:C T1:D T1:Y
printf(“T%d:Y \n”, n); printf(“Finished”);
T0:X T0:A T0:B T0:Y
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

#pragma omp parallel {
int n = omp_get_thread_num(); printf(“T%d:X \n”, n);
#pragma omp sections {
#pragma omp section
{ printf(“T%d:A\n”,n);
printf(“T%d:B \n”, n); }
#pragma omp section
{ printf(“T%d:C\n”,n);
printf(“T%d:D \n”, n); }
printf(“T%d:Y \n”, n);
Sections – Example – 2
printf(“Finished”);
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing
The schedule clause Determines how iterations are distributed over
schedule( static|dynamic|guided|auto [,chunksize] )
Aim: distribute the workload evenly so that processors are being used for the same amount of time.
Static is the default scheduling policy for most OpenMP implantations
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

Thread Scheduling
Ruud van der Pas, IWOMP 2010
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing
Static Scheduling
schedule( static [,chunksize] )
• Before executing the loop, assign the iterations in blocks of size “chunk” over the threads in round-robin fashion.
Default chunk @ num_iterations / num_threads Example: twelve iterations, 0, 1, . . . , 11, and 3 threads
schedule(static)
T0: 0,1,2,3
T1: 4,5,6,7
T2: 8,9,10,11
schedule(static,4)
T0: 0,1,2,3
T1: 4,5,6,7
T2: 8,9,10,11
schedule(static,2)
T0: 0,1,6,7
T1: 2,3,8,9
T2: 4,5,10,11
schedule(static,1)
T0: 0,3,6,9
T1: 1,4,7,10
T2: 2,5,8,11
Topic 9 – Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing

Dynamic Scheduling
schedule(dynamic [,chunksize] )
• Iterations are broken up into chunks of chunksize.
• Each thread executes a chunk. Once done, it requests another
• This keeps going till all iterations are completed. Default chunksize = 1
Possible outputs
T2:0 T1:4 T1:4 T2:0 T2:1 T0:2 T1:5 T0:3 T0:2 T1:5 T3:6 T0:6 T0:3 T2:1 T3:7 T0:7
#pragma omp parallel for schedule(dynamic,2) for(int i = 0; i<8; i++) printf("T%d:%d\n", omp_get_thread_num(),i); Be careful of the overhead: once a chunk is finished, threads need to receive a new iteration counter. To overcome this problem, make sure your chunk size is reasonably large. Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Guided/AUTO Scheduling schedule( guided | auto [,chunksize] ) • Similar to dynamic, BUT it starts with large chunks then adjusts to smaller chunk sizes if the workload is imbalanced. • Default: • if chunksize is unspecified, chunk sizes decrease down to 1. • If chunksize is specified, it decreases down to chunksize But the last chuck might be smaller than chunksize. • The compiler and/or the run-time system determine the best schedule. (OpenMP v3+) Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing When to Use Which? • Use if iterations require roughly the same amount of time • It requires little overhead (mostly done at compile time) • Use if iterations require different amount of times. • Allows processors the finish first to go after other chunks and hence balance the workload and keep all processors • It requires more overhead than static • Use when the workload increases as we go to higher iterations. • i.e. when initial chunks require less time per iteration than later chunks • Like dynamic, there is more overhead at runtime Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Experiment We want to parallelize this loop. sum = 0.0; for(i=0; i<=n; i++) sum += f(i); • Assume the time required by f(i) is proportional to value of i. • Most OpenMP implementation, static scheduling is used by default. • A better assignment might be dynamic scheduling • i.e. cyclic partitioning of the iterations among the threads Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Experiment, cont’d To get a feel for how drastically this can affect performance, assume f is: double f(int i){ int j, start = i*(i+1)/2, finish = start + i; double result = 0.0; for(j = start; j<=finish; j++) result += sin(j); return result; • Note that time to run the inner for loop depends on i. for example, f(2i) would take twice the time of f(i) Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Experiment: Results n = 10,000 one thread (serialized) run-time = 3.67 seconds. two threads Static assignment (default) run-time = 2.76 seconds speedup = 1.33 two threads Dynamic assignment run-time = 1.84 seconds speedup = 1.99 Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing The ordered clause The assignment of iteration chunks to threads is unspecified and hence the output is usually unordered. You can use the ordered clause to force certain events to run in the order of iterations. #pragma omp for ordered schedule(dynamic) for(int i=0; i<100; i++) { f( a[i] ); #pragma omp ordered g( a[i] ); f() is done in any order and in parallel e.g. f(a[5]) and f(a[6]) may be done in parallel by two threads g() is done strictly in order e.g. assume a thread finishes f( a[6] ) and now wants to proceed to g(a[6]). It has to check if g(a[5]) is finished. If not, the thread waits until g(a[5]) is finished. Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing The ordered clause #pragma omp for ordered schedule(dynamic) for(int i=0; i<100; i++) { f( a[i] ); #pragma omp ordered g( a[i] ); • Thin outlines may be done in any order • Thick outlines must be done in order Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Rules and Cost • You can have exactly one ordered block per an ordered loop, no less and no more • The outer for construct must contain the ordered clause Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Ordering comes at an expense of wasting CPU time (waiting for other threads) Time for (Estimating) PI... Serial code: int n = 100000000; //100 millions double factor = 1, sum = 0; for(int k = 0; k < n; k++){ sum += factor/(2*k+1); factor = -factor; double pi = 4 * sum; printf("%f", pi); Q: Parallelize the code! Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing First Piece of PI First attempt... Will this code work?? int n = 100000000; //100 millions double factor = 1, sum = 0; #pragma omp parallel for reduction(+:sum) for(int k = 0; k < n; k++){ sum += factor/(2*k+1); factor = -factor; double pi = 4 * sum; printf("%f", pi); //possible output: 4.810167 Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing a) Yes,itworks b) No, there is data race c) Iamsuperconfused! Second Piece of PI Second attempt... How about this solution? int n = 100000000; double factor, sum = 0; #pragma omp parallel for reduction(+:sum) for(int k = 0; k < n; k++){ if(k%2==0) factor = 1; else factor = -1; sum += factor/(2*k+1); double pi = 4 * sum; printf("%f", pi); //possible output: 2.699575 a) YES,thisisperfect! b) No, there is still a problem. c) Iamstillconfused! Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Loop-carried dependency Third Piece of PI Third attempt... (Avoiding Data Race) How about this one now? int n = 100000000; double factor, sum = 0; #pragma omp parallel for reduction(+:sum) private(factor) for(int k = 0; k < n; k++){ if(k%2==0) factor = 1; else factor = -1; sum += factor/(2*k+1); double pi = 4 * sum; printf("%f", pi); //output: 3.141593 Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) Ensures factor has private scope (WHY?) COSC 407: Intro to Parallel Computing Forth Piece of PI The Easy Way.... Experiment Contender #1 Contender #1: Use order clause to force the iterations to run in order int n = 500000000; double factor = 1, sum = 0; double time = omp_get_wtime(); #pragma omp parallel for ordered reduction(+:sum) for(int k = 0; k < n; k++){ sum += factor/(2*k+1); #pragma omp ordered factor = -factor; double pi = 4 * sum; double finish = omp_get_wtime(); printf("pi = %f\ntime = %f ms", pi, 1000*(finish-time)); Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing pi = 3.141593 time = 8150 ms Third Piece of PI Experiment Contender #2 Contender #2: Algorithm redesign (3rd Piece of PI) int n = 500000000; double factor, sum = 0; double time = omp_get_wtime(); #pragma omp parallel for reduction(+:sum) private(factor) for(int k = 0; k < n; k++){ if(k%2==0) factor = 1; else factor = -1; sum += factor/(2*k+1); double pi = 4 * sum; double finish = omp_get_wtime(); printf("pi = %f\ntime = %f ms", pi, 1000*(finish-time)); Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing pi = 3.141593 time = 1202 ms Comparison Contender #1 Using ordered clause Contender #2 Redesigning the algorithm pi = 3.141593 time = 8150 ms pi = 3.141593 time = 1202 ms Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing Conclusion/Up Next § What we covered today (review key concepts): – Sections – Scheduling Loops (static, dynamic, guided, auto) – Ordered Iterations – Some Examples – Some More Examples • Matrix multiplication • Max reduction – Asides and Comments on OpenMP Topic 9 - Work Sharing (Sections, Scheduling and Ordered Iterations ) COSC 407: Intro to Parallel Computing 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com