程序代写 COSC 407: Intro to Parallel Computing

Intro to Parallel Computing
Topic 8 – Work Sharing (Parallel For, Single)
COSC 407: Intro to Parallel Computing
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing

Copyright By PowCoder代写 加微信 powcoder

Outline (Asynchronous)
Previously:
• Mutual Exclusion (critical, atomic, locks)
• Variable scope (shared, private, firstprivate)
• Reduction, Synchronization (barriers, nowait)
• Work-sharing: parallel for construct
• Data dependency
• Single, master constructs
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing

Work-Sharing Constructs
• Within a parallel region, you can decide how work is distributed among the threads.
for – The for construct is probably the most important construct in OpenMP.
single – Assigning a task to a single thread sections – Dividing the tasks into sections. Each
section is executed by one thread.
• Implied barrier on exit (unless nowait is used). No implied barrier on entry….
• As they used in parallel region, use existing threads (do not create new threads)
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Parallel For
§ Loop iterations are distributed across the thread team (at run time)
– Loopsmusthaveanintegercountervariablewhosevalueis inc/dec by a fixed value
– Reachedaspecifiedbound
§ Loop variable is explicitly declared to be a private variable (each thread gets own copy)
– Valueatendofloopisundefined(unlesssetaslastprivate) – RecommendedthatprogrammersdonotrelyonOpenMP
default rules
• Explicitly set data sharing attributes
§ Output/processing order is non-deterministic
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing

Parallel For Loop
• Uses a team of threads to execute a for loop block following thefor directive.
• The system parallelizes the for loop by dividing the iterations of the loop among the threads.
Syntax (two options)
(1) #pragma omp for [clause [clause]…] //does not create new threads
for(i = start; i end, incr_expression) The above block must be placed within a parallel region.
Use this syntax if your parallel region includes a for loop and other statements.
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Parallel For Loop
(2) #pragma omp parallel for [clause [clause]…] //creates new threads
for(i = start; i end, incr_expression) The above is for creating a parallel block that ONLY includes a
Variable scope: the loop variable of the for-statement
immediately following the for-directive is private
• Recall recommend best-practice – declare as private
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing

WITHOUT the for directive, EACH thread runs a copy of the WHOLE for loop
• On4threads(T0toT3)weget16 print-outs (since each thread executes 4 iterations concurrently with the other threads)
• Would need to divide up work manually
WITHOUT the Parallel For Directive
#pragma omp parallel num_threads(4)
int i, n = omp_get_thread_num();
for(i=0; i<4; i++) printf("T%d: i=%d\n", n , i); Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Possible Output: Parallel For Directive #pragma omp parallel #pragma omp for for (i = 0; i < 4; i++) { n = omp_get_thread_num(); printf("T%d: i=%d\n", n, i); WITH the for directive, iterations are divided among the threads § On 4 threads (T0 to T3) we get 4 print-outs (since each thread executes one iteration concurrently with the other threads) §That is, regardless of the number of threads we will get the exact number of print-outs specified by the for loop. Note: order is not preserved! Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Possible Output: Who Does what? #pragma omp parallel #pragma omp for for(i=0;i< 8;i++){ n = omp_get_thread_num(); printf("T%d: i=%d\n", n, i); Number iterations is divided equally among threads § T0 : 2 iterations ( i= 0, 1 ) § T1 : 2 iterations ( i= 2, 3 ) § T2 : 2 iterations ( i= 4, 5 ) § T3 : 2 iterations ( i= 6, 7 ) Topic 8 - Work Sharing (Parallel For, Single) This is called static scheduling T0 T1 T2 T3 i = 0à1 i = 2à3 I = 4à5 i = 5à6 Each thread gets its own loop counter COSC 407: Intro to Parallel Computing Possible Output: Who Does What? #pragma omp parallel #pragma omp for for(i=0;i< 80;i++){ //some work Number of iterations is divided equally among threads § T0 : 20 iterations ( i= 0...19 ) § T1 : 20 iterations ( i= 20...39 ) § T2 : 20 iterations ( i= 40...59 ) § T3 : 20 iterations ( i= 60...79 ) Topic 8 - Work Sharing (Parallel For, Single) T0 T1 T2 T3 i = 0à19 i = 20à39 I = 40à59 i = 50à69 Each thread gets its own loop counter COSC 407: Intro to Parallel Computing Who Does What? //assuming 4 threads #pragma omp parallel #pragma omp for for(i=0;i< 6;i++){ n = omp_get_thread_num(); printf("T%d: i=%d\n", n, i); Note: extra iterations are assigned to first few threads Possible Output: § T0 took 2 iterations ( i= 0, 1 ) § T1 took 2 iterations ( i= 2, 3 ) § T2 took 1 iteration ( i= 4 ) § T3 took 1 iteration ( i= 5 ) Topic 8 - Work Sharing (Parallel For, Single) T0 T1T2T3 i=0à1 i=2à3 4à5 5à6 Each thread gets its own loop counter COSC 407: Intro to Parallel Computing More Practical Examples § Image processing: Converting an RGB image to Grayscale. – Assumetheluminosityofeachgrayscalepixelisequalto 0.21 * Red + 0.72 * Green + .07 * Blue. Then: #pragma omp parallel for for(i=0; i < numPixels; i++) gray[i]= (unsigned char)(.21*red[i]+.72*green[i]+.07*blue[i]); § 3D rendering: – Divideimageintoblocksanditerateoverthem. § Matrix / array operations: – Addtwomatrixes,findthetranspose,etc. § Many other applications that require applying the same action to iteratively in a loop – Refertointroductiontoparallelcomputingnotes Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Area Under the Curve Revisited The aim is to compute the area S (from a to b) under a curve defined by f(x). • One way to do this is to divide S into a series of trapezoids, find the area of these trapezoids, and then sum. • The more trapezoids, the better approximation. Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing The Serial Algorithm Revisited 𝑆𝑢𝑚𝑜𝑓𝑡𝑟𝑎𝑝𝑒𝑧𝑜𝑖𝑑𝑎𝑟𝑒𝑎𝑠=h 𝑓 𝑎 +𝑓(𝑏)+𝑓 𝑥! +𝑓 𝑥" +⋯+𝑓 𝑥#$! 2 h = (b - a) / n; approx = (f(a) + f(b)) / 2.0; for (i = 1; i <= n-1; i++){ xi = a + i * h; approx += f(xi); approx = h * approx; //calculate h only once //first and last terms //remaining terms //approximate area Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Area Calculation Given the serial code in previous slide, provide the parallel version • In this version, use the parallel for directive • Note that when you use parallel-for, it's NOT necessary for n to be evenly divisible by thread_count Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing The Parallel Algorithm v.5 𝑆𝑢𝑚𝑜𝑓𝑡𝑟𝑎𝑝𝑒𝑧𝑜𝑖𝑑𝑎𝑟𝑒𝑎𝑠=h 𝑓 𝑎 +𝑓(𝑏)+𝑓 𝑥! +𝑓 𝑥" +⋯+𝑓 𝑥#$! 2 h = (b - a) / n; //calculate h only once approx = (f(a) + f(b)) / 2.0; //first and last terms #pragma omp parallel for reduction(+: approx) for (i = 1; i <= n-1; i++){ //remaining terms xi = a + i * h; approx += f(xi); approx = h * approx; //approximate area Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing The Complete Program double a=1,b=2; int n = 10, thread_count = 4; double global_result = Trap(a, b, n, thread_count); printf("Approximate area: %f\n", global_result); return 0; double Trap(double a, double b, int n, int thread_count) { double h = (b-a)/n; double approx = (f(a) + f(b))/2.0; int i; #pragma omp parallel for num_threads(thread_count) reduction(+: approx) for (i = 1; i <= n-1; i++) approx += f(a + i*h); return h * approx; Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Previous Version without Parallel For double global_result = 0, a=1, b=2; //global_result is shared int n = 12; # pragma omp parallel num_threads(4) reduction(+:global_result) global_result += Local_trap(...); printf("Approximate area: %f\n", global_result); double Local_trap(double a, double b, int n) { double x, my_approx, my_a, my_b; int i, my_n, my_rank = omp_get_thread_num(); int thread_count = omp_get_num_threads(); my_n = n / thread_count; my_a = a + my_rank * my_n * h; my_b = my_a + my_n * h; double h = (b - a) / n; my_approx = (f(my_a) + f(my_b)) / 2.0; for (i = 1; i <= my_n - 1; i++) my_approx += f(my_a + i * h); return h * my_approx; //instead of adding it to global_result Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing • we need to define range for each thread • n must be divisible by num_threads Nested Loops #pragma omp for for(i = 0; i<5; i++) //this loop is a parallelized for(j=0; j<10; j++) //but this one is not ... Nested for loops: (In the above example) • outer for loop is parallelized • Iterations are divided among threads • inner for loop is not parallelized • Each thread will execute all iterations Variable scope: • The loop counter of the for-statement immediately following the for-directive is private for each thread. • Hence ( i ) in the above code is private • However, ( j ) is not private unless it is declared right after the outer for loop, or it is explicitly defined private Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Caution: How and When to Use Parallel For? Only the following form is legal for parallelized for statement Limitations: • Must have integer or pointer type (e.g., it can’t be float) • Can only be modified by the “increment expression” in the for start, end, and incr must: • Have a compatible type (e.g., if index is a pointer, then incr must • Not change during execution of the loop Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing Caution, cont’d Cannot parallelize: • while or do-while loops • for (while ; ; ) {...} • for (i=0; i #include #include int main() {
#pragma omp parallel {
printf(“Hi from T%d\n”, omp_get_thread_num());
#pragma omp single
printf(“One Hi from T%d\n”, omp_get_thread_num());
return 0; }
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Possible Output
Hi from T2
Hi from T3
One Hi from T2 Hi from T0
Hi from T1

Assigning Work to a Single Thread, cont’d
#pragma omp parallel
{ int n = omp_get_thread_num();
printf(“T%d:A\n”,n); #pragma omp single
printf(“T%d:X\n”,n);
printf(“T%d:B\n”,n);
printf(“Finished”);
Topic 8 – Work Sharing (Parallel For, Single)
int n = …
int n = …
int n = …
int n = …
(Synchronization after ‘single’)
COSC 407: Intro to Parallel Computing
Assigning Work to a Single Thread, cont’d
#pragma omp parallel
int n = omp_get_thread_num(); printf(“T%d:A\n”,n);
#pragma omp single nowait
printf(“T%d:X\n”,n);
printf(“T%d:B\n”,n);
printf(“Finished”);
int n = …
int n = …
int n = …
int n = …
(no synchronization after ‘single’ if nowait is used)
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
NO Barrier
Possible Output
T1:A T0:A T1:X T0:B T1:B T3:A T3:B T2:A T2:B Finished

Assigning Work to the Master Thread
#pragma omp master
• Within a parallel region, executes block of code just once by the master thread in the team.
• Unlike “single directive”, master directly does NOT have an implied barrier on exit.
#include
#include
#include
int main() {
#pragma omp parallel
{ printf(“Hi from T%d\n”, omp_get_thread_num());
#pragma omp master
printf(“One Hi from T%d\n”, omp_get_thread_num());
return 0; }
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Possible Output
Hi from T2
Hi from T3
Hi from T0
One Hi from T0 Hi from T1
Assigning Work to Master Thread, cont’d
#pragma omp parallel
int n = omp_get_thread_num(); printf(“T%d:A\n”,n);
#pragma omp master
printf(“T%d:X\n”,n);
printf(“T%d:B\n”,n);
printf(“Finished”);
int n = …
int n = …
int n = …
int n = …
(no synchronization after ‘master’ even without nowait)
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
NO Barrier
Possible Output
T1:A T1:B T3:A T3:B T0:A T0:X T0:B T2:A T2:B Finished

Assigning Work to Master Thread, cont’d
#pragma omp parallel
int n = omp_get_thread_num(); printf(“T%d:A\n”,n);
#pragma omp master
printf(“T%d:X\n”,n); #pragma omp barrier printf(“T%d:B\n”,n);
printf(“Finished”);
Topic 8 – Work Sharing (Parallel For, Single)
Barrier Explicitly Added Here
COSC 407: Intro to Parallel Computing
int n = …
int n = …
int n = …
int n = …
Possible Output
T3:A T2:A T0:A T0:X T1:A T3:B T2:B T0:B T1:B Finished
Conclusion/Up Next
§ What we covered today (review key concepts):
• Work-sharing: parallel for construct
• Data dependency
• Single, master constructs
– Sections
– SchedulingLoops(static,dynamic,guided,auto) – OrderedIterations
– NestedLoops
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com