Intro to Parallel Computing
Topic 8 – Work Sharing (Parallel For, Single)
COSC 407: Intro to Parallel Computing
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Copyright By PowCoder代写 加微信 powcoder
Outline (Asynchronous)
Previously:
• Mutual Exclusion (critical, atomic, locks)
• Variable scope (shared, private, firstprivate)
• Reduction, Synchronization (barriers, nowait)
• Work-sharing: parallel for construct
• Data dependency
• Single, master constructs
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Work-Sharing Constructs
• Within a parallel region, you can decide how work is distributed among the threads.
for – The for construct is probably the most important construct in OpenMP.
single – Assigning a task to a single thread sections – Dividing the tasks into sections. Each
section is executed by one thread.
• Implied barrier on exit (unless nowait is used). No implied barrier on entry….
• As they used in parallel region, use existing threads (do not create new threads)
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Parallel For
§ Loop iterations are distributed across the thread team (at run time)
– Loopsmusthaveanintegercountervariablewhosevalueis inc/dec by a fixed value
– Reachedaspecifiedbound
§ Loop variable is explicitly declared to be a private variable (each thread gets own copy)
– Valueatendofloopisundefined(unlesssetaslastprivate) – RecommendedthatprogrammersdonotrelyonOpenMP
default rules
• Explicitly set data sharing attributes
§ Output/processing order is non-deterministic
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Parallel For Loop
• Uses a team of threads to execute a for loop block following thefor directive.
• The system parallelizes the for loop by dividing the iterations of the loop among the threads.
Syntax (two options)
(1) #pragma omp for [clause [clause]…] //does not create new threads
for(i = start; i
Use this syntax if your parallel region includes a for loop and other statements.
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Parallel For Loop
(2) #pragma omp parallel for [clause [clause]…] //creates new threads
for(i = start; i
Variable scope: the loop variable of the for-statement
immediately following the for-directive is private
• Recall recommend best-practice – declare as private
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
WITHOUT the for directive, EACH thread runs a copy of the WHOLE for loop
• On4threads(T0toT3)weget16 print-outs (since each thread executes 4 iterations concurrently with the other threads)
• Would need to divide up work manually
WITHOUT the Parallel For Directive
#pragma omp parallel num_threads(4)
int i, n = omp_get_thread_num();
for(i=0; i<4; i++)
printf("T%d: i=%d\n", n , i);
Topic 8 - Work Sharing (Parallel For, Single)
COSC 407: Intro to Parallel Computing
Possible Output:
Parallel For Directive
#pragma omp parallel
#pragma omp for
for (i = 0; i < 4; i++) {
n = omp_get_thread_num();
printf("T%d: i=%d\n", n, i);
WITH the for directive, iterations are divided among the threads
§ On 4 threads (T0 to T3) we get 4 print-outs (since each thread executes one iteration concurrently with the other threads)
§That is, regardless of the number of threads we will get the exact number of print-outs specified by the for loop.
Note: order is not preserved!
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Possible Output:
Who Does what?
#pragma omp parallel
#pragma omp for
for(i=0;i< 8;i++){
n = omp_get_thread_num(); printf("T%d: i=%d\n", n, i);
Number iterations is divided equally among threads
§ T0 : 2 iterations ( i= 0, 1 ) § T1 : 2 iterations ( i= 2, 3 ) § T2 : 2 iterations ( i= 4, 5 ) § T3 : 2 iterations ( i= 6, 7 )
Topic 8 - Work Sharing (Parallel For, Single)
This is called static scheduling
T0 T1 T2 T3
i = 0à1 i = 2à3 I = 4à5 i = 5à6
Each thread gets its own loop counter
COSC 407: Intro to Parallel Computing
Possible Output:
Who Does What?
#pragma omp parallel
#pragma omp for
for(i=0;i< 80;i++){ //some work
Number of iterations is divided equally among threads
§ T0 : 20 iterations ( i= 0...19 ) § T1 : 20 iterations ( i= 20...39 ) § T2 : 20 iterations ( i= 40...59 ) § T3 : 20 iterations ( i= 60...79 )
Topic 8 - Work Sharing (Parallel For, Single)
T0 T1 T2 T3
i = 0à19 i = 20à39 I = 40à59 i = 50à69 Each thread gets its own loop counter
COSC 407: Intro to Parallel Computing
Who Does What?
//assuming 4 threads
#pragma omp parallel
#pragma omp for
for(i=0;i< 6;i++){
n = omp_get_thread_num(); printf("T%d: i=%d\n", n, i);
Note: extra iterations are assigned to first few threads
Possible Output:
§ T0 took 2 iterations ( i= 0, 1 ) § T1 took 2 iterations ( i= 2, 3 ) § T2 took 1 iteration ( i= 4 )
§ T3 took 1 iteration ( i= 5 )
Topic 8 - Work Sharing (Parallel For, Single)
T0 T1T2T3 i=0à1 i=2à3 4à5 5à6
Each thread gets its own loop counter
COSC 407: Intro to Parallel Computing
More Practical Examples
§ Image processing: Converting an RGB image to Grayscale.
– Assumetheluminosityofeachgrayscalepixelisequalto 0.21 * Red + 0.72 * Green + .07 * Blue. Then:
#pragma omp parallel for for(i=0; i < numPixels; i++)
gray[i]= (unsigned char)(.21*red[i]+.72*green[i]+.07*blue[i]);
§ 3D rendering:
– Divideimageintoblocksanditerateoverthem.
§ Matrix / array operations:
– Addtwomatrixes,findthetranspose,etc.
§ Many other applications that require applying the same action to iteratively in a loop
– Refertointroductiontoparallelcomputingnotes
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Area Under the Curve Revisited
The aim is to compute the area S (from a to b) under a curve defined by f(x).
• One way to do this is to divide S into a series of trapezoids, find the area of these trapezoids, and then sum.
• The more trapezoids, the better approximation.
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
The Serial Algorithm Revisited
𝑆𝑢𝑚𝑜𝑓𝑡𝑟𝑎𝑝𝑒𝑧𝑜𝑖𝑑𝑎𝑟𝑒𝑎𝑠=h 𝑓 𝑎 +𝑓(𝑏)+𝑓 𝑥! +𝑓 𝑥" +⋯+𝑓 𝑥#$! 2
h = (b - a) / n;
approx = (f(a) + f(b)) / 2.0;
for (i = 1; i <= n-1; i++){
xi = a + i * h;
approx += f(xi);
approx = h * approx;
//calculate h only once
//first and last terms
//remaining terms
//approximate area
Topic 8 - Work Sharing (Parallel For, Single)
COSC 407: Intro to Parallel Computing
Area Calculation
Given the serial code in previous slide, provide the parallel version
• In this version, use the parallel for directive
• Note that when you use parallel-for, it's NOT
necessary for n to be evenly divisible by thread_count
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
The Parallel Algorithm v.5
𝑆𝑢𝑚𝑜𝑓𝑡𝑟𝑎𝑝𝑒𝑧𝑜𝑖𝑑𝑎𝑟𝑒𝑎𝑠=h 𝑓 𝑎 +𝑓(𝑏)+𝑓 𝑥! +𝑓 𝑥" +⋯+𝑓 𝑥#$! 2
h = (b - a) / n; //calculate h only once approx = (f(a) + f(b)) / 2.0; //first and last terms
#pragma omp parallel for reduction(+: approx)
for (i = 1; i <= n-1; i++){ //remaining terms xi = a + i * h;
approx += f(xi);
approx = h * approx; //approximate area
Topic 8 - Work Sharing (Parallel For, Single)
COSC 407: Intro to Parallel Computing
The Complete Program
double a=1,b=2;
int n = 10, thread_count = 4;
double global_result = Trap(a, b, n, thread_count); printf("Approximate area: %f\n", global_result); return 0;
double Trap(double a, double b, int n, int thread_count) {
double h = (b-a)/n;
double approx = (f(a) + f(b))/2.0; int i;
#pragma omp parallel for num_threads(thread_count) reduction(+: approx)
for (i = 1; i <= n-1; i++)
approx += f(a + i*h);
return h * approx;
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Previous Version without Parallel For
double global_result = 0, a=1, b=2; //global_result is shared int n = 12;
# pragma omp parallel num_threads(4) reduction(+:global_result)
global_result += Local_trap(...); printf("Approximate area: %f\n", global_result);
double Local_trap(double a, double b, int n) {
double x, my_approx, my_a, my_b;
int i, my_n, my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads(); my_n = n / thread_count;
my_a = a + my_rank * my_n * h;
my_b = my_a + my_n * h;
double h = (b - a) / n;
my_approx = (f(my_a) + f(my_b)) / 2.0; for (i = 1; i <= my_n - 1; i++)
my_approx += f(my_a + i * h);
return h * my_approx; //instead of adding it to global_result
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
• we need to define range for each thread
• n must be divisible by num_threads
Nested Loops
#pragma omp for
for(i = 0; i<5; i++) //this loop is a parallelized
for(j=0; j<10; j++) //but this one is not ...
Nested for loops: (In the above example) • outer for loop is parallelized
• Iterations are divided among threads • inner for loop is not parallelized
• Each thread will execute all iterations Variable scope:
• The loop counter of the for-statement immediately following the for-directive is private for each thread.
• Hence ( i ) in the above code is private
• However, ( j ) is not private unless it is declared right after
the outer for loop, or it is explicitly defined private
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Caution: How and When to Use Parallel For?
Only the following form is legal for parallelized for statement
Limitations:
• Must have integer or pointer type (e.g., it can’t be float)
• Can only be modified by the “increment expression” in the for
start, end, and incr must:
• Have a compatible type (e.g., if index is a pointer, then incr must
• Not change during execution of the loop
Topic 8 - Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Caution, cont’d
Cannot parallelize:
• while or do-while loops
• for (while ; ; ) {...}
• for (i=0; i
#pragma omp parallel {
printf(“Hi from T%d\n”, omp_get_thread_num());
#pragma omp single
printf(“One Hi from T%d\n”, omp_get_thread_num());
return 0; }
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Possible Output
Hi from T2
Hi from T3
One Hi from T2 Hi from T0
Hi from T1
Assigning Work to a Single Thread, cont’d
#pragma omp parallel
{ int n = omp_get_thread_num();
printf(“T%d:A\n”,n); #pragma omp single
printf(“T%d:X\n”,n);
printf(“T%d:B\n”,n);
printf(“Finished”);
Topic 8 – Work Sharing (Parallel For, Single)
int n = …
int n = …
int n = …
int n = …
(Synchronization after ‘single’)
COSC 407: Intro to Parallel Computing
Assigning Work to a Single Thread, cont’d
#pragma omp parallel
int n = omp_get_thread_num(); printf(“T%d:A\n”,n);
#pragma omp single nowait
printf(“T%d:X\n”,n);
printf(“T%d:B\n”,n);
printf(“Finished”);
int n = …
int n = …
int n = …
int n = …
(no synchronization after ‘single’ if nowait is used)
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
NO Barrier
Possible Output
T1:A T0:A T1:X T0:B T1:B T3:A T3:B T2:A T2:B Finished
Assigning Work to the Master Thread
#pragma omp master
• Within a parallel region, executes block of code just once by the master thread in the team.
• Unlike “single directive”, master directly does NOT have an implied barrier on exit.
#include
#include
#include
int main() {
#pragma omp parallel
{ printf(“Hi from T%d\n”, omp_get_thread_num());
#pragma omp master
printf(“One Hi from T%d\n”, omp_get_thread_num());
return 0; }
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
Possible Output
Hi from T2
Hi from T3
Hi from T0
One Hi from T0 Hi from T1
Assigning Work to Master Thread, cont’d
#pragma omp parallel
int n = omp_get_thread_num(); printf(“T%d:A\n”,n);
#pragma omp master
printf(“T%d:X\n”,n);
printf(“T%d:B\n”,n);
printf(“Finished”);
int n = …
int n = …
int n = …
int n = …
(no synchronization after ‘master’ even without nowait)
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
NO Barrier
Possible Output
T1:A T1:B T3:A T3:B T0:A T0:X T0:B T2:A T2:B Finished
Assigning Work to Master Thread, cont’d
#pragma omp parallel
int n = omp_get_thread_num(); printf(“T%d:A\n”,n);
#pragma omp master
printf(“T%d:X\n”,n); #pragma omp barrier printf(“T%d:B\n”,n);
printf(“Finished”);
Topic 8 – Work Sharing (Parallel For, Single)
Barrier Explicitly Added Here
COSC 407: Intro to Parallel Computing
int n = …
int n = …
int n = …
int n = …
Possible Output
T3:A T2:A T0:A T0:X T1:A T3:B T2:B T0:B T1:B Finished
Conclusion/Up Next
§ What we covered today (review key concepts):
• Work-sharing: parallel for construct
• Data dependency
• Single, master constructs
– Sections
– SchedulingLoops(static,dynamic,guided,auto) – OrderedIterations
– NestedLoops
Topic 8 – Work Sharing (Parallel For, Single) COSC 407: Intro to Parallel Computing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com