程序代写代做代考 case study Parallel Programming in C with the Message Passing Interface

Parallel Programming in C with the Message Passing Interface

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Programming
in C with MPI and OpenMP
Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 18
Combining MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Outline
Advantages of using both MPI and OpenMP
Case Study: Conjugate gradient method
Case Study: Jacobi method

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

C+MPI vs. C+MPI+OpenMP

C + MPI
C + MPI + OpenMP

1.bin

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Why C + MPI + OpenMP
Can Execute Faster
Lower communication overhead
More portions of program may be practical to parallelize
May allow more overlap of communications with computations

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Case Study: Conjugate Gradient
Conjugate gradient method solves Ax = b
In our program we assume A is dense
Methodology
Start with MPI program
Profile functions to determine where most execution time spent
Tackle most time-intensive function first

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Result of Profiling MPI Program

Clearly our focus needs to be on function
matrix_vector_product
Function 1 CPU 8 CPUs
matrix_vector_product 99.55% 97.49%
dot_product 0.19% 1.06%
cg 0.25% 1.44%

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Code for matrix_vector_product

void matrix_vector_product (int id, int p,
int n, double **a, double *b, double *c)
{
int i, j;
double tmp; /* Accumulates sum */
for (i=0; i 0)
MPI_Send (u[1], N, MPI_DOUBLE, id-1, 0,
MPI_COMM_WORLD);
if (id < p-1) { MPI_Send (u[my_rows-2], N, MPI_DOUBLE, id+1, 0, MPI_COMM_WORLD); MPI_Recv (u[my_rows-1], N, MPI_DOUBLE, id+1, 0, MPI_COMM_WORLD, &status); } if (id > 0)
MPI_Recv (u[0], N, MPI_DOUBLE, id-1, 0,
MPI_COMM_WORLD, &status);

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Function find_steady_state (2/2)
diff = 0.0;
for (i = 1; i < my_rows-1; i++) for (j = 1; j < N-1; j++) { w[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1])/4.0; if (fabs(w[i][j] - u[i][j]) > diff)
diff = fabs(w[i][j] – u[i][j]);
}
for (i = 1; i < my_rows-1; i++) for (j = 1; j < N-1; j++) u[i][j] = w[i][j]; MPI_Allreduce (&diff, &global_diff, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD); if (global_diff <= EPSILON) break; its++; Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Making Function Parallel (1/2) Except for two initializations and a return statement, function is a big for loop Cannot execute for loop in parallel Not in canonical form Contains a break statement Contains calls to MPI functions Data dependences between iterations Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Making Function Parallel (2/2) Focus on first for loop indexed by i How to handle multiple threads testing/updating diff? Putting if statement in a critical section would increase overhead and lower speedup Instead, create private variable tdiff Thread tests tdiff against diff before call to MPI_Allreduce Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Modified Function diff = 0.0; #pragma omp parallel private (i, j, tdiff) { tdiff = 0.0; #pragma omp for for (i = 1; i < my_rows-1; i++) ... #pragma omp for nowait for (i = 1; i < my_rows-1; i++) #pragma omp critical if (tdiff > diff) diff = tdiff;
}
MPI_Allreduce (&diff, &global_diff, 1,
MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Making Function Parallel (3/3)
Focus on second for loop indexed by i
Copies elements of w to corresponding elements of u: no problem with executing in parallel

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Benchmarking
Target system: a commodity cluster with four dual-processor nodes
C+MPI program executes on 1, 2, …, 8 CPUs
On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPU
C+MPI+OpenMP program executes on 1, 2, 3, 4 processes
Each process has two threads
C+MPI+OpenMP program executes on 2, 4, 6, 8 threads

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Benchmarking Results

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Analysis of Results
Hybrid C+MPI+OpenMP program uniformly faster than C+MPI program
Computation/communication ratio of hybrid program is superior
Number of mesh points per element communicated is twice as high per node for the hybrid program
Lower communication overhead leads to 19% better speedup on 8 CPUs

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary
Many contemporary parallel computers consists of a collection of multiprocessors
On these systems, performance of C+MPI+OpenMP programs can exceed performance of C+MPI programs
OpenMP enables us to take advantage of shared memory to reduce communication overhead
Often, conversion requires addition of relatively few pragmas

Interconnection Network
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Interconnection Network
Pt
t
t
t
Pt
t
t
t
Pt
t
t
t
Pt
t
t
t
(a)
(b)