High Performance Computing
Course Notes
Coursework 2
Dr Ligang He
Problem Domain
•
Coursework taken from the field of Computational Fluid Dynamics (CFD)
Fluid dynamics based on three fundamental principles: (i) mass is conserved; (ii) Newton’s second law ;(iii) energy is conserved
Expressed as partial differential equations, showing how velocity and pressure are related, etc. (called governing equations).
Computer Science, University of Warwick
2
Governing Equations
Computer Science, University of Warwick
3
Problem Domain
• Coursework taken from the field of Computational Fluid Dynamics (CFD)
•
Fluid dynamics based on three fundamental principles: (i) mass is conserved; (ii) Newton’s second law ;(iii) energy is conserved
Expressed as partial differential equations, showing how velocity and pressure are related, etc. (called governing equations).
The coordinates and time are independent variables while velocity and pressure are dependent variables
Computational Fluid Dynamics is the science of finding the numerical solution to the governing equations of fluid flow, over the discretized space or time
Computer Science, University of Warwick
4
CFD code
The code in the coursework, called Karman, calculates the velocity and pressure of a 2D flow
The code writes the solution values into a binary file Currently the code is sequential
The purpose of the coursework is to parallelize the code
Computer Science, University of Warwick
5
Data and stencil
The area represented as a 2D Grid (discretize) Calculate one point in each cell
Computer Science, University of Warwick
6
CFD code
jmax
There are two types of cell: fluid (C_F) and obstacle (C_B)
Computer Science, University of Warwick
7
Data and stencil
• The properties of a cell in the grid
• Communication pattern: using a five-point stencil to
calculate a point
Computer Science, University of Warwick
8
Numerical method for solving the governing equations
Initialize each cell value
Check if the solution satisfies the governing equations
If not, generate the new solution based on a stencil of current solutions from neighbouring cells
Iterations are advanced until the termination condition is met
Computer Science, University of Warwick
9
General Steps for using the numerical method to solve the linear equations
Aim: solve AΦ=B
Step 1: Guess a initial solution Φ0
Step 2: If the convergence is reached by checking the residual B-AΦi
Key: when we repeat iterative steps, each step generates a better solution
Computer Science, University of Warwick
10
Successive Over-Relaxation(SOR)
SOR is a method to generate new solutions: it can speed up convergence
For a set of linear equations: AΦ=B
let A=D+U+L, where D, L and U denote the diagonal, strictly lower
triangular, and strictly upper triangular parts of A, respectively
The successive over-relaxation (SOR) iteration is defined by the following
Where values of ω > 1 are used to speedup convergence of a slow-converging process, while values of ω < 1 are often help establish convergence of diverging iterative process
Computer Science, University of Warwick
11
SOR
Successive Over-Relaxation (SOR) method is used to speed up convergence
Computer Science, University of Warwick
12
SOR
• SOR is originally written so that p_new(i, j) depends on
1) a stencil of current solutions from neighbouring cells,
2) p_new(i-1, j), 3) p_new(i, j-1).
for all i imax, j jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1))
P(k) is the solution at the iteration k (P is a matrix [pij])
• This introduces a chain of
dependencies for each cell
• Fine on sequential execution: just calculate values from left to right, and top to bottom
Computer Science, University of Warwick
13
SOR
• SOR is originally written so that calculating p_new(i, j) depends on
1) a stencil of current solutions from neighbouring cells,
2) p_new(i-1, j), 3) p_new(i, j-1).
for all i imax, j jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1))
P(k) is the solution at the iteration k (P is a matrix [pij])
• This introduces a chain of
dependencies for each cell
• Fine on sequential machines: just calculate values from left to right, and top to bottom
Computer Science, University of Warwick
14
SOR
• SOR is originally written so that calculating p_new(i, j) depends on
1) a stencil of current solutions from neighbouring cells,
2) p_new(i-1, j), 3) p_new(i, j-1).
for all i imax, j jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1))
P(k) is the solution at the iteration k (P is a matrix [pij])
• This introduces a chain of
dependencies for each cell
• Fine on sequential machines: just calculate values from left to right, and top to bottom
Computer Science, University of Warwick
15
SOR
• SOR is originally written so that calculating p_new(i, j) depends on
1) a stencil of current solutions from neighbouring cells,
2) p_new(i-1, j), 3) p_new(i, j-1).
for all i imax, j jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1))
P(k) is the solution at the iteration k (P is a matrix [pij])
• This introduces a chain of
dependencies for each cell
• Fine on sequential machines: just calculate values from left to right, and top to bottom
Computer Science, University of Warwick
16
SOR
• This approach forms a “wave front”
• It does not make for a good parallel solution
• The fundamental reasons are
o A chain of dependencies
o The number of cells to be computed varies in each step (unbalanced workload)
Computer Science, University of Warwick
17
Red/Black SOR
Less dependency
Workload can be balanced among the processors
• Another solution that achieves the same result is red/black ordering
• Rewrite the SOR equations so the cells are divided into two populations.
• New red values are calculated first (using current black values)
Red cells:
pij(k+1)=f(pi-1,j(k), pi,j-1(k), pi+1,j(k) , pi,j+1(k))
• Then black is calculated using the just updated red values.
Black cells:
pij(k+1)=f(pi-1,j(k+1), pi,j-1(k+1), pi+1,j(k+1), pi,j+1(k+1)) Note: pi-1,j(k+1), pi,j-1(k+1), ... are red cells
• All red cells can be calculated in parallel,
• Then all black cells can be calculated in
parallel
Computer Science, University of Warwick
18
Decomposition of a grid of cells
• Make each process responsible for updating a block of cells
• A process must send the cells at the edge of its domain to its neighbours, and receive a copy of the edge cells from its neighbours
• You can think whether you implement 1D or 2D decomposition
o 1D is OK for this coursework
Process 1
Process 2
Computer Science, University of Warwick
19
The main loop
for (t = 0.0; t< t_end; t = t + delt)
{
1.Calculate an approximate time-step size by seeing how much movement occurred in the last time-step. The discrete approximation is only stable when the maximum motion < 1 cell per time-step.
2.For each cell, compute a tentative new velocity (f,g) based on the previous (u,v) values. It takes as input the u, v and flag matrices, and updates the f and g matrices.
3.For each cell calculate the RHS of the pressure equation. This uses two f and g values. It takes as input the f, g, and flag matrices and updates the RHS matrix.
4.For the entire pressure matrix, use Red/Black SOR to solve the Poisson equation. This takes a large number iterations of the Red/Black process as shown in the slides. It takes as input the current pressure matrix, flag matrix, and the RHS matrix and outputs a new pressure matrix
Computer Science, University of Warwick
20
The main loop
5. For each cell, update the real (u,v) values based on the pressure matrix and the tentative velocity values (f,g). It takes as input the pressure, f, g and flag matrices and updates the u and v matrices.
6. For each cell that is adjacent to an edge cell, the (u,v) values of the boundary cells are updated (by taking values from their neighbours). It takes as input the u, v and flag matrices and updates the u and v matrices
}
Note that it is not going to be worth parallelizing the whole program, just parts of it.
Computer Science, University of Warwick
21
Coursework 2 – What to Do in General
- You will be given a serial program, called Karman
- You need to parallelize the Karman code and write a report
- 80% of the full mark comes from the parallelization using MPI
- 20% of the mark comes from the parallelization using both MPI and OpenMP (a hybrid approach), i.e., parallelizing the computations in one machine using OpenMP, while parallelizing the computations across machines using MPI
Computer Science, University of Warwick
22
Coursework 2 – Detailed Requirements for Programming
• Parallelize the Karman code, using a pure MPI approach (80% of marks) and a hybrid MPI-OpenMP approach (20% marks)
• Profiling which Karman functions are more time consuming
• Design your decomposition strategy, date exchange strategies between neighboring partitions, the MPI functions (e.g., collective operations)
• For a hybrid approach, add OpenMP directives
Computer Science, University of Warwick
23
Coursework 2 – Detailed Requirements for Programming
•
Benchmarking the execution time (e.g., using MPI_Wtime function) of your parallel code as you
increase the number of processes
and/or change the problem size if it helps you observe the trend
If you develop a hybrid MPI-OpenMP implementation,
benchmark the runtime of the hybrid parallel version
Compare the performance between the hybrid approach and the pure MPI approach
Ensure the simulation results are the same after the parallelization
Contain adequate comments as good programming
practice.
•
• •
• •
•
•
Computer Science, University of Warwick
24
Assignment 2 – Requirements for Report
•
•
Profiling the functions to determine which functions are more time consuming; discuss the profiling results
Discuss your MPI implementation, for example,
your decomposition strategy and load balancing strategy;
your strategy of exchanging the boundary data between neighboring partitions;
MPI functions such as collective operations you used;
Benchmark the change in execution times of your parallel program as you increase the number of processes and/or problem size; present the results in graph or table
Analyze and discuss the performance of your parallel
program
•
•
• •
•
Computer Science, University of Warwick
25
Assignment 2 – Requirements for Report
•
If you develop the hybrid MPI-OpenMP implementation,
Discuss OpenMP implementation
Benchmark the runtime of the hybrid parallel code and present the results
Discuss the performance comparison between the hybrid approach and the pure MPI approach
Writing skills: pay attention to spelling, punctuation and grammar
Up to 6 A4 pages (not a strict up limit)
•
•
• •
•
Computer Science, University of Warwick
26
Assignment 2 – Submission
-
Submit a compressed folder (.zip) to Tabula; the folder should contain
your parallel code (the directory structure used in the project should not change)
your report (pdf format and place your report in the top level directory in the folder)
Deadline: 12 noon, March 18th, 2021
-
- -
Computer Science, University of Warwick
27