CS计算机代考程序代写 chain High Performance Computing

High Performance Computing
Course Notes
Coursework 2
Dr Ligang He

Problem Domain

Coursework taken from the field of Computational Fluid Dynamics (CFD)
 Fluid dynamics based on three fundamental principles: (i) mass is conserved; (ii) Newton’s second law ;(iii) energy is conserved
 Expressed as partial differential equations, showing how velocity and pressure are related, etc. (called governing equations).
Computer Science, University of Warwick
2

Governing Equations
Computer Science, University of Warwick
3

Problem Domain
• Coursework taken from the field of Computational Fluid Dynamics (CFD)

 Fluid dynamics based on three fundamental principles: (i) mass is conserved; (ii) Newton’s second law ;(iii) energy is conserved
 Expressed as partial differential equations, showing how velocity and pressure are related, etc. (called governing equations).
 The coordinates and time are independent variables while velocity and pressure are dependent variables
Computational Fluid Dynamics is the science of finding the numerical solution to the governing equations of fluid flow, over the discretized space or time
Computer Science, University of Warwick
4

CFD code
The code in the coursework, called Karman, calculates the velocity and pressure of a 2D flow
The code writes the solution values into a binary file Currently the code is sequential
The purpose of the coursework is to parallelize the code
Computer Science, University of Warwick
5

Data and stencil
The area represented as a 2D Grid (discretize) Calculate one point in each cell
Computer Science, University of Warwick
6

CFD code
jmax
There are two types of cell: fluid (C_F) and obstacle (C_B)
Computer Science, University of Warwick
7

Data and stencil
• The properties of a cell in the grid
• Communication pattern: using a five-point stencil to
calculate a point
Computer Science, University of Warwick
8

Numerical method for solving the governing equations
Initialize each cell value
Check if the solution satisfies the governing equations
If not, generate the new solution based on a stencil of current solutions from neighbouring cells
Iterations are advanced until the termination condition is met
Computer Science, University of Warwick
9

General Steps for using the numerical method to solve the linear equations
 Aim: solve AΦ=B
 Step 1: Guess a initial solution Φ0
 Step 2: If the convergence is reached by checking the residual B-AΦi=0), Φ(i+1)= f(Φ(i)), go to Step 2 e.g., f(Φ(i))=Φ(i)+(B-AΦ(i))
 Key: when we repeat iterative steps, each step generates a better solution
Computer Science, University of Warwick
10

Successive Over-Relaxation(SOR)
 SOR is a method to generate new solutions: it can speed up convergence
 For a set of linear equations: AΦ=B
 let A=D+U+L, where D, L and U denote the diagonal, strictly lower
triangular, and strictly upper triangular parts of A, respectively
 The successive over-relaxation (SOR) iteration is defined by the following
 Where values of ω > 1 are used to speedup convergence of a slow-converging process, while values of ω < 1 are often help establish convergence of diverging iterative process Computer Science, University of Warwick 11 SOR Successive Over-Relaxation (SOR) method is used to speed up convergence Computer Science, University of Warwick 12 SOR • SOR is originally written so that p_new(i, j) depends on 1) a stencil of current solutions from neighbouring cells, 2) p_new(i-1, j), 3) p_new(i, j-1). for all i  imax, j  jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1)) P(k) is the solution at the iteration k (P is a matrix [pij]) • This introduces a chain of dependencies for each cell • Fine on sequential execution: just calculate values from left to right, and top to bottom Computer Science, University of Warwick 13 SOR • SOR is originally written so that calculating p_new(i, j) depends on 1) a stencil of current solutions from neighbouring cells, 2) p_new(i-1, j), 3) p_new(i, j-1). for all i  imax, j  jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1)) P(k) is the solution at the iteration k (P is a matrix [pij]) • This introduces a chain of dependencies for each cell • Fine on sequential machines: just calculate values from left to right, and top to bottom Computer Science, University of Warwick 14 SOR • SOR is originally written so that calculating p_new(i, j) depends on 1) a stencil of current solutions from neighbouring cells, 2) p_new(i-1, j), 3) p_new(i, j-1). for all i  imax, j  jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1)) P(k) is the solution at the iteration k (P is a matrix [pij]) • This introduces a chain of dependencies for each cell • Fine on sequential machines: just calculate values from left to right, and top to bottom Computer Science, University of Warwick 15 SOR • SOR is originally written so that calculating p_new(i, j) depends on 1) a stencil of current solutions from neighbouring cells, 2) p_new(i-1, j), 3) p_new(i, j-1). for all i  imax, j  jmax, do pij(k+1)=f(P(k), pi-1,j(k+1), pi,j-1(k+1)) P(k) is the solution at the iteration k (P is a matrix [pij]) • This introduces a chain of dependencies for each cell • Fine on sequential machines: just calculate values from left to right, and top to bottom Computer Science, University of Warwick 16 SOR • This approach forms a “wave front” • It does not make for a good parallel solution • The fundamental reasons are o A chain of dependencies o The number of cells to be computed varies in each step (unbalanced workload) Computer Science, University of Warwick 17 Red/Black SOR Less dependency Workload can be balanced among the processors • Another solution that achieves the same result is red/black ordering • Rewrite the SOR equations so the cells are divided into two populations. • New red values are calculated first (using current black values) Red cells: pij(k+1)=f(pi-1,j(k), pi,j-1(k), pi+1,j(k) , pi,j+1(k)) • Then black is calculated using the just updated red values. Black cells: pij(k+1)=f(pi-1,j(k+1), pi,j-1(k+1), pi+1,j(k+1), pi,j+1(k+1)) Note: pi-1,j(k+1), pi,j-1(k+1), ... are red cells • All red cells can be calculated in parallel, • Then all black cells can be calculated in parallel Computer Science, University of Warwick 18 Decomposition of a grid of cells • Make each process responsible for updating a block of cells • A process must send the cells at the edge of its domain to its neighbours, and receive a copy of the edge cells from its neighbours • You can think whether you implement 1D or 2D decomposition o 1D is OK for this coursework Process 1 Process 2 Computer Science, University of Warwick 19 The main loop for (t = 0.0; t< t_end; t = t + delt) { 1.Calculate an approximate time-step size by seeing how much movement occurred in the last time-step. The discrete approximation is only stable when the maximum motion < 1 cell per time-step. 2.For each cell, compute a tentative new velocity (f,g) based on the previous (u,v) values. It takes as input the u, v and flag matrices, and updates the f and g matrices. 3.For each cell calculate the RHS of the pressure equation. This uses two f and g values. It takes as input the f, g, and flag matrices and updates the RHS matrix. 4.For the entire pressure matrix, use Red/Black SOR to solve the Poisson equation. This takes a large number iterations of the Red/Black process as shown in the slides. It takes as input the current pressure matrix, flag matrix, and the RHS matrix and outputs a new pressure matrix Computer Science, University of Warwick 20 The main loop 5. For each cell, update the real (u,v) values based on the pressure matrix and the tentative velocity values (f,g). It takes as input the pressure, f, g and flag matrices and updates the u and v matrices. 6. For each cell that is adjacent to an edge cell, the (u,v) values of the boundary cells are updated (by taking values from their neighbours). It takes as input the u, v and flag matrices and updates the u and v matrices } Note that it is not going to be worth parallelizing the whole program, just parts of it. Computer Science, University of Warwick 21 Coursework 2 – What to Do in General - You will be given a serial program, called Karman - You need to parallelize the Karman code and write a report - 80% of the full mark comes from the parallelization using MPI - 20% of the mark comes from the parallelization using both MPI and OpenMP (a hybrid approach), i.e., parallelizing the computations in one machine using OpenMP, while parallelizing the computations across machines using MPI Computer Science, University of Warwick 22 Coursework 2 – Detailed Requirements for Programming • Parallelize the Karman code, using a pure MPI approach (80% of marks) and a hybrid MPI-OpenMP approach (20% marks) • Profiling which Karman functions are more time consuming • Design your decomposition strategy, date exchange strategies between neighboring partitions, the MPI functions (e.g., collective operations) • For a hybrid approach, add OpenMP directives Computer Science, University of Warwick 23 Coursework 2 – Detailed Requirements for Programming • Benchmarking the execution time (e.g., using MPI_Wtime function) of your parallel code as you increase the number of processes and/or change the problem size if it helps you observe the trend If you develop a hybrid MPI-OpenMP implementation, benchmark the runtime of the hybrid parallel version Compare the performance between the hybrid approach and the pure MPI approach Ensure the simulation results are the same after the parallelization Contain adequate comments as good programming practice. • • • • • • • Computer Science, University of Warwick 24 Assignment 2 – Requirements for Report • • Profiling the functions to determine which functions are more time consuming; discuss the profiling results Discuss your MPI implementation, for example, your decomposition strategy and load balancing strategy; your strategy of exchanging the boundary data between neighboring partitions; MPI functions such as collective operations you used; Benchmark the change in execution times of your parallel program as you increase the number of processes and/or problem size; present the results in graph or table Analyze and discuss the performance of your parallel program • • • • • Computer Science, University of Warwick 25 Assignment 2 – Requirements for Report • If you develop the hybrid MPI-OpenMP implementation, Discuss OpenMP implementation Benchmark the runtime of the hybrid parallel code and present the results Discuss the performance comparison between the hybrid approach and the pure MPI approach Writing skills: pay attention to spelling, punctuation and grammar Up to 6 A4 pages (not a strict up limit) • • • • • Computer Science, University of Warwick 26 Assignment 2 – Submission - Submit a compressed folder (.zip) to Tabula; the folder should contain your parallel code (the directory structure used in the project should not change) your report (pdf format and place your report in the top level directory in the folder) Deadline: 12 noon, March 18th, 2021 - - - Computer Science, University of Warwick 27