COMP528-JAN21 University of Liverpool
COMP528-JAN21 – Lab 4
Coordinator: Fabio Papacchini
Message Passing Interface (MPI) #2
The contents of the labs may be useful in assignments and written assessments.
Login to Barkla (as per LAB01 and LAB02) and obtain today’s lab file/s:
cd
tar -xzf /users/papacchf/COMP528-JAN21/labs/lab04.tgz
cd intro-mpi-2
If you now list the directory (using “ls”) you should see:
mpi quad-GATHER.c prime-serial.c chkPrime.c func2.c mpi quad.c prime-bcast-gather-skeleton.c
You are encouraged to follow all the steps in this document. You can email me your solutions
(please write “COMP528 Lab4” in the subject), and I will provide you with feedback on them.
1 MPI point-to-point communications & MPI timers
All code should be compiled in batch, with zero optimisation, and run 3 times.
1. Examine the code “mpi quad.c” which is a parallel quadrature code, loosely based on the MPI lectures
to date. Examine the code and identify
(a) How we have timed segments without using an MPI timer
(b) How we have timed the same segments using an MPI timer
(c) Look closely at the function used for step 1b. How does this differ from most MPI function calls
(NB you can use the manual pages for further information).
2. The function to integrate is in the file func2.c (so needs to be included on the compile line along with
mpi quad.c).
3. In batch, load the modules, compile without optimisation but for MPI, and run 3 times using a single
MPI process. Take the data from the quickest of these 3 runs to estimate the proportion of time
spent in serial proportion of the code, and use this to estimate the maximum speed up of
the code, and also how quickly you expect the code to run on 20 and on 40 cores.
4. Remove any debugging print statements you no longer need and repeat step 3. Explain your obser-
vation.
5. Copy/amend your batch script so that – within the same script – it runs on 1, 2, 3 up to (& including)
40 cores on a single node. Remember that the SLURM variable “$SLURM NTASKS” gives you the
number of cores for the job taken from the “-n N” flag to “sbatch” (where N would be 40 if you want
to run on up to and including 40 cores). You can either run your script many times with a different
value of “N” each time, or you can amend your script to read “N” as the maximum number of MPI
processes and then add an appropriate looping mechanism to your script.
6. Explain from your results whether the program/solution is numerically stable
Fabio Papacchini: Fabio. .uk 1
COMP528-JAN21 University of Liverpool
7. Do you get the expected speed-up on 20 and 40 cores? Explain.
8. Tabulate your results from step 5 and plot a speed-up graph.
2 MPI collective communications: MPI Gather
All code should be compiled in batch, with zero optimisation, and run 3 times.
9. Examine the code “mpi quad-GATHER.c” which is an alternative implementation but does the same
calculations as “mpi quad.c”.
10. Identify the MPI “collective communication” – this is a single MPI function that each & every MPI
process calls (and it replaces multiple “point to point communications”).
11. Copy then amend your batch script from step 5 above, to compile then run this version. Add your
best (of 3) timing results to the table (step 8). What do you observe? How can you explain
your observation.
3 MPI collective communication for “reduction”
All code should be compiled in batch, with zero optimisation, and run 3 times.
12. Recall that the pattern used within the problem – where we take a value from each process and sum to
find the global (across all processes) value – is known as a “reduction”, can you use MPI Forum to
determine a more appropriate MPI function than MPI Gather? i.e. a function that not only
obtains the values but also sums them in the same operation with the result being only on process of
rank 0.
13. Copy “mpi quad-GATHER.c” to a new file (so you can return and re-run again later if required, e.g. cp
mpi quad-GATHER.c mpi-quad-MeaningfulName.c
14. In your new file
(a) remove the call to MPI Gather
(b) remove loop following the call to MPI Gather (i.e. the loop that forms a sum)
(c) use the info you gained from step 12 and use another MPI collective to replace the logic you have
removed (i.e. to find the total sum on process of rank 0)
15. Copy then amend your batch script from above, to compile & run this version. Add your best (of
3) timing results to your table. What do you observe regarding use of this MPI collective
communications function?
4 “Finding Primes” Example illustrating some MPI Collective
Communications
16. The code in “prime-serial.c” should be fairly clear. It reads in a number and then returns how many
prime numbers from 1 to that number and lists the prime numbers. Run this and test the output.
17. The skeleton code in “prime-bcast-gather-skeleton.c” is for you to complete such that it uses MPI to
share out the work finding the prime numbers. You will need to
(a) Work out why we have several “MPI Bcast” calls and what each is doing
(b) Amend the “i” for loop from i=1 to i