CS计算机代考程序代写 compiler Excel COMP528-JAN21 University of Liverpool

COMP528-JAN21 University of Liverpool

COMP528-JAN21 – Lab 2

Coordinator: Fabio Papacchini

Introduction to Parallel Execution

Login to Barkla (as per LAB01) and obtain today’s lab file/s:
cd

tar -xzf ∼papacchf/COMP528-JAN21/labs/lab02.tgz
cd intro-par

Alternatively, you can perform the following commands.
cd

cp /users/papacchf/COMP528-JAN21/labs/lab02.tgz . (note the space before the last dot)
tar -xzf lab02.tgz

cd intro-par

What this sequence of command does is:

1. go to your home folder (the cd command)

2. copy the tgz file in your home folder (second command)

3. extract the content of the tgz file into your current folder

4. enter the newly created folder intro-par

If you now list the directory (using “ls”) you should see:
func1.c func2.c mpi quad.c openmp quad.c run-serial.sh

You are encouraged to follow all the steps in this document. You can email my your solutions
(please write “COMP528 Lab2” in the subject), and I will provide you with feedback on them.

1. Check that you understand from previous labs on how to edit (e.g. using “gedit”) and how to compile
(e.g. “icc” is the Intel compiler but you also need to use the “module” command to load the correct
modules) and how to run a job in batch. On Barkla the batch system is SLURM and the main
command to submit is “sbatch”. If in any doubt ask the lab tutor/demonstrators.

2. Key SLURM commands are:

sbatch [parameters] scriptName.sh to start a batch job via the named script
squeue -u $USER to list user’s jobs
scancel NNN to cancel batch job number NNN
sinfo to see overview of batch queues (aka “partitions”)

3. In this lab, you are given 3 codes that do the same integration – the quadrature example from the
lectures. One is serial, one is parallel via OpenMP and one is parallel via MPI. You will run this on a
varying number of cores and explore how efficient they are. You should also plot the times, speed-ups
and efficiencies as covered in HAL05 async lecture (Performance Challenges).

Fabio Papacchini: Fabio. .uk 1

COMP528-JAN21 University of Liverpool

4. We will cover details on the OpenMP and the MPI syntax in coming lectures. There should be sufficient
comments within the codes for you to follow. This is the (minimum!) level of commenting
expected when you do your assignment later in the semester.

5. You are also given three scripts (one for each of serial, OpenMP and MPI) that work with the SLURM
batch system. Each script will run the given version of the code 3 times and you should use the minimum
time recorded. See below for how to use each script but each outputs some useful information and
you should also check example scripts in /opt/apps/Slurm Examples if you want further information.
Note that there are a number of SLURM batch scripts available in that directory, each to do a specific
function.

6. Check today’s lab scripts to see

(a) how the serial code is compiled, and how it is run

(b) how the OpenMP code is compiled, and how it knows how many threads to use when it is run

(c) how the MPI code is compiled, and how it knows how many processes to use when it is run

Write these commands down as first row of a table in the spreadsheet (this will be a useful reference
for later labs).

7. What level of optimisation is used in the compilations? Why do you think it is set to this level?

8. Run the serial code and record the time taken in the spreadsheet.

9. The OpenMP parallel code is openmp quad.c. Compare this to the serial code and identify areas where
parallelism occurs, and any areas where you think there is no parallelism.

10. To run an OpenMP executable on 4 cores we can use
sbatch -c 4 run-openmp.sh

Try this and check the outputs. (The “-c” is essentially the number of cores per task and by default
there is only one task so it can be thought of as how many cores are requested for the batch job.)

11. Run the OpenMP parallel version on 1,2,4,8,16,32 and 40 cores and record the best times for each in
your spreadsheet. (bearing in mind that the node architecture of Barkla is 2 Skylakes each comprising
20 cores)

12. The MPI parallel code is mpi quad.c
Compare this to the serial code and identify areas where parallelism occurs, and any areas where you
think there is no parallelism. Also compare to the OpenMP code and see which you think is clearer to
follow.

13. To run an MPI executable on 4 cores we could use
sbatch -n 4 run-mpi.sh

NOTE that we now use “-n” rather than “-c”. (The “-n” is advised as the flag to use for determining
how many cores required by the batch job for MPI jobs.) By default, we are only going to use the 1
node of the “course” partition of the SLURM batch system. Try this and check the results.

14. Run the MPI parallel version on a varying number of cores and record the best times for each in your
spreadsheet.

15. How do the times for the OpenMP and for the MPI compare?

16. Add new columns to your spreadsheet, one for OpenMP and one for MPI. In these columns calculate

the “parallel speed-up” defined by
t1
tp

where tp is the time spent on p cores and t1 is the serial case

(i.e., on 1 core).

17. Using the “chart” function of Excel, or any equivalent, plot time .v. #cores, speed-up .v. #cores
and/or efficiency .v. #cores and state at the bottom of your spreadsheet whether you think there is
any strong scaling.

Fabio Papacchini: Fabio. .uk 2

COMP528-JAN21 University of Liverpool

18. ADVANCED: It is possible for the MPI example to use more than 1 node. To do this you have to both
use the “-N” flag to sbatch to say how many nodes and also over-ride the use of “-p course” (see the
run-mpi.sh script) since that partition only has one node; instead you can use the “nodes” partition.
For example
sbatch -n 20 -N 2 -p nodes run-mpi.sh

will provide 20 cores spread over 2 nodes (of the “nodes” partition) for the batch job. If the system is
busy you may have to wait several minutes (or hours) for your job to start. This is why for COMP528
we have a “courses” partition that is available to those on COMP528 (but is limited to a single node).
Therefore it is good practice to also put an upper bound on the time you require the resources for, by
using “-t mm:ss” (where mm:ss is minutes:seconds, e.g., 02:30 is 2 and a half minutes), and time is
wallclock time.

19. If you have time in the labs, try running on more than one node and explain how this effects the
timings observed.

20. We shall cover in lectures why OpenMP is limited to one node.

21. In future labs we shall also explore other batch queues (aka partitions) available and how to request
exclusive access to a node.

Fabio Papacchini: Fabio. .uk 3