代写 C html math MPI openmp parallel compiler graph cuda GPU COMP528 Assignment Resits (2018/19)

COMP528 Assignment Resits (2018/19)
Dr. Michael K. Bane

Overview
• 4 assignments, each worth 10% of total
􏰁 your letter will indicate which (if any) assignments you are expected to resit
• Resits
􏰁 questions comparable to original, testing same learning etc
􏰂 you will get lots of hints and help by going back to the lab work and to previous assignments
􏰁 all codes to be written in C, compiled and benchmarked on Chadwick
􏰁 standards of academic integrity expected (as per original) – reports may go through “TurnItIn”
for automatic checking
􏰁 will be marked on the code & report, for correctness & understanding of the topics
• Submission
􏰁 each assignment as a single zip file (comprising Report & code & any scripts (plus any
supporting evidence you wish))
􏰁 submission to SAM: 91 for resit#1, 92 for resit#1, 93 for resit#3, 94 for resit#4
􏰁DEADLINE for all submissions: 10am, Friday 9th August 2019

Assignment 1: MPI

Assignment #1 Resit
• Testing knowledge of
• parallel programming & MPI & timing via a batch system
• TASK: least squares regression – parallelisation using MPI
• https://www.mathsisfun.com/data/least-squares-regression.html
• for set of discrete points (x[i], y[i]), the best linear fit y=mx + b using given equations (next slide) to determine m & b
• write two C codes to determine m and b for a given input set of x,y
i. A serial test code
ii. One using MPI parallelism
• use the Intel compiler and compile with no optimisation ‘-O0’
• time the section of the code (running in batch) that finds m & b, and do this on various number of MPI processes and discuss your findings e.g. in terms of speed-up and parallel efficiency (and Amdahl’s Law)

Assignment #1 Resit
• Remember:
• can parallelise where lots of
independent work
• MPI is single code with each process having its own “rank” (useful to split up work?)
https://www.mathsisfun.com/data/least-squares-regression.html
• MPI provides “Reduction” calls e.g. for doing summation over processes and storing result on “root” process (or on all processes)
• MPI provides timing MPI_Wtime function, and the wall-clock time is the difference between two consecutive calls to MPI_Wtime
• that N may not be equally divisible by the number of MPI processes (available via MPI_Comm_size function)

Assignment #1 Resit
• Data
• suggestion: use a small set of input data (x,y), to check you are getting the correct answer (serially and for any number of MPI processes); once all good, then use data for the assignment (as below). Remember to use the batch system to undertake your timings for different numbers of MPI processes
• Assignment data: • N=100,000
• x[i] = (float)i/1000.0 for i=1 to i=99,999 note we start at i=1 and go to N-1
• y[i] = sin(x[i]/500.0) / cos(x[i]/499.0 + x[i]) you will need to include

Assignment #1 Resit
• Code
• Submit both serial & MPI code • Submit any scripts used
• Report: up to 3 pages
• Discussion of your approach & of your results
• Give command that you use to • Compile
• Submitandrunyourparallelcode
• The equation of the best fit straight line
• Marking
• Correctness of codes: 50% • Explaining/understanding parallel principles & MPI: 25% • Discussion of results: 25%

Assignment 2: OpenMP

Assignment #2 Resit
• Testing knowledge of
• parallel programming & OpenMP & timing via a batch system
• TASK: least squares regression – parallelisation using OpenMP
• (see Assignment#1 for detailed description)
• for set of points discrete points (x[i], y[i]), the best linear fit y=mx + b using given equations (next slide) to determine m & b
• use the same assignment data as described for Assignment#1 Resit
• write a C code to determine m and b for a given input set of x,y that uses
OpenMP work-sharing constructs to parallelise the work
• use the Intel compiler and compile with no optimisation ‘-O0’
• time the section of the code (running in batch) that finds m & b, and do this on various number of OpenMP threads and discuss your findings e.g. in terms of speed-up and parallel efficiency (and Amdahl’s Law)

Assignment #2 Resit
• Remember:
• can parallelise where lots of independent work
• OpenMP is single code with fork-join parallel regions in which each thread having its own thread number. Typically parallelise at the ‘for’ loop level
• OpenMP provides a “Reduction” clause e.g. for doing summation over processes and storing result on “master” thread
• OpenMP provides timing omp_get_wtime function, and the wall-clock time is the difference between two consecutive calls
• OpenMP loop parallelisation can have different “schedules” which may be useful for irregular work distribution between threads
• You can use compiler flags to ignore all OpenMP.

Assignment #2 Resit
• Code
• Submit OpenMP code
• Submit any scripts used
• Report: up to 3 pages
• Discussion of your approach & of your results
• Give command that you use to • Compile
• Submitandrunyourparallelcode
• The equation of the best fit straight line
• Marking
• Correctness of code: 50% • Explaining/understanding parallel principles & MPI: 25% • Discussion of results: 25%

Assignment 3: GPU Programming

Assignment #3 Resit
• Testing knowledge of
• parallel programming of GPUs
• TASK: discretization using GPU
• Function f(x) = exp(x/3.1) – x*x*x*x*18.0
• You need to discretize this between x=0.0 and x=60.0 and find the minimum using 33M points
• Write a C-based code with an accelerated kernel written in either CUDA or using OpenACC directives; the code should
• timeaserialruncomprisingsettingvaluesandthenfindingminimum(i.e.allontheCPU)
• timeanacceleratedrunwithvaluessetontheGPU,passedbacktoCPUandthe minimum found on the CPU

Assignment #3 Resit
• Reminder for CUDA
• write C + CUDA kernel in file e.g. myCode.cu (note the .cu suffix)
• compile (on login node):
module load cuda-8.0
nvcc -Xcompiler -fopenmp myCode.cu
• debug running in batch
qrsh -l gputype=tesla,h_rt=00:10:00 -pe smp 1-16 -V -cwd ./a.out
• timing run in batch (hogging all GPU & CPU cores for yourself)
qrsh -l gputype=tesla,exclusive,h_rt=00:10:00 -pe smp 16 -V -cwd ./a.out
• For openACC
• please see lecture notes

Assignment #3 Resit
• Code
• Submit code and any scripts used
• Report: up to 3 pages
• Discussion of your approach & of your results
• includinghowspeedratioofGPUtoCPU
• notingwhetheryouincludeGPUmemory&datacosts(andwhateffectthiswouldhave)
• Give command that you use to
• Compile,submitandrunyourparallelcode
• Value of the minimum of f(x[i]) and for which value of x[i] this occurs
• Marking
• Correctness of code: 40% • Explaining/understanding parallel principles & GPUs: 30% • Discussion of results: 30%

Assignment 4: hybrid programming

Assignment #4 Resit
• Testing knowledge of
• parallel programming & hybrid MPI+OpenMP parallelism
• TASK: hybrid MPI+OpenMP parallelisation of galaxy formation
• using the C code “COMP528-assign4-resit.c” provided in Sub-Section “Resit
Assignments” at https://cgi.csc.liv.ac.uk/~mkbane/COMP528/
• add MPI and OpenMP to accelerate the simulation (including, if appropriate, the initialisation); as per the original assignment, use MPI to parallelise at a coarse grained level (dividing the number of bodies (variable “BODIES”) between the number of processes) and each MPI process then using OpenMP to parallelise its work
• use the Intel compiler and compile with optimisation flag ‘-O2’
• time the section of the code (running in batch) that simulates the movement of the galaxies, and do this on various number of MPI processes & OpenMP threads

Assignment #4 Resit
• Code – submit MPI+OpenMP code & any scripts used • Report: up to 3 pages
• Discussion of your approach & of your results
• howyoudeterminedwhattoparallelise&explainwhyyouchosethegivenparallelisationmethod • theresults(accuracy,speed-up,parallelefficiency)
• whichcombinationofMPI/OpenMPyoufoundtobethefastest
• Include a paragraph on what you would need to scale the number of BODIES by 100 orders of magnitude (and keep run time about the same)
• e.g.isBarklabigenough?isCPUtheonlyoption? • State commands that you use to
• Compile,submit,run&timeyourcodetogettimingdatapresented • Marking
• Code: 30% • Explaining/understanding parallel principles used: 25% • Discussion on scaling by 100 orders of magnitude: 20% • Discussion of results: 25%

• Good luck
• Ask if any questions!
• Michael Bane, G14 Ashton m.k.bane@Liverpool.ac.uk
Skype: https://join.skype.com/invite/m49PHwnmVmo2