#SBATCH –nodes=1
#SBATCH –ntasks-per-node=1
#SBATCH –cpus-per-task=32
#SBATCH –partition=broadwe
Copyright By PowCoder代写 加微信 powcoder
You must document how you obtained your results and they must be reproducible. Include a shell script in your submission that allows to reproduce your results for example.
You must use the GCC 10.2.0 compiler to generate your code
$ modu e oad gcc/10.2.0-fasrc01
Youmustusetheprovidedget_wtime()timertocollectyourtimemeasurements,see the file main.cpp.
You cannot use more software threads than there are physical cores. Example: if the target architecture has 16 cores, you can use at most 16 threads. Document the num- ber of threads you have used for the performance result that you report.
The reference frequency for AVX2 instructions used by the Intel E5-2683v4 processor is 2.5 GHz. Make sure you use this frequency when you compute the nominal peak performance of the architecture. The roofline plot and your reported performance should be relative to this frequency.
You cannot use ISPC for this problem.
Write your compute kernel in the file kerne .cpp. In the main.cpp file you must complete the macros KERNEL_FLOPS and ARCHITECTURE_PEAK which are the total number of floating- point operations in your kernel implementation and the peak floating-point performance of your target architecture in Gflop/s. Your derivation of the number of KERNEL_FLOPS must be well documented in your report and in the source code. Writing down just a number without justification is not sufficient, see the file main.cpp for more details. Pay attention to compute the total number of flops in your kernel implementation correctly. We will provide an anonymous ranking of the submitted results. A part of the grade for this problem will depend on the achieved fraction of the peak. Run
$ grep -n TODO *
inside the code/p1 directory for a list of the open TODO hints.
Hint: Make sure that your code uses SIMD (AVX2) instructions.
Hint: Use all cores on the compute node. If the thread imba ance output of the main executable is larger than about 6 % it means your measurement is inaccurate due to over subscription of threads to physical cores.
Hint: Compile your code with optimizations such as -O3 and -ffast-math if you use math func- tions and possibly other flags. See the Makefi e.
Problem 2: N-Body Force Computation with ISPC (40 points)
In molecular dynamics (MD) simulations the interaction force between atoms is often modeled using the Lennard-Jones (LJ) potential. Computing interaction forces is neces- sary to advance the system in time, since the underlying equations of motion are governed by Newton’s law. Computing the gradient of the LJ potential yields a force vector field given by the relation
FLJ(r) = 12# ⇣rm⌘12 ⇣rm⌘6 r, (1) r2 r r
where r = xj xi is the distance vector between atom i and j. The length of the distance vector is denoted by r = |r| and corresponds to the Euclidean distance. The constants # and rm describe the depth of the potential well and the location of the potential minimum, respectively. Computing an update of all particles in the system is of O(N2) complex- ity.1 Because updating particles is expensive, we aim at optimizing the force computation using the SPMD programming model provided by the Intel ISPC compiler. We are only concerned with single threaded code in this exercise.
Work in directory: ./code/p2 a) 24 points
Write a scalar baseline kernel that implements equation 1 for the particle-particle in- teractions. The integrate() function in the application code main.cpp will call the force kernel for each particle and expects it to return the components of the force vector in the arguments Fx, Fy and Fz. The constants # and rm are provided in the auxi iary/constants.h header. You can access them by typing EPS and RM in your code, respectively.
Write your code in double precision and focus on an optimal implementation by avoid- ing expensive mathematical functions, that is, you are not allowed to use functions in the math.h or cmath headers. Try to avoid if-branches in loop-bodies if possible. You can compile this scalar version of the code with make main_sca ar and run it with ./main_sca ar. You can compile your code with debug flags enabled by using make debug=true
Write your code in the file LJ_force.cpp. b) 16 points
Exploit data-level parallelism (DLP) by vectorizing the force kernel with ISPC. Your optimized kernel should run on platforms that support AVX2 instructions, that is, you must use an Intel Broadwell node for AVX2
$ sa oc -N1 -c32 -p broadwe -t
the recipe in the provided Makefi e. Write this recipe for an x86-64 target with AVX2 extensions2. You can compile the code with make and run it with ./main. Write your code in double precision.
Write your code in the file LJ_force.ispc. Report the following for the AVX2 force kernel:
1. The speedup you expect
2. The speedup you achieve
3. If your result deviates from your expectation, explain the reason for the deviation.
Hint: The ISPC compiler is installed on the academic cluster in $SHARED_DATA/ oca /bin. Hint: Arrays passed to the kernel are in structure of arrays (SoA) format.
Hint: ISPC provides a reduce_add() function that takes a varying value and returns the sum of that value across all of the active program instances.
2As mentioned above, if you want to run on an AMD node you must target the AVX instruction set instead. Your results must be reported for AVX2.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com