15-418/618, Spring 2019
Assignment 1
Exploring Multi-Core, Instruction-Level, and SIMD Parallelism
Waitlist students
Mon., Jan. 14
Wed., Jan. 23, 11:59 pm Wed., Jan. 23
Event
Assigned:
Due:
Last day to handin: Sat., Feb. 2
Registered students Mon., Jan. 14
Wed., Jan. 30, 11:59 pm
Overview
In this assignment you will modify and experiment with code designed to exploit the three main forms of parallelism available on modern processors: the multiple cores that can execute programs independently, the multiple functional units that can operate in parallel, and the SIMD vector units that allow each processor to perform some of its arithmetic and memory operations on vectors of data. (You will see some of the effects of Intel hyperthreading, as well, where each core can execute the instruction streams of two programs simultaneously.)
You will also gain experience measuring and reasoning about the performance of parallel programs, a chal- lenging, but important, skill you will use throughout this class. This assignment involves only a small amount of programming, but a lot of analysis!
This is an individual project. All handins are electronic. Your submission will consist of the code files you have modified, as well as a single document reporting your findings on the 5 problems described below. You may use any document preparation system you choose, but the final result must be stored as a single file in PDF format, named report.pdf. Make sure the report includes your name and Andrew Id. More details on how to submit this information is provided at the end of this document.
Before you begin, please take the time to review the course policy on academic integrity at:
http://www.cs.cmu.edu/ ̃418/academicintegrity.html
Getting started
You will need to run code on the machines in the Gates cluster for this assignment. Host names for these machines are ghcX.ghc.andrew.cmu.edu, where X is between 26 and 46. Each of these machines
1
has an eight-core, 3.2 GHz Intel Core i7 processor (although dynamic frequency scaling can take them to 3.8 GHz when the chip decides it is useful and possible to do so). Each core can execute AVX2 vec- tor instructions, supporting simultaneous execution of the same operation on multiple data values (8 in the case of single-precision data). For the curious, a complete specification for this CPU can be found at ark.intel.com/products/92985/Intel-Xeon-Processor-E5-1660-v4-20M-Cache-3 20-GHz. You can log into these machines in the cluster, or you can reach them via ssh.
We will grade your analysis of code run on the Gates machines; however for your own interest, you may also want to run these programs on other machines. To do this, you will first need to install the Intel SPMD Program Compiler (ISPC) available at: ispc.github.com/. Feel free to include your findings from running code on other machines in your report as well, just be very clear in your report to describe the machine(s) you used.
ISPC is needed to compile many of the programs used in this assignment. ISPC is currently installed on the Gates machines in the directory /usr/local/depot/ispc-v1.9.1-linux/. You will need to add this directory to your system path.
We will distribute the assignment starter code via a repository hosted on Github. Clone the assignment 1 starter code using:
git clone https://github.com/cmu15418/asst1-s19.git
1 Problem 1: Parallel Fractal Generation Using Pthreads (15 points)
Build and run the code in the prob1_mandelbrot_threads directory of the Assignment 1 code base. This program produces the image file mandelbrot-vV -serial.ppm, where V is the view index. This image is a visualization of a famous set of complex numbers called the Mandelbrot set. As you can see in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to a value in the complex plane, and the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set—white pixels required the maximum (256) number of iterations, dark ones only a few iterations, and colored pixels were somewhere in between. (See function mandel() defined in mandelbrot.cpp.) You can learn more about the definition of the Mandelbrot set at en.wikipedia.org/wiki/Mandelbrot set.
Use the command option “–view V ” for V between 0 and 6 to get the different images. You can click the links below to see the different images on a browser. Take the time to do this—the images are quite striking. (View 0 is not shown—it is all white.)
2
View 1
1× magnification
View 4
50× magnification
View 2
66× magnification
View 5
50× magnification
View 3
50× magnification
View 6
500× magnification
Your job is to parallelize the computation of the images using Pthreads. The command-line option “–threads T ” specifies that the computation is to be partitioned over T threads. In function mandelbrotThread(),
located in mandelbrot.cpp, the main application thread creates T −1 additional thread using pthread_create(). It waits for these threads to complete using pthread_join(). Currently, neither the launched threads
nor the main thread do any computation, and so the program generates an error message. You should add code to the workerThreadStart() function to accomplish this task. You will not need to use of any other Pthread API calls in this assignment.
What you need to do:
1. Modify the code in mandelbrot.cpp to parallelize the Mandelbrot generation using two cores. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors.
2. Extend your code to utilize T threads for T ∈ {2, 4, 8, 16}, partitioning the image generation work into the appropriate number of horizontal blocks. You will need to modify the code in function workerThreadStart, to partition the work over the threads.
Note that the processor only has 8 cores, but each core supports two hyper-threads. Also, the active images have 749 rows (with another row added to detect array overrun), and so you must handle the case where the number of rows is not evenly divisible by the number of threads. In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of cores used for views 0, 1, and 2. Is speedup linear in the number of cores used? In your writeup hypothesize why this is (or is not) the case?
3
3. To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to com- plete its work by inserting timing code at the beginning and end of workerThreadStart(). How do your measurements explain the speedup graph you previously created?
4. Modify the mapping of work to threads to improve speedup to at least 8× on view 0 and almost 8× on views 1 and 2 (if you’re close to 8× that’s fine, don’t sweat it). You may not use any syn- chronization between threads. We expect you to come up with a single work decomposition policy that will work well for all thread counts; hard coding a solution specific to each configuration is not allowed! (Hint: There is a very simple static assignment that will achieve this goal, and no commu- nication/synchronization among threads is necessary.) In your writeup, describe your approach and report the final 16-thread speedup obtained. Also comment on the difference in scaling behavior from 4 to 8 threads versus 8 to 16 threads.
As a bonus (to you, not extra credit), once you have your threaded code running properly, you can run our interactive visualization tool of the Mandelbrot set. You will find this quite addictive. The program isimplementedasthefilemviz.pyinthemainassignmentdirectory.1 Invokeitusing,ascommand-line arguments, the command line you give to run the mandelbrot program. For example, you might give the command
linux> ../mviz.py ./mandelbrot -t 16
When the program starts, it will display the list of single-character keyboard commands you can use to zoom and pan around the set. You will notice the fractal self similarity property, where common patterns keep occurring as you zoom deeper. You will also find that speedup you get from threading can greatly improve the interactive response.
What you need to turn in:
2
1. Your report should contain the graphs, analyses, and answers specified above.
2. Your report should describe the decomposition strategy you used to maximize speedup.
3. The archive file you submit will contain your version of the file mandelbrot.cpp. This file should contain the best performing code you created. Any modifications you made should follow good coding conventions, in terms of indenting, variable names, and comments.
Problem 2: Vectorizing Code Using SIMD Intrinsics (20 points)
Take a look at the function clampedExpSerial() in prob2_vecintrin/functions.cpp of the Assignment 1 code base. The function raises values[i] to the integer power given by exponents[i] for all elements of the input array and clamps the resulting values at 4.18. The function computes xp based on the technique known as exponentiation by squaring. Whereas the usual technique of multiplying together p copies of x requires p − 1 multiplications, iterative squaring requires at most 2 log2 p multiplications. For
1The visualization program requires the Image and ImageTk packages from the Python PIL library. Unfortunately, these have not yet been installed on the GHC machines. Hopefully, that will be fixed soon.
4
p = 1000, exponentiation by squaring requires less than 20 multiplications rather than 999. In Problem 2, your job is to vectorize this piece of code so that it can be run on a machine with SIMD vector instructions.
We won’t ask you to craft an implementation using the SSE or AVX vector instrinsics that map to real vector instructions on moderns CPUs. Instead, to make things a little easier, we’re asking you to implement your version using 15-418’s “fake vector instrinics” defined in CMU418intrin.h. The CMU418intrin library provides you with a set of vector instructions that operate on vector values and/or vector masks. (These functions don’t translate to real vector instructions; instead we simulate these operations for you in our library, and provide feedback that makes for easier debugging.) As an example of using the 15- 418 intrinsics, a vectorized version of the abs() function is given in functions.cpp. This example contains some basic vector loads and stores and manipulates mask registers. Note that the abs() example is only a simple example, and in fact the code does not correctly handle all inputs. (We will let you figure out why.) You may wish to read through the comments and function definitions in CMU418intrin.h to know what operations are available to you.
Here are a few hints to help you in your implementation:
1. Every vector instruction is subject to an optional mask parameter. The mask parameter defines which lanes have their output “masked” for this operation. A 0 in the mask indicates a lane is masked, and so its value will not be overwritten by the results of the vector operation. If no mask is specified in the operation, no lanes are masked. (This is equivalent to providing a mask of all ones.) Your solution will need to use multiple mask registers and various mask operations provided in the library.
2. You will find the _cmu418_cntbits() function to be helpful in this problem.
3. You must handle the case where the total number of loop iterations is not a multiple of SIMD vector width.Wesuggestyoutestyourcodewith./vrun -s 3.Youmightfind_cmu418_init_ones() helpful.
4. Use command-line option -l to print a log of executed vector instruction at the end. Insert calls to function addUserLog() in your vector code to add customized debugging information to the log.
The output of the program will tell you if your implementation generates correct output. If there are incorrect results, the program will print the first one it finds and print out a table of function inputs and outputs.2 Your function’s output is after “output = ”, which should match with the results after “gold = ”. The program also prints out a list of statistics describing utilization of the 15418 fake vector units. You should consider the performance of your implementation to be the value “Total Vector Instructions”. “Vector Utilization” shows the percentage of vector lanes that are enabled.
What you need to do:
1. ImplementavectorizedversionofclampedExpSerial()asthefunctionclampedExpVector() in file functions.cpp. Your implementation should work with any combination of input array size N and vector width W .
2. Run ./vrun -s 10000 and sweep the vector width over the values {2, 4, 8, 16, 32}. Record the resulting vector utilization. You can do this by changing the defined value of VECTOR_WIDTH in
2If it finds a mismatch in row 749, that indicates that your program overran the image array. 5
CMU418intrin.h and recompiling the code each time. How much does the vector utilization change as W changes? Explain the reason for these changes and the degree of sensitivity the uti- lization has on the vector width. Explain how the total number of vector instructions varies with W.
3. Extra credit: (1 point) Implement a vectorized version of arraySumSerial() as the function arraySumVector() in file functions.cpp. Your implementation may assume that W is a factor of the input array size N. Whereas the serial implementation has O(N) span, your implemen- tation should have at most O(N/W + log2 W ) span. You may find the hadd and interleave operations useful.
What you need to turn in:
1. Your report should contains tables giving the vector utilizations and total vector instructions for the
3
different values of W .
2. Your report should contain the analyses and answers to the questions listed above.
3. If you did the extra credit problem, state in the report whether or not your code passed the correctness test.
4. The archive file you submit will contain your version of the file functions.cpp. This file should contain the best performing code you created. Any modifications you made should follow good coding conventions, in terms of indenting, variable names, and comments.
Problem 3: Using Instruction-level Parallelism in Fractal Generation (20 points)
Now that you have seen how SIMD execution can operate on multiple data values simultaneously, we will explore how these principles can speed up a conventional C++ program. In this case, we will exploit the parallel execution capability provided in the multiple, pipelined functional units of an out-of-order processor. This exercise will give us the chance to analyze how well a program makes use of these units and how this limits program performance.
You will want to refer to material in Computer Systems: A Programmer’s Perspective, third edition (CS:APP3e) that you perhaps did not study carefully in whatever version of 15-213 you took. In particular, you will find Sections 3.11 (floating-point code) and Sections 5.7–5.10 (out-of-order execution and instruction-level par- allelism) to be helpful. Copies of the book are on reserve in the Sorrell’s Library. Don’t try to get by with an earlier edition of the book.
All of the code for this problem is provided in the directory prob3 mandelbrot ilp. Your job will be to measure performance and understand it. Compile and run the code on view 0. This is a useful benchmark, since all of the iterations hit the maximum limit of 256, yielding a minimum of overhead. You will find that the reference implementation, as given by the function mandel ref requires around 10.7 clock cycles per iteration.
6
.L123:
# This is the inner loop of mandel_ref
# Parameters are passed to the function as follows:
# %xmm0: c_re
# %xmm1: c_im
# %edi: count
# Before entering the loop, the function sets registers
# to initialize local variables:
# %xmm2: z_re = c_re
# %xmm3: z_im = c_im
# %eax: i = 0
vmulss %xmm2, %xmm2, %xmm4
vmulss %xmm3, %xmm3, %xmm5
vaddss %xmm5, %xmm4, %xmm6
vucomiss .LC0(%rip), %xmm6 # Compare to 4.f
ja .L126
vaddss %xmm2, %xmm2, %xmm2
addl $1, %eax
cmpl %edi, %eax # Set condition codes for jne below vmulss %xmm3, %xmm2, %xmm3
vsubss %xmm5, %xmm4, %xmm2
vaddss %xmm3, %xmm1, %xmm3
vaddss %xmm2, %xmm0, %xmm2
jne .L123
Figure 1: Assembly code for inner loop of function mandel ref
¡
7
Figure 1 shows the assembly code generated for the inner loop of mandel ref, with some comments added to help you understand how the integer and floating-point registers are used. Add more annotation to this code to describe how it relates to the original C++ code. (A copy of the code is provided in the file mandel ref loop.s.) Explain how the expression 2.f * z re * z im is implemented?
Create a data-flow diagram indicating the data dependencies formed by successive iterations of the loop, similar to Figure 5.14(b) of CS:APP3e. You need not show the integer operations, since these do not affect the program performance.
The Gates-cluster machines are based on what Intel labels its “Broadwell” microarchitecture. This is very similar to Haswell microarchitecture described in Section 5.7.2 of CS:APP3e, but with slightly more and faster functional units. You can find out more about the microarchitecture in
www.agner.org/optimize/microarchitecture.pdf
Pages 140–141 of this document describe the functional units. In particular, the processor has two units for floating-point arithmetic. These are fully pipelined, able to start new operations on every clock cycle. For floating point, unit 0 can perform both multiplication and addition, while unit 1 can only perform multiplication. Both operations have a latency of 3 clock cycles. A separate functional unit can perform the floating-point comparison required by the vucomiss instruction.
Based on these latencies and the data-flow diagram you have created, determine the latency bound for function mandel ref. How close does the measured performance come to reaching this bound?
It is difficult to see how to speed up the individual iterations through increased instruction-level parallelism. However, it is possible to increase the level of activity in the loop by processing multiple candidate points. This will enable increasing the throughput of the processing to yield better performance than is dictated by the latency bound.
The file mandelbrot.cpp contains a macro MANDEL BODY and its instantiations to generate the func- tions mandel parU for U ranging from 1 to 5. The idea is to process U points in parallel, using a style of programming similar to SIMD execution. We represent the multiple data with arrays of size U, and write conventional code that uses looping to process all U elements. As with SIMD, the loop exits only when all points have completed their iterations, using a bit vector to determine which elements should be updated. The compiler automatically unrolls all of these loops, and so the generated code is equivalent to what would be obtained by explicitly writing the computation for all U elements. (You can see the generated code in the file mandelbrot.s, generated when you compiled the code for this problem.) The function mandelbrotParallel exploits this capability by computing U adjacent rows of the array in parallel.
Running the program on view 0, you will see that it reaches a peak performance of around 6.1 clock cycles per iteration for U = 3. (That is, each execution of the loop requires around 18.3 cycles, but it performance an iteration for three points.) That’s an improvement over the original performance, but perhaps not quite as significant as one may hope.
Considering the floating-point operations, what is the highest throughput bound imposed by the functional units? You should consider the bounds imposed by multiplication, addition, and the two in combination. How close does the measured performance come to reaching this bound?
If you could modify the generated assembly code, do you think you could lower the throughput bound? Extra Credit: (1 point) Modify the C++ code for both the reference and the parallel versions to (slightly)
8
improve the performance of the parallel version, while still passing the functionality tests. You may alter the computed function, but only in a way that demonstrates how a better compiler could generate faster code.
As a bonus, you can run the interactive visualization tool mviz.py of the ILP-enhanced version the Man- delbrot computation with U = 3. Run with the command line:
linux> ../mviz.py ./mandelbrot
What you need to turn in:
4
1. An annotated version of the assembly code for the main loop of mandel ref.
2. Data-flow diagram showing the loop-carried dependencies among the floating-point values.
3. An analysis of the latency bound of mandel ref and how the measured performance compares.
4. An analysis of the throughput bound for the parallel versions, and how the measured performance compares.
5. Ideas for how a compiler could generate code that runs faster on this particular microarchitecture.
6. (Extra credit.) A demonstration of how the compiler could generate code that runs faster.
Problem 4: Parallel Fractal Generation Using ISPC (20 points)
The code for this problem is in the subdirectory prob4_mandelbrot_ispc. Now that you’re comfort- able with SIMD execution, we’ll return to parallel Mandelbrot fractal generation. As in Problem 1, Problem 4 computes a Mandelbrot fractal image, but it achieves even greater speedups by utilizing the SIMD execu- tion units within each of the 8 cores. We will also see that the ILP-enhancement techniques used in Problem 3 can, in principle, be applied to ISPC code.
In Problem 1, you parallelized image generation by creating one thread for each processing core in the system. Then, you assigned parts of the computation to each of these concurrently executing threads. Instead of specifying a specific mapping of computations to concurrently executing threads, Problem 4 uses ISPC language constructs to describe independent computations. These computations may be executed in parallel without violating program correctness. In the case of the Mandelbrot image, computing the value of each pixel is an independent computation. With this information, the ISPC compiler and runtime system take on the responsibility of generating a program that utilizes the CPUs collection of parallel execution resources as efficiently as possible.
4.1 Problem 4, Part 1. A Few ISPC Basics (5 of 20 points)
When reading ISPC code, you must keep in mind that, although the code appears much like C/C++ code, the ISPC execution model differs from that of standard C/C++. In contrast to C, multiple program instances of an ISPC program are always executed in parallel on the CPU’s SIMD execution units. The number of program instances executed simultaneously is determined by the compiler (and chosen specifically for the
9
underlying machine). This number of concurrent instances is available to the ISPC programmer via the built-in variable programCount. ISPC code can reference its own program instance identifier via the built-in programIndex. Thus, a call from C code to an ISPC function can be thought of as spawning a group of concurrent ISPC program instances (referred to in the ISPC documentation as a gang). The gang of instances runs to completion, then control returns back to the calling C code.
As an example, the following program uses a combination of regular C code and ISPC code to add two 1024-element vectors. As discussed in class, since each instance in a gang is independent and performs the exact same program logic, execution can be accelerated via SIMD instructions.
A simple ISPC program is given below. First, the C program, which calls the ISPC-generated code:
————————————————————————
C program code: myprogram.cpp
————————————————————————
const int TOTAL_VALUES = 1024;
float a[TOTAL_VALUES];
float b[TOTAL_VALUES];
float c[TOTAL_VALUES]
// Initialize arrays a and b here. …
sum(TOTAL_VALUES, a, b, c);
// Upon return from sumArrays, result of a + b is stored in c.
The function sum() called by the C code is generated by compiling the following ISPC code:
————————————————————————
ISPC code: myprogram.ispc
————————————————————————
export sum(uniform int N, uniform float* a, uniform float* b, uniform float* c)
{
// Assumes programCount divides N evenly.
for (int i=0; i
or
linux> ../mviz.py ./mandelbrot -t
You will see how the speedup improves interactivity. What you need to turn in:
5
1. Your report must contain answers to the questions listed above.
2. Your report should describe performance gains you get from SIMD, SIMD+ILP, and threaded paral- lelism.
3. Your answer to the extra credit question, if any.
4. The archive file you submit will contain your version of the file mandelbrot.ispc. This file should contain the best performing code you created. Any modifications you made should follow good coding conventions, in terms of indenting, variable names, and comments.
Problem 5: Iterative Square Root (10 points)
The code for this problem is in the subdirectory prob5_sqrt. Problem 5 concerns the program sqrt, generated from an ISPC program that computes the square root of 20 million random numbers between 0 and 3. For value s, it uses the iterative Newton’s method (named after Isaac Newton) to solve the equation 1/x2 = s. This gives a value x ≈ 1/s. Multiplying x by s gives an approximation to √s.
The value 1.0 is used as the initial guess in this implementation. The graph below shows the number of iterations required for the program to converge to an accurate solution for values in the open interval (0, 3). (The implementation does not converge for inputs outside this range). Notice how the rate of convergence depends on how close the solution is to the initial guess.
40 35 30 25 20 15 10
5 0
Itera&ons to Convergence by Square Root Func&on
0.0 0.5
1.0 1.5
2.0 2.5 3.0
Input Value
13
Itera&ons
What you need to do:
1. Build and run sqrt. Report the ISPC implementation speedup for a single CPU core (no tasks) and when using all cores (with tasks). What is the speedup due to SIMD parallelization? What is the speedup due to multi-core parallelization?
2. Modify the function initGood() in the file data.cpp to generate data that will yield a very high relative speedup of the ISPC implementations. (You may generate any valid data.) Describe why these input data will maximize speedup over the sequential version and report the resulting speedup achieved (for both the with- and without-tasks ISPC implementations). You can test this version with the command-line argument “–data g.” Does your modification improve SIMD speedup? Does it improve multi-core speedup? Please explain why.
3. Modify the function initBad() in the file data.cpp to generate data that will yield a very low (less than 1.0) relative speedup of the ISPC implementations. (You may generate any valid data.) De- scribe why these input data will cause the SIMD code to have very poor speedup over the sequential version and report the resulting speedup achieved (for both the with- and without-tasks ISPC imple- mentations). You can test this version with the command-line argument “–data b.” Does your modification improve multi-core speedup? Please explain why.
Notes and comments: When running your “very-good-case input”, take a look at what the benefit of multi- core execution is. You might be surprised at how high it is. This is a hyper-threading effect.
What you need to handin:
1. Provide explanations and answers in your report.
2. The archive file you submit will contain your version of the file data.cpp.
Hand-in Instructions
As mentioned, you should have your report in a file report.pdf in the home directory for the assignment. (Make sure the report includes your name and Andrew Id.) In this directory, execute the command
make handin.tar
This will create an archive file with your report, and some of the files you modified for the different problems. Completing the handin then depends on your status in the class:
• If you are a registered student, you should have a course Autolab account. Go to Autolab at: https://autolab.andrew.cmu.edu/courses/15418-s19
and submit the file handin.tar.
14
• If you are on the waitlist, you will not have an Autolab account. Instead, you should run the provided program submit.py as follows:
./submit.py -u AndrewID where AndrewID is your Andrew ID.
In either case, you can submit multiple times, but only your final submission will be graded, and the time stamp of that submission will be used in determining any late penalties.
15