18-646 – How to Write Fast Code II 7
Mini-Project 2
The goal of this project is to use your understanding of parallel computing resources in a multicore microprocessor to optimize two fully functional applications.
Mini-Project 2 Due Monday, 4/12 @ 11:59PM EST
Mini-Project 2 Review Thursday, 4/15 @ 6:00PM EST
Canvas: https://canvas.cmu.edu/courses/21510/assignments/354207
Handout: https://cmu.box.com/s/eud2ltk4nxqfivtvaprzgldn4ujgtvj8
Gradescope Links:
Matrix Multiply: https://www.gradescope.com/courses/241050/assignments/1102345 K-Means: https://www.gradescope.com/courses/241050/assignments/1102348
Project Writeup: https://www.gradescope.com/courses/241050/assignments/1102352 18-646 – How to Write Fast Code II 8
Mini-Project 2: Grading Criteria
18-646 – How to Write Fast Code II 9
Mini-Project 2: Write-Up
18-646 – How to Write Fast Code II 10
Mini-Project 2 – Setup
18-646 – How to Write Fast Code II 11
Mini-Project 2 – Testing
The time taken for execution on gradescope can vary quite a bit
The final performance of your submitted code (the code that exists on gradescope on
Monday 04/12 @ 11:59PM PST) will be evaluated on a ghc machine
Test your code for accuracy and speed before uploading to gradescope
Matrix Multiplication
$ ./matrix_mul -i ../matrix_mul_03.dat –o
K-Means
$ ~/18646_MP1/kmeans/seq_main -o -i kmeans03.dat -n 15
$ mv kmeans03.dat.membership seq_main.kmeans03_n15.membership
$ ./cuda_main -o -i kmeans03.dat -n 15
$ diff -y –suppress-common-lines kmeans03.dat.membership seq_main.kmeans03_n15.membership
18-646 – How to Write Fast Code II 12
Mini-Project 2 – Makefile
The only two files you will submit are “matrix_mul.cu” and “cuda_kmeans.cu”
However!!!
For your REPORT you can explore the use of different flags during compilation and report on their effect!
If you
a
to the
for
please
make
include
18-646 – How to Write Fast Code II
13
changes
description
of the
you in
changes
your
fastest
version,
Makefile
made
performing
your
project
report!!!
Mini-Project 2 – General Suggestions
Explore use of local variables and shared memory Local variables
Shared memory optimization Loop unrolling
Increases instruction shuffling and instruction level parallelism Rectangular threadblock sizes
Number of threads per block defines how many simultaneously executing blocks are available
More computation per block / Use FMAs Multiply / Add Balancing
Minimize shared memory to registers move instructions
Every load from global memory to shared memory needs to move through register file Look at PTX assembly code?
Improve overlap between memory and computation
18-646 – How to Write Fast Code II 14
Mini-Project 2 – Matrix Multiply
Shared Memory Optimization
Divide matrix into blocks. Each block is mapped to a single SM of a GPU. Then:
1. Create shared memory of size [BLOCK_SIZE][BLOCK_SIZE] per thread block.
2. All threads in a thread block access their corresponding elements of input matrices
from DRAM memory using global memory indexing formula.
3. Once all threads store their elements in shared memory, sync the threads.
4. Threads access the elements directly from the thread block’s shared memory and
compute the value of product at a given location of product matrix P.
18-646 – How to Write Fast Code II 15
Mini-Project 2 – Matrix Multiply
1. BLOCK_SIZE = 16
2. Handle boundary conditions
(reduces around 10-20 GLOPS)
3. Global DRAM bandwidth saved.
4. Block_size value chosen
empirically.
5. Each thread block perform
2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations.
6. Memory bandwidth no longer a limiting factor
18-646 – How to Write Fast Code II
16
Mini-Project 2 – K-Means
18-646 – How to Write Fast Code II 17
Mini-Project 2 – K-Means
Computing new clusters involves:
Copying membership for every
object from device to host.
Iterating through every object
to aggregate the cluster’s size and coordinate
18-646 – How to Write Fast Code II 18
Mini-Project 2 – K-Means
Naive Solution
We could perform the computation on device itself:
Created a kernel where every thread is responsible for one object Updates the Cluster’s size and coordinate using atomicAdd.
However, very high contention as each thread is updating the same data structures on global memory.
18-646 – How to Write Fast Code II 19
Mini-Project 2 – K-Means
Improved Solution
Let each thread-block have its own copy of data structures in shared memory Created cluster size and cluster buffer for every thread block.
Each thread updates the data-structures in its own thread-block’s shared
memory using atomicAdd.
Finally, the data is copied to device memory
But we still need a way to aggregate the data structures from every thread block to one final data structure.
18-646 – How to Write Fast Code II 20
Mini-Project 2 – K-Means
Improved Solution cont.
Create another reduction kernel to aggregate the per-thread block data structures. This reduction kernel have each thread reducing the data structures for a thread
block from previous kernel.
Similar binary reduction process as we for delta reduction.
Reduce Cluster size and Clusters together in the same kernel.
18-646 – How to Write Fast Code II 21
Mini-Project 2 – K-Means
Some more optimizations
Can we minimize the contention while executing atomicAdd?
Currently each thread is needs
to update same memory address using atomicAdd at the same time.
Distribute them to different memory addresses. Different threads will update memory addresses.
18-646 – How to Write Fast Code II 22