18-646 – How to Write Fast Code II
1
Carnegie Mellon University
Important Dates
Mini-Project 1 – Important Dates: Mon, March 15th @ 11:59PM EST Thu, March 18th @ 3:00PM EST
Term Project – Important Dates: Fri, March 12th @ 11:59PM EST Tue, May 4th
Fri, May 14th
Mini-Project One Due Mini-Project One Review
Project Proposal Due (1-3 pages) Poster Presentation
Final Project Report (~10 pages)
No class on Thursday this week
Final Exam: Friday, May 14th at 1:00-4:00PM EST
18-646 – How to Write Fast Code II
2
Term Project – Proposal
The Project Proposal must contain the following:
1) Title
2) List of team members
3) Description of the problem you want to address
4) Description of how you propose to solve the problem
Platform?Techniques?
5) Outline of your expectations
improvements in throughput compared to baseline implementation (current or sequential)
6) A timeline with milestones
7) List of any special infrastructure requirements
Submit early!
18-646 – How to Write Fast Code II 3
Mini-Project 1
18-646 – How to Write Fast Code II 4
18-646 – How to Write Fast Code II 5
What we discussed last time:
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Highlighted the difference between multicore and manycore platforms Exposing concurrency in k-means, Exploiting parallelism by exploring
mappings from application to platform
Multicore platforms: OpenMP, performance metrics, roofline model
Manycore platforms: GPUs an0d CUDA
18-646 – How to Write Fast Code II 6
Data structures
Algorithms
Software Architecture
Outline
Maximizing Memory Throughput Maximizing Instruction Throughput Maximizing Scheduling Throughput Special Optimizations
18-646 – How to Write Fast Code II 7
Maximize Memory Throughput
M1. SoA vs AoS
M2. Memory coalescing M3. Use of shared memory M4. Memory bank conflict M5. Padding
“Highly parallel processors turns compute-limited algorithms into memory-limited algorithms”
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
CPU Chip
DRAM
18-646 – How to Write Fast Code II
8
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
M3/M4/M5 M1/M2
(M1) SOA vs AOS
Struct of Arrays
typedef struct
{
float* x;
float* y;
float* z;
} Constraints;
vs. Array of Struct
typedef struct __align__(16)
{
float3 position;
} Constraint;
xxxyyyzzz xyzxyzxyz
Important distinction based on application memory access patterns Which one is “better”?
18-646 – How to Write Fast Code II 9
SOA vs AOS? It depends…
What is the computation? For each point:
What is the optimization goal? Example 1:
You have a list of coordinates, and you want
to find the distance from the origin to all points
What should be the data structure?
18-646 – How to Write Fast Code II 10
SOA vs AOS? Example 1
What is the computation? For eOach point:
What is the optimization goal?
Minimize memory bandwidth to DRAM
c
Example 1:
You have a list of coordinates, and you want
nd 0
to find the distance from the origin to all points
What should be the data structure?
Enticed
nitrid yetid
18-646 – How to Write Fast Code II
typedef structassigncolumnto
threads typedef struct __align__(16) {
{
d f ddigesent
in
float* y;
float* z;
id
tfxftidtd.ee
l Il Il l
float* x; xtx x
OR float3 position;
yyy
} pointIn3D;
xftidx
} pointIn3D;
z
z
z
xyzxyzxyz
sequential man loadedbefore being used waste 11
SOA vs AOS? Example 2
What is the computation? F o rOa p o i n t :
What is the optimization goal?
Minimize memory bandwidth to DRAM
Example 2:
Find the distance from the origin to a list of points
estimated to be ~1% of all points
What should be the data structure? don’t need to load all the neighboring
typedef struct typedef struct __align__(16)points {{
float3 position;
} pointIn3D; dos
x 18-646 – How to Write Fast Code II donetocut 12
my
float* x; x x x float* y;
float* z; y y y
OR
di da
zzz xyzxyzxyz
} pointIn3D;
x
(M2) Memory Coalescing
Hardware Constraint: DRAM is accessed in “segments” of 32B/64B/128B Unused data loaded in a “segment” still takes up valuable bandwidth
Goal: combine multiple memory accesses generated from multiple threads into a single physical transaction
increases effective throughput to DRAM
Rules for maximizing DRAM memory bandwidth:
Possible bus transaction sizes: 32B, 64B, or 128B
Memory segment must be aligned: First address = multiple of segment size
Hardware coalescing for each warp: 32-word wide
Kirk, Hwu, Programming Massively Parallel Processors, Chapter 6.2
18-646 – How to Write Fast Code II 13
Examples:
Are these coalesced?
32T
LT
18-646 – How to Write Fast Code II
14
YES
wasted
YES
NO
When is the Access Coalesced?
“Threads can access any words in any order, including the same words, and a single memory transaction for each segment addressed by a warp”
Detailed protocol:
Find the memory segment that contains the address requested by the active thread with the lowest thread ID.
32bytesfor1-bytewords
64bytesfor2-bytewords
128bytesfor4-,8-and16-bytewords.
Find all other active threads whose requested address lies in the same segment. Reduce the transaction size, if possible:
Ifthetransactionsizeis128bytesandonlythelowerorupperhalfisused,reducethetransaction size to 64 bytes;
Ifthetransactionsizeis64bytes(originallyorafterreductionfrom128bytes)andonlythelower or upper half is used, reduce the transaction size to 32 bytes.
Carry out the transaction and mark the serviced threads as inactive Repeat until all threads in the warp are serviced.
18-646 – How to Write Fast Code II 15
(M3) Use of Shared Memory
Take advantage of 9x faster memory bandwidth
Process:
Load from DRAM to
shared memory
Synchronize
Perform work on data in
shared memory
Synchronize
Write out results to DRAM
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
CPU Chip
DRAM
18-646 – How to Write Fast Code II
16
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
M3
Double Buffering
One could double buffer the computation
Getting better instruction mix within each thread Classic software pipelining in ILP compilers
Load next tile from global memory
Loop {
Deposit current tile to shared memory
syncthreads()
addr 010 Load next tile from global memory
Compute current tile
syncthreads()
}
addro.o.O
Loop {
Load current tile to shared memory
syncthreads()
Compute current tile
syncthreads()
}
Original
Kirk, Hwu, Programming Massively Parallel Processors, Figure 6.13
Double Buffered
18-646 – How to Write Fast Code II
17
Using Shared Memory
Two approaches to using shared memory
(1) Declared a fixed sized variable at compile time
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
(2) Define a size to be used at run time
__global__ void mykernel(int a, float *objects) {
extern __shared__ char sharedMemory[];
unsigned char *membershipChanged = (unsigned char *)sharedMemory;
float *clusters = (float *)(sharedMemory + blockDim.x); …
}
// In host code
mykernel <<< nBlks, nThds, shmemByteSize >>> (a, objects);
18-646 – How to Write Fast Code II 18
Remember Matrix Transposition?
One can use the shared memory:
Load Matrix B with coalesced memory access
into shared memory
“Random access” from shared memory
No longer limited by uncoalesced memory access use
Good conceptually, but does it work?
load
32
33
34
35
…
R+32
R+33
R+34
R+35
…
2R+32
2R+33
2R+34
2R+35
…
3R+32
3R+33
3R+34
3R+35
…
…
…
…
…
…
18-646 – How to Write Fast Code II 19
(M4) Memory Bank Conflicts
Shared memory has 32 banks
Organized such that successive 32-bit words are assigned to successive banks Each bank has a bandwidth of 32 bits per two clock cycles (2 cycle latency)
A bank conflict occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank
18-646 – How to Write Fast Code II 20
Examples:
Left: Linear addressing with a stride of one 32-bit word
no bank conflict
Middle: Linear addressing with a
stride of two 32-bit words
Right: Linear addressing with a stride of three 32-bit words
no bank conflict
w
2-way bank conflicts
18-646 – How to Write Fast Code II 21
Examples:
Left: Conflict-free access via random permutation.
Middle: Conflict-free access since threads 3, 4, 6, 7, and 9 access the same word within bank 5.
Right: Conflict-free broadcast access all threads access the same word
How do we make the matrix transposition work?
18-646 – How to Write Fast Code II 22
(M5) Padding Technique
use
use
We have seen padding used to align data structures in project 1
Padding can also be used to offset memory bank conflicts
load
32
33
34
35
…
62
63
R+32
R+33
R+34
R+35
…
R+62
R+63
2R+32
2R+33
2R+34
2R+35
…
2R+62
2R+63
3R+32
3R+33
3R+34
3R+35
…
2R+62
2R+63
…
…
…
…
…
…
…
load
32
33
34
35
…
62
63
*
R+32
R+33
R+34
R+35
…
R+62
R+63
*
2R+32
2R+33
2R+34
2R+35
…
2R+6 2
2R+63
*
3R+32
…
3R+33
…
3R+34
…
3R+35
…
…
…
2R+6 2
…
2R+63
…
*
18-646 – How to Write Fast Code II
23
Maximize Memory Throughput
M1. SoA vs AoS
M2. Memory coalescing M3. Use of shared memory M4. Memory bank conflict M5. Padding
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
CPU Chip
DRAM
18-646 – How to Write Fast Code II
24
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
Outline
Maximizing Memory Throughput
Maximizing Instruction Throughput Maximizing Scheduling Throughput Special Optimizations
18-646 – How to Write Fast Code II 25
Maximizing Instruction Throughput
(I1) Branch divergence
(I2) Optimize instruction mix
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
I2
18-646 – How to Write Fast Code II
26
MUL/ADD Inbalance Inst stream
divergence
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
I1
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
Examples:
Example 1
tid = threadIdx.x;
if (a[tid] > 0) {
x += 1; } else {
if (b[tid] > 0) {
x += 2;
} else {
x += 3;
}
Example 2
if(c>0){
x = x*a1 + b1;
y = y*a1 + b1;
} else {
x = x*a2 + b2;
y = y*a2 + b2; }
Original Code
if (c > 0) {
a = a1;
b = b1;
} else {
a = a2;
b = b2; }
x = x*a + b;
y = y*a + b;
Optimized Code
}
Optimization:
Factor out decision variables to have shorter sequence of divergent code
Tianyi David Han, Tarek S. Abdelrahman, Reducing Branch Divergence in GPU Programs, GPGPU-4 Mar 05-05 2011, Newport Beach, CA USA
18-646 – How to Write Fast Code II 27
(I1) Branch Divergence
At every instruction issue, SIMT unit selects a warp that is ready to execute A warp executes one common instruction at a time
Full efficiency is realized when all 32 threads of a warp agree on their path
If threads of a warp diverge via a data-dependent conditional branch the warp serially executes each branch path taken
disables threads that are not on that path
when all paths complete
the threads converge back to the same execution path Branch divergence occurs only within a warp
18-646 – How to Write Fast Code II 28
Maximizing Instruction Throughput
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
I2
18-646 – How to Write Fast Code II
29
MUL/ADD Inbalance Inst stream
divergence
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
I1
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
(I2) Optimizing Instruction Mix
Compiler Assisted Loop Unrolling
Provides more instruction level parallelism for the compiler to use
Improves the ability for the compiler to find the instruction mix that increases instructions executed per cycle (IPC)
By default, the compiler unrolls small loops with a known trip count
In CUDA, #pragma unroll directive can control unrolling of any given loop Must be placed immediately before the loop and only applies to that loop
Optionally followed by a number
Specifies how many times the loop must be unrolled
18-646 – How to Write Fast Code II 30
Compiler Assisted Loop Unrolling
Example 1:
Loop to be unrolled 5 times
#pragma unroll 5
for (int i = 0; i < n; ++i)
Example 2:
Preventing the compiler from unrolling a loop
#pragma unroll 1
for (int i = 0; i < n; ++i)
Example 3:
#pragma unroll
for (int i = 0; i < n; ++i)
If n is a constant, loop is fully unrolled If n is a variable loop is not rolled at all
18-646 – How to Write Fast Code II 31
Outline
Maximizing Memory Throughput
Maximizing Instruction Throughput Maximizing Scheduling Throughput Special Optimizations
18-646 – How to Write Fast Code II 32
Maximizing Scheduling Throughput
For a particular application, which kernel is faster: Kernel 1:
256 threads/block
17 registers/thread
10KB Shared mem/block
Kernel 2:
196 threads/block
28 registers/thread
4KB Shared mem/block
Must respect the physical limitations of the processor!
18-646 – How to Write Fast Code II 33
NVIDIA Fermi Architecture
Occupancy:
Ability of a CUDA kernel to occupy concurrent contexts in a SM
(Streaming Multiprocessor)
Specifically, the ratio of active warps to the maximum number of warps supported
Helpful in determining how efficient the kernel could be on the GPU
CUDA Occupancy Calculator
A programmer tool for computing the multiprocessor “occupancy”
Threads per Warp
Warps per Multiprocessor
Threads per Multiprocessor
Thread Blocks per Multiprocessor
Total # of 32-bit registers per Multiprocessor Shared Memory per Multiprocessor (bytes) Shared Memory Allocation unit size
https://docs.nvidia.com/cuda/cuda-occupancy-calculator
32
48
1536
8
32768
49152
128
18-646 – How to Write Fast Code II
34
CUDA Occupancy Calculator
196 threads/block, 28 registers/thread, 4KB Shared mem/blockè73%
18-646 – How to Write Fast Code II 35
How to Get the Parameters
Threads/block:
Programmer specified at __global__ function launch
Registers/thread:
Use compiler option to display at compile time
--ptxas-options=-v
Expected output:
ptxas info: Compiling entry function '_XYZ_' for 'sm_20'
ptxas info: Used 25 registers, 3616+0 bytes smem, 53 bytes cmem[0], 4 bytes cmem[16]
Shared memory/block:
If determined at runtime, user specified variable If determined at compile time, see output from:
--ptxas-options=-v
18-646 – How to Write Fast Code II 36
Maximizing Scheduling Throughput
18-646 – How to Write Fast Code II
37
Kernel 1:
256 threads/block
17 registers/thread
10KB Shared mem/block
Occupancy: 67%
Maximizing Scheduling Throughput
18-646 – How to Write Fast Code II
38
Kernel 2:
196 threads/block
28 registers/thread
4KB Shared mem/block
Occupancy: 73%
Outline
Maximizing Memory Throughput
Maximizing Instruction Throughput Maximizing Scheduling Throughput Special Optimizations
18-646 – How to Write Fast Code II 39
Special Optimizations
Device-only CUDA intrinsic functions
Faster implementation with reduced accuracy
Use compiler option (-use_fast_math) to force each function on the left to compile to its intrinsic counterpart.
Or selectively replace mathematical function calls by calls to intrinsic functions O
18-646 – How to Write Fast Code II
40
How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Introduced the manycore platform HW and SW mental models
Introduced the terminologies for you to start FLIRTing with the technology Introduced design trade-offs in data structures with some algorithms
18-646 – How to Write Fast Code II 41
Data structures
Algorithms
Software Architecture