CS计算机代考程序代写 data structure compiler cuda GPU concurrency algorithm 18-646 – How to Write Fast Code II

18-646 – How to Write Fast Code II
1
Carnegie Mellon University

Important Dates
— Mini-Project 1 – Important Dates: — Mon, March 15th @ 11:59PM EST — Thu, March 18th @ 3:00PM EST
— Term Project – Important Dates: — Fri, March 12th @ 11:59PM EST — Tue, May 4th
— Fri, May 14th
Mini-Project One Due Mini-Project One Review
Project Proposal Due (1-3 pages) Poster Presentation
Final Project Report (~10 pages)
— No class on Thursday this week
— Final Exam: Friday, May 14th at 1:00-4:00PM EST
18-646 – How to Write Fast Code II
2

Term Project – Proposal
— The Project Proposal must contain the following:
1) Title
2) List of team members
3) Description of the problem you want to address
4) Description of how you propose to solve the problem
— Platform?Techniques?
5) Outline of your expectations
— improvements in throughput compared to baseline implementation (current or sequential)
6) A timeline with milestones
7) List of any special infrastructure requirements
— Submit early!
18-646 – How to Write Fast Code II 3

Mini-Project 1
18-646 – How to Write Fast Code II 4

18-646 – How to Write Fast Code II 5

What we discussed last time:
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— — —
— Highlighted the difference between multicore and manycore platforms — Exposing concurrency in k-means, Exploiting parallelism by exploring
mappings from application to platform
— Multicore platforms: OpenMP, performance metrics, roofline model
— Manycore platforms: GPUs an0d CUDA
18-646 – How to Write Fast Code II 6
Data structures
Algorithms
Software Architecture

Outline
— Maximizing Memory Throughput — Maximizing Instruction Throughput — Maximizing Scheduling Throughput — Special Optimizations
18-646 – How to Write Fast Code II 7

Maximize Memory Throughput
M1. SoA vs AoS
M2. Memory coalescing M3. Use of shared memory M4. Memory bank conflict M5. Padding
“Highly parallel processors turns compute-limited algorithms into memory-limited algorithms”
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
CPU Chip
DRAM
18-646 – How to Write Fast Code II
8
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
M3/M4/M5 M1/M2

(M1) SOA vs AOS
Struct of Arrays
typedef struct
{
float* x;
float* y;
float* z;
} Constraints;
vs. Array of Struct
typedef struct __align__(16)
{
float3 position;
} Constraint;
xxxyyyzzz xyzxyzxyz
— Important distinction based on application memory access patterns — Which one is “better”?
18-646 – How to Write Fast Code II 9

SOA vs AOS? It depends…
— What is the computation? — For each point:
— What is the optimization goal? — Example 1:
— You have a list of coordinates, and you want
to find the distance from the origin to all points
— What should be the data structure?
18-646 – How to Write Fast Code II 10

SOA vs AOS? Example 1
— What is the computation? — For eOach point:
— What is the optimization goal?
— Minimize memory bandwidth to DRAM
c
— Example 1:
— You have a list of coordinates, and you want
nd 0
—
to find the distance from the origin to all points
What should be the data structure?
Enticed
nitrid yetid
18-646 – How to Write Fast Code II
typedef structassigncolumnto
threads typedef struct __align__(16) {
{
d f ddigesent
in
float* y;
float* z;
id
tfxftidtd.ee
l Il Il l
float* x; xtx x
OR float3 position;
yyy
} pointIn3D;
xftidx
} pointIn3D;
z
z
z
xyzxyzxyz
sequential man loadedbefore being used waste 11

SOA vs AOS? Example 2
— What is the computation? — F o rOa p o i n t :
— What is the optimization goal?
— Minimize memory bandwidth to DRAM
— Example 2:
— Find the distance from the origin to a list of points
estimated to be ~1% of all points
— What should be the data structure? don’t need to load all the neighboring
typedef struct typedef struct __align__(16)points {{
float3 position;
} pointIn3D; dos
x 18-646 – How to Write Fast Code II donetocut 12
my
float* x; x x x float* y;
float* z; y y y
OR
di da
zzz xyzxyzxyz
} pointIn3D;
x

(M2) Memory Coalescing
— Hardware Constraint: DRAM is accessed in “segments” of 32B/64B/128B — Unused data loaded in a “segment” still takes up valuable bandwidth
— Goal: combine multiple memory accesses generated from multiple threads into a single physical transaction
— increases effective throughput to DRAM
— Rules for maximizing DRAM memory bandwidth:
— Possible bus transaction sizes: 32B, 64B, or 128B
— Memory segment must be aligned: First address = multiple of segment size
— Hardware coalescing for each warp: 32-word wide
Kirk, Hwu, Programming Massively Parallel Processors, Chapter 6.2
18-646 – How to Write Fast Code II 13

Examples:
— Are these coalesced?
32T
LT
18-646 – How to Write Fast Code II
14
YES
wasted
YES
NO

When is the Access Coalesced?
— “Threads can access any words in any order, including the same words, and a single memory transaction for each segment addressed by a warp”
— Detailed protocol:
Find the memory segment that contains the address requested by the active thread with the lowest thread ID.
— 32bytesfor1-bytewords
— 64bytesfor2-bytewords
— 128bytesfor4-,8-and16-bytewords.
Find all other active threads whose requested address lies in the same segment. Reduce the transaction size, if possible:
— Ifthetransactionsizeis128bytesandonlythelowerorupperhalfisused,reducethetransaction size to 64 bytes;
— Ifthetransactionsizeis64bytes(originallyorafterreductionfrom128bytes)andonlythelower or upper half is used, reduce the transaction size to 32 bytes.
Carry out the transaction and mark the serviced threads as inactive Repeat until all threads in the warp are serviced.
18-646 – How to Write Fast Code II 15

(M3) Use of Shared Memory
— Take advantage of 9x faster memory bandwidth
— Process:
— Load from DRAM to
shared memory
— Synchronize
— Perform work on data in
shared memory
— Synchronize
— Write out results to DRAM
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
CPU Chip
DRAM
18-646 – How to Write Fast Code II
16
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS
M3

Double Buffering
— One could double buffer the computation
— Getting better instruction mix within each thread — Classic software pipelining in ILP compilers
Load next tile from global memory
Loop {
Deposit current tile to shared memory
syncthreads()
addr 010 Load next tile from global memory
Compute current tile
syncthreads()
}
addro.o.O
Loop {
Load current tile to shared memory
syncthreads()
Compute current tile
syncthreads()
}
Original
Kirk, Hwu, Programming Massively Parallel Processors, Figure 6.13
Double Buffered
18-646 – How to Write Fast Code II
17

Using Shared Memory
— Two approaches to using shared memory
(1) Declared a fixed sized variable at compile time
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
(2) Define a size to be used at run time
__global__ void mykernel(int a, float *objects) {
extern __shared__ char sharedMemory[];
unsigned char *membershipChanged = (unsigned char *)sharedMemory;
float *clusters = (float *)(sharedMemory + blockDim.x); …
}
// In host code
mykernel <<< nBlks, nThds, shmemByteSize >>> (a, objects);
18-646 – How to Write Fast Code II 18

Remember Matrix Transposition?
— One can use the shared memory:
— Load Matrix B with coalesced memory access
into shared memory
— “Random access” from shared memory
— No longer limited by uncoalesced memory access use
Good conceptually, but does it work?
load
32
33
34
35

R+32
R+33
R+34
R+35

2R+32
2R+33
2R+34
2R+35

3R+32
3R+33
3R+34
3R+35






18-646 – How to Write Fast Code II 19

(M4) Memory Bank Conflicts
— Shared memory has 32 banks
— Organized such that successive 32-bit words are assigned to successive banks — Each bank has a bandwidth of 32 bits per two clock cycles (2 cycle latency)
— A bank conflict occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank
18-646 – How to Write Fast Code II 20

Examples:
— Left: Linear addressing with a stride of one 32-bit word
— no bank conflict
— Middle: Linear addressing with a
stride of two 32-bit words
— Right: Linear addressing with a stride of three 32-bit words
— no bank conflict
w
— 2-way bank conflicts
18-646 – How to Write Fast Code II 21

Examples:
— Left: Conflict-free access via random permutation.
— Middle: Conflict-free access since threads 3, 4, 6, 7, and 9 access the same word within bank 5.
— Right: Conflict-free broadcast access — all threads access the same word
How do we make the matrix transposition work?
18-646 – How to Write Fast Code II 22

(M5) Padding Technique
use
use
— We have seen padding used to align data structures in project 1
— Padding can also be used to offset memory bank conflicts
load
32
33
34
35

62
63
R+32
R+33
R+34
R+35

R+62
R+63
2R+32
2R+33
2R+34
2R+35

2R+62
2R+63
3R+32
3R+33
3R+34
3R+35

2R+62
2R+63







load
32
33
34
35

62
63
*
R+32
R+33
R+34
R+35

R+62
R+63
*
2R+32
2R+33
2R+34
2R+35

2R+6 2
2R+63
*
3R+32

3R+33

3R+34

3R+35



2R+6 2

2R+63

*
18-646 – How to Write Fast Code II
23

Maximize Memory Throughput
M1. SoA vs AoS
M2. Memory coalescing M3. Use of shared memory M4. Memory bank conflict M5. Padding
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
CPU Chip
DRAM
18-646 – How to Write Fast Code II
24
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS

Outline
— Maximizing Memory Throughput
— Maximizing Instruction Throughput — Maximizing Scheduling Throughput — Special Optimizations
18-646 – How to Write Fast Code II 25

Maximizing Instruction Throughput
(I1) Branch divergence
(I2) Optimize instruction mix
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
I2
18-646 – How to Write Fast Code II
26
MUL/ADD Inbalance Inst stream
divergence
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
I1
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS

Examples:
Example 1
tid = threadIdx.x;
if (a[tid] > 0) {
x += 1; } else {
if (b[tid] > 0) {
x += 2;
} else {
x += 3;
}
Example 2
if(c>0){
x = x*a1 + b1;
y = y*a1 + b1;
} else {
x = x*a2 + b2;
y = y*a2 + b2; }
Original Code
if (c > 0) {
a = a1;
b = b1;
} else {
a = a2;
b = b2; }
x = x*a + b;
y = y*a + b;
Optimized Code
}
— Optimization:
— Factor out decision variables to have shorter sequence of divergent code
Tianyi David Han, Tarek S. Abdelrahman, Reducing Branch Divergence in GPU Programs, GPGPU-4 Mar 05-05 2011, Newport Beach, CA USA
18-646 – How to Write Fast Code II 27

(I1) Branch Divergence
— At every instruction issue, SIMT unit selects a warp that is ready to execute — A warp executes one common instruction at a time
— Full efficiency is realized when all 32 threads of a warp agree on their path
— If threads of a warp diverge via a data-dependent conditional branch — the warp serially executes each branch path taken
— disables threads that are not on that path
— when all paths complete
— the threads converge back to the same execution path — Branch divergence occurs only within a warp
18-646 – How to Write Fast Code II 28

Maximizing Instruction Throughput
NVIDIA GTX 280 (SP) Roofline Model
622.0 PEAK 311.0
38.9
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
I2
18-646 – How to Write Fast Code II
29
MUL/ADD Inbalance Inst stream
divergence
GDDR
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
I1
SM 0.032 0.25
0.5
PCIExpress 16 125 250
GFLOPS

(I2) Optimizing Instruction Mix
— Compiler Assisted Loop Unrolling
— Provides more instruction level parallelism for the compiler to use
— Improves the ability for the compiler to find the instruction mix that increases instructions executed per cycle (IPC)
— By default, the compiler unrolls small loops with a known trip count
— In CUDA, #pragma unroll directive can control unrolling of any given loop — Must be placed immediately before the loop and only applies to that loop
— Optionally followed by a number
— Specifies how many times the loop must be unrolled
18-646 – How to Write Fast Code II 30

Compiler Assisted Loop Unrolling
— Example 1:
— Loop to be unrolled 5 times
#pragma unroll 5
for (int i = 0; i < n; ++i) — Example 2: — Preventing the compiler from unrolling a loop #pragma unroll 1 for (int i = 0; i < n; ++i) — Example 3: #pragma unroll for (int i = 0; i < n; ++i) — If n is a constant, loop is fully unrolled — If n is a variable loop is not rolled at all 18-646 – How to Write Fast Code II 31 Outline — Maximizing Memory Throughput — Maximizing Instruction Throughput — Maximizing Scheduling Throughput — Special Optimizations 18-646 – How to Write Fast Code II 32 Maximizing Scheduling Throughput — For a particular application, which kernel is faster: — Kernel 1: — 256 threads/block — 17 registers/thread — 10KB Shared mem/block — Kernel 2: — 196 threads/block — 28 registers/thread — 4KB Shared mem/block — Must respect the physical limitations of the processor! 18-646 – How to Write Fast Code II 33 NVIDIA Fermi Architecture — Occupancy: — Ability of a CUDA kernel to occupy concurrent contexts in a SM (Streaming Multiprocessor) — Specifically, the ratio of active warps to the maximum number of warps supported — Helpful in determining how efficient the kernel could be on the GPU — CUDA Occupancy Calculator — A programmer tool for computing the multiprocessor “occupancy” Threads per Warp Warps per Multiprocessor Threads per Multiprocessor Thread Blocks per Multiprocessor Total # of 32-bit registers per Multiprocessor Shared Memory per Multiprocessor (bytes) Shared Memory Allocation unit size https://docs.nvidia.com/cuda/cuda-occupancy-calculator 32 48 1536 8 32768 49152 128 18-646 – How to Write Fast Code II 34 CUDA Occupancy Calculator — 196 threads/block, 28 registers/thread, 4KB Shared mem/blockè73% 18-646 – How to Write Fast Code II 35 How to Get the Parameters — Threads/block: — Programmer specified at __global__ function launch — Registers/thread: — Use compiler option to display at compile time --ptxas-options=-v — Expected output: ptxas info: Compiling entry function '_XYZ_' for 'sm_20' ptxas info: Used 25 registers, 3616+0 bytes smem, 53 bytes cmem[0], 4 bytes cmem[16] — Shared memory/block: — If determined at runtime, user specified variable — If determined at compile time, see output from: --ptxas-options=-v 18-646 – How to Write Fast Code II 36 Maximizing Scheduling Throughput 18-646 – How to Write Fast Code II 37 — Kernel 1: — 256 threads/block — 17 registers/thread — 10KB Shared mem/block — Occupancy: 67% Maximizing Scheduling Throughput 18-646 – How to Write Fast Code II 38 — Kernel 2: — 196 threads/block — 28 registers/thread — 4KB Shared mem/block — Occupancy: 73% Outline — Maximizing Memory Throughput — Maximizing Instruction Throughput — Maximizing Scheduling Throughput — Special Optimizations 18-646 – How to Write Fast Code II 39 Special Optimizations — Device-only CUDA intrinsic functions — Faster implementation with reduced accuracy — Use compiler option (-use_fast_math) to force each function on the left to compile to its intrinsic counterpart. — Or selectively replace mathematical function calls by calls to intrinsic functions O 18-646 – How to Write Fast Code II 40 How is this relevant to writing fast code? Fast Platforms — Multicore platforms — Manycore platforms — Cloud platforms Good Techniques — — — — Introduced the manycore platform HW and SW mental models — Introduced the terminologies for you to start FLIRTing with the technology — Introduced design trade-offs in data structures with some algorithms 18-646 – How to Write Fast Code II 41 Data structures Algorithms Software Architecture