18-646 – How to Write Fast Code II
1
Carnegie Mellon University
Mini-Project One
Mini-Project One Due Monday, 03/08 @ 11:59PM
Mini-Project One Review Thursday, 03/11
One person in your team should add the other team members for your final code submission
18-646 – How to Write Fast Code II 2
Mini-Project One – Testing
The time taken for execution on gradescope can vary quite a bit
The final performance of your submitted code (the code that exists on gradescope on
Monday 03/08 @ 11:59PM PST) will be evaluated on a ghc machine
Test your code for accuracy and speed before uploading to gradescope
Matrix Multiplication
$ ./matrix_mul -i ../matrix_mul_03.dat –o
K-Means
$ ./seq_main -o -i kmeans03.dat -n 15
$ mv kmeans03.dat.membership seq_main.kmeans03_n15.membership
$ ./omp_main -o -i kmeans03.dat -n 15
$ diff -y –suppress-common-lines kmeans03.dat.membership seq_main.kmeans03_n15.membership
18-646 – How to Write Fast Code II 3
Mini-Project One – MP1.1 Test Cases
matrix_mul_03.dat
18-646 – How to Write Fast Code II
4
Mini-Project One – MP1.2 Test Cases
18-646 – How to Write Fast Code II 5
18-646 – How to Write Fast Code II 6
How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Introduced the manycore platform HW and SW mental models
Introduced the terminologies for you to start FLIRTing with the technology
18-646 – How to Write Fast Code II 7
Outline
Multicore and Manycore Differences
Hardware and Software Mental Models Anatomy of an Application
The CUDA Platform
18-646 – How to Write Fast Code II 8
Multicore and Manycore Differences
What’s the difference between Multicore vs Manycore?
Multicore Manycore
Multicore: yoke of oxen
Manycore: flock of chickens
Slide by Bryan Catanzaro
18-646 – How to Write Fast Code II
9
Multicore and Manycore Differences
Fundamentally different design philosophy
alotof ALU
butsimple
Multicore: Optimized for reducing execution latency of a few threads Sophisticated instruction controls, large caches
Each core optimized for executing a single thread
Manycore: Assumes 1000-way concurrency readily available in applications More resources dedicated to compute
Cores optimized for aggregate throughput, deemphasizing individual performance
18-646 – How to Write Fast Code II 10
Significant Architectural Difference
cpu apu
Specific ations
Core i7 2600K
GTX580
Processing Elements
4 cores,
8 way SIMD @2.5-3.4 GHz
16 cores, 16 way SIMD, dual issue @1.55 GHz
0.46x – 0.62x
Resident Threads (max)
4 cores, 2 threads, 8 width SIMD
64 strands
16 cores, 48 SIMD vectors, 32 width SIMD 24,576 strands
384x
SP GFLOP/s
160-218
1587
7.3x – 9.9x
Memory Bandwidth
21.4GB/s – 42.7GB/s
192.4 GB/s
4.5x – 9.0x
Die info
995M Transistors 32nm process 216mm2
3B Transistors 40nm process 520mm2
Intel Core i7-2600K
NVIDIA GTX580
18-646 – How to Write Fast Code II
11
GPU is an Accelerator
cannotbeingused byitselfI Device System (GPU)
1
DRAM
DRAM
GPU System Architecture
PCI-E
Host System (CPU)
Device System (GPU)
PCI-E
DRAM
18-646 – How to Write Fast Code II
12
When Does Using a GPU Make Sense?
Applications with a lot of concurrency 1000-way, fine-grained concurrency
Penryn (SP) Roofline Model
L1 L2 DRAM 50.6 25.3 10.6 GB/s GB/s GB/s
Some memory intensive applications Aggregate memory bandwidth is
PEAK
50.5
NO SIMD
12.6
O
0.25
Operational Intensity (FLOPS / Byte)
higher on the GPU
Advantage diminishes when task granularity becomes
to large to fit in shared memory
L1
L2 DRAM
0.5
1.0 1.18
2.0
4.73
622.0 PEAK 311.0
38.9
NVIDIA GTX 280 (SP) Roofline Model
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
cpumodel
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
GDDR PCIExpress
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
0
0.5
18-646 – How to Write Fast Code II
O16 125 250
13
GFLOPS
GFLOPS
Coarse-grained 1000-way concurrency?
10 10 10-7 10-6 10-5
10-4
10-3
10-2
10-1 100
D
-9 -8
amp
On-die cache Latency
Shared Off-chip Memory Latency
System Latency Network Latency
© Jike Chong 2009
Multicore Task Queue-based Implementations
Pthread-based Implementations
MPI-based Implementations
Remote Procedure Call based Implementations
Manycore Throughput
Optimized Implementations
18-646 – How to Write Fast Code II
14
109
104
Task Management Overhead (Instructions)
108 107 106 105
103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)
Outline
Multicore and Manycore Differences
Hardware and Software Mental Models Anatomy of an Application
The CUDA Platform
18-646 – How to Write Fast Code II 15
Hardware and Software Mental Models
Mental models à help us make design trade-offs
Use CUDA terminologies to introduce Manycore programming
CUDA is designed to be “functionally forgiving” Easy to get correct program running
Can invest more time to improve performance
Achieving good performance requires good understanding of hardware constraints
18-646 – How to Write Fast Code II 16
CUDA: Compute Unified Device Architecture
A parallel computing architecture developed by NVIDIA
Integrated host+device app C program
Serial or modestly parallel parts in host C code
Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
…
18-646 – How to Write Fast Code II
17
…
NVIDIA Fermi Architecture
16-way many-core, 48-way multi-threaded (warps)
32-wide SIMD
2 MB (16 x 128 KB) registers
1 MB (16 x 64 KB) L1 cache
0.75 MB L2 Cache
shared
cache sehjiosei.dk
Why the inversion?
Lz
18-646 – How to Write Fast Code II
18
Why Inversion in Mem Hierarchy?
Task
Data
Synchronization
Latency
Time from task start to task finish (seconds)
Time from request to receiving data (seconds)
Time from start of
synchronization to completion (seconds)
Wecaremore Throughput
m GPUs
# of tasks executed per unit time (tasks/second)
# of Bytes transferred per unit time (Bytes/second)
# of sync operations per unit time (sync ops/second)
Concurrency = (Latency * Throughput)
# of Tasks concurrently managed (tasks)
# of memory operations concurrently managed (Memory instructions)
# of sync operation in flight at the same time (sync operations)
18-646 – How to Write Fast Code II 19
Memory Wall
“Computer Architecture : A Quantitative Approach” by Hennesy and Patterson.
Momaccessspeed
18-646 – How to Write Fast Code II 20
Getting Around the Memory Wall
“Memory Wall”:
Increasing gap between Processor and DRAM performance
Manycore Processors:
Utilize application concurrency to hide memory latency
How?
18-646 – How to Write Fast Code II
21
Krste Asanovic, UC Berkeley CS152 Spring 2010
Fine-Grained Multi-Threading
Each Fermi core can maintain 48 warps of architectural context O
Each warp manages a 32-wide SIMD vector worth of computation With ~20 registers for each thread:
4 (Bytes/register) x 20 (Registers) x 32 (SIMD lanes) x 48 (Warps)
è128KB per core x 16 (core) A warp
Registerswarp
è2MB total of register files
18-646 – How to Write Fast Code II
22
core
Why Warps?
Software abstraction to hide an extra level of architectural complexity A 128KB register file is a large memory
It takes more than one clock cycle to retrieve information
Hardware provide 16-wide physical SIMD units, half-pump register files:
Provide half the operand per clock cycle, then the other half the following cycle To simplify the programming model:
Assume we are only working with 32-wide SIMD unit, where each 32-bit instruction has a bit more latency
18-646 – How to Write Fast Code II 23
How to Deal with GPUs of Different Sizes?
GeForce GTX 580
GeForce GTX 570, GTX 480
GeForce GTX 470
GeForce GTX 560 Ti
GeForce GTX 460
GeForce GTX 470M
GeForce GTX 285, GTX 280, GTX 275
GeForce GTS 450, GTX 460M
GeForce GTX 260
GeForce GT 445M
GeForce 8800 Ultra, 8800 GTX
GeForce 9800 GT, 8800 GT,
GeForce GT 415M
Compute Number of Capability Multiprocessors
2 16
2 15
2 14
2.1 8
2.1 7
2.1 6
1.3 30
2.1 4
1.3 24
2.1 3
1 16
1.1 14
Number of SIMD Lanes 512 480 448 384 336 288 240 192 192 144 128 112
We would like the same program to run on both GTX 580 and GT 415M CUDA provides an abstraction for concurrency to be fully exposed
HW/Runtime provides capability to schedule the computation
2.1 1 48
18-646 – How to Write Fast Code II 24
Thread Blocks
Computation is grouped into blocks of independent, concurrently executable work
Fully exposes the concurrency in the application
The HW/Runtime makes the decision to selectively sequentialize the execution as necessary
What are some implications or limitations?
18-646 – How to Write Fast Code II 25
Threads
Threads are the computation performed in each SIMD lane in a core CUDA provides a SIMT programming abstraction to assist users
SIMT: Single Instruction Multiple Threads
A single instruction controls multiple processing element
Different from SIMD – SIMD exposes the SIMD width to the programmer
SIMT abstract the # threads in a thread block as a user-specified parameter
SIMT enables programmers to write thread-level parallel code for Independent, scalar threads
Data-parallel code for coordinated threads
For functional correctness, programmers can ignore SIMT behavior
For performance, programmers can tune applications with SIMT in mind
18-646 – How to Write Fast Code II 26
What About Data?
SIMD (or SIMT) style programming can be very restrictive for communication between SIMD lanes
On the same chip, in the same core, computations in SIMD lanes (physically) takes places very close to each other
How can we exploit this closeness of proximity in the hardware while providing a generalizable construct in software?
18-646 – How to Write Fast Code II 27
Shared Memory/L1 Cache
Manycore processors provide memory local to each core
Computations in SIMD-lanes in the same core can
communicate via memory read/write
Two types of memory:
Programmer-managed scratch pad memory HW-managed L1 cache
For NVIDIA Fermi architecture, you get 64KB per core with two configurations
48KB scratch pad (Shared Memory), 16kB L1 cache 16KB scratch pad (Shared Memory), 48kB L1 cache
O
18-646 – How to Write Fast Code II 28
How Many Threads per Thread Block?
If I can efficiently communicate between threads in a thread block,
why not just put all my work in one thread block?
s run
A few reasons:
A manycore processor has more than one core
If one uses only one core, one is not fully utilizing available HW Hardware must maintain the context of all threads a thread block
There is a limited amount of resources on-chip
In Fermi, 48 warps of context are maintained per core
In Fermi, each thread block can have up to 1024 threads
This changes between generations (used to be max of 768 threads/block)
One can often achieve higher performance to have less threads/block, but
multiple blocks concurrently running on the same core
on which core
mm
18-646 – How to Write Fast Code II
29
Hardware and Software Mental Models
Summary of what we have discussed:
Hardware Compute Units
SIMD Execution Granularity
Software Abstractions
Warps
Memory Resources
Registers/Shared Memory,
L1 Cache
L2 Cache
SIMD Lanes Threads
Register File
Cores
(Streaming Multiprocessors – SM)
Thread Blocks
Shared Memory, L1 Cache
Multiple SMs
Grids
18-646 – How to Write Fast Code II
30
Outline
Multicore and Manycore Differences
Hardware and Software Mental Models Anatomy of an Application
The CUDA Platform
18-646 – How to Write Fast Code II 31
Anatomy of a CUDA Program
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
int main() {
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
/* launch 1 thread per vector-element, 256 threads per block */ bk = (int)( n / 256 );
vcos<<
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost ); /* host_y now contains cos(x) data */
return( 0 ); }
18-646 – How to Write Fast Code II 32
Data Structure on Different Devices
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
int main() {
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
host_x = (float*)malloc( n*sizeof(float) ); 0PM mall0C host_y = (float*)malloc( n*sizeof(float) );
cudaMalloc( &dev_x, n*sizeof(float) );
cudaMalloc( &dev_y, n*sizeof(float) );
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
DRAM DRAM
/* launch 1 thread per vector-element, 256 threads per block */
Host System (CPU) Device System (GPU)
bk = (int)( n / 256 );
vcos<<
PCI-E
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost ); /* host_y now contains cos(x) data */
return( 0 ); }
18-646 – How to Write Fast Code II 33
CUDA Memory Operations
cudaMalloc(void ** pointer, size_t nbytes)
cudaMemset(void * pointer, int value, size_t
count)
cudaFree(void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);
18-646 – How to Write Fast Code II 34
Data Transfers: HostßàDevice __global__ void vcos( int n, float* x, float* y ) {
y[ix] = cos( x[ix] );
DRAM DRAM
int ix = blockIdx.x*blockDim.x + threadIdx.x;
Device System (GPU)
int n = 1024;
}
int main() {
Host System (CPU)
float *host_x, *host_y; float *dev_x, *dev_y;
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
tdest Tsource
/* launch 1 thread per vector-element, 256 threads per block */
bk = (int)( n / 256 ); vcos<<
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost );
/* host_y now contains cos(x) data */
return( 0 ); }
/* TODO: fill host_x[i] with data here */
PCI-E
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice ); fromCPU to
GPU
18-646 – How to Write Fast Code II
35
CUDA Memory Copy Operations
cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
direction specifies locations (host or device) of src and dst Blocks CPU thread: returns after the copy is complete
Doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice cudaMemcpyPeer
18-646 – How to Write Fast Code II 36
Calling a __global__ Function
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
The __global__ qualifier declares a function as being a kernel. Such a function is: int main() {
§ Executed on the device
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
§ Callable from the host only
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) );
Built in variables:
cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
– blockIdx, threadIdx, for thread self identification
– blockDim, where dimension of the thread block is accessible within the kernel
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
/* launch 1 thread per vector-element, 256 threads per block */
bk = (int)( n / 256 ); vcos<<
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost );
<<< num_blks, num_thrd >>>
/* host_y now contains cos(x) data */
return( 0 ); perblock }
18-646 – How to Write Fast Code II
37
CUDA Function Types
Function Type Qualifiers
Function type qualifiers specify whether a function executes on the host or on the device and whether it is callable from the host or from the device.
__global__ e
Executed on the device
Callable from the host only
__global__ functions must have void return type
Any call to a __global__ function must specify its execution configuration <<< >>> A call to a __global__ function is asynchronous
__device__
Executed on the device O
Callable from the device only
__host__
Executed on the host 0
Callable from the host only
18-646 – How to Write Fast Code II 38
CUDA Built-in Variables
Built-in variables specify the grid and block dimensions and the block and thread indices Only valid within functions that are executed on the device
gridDim
This variable is of type dim3 and contains the dimensions of the grid.
blockIdx
This variable is of type uint3 and contains the block index within the grid.
blockDim
This variable is of type dim3 and contains the dimensions of the block.
threadIdx
This variable is of type uint3 and contains the thread index within the block.
warpSize
This variable is of type int and contains the warp size in threads.
18-646 – How to Write Fast Code II 39
CUDA: Simple Extension of C
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
int main() {
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
/* launch 1 thread per vector-element, 256 threads per block */
bk = (int)( n / 256 ); vcos<<
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost );
/* host_y now contains cos(x) data */
return( 0 ); }
18-646 – How to Write Fast Code II 40
CUDA: Simple Extension of C
Type Qualifiers
global, device, shared,
local, constant Keywords
threadIdx, blockIdx Intrinsics
__syncthreads Runtime API
Memory, symbol, execution management
Function launch 18-646 – How to Write Fast Code II
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M];
…
region[threadIdx] = image[i];
__syncthreads()
…
image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
41
Outline
Multicore and Manycore Differences
Hardware and Software Mental Models Anatomy of an Application
The CUDA Platform
18-646 – How to Write Fast Code II 42
Compilation
Any source file containing CUDA language extensions must be compiled with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … NVCC outputs:
C code (host CPU Code)
Must then be compiled with the rest of the application using another tool
PTX
Object code directly
Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 43
The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
Parallel Thread eXecution (PTX)
Virtual Machine and ISA
Programming model
Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
44
The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
Parallel Thread eXecution (PTX)
Virtual Machine and ISA
Programming model
Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
45
Compilation
Any source file containing CUDA language extensions must be compiled with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … NVCC outputs:
C code (host CPU Code)
Must then be compiled with the rest of the application using another tool
PTX
Object code directly
Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 46
How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Introduced the manycore platform HW and SW mental models
Introduced the terminologies for you to start FLIRTing with the technology
18-646 – How to Write Fast Code II 47
Software Architecture