CS计算机代考程序代写 data structure c/c++ compiler cuda GPU computer architecture concurrency cache algorithm 18-646 – How to Write Fast Code II

18-646 – How to Write Fast Code II
1
Carnegie Mellon University

Mini-Project One
— Mini-Project One Due Monday, 03/08 @ 11:59PM
— Mini-Project One Review Thursday, 03/11
— One person in your team should add the other team members for your final code submission
18-646 – How to Write Fast Code II 2

Mini-Project One – Testing
— The time taken for execution on gradescope can vary quite a bit
— The final performance of your submitted code (the code that exists on gradescope on
Monday 03/08 @ 11:59PM PST) will be evaluated on a ghc machine
— Test your code for accuracy and speed before uploading to gradescope
Matrix Multiplication
$ ./matrix_mul -i ../matrix_mul_03.dat –o
K-Means
$ ./seq_main -o -i kmeans03.dat -n 15
$ mv kmeans03.dat.membership seq_main.kmeans03_n15.membership
$ ./omp_main -o -i kmeans03.dat -n 15
$ diff -y –suppress-common-lines kmeans03.dat.membership seq_main.kmeans03_n15.membership
18-646 – How to Write Fast Code II 3

Mini-Project One – MP1.1 Test Cases
matrix_mul_03.dat
18-646 – How to Write Fast Code II
4

Mini-Project One – MP1.2 Test Cases
18-646 – How to Write Fast Code II 5

18-646 – How to Write Fast Code II 6

How is this relevant to writing fast code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— Introduced the manycore platform HW and SW mental models
— Introduced the terminologies for you to start FLIRTing with the technology
18-646 – How to Write Fast Code II 7

Outline
— Multicore and Manycore Differences
— Hardware and Software Mental Models — Anatomy of an Application
— The CUDA Platform
18-646 – How to Write Fast Code II 8

Multicore and Manycore Differences
— What’s the difference between Multicore vs Manycore?
Multicore Manycore
— Multicore: yoke of oxen
— Manycore: flock of chickens
Slide by Bryan Catanzaro
18-646 – How to Write Fast Code II
9

Multicore and Manycore Differences
— Fundamentally different design philosophy
alotof ALU
butsimple
— Multicore: Optimized for reducing execution latency of a few threads — Sophisticated instruction controls, large caches
— Each core optimized for executing a single thread
— Manycore: Assumes 1000-way concurrency readily available in applications — More resources dedicated to compute
— Cores optimized for aggregate throughput, deemphasizing individual performance
18-646 – How to Write Fast Code II 10

Significant Architectural Difference
cpu apu
Specific ations
Core i7 2600K
GTX580
Processing Elements
4 cores,
8 way SIMD @2.5-3.4 GHz
16 cores, 16 way SIMD, dual issue @1.55 GHz
0.46x – 0.62x
Resident Threads (max)
4 cores, 2 threads, 8 width SIMD
64 strands
16 cores, 48 SIMD vectors, 32 width SIMD 24,576 strands
384x
SP GFLOP/s
160-218
1587
7.3x – 9.9x
Memory Bandwidth
21.4GB/s – 42.7GB/s
192.4 GB/s
4.5x – 9.0x
Die info
995M Transistors 32nm process 216mm2
3B Transistors 40nm process 520mm2
Intel Core i7-2600K
NVIDIA GTX580
18-646 – How to Write Fast Code II
11

GPU is an Accelerator
cannotbeingused byitselfI Device System (GPU)
1
DRAM
DRAM
— GPU System Architecture
PCI-E
Host System (CPU)
Device System (GPU)
PCI-E
DRAM
18-646 – How to Write Fast Code II
12

When Does Using a GPU Make Sense?
— Applications with a lot of concurrency — 1000-way, fine-grained concurrency
Penryn (SP) Roofline Model
L1 L2 DRAM 50.6 25.3 10.6 GB/s GB/s GB/s
— Some memory intensive applications — Aggregate memory bandwidth is
PEAK
50.5
NO SIMD
12.6
O
0.25
Operational Intensity (FLOPS / Byte)
higher on the GPU
— Advantage diminishes when task granularity becomes
to large to fit in shared memory
L1
L2 DRAM
0.5
1.0 1.18
2.0
4.73
622.0 PEAK 311.0
38.9
NVIDIA GTX 280 (SP) Roofline Model
Shared Mem 1244 GB/s
GDDR 141.7 GB/s
PCI Express 2.5 GB/s
cpumodel
MUL/ADD Inbalance Inst stream
divergence
SM 0.032 0.25
GDDR PCIExpress
0.27 2.2 4.4 Operational Intensity (FLOPS / Byte)
0
0.5
18-646 – How to Write Fast Code II
O16 125 250
13
GFLOPS
GFLOPS

Coarse-grained 1000-way concurrency?
10 10 10-7 10-6 10-5
10-4
10-3
10-2
10-1 100
D
-9 -8
amp
On-die cache Latency
Shared Off-chip Memory Latency
System Latency Network Latency
© Jike Chong 2009
Multicore Task Queue-based Implementations
Pthread-based Implementations
MPI-based Implementations
Remote Procedure Call based Implementations
Manycore Throughput
Optimized Implementations
18-646 – How to Write Fast Code II
14
109
104
Task Management Overhead (Instructions)
108 107 106 105
103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)

Outline
— Multicore and Manycore Differences
— Hardware and Software Mental Models — Anatomy of an Application
— The CUDA Platform
18-646 – How to Write Fast Code II 15

Hardware and Software Mental Models
— Mental models à help us make design trade-offs
— Use CUDA terminologies to introduce Manycore programming
— CUDA is designed to be “functionally forgiving” — Easy to get correct program running
— Can invest more time to improve performance
— Achieving good performance requires good understanding of hardware constraints
18-646 – How to Write Fast Code II 16

CUDA: Compute Unified Device Architecture
— A parallel computing architecture developed by NVIDIA
— Integrated host+device app C program
— Serial or modestly parallel parts in host C code
— Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

18-646 – How to Write Fast Code II
17

NVIDIA Fermi Architecture
— 16-way many-core, 48-way multi-threaded (warps)
— 32-wide SIMD
— 2 MB (16 x 128 KB) registers
— 1 MB (16 x 64 KB) L1 cache
— 0.75 MB L2 Cache
shared
cache sehjiosei.dk
Why the inversion?
Lz
18-646 – How to Write Fast Code II
18

Why Inversion in Mem Hierarchy?
Task
Data
Synchronization
Latency
Time from task start to task finish (seconds)
Time from request to receiving data (seconds)
Time from start of
synchronization to completion (seconds)
Wecaremore Throughput
m GPUs
# of tasks executed per unit time (tasks/second)
# of Bytes transferred per unit time (Bytes/second)
# of sync operations per unit time (sync ops/second)
Concurrency = (Latency * Throughput)
# of Tasks concurrently managed (tasks)
# of memory operations concurrently managed (Memory instructions)
# of sync operation in flight at the same time (sync operations)
18-646 – How to Write Fast Code II 19

Memory Wall
“Computer Architecture : A Quantitative Approach” by Hennesy and Patterson.
Momaccessspeed
18-646 – How to Write Fast Code II 20

Getting Around the Memory Wall
— “Memory Wall”:
Increasing gap between Processor and DRAM performance
— Manycore Processors:
— Utilize application concurrency to hide memory latency
— How?
18-646 – How to Write Fast Code II
21
Krste Asanovic, UC Berkeley CS152 Spring 2010

Fine-Grained Multi-Threading
— Each Fermi core can maintain 48 warps of architectural context O
— Each warp manages a 32-wide SIMD vector worth of computation — With ~20 registers for each thread:
4 (Bytes/register) x 20 (Registers) x 32 (SIMD lanes) x 48 (Warps)
è128KB per core x 16 (core) A warp
Registerswarp
è2MB total of register files
18-646 – How to Write Fast Code II
22
core

Why Warps?
— Software abstraction to hide an extra level of architectural complexity — A 128KB register file is a large memory
— It takes more than one clock cycle to retrieve information
— Hardware provide 16-wide physical SIMD units, half-pump register files:
— Provide half the operand per clock cycle, then the other half the following cycle — To simplify the programming model:
— Assume we are only working with 32-wide SIMD unit, where each 32-bit instruction has a bit more latency
18-646 – How to Write Fast Code II 23

How to Deal with GPUs of Different Sizes?
GeForce GTX 580
GeForce GTX 570, GTX 480
GeForce GTX 470
GeForce GTX 560 Ti
GeForce GTX 460
GeForce GTX 470M
GeForce GTX 285, GTX 280, GTX 275
GeForce GTS 450, GTX 460M
GeForce GTX 260
GeForce GT 445M
GeForce 8800 Ultra, 8800 GTX
GeForce 9800 GT, 8800 GT,
GeForce GT 415M
Compute Number of Capability Multiprocessors
2 16
2 15
2 14
2.1 8
2.1 7
2.1 6
1.3 30
2.1 4
1.3 24
2.1 3
1 16
1.1 14
Number of SIMD Lanes 512 480 448 384 336 288 240 192 192 144 128 112
— We would like the same program to run on both GTX 580 and GT 415M — CUDA provides an abstraction for concurrency to be fully exposed
— HW/Runtime provides capability to schedule the computation
2.1 1 48
18-646 – How to Write Fast Code II 24

Thread Blocks
— Computation is grouped into blocks of independent, concurrently executable work
— Fully exposes the concurrency in the application
— The HW/Runtime makes the decision to selectively sequentialize the execution as necessary
— What are some implications or limitations?
18-646 – How to Write Fast Code II 25

Threads
— Threads are the computation performed in each SIMD lane in a core — CUDA provides a SIMT programming abstraction to assist users
— SIMT: Single Instruction Multiple Threads
— A single instruction controls multiple processing element
— Different from SIMD – SIMD exposes the SIMD width to the programmer
— SIMT abstract the # threads in a thread block as a user-specified parameter
— SIMT enables programmers to write thread-level parallel code for — Independent, scalar threads
— Data-parallel code for coordinated threads
— For functional correctness, programmers can ignore SIMT behavior
— For performance, programmers can tune applications with SIMT in mind
18-646 – How to Write Fast Code II 26

What About Data?
— SIMD (or SIMT) style programming can be very restrictive for communication between SIMD lanes
— On the same chip, in the same core, computations in SIMD lanes (physically) takes places very close to each other
— How can we exploit this closeness of proximity in the hardware while providing a generalizable construct in software?
18-646 – How to Write Fast Code II 27

Shared Memory/L1 Cache
— Manycore processors provide memory local to each core
— Computations in SIMD-lanes in the same core can
communicate via memory read/write
— Two types of memory:
— Programmer-managed scratch pad memory — HW-managed L1 cache
— For NVIDIA Fermi architecture, you get 64KB per core with two configurations
— 48KB scratch pad (Shared Memory), 16kB L1 cache — 16KB scratch pad (Shared Memory), 48kB L1 cache
O
18-646 – How to Write Fast Code II 28

How Many Threads per Thread Block?
— If I can efficiently communicate between threads in a thread block,
why not just put all my work in one thread block?
s run
— A few reasons:
— A manycore processor has more than one core
— If one uses only one core, one is not fully utilizing available HW — Hardware must maintain the context of all threads a thread block
— There is a limited amount of resources on-chip
— In Fermi, 48 warps of context are maintained per core
— In Fermi, each thread block can have up to 1024 threads
— This changes between generations (used to be max of 768 threads/block)
— One can often achieve higher performance to have less threads/block, but
multiple blocks concurrently running on the same core
on which core
mm
18-646 – How to Write Fast Code II
29

Hardware and Software Mental Models
— Summary of what we have discussed:
Hardware Compute Units
SIMD Execution Granularity
Software Abstractions
Warps
Memory Resources
Registers/Shared Memory,
L1 Cache
L2 Cache
SIMD Lanes Threads
Register File
Cores
(Streaming Multiprocessors – SM)
Thread Blocks
Shared Memory, L1 Cache
Multiple SMs
Grids
18-646 – How to Write Fast Code II
30

Outline
— Multicore and Manycore Differences
— Hardware and Software Mental Models — Anatomy of an Application
— The CUDA Platform
18-646 – How to Write Fast Code II 31

Anatomy of a CUDA Program
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
int main() {
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
/* launch 1 thread per vector-element, 256 threads per block */ bk = (int)( n / 256 );
vcos<<>>( n, dev_x, dev_y );
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost ); /* host_y now contains cos(x) data */
return( 0 ); }
18-646 – How to Write Fast Code II 32

Data Structure on Different Devices
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
int main() {
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
host_x = (float*)malloc( n*sizeof(float) ); 0PM mall0C host_y = (float*)malloc( n*sizeof(float) );
cudaMalloc( &dev_x, n*sizeof(float) );
cudaMalloc( &dev_y, n*sizeof(float) );
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
DRAM DRAM
/* launch 1 thread per vector-element, 256 threads per block */
Host System (CPU) Device System (GPU)
bk = (int)( n / 256 );
vcos<<>>( n, dev_x, dev_y );
PCI-E
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost ); /* host_y now contains cos(x) data */
return( 0 ); }
18-646 – How to Write Fast Code II 33

CUDA Memory Operations
— cudaMalloc(void ** pointer, size_t nbytes)
— cudaMemset(void * pointer, int value, size_t
count)
— cudaFree(void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);
18-646 – How to Write Fast Code II 34

Data Transfers: HostßàDevice __global__ void vcos( int n, float* x, float* y ) {
y[ix] = cos( x[ix] );
DRAM DRAM
int ix = blockIdx.x*blockDim.x + threadIdx.x;
Device System (GPU)
int n = 1024;
}
int main() {
Host System (CPU)
float *host_x, *host_y; float *dev_x, *dev_y;
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
tdest Tsource
/* launch 1 thread per vector-element, 256 threads per block */
bk = (int)( n / 256 ); vcos<<>>( n, dev_x, dev_y );
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost );
/* host_y now contains cos(x) data */
return( 0 ); }
/* TODO: fill host_x[i] with data here */
PCI-E
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice ); fromCPU to
GPU
18-646 – How to Write Fast Code II
35

CUDA Memory Copy Operations
cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
— direction specifies locations (host or device) of src and dst — Blocks CPU thread: returns after the copy is complete
— Doesn’t start copying until previous CUDA calls complete — enum cudaMemcpyKind
— cudaMemcpyHostToDevice
— cudaMemcpyDeviceToHost
— cudaMemcpyDeviceToDevice — cudaMemcpyPeer
18-646 – How to Write Fast Code II 36

Calling a __global__ Function
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
The __global__ qualifier declares a function as being a kernel. Such a function is: int main() {
§ Executed on the device
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
§ Callable from the host only
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) );
Built in variables:
cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
– blockIdx, threadIdx, for thread self identification
– blockDim, where dimension of the thread block is accessible within the kernel
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
/* launch 1 thread per vector-element, 256 threads per block */
bk = (int)( n / 256 ); vcos<<>>( n, dev_x, dev_y );
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost );
<<< num_blks, num_thrd >>>
/* host_y now contains cos(x) data */
return( 0 ); perblock }
18-646 – How to Write Fast Code II
37

CUDA Function Types
Function Type Qualifiers
— Function type qualifiers specify whether a function executes on the host or on the device and whether it is callable from the host or from the device.
__global__ e
— Executed on the device
— Callable from the host only
— __global__ functions must have void return type
— Any call to a __global__ function must specify its execution configuration <<< >>> — A call to a __global__ function is asynchronous
__device__
— Executed on the device O
— Callable from the device only
__host__
— Executed on the host 0
— Callable from the host only
18-646 – How to Write Fast Code II 38

CUDA Built-in Variables
— Built-in variables specify the grid and block dimensions and the block and thread indices — Only valid within functions that are executed on the device
gridDim
— This variable is of type dim3 and contains the dimensions of the grid.
blockIdx
— This variable is of type uint3 and contains the block index within the grid.
blockDim
— This variable is of type dim3 and contains the dimensions of the block.
threadIdx
— This variable is of type uint3 and contains the thread index within the block.
warpSize
— This variable is of type int and contains the warp size in threads.
18-646 – How to Write Fast Code II 39

CUDA: Simple Extension of C
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] );
}
int main() {
float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024;
host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) );
/* TODO: fill host_x[i] with data here */
cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice );
/* launch 1 thread per vector-element, 256 threads per block */
bk = (int)( n / 256 ); vcos<<>>( n, dev_x, dev_y );
cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost );
/* host_y now contains cos(x) data */
return( 0 ); }
18-646 – How to Write Fast Code II 40

CUDA: Simple Extension of C
— Type Qualifiers
— global, device, shared,
local, constant — Keywords
— threadIdx, blockIdx — Intrinsics
— __syncthreads — Runtime API
— Memory, symbol, execution management
— Function launch 18-646 – How to Write Fast Code II
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M];

region[threadIdx] = image[i];
__syncthreads()

image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
41

Outline
— Multicore and Manycore Differences
— Hardware and Software Mental Models — Anatomy of an Application
— The CUDA Platform
18-646 – How to Write Fast Code II 42

Compilation
— Any source file containing CUDA language extensions must be compiled with NVCC
— NVCC is a compiler driver
— Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … — NVCC outputs:
— C code (host CPU Code)
— Must then be compiled with the rest of the application using another tool
— PTX
— Object code directly
— Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 43

The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
— Parallel Thread eXecution (PTX)
— Virtual Machine and ISA
— Programming model
— Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
44

The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
— Parallel Thread eXecution (PTX)
— Virtual Machine and ISA
— Programming model
— Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
45

Compilation
— Any source file containing CUDA language extensions must be compiled with NVCC
— NVCC is a compiler driver
— Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … — NVCC outputs:
— C code (host CPU Code)
— Must then be compiled with the rest of the application using another tool
— PTX
— Object code directly
— Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 46

How is this relevant to writing fast code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— — —
Data structures
Algorithms
— Introduced the manycore platform HW and SW mental models
— Introduced the terminologies for you to start FLIRTing with the technology
18-646 – How to Write Fast Code II 47
Software Architecture