COMP5426 Distributed
Introduction
Programming
Copyright By PowCoder代写 加微信 powcoder
References
– NVIDIAGPUEducatorsProgram – https://developer.nvidia.com/educators
– NVIDIA’s Academic Programs
– https://developer.nvidia.com/academia
– The contents of this short course ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edition, , 2013
SMs are SIMD Processors
– Control unit for instruction fetch, decode, and control is shared among multiple processing units
– Control overhead is minimized
Shared Memory
Processing Unit
Processor (SM)
Register File
Control Unit
Warps as Scheduling Units
Block 1 Warps Block 2 Warps Block 3 Warps ………
t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 ………
– Each block is divided into 32-thread warps
– Warps are scheduling units in SM
– Threads in a warp execute in Single Instruction Multiple Data (SIMD) manner
– The number of threads in a warp may vary in future generations
– For high-dimensional arrays
– thread blocks are first linearized into 1D in row major order – Linearized thread blocks are then partitioned into warps
Global Memory (DRAM) Bandwidth – Ideal
How about performance on a GPU
– All threads access global memory for their input elements – Onememoryaccess(4bytes)perfloating-pointoperation
– AssumeaGPUwith
– Peakfloating-pointrate1,500GFLOPSwith200GB/sDRAMbandwidth – 4*1,500=6,000GB/srequiredtoachievepeakFLOPSrating
– The200GB/smemorybandwidthlimitstheexecutionat50GFLOPS
– This limits the execution rate to 3.33% (50/1500) of the peak floating-point execution rate of the device!
– Need to drastically cut down memory accesses to get close to the1,500 GFLOPS
– Memory coalescing
– Effective use of shared memory
DRAM Core Array Organization
– Each DRAM core array has about 16M bits
– Each bit is stored in a tiny capacitor made of one transistor
Row Decoder
Memory Cell Core Array
Sense Amps
Column Latches
Column Addr
Narrow Off-chip Data
Pin Interface
A very small (8×2-bit) DRAM Core Array
Sense amps
DRAM Bursting
Address bits to decoder
on interface
Non-burst timing
Burst timing
Core Array access delay
Modern DRAM systems are designed to always be accessed in burst mode. Burst bytes are transferred to the processor but discarded when accesses are not to sequential locations.
Multiple DRAM Banks
Sense amps
Sense amps
Bank 0 Mux Bank 1
DRAM Bursting with Banking
Single-Bank burst timing, dead time on interface
Multi-Bank burst timing, reduced dead time
DRAM Burst – A System View
Burst section Burst section Burst section Burst section
– Each address space is partitioned into burst sections
– Whenever a location is accessed, all other locations in the same section are also delivered to the processor
– In practice, we have at least 4GB address space, burst section sizes of 128-bytes or more
Memory Coalescing
Coalesced Loads T0 T1 T2 T3
Coalesced Loads T0 T1 T2 T3
Burst section Burst section Burst section Burst section
– When all threads of a warp execute a load instruction, if all accessed locations fall into the same burst section, only one DRAM request will be made and the access is fully coalesced.
Un-coalesced Accesses
Un-coalesced Loads
T0 T1 T2 T3
Un-coalesced Loads
T0 T1 T2 T3
Burst section Burst section Burst section Burst section
– When the accessed locations spread across burst section boundaries:
– Coalescing fails
– Multiple DRAM requests are made
– The access is not fully coalesced.
– Some of the bytes accessed and transferred are not used by the threads
How to judge if an access is coalesced?
– Accesses in a warp are to consecutive locations if the index in an array access is in the form of
– A[(expression with terms independent of threadIdx.x) + threadIdx.x];
A 2D Array in Linear Memory Space
B0,0 B0,1 B0,2 B0,3
B2,0 B2,1 B2,2 B2,3
B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3 linearized order in increasing address
Accesses are coalesced
We need locality across threads for one instruction instead of locality across subsequent instructions for one thread
Accesses are Not Coalesced
Load iteration 1
T0 T1 T2 T3
Load iteration 0 T1
Access direction in kernel code
We need locality across threads for one instruction instead of locality across subsequent instructions for one thread
B1,0 B1,1 B1,2 B1,3 B3,0 B3,1 B3,2 B3,3
Programmer View of CUDA Memories
Block (0, 0)
Block (1, 0)
Shared Memory
Shared Memory
Registers Registers
Registers Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global Memory
Constant Memory
Shared Memory in CUDA
– A special type of memory whose contents are explicitly defined and used in the kernel source code
– One in each SM (Streaming Multiprocessor)
– Accessed at much higher speed (in both latency and throughput) than global memory
– Scope of access and sharing – thread blocks
– Lifetime – thread block, contents will disappear after the corresponding threads finish and terminate execution
– Accessed by memory load/store instructions
Declaring CUDA Variables
Variable declaration
int LocalVar;
__device__ __shared__ int SharedVar;
__device__ int GlobalVar;
application
__device__ __constant__ int ConstantVar;
application
– __device__isoptionalwhenusedwith__shared__,or__constant__
Where to Declare Variables?
Can host access it?
global register constant shared
Outside of Kernel codes
In the kernel
Shared Memory Variable Declaration
__global__
void myKernel(unsigned char * in, unsigned char * out, int w, int h)
__shared__ float ds_in[TILE_WIDTH][TILE_WIDTH]; }…
Global Memory Access Pattern
Global Memory
Thread 1 Thread 2 …
Tiling/Blocking – Basic Idea
Global Memory
On-chip Memory … Thread 1 Thread 2
Divide the global memory content into tiles
Focus the computation of threads on one or a small number of tiles at each point in time
Tiling/Blocking – Basic Idea
Global Memory
On-chip Memory … Thread 1 Thread 2
Barrier Synchronization
– __syncthreads() is needed to ensure that all threads in the block finished using the data in the shared memory before we proceed to the next step
Reduction Operation
– Summarize a set of input values into one value using a “reduction operation”
– Often used with a user defined reduction operation function as long as the operation
– Is associative and commutative
– Has a well-defined identity value
– For example, the user may supply a custom “max” function for 3D coordinate data sets where the magnitude for the each coordinate data tuple is the distance from the origin.
Partition and Summarize
– A commonly used strategy for processing large input data sets
– There is no required order of processing elements in a data set (associative and commutative)
– Partition the data set into smaller chunks
– Have each thread to process a chunk
– Use a reduction tree to summarize the results from each chunk into the final answer
– We will focus on the reduction tree step for now
An Efficient Sequential Reduction O(N)
– Iterate through the input and perform the reduction operation between the result value and the current input value
– N reduction operations performed for N input values
– Each input value is only visited once – an O(N) algorithm – This is a computationally efficient algorithm.
A parallel reduction tree algorithm performs N-1 operations in log(N) steps
Parallel Sum Reduction
– Parallel implementation
– Recursively halve # of threads, add two values per thread in each step
– Takes log(n) steps for n elements, requires n/2 threads
– Assume an in-place reduction using shared memory
– The original vector is in device global memory
– The shared memory is used to hold a partial sum vector
– Each step brings the partial sum vector closer to the sum – The final sum will be in element 0 of the partial sum vector – Reduces global memory traffic due to partial sum values
– Thread block size limits n to be less than or equal to 2,048
A Naive Thread to Data Mapping
– Each thread is responsible for an even-index location of the partial sum vector (location of responsibility)
– After each step, half of the threads are no longer needed
– One of the inputs is always from the location of responsibility
– In each step, one of the inputs comes from an increasing distance away
A Parallel Sum Reduction Example
A Simple Thread Block Design
– Each thread block takes 2*BlockDim.x input elements
– Each thread loads 2 elements into shared memory
__shared__ float partialSum[2*blockDim.x];
unsigned int t = threadIdx.x;
unsigned int start = 2*blockDim.x*blockIdx.x;
unsigned int j = 2*t;
partialSum[j] = input[start + j]; partialSum[j+1] = input[start + j+1];
– A better way
– To ensure memory coalescing
partialSum[t] = input[start + t]; partialSum[blockDim+t] = input[start + blockDim.x+t];
The Reduction Steps
for (unsigned int stride = 1;
{ stride <= blockDim.x; stride *= 2)
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
__syncthreads() is needed to ensure that all elements of each version of partial sums have been generated before we proceed to the next step
Control Divergence
– Control divergence occurs when threads in a warp take different control flow paths by making different control decisions
– Some take the then-path and others take the else-path of an if- statement
– Some threads take different number of loop iterations than others
– The execution of threads taking different paths are serialized in current GPUs
– The control paths taken by the threads in a warp are traversed one at a time until there is no more.
– During the execution of each path, all threads taking that path will be executed in parallel
– The number of different paths can be large when considering nested control flow statements
Some Observations on the naïve reduction kernel
– In each iteration, two control flow paths will be sequentially traversed for each warp
– Threads that perform addition and threads that do not
– Threads that do not perform addition still consume execution resources
Thread Index Usage Matters
– In some algorithms, one can shift the index usage to improve the divergence behavior
– Commutative and associative operators
– Keep the active threads consecutive
– Always compact the partial sums into the front locations in the partialSum[ ] array
An Example of 4 threads
Thread 0 Thread 1 Thread 2 Thread 3
A Better Reduction Kernel
for (unsigned int stride = blockDim.x;
stride > 0; stride /= 2)
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
A Quick Analysis
– For a 1024 thread block
– No divergence in the first 5 steps
– 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step
– All threads in each warp either all active or all inactive
– Only the first warp in the final 5 steps will still have divergence
Back to the Global Picture
– At the end of the kernel, Thread 0 in each thread block writes the sum of the thread block in partialSum[0] into a vector indexed by the blockIdx.x
– There can be a large number of such sums if the original vector is very large
– do another iteration
– If there are only a small number of sums, the host can simply transfer the data back and add them together
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com