CS代写 COSC 407: Intro to Parallel Computing

Intro to Parallel Computing
Topic 14: Scheduling, Warps and memory
COSC 407: Intro to Parallel Computing
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing

Copyright By PowCoder代写 加微信 powcoder

Previously:
– Kernel Launch Configuration: nD grids/blocks
– CUDA limits
– Thread Cooperation
– Running Example: Matrix Multiplication
– Tiling (Improving Performance of Matrix Multiplication)
– CUDA Scalability
– Thread Scheduling on the H/W: Thread Lifecycle
– zero-overhead and latency tolerance
– GPU limits
– CUDA Memories Types (and Performance)
Topic 14: Scheduling, Warps and Memory
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
COSC 407: Intro to Parallel Computing

Parallel Code: Using One Block
Basic Idea
▪ Only ONE block used to compute the
output matrix P
▪ Each thread computes one element of P as follows:
– Loads a row of matrix M
– Loads a column of matrix N
– Perform one multiply and addition for each pair of M and N elements
– Compute and stores the result on an off-chip memory (DRAM)
Limitation:
▪ Size of P is limited to 32×32
– i.e. the number of threads allowed
in a thread block.
Topic 14: Scheduling, Warps and Memory
3 2 5 4 48 MP
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
COSC 407: Intro to Parallel Computing
Thread (2, 2)
Parallel: Kernel – One Block // Matrix multiplication kernel – each thread computes one P element
// LET’s SEE THE SERIAL CODE FIRST
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width){
void MatrixMulSerial(float* M, float* N, float* P, int width) { //find index of Pr,c element
int r = threadIdx.y + blockIdx.y * blockDim.y;
for (int r=0; r>>(d_M, d_N, d_P, WIDTH);
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
TILE_WIDTHE
WIDTH WIDTH

Parallel : Multiple Blocks – Details
▪ To illustrate how the algorithm works, assume – TILE_WIDTH = blockDim.x = blockDim.y = 2
– Each block has 4 threads
▪ If WIDTH = 4, how many blocks? – WIDTH / TILE_WIDTH = 2
– Use2*2=4blocks
▪ How to identify element Px,y? – x=TILE_WIDTH*bx+tx – y = TILE_WIDTH * by + ty
Topic 14: Scheduling, Warps and Memory
Block(0,0)
Block(0,1)
P0,0 P0,1 P1,0 P1,1
P0,2 P0,3 P1,2 P1,3
P2,0 P2,1 P3,0 P3,1
P2,2 P2,3 P2,3 P3,3
Block(1,0)
Block(1,1)
COSC 407: Intro to Parallel Computing
Parallel : Multiple blocks – Details N0,0 N1,0
▪ Each thread identified by blockIdx and threadIdx will calculate one P element
▪ Example of thread (0, 0) and thread (1, 0) of block (0, 0)
M0,0 M1,0 M2,0 M3,0
M0,1 M1,1 M2,1 M3,1
Topic 14: Scheduling, Warps and Memory
N0,1 N1,1 N0,2 N1,2 N0,3 N1,3
P0,2 P1,2 P0,3 P1,3
P2,0 P3,0 P2,1 P3,1 P2,2 P3,2
COSC 407: Intro to Parallel Computing

Transparent Scalability
▪ The GPU is responsible for assigning thread blocks to SMs
– A block must be assigned to exactly one SM.
– An SM can run more than one thread block
▪ Threads in the same block may cooperate
– not all threads in an SM can cooperate as they may belong to different blocks. (More about this shortly)
▪ Using SMs allows for scalable architecture
SM SM SM SM SM SM SM SM
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Transparent Scalability
▪ Hardware is free to assigns blocks to any processor at any time – A kernel scales across any number of parallel processors
Kernel grid
Block 0 Block 2 Block 4 Block 6
Block 1 Block 3 Block 5 Block 7
Each block can execute in any order relative to other blocks.
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing

1. Blocks are assigned to SMs (as explained before)
2. Each SM splits threads in its blocks into Warps.
– Groups of threads known as warps in SIMT fashion (execute same instruction)
– Warps are the scheduling units of SM
– Thread IDs within a warp are consecutive and increasing:
• Warp 0 starts with Thread ID 0
– Size of the warp is implementation specific
• Generally # of threads in a warp (32) = # of SPs in SM – The warp scheduler of SM decides which of the warp gets
prioritized during issuance of instructions.
▪ DO NOT rely on any ordering between warps
– If there are any dependencies between threads, you must
synchronize them to get correct results (more on this later).
▪ Warps are not part of the CUDA specification, but
• Can help optimize the performance in particular devices (discussed later)
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Thread Scheduling
▪ Each block is executed as subsequent
Block 2 Warps
Block 3 Warps
– All threads in a single warp execute in parallel
– A warp in an SM runs in parallel with Warps in other SMs
Block 1 Warps
▪ Question: Consider a GPU with warp = 32 threads
– if 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
– Each Block is divided into 256 / 32 = 8 Warps
– Thereare8*3=24Warps Topic 14: Scheduling, Warps and Memory
COSC 407: Intro to Parallel Computing
t0 t1 t2 … t31
t0 t1 t2 … t31
t0 t1 t2 … t

Thread Life Cycle on the HW
The complete story
1. The Grid is Launched
2. Blocks are assigned to SMs in arbitrary order
• Each block is assigned to one SM.
Each SM is assigned zero or more blocks.
• There are limits on the number of blocks/threads the SM can track simultaneously. This is taken care of by the GPU.
3. Each block is divided into Warps whose execution is interleaved.
4. Warps are executed by the SM (each SP executes one thread).
• Threads in a warp run simultaneously.
• All threads in a warp execute the same instruction when
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Zero-Overhead and Latency Tolerance
With many warps. those which are ready for consumption are eligible for execution (scheduling priority).
▪ Latency hiding:
– While a warp is waiting for result from a long-latency operation (e.g. global memory access ~500 cylces, floating-point arithmetic, etc), the SM will pick another warp that’s ready to execute to:
• avoid idle time
• make full use of the hardware despite long latency operations.
▪ Zero-overhead thread scheduling
– Having zero idle time is referred to as zero-
overhead thread scheduling in processor designs.
Topic 14: Scheduling, Warps and Memory
warp 4, instruction 18
warp 5, instruction 12
warp 4, instruction 19
warp 7, instruction 8
warp 5, instruction 13
warp 7, instruction 9
COSC 407: Intro to Parallel Computing
SM warp Scheduler

GPU Limits
• CUDA (the software) has limits (as discussed before).
• GPU also has limits on how many blocks and threads it can
simultaneously track (and schedule).
• Hardware resources are required for SMs to maintain the thread,
block IDs, and track their execution status.
• For example:
• G80 (16 SMs)
• Each SM can track up to 8 blocks or 768 threads at a time
• 3 blocks x 256 threads, or
• 6 blocks x 128 threads, or …. etc
• Max number threads at a time = 16 SMs x 768 threads =
12,288 threads
• G200 (30 SMs)
• Each SM can process up to 8 blocks or 1024 threads at a time
• Max threads: 30 SMs x 1024 threads = 30,720 threads
• If we assign to the SM more than its max amount of blocks (as per CUDA
limits), they will be scheduled for later execution.
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
• Why is it good to know this stuff? (i.e. warps, GPU limits, etc.)
• One benefit is to allow for full utilization of each SM on the GPU
• Will discuss more on how these concepts are used when improving performance in the “CUDA Best Practices”
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing

Granularity An Example
Consider G80
• CUDA Limits: 512 threads per block, 216 x 216 blocks per grid.
• These are the limits for CUDA 1.0 supported by G80 • GPU Limits: 8 blocks or 768 threads per SM
• Assume we have thousands of threads to run. To fully utilize each SM on G80, should we use 8X8, 16X16 or 32X32 threads per block?
• For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!
• For 8X8, we have 64 threads per Block. Since each SM can take work with only 8 blocks at a time, this means 64×8=512 threads will go into each SM. But since SM needs 768 threads for full utilization,→66% full – underutilized (fewer warps to schedule)
• For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity and a lot of warps to schedule.
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Utilizing the Hardware
Key things that need to be considered:
• Have a number of blocks >= the number of SMs • Want to utilize all SMs
• Have a reasonable number of threads per block • Fully utilize each SM
• Occupancy
• Occupancy is defined as the ratio of active warps on an SM
to the maximum number of active warps supported by the
• Occupancy varies over time as warps begin and end, and
can be different for each SM
• Low occupancy results in poor instruction issue efficiency;
not enough eligible warps to hide latency between dependent instructions
https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/r eport/cudaexperiments/kernellevel/achievedoccupancy.htm
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing

Device Memories
1. Programmable
• Can control which data to put in that memory • Includes
• Registers
• Shared memory
• Local memory
• Constant memory
• Global memory
2. Non-programmable
• Cannot control which data is put in that memory • Includes
• L1 Cache memory
• L2 Cache memory
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Remember: GPU Design
SM SM SM SM SM SM SM SM
Load/store Load/store Load/store Load/store Load/store
Global Memory (DRAM)
▪ A block is assigned to exactly one SM. An SM may run many blocks concurrently.
▪ All SMs can access the global memory
▪ We have different memories (as discussed before)
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Load/store

Device Memories
Private to each thread
Partitioned among threads in a block
More threads means less registers per thread.
In diagram: if # of active threads = 1024, then each thread gets 32,768/1024 = 32 registers
Shared memory
– Shared by threads in the same block
– Partitioned among blocks
• Remember that several blocks may be assigned to the same SM at the same time.
• More blocks means less shared memory per block.
– There is 64KB on-chip configurable memory, which is partitioned between shared memory and L1 cache.
SINGLE SM on Fermi Architecture COSC 407: Intro to Parallel Computing
Topic 14: Scheduling, Warps and Memory
Instruction Cache
Dispatch Unit Dispatch Unit
Register File
(32,768 x 4 Bytes)
Shared Memory & L1 Cache (64 KB)
Load/store
Special Function Unit
Device Memories
Global Memory (not shown in figure) – Shared by all threads in the grid
Question: How this structure helps implementing zero-overhead thread scheduling?
▪ Zero-overhead (negligible context switching
because each thread has its own registers and each block has its own shared memory
▪ GPU implements latency-hiding
– Many warps to run, and long latencies
from reading global memory are hidden by
warp-scheduling
– The long waiting time of warp instructions
is hidden by executing instructions from other warps
Topic 14: Scheduling, Warps and Memory
SINGLE SM on Fermi Architecture
COSC 407: Intro to Parallel Computing
Instruction Cache
Dispatch Unit Dispatch Unit
Register File
(32,768 x 4 Bytes)
Load/store
Special Function Unit
Shared Memory & L1 Cache (64 KB)

Device Memory Model
Device can:
– Read/Write registers • ~1 cycle.
– Read/Write shared memory • ~5 cycles
– Read/Write global memory
• ~500 cycles
• Allocated by host using cudaMalloc
– Read only constant memory
• ~5 cycles with caching
• a static global memory area
which is cached.
Host can transfer data per-grid to/from global/constant memory.
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Shared Memory
Shared Memory
Registers Registers
Registers Registers
Global Memory
Constant Memory
Constant Memory
▪ Like global memory (DRAM) but has a dedicated on-chip cache for improved performance.
▪ Initialized in host code
– Host can read/write
– Kernel can read-only
▪ Has limited size (64 KB)
▪ Which data is stored in constant memory?
– variables declared as __constant__
– __global__ function parameters are passed to the device via
constant memory (limit is 4 KB)
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing

Parallel Memory Sharing
Registers / Local Memory:
– Private per thread
– Auto variables
– Register spill
• When out of registers, use local memory (see next slide)
Registers/ Local Memory
Shared Memory:
– Shared by threads of the same block
– Inter-thread communication Global Memory:
– Shared by all threads
– Inter-Grid communication
GRID 0 GRID 1
Topic 14: Scheduling, Warps and Memory
COSC 407: Intro to Parallel Computing
Global Memory
Shared Memory
Where is the Local Memory? ▪ Does not refer to a new physical
– It is on the global memory
• Dataisputbythecompiler
– Local because each thread has
its own private area.
– Unlike global memory, it is
cached (L1)
When is the Local Memory Used?
▪ When we run out of registers (Called register spilling)
• Remember: there is a limit on # of registers per thread
▪ When declaring arrays inside kernels
• Some arrays are still stored in registers if small and the compiler can
resolve indexing (registers aren’t indexable)
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
Sequential Grids in Time

L1 and L2 Cache
▪ Cache is non-programmable small
▪ Cache memories (L1 and L2) help multiple threads that access the same memory segment so that they do not need to all go to the DRAM
Aside: L2 is coherent. L1 is not coherent.
– “Notcoherent”meansthatiftwo SMs are working on the same global memory location, it is no guaranteed that one SM will immediately see the changes made by the other SM
Topic 14: Scheduling, Warps and Memory
Global Memory
COSC 407: Intro to Parallel Computing
– CUDA Scalability
– Thread Scheduling on the H/W: Thread Lifecycle
– zero-overhead and latency tolerance
– GPU limits
– CUDA Memories Types (and Performance)
Topic 14: Scheduling, Warps and Memory COSC 407: Intro to Parallel Computing
– CUDA Memories Types (and Performance)
– Memory Access Challenges
– Thread Performance
– More Example: Improving Performance of Matrix Multiplication

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com