Chapter 4: Data-Level Parallelism
p 4.1 Introduction
p 4.2 Vector Architecture
p 4.3 SIMD Instruction Set Extensions p 4.4 GPU (Graphic Processing Unit)
Copyright By PowCoder代写 加微信 powcoder
Chapter 4: Data-Level Parallelism
What is a Graphics Card?
Graphics cards controls what is to be shown on a computer monitor and calculates 3D images and graphics.
Chapter 4: Data-Level Parallelism
Main Steps
Chapter 4: Data-Level Parallelism
Main Steps
Chapter 4: Data-Level Parallelism
Graphics Pipeline
Chapter 4: Data-Level Parallelism
GPU Evolvement
p Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units
m Programmability was an afterthought m Started from 1999, GeForce 256
p Over time, more programmability added (2001-2005) m New language Cg (Nvidia) for writing small programs run on
each vertex or each pixel
p Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations
m Incredibly difficult programming model as had to use graphics pipeline model for general computation
Chapter 4: Data-Level Parallelism
Pre-1999 PC 3D graphics accelerator
Chapter 4: Data-Level Parallelism
GPU* circa 1999
Chapter 4: Data-Level Parallelism
Direct3D 9 programmability: 2002
Chapter 4: Data-Level Parallelism
Direct3D 10 programmability: 2006
Chapter 4: Data-Level Parallelism
GPU is fast!
Chapter 4: Data-Level Parallelism
Large Number of Cores
Chapter 4: Data-Level Parallelism
NVIDIA Tesla C870
p 518 Gflops/card
p 1.5 GB Memory
p 128 SM processors
Chapter 4: Data-Level Parallelism
Building Blocks for Supercomputers
www.top500.org Nov. 2011
Chapter 4: Data-Level Parallelism
Building Blocks for Supercomputers
www.top500.org Nov. 2013
Chapter 4: Data-Level Parallelism
Comparing CPU and GPU
Chapter 4: Data-Level Parallelism
General-Purpose GPUs (GP-GPUs)
p Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing
p Host CPU issues data-parallel kernels to GP-GPU for execution
p Programming model is “Single Instruction Multiple Thread (SIMD/SIMT)”
Chapter 4: Data-Level Parallelism
p In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA
p CUDATM is a parallel computing platform and programming model invented by NVIDIA.
p It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
Compute Unified Device Architecture
Chapter 4: Data-Level Parallelism
GPU Hardware (e.g., G80 GPU)
Input Assembler Vtx Thread Issue
Setup / Rstr / Z Thread Issue Pixel Thread Issue
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
Thread Processor
Chapter 4: Data-Level Parallelism
GPU Hardware (
p 0 C 4A4 24 A4 D
p 12 2 C 41 A24 D
p 12 2 C41 24D
p 12 A A 41
)) 0
Chapter 4: Data-Level Parallelism
CUDA: Heterogeneous Computing
m Serial code
m Highly parallel
Chapter 4: Data-Level Parallelism
Overview of CUDA Programming
p A kernel is executed as a Grid of thread Blocks
m Up to 512 threads in one block
m All threads in a block execute the same program (kernel) but on different data
m A block is assigned to a processor that executes the code
m Independent blocks
p Only ONE kernel at a time
Chapter 4: Data-Level Parallelism
Hardware Execution Model
p GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor
p CPU sends whole “grid” to GPU, which distributes thread blocks among cores (each thread block executes on one core)
Chapter 4: Data-Level Parallelism
Memory Hierarchy
p Registers
p Local memory
p Shared memory (shared by different threads in same block)
p Device memory (texture, constant, local, global)
Chapter 4: Data-Level Parallelism
CUDA Memory Types
read-write per-thread
local memory
read-write per-thread
shared memory
read-write per-block
global memory
read-write per-grid
constant memory
read-only per-grid
texture memory
read-only per-grid
Chapter 4: Data-Level Parallelism
Example for Data Division and Thread
Chapter 4: Data-Level Parallelism
Example: DAXPY
Conventional C code for the DAXPY loop!
Chapter 4: Data-Level Parallelism
CUDA code for DAXPY
p Launch n threads, one per element
p 256 CUDA threads per thread block in a multithreaded SIMD processor
Chapter 4: Data-Level Parallelism
Example: X * Y (8,192 elements)
p A Grid works on the whole 8,192 elements p A grid is composed of Thread Blocks, each
processing 512 elements
m # of blocks = 8,192/512 = 16
p A SIMD instruction executes 32 elements at a time m# of SIMD threads in a block = 512/32 = 16
Chapter 4: Data-Level Parallelism
Programmer’s View of Execution
Chapter 4: Data-Level Parallelism
Multithreaded SIMD Processor
Chapter 4: Data-Level Parallelism
Scheduling of Threads of SIMD Instructions
p Each multithreaded SIMD processor has one scheduler
p The scheduler selects a ready thread of SIMD instructions and issues an instruction synchronously to all the SIMD lanes executing the SIMD thread
p Threads of SIMD instructions are independent, so the scheduler may select a different SIMD thread each time
Chapter 4: Data-Level Parallelism
CUDA Threads
Chapter 4: Data-Level Parallelism
Parallel Program Organization in CUDA
Chapter 4: Data-Level Parallelism
CUDA Program Example
Chapter 4: Data-Level Parallelism
Chapter 4: Data-Level Parallelism
Mapping Threads to GPU
Chapter 4: Data-Level Parallelism
Block of Threads vs. SM
• A Warp: Threads execute the same instructions at same time
• A SP runs one CUDA thread
Chapter 4: Data-Level Parallelism
Global Block Scheduler
p The global block scheduler manages and allocates blocks of threads to SM
p Load balancing
Chapter 4: Data-Level Parallelism
Life Cycle of Thread
p Grid is started on GPU p A block of threads
allocated to a SM
p SM organizes threads of a given block into warps
p Warps are scheduled on SM
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com