代写代考 Chapter 4: Data-Level Parallelism

Chapter 4: Data-Level Parallelism
p 4.1 Introduction
p 4.2 Vector Architecture
p 4.3 SIMD Instruction Set Extensions p 4.4 GPU (Graphic Processing Unit)

Copyright By PowCoder代写 加微信 powcoder

Chapter 4: Data-Level Parallelism
What is a Graphics Card?
Graphics cards controls what is to be shown on a computer monitor and calculates 3D images and graphics.

Chapter 4: Data-Level Parallelism
Main Steps

Chapter 4: Data-Level Parallelism
Main Steps

Chapter 4: Data-Level Parallelism
Graphics Pipeline

Chapter 4: Data-Level Parallelism
GPU Evolvement
p Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units
m Programmability was an afterthought m Started from 1999, GeForce 256
p Over time, more programmability added (2001-2005) m New language Cg (Nvidia) for writing small programs run on
each vertex or each pixel
p Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations
m Incredibly difficult programming model as had to use graphics pipeline model for general computation

Chapter 4: Data-Level Parallelism
Pre-1999 PC 3D graphics accelerator

Chapter 4: Data-Level Parallelism
GPU* circa 1999

Chapter 4: Data-Level Parallelism
Direct3D 9 programmability: 2002

Chapter 4: Data-Level Parallelism
Direct3D 10 programmability: 2006

Chapter 4: Data-Level Parallelism
GPU is fast!

Chapter 4: Data-Level Parallelism
Large Number of Cores

Chapter 4: Data-Level Parallelism
NVIDIA Tesla C870
p 518 Gflops/card
p 1.5 GB Memory
p 128 SM processors

Chapter 4: Data-Level Parallelism
Building Blocks for Supercomputers
www.top500.org Nov. 2011

Chapter 4: Data-Level Parallelism
Building Blocks for Supercomputers
www.top500.org Nov. 2013

Chapter 4: Data-Level Parallelism
Comparing CPU and GPU

Chapter 4: Data-Level Parallelism
General-Purpose GPUs (GP-GPUs)
p Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing
p Host CPU issues data-parallel kernels to GP-GPU for execution
p Programming model is “Single Instruction Multiple Thread (SIMD/SIMT)”

Chapter 4: Data-Level Parallelism
p In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA
p CUDATM is a parallel computing platform and programming model invented by NVIDIA.
p It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
Compute Unified Device Architecture

Chapter 4: Data-Level Parallelism
GPU Hardware (e.g., G80 GPU)
Input Assembler Vtx Thread Issue
Setup / Rstr / Z Thread Issue Pixel Thread Issue
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
Thread Processor

Chapter 4: Data-Level Parallelism
GPU Hardware (
p 􏰆􏰇0􏰈 􏰊 􏰇􏰋􏰌C 􏰇4􏰍􏰎A􏰏4 􏰋􏰏􏰐24􏰑􏰑􏰒􏰓􏰔 􏰌􏰕A􏰑􏰎4􏰏 D
p 􏰖12􏰗 􏰇􏰋􏰌 2􏰐􏰓􏰑􏰒􏰑􏰎􏰑 􏰐􏰘 C 􏰙􏰎􏰏41􏰚􏰒􏰓􏰔 􏰛A􏰕􏰎􏰒􏰜􏰏􏰐24􏰑􏰑􏰐􏰏􏰑 D
p 􏰖12􏰗 􏰙􏰛 2􏰐􏰓􏰑􏰒􏰑􏰎􏰑 􏰐􏰘 􏰙􏰋 C􏰙􏰎􏰏41􏰚 􏰋􏰏􏰐24􏰑􏰑􏰐􏰏D
p 􏰖12􏰗 􏰙􏰛 􏰑A􏰜􏰜􏰐􏰏􏰎􏰑 A􏰜 􏰎􏰐 􏰊 􏰈􏰝􏰎􏰗􏰏41􏰞􏰑
􏰂􏰃)􏰄)􏰅 􏰆􏰇0􏰈 􏰉

Chapter 4: Data-Level Parallelism
CUDA: Heterogeneous Computing
m Serial code
m Highly parallel

Chapter 4: Data-Level Parallelism
Overview of CUDA Programming
p A kernel is executed as a Grid of thread Blocks
m Up to 512 threads in one block
m All threads in a block execute the same program (kernel) but on different data
m A block is assigned to a processor that executes the code
m Independent blocks
p Only ONE kernel at a time

Chapter 4: Data-Level Parallelism
Hardware Execution Model
p GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor
p CPU sends whole “grid” to GPU, which distributes thread blocks among cores (each thread block executes on one core)

Chapter 4: Data-Level Parallelism
Memory Hierarchy
p Registers
p Local memory
p Shared memory (shared by different threads in same block)
p Device memory (texture, constant, local, global)

Chapter 4: Data-Level Parallelism
CUDA Memory Types
read-write per-thread
local memory
read-write per-thread
shared memory
read-write per-block
global memory
read-write per-grid
constant memory
read-only per-grid
texture memory
read-only per-grid

Chapter 4: Data-Level Parallelism
Example for Data Division and Thread

Chapter 4: Data-Level Parallelism
Example: DAXPY
Conventional C code for the DAXPY loop!

Chapter 4: Data-Level Parallelism
CUDA code for DAXPY
p Launch n threads, one per element
p 256 CUDA threads per thread block in a multithreaded SIMD processor

Chapter 4: Data-Level Parallelism
Example: X * Y (8,192 elements)
p A Grid works on the whole 8,192 elements p A grid is composed of Thread Blocks, each
processing 512 elements
m # of blocks = 8,192/512 = 16
p A SIMD instruction executes 32 elements at a time m# of SIMD threads in a block = 512/32 = 16

Chapter 4: Data-Level Parallelism
Programmer’s View of Execution

Chapter 4: Data-Level Parallelism
Multithreaded SIMD Processor

Chapter 4: Data-Level Parallelism
Scheduling of Threads of SIMD Instructions
p Each multithreaded SIMD processor has one scheduler
p The scheduler selects a ready thread of SIMD instructions and issues an instruction synchronously to all the SIMD lanes executing the SIMD thread
p Threads of SIMD instructions are independent, so the scheduler may select a different SIMD thread each time

Chapter 4: Data-Level Parallelism
CUDA Threads

Chapter 4: Data-Level Parallelism
Parallel Program Organization in CUDA

Chapter 4: Data-Level Parallelism
CUDA Program Example

Chapter 4: Data-Level Parallelism

Chapter 4: Data-Level Parallelism
Mapping Threads to GPU

Chapter 4: Data-Level Parallelism
Block of Threads vs. SM
• A Warp: Threads execute the same instructions at same time
• A SP runs one CUDA thread

Chapter 4: Data-Level Parallelism
Global Block Scheduler
p The global block scheduler manages and allocates blocks of threads to SM
p Load balancing

Chapter 4: Data-Level Parallelism
Life Cycle of Thread
p Grid is started on GPU p A block of threads
allocated to a SM
p SM organizes threads of a given block into warps
p Warps are scheduled on SM

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com