程序代写代做代考 assembler cuda algorithm C kernel game cache GPU graph clock compiler An Introduction to Modern GPU Architecture

An Introduction to Modern GPU Architecture
Ashu Rege
Director of Developer Technology

Agenda
• Evolution of GPUs
• Computing Revolution
• Stream Processing
• Architecture details of modern GPUs

Evolution of GPUs

Evolution of GPUs (1995-1999)
• 1995 – NV1
• 1997 – Riva 128 (NV3), DX3
• 1998 – Riva TNT (NV4), DX5 • 32 bit color, 24 bit Z, 8 bit stencil • Dual texture, bilinear filtering
• 2 pixels per clock (ppc)
Virtua Fighter (SEGA Corporation)
• 1999 – Riva TNT2 (NV5), DX6 • Faster TNT
• 128b memory interface
• 32 MB memory
50K triangles/sec
• The chip that would not die ☺
1995
NV1
1M pixel ops/sec
1M transistors
16-bit color
16-bit color Nearest filtering

Evolution of GPUs (Fixed Function)
• GeForce 256 (NV10)
• DirectX 7.0
• Hardware T&L
• Cubemaps
• DOT3 – bump mapping • Register combiners
Deus Ex (Eidos/Ion Storm)
• 2x Anisotropic filtering
• Trilinear filtering
• DXT texture compression
• 4ppc
• Term “GPU” introduced
NV10
15M triangles/sec 480M pixel ops/sec 23M transistors 32-bit color Trilinear filtering
1999

NV10 – Register Combiners
RGB Portion
C op2 D
Alpha Portion
CD
Input RGB, Alpha Registers
Input Mappings
RGB Function
RGB Scale/Bias
Next Combiner’s RGB Registers
Input Alpha, Blue Registers
Input Mappings
Alpha Function
Alpha Scale/Bias
Next Combiner’s Alpha Registers
A B C
A op1 B
AB op3 CD D
A B C
AB
AB op4 CD D

Evolution of GPUs (Shader Model 1.0)
• GeForce 3 (NV20) • NV2A – Xbox GPU
• DirectX 8.0
• Vertex and Pixel Shaders
• 3D Textures
• Hardware Shadow Maps
• 8x Anisotropic filtering
• Multisample AA (MSAA)
• 4ppc
Ragnarok Online (Atari/Gravity)
NV20
100M triangles/sec 1G pixel ops/sec 57M transistors Vertex/Pixel shaders MSAA
2001

Evolution of GPUs (Shader Model 2.0)
• GeForce FX Series (NV3x)
• DirectX 9.0
• Floating Point and “Long”
Vertex and Pixel Shaders
• Shader Model 2.0
• 256 vertex ops
• 32 tex + 64 arith pixel ops
Dawn Demo (NVIDIA)
• Shader Model 2.0a • 256 vertex ops
• Up to 512 ops
NV30
200M triangles/sec 2G pixel ops/sec 125M transistors Shader Model 2.0a
• Shading Languages • HLSL, Cg, GLSL
2003

Evolution of GPUs (Shader Model 3.0)
• GeForce 6 Series (NV4x)
• DirectX 9.0c
• Shader Model 3.0
• Dynamic Flow Control in
Far Cry HDR (Ubisoft/Crytek)
Vertex and Pixel Shaders1
• Branching, Looping, Predication, …
NV40
600M triangles/sec
• Vertex Texture Fetch
600M triangles/sec 12.8G pixel ops/sec 220M transistors Shader Model 3.0 Rotated Grid MSAA 16x Aniso, SLI 2004
• High Dynamic Range (HDR)
• 64 bit render target
• FP16x4 Texture Filtering and Blending
1Some flow control first introduced in SM2.0a

Far Cry – No HDR/HDR Comparison

Evolution of GPUs (Shader Model 4.0)
• GeForce 8 Series (G8x)
• DirectX 10.0
• Shader Model 4.0
• Geometry Shaders • No “caps bits”
• Unified Shaders
Crysis (EA/Crytek)
• New Driver Model in Vista
• CUDA based GPU computing
• GPUs become true computing
G80
Unified Shader Cores w/ Stream Processors 681M transistors Shader Model 4.0
8x MSAA, CSAA
processors measured in GFLOPS
2006

Crysis. Images courtesy of Crytek.

As Of Today…
• GeForce GTX 280 (GT200) • DX10
• 1.4 billion transistors
• 576 mm2 in 65nm CMOS
• 240 stream processors • 933 GFLOPS peak
• 1.3GHz processor clock
• 1GB DRAM
• 512 pin DRAM interface • 142 GB/s peak

Stunning Graphics Realism
Lush, Rich Worlds
Hellgate: London © 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.
Incredible Physics Effects Core of the Definitive Gaming Platform
Crysis © 2006 Crytek / Electronic Arts

What Is Behind This Computing Revolution?
• Unified Scalar Shader Architecture • Highly Data Parallel Stream Processing
• Next, let’s try to understand what these terms mean…

Unified Scalar Shader Architecture

Graphics Pipelines For Last 20 Years
Processor per function
Vertex Triangle Pixel ROP Memory
T&L evolved to vertex shading Triangle, point, line – setup
Flat shading, texturing, eventually pixel shading
Blending, Z-buffering, antialiasing Wider and faster over the years

Shaders in Direct3D
• DirectX 9:
Vertex Shader, Pixel Shader
• DirectX 10:
Vertex Shader, Geometry Shader, Pixel Shader
• DirectX 11:
Vertex Shader, Hull Shader, Domain Shader, Geometry Shader, Pixel Shader, Compute Shader
• Observation: All of these shaders require the same basic functionality: Texturing (or Data Loads) and Math Ops.

Unified Pipeline
Physics
Geometry (new in DX10)
Vertex
Future
Point Processor
Texture + Floating
Pixel
ROP Memory
(CUDA, DX11 Compute, OpenCL)
Compute

Why Unify?
Vertex Shader
Pixel Shader Idle hardware
Vertex Shader Idle hardware
Pixel Shader
Unbalanced
and inefficient utilization in non- unified architecture
Heavy Geometry Workload Perf = 4
Heavy Pixel Workload Perf = 8

Why Unify?
Unified Shader Vertex Workload
Unified Shader Pixel Workload
Pixel
Vertex
Optimal utilization
In unified architecture
Heavy Geometry Workload Perf = 11
Heavy Pixel Workload Perf = 11

Why Scalar Instruction Shader (1)
• Vector ALU – efficiency varies •
• 4 •
MAD r2.xyzw, r0.xyzw, r1.xyzw – 100% utilization DP3 r2.w, r0.xyz, r1.xyz – 75%
MUL r2.xy, r0.xy, r1.xy – 50%
ADD r2.w, r0.x, r1.x – 25%
• 3 •
• 2 •
• 1

Why Scalar Instruction Shader (2)
• Vector ALU with co-issue – better but not perfect
• 4 DP3 r2.x, r0.xyz, r1.xyz • ADD r2.w, r0.w, r1.w

• 3 DP3 r2.w, r0.xyz, r1.xyz •
} 100%
• 1 ADD r2.w, r0.w, r2.w
• Vector/VLIW architecture – More compiler work required
• G8x, GT200: scalar – always 100% efficient, simple to compile • Up to 2x effective throughput advantage relative to vector
Cannot co-issue

Complex Shader Performance on Scalar Arch.
Procedural Perlin Noise Fire
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Procedural Fire
7900GTX 8800GTX

Conclusion
• Build a unified architecture with scalar cores where all shader operations are done on the same processors

Stream Processing

The Supercomputing Revolution (1)

The Supercomputing Revolution (2)

What Accounts For This Difference?
• Need to understand how CPUs and GPUs differ • Latency Intolerance versus Latency Tolerance
• TaskParallelismversusDataParallelism
• Multi-threadedCoresversusSIMT(SingleInstructionMultipleThread)Cores • 10s of Threads versus 10,000s of Threads

Latency and Throughput
• “Latency is a time delay between the moment something is initiated, and the moment one of its effects begins or becomes detectable”
• For example, the time delay between a request for texture reading and texture data returns
• Throughput is the amount of work done in a given amount of time • Forexample,howmanytrianglesprocessedpersecond
• CPUs are low latency low throughput processors
• GPUs are high latency high throughput processors

Latency (1)
• GPUs are designed for tasks that can tolerate latency • Example:Graphicsinagame(simplifiedscenario):
CPU
GPU Idle
Render Frame 0
Render Frame 1
Generate Frame 0
Generate Frame 1
Generate Frame 2
Latency between frame generation and rendering (order of milliseconds)
• To be efficient, GPUs must have high throughput, i.e. processing millions of pixels in a single frame

Latency (2)
• CPUs are designed to minimize latency • Example: Mouse or keyboard input
• Caches are needed to minimize latency
• CPUs are designed to maximize running operations out of cache • Instruction pre-fetch
• Out-of-order execution, flow control
• 􏱔 CPUs need a large cache, GPUs do not
• GPUs can dedicate more of the transistor area to computation horsepower

CPU versus GPU Transistor Allocation
• GPUs can have more ALUs for the same sized chip and therefore run many more threads of computation
DRAM
DRAM
Control
Cache
CPU
GPU
ALU ALU
ALU ALU
• Modern GPUs run 10,000s of threads concurrently

Managing Threads On A GPU
• How do we:
• Avoidsynchronizationissuesbetweensomanythreads?
• Dispatch, schedule, cache, and context switch 10,000s of threads? • Program 10,000s of threads?
• Design GPUs to run specific types of threads:
• Independentofeachother–nosynchronizationissues
• SIMD(SingleInstructionMultipleData)threads–minimizethreadmanagement
• Reduce hardware overhead for scheduling, caching etc.
• Programblocksofthreads(e.g.onepixelshaderperdrawcall,orgroupofpixels)
• Any problems which can be solved with this type of computation?

Data Parallel Problems
• Plenty of problems fall into this category (luckily ☺)
• Graphics,image&videoprocessing,physics,scientificcomputing,…
• This type of parallelism is called data parallelism • And GPUs are the perfect solution for them!
• In fact the more the data, the more efficient GPUs become at these algorithms
• Bonus: You can relatively easily add more processing cores to a GPU and increase the throughput

Parallelism in CPUs v. GPUs
• CPUs use task parallelism • Multipletasksmaptomultiple
• GPUs use data parallelism
• SIMDmodel(SingleInstruction
threads
Multiple Data)
• Sameinstructionondifferentdata
• Tasks run different instructions
• 10s of relatively heavyweight threads run on 10s of cores
• 10,000soflightweightthreadson100s of cores
• Eachthreadmanagedandscheduled explicitly
• Threadsaremanagedandscheduled by hardware
• Eachthreadhastobeindividually programmed
• Programmingdoneforbatchesof threads (e.g. one pixel shader per group of pixels, or draw call)

Stream Processing
• What we just described:
• Givena(typicallylarge)setofdata(“stream”)
• Run the same series of operations (“kernel” or “shader”) on all of the data (SIMD)
• GPUs use various optimizations to improve throughput:
• Someon-chipmemoryandlocalcachestoreducebandwidthtoexternalmemory
• Batch groups of threads to minimize incoherent memory access
• Bad access patterns will lead to higher latency and/or thread stalls.
• Eliminateunnecessaryoperationsbyexitingorkillingthreads
• Example: Z-Culling and Early-Z to kill pixels which will not be displayed

To Summarize
• GPUs use stream processing to achieve high throughput
• GPUsdesignedtosolveproblemsthattoleratehighlatencies
• High latency tolerance􏱔Lower cache requirements
• Less transistor area for cache􏱔More area for computing units
• More computing units 􏱔 10,000s of SIMD threads and high throughput
• GPUswin☺
• Additionally:
• Threadsmanagedbyhardware􏱔Youarenotrequiredtowritecodeforeach
thread and manage them yourself
• Easiertoincreaseparallelismbyaddingmoreprocessors
• So, fundamental unit of a modern GPU is a stream processor…

G80 and GT200 Streaming Processor Architecture

Building a Programmable GPU
• The future of high throughput computing is programmable stream processing
• So build the architecture around the unified scalar stream processing cores
• GeForce 8800 GTX (G80) was the first GPU architecture built with this new paradigm

Thread Processor
G80 Replaces The Pipeline Model
128 Unified Streaming Processors
Host
Input Assembler Vtx Thread Issue
Geom Thread Issue
Setup / Rstr / ZCull Pixel Thread Issue
SP SP SP SP SP SP
SP SP SP SP
SP SP SP SP SP SP
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB

GT200 Adds More Processing Power
ROP L2
ROP L2
ROP L2
Interconnection Network ROP L2 ROP L2
ROP L2
ROP L2
ROP L2
DRAM
DRAM
DRAM
DRAM DRAM
DRAM
DRAM
DRAM
Host CPU
System Memory
Host Interface
GPU
Input Assemble
Viewport / Clip / Setup / Raster / ZCull
Vertex Work Distribution
Geometry Work Distribution
Pixel Work Distribution
Compute Work Distribution

8800GTX (high-end G80)
16 Stream Multiprocessors
• Eachonecontains8unifiedstreamingprocessors–128intotal
GTX280 (high-end GT200)
24 Stream Multiprocessors
• Eachonecontains8unifiedstreamingprocessors–240intotal

Inside a Stream Multiprocessor (SM)
• Scalar register-based ISA
I-Cache MT Issue C-Cache
• Multithreaded Instruction Unit
• Up to 1024 concurrent threads
• Hardware thread scheduling
• In-order issue
TPC SP SP SP SP SP SP SP SP
• 8 SP: Thread Processors
• IEEE 754 32-bit floating point
• 32-bit and 64-bit integer
• 16K 32-bit registers
• 2 SFU: Special Function Units • sin, cos, log, exp
SFU SFU
• Double Precision Unit
• IEEE 754 64-bit floating point
• Fused multiply-add
DP
• 16KB Shared Memory
Shared Memory

Multiprocessor Programming Model
• Workloads are partitioned into blocks of threads among multiprocessors
• ablockrunstocompletion
• a block doesn’t run until resources are available
• Allocation of hardware resources
• sharedmemoryispartitionedamongblocks • registers are partitioned among threads
• Hardware thread scheduling
• anythreadnotwaitingforsomethingcanrun • context switching is free – every cycle

Memory Hierarchy of G80 and GT200
• SM can directly access device memory (video memory) • Not cached
• Read&write
• GT200: 140 GB/s peak
• SM can access device memory via texture unit • Cached
• Read-only, for textures and constants • GT200:48GTexels/speak
• On-chip shared memory shared among threads in an SM • importantforcommunicationamongstthreads
• provideslow-latencytemporarystorage
• G80&GT200:16KBperSM

Performance Per Millimeter
• For GPU, performance == throughput • Cachearelimitedinthememoryhierarchy
• Strategy: hide latency with computation, not cache
• Heavy multithreading
• Switchtoanothergroupofthreadswhenthecurrentgroupiswaitingformemory access
• Implication: need large number of threads to hide latency
• Occupancy: typically 128 threads/SM minimum
• Maximum 1024 threads/SM on GT200 (total 1024 * 24 = 24,576 threads)
• Strategy: Single Instruction Multiple Thread (SIMT)

SIMT Thread Execution
• Group 32 threads (vertices, pixels or primitives) into warps
I-Cache MT Issue C-Cache
• Threads in warp execute same instruction at a time
• Shared instruction fetch/dispatch
SP SP
• Hardware automatically handles divergence (branches)
TPC
• Warps are the primitive unit of scheduling • Pick 1 of 24 warps for each instruction slot
SP SP
• SIMT execution is an implementation choice • Shared control logic leaves more space for ALUs
• Largely invisible to programmer
SFU SFU
SP SP
SP SP
DP
Shared Memory

number of coherent 4×4 tiles
Shader Branching Performance
• G8x/G9x/GT200 branch efficiency is 32 threads (1 warp)
G80 – 32 pixel coherence 48 pixel coherence
PS Branching Efficiency
• If threads diverge, both
sides of branch will execute 10
on all 32
8 6 4 2
• More efficient compared to architecture with branch efficiency of 48 threads
16 14 12
0% 20% 40%
60% 80%
100% 120%

Conclusion:
G80 and GT200 Streaming Processor Architecture
• Execute in blocks can maximally exploits data parallelism • Minimizeincoherentmemoryaccess
• AddingmoreALUyieldsbetterperformance
• Performs data processing in SIMT fashion • Group32threadsintowarps
• Threadsinwarpexecutesameinstructionatatime
• Thread scheduling is automatically handled by hardware • Context switching is free (every cycle)
• Transparentscalability.Easyforprogramming
• Memory latency is covered by large number of in-flight threads • Cache is mainly used for read-only memory access (texture, constants).