程序代写 CSC 367 Parallel Programming

CSC 367 Parallel Programming

General-purpose computing with Graphics Processing Units (GPUs) (Introduction)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Copyright By PowCoder代写 加微信 powcoder

• Revisiting PC architecture
• Why GPUs?
• General-purpose GPUs – the architecture basics
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2

GPU computing
• , “the father of supercomputing”: “If you were plowing a field, which would you rather use? Two strong oxen or a 1024 chickens?”
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3

• CPU, DRAM, video
Classic PC architecture
• Northbridgeconnects3components that must communicate at high speed
• Videoalsoneedstohave1st-class access to DRAM
• OlderNVIDIAcardsareconnectedto AGP, up to 2 GB/s transfers
• SouthbridgeservesslowerI/Odevices
• e.g.,harddrives,USBports,Ethernet ports, etc.
• Corelogicchipsetactsasaswitch • RoutesI/Otrafficamongthedifferent
Core logic chipsetImg src: https://arstechnica.com/features/2004/07/pcie/ University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4
devices that make up the system.

• Limitations:
• Performance(doesn’tscale)
• Advancedfunctionality
Original PCI bus
• Sharedbustopology
• Network, sound card, etc. use bus to communicate with CPU • Needsbusarbitration:whogetsaccessandwhen?
• Southbridge,NorthbridgeandCPUtogetherformthehost(root)role • Detect and initialize PCI devices, controls the PCI bus
• Simple, easy to implement
• Access PCI devices in the same way as accessing memory (LD/ST)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5

PCI Express (PCIe)
• Point-to-pointbustopology
• Point-to-pointconnectionsbetweenanytwodevices
• Sharedswitchinsteadofsharedbus • Eachdevicehasitsownlink
• CPUcantalktoanydevicedirectly
• Switchmanagesalltraffic,centralizesresourcesharingdecisions
• Otheradvantages
• Prioritizingcertaintraffic(QoS)
• e.g., Less dropped frames
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6

Std PCIe PCIe PCIe PCIe size 1.0 2.0 3.0 4.0
x16 4GB/s 8GB/s ~16GB/s ~32GB/s
• Smaller card may go in larger slot
• Small cards might take only 1 lane • High-end graphics cards – 16 lanes
• PCIe SSD drives – x4, x8 ..
Source: http://en.wikipedia.org/wiki/PCI_Express
PCIe Links and Lanes
• Each link consists of one or more lanes
• Lane: 4 wires (1 bit per cycle in both directions)
x1 250MB/s 500MB/s ~1GB/s ~2GB/s x4 1GB/s 2GB/s ~4GB/s ~8GB/s x8 2GB/s 4GB/s ~8GB/s ~16GB/s
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7

• PCIe forms the interconnect backbone
• Expansioncards
• e.g.,Discretevideocard
PCIe PC architecture
• NorthbridgeandSouthbridgeareboth PCIe switches
• SomeSouthbridgedesignshavebuilt-in PCI-PCIe bridge to allow old PCI cards
• SomePCIeI/OcardsarePCIcardswitha PCI-PCIe bridge
• SSDcard,moreRAM(veryfast,butvery expensive!)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8
Source: , PCI Express: An Overview https://arstechnica.com/features/2004/07/pcie/

100-300GB/s
CPU-GPU architecture
• CPU:latency-optimized,ILP,typicalcachehitrate:99%,typical memory bus width: 64-bit
• GPU:throughput-optimized,ILP+TLP,cachehitrate:90%orless, typical memory bus width: 256-bit, 384-bit, 512-bit
CPU (aka Host)
PCIe bus 8-16GB/s
GPU (aka Device)
CPU memory
GPU memory
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9

• LargecollectionofSIMDmultiprocessors • Massive thread parallelism – 100s of
processors, high memory bandwidth
• Goodfordata-parallelcomputations
• Musthaveaninherentlyhighlevelof parallelism in our application
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10

GPUs vs CPUs – conceptual approach
• Example: add arrays A and B, into array C
• CPU(sequentially):Allocatememory,forlooptoaddelementspairwise • CPU(parallel):
• create N threads (N = number of cores on the CPU)
• partition the data equally into ranges among the threads
• each thread: for each i in its range of elements, add those elements • wait for all threads to finish
• Howdoesperformancescale?
• Assuming8cores=>Smax=~8X
• Howtoscalethisfurther?Howmanythreads?
• Limitedcores,memorybuscontention,penaltyswitchingbetweenthreads
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11

GPU-based parallelization basics
• GPU-basedparallelization–steps:
• allocatememoryforarraysontheGPU
• transferthedatafromCPUmemoryintoGPUmemory
• launchakernel(functionexecutedbyeachprocessingthread)
• spawn a massive amount of threads (e.g., 32000, 64000, etc.)
• each thread is instructed to only handle a few elements
• wait for all threads to finish
• transferresultsbacktoCPUmemory
• WhyGPUs?GPUhidesmemorylatencybetterthanCPU
• GPU switches between threads to hide latency
• Tonsofregisters,switchingbetweenthreadsisbasicallyfree
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12

• More,butlesspowerful(lowerfrequency)coresthanCPU
• Mustuseenoughthreads(atleastafewthousands)tohidelatency
and to saturate the memory bus
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13

Summary: Main tradeoffs
• CPU-jack-of-all-trades:runsyourOS,avarietyofapplications,allofwhichmust get good latency
• Complexcontrolunit,largechiptoprovidethelogic • Severallevelsofcaches,increasingly-larger
• GPU – process one thing in massively-parallel fashion • Alotmorecores,butsmallerclockspeed(100sofMHz)
• Control unit much simpler too (recall SIMD) => Less diverse computations!
• Smallercaches,notquitethesamegoalasforCPUs
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14

CPU vs GPU performance
Source: http://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15

CUDA architecture
• Past GPUs: only for graphics processing (vertex & pixel shaders, etc.)
• CUDA architecture: support for general purpose computing
• Hardtoleveragearchitecturalfeatures,mustdisguisecomputationsasgraphicsproblem • Until…CUDAC/C++language,hardwaredriver,compiler
Graphics Processing OpenGL
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16

NVIDIA GPU generations
• SomeoftherecentGPUgenerationsfromNVIDIA: • Fermi(2010)
• Kepler(2012)
• Maxwell(2014)
• Pascal(2016)–octolabclusterofmachines(GTX1080cards)!
• Turing(October2018)succeedPascal
• Volta(2017)succeedPascalforHPCandDeepLearning(tensorunits,etc.)
• Many new features in recent versions, as hardware evolved
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17

Diversity of applications
• Highperformancecomputing • Numericalanalysis
• Physicssimulations
• Machinelearning
• Databases
• …andmanymore!
• Stillgoodforgraphicsprocessingandrenderingthough!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18

CUDA Compute Capability
• Compute capability tells us what features are supported by that GPU
• First compute capabilities: 1.0, 1.1, 1.2, 1.3, and 2.0 (your textbook)
• MostrecentCUDA11.0incurrentcards(andonthelabs)
• Highercapabilityversionsaresupersetsoflowercapabilities
• Supportnewerfeatures,butolderonestoo
• e.g., double-precision floating point, atomics, unified virtual addressing, etc.
• Cancompilecodeforaspecificcapability,e.g.: • nvcc–arch=sm_11
• nvcc–arch=sm_30
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19

• UDA architecture
Turing architecture (RTX2070 cards) in the DH lab machines!
SM SM SM SM SM
Memory Controller (64-bit)
Memory Controller (64-bit)
Global memory (typically: 2-4 GB)
• bit memory bus width), memory bandwidth: (only) 86.4 GB/s
• Pascalarchitecture(intheoctolabmachines-1080cards)
total), memory bandwidth: 320 GB/s
Multi‐projection tech, VR support!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20

• Warpscheduler
Streaming Multiprocessor (SM)
• Thousandsofregisters(canbepartitionedamongthreadsof execution)
• Caches (we’ll get to these later..)
• Sharedmemory–fastdatainterchangebetweenthreads
• Constantcache–fastbroadcastofreadsfromconstantmemory • Texturecache–toaggregatebandwidthfromtexturememory
• L1cache–toreducelatencytolocalorglobalmemory
• Canquicklyswitchbetweenthreadcontexts,andissueinstructionsto groups of threads (aka “warps”) which are ready to execute
• Executionunitsforintegerandfloating-pointoperations
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21

Streaming Multiprocessor (SM)
Instruction Buffer
Warp scheduler
Dispatch unit Dispatch unit
Instruction Buffer
Warp scheduler
Dispatch unit Dispatch unit
Registers (16,384 x 32-bit)
Registers (16,384 x Core Core Core
LD/ST SFU LD/ST SFU LD/ST SFU LD/ST SFU LD/ST SFU LD/ST SFU LD/ST SFU LD/ST SFU
Core Core Core Core Core Core Core Core Core
Core LD/ST SFU Core LD/ST SFU Core LD/ST SFU
Core Core Core
Core Core Core Core Core Core
Core Core Core Core Core Core
Core LD/ST SFU Core LD/ST SFU
(x4 = 128 cores/SM registers:65536)
Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core
Core LD/ST SFU Core LD/ST SFU Core LD/ST SFU
Core Core Core
Core Core Core Core Core Core Core Core Core
Instruction cache
Texture memory / L1 cache Shared Memory 64KB
University of Toronto Mississauga, Department of Mathematical and
Computational Sciences

CPUs vs GPUs in a nutshell
+ Good for sequential code
+ Designed for massively parallel computations
High clock speed
Optimized for latency
– Hard to leverage massively
Optimized for bandwidth
+ High off-chip bandwidth
– Harder to handle complex code (complex control flow)
parallel code
– Low off-chip bandwidth
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23

Announcements
• Simple task to learn about your GPU
• Very straightforward, we know you have an A3 to work on this week!
• Assignment3duethisweek! • See TA extra office hours too
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com