程序代写代做代考 assembly matlab cuda GPU cache compiler PowerPoint Presentation

PowerPoint Presentation

Parallel Computing

with GPUs: GPU

Architectures
Dr Paul Richmond

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Last week

Parallelism can add performance to our code

We must identify parallel regions

OpenMP can be both data and task parallel

OpenMP data parallelism is parallel over data elements
but threads operate independently

Critical sections cause serialisation which can slow performance

Scheduling is required to achieve best performance

This Lecture

What is a GPU?

General Purpose Computation on GPUs (and GPU History)

GPU CUDA Hardware Model

Accelerated Systems

GPU Refresher

Latency vs. Throughput

Latency: The time required to perform some action
Measure in units of time

Throughput: The number of actions executed per unit of time
Measured in units of what is produced

E.g. An assembly line manufactures GPUs. It takes 6 hours to
manufacture a GPU but the assembly line can manufacture 100 GPUs
per day.

CPU vs GPU

CPU

Latency oriented

Optimised for serial code performance

Good for single complex tasks

GPU

Throughput oriented

Massively parallel architecture

Optimised for performing many similar tasks
simultaneously (data parallel)

CPU vs GPU

Large Cache

Hide long latency memory access

Powerful Arithmetic Logical Unit
(ALU)

Low Operation Latency

Complex Control mechanisms

Branch prediction etc.

Small cache

But faster memory throughput

Energy efficient ALUs

Long latency but high throughput

Simple control

No branch prediction

Data Parallelism

Program has many similar threads of execution
Each thread performs the same behaviour on different data
Good for high throughput

We can classify an architecture based on instructions and data
(Flynn’s Taxonomy)
Instructions:

Single instruction (SI)
Multiple Instruction (MI)
Single Program (SP)
Multiple Program (MP)

Data:
Single Data (SD) – w.r.t. work item not necessarily single word
Multiple Data (MD)

e.g. SIMD = Single Instruction and Multiple Data

Not part of the original taxonomy

V
e

ct
o

r
U

n
it

SISD and SIMD

SISD
Classic von Neumann architecture

PU = Processing Unit

SIMD
Multiple processing elements performing the

same operation simultaneously

E.g. Early vector super computers

Modern CPUs have SIMD instructions
But are not SIMD in general

Instruction Pool

PU

D
a

ta
P

o
o

l

Instruction Pool

PU

D
a

ta
P

o
o

l

PU

PU

PU

SIMD

SISD

MISD and MIMD

MISD
E.g. Pipelining architectures

MIMD
Processors as functionally asynchronous and

independent

Different processors may execute different
instructions on different data

E.g. Most parallel computers

E.g. OpenMP programming model

Instruction Pool

PU

D
a

ta
P

o
o

l
D

a
ta

P
o

o
l

MIMD

MISD

PU

PU PU

PU PU

PU PU

Instruction Pool

SPMD and MPMD

SPMD
Multiple autonomous processors simultaneously executing a program on

different data

Program execution can have an independent path for each data point

E.g. Message passing on distributed memory machines.

MPMD
Multiple autonomous processors simultaneously executing at least two

independent programs.

Typically client & host programming models fit this description.

E.g. Sony PlayStation 3 SPU/PPU combination, Some system on chip
configurations with CPU and GPUs

Taxonomy of a GPU

What taxonomy best describes data parallelism with a GPU?

SISD?

SIMD?

MISD?

MIMD?

SPMD?

MPMD?

Taxonomy of a GPU

What taxonomy best describes data parallelism with a GPU?

Obvious Answer: SIMD

Less Obvious answer: SPMD

Slightly confusing answer: SIMT (Single Instruction Multiple Thread)
This is a combination of both it differs from SIMD in that;

1) Each thread has its own registers
2) Each thread has multiple addresses
3) Each thread has multiple flow paths

We will explore this in more detail when we look at the hardware!
http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

This Lecture

What is a GPU?

General Purpose Computation on GPUs (and GPU History)

GPU CUDA Hardware Model

Accelerated Systems

GPU Early History

Hardware has evolved from the demand for increased quality of 3D
computer graphics

Initially specialised processors for each part of the graphics pipeline
Vertices (points of triangles) and Fragments (potential pixels) can be

manipulated in parallel

The stages of the graphics pipeline became programmable in early
2000’s
NVIDIA GeForce 3 and ATI Radeon 9700

DirectX 9.0 required programmable pixel and vertex shaders

The Graphics Pipeline

GPGPU

Source: NVidia Cg Users Manual

GPGPU

General Purpose computation on Graphics Hardware
First termed by Mark Harris (NVIDIA) in 2002

Recognised the use of GPUs for non graphics applications

Requires mapping a problem into graphics concepts
Data into textures (images)

Computation into shaders

Later unified processors were used rather than fixed stages
2006: GeForce 8 series

Unified Processors and CUDA

Compute Unified Device Architecture (CUDA)
First released in 2006/7

Targeted new bread of unified “streaming multiprocessors”

C like programming for GPUs
No computer graphics: General purpose programming model

Revolutionised GPU programming for general purpose use

GPU Accelerated Libraries and Applications (MATLAB, Ansys, etc)
GPU mostly abstracted from end user
Pros: ?
Cons: ?

GPU Accelerated Directives (OpenACC)
Helps compiler auto generate code for the GPU
Very similar to OpenMP
Pros: ?
Cons: ?

OpenCL
Inspired by CUDA but targeted at more general data parallel architectures
Pros: ?
Cons: ?

Other GPU Programming Techniques

GPU Accelerated Libraries and Applications (MATLAB, Ansys, etc)
GPU mostly abstracted from end user
Pros: Easy to learn and use
Cons: … difficult to master (High level of abstraction reduces ability to perform

bespoke optimisation)

GPU Accelerated Directives (OpenACC)
Helps compiler auto generate code for the GPU
Very similar to OpenMP
Pros: Performance portability, limited understanding of hardware required
Cons: Limited fine grained control of optimisation

OpenCL
Inspired by CUDA but targeted at more general data parallel architectures
Pros: Cross platform
Cons: Limited access to cutting edge NVIDIA specific functionality, limited support

Other GPU Programming Techniques

This Lecture

What is a GPU?

General Purpose Computation on GPUs (and GPU History)

GPU CUDA Hardware Model

Accelerated Systems

NVIDIA GPUs have a 2-
level hierarchy
Each Streaming

Multiprocessor (SMP) has
multiple vector “CUDA”
cores

The number of SMs varies
across different hardware
implementations

The design of SMPs varies
between GPU families

The number of cores per
SMP varies between GPU
families

Hardware Model

GPU

SM SM

SMSM

SM

Device Memory Shared Memory / Cache

Scheduler / Dispatcher

Instruction Cache and Registers

NVIDIA CUDA Core

CUDA Core
Vector processing unit

Stream processor

Works on a single
operation

NVIDIA GPU Range

GeForce
Consumer range

Gaming oriented for mass market

Quadro Range
Workstation and professional graphics

Tesla
Number crunching boxes

Much better support for double precision

Faster memory bandwidth

Better Interconects

Tesla Range Specifications

“Kepler”
K20

“Kepler”
K40

“Maxwell”
M40

Pascal P100 Volta V100

CUDA cores 2496 2880 3072 3584 5120

Chip Variant GK110 GK110B GM200 GP100 GV100

Cores per SM 192 192 128 64 64

Single Precision
Performance

3.52 Tflops 4.29 Tflops 7.0 Tflops 9.5TFlops 15TFFlops

Double Precision
Performance

1.17 TFlops 1.43 Tflops 0.21 Tflops 4.7 Tflops 7.5Tflops

Memory
Bandwidth

208 GB/s 288 GB/s 288GB/s 720GB/s 900GB/s

Memory 5 GB 12 GB 12GB 12/16GB 16GB

Chip partitioned into
Streaming Multiprocessors
(SMPs)

32 vector cores per SMP

Not cache coherent. No
communication possible
across SMPs.

Fermi Family of Tesla GPUs

Kepler Family of Tesla GPUs

Streaming
Multiprocessor
Extreme (SMX)

Huge increase in
the number of
cores per SMX
Smaller 28nm

processes

Increased L2
Cache

Cache coherency
at L2 not at L1

Maxwell Family Tesla GPUs

Streaming Multiprocessor
Module (SMM)

SMM Divided into 4 quadrants
(GPC)
Each has own instruction buffer,

registers and scheduler for each of
the 32 vector cores

SMM has 90% performance of
SMX at 2x energy efficiency
128 cores vs. 192 in Kepler
BUT small die space = more

SMMs

8x the L2 cache of Kepler (2MB)

Pascal P100 GPU

Many more SMPs

More GPCs

Each CUDA core is more
efficient
More registers available

Same die size as Maxwell

Memory bandwidth
improved drastically
NVLink

Warp Scheduling

GPU Threads are always executed in groups called warps (32 threads)
Warps are transparent to users

SMPs have zero overhead warp scheduling
Warps with instructions ready to execute are eligible for scheduling

Eligible warps are selected for execution on priority (context switching)

All threads execute the same instruction (SIMD) when executed on the vector
processors (CUDA cores)

The specific way in which warps are scheduled varies across families
Fermi, Kepler and Maxwell have different numbers of warp schedulers and

dispatchers

NVIDIA Roadmap

Performance Characteristics

Source: NVIDIA Programming Guide (http://docs.nvidia.com/cuda/cuda-c-programming-guide)

Performance Characteristics

Source: NVIDIA Programming Guide (http://docs.nvidia.com/cuda/cuda-c-programming-guide)

This Lecture

What is a GPU?

General Purpose Computation on GPUs (and GPU History)

GPU CUDA Hardware Model

Accelerated Systems

CPUs and Accelerators are used together
GPUs cannot be used instead of CPUs

GPUs perform compute heavy parts

Communication is via PCIe bus
PCIe 3.0: up to 8 GB per second throughput

NVLINK: 5-12x faster than PCIe 3.0

Accelerated Systems

DRAM GDRAM

CPU
GPU/

Accelerator

I/O I/O
PCIe

Insert your accelerator into PCI-e

Make sure that
There is enough space

Your power supply unit (PSU)is up
to the job

You install the latest GPU drivers

Simple Accelerated Workstation

Can have multiple CPUs
and Accelerators within
each “Shared Memory
Node”
CPUs share memory but

accelerators do not!

Larger Accelerated Systems

DRAM

GDRAM

CPU
GPU/

Accelerator

I/O I/O
PCIe

GDRAM

CPU
GPU/

Accelerator

I/O I/O
PCIe

Interconnect

Multiple Servers can be
connected via interconnect

Several vendors offer GPU
servers

For example 2 multi core CPUs +
4 GPUS

Make sure your case and power
supply are upto the job!

GPU Workstation Server

Accelerated Supercomputers

………

DGX-1 (Volta V100)

Capabilities of Machines Available to you

Diamond High Spec Lab (for lab sessions)
Quadro K5200 GPUs

Kepler Architecture

2.9 Tflops Single Precision

VAR Lab
Same machines as High Spec Lab (no managed desktop)

Must be booked to access (link)

ShARC Facility
Kepler Tesla K80 GPUs (general pool)

Pascal Tesla P100 GPUs in DGX-1 (DCS only)

Lab in week 8

CUDA 9.1, NSIGHT Visual Profiler available in all locations

https://www.sheffield.ac.uk/diamond/engineering/var

GPUs are better suited to parallel tasks that CPUs

Accelerators are typically not used alone, but work in tandem with
CPUs

GPU hardware is constantly evolving

GPU accelerated systems scale from simple workstations to large-
scale supercomputers

CUDA is a language for general purpose GPU (NVIDIA only)
programming

Summary

Mole Quiz Next Week

Next Weeks lecture 15:00-16:00 in LECT DIA-LT08

This time next week (16:00) will be a MOLE quiz.
Where? DIA-004 (Computer room 4)
When? Now
How Long: 45 mins (25 Questions)
What? Everything up to the end of this weeks lectures…

E.g.
int a[5] = {1,2,3,4,5};

x = &a[3];

What is x?
1. a pointer to an integer with value of 3
2. a pointer to an integer with value of 4
3. a pointer to an integer with a value of the address of the third element of a
4. an integer with a value of 4