CS计算机代考程序代写 compiler cuda GPU cache L11-Accelerated_Architectures

L11-Accelerated_Architectures

Accelerated
Architectures

EPCC
The University of Edinburgh

Outline

• Why do we want/need accelerators such as GPUs?
• Architectural reasons for accelerator performance

advantages

• Latest accelerator Products
– (current) Market leader: NVIDIA
– Alternatives: AMD GPUs, Intel Xeon Phi

• Accelerated Systems

2

4 key performance factors

3

Memory

Processor

D
AT

A
IN

D
ATA

O
U

T

DATA PROCESSED

1. Amount of data processed at
one time (Parallel processing)

2. Processing speed on each data
element (Clock frequency)

3. Amount of data transferred at
one time (Memory bandwidth)

4. Time for each data element to
be transferred (Memory
latency)

4 key performance factors

4

Memory

Processor

D
AT

A
IN

D
ATA

O
U

T

DATA PROCESSED

1. Parallel processing
2. Clock frequency
3. Memory bandwidth
4. Memory latency

• Different computational problems
are sensitive to these in different
ways from one another

• Different architectures address
these factors in different ways

CPUs: 4 key factors

5

• Parallel processing
– Until relatively recently, each CPU only had a single core. Now

CPUs have multiple cores, where each can process multiple
instructions per cycle

• Clock frequency
– CPUs aim to maximise clock frequency, but this has now hit a

limit due to power restrictions (more later)

• Memory bandwidth
– CPUs use regular DDR memory, which has limited bandwidth

• Memory latency
– Latency from DDR is high, but CPUs strive to hide the latency

through:
– Large on-chip low-latency caches to stage data
– Multithreading
– Out-of-order execution

The Problem with CPUs

• The power used by a CPU core is proportional to
Clock Frequency x Voltage2

• In the past, computers got faster by increasing the
frequency
– Voltage was decreased to keep power reasonable.

• Now, voltage cannot be decreased any further
– 1s and 0s in a system are represented by different

voltages
– Reducing overall voltage further would reduce this

difference to a point where 0s and 1s cannot be properly
distinguished

6

The Problem with CPUs

7

• http://queue.acm.org/detail.cfm?id=2181798

The Problem with CPUs

• Instead, performance increases can be achieved
through exploiting parallelism

• Need a chip which can perform many parallel
operations every clock cycle
– Many cores and/or many operations per core

• Want to keep power/core as low as possible
• Much of the power expended by CPU cores is on

functionality not generally that useful for HPC
– e.g. branch prediction

8

Accelerators

• So, for HPC, we want chips with simple, low power,
number-crunching cores

• But we need our machine to do other things as well
as the number crunching
– Run an operating system, perform I/O, set up calculation

etc

• Solution: “Hybrid” system containing both CPU and
“accelerator” chips

9

Accelerators

• It costs a huge amount of money to design and
fabricate new chips
– Not feasible for relatively small HPC market

• Luckily, over the last few years, Graphics
Processing Units (GPUs) have evolved for the
highly lucrative gaming market
– And largely possess the right characteristics for HPC

– Many number-crunching cores

• GPU vendors NVIDIA and AMD have tailored
existing GPU architectures to the HPC market

• GPUs now firmly established in HPC industry
10

Accelerators

• Intel released a different type of accelerator to
compete with GPUs for scientific computing
– Many Integrated Core (MIC) architecture
– AKA Xeon Phi (codenames Larrabee, Knights Ferry,

Knights Corner, Knights Landing)
– Intel prefered the term “coprocessor” to “accelerator”

• KNC comprised old Pentium CPU cores from 1993
– Augmented with wide vector units

• So again uses concept of many simple low-power
cores
– Each performing multiple operations per cycle

• Intel Xeon Phi KNH cancelled:
– End of Xeon Phi era 11

Latest Technology

• NVIDIA
– Volta GPUs have evolved from

GeForce series

• AMD
– FirePro HPC specific GPUs have

evolved from (ATI) Radeon series

• Intel
– Xeon Phi emerged to compete with

GPUs for general purpose
computation

12

Image: https://www.amd.com/en/products/servers-
graphics?utm_medium=redirect&utm_source=301

Image:
https://www.nvidia.com/en-
us/data-center/tesla-v100/

Image: https://software.intel.com/en-
us/xeon-phi/mic

Accelerators addressing performance

• Parallel processing and clock frequency:
– Focus on many number crunching cores.

– (instead of few high-power cores)

13

AMD 12-core CPU

• Not much space on CPU is dedicated to compute

= compute unit
(= core)

14

NVIDIA Pascal GPU

• GPU dedicates much more space to compute
– At expense of caches, controllers, sophistication etc

= compute unit
(= SM
= 64 CUDA cores)

15

Intel Xeon Phi

• As does Xeon Phi

= compute
unit
(= core)

16

Accelerators addressing performance

• Parallel processing and clock frequency:
– Focus on many number crunching cores.

– (instead of few high-power cores)

• Memory bandwidth and latency:
– GDDR and later HBM offer significantly higher bandwidth

vs DDR
– Latency hidden by high parallelism, low cost context

switching and (some) cache memory.

17

Memory

• GPUs and Intel Xeon Phi both uses Graphics
memory: much higher bandwidth

CPUs use DRAM GPUs and Xeon Phi
use Graphics DRAM

• For many applications, performance is very
sensitive to memory bandwidth

18

Memory

• NVIDIA Pascal introduced stacked memory

Cross-section Photomicrograph of a P100 HBM2 stack and GP100 GPU

• Similar staked memory used in Intel KNL
(last version of Xeon Phi) and newer GPUs

19

(Image credits: NVIDIA Pascal architecture White Paper
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf )

https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

Accelerators: 4 key factors

20

• Parallel processing
– Accelerators have a much higher extent of parallelism than

CPUs. Many more cores and/or operations per core

• Clock frequency
– Accelerators typically have lower clock-frequency than CPUs,

and instead get performance through parallelism

• Memory bandwidth
– Accelerators use high bandwidth GDDR memory or HBM2

memory

• Memory latency
– Memory latency from GDDR is similar to DDR
– GPUs hide latency through very high levels of multithreading
– Xeon Phi hides latency in a similar way to CPUs, although the

caches are smaller and no out-of-order execution (in current
models).

21

• GPU performance has been increasing much more rapidly
than CPU

Image: https://arxiv.org/pdf/1412.7789.pdf

DRAM

GPUs accelerated systems

• GPUs cannot be used instead of CPUs
– They must be used together
– GPUs act as accelerators

– Responsible for the computationally expensive parts of the code
– CPU-GPU communication (BW/latency) presents bottleneck

CPU GPU

I/O I/O

22

HBM

GPUs accelerated systems

• Performance considerations
– Need to exploit high level of parallelism
– Generally speaking: separate CPU-GPU memory space

– Need to consider transfer to/from GPU-CPU
– HBM generally lower capacity than DDR per GPU/CPU
– Added complexity of programming model

– Need to take the above into consideration
– For GPU Cuda, OpenCL
– OpenMP or OpenACC
for directives based approach.
– Libraries and high level
programming languages can help.

DRAM

CPU

HBM

GPU

I/O I/O

Xeon Phi accelerated systems

• Xeon Phi run their own OS
– Xeon Phi more independent from CPU
– KNC was “card” accelerator and could run in three

modes:
– Offload: Highly parallel regions ”offloaded” to XeonPhi similar to

GPU acceleration
– Native mode: Log into Xeon Phi card and run directly on KNC
– Symmetric: Run across CPU+KNC as if across nodes

– KNL only available in socket version
– Replaced CPU completely DRAM

KNL HBM2

I/O

DRAM

CPU

GDRAM

KNC

I/O I/O

NVIDIA Volta look inside

• NVIDIA’s latest GPU product is highly AI/DL
oriented

• Follows basic building blocks of previous
architectures with notable improvements

25

Image courtesy of Alan Grey, NVIDIA

NVIDIA Volta
• Volta GV100 SM

– Chip partitioned into Streaming
Multiprocessors (SMs) that act
independently of each other

– 4 processing blocks per SM,
each with:

– 16 FP32 Cores
– 8 FP64 Cores
– 16 INT32 Cores
– 2 Tensor cores
– L0 instruction cache
– 1 warp scheduler
– 1 dispatch unit
– 64 KB Register File.

– Number of SMs, and
processing blocks etc per SM,
varies across products. High-
end GPUs have more than
1,000 total “cores”

26Image courtesy of Alan Grey, NVIDIA

NVIDIA Volta

• Volta GV100 HBM2
– Up to eight memory dies

per HBM2 stack
– Up to four stacks
– Maximum of 32 GB of

GPU memory.

27

Image: https://www.eteknix.com/samsung-sk-hynix-supply-
hbm2-nvidia/

NVIDIA Volta

• Volta GV100 HBM2
– Up to eight memory dies

per HBM2 stack
– Up to four stacks
– Maximum of 32 GB of

GPU memory.

• Cache
– 128KB configurable

– Data Cache and/or
– Shared memory
– For example, if shared

memory is configured to 64
KB, texture and load/store
operations can use the
remaining 64 KB of L1.

28Image courtesy of Alan Grey, NVIDIA

NVIDIA Volta

• NVLink
– Proprietary NVIDIA interconnect

– Introduced with Pascal generation GPUs
– Each NVLink connection achieving approximately 50GB/sec

– Single V100 can support 6 NVlink connections
– Achieving up to 300GB/sec
– 3x that of PCIe Gen 3

– Can be used for GPU-GPU communication
– Can also be used for CPU-GPU communication

– (Currently) only with IBM Power CPU
– Notably not available with Intel CPUs

29

NVIDIA Volta

• NVLink

30

Image: https://www.nvidia.com/en-us/data-center/nvlink/

NVIDIA Volta for DL

• Tensor cores
– Matrix multiply engines

– Each Tensor Core performs the following operation: D = AxB + C,
where A, B, C, and D are 4×4 matrices. The matrix multiply inputs
A and B are FP16 matrices, while the accumulation matrices C
and D may be FP16 or FP32 matrices.

– Exploit “forgiving” nature of DL problems
– Might prove useful for some HPC codes

– Each Volta has 640 tensor cores, each performing 64
floating-point fused-multiply-add (FMA) operations per
clock

– 125TFLOPs for training and inference

31

NVIDIA Volta

• Tensor cores

32

Animation: https://www.nvidia.com/en-us/data-center/tensorcore/

NVIDIA Volta

• Optimised software

33

Image courtesy of Alan Grey, NVIDIA

AMD FirePro

• AMD acquired ATI in 2006
• AMD FirePro series: derivative of

Radeon chips with HPC enhancements

• Like NVIDIA, High computational
performance and high-bandwidth
graphics memory

• Currently much less widely used for
GPGPU than NVIDIA, because of
programming support issues

34

Intel Xeon Phi

• Intel Pentium P54C cores were originally used in CPUs
in 1993
– Simplistic and low-power compared to today’s high-end CPUs

• KNL moved from Pentium to Silvermont architecture
cores (adapted from Atom mobile range)
– Faster single thread performance
– Also integrating network controllers on chip

• Philosophy behind Phi is to dedicate large fraction of
silicone to many of these cores

• And, similar to GPUs, original Phi used MCDRAM
Memory
– Higher memory bandwidth than standard DDR memory used

by CPUs
– KNL Stacked MCDRAM : 400GB/sec bandwidth

– KNL to DRAM : ~90GB/sec bandwidth 35

Intel Xeon Phi

• On KNL, each core has been augmented with two wide
512-bit vector unit

• For each clock cycle, each core can operate 2 vectors
of size 8 (in double precision)
– Twice the width of 256-bit “AVX” instructions supported by

current CPUs

• Multiple cores, each performing multiple operations per
cycle

• Peak performance and memory bandwidth similar to
GPUs

• Vector units drive performance

36

Programming
• GPUs

– CUDA: Extensions to the C language which allow interfacing to
the hardware (NVIDIA proprietary)

– OpenCL: Similar to CUDA but cross-platform (including AMD
and NVIDIA)

– Directives based approach: directives help compiler to
automatically create code for GPU. OpenACC and now also
new OpenMP 4.0

– Libraries like CuFFT, NVBLAS can also help.
• Xeon Phi

– Can just use regular CPU code with OpenMP
– Typically needs work to allow compiler to auto-vectorise efficiently
– Intel specific directives allow offloading to phi, such that fast CPU

core can be used for serial parts of code
– Also Intel Thread Building Blocks
– Often need intrinsics to fully exploit performance capabilities 37

DRAM

GPU Accelerated Systems

• CPUs and Accelerators are used together
– Communicate over PCIe bus or NVLink

CPU

HBM

Accelerator

PCIe
I/O I/O

38

Can also be nvlink

Scaling to larger systems
• Can have multiple CPUs and accelerators within each “workstation”

or “shared memory node”
– E.g. 2 CPUs +2 Accelerators (below)
– CPUs share memory, but Accelerators do not

DRAM

CPU

I/O I/O

CPU

PCIe / nvlink
I/O I/O

Accelerator +
HBM

Interconnect

Interconnect allows
multiple nodes to be
connected

Accelerator +
HBM

39

PCIe / nvlink

nvlink

GPU Accelerated Supercomputer

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

Acc.+CPU
Node

… … …

40

DIY GPU Workstation

• Just need to slot GPU card into PCI-e
• Need to make sure there is enough space and

power in workstation

41

GPU Servers

• Multiple servers can be connected via interconnect

• Several vendors offer
GPU Servers

• Example
Configuration:
– 4 GPUs plus 2 (multi-

core) CPUs

42

Summit – ORNL

https://www.olcf.ornl.gov/summit/

https://www.olcf.ornl.gov/summit/

Each node:
• 2x 22core Power9 CPUs
• 6x NVIDIA V100 GPUs
• 512 GB DDR4
• Mellanox EDR 100G InfiniBand
x4,608 nodes

Summit – ORNL

https://www.olcf.ornl.gov/summit/

https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro_Summit_System_Overview.pdf

Summit – ORNL

• Number 1 machine in the world
– ~187 PFlop/s peak performance
– ~122 PFlop/s max performance
– ~2,300,000 cores

• 8.8 MW
• Power9 system + GPU + Liquid Cooling
• EXAOP (not Flop) DL run already a reality:

– https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-
exaops-on-summit-supercomputer/

https://www.olcf.ornl.gov/summit/

https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summit-supercomputer/

Going forward
• Some very interesting developments on the horizon
• NVIDIA GPUs are current frontrunner

– 5 out of 10 of top10 systems use NVIDIA GPUs

• Intel focusing on Intel Xeon CPUs (not to be confused
with end-of-life Intel Xeon Phi co-processors)

• AMD has shown promising technology but still lags in
software support

• FPGAs and CPUs with FPGAs are becoming more and
more relevant due to power efficiency and performance:
– Not “exactly” an accelerator
– Separate lecture on FPGAs coming in a few weeks

46

Summary

• Accelerators have higher compute and memory
bandwidth capabilities than CPUs
– Silicon dedicated to many simplistic cores
– Use of stacked memory

• Accelerators are typically not used alone, but work in
tandem with CPUs

• Most common are NVIDIA GPUs and Intel Xeon Phis.
– Architectures differ
– AMD also have high performance GPUs, but not so widely

used due to programming support
• GPU accelerated systems scale from simple

workstations to large-scale supercomputers
– And dominate the top10 most powerful systems as of June

2018
47