CIS 501: Computer Architecture
Unit 13: Data-Level Parallelism & Accelerators
Slides developed by , & at UPenn with sources that included University of Wisconsin slides
by , , , and
Copyright By PowCoder代写 加微信 powcoder
CIS 501: Comp. Arch. | Prof. | Accelerators 1
How to Compute This Fast?
• Performing the same operations on many data items
• Example: SAXPY
for (I = 0; I < 1024; I++) {
Z[I] = A*X[I] + Y[I];
L1: ldf [X+r1]->f1 // I is in r1 mulf f0,f1->f2 // A is in f0 ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1}
addi r1,4->r1
blti r1,4096,L1
• Instruction-level parallelism (ILP) – fine grained
• Loop unrolling with static scheduling –or– dynamic scheduling • Wide-issue superscalar (non-)scaling limits benefits
• Thread-level parallelism (TLP) – coarse grained • Multicore
• Can we do some “medium grained” parallelism?
CIS 501: Comp. Arch. | Prof. | Accelerators 2
Data-Level Parallelism
• Data-levelparallelism(DLP)
• Single operation repeated on multiple data elements
• SIMD (Single-Instruction, Multiple-Data)
• Less general than ILP: parallel insns are all same operation • Exploit with vectors
• Old idea: Cray-1 supercomputer from late 1970s • Eight 64-entry x 64-bit floating point “vector registers”
• 4096 bits (0.5KB) in each register! 4KB for vector register file • Special vector instructions to perform vector operations
• Load vector, store vector (wide memory operation) • Vector+Vector or Vector+Scalar
• addition, subtraction, multiply, etc.
• In Cray-1, each instruction specifies 64 operations!
• ALUs were expensive, so one operation per cycle (not parallel)
CIS 501: Comp. Arch. | Prof. | Accelerators 3
Example Vector ISA Extensions (SIMD)
• Extend ISA with vector storage …
• Vectorregister:fixed-sizearrayofFP/intelements
• Vectorlength:Forexample:4,8,16,64,…
• … and example operations for vector length of 4
• Loadvector:ldf.v[X+r1]->v1 ldf [X+r1+0]->v10
ldf [X+r1+1]->v11
ldf [X+r1+2]->v12
ldf [X+r1+3]->v13
• Addtwovectors:addf.vvv1,v2->v3
addf v1i,v2i->v3i (where i is 0,1,2,3)
• Addvectortoscalar:addf.vsv1,f2,v3
addf v1i,f2->v3i (where i is 0,1,2,3)
• Today’s vectors: short (128-512 bits), but fully parallel
CIS 501: Comp. Arch. | Prof. | Accelerators 4
Example Use of Vectors – 4-wide
ldf [X+r1]->f1
mulf f0,f1->f2
ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1]
addi r1,4->r1
blti r1,4096,L1
ldf.v [X+r1]->v1
mulf.vs v1,f0->v2
ldf.v [Y+r1]->v3
addf.vv v2,v3->v4
stf.v v4,[Z+r1]
addi r1,16->r1
blti r1,4096,L1
7×1024 instructions 7×256 instructions
• Operations
• Loadvector:ldf.v[X+r1]->v1
• Multiplyvectortoscalar:mulf.vsv1,f2->v3 • Addtwovectors:addf.vvv1,v2->v3
• Storevector:stf.vv1->[X+r1]
• Performance?
• Best case: 4x speedup
• But, vector instructions don’t always have single-cycle throughput
• Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. | Accelerators 5
(4x fewer instructions)
Vector Datapath & Implementatoin
• Vector insn. are just like normal insn… only “wider”
• Single instruction fetch (no extra N2 checks)
• Wide register read & write (not multiple ports)
• Wide execute: replicate floating point unit (same as superscalar) • Wide bypass (avoid N2 bypass problem)
• Wide cache read & write (single cache tag check)
• Execution width (implementation) vs vector width (ISA)
• Example: Pentium 4 and “Core 1” executes vector ops at half width • “Core 2” executes them at full width
• Because they are just instructions…
• …superscalar execution of vector instructions • Multiple n-wide vector instructions per cycle
CIS 501: Comp. Arch. | Prof. | Accelerators 6
Vector Insn Sets for Different ISAs
• Intel and AMD: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2 • currently: AVX 512 offers 512b vectors
• AltiVEC/VMX: 128b
• NEON: 128b
• Scalable Vector Extensions (SVE): up to 2048b • implementation is narrower than this!
CIS 501: Comp. Arch. | Prof. | Accelerators 7
By the numbers: CPUs vs GPUs
Intel Xeon Platinum 8168 “Skylake”
Intel Xeon Phi GV100 7290F
frequency 2.7 GHz 1.1 GHz 1.5 GHz
cores / threads 24 / 48 80 (“5120”) / 10Ks 72 / 288
RAM 768 GB 32 GB 384 GB
DP TFLOPS 1.0 5.8 3.5
Transistors >5B ? 21.1B >5B ?
Price $5,900 $9,000 $3,400
CIS 501: Comp. Arch. | Prof. | Accelerators 12
• following slides c/o ’s “Beyond Programmable Shading” course
• http://www.cs.cmu.edu/~kayvonf/
CIS 501: Comp. Arch. | Prof. | Accelerators 16
SIMD vs SIMT
• SIMD: single insn multiple data
• write 1 insn that operates on a vector of data
• handle control flow via explicit masking operations
• SIMT: single insn multiple thread
• write 1 insn that operates on scalar data
• each of many threads runs this insn
• compiler+hw aggregate threads into groups that execute on SIMD hardware
• compiler+hw handle masking for control flow
CIS 501: Comp. Arch. | Prof. | Accelerators 38
Google Tensor Processing Unit (v2)
• Slides from HotChips 2017
• https://www.hotchips.org/wp- content/uploads/hc_archives/hc29/HC29.22-Tuesday- Pub/HC29.22.69-Key2-AI-ML- Pub/HotChips%20keynote%20Jeff%20Dean%20- %20August%202017.pdf
CIS 501: Comp. Arch. | Prof. | Accelerators 54
CIS 501: Comp. Arch. | Prof. | Accelerators 55
CIS 501: Comp. Arch. | Prof. | Accelerators 56
TPU v1 ISA
CIS 501: Comp. Arch. | Prof. | Accelerators 57
Systolic Array Matrix Multiply
• https://storage.googleapis.com/gweb-cloudblog- publish/original_images/Systolic_Array_for_Neural_Networ k_2g8b7.GIF
CIS 501: Comp. Arch. | Prof. | Accelerators 58
Accelerators Summary
• Data Level Parallelism
• “medium-grained” parallelism between ILP and TLP
• Still one flow of execution (unlike TLP)
• Compiler/programmer must explicitly express it (unlike ILP)
• Embrace data parallelism via “SIMT” execution model • Becoming more programmable all the time
• Neural network accelerator
• Fast matrix multiply machine
• Slow growth in single-thread performance, Moore’s Law drives adoption of accelerators
CIS 501: Comp. Arch. | Prof. | Accelerators 59
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com