程序代写 CSE 371 Computer Organization and Design

CSE 371 Computer Organization and Design

CIS 371: Comp. Org. & Design | Prof. | Accelerators
CIS 371: Computer Architecture

Copyright By PowCoder代写 加微信 powcoder

Unit 13: Data-Level Parallelism &
Accelerators
Slides developed by , & at UPenn
with sources that included University of Wisconsin slides
by , , , and

How to Compute SAXPY Quickly?
Performing the same operations on many data items
Example: SAXPY

Instruction-level parallelism (ILP) – fine grained
Loop unrolling with static scheduling –or– dynamic scheduling
Wide-issue superscalar (non-)scaling limits benefits
Thread-level parallelism (TLP) – coarse grained
Can we do some “medium grained” parallelism?
L1: ldf [X+r1]->f1 // I is in r1
mulf f0,f1->f2 // A is in f0
ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1}
addi r1,4->r1
blti r1,4096,L1
for (I = 0; I < 1024; I++) { Z[I] = A*X[I] + Y[I]; CIS 371: Comp. Org. & Design | Prof. | Accelerators Data-Level Parallelism Data-level parallelism (DLP) Single operation repeated on multiple data elements SIMD (Single-Instruction, Multiple-Data) Less general than ILP: parallel insns are all same operation Exploit with vectors Old idea: Cray-1 supercomputer from late 1970s Eight 64-entry x 64-bit floating point “vector registers” 4096 bits (0.5KB) in each register! 4KB for vector register file Special vector instructions to perform vector operations Load vector, store vector (wide memory operation) Vector+Vector or Vector+Scalar addition, subtraction, multiply, etc. In Cray-1, each instruction specifies 64 operations! ALUs were expensive, so one operation per cycle (not parallel) CIS 371: Comp. Org. & Design | Prof. | Accelerators CIS 371: Comp. Org. & Design | Prof. | Accelerators Example Vector ISA Extensions (SIMD) Extend ISA with vector storage … Vector register: fixed-size array of FP/int elements Vector length: For example: 4, 8, 16, 64, … … and example operations for vector length of 4 Load vector: ldf.v [X+r1]->v1
ldf [X+r1+0]->v10
ldf [X+r1+1]->v11
ldf [X+r1+2]->v12
ldf [X+r1+3]->v13
Add two vectors: addf.vv v1,v2->v3
addf v1i,v2i->v3i (where i is 0,1,2,3)
Add vector to scalar: addf.vs v1,f2,v3
addf v1i,f2->v3i (where i is 0,1,2,3)
Today’s vectors: short (128-512 bits), but fully parallel

CIS 371: Comp. Org. & Design | Prof. | Accelerators
Example Use of Vectors – 4-wide

Operations
Load vector: ldf.v [X+r1]->v1
Multiply vector to scalar: mulf.vs v1,f2->v3
Add two vectors: addf.vv v1,v2->v3
Store vector: stf.v v1->[X+r1]
Performance?
Best case: 4x speedup
But, vector instructions don’t always have single-cycle throughput
Execution width (implementation) vs vector width (ISA)

ldf [X+r1]->f1
mulf f0,f1->f2
ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1]
addi r1,4->r1
blti r1,4096,L1

ldf.v [X+r1]->v1
mulf.vs v1,f0->v2
ldf.v [Y+r1]->v3
addf.vv v2,v3->v4
stf.v v4,[Z+r1]
addi r1,16->r1
blti r1,4096,L1

7×1024 instructions
7×256 instructions
(4x fewer instructions)

Even sequential vector implementations can benefit from pipelining: no bypassing
also fewer insns to fetch/decode

Vector Datapath & Implementatoin
Vector insn. are just like normal insn… only “wider”
Single instruction fetch (no extra N2 checks)
Wide register read & write (not multiple ports)
Wide execute: replicate floating point unit (same as superscalar)
Wide bypass (avoid N2 bypass problem)
Wide cache read & write (single cache tag check)

Execution width (implementation) vs vector width (ISA)
Example: Pentium 4 and “Core 1” executes vector ops at half width
“Core 2” executes them at full width

Because they are just instructions…
…superscalar execution of vector instructions
Multiple n-wide vector instructions per cycle
CIS 371: Comp. Org. & Design | Prof. | Accelerators

CIS 371: Comp. Org. & Design | Prof. | Accelerators
Vector Insn Sets for Different ISAs
Intel and AMD: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2
currently: AVX 512 offers 512b vectors
AltiVEC/VMX: 128b
NEON: 128b
Scalable Vector Extensions (SVE): up to 2048b
implementation is narrower than this!
makes vector code portable

CIS 371: Comp. Org. & Design | Prof. | Accelerators
Other Vector Instructions
These target specific domains: e.g., image processing, crypto
Vector reduction (sum all elements of a vector)
Geometry processing: 4×4 translation/rotation matrices
Saturating (non-overflowing) subword add/sub: image processing
Byte asymmetric operations: blending and composition in graphics
Byte shuffle/permute: crypto
Population (bit) count: crypto
Max/min/argmax/argmin: video codec
Absolute differences: video codec
Multiply-accumulate: digital-signal processing
Special instructions for AES encryption
More advanced (but in Intel’s Xeon Phi)
Scatter/gather loads: indirect store (or load) from a vector of pointers
Vector mask: predication (conditional execution) of specific elements

Vector Masks (Predication)
Vector Masks: 1 bit per vector element
Implicit predicate in all vector operations
for (I=0; I v1
cmp_ne.v v1,f0 -> r2 // 0.0 is in f0
divf.sv {r2} v1,f1 -> v2 // A is in f1
stf.v {r2} v2 -> [Z+r1]
CIS 371: Comp. Org. & Design | Prof. | Accelerators

Scatter Stores & Gather Loads
How to vectorize:
for(int i = 1, i[r1+v1]
Each address calculated from r1+v1i
stf v20->[r1+v10], stf v21->[r1+v11],
stf v22->[r1+v12], stf v23->[r1+v13]
Vector “gather loads” defined analogously
ldf.v [r1+v1]->v2
Scatter/gathers slower than regular vector load/store ops
Still provides a throughput advantage over non-vector version
CIS 371: Comp. Org. & Design | Prof. | Accelerators

Using Vectors in Your Code
Write in assembly
Use “intrinsic” functions and data types
For example: _mm_mul_ps() and “__m128” datatype
Use vector data types
typedef double v2df __attribute__ ((vector_size (16)));
Use a library someone else wrote
Let them do the hard work
Matrix and linear algebra packages
Let the compiler do it (automatic vectorization, with feedback)
GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
Limited impact for C/C++ code (old, hard problem)
CIS 371: Comp. Org. & Design | Prof. | Accelerators

By the numbers: CPU vs GPU

CIS 371: Comp. Org. & Design | Prof. | Accelerators
Intel Xeon Platinum 8276L “Cascade Lake” GV100
frequency 2.2-4.0 GHz 1.1 GHz
cores / threads 28 / 56 80 (“5120”) / 10Ks
RAM 4.5 TB 32 GB
DP TFLOPS ~1.5 5.8
Transistors >5B ? 21.1B
Price $11,700 $9,000

Xeon Platinum 9200 can do 3.2 Tflops with 2 dies: https://www.hpcwire.com/2019/04/02/intel-launches-second-gen-scalable-xeons-with-up-to-56-cores/, so 8276L is probably half that

Recap: Vectors for Exploiting DLP
Vectors are an efficient way of capturing parallelism
Data-level parallelism
Avoid the N2 problems of superscalar
Avoid the difficult fetch problem of superscalar
Area efficient, power efficient

The catch?
Need code that is “vector-izable”
Need to modify program (unlike dynamic-scheduled superscalar)
Requires some help from the programmer

Looking forward: Intel “Xeon Phi” (neé Larrabee) vectors
More flexible (vector “masks”, scatter, gather) and wider
Should be easier to exploit, more bang for the buck
CIS 371: Comp. Org. & Design | Prof. | Accelerators

Graphics Processing Units (GPU)

Tesla S870

Killer app for parallelism: graphics (3D games)
CIS 371: Comp. Org. & Design | Prof. | Accelerators

GPUs and SIMD/Vector Data Parallelism
How do GPUs have such high peak FLOPS & FLOPS/Joule?
Exploit massive data parallelism – focus on total throughput
Remove hardware structures that accelerate single threads
Specialized for graphics: e.g., data-types & dedicated texture units
“SIMT” execution model
Single instruction multiple threads
Similar to both “vectors” and “SIMD”
A key difference: better support for conditional control flow
Program it with CUDA or OpenCL
Extensions to C/C++
Perform a “shader task” (a snippet of scalar computation) over many elements
Internally, GPU uses scatter/gather and vector mask operations
CIS 371: Comp. Org. & Design | Prof. | Accelerators

following slides c/o ’s “Beyond Programmable Shading” course
http://www.cs.cmu.edu/~kayvonf/
CIS 371: Comp. Org. & Design | Prof. | Accelerators

diffuse reflection example
CIS 371: Comp. Org. & Design | Prof. | Accelerators

diffuse reflection
specular reflection
c/o http://www.iconarchive.com/show/pool-ball-icons-by-barkerbaggies/Ball-6-icon.html

fragment = pixel

execution context = register file

NB: GPU vendors count “cores” as ALUs!!

SIMD vs SIMT
SIMD: single insn multiple data
write 1 insn that operates on a vector of data
you handle control flow via explicit masking operations
SIMT: single insn multiple thread
write 1 insn that operates on scalar data
each of many threads runs this insn
compiler+hw aggregate threads into groups that execute on SIMD hardware
compiler+hw handle masking for control flow
CIS 371: Comp. Org. & Design | Prof. | Accelerators

trading latency for bandwidth

Note the relatively modest clock frequency

Nvidia CUDA Programming Model
CIS 371: Comp. Org. & Design | Prof. | Accelerators

up to 2K threads per block, can have millions of blocks
blocks and grid can be 1D, 2D or 3D

Nvidia CUDA Programming Model
CIS 371: Comp. Org. & Design | Prof. | Accelerators

programmer-managed scratchpad
no cache coherence

Intel Larrabee Xeon Phi
GPU-like manycore accelerator
61 inorder x86 cores running at ~1 GHz
2-way superscalar
4 hyperthreads per core
512-bit vector operations
Leverages CPU programming models like OpenMP
Powers China’s Tianhe-2 supercomputer
33.86 Pflops
3.1M cores
the biggest in the world!
no GPUs, just Xeon multicores
and Xeon Phis
CIS 371: Comp. Org. & Design | Prof. | Accelerators

Google Tensor Processing Unit (v2)
Slides from HotChips 2017
https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.69-Key2-AI-ML-Pub/HotChips%20keynote%20Jeff%20Dean%20-%20August%202017.pdf
CIS 371: Comp. Org. & Design | Prof. | Accelerators

CIS 371: Comp. Org. & Design | Prof. | Accelerators

CIS 371: Comp. Org. & Design | Prof. | Accelerators

What precision? Unclear, perhaps 16b. TPU v1 used 8b int for matrix values

TPU v1 ISA

CIS 371: Comp. Org. & Design | Prof. | Accelerators

programming in TPU “assembly” would be intolerable, but we get to use TensorFlow instead
from https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.70-NeuralNet2-Pub/HC29.22.730-TensorPU-Young-Google.pdf

Systolic Array Matrix Multiply
CIS 371: Comp. Org. & Design | Prof. | Accelerators
https://storage.googleapis.com/gweb-cloudblog-publish/original_images/Systolic_Array_for_Neural_Network_2g8b7.GIF

Accelerators Summary
Data Level Parallelism
“medium-grained” parallelism between ILP and TLP
Still one flow of execution (unlike TLP)
Compiler/programmer must explicitly express it (unlike ILP)
Embrace data parallelism via “SIMT” execution model
Becoming more programmable all the time
Neural network accelerator
Fast matrix multiply machine
Slow growth in single-thread performance, Moore’s Law drives adoption of accelerators
CIS 371: Comp. Org. & Design | Prof. | Accelerators

thread_mem_hierarchy.graffle

Thread-Block

Per-thread-block
shared memory

Per-thread
private memory

Global memory

Thread Hierarchy Memory Hierarchy

Thread-Block
Per-thread-block
shared memory
Per-thread
private memory
Global memory
Thread Hierarchy Memory Hierarchy

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com