CS计算机代考程序代写 arm GPU algorithm compiler chain mips cache x86 data structure 18-646 – How to Write Fast Code?

18-646 – How to Write Fast Code?
1
Carnegie Mellon University

Course Information
— Lectures:
— Tuesday and Thursday 6:00pm-7:20pm ET
— Office Hours:
— Instructor Office Hours: Wednesdays 4:30pm-5:30pm ET — TA Office Hours: TBD
— Course Links:
— Canvas: https://canvas.cmu.edu/courses/21510/pages/course-schedule — Piazza: https://piazza.com/class/kkmp02yc92h598
— Gradescope: https://www.gradescope.com/courses/241050
18-6456 – How to Write Fast Code? 2

https://canvas.cmu.edu/courses/21510/pages/course-schedule
18-6456 – How to Write Fast Code? 3

Course Goal
— Course Goal
— When your research/application needs to be fast, you will be able to: FLIRT
1. Feel comfortable hacking up a solution
2. Leverage existing software building blocks
3. Indicate which platform is the best one to use
4. Reason about why a piece of existing code is slow
5. Take care of potential performance bottlenecks
18-6456 – How to Write Fast Code? 4

How to Write Fast Code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— We focus on the fast hardware in this lecture
— Combines with good techniques to produce fast code…
…in order to solve a problem or improve an outcome
18-6465 – How to Write Fast Code? 5

Outline
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-6456 – How to Write Fast Code? 6

Landscape of Computing Platforms
— This is what you see most often
The exponential increase in transistor integration, and the flattening of clock speed, power consumption, and performance per clock
18-6465 – How to Write Fast Code? 7

Landscape of Computing Platforms
— This is triggering a flurry of technology innovations!
— Let’s look at the innovations in three axis: — Instruction Set
— Design Philosophy
— Power Consumption
Instruction Set – x86
– ARM
– Power/Cell
– SPARC
Manycore
Design Philosophy
Multicore
18-6465 – How to Write Fast Code?
8
1W 10W 100W
Power Consumption
Instruction Set

Computing Technology Innovations
Many core
Multi core
x86 Platform
NVIDIA GTXxxx AMD HDxxxx Intel Larrabee
Intel Core2
AMD BullDozer
1W 10W 100W
Power/Cell Platform
IBM Cell
IBM Power
1W 10W 100W
ARM Platform
Many core
Apple A5 Imagination Technology
Multi Apple A4 NVIDIA Tegra
core many more…
1W 10W 100W
Many core
Multi core
SPARC Platform
Many Sun/Oracle
core
Multi core
Niagara/ Victoria
Fujitsu SPARC64 VIII
18-6456 – How to Write Fast Code?
9
AMD Bobcat
Intel Atom
1W 10W 100W

Outline
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-6456 – How to Write Fast Code? 10

(1) Multicore vs Manycore Processors
— What’s the difference between Multicore vs Manycore?
Multicore Manycore
— Multicore: yoke of oxen
— Each core optimized for executing a single thread
— Manycore: flock of chickens
— Cores optimized for aggregate throughput, deemphasizing individual performance
18-6465 – How to Write Fast Code? 11

Multicore and Manycore Parallelism
— Similar in scaling trends:
Core
SIMD level Parallelism
— Increasing vector unit width
— Increasing numbers of cores per die
— Increasing bandwidth to off-chip memory 0
— Different in optimization points
Core Chip Core Core Core
Core level Parallelism
DRAM
Shared Memory Bus
Core Core
18-6465 – How to Write Fast Code?
12
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Cache Cache Cache

Significant Architectural Difference multiple
52ND singleinstruction
data
Intel Core i7-2600K
NVIDIA GTX580
Specific ations
Core i7 2600K
GTX580
Processing Elements
4 cores,
8 way SIMD @2.5-3.4 GHz
16 cores, 16 way SIMD, dual issue @1.55 GHz
0.46x – 0.62x
Resident Threads (max)
4 cores, 2 threads, 8 width SIMD
0
64 strands
16 cores, 48 SIMD vectors, 32 width SIMD 24,576 strands
0
1587
384x
SP GFLOP/s
160-218
7.3x – 9.9x
Memory Bandwidth
21.4GB/s – 42.7GB/s
192.4 GB/s
4.5x – 9.0x
Die info
995M Transistors 32nm process 216mm2
3B Transistors 40nm process 520mm2
18-6456 – How to Write Fast Code? Advanced Parallel Hardware Architecture 13

Outline
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-6456 – How to Write Fast Code? 14

Load-store Architecture: MIPS
— Data and instructions are loaded from memory for processing
— Follows the Von Neuman Architecture – separation of CPU and memory
o
0
18-6465 – How to Write Fast Code?
15

(2) Instruction Level Parallelism (ILP)
— Instructions in a sequence that can be computed at the same time — Example:
— Euclidean distance between two points
ld r1, x1
ld r2, y1
ld r3, z1
ld r4, x2
ld r5, y2
ld r6, z2
sub r1, r1, r4 sub r2, r2, r5 sub r3, r3, r6 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
18-6465 – How to Write Fast Code? Advanced Parallel Hardware Architecture 16

Multiple Valid Instruction Sequence
— Compilers may produce valid instruction sequences that have less ILP when executed in order
sqrt
ld r1, x1
ld r4, x2
sub r1, r4, r1 mul r1, r1, r1 ld r2, y1
ld r5, y2
sub r2, r5, r2 mul r2, r2, r2 add r1, r1, r2 ld r3, z1
ld r6, z2
sub r3, r6, r3 mul r3, r3, r3 add r1, r1, r3 sqrt r1, r1 st d, r1
ld r1, x1
ld r2, y1
ld r3, z1
ld r4, x2
ld r5, y2
ld r6, z2
sub r1, r1, r4 sub r2, r2, r5 sub r3, r3, r6 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
DD
mu mu mu lll
su su su bbb
x2x1 y2y1 z2z1
ad d
ad d
18-6465 – How to Write Fast Code?
17

Traditional In-order Pipeline
— An in-order processor pipeline could run into issues — Reduced ILP and Read/Write operand dependency
In-order scalar pipeline
ld r1, x1
ld r4, x2
sub r1, r4, r1 mul r1, r1, r1 ld r2, y1
ld r5, y2
sub r2, r5, r2 mul r2, r2, r2 add r1, r1, r2 ld r3, z1
ld r6, z2
sub r3, r6, r3 mul r3, r3, r3 add r1, r1, r3 sqrt r1, r1 st d, r1
In-order super scalar pipeline
18-6465 – How to Write Fast Code?
18
http://www.lighterra.com/paper s/modernmicroprocessors/

Later Out-of-order Pipelines
— A out-of-order processor pipeline: — Allows instruction re-ordering
— Uses register-renaming
ld r1, x1
ld r4, x2
sub r1, r4, r1 mul r1, r1, r1 ld r2, y1
ld r5, y2
sub r2, r5, r2 mul r2, r2, r2 add r1, r1, r2 ld r3, z1
ld r6, z2
sub r3, r6, r3 mul r3, r3, r3 add r1, r1, r3 sqrt r1, r1 st d, r1
18-6465 – How to Write Fast Code? 19

Exploiting ILP
— Motivation:
— Given a single, sequential stream of instructions, how to execute it as fast as
possible? — Advantages:
— No changes in sequential software necessary
— Disadvantages:
— Significantly more complex processor architecture
— Longer to design the processor
— Longer to verify the correctness of the processor design — Consumes more energy than simple in-order processor
18-6465 – How to Write Fast Code? 20

Outline
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-6456 – How to Write Fast Code? 21

(3) SIMD – Single Inst, Multiple Data
— Taxonomy proposed by Michael J Flynn in 1966
— SISD: No parallelism
— SIMD: Exploits data parallelism
— MISD: Redundancy (used in Space Shuttle flight control) — MIMD: Distributed computing
X SIMD oSISD + width=2
c2
— SIMD:
— Can be area and power efficient: Amortize control overhead over SIMD width — Parallelism exposed to programmer & compiler
a1
a2
b1
b2
a
b
+
c
c1
18-6456 – How to Write Fast Code? 22

Processor Power Consumption
300
250
200
150
100
50
0
coprocessor 0 reg write writeback load/store pipe execute
reg bypass/latch reg read instruction
epc chain
pc gen
dcache refill
dcache data busses dcache store dcache load
dcache tag
dcache addr bus icache refill
icache data bus icache data
icache tag
icache addr bus
Figure 4-1: Unoptimized energy per cycle.
18-6465 – How to Write Fast Code? 23
energy per cycle (pJ)
LZW_medtest gcc_test
go_20_9 ijpeg_test
li_test m88ksim_test
perl_test_jumble vortex_small
adpcm_dec adpcm_enc
g721_dec g721_enc
jpeg_dec jpeg_enc
pegwit_enc pegwit_dec
average

A Brief History of x86 SIMD
Subset Future Subset
8 x 8 bit MMX Integer
SSE 4 x 32 bit SP Float
SSE2 2 x 64 bit DP Float
SSE3 ??
SSSE3
SSE4.1
SSE4.2
8 x 32 bit SP Float
3dNow!
4 x 32 bit SP Float
Larrabee
16 x 32 bit SP Float
AVX
SSE4.A
SSE5
AVX+FMA 3 operands
Slide by Bryan Catanzaro
18-6465 – How to Write Fast Code? 24

What to do with SIMD?
4 way SIMD (SSE) 16 way SIMD (LRB)
— Neglecting SIMD is expensive
— AVX: 8 way SIMD, Larrabee: 16 way SIMD — NVIDIA: 32 way SIMD, ATI: 64 way SIMD
18-6456 – How to Write Fast Code? 25

How to Utilize SIMD Capabilities?
— Compilers:
— Option available in GCC, ICC, and others — GCC:
— Enable by options such as: — -ftree-vectorize
— -msse2
— -ffast-math
— -fassociative-math
— Also enabled by default with “-O3”
— Examples at:
— http://gcc.gnu.org/projects/tree-
ssa/vectorization.html#using
— Hand optimization:
— Optimization using intrinsics
18-6456 – How to Write Fast Code? 26

Example: Exploiting SIMD Parallelism
— Applied to finding distance between two points:
ld r1, x1
ld r2, x2
ld r3, x3
ld r4, y1
ld r5, y2
ld r6, y3
sub r1, r4, r1 sub r2, r5, r2 sub r3, r6, r3 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
V1 = _MM_LOAD(&X1[0]) V2 = _MM_LOAD(&Y1[0]) V1 = _MM_SUB(V2, V1)
V1 = _MM_MUL(V1, V1) _MM_STORE(&res[0], V1) add r1, &res[0], &res[1] add r1, r1, &res[2] sqrt r1, r1
st d, r1
18-6465 – How to Write Fast Code? 27

Example…
#include
#include // header for SSE3
#define k 32
int main(){
float x[k]; float y[k]; // vectors of length k __m128 X, Y; // 128-bit values __m128 acc = _mm_setzero_ps(); // set to (0, 0, 0, 0) float inner_prod, temp[4];
int i, j;
for (j=0; j