18-646 – How to Write Fast Code?
1
Carnegie Mellon University
Course Information
Lectures:
Tuesday and Thursday 6:00pm-7:20pm ET
Office Hours:
Instructor Office Hours: Wednesdays 4:30pm-5:30pm ET TA Office Hours: TBD
Course Links:
Canvas: https://canvas.cmu.edu/courses/21510/pages/course-schedule Piazza: https://piazza.com/class/kkmp02yc92h598
Gradescope: https://www.gradescope.com/courses/241050
18-6456 – How to Write Fast Code? 2
https://canvas.cmu.edu/courses/21510/pages/course-schedule
18-6456 – How to Write Fast Code? 3
Course Goal
Course Goal
When your research/application needs to be fast, you will be able to: FLIRT
1. Feel comfortable hacking up a solution
2. Leverage existing software building blocks
3. Indicate which platform is the best one to use
4. Reason about why a piece of existing code is slow
5. Take care of potential performance bottlenecks
18-6456 – How to Write Fast Code? 4
How to Write Fast Code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
We focus on the fast hardware in this lecture
Combines with good techniques to produce fast code…
…in order to solve a problem or improve an outcome
18-6465 – How to Write Fast Code? 5
Outline
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-6456 – How to Write Fast Code? 6
Landscape of Computing Platforms
This is what you see most often
The exponential increase in transistor integration, and the flattening of clock speed, power consumption, and performance per clock
18-6465 – How to Write Fast Code? 7
Landscape of Computing Platforms
This is triggering a flurry of technology innovations!
Let’s look at the innovations in three axis: Instruction Set
Design Philosophy
Power Consumption
Instruction Set – x86
– ARM
– Power/Cell
– SPARC
Manycore
Design Philosophy
Multicore
18-6465 – How to Write Fast Code?
8
1W 10W 100W
Power Consumption
Instruction Set
Computing Technology Innovations
Many core
Multi core
x86 Platform
NVIDIA GTXxxx AMD HDxxxx Intel Larrabee
Intel Core2
AMD BullDozer
1W 10W 100W
Power/Cell Platform
IBM Cell
IBM Power
1W 10W 100W
ARM Platform
Many core
Apple A5 Imagination Technology
Multi Apple A4 NVIDIA Tegra
core many more…
1W 10W 100W
Many core
Multi core
SPARC Platform
Many Sun/Oracle
core
Multi core
Niagara/ Victoria
Fujitsu SPARC64 VIII
18-6456 – How to Write Fast Code?
9
AMD Bobcat
Intel Atom
1W 10W 100W
Outline
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-6456 – How to Write Fast Code? 10
(1) Multicore vs Manycore Processors
What’s the difference between Multicore vs Manycore?
Multicore Manycore
Multicore: yoke of oxen
Each core optimized for executing a single thread
Manycore: flock of chickens
Cores optimized for aggregate throughput, deemphasizing individual performance
18-6465 – How to Write Fast Code? 11
Multicore and Manycore Parallelism
Similar in scaling trends:
Core
SIMD level Parallelism
Increasing vector unit width
Increasing numbers of cores per die
Increasing bandwidth to off-chip memory 0
Different in optimization points
Core Chip Core Core Core
Core level Parallelism
DRAM
Shared Memory Bus
Core Core
18-6465 – How to Write Fast Code?
12
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Cache Cache Cache
Significant Architectural Difference multiple
52ND singleinstruction
data
Intel Core i7-2600K
NVIDIA GTX580
Specific ations
Core i7 2600K
GTX580
Processing Elements
4 cores,
8 way SIMD @2.5-3.4 GHz
16 cores, 16 way SIMD, dual issue @1.55 GHz
0.46x – 0.62x
Resident Threads (max)
4 cores, 2 threads, 8 width SIMD
0
64 strands
16 cores, 48 SIMD vectors, 32 width SIMD 24,576 strands
0
1587
384x
SP GFLOP/s
160-218
7.3x – 9.9x
Memory Bandwidth
21.4GB/s – 42.7GB/s
192.4 GB/s
4.5x – 9.0x
Die info
995M Transistors 32nm process 216mm2
3B Transistors 40nm process 520mm2
18-6456 – How to Write Fast Code? Advanced Parallel Hardware Architecture 13
Outline
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-6456 – How to Write Fast Code? 14
Load-store Architecture: MIPS
Data and instructions are loaded from memory for processing
Follows the Von Neuman Architecture – separation of CPU and memory
o
0
18-6465 – How to Write Fast Code?
15
(2) Instruction Level Parallelism (ILP)
Instructions in a sequence that can be computed at the same time Example:
Euclidean distance between two points
ld r1, x1
ld r2, y1
ld r3, z1
ld r4, x2
ld r5, y2
ld r6, z2
sub r1, r1, r4 sub r2, r2, r5 sub r3, r3, r6 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
18-6465 – How to Write Fast Code? Advanced Parallel Hardware Architecture 16
Multiple Valid Instruction Sequence
Compilers may produce valid instruction sequences that have less ILP when executed in order
sqrt
ld r1, x1
ld r4, x2
sub r1, r4, r1 mul r1, r1, r1 ld r2, y1
ld r5, y2
sub r2, r5, r2 mul r2, r2, r2 add r1, r1, r2 ld r3, z1
ld r6, z2
sub r3, r6, r3 mul r3, r3, r3 add r1, r1, r3 sqrt r1, r1 st d, r1
ld r1, x1
ld r2, y1
ld r3, z1
ld r4, x2
ld r5, y2
ld r6, z2
sub r1, r1, r4 sub r2, r2, r5 sub r3, r3, r6 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
DD
mu mu mu lll
su su su bbb
x2x1 y2y1 z2z1
ad d
ad d
18-6465 – How to Write Fast Code?
17
Traditional In-order Pipeline
An in-order processor pipeline could run into issues Reduced ILP and Read/Write operand dependency
In-order scalar pipeline
ld r1, x1
ld r4, x2
sub r1, r4, r1 mul r1, r1, r1 ld r2, y1
ld r5, y2
sub r2, r5, r2 mul r2, r2, r2 add r1, r1, r2 ld r3, z1
ld r6, z2
sub r3, r6, r3 mul r3, r3, r3 add r1, r1, r3 sqrt r1, r1 st d, r1
In-order super scalar pipeline
18-6465 – How to Write Fast Code?
18
http://www.lighterra.com/paper s/modernmicroprocessors/
Later Out-of-order Pipelines
A out-of-order processor pipeline: Allows instruction re-ordering
Uses register-renaming
ld r1, x1
ld r4, x2
sub r1, r4, r1 mul r1, r1, r1 ld r2, y1
ld r5, y2
sub r2, r5, r2 mul r2, r2, r2 add r1, r1, r2 ld r3, z1
ld r6, z2
sub r3, r6, r3 mul r3, r3, r3 add r1, r1, r3 sqrt r1, r1 st d, r1
18-6465 – How to Write Fast Code? 19
Exploiting ILP
Motivation:
Given a single, sequential stream of instructions, how to execute it as fast as
possible? Advantages:
No changes in sequential software necessary
Disadvantages:
Significantly more complex processor architecture
Longer to design the processor
Longer to verify the correctness of the processor design Consumes more energy than simple in-order processor
18-6465 – How to Write Fast Code? 20
Outline
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-6456 – How to Write Fast Code? 21
(3) SIMD – Single Inst, Multiple Data
Taxonomy proposed by Michael J Flynn in 1966
SISD: No parallelism
SIMD: Exploits data parallelism
MISD: Redundancy (used in Space Shuttle flight control) MIMD: Distributed computing
X SIMD oSISD + width=2
c2
SIMD:
Can be area and power efficient: Amortize control overhead over SIMD width Parallelism exposed to programmer & compiler
a1
a2
b1
b2
a
b
+
c
c1
18-6456 – How to Write Fast Code? 22
Processor Power Consumption
300
250
200
150
100
50
0
coprocessor 0 reg write writeback load/store pipe execute
reg bypass/latch reg read instruction
epc chain
pc gen
dcache refill
dcache data busses dcache store dcache load
dcache tag
dcache addr bus icache refill
icache data bus icache data
icache tag
icache addr bus
Figure 4-1: Unoptimized energy per cycle.
18-6465 – How to Write Fast Code? 23
energy per cycle (pJ)
LZW_medtest gcc_test
go_20_9 ijpeg_test
li_test m88ksim_test
perl_test_jumble vortex_small
adpcm_dec adpcm_enc
g721_dec g721_enc
jpeg_dec jpeg_enc
pegwit_enc pegwit_dec
average
A Brief History of x86 SIMD
Subset Future Subset
8 x 8 bit MMX Integer
SSE 4 x 32 bit SP Float
SSE2 2 x 64 bit DP Float
SSE3 ??
SSSE3
SSE4.1
SSE4.2
8 x 32 bit SP Float
3dNow!
4 x 32 bit SP Float
Larrabee
16 x 32 bit SP Float
AVX
SSE4.A
SSE5
AVX+FMA 3 operands
Slide by Bryan Catanzaro
18-6465 – How to Write Fast Code? 24
What to do with SIMD?
4 way SIMD (SSE) 16 way SIMD (LRB)
Neglecting SIMD is expensive
AVX: 8 way SIMD, Larrabee: 16 way SIMD NVIDIA: 32 way SIMD, ATI: 64 way SIMD
18-6456 – How to Write Fast Code? 25
How to Utilize SIMD Capabilities?
Compilers:
Option available in GCC, ICC, and others GCC:
Enable by options such as: -ftree-vectorize
-msse2
-ffast-math
-fassociative-math
Also enabled by default with “-O3”
Examples at:
http://gcc.gnu.org/projects/tree-
ssa/vectorization.html#using
Hand optimization:
Optimization using intrinsics
18-6456 – How to Write Fast Code? 26
Example: Exploiting SIMD Parallelism
Applied to finding distance between two points:
ld r1, x1
ld r2, x2
ld r3, x3
ld r4, y1
ld r5, y2
ld r6, y3
sub r1, r4, r1 sub r2, r5, r2 sub r3, r6, r3 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
V1 = _MM_LOAD(&X1[0]) V2 = _MM_LOAD(&Y1[0]) V1 = _MM_SUB(V2, V1)
V1 = _MM_MUL(V1, V1) _MM_STORE(&res[0], V1) add r1, &res[0], &res[1] add r1, r1, &res[2] sqrt r1, r1
st d, r1
18-6465 – How to Write Fast Code? 27
Example…
#include
#include
#define k 32
int main(){
float x[k]; float y[k]; // vectors of length k __m128 X, Y; // 128-bit values __m128 acc = _mm_setzero_ps(); // set to (0, 0, 0, 0) float inner_prod, temp[4];
int i, j;
for (j=0; j