Carnegie Mellon
Code Optimization
15-213/18-213/15-513: Introduction to Computer Systems 13th Lecture, June 17, 2020
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
1
Carnegie Mellon
Rear Admiral Grace Hopper
▪ Invented first compiler in 1951
(technically it was a linker)
▪ Coined “compiler” (and “bug”)
▪ Compiled for Harvard Mark I
▪ Eventually led to COBOL
(which ran the world for years)
▪ “I decided data processors ought to be able to write their programs in English, and the computers would translate them into machine code”
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
2
Carnegie Mellon
John Backus
▪ Led team at IBM invented the first commercially available compiler in 1957
▪ Compiled FORTRAN code for the IBM 704 computer
▪ FORTRAN still in use today for high performance code
▪ “Much of my work has come from being lazy. I didn’t like writing programs, and so, when I was working on the IBM 701, I started work on a programming system to make it easier to write programs”
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
3
Carnegie Mellon
Fran Allen
▪ Pioneer of many optimizing
compilation techniques
▪ Wrote a paper simply called “Program Optimization” in 1966
▪ “This paper introduced the use of graph-theoretic structures to encode program content in order to automatically and efficiently derive relationships and identify opportunities for optimization”
▪ First woman to win the ACM Turing Award (the “Nobel Prize of Computer Science”)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4
Carnegie Mellon
Today
Overview
Generally Useful Optimizations ▪ Code motion/precomputation
▪ Strength reduction
▪ Sharing of common subexpressions ▪ Example: Bubblesort
Optimization Blockers ▪ Procedure calls
▪ Memory aliasing
Exploiting Instruction-Level Parallelism Dealing with Conditionals
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5
Carnegie Mellon
Performance Realities
There’s more to performance than asymptotic complexity
Constant factors matter too!
▪ Easily see 10:1 performance range depending on how code is written ▪ Must optimize at multiple levels:
▪ algorithm, data representations, procedures, and loops
Must understand system to optimize performance
▪ How programs are compiled and executed
▪ How modern processors + memory systems operate
▪ How to measure program performance and identify bottlenecks
▪ How to improve performance without destroying code modularity and generality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
6
Carnegie Mellon
Optimizing Compilers
Provide efficient mapping of program to machine ▪ register allocation
▪ code selection and ordering (scheduling) ▪ dead code elimination
▪ eliminating minor inefficiencies
Don’t (usually) improve asymptotic efficiency
▪ up to programmer to select best overall algorithm
▪ big-O savings are (often) more important than constant factors
▪ but constant factors also matter
Have difficulty overcoming “optimization blockers” ▪ potential memory aliasing
▪ potential procedure side-effects
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
7
Carnegie Mellon
Generally Useful Optimizations
Optimizations that you or the compiler should do regardless of processor / compiler
Code Motion
▪ Reduce frequency with which computation performed
▪ If it will always produce same result ▪ Especially moving code out of loop
void set_row(double *a, double *b, long i, long n)
{
long j;
for (j = 0; j < n; j++) a[n*i+j] = b[j];
}
long j;
int ni = n*i;
for (j = 0; j < n; j++)
a[ni+j] = b[j];
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
8
Carnegie Mellon
Compiler-Generated Code Motion (-O1)
void set_row(double *a, double *b, long i, long n)
{
long j;
for (j = 0; j < n; j++)
a[n*i+j] = b[j];
}
set_row:
.L3:
testq
jle
imulq
leaq
movl
movsd movsd addq cmpq jne
%rcx, %rcx
.L1
%rcx, %rdx (%rdi,%rdx,8), %rdx $0,%eax
(%rsi,%rax,8), %xmm0 %xmm0, (%rdx,%rax,8) $1, %rax
%rcx, %rax
# Test n
# If <= 0, goto done # ni = n*i
# rowp = A + ni*8 #j=0
# loop:
# t = b[j]
# M[A+ni*8 + j*8] = t # j++
# j:n
# if !=, goto loop
# done:
.L1:
.L3 rep ; ret
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
9
long j;
long ni = n*i;
double *rowp = a+ni;
for (j = 0; j < n; j++)
*rowp++ = b[j];
Carnegie Mellon
Reduction in Strength
▪ Replace costly operation with simpler one ▪ Shift, add instead of multiply or divide
16*x--> x<<4
▪ Utility is machine dependent
▪ Depends on cost of multiply or divide instruction
– Intel Nehalem: integer multiply takes 3 CPU cycles, add is 1 cycle1 ▪ Recognize sequence of products
int ni = 0;
for (i = 0; i < n; i++) { for (j = 0; j < n; j++)
a[ni + j] = b[j];
ni += n;
}
for (i = 0; i < n; i++) {
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
10
1https://www.agner.org/optimize/instruction_tables.pdf
Carnegie Mellon
Share Common Subexpressions
▪ Reuse portions of expressions ▪ GCC will do this with –O1
/* Sum neighbors of i,j */
up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right;
3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
11
long inj = i*n + j;
up = val[inj - n];
down = val[inj + n];
left = val[inj - 1];
right = val[inj + 1];
sum = up + down + left + right;
leaq 1(%rsi), %rax # i+1
leaq -1(%rsi), %r8 # i-1
imulq %rcx, %rsi
imulq %rcx, %rax
imulq %rcx, %r8
addq %rdx, %rsi
addq %rdx, %rax
addq %rdx, %r8
...
# i*n
# (i+1)*n
# (i-1)*n
# i*n+j
# (i+1)*n+j
# (i-1)*n+j
imulq %rcx, %rsi # i*n
addq %rdx, %rsi # i*n+j
movq %rsi, %rax # i*n+j
subq %rcx, %rax # i*n+j-n
leaq (%rsi,%rcx), %rcx # i*n+j+n ...
Carnegie Mellon
Optimization Example: Bubblesort
Bubblesort program that sorts an array A that is allocated in static storage:
▪ an element of A requires four bytes of a byte-addressed machine ▪ elements of A are numbered 1 through n (n is a variable)
▪ A[j] is in location &A+4*(j-1)
for (i = n-1; i >= 1; i–) { for (j = 1; j <= i; j++)
if (A[j] > A[j+1]) {
temp = A[j];
A[j] = A[j+1]; A[j+1] = temp;
} }
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
12
Carnegie Mellon
Translated (Pseudo) Code
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
t8 := j-1
t9 := 4*t8
temp := A[t9] // temp:=A[j] t10 := j+1
t11:= t10-1
t12 := 4*t11
t13 := A[t12] // A[j+1]
t14 := j-1
t15 := 4*t14
A[t15] := t13 // A[j]:=A[j+1] t16 := j+1
t17 := t16-1
t18 := 4*t17
A[t18]:=temp
L3: j := j+1
goto L4
L2: i := i-1 goto L5
L1:
t1 := j-1
t2 := 4*t1
t3 := A[t2]
t4 := j+1
t5 := t4-1 t6 := 4*t5 t7 := A[t6]
// A[j]
// A[j+1]
if t3<=t7 goto L3
// A[j+1]:=temp
for (i = n-1; i >= 1; i–) { for (j = 1; j <= i; j++) if (A[j] > A[j+1]) {
temp = A[j];
A[j] = A[j+1];
A[j+1] = temp;
} }
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
13
Instructions 29 in outer loop 25 in inner loop
Carnegie Mellon
Redundancy in Address Calculation
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
t1 := j-1
t2 := 4*t1
t3 := A[t2] // A[j] t4 := j+1
t8 :=j-1
t9 := 4*t8
temp := A[t9] // temp:=A[j] t10 := j+1
t11:= t10-1
t12 := 4*t11
t13 := A[t12] // A[j+1]
t14 := j-1
t15 := 4*t14
A[t15] := t13 // A[j]:=A[j+1] t16 := j+1
t17 := t16-1
t18 := 4*t17
A[t18]:=temp
L3: j := j+1
goto L4
L2: i := i-1
goto L5
L1:
t5 := t4-1 t6 := 4*t5 t7 := A[t6]
// A[j+1]
if t3<=t7 goto L3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
14
// A[j+1]:=temp
Carnegie Mellon
Redundancy Removed
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
t8 :=j-1
t9 := 4*t8
temp := A[t9] // temp:=A[j] t12 := 4*j
t13 := A[t12] // A[j+1]
t1 := j-1
t2 := 4*t1
t3 := A[t2]
t6 := 4*j
t7 := A[t6]
if t3<=t7 goto L3
A[t9]:= t13
A[t12]:=temp
L3: j := j+1
goto L4
L2: i := i-1
goto L5 L1:
// A[j]:=A[j+1]
// A[j+1]:=temp
// A[j]
// A[j+1]
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
15
Instructions 20 in outer loop 16 in inner loop
Carnegie Mellon
More Redundancy
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
t8 :=j-1
t9 := 4*t8
temp := A[t9] // temp:=A[j] t12 := 4*j
t13 := A[t12] // A[j+1] A[t9]:= t13 // A[j]:=A[j+1] A[t12]:=temp // A[j+1]:=temp
L3: j := j+1
goto L4
L2: i := i-1
goto L5
L1:
t1 := j-1
t2 := 4*t1
t3 := A[t2]
t6 := 4*j
t7 := A[t6]
if t3<=t7 goto L3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
16
// A[j]
// A[j+1]
Carnegie Mellon
Redundancy Removed
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
A[t2] := t7
A[t6] := t3
j := j+1
goto L4
i := i-1
goto L5
// A[j]:=A[j+1]
// A[j+1]:=old_A[j]
t1 := j-1
t2 := 4*t1
t3 := A[t2]
t6 := 4*j
t7 := A[t6]
if t3<=t7 goto L3
L3: L2:
L1:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
17
// old_A[j]
// A[j+1]
Instructions 15 in outer loop 11 in inner loop
Carnegie Mellon
Redundancy in Loops
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
t1 := j-1
t2 := 4*t1
t3 := A[t2]
t6 := 4*j
t7 := A[t6]
if t3<=t7 goto L3
A[t2] := t7
A[t6] := t3
L3: j := j+1
goto L4 L2: i := i-1
goto L5 L1:
// A[j]
// A[j+1]
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
18
Carnegie Mellon
Redundancy Eliminated
i := n-1
L5: if i<1 goto L1
j := 1
L4: if j>i goto L2
i := n-1
L5: if i<1 goto L1
t2 := 0
t6 := 4 t19 := 4*i
L4: if t6>t19 goto L2
t3 := A[t2]
t7 := A[t6]
if t3<=t7 goto L3
A[t2] := t7
A[t6] := t3
L3: t2 := t2+4
t6 := t6+4
goto L4 L2: i := i-1
goto L5
t1 := j-1
t2 := 4*t1
t3 := A[t2]
t6 := 4*j
t7 := A[t6]
if t3<=t7 goto L3
A[t2] := t7
A[t6] := t3 j:=j+1 goto L4
i := i-1 goto L5
// A[j]
// A[j+1]
L3: L2: L1:
L1:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
19
Carnegie Mellon
Final Pseudo Code
i := n-1
L5: if i<1 goto L1
t2 := 0
t6 := 4
t19 := i << 2
L4: if t6>t19 goto L2 t3 := A[t2]
t7 := A[t6]
if t3<=t7 goto L3
A[t2] := t7
A[t6] := t3 L3: t2 := t2+4 t6 := t6+4
goto L4
L2: i := i-1
goto L5
• •
These were Machine-Independent Optimizations.
Will be followed by Machine-Dependent Optimizations,
including allocating temporaries to registers, converting to assembly code
L1:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
20
Instruction Count Before Optimizations 29 in outer loop 25 in inner loop
Instruction Count After Optimizations 15 in outer loop
9 in inner loop
Carnegie Mellon
Today
Overview
Generally Useful Optimizations ▪ Code motion/precomputation
▪ Strength reduction
▪ Sharing of common subexpressions ▪ Example: Bubblesort
Optimization Blockers ▪ Procedure calls
▪ Memory aliasing
Exploiting Instruction-Level Parallelism Dealing with Conditionals
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
21
Carnegie Mellon
Limitations of Optimizing Compilers
Operateunderfundamentalconstraint
▪ Must not cause any change in program behavior
▪ Often prevents optimizations that affect only “edge case” behavior
Behaviorobvioustotheprogrammerisnotobvioustocompiler
▪ e.g., Data range may be more limited than types suggest (short vs. int)
Most analysis is only within a procedure
▪ Whole-program analysis is usually too expensive
▪ Sometimes compiler does interprocedural analysis within a file (new GCC)
Most analysis is based only on static information ▪ Compiler has difficulty anticipating run-time inputs
When in doubt, the compiler must be conservative
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
22
Carnegie Mellon
Optimization Blocker #1: Procedure Calls Procedure to Convert String to Lower Case
void lower(char *s) {
size_t i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= ‘A’ && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
▪ Extracted from 213 lab submissions, Fall, 1998
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
23
Carnegie Mellon
Lower Case Conversion Performance
▪ Time quadruples when double string length ▪ Quadratic performance
250
200
150
100
50
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
String length
lower1
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
24
CPU seconds
Carnegie Mellon
Convert Loop To Goto Form
▪ strlen executed every iteration
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
25
void lower(char *s)
{
size_t i = 0;
if (i >= strlen(s))
goto done;
loop:
if (s[i] >= ‘A’ && s[i] <= 'Z') s[i] -= ('A' - 'a');
i++;
if (i < strlen(s))
goto loop;
done:
}
Carnegie Mellon
Calling Strlen
Strlenperformance
▪ Only way to determine length of string is to scan its entire length, looking for
null character.
Overall performance, string of length N ▪ N calls to strlen
▪ Require times N, N-1, N-2, ..., 1 ▪ Overall O(N2) performance
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
26
/* My version of strlen */
size_t strlen(const char *s)
{
size_t length = 0;
while (*s != '\0') {
s++;
length++; }
return length; }
Carnegie Mellon
Improving Performance
void lower(char *s)
{
size_t i;
size_t len = strlen(s); for (i = 0; i < len; i++)
if (s[i] >= ‘A’ && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
▪ Move call to strlen outside of loop
▪ Legal since result does not change from one iteration to another ▪ Form of code motion
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
27
Carnegie Mellon
Lower Case Conversion Performance
▪ Time doubles when double string length ▪ Linear performance of lower2
250
200
150
100
50
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
String length
lower1
lower2
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
28
CPU seconds
Carnegie Mellon
Optimization Blocker: Procedure Calls
Whycouldn’tcompilermovestrlenoutof innerloop?
▪ Procedure may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments ▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen
Warning:
▪ Compiler may treat procedure call as a black box ▪ Weak optimizations near them
Remedies:
▪ Use of inline functions
▪ GCC does this with –O1 – Within single file ▪ Do your own code motion
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
29
size_t lencnt = 0;
size_t strlen(const char *s)
{
size_t length = 0;
while (*s != '\0') {
s++; length++; }
lencnt += length;
return length;
}
Carnegie Mellon
Memory Matters
/* Sum rows of n X n matrix a and store in vector b */
void sum_rows1(double *a, double *b, long n) { long i, j;
for (i = 0; i < n; i++) { b[i] = 0;
for (j = 0; j < n; j++) b[i] += a[i*n + j];
} }
# sum_rows1 inner loop .L4:
movsd
addsd
movsd
(%rsi,%rax,8), %xmm0
(%rdi), %xmm0
%xmm0, (%rsi,%rax,8)
# FP load
# FP add
# FP store
addq
cmpq
jne .L4
▪ Code updates b[i] on every iteration
▪ Why couldn’t compiler optimize this away?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
30
$8, %rdi
%rcx, %rdi
Carnegie Mellon
Memory Aliasing
/* Sum rows is of n X n matrix a and store in vector b */
void sum_rows1(double *a, double *b, long n) { long i, j;
for (i = 0; i < n; i++) { b[i] = 0;
} }
for (j = 0; j < n; j++) b[i] += a[i*n + j];
double A[9] = {0, 1, 2,
4, 8, 16},
32, 64, 128};
double B[3] = A+3;
sum_rows1(A, B, 3);
double A[9] =
{ 0, 1, 2,
0313, 22803862, 21931242606}, 32, 64, 128};
▪ Code updates b[i] on every iteration
▪ Must consider possibility that these updates will affect program behavior
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
31
Value of B:
init: [4, 8, 16]
i = 0: [3, 8, 16]
i = 1: [3, 22, 16]
i = 2: [3, 22, 224]
Carnegie Mellon
Removing Aliasing
/* Sum rows is of n X n matrix a and store in vector b */
void sum_rows2(double *a, double *b, long n) { long i, j;
for (i = 0; i < n; i++) { double val = 0;
for (j = 0; j < n; j++)
val += a[i*n + j];
b[i] = val; }
}
# sum_rows2 inner loop .L10:
addsd (%rdi), %xmm0
addq $8, %rdi
cmpq %rax, %rdi
jne .L10
# FP load + add
▪ No need to store intermediate results Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
32
Carnegie Mellon
Removing Aliasing
/* Sum rows is of n X n matrix a and store in vector b */
void sum_rows3(double *restrict a, double *restrict b, long n) { long i, j;
for (i = 0; i < n; i++) { b[i] = 0;
} }
for (j = 0; j < n; j++) b[i] += a[i*n + j];
# sum_rows3 inner loop .L12:
addsd (%rdi), %xmm0
addq $8, %rdi
cmpq %rax, %rdi
jne .L12
# FP load + add
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
33
Carnegie Mellon
Optimization Blocker: Memory Aliasing
Aliasing
▪ Two different memory references specify single location ▪ Easy to have happen in C
▪ Since allowed to do address arithmetic
▪ Direct access to storage structures
▪ Get in habit of introducing local variables
▪ Accumulating within loops
▪ Your way of telling compiler not to check for aliasing
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
34
Carnegie Mellon
Quiz Time!
Check out:
https://canvas.cmu.edu/courses/16836
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
35
Carnegie Mellon
Today
Overview
Generally Useful Optimizations ▪ Code motion/precomputation
▪ Strength reduction
▪ Sharing of common subexpressions ▪ Example: Bubblesort
Optimization Blockers ▪ Procedure calls
▪ Memory aliasing
Exploiting Instruction-Level Parallelism Dealing with Conditionals
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
36
Carnegie Mellon
Exploiting Instruction-Level Parallelism Need general understanding of modern processor design
▪ Hardware can execute multiple instructions in parallel
Performance limited by data dependencies
Simple transformations can cause big speedups
▪ Compilers often cannot make these transformations
▪ Lack of associativity and distributivity in floating-point arithmetic
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
37
Carnegie Mellon
Benchmark Example: Data Type for Vectors
/* data structure for vectors */ typedef struct{
size_t len;
data_t *data;
} vec;
len
data
0 1
len-1
Data Types
▪ Use different declarations
for data_t
▪ int
▪ long
▪ float
▪ double
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
38
/* retrieve vector element and store at val */
int get_vec_element
(*vec v, size_t idx, data_t *val)
{
if (idx >= v->len)
return 0;
*val = v->data[idx];
return 1;
}
Carnegie Mellon
Benchmark Computation
void combine1(vec_ptr v, data_t *dest)
{
long int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val; get_vec_element(v, i, &val); *dest = *dest OP val;
} }
Data Types
▪ Use different declarations
for data_t
▪ int
▪ long
▪ float
▪ double
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Operations
▪ Use different definitions of
OP and IDENT ▪+/0
▪*/1
Compute sum or product of vector elements
39
Carnegie Mellon
Cycles Per Element (CPE)
Convenient way to express performance of program that operates on vectors or lists
Length = n
In our case: CPE = cycles per OP
T = CPE*n + Overhead ▪ CPE is slope of line
2500 2000 1500 1000
500
0
0 50 100 150 200
psum1
Slope = 9.0
psum2
Slope = 6.0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
40
Elements
Cycles
Carnegie Mellon
Benchmark Performance
void combine1(vec_ptr v, data_t *dest)
{
long int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val; get_vec_element(v, i, &val); *dest = *dest OP val;
} }
Compute sum or product of vector elements
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Combine1 unoptimized
22.68
20.02
19.98
20.18
Combine1 –O1
10.12
10.12
10.17
11.14
Combine1 –O3
4.5
4.5
6
7.8
Results in CPE (cycles per element)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
41
Carnegie Mellon
Basic Optimizations
void combine4(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}
Move vec_length out of loop
Avoid bounds check on each cycle Accumulate in temporary
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
42
Carnegie Mellon
Effect of Basic Optimizations
Method
Operation
Combine1 –O1
Combine4
void combine4(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}
Integer
Add
10.12
1.27
Mult
10.12
3.01
Double FP
Add
10.17
3.01
Mult
11.14
5.01
Eliminates sources of overhead in loop Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
43
Carnegie Mellon
Modern CPU Design
Instruction Control
Fetch Control
Address
Instructions Operations
Retirement Unit
Instruction Cache
Register File
Instruction Decode
Register Updates Prediction OK?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
44
Branch Arith Arith Arith Load Store
Functional Units
Execution
Operation Results
Addr. Addr. Data
Data Cache
Data
Carnegie Mellon
Superscalar Processor
Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have
Most modern CPUs are superscalar.
Intel: since Pentium (1993)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
45
Carnegie Mellon
Pipelined Functional Units
Stage 1 Stage 2 Stage 3
long mult_eg(long a, long b, long c) {
long p1 = a*b;
long p2 = a*c; long p3 = p1 * p2; return p3;
}
Time
1
2
3
4
5
6
7
Stage 1
a*b
a*c
p1*p2
Stage 2
a*b
a*c
p1*p2
Stage 3
a*b
a*c
p1*p2
▪ Divide computation into stages
▪ Pass partial computations from stage to stage
▪ Stage i can start on new computation once values passed to i+1
▪ E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
46
Carnegie Mellon
Haswell CPU
▪ 8 Total Functional Units
Multiple instructions can execute in parallel 2 load, with address computation
1 store, with address computation
4 integer
2 FP multiply 1 FP add
1 FP divide
Some instructions take > 1 cycle, but can be pipelined Instruction Latency Cycles/Issue
Load / Store
Integer Multiply Integer/Long Divide Single/Double FP Multiply Single/Double FP Add Single/Double FP Divide
4 1
3 1
3-30 3-30
5 1
3 1
3-15 3-15
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
47
Carnegie Mellon
x86-64 Compilation of Combine4 Inner Loop (Case: Integer Multiply)
.L519: # Loop:
imull (%rax,%rdx,4), %ecx # t = t * d[i]
addq $1,%rdx cmpq %rdx,%rbp jg .L519
#i++ #Comparelength:i # If >, goto Loop
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Combine4
1.27
3.01
3.01
5.01
Latency Bound
1.00
3.00
3.00
5.00
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
48
Carnegie Mellon
Combine4 = Serial Computation (OP = *)
Computation (length=8)
((((((((1 * d[0]) * d[1]) * d[2]) * d[3])
* d[4]) * d[5]) * d[6]) * d[7])
Sequential dependence
▪ Performance: determined by latency of OP
d3 *
d5 *
d7 *
1d0 *
d2 *
d4 *
d6 *
d1 *
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
49
Carnegie Mellon
Loop Unrolling (2×1)
void unroll2a_combine(vec_ptr v, data_t *dest) {
long length = vec_length(v); long limit = length-1;
data_t *d = get_vec_start(v); data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1]; }
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i]; }
*dest = x; }
Perform 2x more useful work per iteration Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
50
Carnegie Mellon
Effect of Loop Unrolling
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Combine4
1.27
3.01
3.01
5.01
Unroll 2x1
1.01
3.01
3.01
5.01
Latency Bound
1.00
3.00
3.00
5.00
Helps integer add
▪ Achieves latency bound
Others don’t improve. Why? ▪ Still sequential dependency
x = (x OP d[i]) OP d[i+1];
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
51
Carnegie Mellon
Loop Unrolling with Reassociation (2x1a)
void unroll2aa_combine(vec_ptr v, data_t *dest) {
long length = vec_length(v); long limit = length-1;
data_t *d = get_vec_start(v); data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x OP (d[i] OP d[i+1]); }
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i]; }
*dest = x; }
Can this change the result of the computation?
Yes, for FP. Why?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
52
Compare to before
x = (x OP d[i]) OP d[i+1];
Carnegie Mellon
Effect of Reassociation
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Combine4
1.27
3.01
3.01
5.01
Unroll 2x1
1.01
3.01
3.01
5.01
Unroll 2x1a
1.01
1.51
1.51
2.51
Latency Bound
1.00
3.00
3.00
5.00
Throughput Bound
0.50
1.00
1.00
0.50
4 func. units for int +, 2 func. units for load Why Not .25?
1 func. unit for FP + 3-stage pipelined FP +
2 func. units for FP *, 2 func. units for load 5-stage pipelined FP *
Nearly 2x speedup for Int *, FP +, FP * ▪ Reason: Breaks sequential dependency
x = x OP (d[i] OP d[i+1]);
▪ Why is that? (next slide)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
53
Carnegie Mellon
Reassociated Computation
x = x OP (d[i] OP d[i+1]);
d0 d1 *
1 *
*
What changed:
▪ Ops in the next iteration can be
started early (no dependency)
Overall Performance
▪ N elements, D cycles latency/op
▪ (N/2+1)*D cycles: CPE = D/2
d2 d3
* d4 d5
*
*
d6 d7 *
*
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
54
Carnegie Mellon
Loop Unrolling with Separate Accumulators (2x2)
void unroll2a_combine(vec_ptr v, data_t *dest) {
long length = vec_length(v); long limit = length-1;
data_t *d = get_vec_start(v); data_t x0 = IDENT;
data_t x1 = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1; }
Different form of reassociation Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
55
Carnegie Mellon
Effect of Separate Accumulators
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Combine4
1.27
3.01
3.01
5.01
Unroll 2x1
1.01
3.01
3.01
5.01
Unroll 2x1a
1.01
1.51
1.51
2.51
Unroll 2x2
0.81
1.51
1.51
2.51
Latency Bound
1.00
3.00
3.00
5.00
Throughput Bound
0.50
1.00
1.00
0.50
Int + makes use of two load units
2x speedup (over unroll2) for Int *, FP +, FP * Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
56
x0 = x0 OP d[i]; x1 = x1 OP d[i+1];
Carnegie Mellon
Separate Accumulators
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];
1d0 1d1
* d2 * d3
* d4 * d5
* d6 * d7 **
*
What changed:
▪ Two independent “streams” of
operations
Overall Performance
▪ N elements, D cycles latency/op
▪ Should be (N/2+1)*D cycles: CPE = D/2
▪ CPE matches prediction! What Now?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
57
Carnegie Mellon
Unrolling & Accumulating
Idea
▪ Can unroll to any degree L
▪ Can accumulate K results in parallel ▪ L must be multiple of K
Limitations
▪ Diminishing returns
▪ Cannot go beyond throughput limitations of execution units ▪ Large overhead for short lengths
▪ Finish off iterations sequentially
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
58
Carnegie Mellon
Unrolling & Accumulating: Double *
Case
▪ Intel Haswell
▪ Double FP Multiplication
▪ Latency bound: 5.00. Throughput bound: 0.50
FP *
Unrolling Factor L
K
1
2
3
4
6
8
10
12
1
5.01
5.01
5.01
5.01
5.01
5.01
5.01
2
2.51
2.51
2.51
3
1.67
4
1.25
1.26
6
0.84
0.88
8
0.63
10
0.51
12
0.52
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
59
Accumulators
Carnegie Mellon
Achievable Performance
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Best
0.54
1.01
1.01
0.52
Latency Bound
1.00
3.00
3.00
5.00
Throughput Bound
0.50
1.00
1.00
0.50
Limited only by throughput of functional units
Up to 42X improvement over original, unoptimized code
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
61
Carnegie Mellon
Programming with AVX2 YMM Registers
◼ 16 total, each 32 bytes ◼ 32 single-byte integers
◼ 16 16-bit integers
◼ 8 32-bit integers
◼ 8 single-precision floats ◼ 4 double-precision floats ◼ 1 single-precision float
◼ 1 double-precision float
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
62
Carnegie Mellon
SIMD Operations
◼ SIMD Operations: Single Precision
vaddps %ymm0, %ymm1, %ymm1
++++++++
%ymm0 %ymm1
◼ SIMD Operations: Double Precision
vaddpd %ymm0, %ymm1, %ymm1
++++
%ymm0 %ymm1
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
63
Carnegie Mellon
Using Vector Instructions
Method
Integer
Double FP
Operation
Add
Mult
Add
Mult
Scalar Best
0.54
1.01
1.01
0.52
Vector Best
0.06
0.24
0.25
0.16
Latency Bound
0.50
3.00
3.00
5.00
Throughput Bound
0.50
1.00
1.00
0.50
Vec Throughput Bound
0.06
0.12
0.25
0.12
Make use of AVX Instructions
▪ Parallel operations on multiple data elements ▪ See Web Aside OPT:SIMD on CS:APP web page
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
64
Carnegie Mellon
What About Branches? Challenge
▪Instruction Control Unit must work well ahead of Execution Unit to generate enough operations to keep EU busy
Executing
How to continue?
404663: mov $0x0,%eax 404668: cmp (%rdi),%rsi 40466b: jge 404685
40466d: mov 0x8(%rdi),%rax
.. .
404685: repz retq
▪When encounters conditional branch, cannot reliably determine where to continue fetching
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
65
Carnegie Mellon
Modern CPU Design
Instruction Control
Fetch Control
Address
Instructions Operations
Retirement Unit
Instruction Cache
Register File
Instruction Decode
Register Updates Prediction OK?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
66
Branch Arith Arith Arith Load Store
Functional Units
Execution
Operation Results
Addr. Addr. Data
Data Cache
Data
Carnegie Mellon
Branch Outcomes
▪When encounter conditional branch, cannot determine where to continue fetching
▪ Branch Taken: Transfer control to branch target
▪ Branch Not-Taken: Continue with next instruction in sequence ▪Cannot resolve until outcome determined by branch/integer unit
404663: mov $0x0,%eax 404668: cmp (%rdi),%rsi 40466b: jge 404685
40466d: mov 0x8(%rdi),%rax
.. .
404685: repz retq
Branch Not-Taken
Branch Taken
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
67
Carnegie Mellon
Branch Prediction
Idea
▪ Guess which way branch will go
▪ Begin executing instructions at predicted position
▪ But don’t actually modify register or memory data
Predict Taken
Begin Execution
404663: mov $0x0,%eax 404668: cmp (%rdi),%rsi 40466b: jge 404685
40466d: mov 0x8(%rdi),%rax
.. .
404685: repz retq
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
68
Carnegie Mellon
Branch Prediction Through Loop
Assume
vector length = 100
Predict Taken (OK)
Predict Taken (Oops)
401029: vmulsd (%rdx),%xmm0,%xmm0 40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
401029: vmulsd (%rdx),%xmm0,%xmm0 40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029
i = 100
Read invalid location
Executed Fetched
401029: vmulsd (%rdx),%xmm0,%xmm0 40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
69
Carnegie Mellon
Branch Misprediction Invalidation
Assume
vector length = 100
Predict Taken (OK)
Predict Taken (Oops)
Invalidate
401029: vmulsd (%rdx),%xmm0,%xmm0 40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
401029: vmulsd (%rdx),%xmm0,%xmm0 40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029
i = 100
401029: vmulsd (%rdx),%xmm0,%xmm0 40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
70
Carnegie Mellon
Branch Misprediction Recovery
Definitely not taken
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx 401031: cmp %rax,%rdx 401034: jne 401029 401036: jmp 401040
i = 99
...
401040: vmovsd %xmm0,(%r12)
Performance Cost
▪ Multiple clock cycles on modern processor ▪ Can be a major performance limiter
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
71
Reload Pipeline
Carnegie Mellon
Branch Prediction Numbers
Default behavior:
▪ Backwards branches are often loops so predict taken ▪ Forwards branches are often if so predict not taken
Predictors average better than 95% accuracy ▪ Most branches are already predictable.
Annual branch predictor contests at top Computer Architecture conferences
▪ https://www.jilp.org/jwac-2/program/JWAC-2-program.htm ▪ Winner: 34.1 mispredictions per kilo-instruction (!)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
73
Carnegie Mellon
Getting High Performance Good compiler and flags
Don’t do anything sub-optimal
▪ Watch out for hidden algorithmic inefficiencies ▪ Write compiler-friendly code
▪ Watch out for optimization blockers: procedure calls & memory references
▪ Look carefully at innermost loops (where most work is done)
Tune code for machine
▪ Exploit instruction-level parallelism
▪ Avoid unpredictable branches
▪ Make code cache friendly (Covered later in course)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
74