CSCI 2021: Program Performance Micro-Optimizations
Chris Updated:
Mon Nov 29 02:18:17 PM CST 2021
1
Logistics
Reading Bryant/O’Hallaron
▶ Ch 6: Memory System ▶ Ch 5: Optimization
Goals
▶ Permanent Storage
▶ Optimization Overview ▶ Micro-optimizations
P4 Reminders
▶ Search Benchmark: report times that are > 1e-02
▶ Writeup: answers are 3-4 sentences, supported with tables of times
Upcoming Events
Date
Mon 11/29
Wed 12/01 Fri 12/03
Event Micro Opts
Review
P4 Due Exam 3
Lab13: Review
Function Pointers
▶ Optional Tutorial posted
▶ Relevant for P4 Problem 2 Optional MAKEUP Credit
2
Caution: Should I Optimize?
▶ Optimizing program execution time saves CPU time, costs Human time
▶ CPU Time: cheap
▶ Human Time: expensive
▶ Determine if there is a NEED to optimize
▶ Benchmark your code – if it is fast enough, move on
▶ When optimizing, use data/tools to direct Human Effort (benchmarks/profiler)
▶ Never sacrifice correctness for speed
3
What to Optimize First
In order of impact
1. Algorithms and Data Structure Selection
2. Elimination of unneeded work/hidden costs
3. Memory Utilization
4. Micro-optimizations
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
–
4
Exercise: Optimize This
▶ is tasked by her boss to optimize performance of function get_min()
1 int get_min(storage_t *st){
2 int *arr =
3 malloc(sizeof(int)*get_size(st));
▶ The current version of the 4
function code looks like the code to the right.
▶ Prema immediately jumps to the code for
5 for(int i=0; i
lila> ./a.out 0 1000000000
sum_range1: 1.9126e+00 secs
sum_range2: 2.6942e-01 secs
# Limit opt
### Compiled with -Og: limited opt
sum_range1:
movl $0, (%rdx) # init MEMORY
jmp .LOOTOP
.BODY:
addl %edi, (%rdx) # MEMORY add
▶ Minimal optimizations
▶ Memory reference definitely
matters
lila> gcc -O1 sum_range.c # Opt plz!
lila> ./a.out 0 1000000000
sum_range1: 2.8972e-01 secs
sum_range2: 2.7569e-01 secs
▶ Observe code differences between -Og and -O1
▶ Why is performance improved so much?
addl $1, %edi
.LOOPTOP:
cmpl %esi, %edi
jl .BODY
ret
# in loop
### Compiled with -O1: some opt
sum_range1:
cmpl %esi, %edi
jge .END
movl $0, %eax
# init REGISTER
# REGISTER add
# in loop
# MEMORY write
.LOOP:
addl %edi, %eax
addl $1, %edi
cmpl %edi, %esi
jne .LOOP
movl %eax,
.END: ret
(%rdx)
11
Dash-O! Compiler Optimizes for You
▶ gcc can perform many micro-optimizations, almost NEVER macro optimizations
▶ Series of -OX options authorize use of various micro-opts
▶ We will use -Og at times to disable many optimizations
▶ -Og: Optimize debugging: “…offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.”
▶ Individual optimizations can be enabled and disabled
▶ -O or -O1: Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.
With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
▶ -O2: Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
▶ -O3: Optimize yet more. -O3 turns on all optimizations specified by -O2 and also…
▶ -Ofast: Disregard strict standards compliance. (!)
12
Compiler Optimizations
gcc -O or gcc -O1 turns on the following optimization flags:
-fauto-inc-dec -fbranch-count-reg -fcombine-stack-adjustments
–fcompare-elim fcprop-registers -fdce -fdefer-pop -fdelayed-branch
–fdse -fforward-propagate fguess-branch-probability -fif-conversion2
–fif-conversion finline-functions-called-once -fipa-pure-const
–fipa-profile -fipa-reference fmerge-constants -fmove-loop-invariants
–freorder-blocks -fshrink-wrap fshrink-wrap-separate
–fsplit-wide-types -fssa-backprop -fssa-phiopt -ftree-bit-ccp
-ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-dce
-ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
–ftree-phiprop -ftree-sink ftree-slsr -ftree-sra -ftree-pta
–ftree-ter -funit-at-a-time
▶ Some combination of these enables sum_range2() to fly as fast as sum_range1()
▶ We will look at some “by-hand” versions of these optimizations but let the compiler optimize for you whenever possible
13
Exercise: Allocation and Hidden Costs
Consider the following Java code
public class StringUtils{
public static
String repString(String str, int reps)
{
String result = “”;
for(int i=0; i
apollo> ./a.out 1000000000
sum_rangeA: 1.0698e+00 secs
sum_rangeB: 6.2750e-01 secs
sum_rangeC: 6.2746e-01 secs
phaedrus> ./a.out 1000000000
sum_rangeA: 2.8913e-01 secs
sum_rangeB: 5.3285e-01 secs
sum_rangeC: 2.6774e-01 secs
Unrolling is Unpredictable
▶ Performance Gains vary from one compiler+processor to another
▶ All unrolling requires cleanup loops like those in the B/C versions: add on remaining elements
18
GCC Options to Unroll
▶ gcc has options to unroll loops during optimization
▶ Unrolling has unpredictable performance implications so
unrolling is not enabled for -O1, -O2, -O3
▶ Can manually enable it with compiler options like
-funroll-loops to check for performance bumps
apollo> gcc -Og unroll.c
## limited compiler opts
apollo> ./a.out 1000000000
sum_rangeA: 1.0698e+00 secs
sum_rangeB: 6.2750e-01 secs
sum_rangeC: 6.2746e-01 secs
apollo> gcc -Og -funroll-loops unroll.c
## loops unrolled by compiler
apollo> ./a.out 1000000000
sum_rangeA: 7.0386e-01 secs
sum_rangeB: 6.2802e-01 secs
sum_rangeC: 6.2797e-01 secs
apollo> gcc -O3 unroll.c
## Many opts, no unrolling
apollo> ./a.out 1000000000
sum_rangeA: 9.4124e-01 secs
sum_rangeB: 4.1833e-01 secs
sum_rangeC: 4.1832e-01 secs
apollo> gcc -Og -funroll-loops -fvariable-expansion-in-unroller unroll.c
apollo> ./a.out 1000000000 ## loops unrolled + multiple intermediates used
sum_rangeA: 5.2711e-01 secs
sum_rangeB: 6.2759e-01 secs
sum_rangeC: 6.2750e-01 secs
19
Conditional Code and Performance
Consider two examples of adding even numbers in a range
1 // CONDITION version
2 long sum_evensA(long start, long stop){
Timings for these two are shown below at two levels of optimization.
lila> gcc -Og condloop.c
lila> a.out 0 400000000
sum_evensA: 1.1969e+00 secs
sum_evensB: 2.8953e-01 secs
# 4x speedup
lila> gcc -O3 condloop.c
lila> a.out 0 400000000
sum_evensA: 2.3662e-01 secs
sum_evensB: 9.6242e-02 secs
# 2x speedup
Message is simple: eliminate conditionals whenever possible to improve performance
3
4
5
6
7}
8}
9 return sum;
long sum=0;
for(int i=start; i
val> gcc -Og $FILES
val> ./a.out 16000 8000
row_sums_FUNC: 2.8037e-01 secs
row_sums_MACRO: 9.2829e-02 secs
val> gcc -Og -finline-small-functions $FILE
val> ./a.out 16000 8000
row_sums_FUNC: 1.3620e-01 secs
row_sums_MACRO: 1.2969e-01 secs
val> gcc -O3 $FILES
val> ./a.out 16000 8000
row_sums_FUNC: 3.1132e-02 secs
row_sums_MACRO: 3.6975e-02 secs
▶ Inlining typically most effective for for small functions (getters/setters)
▶
Can only be done if source code (C file) for function is available
▶ Like loop unrolling, function inlining has trade-offs
▶ Enables pipelining
▶ More predictable control ▶ More register pressure ▶ Increased code size
23
S
Profilers: gprof and Friends
▶ Profiler: a tool that monitors code execution to enable performance optimizations
▶ gprof is stock on Linux systems, interfaces with gcc
▶ Compile with profiling options: gcc -pg
▶ Run code to produce data file
▶ Examine with gprof
▶ Note: gcc version 6 and
7 contain a bug requiring use of -no-pie option, not a problem on apollo
# Compile
# -pg : instrument code for profiling
# -no-pie : bug fix for new-ish gcc’s
> gcc -pg -no-pie -g -Og -o
> ls
unroll unroll.c
> ./unroll 1000000000
sum_rangeA: 2.9401e-01 secs
sum_rangeB: 5.3164e-01 secs
sum_rangeC: 2.6574e-01 secs
# gmon.out now created with timing info
> ls
gmon.out unroll unroll.c
> file gmon.out
gmon.out: GNU prof performance data
> gprof -b unroll
… output on next slide …
unroll unroll.c
24
gprof output for unroll
> gprof -b unroll
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
50.38 0.54 0.54 1 544.06 544.06 sum_rangeB
26.12 0.83 0.28 1 282.11 282.11 sum_rangeA
24.26 1.09 0.26 1 261.95 261.95 sum_rangeC
index % time
[1] 100.0
Call graph
self children called name
0.00 1.09 main [1]
0.54 0.00 1/1 sum_rangeB [2]
0.28 0.00 1/1 sum_rangeA [3]
0.26 0.00 1/1 sum_rangeC [4]
———————————————–
0.54 0.00 1/1 main [1]
[2] 50.0 0.54 0.00 1 sum_rangeB [2]
———————————————–
0.28 0.00 1/1 main [1]
[3] 25.9 0.28 0.00 1 sum_rangeA [3]
———————————————–
0.26 0.00 1/1 main [1]
[4] 24.1 0.26 0.00 1 sum_rangeC [4]
———————————————–
25
gprof Example: Dictionary Application
> ./dictionary < craft-67.txt
... Total time = 0.829561 seconds
> gprof -b dictionary
% cumulative self self total
time seconds seconds calls ms/call ms/call name
50.07 0.18 0.18 1 180.25 180.25 sort_words
19.47 0.25 0.07 463016 0.00 0.00 find_ele_rec
13.91 0.30 0.05 2862749 0.00 0.00 Strlen
8.34 0.33 0.03 463016 0.00 0.00 lower1
2.78 0.34 0.01 463017 0.00 0.00 get_token
2.78 0.35 0.01 463016 0.00 0.00 h_mod
2.78 0.36 0.01 20451 0.00 0.00 save_string
0.00 0.36 0.00 463017 0.00 0.00 get_word
0.00 0.36 0.00 463016 0.00 0.00 insert_string
0.00 0.36 0.00 20451 0.00 0.00 new_ele
0.00 0.36 0.00
0.00 0.36 0.00
0.00 0.36 0.00
0.00 0.36 0.00
0.00 0.36 0.00
0.00 0.36 0.00
0.00 0.36 0.00 1 0.00 360.50 word_freq
7 0.00
1 0.00
1 0.00
1 0.00
1 0.00
1 0.00
0.00 add_int_option
0.00 add_string_option
0.00 init_token
0.00 new_table
0.00 parse_options
0.00 show_options
26
gprof Example Cont’d: Dictionary Application
> ./dictionary < craft-67.txt ## After upgrading sort_words() to qsort()
... Total time = 0.624172 seconds
> gprof -b dictionary
% cumulative self self total
time seconds seconds calls ms/call ms/call name
60.08 0.12 0.12 463016 0.00 0.00 find_ele_rec
15.02 0.15 0.03 2862749 0.00 0.00 Strlen
10.01 0.17 0.02 463016 0.00 0.00 lower1
5.01 0.18 0.01 463017 0.00 0.00 get_token
5.01 0.19 0.01 463016 0.00 0.00 h_mod
5.01 0.20 0.01 20451 0.00 0.00 save_string
0.00 0.20 0.00 463017 0.00 0.00 get_word
0.00 0.20 0.00 463016 0.00 0.00 insert_string
0.00 0.20 0.00 20451 0.00 0.00 new_ele
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00
0.00 0.20 0.00 1 0.00 200.28 word_freq
8 0.00
7 0.00
1 0.00
1 0.00
1 0.00
1 0.00
1 0.00
1 0.00
1 0.00
0.00 match_length
0.00 add_int_option
0.00 add_string_option
0.00 find_option
0.00 init_token
0.00 new_table
0.00 parse_options
0.00 show_options
0.00 sort_words ** was 0.18 **
27
Exercise: Quick Review
1. What’s the first thing to consider when optimization seems necessary?
2. What kinds of optimizations would have the biggest impact on performance?
3. What is the smartest way to “implement” micro-optimizations, to get their benefit with minimal effort?
28
Answers: Quick review
1. What’s the first thing to consider when optimization seems necessary?
A: Is optimization really necessary? Or is there something else that would be more worth the effort (e.g. fixing bugs, adding features, improving documentation, etc.)
2. What kinds of optimizations would have the biggest impact on performance?
A: From most to least important
▶ Algorithms and Data Structure Selection ▶ Elimination of unneeded work/hidden costs ▶ Memory Utilization
▶ Micro-optimizations (today’s lecture)
3. What is the smartest way to “implement” micro-optimizations, to get their benefit with minimal effort? A: Use the compiler to mechanically perform code trans- forms to achieve micro-optimizations. Using -O2 will pro- duce faster-running code because the compiler is trans- forming generated assembly instructions from C sources.
29