Basic Serial Optimization
CMPSC 450
Profiling … “but how do you know?” • Gain insight into code hotspots
• Variety of methods:
• visual observation
• get_wall_time & printf
• gprof
• valgrind/callgrind & kcachegrind
CMPSC 450
gprof
• One of the most popular profiling tools available. • Difficult to use/understand!
• Compile application with debugging enabled (-pg) • gcc -Wall -pg test_gprof.c test_gprof_new.c -o test_gprof
• Run gprof on output:
• gprof test_gprof gmon.out > analysis.txt
CMPSC 450
gprof
CMPSC 450
gprof
CMPSC 450
Kcachegrind – valgrind
CMPSC 450
Valgrind – better visualization
CMPSC 450
Common sense optimizations
• Do less work
bool b_flag = false;
for (int i = 0; i < N; i++) {
if (complex_function(A[i]) < THRESHOLD) {
} }
b_flag = true;
CMPSC 450
Common sense optimizations
• Do less work
CMPSC 450
bool b_flag = false;
for (int i = 0; i < N; i++) {
if (complex_function(A[i]) < THRESHOLD) {
} }
b_flag = true;
bool b_flag = false;
for (int i = 0; i < N; i++) {
if (complex_function(A[i]) < THRESHOLD) {
} }
b_flag = true;
break;
Avoid expensive operations
• Use a look up table
• AvoidDIV/SQRT/SIN/COS/TAN
A = A + pow(B, 2.0); A = A + B * B;
CMPSC 450
// these values are all from -1 to +1 int i_L, i_R, i_U, i_O, i_S, i_N; double d_edelz, d_tt, d_BF;
d_edelz = i_L + i_R + i_U + i_O + i_S + i_N;
d_BF = 0.5 * (1.0 + tanh(d_edelz / d_tt ));
Avoid expensive operations
• Use a look up table
• In this case, the input is fixed between -6 and +6.
// these values are all from -1 to +1 int i_L, i_R, i_U, i_O, i_S, i_N; double d_edelz, d_tt, d_BF;
d_edelz = i_L + i_R + i_U + i_O + i_S + i_N;
d_BF = 0.5 * (1.0 + tanh(d_edelz / d_tt));
int i_L, i_R, i_U, i_O, i_S, i_N; double d_tt, d_BF, d_tanhTable[13];
for (int i = 0; i < 13; i++)
d_tanhTable[i] = 0.5 * (1.0 + tanh((double)(i-6) / d_tt) );
d_BF = d_tanhTable[i_L + i_R + i_U + i_O + i_S + i_N];
CMPSC 450
Elimination of common subexpressions
• Does the work have to exist in a loop?
• The compiler SHOULD handle this, but sometimes doesn’t.
• This is also a candidate for a SIMD call.
for (int i = 0; i < N; i++)
A[i] = A[i] + s + r * sin(x);
double d_tmp = s + r*sin(x);
for (int i = 0; i < N; i++) A[i] = A[i] + d_tmp;
CMPSC 450
double d_tmp = s + r*sin(x); SIMD_vector_scalar_add(&A[0], d_tmp, N);
Avoid conditional branches
for (int j = 0; j < N; j++) {
for (int i = 0; i < N; i++) {
} }
if (i > j) sign = 1.0;
else if (i < j) sign = -1.0; else sign = 0.0;
C(j) = C(j) + sign * A[i][j] * B[i];
• This can be done simpler...
CMPSC 450
Avoid branches
• Remove conditionals
• Less branching
• Specifically, remove extra double multiply
CMPSC 450
for (int j = 0; j < N; j++) {
for (int i = j+1; i < N; i++)
C(j) = C(j) + A[i][j] * B[i];
}
for (int j = 0; j < N; j++) {
}
for (int i = 0; i < j-1; i++)
C(j) = C(j) - A[i][j] * B[i];
Pay attention to memory access
• Pay attention to stride
• Some library calls assume contiguous data. (stride = 1)
• Two dimensional arrays in C/C++ are stored in row major format.
• (A[0][4]andA[0][5] are adjacent in memory)
• Watch for dependencies!
for (int i = 1; i < N; i++) A[i] = A[i-1] * B;
CMPSC 450
for (int j = 0; j < N; j++) {
for (int i = j+1; i < N; i++)
C(j) = C(j) + A[i][j] * B[i];
}
for (int j = 0; j < N; j++) {
}
for (int i = 0; i < j-1; i++)
C(j) = C(j) - A[i][j] * B[i];
Use SIMD if you can!
• Compilers will not always catch opportunities to utilize SIMD operations.
• Intelcompilerswill recognize #pragma SIMD flags to encourage vectorization
• Othermathlibraries have built in optimized routines (volk)
CMPSC 450
float myMult(float *C, float *B, float *A, int N) {
for (int i = 0; i < N; i++)
{
} }
C[i] = A[i] * B[i];
float myMult(float *C, float *B, float *A, int N) {
}
VOLK_SIMD_F32_MULT(C, A, B, N);
Compilers are great! ... until they aren’t...
• Pay attention to compiler flags
• -OX {X = 0, 1, 2, 3} • -msse, -mavx
• Pay attention to aliasing • memcpy vs memmove
// common optimization settings (actual optimization effort not universal, i.e. one compiler may loop unroll at O1 while another unrolls at O2)
// the more information you give the compiler, the better
• Drastic compiler optimization may decide your code is not necessary and remove it completely!
• Especiallyforsituationswherefinalvariablevalueisnotobviouslysubsequentlyused.
CMPSC 450
C++ data construction
• C style construction
• Vector allocated each
time, even if not used • Lazy construction
• Vector only allocated when needed.
CMPSC 450
void some_func(double threshold, int length) {
std::vector
} }
v = obtain_data(length);
std::sort(v.begin(), v.end());
process_data(v);
void some_func(double threshold, int length) {
if (rand() > threshold*RAND_MAX)
{
} }
std::vector
std::sort(v.begin(), v.end());
process_data(v);
C++ data construction
• Static construction
• Prevents time spent allocating and de-allocating memory
• Static variables may not work so well in parallel environments.
CMPSC 450
const int length=1000;
void some_func(double threshold, int length) {
static std::vector
{
} }
v = obtain_data(length);
std::sort(v.begin(), v.end());
process_data(v);
Be cautious!
• Test optimizations!
• Use #define to toggle between
original and optimized code.
#define USE_SIMD
#ifndef USE_SIMD
for (int i = 0; i < N; i++)
A[i] = B[i] * C[i];
#else
SIMD_vector_mult(A, B, C, N);
#endif
Optimizations are useless if original functionality is not sufficiently replicated.
CMPSC 450
Pay attention to simple data types
• Integer math is fast
• Float (32 bit) and double precision (64 bit) floating point math virtually the same performance*
• Float is half the size – better for caching
• SIMD calls will be specifically float or double precision
• Converting between integer and floating point is expensive!
CMPSC 450
Additional Resources
• http://agner.org/optimize/optimizing_cpp.pdf
CMPSC 450