CS计算机代考程序代写 compiler c++ c/c++ cache Basic Serial Optimization

Basic Serial Optimization
CMPSC 450

Profiling … “but how do you know?” • Gain insight into code hotspots
• Variety of methods:
• visual observation
• get_wall_time & printf
• gprof
• valgrind/callgrind & kcachegrind
CMPSC 450

gprof
• One of the most popular profiling tools available. • Difficult to use/understand!
• Compile application with debugging enabled (-pg) • gcc -Wall -pg test_gprof.c test_gprof_new.c -o test_gprof
• Run gprof on output:
• gprof test_gprof gmon.out > analysis.txt
CMPSC 450

gprof
CMPSC 450

gprof
CMPSC 450

Kcachegrind – valgrind
CMPSC 450

Valgrind – better visualization
CMPSC 450

Common sense optimizations
• Do less work
bool b_flag = false;
for (int i = 0; i < N; i++) { if (complex_function(A[i]) < THRESHOLD) { } } b_flag = true; CMPSC 450 Common sense optimizations • Do less work CMPSC 450 bool b_flag = false; for (int i = 0; i < N; i++) { if (complex_function(A[i]) < THRESHOLD) { } } b_flag = true; bool b_flag = false; for (int i = 0; i < N; i++) { if (complex_function(A[i]) < THRESHOLD) { } } b_flag = true; break; Avoid expensive operations • Use a look up table • AvoidDIV/SQRT/SIN/COS/TAN A = A + pow(B, 2.0); A = A + B * B; CMPSC 450 // these values are all from -1 to +1 int i_L, i_R, i_U, i_O, i_S, i_N; double d_edelz, d_tt, d_BF; d_edelz = i_L + i_R + i_U + i_O + i_S + i_N; d_BF = 0.5 * (1.0 + tanh(d_edelz / d_tt )); Avoid expensive operations • Use a look up table • In this case, the input is fixed between -6 and +6. // these values are all from -1 to +1 int i_L, i_R, i_U, i_O, i_S, i_N; double d_edelz, d_tt, d_BF; d_edelz = i_L + i_R + i_U + i_O + i_S + i_N; d_BF = 0.5 * (1.0 + tanh(d_edelz / d_tt)); int i_L, i_R, i_U, i_O, i_S, i_N; double d_tt, d_BF, d_tanhTable[13]; for (int i = 0; i < 13; i++) d_tanhTable[i] = 0.5 * (1.0 + tanh((double)(i-6) / d_tt) ); d_BF = d_tanhTable[i_L + i_R + i_U + i_O + i_S + i_N]; CMPSC 450 Elimination of common subexpressions • Does the work have to exist in a loop? • The compiler SHOULD handle this, but sometimes doesn’t. • This is also a candidate for a SIMD call. for (int i = 0; i < N; i++) A[i] = A[i] + s + r * sin(x); double d_tmp = s + r*sin(x); for (int i = 0; i < N; i++) A[i] = A[i] + d_tmp; CMPSC 450 double d_tmp = s + r*sin(x); SIMD_vector_scalar_add(&A[0], d_tmp, N); Avoid conditional branches for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { } } if (i > j) sign = 1.0;
else if (i < j) sign = -1.0; else sign = 0.0; C(j) = C(j) + sign * A[i][j] * B[i]; • This can be done simpler... CMPSC 450 Avoid branches • Remove conditionals • Less branching • Specifically, remove extra double multiply CMPSC 450 for (int j = 0; j < N; j++) { for (int i = j+1; i < N; i++) C(j) = C(j) + A[i][j] * B[i]; } for (int j = 0; j < N; j++) { } for (int i = 0; i < j-1; i++) C(j) = C(j) - A[i][j] * B[i]; Pay attention to memory access • Pay attention to stride • Some library calls assume contiguous data. (stride = 1) • Two dimensional arrays in C/C++ are stored in row major format. • (A[0][4]andA[0][5] are adjacent in memory) • Watch for dependencies! for (int i = 1; i < N; i++) A[i] = A[i-1] * B; CMPSC 450 for (int j = 0; j < N; j++) { for (int i = j+1; i < N; i++) C(j) = C(j) + A[i][j] * B[i]; } for (int j = 0; j < N; j++) { } for (int i = 0; i < j-1; i++) C(j) = C(j) - A[i][j] * B[i]; Use SIMD if you can! • Compilers will not always catch opportunities to utilize SIMD operations. • Intelcompilerswill recognize #pragma SIMD flags to encourage vectorization • Othermathlibraries have built in optimized routines (volk) CMPSC 450 float myMult(float *C, float *B, float *A, int N) { for (int i = 0; i < N; i++) { } } C[i] = A[i] * B[i]; float myMult(float *C, float *B, float *A, int N) { } VOLK_SIMD_F32_MULT(C, A, B, N); Compilers are great! ... until they aren’t... • Pay attention to compiler flags • -OX {X = 0, 1, 2, 3} • -msse, -mavx • Pay attention to aliasing • memcpy vs memmove // common optimization settings (actual optimization effort not universal, i.e. one compiler may loop unroll at O1 while another unrolls at O2) // the more information you give the compiler, the better • Drastic compiler optimization may decide your code is not necessary and remove it completely! • Especiallyforsituationswherefinalvariablevalueisnotobviouslysubsequentlyused. CMPSC 450 C++ data construction • C style construction • Vector allocated each time, even if not used • Lazy construction • Vector only allocated when needed. CMPSC 450 void some_func(double threshold, int length) { std::vector v(length); if (rand() > threshold*RAND_MAX) {
} }
v = obtain_data(length);
std::sort(v.begin(), v.end());
process_data(v);
void some_func(double threshold, int length) {
if (rand() > threshold*RAND_MAX)
{
} }
std::vector v(obtain_data(length));
std::sort(v.begin(), v.end());
process_data(v);

C++ data construction
• Static construction
• Prevents time spent allocating and de-allocating memory
• Static variables may not work so well in parallel environments.
CMPSC 450
const int length=1000;
void some_func(double threshold, int length) {
static std::vector v(length); if (rand() > threshold*RAND_MAX)
{
} }
v = obtain_data(length);
std::sort(v.begin(), v.end());
process_data(v);

Be cautious!
• Test optimizations!
• Use #define to toggle between
original and optimized code.
#define USE_SIMD
#ifndef USE_SIMD
for (int i = 0; i < N; i++) A[i] = B[i] * C[i]; #else SIMD_vector_mult(A, B, C, N); #endif Optimizations are useless if original functionality is not sufficiently replicated. CMPSC 450 Pay attention to simple data types • Integer math is fast • Float (32 bit) and double precision (64 bit) floating point math virtually the same performance* • Float is half the size – better for caching • SIMD calls will be specifically float or double precision • Converting between integer and floating point is expensive! CMPSC 450 Additional Resources • http://agner.org/optimize/optimizing_cpp.pdf CMPSC 450