代写 C++ C openmp parallel computer architecture cuda Recitation 4: OpenMP Programming

Recitation 4: OpenMP Programming
15-418 Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2019
CMU 15-418/15-618, Spring 2019

Goals for today ▪Learn to use Open MP
1.
2. 3.
Sparse matrix-vector code
▪ Understand “CSR” sparse matrix format ▪ Simplest OpenMP features
Parallelize array sum via OpenMP reductions Parallelize radix sort
▪ More complex OpenMP example ▪Most of all,
ANSWER YOUR QUESTIONS!
CMU 15-418/15-618, Spring 2019

Matrix-matrix multiplication
▪Recall from recitation 2… ෍
𝑘
(𝑖,𝑗)
C
(𝑖, 𝑘) →
A
B
CMU 15-418/15-618, Spring 2019

Matrix-matrix multiplication (matmul)
▪Simple C++ implementation: /* Find element based on row-major ordering */
#define RM(r, c, width) ((r) * (width) + (c))
// Standard multiplication
void multMatrixSimple(int N, float *matA, float *matB, float *matC) {
for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) { } } float sum = 0.0; for (int k = 0; k < N; k++) sum += matA[RM(i,k,N)] * matB[RM(k,j,N)]; matC[RM(i,j,N)] = sum; CMU 15-418/15-618, Spring 2019 Today: Matrix-vector multiplication (𝑖,𝑗)෍ 𝑘 C (𝑖, 𝑘) → A ▪𝑛×𝑛 × 𝑛×1 ⇒(𝑛×1)outputvector ▪Output = dot-products of rows from A and the vector B B CMU 15-418/15-618, Spring 2019 (𝑘, 𝑗) → Matrix-vector multiplication ▪ Simple C++ implementation (discards 𝑗 loop from matmul): /* Find element based on row-major ordering */ #define RM(r, c, width) ((r) * (width) + (c)) void matrixVectorProduct(int N, float *matA, float *vecB, float *vecC) { for (int i = 0; i < N; i++) float sum = 0.0; for (int k = 0; k < N; k++) sum += matA[RM(i,k,N)] * vecB[k]; vecC[i] = sum; } } CMU 15-418/15-618, Spring 2019 Matrix-vector multiplication ▪ Our code is slightly refactored: float rvp_dense_seq(dense_t *m, vec_t *x, index_t r) { index_t nrow = m->nrow;
index_t idx = r*nrow;
float val = 0.0;
for (index_t c = 0; c < nrow; c++) val += x->value[c] * m->value[idx++];
return val; }
Row dot product (the inner loop over 𝑘 in original code)
void mvp_dense_seq(dense_t *m, vec_t *x, vec_t *y, rvp_dense_t rp_fun) { index_t nrow = m->nrow;
for (index_t r = 0; r < nrow; r++) { y->value[r] = rp_fun(m, x, r); }
}
The inner loop over rows (over 𝑖 in original code)
CMU 15-418/15-618, Spring 2019

Benchmarking dense mat-vec
$ ./mrun -l 3 -d 10 -t r –D Dense 3 10 r
MVP RVP GF seq seq 0.12
▪0.12 GFLops … machine capable of 6.4 Gflops ➔This is bad performance
▪Why? We are only counting non-zero entries of matrix
CMU 15-418/15-618, Spring 2019

Sparse matrix-vector multiplication ▪What if A is mostly zeroes? (This is common)
(𝑖,𝑗)෍
𝑘
C
▪Idea: We should only compute on non-zeros in A ▪➔ Need new sparse matrix representation
(𝑖, 𝑘) →
A
B
CMU 15-418/15-618, Spring 2019
(𝑘, 𝑗) →

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 6 2 4 1 2 9 3 1 (Compact non-zeroes into dense format)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 1 (Position corresponding to each value)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 1 5 (Position corresponding to each value)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 1 5 0 (Position corresponding to each value)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 1 5 0 3 (Position corresponding to each value)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 1 5 0 3 7 4 1 6 (Position corresponding to each value)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 15037416
Offsets: (Where each row starts)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931
Indices: 15037416
Offsets: 0 (Where each row starts)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931 Indices: 15037416 Offsets: 0 2
(Where each row starts)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931 Indices: 15037416 Offsets: 0 2 5
(Where each row starts)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931 Indices: 15037416 Offsets: 0 2 5 6
(Where each row starts)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 62412931 Indices: 15037416 Offsets: 0 2 5 6 8
(Where each row starts)
CMU 15-418/15-618, Spring 2019

Compressed sparse-row (CSR) matrix format
▪Dense matrix: ▪CSR matrix:
Row
6
2
4
1
2
9
3
1
Values: 6 2 4 1 2 9 3 1 Indices: 1 5 0 3 7 4 1 6 Offsets: 0 2 5 6 8
(Compact non-zeroes into dense format)
(Position corresponding to each value)
(Where each row starts)
CMU 15-418/15-618, Spring 2019

Sparse matrix-vector multiplication (spmv)
float rvp_csr_seq(csr_t *m, vec_t *x, index_t r) {
index_t idxmin = m->rowstart[r]; /* 1. get indices from offsets */ index_t idxmax = m->rowstart[r+1];
float val = 0.0;
for (index_t idx = idxmin; idx < idxmax; idx++) { /* 2. iterate over row */ index_t c = m->cindex[idx]; data_t mval = m->value[idx]; data_t xval = x->value[c]; val += mval * xval;
}
return val; }
/* 3. find corresponding index in vector */ /* read matrix (using idx) */
/* read vector (using c) */
/* the outer loop (across rows) doesn’t change */
void mvp_csr_seq(csr_t *m, vec_t *x, vec_t *y, rvp_csr_t rp_fun) { index_t nrow = m->nrow;
for (index_t r = 0; r < nrow; r++) { y->value[r] = rp_fun(m, x, r);
}
}
CMU 15-418/15-618, Spring 2019

Benchmarking dense mat-vec
$ ./mrun -l 3 -d 10 -t r –D Dense 3 10 r
MVP RVP GF seq seq 0.12
CMU 15-418/15-618, Spring 2019

Benchmarking spmv
$ ./mrun -l 3 -d 10 -t r Sparse 3 10 r
MVP RVP GF seq seq 1.15
▪This matrix is 10% dense (-d parameter) ▪CSR gives 9.6× speedup!
CMU 15-418/15-618, Spring 2019

Let’s optimize this code!
Exploit parallelism at multiple levels
▪ILP ▪SIMD ▪Threads
To properly measure parallel speedup, need to start from an optimized baseline
CMU 15-418/15-618, Spring 2019

What’s the ILP of spmv?
▪GHC machines: 2x LD @ 1 cycle & 2x FMA @ 3 cycle
▪Q: How many operations per iteration?
▪A: 3 loads, 1 FMA
▪Iteration takes at least: 3 cycles (latency bound)
▪Expected: 3.2 𝐺𝐻𝑧 = 1.07 𝐺𝐹𝑙𝑜𝑝𝑠 3 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑒𝑟 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛
▪ Pretty close to what we observe
▪Machine utilization low: FMA ≈ 17%, LD ≈ 50%
CMU 15-418/15-618, Spring 2019

Improving spmv ILP
#define UNROLL 3
float rvp_csr_par(csr_t *m, vec_t *x, index_t r) {
index_t idxmin = m->rowstart[r], idxmax = m->rowstart[r+1];
index_t idx;
float uval[UNROLL]; float val = 0.0; for (index_t i = 0;
Accumulate elements mod UNROLL in uval array
i < UNROLL; i++) uval[i] = 0.0; for (idx = idxmin; idx+UNROLL <= idxmax; idx+=UNROLL) { for (index_t i = 0; i < UNROLL; i++) { } } index_t c = data_t mval data_t xval uval[i] += mval * xval; for (; idx < idxmax; idx ++) { index_t c = m->cindex[idx]; data_t mval = m->value[idx]; data_t xval = x->value[c]; val += mval * xval;
}
for (index_t i = 0; i < UNROLL; i++) val += uval[i]; Return sum of uval array m->cindex[idx+i];
= m->value[idx+i];
= x->value[c];
return val; }
CMU 15-418/15-618, Spring 2019
Note: gcc with –O3 will unroll these loops for us

Benchmarking spmv
$ ./mrun -l 4 -d 10 -t r Sparse 4 10 r
MVP RVP GF seq seq 1.40
CMU 15-418/15-618, Spring 2019

Benchmarking unrolled spmv
$ ./mrun –l 4 –d 10 -t r Sparse 4 10 r
MVP RVP GF seq seq 1.40 seq par 2.72
▪3× unrolling gives ≈ 2 × speedup
▪More unrolling doesn’t help (3 slightly better than 2)
▪Why? With 100% hits, LD utilization = 100% when unrolled twice
Unrolling thrice helps a bit, probably by hiding misses on
sparse vector loads
CMU 15-418/15-618, Spring 2019

Let’s optimize this code!
Exploit parallelism at multiple levels
▪ILP ▪SIMD ▪Threads
CMU 15-418/15-618, Spring 2019

Thread parallelism with OpenMP ▪OpenMP is supported by gcc
▪ “Decorate” your code with #pragmas
▪We will cover only a few of OpenMP’s features
CMU 15-418/15-618, Spring 2019

Parallelizing spmv with OpenMP ▪Focus on the outer loop
void mvp_csr_seq(csr_t *m, vec_t *x, vec_t *y,
rvp_csr_t rp_fun) {
index_t nrow = m->nrow;
for (index_t r = 0; r < nrow; r++) { y->value[r] = rp_fun(m, x, r); }
}
CMU 15-418/15-618, Spring 2019

Parallelizing spmv with OpenMP ▪Focus on the outer loop
void mvp_csr_mps(csr_t *m, vec_t *x, vec_t *y,
rvp_csr_t rp_fun) {
index_t nrow = m->nrow;
#pragma omp parallel for schedule(static)
for (index_t r = 0; r < nrow; r++) { y->value[r] = rp_fun(m, x, r);
} }
▪ schedule(static) divides loop statically across system cores (including hyperthreads)
CMU 15-418/15-618, Spring 2019

Benchmarking threaded spmv
$ ./mrun –l 4 –d 10 -t r Sparse 4 10 r
MVP RVP GF seq seq 1.40 seq par 2.72

mps seq 9.92
mps par 10.44
▪Threading gives 7× for naive, 4× for unrolled ▪We have 8×2-hyperthread = 16 cores
CMU 15-418/15-618, Spring 2019

Analyzing poor threading performance
▪Sources of poor parallel speedup:
▪ Communication/synchronization?
No, static work division is ≈zero overhead
▪Throughput limitations?
Not execution units… ‘mps/seq’ has poor FMA utilization, and ‘mps/par’ only gets 4× speedup from 8 cores
▪Load imbalance?
Maybe…how many elements are there per row? A: its random
CMU 15-418/15-618, Spring 2019

Benchmarking threaded spmv
$ ./mrun –l 4 –d 10 -t s Sparse 4 10 s
MVP RVP GF seq seq 1.40 seq par 3.30

mps seq 2.10
mps par 4.24
▪Increasing skew (all non-zero elements are in first 10% of rows) reduces speedup significantly
▪Single-thread is faster (why?), OpenMP is slower (why?) CMU 15-418/15-618, Spring 2019

Benchmarking threaded spmv
$ ./mrun –l 4 –d 10 -t u Sparse 4 10 u
MVP RVP GF seq seq 1.45 seq par 2.63

mps seq 10.08
mps par 10.51
▪…But forcing all rows to have the same number of non- zero elements doesn’t change much vs. random
▪Hypothesis: Memory bandwidth saturated? CMU 15-418/15-618, Spring 2019

Parallelizing spmv with OpenMP ▪Dealing with skewed workloads
void mvp_csr_mps(csr_t *m, vec_t *x, vec_t *y,
rvp_csr_t rp_fun) {
index_t nrow = m->nrow;
#pragma omp parallel for schedule(static)
for (index_t r = 0; r < nrow; r++) { y->value[r] = rp_fun(m, x, r);
} }
CMU 15-418/15-618, Spring 2019

Parallelizing spmv with OpenMP ▪Dealing with skewed workloads
void mvp_csr_mps(csr_t *m, vec_t *x, vec_t *y,
rvp_csr_t rp_fun) {
index_t nrow = m->nrow;
#pragma omp parallel for schedule(dynamic)
for (index_t r = 0; r < nrow; r++) { y->value[r] = rp_fun(m, x, r);
} }
▪ schedule(dynamic) divides loop into many small chunks and load balances them across threads
CMU 15-418/15-618, Spring 2019

Benchmarking threaded spmv
$ ./mrun –l 4 –d 10 -t s Sparse 4 10 s
MVP RVP GF seq seq 1.36 seq par 3.17

mpd seq 8.05
mpd par 8.24
▪Dynamic scheduling increases performance on heavily skewed workloads
CMU 15-418/15-618, Spring 2019

Benchmarking threaded spmv
$ ./mrun –l 4 –d 10 -t r Sparse 4 10 r
MVP RVP GF seq seq 1.36 seq par 3.17

mpd seq 10.52
mpd par 11.24
▪…And some on slightly skewed workloads CMU 15-418/15-618, Spring 2019

Let’s optimize this code!
Exploit parallelism at multiple levels
▪ILP ▪SIMD ▪Threads 
There’s code for this in the tar, but I’m not going to talk about it
CMU 15-418/15-618, Spring 2019

Example: Summing an array
double sum_array(double *d, int n) { int i;
double sum = 0.0;
for (i = 0; i < n ; i++) { double val = d[i]; sum += val; } return sum; } CMU 15-418/15-618, Spring 2019 Parallelizing array sum w/ OpenMP (Attempt #1) double sum_array(double *d, int n) { int i; double sum = 0.0; #pragma omp parallel for schedule(static) for (i = 0; i < n ; i++) { double val = d[i]; sum += val; } return sum; } ▪Q: Is this OK? ▪A: No, concurrent updates to sum CMU 15-418/15-618, Spring 2019 Parallelizing array sum w/ OpenMP (Attempt #2) double sum_array(double *d, int n) { int i; double sum = 0.0; #pragma omp parallel for schedule(static) for (i = 0; i < n ; i++) { double val = d[i]; #pragma omp critical sum += val; } return sum; } ▪Q: Is this OK? ▪A: Its correct, but won’t speedup (threads serialize) CMU 15-418/15-618, Spring 2019 Parallelizing array sum w/ OpenMP (Attempt #3) double sum_array(double *d, int n) { int i; double sum = 0.0; #pragma omp parallel for schedule(static) for (i = 0; i < n ; i++) { double val = d[i]; __sync_fetch_and_add(&sum, val); } return sum; } ▪GCC __sync_* intrinsics compile to x86 atomic instructions (vs. locks for #pragma omp critical) ▪Faster, but threads still serialize! CMU 15-418/15-618, Spring 2019 Parallelizing array sum w/ OpenMP (Attempt #4) double sum_array(double *d, int n) { int i; double sum = 0.0; #pragma omp parallel for schedule(static) \ reduction (+:sum) for (i = 0; i < n ; i++) { double val = d[i]; sum += val; } return sum; } ▪OpenMP has specialized support for this pattern ▪Syntax: (𝑜𝑝𝑒𝑟𝑎𝑛𝑑:𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) CMU 15-418/15-618, Spring 2019 Benchmarking array sum $ ./bigsum Function OMP critical GCC atomic ... OMP reduce ... 16 ... 0.005 ... 0.037 ... 12.828 ▪Atomics are much faster than locks (7.4× speedup) ▪Reduction avoids serialization (346× speedup) CMU 15-418/15-618, Spring 2019 More complex example: Radix sort ▪Sort numbers digit-by-digit 103 031 593 892 193 576 298 191 031 191 892 103 593 193 576 298 103 031 576 191 892 593 193 298 031 103 191 193 298 576 593 892 Last digit Middle First digit digit CMU 15-418/15-618, Spring 2019 More complex example: Radix sort void seq_radix_sort(index_t N, data_t *indata, data_t *outdata, data_t *scratchdata) { data_t *src = indata; data_t *dest = scratchdata; index_t count[BASE]; index_t offset[BASE]; index_t total_digits = sizeof(data_t) * 8; for (index_t shift = 0; shift < total_digits; shift += BASE_BITS) { // Accumulate count for each key value memset(count, 0, BASE*sizeof(index_t)); for (index_t i = 0; i < N; i++) { data_t key = DIGITS(src[i], shift); count[key]++; } // Compute offsets offset[0] = 0; for (index_t b = 1; b < BASE; b++) offset[b] = offset[b-1] + count[b-1]; // Distribute data for (index_t i = 0; i < N; i++) { data_t key = DIGITS(src[i], shift); index_t pos = offset[key]++; dest[pos] = src[i]; } // Swap src/dest (only keep two buffers) src = dest; dest = (dest == outdata) ? scratchdata : outdata; } } CMU 15-418/15-618, Spring 2019 Translating CUDA➔OpenMP CUDA OpenMP __global__ function (kernel) #pragma omp parallel if (threadId == 0) { ... } #pragma omp single __sync_threads() #pragma omp barrier Spin lock using ‘atomicExch’ #pragma omp critical for-loop (with iterations computed from threadId) #pragma omp for schedule(static/dynamic) CMU 15-418/15-618, Spring 2019 Figure out which thread we are Split loop across threads, don’t join at the end Critical section Barrier Only one thread runs this code Another barrier Another loop, with an implicit barrier on exit Radix sort in OpenMP (credit: Haichuan Wang, UIUC) void full_par_radix_sort(index_t N, data_t *indata, data_t *outdata, data_t *scratchdata) { data_t *src = indata, *dest = scratchdata; index_t total_digits = sizeof(data_t) * 8; index_t count[BASE], offset[BASE]; for(index_t shift = 0; shift < total_digits; Run this code shift+=BASE_BITS) { in parallel in memset(count, 0, BASE*sizeof(index_t)); #pragma omp parallel int nthreads = omp_get_num_threads(); int tid = omp_get_thread_num(); for (int t = 0; t < nthreads; t++) { if(t == tid) { for(index_t b = 0; b < BASE; b++){ local_offset[b] = offset[b]; offset[b] += local_count[b]; } } #pragma } each thread { omp barrier for schedule(static) i=0;i