CS计算机代考程序代写 concurrency cache algorithm compiler data structure 18-646 – How to Write Fast Code II

18-646 – How to Write Fast Code II
1
Carnegie Mellon University Ian Lane

What we discussed last time:
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— Highlighted the difference between multicore and manycore platforms — Discussed the multicore and manycore platform intricacies
— Exposing concurrency in the k-means algorithm
— Exploiting parallelism by exploring mappings from application to platform
18-646 – How to Write Fast Code II 2

Example 7: Sequential Optimization
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II 3 0.46 seconds Outline — Multicores Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — Sequential Optimizations — Cost Model — Cost Reduction 18-646 – How to Write Fast Code II 4 SPMD vs. Worksharing — A parallel construct by itself creates an SPMD — “Single Program Multiple Data” — Programmer must explicitly specify what each thread must do differently — The division of work is hard-coded in the program — Opportunity: — Many parallel regions are loops — Question: — Can we make it easier to parallelize these loops? 0 18-646 – How to Write Fast Code II 5 OpenMP: Parallel For — One of the worksharing constructs OpenMP provide is “omp for” Sequential code for(i=0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
Sequential: 2.12 seconds
int i; double pi, x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) {
variable
pliable sunningup1multi
Reduction is more scalable than critical sections.
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; FTariable } } Pi = sum * step; printf("pi = %f \n", pi); return 0; } sum sum = sum + 4.0/(1.0+x*x); operation 1.22 seconds Loops must not have loop carry dependency. Limitation of compiler technology. 18-646 – How to Write Fast Code II 8 “i” is private by default Combined Parallel/Worksharing Construct — OpenMP shortcut: Put the “parallel” and the worksharing directive on the same line double res[MAX]; int i; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } double res[MAX]; int i; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); } 18-646 – How to Write Fast Code II 9 OpenMP: Working With Loops — Basic approach — Find compute intensive loops — Make the loop iterations independent .. So they can safely execute in any order without loop-carried dependencies — Place the appropriate OpenMP directive and test double hoop = nthrds*step; for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; sum = sum + 4.0/(1.0+x*x); } depend on not infor is for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II 10 OpenMP: Reductions — Accumulating values into a single variable (sum) creates true dependence between loop iterations that can’t be trivially removed — This is a very common situation ... it is called a “reduction”. — Support for reduction operations is included in most parallel programming environments. for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } reduction (op : list) 1. A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”) 2. Updates occur on the local copy 3. Local copies are reduced into a single value and combined with the original global value #pragma omp for reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II 11 OpenMP: Reduction Operators — Many different associative operands can be used with reduction: — Initial values are the ones that make sense mathematically. Operator Initial value + 0 * 1 - 0 Operator Initial value & ~0 | 0 ^ 0 && 1 || 0 18-646 – How to Write Fast Code II 12 Example 8: Parallel For and Reduction #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.22 seconds Advantage: we created a parallel program without changing any code and by adding 2 simple lines of text! Disadvantage: 1) Lot’s of pitfalls if you don’t understand system architecture 2) The basic result may not the fastest one can do 18-646 – How to Write Fast Code II 13 Outline — Multicores Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — Sequential Optimizations — Cost Model — Cost Reduction 18-646 – How to Write Fast Code II 14 Data Environment in OpenMP — Shared Memory programming model: — Most variables are shared by default — Global variables are SHARED among threads — File scope variables, static — Dynamically allocated memory (ALLOCATE, malloc, new) — But not everything is shared... — Functions called from parallel regions are PRIVATE — Automatic variables within a statement block are PRIVATE. 18-646 – How to Write Fast Code II 15 Example: Data Sharing double A[10]; int main() { int index[10]; #pragma omp parallel work(index); printf(“%d\n”, index[0]); } o extern double A[10]; Jedward void work(int *index) { double temp[10]; static int count; thread ... } ne yany A, index, count temp temp A, index, count temp A, index and count are shared by all threads. temp is local to each thread 18-646 – How to Write Fast Code II 16 Changing Storage Attributes — One can selectively change storage attributes for constructs using the following clauses: — SHARED — PRIVATE valueforeachopen —FIRSTPRIVATE globalbeforeopenup get — The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with: — LASTPRIVATE critical atomf after openup typically use 18-646 – How to Write Fast Code II 17 Example 9: firstprivate example #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{o
int i; double pi, x, local_sum = 0.0; int nthreads;
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel private(x) firstprivate(local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II 18 Outline — Multicores Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — Sequential Optimizations — Cost Model — Cost Reduction 18-646 – How to Write Fast Code II 19 The Scheduling Cause — The schedule clause affects how loop iterations are mapped onto threads — schedule(static [,chunk]) — Deal-out blocks of iterations of size “chunk” to each thread. — schedule(dynamic[,chunk]) — Each thread grabs “chunk” iterations off a queue until all iterations have been handled. — schedule(guided[,chunk]) — Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds. — schedule(runtime) — Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime library) 18-646 – How to Write Fast Code II 20 The Scheduling Cause Schedule Clause When To Use STATIC Pre-determined and predictable by the programmer DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead 18-646 – How to Write Fast Code II 21 Least work at runtime : scheduling done at compile-time Most work at runtime : complex scheduling logic used at run-time Example 9: Scheduling #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) schedule(static, 10000000)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.31 seconds 10 chunks 18-646 – How to Write Fast Code II 22 Outline — Multicores Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — Sequential Optimizations — Cost Model — Cost Reduction 18-646 – How to Write Fast Code II 23 Synchronization: ordered — The ordered region executes in the sequential order — Important for some scientific code and optimization code — Order of reduction may cause rounding differences #pragma omp parallel private (tmp) #pragma omp for ordered reduction(+:res) for (I=0;I
#include
void report_num_threads(int level) {
#pragma omp single
{ printf(”L %d: # threads in team %d\n”, level, omp_get_num_threads());} }
int main() { omp_set_dynamic(0);
#pragma omp parallel num_threads(2) {
report_num_threads(1);
#pragma omp parallel num_threads(2) {
} }
return(0); }
report_num_threads(2);
#pragma omp parallel num_threads(2)
report_num_threads(3);
http://download.oracle.com/docs/cd/E19205-01/819-5270/aewbc/index.html
18-646 – How to Write Fast Code II 27

Nested Parallelism
— Compiling and running this program with nested parallelism enabled produces the following (sorted) output:
$ export OMP_NESTED=TORUE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 2 L 2: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2
$ export OMP_NESTED=FALSE $ ./experimentN o
L 1: # threads in team 2 L 2: # threads in team 1 L 3: # threads in team 1 L 2: # threads in team 1 L 3: # threads in team 1
18-646 – How to Write Fast Code II 28

OpenMP Environment Variables
— OMP_SCHEDULE=algorithm — dynamic[, n]
— guided[, n] — Runtime
— static[, n]
— OMP_NUM_THREADS=num — OMP_NESTED=TRUE|FALSE
— OMP_DYNAMIC=TRUE|FALSE
18-646 – How to Write Fast Code II 29

Outline
— Multicores Shared-Memory Platforms
— OpenMP – API for Shared-Memory Multi-processing — Overview
— Example: numerical integration of Pi — Shared variable synchronization
— SPMD vs worksharing
— Data environment options
— Advanced worksharing options — Synchronization options
— Sequential Optimizations — Cost Model
— Cost Reduction
18-646 – How to Write Fast Code II 30

Optimizing the Base Case
— Base case is usually the guts of the algorithm that computes on data in caches
— Should fit in registers — Must use SIMD
18-646 – How to Write Fast Code II 31

Cost Model
— Algebraic model +, -, /, * sin, cos, etc.
— Realistic model — +, -, /, *
int func(int i) {
return i+2;
}
struct A {
double sum, prod;
};
A acc(int n, int array[]) {
A res = {0.0, 1.0};
for(int i=0; i double
4 a.x
++i
i