18-646 – How to Write Fast Code II
1
Carnegie Mellon University Ian Lane
What we discussed last time:
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Highlighted the difference between multicore and manycore platforms Discussed the multicore and manycore platform intricacies
Exposing concurrency in the k-means algorithm
Exploiting parallelism by exploring mappings from application to platform
18-646 – How to Write Fast Code II 2
Example 7: Sequential Optimization
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) {
x += hoop;
local_sum += 4.0/(1.0+x*x); }
#pragma omp critical
pi += local_sum * step;
}
printf("pi = %f \n", pi);
return 0;
}
Sequential: 2.12 seconds
18-646 – How to Write Fast Code II 3
0.46 seconds
Outline
Multicores Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
Sequential Optimizations Cost Model
Cost Reduction
18-646 – How to Write Fast Code II 4
SPMD vs. Worksharing
A parallel construct by itself creates an SPMD
“Single Program Multiple Data”
Programmer must explicitly specify what each thread must do differently The division of work is hard-coded in the program
Opportunity:
Many parallel regions are loops
Question:
Can we make it easier to parallelize these loops?
0
18-646 – How to Write Fast Code II 5
OpenMP: Parallel For
One of the worksharing constructs OpenMP provide is “omp for” Sequential code for(i=0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
Sequential: 2.12 seconds
int i; double pi, x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) {
variable
pliable sunningup1multi
Reduction is more scalable than critical sections.
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step; FTariable
} }
Pi = sum * step;
printf("pi = %f \n", pi);
return 0; }
sum
sum = sum + 4.0/(1.0+x*x);
operation
1.22 seconds
Loops must not have loop carry dependency. Limitation of compiler technology.
18-646 – How to Write Fast Code II
8
“i” is private by default
Combined Parallel/Worksharing Construct
OpenMP shortcut: Put the “parallel” and the worksharing directive on the same line
double res[MAX]; int i;
#pragma omp parallel
{
#pragma omp for
for (i=0;i< MAX; i++) {
res[i] = huge(); }
}
double res[MAX]; int i;
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge(); }
18-646 – How to Write Fast Code II 9
OpenMP: Working With Loops
Basic approach
Find compute intensive loops
Make the loop iterations independent .. So they can safely execute in any order without loop-carried dependencies
Place the appropriate OpenMP directive and test
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) {
x += hoop;
sum = sum + 4.0/(1.0+x*x); }
depend on
not infor
is
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
18-646 – How to Write Fast Code II
10
OpenMP: Reductions
Accumulating values into a single variable (sum) creates true dependence between loop iterations that can’t be trivially removed
This is a very common situation ... it is called a “reduction”.
Support for reduction operations is included in most parallel programming
environments.
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
reduction (op : list)
1. A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”)
2. Updates occur on the local copy
3. Local copies are reduced into a
single value and combined with the original global value
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
18-646 – How to Write Fast Code II 11
OpenMP: Reduction Operators
Many different associative operands can be used with reduction: Initial values are the ones that make sense mathematically.
Operator
Initial value
+
0
*
1
-
0
Operator
Initial value
&
~0
|
0
^
0
&&
1
||
0
18-646 – How to Write Fast Code II 12
Example 8: Parallel For and Reduction
#include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
Pi = sum * step;
printf("pi = %f \n", pi);
return 0; }
Sequential: 2.12 seconds
1.22 seconds
Advantage: we created a parallel program without changing any code and by adding 2 simple lines of text!
Disadvantage:
1) Lot’s of pitfalls if you don’t understand system architecture 2) The basic result may not the fastest one can do
18-646 – How to Write Fast Code II 13
Outline
Multicores Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
Sequential Optimizations Cost Model
Cost Reduction
18-646 – How to Write Fast Code II 14
Data Environment in OpenMP
Shared Memory programming model: Most variables are shared by default
Global variables are SHARED among threads File scope variables, static
Dynamically allocated memory (ALLOCATE, malloc, new)
But not everything is shared...
Functions called from parallel regions are PRIVATE
Automatic variables within a statement block are PRIVATE.
18-646 – How to Write Fast Code II 15
Example: Data Sharing
double A[10]; int main() {
int index[10];
#pragma omp parallel work(index);
printf(“%d\n”, index[0]); }
o
extern double A[10]; Jedward void work(int *index) {
double temp[10]; static int count;
thread
... }
ne
yany
A, index, count
temp temp
A, index, count
temp
A, index and count are shared by all threads.
temp is local to each thread
18-646 – How to Write Fast Code II
16
Changing Storage Attributes
One can selectively change storage attributes for constructs using the following clauses:
SHARED
PRIVATE valueforeachopen
FIRSTPRIVATE globalbeforeopenup get
The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with:
LASTPRIVATE
critical atomf
after
openup
typically use
18-646 – How to Write Fast Code II
17
Example 9: firstprivate example
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{o
int i; double pi, x, local_sum = 0.0; int nthreads;
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel private(x) firstprivate(local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) {
x += hoop;
local_sum += 4.0/(1.0+x*x); }
#pragma omp critical
pi += local_sum * step;
}
printf("pi = %f \n", pi);
return 0;
}
18-646 – How to Write Fast Code II 18
Outline
Multicores Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
Sequential Optimizations Cost Model
Cost Reduction
18-646 – How to Write Fast Code II 19
The Scheduling Cause
The schedule clause affects how loop iterations are mapped onto threads
schedule(static [,chunk])
Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk])
Each thread grabs “chunk” iterations off a queue until all iterations have
been handled.
schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts
large and shrinks down to size “chunk” as the calculation proceeds.
schedule(runtime)
Schedule and chunk size taken from the OMP_SCHEDULE environment
variable (or the runtime library)
18-646 – How to Write Fast Code II 20
The Scheduling Cause
Schedule Clause
When To Use
STATIC
Pre-determined and predictable by the programmer
DYNAMIC
Unpredictable, highly variable work per iteration
GUIDED
Special case of dynamic to reduce scheduling overhead
18-646 – How to Write Fast Code II 21
Least work at runtime : scheduling done at compile-time
Most work at runtime : complex scheduling logic used at run-time
Example 9: Scheduling
#include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) schedule(static, 10000000)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
Pi = sum * step;
printf("pi = %f \n", pi);
return 0; }
Sequential: 2.12 seconds
1.31 seconds
10 chunks
18-646 – How to Write Fast Code II
22
Outline
Multicores Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
Sequential Optimizations Cost Model
Cost Reduction
18-646 – How to Write Fast Code II 23
Synchronization: ordered
The ordered region executes in the sequential order
Important for some scientific code and optimization code
Order of reduction may cause rounding differences
#pragma omp parallel private (tmp) #pragma omp for ordered reduction(+:res)
for (I=0;I
#include
void report_num_threads(int level) {
#pragma omp single
{ printf(”L %d: # threads in team %d\n”, level, omp_get_num_threads());} }
int main() { omp_set_dynamic(0);
#pragma omp parallel num_threads(2) {
report_num_threads(1);
#pragma omp parallel num_threads(2) {
} }
return(0); }
report_num_threads(2);
#pragma omp parallel num_threads(2)
report_num_threads(3);
http://download.oracle.com/docs/cd/E19205-01/819-5270/aewbc/index.html
18-646 – How to Write Fast Code II 27
Nested Parallelism
Compiling and running this program with nested parallelism enabled produces the following (sorted) output:
$ export OMP_NESTED=TORUE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 2 L 2: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2
$ export OMP_NESTED=FALSE $ ./experimentN o
L 1: # threads in team 2 L 2: # threads in team 1 L 3: # threads in team 1 L 2: # threads in team 1 L 3: # threads in team 1
18-646 – How to Write Fast Code II 28
OpenMP Environment Variables
OMP_SCHEDULE=algorithm dynamic[, n]
guided[, n] Runtime
static[, n]
OMP_NUM_THREADS=num OMP_NESTED=TRUE|FALSE
OMP_DYNAMIC=TRUE|FALSE
18-646 – How to Write Fast Code II 29
Outline
Multicores Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
Sequential Optimizations Cost Model
Cost Reduction
18-646 – How to Write Fast Code II 30
Optimizing the Base Case
Base case is usually the guts of the algorithm that computes on data in caches
Should fit in registers Must use SIMD
18-646 – How to Write Fast Code II 31
Cost Model
Algebraic model +, -, /, * sin, cos, etc.
Realistic model +, -, /, *
int func(int i) {
return i+2;
}
struct A {
double sum, prod;
};
A acc(int n, int array[]) {
A res = {0.0, 1.0};
for(int i=0; i
4 a.x
++i
i