18-646 – How to Write Fast Code II?
1
Carnegie Mellon University Ian Lane
What we discussed last time:
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Highlighted the difference between multicore and manycore platforms Discussed the multicore and manycore platform intricacies
Exposing concurrency in the k-means algorithm
Exploiting parallelism by exploring mappings from application to platform
This lecture: An abstraction for multicore parallel programming
18-646 – How to Write Fast Code II? 2
Outline
Multicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 3
Multicore Shared-Memory Platforms
Core
SIMD level Parallelism
Parallel resources available in:
SIMD lanes in vector units in a core
Simultaneous multithreading (SMT) in a core Cores within a chip
Core Chip Core Core Core
Core level Parallelism
DRAM
Shared Memory Bus
Core Core
18-646 – How to Write Fast Code II?
4
Cache Cache Cache
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Exploiting Different Levels of Parallelism
How do we write fast code to exploit the many levels of parallelism on a multicore platform?
Core
SIMD level Parallelism
re re Co Chip Co
SIMD-Level Parallelism
SMT-Level Parallelism
Core-Level Parallelism
Exploited using vectorizing compiler and hand-code intrinsics
OS abstract it to core-level parallelism
Using threads to describe work done on different cores
Co Co re re
Core level
Parallelism
Co Co
re re
18-646 – How to Write Fast Code II?
5
Ca Ca Ca ch ch ch eee
Ca Ca Ca ch ch ch eee
What is a thread?
A thread of execution is a unit of processing scheduled by the OS Allows system resources to be shared efficiently
Processes
Threads
Exist independently in the OS
Subsets of processes
Independent states:
virtual memory space, handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution
Threads in a process share: virtual address space and system resources
Independent:
exception handlers, a scheduling priority, a unique thread identifier, and thread context
Interact through system-provided inter- process communication mechanisms
Interact through shared memory
OS Preemptive
OS Preemptive
18-646 – How to Write Fast Code II? 6
Landscape of Thread Programming
POSIX threads
POSIX: Portable Operating System Interface for Unix – IEEE Standard
Defines a set of C programming language types, functions and constants Manually expose and manage all concurrency with the API
OpenMP
OpenMP: Open Multi-Processing – an API for shared memory multiprocessing Programmer hints at the concurrency
Compiler manages the parallelism
Intel Cilk Plus, Intel Thread Building Block (TBB):
Fork-join parallelism (“spawn” and “sync” in Clik, Parallel loops/containers in TBB)
Programmer exposes the concurrency
Runtime determines what is run in parallel
18-646 – How to Write Fast Code II? 7
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 8
OpenMP Basic Defs: Solution Stack
End User
Application
Directives, Compiler
OpenMP library
OpenMP Runtime library
OS/system support for shared memory and threading
Proc1 Proc2 Proc3 Shared Address Space
ProcN
18-646 – How to Write Fast Code II?
9
Environment variables
HW System layer
Prog. Layer
User layer
Scope of OpenMP in this course
Topic
Concepts
I. OMP Intro
Parallel regions
II. Creating threads
Parallel, default data environment, runtime library calls
III. Synchronization
False sharing, critical, atomic
IV. Parallel loops
For, schedule, reduction
V. Odds and ends
Single, sections, master, runtime libraries, environment variables, synchronization, etc.
VI. Data Environment
Data environment details, software optimization
VII. OpenMP 3 and tasks
Tasks and other OpenMP 3 features
VIII. Memory model
The need for flush, but its actual use is left to an appendix
IX. Threadprivate
Modular software
18-646 – How to Write Fast Code II? 10
Spectrum of Granularity
Multicore Task Queue-based Implementations
Pthread-based Implementations
Manycore Throughput
Optimized Implementations
10-9 10-8 10-7 10-6 10-5
10-4
10-3
10-2
10-1 100
On-die cache Latency
Shared Off-chip Memory Latency
System Latency
Network Latency
© Jike Chong 2009
107 106 105
Task Management Overhead (Instructions)
MPI-based Implementations
Remote Procedure Call based Implementations
18-646 – How to Write Fast Code II?
11 11
109
108
104 103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)
Example 1: Hello World
Verifying that your OpenMP environment works
Write a multithreaded program where each thread prints “hello world”.
void main() {
int ID = 0;
printf(” hello(%d) “, ID); printf(” world(%d) \n”, ID);
}
18-646 – How to Write Fast Code II? 12
Example 1: SPMD with “omp parallel”
Verifying that your OpenMP environment works
Write a multithreaded program where each thread prints “hello world”.
#include
void main() {
#pragma omp parallel {
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
}
Switches for compiling and linking
gcc -fopenmp
gcc
18-646 – How to Write Fast Code II?
13
Example 1: SPMD with “omp parallel”
A multi-threaded “Hello world” program
Write a multithreaded program where each thread prints “hello world”. OpenMP include file
#include
void main() {
Parallel region with default number of threads
Sample Output: hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3) world(2)
Runtime library function to return a thread ID.
#pragma omp parallel {
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
}
End of the Parallel region
18-646 – How to Write Fast Code II?
14
Example 1: SPMD with “omp parallel”
A multi-threaded “Hello world” program
Write a multithreaded program where each thread prints “hello world”.
#include
void main() {
#pragma omp parallel {
End User Application
Directives, Compiler
OpenMP library
Environment variables
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
OpenMP Runtime library
OS/system support for shared memory and threading
}
Mattson et al, A “Hands-on”
Proc1 Proc2 Shared Address Space
Proc3 ProcN
Introduction to OpenMP
HW System layer Prog. User layer Laye
r
18-646 – How to Write Fast Code II?
15
OpenMP: Execution Pattern
Parallel Regions
A Nested Parallel region
Master Thread in red
Managed Fork-Join Parallelism:
Master thread creates a team of threads as needed
Parallelism added incrementally until performance goals are met i.e. the sequential program evolves into a parallel program
Sequential Parts
18-646 – How to Write Fast Code II? 16
Thread Creation: Parallel Regions
Threads in OpenMP are created with the parallel construct To create a 4 thread Parallel region:
double A[1000];
omp_set_num_threads(4); #pragma omp parallel
{
foo(ID,A);
}
Runtime function to request a certain number of threads
int ID = omp_get_thread_num();
Runtime function returning a thread ID
Each thread executes a copy of the code within the structured block
18-646 – How to Write Fast Code II?
17
Each thread calls foo(ID,A)
for ID
=0
to 3
Thread Creation: Parallel Regions
Threads in OpenMP are created with the parallel construct To create a 4 thread Parallel region:
int ID = omp_get_thread_num();
Runtime function to request a certain number of threads
double A[1000];
#pragma omp parallel num_threads(4)
{
foo(ID,A);
}
Runtime function returning a thread ID
Each thread executes a copy of the code within the structured block
18-646 – How to Write Fast Code II?
18
Each thread calls foo(ID,A)
for ID
=0
to 3
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 19
Example: Numerical Integration of Pi
Mathematically, we know that:
∫1 4 dx=4⋅tan−1×1 =π 0 1+x2 0
We can approximate the integral as a sum of rectangles:
4.0
2.0
0.0
1.0 X
åN
i i=0
F(x )Dx » p
Where each rectangle has width Dx and
height F(xi) at the middle of interval i.
18-646 – How to Write Fast Code II? 20
€
F(x) = 4.0/(1+x2)
Numerical Integration of Pi
#include
static long num_steps = 100000; double step;
int main () {
int i; double pi, sum;
step = 1.0/(double) num_steps;
int i; double x;
for (i=0, sum=0.0;i< num_steps; i++){
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
pi = sum * step;
printf("pi = %f \n", pi);
return 0;
}
18-646 – How to Write Fast Code II? 21
4.0
2.0
0.0
1.0 X
One Step
F(x) = 4.0/(1+x2)
Numerical Integration of Pi
#include
static long num_steps = 100000; double step;
int main () {
int i; double pi, sum;
step = 1.0/(double) num_steps;
int i; double x;
for (i=0, sum=0.0;i< num_steps; i++){
}
pi = sum * step;
printf("pi = %f \n", pi);
return 0;
}
18-646 – How to Write Fast Code II? 22
4.0
2.0
0.0
1.0 X
One Step
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
F(x) = 4.0/(1+x2)
Concurrency Opportunities
What is parallelizable?
Calculation of each step is independent
One can map each of the steps to an independent thread
What’s wrong with this approach?
Significant thread management overhead
Compute each step as a separate thread involves significant thread management overhead (hundreds of cycles)
4.0
2.0
0.0
1.0 X
18-646 – How to Write Fast Code II? 23
F(x) = 4.0/(1+x2)
Concurrency Opportunities
What is parallelizable?
Calculation of each step is independent
Alternative:
Specify a fixed set of parallel entities
e.g. two threads (red and yellow threads)
Parallelize workload over the available number of threads
4.0
2.0
0.0
1.0 X
Thread 0 Thread 1
18-646 – How to Write Fast Code II?
24
F(x) = 4.0/(1+x2)
Implementation View
Work is evenly distributed across two processors
DRAM
CPU
4.0
2.0
0.0
1.0 X
L2 Cache
Core
Core
18-646 – How to Write Fast Code II?
25
F(x) = 4.0/(1+x2)
L1 Cache
mem mem mem mem mem mem mem mem
L1 Cache
Example 2: Numerical Integration of Pi
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, sum = 0.0; step = 1.0/(double) num_steps;
double x;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x); }
pi = sum * step;
printf("pi = %f \n", pi);
return 0;
}
Sequential: 2.12 seconds
4.0
2.0
0.0
1.0 X
18-646 – How to Write Fast Code II? 26
F(x) = 4.0/(1+x2)
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 27
Example 3: Critical Region
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum=0.0; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
#pragma omp critical
{sum += 4.0/(1.0+x*x);}
}
if (id == 0) nthreads = nthrds;
}
pi = sum * step;
printf("pi = %f \n", pi);
return 0;
}
Why so slow?
Sequential: 2.12 seconds
18-646 – How to Write Fast Code II? 28
64.45 seconds
Example 4: Atomic Operation
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum=0.0; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x, tmp; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
tmp = 4.0/(1.0+x*x);
#pragma omp atomic
{sum += tmp}
}
if (id == 0) nthreads = nthrds;
}
pi = sum * step;
printf("pi = %f \n", pi);
return 0;
}
Better, why still so slow?
Sequential: 2.12 seconds
18-646 – How to Write Fast Code II? 29
4.93 seconds
Shared Memory Model
Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP
Shared Memory
Cache2
Cache3
CacheN
Proc2
Proc3
ProcN
a
Cache1
Proc1
... a
All threads share an address space
Multiple copies of data may be present in various levels of cache
18-646 – How to Write Fast Code II? 30
Shared Memory Model
Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP
Shared Memory
Cache2
Cache3
CacheN
Proc2
Proc3
ProcN
a
Cache1
Proc1
...
Thrd 1@t2: read a Thrd 2@t1: store a
A piece of data (at address “a”) can be: Written into memory by Thrd 2 at t1
Then, read by Thrd 1 at t2
We say the data at address “a” is in a shared state
18-646 – How to Write Fast Code II? 31
Managing Coherence Costly
Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP
Shared Memory
Cache2
Cache3
CacheN
Proc2
Proc3
ProcN
a
Cache1
Proc1
...
Thrd 1@t2: read a Thrd 2@t1: store a
Synchronization protocols are implemented in hardware to manage coherence between caches
e.g. MESI Protocol (Papamarcos & Patel 1984)
Allows data to be in one of four states: Modified, Exclusive, Shared, Invalid Transparent to software, but could have severe performance penalties
18-646 – How to Write Fast Code II? 32
Threads can Write to Different Addresses...
Shared Memory
Cache2
Cache3
CacheN
Proc2
Proc3
ProcN
a
b
Proc1
Cache1
...
Thrd 1@t2: store b Thrd 2@t1: store a
Avoid cache synchronization protocols induced severe performance penalties when no sharing is required
Technique:
Write to separate addresses during parallel computation
18-646 – How to Write Fast Code II? 33
Example 5: Separate Result Storage
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum[NUM_THREADS]; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS)
{
int i, id = omp_get_thread_num(); double x;
int nthrds = omp_get_num_threads();
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
if (id == 0) nthreads = nthrds;
}
for(i=0, pi=0.0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x, local_sum=0.0; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
local_sum += 4.0/(1.0+x*x); }
#pragma omp critical
pi += local_sum * step;
}
printf("pi = %f \n", pi);
return 0;
}
Sequential: 2.12 seconds
18-646 – How to Write Fast Code II? 38
1.15 seconds
Example 6: Eliminate False Sharing
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
Sequential: 2.12 seconds
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step; local_sum += 4.0/(1.0+x*x);
}#pragma omp critical
pi += local_sum * step;
}
printf("pi = %f \n", pi);
return 0;
}
Great! But is this the best we can do?
18-646 – How to Write Fast Code II?
39
1.15 seconds
Example 7: Sequential Optimization
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) {
x += hoop;
local_sum += 4.0/(1.0+x*x); }
#pragma omp critical
pi += local_sum * step;
}
printf("pi = %f \n", pi);
return 0;
}
Sequential: 2.12 seconds
18-646 – How to Write Fast Code II? 40
0.46 seconds
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 41
SPMD vs. Worksharing
A parallel construct by itself creates an SPMD
“Single Program Multiple Data”
Programmer must explicitly specify what each thread must do differently The division of work is hard-coded in the program
Opportunity:
Many parallel regions are loops
Question:
Can we make it easier to parallelize these loops?
18-646 – How to Write Fast Code II? 42
OpenMP: Parallel For
One of the worksharing constructs OpenMP provide is “omp for” Sequential code for(i=0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) {
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
} }
Pi = sum * step;
printf("pi = %f \n", pi);
return 0; }
Sequential: 2.12 seconds
1.22 seconds
Reduction is more scalable than critical sections.
Loops must not have loop carry dependency. Limitation of compiler technology.
“i” is private by default
18-646 – How to Write Fast Code II? 45
Combined Parallel/Worksharing Construct
OpenMP shortcut: Put the “parallel” and the worksharing directive on the same line
double res[MAX]; int i;
#pragma omp parallel
{
#pragma omp for
for (i=0;i< MAX; i++) {
res[i] = huge(); }
}
double res[MAX]; int i;
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge(); }
18-646 – How to Write Fast Code II? 46
OpenMP: Working With Loops
Basic approach
Find compute intensive loops
Make the loop iterations independent .. So they can safely execute in any order without loop-carried dependencies
Place the appropriate OpenMP directive and test
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) {
x += hoop;
sum = sum + 4.0/(1.0+x*x); }
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
18-646 – How to Write Fast Code II? 47
OpenMP: Reductions
Accumulating values into a single variable (sum) creates true dependence between loop iterations that can’t be trivially removed
This is a very common situation ... it is called a “reduction”.
Support for reduction operations is included in most parallel programming
environments.
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
reduction (op : list)
1. A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”)
2. Updates occur on the local copy
3. Local copies are reduced into a
single value and combined with the original global value
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
18-646 – How to Write Fast Code II? 48
OpenMP: Reduction Operators
Many different associative operands can be used with reduction: Initial values are the ones that make sense mathematically.
Operator
Initial value
+
0
*
1
-
0
Operator
Initial value
&
~0
|
0
^
0
&&
1
||
0
18-646 – How to Write Fast Code II? 49
Example 8: Parallel For and Reduction
#include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
Pi = sum * step;
printf("pi = %f \n", pi);
return 0; }
Sequential: 2.12 seconds
1.22 seconds
Advantage: we created a parallel program without changing any code and by adding 2 simple lines of text!
Disadvantage:
1) Lot’s of pitfalls if you don’t understand system architecture 2) The basic result may not the fastest one can do
18-646 – How to Write Fast Code II? 50
Outline
Multicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 51
Data Environment in OpenMP
Shared Memory programming model: Most variables are shared by default
Global variables are SHARED among threads File scope variables, static
Dynamically allocated memory (ALLOCATE, malloc, new)
But not everything is shared...
Functions called from parallel regions are PRIVATE
Automatic variables within a statement block are PRIVATE.
18-646 – How to Write Fast Code II? 52
Example: Data Sharing
extern double A[10]; void work(int *index) {
double temp[10]; static int count; ...
}
double A[10]; int main() {
int index[10];
#pragma omp parallel work(index);
printf(“%d\n”, index[0]); }
A, index, count temp temp
A, index, count
temp
A, index and count are shared by all threads.
temp is local to each thread
18-646 – How to Write Fast Code II?
53
Changing Storage Attributes
One can selectively change storage attributes for constructs using the following clauses:
SHARED
PRIVATE
FIRSTPRIVATE
The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with:
LASTPRIVATE
18-646 – How to Write Fast Code II? 54
Example 9: firstprivate example
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum = 0.0; int nthreads; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS);
#pragma omp parallel private(x) firstprivate(local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) {
x += hoop;
local_sum += 4.0/(1.0+x*x); }
#pragma omp critical
pi += local_sum * step;
}
printf("pi = %f \n", pi);
return 0;
}
18-646 – How to Write Fast Code II? 55
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 56
The Scheduling Cause
The schedule clause affects how loop iterations are mapped onto threads
schedule(static [,chunk])
Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk])
Each thread grabs “chunk” iterations off a queue until all iterations have
been handled.
schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts
large and shrinks down to size “chunk” as the calculation proceeds.
schedule(runtime)
Schedule and chunk size taken from the OMP_SCHEDULE environment
variable (or the runtime library)
18-646 – How to Write Fast Code II? 57
The Scheduling Cause
Schedule Clause
When To Use
STATIC
Pre-determined and predictable by the programmer
DYNAMIC
Unpredictable, highly variable work per iteration
GUIDED
Special case of dynamic to reduce scheduling overhead
18-646 – How to Write Fast Code II? 58
Least work at runtime : scheduling done at compile-time
Most work at runtime : complex scheduling logic used at run-time
Example 9: Scheduling
#include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) schedule(static, 10000000)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
Pi = sum * step;
printf("pi = %f \n", pi);
return 0; }
Sequential: 2.12 seconds
18-646 – How to Write Fast Code II? 59
1.31 seconds
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 60
Synchronization: ordered
The ordered region executes in the sequential order
Important for some scientific code and optimization code
Order of reduction may cause rounding differences
#pragma omp parallel private (tmp) #pragma omp for ordered reduction(+:res)
for (I=0;I
#include
void report_num_threads(int level) {
#pragma omp single
{ printf(”L %d: # threads in team %d\n”, level, omp_get_num_threads());} }
int main() { omp_set_dynamic(0);
#pragma omp parallel num_threads(2) {
report_num_threads(1);
#pragma omp parallel num_threads(2) {
} }
return(0); }
report_num_threads(2);
#pragma omp parallel num_threads(2)
report_num_threads(3);
http://download.oracle.com/docs/cd/E19205-01/819-5270/aewbc/index.html
18-646 – How to Write Fast Code II? 64
Nested Parallelism
Compiling and running this program with nested parallelism enabled produces the following (sorted) output:
$ export OMP_NESTED=TRUE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 2 L 2: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2
$ export OMP_NESTED=FALSE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 1 L 3: # threads in team 1 L 2: # threads in team 1 L 3: # threads in team 1
18-646 – How to Write Fast Code II? 65
OpenMP Environment Variables
OMP_SCHEDULE=algorithm dynamic[, n]
guided[, n] Runtime
static[, n]
OMP_NUM_THREADS=num OMP_NESTED=TRUE|FALSE
OMP_DYNAMIC=TRUE|FALSE
18-646 – How to Write Fast Code II? 66
Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 67
k-means Algorithm Concurrency
1. 2.
Initialization: Randomly select k cluster centers
Expectation: Assign closest center to each data point N (independent)
k (min reduction)
D (sum reduction)
Maximization: Update centers based on assignments D (independent)
N (Histogram computation into k bins)
3.
Evaluate: Re-iterate steps 2-3 until convergence
18-646 – How to Write Fast Code II? 68
4.
Project 1: Search Among Alternatives
What are the implementation alternatives?
Different mapping of application concurrency to platform parallelism
The search process
More complex than one application concurrency to one platform parallelism May want to sequentialize some operations:
Some parallel operations are as “work-efficient” as sequential operations One level of concurrency could map to multiple levels of parallelism
18-646 – How to Write Fast Code II? 69
How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
This lecture: An abstraction for multicore parallel programming
Abstractions:
Help establish a mental model for the programmers
Make it easier to write parallel code
Performance depend on deep understanding of the implementation platform
18-646 – How to Write Fast Code II? 70