CS计算机代考程序代写 concurrency cache algorithm compiler data structure 18-646

18-646 – How to Write Fast Code II?
1
Carnegie Mellon University Ian Lane

What we discussed last time:
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Highlighted the difference between multicore and manycore platforms Discussed the multicore and manycore platform intricacies
Exposing concurrency in the k-means algorithm
Exploiting parallelism by exploring mappings from application to platform
This lecture: An abstraction for multicore parallel programming
18-646 – How to Write Fast Code II? 2

Outline
Multicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 3

Multicore Shared-Memory Platforms
Core
SIMD level Parallelism
Parallel resources available in:
SIMD lanes in vector units in a core
Simultaneous multithreading (SMT) in a core Cores within a chip
Core Chip Core Core Core
Core level Parallelism
DRAM
Shared Memory Bus
Core Core
18-646 – How to Write Fast Code II?
4
Cache Cache Cache
Cache
Cache
Cache
mem mem mem mem mem mem mem mem

Exploiting Different Levels of Parallelism
How do we write fast code to exploit the many levels of parallelism on a multicore platform?
Core
SIMD level Parallelism
re re Co Chip Co
SIMD-Level Parallelism
SMT-Level Parallelism
Core-Level Parallelism
Exploited using vectorizing compiler and hand-code intrinsics
OS abstract it to core-level parallelism
Using threads to describe work done on different cores
Co Co re re
Core level
Parallelism
Co Co
re re
18-646 – How to Write Fast Code II?
5
Ca Ca Ca ch ch ch eee
Ca Ca Ca ch ch ch eee

What is a thread?
A thread of execution is a unit of processing scheduled by the OS Allows system resources to be shared efficiently
Processes
Threads
Exist independently in the OS
Subsets of processes
Independent states:
virtual memory space, handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution
Threads in a process share: virtual address space and system resources
Independent:
exception handlers, a scheduling priority, a unique thread identifier, and thread context
Interact through system-provided inter- process communication mechanisms
Interact through shared memory
OS Preemptive
OS Preemptive
18-646 – How to Write Fast Code II? 6

Landscape of Thread Programming
POSIX threads
POSIX: Portable Operating System Interface for Unix – IEEE Standard
Defines a set of C programming language types, functions and constants Manually expose and manage all concurrency with the API
OpenMP
OpenMP: Open Multi-Processing – an API for shared memory multiprocessing Programmer hints at the concurrency
Compiler manages the parallelism
Intel Cilk Plus, Intel Thread Building Block (TBB):
Fork-join parallelism (“spawn” and “sync” in Clik, Parallel loops/containers in TBB)
Programmer exposes the concurrency
Runtime determines what is run in parallel
18-646 – How to Write Fast Code II? 7

Outline
Mutlicore Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 8

OpenMP Basic Defs: Solution Stack
End User
Application
Directives, Compiler
OpenMP library
OpenMP Runtime library
OS/system support for shared memory and threading
Proc1 Proc2 Proc3 Shared Address Space
ProcN
18-646 – How to Write Fast Code II?
9
Environment variables
HW System layer
Prog. Layer
User layer

Scope of OpenMP in this course
Topic
Concepts
I. OMP Intro
Parallel regions
II. Creating threads
Parallel, default data environment, runtime library calls
III. Synchronization
False sharing, critical, atomic
IV. Parallel loops
For, schedule, reduction
V. Odds and ends
Single, sections, master, runtime libraries, environment variables, synchronization, etc.
VI. Data Environment
Data environment details, software optimization
VII. OpenMP 3 and tasks
Tasks and other OpenMP 3 features
VIII. Memory model
The need for flush, but its actual use is left to an appendix
IX. Threadprivate
Modular software
18-646 – How to Write Fast Code II? 10

Spectrum of Granularity
Multicore Task Queue-based Implementations
Pthread-based Implementations
Manycore Throughput
Optimized Implementations
10-9 10-8 10-7 10-6 10-5
10-4
10-3
10-2
10-1 100
On-die cache Latency
Shared Off-chip Memory Latency
System Latency
Network Latency
© Jike Chong 2009
107 106 105
Task Management Overhead (Instructions)
MPI-based Implementations
Remote Procedure Call based Implementations
18-646 – How to Write Fast Code II?
11 11
109
108
104 103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)

Example 1: Hello World
Verifying that your OpenMP environment works
Write a multithreaded program where each thread prints “hello world”.
void main() {
int ID = 0;
printf(” hello(%d) “, ID); printf(” world(%d) \n”, ID);
}
18-646 – How to Write Fast Code II? 12

Example 1: SPMD with “omp parallel”
Verifying that your OpenMP environment works
Write a multithreaded program where each thread prints “hello world”.
#include
void main() {
#pragma omp parallel {
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
}
Switches for compiling and linking
gcc -fopenmp
gcc
18-646 – How to Write Fast Code II?
13

Example 1: SPMD with “omp parallel”
A multi-threaded “Hello world” program
Write a multithreaded program where each thread prints “hello world”. OpenMP include file
#include
void main() {
Parallel region with default number of threads
Sample Output: hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3) world(2)
Runtime library function to return a thread ID.
#pragma omp parallel {
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
}
End of the Parallel region
18-646 – How to Write Fast Code II?
14

Example 1: SPMD with “omp parallel”
A multi-threaded “Hello world” program
Write a multithreaded program where each thread prints “hello world”.
#include
void main() {
#pragma omp parallel {
End User Application
Directives, Compiler
OpenMP library
Environment variables
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
OpenMP Runtime library
OS/system support for shared memory and threading
}
Mattson et al, A “Hands-on”
Proc1 Proc2 Shared Address Space
Proc3 ProcN
Introduction to OpenMP
HW System layer Prog. User layer Laye
r
18-646 – How to Write Fast Code II?
15

OpenMP: Execution Pattern
Parallel Regions
A Nested Parallel region
Master Thread in red
Managed Fork-Join Parallelism:
Master thread creates a team of threads as needed
Parallelism added incrementally until performance goals are met i.e. the sequential program evolves into a parallel program
Sequential Parts
18-646 – How to Write Fast Code II? 16

Thread Creation: Parallel Regions
Threads in OpenMP are created with the parallel construct To create a 4 thread Parallel region:
double A[1000];
omp_set_num_threads(4); #pragma omp parallel
{
foo(ID,A);
}
Runtime function to request a certain number of threads
int ID = omp_get_thread_num();
Runtime function returning a thread ID
Each thread executes a copy of the code within the structured block
18-646 – How to Write Fast Code II?
17
Each thread calls foo(ID,A)
for ID
=0
to 3

Thread Creation: Parallel Regions
Threads in OpenMP are created with the parallel construct To create a 4 thread Parallel region:
int ID = omp_get_thread_num();
Runtime function to request a certain number of threads
double A[1000];
#pragma omp parallel num_threads(4)
{
foo(ID,A);
}
Runtime function returning a thread ID
Each thread executes a copy of the code within the structured block
18-646 – How to Write Fast Code II?
18
Each thread calls foo(ID,A)
for ID
=0
to 3

Example: Numerical Integration of Pi
Mathematically, we know that:
∫1 4 dx=4⋅tan−1×1 =π 0 1+x2 0
We can approximate the integral as a sum of rectangles:
4.0
2.0
0.0
1.0 X
åN
i i=0
F(x )Dx » p
Where each rectangle has width Dx and
height F(xi) at the middle of interval i.
18-646 – How to Write Fast Code II? 20
€
F(x) = 4.0/(1+x2)

Numerical Integration of Pi
#include
static long num_steps = 100000; double step;
int main () {
int i; double pi, sum;
step = 1.0/(double) num_steps;
int i; double x;
for (i=0, sum=0.0;i< num_steps; i++){ x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = sum * step; printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II? 21 4.0 2.0 0.0 1.0 X One Step F(x) = 4.0/(1+x2) Numerical Integration of Pi #include
static long num_steps = 100000; double step;
int main () {
int i; double pi, sum;
step = 1.0/(double) num_steps;
int i; double x;
for (i=0, sum=0.0;i< num_steps; i++){ } pi = sum * step; printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II? 22 4.0 2.0 0.0 1.0 X One Step x = (i+0.5)*step; sum += 4.0/(1.0+x*x); F(x) = 4.0/(1+x2) Concurrency Opportunities What is parallelizable? Calculation of each step is independent One can map each of the steps to an independent thread What’s wrong with this approach? Significant thread management overhead Compute each step as a separate thread involves significant thread management overhead (hundreds of cycles) 4.0 2.0 0.0 1.0 X 18-646 – How to Write Fast Code II? 23 F(x) = 4.0/(1+x2) Concurrency Opportunities What is parallelizable? Calculation of each step is independent Alternative: Specify a fixed set of parallel entities e.g. two threads (red and yellow threads) Parallelize workload over the available number of threads 4.0 2.0 0.0 1.0 X Thread 0 Thread 1 18-646 – How to Write Fast Code II? 24 F(x) = 4.0/(1+x2) Implementation View Work is evenly distributed across two processors DRAM CPU 4.0 2.0 0.0 1.0 X L2 Cache Core Core 18-646 – How to Write Fast Code II? 25 F(x) = 4.0/(1+x2) L1 Cache mem mem mem mem mem mem mem mem L1 Cache Example 2: Numerical Integration of Pi #include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, sum = 0.0; step = 1.0/(double) num_steps;
double x;
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 4.0 2.0 0.0 1.0 X 18-646 – How to Write Fast Code II? 26 F(x) = 4.0/(1+x2) Outline Mutlicore Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 27 Example 3: Critical Region #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum=0.0; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; #pragma omp critical {sum += 4.0/(1.0+x*x);} } if (id == 0) nthreads = nthrds; } pi = sum * step; printf("pi = %f \n", pi); return 0; } Why so slow? Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 28 64.45 seconds Example 4: Atomic Operation #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum=0.0; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x, tmp; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; tmp = 4.0/(1.0+x*x); #pragma omp atomic {sum += tmp} } if (id == 0) nthreads = nthrds; } pi = sum * step; printf("pi = %f \n", pi); return 0; } Better, why still so slow? Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 29 4.93 seconds Shared Memory Model Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a Cache1 Proc1 ... a All threads share an address space Multiple copies of data may be present in various levels of cache 18-646 – How to Write Fast Code II? 30 Shared Memory Model Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a Cache1 Proc1 ... Thrd 1@t2: read a Thrd 2@t1: store a A piece of data (at address “a”) can be: Written into memory by Thrd 2 at t1 Then, read by Thrd 1 at t2 We say the data at address “a” is in a shared state 18-646 – How to Write Fast Code II? 31 Managing Coherence Costly Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a Cache1 Proc1 ... Thrd 1@t2: read a Thrd 2@t1: store a Synchronization protocols are implemented in hardware to manage coherence between caches e.g. MESI Protocol (Papamarcos & Patel 1984) Allows data to be in one of four states: Modified, Exclusive, Shared, Invalid Transparent to software, but could have severe performance penalties 18-646 – How to Write Fast Code II? 32 Threads can Write to Different Addresses... Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a b Proc1 Cache1 ... Thrd 1@t2: store b Thrd 2@t1: store a Avoid cache synchronization protocols induced severe performance penalties when no sharing is required Technique: Write to separate addresses during parallel computation 18-646 – How to Write Fast Code II? 33 Example 5: Separate Result Storage #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum[NUM_THREADS]; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS)
{
int i, id = omp_get_thread_num(); double x;
int nthrds = omp_get_num_threads();
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } if (id == 0) nthreads = nthrds; } for(i=0, pi=0.0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x, local_sum=0.0; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 38 1.15 seconds Example 6: Eliminate False Sharing #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
Sequential: 2.12 seconds
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; local_sum += 4.0/(1.0+x*x); }#pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Great! But is this the best we can do? 18-646 – How to Write Fast Code II? 39 1.15 seconds Example 7: Sequential Optimization #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 40 0.46 seconds Outline Mutlicore Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 41 SPMD vs. Worksharing A parallel construct by itself creates an SPMD “Single Program Multiple Data” Programmer must explicitly specify what each thread must do differently The division of work is hard-coded in the program Opportunity: Many parallel regions are loops Question: Can we make it easier to parallelize these loops? 18-646 – How to Write Fast Code II? 42 OpenMP: Parallel For One of the worksharing constructs OpenMP provide is “omp for” Sequential code for(i=0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) {
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.22 seconds Reduction is more scalable than critical sections. Loops must not have loop carry dependency. Limitation of compiler technology. “i” is private by default 18-646 – How to Write Fast Code II? 45 Combined Parallel/Worksharing Construct OpenMP shortcut: Put the “parallel” and the worksharing directive on the same line double res[MAX]; int i; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } double res[MAX]; int i; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); } 18-646 – How to Write Fast Code II? 46 OpenMP: Working With Loops Basic approach Find compute intensive loops Make the loop iterations independent .. So they can safely execute in any order without loop-carried dependencies Place the appropriate OpenMP directive and test double hoop = nthrds*step; for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; sum = sum + 4.0/(1.0+x*x); } for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II? 47 OpenMP: Reductions Accumulating values into a single variable (sum) creates true dependence between loop iterations that can’t be trivially removed This is a very common situation ... it is called a “reduction”. Support for reduction operations is included in most parallel programming environments. for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } reduction (op : list) 1. A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”) 2. Updates occur on the local copy 3. Local copies are reduced into a single value and combined with the original global value #pragma omp for reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II? 48 OpenMP: Reduction Operators Many different associative operands can be used with reduction: Initial values are the ones that make sense mathematically. Operator Initial value + 0 * 1 - 0 Operator Initial value & ~0 | 0 ^ 0 && 1 || 0 18-646 – How to Write Fast Code II? 49 Example 8: Parallel For and Reduction #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.22 seconds Advantage: we created a parallel program without changing any code and by adding 2 simple lines of text! Disadvantage: 1) Lot’s of pitfalls if you don’t understand system architecture 2) The basic result may not the fastest one can do 18-646 – How to Write Fast Code II? 50 Outline Multicore Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 51 Data Environment in OpenMP Shared Memory programming model: Most variables are shared by default Global variables are SHARED among threads File scope variables, static Dynamically allocated memory (ALLOCATE, malloc, new) But not everything is shared... Functions called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE. 18-646 – How to Write Fast Code II? 52 Example: Data Sharing extern double A[10]; void work(int *index) { double temp[10]; static int count; ... } double A[10]; int main() { int index[10]; #pragma omp parallel work(index); printf(“%d\n”, index[0]); } A, index, count temp temp A, index, count temp A, index and count are shared by all threads. temp is local to each thread 18-646 – How to Write Fast Code II? 53 Changing Storage Attributes One can selectively change storage attributes for constructs using the following clauses: SHARED PRIVATE FIRSTPRIVATE The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with: LASTPRIVATE 18-646 – How to Write Fast Code II? 54 Example 9: firstprivate example #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum = 0.0; int nthreads; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS);
#pragma omp parallel private(x) firstprivate(local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II? 55 Outline Mutlicore Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 56 The Scheduling Cause The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Deal-out blocks of iterations of size “chunk” to each thread. schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all iterations have been handled. schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds. schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime library) 18-646 – How to Write Fast Code II? 57 The Scheduling Cause Schedule Clause When To Use STATIC Pre-determined and predictable by the programmer DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead 18-646 – How to Write Fast Code II? 58 Least work at runtime : scheduling done at compile-time Most work at runtime : complex scheduling logic used at run-time Example 9: Scheduling #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) schedule(static, 10000000)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 59 1.31 seconds Outline Mutlicore Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 60 Synchronization: ordered The ordered region executes in the sequential order Important for some scientific code and optimization code Order of reduction may cause rounding differences #pragma omp parallel private (tmp) #pragma omp for ordered reduction(+:res) for (I=0;I
#include
void report_num_threads(int level) {
#pragma omp single
{ printf(”L %d: # threads in team %d\n”, level, omp_get_num_threads());} }
int main() { omp_set_dynamic(0);
#pragma omp parallel num_threads(2) {
report_num_threads(1);
#pragma omp parallel num_threads(2) {
} }
return(0); }
report_num_threads(2);
#pragma omp parallel num_threads(2)
report_num_threads(3);
http://download.oracle.com/docs/cd/E19205-01/819-5270/aewbc/index.html
18-646 – How to Write Fast Code II? 64

Nested Parallelism
Compiling and running this program with nested parallelism enabled produces the following (sorted) output:
$ export OMP_NESTED=TRUE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 2 L 2: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2
$ export OMP_NESTED=FALSE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 1 L 3: # threads in team 1 L 2: # threads in team 1 L 3: # threads in team 1
18-646 – How to Write Fast Code II? 65

OpenMP Environment Variables
OMP_SCHEDULE=algorithm dynamic[, n]
guided[, n] Runtime
static[, n]
OMP_NUM_THREADS=num OMP_NESTED=TRUE|FALSE
OMP_DYNAMIC=TRUE|FALSE
18-646 – How to Write Fast Code II? 66

k-means Algorithm Concurrency
1. 2.
Initialization: Randomly select k cluster centers
Expectation: Assign closest center to each data point N (independent)
k (min reduction)
D (sum reduction)
Maximization: Update centers based on assignments D (independent)
N (Histogram computation into k bins)
3.
Evaluate: Re-iterate steps 2-3 until convergence
18-646 – How to Write Fast Code II? 68
4.

Project 1: Search Among Alternatives
What are the implementation alternatives?
Different mapping of application concurrency to platform parallelism
The search process
More complex than one application concurrency to one platform parallelism May want to sequentialize some operations:
Some parallel operations are as “work-efficient” as sequential operations One level of concurrency could map to multiple levels of parallelism
18-646 – How to Write Fast Code II? 69

How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
This lecture: An abstraction for multicore parallel programming
Abstractions:
Help establish a mental model for the programmers
Make it easier to write parallel code
Performance depend on deep understanding of the implementation platform
18-646 – How to Write Fast Code II? 70

Related Posts