CS计算机代考程序代写 concurrency cache algorithm compiler data structure 18-646 – How to Write Fast Code II?

18-646 – How to Write Fast Code II?
1
Carnegie Mellon University Ian Lane

What we discussed last time:
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— Highlighted the difference between multicore and manycore platforms — Discussed the multicore and manycore platform intricacies
— Exposing concurrency in the k-means algorithm
— Exploiting parallelism by exploring mappings from application to platform
— This lecture: An abstraction for multicore parallel programming
18-646 – How to Write Fast Code II? 2

Outline
— Multicore Shared-Memory Platforms
— OpenMP – API for Shared-Memory Multi-processing — Overview
— Example: numerical integration of Pi — Shared variable synchronization
— SPMD vs worksharing
— Data environment options
— Advanced worksharing options — Synchronization options
— How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 3

Multicore Shared-Memory Platforms
Core
SIMD level Parallelism
— Parallel resources available in:
— SIMD lanes in vector units in a core
— Simultaneous multithreading (SMT) in a core — Cores within a chip
Core Chip Core Core Core
Core level Parallelism
DRAM
Shared Memory Bus
Core Core
18-646 – How to Write Fast Code II?
4
Cache Cache Cache
Cache
Cache
Cache
mem mem mem mem mem mem mem mem

Exploiting Different Levels of Parallelism
— How do we write fast code to exploit the many levels of parallelism on a multicore platform?
Core
SIMD level Parallelism
re re Co Chip Co
SIMD-Level Parallelism
SMT-Level Parallelism
Core-Level Parallelism
Exploited using vectorizing compiler and hand-code intrinsics
OS abstract it to core-level parallelism
Using threads to describe work done on different cores
Co Co re re
Core level
Parallelism
Co Co
re re
18-646 – How to Write Fast Code II?
5
Ca Ca Ca ch ch ch eee
Ca Ca Ca ch ch ch eee

What is a thread?
— A thread of execution is a unit of processing scheduled by the OS — Allows system resources to be shared efficiently
Processes
Threads
Exist independently in the OS
Subsets of processes
Independent states:
virtual memory space, handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution
Threads in a process share: virtual address space and system resources
Independent:
exception handlers, a scheduling priority, a unique thread identifier, and thread context
Interact through system-provided inter- process communication mechanisms
Interact through shared memory
OS Preemptive
OS Preemptive
18-646 – How to Write Fast Code II? 6

Landscape of Thread Programming
— POSIX threads
— POSIX: Portable Operating System Interface for Unix – IEEE Standard
— Defines a set of C programming language types, functions and constants — Manually expose and manage all concurrency with the API
— OpenMP
— OpenMP: Open Multi-Processing – an API for shared memory multiprocessing — Programmer hints at the concurrency
— Compiler manages the parallelism
— Intel Cilk Plus, Intel Thread Building Block (TBB):
— Fork-join parallelism (“spawn” and “sync” in Clik, Parallel loops/containers in TBB)
— Programmer exposes the concurrency
— Runtime determines what is run in parallel
18-646 – How to Write Fast Code II? 7

Outline
— Mutlicore Shared-Memory Platforms
— OpenMP – API for Shared-Memory Multi-processing — Overview
— Example: numerical integration of Pi — Shared variable synchronization
— SPMD vs worksharing
— Data environment options
— Advanced worksharing options — Synchronization options
— How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 8

OpenMP Basic Defs: Solution Stack
End User
Application
Directives, Compiler
OpenMP library
OpenMP Runtime library
OS/system support for shared memory and threading
Proc1 Proc2 Proc3 Shared Address Space
ProcN
18-646 – How to Write Fast Code II?
9
Environment variables
HW System layer
Prog. Layer
User layer

Scope of OpenMP in this course
Topic
Concepts
I. OMP Intro
Parallel regions
II. Creating threads
Parallel, default data environment, runtime library calls
III. Synchronization
False sharing, critical, atomic
IV. Parallel loops
For, schedule, reduction
V. Odds and ends
Single, sections, master, runtime libraries, environment variables, synchronization, etc.
VI. Data Environment
Data environment details, software optimization
VII. OpenMP 3 and tasks
Tasks and other OpenMP 3 features
VIII. Memory model
The need for flush, but its actual use is left to an appendix
IX. Threadprivate
Modular software
18-646 – How to Write Fast Code II? 10

Spectrum of Granularity
Multicore Task Queue-based Implementations
Pthread-based Implementations
Manycore Throughput
Optimized Implementations
10-9 10-8 10-7 10-6 10-5
10-4
10-3
10-2
10-1 100
On-die cache Latency
Shared Off-chip Memory Latency
System Latency
Network Latency
© Jike Chong 2009
107 106 105
Task Management Overhead (Instructions)
MPI-based Implementations
Remote Procedure Call based Implementations
18-646 – How to Write Fast Code II?
11 11
109
108
104 103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)

Example 1: Hello World
Verifying that your OpenMP environment works
— Write a multithreaded program where each thread prints “hello world”.
void main() {
int ID = 0;
printf(” hello(%d) “, ID); printf(” world(%d) \n”, ID);
}
18-646 – How to Write Fast Code II? 12

Example 1: SPMD with “omp parallel”
Verifying that your OpenMP environment works
— Write a multithreaded program where each thread prints “hello world”.
#include
void main() {
#pragma omp parallel {
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
}
Switches for compiling and linking
gcc -fopenmp
gcc
18-646 – How to Write Fast Code II?
13

Example 1: SPMD with “omp parallel”
A multi-threaded “Hello world” program
— Write a multithreaded program where each thread prints “hello world”. OpenMP include file
#include
void main() {
Parallel region with default number of threads
Sample Output: hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3) world(2)
Runtime library function to return a thread ID.
#pragma omp parallel {
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
}
End of the Parallel region
18-646 – How to Write Fast Code II?
14

Example 1: SPMD with “omp parallel”
A multi-threaded “Hello world” program
— Write a multithreaded program where each thread prints “hello world”.
#include
void main() {
#pragma omp parallel {
End User Application
Directives, Compiler
OpenMP library
Environment variables
int ID = omp_get_thread_num(); printf(” hello(%d) “, ID);
printf(” world(%d) \n”, ID);
}
OpenMP Runtime library
OS/system support for shared memory and threading
}
Mattson et al, A “Hands-on”
Proc1 Proc2 Shared Address Space
Proc3 ProcN
Introduction to OpenMP
HW System layer Prog. User layer Laye
r
18-646 – How to Write Fast Code II?
15

OpenMP: Execution Pattern
Parallel Regions
A Nested Parallel region
Master Thread in red
— Managed Fork-Join Parallelism:
— Master thread creates a team of threads as needed
— Parallelism added incrementally until performance goals are met — i.e. the sequential program evolves into a parallel program
Sequential Parts
18-646 – How to Write Fast Code II? 16

Thread Creation: Parallel Regions
— Threads in OpenMP are created with the parallel construct — To create a 4 thread Parallel region:
double A[1000];
omp_set_num_threads(4); #pragma omp parallel
{
foo(ID,A);
}
Runtime function to request a certain number of threads
int ID = omp_get_thread_num();
Runtime function returning a thread ID
Each thread executes a copy of the code within the structured block
18-646 – How to Write Fast Code II?
17
Each thread calls foo(ID,A)
for ID
=0
to 3

Thread Creation: Parallel Regions
— Threads in OpenMP are created with the parallel construct — To create a 4 thread Parallel region:
int ID = omp_get_thread_num();
Runtime function to request a certain number of threads
double A[1000];
#pragma omp parallel num_threads(4)
{
foo(ID,A);
}
Runtime function returning a thread ID
Each thread executes a copy of the code within the structured block
18-646 – How to Write Fast Code II?
18
Each thread calls foo(ID,A)
for ID
=0
to 3

Outline
— Mutlicore Shared-Memory Platforms
— OpenMP – API for Shared-Memory Multi-processing — Overview
— Example: numerical integration of Pi — Shared variable synchronization
— SPMD vs worksharing
— Data environment options
— Advanced worksharing options — Synchronization options
— How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 19

Example: Numerical Integration of Pi
Mathematically, we know that:
∫1 4 dx=4⋅tan−1×1 =π 0 1+x2 0
We can approximate the integral as a sum of rectangles:
4.0
2.0
0.0
1.0 X
åN
i i=0
F(x )Dx » p
Where each rectangle has width Dx and
height F(xi) at the middle of interval i.
18-646 – How to Write Fast Code II? 20

F(x) = 4.0/(1+x2)

Numerical Integration of Pi
#include
static long num_steps = 100000; double step;
int main () {
int i; double pi, sum;
step = 1.0/(double) num_steps;
int i; double x;
for (i=0, sum=0.0;i< num_steps; i++){ x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = sum * step; printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II? 21 4.0 2.0 0.0 1.0 X One Step F(x) = 4.0/(1+x2) Numerical Integration of Pi #include
static long num_steps = 100000; double step;
int main () {
int i; double pi, sum;
step = 1.0/(double) num_steps;
int i; double x;
for (i=0, sum=0.0;i< num_steps; i++){ } pi = sum * step; printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II? 22 4.0 2.0 0.0 1.0 X One Step x = (i+0.5)*step; sum += 4.0/(1.0+x*x); F(x) = 4.0/(1+x2) Concurrency Opportunities — What is parallelizable? — Calculation of each step is independent — One can map each of the steps to an independent thread — What’s wrong with this approach? — Significant thread management overhead — Compute each step as a separate thread involves significant thread management overhead (hundreds of cycles) 4.0 2.0 0.0 1.0 X 18-646 – How to Write Fast Code II? 23 F(x) = 4.0/(1+x2) Concurrency Opportunities — What is parallelizable? — Calculation of each step is independent — Alternative: — Specify a fixed set of parallel entities e.g. two threads (red and yellow threads) — Parallelize workload over the available number of threads 4.0 2.0 0.0 1.0 X Thread 0 Thread 1 18-646 – How to Write Fast Code II? 24 F(x) = 4.0/(1+x2) Implementation View — Work is evenly distributed across two processors DRAM CPU 4.0 2.0 0.0 1.0 X L2 Cache Core Core 18-646 – How to Write Fast Code II? 25 F(x) = 4.0/(1+x2) L1 Cache mem mem mem mem mem mem mem mem L1 Cache Example 2: Numerical Integration of Pi #include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, sum = 0.0; step = 1.0/(double) num_steps;
double x;
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 4.0 2.0 0.0 1.0 X 18-646 – How to Write Fast Code II? 26 F(x) = 4.0/(1+x2) Outline — Mutlicore Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 27 Example 3: Critical Region #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum=0.0; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; #pragma omp critical {sum += 4.0/(1.0+x*x);} } if (id == 0) nthreads = nthrds; } pi = sum * step; printf("pi = %f \n", pi); return 0; } Why so slow? Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 28 64.45 seconds Example 4: Atomic Operation #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum=0.0; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x, tmp; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; tmp = 4.0/(1.0+x*x); #pragma omp atomic {sum += tmp} } if (id == 0) nthreads = nthrds; } pi = sum * step; printf("pi = %f \n", pi); return 0; } Better, why still so slow? Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 29 4.93 seconds Shared Memory Model Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a Cache1 Proc1 ... a — All threads share an address space — Multiple copies of data may be present in various levels of cache 18-646 – How to Write Fast Code II? 30 Shared Memory Model Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a Cache1 Proc1 ... Thrd 1@t2: read a Thrd 2@t1: store a — A piece of data (at address “a”) can be: — Written into memory by Thrd 2 at t1 — Then, read by Thrd 1 at t2 — We say the data at address “a” is in a shared state 18-646 – How to Write Fast Code II? 31 Managing Coherence Costly Based on slides from Mattson et al, A “Hands-on” Introduction to OpenMP Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a Cache1 Proc1 ... Thrd 1@t2: read a Thrd 2@t1: store a — Synchronization protocols are implemented in hardware to manage coherence between caches — e.g. MESI Protocol (Papamarcos & Patel 1984) — Allows data to be in one of four states: Modified, Exclusive, Shared, Invalid — Transparent to software, but could have severe performance penalties 18-646 – How to Write Fast Code II? 32 Threads can Write to Different Addresses... Shared Memory Cache2 Cache3 CacheN Proc2 Proc3 ProcN a b Proc1 Cache1 ... Thrd 1@t2: store b Thrd 2@t1: store a — Avoid cache synchronization protocols induced severe performance penalties when no sharing is required — Technique: — Write to separate addresses during parallel computation 18-646 – How to Write Fast Code II? 33 Example 5: Separate Result Storage #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads; double sum[NUM_THREADS]; step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS)
{
int i, id = omp_get_thread_num(); double x;
int nthrds = omp_get_num_threads();
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } if (id == 0) nthreads = nthrds; } for(i=0, pi=0.0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) {
int i, id = omp_get_thread_num(); double x, local_sum=0.0; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 38 1.15 seconds Example 6: Eliminate False Sharing #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
Sequential: 2.12 seconds
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
for (i=id;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; local_sum += 4.0/(1.0+x*x); }#pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Great! But is this the best we can do? 18-646 – How to Write Fast Code II? 39 1.15 seconds Example 7: Sequential Optimization #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 40 0.46 seconds Outline — Mutlicore Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 41 SPMD vs. Worksharing — A parallel construct by itself creates an SPMD — “Single Program Multiple Data” — Programmer must explicitly specify what each thread must do differently — The division of work is hard-coded in the program — Opportunity: — Many parallel regions are loops — Question: — Can we make it easier to parallelize these loops? 18-646 – How to Write Fast Code II? 42 OpenMP: Parallel For — One of the worksharing constructs OpenMP provide is “omp for” Sequential code for(i=0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) {
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.22 seconds Reduction is more scalable than critical sections. Loops must not have loop carry dependency. Limitation of compiler technology. “i” is private by default 18-646 – How to Write Fast Code II? 45 Combined Parallel/Worksharing Construct — OpenMP shortcut: Put the “parallel” and the worksharing directive on the same line double res[MAX]; int i; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } double res[MAX]; int i; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); } 18-646 – How to Write Fast Code II? 46 OpenMP: Working With Loops — Basic approach — Find compute intensive loops — Make the loop iterations independent .. So they can safely execute in any order without loop-carried dependencies — Place the appropriate OpenMP directive and test double hoop = nthrds*step; for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; sum = sum + 4.0/(1.0+x*x); } for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II? 47 OpenMP: Reductions — Accumulating values into a single variable (sum) creates true dependence between loop iterations that can’t be trivially removed — This is a very common situation ... it is called a “reduction”. — Support for reduction operations is included in most parallel programming environments. for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } reduction (op : list) 1. A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”) 2. Updates occur on the local copy 3. Local copies are reduced into a single value and combined with the original global value #pragma omp for reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II? 48 OpenMP: Reduction Operators — Many different associative operands can be used with reduction: — Initial values are the ones that make sense mathematically. Operator Initial value + 0 * 1 - 0 Operator Initial value & ~0 | 0 ^ 0 && 1 || 0 18-646 – How to Write Fast Code II? 49 Example 8: Parallel For and Reduction #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.22 seconds Advantage: we created a parallel program without changing any code and by adding 2 simple lines of text! Disadvantage: 1) Lot’s of pitfalls if you don’t understand system architecture 2) The basic result may not the fastest one can do 18-646 – How to Write Fast Code II? 50 Outline — Multicore Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 51 Data Environment in OpenMP — Shared Memory programming model: — Most variables are shared by default — Global variables are SHARED among threads — File scope variables, static — Dynamically allocated memory (ALLOCATE, malloc, new) — But not everything is shared... — Functions called from parallel regions are PRIVATE — Automatic variables within a statement block are PRIVATE. 18-646 – How to Write Fast Code II? 52 Example: Data Sharing extern double A[10]; void work(int *index) { double temp[10]; static int count; ... } double A[10]; int main() { int index[10]; #pragma omp parallel work(index); printf(“%d\n”, index[0]); } A, index, count temp temp A, index, count temp A, index and count are shared by all threads. temp is local to each thread 18-646 – How to Write Fast Code II? 53 Changing Storage Attributes — One can selectively change storage attributes for constructs using the following clauses: — SHARED — PRIVATE — FIRSTPRIVATE — The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with: — LASTPRIVATE 18-646 – How to Write Fast Code II? 54 Example 9: firstprivate example #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum = 0.0; int nthreads; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS);
#pragma omp parallel private(x) firstprivate(local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II? 55 Outline — Mutlicore Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 56 The Scheduling Cause — The schedule clause affects how loop iterations are mapped onto threads — schedule(static [,chunk]) — Deal-out blocks of iterations of size “chunk” to each thread. — schedule(dynamic[,chunk]) — Each thread grabs “chunk” iterations off a queue until all iterations have been handled. — schedule(guided[,chunk]) — Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds. — schedule(runtime) — Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime library) 18-646 – How to Write Fast Code II? 57 The Scheduling Cause Schedule Clause When To Use STATIC Pre-determined and predictable by the programmer DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead 18-646 – How to Write Fast Code II? 58 Least work at runtime : scheduling done at compile-time Most work at runtime : complex scheduling logic used at run-time Example 9: Scheduling #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) schedule(static, 10000000)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II? 59 1.31 seconds Outline — Mutlicore Shared-Memory Platforms — OpenMP – API for Shared-Memory Multi-processing — Overview — Example: numerical integration of Pi — Shared variable synchronization — SPMD vs worksharing — Data environment options — Advanced worksharing options — Synchronization options — How is This Relevant to Writing Fast Code? 18-646 – How to Write Fast Code II? 60 Synchronization: ordered — The ordered region executes in the sequential order — Important for some scientific code and optimization code — Order of reduction may cause rounding differences #pragma omp parallel private (tmp) #pragma omp for ordered reduction(+:res) for (I=0;I
#include
void report_num_threads(int level) {
#pragma omp single
{ printf(”L %d: # threads in team %d\n”, level, omp_get_num_threads());} }
int main() { omp_set_dynamic(0);
#pragma omp parallel num_threads(2) {
report_num_threads(1);
#pragma omp parallel num_threads(2) {
} }
return(0); }
report_num_threads(2);
#pragma omp parallel num_threads(2)
report_num_threads(3);
http://download.oracle.com/docs/cd/E19205-01/819-5270/aewbc/index.html
18-646 – How to Write Fast Code II? 64

Nested Parallelism
— Compiling and running this program with nested parallelism enabled produces the following (sorted) output:
$ export OMP_NESTED=TRUE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 2 L 2: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2
$ export OMP_NESTED=FALSE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 1 L 3: # threads in team 1 L 2: # threads in team 1 L 3: # threads in team 1
18-646 – How to Write Fast Code II? 65

OpenMP Environment Variables
— OMP_SCHEDULE=algorithm — dynamic[, n]
— guided[, n] — Runtime
— static[, n]
— OMP_NUM_THREADS=num — OMP_NESTED=TRUE|FALSE
— OMP_DYNAMIC=TRUE|FALSE
18-646 – How to Write Fast Code II? 66

Outline
— Mutlicore Shared-Memory Platforms
— OpenMP – API for Shared-Memory Multi-processing — Overview
— Example: numerical integration of Pi — Shared variable synchronization
— SPMD vs worksharing
— Data environment options
— Advanced worksharing options — Synchronization options
— How is This Relevant to Writing Fast Code?
18-646 – How to Write Fast Code II? 67

k-means Algorithm Concurrency
1. 2.
Initialization: Randomly select k cluster centers
Expectation: Assign closest center to each data point — N (independent)
— k (min reduction)
— D (sum reduction)
Maximization: Update centers based on assignments — D (independent)
— N (Histogram computation into k bins)
3.
Evaluate: Re-iterate steps 2-3 until convergence
18-646 – How to Write Fast Code II? 68
4.

Project 1: Search Among Alternatives
— What are the implementation alternatives?
— Different mapping of application concurrency to platform parallelism
— The search process
— More complex than one application concurrency to one platform parallelism — May want to sequentialize some operations:
— Some parallel operations are as “work-efficient” as sequential operations — One level of concurrency could map to multiple levels of parallelism
18-646 – How to Write Fast Code II? 69

How is this relevant to writing fast code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— This lecture: An abstraction for multicore parallel programming
— Abstractions:
— Help establish a mental model for the programmers
— Make it easier to write parallel code
— Performance depend on deep understanding of the implementation platform
18-646 – How to Write Fast Code II? 70