CS计算机代考程序代写 concurrency cache algorithm compiler data structure 18-646

18-646 – How to Write Fast Code II
1
Carnegie Mellon University Ian Lane

What we discussed last time:
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Highlighted the difference between multicore and manycore platforms Discussed the multicore and manycore platform intricacies
Exposing concurrency in the k-means algorithm
Exploiting parallelism by exploring mappings from application to platform
18-646 – How to Write Fast Code II 2

Example 7: Sequential Optimization
#include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
int i; double pi, x, local_sum; int nthreads;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM_THREADS) private(x, local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 18-646 – How to Write Fast Code II 3 0.46 seconds Outline Multicores Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options Sequential Optimizations Cost Model Cost Reduction 18-646 – How to Write Fast Code II 4 SPMD vs. Worksharing A parallel construct by itself creates an SPMD “Single Program Multiple Data” Programmer must explicitly specify what each thread must do differently The division of work is hard-coded in the program Opportunity: Many parallel regions are loops Question: Can we make it easier to parallelize these loops? 0 18-646 – How to Write Fast Code II 5 OpenMP: Parallel For One of the worksharing constructs OpenMP provide is “omp for” Sequential code for(i=0;i
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{
Sequential: 2.12 seconds
int i; double pi, x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) {
variable
pliable sunningup1multi
Reduction is more scalable than critical sections.
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; FTariable } } Pi = sum * step; printf("pi = %f \n", pi); return 0; } sum sum = sum + 4.0/(1.0+x*x); operation 1.22 seconds Loops must not have loop carry dependency. Limitation of compiler technology. 18-646 – How to Write Fast Code II 8 “i” is private by default Combined Parallel/Worksharing Construct OpenMP shortcut: Put the “parallel” and the worksharing directive on the same line double res[MAX]; int i; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } double res[MAX]; int i; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); } 18-646 – How to Write Fast Code II 9 OpenMP: Working With Loops Basic approach Find compute intensive loops Make the loop iterations independent .. So they can safely execute in any order without loop-carried dependencies Place the appropriate OpenMP directive and test double hoop = nthrds*step; for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; sum = sum + 4.0/(1.0+x*x); } depend on not infor is for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II 10 OpenMP: Reductions Accumulating values into a single variable (sum) creates true dependence between loop iterations that can’t be trivially removed This is a very common situation ... it is called a “reduction”. Support for reduction operations is included in most parallel programming environments. for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } reduction (op : list) 1. A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”) 2. Updates occur on the local copy 3. Local copies are reduced into a single value and combined with the original global value #pragma omp for reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } 18-646 – How to Write Fast Code II 11 OpenMP: Reduction Operators Many different associative operands can be used with reduction: Initial values are the ones that make sense mathematically. Operator Initial value + 0 * 1 - 0 Operator Initial value & ~0 | 0 ^ 0 && 1 || 0 18-646 – How to Write Fast Code II 12 Example 8: Parallel For and Reduction #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.22 seconds Advantage: we created a parallel program without changing any code and by adding 2 simple lines of text! Disadvantage: 1) Lot’s of pitfalls if you don’t understand system architecture 2) The basic result may not the fastest one can do 18-646 – How to Write Fast Code II 13 Outline Multicores Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options Sequential Optimizations Cost Model Cost Reduction 18-646 – How to Write Fast Code II 14 Data Environment in OpenMP Shared Memory programming model: Most variables are shared by default Global variables are SHARED among threads File scope variables, static Dynamically allocated memory (ALLOCATE, malloc, new) But not everything is shared... Functions called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE. 18-646 – How to Write Fast Code II 15 Example: Data Sharing double A[10]; int main() { int index[10]; #pragma omp parallel work(index); printf(“%d\n”, index[0]); } o extern double A[10]; Jedward void work(int *index) { double temp[10]; static int count; thread ... } ne yany A, index, count temp temp A, index, count temp A, index and count are shared by all threads. temp is local to each thread 18-646 – How to Write Fast Code II 16 Changing Storage Attributes One can selectively change storage attributes for constructs using the following clauses: SHARED PRIVATE valueforeachopen FIRSTPRIVATE globalbeforeopenup get The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with: LASTPRIVATE critical atomf after openup typically use 18-646 – How to Write Fast Code II 17 Example 9: firstprivate example #include
#include
static long num_steps = 100000000; double step; #define NUM_THREADS 2
int main ()
{o
int i; double pi, x, local_sum = 0.0; int nthreads;
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel private(x) firstprivate(local_sum) {
int i, id = omp_get_thread_num(); local_sum=0.0; int nthrds = omp_get_num_threads();
double hoop = nthrds*step;
for (i=id, x=(i+0.5)*step;i< num_steps; i=i+nthrds) { x += hoop; local_sum += 4.0/(1.0+x*x); } #pragma omp critical pi += local_sum * step; } printf("pi = %f \n", pi); return 0; } 18-646 – How to Write Fast Code II 18 Outline Multicores Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options Sequential Optimizations Cost Model Cost Reduction 18-646 – How to Write Fast Code II 19 The Scheduling Cause The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Deal-out blocks of iterations of size “chunk” to each thread. schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all iterations have been handled. schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds. schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime library) 18-646 – How to Write Fast Code II 20 The Scheduling Cause Schedule Clause When To Use STATIC Pre-determined and predictable by the programmer DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead 18-646 – How to Write Fast Code II 21 Least work at runtime : scheduling done at compile-time Most work at runtime : complex scheduling logic used at run-time Example 9: Scheduling #include
#include
static long num_steps = 100000000; double step;
int main () {
int i; double pi, x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum) schedule(static, 10000000)
for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } Pi = sum * step; printf("pi = %f \n", pi); return 0; } Sequential: 2.12 seconds 1.31 seconds 10 chunks 18-646 – How to Write Fast Code II 22 Outline Multicores Shared-Memory Platforms OpenMP – API for Shared-Memory Multi-processing Overview Example: numerical integration of Pi Shared variable synchronization SPMD vs worksharing Data environment options Advanced worksharing options Synchronization options Sequential Optimizations Cost Model Cost Reduction 18-646 – How to Write Fast Code II 23 Synchronization: ordered The ordered region executes in the sequential order Important for some scientific code and optimization code Order of reduction may cause rounding differences #pragma omp parallel private (tmp) #pragma omp for ordered reduction(+:res) for (I=0;I
#include
void report_num_threads(int level) {
#pragma omp single
{ printf(”L %d: # threads in team %d\n”, level, omp_get_num_threads());} }
int main() { omp_set_dynamic(0);
#pragma omp parallel num_threads(2) {
report_num_threads(1);
#pragma omp parallel num_threads(2) {
} }
return(0); }
report_num_threads(2);
#pragma omp parallel num_threads(2)
report_num_threads(3);
http://download.oracle.com/docs/cd/E19205-01/819-5270/aewbc/index.html
18-646 – How to Write Fast Code II 27

Nested Parallelism
Compiling and running this program with nested parallelism enabled produces the following (sorted) output:
$ export OMP_NESTED=TORUE $ ./experimentN
L 1: # threads in team 2 L 2: # threads in team 2 L 2: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2 L 3: # threads in team 2
$ export OMP_NESTED=FALSE $ ./experimentN o
L 1: # threads in team 2 L 2: # threads in team 1 L 3: # threads in team 1 L 2: # threads in team 1 L 3: # threads in team 1
18-646 – How to Write Fast Code II 28

OpenMP Environment Variables
OMP_SCHEDULE=algorithm dynamic[, n]
guided[, n] Runtime
static[, n]
OMP_NUM_THREADS=num OMP_NESTED=TRUE|FALSE
OMP_DYNAMIC=TRUE|FALSE
18-646 – How to Write Fast Code II 29

Outline
Multicores Shared-Memory Platforms
OpenMP – API for Shared-Memory Multi-processing Overview
Example: numerical integration of Pi Shared variable synchronization
SPMD vs worksharing
Data environment options
Advanced worksharing options Synchronization options
Sequential Optimizations Cost Model
Cost Reduction
18-646 – How to Write Fast Code II 30

Optimizing the Base Case
Base case is usually the guts of the algorithm that computes on data in caches
Should fit in registers Must use SIMD
18-646 – How to Write Fast Code II 31

Cost Model
Algebraic model +, -, /, * sin, cos, etc.
Realistic model +, -, /, *
int func(int i) {
return i+2;
}
struct A {
double sum, prod;
};
A acc(int n, int array[]) {
A res = {0.0, 1.0};
for(int i=0; i double
4 a.x
++i
i

Related Posts