程序代写 CSC 367 Parallel Programming

CSC 367 Parallel Programming

Directive-based parallel programming with OpenMP
University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Copyright By PowCoder代写 加微信 powcoder

Directive-based parallel programming
• Pthreadsinvolvelow-levelprogramming
• Programmermustspawnthem,assignthemexplicitwork,waittofinish,etc.
• Whatifwehadhigher-levellanguageconstructstoautomatesomeof the most common mechanics?
• Directive-basedlanguageshavebeenaroundforawhile,butnostandards
• OpenMP:astandardfordirective-basedparallelprogramming
• An API for C, C++, Fortran, to simplify parallel programming on shared memory machines
• Supportforparallelconstructs,concurrency,synchronization,etc.
• Muchlessprogrammingeffort,more”automated”parallelization
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2

Worker Threads
Execution model: Fork-and-Join
• Startwithasingle(master)thread
• Fork: Create a group of worker
Master Thread
Sequential block
Parallel block
• Everything in a parallel block is executed by every thread
Sync point
• Join:Allthreadssynchronizeatthe end of the parallel block
Worker Threads
Sequentialblock
Parallel block
• Execution continues with only the initial (master) thread
Sync point
• We’veseenthispatternbeforewith pthreads …
• Threadscandotheexactsamework,sharethesametasks(worksharing),or
perform distinct tasks in parallel!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3

Reminder: Shared Memory Model
• All worker threads share the same address space
T0 … Tn-1 Variables: private data? shared data?
Shared Memory
• OpenMPprovidestheabilitytodeclarevariablesprivateorshared within a parallel block (more on this later…)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4

OpenMP Programming Basics
• Programmingmodel:provide”hints”or”directives”tothecompilerasto what you intend to parallelize
• C/C++ #pragma compiler directives, followed by various clauses: #pragma omp directivename [clause list]
• MustincludeOpenMPfunctionheaders
#include
• CompilelikearegularCprogram,linkwithomplibrary
gcc -fopenmp … [-std=c99/gnu99]
• Debugwithgdbandvalgrind,ordedicatedparalleldebuggerse.g., TotalView, DDT, etc.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5

// Code executed by all threads
Parallel directive
• Regular serial execution until it hits a parallel directive
• Creates a group of threads, main thread becomes the master of the group
// Serial code
#pragma omp parallel [clause list]
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6

• Useful to set a default
A first parallel program
• Defaultnumberofthreadstobespawnedisstoredintheenvironmentvariable OMP_NUM_THREADS
int main(int argc, char *argv[]) {
printf(“Starting a parallel region, spawning threads\n”);
#pragma omp parallel
printf(“Hello world\n”);
return 0; }
$ export OMP_NUM_THREADS=8
$ gcc –o hello hello.c –fopenmp
• Allthreadsexecutetheexactsamecodefromtheparallelregion
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7

printf(“Bye world\n”);
return 0; }
A first parallel program
• Parallelregionspawnsablockofcodeoraline(nobracesneededforlatter)
int main() {
printf(“Starting a parallel region, spawning threads\n”);
#pragma omp parallel
printf(“Hello world\n”);
$ export OMP_NUM_THREADS=8
$ gcc –o hello hello.c –fopenmp
• What’stheoutputofthiscode?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8

OpenMP: language extension + library
• Languageextensions:#pragmaomp(ignoredifnotcompiledwith–fopenmp) • Library functions (must include omp.h, even if compiled with –fopenmp):
int omp_get_num_threads(); /* # of threads running when this is invoked */
/* refers to closest enclosing parallel block*/
void omp_set_num_threads(int n); /* # of threads to use in the next
parallel section */
int omp_get_max_threads(); /* max number of threads that can be created */
int omp_get_thread_num(); /* thread id in a group of threads */
int omp_get_num_procs(); /* number of processors available */
int omp_in_parallel(); /* non-zero if called within a parallel region */
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9

A first parallel program
• Threadsruninparallel-noguaranteesaboutordering
int main() {
printf(“Starting a parallel region, spawning threads\n”);
#pragma omp parallel
printf(“Hello world, I am thread %d out of %d!\n”,
return 0; }
omp_get_thread_num(), omp_get_num_threads());
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10

return 0; }
What’s the output of this code? Notice anything?
A first parallel program
• Threadsruninparallel-noguaranteesaboutordering
int main() {
printf(“Starting a parallel region, spawning threads\n”);
#pragma omp parallel
printf(“Hello world, I am thread %d out of %d running threads!\n”,
omp_get_thread_num(), omp_get_num_threads());
printf(“There are %d threads running!\n”, omp_get_num_threads());
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11

Parallel directive clauses
• ConditionalParallelization:ifclause(onlyone!) • Onlycreatethreadsifanexpressionholds
#pragma omp parallel if(to_parallelize == 1)
• Degreeofconcurrency:num_threadsclause
• Howmanythreadstospawn,overridesthedefaultOMP_NUM_THREADS
#pragma omp parallel num_threads(8)
• Datahandling:private/firstprivate/sharedclauses(differentvisibility) #pragma omp parallel private(v1) shared(v2,v3) firstprivate(v4)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12

Variable semantics – summary
• Semantics of each variable
• Shared:allthreadssharethesamecopyofthevariable(s)
• Private:variable(s)arelocaltoeachthread
• Firstprivate:likeprivate,butvalueofeachcopyisinitializedtothevalue before the parallel directive
long timeNoSee;
bool dozer;
double rainbow = 2.0;
#pragma omp parallel if (is_parallel == 1) num_threads(8) \ shared(timeNoSee) private(dozer) firstprivate(rainbow)
// parallel execution
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13

long timeNoSee = 0;
int main() {
void* func(void* arg) {
bool dozer = 0;
double rainbow = *(int*)arg;
// parallel execution
long timeNoSee;
bool dozer;
double rainbow;
int main() {
// serial execution
// serial execution
#pragma omp parallel num_threads(8)\
shared(timeNoSee) private(dozer)\
firstprivate(rainbow)
double rainbow = 2.0;
for(i = 0; i < 8; i++) { pthread_create(... func, &rainbow); // other serial execution for(i = 0; i < 8; i++) { pthread_join(...); Equivalent code .. Which one seems easier to write? // other serial execution pthreads / OpenMP University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14 // parallel execution Careful with variable state • Thinkcarefullyabouttheintendedvariablevisibilitybetweenthreads • Shouldavariablebesharedorprivate? • Private:athreadcanmodifyitsowncopywithoutimpactingotherthreads • Shared:allthreadsseetheexactsamecopy,trickyifthedataisnotread-only • Dramatic impact on correctness if you get this wrong! Src: https://betanews.com/2015/10/23/how-to-fix-windows-10s-worst-problems-part-2/ University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15 • Ifnotexplicitlyindicated Default state • Canspecifydefaultstateusingdefault(...)clause • default(shared) = by default, a variable is shared by all threads • default(none) = must explicitly specify how every single variable used in parallel region should be handled, otherwise compile errors raised • Variables declared within a parallel block are implicitly private • Variables declared outside a parallel block become shared when parallel region starts (with some exceptions like some loop counters .. more later) • Safertousedefault(none)toforceyoutospecifyintendedvisibility • Recommendedtoavoidpossiblebugsfromunintentionallysharingdata University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16 master when the parallel block ends • Important primitive – supported seamlessly using reduction clause • Multiplelocalcopiesofavariablearecombinedintoasinglecopyatthe reduction(operator: variable list) • Operators:+,*,-,&,|,^,&&,|| int s = 0; #pragma omp parallel reduction(+:s) num_threads(8) { // compute sum s s += ... } // variable s now has the sum of all local dim copies University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17 Calculate the sum of the thread ids Simple but useful construct for more complex computations (as we'll see later ..) Silly Simple example int sum = 0; #pragma omp parallel reduction(+: sum) num_threads(8) { int tid = omp_get_thread_num(); sum += tid; Each thread gets a private sum copy printf("Sum of thread ids = %d\n", sum); University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18 Notice that we have to explicitly partition the data, and make sure that each thread only operates on its assigned chunk More complex example • Calculate the dot product of two arrays 'a' and 'b' of length 'size' • Canusethereductionclausefortheparalleldirective int dotprod = 0; #pragma omp parallel shared(size, a, b) \ Each thread gets a private dotprod copy reduction(+: dotprod) num_threads(8) int nthr = omp_get_num_threads(); int chunk = (size+nthr-1) / nthr; int tid = omp_get_thread_num(); What's this all about? for(int i = tid*chunk; i < (tid+1)*chunk && i < size; dotprod += a[i] * b[i]; Loop counter i is implicitly private University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19 For loops in OpenMP • Noticesomethinginthepreviousexample? • Everythreadexecutesthesameloop,justondifferentvaluerangesofaandb • Asimpleexample: int i, size=8; int *a = (int*)malloc(size*sizeof(int)); for(i = 0; i < size; i++) { a[i] = i+1; } #pragma omp parallel shared(size, a) private(i) num_threads(8) int tid = omp_get_thread_num(); for(i = 0; i < size; i++) { printf("TID[%d] - a[%d] = %d\n", tid, i, a[i]); • Whatwillthisprint? • Kind of useless ... • Instead,we'dliketodivvyupthework(theloopiterations)andprintdifferentitems University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20 • Clauses: Concurrent tasks – loop scheduling • Useparalleldirectivetocreateconcurrencyacrossiterations • Recalltaskparallelism! • Then, automatically divvy up the iterations across threads using for directive #pragma omp for [clause list] // for loop • private, firstprivate, lastprivate • reduction • schedule • private and firstprivate – same semantics as for parallel directive • lastprivate handles writing back a single copy from multiple copies University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21 for directive • Asimpleexample: int i, size=8; int *a = (int*)malloc(size*sizeof(int)); for(i = 0; i < size; i++) { a[i] = i+1; } #pragma omp parallel shared(size, a) private(i) num_threads(8) int tid = omp_get_thread_num(); #pragma omp for for(i = 0; i < size; i++) { printf("TID[%d] - a[%d] = %d\n", tid, i, a[i]); • Whatwillthisprint? • Loopiterationspartitionedacrossthreadsnow • Eachthreadtakescareofdifferentiterations University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22 for directive • No need to explicitly partition arrays like before, all done automatically! int dotprod = 0; #pragma omp parallel \ int dotprod = 0; #pragma omp parallel \ shared(size, a, b) \ reduction(+: dotprod) \ num_threads(8) shared(size, a, b) \ reduction(+: dotprod) \ num_threads(8) int nthr = omp_get_num_threads(); int chunk = (size+nthr-1) / nthr; int tid = omp_get_thread_num(); for(int i = tid*chunk; i < (tid+1)*chunk && i < size; for(int i = 0; i < size; i++) { dotprod += a[i] * b[i]; dotprod += a[i] * b[i]; • Easiertocodethanspawningpthreads=>onlydifferencesbetweenthe sequential code and parallel code are the two #pragma directives
• Onceagain,directive-basedparallelizationsimplifiesconvertingserialtoparallelcode! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23
#pragma omp for

• Q1:Parallelizethegivencodeusingtheconstructslearntsofar…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24

• static • dynamic • guided • runtime
Scheduling work to threads
• Thescheduleclause–waystoassigniterationstothreads schedule(scheduling_class[, parameter])
• Schedulingclasses
• RecallmodelsstudiedintheearlylecturesandA2!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 25

Scheduling classes
• Example: matrix multiplication (assume n x n square)
for(i = 0; i < n; i++) { for(j = 0; j < n; j++) { c[i][j] = 0; for(k = 0; k < n; k++) { c[i][j] += a[i][k] * b[k][j]; • Howcanweparallelizethissequentialcode?Thoughts? University of Toronto Mississauga, Department of Mathematical and Computational Sciences 26 #pragma omp parallel default(none) shared(a, b, c, n) num_threads(8) #pragma omp for schedule(static) for(int i = 0; i < n; i++) { for(int j = 0; j < n; j++) { c[i][j] = 0; for(int k = 0; k < n; k++) { Static scheduling • Split iterations into equal chunks of size S, assign to threads round-robin • Ifnochunksizespecified,divideintoasmanychunksasthreads c[i][j] += a[i][k] * b[k][j]; •} Outer loop is split into 8 chunks • e.g.,forn=1024,chunk=128cols=>sameasschedule(static,128) University of Toronto Mississauga, Department of Mathematical and Computational Sciences 27

Consider a for loop with 32 iterations and num_threads(4) 1 box = 1 iteration
schedule(static), or schedule(static,8) •
schedule(static,4)
T0 T1 T2 T3
chunk T0 size=8 T1
chunk size=4
• schedule(static,2) T0
• schedule(static,1) T0
Static scheduling
Appropriate when all iterations have the same computational cost.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 28

University of Toronto Mississauga, Department of Mathematical and Computational Sciences 29
Dynamic scheduling
• Loadperiterationmaynotbebalanced=>equallypartitionedtasks may take different execution time
• Split iterations into chunks, assign new chunk to thread only when idle • Ifnochunksizespecified,defaultchunkissingleiteration
#pragma omp parallel default(none)
shared(a, b, c, n)
num_threads(8)
#pragma omp for schedule(dynamic)
for(int i = 0; i < n; i++) { for(int j = 0; j < n; j++) { What was this task model called? c[i][j] = 0; for(int k = 0; k < n; k++) { c[i][j] += a[i][k] * b[k][j]; • schedule(dynamic,2) T0 1 box = 1 iteration chunk size = 2 Dynamic scheduling • Consideraforloopwith32iterationsandnum_threads(4) • schedule(dynamic),orschedule(dynamic,1) T1 chunk size = 1 T2 Appropriate when iterations have different computational costs. University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30 1 box = 1 iteration Guided scheduling • Imagine1000iterationsandchunksize=50=>20chunks
• 16threads=>12threadsget1chunkeach,4threadsget2each • Loadimbalance:12threadsareidleforpossiblyalongtime!
• Guidedschedulingidea:startwithbigchunksbutreducesizeas computation progresses
• Chunks get smaller and smaller as computation progresses, if load gets imbalanced
schedule(guided[, chunksize])
• Parameterchunksizespecifiestheminimumsizechunktouse
• If not specified, default chunksize is 1
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31

T0 T1 T2 T3
1 box = 1 iteration
• schedule(guided,2)
T0 T1 T2 T3
1 box = 1 iteration
Guided scheduling
• Consideraforloopwith32iterationsandnum_threads(4) • schedule(guided),orschedule(guided,1)
Appropriate when iterations are not well balanced. Smaller chunks at
the end fill the schedule to deal with poor load balancing.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 32
min chunk size = 1
min chunk size = 2

Runtime scheduling
• Delayschedulingdecisionuntilruntime
• Environment variable OMP_SCHEDULE determines class and chunksize
• Whennoschedulingclassisspecified=>implementationdependent
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33

Restrictions on for directive #pragma omp for
for (int i=0; i < n; i++) { Must be integer type! Must have integer increments only Must be an integer assignment Must be a <, >, <=, or >= expression
// statements
No break instructions!
• Compiler will typically prompt you to the problem
• Refer to the OpenMP manual when it doubt!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 34

Sequences of for directives
• Asequenceoffordirectives:implicitlyaddsbarrieraftereachone
Tn-1 // Parallel execution of loop }
// Parallel execution of loop
#pragma omp parallel {
#pragma omp for
for(…) {
#pragma omp for
for(…) {
// Spawn threads
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 35

nowait clause • Letthreadsproceedwithoutanimplicitbarrier
// Parallel execution of loop
• Mightbeusefulifthereisnoneedtowaitforresultsfrompreviousloop
#pragma omp parallel
// Spawn threads
#pragma omp for
for(…) {
#pragma omp for nowait
for(…) {
// Parallel execution of loop
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 36

– each thread gets 1 or more sections
#pragma omp section {
#pragma omp section {
Tasks scheduled in parallel
– 1 or more threads assigned to each section
• Sofar,threadsdothesamework:sameblock,orthesameloopcode • What about diverging tasks?
#pragma omp parallel {
#pragma omp sections [clauses] {
// Specify different tasks as “section” blocks
#pragma omp section {
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 37

• What about diverging tasks?
#pragma omp parallel
#pragma omp sections [clauses]
• Sofar,threadsdothesamework:sameblock,orthesameloopcode
// Specify different tasks as “section” blocks
#pragma omp section
{ // task 1
#pragma omp section
{ // task 2
– private, firstprivate, lastprivate – reduction
#pragma omp section
{ // task n
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 38

#pragma omp for
for(int i = 0; i < n; i++) { // parallel execution // parallel execution #pragma omp parallel {{ University of Toronto Mississauga, Department of Mathematical and Computational Sciences 39 #pragma omp sections { #pragma omp section #pragma omp section #pragma omp section { //task2 #pragma omp section { //taskn { // task 1 #pragma omp section { // task 2 #pragma omp section { //taskn Merging directives • parallel needed to spawn threads => merge it with for or sections
#pragma omp parallel shared(n) {
#pragma omp parallel for shared(n) for(int i = 0; i < n; i++) { #pragma omp parallel sections for(int i = 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com