程序代写代做代考 c++ Fortran compiler cache PowerPoint Presentation

PowerPoint Presentation

Parallel Computing

with GPUs: OpenMP
Dr Paul Richmond

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Last Lecture

We looked at how to make programs fast on a single core

But we didn’t consider parallelism

Guess what we are going today?

Multicore systems and OpenMP

Parallelising Loops

Critical Sections and Synchronisation

Scoping

Data vs Task Parallelism

Multicore systems

Multi-core CPUs are a shared memory system
Each CPU has access to main memory

Each CPU can access all of the memory (hence shared)

Each CPU cores have their own L2 and L3 cache
This can cause a lack of coherence

If one core modifies its cache this might need to be synchronised on other cores

T0

T1

L1 Instruction Cache

L1 Data Cache (32KB)
L2 Cache (256KB)

L3 Cache
(8MB)

C
o

re
0

T2

T3

L1 Instruction Cache

L1 Data Cache (32KB)
L2 Cache (256KB)

C
o

re
1

Main Memory

OpenMP

Open Multi-Processing Standard
An API that supports shared memory programming in C, C++ and FORTRAN

Cross platform support using native threading
Higher level than OS models and portable

Is not suitable for distributed computing (look at MPI)

It is not an automatic parallel programming language
Parallelism is explicitly defined and controlled by the programmer

Requires compiler directives, a runtime, environment variables

Application Compiler

OpenMP Runtime

Platform threading model
(e.g. Windows threading or pthreads)

Environment

OpenMP Compiler Directives

Use of #pragmas
If not understood by the compiler then they are ignored

Does not require serial code to be changed

Allows behaviour to be specified which are not part of the C specification

#include

#include

int main()

{

#pragma omp parallel

{

printf(“Hello World\n”);

}

return 0;

}

Extending OpenMP Hello World

#include

#include

int main()

{

#pragma omp parallel

{

int thread = omp_get_thread_num();

int max_threads = omp_get_max_threads();

printf(“Hello World (Thread %d of %d)\n”, thread, max_threads);

}

return 0;

}

Hello World (Thread 5 of 8)

Hello World (Thread 6 of 8)

Hello World (Thread 2 of 8)

Hello World (Thread 7 of 8)

Hello World (Thread 1 of 8)

Hello World (Thread 0 of 8)

Hello World (Thread 3 of 8)

Hello World (Thread 4 of 8)

Fork and Join

OpenMP uses a fork a join model
Fork: Creates a number of parallel threads from a master thread

Master thread is always thread 0

No guarantee of order

Join: Synchronises thread termination and returns program control to master

Fo
rk

Jo
inMaster thread

Master thread

Master thread

Terminology

Fo
rk

Jo
inMaster thread

Team of threads

Terminology

Fo
rk

Jo
inMaster thread

Slave threads

Master thread

Master thread

Multicore systems and OpenMP

Parallelising Loops

Critical Sections and Synchronisation

Scoping

Data vs Task Parallelism

OpenMP Syntax

Parallel region directive
#pragma omp parallel [clause list] {structured block}

Spawns a number of parallel threads

Clauses
Are used to specify modifications to the parallel directive e.g.

Control scoping of variables in multiple threads

Dictate the number of parallel threads (example below)

Conditional parallelism

#pragma omp parallel num_threads(16)

{

int thread = omp_get_thread_num();

int max_threads = omp_get_max_threads();

printf(“Hello World (Thread %d of %d)\n”, thread, max_threads);

}

num_threads()

Without this clause OMP_NUM_THREADS will be used
This is an environment variable

Set to the number of cores (or hyperthreads) on your machine

This can be set globally by omp_set_num_threads(int)

Value can be queried by int omp_get_num_threads();

num_threads takes precedence over the environment variable

num_threads() does not guarantee that the number requested
will be created
System limitations may prevent this

However: It almost always will

Application Compiler

OpenMP Runtime

Platform threading model
(e.g. Windows threading or pthreads)

Environment

parallel for

#pragma omp for

Assigns work units to the team

Divides loop iterations between threads

For can be combined e.g. #pragma omp parallel for
Threads are spawned and then assigned to loop iterations

int n;

#pragma omp parallel for

for (n = 0; n < 8; n++){ int thread = omp_get_thread_num(); printf("Parallel thread %d \n", thread); } #pragma omp parallel { int n; #pragma omp for for (n = 0; n < 8; n++){ int thread = omp_get_thread_num(); printf("Parallel thread %d \n", thread); } } #pragma omp parallel { int n; for (n = 0; n < 8; n++){ int thread = omp_get_thread_num(); printf("Parallel thread %d \n", thread); } } Which is the odd one out? parallel for #pragma omp for Assigns work units to the team Divides loop iterations between threads For can be combined e.g. #pragma omp parallel for Threads are spawned and then assigned to loop iterations #pragma omp parallel { int n; #pragma omp for for (n = 0; n < 8; n++){ int thread = omp_get_thread_num(); printf("Parallel thread %d \n", thread); } } #pragma omp parallel { int n; for (n = 0; n < 8; n++){ int thread = omp_get_thread_num(); printf("Parallel thread %d \n", thread); } } Parallel thread 0 Parallel thread 0 Parallel thread 0 Parallel thread 0 Parallel thread 0 Parallel thread 0 Parallel thread 0 Parallel thread 0 Parallel thread 2 Parallel thread 2 Parallel thread 2 Parallel thread 2 Parallel thread 2 Parallel thread 2 Parallel thread 2 Parallel thread 2 Parallel thread 5 Parallel thread 5 Parallel thread 5 Parallel thread 5 Parallel thread 4 Parallel thread 4 Parallel thread 3 Parallel thread 3 Parallel thread 1 … Multicore systems and OpenMP Parallelising Loops Critical Sections and Synchronisation Scoping Data vs Task Parallelism What is wrong with this code? Consider a problem such as Taylor series expansion for cos function cos 𝑥 = σ𝑛=0 ∞ (−1)𝑛−1 𝑥2𝑛−1 2𝑛 ! cos 𝑥 = 1 − 𝑥2 2! + 𝑥4 4! + 𝑥6 6! … int n; double result = 0.0; double x = 1.0; #pragma omp parallel for for (n = 0; n < EXPANSION_STEPS; n++){ double r = pow(-1, n - 1) * pow(x, 2 * n - 1) / fac(2 * n); result -= r; } printf("Approximation is %f, value is %f\n", result, cos(x)); Critical sections Consider a problem such as Taylor series expansion for cos function cos 𝑥 = σ𝑛=0 ∞ (−1)𝑛−1 𝑥2𝑛−1 2𝑛 ! cos 𝑥 = 1 − 𝑥2 2! + 𝑥4 4! + 𝑥6 6! … int n; double result = 0.0; double x = 1.0; #pragma omp parallel for for (n = 0; n < EXPANSION_STEPS; n++){ double r = pow(-1, n - 1) * pow(x, 2 * n - 1) / fac(2 * n); result -= r; } printf("Approximation is %f, value is %f\n", result, cos(x)); Multiple threads try to write to the same value! (undefined behaviour and unpredictable results) Critical sections Consider a problem such as Taylor series expansion for cos function cos 𝑥 = σ𝑛=0 ∞ (−1)𝑛−1 𝑥2𝑛−1 2𝑛 ! cos 𝑥 = 1 − 𝑥2 2! + 𝑥4 4! + 𝑥6 6! … int n; double result = 0.0; double x = 1.0; #pragma omp parallel for for (n = 0; n < EXPANSION_STEPS; n++){ double r = pow(-1, n - 1) * pow(x, 2 * n - 1) / fac(2 * n); #pragma omp critical { result -= r; } } printf("Approximation is %f, value is %f\n", result, cos(x)); Define as a critical section Critical sections #pragma omp critical [name] Ensures mutual exclusions when accessing a shared value Prevents race conditions A thread will wait until no other thread is executing a critical region (with the same name) before beginning Unnamed critical regions map to the same unspecified name Idle time Time Critical Region Thread 0 Thread 1 Thread 2 Thread 3 Atomics Atomic operations can be used to safely increment a shared numeric value For example summation Atomics only apply to the immediate assignment Atomics are usually faster than critical sections Critical sections can be applied to general blocks of code (atomics can not) Example Compute histogram of random values for a given range Random is an int array of size NUM_VALUES with random value within 0:RANGE Histogram is an int array of size RANGE with 0 values; #pragma omp parallel { int i; #pragma omp for for (i = 0; i < NUM_VALUES; i++){ int value = randoms[i]; #pragma omp atomic histogram[value]++; } } Barriers #pragma omp barrier Synchronises threads at a barrier point Parallel regions have an implicit barrier Can be used to ensure execution of particular code is complete E.g. data read by function_B #pragma omp parallel { function_A() #pragma omp barrier function_B(); } function_A function_A function_A function_A function_B function_B function_B function_B Idle time Time Barrier Thread 0 Thread 1 Thread 2 Thread 3 Single and Master Sections #pragma omp single { … } Used to ensure that only a single thread executes a region of a structured block Useful for I/O and initialisation First available thread will execute the defined region No control over which this is Will cause an implicit barrier unless a nowait clause is used E.g. #pragma omp single nowait nowait will remove an implied barrier and can also be applied to parallel for loops #pragma omp master { … } Similar to single but will always use the master thread Is equivalent to using an IF clause E.g. #pragma omp parallel IF(omp_get_thread_num() == 0) nowait The IF clause makes the spawning of parallel threads conditional Preferable to single Does not require an implicit barrier Master example int t, r; int local_histogram[THREADS][RANGE]; zero_histogram(local_histogram); #pragma omp parallel num_threads(THREADS) { int i; #pragma omp for for (i = 0; i < NUM_VALUES; i++){ int value = randoms[i]; local_histogram[omp_get_thread_num()][value]++; } #pragma omp barrier #pragma omp master for (t = 0; t < THREADS; t++){ for (r = 0; r < RANGE; r++){ histogram[r] += local_histogram[t][r]; } } } Same result as the atomic version Multicore systems and OpenMP Parallelising Loops Critical Sections and Synchronisation Scoping Data vs Task Parallelism Scoping Scope refers to the part of the program in which a variable can be used OpenMP has different scoping to serial programming We must define if a variable is private or shared between threads Shared: A variable can be accessed by all threads in the team All variables outside of a parallel loop are shared by default Private: A Variable is local to a single thread and can only be accessed by this thread within the structured block it is defined All variables inside a structured block are private by default Scoping int t, r; int local_histogram[THREADS][RANGE]; zero_histogram(local_histogram); #pragma omp parallel num_threads(THREADS) { int i; #pragma omp for for (i = 0; i < NUM_VALUES; i++){ int value = randoms[i]; local_histogram[omp_get_thread_num()][value]++; } #pragma omp barrier #pragma omp master for (t = 0; t < THREADS; t++){ for (r = 0; r < RANGE; r++){ histogram[r] += local_histogram[t][r]; } } } Shared Private But what about i? Scoping int t, r; int local_histogram[THREADS][RANGE]; zero_histogram(local_histogram); #pragma omp parallel num_threads(THREADS) { int i; #pragma omp for for (i = 0; i < NUM_VALUES; i++){ int value = randoms[i]; local_histogram[omp_get_thread_num()][value]++; } #pragma omp barrier #pragma omp master for (t = 0; t < THREADS; t++){ for (r = 0; r < RANGE; r++){ histogram[r] += local_histogram[t][r]; } } } Shared Private i is private as it is the counter of the parallel for loop Explicit scoping Why is explicit scoping required? It is possible to use implicit scoping as in previous example Although it is good practice to use shared for any shared variables The clause default(shared or none) is helpful in ensuring you have defined variables scope correctly By changing the default scope from shared to none it enforces explicit scoping of variables and will give errors if scoping is not defined const variables can not be explicitly scoped (always shared) - more Not enforced in windows but this is against the spec int a, b = 0; #pragma omp parallel default(none) shared(b) { b += a; } error C3052: 'a' : variable doesn't appear in a data-sharing clause under a default(none) clause http://jakascorner.com/blog/2016/07/omp-default-none-and-const.html Explicit scoping Why is explicit scoping required? Older C programming (C89) style has variable declarations before definitions and statements (including loops) Requires declarations to be made explicitly private for the parallel structured block E.g. Consider our atomic histogram example void calculate_histogram() { int i; int value; #pragma omp parallel for private(value) for (i = 0; i < NUM_VALUES; i++){ value = randoms[i]; #pragma omp atomic histogram[value]++; } } Advanced private scoping If you want to pass the value of a variable outside of a parallel structured block then you must use the firstprivate clause Private variables will be initialised with the value of the master thread before the parallel directive If you want to pass a private value to a variable outside of the parallel for loop you can use the lastprivate clause This will assign the value of the last iteration of the loop int i = 10; #pragma omp parallel private(i) { printf("Thread %d: i = %d\n", omp_get_thread_num(), i); } int i = 10; #pragma omp parallel firstprivate(i) { printf("Thread %d: i = %d\n", omp_get_thread_num(), i); } Thread 0: i = 0 Thread 2: i = 0 Thread 1: i = 0 Thread 3: i = 0 Thread 0: i = 10 Thread 2: i = 10 Thread 1: i = 10 Thread 3: i = 10 Multicore systems and OpenMP Parallelising Loops Critical Sections and Synchronisation Scoping Data vs Task Parallelism Data vs Task Parallelism Is an OpenMP parallel for clause data or task parallel? Data vs Task Parallelism Parallelism over loops is data parallelism. i.e. The task is the same (the loop) Parallelism is over the data elements the loop refers to What about task parallelism? Task Parallelism: Divide a set of tasks between threads This is supported by sections Further task parallelism is supported by OpenMP tasks This is OpenMP 3.0 spec and not supported in Visual Studio 2017 Very similar to sections Sections #pragma omp parallel #pragma omp sections { #pragma omp section task_A(); #pragma omp section task_B(); #pragma omp section task_C(); } task_c task_B task_A Idle time Time Barrier (can be omitted with nowait clause) Thread 0 Thread 1 Thread 2 Thread 3 main main Sections #pragma omp sections [clauses] Defines a code region where individual sections can be assigned to individual threads Each section is executed exactly once by one thread Unused threads wait for implicit barrier Sections #pragma omp parallel #pragma omp sections nowait { #pragma omp section task_A(); #pragma omp section task_B(); #pragma omp section task_C(); } task_c task_B task_A Idle time Time Barrier (can be omitted with nowait clause) Thread 0 Thread 1 Thread 2 Thread 3 main main Sections If nowait clause is used then sections omit the barrier will immediately enter other parallel sections Summary OpenMP lets us add explicit parallelism to serial code We can parallelise loops or tasks OpenMP uses directives to modify the code This enables to portability (serial and parallel code is the same) OpenMP exposes both data and task parallelism using a fork and join model Care must be taken on parallel blocks which require access to shared variables There is a distinction between private a shared variables within a parallel blocks scope