程序代写 COSC 407: Intro to Parallel Computing

Intro to Parallel Computing
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing

Copyright By PowCoder代写 加微信 powcoder

Outline (Asynchronous)
Previously:
• Sections
• Scheduling Loops (static, dynamic, guided, auto)
• Ordered Iterations
• Some examples
• Matrix multiplication
• Max reduction
• Some More Examples
• Asides (Producer/Consumer) and Comments on
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing

Matrix Multiplication
Write a C program that uses OpenMP to perform parallel matrix multiplication. The code should do the following steps:
1. create three matrices A, B, and C and initializes A and B to some values of your choice
(e.g., a[i][j]=i+j and b[i][j]=i*j).
2. perform matrix multiplication.
3. print out the resultant matrix C.
if we have two matrices:
Then their matrix product is
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Serial Version
int i, j, k; double a[NRA][NCA], b[NCA][NCB], double c[NRA][NCB];
// Initialize matrices
for (i = 0; i < NRA; i++) for (j = 0; j < NCA; j++) a[i][j] = i + j; for (i = 0; i < NCA; i++) for (j = 0; j < NCB; j++) b[i][j] = i * j; for (i = 0; i < NRA; i++) for (j = 0; j < NCB; j++) c[i][j] = 0; // matrix multiplication printf("Starting matrix multiplication...\n"); for (i = 0; i < NRA; i++) for (j = 0; j < NCB; j++) for (k = 0; k < NCA; k++) c[i][j] += a[i][k] * b[k][j]; // Print results printf("Result Matrix:\n"); for (i = 0; i < NRA; i++) { for (j = 0; j < NCB; j++) printf("%6.2f ", c[i][j]); printf("\n"); Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Q1: which loops would you parallelize? Q2: Is there any loop-carried dependencies? Finding the Following is a function that finds and returns the max integer in a given array int max_serial(int *arr){ int max = arr[0]; for (i = 0; i < N; i++) if (max < arr[i] ) max = arr[i]; return max; Assume N is a constant = size of the array Write an OpenMP code to parallelize the above function. Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing a) Yes b) No Finding the V.1 Will this code produce the required output? int max_parallel(int *arr){ int max = arr[0]; #pragma omp parallel for for (i = 0; i < N; i++) if (max < arr[i]) max = arr[i]; return max; Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing a) Thisisaperfec b) We might have c) Outputiscorre d) There is a synt Finding the v.2 How about this Solution? int max_parallel(int *arr){ int max = arr[0]; #pragma omp parallel for for (i = 0; i < N; i++) #pragma omp critical if (max < arr[i]) max = arr[i]; return max; } Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing a) Thisisagoodsolution b) We might have a race condition. c) Outputiscorrect,butthecodeismostlysequential d) There is a syntax error. Finding the v.3 What do you think now? int max_parallel(int *arr, int thread_count){ int shared_max = arr[0], i; #pragma omp parallel num_threads(thread_count) { int local_max = arr[0]; //each thread gets a copy of local_max #pragma omp for for (i = 0; i < N; i++) //each thread finds it local max if (local_max < arr[i]) local_max = arr[i]; #pragma omp critical //one thread at a time, i.e. sequential if(shared_max < local_max) shared_max = local_max; return shared_max; Only a “good” solution – there is a better way using reduction! This will be discussed shortly (not in the next slide though) Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing a) Thisisagoodsolution. b) We might have a race condition. c) Outputiscorrect,butthecodeismostlysequential d) There is a syntax error. Finding the v.4 Will this code produce the required output? int max_parallel(int *arr, int thread_count){ int local_max[thread_count]; #pragma omp parallel num_threads(thread_count) { int tid = omp_get_thread_num(); local_max[tid] = arr[0]; #pragma omp for for (int i = 0; i < N; i++) Assuming 2 threads only if (local_max[tid] max) max = local_max[i];
return max;
Topic 10 – OpenMP Examples, Models, SIMD
a) Thisisagoodsolution.
b) We might have a race condition.
c) Outputiscorrect,butthecodeismostlysequential
• First introduced in OpenMP v3.1 • Remember:
Reduction Syntax:
reduction ( : )
1. Provides a private copy of the variable for each thread
2. Combines the private results on exit
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing

Finding the v.5
Now, what do you think?
int max_parallel(int *arr){ int i, m = arr[0];
#pragma omp parallel for reduction(max:m)
for (i = 0; i < N; i++) if (m < arr[i]) m = arr[i]; Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing a) Yes b) No Finding the Min, Max and SUM A function that returns 3 values: max, min, and sum in an array using OpenMP void stats(int* arr, int* h, int* l, int* s) { int i, high = arr[0], low = arr[0], sum = 0; #pragma omp parallel for reduction(max:high) reduction(min:low) \ reduction(+:sum) for (i = 0; i < N; i++){ if(higharr[i]) low = arr[i]; sum += arr[i];
*h = high; //returning three values *l = low;
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing

Nested Loops: collapse clause
• Collapse directive turns nested loops into ONE big loop
• The iteration space is then divided according to the schedule
#pragma omp for collapse(n)
Where n is the number of loops to be collapsed
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
for(j=0; j<4; j++) printf("%d,%d ", i, j); 0,0 0,1 0,2 0,3 1,2 1,3 1,4 Example: collapse #pragma omp parallel for private(j) num_threads(4) for(i=0; i<2; i++) #pragma omp parallel for collapse(2) num_threads(4) for(i=0; i<2; i++) for(j=0; j<4; j++) printf("T%d,i:%d,j:%d\n",omp_get_thread_num(), i, j); 0,0 0,1 0,2 0,3 1,1 1,2 1,3 1,4 T0 T1 T2 T3 Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Nested Loops: collapse clause • The specified number of loops must be indicated • The loops must be perfectly nested; i.e., no code between the loops which are collapsed. • All counters in the collapsed loops are private (no need to add the private clause for them) Note: collapse doesn’t always improve the performance – a smart compiler will automatically collapse nested loops IF needed. Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Bubble Sort Starting from the first item, compare adjacent items and keep “bubbling” the larger one to the right. Repeat for remaining sublist. for(k = list_length, k >= 2; k–) for(i = 0; i < k; i++) if(a[i-1] > a[i]) swap(&a[i-1],&a[i]);
What is the complexity?
Bubble sort cannot be easily parallelized!
Why? (i.e. identify the loop carried dependency for both for- loops)
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing

Odd-Even Transposition Sort
• A sequence of phases (N is number of elements in list a).
• Similar to bubble sort, but easier to parallelize
• Outer-loop has loop-carried dependence
• Inner loop doesn’t have loop-carried
dependence
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
for(phase = 0, phase < N; phase++) if(phase % 2 == 0) //even phases for(i = 1; i < N; i += 2) if(a[i-1] > a[i]) swap(&a[i-1],&a[i]);
else //odd phases for(i = 1; i < N-1; i += 2) if(a[i] > a[i+1]) swap(&a[i],&a[i+1]);
Odd-Even Transposition Sort
for(phase = 0, phase < N; phase++) if(phase % 2 == 0) //even phases for(i = 1; i < N; i += 2) if(a[i-1] > a[i]) swap(&a[i-1],&a[i]);
else //odd phases
for(i = 1; i < N-1; i += 2) if(a[i] > a[i+1]) swap(&a[i],&a[i+1]);
Topic 10 – OpenMP Examples, Models, SIMD
No dependency between thread, thus no race condition
COSC 407: Intro to Parallel Computing

Odd-Even Sort – Attempt 1
for (phase = 0; phase < n; phase++) { if (phase % 2 == 0) # pragma omp parallel for num_threads(thread_count) \ default(none) shared(a, n) private(i, tmp) for (i = 1; i < n; i += 2){ if (a[i-1] > a[i]) {
tmp = a[i-1];
a[i-1] = a[i];
a[i] = tmp;
Q: Do we need any explicit barriers in this code?
join (implicit)
# pragma omp parallel for num_threads(thread_count) \ default(none) shared(a, n) private(i, tmp)
for (i = 1; i < n-1; i += 2) if (a[i] > a[i+1]) {
tmp = a[i+1];
a[i+1] = a[i];
a[i] = tmp;
join (implicit)
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Odd-Even Sort – Attempt 2
# pragma omp parallel num_threads(thread_count) \ default(none) shared(a, n) private(i, tmp, phase)
for (phase = 0; phase < n; phase++) { if (phase % 2 == 0) # pragma omp for for (i = 1; i < n; i += 2){ if (a[i-1] > a[i]) { tmp = a[i-1];
a[i-1] = a[i];
a[i] = tmp; }
tells OpenMP to parallelize the for loop with the existing team of threads.
Reuse thread
# pragma omp for
for (i = 1; i < n-1; i += 2) if (a[i] > a[i+1]) { tmp = a[i+1];
a[i+1] = a[i];
a[i] = tmp; }
Reuse thread
join (implicit)
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing

Comparison of Performance
• Odd-even sort with two parallel for directives and two for directives.
(Times are in seconds, size of array is 20,000)
Thread count
Two parallel for directives
Two for directives
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Producer – Consumer Model
This model has two (or more) threads/processes work together on a shared resource (a buffer).
• Producer writes data to the buffer when it is not full
• Consumer reads data from buffer when it is not empty
consumer producer
Circular buffers are actually linear buffers in which the producer or consumer moves to the beginning whenever it reaches the end of the buffer.
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing

Condition Synchronization
• Condition synchronization is a mechanism to ensure that a process is blocked if some condition is not satisfied
For a consumer process
àCondition: the buffer is not empty.
For a producer process
àCondition: the buffer is not full
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Pseudo Code
#define NUM_ITEMS 30 // number of items produced and consumed
int empty = 1, full = 0; // at beginning, buffer is empty and not full
int main(){
#pragma omp parallel
assign several threads to run produce(), and several to consume() }
PRODUCER code
void put(T item){
//add item to buffer
//set empty to 0 (buffer is not empty) //set full to 1 IF buffer is full
void produce(){
while (i < NUM_ITEMS) { #pragma omp critical(one) if (!full) { // if full, don’t do // anything put(item); // puts data into buffer i++; // incr. only if item CONSUMER code // read item from buffer // set full to 0 (buffer is not full) // set empty to 1 IF buffer has not // items // return item; void consume() { while (j < NUM_ITEMS) { #pragma omp critical(two) if (!empty) { x = get(); // read data from // buffer j++; // incr. only if item // received Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Produce-Consumer Producer #define BUFFER_SIZE 10 #define NUM_ITEMS 30 char buffer[BUFFER_SIZE]; int i_in = 0, i_out = 0; int num_items = 0; int empty = 1, full = 0; int i =0, j=0; //producer code void put(char ch) { buffer[i_in] = ch; i_in = (i_in + 1) % BUFFER_SIZE; // incr. producer index //keep track of number of items in buffer num_items++; // # of items in buffer if(num_items == 1) empty = 0; // buffer is not empty anymore if(num_items==BUFFER_SIZE) full = 1; // buffer is now full void produce(int tid) { while (i < NUM_ITEMS) { #pragma omp critical(one) if (!full) { char ch = 'A' + rand()%26; printf("T%d:Produced %c\t\tnow is %s\n",tid,ch,full?"Full":"Not full"); i++; // only increment if item added Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing // Buffer size // total number of producing and consuming //Two buffer indexes for producer/consumer //how many items in the buffer //buffer is empty and not full at beginning //track any producer or consumer action // put character in circular buffer This part should be executed atomically for all producers – ok to run in parallel with consumers. e.g. you don't want more than one producer to check !full condition and proceed (what if there is only one spot left and two threads proceed because !full is true?) Condition synchronization: can only produce if buffer is not full – if full, skip and try again later in the loop. Produce-Consumer char get() { char ch = buffer[i_out]; // read item from circular buffer i_out = (i_out + 1) % BUFFER_SIZE; // incr. consumer index //keep track of number of items in buffer num_items--; // decr. # of items if (num_items == BUFFER_SIZE-1) full = 0;// buffer is not full anymore if (num_items == 0) empty = 1; return ch; void consume(int tid) { while (j < NUM_ITEMS) { #pragma omp critical(two) if (!empty) { // buffer is now empty Only one consumer can get ch = get(); printf("T%d:Consumed %c\t\tnow is %s\n",tid,ch,empty?"Empty":"Not empty"); j++; int main(){ #pragma omp parallel firstprivate(i,j) num_threads(8) { int tid = omp_get_thread_num(); if(tid< omp_get_num_threads()/2) produce(tid); //1st half of threads producing else consume(tid); //2nd half consume Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Produce-Consumer PHQGHU......... T0: Producing T1: Consuming T2: Producing T2: Producing T1: Consuming T0: Producing T1: Consuming T0: Producing T1: Consuming T0: Producing T5: Consuming T7: Consuming Topic 10 - OpenMP Examples, Models, SIMD P P H Q H G Q W G U now is Not full now is Empty now is Not full now is Not full now is Not empty now is Not full now is Not empty now is Not full now is Not empty now is Not full now is Not empty now is Empty COSC 407: Intro to Parallel Computing Consumer-Producer Applications Computer Science Applications: • When data is set to a printer, a producer writes data to the buffer and the printer’s process reads data from buffer • When we have several “producer” threads and several “consumer” threads • Producer threads might “produce” requests for data. • Consumer threads might “consume” the request by finding or generating the requested data Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Challenge Producer-Consumer • The code we saw earlier is not the best implementation of the producer- consumer model. • Threads will be busy always even if they are practically doing nothing. • Example: if queue is empty, a consumer will still be busy running the loop and checking the if condition. • OpenMP cannot arbitrarily start/stop threads. • Idea: locks may be used to implement condition synchronization, but we will not discuss this here. • Other libraries can do this. • Java and POSIX Threads can do this • POSIX threads can explicitly create, run, stop, and destroy threads. OpenMP is best when you want a fixed number of threads which will fully use the CPU, and the threads are more-or-less doing the same thing. Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing Message-Passing: The Process 1) Master thread creates a shared array of message queues: one for each thread. 2) Start the threads using a parallel directive. Each thread allocates space for its queue. 3) After ALL the threads are done step 2, each thread will • Send all its messages to other threads (putting them in other threads’ queues) Shared array • Receive all message sent to it Topic 10 - OpenMP Examples, Models, SIMD for(i=0; iCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com