计算机代考 COMP2017 / COMP9017 Week 10 Tutorial Parallelism with POSIX threads and

COMP2017 / COMP9017 Week 10 Tutorial Parallelism with POSIX threads and basics
Pthreads is a library specified by the POSIX standard, defining an API for creating and manipu- lating threads.
To use the Pthreads library:
1. Include the pthread.h header in your source code.

Copyright By PowCoder代写 加微信 powcoder

2. Specify-pthreadinyourcompilerflagstotellthecompilertolinkandconfigurethePthreads library.
> clang -g -Wall -Werror -std=gnu11 program.c -o program -pthread
Linker flags should be specified at the end of your usual compiler options.
Creating threads
To create a thread using the Pthreads library, we use the pthread_create function.
int pthread_create(
pthread_t* thread,
const pthread_attr_t* attr, void* (*start_routine) (void *), void* arg
1. The first argument to this function takes a pthread_t, it will use it to store the thread ID.
2. Thesecondargumentisapointertoapthread_attr_tstructure,youcanusethistospecify additional options in the creation of the thread, it¡¯s fine to leave this as NULL.
3. The third argument is a pointer to a function that the thread will execute once it spawns.
4. The last argument is the parameter passed to the thread function.
5. The function returns a non zero value if an error occurred.

COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations
Waiting on threads
The pthread_join function forces the calling thread to wait for a particular thread ID to finish. int pthread_join(pthread_t thread, void** retval);
1. The first argument is the thread to wait for.
2. The second argument allows you to get the return value from the start_routine.
If you don¡¯t need the return value you can just pass in NULL. 3. The function returns a nonzero value if an error occurred.
Pre-Tutorial Question
Question 1: Hello from Pthreads
The following code creates 4 threads, giving each of them an argument and waits for them to finish.
#include
#include #define NTHREADS 4
void* worker(void* arg) {
const int argument = *((int*) arg); printf(“Hello from thread %d\n”, argument); return NULL;
int main(void) {
int args[NTHREADS] = { 1, 2, 3, 4 }; pthread_t thread_ids[NTHREADS];
// Create threads with given worker function and argument
for (size_t i = 0; i < NTHREADS; i++) { if (pthread_create(thread_ids + i, NULL, worker, args + i) != 0) { perror("unable to create thread"); return 1; } // Wait for all threads to finish for (size_t i = 0; i < NTHREADS; i++) { if (pthread_join(thread_ids[i], NULL) != 0) { perror("unable to join thread"); return 1; Systems Programming Page 2 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations Compile and run the code above. Make sure you understand exactly what each line is doing. Reminder that printf is thread safe by the POSIX standard, thus calling it from multiple threads will yield no side effects. 1. Why does the program output Hello from thread ... in different orderings? 2. What is the difference between a thread (pthread_create) and a process (fork)? Systems Programming Page 3 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations Question 2: Parallel Sum In the following code, each thread is given a section of an array of numbers to sum. 1. How long does it take to run when you set THREADS to 1, 2, 3 ... etc? Youcanusetime ./sumtoseehowlongtheprogramtakestorun. 2. Why does using more threads make the program slower? 3. How can this be fixed? Hint: False sharing #include
#include
#include #define LENGTH 100000000
#define NTHREADS 4
#define NREPEATS 10
#define CHUNK (LENGTH / NTHREADS)
typedef struct { size_t id;
long* array;
long result; } worker_args;
void* worker(void* args) {
worker_args* wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNK;
const size_t end = wargs->id == NTHREADS – 1 ? LENGTH :
(wargs->id + 1) * CHUNK;
// Sum values from start to end
for (size_t i = start; i < end; i++) { wargs->result += wargs->array[i];
return NULL;
Systems Programming
Page 4 of 13

free(args);
free(numbers);
COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH); for (size_t i = 0; i < LENGTH; i++) { numbers[i] = i + 1; worker_args* args = malloc(sizeof(worker_args) * NTHREADS); for (size_t n = 1; n <= NREPEATS; n++) { for (size_t i = 0; i < NTHREADS; i++) { args[i] = (worker_args) { .id = i, .array = numbers, .result = 0, pthread_t thread_ids[NTHREADS]; // Launch threads for (size_t i = 0; i < NTHREADS; i++) { pthread_create(thread_ids + i, NULL, worker, args + i); // Wait for threads to finish for (size_t i = 0; i < NTHREADS; i++) { pthread_join(thread_ids[i], NULL); long sum = 0; // Calculate total sum for (size_t i = 0; i < NTHREADS; i++) { sum += args[i].result; printf("Run %2zu: total sum is %ld\n", n, sum); Systems Programming Page 5 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations Tutorial Question Question 3: Performance and bench-marking The execution time of a program can be measured by using the time command. Taking the example from the next exercise, we can measure how long it takes to run by running the time command like so: time ./mutex real 0m0.420s user 0m0.999s sys 0m1.234s The real time is the wall clock time taken to run the program. The user time is the total CPU time used, and the system time is the time spent in system calls or the kernel. Here you can see that the CPU time spent is larger than the wall clock time spent, so this was executing on multiple cores simultaneously. Sometimes it is better to have finer-grained measures of time, for example if your program has a large setup phase, and you only want to measure some operation which occurs after that. For this you can use the clock() function. This provides you with microsecond accuracy. You can achieve nanosecond accuracy using by using operating system specific functions. #include
#include
int main(void) {
const clock_t tick = clock();
int ops = 0;
for (int i = 0; i < 10000000; i++) { ops += i; } const clock_t tock = clock(); printf("Time elapsed: %fs\n", (double) (tock - tick) / CLOCKS_PER_SEC); return 0; } Systems Programming Page 6 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations Question 4: Matrix multiplication The code below performs matrix multiplication. 1. Analyse the program¡¯s performance: To use gprof, you need to compile with profiling using gcc and the -pg compiler flag. $ gcc -g -pg -Wall -Werror -std=gnu11 matrix.c -o matrix -pthread (a) Ensure the that -pg flag has been used to compile the program. (b) Execute the program as usual. (c) After the program has finished, execute the profiler: $ gprof ./matrix > mm.stats
(d) View the saved information using less, which function is the bottleneck in the code and
(e) You can use gprof -l to show statistics for each line of the source.
2. Swapoftheorderoftheloopssothatitisy, k, xinsteadofy, x, k. Why does this have any effect on the performance of the code?
3. Improve the performance by using pthreads for parallelism.
4. Measure how long it takes to run with and without parallelism using time. Recommend that you use the lab machines to benchmark as they are similar to the machines used for the upcom- ing assignments. When bench-marking, it¡¯s important that no other programs are running since they may affect the results.
5. Compare your performance with the cache optimised version in
What every programmer should know about memory by .
Systems Programming Page 7 of 13

COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations
#include
#include
#include
#include
#include #define WIDTH 512
#define IDX(x, y) ((y) * WIDTH + (x))
* Returns the matrix multiplication of a and b.
float* multiply(const float* a, const float* b) {
float* result = calloc(WIDTH * WIDTH, sizeof(float));
for (size_t y = 0; y < WIDTH; y++) { for (size_t x = 0; x < WIDTH; x++) { for (size_t k = 0; k < WIDTH; k++) { result[IDX(x, y)] += a[IDX(k, y)] * b[IDX(x, k)]; return result; } * Returns a Hadamard matrix, if H is Hadamard matrix, then * HH^T = nI, where I is the identity matrix and n is the width. * Easy to verify that the matrix multiplication was done correctly. * Sylvester's construction implemented here only works * for matrices that have width that is a power of 2. * Note that this construction produces matrices that are symmetric. float* hadamard(void) { // Ensure the width is a power of 2 assert(((WIDTH - 1) & WIDTH) == 0); size_t w = WIDTH; size_t quad_size = 1; float* result = malloc(WIDTH * WIDTH * sizeof(float)); result[0] = 1; while ((w >>= 1) != 0) {
// Duplicate the upper left quadrant into the other three quadrants
Systems Programming Page 8 of 13

COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations
for (size_t y = 0; y < quad_size; ++y) { for (size_t x = 0; x < quad_size; ++x) { const float v = result[IDX(x, y)]; result[IDX(x + quad_size, y)] = v; result[IDX(x, y + quad_size)] = v; result[IDX(x + quad_size, y + quad_size)] = -v; quad_size *= 2; return result; } // Displays a matrix. void display(const float* matrix) { for (size_t y = 0; y < WIDTH; y++) { for (size_t x = 0; x < WIDTH; x++) { printf("%6.2f ", matrix[IDX(x, y)]); printf("\n"); } int main(void) { // Construct the matrices float* a = hadamard(); float* b = hadamard(); // Compute the result float* r = multiply(a, b); // Verify the result for (size_t y = 0; y < WIDTH; y++) { for (size_t x = 0; x < WIDTH; x++) { assert(x == y ? r[IDX(x, y)] == WIDTH : r[IDX(x, y)] == 0); puts("done"); Systems Programming Page 9 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations return 0; } Question 5: Critical sections and mutual exclusion A critical section is a piece of code that accesses a shared resource that must not be be concurrently accessed by more than one thread. These are usually enforced using locks or other synchronisation measures. Mutexes are a way of enforcing a critical section. In our threads, we can call pthread_mutex_lock in order to obtain ownership of the mutex before entering the critical section. If the thread cannot ob- tain ownership of the mutex object, it will wait until the mutex has been unlocked. The thread that owns the mutex must call pthread_mutex_unlock after it has exited the critical section. The example below shows a typical use of mutexes in pthreads. This mutex is statically ini- tialised using PTHREAD_MUTEX_INITIALIZER. You can also create them dynamically using pthread_mutex_init. #include
#include #define THREADS 4
#define LOOPS 1000000
static unsigned counter = 0;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static void* worker(void *arg) {
for (unsigned i = 0; i < LOOPS; i++) { // attempt to lock the mutex... // thread will wait when mutex is already locked pthread_mutex_lock(&mutex); // only one thread will be able execute this code counter += 1; // unlock the mutex after the critical section pthread_mutex_unlock(&mutex); return NULL; } int main(void) { Systems Programming Page 10 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations pthread_t thread_ids[THREADS]; for (size_t i = 0; i < THREADS; i++) { pthread_create(thread_ids + i, NULL, worker, NULL); for (size_t i = 0; i < THREADS; i++) { pthread_join(thread_ids[i], NULL); printf("%d\n", counter); } 1. Run the code and verify that it outputs the result correctly. 2. Comment out the pthread_mutex_unlock line, does the program behave in a way you expect? 3. Uncomment out both pthread_mutex_lock and pthread_mutex_unlock lines, re- peat the program until it outputs an incorrect result, why does this happen? Systems Programming Page 11 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations Question 6: Dining philosophers Assume that there are N philosophers sitting at a round table. A single chopstick is placed between two adjacent philosophers. Every philosopher is either thinking or eating. However, a philosopher needs both chopsticks (to the left and to the right) to start eating. They are not allowed to acquire chopsticks that are not immediately adjacent to them. Complete the following program so that each philosopher is able to eat. #include
#include
#include
#include #define THINKERS 5
static pthread_mutex_t chopsticks[THINKERS]; void* dine(void* arg) {
const unsigned id = *((unsigned *) arg);
while (true) {
// TODO: Acquire two chopsticks first
// the ith philosopher can only reach
// the ith and (i + 1)th chopstick printf(“Philosopher %u is eating\n”, id);
return NULL; }
int main(void) {
unsigned args[THINKERS];
pthread_t thinkers[THINKERS];
// create the chopsticks
for (size_t i = 0; i < THINKERS; i++) { if (pthread_mutex_init(chopsticks + i, NULL) != 0) { perror("unable to initialize mutex"); return 1; } // launch threads for (size_t i = 0; i < THINKERS; i++) { args[i] = i; Systems Programming Page 12 of 13 COMP2017 / COMP9017 Parallelism with POSIX threads and Optimisations if (pthread_create(thinkers + i, NULL, dine, args + i) != 0) { perror("unable to create thread"); // wait for threads to finish for (size_t i = 0; i < THINKERS; i++) { if (pthread_join(thinkers[i], NULL) != 0) { perror("unable to join thread"); return 1; } // remove the chopsticks for (size_t i = 0; i < THINKERS; i++) { if (pthread_mutex_destroy(chopsticks + i) != 0) { perror("unable to destroy mutex"); return 1; } return 0; } Systems Programming Page 13 of 13 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com