代写代考 CSE4100: System Programming

Sogang University
Thread-Level Parallelism
CSE4100: System Programming
Youngjae Kim (PhD)

Copyright By PowCoder代写 加微信 powcoder

https://sites.google.com/site/youkim/home
Distributed Computing and Operating Systems Laboratory (DISCOS)
https://discos.sogang.ac.kr
Office: R911, E-mail:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Sogang University
¢ Parallel Computing Hardware § Multicore
§ Multiple separate processors on single chip § Hyperthreading
§ Efficient execution of multiple threads on single core
¢ Thread-Level Parallelism
§ Splitting program into independent tasks
§ Example: Parallel summation ¢ Consistency Models
§ What happens when multiple threads are reading & writing shared state
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Sogang University
Exploiting parallel execution
¢ So far, we’ve used threads to deal with I/O delays
§ e.g., one thread per client to prevent one from delaying another
¢ Multi-core/Hyperthreaded CPUs offer another opportunity
§ Spread work over threads executing in parallel
§ Happens automatically, if many independent tasks
§ e.g., running many applications or serving many clients
§ Can also write code to make one big task go faster § by organizing it as multiple parallel sub-tasks
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Sogang University
Typical Multicore Processor
L1 d-cache
L2 unified cache
L1 i-cache
L3 unified cache (shared by all cores)
L1 d-cache
L2 unified cache
L1 i-cache
Main memory
¢ Multiple processors operating with coherent view of memory Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Sogang University
Benchmark Machine
¢ Get data about machine from /proc/cpuinfo
¢ Shark Machines
§ Intel Xeon E5520 @ 2.27 GHz
§ Nehalem, ca. 2010
§ Each can do 2x hyperthreading
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Sogang University
Example 1: Parallel Summation ¢ Sum numbers 0, …, n-1
§ Should add up to ((n-1)*n)/2
¢ Partition values 1, …, n-1 into t ranges § ën/tû values in each range
§ Each of t threads processes 1 range
§ For simplicity, assume n is a multiple of t
¢ Let’s consider different ways that multiple threads might work on their assigned ranges in parallel
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Sogang University
First attempt: psum-mutex
¢ Simplest approach: Threads sum into a global variable
protected by a semaphore mutex.
void *sum_mutex(void *vargp); /* Thread routine */
/* Global shared variables */
long gsum = 0; /* Global sum */
long nelems_per_thread; /* Number of elements to sum */ sem_t mutex; /* Mutex to protect global sum */
int main(int argc, char **argv) {
long i, nelems, log_nelems, nthreads, myid[MAXTHREADS]; pthread_t tid[MAXTHREADS];
/* Get input arguments */
nthreads = atoi(argv[1]);
log_nelems = atoi(argv[2]);
nelems = (1L << log_nelems); nelems_per_thread = nelems / nthreads; sem_init(&mutex, 0, 1); psum-mutex.c Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University psum-mutex (cont) ¢ Simplest approach: Threads sum into a global variable protected by a semaphore mutex. /* Create peer threads and wait for them to finish */ for (i = 0; i < nthreads; i++) { myid[i] = i; Pthread_create(&tid[i], NULL, sum_mutex, &myid[i]); } for (i = 0; i < nthreads; i++) Pthread_join(tid[i], NULL); /* Check final answer */ if (gsum != (nelems * (nelems-1))/2) printf("Error: result=%ld\n", gsum); exit(0); } psum-mutex.c Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University psum-mutex Thread Routine ¢ Simplest approach: Threads sum into a global variable protected by a semaphore mutex. /* Thread routine for psum-mutex.c */ void *sum_mutex(void *vargp) { long myid = *((long *)vargp); /* Extract thread ID */ long start = myid * nelems_per_thread; /* Start element index */ long end = start + nelems_per_thread; /* End element index */ long i; for (i = start; i < end; i++) { P(&mutex); gsum += i; V(&mutex); } return NULL; psum-mutex.c Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University psum-mutex Performance ¢ Shark machine with 8 cores, n=231 Threads (Cores) psum-mutex (secs) ¢ Nasty surprise: § Single thread is very slow § Gets slower as we use more cores Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University Next Attempt: psum-array ¢ Peer thread i sums into global array element psum[i] ¢ Main waits for theads to finish, then sums elements of psum ¢ Eliminates need for mutex synchronization /* Thread routine for psum-array.c */ void *sum_array(void *vargp) { long myid = *((long *)vargp); /* Extract thread ID */ long start = myid * nelems_per_thread; /* Start element index */ long end = start + nelems_per_thread; /* End element index */ long i; for (i = start; i < end; i++) { psum[myid] += i; return NULL; psum-array.c Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University psum-array Performance ¢ Orders of magnitude faster than psum-mutex 5 4 3 2 1 0 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Parallel Summation psum-ar ray Threads (cores) Elapsed seconds Sogang University Next Attempt: psum-local ¢ Reduce memory references by having peer thread i sum into a local variable (register) /* Thread routine for psum-local.c */ void *sum_local(void *vargp) { long myid = *((long *)vargp); /* Extract thread ID */ long start = myid * nelems_per_thread; /* Start element index */ long end = start + nelems_per_thread; /* End element index */ long i, sum = 0; for (i = start; i < end; i++) { sum += i; psum[myid] = sum; return NULL; psum-local.c Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University psum-local Performance ¢ Significantly faster than psum-array Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Parallel Summation 5 4 3 2 1 0 psum-ar ray psum-local 0.32 8( 8) Threads (cores) Elapsed seconds Sogang University Characterizing Parallel Program Performance ¢ p processor cores, Tk is the running time using k cores ¢Def.Speedup: Sp=T1/Tp § Sp is relative speedup if T1 is running time of parallel version of the code running on 1 core. § Sp is absolute speedup if T1 is running time of sequential version of code running on 1 core. § Absolute speedup is a much truer measure of the benefits of parallelism. ¢ Def. Efficiency: Ep = Sp /p = T1 /(pTp) § Reported as a percentage in the range (0, 100]. § Measures the overhead due to parallelization Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University Performance of psum-local Threads (t) Running time (Tp) Speedup (Sp) Efficiency (Ep) ¢ Efficiencies OK, not great ¢ Our example is easily parallelizable ¢ Real codes are often much harder to parallelize Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University Memory Consistency int a = 1; int b = 100; Thread consistency constraints Thread1: Wa: a=2; Rb: print(b); ¢ What are the possible values printed? § Depends on memory consistency model § Abstract model of how hardware handles concurrent accesses ¢ Sequential consistency § Overall effect consistent with each individual thread § Otherwise, arbitrary interleaving Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Thread2: Wb: b = 200; Ra: print(a); Sogang University Sequential Consistency Example Thread consistency constraints int a = 1; int b = 100; Wb Ra 100,2 Rb Ra 200,2 Ra Rb 2,200 Wa Rb 1,200 Ra Rb 2,200 Rb Ra 200,2 Thread1: Wa: a=2; Rb: print(b); Thread2: Wb: b = 200; Ra: print(a); Wa Wb Ra ¢ Impossible outputs § 100, 1 and 1, 100 § Would require reaching both Ra and Rb before Wa and Wb Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University Non-Coherent Cache Scenario ¢ Write-back caches, without coordination between them int a = 1; int b = 100; Thread1: Wa: a=2; Rb: print(b); Thread1 Cache a: 2 b:100 Thread2 Cache Thread2: Wb: b = 200; Ra: print(a); Main Memory Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Sogang University Snoopy Caches ¢ Tag each cache block with state int a = 1; int b = 100; Invalid Shared Exclusive Cannot use value Readable copy Writeable copy Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Thread1: Wa: a=2; Rb: print(b); Thread2: Wb: b = 200; Ra: print(a); Thread1 Cache E a:2 Main Memory Thread2 Cache E Sogang University Snoopy Caches ¢ Tag each cache block with state Invalid Shared Exclusive Cannot use value Readable copy Writeable copy Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition int a = 1; int b = 100; Thread1: Wa: a=2; Rb: print(b); Thread2: Wb: b = 200; Ra: print(a); Thread1 Cache SE Thread2 Cache ¢ When cache sees request for one of its E-tagged blocks ¢ Supply value from cache ¢ SettagtoS Main Memory 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com