Sogang University
Thread-Level Parallelism
CSE4100: System Programming
Youngjae Kim (PhD)
Copyright By PowCoder代写 加微信 powcoder
https://sites.google.com/site/youkim/home
Distributed Computing and Operating Systems Laboratory (DISCOS)
https://discos.sogang.ac.kr
Office: R911, E-mail:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
¢ Parallel Computing Hardware § Multicore
§ Multiple separate processors on single chip § Hyperthreading
§ Efficient execution of multiple threads on single core
¢ Thread-Level Parallelism
§ Splitting program into independent tasks
§ Example: Parallel summation ¢ Consistency Models
§ What happens when multiple threads are reading & writing shared state
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Exploiting parallel execution
¢ So far, we’ve used threads to deal with I/O delays
§ e.g., one thread per client to prevent one from delaying another
¢ Multi-core/Hyperthreaded CPUs offer another opportunity
§ Spread work over threads executing in parallel
§ Happens automatically, if many independent tasks
§ e.g., running many applications or serving many clients
§ Can also write code to make one big task go faster § by organizing it as multiple parallel sub-tasks
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Typical Multicore Processor
L1 d-cache
L2 unified cache
L1 i-cache
L3 unified cache (shared by all cores)
L1 d-cache
L2 unified cache
L1 i-cache
Main memory
¢ Multiple processors operating with coherent view of memory Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Benchmark Machine
¢ Get data about machine from /proc/cpuinfo
¢ Shark Machines
§ Intel Xeon E5520 @ 2.27 GHz
§ Nehalem, ca. 2010
§ Each can do 2x hyperthreading
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Example 1: Parallel Summation ¢ Sum numbers 0, …, n-1
§ Should add up to ((n-1)*n)/2
¢ Partition values 1, …, n-1 into t ranges § ën/tû values in each range
§ Each of t threads processes 1 range
§ For simplicity, assume n is a multiple of t
¢ Let’s consider different ways that multiple threads might work on their assigned ranges in parallel
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
First attempt: psum-mutex
¢ Simplest approach: Threads sum into a global variable
protected by a semaphore mutex.
void *sum_mutex(void *vargp); /* Thread routine */
/* Global shared variables */
long gsum = 0; /* Global sum */
long nelems_per_thread; /* Number of elements to sum */ sem_t mutex; /* Mutex to protect global sum */
int main(int argc, char **argv) {
long i, nelems, log_nelems, nthreads, myid[MAXTHREADS]; pthread_t tid[MAXTHREADS];
/* Get input arguments */
nthreads = atoi(argv[1]);
log_nelems = atoi(argv[2]);
nelems = (1L << log_nelems); nelems_per_thread = nelems / nthreads; sem_init(&mutex, 0, 1);
psum-mutex.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
psum-mutex (cont)
¢ Simplest approach: Threads sum into a global variable
protected by a semaphore mutex.
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) { myid[i] = i;
Pthread_create(&tid[i], NULL, sum_mutex, &myid[i]); }
for (i = 0; i < nthreads; i++) Pthread_join(tid[i], NULL);
/* Check final answer */
if (gsum != (nelems * (nelems-1))/2) printf("Error: result=%ld\n", gsum);
exit(0); }
psum-mutex.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
psum-mutex Thread Routine
¢ Simplest approach: Threads sum into a global variable
protected by a semaphore mutex.
/* Thread routine for psum-mutex.c */
void *sum_mutex(void *vargp) {
long myid = *((long *)vargp); /* Extract thread ID */ long start = myid * nelems_per_thread; /* Start element index */ long end = start + nelems_per_thread; /* End element index */ long i;
for (i = start; i < end; i++) { P(&mutex);
gsum += i;
V(&mutex); }
return NULL;
psum-mutex.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
psum-mutex Performance ¢ Shark machine with 8 cores, n=231
Threads (Cores)
psum-mutex (secs)
¢ Nasty surprise:
§ Single thread is very slow
§ Gets slower as we use more cores
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Next Attempt: psum-array
¢ Peer thread i sums into global array element psum[i]
¢ Main waits for theads to finish, then sums elements of psum ¢ Eliminates need for mutex synchronization
/* Thread routine for psum-array.c */
void *sum_array(void *vargp) {
long myid = *((long *)vargp); /* Extract thread ID */ long start = myid * nelems_per_thread; /* Start element index */ long end = start + nelems_per_thread; /* End element index */ long i;
for (i = start; i < end; i++) { psum[myid] += i;
return NULL;
psum-array.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
psum-array Performance
¢ Orders of magnitude faster than psum-mutex
5 4 3 2 1 0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Parallel Summation
psum-ar ray
Threads (cores)
Elapsed seconds
Sogang University
Next Attempt: psum-local
¢ Reduce memory references by having peer thread i sum
into a local variable (register)
/* Thread routine for psum-local.c */
void *sum_local(void *vargp) {
long myid = *((long *)vargp); /* Extract thread ID */ long start = myid * nelems_per_thread; /* Start element index */ long end = start + nelems_per_thread; /* End element index */ long i, sum = 0;
for (i = start; i < end; i++) { sum += i;
psum[myid] = sum;
return NULL;
psum-local.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
psum-local Performance ¢ Significantly faster than psum-array
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Parallel Summation
5 4 3 2 1 0
psum-ar ray psum-local
0.32 8( 8)
Threads (cores)
Elapsed seconds
Sogang University
Characterizing Parallel Program Performance ¢ p processor cores, Tk is the running time using k cores
¢Def.Speedup: Sp=T1/Tp
§ Sp is relative speedup if T1 is running time of parallel version of the
code running on 1 core.
§ Sp is absolute speedup if T1 is running time of sequential version of code running on 1 core.
§ Absolute speedup is a much truer measure of the benefits of parallelism.
¢ Def. Efficiency: Ep = Sp /p = T1 /(pTp)
§ Reported as a percentage in the range (0, 100]. § Measures the overhead due to parallelization
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Performance of psum-local
Threads (t)
Running time (Tp)
Speedup (Sp)
Efficiency (Ep)
¢ Efficiencies OK, not great
¢ Our example is easily parallelizable
¢ Real codes are often much harder to parallelize
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Memory Consistency
int a = 1; int b = 100;
Thread consistency constraints
Thread1: Wa: a=2; Rb: print(b);
¢ What are the possible values printed?
§ Depends on memory consistency model
§ Abstract model of how hardware handles concurrent accesses
¢ Sequential consistency
§ Overall effect consistent with each individual thread § Otherwise, arbitrary interleaving
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Thread2: Wb: b = 200; Ra: print(a);
Sogang University
Sequential Consistency Example
Thread consistency constraints
int a = 1; int b = 100;
Wb Ra 100,2 Rb Ra 200,2
Ra Rb 2,200 Wa Rb 1,200
Ra Rb 2,200
Rb Ra 200,2
Thread1: Wa: a=2; Rb: print(b);
Thread2: Wb: b = 200; Ra: print(a);
Wa Wb Ra
¢ Impossible outputs
§ 100, 1 and 1, 100
§ Would require reaching both Ra and Rb before Wa and Wb
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Non-Coherent Cache Scenario
¢ Write-back caches, without coordination between them
int a = 1; int b = 100;
Thread1: Wa: a=2; Rb: print(b);
Thread1 Cache
a: 2 b:100
Thread2 Cache
Thread2: Wb: b = 200; Ra: print(a);
Main Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Sogang University
Snoopy Caches
¢ Tag each cache block with state
int a = 1; int b = 100;
Invalid Shared Exclusive
Cannot use value Readable copy Writeable copy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Thread1: Wa: a=2; Rb: print(b);
Thread2: Wb: b = 200; Ra: print(a);
Thread1 Cache E a:2
Main Memory
Thread2 Cache E
Sogang University
Snoopy Caches
¢ Tag each cache block with state
Invalid Shared Exclusive
Cannot use value Readable copy Writeable copy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
int a = 1; int b = 100;
Thread1: Wa: a=2; Rb: print(b);
Thread2: Wb: b = 200; Ra: print(a);
Thread1 Cache SE
Thread2 Cache
¢ When cache sees request for one of its E-tagged blocks
¢ Supply value from cache ¢ SettagtoS
Main Memory
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com