程序代写 CSE 371 Computer Organization and Design

CSE 371 Computer Organization and Design

CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Computer Architecture

Unit 11: Multicore
Slides developed by , & at UPenn
with sources that included University of Wisconsin slides
by , , , and

CIS 371: Comp. Org. & Design.| Prof. | Multicore
This Unit: Shared Memory Multiprocessors
Thread-level parallelism (TLP)
Shared memory model
Multiplexed uniprocessor
Hardware multithreading
Multiprocessing
Cache coherence
Valid/Invalid, MSI, MESI

System software

Chapter 7.1-7.3, 7.5
Chapter 5.8, 5.10
Suggested reading
“A Primer on Memory Consistency and Cache Coherence” (Synthesis Lectures on Computer Architecture) by , , and , November 2011
“Why On-Chip Cache Coherence is Here to Stay”
by , , and ,
Communications of the ACM (CACM), July 2012.

CIS 371: Comp. Org. & Design.| Prof. | Multicore

Beyond Implicit Parallelism
Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE];
void daxpy():
for (i = 0; i < SIZE; i++) z[i] = a*x[i] + y[i]; Lots of instruction-level parallelism (ILP) But how much can we really exploit? 4 wide? 8 wide? Limits to (efficient) super-scalar execution But, if SIZE is 10,000 the loop has 10,000-way parallelism! How do we exploit it? CIS 371: Comp. Org. & Design.| Prof. | Multicore Explicit Parallelism Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(): for (i = 0; i < SIZE; i++) z[i] = a*x[i] + y[i]; Break it up into N “chunks” on N cores! Done by the programmer (or maybe a really smart compiler) void daxpy(int chunk_id): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] Local variables are “private” and x, y, and z are “shared” Assumes SIZE is a multiple of N (that is, SIZE % N == 0) CIS 371: Comp. Org. & Design.| Prof. | Multicore Chunk ID Start End SIZE = 400, N=4 Explicit Parallelism Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(int chunk_id): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] Main code then looks like: parallel_daxpy(): for (tid = 0; tid < CORES; tid++) { spawn_task(daxpy, tid); wait_for_tasks(CORES); CIS 371: Comp. Org. & Design.| Prof. | Multicore Explicit (Loop-Level) Parallelism Another way: “OpenMP” annotations to inform the compiler double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy() { #pragma omp parallel for for (i = 0; i < SIZE; i++) { z[i] = a*x[i] + y[i]; But only works if loop is actually parallel If not parallel, unpredictable incorrect behavior may result CIS 371: Comp. Org. & Design.| Prof. | Multicore Multicore & Multiprocessor CIS 371: Comp. Org. & Design.| Prof. | Multicore CIS 371: Comp. Org. & Design.| Prof. | Multicore Multiplying Performance A single core can only be so fast Limited clock frequency Limited instruction-level parallelism What if we need even more computing power? Use multiple cores! But how? Old-school (2000s): Ultra Enterprise 25k 72 dual-core UltraSPARC IV+ processors Up to 1TB of memory Niche: large database servers $$$, weighs more than 1 ton Today: multicore is everywhere Can’t buy a single-core smartphone… …or even smart watch! Intel Quad-Core “Core i7” CIS 371: Comp. Org. & Design.| Prof. | Multicore CIS 371: Comp. Org. & Design.| Prof. | Multicore Application Domains for Multiprocessors Scientific computing/supercomputing Examples: weather simulation, aerodynamics, protein folding Large grids, integrating changes over time Each processor computes for a part of the grid Server workloads Example: airline reservation database Many concurrent updates, searches, lookups, queries Processors handle different requests Media workloads Processors compress/decompress different parts of image/frames Desktop workloads… Gaming workloads… But software must be written to expose parallelism CIS 371: Comp. Org. & Design.| Prof. | Multicore Multicore is Energy Efficient Explicit parallelism (multicore) is highly energy efficient Recall: dynamic voltage and frequency scaling Performance vs power is NOT linear Example: Intel’s Xscale 1 GHz  200 MHz reduces energy used by 30x Consider the impact of parallel execution What if we used 5 Xscales at 200Mhz? Similar performance as a 1 , but 1/6th the energy 5 cores * 1/30th = 1/6th And, amortizes background “uncore” energy among cores Assumes parallel speedup (a difficult task) Subject to Ahmdal’s law CIS 371: Comp. Org. & Design.| Prof. | Multicore Amdahl’s Law Restatement of the law of diminishing returns Total speedup limited by non-accelerated piece Analogy: drive to work & park car, walk to building Consider a task with a “parallel” and “serial” portion What is the speedup with N cores? Speedup(n, p, s) = (s+p) / (s + (p/n)) p is “parallel percentage”, s is “serial percentage” What about infinite cores? Speedup(p, s) = (s+p) / s = 1 / s Example: can optimize 50% of program A Even a “magic” optimization that makes this 50% disappear… …only yields a 2X speedup CIS 371: Comp. Org. & Design.| Prof. | Multicore Amdahl’s Law Graph Source: Wikipedia Threading & The Shared Memory Programming Model CIS 371: Comp. Org. & Design.| Prof. | Multicore CIS 371: Comp. Org. & Design.| Prof. | Multicore First, Uniprocessor Concurrency Software “thread”: Independent flows of execution “Per-thread” state Context state: PC, registers Stack (per-thread local variables) “Shared” state: globals, heap, etc. Threads generally share the same memory space A process is like a thread, but with its own memory space Java has thread support built in, C/C++ use the pthreads library Generally, system software (the O.S.) manages threads “Thread scheduling”, “context switching” In single-core system, all threads share one processor Hardware timer interrupt occasionally triggers O.S. Quickly swapping threads gives illusion of concurrent execution Much more in an operating systems course CIS 371: Comp. Org. & Design.| Prof. | Multicore Shared Memory Programming Model Programmer explicitly creates multiple threads All loads & stores to a single shared memory space Each thread has its own stack frame for local variables All memory shared, accessible by all threads A “thread switch” can occur at any time Pre-emptive multithreading by OS Common uses: Handling user interaction (GUI programming) Handling I/O latency (send network message, wait for response) Expressing parallel work via Thread-Level Parallelism (TLP) This is our focus! threads can r/w each others’ stacks mxthreading useful for I/O latency even on uniprocessor CIS 371: Comp. Org. & Design.| Prof. | Multicore Initially: all variables zero (that is, x=0, y=0) What value pairs can be read by the two loads? What about (x=0, y=0)? Shared Memory Model: Interleaving store 1 → x store 1 → y store 1 → y store 1 → x (x=0, y=1) store 1 → x store 1 → y (x=1, y=0) store 1 → y store 1 → x (x=1, y=1) store 1 → x store 1 → y (x=1, y=1) store 1 → y store 1 → x (x=1, y=1) store 1 → x store 1 → y (x=1, y=1) CIS 371: Comp. Org. & Design.| Prof. | Multicore Shared Memory Implementations Multiplexed uniprocessor Runtime system and/or OS occasionally pre-empt & swap threads Interleaved, but no parallelism Multiprocessors Multiply execution resources, higher peak performance Same interleaved shared-memory model Foreshadowing: allow private caches, further disentangle cores Hardware multithreading Tolerate pipeline latencies, higher efficiency Same interleaved shared-memory model All support the shared memory programming model CIS 371: Comp. Org. & Design.| Prof. | Multicore Simplest Multiprocessor Replicate entire processor pipeline! Instead of replicating just register file & PC Exception: share the caches (we’ll address this bottleneck soon) Multiple threads execute Shared memory programming model Operations (loads and stores) are interleaved “at random” Loads returns the value written by most recent store to location CIS 371: Comp. Org. & Design.| Prof. | Multicore Four Shared Memory Issues Cache coherence If cores have private (non-shared) caches How to make writes to one cache “show up” in others? Parallel programming How does the programmer express the parallelism? Synchronization How to regulate access to shared data? How to implement “locks”? Memory consistency models How to keep programmer sane while letting hardware optimize? How to reconcile shared memory with compiler optimizations, store buffers, and out-of-order execution? Hardware Multithreading Not the same as software multithreading! A hardware thread is a sequential stream of insns from some software thread (e.g., single-threaded process) Hardware Multithreading (MT) Multiple hardware threads dynamically share a single pipeline Replicate only per-thread structures: program counter & registers Hardware interleaves instructions CIS 371: Comp. Org. & Design.| Prof. | Multicore CIS 371: Comp. Org. & Design.| Prof. | Multicore Hardware Multithreading Why use hw multithreading? Multithreading improves utilization and throughput Single programs utilize <50% of pipeline (branch, cache miss) allow insns from different hw threads in pipeline at once Multithreading does not improve single-thread performance Individual threads run as fast or even slower Coarse-grain MT: switch on cache misses Why? Simultaneous MT: no explicit switching, fine-grain interleaving Intel’s “hyperthreading” (just one thread) SMT = Intel’s hyperthreading CIS 371: Comp. Org. & Design.| Prof. | Multicore Roadmap Checkpoint Thread-level parallelism (TLP) Shared memory model Multiplexed uniprocessor Hardware multithreading Multiprocessing Cache coherence Valid/Invalid, MSI, MESI Parallel programming Synchronization Lock implementation Locking gotchas Transactional memory Memory consistency models System software CIS 371: Comp. Org. & Design.| Prof. | Multicore Recall: Simplest Multiprocessor What if we don’t want to share the L1 caches? Bandwidth and latency issue Solution: use per-processor (“private”) caches Coordinate them with a Cache Coherence Protocol Must still provide shared-memory invariant: “Loads read the value written by the most recent store” No-Cache (Conceptual) Implementation Interconnect No-Cache (Conceptual) Implementation Memory A 500 Not a realistic design Interconnect Shared Cache Implementation Memory A 500 Cache Tag Data Interconnect Shared Cache Implementation Memory A 500 Cache Tag Data On-chip shared cache Lacks per-core caches Shared cache becomes bottleneck Interconnect Shared Cache Implementation Memory A 500 Cache Tag Data Interconnect Shared Cache Implementation Memory A 500 Cache Tag Data Cache Tag Data Interconnect Shared Cache Implementation Write into cache Memory A 500 Store 400 -> [A]

Cache Tag Data State
A 400 Dirty

Interconnect
Shared Cache Implementation
Mark as “dirty”
Memory not updated
Memory A 500

Store 400 -> [A]

–why do we mark as dirty?
–When is the dirty bit set to clean?

Interconnect
Adding Private Caches

Cache Tag Data State

Add per-core caches
(write-back caches)
Reduces latency
Increases throughput
Decreases energy

Interconnect
Adding Private Caches

Memory A 500

Cache Tag Data State

Interconnect
Adding Private Caches

Memory A 500