CSE 371 Computer Organization and Design
CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Computer Architecture
Copyright By PowCoder代写 加微信 powcoder
Unit 11: Multicore
Slides developed by , & at UPenn
with sources that included University of Wisconsin slides
by , , , and
CIS 371: Comp. Org. & Design.| Prof. | Multicore
This Unit: Shared Memory Multiprocessors
Thread-level parallelism (TLP)
Shared memory model
Multiplexed uniprocessor
Hardware multithreading
Multiprocessing
Cache coherence
Valid/Invalid, MSI, MESI
System software
Chapter 7.1-7.3, 7.5
Chapter 5.8, 5.10
Suggested reading
“A Primer on Memory Consistency and Cache Coherence” (Synthesis Lectures on Computer Architecture) by , , and , November 2011
“Why On-Chip Cache Coherence is Here to Stay”
by , , and ,
Communications of the ACM (CACM), July 2012.
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Beyond Implicit Parallelism
Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE];
void daxpy():
for (i = 0; i < SIZE; i++)
z[i] = a*x[i] + y[i];
Lots of instruction-level parallelism (ILP)
But how much can we really exploit? 4 wide? 8 wide?
Limits to (efficient) super-scalar execution
But, if SIZE is 10,000 the loop has 10,000-way parallelism!
How do we exploit it?
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Explicit Parallelism
Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE];
void daxpy():
for (i = 0; i < SIZE; i++)
z[i] = a*x[i] + y[i];
Break it up into N “chunks” on N cores!
Done by the programmer (or maybe a really smart compiler)
void daxpy(int chunk_id):
chuck_size = SIZE / N
my_start = chuck_id * chuck_size
my_end = my_start + chuck_size
for (i = my_start; i < my_end; i++)
z[i] = a*x[i] + y[i]
Local variables are “private” and x, y, and z are “shared”
Assumes SIZE is a multiple of N (that is, SIZE % N == 0)
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Chunk ID Start End
SIZE = 400, N=4
Explicit Parallelism
Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE];
void daxpy(int chunk_id):
chuck_size = SIZE / N
my_start = chuck_id * chuck_size
my_end = my_start + chuck_size
for (i = my_start; i < my_end; i++)
z[i] = a*x[i] + y[i]
Main code then looks like:
parallel_daxpy():
for (tid = 0; tid < CORES; tid++) {
spawn_task(daxpy, tid);
wait_for_tasks(CORES);
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Explicit (Loop-Level) Parallelism
Another way: “OpenMP” annotations to inform the compiler
double a, x[SIZE], y[SIZE], z[SIZE];
void daxpy() {
#pragma omp parallel for
for (i = 0; i < SIZE; i++) {
z[i] = a*x[i] + y[i];
But only works if loop is actually parallel
If not parallel, unpredictable incorrect behavior may result
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Multicore & Multiprocessor
CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Multiplying Performance
A single core can only be so fast
Limited clock frequency
Limited instruction-level parallelism
What if we need even more computing power?
Use multiple cores! But how?
Old-school (2000s): Ultra Enterprise 25k
72 dual-core UltraSPARC IV+ processors
Up to 1TB of memory
Niche: large database servers
$$$, weighs more than 1 ton
Today: multicore is everywhere
Can’t buy a single-core smartphone…
…or even smart watch!
Intel Quad-Core “Core i7”
CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Application Domains for Multiprocessors
Scientific computing/supercomputing
Examples: weather simulation, aerodynamics, protein folding
Large grids, integrating changes over time
Each processor computes for a part of the grid
Server workloads
Example: airline reservation database
Many concurrent updates, searches, lookups, queries
Processors handle different requests
Media workloads
Processors compress/decompress different parts of image/frames
Desktop workloads…
Gaming workloads…
But software must be written to expose parallelism
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Multicore is Energy Efficient
Explicit parallelism (multicore) is highly energy efficient
Recall: dynamic voltage and frequency scaling
Performance vs power is NOT linear
Example: Intel’s Xscale
1 GHz 200 MHz reduces energy used by 30x
Consider the impact of parallel execution
What if we used 5 Xscales at 200Mhz?
Similar performance as a 1 , but 1/6th the energy
5 cores * 1/30th = 1/6th
And, amortizes background “uncore” energy among cores
Assumes parallel speedup (a difficult task)
Subject to Ahmdal’s law
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Amdahl’s Law
Restatement of the law of diminishing returns
Total speedup limited by non-accelerated piece
Analogy: drive to work & park car, walk to building
Consider a task with a “parallel” and “serial” portion
What is the speedup with N cores?
Speedup(n, p, s) = (s+p) / (s + (p/n))
p is “parallel percentage”, s is “serial percentage”
What about infinite cores?
Speedup(p, s) = (s+p) / s = 1 / s
Example: can optimize 50% of program A
Even a “magic” optimization that makes this 50% disappear…
…only yields a 2X speedup
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Amdahl’s Law Graph
Source: Wikipedia
Threading &
The Shared Memory Programming Model
CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Comp. Org. & Design.| Prof. | Multicore
First, Uniprocessor Concurrency
Software “thread”: Independent flows of execution
“Per-thread” state
Context state: PC, registers
Stack (per-thread local variables)
“Shared” state: globals, heap, etc.
Threads generally share the same memory space
A process is like a thread, but with its own memory space
Java has thread support built in, C/C++ use the pthreads library
Generally, system software (the O.S.) manages threads
“Thread scheduling”, “context switching”
In single-core system, all threads share one processor
Hardware timer interrupt occasionally triggers O.S.
Quickly swapping threads gives illusion of concurrent execution
Much more in an operating systems course
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Shared Memory Programming Model
Programmer explicitly creates multiple threads
All loads & stores to a single shared memory space
Each thread has its own stack frame for local variables
All memory shared, accessible by all threads
A “thread switch” can occur at any time
Pre-emptive multithreading by OS
Common uses:
Handling user interaction (GUI programming)
Handling I/O latency (send network message, wait for response)
Expressing parallel work via Thread-Level Parallelism (TLP)
This is our focus!
threads can r/w each others’ stacks
mxthreading useful for I/O latency even on uniprocessor
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Initially: all variables zero (that is, x=0, y=0)
What value pairs can be read by the two loads?
What about (x=0, y=0)?
Shared Memory Model: Interleaving
store 1 → x
store 1 → y
store 1 → y
store 1 → x
(x=0, y=1)
store 1 → x
store 1 → y
(x=1, y=0)
store 1 → y
store 1 → x
(x=1, y=1)
store 1 → x
store 1 → y
(x=1, y=1)
store 1 → y
store 1 → x
(x=1, y=1)
store 1 → x
store 1 → y
(x=1, y=1)
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Shared Memory Implementations
Multiplexed uniprocessor
Runtime system and/or OS occasionally pre-empt & swap threads
Interleaved, but no parallelism
Multiprocessors
Multiply execution resources, higher peak performance
Same interleaved shared-memory model
Foreshadowing: allow private caches, further disentangle cores
Hardware multithreading
Tolerate pipeline latencies, higher efficiency
Same interleaved shared-memory model
All support the shared memory programming model
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Simplest Multiprocessor
Replicate entire processor pipeline!
Instead of replicating just register file & PC
Exception: share the caches (we’ll address this bottleneck soon)
Multiple threads execute
Shared memory programming model
Operations (loads and stores) are interleaved “at random”
Loads returns the value written by most recent store to location
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Four Shared Memory Issues
Cache coherence
If cores have private (non-shared) caches
How to make writes to one cache “show up” in others?
Parallel programming
How does the programmer express the parallelism?
Synchronization
How to regulate access to shared data?
How to implement “locks”?
Memory consistency models
How to keep programmer sane while letting hardware optimize?
How to reconcile shared memory with compiler optimizations, store buffers, and out-of-order execution?
Hardware Multithreading
Not the same as software multithreading!
A hardware thread is a sequential stream of insns
from some software thread (e.g., single-threaded process)
Hardware Multithreading (MT)
Multiple hardware threads dynamically share a single pipeline
Replicate only per-thread structures: program counter & registers
Hardware interleaves instructions
CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Hardware Multithreading
Why use hw multithreading?
Multithreading improves utilization and throughput
Single programs utilize <50% of pipeline (branch, cache miss)
allow insns from different hw threads in pipeline at once
Multithreading does not improve single-thread performance
Individual threads run as fast or even slower
Coarse-grain MT: switch on cache misses Why?
Simultaneous MT: no explicit switching, fine-grain interleaving
Intel’s “hyperthreading”
(just one thread)
SMT = Intel’s hyperthreading
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Roadmap Checkpoint
Thread-level parallelism (TLP)
Shared memory model
Multiplexed uniprocessor
Hardware multithreading
Multiprocessing
Cache coherence
Valid/Invalid, MSI, MESI
Parallel programming
Synchronization
Lock implementation
Locking gotchas
Transactional memory
Memory consistency models
System software
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Recall: Simplest Multiprocessor
What if we don’t want to share the L1 caches?
Bandwidth and latency issue
Solution: use per-processor (“private”) caches
Coordinate them with a Cache Coherence Protocol
Must still provide shared-memory invariant:
“Loads read the value written by the most recent store”
No-Cache (Conceptual) Implementation
Interconnect
No-Cache (Conceptual) Implementation
Memory A 500
Not a realistic design
Interconnect
Shared Cache Implementation
Memory A 500
Cache Tag Data
Interconnect
Shared Cache Implementation
Memory A 500
Cache Tag Data
On-chip shared cache
Lacks per-core caches
Shared cache becomes bottleneck
Interconnect
Shared Cache Implementation
Memory A 500
Cache Tag Data
Interconnect
Shared Cache Implementation
Memory A 500
Cache Tag Data
Cache Tag Data
Interconnect
Shared Cache Implementation
Write into cache
Memory A 500
Store 400 -> [A]
Cache Tag Data State
A 400 Dirty
Interconnect
Shared Cache Implementation
Mark as “dirty”
Memory not updated
Memory A 500
Store 400 -> [A]
–why do we mark as dirty?
–When is the dirty bit set to clean?
Interconnect
Adding Private Caches
Cache Tag Data State
Add per-core caches
(write-back caches)
Reduces latency
Increases throughput
Decreases energy
Interconnect
Adding Private Caches
Memory A 500
Cache Tag Data State
Interconnect
Adding Private Caches
Memory A 500
Cache Tag Data State
A 500 Clean
Interconnect
Adding Private Caches
Memory A 500
Cache Tag Data State
A 500 Clean
Store 400 -> [A]
Interconnect
Adding Private Caches
Tag Data State
A 400 Dirty
Memory A 500
Cache Tag Data State
A 500 Clean
Store 400 -> [A]
Interconnect
Private Cache Problem: Incoherence
What happens when another core tries to read A?
Tag Data State
A 400 Dirty
Memory A 500
Cache Tag Data State
A 500 Clean
Interconnect
Private Cache Problem: Incoherence
Tag Data State
A 400 Dirty
Memory A 500
Cache Tag Data State
A 500 Clean
Interconnect
Private Cache Problem: Incoherence
Tag Data State
A 400 Dirty
Memory A 500
Cache Tag Data State
A 500 Clean
Interconnect
Private Cache Problem: Incoherence
P0 got the wrong value!
Tag Data State
A 400 Dirty
Memory A 500
Cache Tag Data State
A 500 Clean
Cache Coherence: Who bears the brunt?
Caches are supposed to be invisible to the programmer!
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Interconnect
Rewind: Fix Problem by Tracking Sharers
Tag Data State
Tag Data State
A 400 Dirty
Tag Data State
Memory A 500
Cache Tag Data State Owner
A 500 — P1
Solution: Track
copies of each block
Interconnect
Use Tracking Information to “Invalidate”
Tag Data State
Tag Data State
A 400 Dirty
Tag Data State
Memory A 500
Cache Tag Data State Owner
A 500 — P1
Interconnect
Use Tracking Information to “Invalidate”
Tag Data State
Tag Data State
A 400 Dirty
Tag Data State
Memory A 500
Cache Tag Data State Owner
A 500 — P1
Interconnect
Use Tracking Information to “Invalidate”
Tag Data State
A 400 Dirty
Tag Data State
Tag Data State
Memory A 500
Cache Tag Data State Owner
A 500 — P1
Interconnect
Use Tracking Information to “Invalidate”
Tag Data State
A 400 Dirty
Tag Data State
Tag Data State
Memory A 500
Cache Tag Data State Owner
A 500 — P1
“Valid/Invalid” Cache Coherence
To enforce the shared memory invariant…
“Loads read the value written by the most recent store”
Enforce the invariant…
“At most one valid copy of the block”
Simplest form is a two-state “valid/invalid” protocol
If a core wants a copy, must find and “invalidate” it
On a cache miss, how is the valid copy found?
Option #1 “Snooping”: broadcast to all, whoever has it responds
Option #2: “Directory”: track sharers with separate structure
Problem: multiple copies can’t exist, even if read-only
Consider mostly-read data structures, instructions, etc.
CIS 371: Comp. Org. & Design.| Prof. | Multicore
CIS 371: Comp. Org. & Design.| Prof. | Multicore
VI (MI) Coherence Protocol
VI (valid-invalid) protocol: aka “MI”
Two states (per block in cache)
V (valid): have block
I (invalid): don’t have block
Can implement with valid bit
Protocol diagram (left & next slide)
If anyone wants to read/write block
Give it up: transition to I state
Write-back if your own copy is dirty
This is an invalidate protocol
Update protocol: copy data, don’t invalidate
Sounds good, but uses too much bandwidth
Load, Store
LdMiss, StMiss, WB
Load, Store
LdMiss/StMiss
VI Protocol State Transition Table
This Processor Other Processor
State Load Store Load Miss Store Miss
(I) Load Miss
V Store Miss
V — —
(V) Hit Hit Send Data
I Send Data
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Rows are “states”
Columns are “events”
Writeback events not shown
Memory controller not shown
Memory sends data when no processor responds
MSI Cache Coherence Protocol
Solution: enforce the invariant…
Multiple read-only copies —OR—
Single read/write copy
Track these MSI permissions (states) in per-core caches
Modified (M): read/write permission
Shared (S): read-only permission
Invalid (I): no permission
Also track a “Sharer” bit vector in shared cache
One bit per core; tracks all shared copies of a block
Then, invalidate all readers when a write occurs
Allows for many readers…
…while still enforcing shared memory invariant
(“Loads read the value written by the most recent store”)
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Point-to-Point Interconnect
MSI Coherence Example: Step #1
Tag Data State
Cache Tag Data State Sharers
A 500 P1 is Modified P1
B 0 Idle —
Tag Data State
Tag Data State
Memory A 500
Point-to-Point Interconnect
MSI Coherence Example: Step #2
Tag Data State
Tag Data State
Tag Data State
LdMiss: Addr=A
Memory A 500
Cache Tag Data State Sharers
A 500 Blocked P1
B 0 Idle —
LdMissForward: Addr=A, Req=P0
why block?
Point-to-Point Interconnect
MSI Coherence Example: Step #3
Tag Data State
Tag Data State
Tag Data State
Response: Addr=A, Data=400
Memory A 500
Cache Tag Data State Sharers
A 500 Blocked P1
B 0 Idle —
Point-to-Point Interconnect
MSI Coherence Example: Step #4
Tag Data State
Tag Data State
Tag Data State
Response: Addr=A, Data=400
Memory A 500
Cache Tag Data State Sharers
A 500 Blocked P1
B 0 Idle —
Point-to-Point Interconnect
MSI Coherence Example: Step #5
Tag Data State
Tag Data State
Tag Data State
Load [A] (400)
Memory A 500
Cache Tag Data State Sharers
A 500 Shared P0, P1
B 0 Idle —
Unblock: Addr=A
How can the line be both shared and dirty?
Point-to-Point Interc
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com