CIS 371: Computer Architecture
Unit 11: Multicore
Slides developed by , & at UPenn with sources that included University of Wisconsin slides
by , , , and
Copyright By PowCoder代写 加微信 powcoder
CIS 371: Comp. Org. & Design.| Prof. | Multicore 1
This Unit: Shared Memory Multiprocessors
• Thread-level parallelism (TLP)
• Shared memory model • Multiplexed uniprocessor • Hardware multithreading • Multiprocessing
• Cache coherence
• Valid/Invalid, MSI, MESI
System software
Mem CPU CPU
CIS 371: Comp. Org. & Design.| Prof. | Multicore 2
• Chapter 7.1-7.3, 7.5 • Chapter 5.8, 5.10
• Suggested reading
• “A Primer on Memory Consistency and Cache Coherence” (Synthesis Lectures on Computer Architecture) by , , and , November 2011
• “Why On-Chip Cache Coherence is Here to Stay” by , , and , Communications of the ACM (CACM), July 2012.
CIS 371: Comp. Org. & Design.| Prof. | Multicore 3
Beyond Implicit Parallelism
• Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy():
for (i = 0; i < SIZE; i++)
z[i] = a*x[i] + y[i];
• Lots of instruction-level parallelism (ILP) • Great!
• But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution
• But, if SIZE is 10,000 the loop has 10,000-way parallelism! • How do we exploit it?
CIS 371: Comp. Org. & Design.| Prof. | Multicore 4
Explicit Parallelism
• Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy():
for (i = 0; i < SIZE; i++)
z[i] = a*x[i] + y[i];
• Break it up into N “chunks” on N cores!
• Done by the programmer (or maybe a really smart compiler)
void daxpy(int chunk_id):
chuck_size = SIZE / N
my_start = chuck_id * chuck_size
my_end = my_start + chuck_size
for (i = my_start; i < my_end; i++)
z[i] = a*x[i] + y[i]
SIZE = 400, N=4
• Local variables are “private” and x, y, and z are “shared”
• Assumes SIZE is a multiple of N (that is, SIZE % N == 0)
CIS 371: Comp. Org. & Design.| Prof. | Multicore 5
Explicit Parallelism
• Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(int chunk_id):
chuck_size = SIZE / N
my_start = chuck_id * chuck_size
my_end = my_start + chuck_size
for (i = my_start; i < my_end; i++)
z[i] = a*x[i] + y[i]
• Main code then looks like:
parallel_daxpy():
for (tid = 0; tid < CORES; tid++) {
spawn_task(daxpy, tid);
wait_for_tasks(CORES);
CIS 371: Comp. Org. & Design.| Prof. | Multicore 6
Explicit (Loop-Level) Parallelism
• Another way: “OpenMP” annotations to inform the compiler
double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy() {
#pragma omp parallel for
for (i = 0; i < SIZE; i++) {
z[i] = a*x[i] + y[i];
• But only works if loop is actually parallel
• If not parallel, unpredictable incorrect behavior may result
CIS 371: Comp. Org. & Design.| Prof. | Multicore 7
Multicore & Multiprocessor Hardware
CIS 371: Comp. Org. & Design.| Prof. | Multicore 8
Multiplying Performance
• Asinglecorecanonlybesofast • Limitedclockfrequency
• Limitedinstruction-levelparallelism
• Whatifweneedevenmorecomputingpower? • Use multiple cores! But how?
• Old-school(2000s):UltraEnterprise25k • 72dual-coreUltraSPARCIV+processors
• Upto1TBofmemory
• Niche:largedatabaseservers
• $$$,weighsmorethan1ton
• Today:multicoreiseverywhere
• Can’tbuyasingle-coresmartphone...
• ...or even smart watch!
CIS 371: Comp. Org. & Design.| Prof. | Multicore 9
Intel Quad-Core “Core i7”
CIS 371: Comp. Org. & Design.| Prof. | Multicore 10
Application Domains for Multiprocessors
• Scientificcomputing/supercomputing
• Examples: weather simulation, aerodynamics, protein folding • Large grids, integrating changes over time
• Each processor computes for a part of the grid
• Serverworkloads
• Example: airline reservation database
• Many concurrent updates, searches, lookups, queries • Processors handle different requests
• Mediaworkloads
• Processors compress/decompress different parts of image/frames
• Desktopworkloads... • Gamingworkloads...
But software must be written to expose parallelism
CIS 371: Comp. Org. & Design.| Prof. | Multicore 11
Multicore is Energy Efficient
• Explicit parallelism (multicore) is highly energy efficient
• Recall: dynamic voltage and frequency scaling • Performance vs power is NOT linear
• Example: Intel’s Xscale
• 1 GHz ® 200 MHz reduces energy used by 30x
• Consider the impact of parallel execution
• What if we used 5 Xscales at 200Mhz?
• Similar performance as a 1 , but 1/6th the energy
• 5 cores * 1/30th = 1/6th
• And, amortizes background “uncore” energy among cores
• Assumes parallel speedup (a difficult task) • Subject to Ahmdal’s law
CIS 371: Comp. Org. & Design.| Prof. | Multicore 12
Amdahl’s Law
• Restatement of the law of diminishing returns • Total speedup limited by non-accelerated piece
• Analogy: drive to work & park car, walk to building
• Consider a task with a “parallel” and “serial” portion • What is the speedup with N cores?
• Speedup(n, p, s) = (s+p) / (s + (p/n))
• p is “parallel percentage”, s is “serial percentage” • What about infinite cores?
• Speedup(p,s)=(s+p)/s = 1/s
• Example: can optimize 50% of program A
• Even a “magic” optimization that makes this 50% disappear... • ...only yields a 2X speedup
CIS 371: Comp. Org. & Design.| Prof. | Multicore 13
Amdahl’s Law Graph
CIS 371: Comp. Org. & Design.| Prof. | Multicore Source: Wikipedia 14
Threading &
The Shared Memory
Programming Model
CIS 371: Comp. Org. & Design.| Prof. | Multicore 15
First, Uniprocessor Concurrency
• Software“thread”:Independentflowsofexecution • “Per-thread” state
• Context state: PC, registers
• Stack (per-thread local variables)
• “Shared” state: globals, heap, etc.
• Threads generally share the same memory space
• A process is like a thread, but with its own memory space
• Java has thread support built in, C/C++ use the pthreads library
• Generally, system software (the O.S.) manages threads • “Thread scheduling”, “context switching”
• In single-core system, all threads share one processor
• Hardware timer interrupt occasionally triggers O.S.
• Quickly swapping threads gives illusion of concurrent execution
• Much more in an operating systems course
CIS 371: Comp. Org. & Design.| Prof. | Multicore 16
Shared Memory Programming Model
• Programmer explicitly creates multiple threads
• All loads & stores to a single shared memory space • Each thread has its own stack frame for local variables
• All memory shared, accessible by all threads
• A “thread switch” can occur at any time • Pre-emptive multithreading by OS
• Common uses:
• Handling user interaction (GUI programming)
• Handling I/O latency (send network message, wait for response)
• ExpressingparallelworkviaThread-LevelParallelism(TLP)
• This is our focus!
CIS 371: Comp. Org. & Design.| Prof. | Multicore 17
Shared Memory Model: Interleaving
• Initially:allvariableszero(thatis,x=0,y=0) thread 1 thread 2
• What value pairs can be read by the two loads?
store 1 → y load x
store 1 → x load y
store 1 → y load x store 1 → x load y (x=0, y=1)
store 1 → y store 1 → x load x
load y (x=1, y=1)
store 1 → y store 1 → x load y
load x (x=1, y=1)
store 1 → x load y store 1 → y load x (x=1, y=0)
store 1 → x store 1 → y load y
load x (x=1, y=1)
store 1 → x store 1 → y load x
load y (x=1, y=1)
• What about (x=0, y=0)?
CIS 371: Comp. Org. & Design.| Prof. | Multicore 18
Shared Memory Implementations
• Multiplexeduniprocessor
• Runtime system and/or OS occasionally pre-empt & swap threads • Interleaved, but no parallelism
• Multiprocessors
• Multiply execution resources, higher peak performance
• Same interleaved shared-memory model
• Foreshadowing: allow private caches, further disentangle cores
• Hardwaremultithreading
• Tolerate pipeline latencies, higher efficiency • Same interleaved shared-memory model
• Allsupportthesharedmemoryprogrammingmodel CIS 371: Comp. Org. & Design.| Prof. | Multicore 19
Simplest Multiprocessor
• Replicate entire processor pipeline!
• Instead of replicating just register file & PC
• Exception: share the caches (we’ll address this bottleneck soon)
• Multiple threads execute
• Shared memory programming model
• Operations (loads and stores) are interleaved “at random”
• Loads returns the value written by most recent store to location
CIS 371: Comp. Org. & Design.| Prof. | Multicore 20
Four Shared Memory Issues
1. Cache coherence
• If cores have private (non-shared) caches
• How to make writes to one cache “show up” in others?
1. Parallel programming
• How does the programmer express the parallelism?
2. Synchronization
• How to regulate access to shared data? • How to implement “locks”?
3. Memory consistency models
• How to keep programmer sane while letting hardware optimize?
• How to reconcile shared memory with compiler optimizations, store buffers, and out-of-order execution?
CIS 371: Comp. Org. & Design.| Prof. | Multicore 21
Hardware Multithreading
• Not the same as software multithreading!
• A hardware thread is a sequential stream of insns
• from some software thread (e.g., single-threaded process) D$
• HardwareMultithreading(MT)
• Multiple hardware threads dynamically share a single pipeline
• Replicate only per-thread structures: program counter & registers • Hardware interleaves instructions
CIS 371: Comp. Org. & Design.| Prof. | Multicore 22
Hardware Multithreading
• Why use hw multithreading?
+ Multithreading improves utilization and throughput
• Single programs utilize <50% of pipeline (branch, cache miss)
• allow insns from different hw threads in pipeline at once
• Multithreadingdoesnotimprovesingle-threadperformance
• Individual threads run as fast or even slower
• Coarse-grain MT: switch on cache misses Why?
• SimultaneousMT:noexplicitswitching,fine-graininterleaving
• Intel’s “hyperthreading” ROB
(just one thread)
CIS 371: Comp. Org. & Design.| Prof. | Multicore
Roadmap Checkpoint
• Thread-level parallelism (TLP)
• Shared memory model • Multiplexed uniprocessor • Hardware multithreading • Multiprocessing
• Cachecoherence
• Valid/Invalid,MSI,MESI
• Parallel programming
• Synchronization
• Lock implementation • Locking gotchas
• Transactional memory
• Memory consistency models
CIS 371: Comp. Org. & Design.| Prof. | Multicore 24
Mem CPU CPU
System software
Recall: Simplest Multiprocessor
• What if we don’t want to share the L1 caches? • Bandwidth and latency issue
• Solution: use per-processor (“private”) caches
• Coordinate them with a Cache Coherence Protocol
• Must still provide shared-memory invariant:
• “Loadsreadthevaluewrittenbythemostrecentstore” CIS 371: Comp. Org. & Design.| Prof. | Multicore 25
No-Cache (Conceptual) Implementation
No-Cache (Conceptual) Implementation
Interconnect
• Nocaches
• Not a realistic design
Shared Cache Implementation
Interconnect
Shared Cache
Shared Cache Implementation
Interconnect
Shared Cache
• On-chip shared cache
• Lacks per-core caches
• Shared cache becomes bottleneck
Shared Cache Implementation
Interconnect
Shared Cache
Shared Cache Implementation
Interconnect
Shared Cache
Shared Cache Implementation
Store 400 -> [A]
Shared Cache
• Write into cache
Interconnect
Shared Cache Implementation
Store 400 -> [A]
Interconnect
Shared Cache
• Mark as “dirty”
• Memory not updated 33
Adding Private Caches
Cache Cache Cache
Interconnect
• Add per-core caches
(write-back caches)
• Reduces latency
• Increases throughput
• Decreases energy
Shared Cache
Adding Private Caches
Interconnect
Shared Cache
Adding Private Caches
Interconnect
Shared Cache
Adding Private Caches
Store 400 -> [A]
Interconnect
Shared Cache
Adding Private Caches
Store 400 -> [A]
Interconnect
Shared Cache
Private Cache Problem: Incoherence
Interconnect
• What happens when another core tries to read A?
Shared Cache
Private Cache Problem: Incoherence
Interconnect
Shared Cache
Private Cache Problem: Incoherence
Interconnect
Shared Cache
Private Cache Problem: Incoherence
Uh, Oh P1 P2
Interconnect
• P0gotthe wrong value!
Shared Cache
Cache Coherence: Who bears the brunt?
• Caches are supposed to be invisible to the programmer!
CIS 371: Comp. Org. & Design.| Prof. | Multicore 43
Rewind: Fix Problem by Tracking Sharers
Interconnect
Shared Cache
• Solution: Track
copies of each block
Use Tracking Information to “Invalidate”
Interconnect
Shared Cache
Use Tracking Information to “Invalidate”
Interconnect
Shared Cache
Use Tracking Information to “Invalidate”
Interconnect
Shared Cache
Use Tracking Information to “Invalidate”
Interconnect
Shared Cache
“Valid/Invalid” Cache Coherence
• To enforce the shared memory invariant…
• “Loads read the value written by the most recent store”
• Enforce the invariant…
• “Atmostonevalidcopyoftheblock”
• Simplest form is a two-state “valid/invalid” protocol • If a core wants a copy, must find and “invalidate” it
• On a cache miss, how is the valid copy found?
• Option #1 “Snooping”: broadcast to all, whoever has it responds • Option #2: “Directory”: track sharers with separate structure
• Problem:multiplecopiescan’texist,evenifread-only • Consider mostly-read data structures, instructions, etc.
CIS 371: Comp. Org. & Design.| Prof. | Multicore 49
VI Protocol State Transition Table
This Processor
Other Processor
Load Store
Load Miss Store Miss
Invalid (I)
Load Miss èV
Store Miss èV
Send Data èI
Send Data èI
• Rows are “states” • IvsV
• Columns are “events”
• Writeback events not shown
• Memory controller not shown
• Memory sends data when no processor responds
CIS 371: Comp. Org. & Design.| Prof. | Multicore 51
MSI Cache Coherence Protocol
• Solution: enforce the invariant…
• Multiple read-only copies —OR— • Singleread/writecopy
• Track these MSI permissions (states) in per-core caches • Modified(M):read/writepermission
• Shared(S):read-onlypermission
• Invalid (I): no permission
• Also track a “Sharer” bit vector in shared cache • One bit per core; tracks all shared copies of a block
• Then, invalidate all readers when a write occurs
• Allows for many readers…
• …while still enforcing shared memory invariant
(“Loads read the value written by the most recent store”)
CIS 371: Comp. Org. & Design.| Prof. | Multicore 52
MSI Coherence Example: Step #1
Point-to-Point Interconnect
P1 is Modified
Shared Cache
MSI Coherence Example: Step #2
Load [A] Cache LdMiss: Addr=A
1 Point-to-Point Interconnect
Shared Cache
LdMissForward: Addr=A, Req=P0
MSI Coherence Example: Step #3
Response: Addr=A, Data=400
Shared Cache
3 Point-to-Point Interconnect
MSI Coherence Example: Step #4
Response: Addr=A, Data=400
Shared Cache
3 Point-to-Point Interconnect
MSI Coherence Example: Step #5
Load [A] (400) Cache Cache Unblock: Addr=A
Shared Cache
Point-to-Point Interconnect
MSI Coherence Example: Step #6
Store 300 -> [A] Cache Cache
Point-to-Point Interconnect
Shared Cache
Classifying Misses: 3C Model
• Divide cache misses into three categories
• Compulsory(cold):neverseenthisaddressbefore
• Would miss even in infinite cache
• Capacity:misscausedbeca
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com