代写代考 CIS 371: Computer Architecture

CIS 371: Computer Architecture
Unit 11: Multicore
Slides developed by , & at UPenn with sources that included University of Wisconsin slides
by , , , and

Copyright By PowCoder代写 加微信 powcoder

CIS 371: Comp. Org. & Design.| Prof. | Multicore 1

This Unit: Shared Memory Multiprocessors
• Thread-level parallelism (TLP)
• Shared memory model • Multiplexed uniprocessor • Hardware multithreading • Multiprocessing
• Cache coherence
• Valid/Invalid, MSI, MESI
System software
Mem CPU CPU
CIS 371: Comp. Org. & Design.| Prof. | Multicore 2

• Chapter 7.1-7.3, 7.5 • Chapter 5.8, 5.10
• Suggested reading
• “A Primer on Memory Consistency and Cache Coherence” (Synthesis Lectures on Computer Architecture) by , , and , November 2011
• “Why On-Chip Cache Coherence is Here to Stay” by , , and , Communications of the ACM (CACM), July 2012.
CIS 371: Comp. Org. & Design.| Prof. | Multicore 3

Beyond Implicit Parallelism
• Consider “daxpy”:
double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy():
for (i = 0; i < SIZE; i++) z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000 the loop has 10,000-way parallelism! • How do we exploit it? CIS 371: Comp. Org. & Design.| Prof. | Multicore 4 Explicit Parallelism • Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(): for (i = 0; i < SIZE; i++) z[i] = a*x[i] + y[i]; • Break it up into N “chunks” on N cores! • Done by the programmer (or maybe a really smart compiler) void daxpy(int chunk_id): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] SIZE = 400, N=4 • Local variables are “private” and x, y, and z are “shared” • Assumes SIZE is a multiple of N (that is, SIZE % N == 0) CIS 371: Comp. Org. & Design.| Prof. | Multicore 5 Explicit Parallelism • Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(int chunk_id): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] • Main code then looks like: parallel_daxpy(): for (tid = 0; tid < CORES; tid++) { spawn_task(daxpy, tid); wait_for_tasks(CORES); CIS 371: Comp. Org. & Design.| Prof. | Multicore 6 Explicit (Loop-Level) Parallelism • Another way: “OpenMP” annotations to inform the compiler double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy() { #pragma omp parallel for for (i = 0; i < SIZE; i++) { z[i] = a*x[i] + y[i]; • But only works if loop is actually parallel • If not parallel, unpredictable incorrect behavior may result CIS 371: Comp. Org. & Design.| Prof. | Multicore 7 Multicore & Multiprocessor Hardware CIS 371: Comp. Org. & Design.| Prof. | Multicore 8 Multiplying Performance • Asinglecorecanonlybesofast • Limitedclockfrequency • Limitedinstruction-levelparallelism • Whatifweneedevenmorecomputingpower? • Use multiple cores! But how? • Old-school(2000s):UltraEnterprise25k • 72dual-coreUltraSPARCIV+processors • Upto1TBofmemory • Niche:largedatabaseservers • $$$,weighsmorethan1ton • Today:multicoreiseverywhere • Can’tbuyasingle-coresmartphone... • ...or even smart watch! CIS 371: Comp. Org. & Design.| Prof. | Multicore 9 Intel Quad-Core “Core i7” CIS 371: Comp. Org. & Design.| Prof. | Multicore 10 Application Domains for Multiprocessors • Scientificcomputing/supercomputing • Examples: weather simulation, aerodynamics, protein folding • Large grids, integrating changes over time • Each processor computes for a part of the grid • Serverworkloads • Example: airline reservation database • Many concurrent updates, searches, lookups, queries • Processors handle different requests • Mediaworkloads • Processors compress/decompress different parts of image/frames • Desktopworkloads... • Gamingworkloads... But software must be written to expose parallelism CIS 371: Comp. Org. & Design.| Prof. | Multicore 11 Multicore is Energy Efficient • Explicit parallelism (multicore) is highly energy efficient • Recall: dynamic voltage and frequency scaling • Performance vs power is NOT linear • Example: Intel’s Xscale • 1 GHz ® 200 MHz reduces energy used by 30x • Consider the impact of parallel execution • What if we used 5 Xscales at 200Mhz? • Similar performance as a 1 , but 1/6th the energy • 5 cores * 1/30th = 1/6th • And, amortizes background “uncore” energy among cores • Assumes parallel speedup (a difficult task) • Subject to Ahmdal’s law CIS 371: Comp. Org. & Design.| Prof. | Multicore 12 Amdahl’s Law • Restatement of the law of diminishing returns • Total speedup limited by non-accelerated piece • Analogy: drive to work & park car, walk to building • Consider a task with a “parallel” and “serial” portion • What is the speedup with N cores? • Speedup(n, p, s) = (s+p) / (s + (p/n)) • p is “parallel percentage”, s is “serial percentage” • What about infinite cores? • Speedup(p,s)=(s+p)/s = 1/s • Example: can optimize 50% of program A • Even a “magic” optimization that makes this 50% disappear... • ...only yields a 2X speedup CIS 371: Comp. Org. & Design.| Prof. | Multicore 13 Amdahl’s Law Graph CIS 371: Comp. Org. & Design.| Prof. | Multicore Source: Wikipedia 14 Threading & The Shared Memory Programming Model CIS 371: Comp. Org. & Design.| Prof. | Multicore 15 First, Uniprocessor Concurrency • Software“thread”:Independentflowsofexecution • “Per-thread” state • Context state: PC, registers • Stack (per-thread local variables) • “Shared” state: globals, heap, etc. • Threads generally share the same memory space • A process is like a thread, but with its own memory space • Java has thread support built in, C/C++ use the pthreads library • Generally, system software (the O.S.) manages threads • “Thread scheduling”, “context switching” • In single-core system, all threads share one processor • Hardware timer interrupt occasionally triggers O.S. • Quickly swapping threads gives illusion of concurrent execution • Much more in an operating systems course CIS 371: Comp. Org. & Design.| Prof. | Multicore 16 Shared Memory Programming Model • Programmer explicitly creates multiple threads • All loads & stores to a single shared memory space • Each thread has its own stack frame for local variables • All memory shared, accessible by all threads • A “thread switch” can occur at any time • Pre-emptive multithreading by OS • Common uses: • Handling user interaction (GUI programming) • Handling I/O latency (send network message, wait for response) • ExpressingparallelworkviaThread-LevelParallelism(TLP) • This is our focus! CIS 371: Comp. Org. & Design.| Prof. | Multicore 17 Shared Memory Model: Interleaving • Initially:allvariableszero(thatis,x=0,y=0) thread 1 thread 2 • What value pairs can be read by the two loads? store 1 → y load x store 1 → x load y store 1 → y load x store 1 → x load y (x=0, y=1) store 1 → y store 1 → x load x load y (x=1, y=1) store 1 → y store 1 → x load y load x (x=1, y=1) store 1 → x load y store 1 → y load x (x=1, y=0) store 1 → x store 1 → y load y load x (x=1, y=1) store 1 → x store 1 → y load x load y (x=1, y=1) • What about (x=0, y=0)? CIS 371: Comp. Org. & Design.| Prof. | Multicore 18 Shared Memory Implementations • Multiplexeduniprocessor • Runtime system and/or OS occasionally pre-empt & swap threads • Interleaved, but no parallelism • Multiprocessors • Multiply execution resources, higher peak performance • Same interleaved shared-memory model • Foreshadowing: allow private caches, further disentangle cores • Hardwaremultithreading • Tolerate pipeline latencies, higher efficiency • Same interleaved shared-memory model • Allsupportthesharedmemoryprogrammingmodel CIS 371: Comp. Org. & Design.| Prof. | Multicore 19 Simplest Multiprocessor • Replicate entire processor pipeline! • Instead of replicating just register file & PC • Exception: share the caches (we’ll address this bottleneck soon) • Multiple threads execute • Shared memory programming model • Operations (loads and stores) are interleaved “at random” • Loads returns the value written by most recent store to location CIS 371: Comp. Org. & Design.| Prof. | Multicore 20 Four Shared Memory Issues 1. Cache coherence • If cores have private (non-shared) caches • How to make writes to one cache “show up” in others? 1. Parallel programming • How does the programmer express the parallelism? 2. Synchronization • How to regulate access to shared data? • How to implement “locks”? 3. Memory consistency models • How to keep programmer sane while letting hardware optimize? • How to reconcile shared memory with compiler optimizations, store buffers, and out-of-order execution? CIS 371: Comp. Org. & Design.| Prof. | Multicore 21 Hardware Multithreading • Not the same as software multithreading! • A hardware thread is a sequential stream of insns • from some software thread (e.g., single-threaded process) D$ • HardwareMultithreading(MT) • Multiple hardware threads dynamically share a single pipeline • Replicate only per-thread structures: program counter & registers • Hardware interleaves instructions CIS 371: Comp. Org. & Design.| Prof. | Multicore 22 Hardware Multithreading • Why use hw multithreading? + Multithreading improves utilization and throughput • Single programs utilize <50% of pipeline (branch, cache miss) • allow insns from different hw threads in pipeline at once • Multithreadingdoesnotimprovesingle-threadperformance • Individual threads run as fast or even slower • Coarse-grain MT: switch on cache misses Why? • SimultaneousMT:noexplicitswitching,fine-graininterleaving • Intel’s “hyperthreading” ROB (just one thread) CIS 371: Comp. Org. & Design.| Prof. | Multicore Roadmap Checkpoint • Thread-level parallelism (TLP) • Shared memory model • Multiplexed uniprocessor • Hardware multithreading • Multiprocessing • Cachecoherence • Valid/Invalid,MSI,MESI • Parallel programming • Synchronization • Lock implementation • Locking gotchas • Transactional memory • Memory consistency models CIS 371: Comp. Org. & Design.| Prof. | Multicore 24 Mem CPU CPU System software Recall: Simplest Multiprocessor • What if we don’t want to share the L1 caches? • Bandwidth and latency issue • Solution: use per-processor (“private”) caches • Coordinate them with a Cache Coherence Protocol • Must still provide shared-memory invariant: • “Loadsreadthevaluewrittenbythemostrecentstore” CIS 371: Comp. Org. & Design.| Prof. | Multicore 25 No-Cache (Conceptual) Implementation No-Cache (Conceptual) Implementation Interconnect • Nocaches • Not a realistic design Shared Cache Implementation Interconnect Shared Cache Shared Cache Implementation Interconnect Shared Cache • On-chip shared cache • Lacks per-core caches • Shared cache becomes bottleneck Shared Cache Implementation Interconnect Shared Cache Shared Cache Implementation Interconnect Shared Cache Shared Cache Implementation Store 400 -> [A]
Shared Cache
• Write into cache
Interconnect

Shared Cache Implementation
Store 400 -> [A]
Interconnect
Shared Cache
• Mark as “dirty”
• Memory not updated 33

Adding Private Caches
Cache Cache Cache
Interconnect
• Add per-core caches
(write-back caches)
• Reduces latency
• Increases throughput
• Decreases energy
Shared Cache

Adding Private Caches
Interconnect
Shared Cache

Adding Private Caches
Interconnect
Shared Cache

Adding Private Caches
Store 400 -> [A]
Interconnect
Shared Cache

Adding Private Caches
Store 400 -> [A]
Interconnect
Shared Cache

Private Cache Problem: Incoherence
Interconnect
• What happens when another core tries to read A?
Shared Cache

Private Cache Problem: Incoherence
Interconnect
Shared Cache

Private Cache Problem: Incoherence
Interconnect
Shared Cache

Private Cache Problem: Incoherence
Uh, Oh P1 P2
Interconnect
• P0gotthe wrong value!
Shared Cache

Cache Coherence: Who bears the brunt?
• Caches are supposed to be invisible to the programmer!
CIS 371: Comp. Org. & Design.| Prof. | Multicore 43

Rewind: Fix Problem by Tracking Sharers
Interconnect
Shared Cache
• Solution: Track
copies of each block

Use Tracking Information to “Invalidate”
Interconnect
Shared Cache

Use Tracking Information to “Invalidate”
Interconnect
Shared Cache

Use Tracking Information to “Invalidate”
Interconnect
Shared Cache

Use Tracking Information to “Invalidate”
Interconnect
Shared Cache

“Valid/Invalid” Cache Coherence
• To enforce the shared memory invariant…
• “Loads read the value written by the most recent store”
• Enforce the invariant…
• “Atmostonevalidcopyoftheblock”
• Simplest form is a two-state “valid/invalid” protocol • If a core wants a copy, must find and “invalidate” it
• On a cache miss, how is the valid copy found?
• Option #1 “Snooping”: broadcast to all, whoever has it responds • Option #2: “Directory”: track sharers with separate structure
• Problem:multiplecopiescan’texist,evenifread-only • Consider mostly-read data structures, instructions, etc.
CIS 371: Comp. Org. & Design.| Prof. | Multicore 49

VI Protocol State Transition Table
This Processor
Other Processor
Load Store
Load Miss Store Miss
Invalid (I)
Load Miss èV
Store Miss èV
Send Data èI
Send Data èI
• Rows are “states” • IvsV
• Columns are “events”
• Writeback events not shown
• Memory controller not shown
• Memory sends data when no processor responds
CIS 371: Comp. Org. & Design.| Prof. | Multicore 51

MSI Cache Coherence Protocol
• Solution: enforce the invariant…
• Multiple read-only copies —OR— • Singleread/writecopy
• Track these MSI permissions (states) in per-core caches • Modified(M):read/writepermission
• Shared(S):read-onlypermission
• Invalid (I): no permission
• Also track a “Sharer” bit vector in shared cache • One bit per core; tracks all shared copies of a block
• Then, invalidate all readers when a write occurs
• Allows for many readers…
• …while still enforcing shared memory invariant
(“Loads read the value written by the most recent store”)
CIS 371: Comp. Org. & Design.| Prof. | Multicore 52

MSI Coherence Example: Step #1
Point-to-Point Interconnect
P1 is Modified
Shared Cache

MSI Coherence Example: Step #2
Load [A] Cache LdMiss: Addr=A
1 Point-to-Point Interconnect
Shared Cache
LdMissForward: Addr=A, Req=P0

MSI Coherence Example: Step #3
Response: Addr=A, Data=400
Shared Cache
3 Point-to-Point Interconnect

MSI Coherence Example: Step #4
Response: Addr=A, Data=400
Shared Cache
3 Point-to-Point Interconnect

MSI Coherence Example: Step #5
Load [A] (400) Cache Cache Unblock: Addr=A
Shared Cache
Point-to-Point Interconnect

MSI Coherence Example: Step #6
Store 300 -> [A] Cache Cache
Point-to-Point Interconnect
Shared Cache

Classifying Misses: 3C Model
• Divide cache misses into three categories
• Compulsory(cold):neverseenthisaddressbefore
• Would miss even in infinite cache
• Capacity:misscausedbeca

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com