CS4203/EE4363 Computer Organization and Design
Chapter 6
Parallel Processors from Client to Cloud
Prof. Pen
–
Chung Yew
With Slides from Profs. Patterson, Hennessy and Mary Jane Irwin
Introduction
• Goal: connecting multiple computers to get higher performance
• Multiprocessors
• Scalability, availability, power efficiency
• Task-level (process-level) parallelism
• High throughput for independent jobs
(multiprogramming)
• Parallel processing program
• Single program run on multiple processors
• Multicore microprocessors
• Chips with multiple processors (cores)
What We’ve Already Covered
• §2.11:ParallelismandInstructions • Synchronization
• §3.6:ParallelismandComputerArithmetic • Subword Parallelism
• §4.10:ParallelismandAdvancedInstruction-Level Parallelism
• §5.10:ParallelismandMemoryHierarchies • Cache Coherence
Parallel Processing
• Writing/Producing parallel software is a very challenging task • Need to get significant performance improvement
• Otherwise, just use a faster uniprocessor, since it’s easier!
• But Amdahl’s Law shows it is very challenging
• Debugging a parallel program is very difficult and time consuming due to non-determinism (e.g. thread interleaving can cause race condition to shared data)
• Other challenges
• Partition a program to run on multiple cores • Coordination/Synchronization
• Communications overhead (need an interconnect for multiple cores)
Multithreading
• Performing multiple threads of execution in parallel
• Replicate registers, PC, etc.
• Fast switching between threads
• Fine-grain multithreading
• Switch threads after each cycle
• Interleave instruction execution
• If one thread stalls, others are executed
• Coarse-grain multithreading
• Only switch on long stall (e.g., L2-cache miss)
• Simplifies hardware, but doesn’t hide short stalls (e.g, data hazards)
Simultaneous Multithreading (SMT)/Intel’s Hypertheading
• In multiple-issue dynamically scheduled processor
• Schedule instructions from multiple threads
• Instructions from independent threads execute when function units are available
• Within threads, dependencies handled by scheduling and register renaming
• Example: Intel Pentium-4 HT (Hyper-Threading)
• Two threads: duplicated registers, shared function units and caches
Multithreading Example
Switch when long stall
Switch in every cycle
Mixed in every cycle
Shared Memory
• SMP: Shared Memory multiProcessor
• Hardware provides single (shared) address space for all processors
• Synchronize shared variables using locks
• Use regular load/store instructions to access shared data
• Memory access time
• NUMA (non-uniform memory access) vs. UMA (uniform memory access)
Message Passing
• Each processor has private address space
• Hardware sends/receives messages between processors
Loosely Coupled Clusters
• Network of independent computers
• Each has private memory and OS
• Connected using I/O system
• E.g.,Ethernet/switch,Internet
• Suitable for applications with independent tasks
• Web servers, databases, simulations, …
• High availability, scalable, affordable
• Problems
• Administration cost (prefer virtual machines)
• Low interconnect bandwidth
• c.f.processor/memorybandwidthonanSMP
Modeling Performance
• Arithmetic intensity of a kernel
• FLOPs/Amount of data accessed in bytes
• FLOPs per byte of memory accessed
• For a given computer, determine
• Peak GFLOPS (from data sheet)
• Peak memory bytes/sec (using Stream benchmark)
Roofline Diagram
Attainable GFLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Comparing Systems
• Example: AMD Opteron X2 vs. Opteron X4 (Barcelona)
• 2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs.
2.3GHz
• Same memory system
n To get higher performance on X4 than X2
n Need high arithmetic intensity
n Or working set must fit in X4’s 2MB L-3 cache
Optimizing Performance
• Optimize FP performance
• Balance adds & multiplies
• Improve superscalar ILP and use of SIMD instructions
• Optimize memory usage
• Software prefetch
• Avoidloadstalls
• Memory affinity
• Avoidnon-localdataaccesses
Optimizing Performance
• Choice of optimization depends on arithmetic intensity of code n Arithmetic intensity is not
always fixed
n May scale with problem size
n Caching reduces memory accesses
n Increases arithmetic intensity
Pitfalls
• Not developing the software to take account of a multiprocessor architecture
• Example: using a single lock for a shared composite resource • Serializesaccesses,eveniftheycouldbedoneinparallel • Usefiner-granularitylocking
Concluding Remarks
• Goal: higher performance by using multiple processors
• Difficulties
• Developing parallel software
• Devising appropriate architectures
• SaaS importance is growing and clusters are a good match
• Performance per dollar and performance per Joule drive both mobiles and clouds