Digital System Design 4 Parallel Computing Architecture 4
Stewart Smith Digital Systems Design 4
• •
Flynn’s Taxonomy – SIMD/MIMD/etc.
Parallel benchmarks and the Roofline Model
This Lecture
Stewart Smith
Digital Systems Design 4
Instructions and Data Streams
•
An alternate classification
Data Streams
Single
Multiple
Instruction Streams
Single
SISD: Intel Pentium 4
SIMD: SSE instructions of x86
Multiple
MISD: No examples today
MIMD: Intel Xeon e5345
•
‣ ‣
Stewart Smith
SPMD: Single Program Multiple Data
A parallel program on a MIMD computer Conditional code for different processors
Flynn’s Taxonomy
Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Parallel Benchmarks
• •
•
•
•
‣
Linpack: matrix linear algebra (TOP500)
SPECrate: parallel run of SPEC CPU programs ‣ Job-level parallelism
SPLASH: Stanford Parallel Applications for Shared Memory
‣ Mix of kernels and applications, strong scaling NAS (NASA Advanced Supercomputing) suite
‣ Computational fluid dynamics (CFD) kernels
PARSEC (Princeton Application Repository for Shared Memory Computers) suite
Multithreaded applications using Pthreads and OpenMP
Stewart Smith
Digital Systems Design 4
Code or Applications?
• •
‣ ‣
‣ E.g., Linpack, Berkeley Design Patterns
•
Traditional benchmarks ‣ Fixed code and data sets
Parallel programming is evolving
Should algorithms, programming languages, and tools be part of the system?
Compare systems, provided they implement a given application
Would foster innovation in approaches to parallelism
Stewart Smith
Digital Systems Design 4
Modelling Performance
•
‣
• •
‣ ‣
Assume performance metric of interest is achievable GFLOPs/sec
Measured using computational kernels from Berkeley Design Patterns
Arithmetic intensity of a kernel
‣ FLOPs per byte of memory accessed
For a given computer, determine
Peak GFLOPS (from data sheet)
Peak memory bytes/sec (using Stream benchmark)
Stewart Smith
Digital Systems Design 4
•
•
In some kernels the intensity scales with problem size while in others it is independent
Arithmetic Intensity
Different results for strong and weak scaling
Stewart Smith
Digital Systems Design 4
Roofline Diagram
Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity, Peak Floating-Point Performance)
Stewart Smith Digital Systems Design 4
Comparing Systems
•
‣ ‣
Example: Opteron X2 vs. Opteron X4
2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHz
Same main memory system
•
‣ ‣
To get higher performance on X4 than X2
Need high arithmetic intensity
Or working set must fit in X4’s 2MB L-3 cache
Stewart Smith
Digital Systems Design 4
Optimising Performance
• Optimise Floating Point performance
‣ ‣
Balance adds and multiplies
Improve superscalar ILP and use of SIMD
Stewart Smith
Digital Systems Design 4
Optimising Performance
• Optimise memory usage
‣
– ‣
Software prefetch
Avoid load stalls Memory affinity
–
Avoid non-local data accesses
Stewart Smith
Digital Systems Design 4
Optimising Performance
•
Choice of optimisation depends on arithmetic intensity of code
•
‣ ‣
–
Arithmetic intensity is not always fixed
May scale with problem size
Caching reduces memory accesses
Increases arithmetic intensity
Stewart Smith
Digital Systems Design 4
Concluding Remarks
• •
‣
•
‣ ‣
•
Goal: higher performance by using multiple processors
Difficulties
Developing parallel software
‣ Devising appropriate architectures
Many reasons for optimism
Changing software and application environment
Chip-level multiprocessors with lower latency, higher bandwidth interconnect
An ongoing challenge for computer architects!
Stewart Smith
Digital Systems Design 4