Parallel 4 2021
Stewart Smith Digital Systems Design 4
Digital System Design 4
Parallel Computing Architecture 4
Stewart Smith Digital Systems Design 4
This Lecture
• Flynn’s Taxonomy – SIMD/MIMD/etc.
• Parallel benchmarks and the Roofline
Model
Stewart Smith Digital Systems Design 4
• An alternate classification
• SPMD: Single Program Multiple Data
‣ A parallel program on a MIMD computer
‣ Conditional code for different processors
Instructions and Data Streams
Data Streams
Single Multiple
Instruction
Streams
Single SISD: Intel Pentium 4 SIMD: SSE
instructions of x86
Multiple MISD: No examples today MIMD: Intel Xeon e5345
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4
Parallel Benchmarks
• Linpack: matrix linear algebra (TOP500)
• SPECrate: parallel run of SPEC CPU programs
‣ Job-level parallelism
• SPLASH: Stanford Parallel Applications for Shared
Memory
‣ Mix of kernels and applications, strong scaling
• NAS (NASA Advanced Supercomputing) suite
‣ Computational fluid dynamics (CFD) kernels
• PARSEC (Princeton Application Repository for
Shared Memory Computers) suite
‣ Multithreaded applications using Pthreads and OpenMP
Stewart Smith Digital Systems Design 4
Code or Applications?
• Traditional benchmarks
‣ Fixed code and data sets
• Parallel programming is evolving
‣ Should algorithms, programming languages, and tools
be part of the system?
‣ Compare systems, provided they implement a given
application
‣ E.g., Linpack, Berkeley Design Patterns
• Would foster innovation in approaches to
parallelism
Stewart Smith Digital Systems Design 4
Modelling Performance
• Assume performance metric of interest is
achievable GFLOPs/sec
‣ Measured using computational kernels from
Berkeley Design Patterns
• Arithmetic intensity of a kernel
‣ FLOPs per byte of memory accessed
• For a given computer, determine
‣ Peak GFLOPS (from data sheet)
‣ Peak memory bytes/sec (using Stream benchmark)
Stewart Smith Digital Systems Design 4
Arithmetic Intensity
• In some kernels the intensity scales with problem
size while in others it is independent
• Different results for strong and weak scaling
Stewart Smith Digital Systems Design 4
Roofline Diagram
Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity,
Peak Floating-Point Performance)
Stewart Smith Digital Systems Design 4
Comparing Systems
• Example: Opteron X2 vs. Opteron X4
‣ 2-core vs. 4-core, 2× FP performance/core,
2.2GHz vs. 2.3GHz
‣ Same main memory system
• To get higher performance on
X4 than X2
‣ Need high arithmetic
intensity
‣ Or working set must fit in
X4’s 2MB L-3 cache
Stewart Smith Digital Systems Design 4
Optimising Performance
• Optimise Floating
Point performance
‣ Balance adds and
multiplies
‣ Improve superscalar
ILP and use of SIMD
Stewart Smith Digital Systems Design 4
Optimising Performance
• Optimise memory usage
‣ Software prefetch
– Avoid load stalls
‣ Memory affinity
– Avoid non-local data
accesses
Stewart Smith Digital Systems Design 4
Optimising Performance
• Choice of optimisation depends on arithmetic
intensity of code
• Arithmetic intensity is not
always fixed
‣ May scale with problem size
‣ Caching reduces memory
accesses
– Increases arithmetic
intensity
Stewart Smith Digital Systems Design 4
Concluding Remarks
• Goal: higher performance by using multiple
processors
• Difficulties
‣ Developing parallel software
‣ Devising appropriate architectures
• Many reasons for optimism
‣ Changing software and application environment
‣ Chip-level multiprocessors with lower latency, higher
bandwidth interconnect
• An ongoing challenge for computer architects!