CS计算机代考程序代写 x86 cache algorithm Digital System Design 4 Parallel Computing Architecture 4

Digital System Design 4 Parallel Computing Architecture 4
Stewart Smith Digital Systems Design 4

• •
Flynn’s Taxonomy – SIMD/MIMD/etc.
Parallel benchmarks and the Roofline Model
This Lecture
Stewart Smith
Digital Systems Design 4

Instructions and Data Streams

An alternate classification
Data Streams
Single
Multiple
Instruction Streams
Single
SISD: Intel Pentium 4
SIMD: SSE instructions of x86
Multiple
MISD: No examples today
MIMD: Intel Xeon e5345

‣ ‣
Stewart Smith
SPMD: Single Program Multiple Data
A parallel program on a MIMD computer Conditional code for different processors
Flynn’s Taxonomy
Digital Systems Design 4

Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy
Stewart Smith Digital Systems Design 4

Parallel Benchmarks
• •




Linpack: matrix linear algebra (TOP500)
SPECrate: parallel run of SPEC CPU programs ‣ Job-level parallelism
SPLASH: Stanford Parallel Applications for Shared Memory
‣ Mix of kernels and applications, strong scaling NAS (NASA Advanced Supercomputing) suite
‣ Computational fluid dynamics (CFD) kernels
PARSEC (Princeton Application Repository for Shared Memory Computers) suite
Multithreaded applications using Pthreads and OpenMP
Stewart Smith
Digital Systems Design 4

Code or Applications?
• •
‣ ‣
‣ E.g., Linpack, Berkeley Design Patterns

Traditional benchmarks ‣ Fixed code and data sets
Parallel programming is evolving
Should algorithms, programming languages, and tools be part of the system?
Compare systems, provided they implement a given application
Would foster innovation in approaches to parallelism
Stewart Smith
Digital Systems Design 4

Modelling Performance


• •
‣ ‣
Assume performance metric of interest is achievable GFLOPs/sec
Measured using computational kernels from Berkeley Design Patterns
Arithmetic intensity of a kernel
‣ FLOPs per byte of memory accessed
For a given computer, determine
Peak GFLOPS (from data sheet)
Peak memory bytes/sec (using Stream benchmark)
Stewart Smith
Digital Systems Design 4



In some kernels the intensity scales with problem size while in others it is independent
Arithmetic Intensity
Different results for strong and weak scaling
Stewart Smith
Digital Systems Design 4

Roofline Diagram
Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity, Peak Floating-Point Performance)
Stewart Smith Digital Systems Design 4

Comparing Systems

‣ ‣
Example: Opteron X2 vs. Opteron X4
2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHz
Same main memory system

‣ ‣
To get higher performance on X4 than X2
Need high arithmetic intensity
Or working set must fit in X4’s 2MB L-3 cache
Stewart Smith
Digital Systems Design 4

Optimising Performance
• Optimise Floating Point performance
‣ ‣
Balance adds and multiplies
Improve superscalar ILP and use of SIMD
Stewart Smith
Digital Systems Design 4

Optimising Performance
• Optimise memory usage

– ‣
Software prefetch
Avoid load stalls Memory affinity

Avoid non-local data accesses
Stewart Smith
Digital Systems Design 4

Optimising Performance

Choice of optimisation depends on arithmetic intensity of code

‣ ‣

Arithmetic intensity is not always fixed
May scale with problem size
Caching reduces memory accesses
Increases arithmetic intensity
Stewart Smith
Digital Systems Design 4

Concluding Remarks
• •


‣ ‣

Goal: higher performance by using multiple processors
Difficulties
Developing parallel software
‣ Devising appropriate architectures
Many reasons for optimism
Changing software and application environment
Chip-level multiprocessors with lower latency, higher bandwidth interconnect
An ongoing challenge for computer architects!
Stewart Smith
Digital Systems Design 4