CS计算机代考程序代写 x86 cache algorithm Parallel 4 2021

Parallel 4 2021

Stewart Smith Digital Systems Design 4

Digital System Design 4
Parallel Computing Architecture 4

Stewart Smith Digital Systems Design 4

This Lecture

• Flynn’s Taxonomy – SIMD/MIMD/etc.
• Parallel benchmarks and the Roofline

Model

Stewart Smith Digital Systems Design 4

• An alternate classification

• SPMD: Single Program Multiple Data
‣ A parallel program on a MIMD computer
‣ Conditional code for different processors

Instructions and Data Streams

Data Streams

Single Multiple

Instruction
Streams

Single SISD: Intel Pentium 4 SIMD: SSE
instructions of x86

Multiple MISD: No examples today MIMD: Intel Xeon e5345

Flynn’s Taxonomy

Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy

Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy

Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy

Stewart Smith Digital Systems Design 4

Flynn’s Taxonomy

Stewart Smith Digital Systems Design 4

Parallel Benchmarks
• Linpack: matrix linear algebra (TOP500)
• SPECrate: parallel run of SPEC CPU programs
‣ Job-level parallelism
• SPLASH: Stanford Parallel Applications for Shared

Memory
‣ Mix of kernels and applications, strong scaling
• NAS (NASA Advanced Supercomputing) suite
‣ Computational fluid dynamics (CFD) kernels
• PARSEC (Princeton Application Repository for

Shared Memory Computers) suite
‣ Multithreaded applications using Pthreads and OpenMP

Stewart Smith Digital Systems Design 4

Code or Applications?
• Traditional benchmarks
‣ Fixed code and data sets

• Parallel programming is evolving
‣ Should algorithms, programming languages, and tools

be part of the system?

‣ Compare systems, provided they implement a given
application

‣ E.g., Linpack, Berkeley Design Patterns

• Would foster innovation in approaches to
parallelism

Stewart Smith Digital Systems Design 4

Modelling Performance
• Assume performance metric of interest is

achievable GFLOPs/sec
‣ Measured using computational kernels from

Berkeley Design Patterns

• Arithmetic intensity of a kernel
‣ FLOPs per byte of memory accessed
• For a given computer, determine
‣ Peak GFLOPS (from data sheet)
‣ Peak memory bytes/sec (using Stream benchmark)

Stewart Smith Digital Systems Design 4

Arithmetic Intensity

• In some kernels the intensity scales with problem
size while in others it is independent

• Different results for strong and weak scaling

Stewart Smith Digital Systems Design 4

Roofline Diagram

Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity,
Peak Floating-Point Performance)

Stewart Smith Digital Systems Design 4

Comparing Systems
• Example: Opteron X2 vs. Opteron X4
‣ 2-core vs. 4-core, 2× FP performance/core,

2.2GHz vs. 2.3GHz
‣ Same main memory system

• To get higher performance on
X4 than X2
‣ Need high arithmetic

intensity
‣ Or working set must fit in

X4’s 2MB L-3 cache

Stewart Smith Digital Systems Design 4

Optimising Performance

• Optimise Floating
Point performance

‣ Balance adds and
multiplies

‣ Improve superscalar
ILP and use of SIMD

Stewart Smith Digital Systems Design 4

Optimising Performance

• Optimise memory usage
‣ Software prefetch
– Avoid load stalls

‣ Memory affinity
– Avoid non-local data

accesses

Stewart Smith Digital Systems Design 4

Optimising Performance
• Choice of optimisation depends on arithmetic

intensity of code

• Arithmetic intensity is not
always fixed
‣ May scale with problem size
‣ Caching reduces memory

accesses

– Increases arithmetic
intensity

Stewart Smith Digital Systems Design 4

Concluding Remarks
• Goal: higher performance by using multiple

processors
• Difficulties
‣ Developing parallel software
‣ Devising appropriate architectures
• Many reasons for optimism
‣ Changing software and application environment
‣ Chip-level multiprocessors with lower latency, higher

bandwidth interconnect

• An ongoing challenge for computer architects!