程序代写代做代考 scheme concurrency algorithm cache computer architecture Microsoft PowerPoint – Chapter2 – Parallel Programming Platforms

Microsoft PowerPoint – Chapter2 – Parallel Programming Platforms

Introduction to
Parallel Computing

George Karypis
Parallel Programming Platforms

Elements of a Parallel Computer
Hardware

Multiple Processors
Multiple Memories
Interconnection Network

System Software
Parallel Operating System
Programming Constructs to Express/Orchestrate Concurrency

Application Software
Parallel Algorithms

Goal:
Utilize the Hardware, System, & Application Software to either

Achieve Speedup: Tp = Ts/p
Solve problems requiring a large amount of memory.

Parallel Computing Platform
Logical Organization

The user’s view of the machine as it is being
presented via its system software

Physical Organization
The actual hardware architecture

Physical Architecture is to a large extent
independent of the Logical Architecture

Logical Organization Elements
Control Mechanism

SISD/SIMD/MIMD/MISD
Single/Multiple Instruction Stream
& Single/Multiple Data Stream

SPMD:
Single Program Multiple Data

Logical Organization Elements

Communication Model
Shared-Address Space

UMA/NUMA/ccNUMA

Message-Passing

Physical Organization
Ideal Parallel Computer Architecture

PRAM: Parallel Random Access Machine
PRAM Models

EREW/ERCW/CREW/CRCW
Exclusive/Concurrent Read and/or Write

Concurrent Writes are resolved via
Common/Arbitrary/Priority/Sum

Physical Organization
Interconnection Networks (ICNs)

Provide processor-to-processor and processor-to-memory
connections
Networks are classified as:

Dynamic
The network consists of
switching elements that the
various processors attach to

indirect network
Historically used to link
processors-to-memory

shared-memory systems

Static
Consist of a number of
point-to-point links

direct network
Historically used to link
processors-to-processors

distributed-memory
system

Static & Dynamic ICNs

Evaluation Metrics for ICNs
Diameter

The maximum distance between any two nodes
Smaller the better.

Connectivity
The minimum number of arcs that must be removed to break it into two
disconnected networks

Larger the better
Measures the multiplicity of paths

Bisection width
The minimum number of arcs that must be removed to partition the network into
two equal halves.

Larger the better
Bisection bandwidth

Applies to networks with weighted arcs—weights correspond to the link width
(how much data it can transfer)
The minimum volume of communication allowed between any two halves of a
network

Larger the better
Cost

The number of links in the network
Smaller the better

Metrics and Dynamic Networks

Network Topologies
Bus-Based
Networks

Shared medium
Information is being
broadcasted
Evaluation:

Diameter: O(1)
Connectivity: O(1)
Bisection width: O(1)
Cost: O(p)

Network Topologies
Crossbar Networks

Switch-based network
Supports simultaneous
connections
Evaluation:

Diameter: O(1)
Connectivity: O(1)?
Bisection width: O(p)?
Cost: O(p2)

Network Topologies
Multistage Interconnection Networks

Multistage Switch Architecture

Pass-through

Cross-over

Connecting the Various Stages

Blocking in a Multistage Switch
Routing is done by comparing the bit-level
representation of source and destination addresses.
-match goes via pass-through
-mismatch goes via cross-over

Network Topologies
Complete and star-connected networks.

Network Topologies
Cartesian Topologies

Network Topologies
Hypercubes

Network Topologies
Trees

Summary of Performance Metrics

Physical Organization
Cache Coherence in Shared Memory
Systems

A certain level of consistency must be
maintained for multiple copies of the same
data
Required to ensure proper semantics and
correct program execution

serializability
Two general protocols for dealing with it

invalidate & update

Invalidate/Update Protocols

Invalidate/Update Protocols
The preferred scheme depends on the
characteristics of the underlying application

frequency of reads/writes to shared variables
Classical trade-off between communication
overhead (updates) and idling (stalling in
invalidates)
Additional problems with false sharing
Existing schemes are based on the invalidate
protocol

A number of approaches have been developed for
maintaining the state/ownership of the shared data

Communication Costs in Parallel
Systems

Message-Passing Systems
The communication cost of a data-transfer
operation depends on:

start-up time: ts
add headers/trailer, error-correction, execute the routing
algorithm, establish the connection between source &
destination

per-hop time: th
time to travel between two directly connected nodes.

node latency
per-word transfer time: tw

1/channel-width

Store-and-Forward & Cut-Through
Routing

Cut-through Routing Deadlocks

Messages 0, 1, 2, and 3
need to go to nodes A, B,
C, and D, respectively

Communication Model Used for
this Class

We will assume that the cost of sending a
message of size m is:

In general true because ts is much larger
than th and for most of the algorithms that
we will study mtw is much larger than lth

Routing Mechanisms
Routing:

The algorithm used to determine the path that
a message will take to go from the source to
destination

Can be classified along different
dimensions

minimal vs non-minimal
deterministic vs adaptive

Dimension Ordered Routing
There is a predefined ordering of the dimensions
Messages are routed along the dimensions in that order
until they cannot move any further

X-Y routing for meshes
E-cube routine for hypercubes

Topology Embeddings
Mapping between networks

Useful in the early days of parallel computing
when topology specific algorithms were being
developed.

Embedding quality metrics
dilation

maximum number of lines an edge is mapped to
congestion

maximum number of edges mapped on a single
link

Mapping a Cartesian Topology
onto a Hypercube

Cool things ☺

Mapping a Cartesian Topology
onto a Hypercube