程序代写代做代考 scheme concurrency algorithm Microsoft PowerPoint – Chapter 3 – Principles of Parallel Algorithm Design

Microsoft PowerPoint – Chapter 3 – Principles of Parallel Algorithm Design

Introduction to
Parallel Computing

George Karypis
Principles of Parallel Algorithm
Design

Outline
Overview of some Serial Algorithms
Parallel Algorithm vs Parallel Formulation
Elements of a Parallel Algorithm/Formulation
Common Decomposition Methods

concurrency extractor!
Common Mapping Methods

parallel overhead reducer!

Some Serial Algorithms
Working Examples

Dense Matrix-Matrix & Matrix-Vector
Multiplication
Sparse Matrix-Vector Multiplication
Gaussian Elimination
Floyd’s All-pairs Shortest Path
Quicksort
Minimum/Maximum Finding
Heuristic Search—15-puzzle problem

Dense Matrix-Vector Multiplication

Dense Matrix-Matrix Multiplication

Sparse Matrix-Vector Multiplication

Gaussian Elimination

Floyd’s All-Pairs Shortest Path

Quicksort

Minimum Finding

15—Puzzle Problem

Parallel Algorithm vs Parallel
Formulation

Parallel Formulation
Refers to a parallelization of a serial algorithm.

Parallel Algorithm
May represent an entirely different algorithm than the
one used serially.

We primarily focus on “Parallel Formulations”
Our goal today is to primarily discuss how to develop
such parallel formulations.
Of course, there will always be examples of “parallel
algorithms” that were not derived from serial
algorithms.

Elements of a Parallel
Algorithm/Formulation

Pieces of work that can be done concurrently
tasks

Mapping of the tasks onto multiple processors
processes vs processors

Distribution of input/output & intermediate data across the different
processors
Management the access of shared data

either input or intermediate
Synchronization of the processors at various points of the parallel
execution

Holy Grail:
Maximize concurrency and reduce overheads due to parallelization!
Maximize potential speedup!

Finding Concurrent Pieces of Work

Decomposition:
The process of dividing the computation into
smaller pieces of work i.e., tasks

Tasks are programmer defined and are
considered to be indivisible

Example: Dense Matrix-Vector
Multiplication

Tasks can be of different size.
• granularity of a task

Example: Query Processing

Query:

Example: Query Processing
Finding concurrent tasks…

Task-Dependency Graph
In most cases, there are dependencies between
the different tasks

certain task(s) can only start once some other task(s)
have finished

e.g., producer-consumer relationships
These dependencies are represented using a
DAG called task-dependency graph

Task-Dependency Graph (cont)
Key Concepts Derived from the Task-
Dependency Graph

Degree of Concurrency
The number of tasks that can be concurrently
executed

we usually care about the average degree of
concurrency

Critical Path
The longest vertex-weighted path in the graph

The weights represent task size

Task granularity affects both of the above
characteristics

Task-Interaction Graph
Captures the pattern of interaction between
tasks

This graph usually contains the task-dependency
graph as a subgraph

i.e., there may be interactions between tasks even if there
are no dependencies

these interactions usually occur due to accesses on shared
data

Task Dependency/Interaction
Graphs

These graphs are important in developing
effectively mapping the tasks onto the different
processors

Maximize concurrency and minimize overheads

More on this later…

Common Decomposition Methods

Data Decomposition
Recursive Decomposition
Exploratory Decomposition
Speculative Decomposition
Hybrid Decomposition

Task
decomposition
methods

Recursive Decomposition
Suitable for problems that can be solved
using the divide-and-conquer paradigm
Each of the subproblems generated by the
divide step becomes a task

Example: Quicksort

Example: Finding the Minimum
Note that we can obtain divide-and-conquer algorithms
for problems that are traditionally solved using non-
divide-and-conquer approaches

Recursive Decomposition
How good are the decompositions that it
produces?

average concurrency?
critical path?

How do the quicksort and min-finding
decompositions measure-up?

Data Decomposition
Used to derive concurrency for problems that operate on
large amounts of data
The idea is to derive the tasks by focusing on the
multiplicity of data
Data decomposition is often performed in two steps

Step 1: Partition the data
Step 2: Induce a computational partitioning from the data
partitioning

Which data should we partition?
Input/Output/Intermediate?

Well… all of the above—leading to different data decomposition
methods

How do induce a computational partitioning?
Owner-computes rule

Example: Matrix-Matrix
Multiplication

Partitioning the output data

Example: Matrix-Matrix
Multiplication

Partitioning the intermediate data

Data Decomposition
Is the most widely-used decomposition
technique

after all parallel processing is often applied to
problems that have a lot of data
splitting the work based on this data is the natural
way to extract high-degree of concurrency

It is used by itself or in conjunction with other
decomposition methods

Hybrid decomposition

Exploratory Decomposition
Used to decompose computations that
correspond to a search of a space of
solutions

Example: 15-puzzle Problem

Exploratory Decomposition
It is not as general purpose
It can result in speedup anomalies

engineered slow-down or superlinear
speedup

Speculative Decomposition
Used to extract concurrency in problems in
which the next step is one of many
possible actions that can only be
determined when the current tasks
finishes
This decomposition assumes a certain
outcome of the currently executed task
and executes some of the next steps

Just like speculative execution at the
microprocessor level

Example: Discrete Event
Simulation

Speculative Execution
If predictions are wrong…

work is wasted
work may need to be undone

state-restoring overhead
memory/computations

However, it may be the only way to extract
concurrency!

Mapping the Tasks
Why do we care about task mapping?

Can I just randomly assign them to the available processors?

Proper mapping is critical as it needs to minimize the
parallel processing overheads

If Tp is the parallel runtime on p processors and Ts is the serial
runtime, then the total overhead To is p*Tp – Ts

The work done by the parallel system beyond that required by the
serial system

Overhead sources:
Load imbalance
Inter-process communication

coordination/synchronization/data-sharing

remember the
holy grail…

they can
be at odds
with each

other

Why Mapping can be Complicated?
Proper mapping needs to take into account the task-dependency
and interaction graphs

Are the tasks available a priori?
Static vs dynamic task generation

How about their computational requirements?
Are they uniform or non-uniform?
Do we know them a priori?

How much data is associated with each task?
How about the interaction patterns between the tasks?

Are they static or dynamic?
Do we know them a priori?
Are they data instance dependent?
Are they regular or irregular?
Are they read-only or read-write?

Depending on the above characteristics different mapping
techniques are required of different complexity and cost

Task
dependency
graph

Task
interaction
graph

Example: Simple & Complex Task
Interaction

Mapping Techniques for Load
Balancing

Be aware…
The assignment of tasks whose aggregate
computational requirements are the same does not
automatically ensure load balance.

Each
processor is

assigned three
tasks but (a) is
better than (b)!

Load Balancing Techniques
Static

The tasks are distributed among the processors prior
to the execution
Applicable for tasks that are

generated statically
known and/or uniform computational requirements

Dynamic
The tasks are distributed among the processors
during the execution of the algorithm

i.e., tasks & data are migrated
Applicable for tasks that are

generated dynamically
unknown computational requirements

Static Mapping—Array Distribution

Suitable for algorithms that
use data decomposition
their underlying input/output/intermediate data
are in the form of arrays

Block Distribution
Cyclic Distribution
Block-Cyclic Distribution
Randomized Block Distributions

1D/2D/3D

Examples: Block Distributions

Examples: Block Distributions

Example: Block-Cyclic Distributions

Gaussian Elimination
The active portion
of the array shrinks
as the computations
progress

Random Block Distributions
Sometimes the computations are performed only
at certain portions of an array

sparse matrix-matrix multiplication

Random Block Distributions
Better load balance can be achieved via a
random block distribution

Graph Partitioning
A mapping can be achieved by directly
partitioning the task interaction graph.

EG: Finite element mesh-based computations

Directly partitioning this graph

Example: Sparse Matrix-Vector
Another instance of graph partitioning

Dynamic Load Balancing Schemes

There is a huge body of research
Centralized Schemes

A certain processors is responsible for giving out work
master-slave paradigm

Issue:
task granularity

Distributed Schemes
Work can be transferred between any pairs of processors.
Issues:

How do the processors get paired?
Who initiates the work transfer? push vs pull
How much work is transferred?

Mapping to Minimize Interaction
Overheads

Maximize data locality
Minimize volume of data-exchange
Minimize frequency of interactions
Minimize contention and hot spots
Overlap computation with interactions
Selective data and computation replication

Achieving the above is usually an interplay of
decomposition and mapping and is usually done iteratively