程序代写代做代考 concurrency algorithm Microsoft PowerPoint – Chapter 8 – Dense Matrix Algorithms

Microsoft PowerPoint – Chapter 8 – Dense Matrix Algorithms

Introduction to
Parallel Computing

George Karypis
Dense Matrix Algorithms

Outline
Focus on numerical algorithms involving
dense matrices:

Matrix-Vector Multiplication
Matrix-Matrix Multiplication
Gaussian Elimination

Decompositions & Scalability

Review

Matrix-Vector Multiplication
Compute: y = Ax

y, x are nx1 vectors
A is an nxn dense matrix

Serial complexity: W = O(n2).
We will consider:

1D & 2D partitioning.

Row-wise 1D Partitioning

How do we perform the operation?

Row-wise 1D Partitioning
Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

Analysis?

Block 2D Partitioning

How do we perform the operation?

Block 2D Partitioning
Each processor needs to have the portion of the x vector

that corresponds to the set of columns that it stores.

Analysis?

1D vs 2D Formulation
Which one is better?

Matrix-Matrix Multiplication
Compute: C = AB

A, B, & C are nxn dense
matrices.

Serial complexity:
W = O(n3).
We will consider:

2D & 3D partitioning.

Simple 2D Algorithm
Processors are arranged in a logical
sqrt(p)*sqrt(p) 2D topology.
Each processor gets a block of
(n/sqrt(p))*(n/sqrt(p)) block of A, B, & C.
It is responsible for computing the entries
of C that it has been assigned to.
Analysis?

How about the
memory

complexity?

Cannon’s Algorithm
Memory efficient variant of the simple
algorithm.
Key idea:

Replace traditional loop:

With the following loop:

During each step, processors operate on
different blocks of A and B.

Can we do better?
Can we use more than O(n2) processors?
So far the task corresponded to the dot-
product of two vectors

i.e., Ci,j = Ai,* . B*,j
How about performing this dot-product in
parallel?
What is the maximum concurrency that we
can extract?

3D Algorithm—DNS Algorithm
Partitioning the intermediate data

3D Algorithm—DNS Algorithm

Which one is better?

Gaussian Elimination
Solve Ax=b

A is an nxn dense matrix.
x and b are dense vectors

Serial complexity:
W = O(n3).
There are two key steps in
each iteration:

Division step
Rank-1 update

We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.

1D Partitioning
Assign n/p rows of A to
each processor.
During the ith iteration:

Divide operation is
performed by the processor
who stores row i.
Result is broadcasted to the
rest of the processors.
Each processor performs
the rank-1 update for its
local rows.

Analysis?

(one element per processor)

1D Pipelined Formulation
Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.

Essentially multiple iterations are perform
simultaneously!

Cost-optimal with
n processors

1D Partitioning
Is the block mapping a good idea?

2D Mapping
Each processor gets a 2D
block of the matrix.
Steps:

Broadcast of the “active” column
along the rows.
Divide step in parallel by the
processors who own portions of
the row.
Broadcast along the columns.
Rank-1 update.

Analysis?

2D Pipelined

Cost-optimal with
n2 processors