Microsoft PowerPoint – Chapter 8 – Dense Matrix Algorithms
Introduction to
Parallel Computing
George Karypis
Dense Matrix Algorithms
Outline
Focus on numerical algorithms involving
dense matrices:
Matrix-Vector Multiplication
Matrix-Matrix Multiplication
Gaussian Elimination
Decompositions & Scalability
Review
Matrix-Vector Multiplication
Compute: y = Ax
y, x are nx1 vectors
A is an nxn dense matrix
Serial complexity: W = O(n2).
We will consider:
1D & 2D partitioning.
Row-wise 1D Partitioning
How do we perform the operation?
Row-wise 1D Partitioning
Each processor needs to have the entire x vector.
All-to-all broadcast Local computations
Analysis?
Block 2D Partitioning
How do we perform the operation?
Block 2D Partitioning
Each processor needs to have the portion of the x vector
that corresponds to the set of columns that it stores.
Analysis?
1D vs 2D Formulation
Which one is better?
Matrix-Matrix Multiplication
Compute: C = AB
A, B, & C are nxn dense
matrices.
Serial complexity:
W = O(n3).
We will consider:
2D & 3D partitioning.
Simple 2D Algorithm
Processors are arranged in a logical
sqrt(p)*sqrt(p) 2D topology.
Each processor gets a block of
(n/sqrt(p))*(n/sqrt(p)) block of A, B, & C.
It is responsible for computing the entries
of C that it has been assigned to.
Analysis?
How about the
memory
complexity?
Cannon’s Algorithm
Memory efficient variant of the simple
algorithm.
Key idea:
Replace traditional loop:
With the following loop:
During each step, processors operate on
different blocks of A and B.
Can we do better?
Can we use more than O(n2) processors?
So far the task corresponded to the dot-
product of two vectors
i.e., Ci,j = Ai,* . B*,j
How about performing this dot-product in
parallel?
What is the maximum concurrency that we
can extract?
3D Algorithm—DNS Algorithm
Partitioning the intermediate data
3D Algorithm—DNS Algorithm
Which one is better?
Gaussian Elimination
Solve Ax=b
A is an nxn dense matrix.
x and b are dense vectors
Serial complexity:
W = O(n3).
There are two key steps in
each iteration:
Division step
Rank-1 update
We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.
1D Partitioning
Assign n/p rows of A to
each processor.
During the ith iteration:
Divide operation is
performed by the processor
who stores row i.
Result is broadcasted to the
rest of the processors.
Each processor performs
the rank-1 update for its
local rows.
Analysis?
(one element per processor)
1D Pipelined Formulation
Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.
Essentially multiple iterations are perform
simultaneously!
Cost-optimal with
n processors
1D Partitioning
Is the block mapping a good idea?
2D Mapping
Each processor gets a 2D
block of the matrix.
Steps:
Broadcast of the “active” column
along the rows.
Divide step in parallel by the
processors who own portions of
the row.
Broadcast along the columns.
Rank-1 update.
Analysis?
2D Pipelined
Cost-optimal with
n2 processors