程序代写代做代考 scheme concurrency algorithm cache Parallel Numerical Algorithms – Chapter 6 – LU Factorization

Parallel Numerical Algorithms – Chapter 6 – LU Factorization

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Parallel Numerical Algorithms
Chapter 6 – LU Factorization

Prof. Michael T. Heath

Department of Computer Science
University of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath Parallel Numerical Algorithms 1 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Outline

1 LU Factorization
Motivation
Gaussian Elimination

2 Parallel Algorithms for LU
Fine-Grain Algorithm
Agglomeration Schemes
Scalability

3 Partial Pivoting

Michael T. Heath Parallel Numerical Algorithms 2 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

LU Factorization

System of linear algebraic equations has form

Ax = b

where A is given n× n matrix, b is given n-vector, and x is
unknown solution n-vector to be computed

Direct method for solving general linear system is by
computing LU factorization

A = LU

where L is unit lower triangular and U is upper triangular

Michael T. Heath Parallel Numerical Algorithms 3 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

LU Factorization

System Ax = b then becomes

LUx = b

Solve lower triangular system

Ly = b

by forward-substitution to obtain vector y

Finally, solve upper triangular system

Ux = y

by back-substitution to obtain solution x to original system

Michael T. Heath Parallel Numerical Algorithms 4 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

Gaussian Elimination Algorithm

LU factorization can be computed by Gaussian elimination as
follows, where U overwrites A

for k = 1 to n− 1
for i = k + 1 to n

`ik = aik/akk
end
for j = k + 1 to n

for i = k + 1 to n
aij = aij − `ikakj

end
end

end

{ loop over columns }
{ compute multipliers

for current column }

{ apply transformation to
remaining submatrix }

Michael T. Heath Parallel Numerical Algorithms 5 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

Gaussian Elimination Algorithm

In general, row interchanges (pivoting) may be required to
ensure existence of LU factorization and numerical stability
of Gaussian elimination algorithm, but for simplicity we
temporarily ignore this issue

Gaussian elimination requires about n3/3 paired additions
and multiplications, so model serial time as

T1 = tc n
3/3

where tc is time required for multiply-add operation

About n2/2 divisions also required, but we ignore this
lower-order term

Michael T. Heath Parallel Numerical Algorithms 6 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

Loop Orderings for Gaussian Elimination

Gaussian elimination has general form of triple-nested loop
in which entries of L and U overwrite those of A

for
for

for
aij = aij − (aik/akk) akj

end
end

end

Indices i, j, and k of for loops can be taken in any order,
for total of 3! = 6 different ways of arranging loops

Michael T. Heath Parallel Numerical Algorithms 7 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

Loop Orderings for Gaussian Elimination

Different loop orders have different memory access
patterns, which may cause their performance to vary
widely, depending on architectural features such as cache,
paging, vector registers, etc.

Perhaps most promising for parallel implementation are kij
and kji forms, which differ only in accessing matrix by
rows or columns, respectively

Michael T. Heath Parallel Numerical Algorithms 8 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Motivation
Gaussian Elimination

Gaussian Elimination Algorithm

kji form of Gaussian elimination

for k = 1 to n− 1
for i = k + 1 to n

`ik = aik/akk
end
for j = k + 1 to n

for i = k + 1 to n
aij = aij − `ik akj

end
end

end

Multipliers `ik computed outside inner loop for greater
efficiency

Michael T. Heath Parallel Numerical Algorithms 9 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Parallel Algorithm

Partition

For i, j = 1, . . . , n, fine-grain task (i, j) stores aij and
computes and stores{

uij , if i ≤ j
`ij , if i > j

yielding 2-D array of n2 fine-grain tasks

Communicate

Broadcast entries of A vertically to tasks below

Broadcast entries of L horizontally to tasks to right

Michael T. Heath Parallel Numerical Algorithms 10 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Fine-Grain Tasks and Communication

a11
u11

a12
u12

a21
ℓ21

a22
u22

a13
u13

a14
u14

a23
u23

a24
u24

a31
ℓ31

a32
ℓ32

a41
ℓ41

a42
ℓ42

a33
u33

a34
u34

a43
ℓ43

a44
u44

a51
ℓ51

a52
ℓ52

a61
ℓ61

a62
ℓ62

a53
ℓ53

a54
ℓ54

a63
ℓ63

a64
ℓ64

a15
u15

a16
u16

a25
u25

a26
u26

a35
u35

a36
u36

a45
u45

a46
u46

a55
u55

a56
u56

a65
ℓ65

a66
u66

Michael T. Heath Parallel Numerical Algorithms 11 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Fine-Grain Parallel Algorithm

for k = 1 to min(i, j)− 1
recv broadcast of akj from task (k, j)
recv broadcast of `ik from task (i, k)
aij = aij − `ik akj

end
if i ≤ j then

broadcast aij to tasks (k, j), k = i + 1, . . . , n
else

recv broadcast of ajj from task (j, j)
`ij = aij/ajj
broadcast `ij to tasks (i, k), k = j + 1, . . . , n

end

{ vert bcast }
{ horiz bcast }
{ update entry }

{ vert bcast }

{ vert bcast }
{ multiplier }
{ horiz bcast }

Michael T. Heath Parallel Numerical Algorithms 12 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Agglomeration

Agglomerate

With n× n array of fine-grain tasks, natural strategies are

2-D: combine k × k subarray of fine-grain tasks to form
each coarse-grain task, yielding (n/k)2 coarse-grain tasks

1-D column: combine n fine-grain tasks in each column
into coarse-grain task, yielding n coarse-grain tasks

1-D row: combine n fine-grain tasks in each row into
coarse-grain task, yielding n coarse-grain tasks

Michael T. Heath Parallel Numerical Algorithms 13 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

2-D Agglomeration

a11
u11

a12
u12

a21
ℓ21

a22
u22

a13
u13

a14
u14

a23
u23

a24
u24

a31
ℓ31

a32
ℓ32

a41
ℓ41

a42
ℓ42

a33
u33

a34
u34

a43
ℓ43

a44
u44

a51
ℓ51

a52
ℓ52

a61
ℓ61

a62
ℓ62

a53
ℓ53

a54
ℓ54

a63
ℓ63

a64
ℓ64

a15
u15

a16
u16

a25
u25

a26
u26

a35
u35

a36
u36

a45
u45

a46
u46

a55
u55

a56
u56

a65
ℓ65

a66
u66

Michael T. Heath Parallel Numerical Algorithms 14 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

1-D Column Agglomeration

a11
u11

a12
u12

a21
ℓ21

a22
u22

a13
u13

a14
u14

a23
u23

a24
u24

a31
ℓ31

a32
ℓ32

a41
ℓ41

a42
ℓ42

a33
u33

a34
u34

a43
ℓ43

a44
u44

a51
ℓ51

a52
ℓ52

a61
ℓ61

a62
ℓ62

a53
ℓ53

a54
ℓ54

a63
ℓ63

a64
ℓ64

a15
u15

a16
u16

a25
u25

a26
u26

a35
u35

a36
u36

a45
u45

a46
u46

a55
u55

a56
u56

a65
ℓ65

a66
u66

Michael T. Heath Parallel Numerical Algorithms 15 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

1-D Row Agglomeration

a11
u11

a12
u12

a21
ℓ21

a22
u22

a13
u13

a14
u14

a23
u23

a24
u24

a31
ℓ31

a32
ℓ32

a41
ℓ41

a42
ℓ42

a33
u33

a34
u34

a43
ℓ43

a44
u44

a51
ℓ51

a52
ℓ52

a61
ℓ61

a62
ℓ62

a53
ℓ53

a54
ℓ54

a63
ℓ63

a64
ℓ64

a15
u15

a16
u16

a25
u25

a26
u26

a35
u35

a36
u36

a45
u45

a46
u46

a55
u55

a56
u56

a65
ℓ65

a66
u66

Michael T. Heath Parallel Numerical Algorithms 16 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Mapping

Map

2-D: assign (n/k)2/p coarse-grain tasks to each of p
processes using any desired mapping in each dimension,
treating target network as 2-D mesh

1-D: assign n/p coarse-grain tasks to each of p processes
using any desired mapping, treating target network as 1-D
mesh

Michael T. Heath Parallel Numerical Algorithms 17 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

2-D Agglomeration with Cyclic Mapping

a11
u11

a12
u12

a21
ℓ21

a22
u22

a13
u13

a14
u14

a23
u23

a24
u24

a31
ℓ31

a32
ℓ32

a41
ℓ41

a42
ℓ42

a33
u33

a34
u34

a43
ℓ43

a44
u44

a51
ℓ51

a52
ℓ52

a61
ℓ61

a62
ℓ62

a53
ℓ53

a54
ℓ54

a63
ℓ63

a64
ℓ64

a15
u15

a16
u16

a25
u25

a26
u26

a35
u35

a36
u36

a45
u45

a46
u46

a55
u55

a56
u56

a65
ℓ65

a66
u66

Michael T. Heath Parallel Numerical Algorithms 18 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Coarse-Grain 2-D Parallel Algorithm

for k = 1 to n− 1
broadcast {akj : j ∈ mycols, j ≥ k} in process column
if k ∈ mycols then

for i ∈ myrows, i > k
`ik = aik/akk { multipliers }

end
end
broadcast {`ik : i ∈ myrows, i > k} in process row
for j ∈ mycols, j > k

for i ∈ myrows, i > k,
aij = aij − `ik akj { update }

end
end

end

Michael T. Heath Parallel Numerical Algorithms 19 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Performance Enhancements

Each process becomes idle as soon as its last row and
column are completed

With block mapping, in which each process holds
contiguous block of rows and columns, some processes
become idle long before overall computation is complete

Block mapping also yields unbalanced load, as computing
multipliers and updates requires successively less work
with increasing row and column numbers

Cyclic or reflection mapping improves both concurrency
and load balance

Michael T. Heath Parallel Numerical Algorithms 20 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Performance Enhancements

Performance can also be enhanced by overlapping
communication and computation

At step k, each process completes updating its portion of
remaining unreduced submatrix before moving on to step
k + 1

Broadcast of each segment of row k + 1, and computation
and broadcast of each segment of multipliers for step k + 1,
could be initiated as soon as relevant segments of row
k + 1 and column k + 1 have been updated by their owners,
before completing remainder of their updating for step k
This send ahead strategy enables other processes to start
working on next step earlier than they otherwise could

Michael T. Heath Parallel Numerical Algorithms 21 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

1-D Column Agglomeration with Cyclic Mapping

a11
u11

a21
ℓ21

a31
ℓ31

a41
ℓ41

a51
ℓ51

a61
ℓ61

a12
u12

a22
u22

a32
ℓ32

a42
ℓ42

a52
ℓ52

a62
ℓ62

a13
u13

a23
u23

a33
u33

a43
ℓ43

a53
ℓ53

a63
ℓ63

a14
u14

a24
u24

a34
u34

a44
u44

a54
ℓ54

a64
ℓ64

a15
u15

a25
u25

a35
u35

a45
u45

a55
u55

a65
ℓ65

a16
u16

a26
u26

a36
u36

a46
u46

a56
u56

a66
u66

Michael T. Heath Parallel Numerical Algorithms 22 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

1-D Column Agglomeration

Matrix rows need not be broadcast vertically, since any
given column is contained entirely in only one process

But there is no parallelism in computing multipliers or
updating any given column

Horizontal broadcasts still required to communicate
multipliers for updating

Michael T. Heath Parallel Numerical Algorithms 23 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Coarse-Grain 1-D Column Parallel Algorithm

for k = 1 to n− 1
if k ∈ mycols then

for i = k + 1 to n
`ik = aik/akk

end
end
broadcast {`ik : k < i ≤ n} for j ∈ mycols, j > k

for i = k + 1 to n
aij = aij − `ik akj

end
end

end

{ multipliers }

{ broadcast }

{ update }

Michael T. Heath Parallel Numerical Algorithms 24 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

1-D Row Agglomeration with Cyclic Mapping

a11
u11

a12
u12

a13
u13

a14
u14

a15
u15

a16
u16

a21
ℓ21

a22
u22

a23
u23

a24
u24

a25
u25

a26
u26

a31
ℓ31

a32
ℓ32

a33
u33

a34
u34

a35
u35

a36
u36

a41
ℓ41

a42
ℓ42

a43
ℓ43

a44
u44

a45
u45

a46
u46

a51
ℓ51

a52
ℓ52

a53
ℓ53

a54
ℓ54

a55
u55

a56
u56

a61
ℓ61

a62
ℓ62

a63
ℓ63

a64
ℓ64

a65
ℓ65

a66
u66

Michael T. Heath Parallel Numerical Algorithms 25 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

1-D Row Agglomeration

Multipliers need not be broadcast horizontally, since any
given matrix row is contained entirely in only one process

But there is no parallelism in updating any given row

Vertical broadcasts still required to communicate each row
of matrix to processes below it for updating

Michael T. Heath Parallel Numerical Algorithms 26 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Coarse-Grain 1-D Row Parallel Algorithm

for k = 1 to n− 1
broadcast {akj : k ≤ j ≤ n}
for i ∈ myrows, i > k,

`ik = aik/akk
end
for j = k + 1 to n

for i ∈ myrows, i > k,
aij = aij − `ik akj

end
end

end

{ broadcast }

{ multipliers }

{ update }

Michael T. Heath Parallel Numerical Algorithms 27 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Performance Enhancements

Same performance enhancements as for 2-D
agglomeration apply to both 1-D column and 1-D row
agglomerations as well, including cyclic mapping and send
ahead strategy

Michael T. Heath Parallel Numerical Algorithms 28 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Scalability for 2-D Agglomeration

Updating by each process at step k requires about
(n− k)2/p operations

Summing over n− 1 steps

Tcomp ≈ tc
n−1∑
k=1

(n− k)2/p

≈ tc n3/(3p)

Michael T. Heath Parallel Numerical Algorithms 29 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Scalability for 2-D Agglomeration

Similarly, amount of data broadcast at step k along each
process row and column is about (n− k)/


p, so on 2-D

mesh

Tcomm ≈
n−1∑
k=1

2(ts + tw (n− k)/

p )

≈ 2 ts n + tw n2/

p

where we have allowed for overlap of broadcasts for
successive steps

Michael T. Heath Parallel Numerical Algorithms 30 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Scalability for 2-D Agglomeration

Total execution time is

Tp ≈ tc n3/(3p) + 2 ts n + tw n2/

p

To determine isoefficiency function, set

tc n
3/3 ≈ E (tc n3/3 + 2 ts n p + tw n2


p )

which holds for large p if n = Θ(

p ), so isoefficiency

function is Θ(p

p ), since T1 = Θ(n3)

Michael T. Heath Parallel Numerical Algorithms 31 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Scalability for 1-D Agglomeration

With either 1-D column or 1-D row agglomeration, updating
by each process at step k requires about (n− k)2/p
operations

Summing over n− 1 steps

Tcomp ≈ tc
n−1∑
k=1

(n− k)2/p

≈ tc n3/(3p)

Michael T. Heath Parallel Numerical Algorithms 32 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Scalability for 1-D Agglomeration

Amount of data broadcast at step k is about n− k, so on
1-D mesh

Tcomm ≈
n−1∑
k=1

(ts + tw (n− k))

≈ ts n + tw n2/2

where we have allowed for overlap of broadcasts for
successive steps

Michael T. Heath Parallel Numerical Algorithms 33 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Scalability for 1-D Agglomeration

Total execution time is

Tp ≈ tc n3/(3p) + ts n + tw n2/2

To determine isoefficiency function, set

tc n
3/3 ≈ E (tc n3/3 + ts n p + tw n2p/2)

which holds for large p if n = Θ(p), so isoefficiency function
is Θ(p3), since T1 = Θ(n3)

Michael T. Heath Parallel Numerical Algorithms 34 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Partial Pivoting

Row ordering of A is irrelevant in system of linear
equations

Partial pivoting takes rows in order of largest entry in
magnitude of leading column of remaining unreduced
matrix

This choice ensures that multipliers do not exceed 1 in
magnitude, which reduces amplification of rounding errors

In general, partial pivoting is required to ensure existence
and numerical stability of LU factorization

Michael T. Heath Parallel Numerical Algorithms 35 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Partial Pivoting

Partial pivoting yields factorization of form

PA = LU

where P is permutation matrix

If PA = LU , then system Ax = b becomes

PAx = LUx = Pb

which can be solved by forward-substitution in lower
triangular system Ly = Pb, followed by back-substitution
in upper triangular system Ux = y

Michael T. Heath Parallel Numerical Algorithms 36 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Parallel Partial Pivoting

Partial pivoting complicates parallel implementation of
Gaussian elimination and significantly affects potential
performance

With 2-D algorithm, pivot search is parallel but requires
communication within process column and inhibits
overlapping of successive steps

With 1-D column algorithm, pivot search requires no
communication but is purely serial

Once pivot is found, index of pivot row must be
communicated to other processes, and rows must be
explicitly or implicitly interchanged in each process

Michael T. Heath Parallel Numerical Algorithms 37 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Parallel Partial Pivoting

With 1-D row algorithm, pivot search is parallel but requires
communication among processes and inhibits overlapping
of successive steps

If rows are explicitly interchanged, then only two processes
are involved

If rows are implicitly interchanged, then mapping of rows to
processes is altered, which may degrade concurrency and
load balance

Tradeoff between column and row algorithms with partial
pivoting depends on relative speeds of communication and
computation

Michael T. Heath Parallel Numerical Algorithms 38 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Alternatives to Partial Pivoting

Because of negative effects of partial pivoting on parallel
performance, various alternatives have been proposed that
limit pivot search

tournament pivoting
threshold pivoting
pairwise pivoting

Such strategies are not foolproof and may trade off some
degree of stability and accuracy for speed

Stability and accuracy may be recovered via iterative
refinement, but this has its own cost

Michael T. Heath Parallel Numerical Algorithms 39 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

Communication vs. Memory Tradeoff

If explicit replication of storage is allowed, then lower
communication volume is possible

As with matrix multiplication, “2.5-D” algorithms have
recently been developed that use partial storage
replication to reduce communication volume to whatever
extent available memory allows

If sufficient memory is avaiable, then these algorithms can
achieve provably optimal communication

Michael T. Heath Parallel Numerical Algorithms 40 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

References

J. W. Demmel, M. T. Heath, and H. A. van der Vorst,
Parallel numerical linear algebra, Acta Numerica
2:111-197, 1993

G. A. Geist and C. H. Romine, LU factorization algorithms
on distributed-memory multiprocessor architectures, SIAM
J. Sci. Stat. Comput. 9:639-649, 1988

L. Grigori, J. Demmel, and H. Xiang, CALU: A
communication optimal LU factorization algorithm, SIAM J.
Matrix Anal. Appl. 32:1317-1350, 2011

B. A. Hendrickson and D. E. Womble, The torus-wrap
mapping for dense matrix calculations on massively
parallel computers, SIAM J. Sci. Stat. Comput.
15:1201-1226, 1994

Michael T. Heath Parallel Numerical Algorithms 41 / 42

LU Factorization
Parallel Algorithms for LU

Partial Pivoting

References

J. M. Ortega, Introduction to Parallel and Vector Solution of
Linear Systems, Plenum Press, 1988
J. M. Ortega and C. H. Romine, The ijk forms of
factorization methods II: parallel systems, Parallel Comput.
7:149-162, 1988
Y. Robert, The Impact of Vector and Parallel Architectures
on the Gaussian Elimination Algorithm, Wiley, 1990
E. Solomonik and J. Demmel, Communication-optimal
parallel 2.5D matrix multiplication and LU factorization
algorithms, 17th Euro-Par Conf. on Parallel Processing,
LNCS 6853, Springer, 2011
S. A. Vavasis, Gaussian elimination with pivoting is
P-complete, SIAM J. Disc. Math. 2:413-423, 1989

Michael T. Heath Parallel Numerical Algorithms 42 / 42

LU Factorization
Motivation
Gaussian Elimination

Parallel Algorithms for LU
Fine-Grain Algorithm
Agglomeration Schemes
Scalability

Partial Pivoting