Parallel Numerical Algorithms – Chapter 6 – LU Factorization
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Parallel Numerical Algorithms
Chapter 6 – LU Factorization
Prof. Michael T. Heath
Department of Computer Science
University of Illinois at Urbana-Champaign
CS 554 / CSE 512
Michael T. Heath Parallel Numerical Algorithms 1 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Outline
1 LU Factorization
Motivation
Gaussian Elimination
2 Parallel Algorithms for LU
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
3 Partial Pivoting
Michael T. Heath Parallel Numerical Algorithms 2 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
LU Factorization
System of linear algebraic equations has form
Ax = b
where A is given n× n matrix, b is given n-vector, and x is
unknown solution n-vector to be computed
Direct method for solving general linear system is by
computing LU factorization
A = LU
where L is unit lower triangular and U is upper triangular
Michael T. Heath Parallel Numerical Algorithms 3 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
LU Factorization
System Ax = b then becomes
LUx = b
Solve lower triangular system
Ly = b
by forward-substitution to obtain vector y
Finally, solve upper triangular system
Ux = y
by back-substitution to obtain solution x to original system
Michael T. Heath Parallel Numerical Algorithms 4 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
Gaussian Elimination Algorithm
LU factorization can be computed by Gaussian elimination as
follows, where U overwrites A
for k = 1 to n− 1
for i = k + 1 to n
`ik = aik/akk
end
for j = k + 1 to n
for i = k + 1 to n
aij = aij − `ikakj
end
end
end
{ loop over columns }
{ compute multipliers
for current column }
{ apply transformation to
remaining submatrix }
Michael T. Heath Parallel Numerical Algorithms 5 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
Gaussian Elimination Algorithm
In general, row interchanges (pivoting) may be required to
ensure existence of LU factorization and numerical stability
of Gaussian elimination algorithm, but for simplicity we
temporarily ignore this issue
Gaussian elimination requires about n3/3 paired additions
and multiplications, so model serial time as
T1 = tc n
3/3
where tc is time required for multiply-add operation
About n2/2 divisions also required, but we ignore this
lower-order term
Michael T. Heath Parallel Numerical Algorithms 6 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
Loop Orderings for Gaussian Elimination
Gaussian elimination has general form of triple-nested loop
in which entries of L and U overwrite those of A
for
for
for
aij = aij − (aik/akk) akj
end
end
end
Indices i, j, and k of for loops can be taken in any order,
for total of 3! = 6 different ways of arranging loops
Michael T. Heath Parallel Numerical Algorithms 7 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
Loop Orderings for Gaussian Elimination
Different loop orders have different memory access
patterns, which may cause their performance to vary
widely, depending on architectural features such as cache,
paging, vector registers, etc.
Perhaps most promising for parallel implementation are kij
and kji forms, which differ only in accessing matrix by
rows or columns, respectively
Michael T. Heath Parallel Numerical Algorithms 8 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Motivation
Gaussian Elimination
Gaussian Elimination Algorithm
kji form of Gaussian elimination
for k = 1 to n− 1
for i = k + 1 to n
`ik = aik/akk
end
for j = k + 1 to n
for i = k + 1 to n
aij = aij − `ik akj
end
end
end
Multipliers `ik computed outside inner loop for greater
efficiency
Michael T. Heath Parallel Numerical Algorithms 9 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Parallel Algorithm
Partition
For i, j = 1, . . . , n, fine-grain task (i, j) stores aij and
computes and stores{
uij , if i ≤ j
`ij , if i > j
yielding 2-D array of n2 fine-grain tasks
Communicate
Broadcast entries of A vertically to tasks below
Broadcast entries of L horizontally to tasks to right
Michael T. Heath Parallel Numerical Algorithms 10 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Fine-Grain Tasks and Communication
a11
u11
a12
u12
a21
ℓ21
a22
u22
a13
u13
a14
u14
a23
u23
a24
u24
a31
ℓ31
a32
ℓ32
a41
ℓ41
a42
ℓ42
a33
u33
a34
u34
a43
ℓ43
a44
u44
a51
ℓ51
a52
ℓ52
a61
ℓ61
a62
ℓ62
a53
ℓ53
a54
ℓ54
a63
ℓ63
a64
ℓ64
a15
u15
a16
u16
a25
u25
a26
u26
a35
u35
a36
u36
a45
u45
a46
u46
a55
u55
a56
u56
a65
ℓ65
a66
u66
Michael T. Heath Parallel Numerical Algorithms 11 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Fine-Grain Parallel Algorithm
for k = 1 to min(i, j)− 1
recv broadcast of akj from task (k, j)
recv broadcast of `ik from task (i, k)
aij = aij − `ik akj
end
if i ≤ j then
broadcast aij to tasks (k, j), k = i + 1, . . . , n
else
recv broadcast of ajj from task (j, j)
`ij = aij/ajj
broadcast `ij to tasks (i, k), k = j + 1, . . . , n
end
{ vert bcast }
{ horiz bcast }
{ update entry }
{ vert bcast }
{ vert bcast }
{ multiplier }
{ horiz bcast }
Michael T. Heath Parallel Numerical Algorithms 12 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Agglomeration
Agglomerate
With n× n array of fine-grain tasks, natural strategies are
2-D: combine k × k subarray of fine-grain tasks to form
each coarse-grain task, yielding (n/k)2 coarse-grain tasks
1-D column: combine n fine-grain tasks in each column
into coarse-grain task, yielding n coarse-grain tasks
1-D row: combine n fine-grain tasks in each row into
coarse-grain task, yielding n coarse-grain tasks
Michael T. Heath Parallel Numerical Algorithms 13 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
2-D Agglomeration
a11
u11
a12
u12
a21
ℓ21
a22
u22
a13
u13
a14
u14
a23
u23
a24
u24
a31
ℓ31
a32
ℓ32
a41
ℓ41
a42
ℓ42
a33
u33
a34
u34
a43
ℓ43
a44
u44
a51
ℓ51
a52
ℓ52
a61
ℓ61
a62
ℓ62
a53
ℓ53
a54
ℓ54
a63
ℓ63
a64
ℓ64
a15
u15
a16
u16
a25
u25
a26
u26
a35
u35
a36
u36
a45
u45
a46
u46
a55
u55
a56
u56
a65
ℓ65
a66
u66
Michael T. Heath Parallel Numerical Algorithms 14 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
1-D Column Agglomeration
a11
u11
a12
u12
a21
ℓ21
a22
u22
a13
u13
a14
u14
a23
u23
a24
u24
a31
ℓ31
a32
ℓ32
a41
ℓ41
a42
ℓ42
a33
u33
a34
u34
a43
ℓ43
a44
u44
a51
ℓ51
a52
ℓ52
a61
ℓ61
a62
ℓ62
a53
ℓ53
a54
ℓ54
a63
ℓ63
a64
ℓ64
a15
u15
a16
u16
a25
u25
a26
u26
a35
u35
a36
u36
a45
u45
a46
u46
a55
u55
a56
u56
a65
ℓ65
a66
u66
Michael T. Heath Parallel Numerical Algorithms 15 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
1-D Row Agglomeration
a11
u11
a12
u12
a21
ℓ21
a22
u22
a13
u13
a14
u14
a23
u23
a24
u24
a31
ℓ31
a32
ℓ32
a41
ℓ41
a42
ℓ42
a33
u33
a34
u34
a43
ℓ43
a44
u44
a51
ℓ51
a52
ℓ52
a61
ℓ61
a62
ℓ62
a53
ℓ53
a54
ℓ54
a63
ℓ63
a64
ℓ64
a15
u15
a16
u16
a25
u25
a26
u26
a35
u35
a36
u36
a45
u45
a46
u46
a55
u55
a56
u56
a65
ℓ65
a66
u66
Michael T. Heath Parallel Numerical Algorithms 16 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Mapping
Map
2-D: assign (n/k)2/p coarse-grain tasks to each of p
processes using any desired mapping in each dimension,
treating target network as 2-D mesh
1-D: assign n/p coarse-grain tasks to each of p processes
using any desired mapping, treating target network as 1-D
mesh
Michael T. Heath Parallel Numerical Algorithms 17 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
2-D Agglomeration with Cyclic Mapping
a11
u11
a12
u12
a21
ℓ21
a22
u22
a13
u13
a14
u14
a23
u23
a24
u24
a31
ℓ31
a32
ℓ32
a41
ℓ41
a42
ℓ42
a33
u33
a34
u34
a43
ℓ43
a44
u44
a51
ℓ51
a52
ℓ52
a61
ℓ61
a62
ℓ62
a53
ℓ53
a54
ℓ54
a63
ℓ63
a64
ℓ64
a15
u15
a16
u16
a25
u25
a26
u26
a35
u35
a36
u36
a45
u45
a46
u46
a55
u55
a56
u56
a65
ℓ65
a66
u66
Michael T. Heath Parallel Numerical Algorithms 18 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Coarse-Grain 2-D Parallel Algorithm
for k = 1 to n− 1
broadcast {akj : j ∈ mycols, j ≥ k} in process column
if k ∈ mycols then
for i ∈ myrows, i > k
`ik = aik/akk { multipliers }
end
end
broadcast {`ik : i ∈ myrows, i > k} in process row
for j ∈ mycols, j > k
for i ∈ myrows, i > k,
aij = aij − `ik akj { update }
end
end
end
Michael T. Heath Parallel Numerical Algorithms 19 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Performance Enhancements
Each process becomes idle as soon as its last row and
column are completed
With block mapping, in which each process holds
contiguous block of rows and columns, some processes
become idle long before overall computation is complete
Block mapping also yields unbalanced load, as computing
multipliers and updates requires successively less work
with increasing row and column numbers
Cyclic or reflection mapping improves both concurrency
and load balance
Michael T. Heath Parallel Numerical Algorithms 20 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Performance Enhancements
Performance can also be enhanced by overlapping
communication and computation
At step k, each process completes updating its portion of
remaining unreduced submatrix before moving on to step
k + 1
Broadcast of each segment of row k + 1, and computation
and broadcast of each segment of multipliers for step k + 1,
could be initiated as soon as relevant segments of row
k + 1 and column k + 1 have been updated by their owners,
before completing remainder of their updating for step k
This send ahead strategy enables other processes to start
working on next step earlier than they otherwise could
Michael T. Heath Parallel Numerical Algorithms 21 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
1-D Column Agglomeration with Cyclic Mapping
a11
u11
a21
ℓ21
a31
ℓ31
a41
ℓ41
a51
ℓ51
a61
ℓ61
a12
u12
a22
u22
a32
ℓ32
a42
ℓ42
a52
ℓ52
a62
ℓ62
a13
u13
a23
u23
a33
u33
a43
ℓ43
a53
ℓ53
a63
ℓ63
a14
u14
a24
u24
a34
u34
a44
u44
a54
ℓ54
a64
ℓ64
a15
u15
a25
u25
a35
u35
a45
u45
a55
u55
a65
ℓ65
a16
u16
a26
u26
a36
u36
a46
u46
a56
u56
a66
u66
Michael T. Heath Parallel Numerical Algorithms 22 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
1-D Column Agglomeration
Matrix rows need not be broadcast vertically, since any
given column is contained entirely in only one process
But there is no parallelism in computing multipliers or
updating any given column
Horizontal broadcasts still required to communicate
multipliers for updating
Michael T. Heath Parallel Numerical Algorithms 23 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Coarse-Grain 1-D Column Parallel Algorithm
for k = 1 to n− 1
if k ∈ mycols then
for i = k + 1 to n
`ik = aik/akk
end
end
broadcast {`ik : k < i ≤ n}
for j ∈ mycols, j > k
for i = k + 1 to n
aij = aij − `ik akj
end
end
end
{ multipliers }
{ broadcast }
{ update }
Michael T. Heath Parallel Numerical Algorithms 24 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
1-D Row Agglomeration with Cyclic Mapping
a11
u11
a12
u12
a13
u13
a14
u14
a15
u15
a16
u16
a21
ℓ21
a22
u22
a23
u23
a24
u24
a25
u25
a26
u26
a31
ℓ31
a32
ℓ32
a33
u33
a34
u34
a35
u35
a36
u36
a41
ℓ41
a42
ℓ42
a43
ℓ43
a44
u44
a45
u45
a46
u46
a51
ℓ51
a52
ℓ52
a53
ℓ53
a54
ℓ54
a55
u55
a56
u56
a61
ℓ61
a62
ℓ62
a63
ℓ63
a64
ℓ64
a65
ℓ65
a66
u66
Michael T. Heath Parallel Numerical Algorithms 25 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
1-D Row Agglomeration
Multipliers need not be broadcast horizontally, since any
given matrix row is contained entirely in only one process
But there is no parallelism in updating any given row
Vertical broadcasts still required to communicate each row
of matrix to processes below it for updating
Michael T. Heath Parallel Numerical Algorithms 26 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Coarse-Grain 1-D Row Parallel Algorithm
for k = 1 to n− 1
broadcast {akj : k ≤ j ≤ n}
for i ∈ myrows, i > k,
`ik = aik/akk
end
for j = k + 1 to n
for i ∈ myrows, i > k,
aij = aij − `ik akj
end
end
end
{ broadcast }
{ multipliers }
{ update }
Michael T. Heath Parallel Numerical Algorithms 27 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Performance Enhancements
Same performance enhancements as for 2-D
agglomeration apply to both 1-D column and 1-D row
agglomerations as well, including cyclic mapping and send
ahead strategy
Michael T. Heath Parallel Numerical Algorithms 28 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Scalability for 2-D Agglomeration
Updating by each process at step k requires about
(n− k)2/p operations
Summing over n− 1 steps
Tcomp ≈ tc
n−1∑
k=1
(n− k)2/p
≈ tc n3/(3p)
Michael T. Heath Parallel Numerical Algorithms 29 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Scalability for 2-D Agglomeration
Similarly, amount of data broadcast at step k along each
process row and column is about (n− k)/
√
p, so on 2-D
mesh
Tcomm ≈
n−1∑
k=1
2(ts + tw (n− k)/
√
p )
≈ 2 ts n + tw n2/
√
p
where we have allowed for overlap of broadcasts for
successive steps
Michael T. Heath Parallel Numerical Algorithms 30 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Scalability for 2-D Agglomeration
Total execution time is
Tp ≈ tc n3/(3p) + 2 ts n + tw n2/
√
p
To determine isoefficiency function, set
tc n
3/3 ≈ E (tc n3/3 + 2 ts n p + tw n2
√
p )
which holds for large p if n = Θ(
√
p ), so isoefficiency
function is Θ(p
√
p ), since T1 = Θ(n3)
Michael T. Heath Parallel Numerical Algorithms 31 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Scalability for 1-D Agglomeration
With either 1-D column or 1-D row agglomeration, updating
by each process at step k requires about (n− k)2/p
operations
Summing over n− 1 steps
Tcomp ≈ tc
n−1∑
k=1
(n− k)2/p
≈ tc n3/(3p)
Michael T. Heath Parallel Numerical Algorithms 32 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Scalability for 1-D Agglomeration
Amount of data broadcast at step k is about n− k, so on
1-D mesh
Tcomm ≈
n−1∑
k=1
(ts + tw (n− k))
≈ ts n + tw n2/2
where we have allowed for overlap of broadcasts for
successive steps
Michael T. Heath Parallel Numerical Algorithms 33 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Scalability for 1-D Agglomeration
Total execution time is
Tp ≈ tc n3/(3p) + ts n + tw n2/2
To determine isoefficiency function, set
tc n
3/3 ≈ E (tc n3/3 + ts n p + tw n2p/2)
which holds for large p if n = Θ(p), so isoefficiency function
is Θ(p3), since T1 = Θ(n3)
Michael T. Heath Parallel Numerical Algorithms 34 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Partial Pivoting
Row ordering of A is irrelevant in system of linear
equations
Partial pivoting takes rows in order of largest entry in
magnitude of leading column of remaining unreduced
matrix
This choice ensures that multipliers do not exceed 1 in
magnitude, which reduces amplification of rounding errors
In general, partial pivoting is required to ensure existence
and numerical stability of LU factorization
Michael T. Heath Parallel Numerical Algorithms 35 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Partial Pivoting
Partial pivoting yields factorization of form
PA = LU
where P is permutation matrix
If PA = LU , then system Ax = b becomes
PAx = LUx = Pb
which can be solved by forward-substitution in lower
triangular system Ly = Pb, followed by back-substitution
in upper triangular system Ux = y
Michael T. Heath Parallel Numerical Algorithms 36 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Parallel Partial Pivoting
Partial pivoting complicates parallel implementation of
Gaussian elimination and significantly affects potential
performance
With 2-D algorithm, pivot search is parallel but requires
communication within process column and inhibits
overlapping of successive steps
With 1-D column algorithm, pivot search requires no
communication but is purely serial
Once pivot is found, index of pivot row must be
communicated to other processes, and rows must be
explicitly or implicitly interchanged in each process
Michael T. Heath Parallel Numerical Algorithms 37 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Parallel Partial Pivoting
With 1-D row algorithm, pivot search is parallel but requires
communication among processes and inhibits overlapping
of successive steps
If rows are explicitly interchanged, then only two processes
are involved
If rows are implicitly interchanged, then mapping of rows to
processes is altered, which may degrade concurrency and
load balance
Tradeoff between column and row algorithms with partial
pivoting depends on relative speeds of communication and
computation
Michael T. Heath Parallel Numerical Algorithms 38 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Alternatives to Partial Pivoting
Because of negative effects of partial pivoting on parallel
performance, various alternatives have been proposed that
limit pivot search
tournament pivoting
threshold pivoting
pairwise pivoting
Such strategies are not foolproof and may trade off some
degree of stability and accuracy for speed
Stability and accuracy may be recovered via iterative
refinement, but this has its own cost
Michael T. Heath Parallel Numerical Algorithms 39 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
Communication vs. Memory Tradeoff
If explicit replication of storage is allowed, then lower
communication volume is possible
As with matrix multiplication, “2.5-D” algorithms have
recently been developed that use partial storage
replication to reduce communication volume to whatever
extent available memory allows
If sufficient memory is avaiable, then these algorithms can
achieve provably optimal communication
Michael T. Heath Parallel Numerical Algorithms 40 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
References
J. W. Demmel, M. T. Heath, and H. A. van der Vorst,
Parallel numerical linear algebra, Acta Numerica
2:111-197, 1993
G. A. Geist and C. H. Romine, LU factorization algorithms
on distributed-memory multiprocessor architectures, SIAM
J. Sci. Stat. Comput. 9:639-649, 1988
L. Grigori, J. Demmel, and H. Xiang, CALU: A
communication optimal LU factorization algorithm, SIAM J.
Matrix Anal. Appl. 32:1317-1350, 2011
B. A. Hendrickson and D. E. Womble, The torus-wrap
mapping for dense matrix calculations on massively
parallel computers, SIAM J. Sci. Stat. Comput.
15:1201-1226, 1994
Michael T. Heath Parallel Numerical Algorithms 41 / 42
LU Factorization
Parallel Algorithms for LU
Partial Pivoting
References
J. M. Ortega, Introduction to Parallel and Vector Solution of
Linear Systems, Plenum Press, 1988
J. M. Ortega and C. H. Romine, The ijk forms of
factorization methods II: parallel systems, Parallel Comput.
7:149-162, 1988
Y. Robert, The Impact of Vector and Parallel Architectures
on the Gaussian Elimination Algorithm, Wiley, 1990
E. Solomonik and J. Demmel, Communication-optimal
parallel 2.5D matrix multiplication and LU factorization
algorithms, 17th Euro-Par Conf. on Parallel Processing,
LNCS 6853, Springer, 2011
S. A. Vavasis, Gaussian elimination with pivoting is
P-complete, SIAM J. Disc. Math. 2:413-423, 1989
Michael T. Heath Parallel Numerical Algorithms 42 / 42
LU Factorization
Motivation
Gaussian Elimination
Parallel Algorithms for LU
Fine-Grain Algorithm
Agglomeration Schemes
Scalability
Partial Pivoting