代写 R C algorithm Scheme math scala parallel statistic Lecture 3: Dense Linear Algebra (SUMMA and Matrix Factorizations

Lecture 3: Dense Linear Algebra (SUMMA and Matrix Factorizations
Kirk M. Soodhalter
Trinity College Dublin The University of Dublin Ireland http://math.soodhalter.com
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Class Business
• First programming assignment has been posted.
• First on-paper assignment will be posted by week’s-end
• This class is listed through blackboard; so I have all email addresses and a way to send group emails
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Previously in Parallel Numerics…
• Dense matrix-vector multiplication (in serial and parallel)
• Dense matrix-matrix multiplication (in serial and started in-parallel discussion)
→ We demonstrated that one cannot attain communication lower bound on a one-dimensional data layout/machine topology
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Overview for today
• Two-dimensional parallel dense matrix-matrix product (SUMMA)
• We will not cover Cannon’s method (perhaps covered in 5611?) – but you should check it out
• Solving dense linear systems
• LU- and QR-factorizations
• Stability of LU
• Parallel LU
• Communication-avoiding LU
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Different Parallel Data Layouts for Matrices (not all!)
0
1
2
3
1) 1D Column Blocked Layout
0123012301230123
2) 1D Column Cyclic Layout
4) Row versions of the previous layouts
01
2
3
0
1
2
3
b
3) 1D Column Block Cyclic Layout
0
1
2
3
5) 2D Row and Column Blocked Layout
01010101 23232323 01010101 23232323 01010101 23232323 01010101 23232323
Generalizes others
6) 2D Row and Column Block Cyclic Layout
36!
02/26/2015! CS267 Lecture 12!
1 1Slide from https://people.eecs.berkeley.edu/~demmel/cs267_Spr15
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

2D Parallel matrix multiplication
• SUMMA
– Scalable Universal Matrix Multiplication Algorithm – Within log p of lower bound
• Cannon – came first, assumes A, B square, p is a perfect square
• 2.5D SUMMA
– Uses additional memory to communicate less
• Parallel Strassen – even lower bounds
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

The outer product form of matrix-matrix multiplication
• C = AB: Dot product formulation
– Each entry C(i, j) = 􏰍nk=1 A(i, k)B(k, j) = A(i, :)B(:, j) – Then we loop over i and j
• Rearrange the loop order (ijk → kji): col.-wise outer product C = AB
n
= 􏰎A(:,k)·B(k,:) k=1􏰓 􏰒􏰑 􏰔 􏰓 􏰒􏰑 􏰔
col. k row k • Block outer product version (block-size = 4)
C = A(:,1:4)B(1:4,:)+A(:,5:8)B(5:8,:)+··· n/b
= 􏰎 A(kth block of col’s) · B(kth block of rows) k=1
• Outer products are BLAS-3!
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Layout and topology
Assumptions: for simplicity, not necessity
• p processors, p = s2
• n=0 mods→s×sblockmatrices
SUMMA – n x n matmul on P1/2 x P1/2 grid • Processors laid out in s × s grid
i”
A[i,k]”
k” j” B[k,j]” k”
*=
What to calculate where
• LetC[i,j]be(n/1s/2)×(n/1s/)2 blockofConP ,
• C[i, j] is n/P •A[i•, kA][bi,ek]tihseni/tPhpaxrtbofsthuebmkathtri(xno/fsA)!×bsubmatrixofA
x n/P submatrix of C on processor P ij!
1/2
• B[k,j] is b x n/P1/2 submatrix of B !
• B[k, j] be the ith part of the kth b × (n/s) submatrix of B • C[i,j] = C[i,j] +􏰍Σk A[i,k]*B[k,j] !
• C[i,j]←C[i,j]+ n/b A[i,k]B[k,j] • summation ovke=r1submatrices!
→• SNuememdinogtobveesrqsuabrmeaptrroicessorgrid!
ij
C[i,j]
Soodhalter Parallel Numerics 5636 Dense Linear Algebra
50!

SUMMA matrix-matrix multiplication on s × s grid SUMMA – n x n matmul on P1/2 x P1/2 grid
i”
A[i,k]”
k” j” B[k,j]” k”
*=
1 fork=0,…n/b−1do
2 for all i = 1,…,s do
• C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij! 3 owner of A[i,k1]/2broadcasts to row i of processors
• A[i,k] is n/P x b submatrix of A!
4 end• B[k,j] is b x n/P1/2 submatrix of B !
5 for •a llCj[i=,j] 1=,C. .[i.,j,]s+dΣo A[i,k]*B[k,j] ! k
• summation over submatrices!
6 owner of B[k,j]broadcasts to whole col. j of processors
• Need not be square processor grid ! 7 end
8 Receive A[i, k] into Acol , B[k, j] into Brow 50! 02/26/2015! CS267 Lecture 12!
9 C[this ] ← Acol Brow
10 end
C[i,j]
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

SUMMA Costs
1 fork=0,…n/b−1do
2 for all i = 1,…,s do
3 owner of A[i,k] broadcasts to row i of processors
4 end
5 for all j = 1,…,s do
6 owner of B[k,j]broadcasts to whole col. j of processors
7 end
8 Receive A[i, k] into Acol , B[k, j] into Brow
9 C[this] ← AcolBrow
10 end
If binary tree broadcast along row/col
• log s messages per broadcast
• α + βb(n/s) per message
• Total communication: 2(n/b) log s[αs + βbn]
• Choose max, b = n/s =⇒ 2s log s[αs + β(n2/s)]
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Briefly: Strassen matrix multiplication
Matrix-matrix multiplication using a divide-and-conquer strategy.
A matrix blocking to compute C = AB
A12 􏰊
􏰉A11
􏰉B11
B12 􏰊
B so that
and B = 􏰉A11B11 + A12B21
Strassen’s observation
Define 7 auxiliary matrices M1, . . . , M7 which are added to get
Cij ’s using 7 rather than 8 mults. Ex: M1 = (A11 + A22 )(B11 + B22 ).
• Called recursively on these 7 smaller multiplications • Reduce until switch to other (e.g., BLAS-3) method • Cost ≈ O(n2.8) but with some reduced stability
Let A = A 􏰉C11 C12􏰊
A
C=C C =AB+AB AB+AB
B
21
21 22 21 11 22 21 21 12 22 22
22
21
22
A11B12 + A12B22􏰊
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Shifting gears: solving dense linear systems…
Linear systems of equations
• SolveAx=bwithA∈Rn×n, b∈Rn andAdense
• Iterative methods for sparse matrices or matrix-free problems to follow
• We consider direct methods (LU-factorization)
– Factorize PA = LU: L, U lower/upper triangular, P is a
row exchange matrix
– Solve PTLUx = b
– Serial LU-factorization with partial pivoting is backwards
stable
– ∥PA − LU∥∞ small (close to εmach) in practice
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Linear least-squares (LS) problems
• Given A ∈ Rm×n, b ∈ Rm, solve minx∈Rn ∥b − Ax∥2
• LS solutions satisfy normal equations: AT Ax = AT b
• For m ≥ n, we denote as the “skinny” QR-factorization of A
􏰉R􏰊 A=Q 0
with Q ∈ Rm×m orthogonal and R ∈ Rn×n upper-triangular
• IfrankA=rankR=n,thenLSsolutiongivenby
Rx = 􏰅QT b􏰆 1:n
• Serial QR-factorization (with appropriate orthogonalization routine) is backwards stable
• ∥A − QR∥∞ small (close to εmach) in practice
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Communication lower bounds
The same communication lower bounds shown for serial and parallel matrix-matrix multiplication have been proven to extend to other dense/direct linear algebra algorithms and more generally to many procedures involving three nested loops.
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Quick LU-factorization (without pivoting) example I
3 1 3 •A=6 7 3
9 12 3
• Eliminating first column: let L1 = −2 1 , then −3 1
3 1 3 L1A=0 5 −3
0 9 −6
1
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Quick LU-factorization (without pivoting) example II
3 1 3 •L1A=0 5 −3
0 9 −6
• Eliminating second column: let L2 =  1 , then −9/5 1
3 1 3  L2L1A=05 −3
0 0 −3/5
1 
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Why do we need pivoting?
• Stability: avoid division by zero or small elements =⇒ ∥A − LU∥2 can be large due to roundoff
0 3 3
• Example: A1 = 3 1 3 has no LU-factorization
623 10−20 3 3
• Example: A2 =  3 1 3 has an unstable 623
1
LU-factorization, with L1 = −3 × 1020 1  −6 × 1020 1
• Partial pivoting to move col-max to diagonal:
623 6 23 P􏰏A1=313 P􏰏A2=3 13
0 3 3 10−6 3 3
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Worst-case instability of LU with partial pivoting
Theorem
Let PA = LU be the LU factorization with partial pivoting in exact arithmetic. Then the computed versions of these factors P􏰐, L􏰐, and U􏰐 satisfy
L􏰐U􏰐 = P􏰐A + δA
where∥δA∥2 =O(ρ·εmach),andρ=maxi,j|uij|. A maxi ,j |aij |
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Worst-case instability of LU with partial pivoting
Theorem
Let PA = LU be the LU factorization with partial pivoting in exact arithmetic. Then the computed versions of these factors P􏰐, L􏰐, and U􏰐 satisfy
L􏰐U􏰐 = P􏰐A + δA where∥δA∥2 =O(ρ·εmach),andρ=maxi,j|uij|.
A maxi ,j |aij |
ρ is called the growth factor. ρ ≈ 1 =⇒ stable factorization
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Worst-case instability of LU with partial pivoting
Theorem
Let PA = LU be the LU factorization with partial pivoting in exact arithmetic. Then the computed versions of these factors P􏰐, L􏰐, and U􏰐 satisfy
L􏰐U􏰐 = P􏰐A + δA where∥δA∥2 =O(ρ·εmach),andρ=maxi,j|uij|.
A maxi ,j |aij |
ρ is called the growth factor. ρ ≈ 1 =⇒ stable factorization
Example (Worst-case instability)
In spite of partial pivoting yielding backwards stability, ρ can be
1 −11
1 1 1 1 1 2
1thenU= 1 3 1  1 8
1 16
huge. Let A = −1 −1
−1 −1 −1 −1
1
−1 −1 −1 1
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Worst-case rare for serial LU with pivoting
• Previous example is pathological. “Never” happens in practice for serial case
• Can be justified with careful statistical arguments2
– Each step of Gaussian elimination introduces a rank-one
correction to remaining submatrix
– On average, this implies statistical relationships between
entries of remaining submatrix tending to retard growth
• Important: this average stability relies on operations introducing rank-one corrections.
=⇒ When moving to HPC/Parallel setting where we favor BLAS-3 and block operations (i.e., higher-rank corrections), average stability can no longer be assumed.
2Trefethen and Schreiber. Average case stability of Gaussian elimination. 1990
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Block LU-factorization – obtained by delaying updates
• MatrixA A 21
􏰉A11
A12􏰊 n×n b×b A ∈R withA11 ∈R
22
• First step computes LU with partial pivoting of the first block:
􏰉A11 􏰊 􏰉L11 􏰊
P1 A = L U11
21 21
• We obtain the factorization 􏰉􏰊􏰋􏰌
PA=L11 U11 U12 1 L21 In−b A(1)
22
with U12 = L−1A12 and A(1) = A22 − L21U12 11 22
• Algorithm is applied recursively on A(1) 22
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Block LU-factorization – pseudocode
􏰉A11 􏰊 􏰉L11 􏰊 1ComputeLUoffirstpanel:P1A =L U11
21 21
2 Apply pivoting matrix P1 to full matrix A = P1A
3 Solve triangular system U12 = L−1A12 11
4 Update the trailing matrix: A(1) = A − L U 22 22 2112
5 Apply recursively on trailing matrix A(1) 22
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Parallel LU factorization overview
LU factorization (as in ScaLAPACK pdgetrf)
1 for i = 0,b,2b,…n do
LU factorization on a P = Pr x Pc grid of processors
(ib) For ib = 1 to n-1 step b
2 A =A(i:n,i:n)
A ” = A(ib:n, ib:n) #messages
(
ib)
3 Factorize ith column panel – find a pivot for
each column, do row swap
b)”
(1) Compute panel factorization
– find pivot in each column, swap rows
O(n log2 Pr )
(2)Applyallrowpermutations
O(n/b(log P+log P)) 2c 2r
4 Broadcast pivot information along all rows, do – broadcast pivot information along the rows
swaps
5 Broadcast L
11
O(n/blog P) along row to compute U
– swap rows at left and right
(3) Compute block row of U 12 2 c
– broadcast right diagonal block of L of current panel
6 Broadcast L21 along rows and U12 down
O(n/b(log P+log P))
7 end Bottleneck
(4)Updatetrailingmatrix
– broadcast right block column of L
columns to update trailing matrix – broadcast down block row of U
2 c
2 r
Page 20
• Getting pivot information from entire column across all processors
Source: slides from Laura Grigori:
https://people.eecs.berkeley.edu/~demmel/cs267_Spr15/Lectures/
lecture13_densela2_CommAvoid_UCB_Grigori_v3.pdf
U” L”
U” L” A(i
U”
L” A(ib)”
U”
L” A(ib+b)”
A(ib)”
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

QR-factorization via Householder reflections
For A ∈ Rm×n, m ≥ n, w compute the “skinny” QR factorization 􏰉R􏰊
A=Q 0
triangular orthogonalization (Gram-Schmidt)
The Householder mTatrix
• Householder matrix: Hi = I − τi hi hi T
Hi =I−τihihi
• Reflect a vector across a hyperplane
x−→∥x∥2e1 € Hi2 =I, €
Hi • is independent of the scaling of v,
• Orthogonal triangularization (Householder) vs.
QR factorization by Householder transformations
• Choose hi (the hyperplane) so that
• is symmetric and orthogonal,
• h = ± (∥x∥ e − x); sign chosen for stability i21
has the following properties:
General scheme for
span(h )⊥ • it reflects x about the hyperplane i

hi
x span(hi)⊥


Soodhalter
• For QR, we choose a Householder matrix that allows to annihilate Parallel Numerics 5636 Dense Linear Algebra
the elements of a vector x, except first one.

QR-factorization via Householder reflections: an example
Apply Householder transformations to annihilate subdiagonal entries
􏰋∗ ∗ ∗ ∗􏰌 􏰋r11 r12 r13 r14􏰌 A= ∗∗∗∗ =H1 ∗ ∗ ∗ ∗∗∗∗ ∗∗∗ ∗∗∗∗ ∗∗∗
􏰉1 􏰊􏰋r11 r12 r13 r14􏰌 r22 r23 r24
=H1
= H1H2H3 R
∗ ∗
=H1
􏰉1 􏰊􏰋1 H􏰐2
1
H􏰐2 􏰓 􏰒􏰑 􏰔
􏰌􏰋r11 r12 r13 r14􏰌 r22 r23 r24 r33 r34 ∗∗ H􏰐3 r44
Q
For general A ∈ Rm×n, the factorization is
R = HnHn−1Hn−2 · · · H2H1A ⇐⇒ (HnHn−1Hn−2 · · · H2H1)T R = A, and Q = H1H2 ···Hn−1Hn
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Householder reflections: a compact form
Represent Q implicitly
Compact representation for Q
• Orthogonal factor Q can be represented implicitly as
Q=H ···H =􏰅I−τ h hT􏰆···􏰅I−τ h hT􏰆=I−Y T YT 1b111bbbbbb
Q=HHKH =(I−τhhT)K(I−τhhT)=I−Y 12b111bbb
Y=(h h K h)
whereY =􏰃h h ··· h 􏰄islowertriangular andT is
b1212bb b upper triangular:

Example (for b = 2)
• Example for b=2:
􏰃 􏰄
􏰉τ1 −τ2hT1 h2τ2􏰊
Y= h1 h2 ,T=
Y = (h1 h2 ),
-τ1h1Th2τ2&
I! Y! !T! YT!
τ2 #τ1
T = %$ τ
2
(‘
Soodhalter Parallel Numerics 5636 Dense Linear Algebra
T

Block QR-factorization
􏰉A11 A12􏰊 b×b A=AA whereA11∈R
21 22
Block QR Algebra
• Compute the first panel QR-factorization Q1 A T 􏰉R11 R12􏰊
• Update Q1 A =
on trailing matrix A􏰐22
A and apply algorithm recursively 􏰐22
T 􏰉A11􏰊
21
􏰉R11􏰊 = 0
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Block QR-factorization
􏰉A11 A12 􏰊 􏰉R11 R12 􏰊 A=AA=Q1 A
21 22
21
2 Compute compact representation: Q1 = I − Y1T1Y1T
1 Compute panel factorization: Q1 A
= 0
3 Update trailing matrix:
T 􏰉A12􏰊 􏰉A12􏰊 􏰇 T 􏰇 T 􏰉A12􏰊􏰈􏰈 Q1 A = A −Y1 T1 Y1 A
22 22 22
4 Continue recursively on trailing matrix Bottleneck
• We must use the entire column to build Householder vector
T 􏰉A11􏰊
􏰐12 􏰉R11􏰊
􏰉R12􏰊 = A
􏰐22
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Tall Skinny QR (TSQR) in Parallel
• We compute the QR-factorization of W ∈ Rm×b with b ≪ m • p processors: each processor owns a block of rows of W
• Classic Parallel TSQR (ScaLAPACK)
– Compute Householder vector for each column
(Bottleneck!!)
– Compute norm of each column: O(b log p) messages • Communication Avoiding Algorithm
– Perform local QR-factorizations, then reduce; repeat – O(log p) messages
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Communication Avoiding TSQR Example
W1!
V10! R10!
Parallel TSQR
0!
WV!R !VR VR
P0! P1!
P2! P3!
QR
00`! 00! R00 01! 01! R01! 02! 02! R10! QR R11! QR
QR
W2! V20! R20! R20! V11! R11! QR R30! QR
W3!
V30! R30!
QR
References: Golub, Plemmons, Sameh 88, Pothen, Raghavan, 89, Da Cunha,
Source: slides from Laura Grigori:
Becker, Patterson, 02
https://people.eecs.berkeley.edu/~demmel/cs267_Spr15/LectPuarges2/7 lecture13_densela2_CommAvoid_UCB_Grigori_v3.pdf
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Communication Avoiding TSQR Example II
• Stage 1:
W = W1 = Q10R10 = 
 R00 ·R10
  R20
R30
W0 Q00R00 Q00  
Q10
Q20
W2 Q20R20 
W3 Q30R30 Q30
• Stages 2 & 3: R00 
R10 􏰉Q01R01􏰊 􏰉Q01 􏰊 􏰉R01􏰊 􏰉R01􏰊
= = ·;=Q02R02 R20 Q11R11 Q11 R11 R11
R30
􏰓 􏰒􏰑 􏰔
Stage2
Q00 
􏰓 􏰒􏰑 􏰔
Stage3
Q10 􏰉Q01􏰊
• Q= · ·Q02 andR=R02
 Q20  Q11 Q30
• Important: The QR-factorization is unique.
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Different reductions on different architectures
Flexibility of TSQR and CAQR algorithms
W0# R00# R01# Parallel:! W!=!! W1# R10#
R02#
W2# R20# R11#
W3# $ ! $ W0#
R30#
R00# !$ !
&W0 # &Q00 WQRR
#&R00#
$ 1! $ 10 !$ 10!01# SequWe=n8al:!=W!=!! W1# . R
$W2!$W2#Q20 !$R20! 02# R03#
$W ! $ % 3″ %
W3# Q !$R ! 30″% 30″
&R00 #
$! 0# 00# R
W R
R&Q #W&R#R&R#01#
$ 10 !=$ 01 !.$ 01! 01#$ 01!=Q R
Dual!Core:! W!=!! 1#
$!$ !W$2#! $!R11#0202
R 02#
R11#
R 03#
R20 % Q11 ” % R11 ” % R11 ” $R ! W3#
% 30″ Reduc8on!tree!will!depend!on!the!underlying!architecture,!
Qcoisurledp!bres!echntoesdeinm!dplyicnitalymaiscallpyr!oduct
Page 29
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Communication-avoiding QR for general matrices
• A ∈ Rm×n on s × s grid of processors, block size b
• Use communication-avoiding TSQR for panel factorization
• Broadcast Qij across appropriate rows after each panel factorization
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Analogous communication avoiding LU-fact. (CALU)?
Tall skinny matrix W: W0 Π00
 L00 ·
U0 ·U1
U2 L30 U3
W = W1 =  W2 
L10
L20
W3
Π30
Π10
􏰓 􏰒􏰑 􏰔
Π0 U0 
U3
Π20
 
􏰊 􏰉L01 􏰓 􏰒􏰑 􏰔
Π1
􏰉U01 􏰊
U
U1 􏰉Π01
U= Π· L·U 2 11 11 11
􏰊 􏰉U01􏰊
11
= Π2L02U02
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Block parallel pivoting
Block parallel pivoting
• Unstable for large number of processors P
• Recall: LU stability controlled by a growth factor ρ of the
• When P=number rows, it corresponds to parallel pivoting, known to be unstable entries of U
• Unstable for large numbers of processors Page 37
• p = # of rows corresponds to “parallel pivoting” which is
known to be unstable [Trefethen and Schreiber, ’90]
(Trefethen and Schreiber, 90)
Source: slides from Laura Grigori
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Solution: Tournament Pivoting I
At each iteration, we have
􏰉A11 A12􏰊 􏰉A11􏰊 A=AA withW=A∈R
21 22 21
• Preprocess W to find good pivots, getting P
• Perform all permutations, i.e., compute PA
n×b
• Perform LU without pivots for updated W, update trailing matrix
􏰉L11 􏰊􏰉U11 U12 􏰊 PA= L21 In−b A22−L21U12
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Solution: Tournament Pivoting II
• We do a reduction with LU at each step, but only to get pivots, and we only pick the top b pivot rows for each block to broadcast.
• Compute LU of each block Wi to get Π0 = diag {Π00, Π10, Π20, Π30} W0  Π00 L00 U00 
W = W1 = Π10L10U10  
W2  Π20 L20 U20  W3 Π30 L30 U30
• Apply pivots to get Wi0 = 􏰕ΠTi0Wi 􏰖 . Get Π1 1:b
W00 
W10 
􏰉Π01 L01 U01 􏰊 Π11L11U11
 W20 =
W30
􏰕T􏰖 􏰉Wi0􏰊02 • Apply pivots to get Wi1 = Πi1Wi0 . Get Π2 with = Π2L02U .
1:b Wi1
• Computed the unpivoted LU factorization ΠT2 ΠT1 ΠT0 W = LU.
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Example of Tournament Pivoting with b = 2 and n = 4 Tournament pivoting
€ €
€ €
€ €

Source: slides from Laura Grigori:
https://people.eecs.berkeley.edu/~demmel/cs267_Spr15/Lectures/
lecture13_densela2_CommAvoid_UCB_Grigori_v3.pdf
W 0 “2 4%
P0! $0 1’
$2 0’=Π0L0U0
$ ‘ € #1 2&

Π T0 W 0 “2 4%
$#2 0’& Π 1T W 1
W 0 “2 4%
$2 0′
$4 1’ = Π0 L0U0
$ ‘ € #2 0&
Π T0 W 0 “4 1%
$2 4′ # &
W 0 “4 1%
Π T0 W 0
“4 1%
$2 4’
€ $ ‘€ Goodpivotsfor

$1 4′ $4 2’=Π0L0U0 # &
#1 4& €€
factorizing W
W 1 “2 0%
“4 1%
P!$00′ $’
1$’20€ €
$4 1′ = Π1L1U1 $’€
# &
#1 0& W 2

Π T2 W 2 “1 4%
W 2 “1 4%
T
Π2 W 2
“4 2%
“0 1%
$’$’$’$’
P2! $1 4′ #0 2& $0 0′ = Π2L2U2
$’€ $’
#1 4&
$0 2’=Π2L2U2 € $4 2′ €
#0 2&
W 3 €
“2 1%
$’€ #4 2&

#0 2&
P3! $0 2’=ΠLU $#0 2’&€ $1 0′ 3 3 3
Π T3 W 3 “4 2%

time
Page 41
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Stability of CALU with Tournament Pivoting
• Experimentally: CALU has been demonstrated stable (with low growth factor ρ) for large random matrices with normally distributed entries
• Theoretically: CALU equivalent in exact arithmetic to BLAS-3 serial block LU with p.p. applied to an auxiliary matrix.
• Example: Let
A11 A12  A = A21 A22
A31 A32
• One round of tournament pivoting produces pivot matrices Π0
A11 and Π1. Let A21
A12 
A22 = ΠT1 ΠT0 A
A31
• Equivalent to LU with p.p. applied to G = A21
A32
A11
A12  A21 
−A31 A32
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

Second programming assignment
Exercise
You will be asked to build on your serial orthogonalization code implement a simple version of the communication-avoiding QR method.
Soodhalter Parallel Numerics 5636 Dense Linear Algebra

What did we talk about today (and on Wednesday?
• The SUMMA algorithm for dense matrix-matrix multiplication on a 2D layout of processors
• The two most widely used matrix factorizations LU and QR, which are used for directly solving, resp., linear systems and least squares problems
• The stability of LU with partial pivoting
• Parallel and then communication avoiding QR
• The naive analog of CALU followed by the stable version using tournament pivoting
Next time
• Sparse matrices and their generalization (hierarchical matrices).
• Discretizing PDEs
Soodhalter Parallel Numerics 5636 Dense Linear Algebra