代写 R data structure algorithm Scheme html math parallel graph Lecture 9: Parallel Sparse Matvec and Preconditioning

Lecture 9: Parallel Sparse Matvec and Preconditioning
Kirk M. Soodhalter
Trinity College Dublin The University of Dublin Ireland http://math.soodhalter.com
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Class Business
• Any questions from last week?
• Talk announcement: Ronny Ramlau, Inverse Problems in Adaptive Optics – Wednesday 13. March at 10am – Venue TBA
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Previously in Parallel Numerics…
• Understanding a deficiency of steepest descent
• Deriving the method of conjugate gradients (CG)
• A-conjugacy (i.e., orthogonality) of the search directions
• Polynomial-based convergence analysis of Krylov subspace methods
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Overview for today
• Sparse matrix-vector product in parallel • Preconditioning
• Preconditioned CG
• Parallel CG with preconditioning
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Core operations of a preconditioned Krylov subspace method
• Preconditioner setup (to be discussed) • Matrix-vector product
• Vector updates
• Dot products
• Actual preconditioning operations (TBD)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Recall: Sparse Storage Schemes
Example (Coordinate Format)
The simplest scheme is to store three arrays. One real (complex) array stores the non-zero entries. Two integer arrays store the row and column indices of each entry.
1.
3. 
6. 0.
• •
0. 0. 2.
􏰀
7. 5. 1. 2. 11. 3. 6. 4. 8. 10. 􏰀3 2 1 1 4 2 3 2 3 4􏰁
􏰀3 4 1 4 4 1 1 2 4 3􏰁
􏰁
4. 0.
0. 7.
0. 10. 11.
5. A: →JR:
8.
This stores redundant information.
JC:
Rather than store both row and column indices, we can store just column indices and then store pointer indices to where each row begins
A: 􏰀1.2.3.4.5.6.7.8.10.11.􏰁 JA: 􏰀1 4 1 2 4 1 3 4 3 4􏰁 IA: 􏰀1 3 6 9􏰁
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Recall: Compressed Sparse Row Format
0 4 8 10 11 12 13 16
n
SpMV in Compressed Sparse Row (CSR) Format
SpMV: y = y + A*x, only store, do arithmetic, on nonzero entries CSR format is simplest one of many possible data structures for A
Row pointers
17
colind data
Matrix-vector multiply kernel: y(i) ß y(i) + A(i,j)􏰛x(j) for each row i
for k=row_ptr[i] to row_ptr[i+1]-1 do y[i] = y[i] + data[k]*x[colind[k]]
n􏰛n matrix with nnz nonzeroes
0
2
3
4
0
1
5
7
2
3
3
4
5
4
5
6
7
Source: https://sites.google.com/a/lbl.gov/cs267-spring-2017/
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Recall: Some other storage formats
We discussed
• Coordinate format
• Compressed sparse row (CSR) format Some other formats
• Compressed sparse column (CSC) format – analogous to (CSR) but roles of rows and columns exchanged
• Modified Sparse Row (MSR) format – two arrays (AA, JA); AA has diagonal entries in the first n positions, then nondiagonal entries afterward. First n positions of JA point to first entry of each row; afterward are the column indices.
• Diagonal storage – for narrow banded matrices. Store diagonals as columns of 2D array. A second array of integers (positive & negative) indicate z diagonal (relative to the main one) for each column
• Ellpack-Itpack – Assume each row has Nd nonzeros at most. Nonzeros of each row stored as one row of a 2D array. A second 2D arrow stores respective column indices
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Recall: Matvec as a loop over dot products
Warm up: Matrix-vector multiplication
1
2 3 4 5 6
{implements y = y + A*x}
Algorithm:Computingy←y+Ax for i = 1:n
for i = 1,…,n do for j = 1:n
for j = 1,…,n do
y(i) = y(i) + A(i,j)*x(j)
y(i) = y(i) + A(i, j)x(j) end
end
=+*
y(i) y(i) x(:)
A(i,:)
Graphic from
https://people.eecs.berkeley.edu/~demmel/cs267_Spr15
01/22/2015 CS267 – Lecture 2
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning
36!

CSR Matvec I
CSR-Matrix Representation
A (matrix entries) :
JA (column indices) :
IA (row beginning pointers) :
􏰀A(1) 􏰀JA(1)
􏰀IA(1)
A(2) JA(2)
IA(2)
· · · ···
· · ·
A(nnz )􏰁 JA(nnz)􏰁
IA(n)􏰁
1
2 3 4 5 6
where nnz is # of non-zeros
Algorithm: CSR Matvec – Dot Product Form
for i = 1,…,n do k1 = IA(i)
k2 =IA(i+1)−1
yi = ⟨A(k1 : k2),x(JA(k1 : k2))⟩ end
Question to the audience
What is the problem with this algorithm in parallel?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSR Matvec II
Moving CSR-Matvec to parallel
• The loop is parallelizable
• Each processor is given a few rows of the matrix
(contiguously stored)
• The vector x also split up (as discussed in lecture 1
• Problem: each processor needs entries from all of x
• Line 5 of the algorithm uses indirect address requiring communication
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSR Matvec II
Moving CSR-Matvec to parallel
• The loop is parallelizable
• Each processor is given a few rows of the matrix
(contiguously stored)
• The vector x also split up (as discussed in lecture 1
• Problem: each processor needs entries from all of x
• Line 5 of the algorithm uses indirect address requiring communication
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSR Matvec II
Moving CSR-Matvec to parallel
• The loop is parallelizable
• Each processor is given a few rows of the matrix
(contiguously stored)
• The vector x also split up (as discussed in lecture 1
• Problem: each processor needs entries from all of x
• Line 5 of the algorithm uses indirect address requiring communication
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSR Matvec II
Moving CSR-Matvec to parallel
• The loop is parallelizable
• Each processor is given a few rows of the matrix
(contiguously stored)
• The vector x also split up (as discussed in lecture 1
• Problem: each processor needs entries from all of x
• Line 5 of the algorithm uses indirect address requiring communication
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSR Matvec II
Moving CSR-Matvec to parallel
• The loop is parallelizable
• Each processor is given a few rows of the matrix
(contiguously stored)
• The vector x also split up (as discussed in lecture 1
• Problem: each processor needs entries from all of x
• Line 5 of the algorithm uses indirect address requiring communication
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSR Matvec II
Moving CSR-Matvec to parallel
• The loop is parallelizable
• Each processor is given a few rows of the matrix
(contiguously stored)
• The vector x also split up (as discussed in lecture 1
• Problem: each processor needs entries from all of x
• Line 5 of the algorithm uses indirect address requiring communication
Definition (Gather Operation)
The indirect addressing of x is done in two steps. The processor receives the entries receives the entries of x into a local vector then performs a local BLAS dot product. The method of parallel indirect addressing x(JA(k1 : k2)) is called a gather operation.
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

ch processor initially. When performing a matrix-by-vector pr mt
eo
Gather Operation
unication will be necessary to get the needed components of
in a given processor. This important case will return in Secti
+ +
+
Gather
DotProduct
*
x(*)(i,*)
y(i)
+
x(1:n)
From Chapters 11 in Y Saad. Iterative Methods for Sparse Linear Illustration of the row-oriented matrix-b
Systems. SIAM, 2003
Aa
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning
vector multiplication.

CSC-Matvec: Gather-Scatter Method
CSC-Matrix Representation
A (matrix entries by column) : JA (row indices) :
IA (column beginning pointers) :
􏰀A(1) 􏰀JA(1)
􏰀IA(1)
A(2) JA(2)
IA(2)
· · · ···
· · ·
A(nnz )􏰁 JA(nnz)􏰁
IA(n)􏰁
where nnz is # of non-zeros
• Column-wise storage =⇒ dot product matvec requires
non-sequential data access
• Instead consider x as acting on A with
n
x : Rn×n → Rn such that A 􏰗→ 􏰋A(:,i)x(i)
i=1 • We call this the basis expansion operator
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSC-Matvec: Gather-Scatter Method
CSC-Matrix Representation
A (matrix entries by column) : JA (row indices) :
IA (column beginning pointers) :
where nnz is # of non-zeros Algorithm:Computingy←Ax
y(1:n)=0
for i = 1,…,n do
􏰀A(1) 􏰀JA(1)
􏰀IA(1)
A(2) JA(2)
IA(2)
· · · ···
· · ·
A(nnz )􏰁 JA(nnz)􏰁
IA(n)􏰁
1
2 3 4 5 6 7
k1 = IA(i)
k2 =IA(i+1)−1
y(JA(k1 :k2))=y(JA(k1 :k2))+x(i)A(k1 :k2)
end
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

CSC-Matvec: Gather-Scatter Method
CSC-Matrix Representation
A (matrix entries by column) : JA (row indices) :
IA (column beginning pointers) :
where nnz is # of non-zeros Algorithm:Computingy←Ax
y(1:n)=0
for i = 1,…,n do
􏰀A(1) 􏰀JA(1)
􏰀IA(1)
A(2) JA(2)
IA(2)
· · · ···
· · ·
A(nnz )􏰁 JA(nnz)􏰁
IA(n)􏰁
1
2 3 4 5 6 7
k1 = IA(i)
k2 =IA(i+1)−1
y(JA(k1 :k2))=y(JA(k1 :k2))+x(i)A(k1 :k2)
end
=⇒ Must gather y then send back (scatter) to to owning processor
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Gather-Scatter Operation
+ Gather Scatter + ++
+ + x(j)* = + y(*) aA((*,j) y(*)
++
y(1:n)
y(1:n)
Illustration of the column-oriented matrix-by- Definitivoencto(SrcmautltieprlicOatpioenr.ation)
The indirect addressing of y is done in three steps: Gather the Going back to the outer loop, subsums can be computed (independently) into
sepaernateritems poofrayryinvteoctoarsloancdalthveencthoers;e Adsudbscuomrrsecsapnobnedainddgedcoant threibcuotmiopnletion of all these partial sums to obtain the final result. For example, an optimized version of the
from a column of A; Send updated entries of y back owning previous algorithm can be implemented as follows:
processor This last step is called a scatter operation.
From Chapters 11 in Y Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2003
1. tmp(1:n,1:p) = 0.0
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning
2. Do m=1, p

Unknown ordering and sparse matvec for PDEs
• For problems with local coupling of unknowns (e.g., discretized PDEs), nearby unknowns should be on same processor
• For complicated domains or problems not arising from PDEs, graph partitioning methods can do this
• Problem: What do we do at the edge of partition of unknowns between two processors?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Unknown ordering and sparse matvec for PDEs
• For problems with local coupling of unknowns (e.g., discretized PDEs), nearby unknowns should be on same processor
• For complicated domains or problems not arising from PDEs, graph partitioning methods can do this
• Problem: What do we do at the edge of partition of unknowns between two processors?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Unknown ordering and sparse matvec for PDEs
• For problems with local coupling of unknowns (e.g., discretized PDEs), nearby unknowns should be on same processor
• For complicated domains or problems not arising from PDEs, graph partitioning methods can do this
• Problem: What do we do at the edge of partition of unknowns between two processors?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Recall the five-point stencil approximation of ∆
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Recall Matrix-free Jacobi iteration
Jacobi for Poisson’s equation
φ(k+1)(xi, yi) = 1 􏰂−φ(k)(xi+1, yi) − φ(k)(xi−1, yi) 4
−φ(k)(x ,y )−φ(k)(x ,y )+b 􏰃 ii+1 ii−1i
What would a parallel method look like?
• Break domain into rectangular blocks, assign each to parallel process.
• Data dependency: locally, looks like Dirichlet problem.
• Update boundary data between each iteration.
• What do we do about neighbors on other processors?
P0
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

“Localized” Parallel Sparse Matvec Pseudocode
1 Algorithm: Parallel Sparse Matvec
2 Scatter values of boundary unknowns to neighboring processors
3 Gather values of unknowns from neighboring processors
4 Perform local part of matvec
5 Perform part of matvec coming from external nodes
• This strategy is agnostic to how matrices and vectors are stored and multiplied on the processor
• A preprocessing step is required to keep track of which processors are “neighbors” with respect to storing neighboring unknowns
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

“Localized” Parallel Sparse Matvec Pseudocode
1 Algorithm: Parallel Sparse Matvec
2 Scatter values of boundary unknowns to neighboring processors
3 Gather values of unknowns from neighboring processors
4 Perform local part of matvec
5 Perform part of matvec coming from external nodes
• This strategy is agnostic to how matrices and vectors are stored and multiplied on the processor
• A preprocessing step is required to keep track of which processors are “neighbors” with respect to storing neighboring unknowns
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

“Localized” Parallel Sparse Matvec Pseudocode
1 Algorithm: Parallel Sparse Matvec
2 Scatter values of boundary unknowns to neighboring processors
3 Gather values of unknowns from neighboring processors
4 Perform local part of matvec
5 Perform part of matvec coming from external nodes
• This strategy is agnostic to how matrices and vectors are stored and multiplied on the processor
• A preprocessing step is required to keep track of which processors are “neighbors” with respect to storing neighboring unknowns
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

“Localized” Parallel Sparse Matvec Pseudocode
1 Algorithm: Parallel Sparse Matvec
2 Scatter values of boundary unknowns to neighboring processors
3 Gather values of unknowns from neighboring processors
4 Perform local part of matvec
5 Perform part of matvec coming from external nodes
• This strategy is agnostic to how matrices and vectors are stored and multiplied on the processor
• A preprocessing step is required to keep track of which processors are “neighbors” with respect to storing neighboring unknowns
Definition (Halo)
For a given processor with local unknowns, the unknowns on neighboring processors needed to perform the part of the sparse matvec local to that processor is called the halo
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

How to deal with boundary conditions
A PDE is generally stated with boundary conditions, yielding a unique solution. We consider two types.
Dirichlet
􏰚∆u(x,y) u(x)
Neumann
 ∆ u ( x ) ∂u
∂n
= f(x) = g(x,y)
(x,y) ∈ Ω (x,y) ∈ ∂Ω
x ∈ Ω x ∈ ∂Ω
= f ( x ) = g(x)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

How to deal with boundary conditions
A PDE is generally stated with boundary conditions, yielding a unique solution. We consider two types.
Dirichlet
􏰚∆u(x,y) u(x)
Neumann
 ∆ u ( x ) ∂u
∂n
= f(x) = g(x,y)
(x,y) ∈ Ω (x,y) ∈ ∂Ω
x ∈ Ω x ∈ ∂Ω
= f ( x ) = g(x)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

How to deal with boundary conditions
A PDE is generally stated with boundary conditions, yielding a unique solution. We consider two types.
Dirichlet
􏰚∆u(x,y) u(x)
Neumann
 ∆ u ( x ) ∂u
∂n
= f(x) = g(x,y)
(x,y) ∈ Ω (x,y) ∈ ∂Ω
x ∈ Ω x ∈ ∂Ω
= f ( x ) = g(x)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Dirichlet boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1  1−1 4−1 −1  L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)

Boundary values moved to right-hand side
−1 −14−1 −1 −14
∆u(xi, yi) ≈ 1 (4u(xi, yi) − u(xi+1, yi) − u(xi−1, yi) h2
−u(xi, yi+1) − u(xi, yi−1)) = bi
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Dirichlet boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1  1−1 4−1 −1  L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)

Boundary values moved to right-hand side
−1 −14−1 −1 −14
∆u(xi, yi) ≈ 1 (4u(xi, yi) − g(xi+1, yi) − u(xi−1, yi) h2
−u(xi, yi+1) − u(xi, yi−1)) = bi
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Dirichlet boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1  1−1 4−1 −1  L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)

Boundary values moved to right-hand side
∆u(xi, yi) ≈ 1 (4u(xi, yi) − u(xi−1, yi) h2
−1 −14−1 −1 −14
−u(xi,yi+1)−u(xi,yi−1))=bi+ 1g(xi+1,yi) h2
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Dirichlet boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1  1−1 4−1 −1  L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)

Boundary values moved to right-hand side
∆u(xi, yi) ≈ 1 (4u(xi, yi) − u(xi−1, yi) h2
−1 −14−1 −1 −14
−u(xi,yi+1)−u(xi,yi−1))=bi+ 1g(xi+1,yi) h2
Conclusion: Dirichlet boundary condition leads to modifications of right-hand side of the discretized linear system
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Suppose we have Neumann boundary condition: ∂u = g(x, y).
∂n On the left vertical boundary, this reduces to
∂u 􏰆ux􏰇 􏰆1􏰇
= · =⇒ ux(x,y)=g(x,y)
∂n uy 0
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
We can approximate with second-order central difference:
u(x+h,y)−u(x−h,y) ≈g(x,y) 2h
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
We can approximate on the grid:
u(xi+1, yi) − u(xbdry, yi) ≈ g(x, y) 2h
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
We can approximate on the grid:
u(xbdry, yi) ≈ u(xi+1, yi) − 2hg(x, y)
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
We can approximate on the grid:
u(xbdry, yi) ≈ u(xi+1, yi) − 2hg(x, y)
Conclusion: Neumann boundary condition leads to modifications of matrix entries
∆u(xi, yi) ≈ 1 (4u(xi, yi) − u(xi+1, yi) − u(xbdry,y ) h2 i
−u(xi, yi+1) − u(xi, yi−1)) = bi
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
We can approximate on the grid:
u(xbdry, yi) ≈ u(xi+1, yi) − 2hg(x, y)
Conclusion: Neumann boundary condition leads to modifications of matrix entries and right-hand side
∆u(xi, yi) ≈ 1 (4u(xi, yi) − u(xi+1, yi)−u(xi+1, yi) + 2hg(x, y) h2
−u(xi, yi+1) − u(xi, yi−1)) = bi
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Using Neumann boundary with the stencil
4−1 −1  −1 4 −1 −1
−14 −1 
1−1 4−1 −1 
L = h2  −1 −1 4 −1 −1 
 −1 −1 −1 4 4 −1 −1 
 −1 −14−1 −1 −14
Approximate normal derivative (example for left vertical boundary)
We can approximate on the grid:
u(xbdry, yi) ≈ u(xi+1, yi) − 2hg(x, y)
Conclusion: Neumann boundary condition leads to modifications of matrix entries and right-hand side
∆u(xi, yi) ≈ 1 (4u(xi, yi)−2u(xi+1, yi) h2
−u(xi, yi+1) − u(xi, yi−1)) = bi−2hg(x, y)
φ(x,y+a)
φ(x−a,y)
φ(x,y)
φ(x+a,y)
φ(x,y−a)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioning
We have seen two types of solvers, direct and iterative
• Direct: operate on matrix entries, good with dense matrices, destroy sparsity
• Iterative: uses only matvec, reduces large, sparse problems to small, dense, direct problems, convergence may be slow
• Idea: Preconditioning! Use some tools from direct methods toolbox to speed up convergence
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioning
We have seen two types of solvers, direct and iterative
• Direct: operate on matrix entries, good with dense matrices, destroy sparsity
• Iterative: uses only matvec, reduces large, sparse problems to small, dense, direct problems, convergence may be slow
• Idea: Preconditioning! Use some tools from direct methods toolbox to speed up convergence
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioning
We have seen two types of solvers, direct and iterative
• Direct: operate on matrix entries, good with dense matrices, destroy sparsity
• Iterative: uses only matvec, reduces large, sparse problems to small, dense, direct problems, convergence may be slow
• Idea: Preconditioning! Use some tools from direct methods toolbox to speed up convergence
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioning
We have seen two types of solvers, direct and iterative
• Direct: operate on matrix entries, good with dense matrices, destroy sparsity
• Iterative: uses only matvec, reduces large, sparse problems to small, dense, direct problems, convergence may be slow
• Idea: Preconditioning! Use some tools from direct methods toolbox to speed up convergence
Preconditioning – Approximate A easier to (directly) invert
A preconditioner M is most often chose such that M ≈ A where
we then solve, e.g., M−1A = M−1b, such that
• M−1 easy to apply (often by some direct method)
• The spectrum of M−1A has better clustering properties
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioning
We have seen two types of solvers, direct and iterative
• Direct: operate on matrix entries, good with dense matrices, destroy sparsity
• Iterative: uses only matvec, reduces large, sparse problems to small, dense, direct problems, convergence may be slow
• Idea: Preconditioning! Use some tools from direct methods toolbox to speed up convergence
Preconditioning – Approximate A−1 easier to (directly) apply A preconditioner M can also be chose such that M ≈ A−1
where we then solve, e.g., MA = Mb, such that
• M is an approximate inverse of A which can be efficiently
applied
• The spectrum of MA has better clustering properties
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Tridiagonal preconditioning of the Laplacian
Example
Let L be the discretized Laplacian, dimension 2500. Let M be the tridiagonal portion of A.
• The black points are the eigenvalues of L
• The red points are the eigenvalues of M−1L
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Tridiagonal preconditioning of the Laplacian – CG convergence
Applying CG to the preconditioned and unpreconditioned systems, we see clearly that the preconditioned system has better convergence properties
• CG Convergence Comparison Animation
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Types of Preconditioners: Stationary
Recall matrix splittings: A = M − N which induce the preconditioned system M−1Ax = M−1b
• One can always symmetrize a stationary method: First iteration M, second iteration with MT
• One such example would be a sweep of forward Gauss-Seidel (M = −F) followed by a sweep of backward Gauss-Seidel (MT = −E)
Exercise
Applying consecutive iterations with M and MT is a new symmetrized stationary iteration. Thus there is an associated splitting A = Msym − Nsym, where Msym is the new, symmetric preconditioner. What is Msym?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Types of Preconditioners: Stationary
Recall matrix splittings: A = M − N which induce the preconditioned system M−1Ax = M−1b
• One can always symmetrize a stationary method: First iteration M, second iteration with MT
• One such example would be a sweep of forward Gauss-Seidel (M = −F) followed by a sweep of backward Gauss-Seidel (MT = −E)
Exercise
Applying consecutive iterations with M and MT is a new symmetrized stationary iteration. Thus there is an associated splitting A = Msym − Nsym, where Msym is the new, symmetric preconditioner. What is Msym?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Types of Preconditioners: Stationary
Recall matrix splittings: A = M − N which induce the preconditioned system M−1Ax = M−1b
• One can always symmetrize a stationary method: First iteration M, second iteration with MT
• One such example would be a sweep of forward Gauss-Seidel (M = −F) followed by a sweep of backward Gauss-Seidel (MT = −E)
Exercise
Applying consecutive iterations with M and MT is a new symmetrized stationary iteration. Thus there is an associated splitting A = Msym − Nsym, where Msym is the new, symmetric preconditioner. What is Msym?
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Types of Preconditioners: Incomplete LU/Cholesky
Apply Gaussian-Elimination algorithm but retain only entries matching a pattern
→ Limited Bandwidth
→ Same sparsity pattern as A
→ Disallow entries below an absolute value threshold
This leads to an approximate factorization A = LU + R. Set M = LU and forget R
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

ILU with no fill-in for the Laplacian
[L,U] = ilu(A)
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

ILU with no fill-in for the Laplacian – eigenvalues
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioned Conjugate Gradients
For a preconditioner M, the new system matrix M−1A is likely not SPD.
• Question: How do we derive a preconditioned CG method?
• Answer: Consider a different inner-product!
Observation
• Let M be SPD. Generally M−1A is not SPD.
• Then M induces an inner product.
• Observe that M−1A is symmetric with respect to the M-inner product
􏰘y, M−1Ax􏰙
M
= ⟨y, Ax⟩ = ⟨Ay, x⟩
= 􏰘M(M−1A)y, x􏰙 = 􏰘x, M−1Ay􏰙 M
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioned Conjugate Gradients
For a preconditioner M, the new system matrix M−1A is likely not SPD.
• Question: How do we derive a preconditioned CG method?
• Answer: Consider a different inner-product!
Observation
• Let M be SPD. Generally M−1A is not SPD.
• Then M induces an inner product.
• Observe that M−1A is symmetric with respect to the M-inner product
􏰘y, M−1Ax􏰙
M
= ⟨y, Ax⟩ = ⟨Ay, x⟩
= 􏰘M(M−1A)y, x􏰙 = 􏰘x, M−1Ay􏰙 M
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Preconditioned Conjugate Gradients
For a preconditioner M, the new system matrix M−1A is likely not SPD.
• Question: How do we derive a preconditioned CG method?
• Answer: Consider a different inner-product!
Observation
• Let M be SPD. Generally M−1A is not SPD.
• Then M induces an inner product.
• Observe that M−1A is symmetric with respect to the M-inner product
􏰘y, M−1Ax􏰙
M
= ⟨y, Ax⟩ = ⟨Ay, x⟩
= 􏰘M(M−1A)y, x􏰙 = 􏰘x, M−1Ay􏰙 M
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

The Conjugate Gradient Method
PCG is simply an implementation of CG for M−1Ax = M−1b in the M-inner product.
1 Algorithm: The PCG method
2 Computer0 =b−Ax0,z0 =M−1r0,p0 :=z0
3 for j = 1,2,… until convergence do
4 αj := ⟨rj,zj⟩/⟨Apj,pj⟩
5 xj+1 :=xj +αjpj
6 rj+1 :=rj −αjApj
7 zj+1 :=M−1rj+1 βj :=⟨rj+1,zj+1⟩/⟨rj,zj⟩
8 pj+1 := zj+1 + βjpj
9 end
Note: M−1A is also symmetric with respect to the A-inner product, and an efficient algorithm can be derived for this.
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Overview of explicit preconditioners
• ILU and stationary methods use approximations of A itself
• Generally, this results in some sort of triangular matrix solves needing to be performed at every iteration
=⇒ Strongly sequential!!!
• Jacobi can be easily parallelized, but it is slow to converge
• Question: How do we design preconditioners yielding fast convergence but easy to parallelize??
• Idea Find approximations M ≈ A−1
→ Polynomial preconditioning → Sparse approximative inverse
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Overview of explicit preconditioners
• ILU and stationary methods use approximations of A itself
• Generally, this results in some sort of triangular matrix solves needing to be performed at every iteration
=⇒ Strongly sequential!!!
• Jacobi can be easily parallelized, but it is slow to converge
• Question: How do we design preconditioners yielding fast convergence but easy to parallelize??
• Idea Find approximations M ≈ A−1
→ Polynomial preconditioning → Sparse approximative inverse
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Overview of explicit preconditioners
• ILU and stationary methods use approximations of A itself
• Generally, this results in some sort of triangular matrix solves needing to be performed at every iteration
=⇒ Strongly sequential!!!
• Jacobi can be easily parallelized, but it is slow to converge
• Question: How do we design preconditioners yielding fast convergence but easy to parallelize??
• Idea Find approximations M ≈ A−1
→ Polynomial preconditioning → Sparse approximative inverse
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Overview of explicit preconditioners
• ILU and stationary methods use approximations of A itself
• Generally, this results in some sort of triangular matrix solves needing to be performed at every iteration
=⇒ Strongly sequential!!!
• Jacobi can be easily parallelized, but it is slow to converge
• Question: How do we design preconditioners yielding fast convergence but easy to parallelize??
• Idea Find approximations M ≈ A−1
→ Polynomial preconditioning → Sparse approximative inverse
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Overview of explicit preconditioners
• ILU and stationary methods use approximations of A itself
• Generally, this results in some sort of triangular matrix solves needing to be performed at every iteration
=⇒ Strongly sequential!!!
• Jacobi can be easily parallelized, but it is slow to converge
• Question: How do we design preconditioners yielding fast convergence but easy to parallelize??
• Idea Find approximations M ≈ A−1
→ Polynomial preconditioning → Sparse approximative inverse
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

What did we talk about today?
• Sparse matrix-vector product in parallel • Preconditioning
• Preconditioned CG
• Parallel CG with preconditioning
Further reading
• Chapters 9.2, 11 in Y Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2003.
• Available legally online at: http://www-users.cs.umn.edu/~saad/books.html
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning

Next time: an actual return to parallelism!!!!
• Parallel preconditioning
• Red-black preconditioning
• More on parallelized iterative solvers
• Synchronous versus asynchronous methods
Soodhalter Parallel Numerics 5636 Parallel Sparse Matvec and Preconditioning