CMPSC 450
Concurrent Scientific Programming
Locality and Parallelism in Simulations
Spring 2016 Kamesh Madduri
Sources of Parallelism and Locality in Simulations
• Parallelism and locality are both critical to performance
– Data movement is expensive
• Real-world problems have parallelism and locality
– Objects often depend more on nearby than distant objects – Dependence on distant objects can often be simplified
• Scientific models may introduce more parallelism
• Slides based on CS 267: Applications of Parallel
Computing by Prof. Jim Demmel
– www.cs.berkeley.edu/~demmel/cs267_Spr13
2
1. 2. 3.
4.
Basic kinds of simulations Discrete event systems
– “Game of Life”, Manufacturing systems, PacMan, … Particle systems
– Billiard balls, Galaxies, Atoms, Pinball, …
Ordinary differential equations (ODEs): lumped
variables depending on continuous parameters – Structural mechanics, Chemical kinetics, …
Partial differential equations (PDEs): continuous variables depending on continuous parameters
– Heat, Elasticity, Electrostatics, …
3
Basic kinds of simulations • A given phenomenon can be modeled at
multiple levels
• Many simulations combine more than one of
these levels
e.g., circuit simulations, games
4
Example: Circuit simulation
5
Outline
• Discrete event systems
– Time and space are discrete
• Particlesystems
– Important special case of lumped systems
• Lumpedsystems(ODEs)
Discrete
– Locations/entities are discrete, time is continuous
• Continuoussystems(PDEs)
– Time and space are continuous
• Identifycommonproblemsandsolutions
Continuous
6
Discrete event systems
• Systemsarerepresentedas
– finite set of variables
– the set of all variable values at a given time is called the state
– each variable is updated by computing a transition function depending on the other variables
• Twotypesofsystems
– synchronous: at each discrete time step, evaluate all
transition functions; also called a state machine
– asynchronous: transition functions are evaluated only if the inputs change, based on an “event” from another part of the system; also called event-driven simulation.
7
Conway’s Game of Life
• Cellular automaton
• Defined transitions at each time step
• Example of a synchronous discrete event simulation
8
Parallelism in Game of Life
• Easy to parallelize by dividing physical domain
(domain decomposition)
• Simulation is synchronous, each cell is
updated at every step
• Use two copies of grid (old and new) and switch between them
9
Locality in Game of Life
• Locality is achieved by using large patches
• Exchange of boundary state information at each iteration
• How to pick shapes of domains? – Minimize communication
• Minimizing communication => Minimizing “surface to volume ratio” of partitions
10
Locality in regular meshes
11
Synchronous circuit simulations
• Circuit is a graph made up of subcircuits connected by wires – Component simulations need to interact if they share a wire.
–Data structure is (irregular) graph of subcircuits.
–Parallel algorithm is timing-driven or synchronous
• Evaluate all components at every timestep (determined by known circuit delay)
• Graph partitioning assigns subgraphs (work, tasks) to processors
–Determines parallelism and locality.
–Goal: evenly distribute subgraphs to nodes (load balance). –Goal: minimize edge crossings (minimize communication).
–Partitioning easy for meshes, but NP-hard in general. So we will approximate.
12
Asynchronous simulations
• Synchronous simulations may waste time:
– Simulates even when the inputs do not change.
• Asynchronous (event-driven) simulations update only when
an event arrives from another component:
– No global time steps, but individual events contain time stamp. – e.g., Circuit simulation with delays (events are gates changing). – e.g., Traffic simulation (events are cars changing lanes, etc.).
• Asynchronous is more efficient, but harder to parallelize
– In MPI, events are naturally implemented as messages, but how do you know when to execute a “receive”?
13
Scheduling Asynchronous simulations
• Conservative
– Only simulate up to (and including) the minimum time stamp of
inputs.
– Need deadlock detection.
• Speculative (or Optimistic)
– Assume no new inputs will arrive and keep simulating.
– May need to backup if assumption wrong, using timestamps – Example: Timewarp
• Optimizing both load balance and locality may be difficult
– e.g., circuit simulation: Locality means putting tightly-coupled subcircuits on one processor. Since “active” part of circuit likely to be in a tightly coupled subcircuit, this may be bad for load balance.
14
Summary of discrete event simulations • Modelofworldisdiscrete
– Both space and time
• Approaches
– Decompose domain, i.e., set of objects – Run each component ahead using
• Synchronous: communicate at end of each time step
• Asynchronous: communicate on-demand – Conservative
– Speculative
15
Particle Systems
• A particle system has
– a finite number of particles
– moving in space according to Newton’s Laws (i.e. F = ma)
– time is continuous
• Examples
– stars in space with laws of gravity
– electron beam in semiconductor manufacturing
– atoms in a molecule with electrostatic forces
– neutrons in a fission reactor
– carsonafreewaywithNewton’slaws+modelofdriverandengine – ballsinapinballgame
• Reminder: many simulations combine techniques such as particle simulations with some discrete events
16
Forces in Particle Systems • Force on each particle can be subdivided
Force = external + nearby + far-field
e.g., External force
• externally imposed electric field in electron beam e.g., Nearby force
• balls on a billiard table bounce off of each other
• Van der Waals forces in fluid (1/r6) e.g., Far-field force
• gravity, electrostatics, radiosity in graphics • forces governed by elliptic PDE
17
Parallelism in external forces • Usually the simplest
• Force on each particle is independent, “embarrassingly parallel” or “pleasantly parallel”
1. Evenly distribute particles to processors – Any distribution works
– Locality not an issue, no communication
2. Applyexternalforce
18
Parallelism in Nearby Forces
• Nearby forces require interaction, and therefore communication
• Force may depend on other nearby particles
– Example: collisions
– Simplest algorithm is O(n2), look at all pairs and see if they collide
• Usual parallel model is domain decomposition of physical region in which particles are located
– O(n/P) particles per processor if evenly distributed
19
Parallelism in Nearby Forces
• Challenge1:interactionsofparticlesnearprocessor boundary:
– need to communicate particles near boundary to neighboring processors.
• Region near boundary called “ghost zone”
– Low surface to volume ratio means low communication. • Use squares, not slabs, to minimize ghost zone sizes
Communicate particles in boundary region to neighbors
Need to check for collisions between regions
20
Parallelism in Nearby Forces • Challenge2:loadimbalance,ifparticlescluster:
– galaxies, electrons hitting a device wall.
• To reduce load imbalance, divide space unevenly. – Each region contains roughly equal number of particles. – Quad-tree in 2D, oct-tree in 3D.
Example: each square contains at most 3 particles
21
Parallelism in Far-field forces
• Far-field forces involve all-to-all interaction and therefore communication.
• Forcedependsonallotherparticles:
– Examples: gravity, protein folding
– Simplest algorithm is O(n2)
– Just decomposing space does not help since every particle needs to “visit” every other particle.
• Use clever algorithms to reduce communication
• Use clever algorithms and approximations to beat O(n2)
22
Far-field force: Particle-mesh methods
• Based on approximation
– Superimpose a regular mesh
– “Move” particles to nearest grid points
• Exploit fact that the far-field force satisfies a PDE that is easy to solve on a regular mesh
– FFT, Multigrid
– Cost drops to O(n log n) or O(n) instead of O(n2)
• Accuracy depends on the fineness of the grid and the uniformity of the particle distribution
23
Tree decomposition • Based on approximation.
– Forces from group of far-away particles “simplified” – resembles a single large particle.
– Use tree; each node contains an approximation of descendants.
• Also O(n log n) or O(n) instead of O(n2).
• Several algorithms
– Barnes-Hut.
– Fast multipole method (FMM) of Greengard/Rokhlin. – Anderson’s method.
24
Summary of particle methods
• Model contains discrete entities, namely, particles
• Time is continuous – must be discretized to solve
• Simulation follows particles through time steps
– Force = external _force + nearby_force + far_field_force
– All-pairs algorithm is simple, but inefficient, O(n2)
– Particle-mesh methods approximates by moving particles to a regular mesh, where it is easier to compute forces
– Tree-based algorithms approximate by treating set of particles as a group, when far away
• May think of this as a special case of a “lumped” system
25
•
Systems of Lumped Variables (ODEs)
Many systems are approximated by
– System of “lumped” variables
– Each depends on continuous parameter (usually time)
e.g., circuits
– approximatedasagraph • wires are edges
• nodes are connections between 2 or more wires
• each edge has resistor, capacitor, inductor, or voltage source
– system is “lumped” because we are not computing the voltage/current at every point in space along a wire, just endpoints.
– Variables related by Ohm’s Law, Kirchoff’s Laws, etc.
Forms a system of ordinary differential equations (ODEs) – Differentiated with respect to time
•
26
Solving ODEs
• In examples from circuit simulations, structural physics, and most others, the matrices are sparse:
– i.e., most array elements are 0.
– neitherstorenorcomputeonthese0’s.
– Sparsebecauseeachcomponentonlydependsonafewothers.
• Given a set of ODEs, two kinds of questions are:
– Compute the values of the variables at some time t
• Explicit methods
• Implicit methods
– Compute modes of vibration
• Eigenvalue problems
27
Solving ODEs: Explicit methods
• Assume ODE is x’(t) = f(x) = A*x(t), where A is a sparse matrix – Compute x(i*dt) = x[i]
at i = 0, 1, 2, …
– ODE gives x’(i*dt) = slope (use slope at x[i])
x[i+1]=x[i] + dt*slope
• Explicit methods, e.g., (Forward) Euler’s method.
– Approximate x’(t)=A*x(t) by (x[i+1] – x[i])/dt = A*x[i].
– x[i+1] = x[i]+dt*A*x[i], i.e., sparse matrix-vector multiplication. • Tradeoffs:
– Simple algorithm: sparse matrix vector multiply.
– Stability problems: May need to take very small time steps, especially if system is stiff (i.e., A has some large entries, so x can change rapidly).
28
Solving ODEs: Implicit methods
• Assume ODE is x’(t) = f(x) = A*x(t), where A is a sparse matrix – Compute x(i*dt) = x[i]
at i = 0, 1, 2, …
– ODE gives x’((i+1)*dt) = slope (use slope at x[i+1])
x[i+1]=x[i] + dt*slope
• Implicit methods, e.g., Backward Euler method.
– Approximate x’(t)=A*x(t) by (x[i+1] – x[i])/dt = A*x[i+1].
– (I – dt*A)*x[i+1] = x[i], i.e., we need to solve a sparse linear system. • Tradeoffs:
– Larger timestep possible: especially for stiff problems.
– More difficult algorithm: need to solve a sparse linear system of equations at each time step.
29
Solving ODEs: Eigensolvers
• Computing modes of vibration in structural physics: finding eigenvalues and eigenvectors.
– Seek solution of d2x(t)/dt2 = A*x(t) of form
x(t) = sin(*t) * x0, where x0 is a constant vector
• called the frequency of vibration
• x0 sometimes called a “mode shape”
– Plug in to get - *x = A*x , so that - is an 2002
eigenvalue and x0 is an eigenvector of A.
– Solution schemes reduce either to sparse-matrix multiplication, or solving sparse linear systems.
30
Implicit methods and Eigenproblems
• Implicit methods for ODEs need to solve linear systems
• Direct methods (Gaussian elimination)
– Called LU Decomposition, because we factor A = L*U.
– More complicated than sparse-matrix vector multiplication.
• Iterative solvers
– Jacobi, Successive over-relaxation (SOR) , Conjugate Gradient (CG), Multigrid,…
• Most have sparse-matrix-vector multiplication in kernel.
• Eigenproblems
– dense and sparse cases.
– Also depend on sparse-matrix-vector multiplication, direct methods.
31
ODEs and Sparse Matrices
• All these problems reduce to sparse matrix problems
– Explicit: sparse matrix-vector multiplication (SpMV).
– Implicit: solve a sparse linear system
• direct solvers (Gaussian elimination).
• iterative solvers (use sparse matrix-vector multiplication).
– Eigenvalue/vector algorithms may also be explicit or implicit.
• Conclusion: SpMV is key to many ODE problems – Relatively simple algorithm to study in detail
– Two key problems: locality and load balance
32
SpMV in Compressed Sparse Row (CSR) format
SpMV: y = y + A*x, only store, do arithmetic, on nonzero entries CSR format is simplest one of many possible data structures for A
x
Representationof A
y
A
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j) for each row i
for k=ptr[i] to ptr[i+1]-1 do
y[i] = y[i] + val[k]*x[ind[k]]
33
Parallel Sparse Matrix-vector multiplication • y = A*x, where A is a sparse n x n matrix
x
• Questions
– which processors store • y[i], x[i], and A[i,j]
– which processors compute
• y[i] = sum (from 1 to n) A[i,j] * x[j]
y
P1 P2 P3 P4
=(rowiofA)*x …asparsedotproduct • Partitioning
– Partition index set {1,…,n} = N1 N2 … Np.
– For all i in Nk, Processor k stores y[i], x[i], and row i of A
– For all i in Nk, Processor k computes y[i] = (row i of A) * x
• “Owner computes” rule: Processor k compute the y[i]s it owns.
34
• •
Matrix Reordering via Graph Partitioning
“Ideal” matrix structure for parallelism: block diagonal – p (number of processors) blocks, can all be computed locally.
– If no non-zeros outside these blocks, no communication needed
Can we reorder the rows/columns to get close to this? – Most nonzeros in diagonal blocks, few outside
P0 P1 P2 P3 P4
*
P0 P1 P2 P3 P4
35
=
Goals of Reordering • Performance goals
– balance load (how is load measured?)
• Approx equal number of nonzeros (not necessarily rows)
– balance storage (how much does each processor store?) • Approx equal number of nonzeros
– minimize communication (how much is communicated?)
• Minimize nonzeros outside diagonal blocks
• Related optimization criterion is to move nonzeros near diagonal
– improve register and cache re-use
• Group nonzeros in small vertical blocks so source (x) elements loaded into
cache or registers may be reused (temporal locality)
• Group nonzeros in small horizontal blocks so nearby source (x) elements in the cache may be used (spatial locality)
• Other algorithms reorder for other reasons
– Reduce # nonzeros in matrix after Gaussian elimination – Improve numerical stability
36
Graph Partitioning and Sparse Matrices
• Edges in the graph are nonzero in the matrix: here the matrix is symmetric (edges are unordered) and weights are equal (1)
• If divided over 3 procs, there are 14 nonzeros outside the diagonal blocks, which represent the 7 (bidirectional) edges
123456 1
2 3 4 5 6
3 42
1
6
1
1
1
11 11
11
11 111
1
1
1
1
11 11
5
37
Summary: Common problems
• LoadBalancing
– Dynamically – if load changes significantly during job
– Statically – Graph partitioning
• Discrete systems
• Sparse matrix vector multiplication
• Linearalgebra
– Solving linear systems (sparse and dense)
– Eigenvalue problems will use similar techniques
• FastParticleMethods
– O(n log n) instead of O(n2)
38
PDEs: Continuous variables, continuous parameters
• Ellipticproblems(steadystate,globalspace dependence)
– Electrostatic/gravitational potential: Potential(position)
• Hyperbolicproblems(timedependent,local space dependence)
– Sound waves: Pressure(position, time)
• Parabolicproblems(timedependent,global
space dependence)
– Heat flow: Temperature(position, time) – Diffusion: Concentration(position, time)
39
PDEs: Global vs Local dependence
• Global means either a lot of communication or very small time steps
• Local arises from finite wave speeds: limits communication
• Many problems combine features
– Fluid flow: Velocity, Pressure, Density(position,
time)
– Elasticity: Stress, Strain(position, time)
40
Example: deriving the heat equation
0 x-h x x+h 1
Consider a simple problem
• • •
•
A bar of uniform material, insulated except at ends
Let u(x,t) be the temperature at position x at time t
Heat travels from x-h to x+h at rate proportional to: 𝜕𝑢(𝑥,𝑡) 𝑢 𝑥−h −𝑢(𝑥,𝑡) − 𝑢 𝑥,𝑡 −𝑢(𝑥+h,𝑡)
=𝐶∗hh 𝜕𝑡 h
As h tends to 0, we get the heat equation
𝜕𝑢(𝑥,𝑡)=𝐶 ∗ 𝜕2𝑢(𝑥,𝑡) 𝜕𝑡 𝜕𝑥2
41
Explicit method for Heat equation
• Discretize space and time using explicit approach (forward Euler) to approximate time derivative
• Change variable x to j*h, t to i*δ, and u(x,t) to u(j,i)
• Let z = C*δ/h2
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+ z*u[j+1,i]
42
• •
Explicit solution for the Heat equation
Use “finite differences” with u[j,i] as the temperature at – time t= i* (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h)
– initial conditions on u[j,0]
– boundary conditions on u[0,i] and u[N,i]
At each timestep i = 0,1,2,…
i i=5
where z =C*/h2 i=3
For j=0 to N
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] i=4
•
This corresponds to
– Matrix-vector-multiply by T
– Combine nearest neighbors on grid
i=2
i=1
i=0
u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0]
43
j
• •
Matrix view of explicit method for Heat
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i], same as: u[ :, i+1] = T * u[ :, i] where T is tridiagonal:
• •
For a 2D mesh (5 point stencil), the Laplacian is pentadiagonal
T =
1-2z z
z 1-2z z
2 -1
-1 2 -1
-1 2 -1
-1 2 -1
z 1-2z z
z 1-2z z
= I – z*L, L called Laplacian (in 1D)
L =
z 1-2z
-1 2
44
Parallelism in explicit method for PDEs
• Sparse matrix vector multiply, via Graph Partitioning
• Partitioning the space (x) into p chunks
– good load balance (assuming large number of points relative to p)
– minimize communication (least dependence on data outside chunk)
• Generalizes to multiple dimensions, arbitrary graphs (= arbitrary sparse matrices)
• Explicit approach often used for hyperbolic equations – Finite wave speed, so only depend on nearest chunks
• Problem with explicit approach for heat (parabolic):
– numerical instability, solution blows up eventually if z = C/h2 > .5 – need to make time step very small when h is small: < .5*h2 /C
45