Overview Parallel reduction: Overview and library support Parallel reduction: Implementation Summary and next lecture
XJCO3221 Parallel Computation
University of Leeds
Copyright By PowCoder代写 加微信 powcoder
Lecture 11: Reduction
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support Previous lectures
Parallel reduction: Implementation Today’s lecture Summary and next lecture
Previous lectures
In the last lecture we looked at data reorganisation and collective communication:
Communication is usually the most significant overhead for distributed systems.
Collective communication involves multiple processes in a one-to-many, many-to-one or many-to-many pattern.
Reduce the communication time tcomm, compared to a loop of point-to-point communications.
In MPI: MPI Bcast(), MPI Scatter(), MPI Gather().
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support Previous lectures Parallel reduction: Implementation Today’s lecture
Summary and next lecture
Today’s lecture
Today we will look at a common combination of data reorganisation and computation: Reduction.
Reduces a data set to one of a smaller size.
Important for both shared and distributed memory systems. Support for many parallel APIs, including OpenMP and MPI. Often optimised using a binary tree.
Binary trees also useful for collective communication.
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial
Reduction in parallel MapReduce
Library support for reduction
Reminder: Serial reduction
Start with a large data set.
Apply binary operations to reduce to a smaller set.
Example 1: Summing the elements of an array
for( i=0; i
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial
Reduction in parallel MapReduce
Library support for reduction
Note each operation is performed sequentially.
01234567 +
Total computation time
tcomp is proportional to the
array size n. +
i.e. the time complexity is O(n).
If these were processing
units, most would be idle
throughout most of the + calculation.
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial
Reduction in parallel
Library support for reduction
Parallel reduction
Ideally we would want to perform all calculations simultaneously: 01234567
This would have a time complexity of tcomp = O(1), but is not possible to achieve in practice.
For now, note that:
Any parallel reduction must change the sequence of calculations
Some concrete examples will be given later in this lecture.
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial
Reduction in parallel
Library support for reduction
Recap: Commutativity and associativity
Let ⊗ denote a general binary operator: c = a ⊗ b.
As parallel reduction alters the sequence in which calculations are
performed, ⊗ must be associative:
An operator ⊗ is associative if a ⊗ (b ⊗ c) = (a ⊗ b) ⊗ c
If ⊗ is only approximately associative, the result of parallel reduction will be (slightly) different from serial reduction.
Some parallel reduction algorithms also require ⊗ to be commutative:
An operator ⊗ is commutative if a ⊗ b = b ⊗ a and
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial
Reduction in parallel
Library support for reduction
Commutativity and associativity (examples)
Combination
Associative and commutative
Associative; not commutative Commutative; not associative
Neither commutative nor as- sociative
max, min, Boolean AND, OR, XOR, exact addition and mul- tiplication
Matrix multiplication
Finite precision floating point addition and multiplication1 , signed saturated addition2 Subtraction, division
1Only approximately associative. See Worksheet 2 Question 6. 2e.g. fn(a,b)=(a+b<1?a+b:1) with a=0.8, b=0.5 and c=-0.3.
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial Reduction in parallel MapReduce
Library support for reduction
An important application of reduction is as part of the MapReduce pattern1:
Fusion of a map followed by a reduction.
Can avoid the need for synchronisation after the map.
1McCool et al., Structured parallel programming (Morgan-Kaufmann, 2012). XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial Reduction in parallel MapReduce
Library support for reduction
Distributed systems example
Suppose a database is distributed over nodes in a cluster. Each node has access to part of the full database.
Suppose a user initiates a search. We could use MapReduce: Each node searches its local database (‘map’).
Local results are combined to give the required global result (‘reduce’).
This MapReduce was developed by Google and was one of the reasons for their early success.
XJCO3221 Parallel Computation
Parallel reduction: Overview and library support
Parallel reduction: Implementation Summary and next lecture
Reduction in serial Reduction in parallel MapReduce
Library support for reduction
Example: Vector dot product
Consider the vector dot product (aka inner or scalar product):
In serial1:
a·b = aibi
float dot =0.0;
for( i=0; i