PowerPoint Presentation
Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk
Room SLB1004, SLB
Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk
Room INB2221, INB
Week W/C Lecture Workshop
1 23/01 Introduction –
2 30/01 Architectures
Tutorial-1
3 06/02 Patterns 1
4 13/02 Patterns 2
Tutorial-2
5 20/02 Patterns 3
6 27/02 Patterns 4
Tutorial-3
7 06/03 Communication & Synchronisation
8 13/03 Algorithms 1
Assessment
support
9 20/03 Algorithms 2
10 27/03 Algorithms 3
11 03/04 Performance & Optimisation
Tutorial-4 &
Demonstration
12 24/04 Parallel Libraries
13 01/05 –
Data Reorganisation Patterns
Gather
Scatter
values = [ 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ]
column index = [ 1 1 2 2 3 3 4 4 5 6 7 8 ]
row index = [ 1 6 2 8 1 4 5 7 3 6 2 8 ]
Perform matrix operation (e.g. transpose) only on
non-zero elements
Given a mask, perform the image processing
operation only on foreground pixels
Data movement – major bottleneck of parallel
applications, sometimes more important than
parallel computation
Data-intensive applications: reorganise data first,
then perform computation
Reorganisation does not always imply direct data
reduction (e.g. sort)
Gather and scatter patterns arise from the
combination of random (irregular) read and write,
respectively, with the map pattern
Gather
Scatter
Perform operations on the reduced set
mask (indices)
mask (indices)
Input: collection of read locations (addresses or array indices),
source array
Gather reads data from source array at given locations into
output
Output size = index collection size
Applications: sparse matrix operations, ray tracing, volume
rendering, proximity queries, and collision detection
index collectionsource array
output
Given the following locations and source array, what values
should go into the output collection:
3 7 0 1 4 0 0 4 5 3 1 0
0 1 2 3 4 5 6 7 8 9 10 11
3 1 10 8 0
? ? ? ? ?
locations
input
output1 7 1 5 3
Shift – special case of gather
Moves data to left or right in memory
Data accesses are offset by fixed distances (efficient)
Shift requires “out of bounds” data
• different variants based on how boundary conditions are
handled
left right
Stencil – regular data access
For each offset in stencil, generate a new input
vector by shifting input by the offset
Good for one-dimensional stencils
Input: collection of write locations (addresses or array
indices), source array
Scatter writes data into output indicated by the given location
– similar to gather which used the locations for reading
Input size = index collection size
The combination of a map with a random write
• writes to the same location are possible
• parallel writes to the same location are collisions
index collection
source array
output
——————————-
par_for i = 1:N
output[i] = input[index[i]]
——————————-
——————————-
par_for i = 1:N
output[index[i]] = input[i]
——————————-
Gather
Scatter
N – number of indices
Given the following locations and source array, what values
should go into the output collection:
3 7 0 1 4 0 0 4 5 3 1 0
0 1 2 3 4 5 6 7 8 9 10 11
2 4 1 5 5 0 4 2 1 2 1 4
? ? ? ? ? ?
locations
input
output
Given the following locations and source array, what values
should go into the output collection:
3 7 0 1 4 0 0 4 5 3 1 0
0 1 2 3 4 5 6 7 8 9 10 11
2 4 1 5 5 0 4 2 1 2 1 4
0 1 3 0 4
locations
input
output
Race Condition: two (or more) values being written to
the same location in output collection. Result is
undefined – needs rules to be resolved!
Different strategies:
• Atomic scatter
• Permutation scatter
• Merge scatter
• Priority scatter
Atomic
Only one value is written
– non-deterministic
in parallel execution
Useful with Boolean variables
Permutation
collisions are
not allowed!
output is a permutation of the input
check for collisions in advance: turn scatter into gather
Examples: FFT scrambling, matrix/image transpose,
unpacking
Merge
Associative and
commutative operators
as merge elements
Commutative properties are required since scatters to a
particular location could occur in any order
Priority
Input elements are
assigned a priority
based on their position
Histogram – an estimator of data distribution. Uses bins (value
intervals) to count the number of values falling into that
interval.
———————————-
vector
vector
//zero all bins
for (uint i = 0; i < H.size(); i++)
H[i] = 0;
//fill in histogram bins
for (uint i = 0; i < A.size(); i++)
H[bin_index(A[i])]++;
----------------------------------
bin_index provides a list of write locations – scatter!
A = {1,4,3,2,0,2,1,3,2,7,5,6,3,8,1,3};
H = 8 bins
int bin_index_1(int value) {
return value;
}
int bin_index_2(int value) {
return value/2;
}
H for bin_index_1
H for bin_index_2
Partial OpenCL implementation – naive!
----------------------------------------
__kernel hist_naive(__global const int* A, __global int* H,
int nr_bins) {
int id = get_global_id(0);
//zero bin counts
if (id < nr_bins)
hist[id] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
H[bin_index(A[id])]++;
}
----------------------------------------
What if two threads want to increase
the same bin at the same time?
lock()
unlock()
Atomic functions resolve the conflict by serialising accesses to a
variable.
More in the following lecture on synchronisation techniques
OpenCL Merge Scatter with atomic operations
----------------------------------------
__kernel hist_atomic(__global const int* A, __global int* H,
int nr_bins) {
int id = get_global_id(0);
//zero bin counts
if (id < nr_bins)
hist[id] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
atomic_inc(&H[bin_index(A[id])]);
}
----------------------------------------
Complexity measures (for N indices) are the same as
for the parallel Map patter
• work – requires N operations O(n)
• span – O(n/p) for p parallel processors
In reality, the performance is affected by random
nature of memory accesses
• Gather: performance lower due to random reads (not so
bad)
• Scatter: performance lower due to random writes and
potential write conflicts (e.g. consider all indices being
the same!)
Structured parallel
programming: patterns for
efficient computation
• Chapter 6, Data Reorganization