Systems, Networks & Concurrency 2019
Data Parallelis5m Uwe R. Zimmer – The Australian National University
[Bacon98]
J. Bacon
Concurrent Systems
1998 (2nd Edition) Addison Wesley Longman Ltd, ISBN 0-201-17767-6
[Ada 2012 Language Reference Manual]
see course pages or http://www.ada-auth.org/standards/ada12.html
[Chapel 1.13 Language Specification Version 0.981]
see course pages or
Data Parallelism
http://chapel.cray.com/docs/latest/_downloads/chapelLanguageSpec.pdf
released on 7. April 2016
References
© 2019 Uwe R. Zimmer, The Australian National University page 396 of 757 (chapter 5: “Data Parallelism” up to page 426)
a$v=a$ y = a$y fxp fa$xp
z a$z type Real_Precision = Float
type Scalar = Real_Precision type Vector = [Real_Precision]
scale :: Scalar -> Vector -> Vector
scale scalar vector = map (scalar *) vector
© 2019 Uwe R. Zimmer, The Australian National University
page 397 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
Potentially concurrent, yet: Executed sequentially.
a$v=a$ y = a$y fxp fa$xp
z a$z type Real_Precision = Float
type Scalar = Real_Precision type Vector = [Real_Precision]
scale :: Scalar -> Vector -> Vector
scale scalar vector = map (scalar *) vector
© 2019 Uwe R. Zimmer, The Australian National University
page 398 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
Executed in parallel.
This may be faster or slower than a sequential execution.
a$v=a$ y = a$y fxp fa$xp
z a$z
import Control.Parallel.Strategies type Real_Precision = Float
type Scalar = Real_Precision type Vector = [Real_Precision]
scale :: Scalar -> Vector -> Vector
scale scalar vector = parMap rpar (scalar *) vector
© 2019 Uwe R. Zimmer, The Australian National University
page 399 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
a$v=a$ y = a$y fxp fa$xp
z a$z type Real is digits 15;
type Vectors is array (Positive range <>) of Real;
function Scale (Scalar : Real; Vector : Vectors) return Vectors is
Scaled_Vector : Vectors (Vector’Range);
begin
for i in Vector’Range loop
Scaled_Vector (i) := Scalar * Vector (i); end loop;
return Scaled_Vector; end Scale;
© 2019 Uwe R. Zimmer, The Australian National University
page 400 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
Translates into
CPU-level vector operations
Buzzword collection: AltiVec, SPE, MMX, SSE, NEON, SPU, AVX, …
Combined with
in-lining, loop unrolling and caching this is as fast as a single CPU will get.
a$v=a$ y = a$y fxp fa$xp
z a$z type Real is digits 15;
type Vectors is array (Positive range <>) of Real;
function Scale (Scalar : Real; Vector : Vectors) return Vectors is
Scaled_Vector : Vectors (Vector’Range);
begin
for i in Vector’Range loop
Scaled_Vector (i) := Scalar * Vector (i); end loop;
return Scaled_Vector; end Scale;
© 2019 Uwe R. Zimmer, The Australian National University
page 401 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
Function is
“promoted”
a$v=a$ y = a$y fxp fa$xp
z a$z const Index = {1 .. 100000000},
Vector
: [Index] real = 1.0,
: real = 5.1,
Scale Scaled
: [Vector] real = Scale * Vector;
© 2019 Uwe R. Zimmer, The Australian National University
page 402 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
Translates into CPU-level vector operations as well as multi-core or
fully distributed operations
Function is
“promoted”
a$v=a$ y = a$y fxp fa$xp
z a$z const Index = {1 .. 100000000},
Vector
: [Index] real = 1.0,
: real = 5.1,
Scale Scaled
: [Vector] real = Scale * Vector;
© 2019 Uwe R. Zimmer, The Australian National University
page 403 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Vectorization
1 2 fx1p fx2p ^1 2h ^1 z1 z2
2h ^1 2h
v=v&y=y&x=x/y=y/z=z
type Real_Precision = Float type Vector = [Real_Precision]
equal :: Vector -> Vector -> Bool
equal v_1 v_2 = foldr (&&) True $ zipWith (==) v_1 v_2
© 2019 Uwe R. Zimmer, The Australian National University
page 404 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Reduction
Potentially concurrent, yet:
Executed lazy sequentially.
1 2 fx1p fx2p ^1 2h ^1 z1 z2
2h ^1 2h
v=v&y=y&x=x/y=y/z=z
type Real_Precision = Float type Vector = [Real_Precision]
equal :: Vector -> Vector -> Bool
equal v_1 v_2 = foldr (&&) True $ zipWith (==) v_1 v_2
© 2019 Uwe R. Zimmer, The Australian National University
page 405 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Reduction
Potentially concurrent, yet:
Executed lazy sequentially.
1 2 fx1p fx2p ^1 z1 z2
2h ^1
2h ^1 2h
v=v&y=y&x=x/y=y/z=z
type Real_Precision = Float type Vector = [Real_Precision]
equal :: Vector -> Vector -> Bool
equal = (==)
© 2019 Uwe R. Zimmer, The Australian National University
page 406 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Reduction
type Real is digits 15;
z1 z2
type Vectors is array (Positive range <>) of Real;
function ”=” (Vector_1, Vector_2 : Vectors) return Boolean is
(for all i in Vector_1’Range => Vector_1 (i) = Vector_2 (i));
Data Parallelism
Vector Machines
Reduction
v=v&y=y&x=x/y=y/z=z
1 2 fx1p fx2p ^1 2h ^1 2h ^1 2h
© 2019 Uwe R. Zimmer, The Australian National University page 407 of 757 (chapter 5: “Data Parallelism” up to page 426)
Translates into
CPU-level vector operations
/-chain is evaluated lazy sequentially.
type Real is digits 15;
z1 z2
type Vectors is array (Positive range <>) of Real;
function ”=” (Vector_1, Vector_2 : Vectors) return Boolean is
(for all i in Vector_1’Range => Vector_1 (i) = Vector_2 (i));
Data Parallelism
Vector Machines
Reduction
v=v&y=y&x=x/y=y/z=z
1 2 fx1p fx2p ^1 2h ^1 2h ^1 2h
© 2019 Uwe R. Zimmer, The Australian National University page 408 of 757 (chapter 5: “Data Parallelism” up to page 426)
Translates into
CPU-level vector operations
/-chain is evaluated lazy sequentially.
Infinite recursion
Data Parallelism
Vector Machines
Reduction
x1 x2
v1 = v2 & fy1p= fy2p& ^x1 = x2h/^y1 = y2h/^z1 = z2h
z1 z2
type Real is digits 15;
type Vectors is array (Positive range <>) of Real;
function ”=” (Vector_1, Vector_2 : Vectors) return Boolean is (Vector_1 = Vector_2);
© 2019 Uwe R. Zimmer, The Australian National University page 409 of 757 (chapter 5: “Data Parallelism” up to page 426)
Translates into
CPU-level vector operations
/-chain is evaluated lazy sequentially.
Data Parallelism
Vector Machines
Reduction
v=v&y=y&x=x/y=y/z=z
1 2 fx1p fx2p ^1 2h ^1 2h ^1 2h
z1 z2
type Real is digits 15;
type Vectors is array (Positive range <>) of Real;
function Equal (Vector_1, Vector_2 : Vectors) return Boolean is (Vector_1 = Vector_2);
© 2019 Uwe R. Zimmer, The Australian National University page 410 of 757 (chapter 5: “Data Parallelism” up to page 426)
Translates into
CPU-level vector operations
/-chain is evaluated lazy sequentially.
Data Parallelism
Vector Machines
Reduction
v=v&y=y&x=x/y=y/z=z
1 2 fx1p fx2p ^1 2h ^1 2h ^1 2h
z1 z2
type Real is digits 15;
type Vectors is array (Positive range <>) of Real;
function Equal (Vector_1, Vector_2 : Vectors) return Boolean renames ”=”;
© 2019 Uwe R. Zimmer, The Australian National University page 411 of 757 (chapter 5: “Data Parallelism” up to page 426)
Translates into
CPU-level vector operations
/-chain is evaluated lazy sequentially.
type Real is digits 15;
z1 z2
type Vectors is array (Positive range <>) of Real;
function ”=” (Vector_1, Vector_2 : Vectors) return Boolean is
(for all i in Vector_1’Range => Vector_1 (i) = Vector_2 (i));
Data Parallelism
Vector Machines
Reduction
v=v&y=y&x=x/y=y/z=z
1 2 fx1p fx2p ^1 2h ^1 2h ^1 2h
© 2019 Uwe R. Zimmer, The Australian National University page 412 of 757 (chapter 5: “Data Parallelism” up to page 426)
Function is
“promoted”
1 2 fx1p fx2p ^1 2h ^1 z1 z2
2h ^1 2h
v=v&y=y&x=x/y=y/z=z
const Index = {1 .. 100000000},
Vector_1, Vector_2 : [Index] real = 1.0;
proc Equal (v1, v2) : bool {return && reduce (v1 == v2);}
© 2019 Uwe R. Zimmer, The Australian National University
page 413 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Reduction
/-operations are evaluated in a concurrent divide-and-conquer (binary tree) structure.
Translates into CPU-level vector operations as well as multi-core or
fully distributed operations
Function is
“promoted”
z1 z2
const Index = {1 .. 100000000},
Vector_1, Vector_2 : [Index] real = 1.0; proc Equal (v1, v2) : bool
{return && reduce (v1 == v2);}
© 2019 Uwe R. Zimmer, The Australian National University
page 414 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Reduction
x1 x2
v1 = v2 & fy1p= fy2p& ^x1 = x2h/^y1 = y2h/^z1 = z2h
Type mismatch
1 2 fx1p fx2p ^1 2h ^1 z1 z2
2h ^1 2h
v=v&y=y&x=x/y=y/z=z
const Index = {1 .. 100000000},
Vector_1, Vector_2 : [Index] real = 1.0;
proc Equal (v1, v2) : bool {return v1 == v2;}
writeln (Equal (Vector_1, Vector_2));
© 2019 Uwe R. Zimmer, The Australian National University
page 415 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
Reduction
© 2019 Uwe R. Zimmer, The Australian National University
page 416 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
R $ -1 :V SW
6px”S-1 5 -1W
S : -1 :W TX
© 2019 Uwe R. Zimmer, The Australian National University
page 417 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
R $ -1 :V SW
6px”S-1 5 -1W
S : -1 :W TX
Data Parallelism
Vector Machines
General Data-parallelism
R $ -1 :V SW
6px”S-1 5 -1W S : -1 :W
const Mask : [1 .. 3, 1 .. 3] real = ((0, -1, 0), (-1, 5, -1), (0, -1, 0));
TX
© 2019 Uwe R. Zimmer, The Australian National University page 418 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
R $ -1 :V SW
6px”S-1 5 -1W S : -1 :W
const Mask : [1 .. 3, 1 .. 3] real = ((0, -1, 0), (-1, 5, -1), (0, -1, 0)); proc Unsharp_Mask (P, (i, j) : index (Image)) : real
TX
{return + reduce (Mask * P [i – 1 .. i + 1, j – 1 .. j + 1]);}
© 2019 Uwe R. Zimmer, The Australian National University page 419 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
R $ -1 :V SW
6px”S-1 5 -1W S : -1 :W
const Mask : [1 .. 3, 1 .. 3] real = ((0, -1, 0), (-1, 5, -1), (0, -1, 0)); proc Unsharp_Mask (P, (i, j) : index (Image)) : real
{return + reduce (Mask * P [i – 1 .. i + 1, j – 1 .. j + 1]);}
const Sharpened_Picture = forall px in Image do Unsharp_Mask (Picture, px);
TX
© 2019 Uwe R. Zimmer, The Australian National University page 420 of 757 (chapter 5: “Data Parallelism” up to page 426)
Translates into CPU-level vector operations as well as multi-core or
fully distributed operations
Data Parallelism
Vector Machines
General Data-parallelism
R $ 6px”S-1
– 1 :V 5 -1W
const Mask : [1 .. 3, 1 .. 3] real = ((0, -1, 0), (-1, 5, -1), (0, -1, 0)); proc Unsharp_Mask (P, (i, j) : index (Image)) : real
{return + reduce (Mask * P [i – 1 .. i + 1, j – 1 .. j + 1]);}
const Sharpened_Picture = forall px in Image do Unsharp_Mask (Picture, px);
S S T
:
– 1
W :W X
© 2019 Uwe R. Zimmer, The Australian National University page 421 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
”
© 2019 Uwe R. Zimmer, The Australian National University page 422 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
”
Cellular automaton transitions from a state S into the next state Sl: S ” Sl + 6c ! S: c ” cl = r^S,ch, i.e. all cells of a state transition concurrently into new cells by following a rule r.
© 2019 Uwe R. Zimmer, The Australian National University page 423 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Vector Machines
General Data-parallelism
”
Cellular automaton transitions from a state S into the next state Sl: S ” Sl + 6c ! S: c ” cl = r^S,ch, i.e. all cells of a state transition concurrently into new cells by following a rule r.
Next_State = forall World_Indices in World do Rule (State, World_Indices);
© 2019 Uwe R. Zimmer, The Australian National University page 424 of 757 (chapter 5: “Data Parallelism” up to page 426)
John Conway’s Game of Life rule:
proc Rule (S, (i, j) : index (World)) : Cell {
Data Parallelism
Vector Machines
General Data-parallelism
”
Cellular automaton transitions from a state S into the next state Sl: S ” Sl + 6c ! S: c ” cl = r^S,ch, i.e. all cells of a state transition concurrently into new cells by following a rule r.
Next_State = forall World_Indices in World do Rule (State, World_Indices);
const Population : index ({0 .. 9}) =
+ reduce Count (Cell.Alive, S [i – 1 .. i + 1, j – 1 .. j + 1]);
return (if Population == 3
|| (Population == 4 && S [i, j] == Cell.Alive) then Cell.Alive
else Cell.Dead);
© 2019 Uwe R. Zimmer, The Australian National University page 425 of 757 (chapter 5: “Data Parallelism” up to page 426)
}
• Data-Parallelism
• Vectorization
• Reduction
• Generaldata-parallelism
• Examples
• Imageprocessing • Cellularautomata
© 2019 Uwe R. Zimmer, The Australian National University
page 426 of 757 (chapter 5: “Data Parallelism” up to page 426)
Data Parallelism
Summary
Data Parallelism