程序代写代做代考 computer architecture algorithm Hive database data structure Fortran compiler Microsoft PowerPoint – programmingmodel-2 [Compatibility Mode]

Microsoft PowerPoint – programmingmodel-2 [Compatibility Mode]

High Performance Computing

Models of Parallel Programming

Dr Ligang He

2Computer Science, University of Warwick

Models of Parallel Programming

Different approaches for programming on parallel
and distributed computing systems include:
– Dedicated languages designed specifically for parallel computers
– Smart compilers, which automatically parallelise sequential codes
– Data parallelism: multiple processors run the same operation on

different elements of a data structure
– Task parallelism: multiple processors run different operations

– Two classes of parallelism according to computer
architecture
– Shared memory: processors share a common address space
– distributed memory: the memory in each processor has its own

address space

3Computer Science, University of Warwick

Specially Designed Language

4Computer Science, University of Warwick

Occam

Occam is a concurrent programming language

Occam is an executable implementation of Communicating
Sequential Processes (CSP) theory

CSP: a mathematical theory for describing the interactions of tasks in a concurrent
system
Can theoretically prove if the program written in Occam is correct

Occam is specially designed to make full use of the features of a
custom-designed computer system, which is built with the
transputer (transistor computer) chips, developed by INMOS

Transputer is the first microprocessor specially designed for parallel computing
A number of transputer chips are wired to form a complete computer system (no
bus, RAM or OS)

In Occam, the processes communicate through channels
Channel can be regarded as the message passing mechanism within a computer

5Computer Science, University of Warwick

Occam

Sequential execution:

SEQ
x := x + 1
y := x * x

Parallel execution:

PAR
p()
q()

Communication between processes:

ProcessA ! Channel_var
ProcessB ? Channel_var

6Computer Science, University of Warwick

Occam

The Occam language was not popular
Poor portability
Transputer chip is very expensive

For more information:

Occam 2 reference manual
www.wotug.org/occam/documentation/oc21refman.pdf

Occam archive
http://vl.fmnet.info/occam/

Transputer
http://en.wikipedia.org/wiki/INMOS_Transputer

7Computer Science, University of Warwick

Dedicated languages

In general dedicated languages are going to do a better job
1. Designed with the hardware architecture
2. Structure of the language reflects the nature of parallelism

However
1. Niche languages are not generally popular
2. It’s hard to port existing code

Much better to modify and extend an existing language to
include parallelism, because

1. Better audience
2. Only need to learn new constructs or API, not a new language
3. Porting is a lot easier

8Computer Science, University of Warwick

Compiler Approach

9Computer Science, University of Warwick

Compiler Approach

– Automatically parallelize the sequential codes

– Very conservative

– Re-implementing the code in an more efficient
way

– Can remove the dependency if possible

10Computer Science, University of Warwick

Re-implement the code in compilation

Before:

After:

11Computer Science, University of Warwick

Removing the dependency

Anti-dependency and output-dependency can be
removed by renaming the variables

12Computer Science, University of Warwick

Dependency and Parallelism (revisit)

 Dependency: If instruction A must finish before
instruction B can run, then B is dependent on A

 Two types of Dependency

 Control dependency: waiting for the instruction which controls
the execution flow to be completed

• IF (X!=0) Then Y=1.0/X: Y=1.0/X has the control dependency on X!=0

 Data dependency: dependency because of calculations or
memory access

• Flow dependency: A=X+Y; B=A+C;

• Anti-dependency: B=A+C; A=X+Y;

• Output dependency: A=2; X=A+1; A=5;

13Computer Science, University of Warwick

Removing the dependency

Anti-dependency and output-dependency can be
removed by renaming the variables

Anti-dependency:
Before:

Y=2 * X
X= 5

After:
Y=2 * X
X1= 5

Output-dependency
Before:

X=A/B
Y=X+2.0
X=D – E

After:
X=A/B
Y=X+2.0
X1=D – E

14Computer Science, University of Warwick

Examples of automatic parallelization by
compiler

A compiler takes code written in a standard language (e.g. C or
Fortran) and automatically compiles it to be run in parallel (e.g. by
parallelizing loops)
Example 1:
DO I=1, N

A(I)=A(I)+B(I)
ENDDO

Example 2:
DO I=2, N

A(I)=A(I-1)+B(I)
ENDDO

Compiling the code:
f90 –O3 –autopar foo.f

15Computer Science, University of Warwick

Can this loop be parallelised?

Example 3:

DO I=2, N

A(I)=2*B(I)

C(I)=A(I)+C(I);

ENDDO

16Computer Science, University of Warwick

Features of the Compiler approach

– Can work fairly well for some regular problems

– Fully automatic (efficient) parallelisation is difficult,
and unlikely to be efficient in general

17Computer Science, University of Warwick

Assisting Compiler

– Programmers can assist the compiler by writing the code in a
way that explicitly expresses the parallelism in the program

 Usually done using directives (pseudo-comments that
instruct the compiler where parallelism lies)

 During 1990s OpenMP emerged as a common set of
directives for implementing various types of parallel
execution and synchronization

 Add these to serial code (and ignored if targeting a single
processor machine)

 Explicit parallelism: we are instructing the compiler what
to do

 We cannot blame the compiler but ourselves if there is
something wrong

18Computer Science, University of Warwick

Examples of Assisting Compilers

Correct:

C$OMP PARALLEL

DO I=1, N

A(I)=A(I)+B(I)

ENDDO

C$OMP END PARALLEL

Wrong:

C$OMP PARALLEL

DO I=2, N

A(I)=A(I-1)+B(I)

ENDDO

C$OMP END PARALLEL

19Computer Science, University of Warwick

Compilers

For more information on this subject area see:

Further reading in High Performance Compilers for Parallel
Computing, Michael Wolfe

Vectorizing C Compilers: how good are they? Lauren Smith,
ACM/IEEE Conference on Supercomputing

Parallelizing and Vectorizing Compilers, Eigenmann and
Hoeflinger, (also on course web page)

20Computer Science, University of Warwick

Data Parallelism

21Computer Science, University of Warwick

Task parallelism vs. Data parallelism

Task parallelism:

if CPU=”a” then
do task “A”

else if CPU=”b” then
do task “B“

end if

Data parallelism:
d is an one-dimensional array
if CPU=”a” then

low_limit=1; upper_limit=50
else if CPU=”b” then

low_limit=51; upper_limit=100
end if
do i = low_limit , upper_limit

Task on d(i)
end do

22Computer Science, University of Warwick

Data parallelism

– If we are applying the same operation to every element in an array
(or list, set, tree, database or whatever) then we may be able to
exploit data parallelism

– F90 and HPF provide the support for Data parallelism
– F90 and HPF allow scalar operations to be applied to arrays to

support the data parallelism
– adding two matrixes can be written as

A = B + C ! A, B, C are arrays

which is equivalent to
do i = 1,m

do j = 1,n
A(i,j) = B(i,j) + C(i,j)

enddo
enddo

The former expression is called explicit parallel statement while the
latter expression called implicit parallel statements (the latter can
be parallelised by the compiler if possible)

23Computer Science, University of Warwick

Data parallelism

A data parallel program is a sequence of explicitly and/or
implicitly parallel statements

The compiler can partition the data into disjoint sub-
domains, one per processor

Data placement is an essential part of data-parallelism, and
if you get the data locality wrong, you will take a
performance hit

Fortunatelly, compilers can achieve data partition, data
allocation/placement, data communications

24Computer Science, University of Warwick

Data parallelism

A data-parallel programming is higher-level than the
message passing-type approach

programmer does not need to specify communication
structures
this is done by the compiler (inferred from the program
and underlying computer architecture)

However,
not all algorithms can be specified in data parallel terms
if the program has irregular communication patterns then
this will be compiled less efficiently

Examples of data parallel languages include F90 and HPF