计算机代写 FIT3143 – LECTURE WEEK 1 INTRODUCTION TO PARALLEL COMPUTING

Information Technology
FIT3143 – LECTURE WEEK 1 INTRODUCTION TO PARALLEL COMPUTING

1. Parallel computing concept and applications

Copyright By PowCoder代写 加微信 powcoder

2. Parallel computing models
3. Parallel computing performance
Associated learning outcomes
• Explain the fundamental principles of parallel computing architectures and algorithms (LO1)
• Analyze and evaluate the performance of parallel algorithms (LO4)
FIT3143 Parallel Computing 2

1. Parallel computing concept and applications
FIT3143 Parallel Computing 3

Definitions
A parallel computer is simply a single computer with multiple internal processors, or a group of multiple computers interconnected to form a coherent high-performance computing platform.
Parallel programming is programming multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than is possible with a single computer.
FIT3143 Parallel Computing

Adapted from https://computing.llnl.gov/tutorials/parallel_comp/
FIT3143 Parallel Computing
Serial Computing
Parallel Computing

The demand for computational speed
There is a continual demand for greater computational speed from a computer system than is currently possible.
Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems.
Computations must be completed within a “reasonable” time period.
FIT3143 Parallel Computing

Grand challenge problems
A grand challenge problem is one that cannot be solved in a
reasonable amount of time with today’s computers.
For example, an execution time of 10 years is usually unacceptable.
• Modeling large DNA structures
• Global weather forecasting
• Modeling motion of astronomical bodies
FIT3143 Parallel Computing

Computationally challenging problems
FIT3143 Parallel Computing 8 8

Reasons for not using existing serial machines
• May not be able to reach the solution within a reasonable
amount of time if using a single machine.
• The limits imposed by both:
ØPhysical constraints such as the speed of light ØCost of designing and manufacturing new chips.
– Current speed of chips – may reach a few billion Hz but will not grow enormously beyond that.
FIT3143 Parallel Computing 9

Reasons for using parallel machines
• With parallelism, much greater speeds can be achieved. The fastest computer in the world has a peak speed of about 415,530 Teraflops (https://top500.org/lists/top500/2020/06/). (flops: floating point instructions)
• The enormous size of the problems that are being investigated. These problems fall in many domains of science, engineering, and economics.
• To solve these problems requires carrying out huge numbers of computations, such as 1012, 1014, 1016 or even more. This number of computations cannot be done by any existing serial machine but is feasible to attack using parallel processing.
FIT3143 Parallel Computing 10

Why is Parallel Computing Important?
Source: , Rice University, 2008
FIT3143 Parallel Computing 11

Some Particularly Challenging Computations
• Global climate modelling
• Biology: genomics; protein folding; drug design
• Astrophysical modelling
• Computational Chemistry
• Computational Material Sciences and Nanosciences
• Engineering
• Semiconductor design
• Earthquake and structural modelling, remote sensing
• Computation fluid dynamics (airplane design) & Combustion (engine design)
• Simulation
• Deep learning
• Game design
• Telecommunications (e.g. Network monitoring & optimization)
• Autonomous systems
• Business
• Financial derivatives and economic modelling
• Transaction processing, web services and search engines
• Analytics
• Nuclear weapons — test by simulations • Cryptography
Source: , Rice University, 2008
FIT3143 Parallel Computing 12

Why Parallelism is now necessary for Mainstream
Computing?
• In single core processor, the performance of the processor will not gain much simply by the increasing operating frequency:
• Memory wall
The processor was incapable of optimizing the performance by increasing the operating frequency as the actual cause is the increasing gap between the CPU and memory throughput (data transfer rate).
• ILP (Instruction Level Parallelism) wall Difficulty in full parallel instruction processing. • Power wall
Power consumption doubled following the doubling of operating frequency.
Faster clock speeds require higher input voltages. Every transistor leaks a small amount of current 1% clock increase results in a 3% power increase.
FIT3143 Parallel Computing 13

Types of parallelism
• Bit level parallelism
• Parallelism achieved by doubling word size
• Instruction level parallelism
• Superscalar processor with multiple executing units
and out of order execution. • Data level parallelism
• Each processor performs the same task on different pieces of distributed data
• Task level parallelism
• Each processor executes a different thread (or
process) on the same or different data
FIT3143 Parallel Computing 14

Classes of parallel computers
Multi core processor
FIT3143 Parallel Computing 15

Classes of parallel computers
Symmetric multiprocessor
FIT3143 Parallel Computing 16

Classes of parallel computers
Cluster & Distributed/Grid computing
FIT3143 Parallel Computing 17

Classes of parallel computers
Massively parallel computing / supercomputing
FIT3143 Parallel Computing 18

Classes of parallel computers
Reconfigurable computing (FPGA) & ASIC
FIT3143 Parallel Computing 19

Classes of parallel computers
General-purpose computing on graphics processing units (GPGPU)
FIT3143 Parallel Computing 20

E.g. Embarrassingly parallel computation
An ideal parallel computation is one that can be immediately divided into completely independent parts that can be executed simultaneously – this is what is known as an embarrassingly-parallel computation, or job-level parallelism.
In this context, a network is a collection of integrated processors that are communicating and sharing information to speed up the solution to a single task.
FIT3143 Parallel Computing 21

1. How big of a collection?
– A few very powerful, special purpose computers? (2-64)
– Many smaller, standard microprocessors? (100-10,000)
– A huge number of simple, one bit processors? (billions, trillions)
2. What kind of processors?
– Special purpose?
– Standard microprocessors or Simple? – One-bit processors?
3. How closely integrated is the collection of processors?
http://www.lsbu.ac.uk/oracle/oracle7/server/doc/SPS73/chap3.htm
– Tightly coupled? Designed as a single architecture and a single system. – Loosely coupled? Systems that can operate as a single logical system but
can also function independently. Not designed as a single architecture.
FIT3143 Parallel Computing 22

How do the processors communicate?
– Shared memory?
– External communications channel?
What type of information is shared?
– Data values, inputs, results?
– Synchronizing information?
Focus on a single task
– We will focus more on task level parallelism rather than job-level parallelism or embarrassingly parallel computation. This means we’re more interested in situations where all the processors work together to speed up the solution of a single task.
Job = Task1 + Task2 + Task3 + …
FIT3143 Parallel Computing 23

2. Parallel computing models
FIT3143 Parallel Computing 24

How to classify parallel computers? Parallel computers can be classified by:
• Type and number of processors.
• Interconnection scheme of the processors. • Communication scheme.
• Input/output operations.
FIT3143 Parallel Computing 25

Conventional computer
• Consists of a processor executing a program stored in a (main) memory.
• Each main memory location located by its address. Addresses
start at 0 and extend to 2n -1 when there are n bits (binary digits)
in the address. (You learned this in COA)
Image link: https://media.geeksforgeeks.org/wp-content/uploads/basic_structure.png
FIT3143 Parallel Computing 26

Multi-processor computer architecture
1. Flynn’s Classification
2. Shared memory model
3. Distributed memory model
4. Distributed shared memory model
FIT3143 Parallel Computing 27

Taxonomy (Classification)
A taxonomy of parallel architectures can be built based on three relationships:
1. Relationship between PE and the instruction sequence executed
2. Relationship between PE and the memory
3. Relationship between PE and the interconnection
PE: Processing Element (CPU)
FIT3143 Parallel Computing 28

Parallel Architecture
PE & Memory
PE & Instruction Sequence
PE & Interconnection Network
FIT3143 Parallel Computing 29

Flynn taxonomy
(1966) developed a taxonomy of parallel systems based on the number of independent instruction and data streams.
1. SISD: Single Instruction Stream- Single Data Stream
This is a good conventional sequential machine.
2. MISD: Multiple Instruction Stream- Single Data Stream
3. SIMD: Single Instruction Stream- Multiple Data Stream
4. MIMD: Multiple Instruction Stream- Multiple Data Stream
FIT3143 Parallel Computing 30

Flynn’s Classification
FIT3143 Parallel Computing 31

Flynn’s Classification
§ Single Instruction, Single Data (SISD):
– A serial (non-parallel) computer
– Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle
– Single data: only one data stream is being used as input during any one clock cycle
– Deterministic execution
– This is the oldest and until recently, the most prevalent form of computer
– Examples: most PCs, single CPU workstations and mainframes
FIT3143 Parallel Computing 32

Flynn’s Classification Single Instruction, Multiple Data (SIMD):
§ A type of parallel computer
§ Single instruction: All processing units execute the same instruction at any given clock
§ Multiple data: Each processing unit can operate on a different data element
§ This type of machine typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units.
§ Best suited for specialized problems characterized by a high degree of regularity, such as image processing.
§ Synchronous (lockstep) and deterministic execution
§ Two varieties: Processor Arrays and Vector Pipelines
§ Examples:
– Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
– Vector Pipelines: IBM 9000, 90, Fujitsu VP, NEC SX-2, Hitachi S820
FIT3143 Parallel Computing 33

Flynn’s Classification
Multiple Instruction, Single Data (MISD):
§ A single data stream is fed into multiple processing units.
§ Each processing unit operates on data independently via independent instruction
§ Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon computer
§ Some conceivable uses might be:
– multiple frequency filters operating on a single signal stream
– multiple cryptography algorithms attempting to crack a single coded message.
FIT3143 Parallel Computing 34

Flynn’s Classification Multiple Instruction, Multiple Data (MIMD):
§ Currently, the most common type of parallel computer. Most modern computers fall into this category.
§ Multiple Instruction: every processor may be executing a different instruction stream
§ Multiple Data: every processor may be working with a different data stream
§ Execution can be synchronous or asynchronous, deterministic or non-deterministic
§ Examples: most current supercomputers, networked parallel computer “grids” and multi-processor SMP computers – including some types of PCs.
FIT3143 Parallel Computing 35

Parallel Computer Memory Architectures
§ Broadly divided into three categories
– Shared memory
– Distributed memory
Shared Memory
§ Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space.
§ Multiple processors can operate independently but share the same memory resources.
§ Changes in a memory location effected by one processor are visible to all other processors.
§ Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.
FIT3143 Parallel Computing 36

Parallel Computer Memory Architectures
Distributed Memory
§ Distributed memory systems require a communication network to connect inter-processor memory.
§ Processors have their own local memory. There is no concept of global address space across all processors.
§ Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
§ When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer’s responsibility.
§ The network “fabric” used for data transfer varies widely, though it can be as simple as Ethernet.
FIT3143 Parallel Computing 37

Parallel Computer Memory Architectures Hybrid
§ The largest and fastest computers in the world today employ both shared and distributed memory architectures.
§ The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine’s memory as global.
§ The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory – not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.
§ Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future.
§ Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.
38 FIT3143 Parallel Computing 38

Parallel Programming Models Overview
– Shared Memory
are several parallel programming models in common use:
– Message Passing
– Data Parallel
§ Parallel programming models exist as an abstraction above hardware and memory architectures.
§ Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware.
FIT3143 Parallel Computing 39

Single Program Multiple Data (SPMD) structure
• Another programming structure we may use is the Single Program Multiple Data (SPMD) structure.
• In this structure, a single source program is written and each processor will execute its personal copy of this program, although independently, and not in synchrony.
• This source program can be constructed so that parts of the program are executed by certain computers and not others depending on the identity of the computer.
• For a master-slave structure, the programs could have parts for the master and parts for slaves.
FIT3143 Parallel Computing 40

Multiple Program Multiple Data (MPMD) structure
• Within the MIMD classification, each processor will have its own program to execute. This could be described as MPMD.
• In this case, some of the programs to be executed could be copies of the same program.
• Typically only two source programs are written, one for the designated master processor, and one for the remaining processors, which are called slave processors.
FIT3143 Parallel Computing 41

Three ways to create executable code for a shared memory multiprocessor
1. Parallel Programming Languages
• With special parallel programming constructs and statements that allow shared variables and parallel code sections to be declared.
• Then the compiler is responsible for producing the final executable code.
2. Threads
• Contain regular high-level language code sequences for individual processors. These code sequences can then access shared locations.
3. Sequential programming
• Use a regular sequential programming language and modify the syntax to specify parallelism.
FIT3143 Parallel Computing 42

Programming message-passing multicomputer • Dividing the problem into parts that are intended to be
executed simultaneously to solve the problem
• Common approach is to use message-passing library routines that are inserted into a conventional sequential program for message passing.
• A problem is divided into a number of concurrent processes that may be executed on a different computer.
• Processes communicate by sending messages; this will be the only way to distribute data and results between processes.
FIT3143 Parallel Computing 43

3. Parallel computing performance
FIT3143 Parallel Computing 44

Potential for increased computational speed
In all forms of MIMD multiprocessor/multicomputer systems, it is necessary to divide the computations into tasks or processes that can be executed simultaneously.
The size of a process can be described by its granularity, which is related to the number of processors being used.
In coarse granularity, each process contains a large number of sequential instructions and takes substantial time to execute.
FIT3143 Parallel Computing 45

In fine granularity, a process might consist of few instructions or perhaps even one instruction.
Medium granularity describes the middle ground.
Sometimes granularity is defined as the size of the computation between communication or synchronization points.
Generally, we want to increase the granularity to reduce the cost of process creation and inter-process communication, but of course this will likely reduce the number of concurrent processes and amount of parallelism. A suitable compromise has to be made.
FIT3143 Parallel Computing 46

Computation = ComputationTime Communication Communication Time
can be used as a granularity metric.
= tcomp tcomm
FIT3143 Parallel Computing

Speed-Up factor
A measure of relative performance between a multiprocessor system and a single processor system in the speed- up factor, S(n), defined as
S (n) = Execution time using one processor (single processor system) Execution time using a multiprocessor with n processors
FIT3143 Parallel Computing 48

The factor S(n) gives the increase in speed due to using a multiprocessor
For comparing a parallel solution with a sequential solution, we will use the fastest known sequential algorithm for running on a single processor. The underlying algorithm for a parallel implementation might be (and is usually) different.
The speed up factor can also be cast in terms of computational steps:
S(n) = Number of computational steps using one processor Numberof parallelcomputationalstepswithprocessors
The maximum speedup is n with n processors (linear speed up). FIT3143 Parallel Computing 49

Superlinear Speedup
where S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation.
One common reason for superlinear speedup is the extra memory in the multiprocessor system which can hold more of
the problem data at any instant, it leads to less relatively slow disk-memory traffic.
Superlinear speedup can occur in search algorithms.
FIT3143 Parallel Computing 50

Space-time diagram of a message passing program
It is reasonable to expect that some part of a computation cannot be divided at all into concurrent processes and must be performed serially.
During the initialization period or period before concurrent processes are being setup, only one processor is doing useful work, but for the rest of the computation, additional processors are operating in parallel.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com