程序代写代做代考 computer architecture compiler cuda GPU cache algorithm Microsoft PowerPoint – 1-fundamentals-1 [Compatibility Mode]

Microsoft PowerPoint – 1-fundamentals-1 [Compatibility Mode]

27Computer Science, University of Warwick

Related Technologies

•HPC covers a wide range of technologies:

• Computer architecture

• Networking

• Compilers

• Algorithms

• Workload and resource manager

• A big HPC system handles many parallel programs from different users

• Task scheduling and resource allocation

• metrics: system throughput, resource utilization, mean response time

28Computer Science, University of Warwick

Related Technologies

•HPC covers a wide range of technologies:

• Computer architecture

• Networking

• Compilers

• Algorithms

• Workload and resource manager

29Computer Science, University of Warwick

History and Evolution of HPC Systems

1960s: Scalar processor

 Process one data item at a time

30Computer Science, University of Warwick

Scalar processor

31Computer Science, University of Warwick

History and Evolution of HPC Systems

1960s: Scalar processor

1970s: Vector processor

 Can process an array of data items in one go

 Architecture: one master processor and many math co-
processors (ALU)

 Each time, the master processor fetches an instruction and a
vector of data items and feed them to ALUs

 Overhead: more complicated address decoding and data
fetching procedure

 Difference between vector processor and scalar processor

32Computer Science, University of Warwick

GPU (Vector processor)

GPU: Graphical Processing Unit

GPU is treated as a PCIe device by the main CPU

33Computer Science, University of Warwick

34Computer Science, University of Warwick

GPU (Vector processor)

GPU: Graphical Processing Unit

GPU is treated as a PCIe device by the main CPU

35Computer Science, University of Warwick

Data processing on GPU

– CUDA: programming on GPU

– Get the array A and B in one memory access operation

– Different threads process different data items

– If no much parallel processing, slower on GPU due to overhead

36Computer Science, University of Warwick

History and Evolution of HPC Systems

1960s: Scalar processor

1970s: Vector processor

Later 1980s: Massively Parallel Processing (MPP)

 Up to thousands of processors, each with its own memory

 Processors can fetch and run instructions in parallel

 Break down the workload in a parallel program

• Workload balance and processor communications

 Difference between MPP and vector processor

37Computer Science, University of Warwick

Architecture of BlueGene/L (MPP)

 Create a philosophy of using a massive number of low
performance processors to construct supercomputers

38Computer Science, University of Warwick

History and Evolution of HPC Systems

1960s: Scalar processor

1970s: Vector processor

Later 1980s: Massively Parallel Processing (MPP)

Later 1990s: Cluster

 Connecting stand-alone computers with high-speed network
(over-cable networks)

• Commodity off the shelve computers

• high-speed network: Gigabit Ethernet, infiniband

• Over-cable network vs. on-board network

 Not a new term itself, but renewed interests

• Performance improvement in CPU and networking

• Advantage over custom-designed mainframe computers: Good
portability

39Computer Science, University of Warwick

Cluster Arechitecture

40Computer Science, University of Warwick

History and Evolution of HPC Systems

1960s: Scalar processor

1970s: Vector processor

Later 1980s: Massively Parallel Processing (MPP)

Later 1990s: Cluster

Later 1990s: Grid

 Integrate geographically distributed resources

 Further evolution of cluster computing

 Draw an analogue from Power grid

41Computer Science, University of Warwick

History and Evolution of HPC Systems

1960s: Scalar processor

1970s: Vector processor

Later 1980s: Massively Parallel Processing (MPP)

Later 1990s: Cluster

Later 1990s: Grid

Since 2000s: Cloud

 Commercialization of Grid and Cluster computing

 Use the resources and services provided by the third party (Cloud service
provider) and pay for the usage

 virtualization technology: secure running environment, higher resource
utilization

42Computer Science, University of Warwick

Architecture Types

 All previous HPC systems can be divided into two
architecture types

• Shared memory system

• Distributed memory system

43Computer Science, University of Warwick

Architecture Types

 Shared memory (uniform memory access – SMP)

• Multiple CPU cores, single memory, shared I/O (Multicore CPU)

• All resources in a SMP machine are equally available to each
core

• Due to resource contention, uniform memory access systems do
not scale well

• CPU cores share access to a common memory space.

• Implemented over a shared system bus or switch

• Support for critical sections is required

• Local cache is critical:

• If not, bus/switch contention (or network traffic) reduces the systems
efficiency.

• Cache introduces problems of coherency (ensuring that stale cache
lines are invalidated when other processors alter shared memory).

44Computer Science, University of Warwick

Architecture Types

•Shared memory (Non-Uniform Memory Access: NUMA)

• Multiple CPUs

• Each CPU has fast access to its local area of the memory, but slower
access to other areas

• Scale well to a large number of processors due to the hierarchical
memory access

• Complicated memory access pattern: local and remote memory address

• Global address space.

Node of a NUMA machine

NUMA machine

45Computer Science, University of Warwick

Architecture Types

 Distributed Memory (MPP, cluster)

• Each processor has it’s own independent memory

• Interconnected through over-cable networks

• When processors need to exchange (or share data), they must
do this through an explicit communication

• Message passing (MPI language)

• Typically larger latencies between processors

• Scalability is good if the task to be computed can be divided
properly

46Computer Science, University of Warwick

Granularity of Parallelism

 Defined as the size of the computations that are
being performed in parallel

 Four types of parallelism (in order of granularity size)

 Instruction-level parallelism (e.g. pipeline)

 Thread-level parallelism (e.g. run a GPU program)

 Process-level parallelism (e.g. run an MPI job in a cluster)

 Job-level parallelism (e.g. run a batch of independent jobs in a
cluster)

47Computer Science, University of Warwick

Dependency and Parallelism

 Dependency: If instruction A must finish before
instruction B can run, then B is dependent on A

 Two types of Dependency

 Control dependency: waiting for the instruction which controls
the execution flow to be completed

• IF (X!=0) Then Y=1.0/X: Y=1.0/X has the control dependency on X!=0

 Data dependency: dependency because of calculations or
memory access

• Flow dependency: A=X+Y; B=A+C;

• Anti-dependency: B=A+C; A=X+Y;

• Output dependency: A=2; X=A+1; A=5;

48Computer Science, University of Warwick

Parallel computing vs. distributed computing

•Parallel Computing

• Breaking the problem to be computed into parts that can
be run simultaneously in different processors

• Example: an MPI program to perform matrix multiplication

• Solve tightly coupled problems

•Distributed Computing

• Parts of the work to be computed are computed in
different places (Note: does not necessarily imply
simultaneous processing)

• An example: running a workflow in a Grid

• Solve loosely-coupled problems (no much
communication)