CS2305: Computer Architecture
Data-Level Parallelism
(Computer Architecture: Chapter 4)
Copyright By PowCoder代写 加微信 powcoder
Department of Computer Science and Engineering
Shanghai University
Chapter 4: Data-Level Parallelism
p 4.1 Introduction
p 4.2 Vector Architecture
p 4.3 SIMD Instruction Set Extensions p 4.4 GPU
Chapter 4: Data-Level Parallelism
Flynn’s Classification (1966)
SISD: Single Instruction, Single Data
•Conventional uniprocessor
SIMD: Single Instruction, Multiple Data
•One instruction stream, multiple data paths
•Distributed memory SIMD (MPP, DAP, CM-1&2, Maspar)
•Shared memory SIMD (STARAN, vector computers)
MISD: Multiple Instruction, Single Data
•Not a practical configuration
MIMD: Multiple Instruction, Multiple Data
•Message passing machines (Transputers, nCube, CM-5)
•Non-cache-coherent shared memory machines (BBN Butterfly, T
•Cache-coherent shared memory machines (Sequent, Sun Starfire SGI Origin)
Chapter 4: Data-Level Parallelism
Approaches to Exploiting Parallelism
• Execute independent instruction streams in
parallel (multithreading, multiple cores)
• Execute multiple
operations of the same
type in parallel
(vector/SIMD Level execution)
• Execute independent instructions from one instruction stream in parallel (pipelining, superscalar, VLIW)
Level Parallel ism (DLP)
Thread- Level Paralleli sm (TLP)
Instruction-
Parallelism (ILP)
Chapter 4: Data-Level Parallelism
Resurgence of DLP
p Convergence of application demands and technology constraints drives architecture choice
p New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel
p SIMD-based architectures (vector-SIMD, subword-SIMD, GPUs) are most efficient ways to execute these algorithms
Chapter 4: Data-Level Parallelism
Introduction
p SIMD architectures can exploit significant data- level parallelism for:
m matrix-oriented scientific computing
m media-oriented image and sound processors
p SIMD is more energy efficient than MIMD
m Only needs to fetch one instruction for data operations m Makes SIMD attractive for personal mobile devices
p Advantage compared to MIMD: SIMD allows programmer to continue to think sequentially
Chapter 4: Data-Level Parallelism
p Single Instruction Multiple Data architectures make use of data parallelism
p We care about SIMD because of area and power efficiency concerns
m – Amortize control overhead over SIMD width
p However, parallelism exposed to programmer & compiler
Chapter 4: Data-Level Parallelism
SIMD Parallelism
1) Vector architectures
2) SIMD extensions
3) Graphics Processor Units (GPUs)
Chapter 4: Data-Level Parallelism
p 4.1 Introduction
p 4.2 Vector Architecture
p 4.3 SIMD Instruction Set Extensions p 4.4 GPU
Chapter 4: Data-Level Parallelism
Vector Architectures
p Basic idea:
m Read sets of data elements into “vector registers” m Operate on those registers
m Disperse the results back into memory
p Registers are controlled by compiler m Used to hide memory latency
m Leverage memory bandwidth
Chapter 4: Data-Level Parallelism
Vector Supercomputers
Epitomized by Cray-1, 1976:
of supercomputers
Scalar Unit + Vector Extensions
p Load/Store Architecture p Vector Registers
p Vector Instructions
p Hardwired Control
p Highly Pipelined Functional Units p Interleaved Memory System
p No Data Caches
p No Virtual Memory
The initial model weighed 5.5 tons including the Freon refrigeration system
Chapter 4: Data-Level Parallelism
Cray-1 (1976)
Chapter 4: Data-Level Parallelism
T0 (Torrent) Vector Microprocessor (1995)
Chapter 4: Data-Level Parallelism
Vector Supercomputer: NEC SX-6 (2003)
Chapter 4: Data-Level Parallelism
Earth System Simulator (NEC, ~2002)
Chapter 4: Data-Level Parallelism
Basic structure of VMIPS
• This processor has a scalar architecture just like MIPS.
• There are also eight 64- element vector registers, and vector functional units.
• The vector and scalar registers have a significant number of read and write ports to allow multiple simultaneous vector operations.
• A set of crossbar switches (thick gray lines) connects these ports to the inputs and outputs of the vector functional units.
Chapter 4: Data-Level Parallelism
Architecture: VMIPS (running example)
p Vector registers
m Each register holds a 64-element, 64 bits/element vector m Register file has 16 read ports and 8 write ports
p Vector functional units
m Fully pipelined
m Data and control hazards are detected
p Vector load-store unit
m Fully pipelined
m One word per clock cycle after initial latency
p Scalar registers
m 32 general-purpose registers m 32 floating-point registers
Loosely based on Cray-1
Chapter 4: Data-Level Parallelism
Vector Programming Model
Chapter 4: Data-Level Parallelism
Vector Instructions
ADDVV.D: add two vectors
ADDVS.D: add vector to a scalar
LV/SV: vector load and vector store from address
Chapter 4: Data-Level Parallelism
Vector Instruction Set Advantages
m one short instruction encodes N operations
p Expressive, tells hardware that these N operations: m are independent
m use the same functional unit
m access disjoint registers (elements)
m access registers in the same pattern as previous instructions m access a contiguous block of memory (unit-stride load/store) m access memory in a known pattern (strided load/store)
p Scalable
m can run same object code on more parallel pipelines or lanes m i.e., the code is machine independent
Chapter 4: Data-Level Parallelism
Example: DAXPY
Double Precision a*X plus Y
64 elements in total
Chapter 4: Data-Level Parallelism
How Vector Processor Work: An Example
MIPS code:
Chapter 4: Data-Level Parallelism
How Vector Processor Work: An Example
VMIPS code:
600 instructions -> 6 instructions
Chapter 4: Data-Level Parallelism
Vector Arithmetic Execution
Note: not always parallel hardware (1 lane)
Deep pipeline also helps!
Chapter 4: Data-Level Parallelism
Vector Execution Time
p Execution time of a sequence of vector operations depends on three factors:
m Length of operand vectors m Structural hazards
m Data dependencies
p VMIPS functional units consume one element per clock cycle
m Execution time is approximately the vector length
Chapter 4: Data-Level Parallelism
Vector Chaining
Chapter 4: Data-Level Parallelism
Vector Chaining Advantage
Chapter 4: Data-Level Parallelism
Optimizations
How can a vector processor execute a single vector faster than one element per clock cycle?
Multiple elements per clock cycle improve performance
Chapter 4: Data-Level Parallelism
Vector Instruction Execution
Chapter 4: Data-Level Parallelism
T0 (Torrent) Vector Microprocessor (1995)
Chapter 4: Data-Level Parallelism
Optimizations (Cont’)
How do you program a vector computer?
Chapter 4: Data-Level Parallelism
Automatic Code Vectorization
Chapter 4: Data-Level Parallelism
Programming Vector Architectures
p Compilers can provide feedback to programmers p Programmers can provide hints to compiler
Chapter 4: Data-Level Parallelism
Optimizations (Cont’)
How does a vector processor handle programs where the vector lengths are not the same as the length of the vector register (64 for VMIPS)?
Most application vectors do not match the architecture vector length
Chapter 4: Data-Level Parallelism
Vector Length Register
p Vector length not known at compile time?
p Use Vector Length Register (VLR) to control the length of any vector operation, including a vector load and store
Chapter 4: Data-Level Parallelism
Vector Stripmining
Chapter 4: Data-Level Parallelism
Optimizations (Cont’)
What happens when there is an IF statement inside the code to be vectorized?
More code can vectorize if we can efficiently handle conditional statements
Chapter 4: Data-Level Parallelism
Vector Mask Registers
Chapter 4: Data-Level Parallelism
Masked Vector Instructions
Chapter 4: Data-Level Parallelism
Optimizations (Cont’)
What does a vector processor need from the memory system?
Without sufficient memory bandwidth, vector execution can be futile
Chapter 4: Data-Level Parallelism
Vector Memory-Memory vs. Vector Register Machines
Chapter 4: Data-Level Parallelism
Vector Memory-Memory vs. Vector Register Machines
Chapter 4: Data-Level Parallelism
Memory Banks
Supplying Bandwidth for Vector Load/Store Units
p Most vector processors use memory banks, which allow multiple independent accesses rather than simple memory interleaving for three reasons
m To support simultaneous accesses from multiple loads or stores, the memory system needs multiple banks and to be able to control the addresses to the banks independently.
m Non-sequential data access
m Same memory system shared by multiple processors (with
independent streams of addresses)
p In combination, these features lead to a large number of independent memory banks (~thousands)
Chapter 4: Data-Level Parallelism
Vector Memory System
Chapter 4: Data-Level Parallelism
Optimizations (Cont’)
How does a vector processor handle multi-dimensional matrices?
Matrices are popular
Chapter 4: Data-Level Parallelism
p Consider:
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
• Must vectorize multiplication of rows of B with columns of D
• Use non-unit stride
Chapter 4: Data-Level Parallelism
Optimizations (Cont’)
How does a vector processor handle sparse matrices?
This is also popular data structure
Chapter 4: Data-Level Parallelism
Scatter-Gather
p Consider:
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]]; p Use index vectors K[] and M[]
LVI Va, (Ra+Vk)
LVI Vc, (Rc+Vm)
ADDVV.D Va, Va, Vc SVI (Ra+Vk), Va
;load A[K[]] ;load M ;load C[M[]] ;add them ;store A[K[]]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com