Chapter 4: Data-Level Parallelism
p 4.1 Introduction
p 4.2 Vector Architecture
p 4.3 SIMD Instruction Set Extensions p 4.4 GPU
Copyright By PowCoder代写 加微信 powcoder
Chapter 4: Data-Level Parallelism
SIMD Multimedia Extensions
p Media applications operate on data types narrower than the native word size
m Example: disconnect carry chains to “partition” adder
p Support to handle short vectors added to existing ISAs
p Usually 64-bit registers split into 2x32b or 4x16b or 8x8b
p Newer designs have 256-bit registers
Chapter 4: Data-Level Parallelism
MMX Instructions
p Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
m opt. signed/unsigned saturate (set to max) if overflow
p Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b
p Multiply in parallel: 4 16b
pCompare = , > in parallel: 8 8b, 4 16b, 2 32b
m sets field to 0s (false) or 1s (true); removes branches
Chapter 4: Data-Level Parallelism
Example SIMD Code for DAXPY
MOV F1, F0
MOV F2, F0
MOV F3, F0 DADDIU R4,Rx,#512 L.4D F4,0[Rx] MUL.4D F4,F4,F0 L.4D F8,0[Ry] ADD.4D F8,F8,F4 S.4D 0[Ry],F8 DADDIU Rx,Rx,#32 DADDIU Ry,Ry,#32 DSUBU R20,R4,Rx BNEZ R20,Loop
;load scalar a
;copy a into F1 for SIMD MUL
;copy a into F2 for SIMD MUL
;copy a into F3 for SIMD MUL
;last address to load
;load X[i], X[i+1], X[i+2], X[i+3] ;aX[i],aX[i+1],aX[i+2],aX[i+3] ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ;aX[i]+Y[i], …, aX[i+3]+Y[i+3] ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] ;increment index to X
;increment index to Y
;compute bound
;check if done
Chapter 4: Data-Level Parallelism
Why are Multimedia SIMD instructions so popular?
p Cost little to add to the standard arithmetic unit p Require little extra state compared to vector
architectures
p Need a lot of memory bandwidth to support a vector architecture, which many computers don’t have
p Do not have to deal with problems in virtual memory when a single instruction that can generate 64 memory accesses can get a page fault in the middle of the vector
Chapter 4: Data-Level Parallelism
Three major omissions in SIMD vs. vector
p Limited instruction set:
m Fixed number of data operands in opcode
m no vector length control (several fixed lengths) m no strided load/store or scatter/gather
p Limited vector register length:
m requires superscalar dispatch to keep multiply/add/load
units busy
p Limited mask registers:
m to support conditional execution of elements as in vector processors
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com