程序代写代做代考 assembly mips x86 Java Fortran chain matlab Microsoft PowerPoint – Comp Arch Chapter_03_stu [ퟸ펟 모ëfiœ]

Microsoft PowerPoint – Comp Arch Chapter_03_stu [ퟸ펟 모ëfiœ]

COMPUTER ORGANIZATION AND DESIGN
The Hardware/Software Interface

5th
Edition

Chapter 3
Arithmetic for Computers

Chapter 3 — Arithmetic for Computers — 2

Arithmetic for Computers

 Operations on integers
 Addition and subtraction

 Multiplication and division

 Dealing with overflow

 Floating-point real numbers
 Representation and operations

§
3
.1

In
tro

d
u
ctio

n

Chapter 3 — Arithmetic for Computers — 3

Integer Addition

 Example: 7 + 6

§
3
.2

A
d
d

itio
n
a

n
d
S

u
b
tra

ctio
n

 Overflow if result out of range
 Adding +ve and –ve operands, no overflow

 Adding two +ve operands
 Overflow if result sign is 1

 Adding two –ve operands
 Overflow if result sign is 0

(Since the sum is not ______ than one of the operand)

(Speed up of addition: carry _________ adder; carry _____ adder)

( _______ carry adder)

Chapter 3 — Arithmetic for Computers — 4

Integer Subtraction

 Add negation of second operand

 Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111
–6: 1111 1111 … 1111 1010
+1: 0000 0000 … 0000 0001

 Overflow if result out of range
 Subtracting two +ve or two –ve operands, no overflow

 Subtracting +ve from –ve operand
 Overflow if result sign is 0

 Subtracting –ve from +ve operand
 Overflow if result sign is 1

(Since this is equivalent with addition of operands of _________ sign)

Chapter 3 — Arithmetic for Computers — 5

Dealing with Overflow

 Some languages (e.g., C) ignore overflow
 Use MIPS addu, addui, subu instructions

 Other languages (e.g., Ada, Fortran)
require raising an exception
 Use MIPS add, addi, sub instructions
 On overflow, invoke exception handler

 Save PC in exception program counter (EPC)
register

 Jump to predefined handler address
 mfc0 (move from coprocessor reg) instruction can

retrieve EPC value, to return after corrective action

(also Java)

(Interrupt: exception coming
from ________)

($k0  EPC; jr $___)

(Since some integer arithmetic for handling _________)

Chapter 3 — Arithmetic for Computers — 6

Arithmetic for Multimedia

 Graphics and media processing operates
on vectors of 8-bit and 16-bit data
 Use 64-bit adder, with partitioned carry chain

 Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors

 SIMD (single-instruction, multiple-data)

 Saturating operations
 On overflow, result is largest representable

value
 c.f. 2s-complement modulo arithmetic

 E.g., clipping in audio, saturation in video

Chapter 3 — Arithmetic for Computers — 7

Multiplication

 Start with long-multiplication approach

1000
× 1001

1000
0000
0000
1000
1001000

Length of product is
the sum of operand
lengths

multiplicand

multiplier

product

§
3
.3

M
u
ltip

lica
tio

n

Chapter 3 — Arithmetic for Computers — 8

Multiplication Hardware

Initially 0

Chapter 3 — Arithmetic for Computers — 9

Multiplication Example
1000

× 1001
1000
0000
0000
1000
1001000

Chapter 3 — Arithmetic for Computers — 10

Optimized Multiplier

 Perform steps in parallel: add/shift

 One cycle per partial-product addition
 That’s ok, if frequency of multiplications is low

Iteration Multiplicand Product
0 0010 0000 0011
1 0010 ____ 0011

0010 0001 0001
2 0010 ____ 0001

0010 0001 1000
3 0010 ____ 1100
4 0010 0000 0110

Chapter 3 — Arithmetic for Computers — 11

Faster Multiplier

 Uses multiple adders
 Cost/performance tradeoff

 Can be pipelined
 Several multiplication performed in parallel

(16+8+4+2+1
= 31 adders;
5 add times)

1011……1101 (32-bit)

x 1101……1011

1011……1101

1011……1101

0000……0000

1011……1101

1011……1101

+ 1011……1101

(Width? ___ )

Chapter 3 — Arithmetic for Computers — 12

Faster Multiplier

 Carry save adder

1011

x 1101

1011 (F)

0000 (E)

1011 (B)

+ 1011 (A)

100001 (A+B)

+ 000

1000010 (A+B+E)

+ 101

10001111 (A+B+E+F)

1011

x 1101

1011 (F)

0000 (E)

1011 (B)

+ 1011 (A)

100111 (Sum of F+E+B)

001000 (Carry of F+E+B)

+ 1011

1101111 (Sum of F+E+B+A)

+ 0010000 (Carry of F+E+B+A)

10001111 (F+E+B+A)

Chapter 3 — Arithmetic for Computers — 13

MIPS Multiplication

 Two 32-bit registers for product
 HI: most-significant 32 bits

 LO: least-significant 32-bits

 Instructions
 mult rs, rt / multu rs, rt

 64-bit product in HI/LO

 mfhi rd / mflo rd

 Move from HI/LO to rd

 Can test HI value to see if product overflows 32 bits

 mul rd, rs, rt

 Least-significant 32 bits of product –> rd

($rd = Hi)

Chapter 3 — Arithmetic for Computers — 14

Division
 Check for 0 divisor
 Long division approach

 If divisor ≤ dividend bits
 1 bit in quotient, subtract

 Otherwise
 0 bit in quotient, bring down next

dividend bit

 Restoring division
 Do the subtract, and if remainder

goes < 0, add divisor back  Signed division  Divide using absolute values  Adjust sign of quotient and remainder as required 1001 1000 1001010 -1000 10 101 1010 -1000 10 n-bit operands yield n-bit quotient and remainder quotient dividend remainder divisor § 3 .4 D ivisio n (dividend = quotient x divisor + ________ ) (Sign of dividend and _________ must be same) Chapter 3 — Arithmetic for Computers — 15 Division Hardware Initially dividend Initially divisor in left half Chapter 3 — Arithmetic for Computers — 16 Division Hardware Example Chapter 3 — Arithmetic for Computers — 17 Optimized Divider  One cycle per partial-remainder subtraction  Looks a lot like a multiplier!  Same hardware can be used for both Iteration Divisor Remainder 0000 111- ____ 110- 2 + 1110 = ____ 1100 ____ 1100 ____ 100- 3 + 1110 = ____ 1001 1 + ____ = 1110 1110 ____ 1110 0 0010 0000 0111 ____ 001- = ____ 0011 Remainder Quotient Remainder 4 + 1110 (Restoring division) (Non-restoring division) 0000 0111 0000 111- + 1110 = 1110 1110 ____ 110- + ____ = 1111 1100 ____ 100- + 0010 = 0001 1001 ____ 001- + ____ = 0001 0011 (__________ division: when the current quotient bit is __, (r + d) x 2 − d = 2r + 2d − d = 2r + d)) (restoring) (shift left) (subtraction for next quotient bit) (shift left) (addition for next quotient bit) Chapter 3 — Arithmetic for Computers — 18 Faster Division  Can’t use parallel hardware as in multiplier  Subtraction is conditional on sign of remainder  Faster dividers (e.g. SRT devision) generate multiple quotient bits per step  Still require multiple steps (Sweeney, Robertson, and Tocher division; popular method for division in many microprocessor; similar to non-restoring division, but it uses a _______ _____ based on 6 bit remainder and 4 bit divisor to determine 4 bit quotient) Chapter 3 — Arithmetic for Computers — 19 MIPS Division  Use HI/LO registers for result  HI: 32-bit remainder  LO: 32-bit quotient  Instructions  div rs, rt / divu rs, rt  No overflow or divide-by-0 checking  Software must perform checks if required  Use mfhi, mflo to access result (Lo = $rs/$rt; Hi = $rs mod $rt) (+7 / +2 : Quotient = +3, Remainder = +1) (–7 / +2 : Quotient = __, Remainder = __) (+7 / –2 : Quotient = __, Remainder = __) (–7 / –2 : Quotient = __, Remainder = __) (Sign of dividend and __________ must be same) Chapter 3 — Arithmetic for Computers — 20 Floating Point  Representation for non-integral numbers  Including very small and very large numbers  Like scientific notation  –2.34 × 1056  +0.002 × 10–4  +987.02 × 109  In binary  ±1.xxxxxxx2 × 2 yyyy  Types float and double in C normalized not normalized § 3 .5 F lo a tin g P o in t Chapter 3 — Arithmetic for Computers — 21 Floating Point Standard  Defined by IEEE Std 754-1985  Developed in response to divergence of representations  Portability issues for scientific code  Now almost universally adopted  Two representations  Single precision (32-bit)  Double precision (64-bit) Chapter 3 — Arithmetic for Computers — 22 IEEE Floating-Point Format  S: sign bit (0  non-negative, 1  negative)  Normalize significand: 1.0 ≤ |significand| < 2.0  Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit)  Significand is Fraction with the “1.” restored  Exponent: excess representation: actual exponent + Bias  Ensures exponent is unsigned  Single: Bias = 127; Double: Bias = 1203 S Exponent Fraction single: 8 bits double: 11 bits single: 23 bits double: 52 bits Bias)(ExponentS 2Fraction)(11)(x  (mantissa, significand; decides _________)(decides the ______) Chapter 3 — Arithmetic for Computers — 23 Single-Precision Range  Exponents 00000000 and 11111111 reserved  Smallest value  Exponent: 00000001  actual exponent = 1 – 127 = –126  Fraction: 000…00  significand = 1.0  ±1.0 × 2–126 ≈ ±1.2 × 10–38  Largest value  exponent: 11111110  actual exponent = 254 – 127 = +127  Fraction: 111…11  significand ≈ 2.0  ±2.0 × 2+127 ≈ ±3.4 × 10+38 Chapter 3 — Arithmetic for Computers — 24 Double-Precision Range  Exponents 0000…00 and 1111…11 reserved  Smallest value  Exponent: 00000000001  actual exponent = 1 – 1023 = –1022  Fraction: 000…00  significand = 1.0  ±1.0 × 2–1022 ≈ ±2.2 × 10–308  Largest value  Exponent: 11111111110  actual exponent = 2046 – 1023 = +1023  Fraction: 111…11  significand ≈ 2.0  ±2.0 × 2+1023 ≈ ±1.8 × 10+308 Chapter 3 — Arithmetic for Computers — 25 Floating-Point Precision  Relative precision  all fraction bits are significant  Single: approx 2–23  Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision  Double: approx 2–52  Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision (IEEE 754-2008: 16-bit ____ precision; ___-bit quadruple precision) Chapter 3 — Arithmetic for Computers — 26 Floating-Point Example  Represent –0.75  –0.75 = (–1)1 × 1.12 × 2 –1  S = 1  Fraction = 1000…002  Exponent = –1 + Bias  Single: –1 + 127 = 126 = 011111102  Double: –1 + 1023 = 1022 = 011111111102  Single: 1011111101000…00  Double: 1011111111101000…00 Chapter 3 — Arithmetic for Computers — 27 Floating-Point Example  What number is represented by the single- precision float 11000000101000…00  S = 1  Fraction = 01000…002  Fxponent = 100000012 = 129  x = (–1)1 × (1 + 012) × 2 (129 – 127) = (–1) × 1.25 × 22 = –5.0 Chapter 3 — Arithmetic for Computers — 28 Denormal Numbers  Exponent = 000...0  hidden bit is 0  Smaller than normal numbers  allow for gradual underflow, with diminishing precision  Denormal with fraction = 000...0 BiasS 2Fraction)(01)(x  0.0 BiasS 20)(01)(x Two representations of 0.0! Bias)(ExponentS 2Fraction)(11)(x  Chapter 3 — Arithmetic for Computers — 29 Infinities and NaNs  Exponent = 111...1, Fraction = 000...0  ±Infinity  Can be used in subsequent calculations, avoiding need for overflow check  Exponent = 111...1, Fraction ≠ 000...0  Not-a-Number (NaN)  Indicates illegal or undefined result  e.g., 0.0 / 0.0  Can be used in subsequent calculations Chapter 3 — Arithmetic for Computers — 30 Floating-Point Addition  Consider a 4-digit decimal example  9.999 × 101 + 1.610 × 10–1  1. Align decimal points  Shift number with smaller exponent  9.999 × 101 + 0.016 × 101  2. Add significands  9.999 × 101 + 0.016 × 101 = 10.015 × 101  3. Normalize result & check for over/underflow  1.0015 × 102  4. Round and renormalize if necessary  1.002 × 102 Chapter 3 — Arithmetic for Computers — 31 Floating-Point Addition  Now consider a 4-digit binary example  1.0002 × 2 –1 + –1.1102 × 2 –2 (0.5 + –0.4375)  1. Align binary points  Shift number with smaller exponent  1.0002 × 2 –1 + –0.1112 × 2 –1  2. Add significands  1.0002 × 2 –1 + –0.1112 × 2 –1 = 0.0012 × 2 –1  3. Normalize result & check for over/underflow  1.0002 × 2 –4, with no over/underflow  4. Round and renormalize if necessary  1.0002 × 2 –4 (no change) = 0.0625 Chapter 3 — Arithmetic for Computers — 32 FP Adder Hardware  Much more complex than integer adder  Doing it in one clock cycle would take too long  Much longer than integer operations  Slower clock would penalize all instructions  FP adder usually takes several cycles  Can be pipelined Chapter 3 — Arithmetic for Computers — 33 FP Adder Hardware Step 1 Step 2 Step 3 Step 4 Chapter 3 — Arithmetic for Computers — 34 Floating-Point Multiplication  Consider a 4-digit decimal example  1.110 × 1010 × 9.200 × 10–5  1. Add exponents  For biased exponents, subtract bias from sum  New exponent = 10 + –5 = 5  2. Multiply significands  1.110 × 9.200 = 10.212  10.212 × 105  3. Normalize result & check for over/underflow  1.0212 × 106  4. Round and renormalize if necessary  1.021 × 106  5. Determine sign of result from signs of operands  +1.021 × 106 Chapter 3 — Arithmetic for Computers — 35 Floating-Point Multiplication  Now consider a 4-digit binary example  1.0002 × 2 –1 × –1.1102 × 2 –2 (0.5 × –0.4375)  1. Add exponents  Unbiased: –1 + –2 = –3  Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127  2. Multiply significands  1.0002 × 1.1102 = 1.1102  1.1102 × 2–3  3. Normalize result & check for over/underflow  1.1102 × 2 –3 (no change) with no over/underflow  4. Round and renormalize if necessary  1.1102 × 2 –3 (no change)  5. Determine sign: +ve × –ve  –ve  –1.1102 × 2 –3 = –0.21875 (FP division: similar to multiplication, but exponent must be _________ and the bias needs to be ________ ) Chapter 3 — Arithmetic for Computers — 36 FP Arithmetic Hardware  FP multiplier is of similar complexity to FP adder  But uses a multiplier for significands instead of an adder  FP arithmetic hardware usually does  Addition, subtraction, multiplication, division, reciprocal, square-root  FP  integer conversion  Operations usually takes several cycles  Can be pipelined Chapter 3 — Arithmetic for Computers — 37 FP Instructions in MIPS  FP hardware is coprocessor 1  Adjunct processor that extends the ISA  Separate FP registers  32 single-precision: $f0, $f1, … $f31  Paired for double-precision: $f0/$f1, $f2/$f3, …  Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s  FP instructions operate only on FP registers  Programs generally don’t do integer ops on FP data, or vice versa  More registers with minimal code-size impact  FP load and store instructions  lwc1, ldc1, swc1, sdc1  e.g., ldc1 $f8, 32($sp) (Their id is $f2) (load word to __________ 1, the floating-point unit) Chapter 3 — Arithmetic for Computers — 38 FP Instructions in MIPS  Single-precision arithmetic  add.s, sub.s, mul.s, div.s  e.g., add.s $f0, $f1, $f6  Double-precision arithmetic  add.d, sub.d, mul.d, div.d  e.g., mul.d $f4, $f4, $f6  Single- and double-precision comparison  c.xx.s, c.xx.d (xx is eq, lt, le, …)  Sets or clears FP condition-code bit  e.g. c.lt.s $f3, $f4  Branch on FP condition code true or false  bc1t, bc1f  e.g., bc1t TargetLabel (ne, gt, ge) Chapter 3 — Arithmetic for Computers — 39 FP Example: °F to °C  C code: float f2c (float fahr) { return ((5.0/9.0)*(fahr - 32.0)); }  fahr in $f12, result in $f0, literals in global memory space  Compiled MIPS code: f2c: lwc1 $f16, const5($gp) lwc2 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra ($f16 = 5.0) (Some _________ execute 5.0/9.0) Chapter 3 — Arithmetic for Computers — 40 FP Example: Array Multiplication  X = X + Y × Z  All 32 × 32 matrices, 64-bit double-precision elements  C code: void mm (double x[][], double y[][], double z[][]) { int i, j, k; for (i = 0; i! = 32; i = i + 1) for (j = 0; j! = 32; j = j + 1) for (k = 0; k! = 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j]; }  Addresses of x, y, z in $a0, $a1, $a2, and i, j, k in $s0, $s1, $s2 Chapter 3 — Arithmetic for Computers — 41 FP Example: Array Multiplication  MIPS code: li $t1, 32 # $t1 = 32 (row size/loop end) li $s0, 0 # i = 0; initialize 1st for loop L1: li $s1, 0 # j = 0; restart 2nd for loop L2: li $s2, 0 # k = 0; restart 3rd for loop sll $t2, $s0, 5 # $t2 = i * 32 (size of row of x) addu $t2, $t2, $s1 # $t2 = i * size(row) + j sll $t2, $t2, 3 # $t2 = byte offset of [i][j] addu $t2, $a0, $t2 # $t2 = byte address of x[i][j] l.d $f4, 0($t2) # $f4 = 8 bytes of x[i][j] L3: sll $t0, $s2, 5 # $t0 = k * 32 (size of row of z) addu $t0, $t0, $s1 # $t0 = k * size(row) + j sll $t0, $t0, 3 # $t0 = byte offset of [k][j] addu $t0, $a2, $t0 # $t0 = byte address of z[k][j] l.d $f16, 0($t0) # $f16 = 8 bytes of z[k][j] … Chapter 3 — Arithmetic for Computers — 42 FP Example: Array Multiplication … sll $t0, $s0, 5 # $t0 = i*32 (size of row of y) addu $t0, $t0, $s2 # $t0 = i*size(row) + k sll $t0, $t0, 3 # $t0 = byte offset of [i][k] addu $t0, $a1, $t0 # $t0 = byte address of y[i][k] l.d $f18, 0($t0) # $f18 = 8 bytes of y[i][k] mul.d $f16, $f18, $f16 # $f16 = y[i][k] * z[k][j] add.d $f4, $f4, $f16 # f4=x[i][j] + y[i][k]*z[k][j] addiu $s2, $s2, 1 # $k k + 1 bne $s2, $t1, L3 # if (k != 32) go to L3 s.d $f4, 0($t2) # x[i][j] = $f4 addiu $s1, $s1, 1 # $j = j + 1 bne $s1, $t1, L2 # if (j != 32) go to L2 addiu $s0, $s0, 1 # $i = i + 1 bne $s0, $t1, L1 # if (i != 32) go to L1 ( ____-major (C, C++, Mathematica) order vs _______-major order (Fortran, MATLAB)) (Paired ______: add.ps $f0, $f2, $f4 ≡ add.s $f0, $f2, $f4 and then add.s $f1, $f3, $f5) Chapter 3 — Arithmetic for Computers — 43 Accurate Arithmetic  IEEE Std 754 specifies additional rounding control  Extra bits of precision (guard, round, sticky)  Choice of rounding modes  Allows programmer to fine-tune numerical behavior of a computation  Not all FP units implement all options  Most programming languages and FP libraries just use defaults  Trade-off between hardware complexity, performance, and market requirements (ex) 2.5610 x 10 0 + 2.3410 x 10 2 with guard/round; ______10 x 10 2 +2.3410 x 10 2 =2.365610 x10 2 = ____10 x 10 2 (ex) Without guard/round; _____10 x 10 2 +2.3410 x 10 2 =_____10 x10 2 (______ bit = 1 if there are nonzero bits to the right of the round bit; (ex) 5.0110 x 10 −1 + 2.3410 x 10 2 = _______ + 2.34 = 2.3450  _____; With sticky bit, _____ ) (round up, down, truncate, round to the nearest ____) (Halfway: if LSB is odd, add one; even, _______) (Fused multiply add: a = a + (b x c); rounding once after ________ to maximize the precision) (2.35  ___; 2.45  ____) Subword Parallellism  Graphics and audio applications can take advantage of performing simultaneous operations on short vectors  Example: 128-bit adder:  Sixteen 8-bit adds  Eight 16-bit adds  Four 32-bit adds  Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD) Chapter 3 — Arithmetic for Computers — 44 § 3 .6 P a ra lle lism a n d C o m p u te r A rith m e tic: S u b w o rd P a ra lle lism (__ bit/color x 3 colors + 8 bit for location of pixel; ____ bit sound) (two 64-bit adds; subword parallelism) Chapter 3 — Arithmetic for Computers — 45 x86 FP Architecture  Originally based on 8087 FP coprocessor  8 × 80-bit extended-precision registers  Used as a push-down stack  Registers indexed from TOS: ST(0), ST(1), …  FP values are 32-bit or 64 in memory  Converted on load/store of memory operand  Integer operands can also be converted on load/store  Very difficult to generate and optimize code  Result: poor FP performance § 3 .7 R e a l S tu ff: S tre a m in g S IM D E xte n sio n s a n d A V X in x8 6 Chapter 3 — Arithmetic for Computers — 46 x86 FP Instructions  Optional variations  I: integer operand  P: pop operand from stack  R: reverse operand order  But not all combinations allowed Data transfer Arithmetic Compare Transcendental FILD mem/ST(i) FISTP mem/ST(i) FLDPI FLD1 FLDZ FIADDP mem/ST(i) FISUBRP mem/ST(i) FIMULP mem/ST(i) FIDIVRP mem/ST(i) FSQRT FABS FRNDINT FICOMP FIUCOMP FSTSW AX/mem FPATAN F2XMI FCOS FPTAN FPREM FPSIN FYL2X Chapter 3 — Arithmetic for Computers — 47 Streaming SIMD Extension 2 (SSE2)  Adds 4 × 128-bit registers  Extended to 8 registers in AMD64/EM64T  Can be used for multiple FP operands  2 × 64-bit double precision  4 × 32-bit double precision  Instructions operate on them simultaneously  Single-Instruction Multiple-Data (Later 16 register; XMM) (2011, Intel YMM; Advanced Vector Extenstions (AVX), 8 x 32-bit or 4 64-bit) Matrix Multiply  Unoptimized code: 1. void dgemm (int n, double* A, double* B, double* C) 2. { 3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for(int k = 0; k < n; k++ ) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*n] = cij; /* C[i][j] = cij */ 10. } 11. } Chapter 3 — Arithmetic for Computers — 48 § 3 .8 G o in g F a ste r: S u b w o rd P a ra lle lism a n d M a trix M u ltip ly (double precision, general matrix multiply) Matrix Multiply  x86 assembly code: 1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0 2. mov %rsi,%rcx # register %rcx = %rsi 3. xor %eax,%eax # register %eax = 0 4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1 5. add %r9,%rcx # register %rcx = %rcx + %r9 6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A 7. add $0x1,%rax # register %rax = %rax + 1 8. cmp %eax,%edi # compare %eax to %edi 9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 10. jg 30 # jump if %eax > %edi

11. add $0x1,%r11d # register %r11 = %r11 + 1

12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element

Chapter 3 — Arithmetic for Computers — 49

§
3
.8

G
o
in

g
F

a
ste

r: S
u
b
w

o
rd

P
a
ra

lle
lism

a
n
d

M
a
trix M

u
ltip

ly

(scalar double)

Matrix Multiply

 Optimized C code:
1. #include

2. void dgemm (int n, double* A, double* B, double* C)

3. {

4. for ( int i = 0; i < n; i+=4 ) 5. for ( int j = 0; j < n; j++ ) { 6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n), 10. _mm256_broadcast_sd(B+k+j*n))); 11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 12. } 13. } Chapter 3 — Arithmetic for Computers — 50 § 3 .8 G o in g F a ste r: S u b w o rd P a ra lle lism a n d M a trix M u ltip ly (With AVX subword-parallel inst for x86) (4 double precision FP) (load 4 double precision FP number in parallel) (add 4 products to 4 sums) (make 4 copies of an elemement of B) (multiply 4 numbers in parallel) Matrix Multiply  Optimized x86 assembly code: 1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0 2. mov %rbx,%rcx # register %rcx = %rbx 3. xor %eax,%eax # register %eax = 0 4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 5. add $0x8,%rax # register %rax = %rax + 8 6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements 7. add %r9,%rcx # register %rcx = %rcx + %r9 8. cmp %r10,%rax # compare %r10 to %rax 9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0 10. jne 50 # jump if not %r10 != %rax

11. add $0x1,%esi # register % esi = % esi + 1

12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements

Chapter 3 — Arithmetic for Computers — 51

§
3
.8

G
o
in

g
F

a
ste

r: S
u
b
w

o
rd

P
a
ra

lle
lism

a
n
d

M
a
trix M

u
ltip

ly

(v means vector of YMM)

(parallel double)

(6.4 GFLOPS/1.7 GFLOPS = 3.85 faster with optimized code ≈ 4)

Chapter 3 — Arithmetic for Computers — 52

Right Shift and Division

 Left shift by i places multiplies an integer
by 2i

 Right shift divides by 2i?
 Only for unsigned integers

 For signed integers
 Arithmetic right shift: replicate the sign bit
 e.g., –5 / 4

 111110112 >> 2 = 111111102 = –2
 Rounds toward –∞

 c.f. 111110112 >>> 2 = 001111102 = +62

§
3
.9

F
a
lla

cie
s a

n
d
P

itfa
lls

Chapter 3 — Arithmetic for Computers — 53

Associativity

 Parallel programs may interleave
operations in unexpected orders
 Assumptions of associativity may fail

(x+y)+z x+(y+z)
x -1.50E+38 -1.50E+38
y 1.50E+38
z 1.0 1.0

1.00E+00 0.00E+00

0.00E+00
1.50E+38

 Need to validate parallel programs under
varying degrees of parallelism

(Integer addition is _________, but FP addition is not)

(Important in numerical analysis; result of sequential and ________ computation must be compared)

Chapter 3 — Arithmetic for Computers — 54

Who Cares About FP Accuracy?

 Important for scientific code
 But for everyday consumer use?

 “My bank balance is out by 0.0002¢!” 

 The Intel Pentium FDIV bug
 The market expects accuracy

 See Colwell, The Pentium Chronicles

(Found by Thomas Nicely, Sept. 1994; costed $__________ for recall )

Chapter 3 — Arithmetic for Computers — 55

Concluding Remarks

 Bits have no inherent meaning
 Interpretation depends on the instructions

applied

 Computer representations of numbers
 Finite range and precision

 Need to account for this in programs

§
3
.9

C
o
n
clu

d
in

g
R

e
m

a
rks

Chapter 3 — Arithmetic for Computers — 56

Concluding Remarks

 ISAs support arithmetic
 Signed and unsigned integers

 Floating-point approximation to reals

 Bounded range and precision
 Operations can overflow and underflow

 MIPS ISA
 Core instructions: 54 most frequently used

 100% of SPECINT, 97% of SPECFP

 Other instructions: less frequent

Chapter 3 — Arithmetic for Computers — 57

ALU Implementation

Chapter 3 — Arithmetic for Computers — 58

ALU Implementation