代写代考 IS 501: Comp Arch| Dr. | Arithmetic 1

Computer Organization and Design
Unit 3: Arithmetic
Based on slides by Profs. & & C.J. IS 501: Comp Arch| Dr. | Arithmetic 1

Copyright By PowCoder代写 加微信 powcoder

This Unit: Arithmetic
System software
• A little review
• Binary + 2s complement
• Ripple-carry addition (RCA)
• Fast integer addition
• Carry-select (CSeA)
• Carry Lookahead Adder (CLA)
• Shifters
• Integer multiplication and division • Floating point arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic 2

• Chapter 3
CIS 501: Comp Arch| Dr. | Arithmetic 3

The Importance of Fast Arithmetic
Tinsn-mem Tregfile TALU Tdata-mem
• Addition of two numbers is most common operation
• Programs use addition frequently
• Loads and stores use addition for address calculation
• Branches use addition to test conditions and calculate targets • All insns use addition to calculate default next PC
• Fast addition is critical to high performance
CIS 501: Comp Arch| Dr. | Arithmetic 4

Review: Binary Integers
• Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
+ Natural since only two values are represented
• Addition, etc. take place as usual (carry the 1, etc.)
17 = +5 = 22 =
• Some old machines use decimal (base10) with only 0/1
30 = 011 000
• Often called BCD (binary-coded decimal)
– Unnatural for digital logic, implementation complicated & slow + Morepreciseforcurrencyoperations
CIS 501: Comp Arch| Dr. | Arithmetic 5

Fixed Width
• On pencil and paper, integers have infinite width
• In hardware, integers have fixed width • Nbits:16,32or64
• LSBis20,MSBis2N-1
• Range:0to2N–1
• Numbers >2N represented using multiple fixed-width integers
• In software (e.g., Java BigInteger class)
CIS 501: Comp Arch| Dr. | Arithmetic 6

What About Negative Integers?
• Sign/magnitude
• Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
+ Matches our intuition from “by hand” decimal arithmetic – representations of both +0 and –0
– Addition is difficult
• symmetric range: –(2N-1–1) to 2N-1–1
• Option II: two’s complement (2C)
• Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
+ One representation for 0 (all zeroes) + Easy addition
• asymmetric range: –(2N-1) to 2N-1–1
CIS 501: Comp Arch| Dr. | Arithmetic 7

The Tao of 2C
• How did 2C come about?
• “Let’s design a representation that makes addition easy”
• Think of subtracting 10 from 0 by hand with 8-bit numbers • Have to “borrow” 1s from some imaginary leading 1
0 = 100000000
-10 = 00000010
-10 = 011111110
• Now, add the conventional way…
-10 = 11111110
+10 = 00000010
0 = 100000000
CIS 501: Comp Arch| Dr. | Arithmetic 8

Still More On 2C
• What is the interpretation of 2C?
• Same as binary, except MSB represents –2N–1, not 2N–1
• –10 = 11110110 = –27+26+25+24+22+21 + Extends to any width
• –10 = 110110 = –25+24+22+21
• Why? 2N = 2*2N–1
• –25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21
• Equivalent to computing modulo 2N
• Trick to negating a number quickly: –B = B’ + 1 • –(1) = (0001)’+1 = 1110+1 = 1111 = –1
• –(–1) = (1111)’+1 = 0000+1 = 0001 = 1
• –(0) = (0000)’+1 = 1111+1 = 0000 = 0
• Think about why this works
CIS 501: Comp Arch| Dr. | Arithmetic 9

CIS 501: Comp Arch| Dr. | Arithmetic 10

1st Grade: Decimal Addition
• Repeat N times
• Add least significant digits and any overflow from previous add • Carry “overflow” to next addition
• Overflow: any digit other than least significant of sum • Shift two addends and sum one digit to the right
• Sum of two N-digit numbers can yield an N+1 digit number
CIS 501: Comp Arch| Dr. | Arithmetic 11

Binary Addition: Works the Same Way
43 = 00101011
+29 = 00011101
72 = 01001000
• Repeat N times
• Add least significant bits and any overflow from previous add • Carry the overflow to next addition
• Shift two addends and sum one bit to the right
• Sum of two N-bit numbers can yield an N+1 bit number
– More steps (smaller base)
+ Each one is simpler (adding just 1 and 0) • So simple we can do it in hardware
CIS 501: Comp Arch| Dr. | Arithmetic 12

The Half Adder
• How to add two binary integers in hardware?
• Start with adding two bits
• When all else fails … look at truth table
A B = C0 S 00 = 00
01 = 01 10 = 01 11 = 10
• CO(carryout)=AB
• This is called a half adder
CIS 501: Comp Arch| Dr. | Arithmetic

The Other Half
• We could chain half adders together, but to do that… • Need to incorporate a carry out from previous adder
C AB = 0 00 = 0 01 = 0 10 = 0 11 = 1 00 = 1 01 = 1 10 = 1 11 =
C0 S 00 01 01 10 01 10 10 11
• S=C’A’B|C’AB’|CA’B’|CAB=C^A^B
• CO=C’AB|CA’B|CAB’|CAB=CA|CB|AB • This is called a full adder
CIS 501: Comp Arch| Dr. | Arithmetic

Ripple-Carry Adder
• N-bit ripple-carry adder
• N 1-bit full adders “chained” together
• CO0 = CI1, CO1 = CI2, etc.
• CON–1 is carry-out of entire adder
• CON–1 = 1 ® “overflow”
• Example: 16-bit ripple carry adder
• How fast is this?
• How fast is an N-bit ripple-carry adder?
A0 FA S0 B0
A1 FA S1 B1
A2 FA S2 B2
A15 FA S15
CIS 501: Comp Arch| Dr. | Arithmetic

Quantifying Adder Delay
• Combinational logic dominated by gate delays • How many gates between input and output?
• Array storage dominated by wire delays
• Longest delay or critical path is what matters
• Can implement any combinational function in “2” logic levels • 1levelofAND+1levelofOR(PLA)
• NOTs are “free”: push to input (DeMorgan’s) or read from latch
• Example: delay(FullAdder) = 2
• d(CarryOut) = delay(AB | AC | BC)
• d(Sum)=d(A^B^C)=d(AB’C’|A’BC’|A’B’C|ABC)=2 • Note ‘^’ means Xor (just like in C & Java)
• Caveat: “2” assumes gates have few (<8 ?) inputs CIS 501: Comp Arch| Dr. | Arithmetic 16 Ripple-Carry Adder Delay • Longest path is to CO15 (or S15) • delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) • d(A15) = d(B15) = 0, d(CI15) = d(CO14) • d(CO15)=2+d(CO14)=2+2+d(CO13)... • d(CO15) = 32 • d(CON–1)=2N – Too slow! – Linear in number of bits • Number of gates is also linear A0 FA S0 B0 A1 FA S1 B1 A2 FA S2 B2 A15 FA S15 CIS 501: Comp Arch| Dr. | Arithmetic CIS 501: Comp Arch| Dr. | Arithmetic 18 4th Grade: Decimal Division 9 3 |29 -27 2 // quotient // divisor | dividend // remainder • Shift divisor left (multiply by 10) until MSB lines up with dividend’s • Repeat until remaining dividend (remainder) < divisor • Find largest single digit q such that (q*divisor) < dividend • Set LSB of quotient to q • Subtract (q*divisor) from dividend • Shift quotient left by one digit (multiply by 10) • Shift divisor right by one digit (divide by 10) CIS 501: Comp Arch| Dr. | Arithmetic 19 Binary Division 3 |29 = 0011 |011101 -24 = 5 = -3= 2 = - 011000 000101 -000011 000010 CIS 501: Comp Arch| Dr. | Arithmetic Binary Division Hardware • Same as decimal division, except (again) – More individual steps (base is smaller) + Each step is simpler • Find largest bit q such that (q*divisor) < dividend • q = 0 or 1 • Subtract (q*divisor) from dividend • q = 0 or 1 ® no actual multiplication, subtract divisor or not • Complication: largest q such that (q*divisor) < dividend • How do you know if (1*divisor) < dividend? • Human can “eyeball” this • Computer does not have eyeballs • Subtract and see if result is negative CIS 501: Comp Arch| Dr. | Arithmetic 21 Software Divide Algorithm • Can implement this algorithm in software • Inputs:dividendanddivisor • Outputs:quotient=dividend/divisor rem = dividend % divisor remainder = 0; quotient = 0; for (int i = 0; i < 32; i++) { remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1; remainder = remainder - divisor; quotient = (quotient << 1) | 0; dividend = dividend << 1; CIS 501: Comp Arch| Dr. | Arithmetic 22 Divide Example • Input:Divisor=00011,Dividend=11101 Step Remainder Quotient Remainder Dividend 00000 00000 00000 00001 00001 00000 00010 00001 00100 00010 01001 00010 • Result: Quotient: 1001, Remainder: 10 CIS 501: Comp Arch| Dr. | Arithmetic Divider Circuit Shift in 0 or 1 • N cycles for n-bit divide Shift in 0 or 1 Shift in 0 CIS 501: Comp Arch| Dr. | Arithmetic Fast Addition CIS 501: Comp Arch| Dr. | Arithmetic 25 Bad idea: a PLA-based Adder? • If any function can be expressed as two-level logic... • ...why not use a PLA for an entire 8-bit adder? • Not small • Approx. 216 AND gates, each with 16 inputs • Then, 8 OR gates, each with 216 inputs • Number of gates exponential in bit width! • Not that fast, either • An AND gate with 65K inputs != 2-input AND gate • Many-input gates made from tree of, say, 4-input gates • 216-input gates would have at least 8 logic levels • So, at least 10 levels of logic for a 16-bit PLA • Even so, delay is still logarithmic in number of bits • There are better (faster, smaller) ways CIS 501: Comp Arch| Dr. | Arithmetic 26 Theme: Hardware != Software • Hardware can do things that software fundamentally can’t • And vice versa (of course) • In hardware, it’s easier to trade resources for latency • One example of this: speculation • Slow computation waiting for some slow input? • Input one of two things? • Computewithboth(slow),chooserightonelater(fast) • Does this make sense in software? Not on a uni-processor • Difference? hardware is parallel, software is sequential CIS 501: Comp Arch| Dr. | Arithmetic 27 Carry-Select Adder • Carry-selectadder • Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1 • Choose the correct one when CO7 finally becomes available + Effectively cuts carry chain in half (break critical path) – But adds mux B7-0 S15-0 0 A15-8 B15-8 S7-0 S15-8 CO 1 S16CO A15-8 15-8 CIS 501: Comp Arch| Dr. | Arithmetic Multi-Segment Carry-Select Adder • Multiple segments • Example: 5, 5, 6 bit = 16 bit • Hardware cost • Still mostly linear (~2x) • Compute each segment with 0 and 1 carry-in • Serial mux chain • 5-bitadder(10)+ Two muxes (4) = 14 S4-0 A0 S9-5 1 S 10 A9-5 9-5 A 0 S15-10 15-10 B15-10 1 S A15-10 B 6+ 15-10 CIS 501: Comp Arch| Dr. | Arithmetic Carry-Select Adder Delay • What is carry-select adder delay (two segment)? • d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2 • d(CO15) = MAX(2*8, 2*8) + 2 = 18 • In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA) • What if we cut adder into 4 equal pieces? • Would it be 2*(N/4) + 2 = 10? Not quite • d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2 • d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2 • d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2 • d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2 • d(CO15)=2*4+3*2=14 • N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2 • 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18 CIS 501: Comp Arch| Dr. | Arithmetic 30 16b Carry-Select Adder Delay 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CIS 501: Comp Arch| Dr. | Arithmetic Gate delay Another Option: Carry Lookahead • Is carry-select adder as fast as we can go? • Nope! • Another approach to using additional resources • Instead of redundantly computing sums assuming different carries • Use redundancy to compute carries more quickly • This approach is called carry lookahead (CLA) CIS 501: Comp Arch| Dr. | Arithmetic 32 A & B ® Generate & Propagate • Let’s look at the single-bit carry-out function • CO = AB|AC|BC • Factor out terms that use only A and B (available immediately) • CO =(AB)|(A|B)C • (AB): generates carry-out regardless of incoming C ® rename to G • (A|B):propagatesincomingC®renametoP CIS 501: Comp Arch| Dr. | Arithmetic Infinite Hardware CLA • Can expand C1...N in terms of G’s, P’s, and C0 • Example: C16 • C16 = G15 | P15C15 • C16 = G15 | P15(G14|P14C14) • C16 = G15 | P15G14 | ... | P15P14...P2P1G0 | P15P14...P2P1P0C0 • Similar expansions for C15, C14, etc. • How much does this cost? • CN needs: N AND’s + 1 OR’s, largest have N+1 inputs • CN...C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs • N=16: 152 total gates, max input 17 • N=64: 2144 total gates, max input 65 • And how fast is it really? – Not that fast, unfortunately, 17-input gates are really slow CIS 501: Comp Arch| Dr. | Arithmetic 34 Compromise: Multi-Level CLA • Ripplecarry + Few small gates: 2N gates, max 3 inputs – Laid in series: 2N (linear) latency • InfinitehardwareCLA – Many big gates: N*(N+3)/2 additional gates, max N+1 inputs + Laid in parallel: constant 5 latency • Is there a compromise? • Reasonable number of small gates? • Sublinear (doesn’t have to be constant) latency? • Yes,multi-levelCLA:exploitshierarchytoachievethis CIS 501: Comp Arch| Dr. | Arithmetic 35 Carry Lookahead Adder (CLA) • Calculate “propagate” and “generate” based on A, B • Not based on carry in • Combine with tree structure • Take aways C • Reasonable area C G1-0 P1-0 C1 • Treegiveslogarithmicdelay G3-0 P3-0 C2 G3-2 P3-2 C3 CIS 501: Comp Arch| Dr. | Arithmetic Individual G & P ® Windowed G & P • WindowedG/P:usefulabstractionformulti-levelCLA • Individual carry equations • C1 =G0 |P0C0, C2 =G1 |P1C1 • Infinite hardware CLA equations • C1 = G0 | P0C0 • C2 =G1 |P1G0 |P1P0C0 • Group terms into “windows” • C2 = (G1 | P1G0) | (P1P0)C0 • C2 = G1-0 | P1-0C0 • G1-0, P1-0 are window G & P • a single bit summarizing information from all the bits in the window • G1-0: carry-out generated by 1-0 bits • P1-0: carry-out propagated by 1-0 bits CIS 501: Comp Arch| Dr. | Arithmetic 37 Two-Level CLA for 4-bit Adder • Individual carry equations • C1 =G0 |P0C0, C2 =G1 |P1C1, C3 =G2 |P2C2, C4 =G3 |P3C3 • Infinite hardware CLA equations • C1 = G0 | P0C0 • C2 =G1 |P1G0 |P1P0C0 • C3 = G2 | P2G1 | P2P1G0 | P2P1P0C0 • C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0C0 • Hierarchical CLA equations • First level: expand C2 using C1, C4 using C3 • C2 =G1 |P1(G0 |P0C0)=(G1 |P1G0)|(P1P0)C0=G1-0 |P1-0C0 • C4 =G3 |P3(G2 |P2C2)=(G3 |P3G2)|(P3P2)C2=G3-2 |P3-2C2 • Second level: expand C4 using expanded C2 • C4 = G3-2 | P3-2(G1-0 | P1-0C0) = (G3-2 | P3-2G1-0) | (P3-2P1-0)C0 • C4 = G3-0 | P3-0C0 CIS 501: Comp Arch| Dr. | Arithmetic 38 Two-Level CLA for 4-bit Adder • Top first-level CLA block • Input: C0, G0, P0, G1, P1 • Output: C1, G1-0, P1-0 • Bottom first-level CLA block • Input: C2, G2, P2, G3, P3 • Output: C3, G3-2, P3-2 • Second-level CLA block • Input: C0, G1-0, P1-0, G3-2, P3-2 • Output: C2, G3-0, P3-0 • These3blocksare“thesame”C B2 3 • C4 block • Input: C0, G3-0, P3-0 • Output: C4 G3-2 P3-2 C3 G3-0 P3-0 C2 CIS 501: Comp Arch| Dr. | Arithmetic G1-0 P1-0 C1 4-bit CLA Timing: d0 • Which signals are ready at 0 gate delays (d0)? • C0 • A0,B0 C0 • A1,B1 • A2,B2 C • A3,B3 G1-0 P1-0 C1 G3-0 P3-0 C2 G3-2 P3-2 C3 CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d1 • Signals ready from before • d0:C0,Ai,Bi • New signals ready at d1 • P0 = A0 | B0, G0 = A0B0 • P1 = A1 | B1, G1 = A1B1 • P2 = A2 | B2, G2 = A2B2 • P3 = A3 | B3, G3 = A3B3 G1-0 P1-0 C1 G3-0 P3-0 C2 G3-2 P3-2 C3 CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d3 • Signals ready from before • d0:C0,Ai,Bi • d1:Pi,Gi • New signals ready at d3 • P1-0=P1P0 • G1-0=G1 |P1G0 • C1 = G0 | P0C0 • P3-2=P3P2 • G3-2=G3 |P3G2 • C3 is not ready G1-0 P1-0 C1 G3-0 P3-0 C2 G3-2 P3-2 C3 CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d5 • Signals ready from before • d0:C0,Ai,Bi • d1:Pi,Gi C0 • d3:P1-0,G1-0,C1,P3-2,G3-2 A0 • New signals ready at d5 3-0 3-2 1-0 • G3-0 = G3-2 | P3-2G1-0 • C2 = G1-0 | P1-0C0 CIS 501: Comp Arch| Dr. | Arithmetic G1-0 P1-0 C1 G3-2 P3-2 C3 G3-0 P3-0 C2 4-bit CLA Timing: d7 • Signals ready from before • d0:C0,Ai,Bi • d1:Pi,Gi C0 • d3:P1-0,G1-0,C1,P3-2,G3-2 • d5: P3-0, G3-0, C2 C • New signals ready at d7 • C3 = G2 | P2C2 • C4 = G3-0 | P3-0C0 C • Si ready d1 after Ci C C4 CIS 501: Comp Arch| Dr. | Arithmetic G1-0 P1-0 C1 G3-0 P3-0 C2 G3-2 P3-2 C3 Two-Level CLA for 4-bit Adder • Hardware? • CLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2) • 3 of these • C4 block: 2 gates, max 2 inputs • Total: 17 gates/3inputs • Infinite: 14/5 C • Latency? C • 2 for “outer” CLA, 4 for “interior” • G/P go “up”, C go “down” • Total:8(7forCLA,1forS) • Infinite: 4 G1-0 P1-0 C1 G3-2 P3-2 C3 G3-0 P3-0 C2 • 2LisbiggerandslowerL CIS 501: Comp Arch| Dr. | Arithmetic Two-Level CLA for 16-bit Adder • 4 G/P inputs per level • Hardware? • First level: 14 gates * 4 blocks • Second level: 14 gates * 1 block • Total: 70 gates • largest gate: 5 inputs • Infinite: 152 gates, 17 inputs • Latency? • Total:8(1+2+2+2+1) • Infinite:4(1+2+1) • CLA for a 64-bit adder? CIS 501: Comp Arch| Dr. | Arithmetic G15-0 P15-0 C4,8,12 G3-0 P3-0 C1,2,3 G7-4 P7-4 C5,6,7 G11-8 P11-8 C9,10,11 G15-12 P15-12 C13,14,15 C16 46 16-bit CLA Timing: d0 • What is ready after 0 gate delays? • C0 G15-0 P15-0 C4,8,12 G3-0 P3-0 C1,2,3 G7-4 P7-4 C5,6,7 G11-8 P11-8 C9,10,11 G15-12 P15-12 C13,14,15 CIS 501: Comp Arch| Dr. | Arithmetic C16 47 16-bit CLA Timing: d1 • What is ready after 1 gate delay? • Individual G/P G15-0 P15-0 C4,8,12 G3-0 P3-0 C1,2,3 G7-4 P7-4 C5,6,7 G11-8 P11-8 C9,10,11 G15-12 P15-12 C13,14,15 CIS 501: Comp Arch| Dr. | Arithmetic C16 48 16-bit CLA Timing: d3 • What is ready after 3 gate delays? • 1st level group G/P • Interior carries of 1st group • C1, C2, C3 G15-0 P15-0 C4,8,12 G3-0 P3-0 C1,2,3 G7-4 P7-4 C5,6,7 G11-8 P11-8 C9,10,11 G15-12 P15-12 C13,14,15 CIS 501: Comp Arch| Dr. | Arithmetic C16 49 16-bit CLA Timing: d5 • And after 5 gate delays? • Outer level group G/P • Outer level “interior” carries • C4, C8, C12 G15-0 P15-0 C4 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com