Computer Organization and Design
Unit 3: Arithmetic
Based on slides by Profs. & & C.J. IS 501: Comp Arch| Dr. | Arithmetic 1
Copyright By PowCoder代写 加微信 powcoder
This Unit: Arithmetic
System software
• A little review
• Binary + 2s complement
• Ripple-carry addition (RCA)
• Fast integer addition
• Carry-select (CSeA)
• Carry Lookahead Adder (CLA)
• Shifters
• Integer multiplication and division • Floating point arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic 2
• Chapter 3
CIS 501: Comp Arch| Dr. | Arithmetic 3
The Importance of Fast Arithmetic
Tinsn-mem Tregfile TALU Tdata-mem
• Addition of two numbers is most common operation
• Programs use addition frequently
• Loads and stores use addition for address calculation
• Branches use addition to test conditions and calculate targets • All insns use addition to calculate default next PC
• Fast addition is critical to high performance
CIS 501: Comp Arch| Dr. | Arithmetic 4
Review: Binary Integers
• Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
+ Natural since only two values are represented
• Addition, etc. take place as usual (carry the 1, etc.)
17 = +5 = 22 =
• Some old machines use decimal (base10) with only 0/1
30 = 011 000
• Often called BCD (binary-coded decimal)
– Unnatural for digital logic, implementation complicated & slow + Morepreciseforcurrencyoperations
CIS 501: Comp Arch| Dr. | Arithmetic 5
Fixed Width
• On pencil and paper, integers have infinite width
• In hardware, integers have fixed width • Nbits:16,32or64
• LSBis20,MSBis2N-1
• Range:0to2N–1
• Numbers >2N represented using multiple fixed-width integers
• In software (e.g., Java BigInteger class)
CIS 501: Comp Arch| Dr. | Arithmetic 6
What About Negative Integers?
• Sign/magnitude
• Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
+ Matches our intuition from “by hand” decimal arithmetic – representations of both +0 and –0
– Addition is difficult
• symmetric range: –(2N-1–1) to 2N-1–1
• Option II: two’s complement (2C)
• Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
+ One representation for 0 (all zeroes) + Easy addition
• asymmetric range: –(2N-1) to 2N-1–1
CIS 501: Comp Arch| Dr. | Arithmetic 7
The Tao of 2C
• How did 2C come about?
• “Let’s design a representation that makes addition easy”
• Think of subtracting 10 from 0 by hand with 8-bit numbers • Have to “borrow” 1s from some imaginary leading 1
0 = 100000000
-10 = 00000010
-10 = 011111110
• Now, add the conventional way…
-10 = 11111110
+10 = 00000010
0 = 100000000
CIS 501: Comp Arch| Dr. | Arithmetic 8
Still More On 2C
• What is the interpretation of 2C?
• Same as binary, except MSB represents –2N–1, not 2N–1
• –10 = 11110110 = –27+26+25+24+22+21 + Extends to any width
• –10 = 110110 = –25+24+22+21
• Why? 2N = 2*2N–1
• –25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21
• Equivalent to computing modulo 2N
• Trick to negating a number quickly: –B = B’ + 1 • –(1) = (0001)’+1 = 1110+1 = 1111 = –1
• –(–1) = (1111)’+1 = 0000+1 = 0001 = 1
• –(0) = (0000)’+1 = 1111+1 = 0000 = 0
• Think about why this works
CIS 501: Comp Arch| Dr. | Arithmetic 9
CIS 501: Comp Arch| Dr. | Arithmetic 10
1st Grade: Decimal Addition
• Repeat N times
• Add least significant digits and any overflow from previous add • Carry “overflow” to next addition
• Overflow: any digit other than least significant of sum • Shift two addends and sum one digit to the right
• Sum of two N-digit numbers can yield an N+1 digit number
CIS 501: Comp Arch| Dr. | Arithmetic 11
Binary Addition: Works the Same Way
43 = 00101011
+29 = 00011101
72 = 01001000
• Repeat N times
• Add least significant bits and any overflow from previous add • Carry the overflow to next addition
• Shift two addends and sum one bit to the right
• Sum of two N-bit numbers can yield an N+1 bit number
– More steps (smaller base)
+ Each one is simpler (adding just 1 and 0) • So simple we can do it in hardware
CIS 501: Comp Arch| Dr. | Arithmetic 12
The Half Adder
• How to add two binary integers in hardware?
• Start with adding two bits
• When all else fails … look at truth table
A B = C0 S 00 = 00
01 = 01 10 = 01 11 = 10
• CO(carryout)=AB
• This is called a half adder
CIS 501: Comp Arch| Dr. | Arithmetic
The Other Half
• We could chain half adders together, but to do that… • Need to incorporate a carry out from previous adder
C AB = 0 00 = 0 01 = 0 10 = 0 11 = 1 00 = 1 01 = 1 10 = 1 11 =
C0 S 00 01 01 10 01 10 10 11
• S=C’A’B|C’AB’|CA’B’|CAB=C^A^B
• CO=C’AB|CA’B|CAB’|CAB=CA|CB|AB • This is called a full adder
CIS 501: Comp Arch| Dr. | Arithmetic
Ripple-Carry Adder
• N-bit ripple-carry adder
• N 1-bit full adders “chained” together
• CO0 = CI1, CO1 = CI2, etc.
• CON–1 is carry-out of entire adder
• CON–1 = 1 ® “overflow”
• Example: 16-bit ripple carry adder
• How fast is this?
• How fast is an N-bit ripple-carry adder?
A0 FA S0 B0
A1 FA S1 B1
A2 FA S2 B2
A15 FA S15
CIS 501: Comp Arch| Dr. | Arithmetic
Quantifying Adder Delay
• Combinational logic dominated by gate delays • How many gates between input and output?
• Array storage dominated by wire delays
• Longest delay or critical path is what matters
• Can implement any combinational function in “2” logic levels • 1levelofAND+1levelofOR(PLA)
• NOTs are “free”: push to input (DeMorgan’s) or read from latch
• Example: delay(FullAdder) = 2
• d(CarryOut) = delay(AB | AC | BC)
• d(Sum)=d(A^B^C)=d(AB’C’|A’BC’|A’B’C|ABC)=2 • Note ‘^’ means Xor (just like in C & Java)
• Caveat: “2” assumes gates have few (<8 ?) inputs
CIS 501: Comp Arch| Dr. | Arithmetic 16
Ripple-Carry Adder Delay
• Longest path is to CO15 (or S15)
• delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15))
• d(A15) = d(B15) = 0, d(CI15) = d(CO14)
• d(CO15)=2+d(CO14)=2+2+d(CO13)... • d(CO15) = 32
• d(CON–1)=2N – Too slow!
– Linear in number of bits
• Number of gates is also linear
A0 FA S0 B0
A1 FA S1 B1
A2 FA S2 B2
A15 FA S15
CIS 501: Comp Arch| Dr. | Arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic 18
4th Grade: Decimal Division
9 3 |29 -27 2
// quotient
// divisor | dividend
// remainder
• Shift divisor left (multiply by 10) until MSB lines up with dividend’s • Repeat until remaining dividend (remainder) < divisor
• Find largest single digit q such that (q*divisor) < dividend • Set LSB of quotient to q
• Subtract (q*divisor) from dividend
• Shift quotient left by one digit (multiply by 10)
• Shift divisor right by one digit (divide by 10)
CIS 501: Comp Arch| Dr. | Arithmetic 19
Binary Division
3 |29 = 0011 |011101
-24 = 5 = -3= 2 =
- 011000 000101 -000011 000010
CIS 501: Comp Arch| Dr. | Arithmetic
Binary Division Hardware
• Same as decimal division, except (again)
– More individual steps (base is smaller)
+ Each step is simpler
• Find largest bit q such that (q*divisor) < dividend
• q = 0 or 1
• Subtract (q*divisor) from dividend
• q = 0 or 1 ® no actual multiplication, subtract divisor or not
• Complication: largest q such that (q*divisor) < dividend • How do you know if (1*divisor) < dividend?
• Human can “eyeball” this
• Computer does not have eyeballs
• Subtract and see if result is negative
CIS 501: Comp Arch| Dr. | Arithmetic 21
Software Divide Algorithm
• Can implement this algorithm in software
• Inputs:dividendanddivisor
• Outputs:quotient=dividend/divisor
rem = dividend % divisor
remainder = 0;
quotient = 0;
for (int i = 0; i < 32; i++) {
remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1;
remainder = remainder - divisor;
quotient = (quotient << 1) | 0;
dividend = dividend << 1;
CIS 501: Comp Arch| Dr. | Arithmetic 22
Divide Example
• Input:Divisor=00011,Dividend=11101
Step Remainder Quotient Remainder Dividend
00000 00000
00000 00001
00001 00000
00010 00001
00100 00010
01001 00010
• Result: Quotient: 1001, Remainder: 10
CIS 501: Comp Arch| Dr. | Arithmetic
Divider Circuit
Shift in 0 or 1
• N cycles for n-bit divide
Shift in 0 or 1
Shift in 0
CIS 501: Comp Arch| Dr. | Arithmetic
Fast Addition
CIS 501: Comp Arch| Dr. | Arithmetic 25
Bad idea: a PLA-based Adder?
• If any function can be expressed as two-level logic... • ...why not use a PLA for an entire 8-bit adder?
• Not small
• Approx. 216 AND gates, each with 16 inputs
• Then, 8 OR gates, each with 216 inputs
• Number of gates exponential in bit width!
• Not that fast, either
• An AND gate with 65K inputs != 2-input AND gate
• Many-input gates made from tree of, say, 4-input gates
• 216-input gates would have at least 8 logic levels
• So, at least 10 levels of logic for a 16-bit PLA
• Even so, delay is still logarithmic in number of bits
• There are better (faster, smaller) ways
CIS 501: Comp Arch| Dr. | Arithmetic 26
Theme: Hardware != Software
• Hardware can do things that software fundamentally can’t • And vice versa (of course)
• In hardware, it’s easier to trade resources for latency
• One example of this: speculation
• Slow computation waiting for some slow input?
• Input one of two things?
• Computewithboth(slow),chooserightonelater(fast)
• Does this make sense in software? Not on a uni-processor • Difference? hardware is parallel, software is sequential
CIS 501: Comp Arch| Dr. | Arithmetic 27
Carry-Select Adder
• Carry-selectadder
• Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1 • Choose the correct one when CO7 finally becomes available + Effectively cuts carry chain in half (break critical path)
– But adds mux
B7-0 S15-0 0
A15-8 B15-8
S7-0 S15-8
CO 1 S16CO A15-8 15-8
CIS 501: Comp Arch| Dr. | Arithmetic
Multi-Segment Carry-Select Adder
• Multiple segments
• Example: 5, 5, 6 bit = 16 bit
• Hardware cost
• Still mostly linear (~2x)
• Compute each segment with 0 and 1 carry-in
• Serial mux chain
• 5-bitadder(10)+ Two muxes (4) = 14
S4-0 A0 S9-5
1 S 10 A9-5 9-5
A 0 S15-10
15-10 B15-10
1 S A15-10
B 6+ 15-10
CIS 501: Comp Arch| Dr. | Arithmetic
Carry-Select Adder Delay
• What is carry-select adder delay (two segment)? • d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2
• d(CO15) = MAX(2*8, 2*8) + 2 = 18
• In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA)
• What if we cut adder into 4 equal pieces?
• Would it be 2*(N/4) + 2 = 10? Not quite
• d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2
• d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2
• d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2 • d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2
• d(CO15)=2*4+3*2=14
• N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2 • 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18
CIS 501: Comp Arch| Dr. | Arithmetic 30
16b Carry-Select Adder Delay
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
CIS 501: Comp Arch| Dr. | Arithmetic
Gate delay
Another Option: Carry Lookahead
• Is carry-select adder as fast as we can go? • Nope!
• Another approach to using additional resources
• Instead of redundantly computing sums assuming different carries • Use redundancy to compute carries more quickly
• This approach is called carry lookahead (CLA)
CIS 501: Comp Arch| Dr. | Arithmetic 32
A & B ® Generate & Propagate
• Let’s look at the single-bit carry-out function
• CO = AB|AC|BC
• Factor out terms that use only A and B (available immediately)
• CO =(AB)|(A|B)C
• (AB): generates carry-out regardless of incoming C ® rename to G • (A|B):propagatesincomingC®renametoP
CIS 501: Comp Arch| Dr. | Arithmetic
Infinite Hardware CLA
• Can expand C1...N in terms of G’s, P’s, and C0 • Example: C16
• C16 = G15 | P15C15
• C16 = G15 | P15(G14|P14C14)
• C16 = G15 | P15G14 | ... | P15P14...P2P1G0 | P15P14...P2P1P0C0
• Similar expansions for C15, C14, etc.
• How much does this cost?
• CN needs: N AND’s + 1 OR’s, largest have N+1 inputs
• CN...C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs • N=16: 152 total gates, max input 17
• N=64: 2144 total gates, max input 65
• And how fast is it really?
– Not that fast, unfortunately, 17-input gates are really slow
CIS 501: Comp Arch| Dr. | Arithmetic 34
Compromise: Multi-Level CLA
• Ripplecarry
+ Few small gates: 2N gates, max 3 inputs – Laid in series: 2N (linear) latency
• InfinitehardwareCLA
– Many big gates: N*(N+3)/2 additional gates, max N+1 inputs + Laid in parallel: constant 5 latency
• Is there a compromise?
• Reasonable number of small gates?
• Sublinear (doesn’t have to be constant) latency?
• Yes,multi-levelCLA:exploitshierarchytoachievethis
CIS 501: Comp Arch| Dr. | Arithmetic 35
Carry Lookahead Adder (CLA)
• Calculate “propagate” and “generate” based on A, B • Not based on carry in
• Combine with tree structure • Take aways
C • Reasonable area C
G1-0 P1-0 C1
• Treegiveslogarithmicdelay
G3-0 P3-0 C2
G3-2 P3-2 C3
CIS 501: Comp Arch| Dr. | Arithmetic
Individual G & P ® Windowed G & P
• WindowedG/P:usefulabstractionformulti-levelCLA
• Individual carry equations
• C1 =G0 |P0C0, C2 =G1 |P1C1
• Infinite hardware CLA equations • C1 = G0 | P0C0
• C2 =G1 |P1G0 |P1P0C0
• Group terms into “windows” • C2 = (G1 | P1G0) | (P1P0)C0
• C2 = G1-0 | P1-0C0
• G1-0, P1-0 are window G & P
• a single bit summarizing information from all the bits in the window • G1-0: carry-out generated by 1-0 bits
• P1-0: carry-out propagated by 1-0 bits
CIS 501: Comp Arch| Dr. | Arithmetic 37
Two-Level CLA for 4-bit Adder
• Individual carry equations
• C1 =G0 |P0C0, C2 =G1 |P1C1, C3 =G2 |P2C2, C4 =G3 |P3C3
• Infinite hardware CLA equations • C1 = G0 | P0C0
• C2 =G1 |P1G0 |P1P0C0
• C3 = G2 | P2G1 | P2P1G0 | P2P1P0C0
• C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0C0
• Hierarchical CLA equations
• First level: expand C2 using C1, C4 using C3
• C2 =G1 |P1(G0 |P0C0)=(G1 |P1G0)|(P1P0)C0=G1-0 |P1-0C0 • C4 =G3 |P3(G2 |P2C2)=(G3 |P3G2)|(P3P2)C2=G3-2 |P3-2C2
• Second level: expand C4 using expanded C2
• C4 = G3-2 | P3-2(G1-0 | P1-0C0) = (G3-2 | P3-2G1-0) | (P3-2P1-0)C0
• C4 = G3-0 | P3-0C0
CIS 501: Comp Arch| Dr. | Arithmetic 38
Two-Level CLA for 4-bit Adder
• Top first-level CLA block • Input: C0, G0, P0, G1, P1
• Output: C1, G1-0, P1-0
• Bottom first-level CLA block • Input: C2, G2, P2, G3, P3
• Output: C3, G3-2, P3-2
• Second-level CLA block
• Input: C0, G1-0, P1-0, G3-2, P3-2 • Output: C2, G3-0, P3-0
• These3blocksare“thesame”C B2 3
• C4 block
• Input: C0, G3-0, P3-0
• Output: C4
G3-2 P3-2 C3
G3-0 P3-0 C2
CIS 501: Comp Arch| Dr. | Arithmetic
G1-0 P1-0 C1
4-bit CLA Timing: d0
• Which signals are ready at 0 gate delays (d0)? • C0
• A0,B0 C0 • A1,B1
• A2,B2 C • A3,B3
G1-0 P1-0 C1
G3-0 P3-0 C2
G3-2 P3-2 C3
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d1
• Signals ready from before • d0:C0,Ai,Bi
• New signals ready at d1 • P0 = A0 | B0, G0 = A0B0
• P1 = A1 | B1, G1 = A1B1
• P2 = A2 | B2, G2 = A2B2
• P3 = A3 | B3, G3 = A3B3
G1-0 P1-0 C1
G3-0 P3-0 C2
G3-2 P3-2 C3
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d3
• Signals ready from before • d0:C0,Ai,Bi
• d1:Pi,Gi
• New signals ready at d3
• P1-0=P1P0
• G1-0=G1 |P1G0
• C1 = G0 | P0C0
• P3-2=P3P2
• G3-2=G3 |P3G2
• C3 is not ready
G1-0 P1-0 C1
G3-0 P3-0 C2
G3-2 P3-2 C3
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d5
• Signals ready from before • d0:C0,Ai,Bi
• d1:Pi,Gi C0
• d3:P1-0,G1-0,C1,P3-2,G3-2 A0
• New signals ready at d5
3-0 3-2 1-0
• G3-0 = G3-2 | P3-2G1-0 • C2 = G1-0 | P1-0C0
CIS 501: Comp Arch| Dr. | Arithmetic
G1-0 P1-0 C1
G3-2 P3-2 C3
G3-0 P3-0 C2
4-bit CLA Timing: d7
• Signals ready from before • d0:C0,Ai,Bi
• d1:Pi,Gi C0 • d3:P1-0,G1-0,C1,P3-2,G3-2
• d5: P3-0, G3-0, C2 C • New signals ready at d7
• C3 = G2 | P2C2
• C4 = G3-0 | P3-0C0 C
• Si ready d1 after Ci C C4
CIS 501: Comp Arch| Dr. | Arithmetic
G1-0 P1-0 C1
G3-0 P3-0 C2
G3-2 P3-2 C3
Two-Level CLA for 4-bit Adder
• Hardware?
• CLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2)
• 3 of these
• C4 block: 2 gates, max 2 inputs • Total: 17 gates/3inputs
• Infinite: 14/5
C • Latency? C
• 2 for “outer” CLA, 4 for “interior” • G/P go “up”, C go “down”
• Total:8(7forCLA,1forS) • Infinite: 4
G1-0 P1-0 C1
G3-2 P3-2 C3
G3-0 P3-0 C2
• 2LisbiggerandslowerL
CIS 501: Comp Arch| Dr. | Arithmetic
Two-Level CLA for 16-bit Adder
• 4 G/P inputs per level
• Hardware?
• First level: 14 gates * 4 blocks
• Second level: 14 gates * 1 block • Total: 70 gates
• largest gate: 5 inputs
• Infinite: 152 gates, 17 inputs
• Latency?
• Total:8(1+2+2+2+1) • Infinite:4(1+2+1)
• CLA for a 64-bit adder?
CIS 501: Comp Arch| Dr. | Arithmetic
G15-0 P15-0 C4,8,12
G3-0 P3-0 C1,2,3
G7-4 P7-4 C5,6,7
G11-8 P11-8 C9,10,11
G15-12 P15-12 C13,14,15
C16 46
16-bit CLA Timing: d0
• What is ready after 0 gate delays? • C0
G15-0 P15-0 C4,8,12
G3-0 P3-0 C1,2,3
G7-4 P7-4 C5,6,7
G11-8 P11-8 C9,10,11
G15-12 P15-12 C13,14,15
CIS 501: Comp Arch| Dr. | Arithmetic
C16 47
16-bit CLA Timing: d1
• What is ready after 1 gate delay? • Individual G/P
G15-0 P15-0 C4,8,12
G3-0 P3-0 C1,2,3
G7-4 P7-4 C5,6,7
G11-8 P11-8 C9,10,11
G15-12 P15-12 C13,14,15
CIS 501: Comp Arch| Dr. | Arithmetic
C16 48
16-bit CLA Timing: d3
• What is ready after 3 gate delays? • 1st level group G/P
• Interior carries of 1st group
• C1, C2, C3
G15-0 P15-0 C4,8,12
G3-0 P3-0 C1,2,3
G7-4 P7-4 C5,6,7
G11-8 P11-8 C9,10,11
G15-12 P15-12 C13,14,15
CIS 501: Comp Arch| Dr. | Arithmetic
C16 49
16-bit CLA Timing: d5
• And after 5 gate delays?
• Outer level group G/P
• Outer level “interior” carries
• C4, C8, C12
G15-0 P15-0 C4
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com