CSE 371 Computer Organization and Design
CIS 501: Comp Arch| Dr. | Arithmetic
Computer Organization and Design
Copyright By PowCoder代写 加微信 powcoder
Unit 3: Arithmetic
Based on slides by Profs. & & C.J. IS 501: Comp Arch| Dr. | Arithmetic
This Unit: Arithmetic
A little review
Binary + 2s complement
Ripple-carry addition (RCA)
Fast integer addition
Carry-select (CSeA)
Carry Lookahead Adder (CLA)
Integer multiplication and division
Floating point arithmetic
System software
CIS 501: Comp Arch| Dr. | Arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic
The Importance of Fast Arithmetic
Addition of two numbers is most common operation
Programs use addition frequently
Loads and stores use addition for address calculation
Branches use addition to test conditions and calculate targets
All insns use addition to calculate default next PC
Fast addition is critical to high performance
CIS 501: Comp Arch| Dr. | Arithmetic
Review: Binary Integers
Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
Natural since only two values are represented
Addition, etc. take place as usual (carry the 1, etc.)
17 = 10001
+5 = 101
22 = 10110
Some old machines use decimal (base10) with only 0/1
30 = 011 000
Often called BCD (binary-coded decimal)
Unnatural for digital logic, implementation complicated & slow
More precise for currency operations
CIS 501: Comp Arch| Dr. | Arithmetic
Fixed Width
On pencil and paper, integers have infinite width
In hardware, integers have fixed width
N bits: 16, 32 or 64
LSB is 20, MSB is 2N-1
Range: 0 to 2N–1
Numbers >2N represented using multiple fixed-width integers
In software (e.g., Java BigInteger class)
CIS 501: Comp Arch| Dr. | Arithmetic
What About Negative Integers?
Sign/magnitude
Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
Matches our intuition from “by hand” decimal arithmetic
representations of both +0 and –0
Addition is difficult
symmetric range: –(2N-1–1) to 2N-1–1
Option II: two’s complement (2C)
Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
One representation for 0 (all zeroes)
Easy addition
asymmetric range: –(2N-1) to 2N-1–1
CIS 501: Comp Arch| Dr. | Arithmetic
The Tao of 2C
How did 2C come about?
“Let’s design a representation that makes addition easy”
Think of subtracting 10 from 0 by hand with 8-bit numbers
Have to “borrow” 1s from some imaginary leading 1
0 = 100000000
-10 = 00000010
-10 = 011111110
Now, add the conventional way…
-10 = 11111110
+10 = 00000010
0 = 100000000
CIS 501: Comp Arch| Dr. | Arithmetic
Still More On 2C
What is the interpretation of 2C?
Same as binary, except MSB represents –2N–1, not 2N–1
–10 = 11110110 = –27+26+25+24+22+21
Extends to any width
–10 = 110110 = –25+24+22+21
Why? 2N = 2*2N–1
–25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21
Equivalent to computing modulo 2N
Trick to negating a number quickly: –B = B’ + 1
–(1) = (0001)’+1 = 1110+1 = 1111 = –1
–(–1) = (1111)’+1 = 0000+1 = 0001 = 1
–(0) = (0000)’+1 = 1111+1 = 0000 = 0
Think about why this works
CIS 501: Comp Arch| Dr. | Arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic
1st Grade: Decimal Addition
Repeat N times
Add least significant digits and any overflow from previous add
Carry “overflow” to next addition
Overflow: any digit other than least significant of sum
Shift two addends and sum one digit to the right
Sum of two N-digit numbers can yield an N+1 digit number
Why N+1, why not 2N?
CIS 501: Comp Arch| Dr. | Arithmetic
Binary Addition: Works the Same Way
1 111111
43 = 00101011
+29 = 00011101
72 = 01001000
Repeat N times
Add least significant bits and any overflow from previous add
Carry the overflow to next addition
Shift two addends and sum one bit to the right
Sum of two N-bit numbers can yield an N+1 bit number
More steps (smaller base)
Each one is simpler (adding just 1 and 0)
So simple we can do it in hardware
CIS 501: Comp Arch| Dr. | Arithmetic
The Half Adder
How to add two binary integers in hardware?
Start with adding two bits
When all else fails … look at truth table
A B = C0 S
0 0 = 0 0
0 1 = 0 1
1 0 = 0 1
1 1 = 1 0
CO (carry out) = AB
This is called a half adder
CIS 501: Comp Arch| Dr. | Arithmetic
The Other Half
We could chain half adders together, but to do that…
Need to incorporate a carry out from previous adder
C A B = C0 S
0 0 0 = 0 0
0 0 1 = 0 1
0 1 0 = 0 1
0 1 1 = 1 0
1 0 0 = 0 1
1 0 1 = 1 0
1 1 0 = 1 0
1 1 1 = 1 1
S = C’A’B | C’AB’ | CA’B’ | CAB = C ^ A ^ B
CO = C’AB | CA’B | CAB’ | CAB = CA | CB | AB
This is called a full adder
CIS 501: Comp Arch| Dr. | Arithmetic
Ripple-Carry Adder
N-bit ripple-carry adder
N 1-bit full adders “chained” together
CO0 = CI1, CO1 = CI2, etc.
CON–1 is carry-out of entire adder
CON–1 = 1 “overflow”
Example: 16-bit ripple carry adder
How fast is this?
How fast is an N-bit ripple-carry adder?
CIS 501: Comp Arch| Dr. | Arithmetic
Quantifying Adder Delay
Combinational logic dominated by gate delays
How many gates between input and output?
Array storage dominated by wire delays
Longest delay or critical path is what matters
Can implement any combinational function in “2” logic levels
1 level of AND + 1 level of OR (PLA)
NOTs are “free”: push to input (DeMorgan’s) or read from latch
Example: delay(FullAdder) = 2
d(CarryOut) = delay(AB | AC | BC)
d(Sum) = d(A ^ B ^ C) = d(AB’C’ | A’BC’ | A’B’C | ABC) = 2
Note ‘^’ means Xor (just like in C & Java)
Caveat: “2” assumes gates have few (<8 ?) inputs
Gate delays is an abstraction: not all gates are equally expensive, etc.
CIS 501: Comp Arch| Dr. | Arithmetic
Ripple-Carry Adder Delay
Longest path is to CO15 (or S15)
delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15))
d(A15) = d(B15) = 0, d(CI15) = d(CO14)
d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) …
d(CO15) = 32
d(CON–1) = 2N
Linear in number of bits
Number of gates is also linear
CIS 501: Comp Arch| Dr. | Arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic
4th Grade: Decimal Division
9 // quotient
3 |29 // divisor | dividend
2 // remainder
Shift divisor left (multiply by 10) until MSB lines up with dividend’s
Repeat until remaining dividend (remainder) < divisor
Find largest single digit q such that (q*divisor) < dividend
Set LSB of quotient to q
Subtract (q*divisor) from dividend
Shift quotient left by one digit (multiply by 10)
Shift divisor right by one digit (divide by 10)
Note that shifting the divisor to the right is equivalent to shifting the dividend to the left. We will use this trick later.
CIS 501: Comp Arch| Dr. | Arithmetic
Binary Division
3 |29 = 0011 |011101
-24 = - 011000
5 = 000101
- 3 = - 000011
2 = 000010
CIS 501: Comp Arch| Dr. | Arithmetic
Binary Division Hardware
Same as decimal division, except (again)
More individual steps (base is smaller)
Each step is simpler
Find largest bit q such that (q*divisor) < dividend
q = 0 or 1
Subtract (q*divisor) from dividend
q = 0 or 1 no actual multiplication, subtract divisor or not
Complication: largest q such that (q*divisor) < dividend
How do you know if (1*divisor) < dividend?
Human can “eyeball” this
Computer does not have eyeballs
Subtract and see if result is negative
CIS 501: Comp Arch| Dr. | Arithmetic
Software Divide Algorithm
Can implement this algorithm in software
Inputs: dividend and divisor
Outputs: quotient = dividend / divisor
rem = dividend % divisor
remainder = 0;
quotient = 0;
for (int i = 0; i < 32; i++) {
remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1;
remainder = remainder - divisor;
quotient = (quotient << 1) | 0;
dividend = dividend << 1;
Same idea except that we are shifting the dividend and keeping the divisor constant
Note ORing the quotient with 0 is redundant. All this means is that we are shifting in a zero into the LSB as opposed to shifting in a 1 as in the previous statement.
Small point right shifting dividend may result in sign extension if you aren’t careful which isn’t what you want. That why ANDing with 1 may be useful.
CIS 501: Comp Arch| Dr. | Arithmetic
Divide Example
Input: Divisor = 00011 , Dividend = 11101
Step Remainder Quotient Remainder Dividend
0 00000 00000 00000 11101
1 00001 00000 00001 11010
2 00011 00001 00000 10100
3 00001 00010 00001 01000
4 00010 00100 00010 10000
5 00101 01001 00010 00000
Result: Quotient: 1001, Remainder: 10
CIS 501: Comp Arch| Dr. | Arithmetic
Divider Circuit
Shift in 0 or 1
Shift in 0 or 1
Shift in 0
N cycles for n-bit divide
Fast Addition
CIS 501: Comp Arch| Dr. | Arithmetic
CIS 501: Comp Arch| Dr. | Arithmetic
Bad idea: a PLA-based Adder?
If any function can be expressed as two-level logic…
…why not use a PLA for an entire 8-bit adder?
Approx. 216 AND gates, each with 16 inputs
Then, 8 OR gates, each with 216 inputs
Number of gates exponential in bit width!
Not that fast, either
An AND gate with 65K inputs != 2-input AND gate
Many-input gates made from tree of, say, 4-input gates
216-input gates would have at least 8 logic levels
So, at least 10 levels of logic for a 16-bit PLA
Even so, delay is still logarithmic in number of bits
There are better (faster, smaller) ways
CIS 501: Comp Arch| Dr. | Arithmetic
Theme: Hardware != Software
Hardware can do things that software fundamentally can’t
And vice versa (of course)
In hardware, it’s easier to trade resources for latency
One example of this: speculation
Slow computation waiting for some slow input?
Input one of two things?
Compute with both (slow), choose right one later (fast)
Does this make sense in software? Not on a uni-processor
Difference? hardware is parallel, software is sequential
CIS 501: Comp Arch| Dr. | Arithmetic
Carry-Select Adder
Carry-select adder
Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1
Choose the correct one when CO7 finally becomes available
Effectively cuts carry chain in half (break critical path)
But adds mux
On board: bad CSA design with 15/1 split
CIS 501: Comp Arch| Dr. | Arithmetic
Multi-Segment Carry-Select Adder
Multiple segments
Example: 5, 5, 6 bit = 16 bit
Hardware cost
Still mostly linear (~2x)
Compute each segment
with 0 and 1 carry-in
Serial mux chain
5-bit adder (10) +
Two muxes (4) = 14
CIS 501: Comp Arch| Dr. | Arithmetic
Carry-Select Adder Delay
What is carry-select adder delay (two segment)?
d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2
d(CO15) = MAX(2*8, 2*8) + 2 = 18
In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA)
What if we cut adder into 4 equal pieces?
Would it be 2*(N/4) + 2 = 10? Not quite
d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2
d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2
d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2
d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2
d(CO15) = 2*4 + 3*2 = 14
N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2
16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18
16b Carry-Select Adder Delay
CIS 501: Comp Arch| Dr. | Arithmetic
Optimal delay is at M=4: 14 gate delays
TODO: can do better with Bharath’s asymmetric 2-2-3-4-5 design! Just 12 gate delays, overlaps RCA latency with muxes better.
gate delays 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 32 18 14.66666666666667 14 14.4 15.33333333333333 16.571428571428569 18 19.555555555555561 21.2 22.90909090909091 24.666666666666671 26.46153846153846 28.28571428571426 30.133333333333319 32 M
Gate delay
CIS 501: Comp Arch| Dr. | Arithmetic
Another Option: Carry Lookahead
Is carry-select adder as fast as we can go?
Another approach to using additional resources
Instead of redundantly computing sums assuming different carries
Use redundancy to compute carries more quickly
This approach is called carry lookahead (CLA)
CIS 501: Comp Arch| Dr. | Arithmetic
A & B Generate & Propagate
Let’s look at the single-bit carry-out function
CO = AB|AC|BC
Factor out terms that use only A and B (available immediately)
CO =(AB)|(A|B)C
(AB): generates carry-out regardless of incoming C rename to G
(A|B) : propagates incoming C rename to P
Why don’t we worry about sum bit?
We have a carry out if we either 1) generate it OR 2) we propagate AND there is a carry-in
CIS 501: Comp Arch| Dr. | Arithmetic
Infinite Hardware CLA
Can expand C1…N in terms of G’s, P’s, and C0
Example: C16
C16 = G15 | P15C15
C16 = G15 | P15(G14|P14C14)
C16 = G15 | P15G14 | … | P15P14…P2P1G0 | P15P14…P2P1P0C0
Similar expansions for C15, C14, etc.
How much does this cost?
CN needs: N AND’s + 1 OR’s, largest have N+1 inputs
CN…C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs
N=16: 152 total gates, max input 17
N=64: 2144 total gates, max input 65
And how fast is it really?
Not that fast, unfortunately, 17-input gates are really slow
CIS 501: Comp Arch| Dr. | Arithmetic
Compromise: Multi-Level CLA
Ripple carry
Few small gates: 2N gates, max 3 inputs
Laid in series: 2N (linear) latency
Infinite hardware CLA
Many big gates: N*(N+3)/2 additional gates, max N+1 inputs
Laid in parallel: constant 5 latency
Is there a compromise?
Reasonable number of small gates?
Sublinear (doesn’t have to be constant) latency?
Yes, multi-level CLA: exploits hierarchy to achieve this
5 gate delays = 3 GD for carry computation + 2 GD for final addition (full adder)
CIS 501: Comp Arch| Dr. | Arithmetic
Carry Lookahead Adder (CLA)
Calculate “propagate” and “generate” based on A, B
Not based on carry in
Combine with tree structure
Take aways
Tree gives logarithmic delay
Reasonable area
Ci is the carry into bit i
CIS 501: Comp Arch| Dr. | Arithmetic
Individual G & P Windowed G & P
Windowed G/P: useful abstraction for multi-level CLA
Individual carry equations
C1 = G0 | P0C0, C2 = G1 | P1C1
Infinite hardware CLA equations
C1 = G0 | P0C0
C2 = G1 | P1G0 | P1P0C0
Group terms into “windows”
C2 = (G1 | P1G0) | (P1P0)C0
C2 = G1-0 | P1-0C0
G1-0, P1-0 are window G & P
a single bit summarizing information from all the bits in the window
G1-0: carry-out generated by 1-0 bits
P1-0: carry-out propagated by 1-0 bits
CIS 501: Comp Arch| Dr. | Arithmetic
Two-Level CLA for 4-bit Adder
Individual carry equations
C1 = G0 | P0C0, C2 = G1 | P1C1, C3 = G2 | P2C2, C4 = G3 | P3C3
Infinite hardware CLA equations
C1 = G0 | P0C0
C2 = G1 | P1G0 | P1P0C0
C3 = G2 | P2G1 | P2P1G0 | P2P1P0C0
C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0C0
Hierarchical CLA equations
First level: expand C2 using C1, C4 using C3
C2 = G1 | P1(G0 | P0C0) = (G1 | P1G0) | (P1P0)C0 = G1-0 | P1-0C0
C4 = G3 | P3(G2 | P2C2) = (G3 | P3G2) | (P3P2)C2 = G3-2 | P3-2C2
Second level: expand C4 using expanded C2
C4 = G3-2 | P3-2(G1-0 | P1-0C0) = (G3-2 | P3-2G1-0) | (P3-2P1-0)C0
C4 = G3-0 | P3-0C0
NB: Gx-y and Px-y are each single bit wires
CIS 501: Comp Arch| Dr. | Arithmetic
Two-Level CLA for 4-bit Adder
Top first-level CLA block
Input: C0, G0, P0, G1, P1
Output: C1, G1-0, P1-0
Bottom first-level CLA block
Input: C2, G2, P2, G3, P3
Output: C3, G3-2, P3-2
Second-level CLA block
Input: C0, G1-0, P1-0, G3-2, P3-2
Output: C2, G3-0, P3-0
These 3 blocks are “the same”
Input: C0, G3-0, P3-0
Output: C4
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d0
Which signals are ready at 0 gate delays (d0)?
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d1
Signals ready from before
d0: C0, Ai, Bi
New signals ready at d1
P0 = A0 | B0, G0 = A0B0
P1 = A1 | B1, G1 = A1B1
P2 = A2 | B2, G2 = A2B2
P3 = A3 | B3, G3 = A3B3
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d3
Signals ready from before
d0: C0, Ai, Bi
d1: Pi, Gi
New signals ready at d3
P1-0 = P1P0
G1-0 = G1 | P1G0
C1 = G0 | P0C0
P3-2 = P3P2
G3-2 = G3 | P3G2
C3 is not ready
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d5
Signals ready from before
d0: C0, Ai, Bi
d1: Pi, Gi
d3: P1-0, G1-0 , C1 , P3-2 , G3-2
New signals ready at d5
P3-0 = P3-2P1-0
G3-0 = G3-2 | P3-2G1-0
C2 = G1-0 | P1-0C0
CIS 501: Comp Arch| Dr. | Arithmetic
4-bit CLA Timing: d7
Signals ready from before
d0: C0, Ai, Bi
d1: Pi, Gi
d3: P1-0, G1-0 , C1 , P3-2 , G3-2
d5: P3-0, G3-0, C2
New signals ready at d7
C3 = G2 | P2C2
C4 = G3-0 | P3-0C0
Si ready d1 after Ci
Q: why is sum ready in just one gate delay?
Q: what happens in C4 block?
CIS 501: Comp Arch| Dr. | Arithmetic
Two-Level CLA for 4-bit Adder
CLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2)
3 of these
C4 block: 2 gates, max 2 inputs
Total: 17 gates/3inputs
Infinite: 14/5
2 for “outer” CLA, 4 for “interior”
G/P go “up”, C go “down”
Total: 8 (7 for CLA, 1 for S)
Infinite: 4
2L is bigger and slower
CLA hierarchy doesn’t pay off until we have larger adders
CIS 501: Comp Arch| Dr. | Arithmetic
Two-Level CLA for 16-bit Adder
4 G/P inputs per level
First level: 14 gates * 4 blocks
Second level: 14 gates * 1 block
Total: 70 gates
largest gate: 5 inputs
Infinite: 152 gates, 17 inputs
Total: 8 (1 + 2 + 2 + 2 + 1)
Infinite: 4 (1 + 2 + 1)
CLA for a 64-bit adder?
CIS 501: Comp Arch| Dr. | Arithmetic
16-bit CLA Timing: d0
What is ready after 0 gate delays?
CIS 501: Comp Arch| Dr. | Arithmetic
16-bit CLA Timing: d1
What is ready after 1 gate delay?
Individual G/P
CIS 501: Comp Arch| Dr. | Arithmetic
16-bit CLA Timing: d3
What is ready after 3 gate delays?
1st level group G/P
Interior carries of 1st group
C1, C2, C3
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com