留学生作业代写 CSE 371 Computer Organization and Design

CSE 371 Computer Organization and Design

CIS 501: Comp Arch| Dr. | Arithmetic
Computer Organization and Design

Copyright By PowCoder代写 加微信 powcoder

Unit 3: Arithmetic

Based on slides by Profs. & & C.J. IS 501: Comp Arch| Dr. | Arithmetic
This Unit: Arithmetic
A little review
Binary + 2s complement
Ripple-carry addition (RCA)
Fast integer addition
Carry-select (CSeA)
Carry Lookahead Adder (CLA)
Integer multiplication and division
Floating point arithmetic

System software

CIS 501: Comp Arch| Dr. | Arithmetic

CIS 501: Comp Arch| Dr. | Arithmetic
The Importance of Fast Arithmetic
Addition of two numbers is most common operation
Programs use addition frequently
Loads and stores use addition for address calculation
Branches use addition to test conditions and calculate targets
All insns use addition to calculate default next PC
Fast addition is critical to high performance

CIS 501: Comp Arch| Dr. | Arithmetic
Review: Binary Integers
Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
Natural since only two values are represented
Addition, etc. take place as usual (carry the 1, etc.)

17 = 10001
+5 = 101
22 = 10110

Some old machines use decimal (base10) with only 0/1
30 = 011 000
Often called BCD (binary-coded decimal)
Unnatural for digital logic, implementation complicated & slow
More precise for currency operations

CIS 501: Comp Arch| Dr. | Arithmetic
Fixed Width
On pencil and paper, integers have infinite width

In hardware, integers have fixed width
N bits: 16, 32 or 64
LSB is 20, MSB is 2N-1

Range: 0 to 2N–1

Numbers >2N represented using multiple fixed-width integers
In software (e.g., Java BigInteger class)

CIS 501: Comp Arch| Dr. | Arithmetic
What About Negative Integers?
Sign/magnitude
Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
Matches our intuition from “by hand” decimal arithmetic
representations of both +0 and –0
Addition is difficult
symmetric range: –(2N-1–1) to 2N-1–1

Option II: two’s complement (2C)
Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
One representation for 0 (all zeroes)
Easy addition
asymmetric range: –(2N-1) to 2N-1–1

CIS 501: Comp Arch| Dr. | Arithmetic
The Tao of 2C
How did 2C come about?
“Let’s design a representation that makes addition easy”
Think of subtracting 10 from 0 by hand with 8-bit numbers
Have to “borrow” 1s from some imaginary leading 1

0 = 100000000
-10 = 00000010
-10 = 011111110

Now, add the conventional way…

-10 = 11111110
+10 = 00000010
0 = 100000000

CIS 501: Comp Arch| Dr. | Arithmetic
Still More On 2C
What is the interpretation of 2C?
Same as binary, except MSB represents –2N–1, not 2N–1
–10 = 11110110 = –27+26+25+24+22+21
Extends to any width
–10 = 110110 = –25+24+22+21
Why? 2N = 2*2N–1
–25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21
Equivalent to computing modulo 2N

Trick to negating a number quickly: –B = B’ + 1
–(1) = (0001)’+1 = 1110+1 = 1111 = –1
–(–1) = (1111)’+1 = 0000+1 = 0001 = 1
–(0) = (0000)’+1 = 1111+1 = 0000 = 0
Think about why this works

CIS 501: Comp Arch| Dr. | Arithmetic

CIS 501: Comp Arch| Dr. | Arithmetic
1st Grade: Decimal Addition

Repeat N times
Add least significant digits and any overflow from previous add
Carry “overflow” to next addition
Overflow: any digit other than least significant of sum
Shift two addends and sum one digit to the right

Sum of two N-digit numbers can yield an N+1 digit number

Why N+1, why not 2N?

CIS 501: Comp Arch| Dr. | Arithmetic
Binary Addition: Works the Same Way
1 111111
43 = 00101011
+29 = 00011101
72 = 01001000

Repeat N times
Add least significant bits and any overflow from previous add
Carry the overflow to next addition
Shift two addends and sum one bit to the right
Sum of two N-bit numbers can yield an N+1 bit number

More steps (smaller base)
Each one is simpler (adding just 1 and 0)
So simple we can do it in hardware

CIS 501: Comp Arch| Dr. | Arithmetic

The Half Adder
How to add two binary integers in hardware?
Start with adding two bits
When all else fails … look at truth table

A B = C0 S
0 0 = 0 0
0 1 = 0 1
1 0 = 0 1
1 1 = 1 0

CO (carry out) = AB
This is called a half adder

CIS 501: Comp Arch| Dr. | Arithmetic
The Other Half
We could chain half adders together, but to do that…
Need to incorporate a carry out from previous adder
C A B = C0 S
0 0 0 = 0 0
0 0 1 = 0 1
0 1 0 = 0 1
0 1 1 = 1 0
1 0 0 = 0 1
1 0 1 = 1 0
1 1 0 = 1 0
1 1 1 = 1 1

S = C’A’B | C’AB’ | CA’B’ | CAB = C ^ A ^ B
CO = C’AB | CA’B | CAB’ | CAB = CA | CB | AB
This is called a full adder

CIS 501: Comp Arch| Dr. | Arithmetic
Ripple-Carry Adder
N-bit ripple-carry adder
N 1-bit full adders “chained” together
CO0 = CI1, CO1 = CI2, etc.
CON–1 is carry-out of entire adder
CON–1 = 1  “overflow”

Example: 16-bit ripple carry adder
How fast is this?
How fast is an N-bit ripple-carry adder?

CIS 501: Comp Arch| Dr. | Arithmetic
Quantifying Adder Delay
Combinational logic dominated by gate delays
How many gates between input and output?
Array storage dominated by wire delays
Longest delay or critical path is what matters

Can implement any combinational function in “2” logic levels
1 level of AND + 1 level of OR (PLA)
NOTs are “free”: push to input (DeMorgan’s) or read from latch
Example: delay(FullAdder) = 2
d(CarryOut) = delay(AB | AC | BC)
d(Sum) = d(A ^ B ^ C) = d(AB’C’ | A’BC’ | A’B’C | ABC) = 2
Note ‘^’ means Xor (just like in C & Java)

Caveat: “2” assumes gates have few (<8 ?) inputs Gate delays is an abstraction: not all gates are equally expensive, etc. CIS 501: Comp Arch| Dr. | Arithmetic Ripple-Carry Adder Delay Longest path is to CO15 (or S15) delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) d(A15) = d(B15) = 0, d(CI15) = d(CO14) d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) … d(CO15) = 32 d(CON–1) = 2N Linear in number of bits Number of gates is also linear CIS 501: Comp Arch| Dr. | Arithmetic CIS 501: Comp Arch| Dr. | Arithmetic 4th Grade: Decimal Division 9 // quotient 3 |29 // divisor | dividend 2 // remainder Shift divisor left (multiply by 10) until MSB lines up with dividend’s Repeat until remaining dividend (remainder) < divisor Find largest single digit q such that (q*divisor) < dividend Set LSB of quotient to q Subtract (q*divisor) from dividend Shift quotient left by one digit (multiply by 10) Shift divisor right by one digit (divide by 10) Note that shifting the divisor to the right is equivalent to shifting the dividend to the left. We will use this trick later. CIS 501: Comp Arch| Dr. | Arithmetic Binary Division 3 |29 = 0011 |011101 -24 = - 011000 5 = 000101 - 3 = - 000011 2 = 000010 CIS 501: Comp Arch| Dr. | Arithmetic Binary Division Hardware Same as decimal division, except (again) More individual steps (base is smaller) Each step is simpler Find largest bit q such that (q*divisor) < dividend q = 0 or 1 Subtract (q*divisor) from dividend q = 0 or 1  no actual multiplication, subtract divisor or not Complication: largest q such that (q*divisor) < dividend How do you know if (1*divisor) < dividend? Human can “eyeball” this Computer does not have eyeballs Subtract and see if result is negative CIS 501: Comp Arch| Dr. | Arithmetic Software Divide Algorithm Can implement this algorithm in software Inputs: dividend and divisor Outputs: quotient = dividend / divisor rem = dividend % divisor remainder = 0; quotient = 0; for (int i = 0; i < 32; i++) {   remainder = (remainder << 1) | (dividend >> 31);
  if (remainder >= divisor) {
    quotient = (quotient << 1) | 1;     remainder = remainder - divisor; quotient = (quotient << 1) | 0;   dividend = dividend << 1; Same idea except that we are shifting the dividend and keeping the divisor constant Note ORing the quotient with 0 is redundant. All this means is that we are shifting in a zero into the LSB as opposed to shifting in a 1 as in the previous statement. Small point right shifting dividend may result in sign extension if you aren’t careful which isn’t what you want. That why ANDing with 1 may be useful. CIS 501: Comp Arch| Dr. | Arithmetic Divide Example Input: Divisor = 00011 , Dividend = 11101 Step Remainder Quotient Remainder Dividend 0 00000 00000 00000 11101 1 00001 00000 00001 11010 2 00011 00001 00000 10100 3 00001 00010 00001 01000 4 00010 00100 00010 10000 5 00101 01001 00010 00000 Result: Quotient: 1001, Remainder: 10 CIS 501: Comp Arch| Dr. | Arithmetic Divider Circuit Shift in 0 or 1 Shift in 0 or 1 Shift in 0 N cycles for n-bit divide Fast Addition CIS 501: Comp Arch| Dr. | Arithmetic CIS 501: Comp Arch| Dr. | Arithmetic Bad idea: a PLA-based Adder? If any function can be expressed as two-level logic… …why not use a PLA for an entire 8-bit adder? Approx. 216 AND gates, each with 16 inputs Then, 8 OR gates, each with 216 inputs Number of gates exponential in bit width! Not that fast, either An AND gate with 65K inputs != 2-input AND gate Many-input gates made from tree of, say, 4-input gates 216-input gates would have at least 8 logic levels So, at least 10 levels of logic for a 16-bit PLA Even so, delay is still logarithmic in number of bits There are better (faster, smaller) ways CIS 501: Comp Arch| Dr. | Arithmetic Theme: Hardware != Software Hardware can do things that software fundamentally can’t And vice versa (of course) In hardware, it’s easier to trade resources for latency One example of this: speculation Slow computation waiting for some slow input? Input one of two things? Compute with both (slow), choose right one later (fast) Does this make sense in software? Not on a uni-processor Difference? hardware is parallel, software is sequential CIS 501: Comp Arch| Dr. | Arithmetic Carry-Select Adder Carry-select adder Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1 Choose the correct one when CO7 finally becomes available Effectively cuts carry chain in half (break critical path) But adds mux On board: bad CSA design with 15/1 split CIS 501: Comp Arch| Dr. | Arithmetic Multi-Segment Carry-Select Adder Multiple segments Example: 5, 5, 6 bit = 16 bit Hardware cost Still mostly linear (~2x) Compute each segment with 0 and 1 carry-in Serial mux chain 5-bit adder (10) + Two muxes (4) = 14 CIS 501: Comp Arch| Dr. | Arithmetic Carry-Select Adder Delay What is carry-select adder delay (two segment)? d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2 d(CO15) = MAX(2*8, 2*8) + 2 = 18 In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA) What if we cut adder into 4 equal pieces? Would it be 2*(N/4) + 2 = 10? Not quite d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2 d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2 d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2 d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2 d(CO15) = 2*4 + 3*2 = 14 N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18 16b Carry-Select Adder Delay CIS 501: Comp Arch| Dr. | Arithmetic Optimal delay is at M=4: 14 gate delays TODO: can do better with Bharath’s asymmetric 2-2-3-4-5 design! Just 12 gate delays, overlaps RCA latency with muxes better. gate delays 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 32 18 14.66666666666667 14 14.4 15.33333333333333 16.571428571428569 18 19.555555555555561 21.2 22.90909090909091 24.666666666666671 26.46153846153846 28.28571428571426 30.133333333333319 32 M Gate delay CIS 501: Comp Arch| Dr. | Arithmetic Another Option: Carry Lookahead Is carry-select adder as fast as we can go? Another approach to using additional resources Instead of redundantly computing sums assuming different carries Use redundancy to compute carries more quickly This approach is called carry lookahead (CLA) CIS 501: Comp Arch| Dr. | Arithmetic A & B  Generate & Propagate Let’s look at the single-bit carry-out function CO = AB|AC|BC Factor out terms that use only A and B (available immediately) CO =(AB)|(A|B)C (AB): generates carry-out regardless of incoming C  rename to G (A|B) : propagates incoming C  rename to P Why don’t we worry about sum bit? We have a carry out if we either 1) generate it OR 2) we propagate AND there is a carry-in CIS 501: Comp Arch| Dr. | Arithmetic Infinite Hardware CLA Can expand C1…N in terms of G’s, P’s, and C0 Example: C16 C16 = G15 | P15C15 C16 = G15 | P15(G14|P14C14) C16 = G15 | P15G14 | … | P15P14…P2P1G0 | P15P14…P2P1P0C0 Similar expansions for C15, C14, etc. How much does this cost? CN needs: N AND’s + 1 OR’s, largest have N+1 inputs CN…C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs N=16: 152 total gates, max input 17 N=64: 2144 total gates, max input 65 And how fast is it really? Not that fast, unfortunately, 17-input gates are really slow CIS 501: Comp Arch| Dr. | Arithmetic Compromise: Multi-Level CLA Ripple carry Few small gates: 2N gates, max 3 inputs Laid in series: 2N (linear) latency Infinite hardware CLA Many big gates: N*(N+3)/2 additional gates, max N+1 inputs Laid in parallel: constant 5 latency Is there a compromise? Reasonable number of small gates? Sublinear (doesn’t have to be constant) latency? Yes, multi-level CLA: exploits hierarchy to achieve this 5 gate delays = 3 GD for carry computation + 2 GD for final addition (full adder) CIS 501: Comp Arch| Dr. | Arithmetic Carry Lookahead Adder (CLA) Calculate “propagate” and “generate” based on A, B Not based on carry in Combine with tree structure Take aways Tree gives logarithmic delay Reasonable area Ci is the carry into bit i CIS 501: Comp Arch| Dr. | Arithmetic Individual G & P  Windowed G & P Windowed G/P: useful abstraction for multi-level CLA Individual carry equations C1 = G0 | P0C0, C2 = G1 | P1C1 Infinite hardware CLA equations C1 = G0 | P0C0 C2 = G1 | P1G0 | P1P0C0 Group terms into “windows” C2 = (G1 | P1G0) | (P1P0)C0 C2 = G1-0 | P1-0C0 G1-0, P1-0 are window G & P a single bit summarizing information from all the bits in the window G1-0: carry-out generated by 1-0 bits P1-0: carry-out propagated by 1-0 bits CIS 501: Comp Arch| Dr. | Arithmetic Two-Level CLA for 4-bit Adder Individual carry equations C1 = G0 | P0C0, C2 = G1 | P1C1, C3 = G2 | P2C2, C4 = G3 | P3C3 Infinite hardware CLA equations C1 = G0 | P0C0 C2 = G1 | P1G0 | P1P0C0 C3 = G2 | P2G1 | P2P1G0 | P2P1P0C0 C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0C0 Hierarchical CLA equations First level: expand C2 using C1, C4 using C3 C2 = G1 | P1(G0 | P0C0) = (G1 | P1G0) | (P1P0)C0 = G1-0 | P1-0C0 C4 = G3 | P3(G2 | P2C2) = (G3 | P3G2) | (P3P2)C2 = G3-2 | P3-2C2 Second level: expand C4 using expanded C2 C4 = G3-2 | P3-2(G1-0 | P1-0C0) = (G3-2 | P3-2G1-0) | (P3-2P1-0)C0 C4 = G3-0 | P3-0C0 NB: Gx-y and Px-y are each single bit wires CIS 501: Comp Arch| Dr. | Arithmetic Two-Level CLA for 4-bit Adder Top first-level CLA block Input: C0, G0, P0, G1, P1 Output: C1, G1-0, P1-0 Bottom first-level CLA block Input: C2, G2, P2, G3, P3 Output: C3, G3-2, P3-2 Second-level CLA block Input: C0, G1-0, P1-0, G3-2, P3-2 Output: C2, G3-0, P3-0 These 3 blocks are “the same” Input: C0, G3-0, P3-0 Output: C4 CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d0 Which signals are ready at 0 gate delays (d0)? CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d1 Signals ready from before d0: C0, Ai, Bi New signals ready at d1 P0 = A0 | B0, G0 = A0B0 P1 = A1 | B1, G1 = A1B1 P2 = A2 | B2, G2 = A2B2 P3 = A3 | B3, G3 = A3B3 CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d3 Signals ready from before d0: C0, Ai, Bi d1: Pi, Gi New signals ready at d3 P1-0 = P1P0 G1-0 = G1 | P1G0 C1 = G0 | P0C0 P3-2 = P3P2 G3-2 = G3 | P3G2 C3 is not ready CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d5 Signals ready from before d0: C0, Ai, Bi d1: Pi, Gi d3: P1-0, G1-0 , C1 , P3-2 , G3-2 New signals ready at d5 P3-0 = P3-2P1-0 G3-0 = G3-2 | P3-2G1-0 C2 = G1-0 | P1-0C0 CIS 501: Comp Arch| Dr. | Arithmetic 4-bit CLA Timing: d7 Signals ready from before d0: C0, Ai, Bi d1: Pi, Gi d3: P1-0, G1-0 , C1 , P3-2 , G3-2 d5: P3-0, G3-0, C2 New signals ready at d7 C3 = G2 | P2C2 C4 = G3-0 | P3-0C0 Si ready d1 after Ci Q: why is sum ready in just one gate delay? Q: what happens in C4 block? CIS 501: Comp Arch| Dr. | Arithmetic Two-Level CLA for 4-bit Adder CLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2) 3 of these C4 block: 2 gates, max 2 inputs Total: 17 gates/3inputs Infinite: 14/5 2 for “outer” CLA, 4 for “interior” G/P go “up”, C go “down” Total: 8 (7 for CLA, 1 for S) Infinite: 4 2L is bigger and slower  CLA hierarchy doesn’t pay off until we have larger adders CIS 501: Comp Arch| Dr. | Arithmetic Two-Level CLA for 16-bit Adder 4 G/P inputs per level First level: 14 gates * 4 blocks Second level: 14 gates * 1 block Total: 70 gates largest gate: 5 inputs Infinite: 152 gates, 17 inputs Total: 8 (1 + 2 + 2 + 2 + 1) Infinite: 4 (1 + 2 + 1) CLA for a 64-bit adder? CIS 501: Comp Arch| Dr. | Arithmetic 16-bit CLA Timing: d0 What is ready after 0 gate delays? CIS 501: Comp Arch| Dr. | Arithmetic 16-bit CLA Timing: d1 What is ready after 1 gate delay? Individual G/P CIS 501: Comp Arch| Dr. | Arithmetic 16-bit CLA Timing: d3 What is ready after 3 gate delays? 1st level group G/P Interior carries of 1st group C1, C2, C3 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com