Lecture 5:
Arithmetic 1/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021
Arithmetic for Computers
▪ Operations on integers
– Addition and subtraction
– Multiplication and division
– Dealing with overflow
▪ Floating-point real numbers
– Representation and operations
2 UC Davis EEC 170, Winter 2021 / © John Owens
§3.1 Introduction
Arithmetic
▪ Where we’ve been:
– Performance (seconds, cycles, instructions) – Abstractions:
– Instruction Set Architecture
– Assembly Language and Machine Language
▪ What’s up ahead:
– Number Representation
– Implementing an ALU
– Implementing the Architecture
a
32
b
32
operation
ALU
32
result
3
UC Davis EEC 170, Winter 2021 / © John Owens
32 bit signed numbers
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten …
0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646 ten
0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647 ten
1000 0000 0000 0000 0000 0000 0000 0000two = – 2,147,483,648ten 1000 0000 0000 0000 0000 0000 0000 0001two = – 2,147,483,647ten 1000 0000 0000 0000 0000 0000 0000 0010two = – 2,147,483,646ten …
1111 1111 1111 1111 1111 1111 1111 1101two = – 3ten
1111 1111 1111 1111 1111 1111 1111 1110two = – 2 ten
1111 1111 1111 1111 1111 1111 1111 1111two = – 1 ten
maxint
minint
4
UC Davis EEC 170, Winter 2021 / © John Owens
What happens if you compute –minint?
5 UC Davis EEC 170, Winter 2021 / © John Owens
Two’s Complement Operations
▪ Negating a two’s complement number: invert all bits and add 1
– remember: “negate” and “invert” are quite different!
▪ Converting n bit numbers into numbers with more than n bits:
– RISC V n bit immediate gets converted to 64 bits for arithmetic
– copy the most significant bit (the sign bit) into the other bits
0010 -> 0000 0010
1010 -> 1111 1010
– “sign extension” (lbu vs. lb)
– mem [x1] = 0xff
– lb x2, 0(x1) -> x2 = ffff ffff
– lbu x2, 0(x1) -> x2 = 0000 00ff
6 UC Davis EEC 170, Winter 2021 / © John Owens
Addition & Subtraction
▪ Two’s complement operations easy
– subtraction using addition of negative numbers
0111 (7) + 1010 (-6)
0111 (7)
– 0110 (6)
7
UC Davis EEC 170, Winter 2021 / © John Owens
Addition & Subtraction
▪ Overflow (result too large for finite computer word):
– adding two n-bit numbers does not yield an n-bit number
1111
+ 0001
10000
– How about -1 + -1?
8 UC Davis EEC 170, Winter 2021 / © John Owens
Detecting Overflow
▪ ▪ ▪
▪ ▪
No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign:
– overflow when adding two positives yields a negative
– or, adding two negatives gives a positive
– or, subtract a negative from a positive and get a negative
– or, subtract a positive from a negative and get a positive
Consider the operations A + B, and A – B
– Can overflow occur if B is 0 ?
– Can overflow occur if A is 0 ?
HW problem on figuring out exact criteria
9 UC Davis EEC 170, Winter 2021 / © John Owens
and Subtraction
Integer Addition ▪ Example: 7 + 6
▪ Overflow if result out of range
– Adding +ve and –ve operands, no overflow
– Adding two +ve operands
– Overflow if result sign is 1 – Adding two –ve operands
– Overflow if result sign is 0
10
UC Davis EEC 170, Winter 2021 / © John Owens
Integer Subtraction
▪ Add negation of second operand
▪ Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111 –6: 1111 1111 … 1111 1010 +1:0000 0000 … 0000 0001
▪ Overflow if result out of range
– Subtracting two +ve or two –ve operands, no overflow
– Subtracting +ve from –ve operand
– Overflow if result sign is 0
– Subtracting –ve from +ve operand – Overflow if result sign is 1
11
UC Davis EEC 170, Winter 2021 / © John Owens
Effects of Overflow
▪ An exception (interrupt) occurs
– Control jumps to predefined address for exception
– Interrupted address is saved for possible resumption
▪ Details based on software system / language
– example: flight control vs. homework assignment
– C/Java do not detect overflow
– Fortran (evidently) can detect overflow
▪ Don’t always want to detect overflow
– –
–
– RISC-V advocates branches on overflow instead
MIPS instructions (but not RISC-V): addu,addiu,subu addiu sign-extends
sltu, sltiu for unsigned comparisons
12 UC Davis EEC 170, Winter 2021 / © John Owens
Arithmetic for Multimedia
▪
▪
Saturating operations
– On overflow, result is largest representable value
– c.f. 2s-complement modulo arithmetic – E.g., clipping in audio, saturation in video
Graphics and media processing operates on vectors of 8-bit and 16-bit data
– Use 64-bit adder, with partitioned carry chain
(this probably doesn’t mean anything to you yet, but wait until the end of lecture)
– Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors – SIMD (single-instruction, multiple-data)
13 UC Davis EEC 170, Winter 2021 / © John Owens
Review: Boolean Algebra & Gates
▪ Problem: Consider a logic function with three inputs: A, B, and C.
▪ Output D is true if at least one input is true
▪ Output E is true if exactly two inputs are true
▪ Output F is true only if all three inputs are true
14 UC Davis EEC 170, Winter 2021 / © John Owens
Review: Boolean Algebra & Gates
▪ Show the Boolean equations for these three functions.
15 UC Davis EEC 170, Winter 2021 / © John Owens
Review: Boolean Algebra & Gates
▪ Show an implementation consisting of inverters, AND, and OR gates.
16 UC Davis EEC 170, Winter 2021 / © John Owens
You should know …
▪ Boolean algebra
▪ Logic gates (and, or, not, xor, nor, multiplexors [muxes], decoders, etc.)
▪ Converting between equations, truth tables, and gate representations
▪ Critical path
▪ Clocking, and registers / memory
▪ Finite state machines
▪ … COD5e Appendix A summarizes this material
17 UC Davis EEC 170, Winter 2021 / © John Owens
Break
18 UC Davis EEC 170, Winter 2021 / © John Owens
Administrivia
▪ We have a TA! I already put him to work. Trivikram Reddy
▪ By popular demand, I will have an office hour Fri 18 October, “when the train arrives” (9:30)–11 at the CoHo
19 UC Davis EEC 170, Winter 2021 / © John Owens
Bit Slices
▪ Concentrate on one bit of the adder:
0111 + 0110
▪ Could we build the same hardware for every bit?
– This is a good idea. Why?
– Each bit’s hardware is called a “bit slice”
▪
20 UC Davis EEC 170, Winter 2021 / © John Owens
Truth Table for Adder Bit Slice ▪ 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
0
1
1
0
1
1
0
0
1
0
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1
21 UC Davis EEC 170, Winter 2021 / © John Owens
Adder Equations ▪ Sum = (A⊕B)⊕Cin
▪ Carry = AB + ACin + BCin ▪ Abstract as “Full Adder”:
A B
Cin
Full Cout Adder Sum
22 UC Davis EEC 170, Winter 2021 / © John Owens
Cascading Adders
▪ Cascade Full Adders to make multibit adder:
▪ A+B=S
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
Full Adder
B3
A2
S3
Full Adder
B2
S2
A1
Full Adder
B1
S1
A0
Full Adder
B0
S0
23
UC Davis EEC 170, Winter 2021 / © John Owens
Arithmetic for Multimedia
▪
▪
Saturating operations
– On overflow, result is largest representable value
– c.f. 2s-complement modulo arithmetic – E.g., clipping in audio, saturation in video
Graphics and media processing operates on vectors of 8-bit and 16-bit data
– Use 64-bit adder, with partitioned carry chain
(this probably doesn’t mean anything to you yet, but wait until the end of lecture)
– Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors – SIMD (single-instruction, multiple-data)
24 UC Davis EEC 170, Winter 2021 / © John Owens
Truth Table for Subtractor Bit Slice ▪ 3 inputs (A, B, Bin); 2 outputs (Diff, Bout)
A
B
Bin
Diff
Bout
0
0
0
0
0
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
1
0
0
1
1
0
0
0
1
1
1
1
1
25 UC Davis EEC 170, Winter 2021 / © John Owens
Cascading Subtractors
▪ Cascade Full Subtrs to make multibit subtr:
▪ A-B=S
A3
A B
Bin
A B
Bin
A B
Bin
A B
Full Subtr
Full Subtr
Full Subtr
Bout Diff
Bout Diff
Bout Diff
Bout Diff
B3
B2
A1
B1
Bin
26
UC Davis EEC 170, Winter 2021 / © John Owens
Full Subtr
S3
A2
S2
S1
A0
B0
S0
How can we combine + and -?
A
A B
Cin
A B
Bin
Full Adder
Full Subtr
Cout Sum
Bout Diff
+ –
+ –
B
▪ This is common—it’s what we’ll do (for example) for logic functions (and, or, etc.)
27 UC Davis EEC 170, Winter 2021 / © John Owens
Cin+1
S
How can we combine + and -?
S = A- B
S = A + (~B) + 1
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Full Adder
Full Adder
Full Adder
Full Adder
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
B3
A2
S3
B2
S2
A1
B1
S1
A0
B0
S0
28
UC Davis EEC 170, Winter 2021 / © John Owens
Lest you think this is only theoretical …
29 UC Davis EEC 170, Winter 2021 / © John Owens
How can we combine + and -?
A1
B1
+ –
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
+ –
Full Adder
Cout
Sum S0
B0
S = A + ~B + 1 First, negate B …
30
UC Davis EEC 170, Winter 2021 / © John Owens
How can we combine + and -?
A1
B1
+ –
A B
Cin
A B
Cin
Full Adder
Cout
Sum
S1
A0
+ –
+ –
Full Adder
Cout
Sum
B0
S0
S = A + ~B + 1 … then add 1
0
1
31
UC Davis EEC 170, Winter 2021 / © John Owens
Control for +/-
One bit controls three muxes. This is a “control point”.
A B
Cin
A B
Cin
A1
B1
+ –
Full Adder
Cout
Sum
S1
A0
Cout
Sum
How do we set this control point for add? subtract?
+ –
Full Adder
B0
S0
0
+ –
1
32
UC Davis EEC 170, Winter 2021 / © John Owens
imm[11:0] rs1 funct3 rd opcode I-type
RISC-V instruc
rs2
imm[31:12] rd opcode
im
m
[2
|19
rd opcode
0|1
codings
:12]
rs1 funct3 imm[4:0] opcode
S-type B-type U-type J-type
LUI AUIPC JAL JALR BEQ BNE BLT BGE BLTU BGEU LB
LH LW LBU LHU SB
SH SW ADDI SLTI SLTIU XORI ORI ANDI SLLI SRLI SRAI ADD SUB SLL SLT SLTU XOR SRL
SRA
imm[11:5]
imm[12|10:5] tiorns2 enrs1 funct3 imm[4:1|11] opcode
0:1|11
RV32I Base Instruction Set
imm[31:12]
rd
0110111
imm[31:12]
rd
0010111
imm[20|10:1|11|19:12]
rd
1101111
imm[11:0]
rs1
000
rd
1100111
imm[12|10:5]
rs2
rs1
000
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
001
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
100
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
101
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
110
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
111
imm[4:1|11]
1100011
imm[11:0]
rs1
000
rd
0000011
imm[11:0]
rs1
001
rd
0000011
imm[11:0]
rs1
010
rd
0000011
imm[11:0]
rs1
100
rd
0000011
imm[11:0]
rs1
101
rd
0000011
imm[11:5]
rs2
rs1
000
imm[4:0]
0100011
imm[11:5]
rs2
rs1
001
imm[4:0]
0100011
imm[11:5]
rs2
rs1
010
imm[4:0]
0100011
imm[11:0]
rs1
000
rd
0010011
imm[11:0]
rs1
010
rd
0010011
imm[11:0]
rs1
011
rd
0010011
imm[11:0]
rs1
100
rd
0010011
imm[11:0]
rs1
110
rd
0010011
imm[11:0]
rs1
111
rd
0010011
0000000
shamt
rs1
001
rd
0010011
0000000
shamt
rs1
101
rd
0010011
0100000
shamt
rs1
101
rd
0010011
0000000
rs2
rs1
000
rd
0110011
0100000
rs2
rs1
000
rd
0110011
0000000
rs2
rs1
001
rd
0110011
0000000
rs2
rs1
010
rd
0110011
0000000
rs2
rs1
011
rd
0110011
0000000
0000000
0100000
rs2 rs1 rs2 rs1 rs2 rs1
33
100 rd
101 rd
101 rd
0110011
0110011
0110011
UC Davis EEC 170, Winter 2021 / © John Owens
0000000 rs2 rs1 110 rd 0110011 OR
|| ||
imm[12 10:5] rs2 rs1 101 imm[4:1 11] 1100011 BGE
imm[12|10:5] rs2 rs1 110 imm[4:1|11] 1100011 BLTU imm[12|10:5] rs2 rs1 111 imm[4:1|11] 1100011 BGEU
imm[11:0] rs1 000 rd 0000011 LB
RISC-V instruction encodings (funct7)
imm[11:0] rs1
0000011 LH 0000011 LW 0000011 LBU 0000011 LHU 0100011 SB 0100011 SH 0100011 SW 0010011 ADDI 0010011 SLTI 0010011 SLTIU 0010011 XORI 0010011 ORI 0010011 ANDI
imm[11:0] rs1 ▪ ADD: i0m0m[0110:0] 00 rs1 imm[11:0] rs1
001 rd 010 rd
100 rd
101 rd
000 imm[4:0]
001 imm[4:0]
010 imm[4:0] 000 rd
010 rd
011 rd
100 rd
110 rd
111 rd
001 rd 0010011 SLLI
▪ SUB: 0100000
imm[11:5] rs2 rs1
imm[11:5] rs2 rs1 imm[11:5] rs2 rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 0000000 shamt rs1
0000000
shamt
rs1
101
rd
0010011
0100000
shamt
rs1
101
rd
0010011
0000000
rs2
rs1
000
rd
0110011
0100000
rs2
rs1
000
rd
0110011
0000000
rs2
rs1
001
rd
0110011
0000000
rs2
rs1
010
rd
0110011
0000000
rs2
rs1
011
rd
0110011
0000000 rs2 rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1
fm pred succ rs1
100 rd
101 rd
101 rd
110 rd
111 rd
000 rd
SRLI SRAI ADD SUB SLL SLT SLTU
0110011 XOR
0110011 SRL
0110011 SRA
0110011 OR
0110011 AND
34
UC Davis EEC 170, Winter 2021 / © John Owens
0001111 FENCE
000000000000 00000 000 00000 1110011 ECALL
imm[11:0] rs1 001 rd 0000011 LH
imm[11:0] rs1 010 rd 0000011 LW imm[11:0] rs1 100 rd 0000011 LBU
RISC-V instruction encodings (funct3)
imm[11:0] rs1 101 rd 0000011 LHU
imm[11:5] rs2 rs1 imm[11:5] rs2 rs1
000 imm[4:0]
001 imm[4:0]
010 imm[4:0]
000 ADDI
010 SLTI
011 SLTIU
100 XORI 110 ORI
111
001 SLLI 101 SRLI 101 SRAI 000 ADD
000 SUB
001 SLL
010 SLT
011 SLTU
▪ ADDI: 000
imm[11:5] rs2 rs1
0100011 SB 0100011 SH 0100011 SW
▪ SLTI: 010
imm[11:0] rs1
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0110011
rd
0110011
rd
0110011
rd
0110011
rd
0110011
▪ SLTIU: 011
imm[11:0] rs1
imm[11:0] rs1
▪SLT: 010
imm[11:0] rs1
imm[11:0] rs1
▪ SLimTmU[:11:0]011 rs1 0000000 shamt rs1 0000000 shamt rs1 0100000 shamt rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1
fm pred succ rs1
ANDI
35
0001111 FENCE
100 rd
101 rd
101 rd
110 rd
111 rd
000 rd
0110011 XOR
0110011 SRL
0110011 SRA
0110011 OR
0110011 AND
UC Davis EEC 170, Winter 2021 / © John Owens
000000000000 00000 000 00000 1110011 ECALL
Bit Slices
▪ Concentrate on one bit of the adder:
0111 + 0110
▪ Needs:
– 2 inputs (A and B)
– Carry from previous slice (Cin)
– Output (Sum)
– Carry to next slice (Cout)
▪
36 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Opcode Map
essor User’s Manual / Joe Heinrich]
37 UC Davis EEC 170, Winter 2021 / © John Owens
End of lecture / Quiz 1
38 UC Davis EEC 170, Winter 2021 / © John Owens
But What About Performance? ▪ Critical path of one bitslice is CP
▪ Critical path of n-bit rippled-carry adder is n*CP
▪ Design Trick:
– Throw hardware at it
A0 B0
A1 B1
A2 B2
A3 B3
CarryIn0
1-bit ALU
CarryOut0
Result0
Result1
Result2
Result3
CarryIn1
1-bit ALU
CarryOut1
CarryIn2
1-bit ALU
CarryOut2 CarryIn3
CarryOut3
39
UC Davis EEC 170, Winter 2021 / © John Owens
1-bit ALU
Truth Table for Adder Bit Slice ▪ 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0=Cin
0
1
1
0
1=Cin
1
0
0
1
0=Cin
1
0
1
0
1=Cin
1
1
0
0
1
1
1
1
1
1
40 UC Davis EEC 170, Winter 2021 / © John Owens
A0 B0
A1 B1
A2 B2
A3 B3
Carry Look Ahead (Design trick: peek)
C0 = Cin
S
S
S
S
C4 = . . .
G
P
G = A and B P = A xor B WHY are these interesting?
41
UC Davis EEC 170, Winter 2021 / © John Owens
A B Cout
0 0 0 “kill”
0 1 Cin 1 0 Cin
1 1 1
“propagate” “propagate” “generate”
G P
G P
C1 = G0 + C0 · P0
G P
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G P
CLA vs. Ripple
C0 = Cin
S
G = A and B P = A xor B
CarryIn0
A0 B0
A1 B1
A0
1-bit ALU
CarryOut0
Result0
G P
B0 CarryIn1
A1
B1 CarryIn2
1-bit
ALU Result1
C1 = G0 + C0 · P0
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 +
G0 · P1 · P2 +
C0 · P0 · P1 · P2
C4 = . . .
S
G P
A2 A2 B2
CarryOut1
1-bit ALU
CarryOut2
1-bit
ALU
CarryOut3
Result2
Result3
S
G P
B2
A3 B3
CarryIn3
A3 B3
S
G
P
G P
42
UC Davis EEC 170, Winter 2021 / © John Owens
Cascaded Carry Look-ahead (16-bit)
C
L C0
A
4-bit Adder
4-bit Adder
4-bit Adder
G0 P0
C1 = G0 + C0 · P0
Abstraction!
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G P
C4 = . . .
43
UC Davis EEC 170, Winter 2021 / © John Owens
Design Trick: Guess (or “Precompute”)
CP(2n) = 2*CP(n)
n-bit adder n-bit adder
(2n) = CP(n) + CP(mux)
n-bit adder
1n-bit adder
0
n-bit adder
Cout
Carry-select adder
44
UC Davis EEC 170, Winter 2021 / © John Owens
P
Carry Skip Adder: reduce worst case delay
▪ Just speed up the slowest case for each block
4-bit Ripple Adder 4-bit Ripple Adder
B A4 B A0
▪ Exercise: optimal design uses variable block sizes (why?)
P3 S P3 S
P2
P1
P0
P2
P1
P0
45 UC Davis EEC 170, Winter 2021 / © John Owens
Adder Lessons
▪ Reuse hardware if possible
– +/- reuse is compelling argument for 2’s complement
▪ For higher performance:
– Look for critical path, optimize for it Reorganize equations [propagate/generate / carry lookahead]
– Precompute [carry save]
– Reduce worst-case delay [carry skip]
46 UC Davis EEC 170, Winter 2021 / © John Owens
Finished way early (1:20 in, 30 minutes
47 UC Davis EEC 170, Winter 2021 / © John Owens
End of lecture
48 UC Davis EEC 170, Winter 2021 / © John Owens
Lecture 6:
Arithmetic 2/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021
Multiply (unsigned)
▪ Paper and pencil example (unsigned):
Multiplicand 1000 Multiplier x1001 1000
0000 0000
1000 Product 01001000
▪ m bits x n bits = m+n bit product
▪ Binary makes it easy:
– 0 => place 0 ( 0 x multiplicand )
– 1 => place a copy ( 1 x multiplicand )
▪ 4 versions of multiply hardware & algorithm:
– successive refinement
50 UC Davis EEC 170, Winter 2021 / © John Owens
m bits x n bits = m+n bit product
51 UC Davis EEC 170, Winter 2021 / © John Owens
Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
52 UC Davis EEC 170, Winter 2021 / © John Owens
How does it work?
00
A3 A3 A2
0
0000
A3 A2 A1 A0 A2
A1
A0
A1
A0
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
▪ At each stage shift A left (multiply it by 2)
▪ Use next bit of B to determine whether to add in shifted
multiplicand
▪ Accumulate 2n bit partial product at each stage
53 UC Davis EEC 170, Winter 2021 / © John Owens
Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
▪ Stage i accumulates A * 2i if Bi == 1
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
▪ Q: How much hardware for 32 bit multiplier? Critical path?
54 UC Davis EEC 170, Winter 2021 / © John Owens
Unsigned shift-add multiplier (version 1)
▪ 64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg
Multiplicand
64 bits
Shift Left
Multiplier
32 bits
Control
Shift Right
64-bit ALU
Product
64 bits
Write
Multiplier = datapath + control
55
UC Davis EEC 170, Winter 2021 / © John Owens
Multiply Algorithm V1
Product
0000 0000 0011
1:0000 0010 0011 2:0000 0010 0011 3:0000 0010 0001 1:0000 0110 0001 2:0000 0110 0001 3:0000 0110 0000
0000 0110 0000
Start
1. Test Multiplier0
Multiplier Multiplicand 0000 0010
Multiplier0 = 1
Multiplier0 = 0
0000 0010
0000 0100
0000 0100 0000 0100 0000 1000 0000 1000
0000 1000
1a. Add multiplicand to product &
place the result in Product register
2. Shift the Multiplicand register left 1 bit. 3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions Done
56
UC Davis EEC 170, Winter 2021 / © John Owens
Observations on Multiply Version 1
▪ 1 clock per cycle => ≈ 100 clocks per multiply – Ratio of multiply to add 5:1 to 100:1
▪ 1/2 bits in multiplicand always 0 => 64-bit adder is wasted
▪ 0’s inserted in right of multiplicand as shifted
=> least significant bits of product never changed once formed
▪ Instead of shifting multiplicand to left, shift product to right?
57 UC Davis EEC 170, Winter 2021 / © John Owens
Multiply Hardware Version 2
▪ 32-bit Multiplicand reg, 32-bit ALU, 64-bit Product reg, 32-bit Multiplier reg
Multiplicand
32 bits
32-bit ALU
Product
64 bits
Multiplier
32 bits
Control
Shift Right
Shift Right
Write
58
UC Davis EEC 170, Winter 2021 / © John Owens
How to think of this?
▪ Remember original combinational multiplier:
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
59 UC Davis EEC 170, Winter 2021 / © John Owens
Simply warp to let product move right…
0000
A3 A2 A1 A0
A3 A2 A1 A0
A3 A2 A1 A0
B0
B1
B2
B3
P7 P6 P5 P4 P3 P2 P1 P0
▪ MultiplicaAnd staAys stAill anAd product moves right 3210
60 UC Davis EEC 170, Winter 2021 / © John Owens
Multiply Algorithm V2
Start
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
0000 0000
1: 0010 0000
2: 0001 0000
0011 0010
0011 0010
1a. Add multiplicand to the left half of product & 0011 0010
3: 00010000 00p0l1acet0h0e1re0sultinthelefthalfofProductregister
1: 0011 0000 0001 0010
2: 0001 1000 0001 0010 Product Multiplier Multiplicand
3: 0001 1000 1: 0001 1000 2: 0000 1100 3: 0000 1100 1: 0000 1100 2: 0000 0110 3: 0000 0110
0000 0110
0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010
0000 0010
2. Shift the Product register right 1 bit.
3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions Done
61
UC Davis EEC 170, Winter 2021 / © John Owens
Still more wasted space!
Start
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
0000 0000 1: 0010 0000 2: 0001 0000 3: 0001 0000 1: 0011 0000
0011 0010
1a. Add multiplicand to the left half of product & 0011 0010
place the result in the left half of Product register 0011 0010
2: 0001 1000 3: 0001 1000 1: 0001 1000 2: 0000 1100 3: 0000 1100 1: 0000 1100 2: 0000 0110 3: 0000 0110
0000 0110
0001 0010
0001 0010
0001 0010 Product Multiplier Multiplicand
0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010
0000 0010
2. Shift the Product register right 1 bit.
3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions Done
62
UC Davis EEC 170, Winter 2021 / © John Owens
Observations on Multiply Version 2
▪ Product register wastes space that exactly matches size of multiplier
=> combine Multiplier register and Product register
63 UC Davis EEC 170, Winter 2021 / © John Owens
Multiply Hardware Version 3
▪ 32-bit Multiplicand reg, 32 -bit ALU, 64-bit Product reg, (0-bit Multiplier reg)
Multiplicand 32 bits
32-bit ALU
Product (Multiplier) 64 bits
Shift Right
Write
Control
64
UC Davis EEC 170, Winter 2021 / © John Owens
Multiply Algorithm V3
Product0 = 1
Start
1.
Product0
Product0 = 0
Test
1a. Add multiplicand to the left half of product & place the result in the left half of Product register
0000 0011 0010
1: 0010 0011 0010
2: 0001 0001 0010
1: 0011 0000 0010 Product Multiplicand
2: 0001 1000 0010
1: 0001 1000 0010
2: 0000 1100 0010
1: 0000 1100 0010
2: 0000 0110 0010
0000 0110 0010
2. Shift the Product register right 1 bit.
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions Done
65
UC Davis EEC 170, Winter 2021 / © John Owens
Observations on Multiply Version 3
▪ 2 steps per bit because Multiplier & Product combined ▪ What about signed multiplication?
- easiest solution is to make both positive & remember whether to complement product when done (leave out the sign bit, run for 31 steps)
- apply definition of 2’s complement
- need to sign-extend partial products and subtract at the end
- Booth’s Algorithm is elegant way to multiply signed numbers using same hardware as before and save cycles
- can handle multiple bits at a time
66 UC Davis EEC 170, Winter 2021 / © John Owens
Faster Multiplier
▪
Uses multiple adders
- Cost/performance tradeoff
▪
Can be pipelined
- Several multiplications performed in parallel
67
UC Davis EEC 170, Winter 2021 / © John Owens
Motivation for Booth’s Algorithm
▪ ▪
Example 2 x 6 = 0010 x 0110:
0010 x 0110
+ + + +
0000 shift (0 in multiplier) 0010 add (1 in multiplier) 0010 add (1 in multiplier)
0000 shift (0 in multiplier) 0001100
68 UC Davis EEC 170, Winter 2021 / © John Owens
Motivation for Booth’s Algorithm
▪ ALU with add or subtract can get same result in more than one way:
▪ 6 = 4 + 2 = -2 + 8
0110 = -00010 + 01000 = 11110 + 01000
▪ For example:
0010 (2 [multiplier]) x 0110 (6 [multiplicand])
0000 - 0010
0000 + 0010
00001100
shift (0 in multiplier)
sub (first 1 in multpl.)
shift (mid string of 1s)
add (prior step had last 1)
69 UC Davis EEC 170, Winter 2021 / © John Owens
Booth’s Algorithm
▪ CuernredntoBfitruBnit tomthide dRilgehtoEfxprluanatbioenginning oExfarmupnle Op
1 0 sub
1 1 0 1 0 0
Begins run of 1s 0001111000 011110
Middle of run of 1s 0001111000 none End of run of 1s 0001111000 add Middle of run of 0s 0001111000 none
▪ Originally for speed (when shift was faster than add)
▪ Replace a string of 1s in multiplier with an initial subtract when we first
see a one and then later add for the bit after the last one
▪ Handles two’s complement!
70 UC Davis EEC 170, Winter 2021 / © John Owens
Booth’s Example (2 x 7)
Operation
0. initial value
1a. P = P - m 1b.
2.
3.
4a. 4b.
Multiplicand 0010
Product
0000 0111 0 1110 0111 0 1111 0011 1 1111 1001 1 1111 1100 1 0001 1100 1 0000 1110 0
next?
10 -> sub
shift P (sign ext) 11 -> nop, shift 11 -> nop, shift 01 -> add
shift done
Blue + red = multiplier
(red is 2 bits to compare); Green = arithmetic ops; Black = product
71 UC Davis EEC 170, Winter 2021 / © John Owens
0010
0010
0010
0010
0010
0010
+ 1110->
+ 0010->
Booth’s Example (2 x -3)
Operation Multiplicand Product
next?
10 -> sub shift P (sign ext) 01 -> add
shift P
10 -> sub shift
11 -> nop
shift done
0. initial value 0010
0000 1101 0 1110 1101 0
1111 0110 1
1a. P = P – m 1b.
2a. 2b.
3a.
3b. 4a.
1110 + 1110 0010
+ 0010
0001 011 0000 1011 0
0 1
0010
+ 1110 1110 1011 0
1111 0101 1 1111 0101 1
0010 0010
4b.
Blue + red = multiplier (red is 2 bits to compare); Green = arithmetic ops; Black = product
0010
1111 1010 1
72 UC Davis EEC 170, Winter 2021 / © John Owens
Radix-4 Modified Booth’s Algorithm
Current
Bits Right
00 0
01 0
10 0
11 0
00 1
01 1
10 1
Explanation Example Recode
Bit to the
Middle of zeros Single one Begins run of 1s Begins run of 1s
Ends run of 1s Ends run of 1s Isolated 0 Middle of run
00 00 00 00 00 0 00 00 00 01 00 1 00 01 11 10 00 -2
00 01 11 11 00 -1 00 00 11 11 00 1
00 01 11 11 00 2 00 11 10 11 00 -1 00 11 11 11 00 0
11 1
Same insight as one-bit Booth’s, simply adjust for alignment of 2 bits. Allows multiplication 2 bits at a time.
73 UC Davis EEC 170, Winter 2021 / © John Owens
RISC-V Multiplication Support
▪ Four multiply instructions: – mul: multiply
– Gives the lower 64 bits of the product – mulh: multiply high
– Gives the upper 64 bits of the product, assuming the operands are signed
– mulhu: multiply high unsigned
– Gives the upper 64 bits of the product, assuming the operands are unsigned
– mulhsu: multiply high signed/unsigned
– Gives the upper 64 bits of the product, assuming one operand is signed and the other unsigned
– Use mulh result to check for 64-bit overflow
74 UC Davis EEC 170, Winter 2021 / © John Owens
RISC-V Support for multiply
▪ “If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order andrdhcannot be the same asrs1or
rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.”
75 UC Davis EEC 170, Winter 2021 / © John Owens
Multiplication Summary
▪ Iterative algorithm ▪ Design techniques:
– Analyze hardware—what’s not in use?
– Spend more hardware to get higher performance
– Booth’s Algorithm—more general (2’s complement)
– Booth’s Algorithm—recoding is powerful technique to think about problem in a different way
– Booth’s Algorithm—more bits at once gives higher performance
76 UC Davis EEC 170, Winter 2021 / © John Owens
Break
77 UC Davis EEC 170, Winter 2021 / © John Owens
Administrivia
▪ TA Trivikram coming at 11:30 to present the project – You’re going to write 4 RISC-V procedures.
▪ HW 3 released last Friday, due on Friday
– Solutions will be outside my office, Kemper 3175
▪ Midterm is week from today
– Open book, open note
– Bring a calculator
78 UC Davis EEC 170, Winter 2021 / © John Owens
Is NVIDIA Doubling Down On RISC-V?
▪ Betteridge’s law of headlines is an adage that states:
“Any headline that ends in a question mark can be answered by the word no”.
79 UC Davis EEC 170, Winter 2021 / © John Owens
Is NVIDIA Doubling Down On RISC-V?
▪ “Six RISC-V positions have been advertised by NVIDIA, based in Shanghai and pertaining to architecture, design, and verification.”
▪ “Due its light weight and extensibility, RISC-V is gaining mainstream adoption across many new sectors, including datacenter accelerators, mobile & wireless, automotive, and IoT.”
▪ “In a 2017 RISC-V workshop in Shanghai, NVIDIA explained that shortcomings such as low performance and lack of caches and thread protection meant Falcon’s architecture could not meet growing complexity demands.”
▪ “NVIDIA listed the technical criteria for its next-gen architecture: more than twice the performance of Falcon, less than twice the area cost of Falcon, support for caches, tightly coupled memories, 64-bit addresses, and suitability for modern operating systems. They concluded only RISC-V meets all criteria. The new RISC-V micro-controllers will outperform Falcon micro- controllers by three times, Tom’s Hardware has reported.”
https://medium.com/syncedreview/is-nvidia-doubling-down-on-r8is0c-v-1ce714a919eb UC Davis EEC 170, Winter 2021 / © John Owens
RISC-V Support for divide
▪ 4 instructions:
-{div, divu, rem, remu} rd, rs1, rs2
– – – –
▪ “If both the quotient and remainder are required from the same division, the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr,
rs1, rs2 (rdq cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single divide operation instead of performing two separate divides.”
▪ Overflow and division-by-zero don’t produce errors
– Just return defined results
– Faster for the common case of no error
div: rs1 / rs2, treat as signed
divu: rs1 / rs2, treat as unsigned
rem: rs1 mod rs2, treat as signed
remu: rs1 mod rs2, treat as unsigned
81 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Support for multiply/divide
▪ Rather than target the general-purpose registers:
– –
hi
mul placed its output into two specialhiandloregisters div placed its divide output intoloand its rem output into
–
general-purpose register)
MIPS providedmfloandmfhiinstructions (destination:
82 UC Davis EEC 170, Winter 2021 / © John Owens
Divide: Paper & Pencil
1001 Quotient Divisor 1000 1001010 Dividend
–100100
101
1010
–1000
10 Remainder
(or Modulo result)
▪
▪ ▪
See how big a number can be subtracted, creating quotient bit on each step
– Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder 3 versions of divide, successive refinement
83 UC Davis EEC 170, Winter 2021 / © John Owens
Divide Hardware Version 1
▪ 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
64 bits
Shift Right
64-bit ALU
Remainder
64 bits
Write
Quotient
32 bits
Control
Shift Left
84
UC Davis EEC 170, Winter 2021 / © John Owens
Divide Algorithm V1
▪ Takes n+1 steps for n-bit Quotient & Rem.
Start: Place Dividend in Remainder
1. Subtract the Divisor register from the Remainder register, and place the result in the Remainder register.
▪ Remainder Quotient
0000 0111
Divisor Test Remainder ≥0 Remainder
0010 0000
Remainder < 0
0000
2b. Restore the original value by adding the Divisor register to the Remainder register, & place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0.
“restoring” division
2a. Shift the Quotient register to the left setting the new rightmost
bit to 1.
3. Shift the Divisor register right1 bit.
n+1 repetition?
No: < n+1 repetitions
Yes: n+1 repetitions (n = 4 here)
UC Davis EEC 170, Winter 2021 / © John Owens
Done
85
Divide Algorithm I example (7 / 2)
Remainder Quotient Divisor
0000 0111 1:1110 0111 00000
2:0000 0111 0000 3:0000 0111 0000 1:1111 0111 0000
2:0000 0111 000
000000010 0000 0010 0000
0010 0000
0001 0000
0001 0000
0001 0000
0000 1000
0000 1000
0000 1000
0000 0100
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
0000 0001
Answer: Quotient = 3 Remainder = 1
0
0 0 0
00
0
3:0000 0111 000
1:11111111 000
2:0000 0111 00
00
00
0
3:0000 0111 00
000
1:0000 0011 00
000 0001
2:0000 0011 0
3:0000 0011 0
0001
0001 2:0000 0001 00011
3:0000 0001 00011
1:0000 0001 0
86
UC Davis EEC 170, Winter 2021 / © John Owens
Divide: Paper & Pencil
01010 Quotient
Divisor 0001 00001010 Dividend 00001
–0001
0000 0001
–0001
0
00 Remainder (or Modulo result)
- No way to get a 1 in leading digit!
- (this is an overflow, i.e quotient would have n+1 bits)
-
can save 1 iteration
⇒ switch order to shift first and then subtract,
87 UC Davis EEC 170, Winter 2021 / © John Owens
Observations on Divide Version 1
▪ 1/2 bits in divisor always 0
=> 1/2 of 64-bit adder is wasted
=> 1/2 of divisor is wasted
▪ Instead of shifting divisor to right, shift remainder to left?
88 UC Davis EEC 170, Winter 2021 / © John Owens
Divide Algorithm I example: wasted space
Remainder Quotient 0000 0111 00000
1:1110 0111 00000
2:0000 0111 00000
3:0000 0111 00000
1:1111 0111 00000
2:0000 0111 00000
3:0000 0111 00000
1:1111 1111 00000
2:0000 0111 00000
3:0000 0111 00000
1:0000 0011 00000
2:0000 0011 00001
3:0000 0011 0
Divisor 0010 0000
0010 0000
0010 0000
0001 0000
0001 0000
0001 0000
0000 1000
0000 1000
0000 1000
0000 0100
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
0001
1:0000 0001 0
0001
2:0000 0001 0
0011
3:0000 0001 00011 0000 0010
89 UC Davis EEC 170, Winter 2021 / © John Owens
Divide Hardware Version 2
▪ 32-bit Divisor reg, 32-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
32 bits
32-bit ALU
Remainder
64 bits
Quotient
32 bits
Control
Shift Left
Shift Left
Write
90
UC Davis EEC 170, Winter 2021 / © John Owens
Divide Algorithm V2
Start: Place Dividend in Remainder
1. Shift the Remainder register left 1 bit.
Remainder Quotient Divisor
0000 0111 0000 0010
2. Subtract the Divisor register from the
left half of the Remainder register, & place the result in the left half of the Remainder register.
Remainder ≥ 0
Test Remainder
Remainder < 0
3a. Shift the Quotient register to the left setting the new rightmost
bit to 1.
3b. Restore the original value by adding the Divisor register to the left half of the Remainder register, &place the sum in the left half of the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0.
nth
repetition?
No: < n repetitions Yes: n repetitions (n = 4 here)
UC Davis EEC 170, Winter 2021 / © John Owens
Done
91
Observations on Divide Version 2
▪ Eliminate Quotient register by combining with Remainder as shifted left
- Start by shifting the Remainder left as before.
- Thereafter loop contains only two steps because the shifting of the Remainder register shifts both the remainder in the left half and the quotient in the right half
- The consequence of combining the two registers together and the new order of the operations in the loop is that the remainder will shifted left one time too many.
- Thus the final correction step must shift back only the remainder in the left half of the register
92 UC Davis EEC 170, Winter 2021 / © John Owens
Divide Hardware Version 3
▪ 32-bit Divisor reg, 32 -bit ALU, 64-bit Remainder reg, (0-bit Quotient reg)
Divisor
32 bits
32-bit ALU
“HI” “LO”
Shift Left
Write
Remainder
(Quotient)
Control
64 bits
93
UC Davis EEC 170, Winter 2021 / © John Owens
Start: Place Dividend in Remainder
Divide Algorithm V3
1. Shift the Remainder register left 1 bit.
2. Subtract the Divisor register from the Remainder Divisor
left half of the Remainder register, & place the
0000 0111 0010
3a. Shift the Remainder register to the left setting the new rightmost
result in the left half of the Remainder register.
Remainder ≥ 0 Test Remainder < 0 Remainder
3b. Restore the original value by adding the Divisor register to the left half of the Remainder register, &place the sum in the left half of the Remainder register. Also shift the Remainder register to the left, setting the new least significant bit to 0.
bit to 1.
nth repetition?
No: < n repetitions
Yes: n repetitions (n = 4 here) Done. Shift left half of Remainder right 1 bit.
UC Davis EEC 170, Winter 2021 / © John Owens
94
Final Multiply / Divide Hardware
Multiplicand 32 bits
32-bit ALU
Product (Multiplier) 64 bits
Divisor
32 bits
32-bit ALU
“HI” “LO”
Shift Right
Write
Control
Remainder
(Quotient)
Shift Left
Write
Control
64 bits
95
UC Davis EEC 170, Winter 2021 / © John Owens
Observations on Divide Version 3
▪ Same Hardware as Multiply: just need ALU to add or subtract, and 64-bit register to shift left or shift right
▪ Hi and Lo registers in MIPS combine to act as 64-bit register for multiply and divide
▪ Signed Divides: Simplest is to remember signs, make positive, and complement quotient and remainder if necessary
- Note: Dividend and Remainder must have same sign
- Note: Quotient negated if Divisor sign & Dividend sign disagree e.g., –7 ÷ 2 = –3, remainder = –1
- What about? –7 ÷ 2 = –4, remainder = +1
- See http://mathforum.org/library/drmath/view/52343.html
▪ Possible for quotient to be too large: if divide 64-bit integer by 1, quotient is 64 bits (called “saturation”)
96 UC Davis EEC 170, Winter 2021 / © John Owens
SRT Division
D. Sweeney of IBM, J.E. Robertson of the University of Illinois, and T.D. Tocher of Imperial College, London
Current
Remainder
P-D (Partial Divisor) Plot
9|9 43211111 8|8 42211110 7|7 32111100 6|6 32111000 5 | 5 2 1 1 1 0 00 0 4 | 4 2 1 1 0 0 00 0 3|3 11000000 2|2 10000000 1|1 00000000 0 +------------------
0 1 2 3 4 5 6 78 9
97 UC Davis EEC 170, Winter 2021 / © John Owens
SRT Division
▪ Intel Pentium divide implementation: SRT division with 2 bits/ iteration (radix 4)
▪ Allows negative entries
▪ 1066 entries in lookup table
98
UC Davis EEC 170, Winter 2021 / © John Owens
[http://members.cox.net/srice1/ pentbug/introduction.html]
Faster Division
▪ Can’t use parallel hardware as in multiplier
- Subtraction is conditional on sign of remainder
▪ Faster dividers (e.g. SRT division) generate multiple quotient bits per step
- Still require multiple steps
99
UC Davis EEC 170, Winter 2021 / © John Owens
Division Lessons
▪ In practice, slower than multiplication
- Also less frequent
- But, in the simple case, can use same hardware!
▪ Generates quotient and remainder together
▪ Floating-point division faster than integer division (why?) ▪ Similar hardware lessons as multiplier:
- Look for unused hardware
- Can process multiple bits at once at cost of extra hardware
100 UC Davis EEC 170, Winter 2021 / © John Owens
End of lecture (1:30 in, to allow project
101 UC Davis EEC 170, Winter 2021 / © John Owens
Lecture 7:
Arithmetic 3/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021
RISC-V logical instructions
Instruction Meaning Pseudocode
XORI rd,rs1,imm
SRAI rd,rs1,imm
Exclusive Or Immediate
rd ← ux(rs1) ⊕ ux(imm)
ORI rd,rs1,imm
Or Immediate
rd ← ux(rs1) ∨ ux(imm)
ANDI rd,rs1,imm
SLLI rd,rs1,imm
And Immediate
rd ← ux(rs1) ∧ ux(imm)
Shift Left Logical Immediate
rd ← ux(rs1) « ux(imm)
SRLI rd,rs1,imm
Shift Right Logical Immediate
rd ← ux(rs1) » ux(imm)
SLL rd,rs1,rs2
SRL rd,rs1,rs2
Shift Right Arithmetic Immediate
rd ← sx(rs1) » ux(imm)
Shift Left Logical
rd ← ux(rs1) « rs2
XOR rd,rs1,rs2
Exclusive Or
rd ← ux(rs1) ⊕ ux(rs2)
Shift Right Logical
rd ← ux(rs1) » rs2
SRA rd,rs1,rs2
OR rd,rs1,rs2
Shift Right Arithmetic
rd ← sx(rs1) » rs2
Or
rd ← ux(rs1) ∨ ux(rs2)
AND rd,rs1,rs2
And
rd ← ux(rs1) ∧ ux(rs2)
103 UC Davis EEC 170, Winter 2021 / © John Owens
Shift Operations ▪ Bit manipulation:
- S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM & 0 11111111 00000000000000000000000 ------------------------------------
0 EEEEEEEE 00000000000000000000000
- Right shift 23 bits to get
000000000000000000000000 EEEEEEEE
- Do arithmetic manipulation
000000000000000000000000 ENEWENEW
- Left shift 23 bits to get
0 ENEWENEW 00000000000000000000000
104 UC Davis EEC 170, Winter 2021 / © John Owens
Shift Operations
▪ Arithmetic operation:
- Example: 00011 << 2 [3 left shift 2]
- 00011 << 2 = 01100 = 12 = 2 * 4
- Each bit shifted left == multiply by two
- Example: 01010 >> 1 [10 right shift 1]
– 01010 >> 1 = 00101 = 5 = 10/2
– Each bit shifted right == divide by two
– Why?
– Compilers do this—“strength reduction”
105 UC Davis EEC 170, Winter 2021 / © John Owens
Shift Operations
▪
▪
With left shift, what do we shift in?
– 00011 << 2 = 01100 (arithmetic)
- 0000XXXX << 4 = XXXX0000 (logical)
- We shifted in zeroes
How about right shift?
- XXXX0000 >> 4 = 0000XXXX (logical) – Shifted in zero
– 00110 (= 6) >> 1 = 00011 (3) (arithmetic) – Shifted in zero
– 11110 (= -2) >> 1 = 11111 (-1) (arithmetic) – Shifted in one
106 UC Davis EEC 170, Winter 2021 / © John Owens
Shift Operations
▪ How about right shift?
– XXXX0000 >> 4 = 0000XXXX: Logical shift
– Shifted in zero
– 00110 (= 6) >> 1 = 00011 (3)
11110 (= -2) >> 1 = 11111 (-1): Arithmetic shift
– Shifted in sign bit
▪ RISC-V supports both logical and arithmetic:
– slli, srai, srli: Shift amount taken from within instruction (“imm”)
6 bits 6 bits 5 bits 3 bits 5 bits
7 bits 5 bits 5 bits 3 bits 5 bits
– sll, sra, srl: shift amount taken from register (“variable”)
– How far can we shift with slli/srai/slli? With sll/sra/srl?
7 bits
7 bits
funct6
shamt
rs1
funct3
rd
opcode
funct7
rs2
rs1
funct3
rd
opcode
107
UC Davis EEC 170, Winter 2021 / © John Owens
Combinational Shifter from MUXes
Basic Building Block AB
sel 1 0 D
8-bit right shifter
A7 A6 A5 A4 A3 A2 A1 A0 1010101010101010
1010101010101010
1010101010101010 R7 R6 R5 R4 R3 R2 R1 R0
S2 S1 S0
▪ What comes in the MSBs?
▪ How many levels for 64-bit shifter? ▪ What if we use 4-1 Muxes ?
108 UC Davis EEC 170, Winter 2021 / © John Owens
General Shift Right Scheme using 16 bit example ▪S 0 If we added right-to-left connections, we could supportROTATE
(0,1)
S1 (0,2)
S2 (0,4)
(not in RISC-V but found in other ISAs)
S3 (0,8)
109 UC Davis EEC 170, Winter 2021 / © John Owens
Funnel Shifter ▪ Shift A by i bits
▪ Problem: Set Y, X, sa ▪ Logical:
▪ Arithmetic: ▪ Rotate:
▪ Left shifts:
Extract 64 bits of 128 (sa selects which bits)
Y
Y
64
X
64
64 R
Shift Right
X
| sa |
R
110
UC Davis EEC 170, Winter 2021 / © John Owens
Barrel Shifter
▪ Technology-dependent solutions: transistor per switch
in6
in5
in4
out3
out2
out1
out0
SR3
SR2 SR1 SR0
in3
in2 in1 in0
111
UC Davis EEC 170, Winter 2021 / © John Owens
Shifter Summary
▪ Shifts common in logical ops, also in arithmetic ▪ RISC-V (oops) has:
– 2 flavors of shift: logical and arithmetic
– 2 directions of shift: right and left
– 2 sources for shift amount: immediate, variable
▪ Lots of cool shift algorithms, but …
– Barrel shifter prevalent in today’s hardware
112 UC Davis EEC 170, Winter 2021 / © John Owens
ating Point
Floating Point
▪ Representation for non-integral numbers
– Including very small and very large numbers
▪ Like scientific notation
– –2.34 × 1056
– +0.002 × 10–4
– +987.02 × 109 ▪ In binary
– ±1.xxxxxxx2 × 2yyyy
▪ Typesfloatanddoublein C
normalized
not normalized
113
UC Davis EEC 170, Winter 2021 / © John Owens
Floating Point Standard
▪ Defined by IEEE Std 754-1985
▪ Developed in response to divergence of representations
– Portability issues for scientific code ▪ Now almost universally adopted
▪ Two representations
– Single precision (32-bit)
– Double precision (64-bit)
114 UC Davis EEC 170, Winter 2021 / © John Owens
115 UC Davis EEC 170, Winter 2021 / © John Owens
Floating-point Formats
▪ Single-precision (32 bits)
▪ Double precision (64 bits)
116 UC Davis EEC 170, Winter 2021 / © John Owens
IEEE Floating-Point Format
x = (−1)S ×(1+Fraction)×2(Exponent−Bias)
▪ S: sign bit (0 ⇒ non-negative, 1 ⇒ negative)
▪ Normalize significand: 1.0 ≤ |significand| < 2.0
- Always has a leading pre-binary-point 1 bit, so no need to
represent it explicitly (hidden bit)
- Significand is Fraction with the “1.” restored
Fraction: single: 23 bits double: 52 bits
▪ Exponent: excess representation: actual exponent + Bias
- Ensures exponent is unsigned
- Single: Bias = 127; Double: Bias = 1023
117 UC Davis EEC 170, Winter 2021 / © John Owens
Exponent: single: 8 bits double: 11 bits
Single-Precision Range
▪ Exponents 00000000 and 11111111 reserved
▪ Smallest value
- Exponent: 00000001
⇒ actual exponent = 1 – 127 = –126
- Fraction: 000...00 ⇒ significand = 1.0
- ±1.0 × 2–126 ≈ ±1.2 × 10–38
▪ Largest value
- exponent: 11111110
⇒ actual exponent = 254 – 127 = +127
- Fraction: 111...11 ⇒ significand ≈ 2.0
- ±2.0 × 2+127 ≈ ±3.4 × 10+38
118
UC Davis EEC 170, Winter 2021 / © John Owens
Double-Precision Range
▪ Exponents 0000...00 and 1111...11 reserved
▪ Smallest value
- Exponent: 00000000001
⇒ actual exponent = 1 – 1023 = –1022
- Fraction: 000...00 ⇒ significand = 1.0
- ±1.0 × 2–1022 ≈ ±2.2 × 10–308
▪ Largest value
- Exponent: 11111111110
⇒ actual exponent = 2046 – 1023 = +1023
- Fraction: 111...11 ⇒ significand ≈ 2.0
- ±2.0 × 2+1023 ≈ ±1.8 × 10+308
119
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Precision
▪ Relative precision
- all fraction bits are significant
- Single: approx 2–23
- Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of
precision
- Double: approx 2–52
- Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision
120
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Example ▪ Represent –0.75
- -
–0.75 = (–1)1 × 1.12 × 2–1 - S =1
Fraction = 1000...002 - Exponent = –1 + Bias
- Single: –1 + 127 = 126 = 011111102
- Double: –1 + 1023 = 1022 = 011111111102
▪ Single: 1011111101000...00
▪ Double: 1011111111101000...00
121
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Example
▪ What number is represented by the single-precision float 11000000101000...00
- S =1
-
-
▪
Fraction = 01000...002
Fxponent = 100000012 = 129 x = (–1)1 × (1 + .012) × 2(129 – 127)
= (–1) × 1.25 × 22 = –5.0
122
UC Davis EEC 170, Winter 2021 / © John Owens
Denormal Numbers
▪ Exponent = 000...0 ⇒ hidden bit is 0
x=(−1)S ×(0+Fraction)×2−Bias
▪ Smaller than normal numbers
- allow for gradual underflow, with diminishing precision
▪ Denormal with fraction = 000...0
x=(−1)S ×(0+0)×2−Bias =±0.0
Two representations of 0.0!
123
UC Davis EEC 170, Winter 2021 / © John Owens
Infinities and NaNs
▪ Exponent = 111...1, Fraction = 000...0
- ±Infinity
- Can be used in subsequent calculations, avoiding need for overflow check
▪ Exponent = 111...1, Fraction ≠ 000...0
- Not-a-Number (NaN)
- Indicates illegal or undefined result
- e.g., 0.0 / 0.0
- Can be used in subsequent calculations
124
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Addition
▪ Consider a 4-digit decimal example - 9.999 × 101 + 1.610 × 10–1
▪ 1. Align decimal points
- Shift number with smaller exponent
- 9.999 × 101 + 0.016 × 101
▪ 2. Add significands
- 9.999 × 101 + 0.016 × 101 = 10.015 × 101
▪ 3. Normalize result & check for over/underflow - 1.0015 × 102
▪ 4. Round and renormalize if necessary - 1.002 × 102
125
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Addition
▪ Now consider a 4-digit binary example
- 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
▪ 1. Align binary points
- Shift number with smaller exponent
- 1.0002 × 2–1 + –0.1112 × 2–1
▪ 2. Add significands
- 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
▪ 3. Normalize result & check for over/underflow - 1.0002 × 2–4, with no over/underflow
▪ 4. Round and renormalize if necessary - 1.0002 × 2–4 (no change) = 0.0625
126
UC Davis EEC 170, Winter 2021 / © John Owens
FP Adder Hardware
▪ Much more complex than integer adder
▪ Doing it in one clock cycle would take too long
- Much longer than integer operations
- Slower clock would penalize all instructions ▪ FP adder usually takes several cycles
- Can be pipelined
127 UC Davis EEC 170, Winter 2021 / © John Owens
FP Adder Hardware
Step 1
Step 2
Step 3 Step 4
128
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Multiplication ▪ Consider a 4-digit decimal example
- 1.110 × 1010 × 9.200 × 10–5
▪ 1. Add exponents
- For biased exponents, subtract bias from sum
- New exponent = 10 + –5 = 5
▪ 2. Multiply significands
- 1.110 × 9.200 = 10.212 ⇒ 10.212 × 105
▪ 3. Normalize result & check for over/underflow - 1.0212 × 106
▪ 4. Round and renormalize if necessary - 1.021 × 106
▪ 5. Determine sign of result from signs of operands - +1.021 × 106
129
UC Davis EEC 170, Winter 2021 / © John Owens
Floating-Point Multiplication ▪ Now consider a 4-digit binary example
- 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)
▪ 1. Add exponents
- Unbiased: –1 + –2 = –3
- Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
▪ 2. Multiply significands
- 1.0002 × 1.1102 = 1.1102 ⇒ 1.1102 × 2–3
▪ 3. Normalize result & check for over/underflow - 1.1102 × 2–3 (no change) with no over/underflow
▪ 4. Round and renormalize if necessary - 1.1102 × 2–3 (no change)
▪
5. Determine sign: +ve × –ve ⇒ –ve - –1.1102 × 2–3 = –0.21875
130
UC Davis EEC 170, Winter 2021 / © John Owens
FP Arithmetic Hardware
▪ FP multiplier is of similar complexity to FP adder
- But uses a multiplier for significands instead of an adder
▪ FP arithmetic hardware usually does
- Addition, subtraction, multiplication, division, reciprocal,
square-root
FP ↔ integer conversion
▪ Operations usually takes several cycles - Can be pipelined
-
131 UC Davis EEC 170, Winter 2021 / © John Owens
FP Instructions in RISC-V
▪ Separate FP registers: f0, ..., f31
- double-precision
- single-precision values stored in the lower 32 bits
▪ FP instructions operate only on FP registers
- Programs generally don’t do integer ops on FP data, or vice
versa
- More registers with minimal code-size impact
▪ FP load and store instructions
- -
flw, fld fsw, fsd
132 UC Davis EEC 170, Winter 2021 / © John Owens
FP Instructions in RISC-V ▪ Single-precision arithmetic
-fadd.s, fsub.s, fmul.s, fdiv.s, fsqrt.s
-
▪ Double-precision arithmetic
e.g.,fadds.s f2, f4, f6
-fadd.d, fsub.d, fmul.d, fdiv.d, fsqrt.d
-
▪ Single- and double-precision comparison
-feq.s, flt.s, fle.s -feq.d, flt.d, fle.d
- Result is 0 or 1 in integer destination register
- Use beq, bne to branch on comparison result
▪ Branch on FP condition code true or false -B.cond
e.g.,fadd.d f2, f4, f6
133
UC Davis EEC 170, Winter 2021 / © John Owens
FP Example: °F to °C ▪ C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
- fahr in f10, result in f10, literals in global memory space ▪ Compiled RISC-V code:
f2c:
flw f0,const5(x3) // f0 = 5.0f
flw f1,const9(x3) // f1 = 9.0f
fdiv.s f0, f0, f1 // f0 = 5.0f / 9.0f
flw f1,const32(x3) // f1 = 32.0f
fsub.s f10,f10,f1 // f10 = fahr - 32.0
fmul.s f10,f0,f10 // f10 = (5.0f/9.0f) * (fahr-32.0f)
jalr x0,0(x1) // return
134 UC Davis EEC 170, Winter 2021 / © John Owens
FP Example: Array Multiplication
▪ C = C +A × B
- All 32 × 32 matrices, 64-bit double-precision elements
▪ C code:
▪ void mm (double c[][],
double a[][], double b[][]) {
size_t i, j, k;
for (i = 0; i < 32; i = i + 1)
for (j = 0; j < 32; j = j + 1)
for (k = 0; k < 32; k = k + 1)
}
c[i][j] = c[i][j]
+ a[i][k] * b[k][j];
-
i,j,kin x5, x6, x7
Addresses ofc,a,bin x10, x11, x12, and
135 UC Davis EEC 170, Winter 2021 / © John Owens
FP Example: Array Multiplication
▪
RISC-V code:
mm:...
li x28,32
li x5,0
L1: li x6,0
L2: li x7,0
// x28 = 32 (row size/loop end)
// i = 0; initialize 1st for loop
// j = 0; initialize 2nd for loop
// k = 0; initialize 3rd for loop
// x30 = i * 2**5 (size of row of c)
slli x30,x5,5
add x30,x30,x6 // x30 = i * size(row) + j
slli x30,x30,3 // x30 = byte offset of [i][j]
add x30,x10,x30 // x30 = byte address of c[i][j]
fld f0,0(x30) // f0 = c[i][j]
L3: slli x29,x7,5 // x29 = k * 2**5 (size of row of b)
add x29,x29,x6 // x29 = k * size(row) + j
slli x29,x29,3 // x29 = byte offset of [k][j]
add x29,x12,x29 // x29 = byte address of b[k][j]
fld f1,0(x29) // f1 = b[k][j]
136
UC Davis EEC 170, Winter 2021 / © John Owens
FP Example: Array Multiplication
...
slli x29,x5,5 // x29 = i * 2**5 (size of row of a)
add slli
add
x29,x29,x7 // x29 = i * size(row) + k
x29,x29,3 // x29 = byte offset of [i][k]
x29,x11,x29 // x29 = byte address of a[i][k]
fld f2,0(x29) // f2 = a[i][k]
fmul.d f1, f2, f1 // f1 = a[i][k] * b[k][j]
fadd.d f0, f0, f1 // f0 = c[i][j] + a[i][k] * b[k][j]
addi x7,x7,1 // k = k + 1
bltu x7,x28,L3 // if (k < 32) go to L3
fsd f0,0(x30) // c[i][j] = f0
bltu bltu
addi x6,x6,1
x6,x28,L2
addi x5,x5,1
x5,x28,L1
// j = j + 1
// if (j < 32) go to L2
// i = i + 1
// if (i < 32) go to L1
137
UC Davis EEC 170, Winter 2021 / © John Owens
Accurate Arithmetic
▪ IEEE Std 754 specifies additional rounding control
- Extra bits of precision (guard, round, sticky)
- Choice of rounding modes
- Allows programmer to fine-tune numerical behavior of a computation
▪ Not all FP units implement all options
- Most programming languages and FP libraries just use
defaults
▪ Trade-off between hardware complexity, performance, and market requirements
138 UC Davis EEC 170, Winter 2021 / © John Owens
Subword Parallellism
▪ Graphics and audio applications can take advantage of performing simultaneous operations on short vectors
- Example: 128-bit adder:
- Sixteen 8-bit adds
- Eight 16-bit adds
- Four 32-bit adds
▪ Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)
139 UC Davis EEC 170, Winter 2021 / © John Owens
§3.6 Parallelism and Computer Arithmetic: Subword Parallelism
x86 FP Architecture
▪ Originally based on 8087 FP coprocessor
- 8 × 80-bit extended-precision registers
- Used as a push-down stack
- Registers indexed from TOS: ST(0), ST(1), ...
▪ FP values are 32-bit or 64 in memory
- Converted on load/store of memory operand
- Integer operands can also be converted on load/store
▪ Very difficult to generate and optimize code
- Result: poor FP performance
140 UC Davis EEC 170, Winter 2021 / © John Owens
§3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86
x86 FP Instructions ▪ Optional variations
Data transfer Arithmetic
- I: integer operand
FIADDP mem/ST(i) - P: pop operand from stack
FILD mem/ST(i)
FISTP mem/ST(i) FIMULP mem/ST(i)
FISUBRP mem/ST(i) - R: reverse opeFraIDnIdVRoPrdmerm/ST(i)
FICOMP
FIUCOMP
FSTSW AX/mem
FLDPI
FLD1 FSQRT
- But not all combinations allowed
FRNDINT
FLDZ
FABS
Compare
Transcendental
FPATAN
F2XMI
FCOS
FPTAN
FPREM
FPSIN
FYL2X
141
UC Davis EEC 170, Winter 2021 / © John Owens
Streaming SIMD Extension 2 (SSE2)
▪ Adds 4 × 128-bit registers
- Extended to 8 registers in AMD64/EM64T
▪ Can be used for multiple FP operands
- 2 × 64-bit double precision
- 4 × 32-bit double precision
- Instructions operate on them simultaneously
- Single-Instruction Multiple-Data
142
UC Davis EEC 170, Winter 2021 / © John Owens
Matrix Multiply ▪ Unoptimized code:
1. void dgemm (int n, double* A, double* B, double* C)
2. {
3. for (int i = 0; i < n; ++i)
4. for (int j = 0; j < n; ++j)
5. {
6. double cij = C[i+j*n]; /* cij = C[i][j] */
7. for (int k = 0; k < n; k++)
8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
9. C[i+j*n] = cij; /* C[i][j] = cij */
10. } 11. }
143
UC Davis EEC 170, Winter 2021 / © John Owens
§3.8 Going Faster: Subword Parallelism and Matrix Multiply
Matrix Multiply ▪ x86 assembly code:
1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0
2. mov %rsi,%rcx # register %rcx = %rsi
3. xor %eax,%eax # register %eax = 0
4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1
5. add %r9,%rcx # register %rcx = %rcx + %r9
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A
7. add $0x1,%rax # register %rax = %rax + 1
8. cmp %eax,%edi # compare %eax to %edi
9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0
10. jg 30
11. add $0x1,%r11d # register %r11 = %r11 + 1
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element
144
UC Davis EEC 170, Winter 2021 / © John Owens
Matrix Multiply ▪ Optimized C code:
1. #include
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4.
5.
6.
7.
8.
9.
10.
11.
12. }
13. }
for ( int i = 0; i < n; i+=4 )
for ( int j = 0; j < n; j++ ) {
__m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */
for( int k = 0; k < n; k++ )
c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
_mm256_mul_pd(_mm256_load_pd(A+i+k*n),
_mm256_broadcast_sd(B+k+j*n)));
_mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
145
UC Davis EEC 170, Winter 2021 / © John Owens
Matrix Multiply
▪ Optimized x86 assembly code:
1. vmovapd (%r11),%ymm0
2. mov %rbx,%rcx
3. xor %eax,%eax
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 5. add $0x8,%rax # register %rax = %rax +8
10. jne 50
11. add $0x1,%esi
12. vmovapd %ymm0,(%r11)
# jump if not %r10 != %rax
# register % esi = % esi + 1
# Store %ymm0 into 4 C elements
# Load 4 elements of C into %ymm0
# register %rcx = %rbx
# register %eax = 0
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
8. cmp %r10,%rax # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
146
UC Davis EEC 170, Winter 2021 / © John Owens
es and Pitfalls
Right Shift and Division
▪ Left shift by i places multiplies an integer by 2i
▪ Right shift divides by 2i?
– Only for unsigned integers
▪ For signed integers
– Arithmetic right shift: replicate the sign bit
– e.g., –5 / 4
– 111110112 >> 2 = 111111102 = –2
– Rounds toward –∞
c.f. 111110112 >>> 2 = 001111102 = +62
–
147
UC Davis EEC 170, Winter 2021 / © John Owens
Associativity
▪ Parallel programs may interleave operations in unexpected orders
– Assumptions of associativity may fail
(x+y)+z
x+(y+z)
x
-1.50E+38
0.00E+00
-1.50E+38
y
1.50E+38
1.50E+38
z
1.0
1.0
1.00E+00
0.00E+00
▪ Need to validate parallel programs under varying degrees of parallelism
148 UC Davis EEC 170, Winter 2021 / © John Owens
Who Cares About FP Accuracy?
▪ Important for scientific code
– But for everyday consumer use?
–
▪ The Intel Pentium FDIV bug
– The market expects accuracy
– See Colwell, The Pentium Chronicles
“My bank balance is out by 0.0002¢!”☹
149 UC Davis EEC 170, Winter 2021 / © John Owens
Concluding Remarks
▪ Bits have no inherent meaning
– Interpretation depends on the instructions applied
▪ Computer representations of numbers
– Finite range and precision
– Need to account for this in programs
▪ ISAs support arithmetic
– Signed and unsigned integers
– Floating-point approximation to reals
▪ Bounded range and precision
– Operations can overflow and underflow
150 UC Davis EEC 170, Winter 2021 / © John Owens
§3.10 Concluding Remarks
Break
151 UC Davis EEC 170, Winter 2021 / © John Owens
Administrivia
▪ My office hour is in the Coffee House, not the Silo. The Silo is closed for construction.
▪ Maybe oops: https://en.wikipedia.org/wiki/ Kahan_summation_algorithm
▪ Tell me about errors in slides / in homework
▪ Anyone try RARS?
▪ Midterm philosophy
▪ No typing on the midterm
▪ Bring a calculator
152 UC Davis EEC 170, Winter 2021 / © John Owens
Problem: Design a “fast” ALU for the RISC-V ISA
▪ Requirements?
– Must support the Arithmetic / Logic operations
– Tradeoffs of cost and speed based on frequency of occurrence, hardware budget
153 UC Davis EEC 170, Winter 2021 / © John Owens
RISC-V ALU requirements
▪ Add, Sub, AddI, AddI
– => 2’s complement adder/sub
▪ And, Or, AndI, OrI, Xor, Xori
– => Logical AND, logical OR, XOR
▪ SLTI, SLTIU (set less than)
– => 2’s complement adder with inverter, check sign bit of
result
▪ See ALU from COD5E, appendix A.5
154 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS arithmetic instruction format ▪ I-format:
immediate
rs1
funct3
rd
opcode
12 bits
▪ R-format: SLTI
SLTIU ANDI ORI XORI
5 bits
3 bits
5 bits
7 bits
opcode
Type op funct
ADD 00 0110011 | 000 | 0000000
Type op 0010011
SUB
5 bits
funct3
rd
7 bits
5 bits
AND OR XOR SLT SLTU
3 bits
5 bits
7 bits
ADDI 0010011 | 000
funct7 rs2 rs1
0010011
0010011
0010011
0010011
00 00 00 00 00
0110011 | 0110011 | 0110011 | 0110011 | 0110011 |
111 | 0000000 110 | 0000000 100 | 0000000 010 | 0000000 011 | 0000000
| 010 | 011 | 111 | 110 | 100
00
0110011 | 000 | 0100000
155
UC Davis EEC 170, Winter 2021 / © John Owens
Design Trick: divide & conquer
▪ Trick: Break the problem into simpler problems, solve them and glue together the solution
▪ Example: assume the immediates have been taken care of before ADDI 0010011 | 000 ADD 00 0110011 | 000 | 0000000
SUB 00 0110011 | 000 | 0100000
Type op Type op funct
the ALU
ANDI ORI XORI SLTI SLTIU
0010011 | 111 AND 00 0110011 | 111 | 0000000
–
7 operations (could be 3 bits, but really 4)
0010011 | 110 0010011 | 100 0010011 | 010 0010011 | 011
OR 00 XOR 00 SLT 00 SLTU 00
0110011 | 110 | 0000000 0110011 | 100 | 0000000 0110011 | 010 | 0000000 0110011 | 011 | 0000000
156 UC Davis EEC 170, Winter 2021 / © John Owens
Let’s Build a ALU
▪ Functional Specification:
– inputs: 2 x 32-bit operands A, B, 4-bit OPeration
– outputs: 32-bit result S, 1-bit carry, 1 bit overflow
– operations: add, sub, and, or, xor, slt, sltu
32 32 AB
c ALU OP ovf
S
32
4
157
UC Davis EEC 170, Winter 2021 / © John Owens
We already know how to do add/sub
A1
B1
+ –
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
A + ~B + 1
+ –
+ –
Cout Full
Sum S0
S=
… then add 1
B
0
Adder
0
1
158
UC Davis EEC 170, Winter 2021 / © John Owens
Control for +/-
One bit controls three muxes. This is a “control point”.
A1
B1
+ –
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
+ –
Cout Full
Sum S0
How do we set this control point for add? subtract?
B0
Adder
0
+ –
1
159
UC Davis EEC 170, Winter 2021 / © John Owens
160 UC Davis EEC 170, Winter 2021 / © John Owens
AND and OR
▪ Consider ALU that supports two functions, AND and OR
▪ How do we do this?
A
and
or
B
161 UC Davis EEC 170, Winter 2021 / © John Owens
AND and OR
▪
A
Combinational logic:
– Control bit OP is 0 for AND, 1 for OR
A
B
OP
OUT
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
1
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
1
B
▪
Hard with lots OP of functions! But let’s do it anyway.
and
or
A
OP
B
162 UC Davis EEC 170, Winter 2021 / © John Owens
Function Inputs Outputs K-Map
127
M0 M1 M2 M3 A B Cin S Cout
add 0000000 00
7-to-2 Combinational Logic
▪ Start turning the crank . . . 0
163 UC Davis EEC 170, Winter 2021 / © John Owens
AND and OR
▪
using a mux
Instead, generate several functions and use control bits to select
and
OP
0 1
▪
Not easy to decide the “best” way to build something
– Don’t want too many inputs to a single gate
– Don’t want to have to go through too many gates
– For our purposes, ease of comprehension is important
A
B
or
164 UC Davis EEC 170, Winter 2021 / © John Owens
Supporting More Functions
▪ With the mux approach, it’s easy to add other functions
– Like add
– To add more, just enlarge mux
– Control signals in mux, not datapath
OP
0 1
and
165
UC Davis EEC 170, Winter 2021 / © John Owens
A
B
or
Supporting More Functions
▪ With the mux approach, it’s easy to add other functions – Like add
OP
– To add more, just enlarge mux
– Control signals in mux, not datapath
0
1
2
A
B
A
B Cin
FA
Cout Sum
166
UC Davis EEC 170, Winter 2021 / © John Owens
Supporting More Functions
▪ With the mux approach, it’s easy to add other functions
– Like add
– To add more, just enlarge mux
(or put more muxes elsewhere)
– Control signals in muxes, not datapath
OP
OP
A
B
A +B
– Cin
+ –
FA
Cout Sum
0
1
2
0
1
167
UC Davis EEC 170, Winter 2021 / © John Owens
Tailoring the ALU to RISC-V
▪ Need to support the set-on-less-than instruction (slt)
– slt produces a 1 if rs < rt and 0 otherwise
- use subtraction: (a-b) < 0 implies a < b
- So now we’ve got a-b as our result. How does this translate to slt operation? What do we have to test and where does it go?
- We test
- To produce the proper result, it goes in
168 UC Davis EEC 170, Winter 2021 / © John Owens
Tailoring the ALU to RISC-V
▪ Need to support the set-on-less-than instruction (slt)
- slt produces a 1 if rs < rt and 0 otherwise
- use subtraction: (a-b) < 0 implies a < b
- So now we’ve got a-b as our result. How does this translate to slt operation? What do we have to test and where does it go?
- We test
the highest (sign) bit (S[31])
- To produce the proper result, it goes in
the lowest bit (S[0])
169 UC Davis EEC 170, Winter 2021 / © John Owens
Why do you think the MIPS-V designers
170 UC Davis EEC 170, Winter 2021 / © John Owens
Tailoring the ALU to the MIPS
▪ Need to support test for equality (beq $t5, $t6, LABEL)
- use subtraction: (a-b) = 0 implies a = b - How do we test if the product is zero?
171 UC Davis EEC 170, Winter 2021 / © John Owens
Original Diagram: bit-slice ALU
A32 B32
a31 b31
a0
m ALU0
b0 m
cin
4 M
ALU31
co
cin
co
s0
s31
32 S
172 UC Davis EEC 170, Winter 2021 / © John Owens
Revised Diagram
▪ LSB and MSB need to do a little extra
A 32 B 32
a31 b31
ALU0 cos31 cin
a0 b0
ALU0
co s0 cin
4
produce carry-in (for add/ subtract), etc.
?
M
C/L to
Overflow
(but not in RISC-V)
32 S
173
UC Davis EEC 170, Winter 2021 / © John Owens
slt sign bit
Behavioral Representation: Verilog
module ALU(A, B, m, S, c, ovf);
input [0:31] A, B;
input [0:3] m;
output [0:31] S;
output c, ovf;
reg [0:31] S;
reg c, ovf;
always @(A, B, m) begin
case (m)
0: S = A + B;
1: S = A - B;
2: S = ...
... end
endmodule
32 32 AB
c ALU m ovf
S
32
4
174
UC Davis EEC 170, Winter 2021 / © John Owens
Conclusion
▪ We can build an ALU to support the RISC-V instruction set
- key idea: use multiplexor to select the output we want
- we can efficiently perform subtraction using two’s complement
- we can replicate a 1-bit ALU to produce a 32-bit ALU
▪ Important points about hardware
- all of the gates are always working
- the speed of a gate is affected by the number of inputs to the gate
- the speed of a circuit is affected by the number of gates in series
(on the “critical path” or the “deepest level of logic”)
175 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Opcode Map
[from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich]
176 UC Davis EEC 170, Winter 2021 / © John Owens
Encodings for ADD, SUB
OP
OP
Big picture of what we’re doing here: understand tie btwn ISA and hardware
A
B
A +B
- Cin
+ -
Cout Sum
0
1
2
0 1
OP
+ -
0
FA
1
177
UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Opcode Map
[from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich]
178 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS (really) Encodings for A
OP
OP
A
B
A +B
- Cin
FA
Cout Sum
0
1
2
OP
0 1
d bit do? Green? Blue? -
+0+ -1
179 UC Davis EEC 170, Winter 2021 / © John Owens
ADD 00 100000 (040)
ADDU 00 100001 (041) D, SUB, SLT
SUB 00 SUBU 00
*? 00
*? 00
SLT 00 SLTU 00
100010 (042) 100011 (043) 101000 (050) 101001 (051) 101010 (052) 101011 (053)
e
D
MIPS Opcode Map
[from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich]
180 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS arithmetic instruction format
Type op ADDI 001000 (010)
I-Type:
op Rs
31 25 20 15 0
ADDIU 001001 (011)
Rt
Immed 16
SLTI 001010 (012) SLTIU 001011 (013) ANDI 001100 (014) ORI 001101 (015) XORI 001110 (016) LUI 001111 (017)
“Perhaps it is surprising that addiu and sltiu also
What does the red bit do?
sign-extend their immediates, but they do. The u
stands for unsigned, but in reality addiu is often used simply as an add instruction that cannot overflow, and hence we often want to add negative numbers. It's much harder to come up with an excuse for why sltiu sign extends its immediate field.” COD2E p. 230
181 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Opcode Map
[from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich]
182 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS arithmetic instruction format
Type op funct
SLL 00
R-type:
*00 SRL 00 SRA 00 SLLV 00 * 00 SRLV 00 SRAV 00
31 25 20 15 10 5 0
000000 (000) 000001 (001)
Rt Rd shamt funct
op Rs
000010 (002)
What does the red bit do?
000011 (003) 000100 (004) 000101 (005) 000110 (006) 000111 (007)
Green? Blue?
* instructions?
183 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Opcode Map
[from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich]
184 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS arithmetic instruction format
R-type: I-Type:
Type op
31 25 20 15
5 0
op
ADDI ADDIU
Rs
Rt Rd funct
ADD 000000 100000 (040)
SLTI SLTIU ANDI ORI XORI LUI
001010 (012) 001011 (013) 001100 (014) 001101 (015) 001110 (016) 001111 (017)
op
001000 (010) 001001 (011)
Rs
Rt
Type op funct ADDU 000000 100001 (041)
Immed 16
SUB 000000 100010 (042) SUBU 000000 100011 (043) AND 000000 100100 (044) OR 000000 100101 (045) XOR 000000 100110 (046) NOR 000000 100111 (047)
185
UC Davis EEC 170, Winter 2021 / © John Owens