CS代考计算机代写 computer architecture compiler scheme mips chain assembly Java RISC-V Fortran x86 algorithm cache Lecture 5:

Lecture 5:
Arithmetic 1/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021

Arithmetic for Computers
▪ Operations on integers
– Addition and subtraction
– Multiplication and division
– Dealing with overflow
▪ Floating-point real numbers
– Representation and operations
2 UC Davis EEC 170, Winter 2021 / © John Owens
§3.1 Introduction

Arithmetic
▪ Where we’ve been:
– Performance (seconds, cycles, instructions) – Abstractions:
– Instruction Set Architecture
– Assembly Language and Machine Language
▪ What’s up ahead:
– Number Representation
– Implementing an ALU
– Implementing the Architecture
a
32
b
32
operation
ALU
32
result
3
UC Davis EEC 170, Winter 2021 / © John Owens

32 bit signed numbers
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten …
0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646 ten
0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647 ten
1000 0000 0000 0000 0000 0000 0000 0000two = – 2,147,483,648ten 1000 0000 0000 0000 0000 0000 0000 0001two = – 2,147,483,647ten 1000 0000 0000 0000 0000 0000 0000 0010two = – 2,147,483,646ten …
1111 1111 1111 1111 1111 1111 1111 1101two = – 3ten
1111 1111 1111 1111 1111 1111 1111 1110two = – 2 ten
1111 1111 1111 1111 1111 1111 1111 1111two = – 1 ten
maxint
minint
4
UC Davis EEC 170, Winter 2021 / © John Owens

What happens if you compute –minint?
5 UC Davis EEC 170, Winter 2021 / © John Owens

Two’s Complement Operations
▪ Negating a two’s complement number: invert all bits and add 1
– remember: “negate” and “invert” are quite different!
▪ Converting n bit numbers into numbers with more than n bits:
– RISC V n bit immediate gets converted to 64 bits for arithmetic
– copy the most significant bit (the sign bit) into the other bits
0010 -> 0000 0010
1010 -> 1111 1010
– “sign extension” (lbu vs. lb)
– mem [x1] = 0xff
– lb x2, 0(x1) -> x2 = ffff ffff
– lbu x2, 0(x1) -> x2 = 0000 00ff
6 UC Davis EEC 170, Winter 2021 / © John Owens

Addition & Subtraction
▪ Two’s complement operations easy
– subtraction using addition of negative numbers
0111 (7) + 1010 (-6)
0111 (7)
– 0110 (6)
7
UC Davis EEC 170, Winter 2021 / © John Owens

Addition & Subtraction
▪ Overflow (result too large for finite computer word):
– adding two n-bit numbers does not yield an n-bit number
1111
+ 0001
10000
– How about -1 + -1?
8 UC Davis EEC 170, Winter 2021 / © John Owens

Detecting Overflow
▪ ▪ ▪
▪ ▪
No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign:
– overflow when adding two positives yields a negative
– or, adding two negatives gives a positive
– or, subtract a negative from a positive and get a negative
– or, subtract a positive from a negative and get a positive
Consider the operations A + B, and A – B
– Can overflow occur if B is 0 ?
– Can overflow occur if A is 0 ?
HW problem on figuring out exact criteria
9 UC Davis EEC 170, Winter 2021 / © John Owens

and Subtraction
Integer Addition ▪ Example: 7 + 6
▪ Overflow if result out of range
– Adding +ve and –ve operands, no overflow
– Adding two +ve operands
– Overflow if result sign is 1 – Adding two –ve operands
– Overflow if result sign is 0
10
UC Davis EEC 170, Winter 2021 / © John Owens

Integer Subtraction
▪ Add negation of second operand
▪ Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111 –6: 1111 1111 … 1111 1010 +1:0000 0000 … 0000 0001
▪ Overflow if result out of range
– Subtracting two +ve or two –ve operands, no overflow
– Subtracting +ve from –ve operand
– Overflow if result sign is 0
– Subtracting –ve from +ve operand – Overflow if result sign is 1
11
UC Davis EEC 170, Winter 2021 / © John Owens

Effects of Overflow
▪ An exception (interrupt) occurs
– Control jumps to predefined address for exception
– Interrupted address is saved for possible resumption
▪ Details based on software system / language
– example: flight control vs. homework assignment
– C/Java do not detect overflow
– Fortran (evidently) can detect overflow
▪ Don’t always want to detect overflow
– –

– RISC-V advocates branches on overflow instead
MIPS instructions (but not RISC-V): addu,addiu,subu addiu sign-extends
sltu, sltiu for unsigned comparisons
12 UC Davis EEC 170, Winter 2021 / © John Owens

Arithmetic for Multimedia


Saturating operations
– On overflow, result is largest representable value
– c.f. 2s-complement modulo arithmetic – E.g., clipping in audio, saturation in video
Graphics and media processing operates on vectors of 8-bit and 16-bit data
– Use 64-bit adder, with partitioned carry chain
(this probably doesn’t mean anything to you yet, but wait until the end of lecture)
– Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors – SIMD (single-instruction, multiple-data)
13 UC Davis EEC 170, Winter 2021 / © John Owens

Review: Boolean Algebra & Gates
▪ Problem: Consider a logic function with three inputs: A, B, and C.
▪ Output D is true if at least one input is true
▪ Output E is true if exactly two inputs are true
▪ Output F is true only if all three inputs are true
14 UC Davis EEC 170, Winter 2021 / © John Owens

Review: Boolean Algebra & Gates
▪ Show the Boolean equations for these three functions.
15 UC Davis EEC 170, Winter 2021 / © John Owens

Review: Boolean Algebra & Gates
▪ Show an implementation consisting of inverters, AND, and OR gates.
16 UC Davis EEC 170, Winter 2021 / © John Owens

You should know …
▪ Boolean algebra
▪ Logic gates (and, or, not, xor, nor, multiplexors [muxes], decoders, etc.)
▪ Converting between equations, truth tables, and gate representations
▪ Critical path
▪ Clocking, and registers / memory
▪ Finite state machines
▪ … COD5e Appendix A summarizes this material
17 UC Davis EEC 170, Winter 2021 / © John Owens

Break
18 UC Davis EEC 170, Winter 2021 / © John Owens

Administrivia
▪ We have a TA! I already put him to work. Trivikram Reddy
▪ By popular demand, I will have an office hour Fri 18 October, “when the train arrives” (9:30)–11 at the CoHo
19 UC Davis EEC 170, Winter 2021 / © John Owens

Bit Slices
▪ Concentrate on one bit of the adder:
0111 + 0110
▪ Could we build the same hardware for every bit?
– This is a good idea. Why?
– Each bit’s hardware is called a “bit slice”

20 UC Davis EEC 170, Winter 2021 / © John Owens

Truth Table for Adder Bit Slice ▪ 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
0
1
1
0
1
1
0
0
1
0
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1
21 UC Davis EEC 170, Winter 2021 / © John Owens

Adder Equations ▪ Sum = (A⊕B)⊕Cin
▪ Carry = AB + ACin + BCin ▪ Abstract as “Full Adder”:
A B
Cin
Full Cout Adder Sum
22 UC Davis EEC 170, Winter 2021 / © John Owens

Cascading Adders
▪ Cascade Full Adders to make multibit adder:
▪ A+B=S
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
Full Adder
B3
A2
S3
Full Adder
B2
S2
A1
Full Adder
B1
S1
A0
Full Adder
B0
S0
23
UC Davis EEC 170, Winter 2021 / © John Owens

Arithmetic for Multimedia


Saturating operations
– On overflow, result is largest representable value
– c.f. 2s-complement modulo arithmetic – E.g., clipping in audio, saturation in video
Graphics and media processing operates on vectors of 8-bit and 16-bit data
– Use 64-bit adder, with partitioned carry chain
(this probably doesn’t mean anything to you yet, but wait until the end of lecture)
– Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors – SIMD (single-instruction, multiple-data)
24 UC Davis EEC 170, Winter 2021 / © John Owens

Truth Table for Subtractor Bit Slice ▪ 3 inputs (A, B, Bin); 2 outputs (Diff, Bout)
A
B
Bin
Diff
Bout
0
0
0
0
0
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
1
0
0
1
1
0
0
0
1
1
1
1
1
25 UC Davis EEC 170, Winter 2021 / © John Owens

Cascading Subtractors
▪ Cascade Full Subtrs to make multibit subtr:
▪ A-B=S
A3
A B
Bin
A B
Bin
A B
Bin
A B
Full Subtr
Full Subtr
Full Subtr
Bout Diff
Bout Diff
Bout Diff
Bout Diff
B3
B2
A1
B1
Bin
26
UC Davis EEC 170, Winter 2021 / © John Owens
Full Subtr
S3
A2
S2
S1
A0
B0
S0

How can we combine + and -?
A
A B
Cin
A B
Bin
Full Adder
Full Subtr
Cout Sum
Bout Diff
+ –
+ –
B
▪ This is common—it’s what we’ll do (for example) for logic functions (and, or, etc.)
27 UC Davis EEC 170, Winter 2021 / © John Owens
Cin+1
S

How can we combine + and -?
S = A- B
S = A + (~B) + 1
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Full Adder
Full Adder
Full Adder
Full Adder
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
B3
A2
S3
B2
S2
A1
B1
S1
A0
B0
S0
28
UC Davis EEC 170, Winter 2021 / © John Owens

Lest you think this is only theoretical …
29 UC Davis EEC 170, Winter 2021 / © John Owens

How can we combine + and -?
A1
B1
+ –
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
+ –
Full Adder
Cout
Sum S0
B0
S = A + ~B + 1 First, negate B …
30
UC Davis EEC 170, Winter 2021 / © John Owens

How can we combine + and -?
A1
B1
+ –
A B
Cin
A B
Cin
Full Adder
Cout
Sum
S1
A0
+ –
+ –
Full Adder
Cout
Sum
B0
S0
S = A + ~B + 1 … then add 1
0
1
31
UC Davis EEC 170, Winter 2021 / © John Owens

Control for +/-
One bit controls three muxes. This is a “control point”.
A B
Cin
A B
Cin
A1
B1
+ –
Full Adder
Cout
Sum
S1
A0
Cout
Sum
How do we set this control point for add? subtract?
+ –
Full Adder
B0
S0
0
+ –
1
32
UC Davis EEC 170, Winter 2021 / © John Owens

imm[11:0] rs1 funct3 rd opcode I-type
RISC-V instruc
rs2
imm[31:12] rd opcode
im
m
[2
|19
rd opcode
0|1
codings
:12]
rs1 funct3 imm[4:0] opcode
S-type B-type U-type J-type
LUI AUIPC JAL JALR BEQ BNE BLT BGE BLTU BGEU LB
LH LW LBU LHU SB
SH SW ADDI SLTI SLTIU XORI ORI ANDI SLLI SRLI SRAI ADD SUB SLL SLT SLTU XOR SRL
SRA
imm[11:5]
imm[12|10:5] tiorns2 enrs1 funct3 imm[4:1|11] opcode
0:1|11
RV32I Base Instruction Set
imm[31:12]
rd
0110111
imm[31:12]
rd
0010111
imm[20|10:1|11|19:12]
rd
1101111
imm[11:0]
rs1
000
rd
1100111
imm[12|10:5]
rs2
rs1
000
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
001
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
100
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
101
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
110
imm[4:1|11]
1100011
imm[12|10:5]
rs2
rs1
111
imm[4:1|11]
1100011
imm[11:0]
rs1
000
rd
0000011
imm[11:0]
rs1
001
rd
0000011
imm[11:0]
rs1
010
rd
0000011
imm[11:0]
rs1
100
rd
0000011
imm[11:0]
rs1
101
rd
0000011
imm[11:5]
rs2
rs1
000
imm[4:0]
0100011
imm[11:5]
rs2
rs1
001
imm[4:0]
0100011
imm[11:5]
rs2
rs1
010
imm[4:0]
0100011
imm[11:0]
rs1
000
rd
0010011
imm[11:0]
rs1
010
rd
0010011
imm[11:0]
rs1
011
rd
0010011
imm[11:0]
rs1
100
rd
0010011
imm[11:0]
rs1
110
rd
0010011
imm[11:0]
rs1
111
rd
0010011
0000000
shamt
rs1
001
rd
0010011
0000000
shamt
rs1
101
rd
0010011
0100000
shamt
rs1
101
rd
0010011
0000000
rs2
rs1
000
rd
0110011
0100000
rs2
rs1
000
rd
0110011
0000000
rs2
rs1
001
rd
0110011
0000000
rs2
rs1
010
rd
0110011
0000000
rs2
rs1
011
rd
0110011
0000000
0000000
0100000
rs2 rs1 rs2 rs1 rs2 rs1
33
100 rd
101 rd
101 rd
0110011
0110011
0110011
UC Davis EEC 170, Winter 2021 / © John Owens
0000000 rs2 rs1 110 rd 0110011 OR

|| ||
imm[12 10:5] rs2 rs1 101 imm[4:1 11] 1100011 BGE
imm[12|10:5] rs2 rs1 110 imm[4:1|11] 1100011 BLTU imm[12|10:5] rs2 rs1 111 imm[4:1|11] 1100011 BGEU
imm[11:0] rs1 000 rd 0000011 LB
RISC-V instruction encodings (funct7)
imm[11:0] rs1
0000011 LH 0000011 LW 0000011 LBU 0000011 LHU 0100011 SB 0100011 SH 0100011 SW 0010011 ADDI 0010011 SLTI 0010011 SLTIU 0010011 XORI 0010011 ORI 0010011 ANDI
imm[11:0] rs1 ▪ ADD: i0m0m[0110:0] 00 rs1 imm[11:0] rs1
001 rd 010 rd
100 rd
101 rd
000 imm[4:0]
001 imm[4:0]
010 imm[4:0] 000 rd
010 rd
011 rd
100 rd
110 rd
111 rd
001 rd 0010011 SLLI
▪ SUB: 0100000
imm[11:5] rs2 rs1
imm[11:5] rs2 rs1 imm[11:5] rs2 rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 imm[11:0] rs1 0000000 shamt rs1
0000000
shamt
rs1
101
rd
0010011
0100000
shamt
rs1
101
rd
0010011
0000000
rs2
rs1
000
rd
0110011
0100000
rs2
rs1
000
rd
0110011
0000000
rs2
rs1
001
rd
0110011
0000000
rs2
rs1
010
rd
0110011
0000000
rs2
rs1
011
rd
0110011
0000000 rs2 rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1
fm pred succ rs1
100 rd
101 rd
101 rd
110 rd
111 rd
000 rd
SRLI SRAI ADD SUB SLL SLT SLTU
0110011 XOR
0110011 SRL
0110011 SRA
0110011 OR
0110011 AND
34
UC Davis EEC 170, Winter 2021 / © John Owens
0001111 FENCE
000000000000 00000 000 00000 1110011 ECALL

imm[11:0] rs1 001 rd 0000011 LH
imm[11:0] rs1 010 rd 0000011 LW imm[11:0] rs1 100 rd 0000011 LBU
RISC-V instruction encodings (funct3)
imm[11:0] rs1 101 rd 0000011 LHU
imm[11:5] rs2 rs1 imm[11:5] rs2 rs1
000 imm[4:0]
001 imm[4:0]
010 imm[4:0]
000 ADDI
010 SLTI
011 SLTIU
100 XORI 110 ORI
111
001 SLLI 101 SRLI 101 SRAI 000 ADD
000 SUB
001 SLL
010 SLT
011 SLTU
▪ ADDI: 000
imm[11:5] rs2 rs1
0100011 SB 0100011 SH 0100011 SW
▪ SLTI: 010
imm[11:0] rs1
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0010011
rd
0110011
rd
0110011
rd
0110011
rd
0110011
rd
0110011
▪ SLTIU: 011
imm[11:0] rs1
imm[11:0] rs1
▪SLT: 010
imm[11:0] rs1
imm[11:0] rs1
▪ SLimTmU[:11:0]011 rs1 0000000 shamt rs1 0000000 shamt rs1 0100000 shamt rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1 0100000 rs2 rs1 0000000 rs2 rs1 0000000 rs2 rs1
fm pred succ rs1
ANDI
35
0001111 FENCE
100 rd
101 rd
101 rd
110 rd
111 rd
000 rd
0110011 XOR
0110011 SRL
0110011 SRA
0110011 OR
0110011 AND
UC Davis EEC 170, Winter 2021 / © John Owens
000000000000 00000 000 00000 1110011 ECALL

Bit Slices
▪ Concentrate on one bit of the adder:
0111 + 0110
▪ Needs:
– 2 inputs (A and B)
– Carry from previous slice (Cin)
– Output (Sum)
– Carry to next slice (Cout)

36 UC Davis EEC 170, Winter 2021 / © John Owens

MIPS Opcode Map
essor User’s Manual / Joe Heinrich]
37 UC Davis EEC 170, Winter 2021 / © John Owens

End of lecture / Quiz 1
38 UC Davis EEC 170, Winter 2021 / © John Owens

But What About Performance? ▪ Critical path of one bitslice is CP
▪ Critical path of n-bit rippled-carry adder is n*CP
▪ Design Trick:
– Throw hardware at it
A0 B0
A1 B1
A2 B2
A3 B3
CarryIn0
1-bit ALU
CarryOut0
Result0
Result1
Result2
Result3
CarryIn1
1-bit ALU
CarryOut1
CarryIn2
1-bit ALU
CarryOut2 CarryIn3
CarryOut3
39
UC Davis EEC 170, Winter 2021 / © John Owens
1-bit ALU

Truth Table for Adder Bit Slice ▪ 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0=Cin
0
1
1
0
1=Cin
1
0
0
1
0=Cin
1
0
1
0
1=Cin
1
1
0
0
1
1
1
1
1
1
40 UC Davis EEC 170, Winter 2021 / © John Owens

A0 B0
A1 B1
A2 B2
A3 B3
Carry Look Ahead (Design trick: peek)
C0 = Cin
S
S
S
S
C4 = . . .
G
P
G = A and B P = A xor B WHY are these interesting?
41
UC Davis EEC 170, Winter 2021 / © John Owens
A B Cout
0 0 0 “kill”
0 1 Cin 1 0 Cin
1 1 1
“propagate” “propagate” “generate”
G P
G P
C1 = G0 + C0 · P0
G P
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G P

CLA vs. Ripple
C0 = Cin
S
G = A and B P = A xor B
CarryIn0
A0 B0
A1 B1
A0
1-bit ALU
CarryOut0
Result0
G P
B0 CarryIn1
A1
B1 CarryIn2
1-bit
ALU Result1
C1 = G0 + C0 · P0
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 +
G0 · P1 · P2 +
C0 · P0 · P1 · P2
C4 = . . .
S
G P
A2 A2 B2
CarryOut1
1-bit ALU
CarryOut2
1-bit
ALU
CarryOut3
Result2
Result3
S
G P
B2
A3 B3
CarryIn3
A3 B3
S
G
P
G P
42
UC Davis EEC 170, Winter 2021 / © John Owens

Cascaded Carry Look-ahead (16-bit)
C
L C0
A
4-bit Adder
4-bit Adder
4-bit Adder
G0 P0
C1 = G0 + C0 · P0
Abstraction!
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G P
C4 = . . .
43
UC Davis EEC 170, Winter 2021 / © John Owens

Design Trick: Guess (or “Precompute”)
CP(2n) = 2*CP(n)
n-bit adder n-bit adder
(2n) = CP(n) + CP(mux)
n-bit adder
1n-bit adder
0
n-bit adder
Cout
Carry-select adder
44
UC Davis EEC 170, Winter 2021 / © John Owens
P

Carry Skip Adder: reduce worst case delay
▪ Just speed up the slowest case for each block
4-bit Ripple Adder 4-bit Ripple Adder
B A4 B A0
▪ Exercise: optimal design uses variable block sizes (why?)
P3 S P3 S
P2
P1
P0
P2
P1
P0
45 UC Davis EEC 170, Winter 2021 / © John Owens

Adder Lessons
▪ Reuse hardware if possible
– +/- reuse is compelling argument for 2’s complement
▪ For higher performance:
– Look for critical path, optimize for it Reorganize equations [propagate/generate / carry lookahead]
– Precompute [carry save]
– Reduce worst-case delay [carry skip]
46 UC Davis EEC 170, Winter 2021 / © John Owens

Finished way early (1:20 in, 30 minutes
47 UC Davis EEC 170, Winter 2021 / © John Owens

End of lecture
48 UC Davis EEC 170, Winter 2021 / © John Owens

Lecture 6:
Arithmetic 2/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021

Multiply (unsigned)
▪ Paper and pencil example (unsigned):
Multiplicand 1000 Multiplier x1001 1000
0000 0000
1000 Product 01001000
▪ m bits x n bits = m+n bit product
▪ Binary makes it easy:
– 0 => place 0 ( 0 x multiplicand )
– 1 => place a copy ( 1 x multiplicand )
▪ 4 versions of multiply hardware & algorithm:
– successive refinement
50 UC Davis EEC 170, Winter 2021 / © John Owens

m bits x n bits = m+n bit product
51 UC Davis EEC 170, Winter 2021 / © John Owens

Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
52 UC Davis EEC 170, Winter 2021 / © John Owens

How does it work?
00
A3 A3 A2
0
0000
A3 A2 A1 A0 A2
A1
A0
A1
A0
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
▪ At each stage shift A left (multiply it by 2)
▪ Use next bit of B to determine whether to add in shifted
multiplicand
▪ Accumulate 2n bit partial product at each stage
53 UC Davis EEC 170, Winter 2021 / © John Owens

Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
▪ Stage i accumulates A * 2i if Bi == 1
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
▪ Q: How much hardware for 32 bit multiplier? Critical path?
54 UC Davis EEC 170, Winter 2021 / © John Owens

Unsigned shift-add multiplier (version 1)
▪ 64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg
Multiplicand
64 bits
Shift Left
Multiplier
32 bits
Control
Shift Right
64-bit ALU
Product
64 bits
Write
Multiplier = datapath + control
55
UC Davis EEC 170, Winter 2021 / © John Owens

Multiply Algorithm V1
Product
0000 0000 0011
1:0000 0010 0011 2:0000 0010 0011 3:0000 0010 0001 1:0000 0110 0001 2:0000 0110 0001 3:0000 0110 0000
0000 0110 0000
Start
1. Test Multiplier0
Multiplier Multiplicand 0000 0010
Multiplier0 = 1
Multiplier0 = 0
0000 0010
0000 0100
0000 0100 0000 0100 0000 1000 0000 1000
0000 1000
1a. Add multiplicand to product &
place the result in Product register
2. Shift the Multiplicand register left 1 bit. 3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions Yes: 32 repetitions Done 56 UC Davis EEC 170, Winter 2021 / © John Owens Observations on Multiply Version 1 ▪ 1 clock per cycle => ≈ 100 clocks per multiply – Ratio of multiply to add 5:1 to 100:1
▪ 1/2 bits in multiplicand always 0 => 64-bit adder is wasted
▪ 0’s inserted in right of multiplicand as shifted
=> least significant bits of product never changed once formed
▪ Instead of shifting multiplicand to left, shift product to right?
57 UC Davis EEC 170, Winter 2021 / © John Owens

Multiply Hardware Version 2
▪ 32-bit Multiplicand reg, 32-bit ALU, 64-bit Product reg, 32-bit Multiplier reg
Multiplicand
32 bits
32-bit ALU
Product
64 bits
Multiplier
32 bits
Control
Shift Right
Shift Right
Write
58
UC Davis EEC 170, Winter 2021 / © John Owens

How to think of this?
▪ Remember original combinational multiplier:
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
59 UC Davis EEC 170, Winter 2021 / © John Owens

Simply warp to let product move right…
0000
A3 A2 A1 A0
A3 A2 A1 A0
A3 A2 A1 A0
B0
B1
B2
B3
P7 P6 P5 P4 P3 P2 P1 P0
▪ MultiplicaAnd staAys stAill anAd product moves right 3210
60 UC Davis EEC 170, Winter 2021 / © John Owens

Multiply Algorithm V2
Start
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
0000 0000
1: 0010 0000
2: 0001 0000
0011 0010
0011 0010
1a. Add multiplicand to the left half of product & 0011 0010
3: 00010000 00p0l1acet0h0e1re0sultinthelefthalfofProductregister
1: 0011 0000 0001 0010
2: 0001 1000 0001 0010 Product Multiplier Multiplicand
3: 0001 1000 1: 0001 1000 2: 0000 1100 3: 0000 1100 1: 0000 1100 2: 0000 0110 3: 0000 0110
0000 0110
0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010
0000 0010
2. Shift the Product register right 1 bit.
3. Shift the Multiplier register right 1 bit.
32nd repetition?
No: < 32 repetitions Yes: 32 repetitions Done 61 UC Davis EEC 170, Winter 2021 / © John Owens Still more wasted space! Start Multiplier0 = 1 1. Test Multiplier0 Multiplier0 = 0 0000 0000 1: 0010 0000 2: 0001 0000 3: 0001 0000 1: 0011 0000 0011 0010 1a. Add multiplicand to the left half of product & 0011 0010 place the result in the left half of Product register 0011 0010 2: 0001 1000 3: 0001 1000 1: 0001 1000 2: 0000 1100 3: 0000 1100 1: 0000 1100 2: 0000 0110 3: 0000 0110 0000 0110 0001 0010 0001 0010 0001 0010 Product Multiplier Multiplicand 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 0000 0010 2. Shift the Product register right 1 bit. 3. Shift the Multiplier register right 1 bit. 32nd repetition? No: < 32 repetitions Yes: 32 repetitions Done 62 UC Davis EEC 170, Winter 2021 / © John Owens Observations on Multiply Version 2 ▪ Product register wastes space that exactly matches size of multiplier => combine Multiplier register and Product register
63 UC Davis EEC 170, Winter 2021 / © John Owens

Multiply Hardware Version 3
▪ 32-bit Multiplicand reg, 32 -bit ALU, 64-bit Product reg, (0-bit Multiplier reg)
Multiplicand 32 bits
32-bit ALU
Product (Multiplier) 64 bits
Shift Right
Write
Control
64
UC Davis EEC 170, Winter 2021 / © John Owens

Multiply Algorithm V3
Product0 = 1
Start
1.
Product0
Product0 = 0
Test
1a. Add multiplicand to the left half of product & place the result in the left half of Product register
0000 0011 0010
1: 0010 0011 0010
2: 0001 0001 0010
1: 0011 0000 0010 Product Multiplicand
2: 0001 1000 0010
1: 0001 1000 0010
2: 0000 1100 0010
1: 0000 1100 0010
2: 0000 0110 0010
0000 0110 0010
2. Shift the Product register right 1 bit.
32nd repetition?
No: < 32 repetitions Yes: 32 repetitions Done 65 UC Davis EEC 170, Winter 2021 / © John Owens Observations on Multiply Version 3 ▪ 2 steps per bit because Multiplier & Product combined ▪ What about signed multiplication? - easiest solution is to make both positive & remember whether to complement product when done (leave out the sign bit, run for 31 steps) - apply definition of 2’s complement - need to sign-extend partial products and subtract at the end - Booth’s Algorithm is elegant way to multiply signed numbers using same hardware as before and save cycles - can handle multiple bits at a time 66 UC Davis EEC 170, Winter 2021 / © John Owens Faster Multiplier ▪ Uses multiple adders - Cost/performance tradeoff ▪ Can be pipelined - Several multiplications performed in parallel 67 UC Davis EEC 170, Winter 2021 / © John Owens Motivation for Booth’s Algorithm ▪ ▪ Example 2 x 6 = 0010 x 0110: 0010 x 0110 + + + + 0000 shift (0 in multiplier) 0010 add (1 in multiplier) 0010 add (1 in multiplier) 0000 shift (0 in multiplier) 0001100 68 UC Davis EEC 170, Winter 2021 / © John Owens Motivation for Booth’s Algorithm ▪ ALU with add or subtract can get same result in more than one way: ▪ 6 = 4 + 2 = -2 + 8 0110 = -00010 + 01000 = 11110 + 01000 ▪ For example: 0010 (2 [multiplier]) x 0110 (6 [multiplicand]) 0000 - 0010 0000 + 0010 00001100 shift (0 in multiplier) sub (first 1 in multpl.) shift (mid string of 1s) add (prior step had last 1) 69 UC Davis EEC 170, Winter 2021 / © John Owens Booth’s Algorithm ▪ CuernredntoBfitruBnit tomthide dRilgehtoEfxprluanatbioenginning oExfarmupnle Op 1 0 sub 1 1 0 1 0 0 Begins run of 1s 0001111000 011110 Middle of run of 1s 0001111000 none End of run of 1s 0001111000 add Middle of run of 0s 0001111000 none ▪ Originally for speed (when shift was faster than add) ▪ Replace a string of 1s in multiplier with an initial subtract when we first see a one and then later add for the bit after the last one ▪ Handles two’s complement! 70 UC Davis EEC 170, Winter 2021 / © John Owens Booth’s Example (2 x 7) Operation 0. initial value 1a. P = P - m 1b. 2. 3. 4a. 4b. Multiplicand 0010 Product 0000 0111 0 1110 0111 0 1111 0011 1 1111 1001 1 1111 1100 1 0001 1100 1 0000 1110 0 next? 10 -> sub
shift P (sign ext) 11 -> nop, shift 11 -> nop, shift 01 -> add
shift done
Blue + red = multiplier
(red is 2 bits to compare); Green = arithmetic ops; Black = product
71 UC Davis EEC 170, Winter 2021 / © John Owens
0010
0010
0010
0010
0010
0010
+ 1110->
+ 0010->

Booth’s Example (2 x -3)
Operation Multiplicand Product
next?
10 -> sub shift P (sign ext) 01 -> add
shift P
10 -> sub shift
11 -> nop
shift done
0. initial value 0010
0000 1101 0 1110 1101 0
1111 0110 1
1a. P = P – m 1b.
2a. 2b.
3a.
3b. 4a.
1110 + 1110 0010
+ 0010
0001 011 0000 1011 0
0 1
0010
+ 1110 1110 1011 0
1111 0101 1 1111 0101 1
0010 0010
4b.
Blue + red = multiplier (red is 2 bits to compare); Green = arithmetic ops; Black = product
0010
1111 1010 1
72 UC Davis EEC 170, Winter 2021 / © John Owens

Radix-4 Modified Booth’s Algorithm
Current
Bits Right
00 0
01 0
10 0
11 0
00 1
01 1
10 1
Explanation Example Recode
Bit to the
Middle of zeros Single one Begins run of 1s Begins run of 1s
Ends run of 1s Ends run of 1s Isolated 0 Middle of run
00 00 00 00 00 0 00 00 00 01 00 1 00 01 11 10 00 -2
00 01 11 11 00 -1 00 00 11 11 00 1
00 01 11 11 00 2 00 11 10 11 00 -1 00 11 11 11 00 0
11 1
Same insight as one-bit Booth’s, simply adjust for alignment of 2 bits. Allows multiplication 2 bits at a time.
73 UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V Multiplication Support
▪ Four multiply instructions: – mul: multiply
– Gives the lower 64 bits of the product – mulh: multiply high
– Gives the upper 64 bits of the product, assuming the operands are signed
– mulhu: multiply high unsigned
– Gives the upper 64 bits of the product, assuming the operands are unsigned
– mulhsu: multiply high signed/unsigned
– Gives the upper 64 bits of the product, assuming one operand is signed and the other unsigned
– Use mulh result to check for 64-bit overflow
74 UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V Support for multiply
▪ “If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order andrdhcannot be the same asrs1or
rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.”
75 UC Davis EEC 170, Winter 2021 / © John Owens

Multiplication Summary
▪ Iterative algorithm ▪ Design techniques:
– Analyze hardware—what’s not in use?
– Spend more hardware to get higher performance
– Booth’s Algorithm—more general (2’s complement)
– Booth’s Algorithm—recoding is powerful technique to think about problem in a different way
– Booth’s Algorithm—more bits at once gives higher performance
76 UC Davis EEC 170, Winter 2021 / © John Owens

Break
77 UC Davis EEC 170, Winter 2021 / © John Owens

Administrivia
▪ TA Trivikram coming at 11:30 to present the project – You’re going to write 4 RISC-V procedures.
▪ HW 3 released last Friday, due on Friday
– Solutions will be outside my office, Kemper 3175
▪ Midterm is week from today
– Open book, open note
– Bring a calculator
78 UC Davis EEC 170, Winter 2021 / © John Owens

Is NVIDIA Doubling Down On RISC-V?
▪ Betteridge’s law of headlines is an adage that states:
“Any headline that ends in a question mark can be answered by the word no”.
79 UC Davis EEC 170, Winter 2021 / © John Owens

Is NVIDIA Doubling Down On RISC-V?
▪ “Six RISC-V positions have been advertised by NVIDIA, based in Shanghai and pertaining to architecture, design, and verification.”
▪ “Due its light weight and extensibility, RISC-V is gaining mainstream adoption across many new sectors, including datacenter accelerators, mobile & wireless, automotive, and IoT.”
▪ “In a 2017 RISC-V workshop in Shanghai, NVIDIA explained that shortcomings such as low performance and lack of caches and thread protection meant Falcon’s architecture could not meet growing complexity demands.”
▪ “NVIDIA listed the technical criteria for its next-gen architecture: more than twice the performance of Falcon, less than twice the area cost of Falcon, support for caches, tightly coupled memories, 64-bit addresses, and suitability for modern operating systems. They concluded only RISC-V meets all criteria. The new RISC-V micro-controllers will outperform Falcon micro- controllers by three times, Tom’s Hardware has reported.”
https://medium.com/syncedreview/is-nvidia-doubling-down-on-r8is0c-v-1ce714a919eb UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V Support for divide
▪ 4 instructions:
-{div, divu, rem, remu} rd, rs1, rs2
– – – –
▪ “If both the quotient and remainder are required from the same division, the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr,
rs1, rs2 (rdq cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single divide operation instead of performing two separate divides.”
▪ Overflow and division-by-zero don’t produce errors
– Just return defined results
– Faster for the common case of no error
div: rs1 / rs2, treat as signed
divu: rs1 / rs2, treat as unsigned
rem: rs1 mod rs2, treat as signed
remu: rs1 mod rs2, treat as unsigned
81 UC Davis EEC 170, Winter 2021 / © John Owens

MIPS Support for multiply/divide
▪ Rather than target the general-purpose registers:
– –
hi
mul placed its output into two specialhiandloregisters div placed its divide output intoloand its rem output into

general-purpose register)
MIPS providedmfloandmfhiinstructions (destination:
82 UC Davis EEC 170, Winter 2021 / © John Owens

Divide: Paper & Pencil
1001 Quotient Divisor 1000 1001010 Dividend
–100100
101
1010
–1000
10 Remainder
(or Modulo result)

▪ ▪
See how big a number can be subtracted, creating quotient bit on each step
– Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder 3 versions of divide, successive refinement
83 UC Davis EEC 170, Winter 2021 / © John Owens

Divide Hardware Version 1
▪ 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
64 bits
Shift Right
64-bit ALU
Remainder
64 bits
Write
Quotient
32 bits
Control
Shift Left
84
UC Davis EEC 170, Winter 2021 / © John Owens

Divide Algorithm V1
▪ Takes n+1 steps for n-bit Quotient & Rem.
Start: Place Dividend in Remainder
1. Subtract the Divisor register from the Remainder register, and place the result in the Remainder register.
▪ Remainder Quotient
0000 0111
Divisor Test Remainder ≥0 Remainder
0010 0000
Remainder < 0 0000 2b. Restore the original value by adding the Divisor register to the Remainder register, & place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0. “restoring” division 2a. Shift the Quotient register to the left setting the new rightmost bit to 1. 3. Shift the Divisor register right1 bit. n+1 repetition? No: < n+1 repetitions Yes: n+1 repetitions (n = 4 here) UC Davis EEC 170, Winter 2021 / © John Owens Done 85 Divide Algorithm I example (7 / 2) Remainder Quotient Divisor 0000 0111 1:1110 0111 00000 2:0000 0111 0000 3:0000 0111 0000 1:1111 0111 0000 2:0000 0111 000 000000010 0000 0010 0000 0010 0000 0001 0000 0001 0000 0001 0000 0000 1000 0000 1000 0000 1000 0000 0100 0000 0100 0000 0100 0000 0010 0000 0010 0000 0010 0000 0001 Answer: Quotient = 3 Remainder = 1 0 0 0 0 00 0 3:0000 0111 000 1:11111111 000 2:0000 0111 00 00 00 0 3:0000 0111 00 000 1:0000 0011 00 000 0001 2:0000 0011 0 3:0000 0011 0 0001 0001 2:0000 0001 00011 3:0000 0001 00011 1:0000 0001 0 86 UC Davis EEC 170, Winter 2021 / © John Owens Divide: Paper & Pencil 01010 Quotient Divisor 0001 00001010 Dividend 00001 –0001 0000 0001 –0001 0 00 Remainder (or Modulo result) - No way to get a 1 in leading digit! - (this is an overflow, i.e quotient would have n+1 bits) - can save 1 iteration ⇒ switch order to shift first and then subtract, 87 UC Davis EEC 170, Winter 2021 / © John Owens Observations on Divide Version 1 ▪ 1/2 bits in divisor always 0 => 1/2 of 64-bit adder is wasted
=> 1/2 of divisor is wasted
▪ Instead of shifting divisor to right, shift remainder to left?
88 UC Davis EEC 170, Winter 2021 / © John Owens

Divide Algorithm I example: wasted space
Remainder Quotient 0000 0111 00000
1:1110 0111 00000
2:0000 0111 00000
3:0000 0111 00000
1:1111 0111 00000
2:0000 0111 00000
3:0000 0111 00000
1:1111 1111 00000
2:0000 0111 00000
3:0000 0111 00000
1:0000 0011 00000
2:0000 0011 00001
3:0000 0011 0
Divisor 0010 0000
0010 0000
0010 0000
0001 0000
0001 0000
0001 0000
0000 1000
0000 1000
0000 1000
0000 0100
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
0001
1:0000 0001 0
0001
2:0000 0001 0
0011
3:0000 0001 00011 0000 0010
89 UC Davis EEC 170, Winter 2021 / © John Owens

Divide Hardware Version 2
▪ 32-bit Divisor reg, 32-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
32 bits
32-bit ALU
Remainder
64 bits
Quotient
32 bits
Control
Shift Left
Shift Left
Write
90
UC Davis EEC 170, Winter 2021 / © John Owens

Divide Algorithm V2
Start: Place Dividend in Remainder
1. Shift the Remainder register left 1 bit.
Remainder Quotient Divisor
0000 0111 0000 0010
2. Subtract the Divisor register from the
left half of the Remainder register, & place the result in the left half of the Remainder register.
Remainder ≥ 0
Test Remainder
Remainder < 0 3a. Shift the Quotient register to the left setting the new rightmost bit to 1. 3b. Restore the original value by adding the Divisor register to the left half of the Remainder register, &place the sum in the left half of the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0. nth repetition? No: < n repetitions Yes: n repetitions (n = 4 here) UC Davis EEC 170, Winter 2021 / © John Owens Done 91 Observations on Divide Version 2 ▪ Eliminate Quotient register by combining with Remainder as shifted left - Start by shifting the Remainder left as before. - Thereafter loop contains only two steps because the shifting of the Remainder register shifts both the remainder in the left half and the quotient in the right half - The consequence of combining the two registers together and the new order of the operations in the loop is that the remainder will shifted left one time too many. - Thus the final correction step must shift back only the remainder in the left half of the register 92 UC Davis EEC 170, Winter 2021 / © John Owens Divide Hardware Version 3 ▪ 32-bit Divisor reg, 32 -bit ALU, 64-bit Remainder reg, (0-bit Quotient reg) Divisor 32 bits 32-bit ALU “HI” “LO” Shift Left Write Remainder (Quotient) Control 64 bits 93 UC Davis EEC 170, Winter 2021 / © John Owens Start: Place Dividend in Remainder Divide Algorithm V3 1. Shift the Remainder register left 1 bit. 2. Subtract the Divisor register from the Remainder Divisor left half of the Remainder register, & place the 0000 0111 0010 3a. Shift the Remainder register to the left setting the new rightmost result in the left half of the Remainder register. Remainder ≥ 0 Test Remainder < 0 Remainder 3b. Restore the original value by adding the Divisor register to the left half of the Remainder register, &place the sum in the left half of the Remainder register. Also shift the Remainder register to the left, setting the new least significant bit to 0. bit to 1. nth repetition? No: < n repetitions Yes: n repetitions (n = 4 here) Done. Shift left half of Remainder right 1 bit. UC Davis EEC 170, Winter 2021 / © John Owens 94 Final Multiply / Divide Hardware Multiplicand 32 bits 32-bit ALU Product (Multiplier) 64 bits Divisor 32 bits 32-bit ALU “HI” “LO” Shift Right Write Control Remainder (Quotient) Shift Left Write Control 64 bits 95 UC Davis EEC 170, Winter 2021 / © John Owens Observations on Divide Version 3 ▪ Same Hardware as Multiply: just need ALU to add or subtract, and 64-bit register to shift left or shift right ▪ Hi and Lo registers in MIPS combine to act as 64-bit register for multiply and divide ▪ Signed Divides: Simplest is to remember signs, make positive, and complement quotient and remainder if necessary - Note: Dividend and Remainder must have same sign - Note: Quotient negated if Divisor sign & Dividend sign disagree e.g., –7 ÷ 2 = –3, remainder = –1 - What about? –7 ÷ 2 = –4, remainder = +1 - See http://mathforum.org/library/drmath/view/52343.html ▪ Possible for quotient to be too large: if divide 64-bit integer by 1, quotient is 64 bits (called “saturation”) 96 UC Davis EEC 170, Winter 2021 / © John Owens SRT Division D. Sweeney of IBM, J.E. Robertson of the University of Illinois, and T.D. Tocher of Imperial College, London Current Remainder P-D (Partial Divisor) Plot 9|9 43211111 8|8 42211110 7|7 32111100 6|6 32111000 5 | 5 2 1 1 1 0 00 0 4 | 4 2 1 1 0 0 00 0 3|3 11000000 2|2 10000000 1|1 00000000 0 +------------------ 0 1 2 3 4 5 6 78 9 97 UC Davis EEC 170, Winter 2021 / © John Owens SRT Division ▪ Intel Pentium divide implementation: SRT division with 2 bits/ iteration (radix 4) ▪ Allows negative entries ▪ 1066 entries in lookup table 98 UC Davis EEC 170, Winter 2021 / © John Owens [http://members.cox.net/srice1/ pentbug/introduction.html] Faster Division ▪ Can’t use parallel hardware as in multiplier - Subtraction is conditional on sign of remainder ▪ Faster dividers (e.g. SRT division) generate multiple quotient bits per step - Still require multiple steps 99 UC Davis EEC 170, Winter 2021 / © John Owens Division Lessons ▪ In practice, slower than multiplication - Also less frequent - But, in the simple case, can use same hardware! ▪ Generates quotient and remainder together ▪ Floating-point division faster than integer division (why?) ▪ Similar hardware lessons as multiplier: - Look for unused hardware - Can process multiple bits at once at cost of extra hardware 100 UC Davis EEC 170, Winter 2021 / © John Owens End of lecture (1:30 in, to allow project 101 UC Davis EEC 170, Winter 2021 / © John Owens Lecture 7: Arithmetic 3/3 John Owens Introduction to Computer Architecture UC Davis EEC 170, Winter 2021 RISC-V logical instructions Instruction Meaning Pseudocode XORI rd,rs1,imm SRAI rd,rs1,imm Exclusive Or Immediate rd ← ux(rs1) ⊕ ux(imm) ORI rd,rs1,imm Or Immediate rd ← ux(rs1) ∨ ux(imm) ANDI rd,rs1,imm SLLI rd,rs1,imm And Immediate rd ← ux(rs1) ∧ ux(imm) Shift Left Logical Immediate rd ← ux(rs1) « ux(imm) SRLI rd,rs1,imm Shift Right Logical Immediate rd ← ux(rs1) » ux(imm) SLL rd,rs1,rs2 SRL rd,rs1,rs2 Shift Right Arithmetic Immediate rd ← sx(rs1) » ux(imm) Shift Left Logical rd ← ux(rs1) « rs2 XOR rd,rs1,rs2 Exclusive Or rd ← ux(rs1) ⊕ ux(rs2) Shift Right Logical rd ← ux(rs1) » rs2 SRA rd,rs1,rs2 OR rd,rs1,rs2 Shift Right Arithmetic rd ← sx(rs1) » rs2 Or rd ← ux(rs1) ∨ ux(rs2) AND rd,rs1,rs2 And rd ← ux(rs1) ∧ ux(rs2) 103 UC Davis EEC 170, Winter 2021 / © John Owens Shift Operations ▪ Bit manipulation: - S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM & 0 11111111 00000000000000000000000 ------------------------------------ 0 EEEEEEEE 00000000000000000000000 - Right shift 23 bits to get 000000000000000000000000 EEEEEEEE - Do arithmetic manipulation 000000000000000000000000 ENEWENEW - Left shift 23 bits to get 0 ENEWENEW 00000000000000000000000 104 UC Davis EEC 170, Winter 2021 / © John Owens Shift Operations ▪ Arithmetic operation: - Example: 00011 << 2 [3 left shift 2] - 00011 << 2 = 01100 = 12 = 2 * 4 - Each bit shifted left == multiply by two - Example: 01010 >> 1 [10 right shift 1]
– 01010 >> 1 = 00101 = 5 = 10/2
– Each bit shifted right == divide by two
– Why?
– Compilers do this—“strength reduction”
105 UC Davis EEC 170, Winter 2021 / © John Owens

Shift Operations


With left shift, what do we shift in?
– 00011 << 2 = 01100 (arithmetic) - 0000XXXX << 4 = XXXX0000 (logical) - We shifted in zeroes How about right shift? - XXXX0000 >> 4 = 0000XXXX (logical) – Shifted in zero
– 00110 (= 6) >> 1 = 00011 (3) (arithmetic) – Shifted in zero
– 11110 (= -2) >> 1 = 11111 (-1) (arithmetic) – Shifted in one
106 UC Davis EEC 170, Winter 2021 / © John Owens

Shift Operations
▪ How about right shift?
– XXXX0000 >> 4 = 0000XXXX: Logical shift
– Shifted in zero
– 00110 (= 6) >> 1 = 00011 (3)
11110 (= -2) >> 1 = 11111 (-1): Arithmetic shift
– Shifted in sign bit
▪ RISC-V supports both logical and arithmetic:
– slli, srai, srli: Shift amount taken from within instruction (“imm”)
6 bits 6 bits 5 bits 3 bits 5 bits
7 bits 5 bits 5 bits 3 bits 5 bits
– sll, sra, srl: shift amount taken from register (“variable”)
– How far can we shift with slli/srai/slli? With sll/sra/srl?
7 bits
7 bits
funct6
shamt
rs1
funct3
rd
opcode
funct7
rs2
rs1
funct3
rd
opcode
107
UC Davis EEC 170, Winter 2021 / © John Owens

Combinational Shifter from MUXes
Basic Building Block AB
sel 1 0 D
8-bit right shifter
A7 A6 A5 A4 A3 A2 A1 A0 1010101010101010
1010101010101010
1010101010101010 R7 R6 R5 R4 R3 R2 R1 R0
S2 S1 S0
▪ What comes in the MSBs?
▪ How many levels for 64-bit shifter? ▪ What if we use 4-1 Muxes ?
108 UC Davis EEC 170, Winter 2021 / © John Owens

General Shift Right Scheme using 16 bit example ▪S 0 If we added right-to-left connections, we could supportROTATE
(0,1)
S1 (0,2)
S2 (0,4)
(not in RISC-V but found in other ISAs)
S3 (0,8)
109 UC Davis EEC 170, Winter 2021 / © John Owens

Funnel Shifter ▪ Shift A by i bits
▪ Problem: Set Y, X, sa ▪ Logical:
▪ Arithmetic: ▪ Rotate:
▪ Left shifts:
Extract 64 bits of 128 (sa selects which bits)
Y
Y
64
X
64
64 R
Shift Right
X
| sa |
R
110
UC Davis EEC 170, Winter 2021 / © John Owens

Barrel Shifter
▪ Technology-dependent solutions: transistor per switch
in6
in5
in4
out3
out2
out1
out0
SR3
SR2 SR1 SR0
in3
in2 in1 in0
111
UC Davis EEC 170, Winter 2021 / © John Owens

Shifter Summary
▪ Shifts common in logical ops, also in arithmetic ▪ RISC-V (oops) has:
– 2 flavors of shift: logical and arithmetic
– 2 directions of shift: right and left
– 2 sources for shift amount: immediate, variable
▪ Lots of cool shift algorithms, but …
– Barrel shifter prevalent in today’s hardware
112 UC Davis EEC 170, Winter 2021 / © John Owens

ating Point
Floating Point
▪ Representation for non-integral numbers
– Including very small and very large numbers
▪ Like scientific notation
– –2.34 × 1056
– +0.002 × 10–4
– +987.02 × 109 ▪ In binary
– ±1.xxxxxxx2 × 2yyyy
▪ Typesfloatanddoublein C
normalized
not normalized
113
UC Davis EEC 170, Winter 2021 / © John Owens

Floating Point Standard
▪ Defined by IEEE Std 754-1985
▪ Developed in response to divergence of representations
– Portability issues for scientific code ▪ Now almost universally adopted
▪ Two representations
– Single precision (32-bit)
– Double precision (64-bit)
114 UC Davis EEC 170, Winter 2021 / © John Owens

115 UC Davis EEC 170, Winter 2021 / © John Owens

Floating-point Formats
▪ Single-precision (32 bits)
▪ Double precision (64 bits)
116 UC Davis EEC 170, Winter 2021 / © John Owens

IEEE Floating-Point Format
x = (−1)S ×(1+Fraction)×2(Exponent−Bias)
▪ S: sign bit (0 ⇒ non-negative, 1 ⇒ negative)
▪ Normalize significand: 1.0 ≤ |significand| < 2.0 - Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) - Significand is Fraction with the “1.” restored Fraction: single: 23 bits double: 52 bits ▪ Exponent: excess representation: actual exponent + Bias - Ensures exponent is unsigned - Single: Bias = 127; Double: Bias = 1023 117 UC Davis EEC 170, Winter 2021 / © John Owens Exponent: single: 8 bits double: 11 bits Single-Precision Range ▪ Exponents 00000000 and 11111111 reserved ▪ Smallest value - Exponent: 00000001 ⇒ actual exponent = 1 – 127 = –126 - Fraction: 000...00 ⇒ significand = 1.0 - ±1.0 × 2–126 ≈ ±1.2 × 10–38 ▪ Largest value - exponent: 11111110 ⇒ actual exponent = 254 – 127 = +127 - Fraction: 111...11 ⇒ significand ≈ 2.0 - ±2.0 × 2+127 ≈ ±3.4 × 10+38 118 UC Davis EEC 170, Winter 2021 / © John Owens Double-Precision Range ▪ Exponents 0000...00 and 1111...11 reserved ▪ Smallest value - Exponent: 00000000001 ⇒ actual exponent = 1 – 1023 = –1022 - Fraction: 000...00 ⇒ significand = 1.0 - ±1.0 × 2–1022 ≈ ±2.2 × 10–308 ▪ Largest value - Exponent: 11111111110 ⇒ actual exponent = 2046 – 1023 = +1023 - Fraction: 111...11 ⇒ significand ≈ 2.0 - ±2.0 × 2+1023 ≈ ±1.8 × 10+308 119 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Precision ▪ Relative precision - all fraction bits are significant - Single: approx 2–23 - Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision - Double: approx 2–52 - Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision 120 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Example ▪ Represent –0.75 - - –0.75 = (–1)1 × 1.12 × 2–1 - S =1 Fraction = 1000...002 - Exponent = –1 + Bias - Single: –1 + 127 = 126 = 011111102 - Double: –1 + 1023 = 1022 = 011111111102 ▪ Single: 1011111101000...00 ▪ Double: 1011111111101000...00 121 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Example ▪ What number is represented by the single-precision float 11000000101000...00 - S =1 - - ▪ Fraction = 01000...002 Fxponent = 100000012 = 129 x = (–1)1 × (1 + .012) × 2(129 – 127) = (–1) × 1.25 × 22 = –5.0 122 UC Davis EEC 170, Winter 2021 / © John Owens Denormal Numbers ▪ Exponent = 000...0 ⇒ hidden bit is 0 x=(−1)S ×(0+Fraction)×2−Bias ▪ Smaller than normal numbers - allow for gradual underflow, with diminishing precision ▪ Denormal with fraction = 000...0 x=(−1)S ×(0+0)×2−Bias =±0.0 Two representations of 0.0! 123 UC Davis EEC 170, Winter 2021 / © John Owens Infinities and NaNs ▪ Exponent = 111...1, Fraction = 000...0 - ±Infinity - Can be used in subsequent calculations, avoiding need for overflow check ▪ Exponent = 111...1, Fraction ≠ 000...0 - Not-a-Number (NaN) - Indicates illegal or undefined result - e.g., 0.0 / 0.0 - Can be used in subsequent calculations 124 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Addition ▪ Consider a 4-digit decimal example - 9.999 × 101 + 1.610 × 10–1 ▪ 1. Align decimal points - Shift number with smaller exponent - 9.999 × 101 + 0.016 × 101 ▪ 2. Add significands - 9.999 × 101 + 0.016 × 101 = 10.015 × 101 ▪ 3. Normalize result & check for over/underflow - 1.0015 × 102 ▪ 4. Round and renormalize if necessary - 1.002 × 102 125 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Addition ▪ Now consider a 4-digit binary example - 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375) ▪ 1. Align binary points - Shift number with smaller exponent - 1.0002 × 2–1 + –0.1112 × 2–1 ▪ 2. Add significands - 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1 ▪ 3. Normalize result & check for over/underflow - 1.0002 × 2–4, with no over/underflow ▪ 4. Round and renormalize if necessary - 1.0002 × 2–4 (no change) = 0.0625 126 UC Davis EEC 170, Winter 2021 / © John Owens FP Adder Hardware ▪ Much more complex than integer adder ▪ Doing it in one clock cycle would take too long - Much longer than integer operations - Slower clock would penalize all instructions ▪ FP adder usually takes several cycles - Can be pipelined 127 UC Davis EEC 170, Winter 2021 / © John Owens FP Adder Hardware Step 1 Step 2 Step 3 Step 4 128 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Multiplication ▪ Consider a 4-digit decimal example - 1.110 × 1010 × 9.200 × 10–5 ▪ 1. Add exponents - For biased exponents, subtract bias from sum - New exponent = 10 + –5 = 5 ▪ 2. Multiply significands - 1.110 × 9.200 = 10.212 ⇒ 10.212 × 105 ▪ 3. Normalize result & check for over/underflow - 1.0212 × 106 ▪ 4. Round and renormalize if necessary - 1.021 × 106 ▪ 5. Determine sign of result from signs of operands - +1.021 × 106 129 UC Davis EEC 170, Winter 2021 / © John Owens Floating-Point Multiplication ▪ Now consider a 4-digit binary example - 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375) ▪ 1. Add exponents - Unbiased: –1 + –2 = –3 - Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127 ▪ 2. Multiply significands - 1.0002 × 1.1102 = 1.1102 ⇒ 1.1102 × 2–3 ▪ 3. Normalize result & check for over/underflow - 1.1102 × 2–3 (no change) with no over/underflow ▪ 4. Round and renormalize if necessary - 1.1102 × 2–3 (no change) ▪ 5. Determine sign: +ve × –ve ⇒ –ve - –1.1102 × 2–3 = –0.21875 130 UC Davis EEC 170, Winter 2021 / © John Owens FP Arithmetic Hardware ▪ FP multiplier is of similar complexity to FP adder - But uses a multiplier for significands instead of an adder ▪ FP arithmetic hardware usually does - Addition, subtraction, multiplication, division, reciprocal, square-root FP ↔ integer conversion ▪ Operations usually takes several cycles - Can be pipelined - 131 UC Davis EEC 170, Winter 2021 / © John Owens FP Instructions in RISC-V ▪ Separate FP registers: f0, ..., f31 - double-precision - single-precision values stored in the lower 32 bits ▪ FP instructions operate only on FP registers - Programs generally don’t do integer ops on FP data, or vice versa - More registers with minimal code-size impact ▪ FP load and store instructions - - flw, fld fsw, fsd 132 UC Davis EEC 170, Winter 2021 / © John Owens FP Instructions in RISC-V ▪ Single-precision arithmetic -fadd.s, fsub.s, fmul.s, fdiv.s, fsqrt.s - ▪ Double-precision arithmetic e.g.,fadds.s f2, f4, f6 -fadd.d, fsub.d, fmul.d, fdiv.d, fsqrt.d - ▪ Single- and double-precision comparison -feq.s, flt.s, fle.s -feq.d, flt.d, fle.d - Result is 0 or 1 in integer destination register - Use beq, bne to branch on comparison result ▪ Branch on FP condition code true or false -B.cond e.g.,fadd.d f2, f4, f6 133 UC Davis EEC 170, Winter 2021 / © John Owens FP Example: °F to °C ▪ C code: float f2c (float fahr) { return ((5.0/9.0)*(fahr - 32.0)); } - fahr in f10, result in f10, literals in global memory space ▪ Compiled RISC-V code: f2c: flw f0,const5(x3) // f0 = 5.0f flw f1,const9(x3) // f1 = 9.0f fdiv.s f0, f0, f1 // f0 = 5.0f / 9.0f flw f1,const32(x3) // f1 = 32.0f fsub.s f10,f10,f1 // f10 = fahr - 32.0 fmul.s f10,f0,f10 // f10 = (5.0f/9.0f) * (fahr-32.0f) jalr x0,0(x1) // return 134 UC Davis EEC 170, Winter 2021 / © John Owens FP Example: Array Multiplication ▪ C = C +A × B - All 32 × 32 matrices, 64-bit double-precision elements ▪ C code: ▪ void mm (double c[][], double a[][], double b[][]) { size_t i, j, k; for (i = 0; i < 32; i = i + 1) for (j = 0; j < 32; j = j + 1) for (k = 0; k < 32; k = k + 1) } c[i][j] = c[i][j] + a[i][k] * b[k][j]; - i,j,kin x5, x6, x7 Addresses ofc,a,bin x10, x11, x12, and 135 UC Davis EEC 170, Winter 2021 / © John Owens FP Example: Array Multiplication ▪ RISC-V code: mm:... li x28,32 li x5,0 L1: li x6,0 L2: li x7,0 // x28 = 32 (row size/loop end) // i = 0; initialize 1st for loop // j = 0; initialize 2nd for loop // k = 0; initialize 3rd for loop // x30 = i * 2**5 (size of row of c) slli x30,x5,5 add x30,x30,x6 // x30 = i * size(row) + j slli x30,x30,3 // x30 = byte offset of [i][j] add x30,x10,x30 // x30 = byte address of c[i][j] fld f0,0(x30) // f0 = c[i][j] L3: slli x29,x7,5 // x29 = k * 2**5 (size of row of b) add x29,x29,x6 // x29 = k * size(row) + j slli x29,x29,3 // x29 = byte offset of [k][j] add x29,x12,x29 // x29 = byte address of b[k][j] fld f1,0(x29) // f1 = b[k][j] 136 UC Davis EEC 170, Winter 2021 / © John Owens FP Example: Array Multiplication ... slli x29,x5,5 // x29 = i * 2**5 (size of row of a) add slli add x29,x29,x7 // x29 = i * size(row) + k x29,x29,3 // x29 = byte offset of [i][k] x29,x11,x29 // x29 = byte address of a[i][k] fld f2,0(x29) // f2 = a[i][k] fmul.d f1, f2, f1 // f1 = a[i][k] * b[k][j] fadd.d f0, f0, f1 // f0 = c[i][j] + a[i][k] * b[k][j] addi x7,x7,1 // k = k + 1 bltu x7,x28,L3 // if (k < 32) go to L3 fsd f0,0(x30) // c[i][j] = f0 bltu bltu addi x6,x6,1 x6,x28,L2 addi x5,x5,1 x5,x28,L1 // j = j + 1 // if (j < 32) go to L2 // i = i + 1 // if (i < 32) go to L1 137 UC Davis EEC 170, Winter 2021 / © John Owens Accurate Arithmetic ▪ IEEE Std 754 specifies additional rounding control - Extra bits of precision (guard, round, sticky) - Choice of rounding modes - Allows programmer to fine-tune numerical behavior of a computation ▪ Not all FP units implement all options - Most programming languages and FP libraries just use defaults ▪ Trade-off between hardware complexity, performance, and market requirements 138 UC Davis EEC 170, Winter 2021 / © John Owens Subword Parallellism ▪ Graphics and audio applications can take advantage of performing simultaneous operations on short vectors - Example: 128-bit adder: - Sixteen 8-bit adds - Eight 16-bit adds - Four 32-bit adds ▪ Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD) 139 UC Davis EEC 170, Winter 2021 / © John Owens §3.6 Parallelism and Computer Arithmetic: Subword Parallelism x86 FP Architecture ▪ Originally based on 8087 FP coprocessor - 8 × 80-bit extended-precision registers - Used as a push-down stack - Registers indexed from TOS: ST(0), ST(1), ... ▪ FP values are 32-bit or 64 in memory - Converted on load/store of memory operand - Integer operands can also be converted on load/store ▪ Very difficult to generate and optimize code - Result: poor FP performance 140 UC Davis EEC 170, Winter 2021 / © John Owens §3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86 x86 FP Instructions ▪ Optional variations Data transfer Arithmetic - I: integer operand FIADDP mem/ST(i) - P: pop operand from stack FILD mem/ST(i) FISTP mem/ST(i) FIMULP mem/ST(i) FISUBRP mem/ST(i) - R: reverse opeFraIDnIdVRoPrdmerm/ST(i) FICOMP FIUCOMP FSTSW AX/mem FLDPI FLD1 FSQRT - But not all combinations allowed FRNDINT FLDZ FABS Compare Transcendental FPATAN F2XMI FCOS FPTAN FPREM FPSIN FYL2X 141 UC Davis EEC 170, Winter 2021 / © John Owens Streaming SIMD Extension 2 (SSE2) ▪ Adds 4 × 128-bit registers - Extended to 8 registers in AMD64/EM64T ▪ Can be used for multiple FP operands - 2 × 64-bit double precision - 4 × 32-bit double precision - Instructions operate on them simultaneously - Single-Instruction Multiple-Data 142 UC Davis EEC 170, Winter 2021 / © John Owens Matrix Multiply ▪ Unoptimized code: 1. void dgemm (int n, double* A, double* B, double* C) 2. { 3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for (int k = 0; k < n; k++) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*n] = cij; /* C[i][j] = cij */ 10. } 11. } 143 UC Davis EEC 170, Winter 2021 / © John Owens §3.8 Going Faster: Subword Parallelism and Matrix Multiply Matrix Multiply ▪ x86 assembly code: 1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0 2. mov %rsi,%rcx # register %rcx = %rsi 3. xor %eax,%eax # register %eax = 0 4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1 5. add %r9,%rcx # register %rcx = %rcx + %r9 6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A 7. add $0x1,%rax # register %rax = %rax + 1 8. cmp %eax,%edi # compare %eax to %edi 9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 10. jg 30 # jump if %eax > %edi
11. add $0x1,%r11d # register %r11 = %r11 + 1
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element
144
UC Davis EEC 170, Winter 2021 / © John Owens

Matrix Multiply ▪ Optimized C code:
1. #include
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4.
5.
6.
7.
8.
9.
10.
11.
12. }
13. }
for ( int i = 0; i < n; i+=4 ) for ( int j = 0; j < n; j++ ) { __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ for( int k = 0; k < n; k++ ) c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ _mm256_mul_pd(_mm256_load_pd(A+i+k*n), _mm256_broadcast_sd(B+k+j*n))); _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 145 UC Davis EEC 170, Winter 2021 / © John Owens Matrix Multiply ▪ Optimized x86 assembly code: 1. vmovapd (%r11),%ymm0 2. mov %rbx,%rcx 3. xor %eax,%eax 4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 5. add $0x8,%rax # register %rax = %rax +8 10. jne 50
11. add $0x1,%esi
12. vmovapd %ymm0,(%r11)
# jump if not %r10 != %rax
# register % esi = % esi + 1
# Store %ymm0 into 4 C elements
# Load 4 elements of C into %ymm0
# register %rcx = %rbx
# register %eax = 0
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
8. cmp %r10,%rax # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
146
UC Davis EEC 170, Winter 2021 / © John Owens

es and Pitfalls
Right Shift and Division
▪ Left shift by i places multiplies an integer by 2i
▪ Right shift divides by 2i?
– Only for unsigned integers
▪ For signed integers
– Arithmetic right shift: replicate the sign bit
– e.g., –5 / 4
– 111110112 >> 2 = 111111102 = –2
– Rounds toward –∞
c.f. 111110112 >>> 2 = 001111102 = +62

147
UC Davis EEC 170, Winter 2021 / © John Owens

Associativity
▪ Parallel programs may interleave operations in unexpected orders
– Assumptions of associativity may fail
(x+y)+z
x+(y+z)
x
-1.50E+38
0.00E+00
-1.50E+38
y
1.50E+38
1.50E+38
z
1.0
1.0
1.00E+00
0.00E+00
▪ Need to validate parallel programs under varying degrees of parallelism
148 UC Davis EEC 170, Winter 2021 / © John Owens

Who Cares About FP Accuracy?
▪ Important for scientific code
– But for everyday consumer use?

▪ The Intel Pentium FDIV bug
– The market expects accuracy
– See Colwell, The Pentium Chronicles
“My bank balance is out by 0.0002¢!”☹
149 UC Davis EEC 170, Winter 2021 / © John Owens

Concluding Remarks
▪ Bits have no inherent meaning
– Interpretation depends on the instructions applied
▪ Computer representations of numbers
– Finite range and precision
– Need to account for this in programs
▪ ISAs support arithmetic
– Signed and unsigned integers
– Floating-point approximation to reals
▪ Bounded range and precision
– Operations can overflow and underflow
150 UC Davis EEC 170, Winter 2021 / © John Owens
§3.10 Concluding Remarks

Break
151 UC Davis EEC 170, Winter 2021 / © John Owens

Administrivia
▪ My office hour is in the Coffee House, not the Silo. The Silo is closed for construction.
▪ Maybe oops: https://en.wikipedia.org/wiki/ Kahan_summation_algorithm
▪ Tell me about errors in slides / in homework
▪ Anyone try RARS?
▪ Midterm philosophy
▪ No typing on the midterm
▪ Bring a calculator
152 UC Davis EEC 170, Winter 2021 / © John Owens

Problem: Design a “fast” ALU for the RISC-V ISA
▪ Requirements?
– Must support the Arithmetic / Logic operations
– Tradeoffs of cost and speed based on frequency of occurrence, hardware budget
153 UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V ALU requirements
▪ Add, Sub, AddI, AddI
– => 2’s complement adder/sub
▪ And, Or, AndI, OrI, Xor, Xori
– => Logical AND, logical OR, XOR
▪ SLTI, SLTIU (set less than)
– => 2’s complement adder with inverter, check sign bit of
result
▪ See ALU from COD5E, appendix A.5
154 UC Davis EEC 170, Winter 2021 / © John Owens

MIPS arithmetic instruction format ▪ I-format:
immediate
rs1
funct3
rd
opcode
12 bits
▪ R-format: SLTI
SLTIU ANDI ORI XORI
5 bits
3 bits
5 bits
7 bits
opcode
Type op funct
ADD 00 0110011 | 000 | 0000000
Type op 0010011
SUB
5 bits
funct3
rd
7 bits
5 bits
AND OR XOR SLT SLTU
3 bits
5 bits
7 bits
ADDI 0010011 | 000
funct7 rs2 rs1
0010011
0010011
0010011
0010011
00 00 00 00 00
0110011 | 0110011 | 0110011 | 0110011 | 0110011 |
111 | 0000000 110 | 0000000 100 | 0000000 010 | 0000000 011 | 0000000
| 010 | 011 | 111 | 110 | 100
00
0110011 | 000 | 0100000
155
UC Davis EEC 170, Winter 2021 / © John Owens

Design Trick: divide & conquer
▪ Trick: Break the problem into simpler problems, solve them and glue together the solution
▪ Example: assume the immediates have been taken care of before ADDI 0010011 | 000 ADD 00 0110011 | 000 | 0000000
SUB 00 0110011 | 000 | 0100000
Type op Type op funct
the ALU
ANDI ORI XORI SLTI SLTIU
0010011 | 111 AND 00 0110011 | 111 | 0000000

7 operations (could be 3 bits, but really 4)
0010011 | 110 0010011 | 100 0010011 | 010 0010011 | 011
OR 00 XOR 00 SLT 00 SLTU 00
0110011 | 110 | 0000000 0110011 | 100 | 0000000 0110011 | 010 | 0000000 0110011 | 011 | 0000000
156 UC Davis EEC 170, Winter 2021 / © John Owens

Let’s Build a ALU
▪ Functional Specification:
– inputs: 2 x 32-bit operands A, B, 4-bit OPeration
– outputs: 32-bit result S, 1-bit carry, 1 bit overflow
– operations: add, sub, and, or, xor, slt, sltu
32 32 AB
c ALU OP ovf
S
32
4
157
UC Davis EEC 170, Winter 2021 / © John Owens

We already know how to do add/sub
A1
B1
+ –
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
A + ~B + 1
+ –
+ –
Cout Full
Sum S0
S=
… then add 1
B
0
Adder
0
1
158
UC Davis EEC 170, Winter 2021 / © John Owens

Control for +/-
One bit controls three muxes. This is a “control point”.
A1
B1
+ –
A B
Cin
A B
Cin
Cout Full
Sum S1
Adder
A0
+ –
Cout Full
Sum S0
How do we set this control point for add? subtract?
B0
Adder
0
+ –
1
159
UC Davis EEC 170, Winter 2021 / © John Owens

160 UC Davis EEC 170, Winter 2021 / © John Owens

AND and OR
▪ Consider ALU that supports two functions, AND and OR
▪ How do we do this?
A
and
or
B
161 UC Davis EEC 170, Winter 2021 / © John Owens

AND and OR

A
Combinational logic:
– Control bit OP is 0 for AND, 1 for OR
A
B
OP
OUT
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
1
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
1
B

Hard with lots OP of functions! But let’s do it anyway.
and
or
A
OP
B
162 UC Davis EEC 170, Winter 2021 / © John Owens

Function Inputs Outputs K-Map
127
M0 M1 M2 M3 A B Cin S Cout
add 0000000 00
7-to-2 Combinational Logic
▪ Start turning the crank . . . 0
163 UC Davis EEC 170, Winter 2021 / © John Owens

AND and OR

using a mux
Instead, generate several functions and use control bits to select
and
OP
0 1

Not easy to decide the “best” way to build something
– Don’t want too many inputs to a single gate
– Don’t want to have to go through too many gates
– For our purposes, ease of comprehension is important
A
B
or
164 UC Davis EEC 170, Winter 2021 / © John Owens

Supporting More Functions
▪ With the mux approach, it’s easy to add other functions
– Like add
– To add more, just enlarge mux
– Control signals in mux, not datapath
OP
0 1
and
165
UC Davis EEC 170, Winter 2021 / © John Owens
A
B
or

Supporting More Functions
▪ With the mux approach, it’s easy to add other functions – Like add
OP
– To add more, just enlarge mux
– Control signals in mux, not datapath
0
1
2
A
B
A
B Cin
FA
Cout Sum
166
UC Davis EEC 170, Winter 2021 / © John Owens

Supporting More Functions
▪ With the mux approach, it’s easy to add other functions
– Like add
– To add more, just enlarge mux
(or put more muxes elsewhere)
– Control signals in muxes, not datapath
OP
OP
A
B
A +B
– Cin
+ –
FA
Cout Sum
0
1
2
0
1
167
UC Davis EEC 170, Winter 2021 / © John Owens

Tailoring the ALU to RISC-V
▪ Need to support the set-on-less-than instruction (slt)
– slt produces a 1 if rs < rt and 0 otherwise - use subtraction: (a-b) < 0 implies a < b - So now we’ve got a-b as our result. How does this translate to slt operation? What do we have to test and where does it go? - We test - To produce the proper result, it goes in 168 UC Davis EEC 170, Winter 2021 / © John Owens Tailoring the ALU to RISC-V ▪ Need to support the set-on-less-than instruction (slt) - slt produces a 1 if rs < rt and 0 otherwise - use subtraction: (a-b) < 0 implies a < b - So now we’ve got a-b as our result. How does this translate to slt operation? What do we have to test and where does it go? - We test the highest (sign) bit (S[31]) - To produce the proper result, it goes in the lowest bit (S[0]) 169 UC Davis EEC 170, Winter 2021 / © John Owens Why do you think the MIPS-V designers 170 UC Davis EEC 170, Winter 2021 / © John Owens Tailoring the ALU to the MIPS ▪ Need to support test for equality (beq $t5, $t6, LABEL) - use subtraction: (a-b) = 0 implies a = b - How do we test if the product is zero? 171 UC Davis EEC 170, Winter 2021 / © John Owens Original Diagram: bit-slice ALU A32 B32 a31 b31 a0 m ALU0 b0 m cin 4 M ALU31 co cin co s0 s31 32 S 172 UC Davis EEC 170, Winter 2021 / © John Owens Revised Diagram ▪ LSB and MSB need to do a little extra A 32 B 32 a31 b31 ALU0 cos31 cin a0 b0 ALU0 co s0 cin 4 produce carry-in (for add/ subtract), etc. ? M C/L to Overflow (but not in RISC-V) 32 S 173 UC Davis EEC 170, Winter 2021 / © John Owens slt sign bit Behavioral Representation: Verilog module ALU(A, B, m, S, c, ovf); input [0:31] A, B; input [0:3] m; output [0:31] S; output c, ovf; reg [0:31] S; reg c, ovf; always @(A, B, m) begin case (m) 0: S = A + B; 1: S = A - B; 2: S = ... ... end endmodule 32 32 AB c ALU m ovf S 32 4 174 UC Davis EEC 170, Winter 2021 / © John Owens Conclusion ▪ We can build an ALU to support the RISC-V instruction set - key idea: use multiplexor to select the output we want - we can efficiently perform subtraction using two’s complement - we can replicate a 1-bit ALU to produce a 32-bit ALU ▪ Important points about hardware - all of the gates are always working - the speed of a gate is affected by the number of inputs to the gate - the speed of a circuit is affected by the number of gates in series (on the “critical path” or the “deepest level of logic”) 175 UC Davis EEC 170, Winter 2021 / © John Owens MIPS Opcode Map [from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich] 176 UC Davis EEC 170, Winter 2021 / © John Owens Encodings for ADD, SUB OP OP Big picture of what we’re doing here: understand tie btwn ISA and hardware A B A +B - Cin + - Cout Sum 0 1 2 0 1 OP + - 0 FA 1 177 UC Davis EEC 170, Winter 2021 / © John Owens MIPS Opcode Map [from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich] 178 UC Davis EEC 170, Winter 2021 / © John Owens MIPS (really) Encodings for A OP OP A B A +B - Cin FA Cout Sum 0 1 2 OP 0 1 d bit do? Green? Blue? - +0+ -1 179 UC Davis EEC 170, Winter 2021 / © John Owens ADD 00 100000 (040) ADDU 00 100001 (041) D, SUB, SLT SUB 00 SUBU 00 *? 00 *? 00 SLT 00 SLTU 00 100010 (042) 100011 (043) 101000 (050) 101001 (051) 101010 (052) 101011 (053) e D MIPS Opcode Map [from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich] 180 UC Davis EEC 170, Winter 2021 / © John Owens MIPS arithmetic instruction format Type op ADDI 001000 (010) I-Type: op Rs 31 25 20 15 0 ADDIU 001001 (011) Rt Immed 16 SLTI 001010 (012) SLTIU 001011 (013) ANDI 001100 (014) ORI 001101 (015) XORI 001110 (016) LUI 001111 (017) “Perhaps it is surprising that addiu and sltiu also What does the red bit do? sign-extend their immediates, but they do. The u stands for unsigned, but in reality addiu is often used simply as an add instruction that cannot overflow, and hence we often want to add negative numbers. It's much harder to come up with an excuse for why sltiu sign extends its immediate field.” COD2E p. 230 181 UC Davis EEC 170, Winter 2021 / © John Owens MIPS Opcode Map [from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich] 182 UC Davis EEC 170, Winter 2021 / © John Owens MIPS arithmetic instruction format Type op funct SLL 00 R-type: *00 SRL 00 SRA 00 SLLV 00 * 00 SRLV 00 SRAV 00 31 25 20 15 10 5 0 000000 (000) 000001 (001) Rt Rd shamt funct op Rs 000010 (002) What does the red bit do? 000011 (003) 000100 (004) 000101 (005) 000110 (006) 000111 (007) Green? Blue? * instructions? 183 UC Davis EEC 170, Winter 2021 / © John Owens MIPS Opcode Map [from MIPS R4000 Microprocessor User’s Manual / Joe Heinrich] 184 UC Davis EEC 170, Winter 2021 / © John Owens MIPS arithmetic instruction format R-type: I-Type: Type op 31 25 20 15 5 0 op ADDI ADDIU Rs Rt Rd funct ADD 000000 100000 (040) SLTI SLTIU ANDI ORI XORI LUI 001010 (012) 001011 (013) 001100 (014) 001101 (015) 001110 (016) 001111 (017) op 001000 (010) 001001 (011) Rs Rt Type op funct ADDU 000000 100001 (041) Immed 16 SUB 000000 100010 (042) SUBU 000000 100011 (043) AND 000000 100100 (044) OR 000000 100101 (045) XOR 000000 100110 (046) NOR 000000 100111 (047) 185 UC Davis EEC 170, Winter 2021 / © John Owens