Chapter 4
The Processor (Datapath and Control) Part1 – Single Cycle Datapath
1
The Processor: Datapath & Control
• We will study an implementation of a subset of the core MIPS instructions:
– memory-referenceinstructions:lw,sw
– arithmetic-logicalinstructions:add,sub,and,or,slt – controlflowinstructions:beq,j
2
Overview of the implementation • Generic Implementation:
– usetheprogramcounter(PC)tosupplyinstructionaddress – get the instruction from memory
– usetheinstructiontodecideexactlywhattodo
– readregisters
• All instructions use the ALU after reading the registers
3
• After using ALU:
– Arithmetic-logical instruction: write the ALU result into a register
– Memory-referenceinstruction:accessmemorytoread(load)or write (store).
– Branch instruction: may need to update PC
4
Abstract / Simplified View of the MIPS subset
Two types of functional units:
-elements that operate on data values (combinational) -elements that contain state (sequential)
5
State Elements
• Unclocked vs. Clocked
• Clocks used in synchronous logic
– when should an element that contains state be updated? Falling edge
Clock period Rising edge
cycle time
6
State element (memory element)
• A clock is used to define when signals can be read or written.
• We will assume an edge-triggered clocking methodology, which allows us to read and write in the same clock cycle.
‘write’ occurs at the clock edge. ‘read’ occurs at any time.
7
Latches and Flip-flops
•
Output is equal to the stored value inside the element (don’t need to ask for permission to look at the value)
• • •
Change of state (value) is based on the clock
Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge (edge-triggered methodology)
A clocking methodology defines when signals can be read and written
— wouldn’t want to read a signal at the same time it was being written
“logically true”,
— could mean electrically low
8
D-latch
• Two inputs:
– the data value to be stored (D)
– the clock signal (C) indicating when to read & store D
• Two outputs:
– thevalueoftheinternalstate(Q)andit’scomplement
C
D
Q Q_
D C Q
9
D flip-flop
• Output changes only on the clock edge
D C Q
DDDQDDQQ
C
latch latch CCQ
Q
10
Our Implementation
• •
An edge triggered methodology Typical execution:
– readcontentsofsomestateelements,
– sendvaluesthroughsomecombinationallogic – writeresultstooneormorestateelements
Clock cycle
State
State
element Combinational logic 12
element
11
Register File
• Built using D flip-flops
Read register number 1
Read register number 1
M . . . u
Read data 1
Read register number 2
Write register
Register file
Read data 2
Read register number 2
Write data
Write
M
Read data 1
Register n – 2 Register n – 1
x
Register 0 Register 1
u Read data 2 x
12
Multiplexor
•
Select
A 32
A30
M
Select
B31
32 B 32
C
B30
M u x
u
x .
C30
A31
M
A0
M
B0
u x
C31
.
u x
C0
13
Register File
• Note: we still use the real clock to determine when to write
Register number
n-to-2n . decoder .
D
Register data
D
Write
n–1 n
C
Register 1
0 1
C
Register 0
D
.
C
Register n – 2
D
C
Register n – 1
14
Elements for Instruction Fetch
Instruction address
Instruction memory
Instruction
PC
Add Sum
a. Instruction memory
b. Program counter
c. Adder
15
Register and ALU
32 registers = 2 5
lw $t1, offset_value($t2)
5 numbers
Read data 1
Register
Read register 2
Zero
Data
Write Write
Read data 2
32bits
5
Read register 1
sw $t1, offset_value($t2) 32bits
4 ALU operation
5
Registers register
Data
ALU ALU result
32bits Data
a. Registers
b. ALU
RegWrite
The ALU takes two 32 bit inputs,
and produces a 32 bit result, a 1-bit signal if the result is 0.
4-bit control signal described in Appendix C.
We need to add the base register ($t2) and offset value.
16
Elements for load and store
Address
Read data
Write data
Data memory
a. Data memory unit
b. Sign-extension unit
used for load/store
MemWrite
MemRead
-Sign-extension unit – sign extend 16-bit offset_value to a 32-bit signed value. -Data memory unit – reads from or writes to. It has both read and write control
signals, and address input, and an input for the data to be written into memory.
16
Sign 32 extend
17
Sign-Extend
-Sign-extension unit – sign extend 16-bit offset_value to a 32-bit signed value.
-To increase the size of a data item by repeating the high-order sign bit of the original data item in the high-order bits of the larger, destination data item.
Example:
Suppose that we sign-extend a 4-bit value to 8 bit signed value.
Original: 0010 (2 dec) -> the high-order bit is 0,
so it extends to: 0000 0010 (still 2 dec)
Original: 1110 (-2 dec with 2’s complement)
the high-order bit is 1,
so it extends to: 1111 1110 (still -2 dec in 2’s complement)
18
Partial Datapath for Branch
beq $t1, $t2, offset
Multiply by 4
19
Example
Loop:
sll $t1, $t3, 2
add $t1, $t1, $t6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s4, 1
j Loop
Exit:
80000
80004
80008
80012
80016
80020 2 2000 80024
0 0 19 9 2 0
0 9 22 9 0 32 35 9 8 0
8 81919 1
5 8 21
…
20
Partial Datapath for Memory and R-type instructions
add $s2, $s3, $s4
s2
$s4
s3 $s3 s4
$s3+$s4
21
Simple Datapath for the basic instructions
22
Control
• Selecting the operations to perform (ALU, read/write, etc.)
• Controlling the flow of data (multiplexor inputs)
• Information comes from the 32 bits of the instruction
• Example:
add $8, $17, $18 Instruction Format:
000000 10001 10010 01000 00000100000 op rs rt rd shamt funct
• ALU’s operation based on instruction type and function code
23
Control
• e.g., what should the ALU do with this instruction
• Example: lw $1, 100($2)
35 2 1 100
op rs rt 16 bit offset • ALU control input
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than 1100 NOR
24
ALU Control
Bnegate Ainvert
Operation
a0 b0
CarryIn ALU0 Less CarryOut
Result0
a1 b1 0
CarryIn ALU1 Less CarryOut
Result1
a2
CarryIn ALU2 Less CarryOut
Result2
b2 0
a31 b31
CarryIn ALU31 Less
Result31
0
Overflow
.
. . CarryIn
.
Set
. Zero .
25
Control
• Must describe hardware to compute 4-bit ALU control input
• Multiple level of decoding
•
– giveninstructiontype 00 = lw, sw
ALUOp
computed from instruction type
01 = beq,
10 = arithmetic
– functioncodeforarithmetic
How the ALU control bits are set depend on the ALUop control bits and The different function codes for the R-type instructions.
26
Control
• Describe it using a truth table (can turn into gates):
FIGURE 4.13 The truth table for the 4 ALU control bits (called Operation). The inputs are the ALUOp and function code field. Only the entries for which the ALU control is asserted are shown. Some don’t-care entries have been added. For example, the ALUOp does not use the encoding 11, so the truth table can contain entries 1X and X1, rather than 10 and 01. Note that when the function field is used, the first 2 bits (F5 and F4) of these instructions are always 10, so they are don’t-care terms and are replaced with XX in the truth table. Copyright © 2009 Elsevier, Inc. All rights reserved.
27
Designing Main Control Unit
• The three instruction classes: – TheR-type
– Branch
– Load-store
Observations:
– Theopfield,alsocalledtheopcode,isalwayscontainedinbits 31:26. We will refer to this as Op[5:0]
– The two registers to be read are always specified by the rs and rt fields, at positions 25:21 and 20:16. (R-type, brand eq, store)
– Thebaseregisterforloadandstoreisalwaysinbitpositions 25:21 (rs).
– The16-bitoffsetforbranchequal,load,andstoreisalwaysin positions 15:0.
– Thedestinationregisterisinoneoftwoplaces.Foraloaditisin bit position 20:16 (rt), while for an R-type it is in bit position 15:11 (rd). We need to add a multiplexor to select which field.
28
FIGURE 4.14 The three instruction classes (R-type, load and store, and branch) use two different instruction formats. The jump instructions use another format, which we will discuss shortly. (a) Instruction format for R-format instructions, which all have an opcode of 0. These instructions have three register operands: rs, rt, and rd. Fields rs and rt are sources, and rd is the destination. The ALU function is in the funct field and is decoded by the ALU control design in the previous section. The R-type instructions that we implement are add, sub, AND, OR, and slt. The shamt field is used only for shifts; we will ignore it in this chapter. (b) Instruction format for load (opcode = 35ten) and store (opcode = 43ten) instructions. The register rs is the base register that is added to the 16-bit address field to form the memory address. For
loads, rt is the destination register for the loaded value. For stores, rt is the source register whose value should be stored into memory. (c) Instruction format for branch equal (opcode = 4). The registers rs and rt are the source registers that are compared for equality. The 16-bit address field is sign-extended, shifted, and added to the PC to compute the branch target address.
29
Control
30
Branch – is it a branch instruction? If it is, 1, otherwise 0. ALU Op1 ALU Op0
R-format 1 0 lw 0 0 sw 0 0 beq 0 1
-> ALU operation determined by funct field -> ALU performs addition
-> ALU performs addition
-> ALU performs subtraction
31
The control lines determined by the opcode fields
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw sw beq
011110000 X1X001000 X 0 X 0 0 0 1 0 1
FIGURE 4.18 The setting of the control lines is completely determined by the opcode fields of the instruction. The first row of the table corresponds to the R-format instructions (add, sub, AND, OR, and slt). For all these instructions, the source register fields are rs and rt, and the destination register field is rd; this defines how the signals ALUSrc and RegDst are set. Furthermore, an R-type instruction writes a register (Reg Write = 1), but neither reads nor writes data memory. When the Branch control signal is 0, the PC is unconditionally replaced with PC + 4; otherwise, the PC is replaced by the branch target if the Zero output of the ALU is also high. The ALUOp field for R-type instructions is set to 10 to indicate that the ALU control should be generated from the funct field. The second and third rows of this table
give the control signal settings for lw and sw. These ALUSrc and ALUOp fields are set to perform the address calculation. The MemRead and MemWrite are set to perform the memory access. Finally, RegDst and RegWrite are set for a load to cause the result to be stored into the rt register. The branch instruction is similar to an R-format operation, since it sends the rs and rt registers to the ALU. The ALUOp field for branch is set for a subtract (ALU control = 01), which is used to test for equality. Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0: since the register is not being written, the value of the data on the register data write port is not used. Thus, the entry MemtoReg in the last
two rows of the table is replaced with X for don’t care. Don’t cares can also be added to RegDst when RegWrite is 0. This type of don’t care must be added by the designer, since it depends on knowledge of how the datapath works.
32
Complete the output by entering 0,1, or X (don’t care for multiplexors)
Input or output
Signal name (opcode)
R-format
(0 = 000000)
lw sw Beg (35=100011) (43=101011) (4=000100)
Input
Op5
0
1 1 0
Output
Op4
Op3
Op2
Op1
Op0 RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
0 0 0 0 0
0 0 0 0 1 0 0 0 1 1 1 0 1 1 0
33
Answer
Input or output
Signal name (opcode)
R-format lw sw Beg
(0 = 000000) (35=100011) (43=101011) (4=000100)
Input
Op5
0 1 1 0
Output
Op4
Op3
Op2
Op1
Op0 RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 0 1 0 X X 0 1 1 0 0 1 X X 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1
34
Simple Datapath with Control Unit
35
Example1: add $t1, $t2, $t3
1. The instruction is fetched, and the PC is incremented.
2. Two registers, $t2 ad $t3, are read from the register file, and
the main control unit compute the setting of the control lines during this step also.
3. The ALU operates on the data read from the register file,
using the function code (bits 5:0, which is the funct field, of the instruction) to generate the ALU function.
4. The result from the ALU is written into the register file using
bits 15:11 of the instruction to select the destination register ($t1).
36
37
Example2: lw $t1, offset($t2)
1. An instruction is fetched from the instruction memory, and the PC is incremented.
2. A register ($t2) value is read from the register file.
3. The ALU computes the sum of the value read from the register file and the sign-extended, lower 16 bits of the instruction (offset).
4. The sum from the ALU is used as the address for the data
memory.
5. The data from the memory unit is written into the register file; the register destination is given by bits 20:16 of the instruction ($t1).
38
39
Example3: beq $t1, $t2, offset
1. An instruction is fetched from the instruction memory, and the PC is incremented.
2. Two registers, $t1 and $t2, are read from the register file.
3. The ALU performs a subtract on the data values read from the register file. The value of PC+4 is added to the sign-extended, lower 16 bits of the instruction (offset) shifted left by two; the result is the branch target address.
4. The Zero result from the ALU is used to decide which adder
result to store into the PC.
40
41
Truth Table for the control function
FIGURE 4.22 The control function for the simple single-cycle implementation is completely specified by this truth table.
42
Extending to handle jump
• Instruction format for the jump instruction (opcode=2).
000010 address Bit position 31:26 25:0
Field
-The destination address – concatenation of
-the upper 4 bits of the current PC+4 (31:28 of the instruction address)
-the 26 bit immediate field of the jump instruction -the bits 00 two (multiplied by 4)
43
New mux added
44
Clock Speed in Single-cycle Approach
• Long cycle time:
– Cycletimemustbelongenoughfortheslowestinstruction. – Forlw
• PC’s Clock -to-Q +
• InstructionMemoryAccessTime+
• Register File Access Time +
• ALU Delay (address calculation) +
• Data Memory Access Time +
• Register File Setup Time +
• ClockSkew(thedifferenceinabsolutetimebetweenthetimes
when two state elements see a clock edge)
• Cycle time is much longer than needed for all other instructions
– R-type instructions do not require data memory access
– JumpdoesnotrequireALUoperationnordatamemoryaccess
45
Single Cycle Implementation Example • Calculate cycle time assuming negligible delays except:
– memory (200ps),
ALU and adders (100ps), register file access (50ps)
Assuming that the multiplexors, control unit, PC accesses, sign extension unit, and wires have no delay,
Which of the following implementations would be faster and by how much?
1. An implementation in which every instruction operates in 1 clock cycle of a fixed length.
2. An implementation where every instruction executes in 1 clock cycle using a variable-length clock, which for each instruction is only as long as it needs to be. (Such an approach is not terribly practical, but it will allow us to see what is being sacrificed when all the instructions must execute in a single clock of the same length.)
Assume that the following instruction mix: 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps.
46
PC
Read address
Read register 1
ALUSrc
4 ALU operation
4
Add ALU result
Instruction
Write Registers Read
ALU ALU r e s u l t
Address Read
d a t a Mu
I n s t r u c t i o n memory
r e g i s t e r
d a t a 2
M ux
Add
Mu x
Read register 2
Read data 1
MemWrite MemtoReg
Write data
x
RegWrite
Write data
Data memory
16
Sign extend
32
MemRead
Shift left 2
Zero
PCSrc
47
Instruction class
critical functional units used by the instruction class
R-type Load word
Instruction
Register
ALU
Register
Store word Branch Jump
Instruction fetch
Register access
Memory access
fetch
access
access
Instruction fetch
Register access
ALU ALU ALU
Memory access
Register access
Instruction fetch
Register access
Instruction fetch
48
Instructio n class
Instructio n memory
Register read
ALU operation
Data Register memory write
Total 400 ps 600 ps 550 ps 350 ps 200 ps
R-type
200
50
100
0 50
Load
200
50
100
200
50
word
Store word
200 200 200
50 50
100 100
200 0
Branch Jump
49
• Since it is a single cycle, CPI = 1. (1 cycle per instruction).
• CPU execution time = Instruction count x CPI x Clock cycle time
= Instruction count x Clock cycle time Instruction count is the same for both implementations.
1. The longest is 600 ps by lw instruction.
2. How above the average?
CPU clock cycle
= 600 x 25% + 550 x 10% + 400 x 45% + 350 x 15% + 200 x 5% = 447.5 ps
The variable clock implementation has a shorted average clock cycle, it is faster. Performance ratio 600/447.7 = 1.34
50