Memory Allocation III CSE 351 Autumn 2016
Roadmap
1
car *c = malloc(sizeof(car));
c->miles = 100;
c->gals = 17;
float mpg = get_mpg(c);
free(c);
Car c = new Car();
c.setMiles(100);
c.setGals(17);
float mpg =
c.getMPG();
Java:
C:
Assembly language:
Machine code:
0111010000011000
100011010000010000000010
1000100111000010
110000011111101000011111
Computer system:
OS:
Memory & data
Arrays & structs
Integers & floats
RISC V assembly
Procedures & stacks
Executables
Memory & caches
Processor Pipeline
Performance
Parallelism
CMPT 295
L28: Control
1
Agenda
Datapath Review
Control Implementation
Performance Analysis
Pipelined Execution
Pipelined Datapath
2
CMPT 295
L28: Control
Single-Cycle RISC-V RV32I Datapath
3
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
CMPT 295
L28: Control
PCSel
Does this instruction change my control flow?
What is the address of my next instruction?
ImmSel
Does this instruction have/use an immediate?
If yes, what type of instruction is it? How is the immediate stored?
RegWEn
Does this instruction write to the destination register rd?
BrUn
Does this instruction do a branch? If so, is it unsigned or signed?
BSel
Does this instruction operate on R[rs2] or an immediate?
ASel
Does this instruction operate on R[rs1] or PC?
ALUSel
What operation should we perform on the selected operands?
MemRW
Do we want to write to memory? Do we want to read from memory?
If we don’t care about the memory output, what should we do?
WBSel
Which value do we want to write back to rd?
If we aren’t writing back (RegWEn = 0) does this value matter?
3
Single-Cycle RISC-V RV32I Datapath
4
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
Control Logic
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
CMPT 295
L28: Control
4
Agenda
Quick Datapath Review
Control Implementation
Performance Analysis
Pipelined Execution
Pipelined Datapath
5
CMPT 295
L28: Control
Our Control Bits
PCSel
Does this instruction change my control flow?
What is the address of my next instruction?
ImmSel
Does this instruction have/use an immediate?
If yes, what type of instruction is it? How is the immediate stored?
RegWEn
Does this instruction write to the destination register rd?
BrUn
Does this instruction do a branch? If so, is it unsigned or signed?
6
CMPT 295
L28: Control
6
Our Control Bits
BSel
Does this instruction operate on R[rs2] or an immediate?
ASel
Does this instruction operate on R[rs1] or PC?
ALUSel
What operation should we perform on the selected operands?
MemRW
Do we want to write to memory? Do we want to read from memory?
If we don’t care about the memory output, what should we do?
WBSel
Which value do we want to write back to rd?
If we aren’t writing back (RegWEn = 0) does this value matter?
7
CMPT 295
L28: Control
7
Designing Control Signals
Questions you should ask:
Is this control signal the same for every instruction of the same type? (I, R, S, SB, etc.) If so, can you use a combination of opcode/funct3/funct7 to encode the value?
Is this control signal dependent on other controls?
ie. PCSel and BrEq, BrLT
Does the value of this control signal alter the execution of the instruction?
Some cases: yes! (MemRW, for example)
Some cases: no! (ImmSel in R-type inst, for example)
8
CMPT 295
L28: Control
8
Let’s try an example!
Design PCSel yourself!
I recognise this is hard, given you don’t all have logic simulator in front of you 🙁 describe the circuit and its dependencies as best you can !
Might help to split it into three cases:
Regular (non-control) instructions
Branches
Jumps
You may assume PCSel = 0 → PC = PC + 4, and PCSel = 1 → PC = ALUout
9
CMPT 295
L28: Control
9
PCSel: Regular Instructions
Assumption: PCSel = 0 → PC = PC + 4, and PCSel = 1 → PC = ALUout
This isn’t the case in every datapath! How can you check? Look at what the PC-input MUX maps 0 and 1 to. Its controlled by PCSel! (Review lecture 9, 10)
For regular instructions, PCSel is always 0, so our circuit looks like this (pretty boring, huh?)
10
CMPT 295
L28: Control
10
PCSel: Branch Instructions
How do we know if an instruction is a branch?
Intuition: check the green sheet!
In order: opcode, func3 → same fields but in hex
Look at that! they all have the same opcode! We should also check to make sure no other instructions have the same one!
spoiler: they don’t, but you should check!
11
CMPT 295
L28: Control
11
PCSel: Branch Instructions
Let’s describe our desired behaviour in words:
If we are a regular instruction, choose PC+4. If we are a branch instruction, choose ALUout
If we are a regular instruction, set PCSel = 0. If we are a branch instruction, set PCSel = 1
We can identify a branch instruction by doing an equality check on the opcode. Here’s our sub circuit:
12
CMPT 295
L28: Control
12
PCSel: Jumps
Which instructions are our jump instructions?
In order, opcode, func3 (none for jal) → hex
Oh no! These are different, so no easy generalisation here.
Same as with branching, though, we can do an equality check on the opcode using a comparator. We’ll have to do two separate comparisons here.
13
CMPT 295
L28: Control
13
Peer Question
Which of the following circuits is the correct PCSel for jumps?
14
A
B
CMPT 295
L28: Control
14
Putting them all together
We can have a regular instruction OR a branch instruction OR a jump instruction. To combine all our signals together and retain the functionality of each individual piece, we’ll OR them!
Describing your circuit aloud, and keying in on the words you use, might be a helpful design/debugging strategy!
If any of the sub-circuits are true, PCSel will become (1)
Otherwise, it’ll be 0
Because we only have sub-circuits for the branch and jump cases, all normal instructions will have PCSel = 0, while branch, jump will have PCSel = 1 as desired 🙂
15
CMPT 295
L28: Control
15
PCSel: Final Circuit
16
CMPT 295
L28: Control
16
Control Signals: Big picture!
Control signals are how we get the same hardware to behave differently and produce different instructions
For every instruction, all control signals are set to one of their possible values (Not always 0 or 1!) or an indeterminate (*) value indicating the control signal doesn’t affect the instruction’s execution
Each control signal has a sub-circuit based on ~nine bits from the instruction format:
Upper 5 func7 bits (lower 2 are the same for all 295 instructions)
All func3 bits
“2nd” upper opcode bit (others are the same for all 295 instructions)
17
CMPT 295
L28: Control
17
Control Signals: ADD
18
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
CMPT 295
L28: Control
18
ADD: PCSel
Should we execute the next instruction (0), or jump control flow to the address given by our ALU output (1)?
We aren’t a branch or jump!
PCSel = 0
19
CMPT 295
L28: Control
19
ADD: ImmSel, RegWEn
How do we want to assemble our immediate?
Wait… we don’t ? have one?
We DON’T CARE about this signal
ImmSel = *
Do we want to write (1) to our destination register rd, or not (0)?
Add should write!
RegWEn = 1
20
CMPT 295
L28: Control
20
ADD: BrUn
When we compare R[rs1] and R[rs2], should the comparison be signed (0), or unsigned (1)?
We aren’t doing a branch !
This value doesn’t matter
BrUn = *
21
CMPT 295
L28: Control
21
ADD: ASel, BSel, ALUSel
Which operands do we want to operate on?
ADD requires rs1 and rs2
ASel = 0 (rs1)
BSel = 0 (rs2)
What operation do we want to perform?
ADD == uh… add ?
ALUSel = “Add”
but wait… that’s not binary, how does that work?
22
CMPT 295
L28: Control
22
ALUSel
For diagramming purposes, we set ALUSel on examples and exam questions to an english value (add, sub, or, etc.)
In your CPU, it’ll have a binary value (and so will all other signals!)
The mapping between english words and binary values depends on how you build your ALU!
These mappings are arbitrary! As long as you’re consistent (all add-based instructions have the same ALUSel) things will work just fine
23
CMPT 295
L28: Control
23
ADD: MemRW, WBSel
Are we reading (0) or writing (1) memory?
Wait, we’re not doing anything with memory. Can this be a “don’t care” value?
NO NO NO NO NO ! 🙁
We never want to “accidentally” write memory! This has to be a “passive read”.
MemRW = 0
What value do we want to write back to rd?
ALU Out!
WBSel = 2
24
CMPT 295
L28: Control
24
ADD: Control Signals
Here are the signals and values we’ve compiled for our ADD instruction:
(green = left 3 cols = control INPUTS)
(orange = right 9 cols = control OUTPUTS)
25
Inst[31:0] BrEq BrLT PCSel ImmSel BrUn ASel BSel ALUSel MemRW RegWEn WBSel
add * * +4 * * Reg Reg Add Read 1 (Y) ALU
CMPT 295
L28: Control
25
RV32I, a nine-bit ISA!
26
Not in
CMPT 295
Instruction type encoded using only 9 bits inst[30],inst[14:12], inst[6:2]
inst[30]
inst[14:12]
inst[6:2]
CMPT 295
L28: Control
Control Logic
addi datapath
27
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
Inst[31:0] PCSel ImmSel RegWEn BrUn BrLT BrEq BSel ASel ALUSel MemRW WBSel
addi +4 I 1 * * * Imm Reg Add Read ALU
CMPT 295
L28: Control
27
28
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
lw datapath
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
lw +4 I 1 * * * Imm Reg Add Read Mem
CMPT 295
L28: Control
28
29
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
Br datapath
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
beq +4 B 0 * 0 * Imm PC Add Read *
beq ALU B 0 * 1 * Imm PC Add Read *
CMPT 295
L28: Control
29
30
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
Jal datapath
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
jal ALU J 1 * * * Imm PC Add Read PC+4
CMPT 295
L28: Control
30
31
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
add +4 * 1 (Y) * * * Reg Reg Add Read ALU
sub +4 * 1 * * * Reg Reg Sub Read ALU
(R-R Op) +4 * 1 * * * Reg Reg (Op) Read ALU
addi +4 I 1 * * * Imm Reg Add Read ALU
lw +4 I 1 * * * Imm Reg Add Read Mem
sw +4 S 0 (N) * * * Imm Reg Add Write *
beq +4 B 0 * 0 * Imm PC Add Read *
beq ALU B 0 * 1 * Imm PC Add Read *
bne ALU B 0 * 0 * Imm PC Add Read *
bne +4 B 0 * 1 * Imm PC Add Read *
blt ALU B 0 0 * 1 Imm PC Add Read *
bltu ALU B 0 1 * 1 Imm PC Add Read *
jalr ALU I 1 * * * Imm Reg Add Read PC+4
jal ALU J 1 * * * Imm PC Add Read PC+4
auipc +4 U 1 * * * Imm PC Add Read ALU
CMPT 295
L28: Control
Control Construction Options
ROM
“Read-Only Memory”
Regular structure (like previous slide’s table)
Can be easily reprogrammed
fix errors
add instructions
Popular when designing control logic manually
Combinatorial Logic
Today, chip designers use logic synthesis tools to convert truth tables to networks of gates
Not easily changeable/re-programmable because requires modifying hardware
But! Likely less expensive, more complex
32
CMPT 295
L28: Control
ROM-based Control
Lecture 12: Control & Performance
33
ROM
Inst[30,14:12,6:2]
BrEq
9
PCSel
ALUSel[3:0]
4
11-bit address (inputs)
15 data bits (outputs)
BrLT
ImmSel[2:0]
3
BrUn
ASel
BSel
MemRW
RegWEn
WBSel[1:0]
2
CMPT 295
L28: Control
ROM Controller Implementation
34
Control Word for add
Control Word for sub
Control Word for or
.
.
.
Address
Decoder
.
.
.
Inst[]
BrEQ
BrLT
Controller output (PCSel, ImmSel, …)
add
sub
or
jal
11
CMPT 295
L28: Control
Agenda
Quick Datapath Review
Control Implementation
Performance Analysis
Pipelined Execution
Pipelined Datapath
35
CMPT 295
L28: Control
Instruction Timing
IF ID EX MEM WB Total
IMEM Reg Read ALU DMEM Reg W
200 ps 100 ps 200 ps 200 ps 100 ps 800 ps
36
CMPT 295
L28: Control
Instruction Timing
Maximum clock frequency
fmax = 1/800ps = 1.25 GHz
Most blocks idle most of the time! ex. “IF” active every 600ps
Instr IF = 200ps ID = 100ps ALU = 200ps MEM=200ps WB = 100ps Total
add X X X X 600ps
beq X X X 500ps
jal X X X X 600ps
lw X X X X X 800ps
sw X X X X 700ps
Instruction 1
Instruction 2
ID
MEM
WB
ALU
IF
ID
MEM
WB
ALU
IF
Clk
CMPT 295
L28: Control
Performance Measures
In our example, CPU executes instructions at 1.25 GHz
1 instruction every 800 ps
Can we improve its performance?
What do we mean with this statement?
Not so obvious:
Quicker response time, so one job finishes faster?
More jobs per unit time (e.g. web server returning pages)?
Longer battery life?
38
CMPT 295
L28: Control
“Iron Law” of Processor Performance
39
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
CMPT 295
L28: Control
Instructions per Program
Determined by
Task
Algorithm, e.g. O(N2) vs O(N)
Programming language
Compiler
Instruction Set Architecture (ISA)
40
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
CMPT 295
L28: Control
(Average) Clock cycles per Instruction, or CPI
Determined by
ISA and processor implementation (or microarchitecture)
E.g. for “our” single-cycle RISC-V design, CPI = 1
Complex instructions (e.g. strcpy), CPI >> 1
True for most CISC languages
Superscalar processors, CPI < 1 (next lecture)
41
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
CMPT 295
L28: Control
Time per Cycle (1/Frequency)
Determined by
Processor microarchitecture (processor critical path)
Technology (e.g. transistor size)
Power budget (lower voltages reduce transistor speed)
42
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
CMPT 295
L28: Control
Speed Trade-off Example
For some task (e.g. image compression) …
43
Processor A Processor B
# Instructions 1 Million 1.5 Million
Average CPI 2.5 1
Clock rate f 2.5 GHz 2 GHz
Execution time 1 ms 0.75 ms
Processor B is faster for this task, despite executing more instructions and having a lower clock rate! Why? Each instruction is less complex! (~2.5 B instructions = 1 A instruction)
CMPT 295
L28: Control
Poll
If we reduce the time a program takes to execute by pipelining our CPU, which factor is likely to shrink?
Instructions per program
Cycles per instruction
Time per cycle
44
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
CMPT 295
L28: Control
44
Agenda
Quick Datapath Review
Control Implementation
Administrivia
Performance Analysis
Pipelined Execution
Pipelined Datapath
45
CMPT 295
L28: Control
Pipeline Analogy: Doing Laundry
Minh, Nick, Brandon, and Ali each have one load of clothes to wash, dry, fold, and put away
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Stasher” takes 30 minutes to put clothes into drawers
46
M
N
B
A
CMPT 295
L28: Control
lol wash the MIPS green shirts
Sequential Laundry
Sequential laundry takes 8 hours for 4 loads
1 load finishes every 2 hours,
47
T
a
s
k
O
r
d
e
r
B
N
A
M
30
Time
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
6 PM
7
8
9
10
11
12
1
2 AM
CMPT 295
L28: Control
Pipelined Laundry
Pipelined laundry takes 3.5 hours for 4 loads!
1 load finishes every half hour (after the first load, which takes 2 hours)
48
T
a
s
k
O
r
d
e
r
B
N
A
M
12
2 AM
6 PM
7
8
9
10
11
1
Time
30
30
30
30
30
30
30
CMPT 295
L28: Control
Pipelining Lessons (1/2)
Pipelining doesn’t decrease latency of single task; it increases throughput of entire workload
Multiple tasks operating simultaneously using different resources
Potential speedup ~ number of pipeline stages
Speedup reduced by time to fill and drain the pipeline:
8 hours/3.5 hours which gives 2.3X speedup v. potential 4X in this example
49
6 PM
7
8
9
Time
30
30
30
30
30
30
30
B
N
A
M
T
a
s
k
O
r
d
e
r
CMPT 295
L28: Control
Pipelining Lessons (2/2)
50
6 PM
7
8
9
Time
30
30
30
30
30
30
30
B
N
A
M
T
a
s
k
O
r
d
e
r
Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipeline stages reduces speedup
CMPT 295
L28: Control
Agenda
Quick Datapath Review
Control Implementation
Administrivia
Performance Analysis
Pipelined Execution
Pipelined Datapath
51
CMPT 295
L28: Control
52
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
wb
Pipelining RISC-V CPU
CMPT 295
L28: Control
52
Quick review: Circuit pipelining!
When we calculate cycle time, or critical path, we do so between state elements, inputs, and outputs.
Adding registers between circuit components decreases our critical path and increases our frequency
53
CMPT 295
L28: Control
53
Pipelining with RISC-V
54
Phase Pictogram tstep Serial
Instruction Fetch 200 ps
Reg Read 100 ps
ALU 200 ps
Memory 200 ps
Register Write 100 ps
tinstruction 800 ps
add t0, t1, t2
or t3, t4, t5
sll t6, t0, t3
tcycle
instruction sequence
tinstruction
tcycle Pipelined
200 ps
200 ps
200 ps
200 ps
200 ps
1000 ps
CMPT 295
L28: Control
Tc = “Time between completion of instructions”
54
Pipeline Performance
Use Tc (“time between completion of instructions”) to measure speedup
Speedup due to increased throughput
Latency for each instruction does not decrease, in fact it may increase if our stages are uneven!
It takes longer for the first instruction to finish, but every instruction after finishes faster!
55
CMPT 295
L28: Control
Morgan Kaufmann Publishers
Chapter 4 — The Processor
Pipelining with RISC-V
56
add t0, t1, t2
or t3, t4, t5
sll t6, t0, t3
tcycle
instruction sequence
tinstruction
Single Cycle Pipelining
Timing tstep = 100 … 200 ps tcycle = 200 ps
Register access only 100 ps All cycles same length
Instruction time, tinstruction = tcycle = 800 ps 1000 ps
Clock rate, fs 1/800 ps = 1.25 GHz 1/200 ps = 5 GHz
CMPT 295
L28: Control
Tc = “Time between completion of instructions”
56
Sequential vs Simultaneous
add t0, t1, t2
or t3, t4, t5
sll t6, t0, t3
tcycle
= 200 ps
instruction sequence
tinstruction = 1000 ps
sw t0, 4(t3)
lw t0, 8(t3)
addi t2, t2, 1
What happens sequentially, what happens simultaneously?
CMPT 295
L28: Control
Tc = “Time between completion of instructions”
57
RISC-V Pipeline
add t0, t1, t2
or t3, t4, t5
slt t6, t0, t3
tcycle
= 200 ps
instruction sequence
tinstruction = 1000 ps
sw t0, 4(t3)
lw t0, 8(t3)
addi t2, t2, 1
Resource use by instruction over time
Resource use in a particular time slot
58
CMPT 295
L28: Control
Single-Cycle RISC-V RV32I Datapath
59
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
CMPT 295
L28: Control
59
Pipelining RISC-V RV32I Datapath
60
IMEM
ALU
Imm.
Gen
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
0
1
2
1
0
pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
wb
Instruction Fetch
(F)
Instruction Decode/Register Read
(D)
ALU Execute
(X)
Memory Access
(M)
Write Back
(W)
NOTE: Control signals are also pipelined!
CMPT 295
L28: Control
60
Pipelined RISC-V RV32I Datapath
61
IMEM
ALU
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
aluX
pcF+4
+4
pcD
pcF
pcX
pcM
instD
instX
rs1X
rs2X
aluM
rs2M
immX
Imm.
Recalculate PC+4 in M stage to avoid sending both PC and PC+4 down pipeline
instM
instW
Must pipeline instruction along with data, so control operates correctly in each stage
CMPT 295
L28: Control
61
Each stage operates on different instruction
62
IMEM
ALU
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
aluX
pcF+4
+4
pcD
pcF
pcX
pcM
instD
instX
rs1X
rs2X
aluM
rs2M
immX
Imm.
instM
instW
add t0, t1, t2
or t3, t4, t5
slt t6, t0, t3
sw t0, 4(t3)
lw t0, 8(t3)
Pipeline registers separate stages, hold data for each instruction in flight
CMPT 295
L28: Control
62
Instruction Level Parallelism (ILP)
Pipelining allows us to execute parts of multiple instructions at the same time using the same hardware!
This is known as instruction level parallelism
Later: Other types of parallelism
DLP: same operation on lots of data (SIMD)
TLP: executing multiple threads “simultaneously” (OpenMP)
63
CMPT 295
L28: Control
Pipelined Control
Control signals derived from instruction
As in single-cycle implementation
Information is stored in pipeline registers for use by later stages
At any given point, there are up to 5 different instructions in the datapath! We must keep track of 5 different sets of control bits!
64
ID
M
WB
ALU
IF
ID
M
WB
ALU
IF
ID
M
WB
ALU
IF
CMPT 295
L28: Control
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
64
Question: Assume the stage times shown below.
Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version.
The latency will be 1.25x slower.
The throughput will be 3x faster.
F F
(A)
F T
(B)
T F
(C)
T T
(D)
1 2
65
Instr Fetch Reg Read ALU Op Mem Access Reg Write
200ps 100 ps 200ps 200ps 100 ps
CMPT 295
L28: Control
1/600
4/800 = 1/200
1/200*600/1 = 3x faster throughput
800/600 = 4/3 = 1.33!!
Question: Assume the stage times shown below.
Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version.
The latency will be 1.25x slower.
The throughput will be 3x faster.
F F
(A)
F T
(B)
T F
(C)
T T
(D)
1 2
66
Instr Fetch Reg Read ALU Op Mem Access Reg Write
200ps 100 ps 200ps 200ps 100 ps
No mem access
throughput:
(IF+ID+EX+WB) = 600 →
(4*max_stage)/4 = 200
old/new = 600/200 = 3x faster
CMPT 295
L28: Control
1/600
4/800 = 1/200
1/200*600/1 = 3x faster throughput
800/600 = 4/3 = 1.33!!
F F
(A)
F T
(B)
T F
(C)
T T
(D)
1 2
67
Instr Fetch Reg Read ALU Op Mem Access Reg Write
200ps 100 ps 200ps 200ps 100 ps
No mem access! Latency:
IF+ID+EX+WB = 600 →
4*max_stage = 800
old/new = 600/800 = negative speedup! 800/600 = 1.33x slower!
Question: Assume the stage times shown below.
Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version.
The latency will be 1.25x slower.
The throughput will be 3x faster.
CMPT 295
L28: Control
1/600
4/800 = 1/200
1/200*600/1 = 3x faster throughput
800/600 = 4/3 = 1.33!!
Question: Assume the stage times shown below.
Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version.
The latency will be 1.25x slower.
The throughput will be 3x faster.
F F
(A)
F T
(B)
T F
(C)
T T
(D)
1 2
68
Instr Fetch Reg Read ALU Op Mem Access Reg Write
200ps 100 ps 200ps 200ps 100 ps
CMPT 295
L28: Control
1/600
4/800 = 1/200
1/200*600/1 = 3x faster throughput
800/600 = 4/3 = 1.33!!
Summary
Implementing controller for your datapath
Ask yourself the questions on the beginning slides!
Work in stages, put everything together at the end!
Pipelining improves performance by exploiting Instruction Level Parallelism
5-stage pipeline for RV32I: IF, ID, EX, MEM, WB
Executes multiple instructions in parallel
Each instruction has the same latency, but there’s better throughput
Think: what problems does pipelining introduce? (more on this next lecture)
69
CMPT 295
L28: Control
Bonus slides (non testable)
70
CMPT 295
L28: Control
70
Energy per Task
71
Energy = Instructions Energy
Program Program * Instruction
Energy α Instructions * C V2
Program Program
“Capacitance” depends on
technology,
processor features
e.g. # of cores
Supply voltage,
e.g. 1V
Want to reduce capacitance and voltage to reduce energy/task
CMPT 295
L28: Control
Energy Trade-off Example
“Next-generation” processor
C (Moore’s Law): -15 %
Supply voltage, Vsup: -15 %
Energy consumption: 1 - (1-0.85)3 = -39 %
Significantly improved energy efficiency thanks to
Moore’s Law AND
Reduced supply voltage
72
CMPT 295
L28: Control
Energy “Iron Law”
Performance = Power * Energy Efficiency
(Tasks/Second) (Joules/Second) (Tasks/Joule)
Energy efficiency (e.g., instructions/Joule) is key metric in all computing devices
For power-constrained systems (e.g., 20MW datacenter), need better energy efficiency to get more performance at same power
For energy-constrained systems (e.g., 1W phone), need better energy efficiency to prolong battery life
CMPT 295
L28: Control
End of Scaling
In recent years, industry has not been able to reduce supply voltage much, as reducing it further would mean increasing “leakage power” where transistor switches don’t fully turn off (more like dimmer switch than on-off switch)
Also, size of transistors and hence capacitance, not shrinking as much as before between transistor generations
Power becomes a growing concern – the “power wall”
Cost-effective air-cooled chip limit around ~150W
74
CMPT 295
L28: Control