程序代写代做代考 compiler algorithm assembly cache RISC-V mips Java Memory Allocation III CSE 351 Autumn 2016

Memory Allocation III CSE 351 Autumn 2016

Roadmap
1
car *c = malloc(sizeof(car));
c->miles = 100;
c->gals = 17;
float mpg = get_mpg(c);
free(c);
Car c = new Car();
c.setMiles(100);
c.setGals(17);
float mpg =
c.getMPG();

Java:
C:
Assembly language:
Machine code:
0111010000011000
100011010000010000000010
1000100111000010
110000011111101000011111
Computer system:
OS:

Memory & data
Arrays & structs
Integers & floats
RISC V assembly
Procedures & stacks
Executables
Memory & caches
Processor Pipeline
Performance
Parallelism

CMPT 295
L28: Control

1

Agenda
Datapath Review
Control Implementation
Performance Analysis
Pipelined Execution
Pipelined Datapath

2

CMPT 295
L28: Control

Single-Cycle RISC-V RV32I Datapath
3
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]

inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb

CMPT 295
L28: Control
PCSel
Does this instruction change my control flow?
What is the address of my next instruction?
ImmSel
Does this instruction have/use an immediate?
If yes, what type of instruction is it? How is the immediate stored?
RegWEn
Does this instruction write to the destination register rd?
BrUn
Does this instruction do a branch? If so, is it unsigned or signed?

BSel
Does this instruction operate on R[rs2] or an immediate?
ASel
Does this instruction operate on R[rs1] or PC?
ALUSel
What operation should we perform on the selected operands?
MemRW
Do we want to write to memory? Do we want to read from memory?
If we don’t care about the memory output, what should we do?
WBSel
Which value do we want to write back to rd?
If we aren’t writing back (RegWEn = 0) does this value matter?

3

Single-Cycle RISC-V RV32I Datapath
4
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
Control Logic
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb

CMPT 295
L28: Control

4

Agenda
Quick Datapath Review
Control Implementation
Performance Analysis
Pipelined Execution
Pipelined Datapath
5

CMPT 295
L28: Control

Our Control Bits
PCSel
Does this instruction change my control flow?
What is the address of my next instruction?
ImmSel
Does this instruction have/use an immediate?
If yes, what type of instruction is it? How is the immediate stored?
RegWEn
Does this instruction write to the destination register rd?
BrUn
Does this instruction do a branch? If so, is it unsigned or signed?
6

CMPT 295
L28: Control

6

Our Control Bits
BSel
Does this instruction operate on R[rs2] or an immediate?
ASel
Does this instruction operate on R[rs1] or PC?
ALUSel
What operation should we perform on the selected operands?
MemRW
Do we want to write to memory? Do we want to read from memory?
If we don’t care about the memory output, what should we do?
WBSel
Which value do we want to write back to rd?
If we aren’t writing back (RegWEn = 0) does this value matter?
7

CMPT 295
L28: Control

7

Designing Control Signals
Questions you should ask:
Is this control signal the same for every instruction of the same type? (I, R, S, SB, etc.) If so, can you use a combination of opcode/funct3/funct7 to encode the value?
Is this control signal dependent on other controls?
ie. PCSel and BrEq, BrLT
Does the value of this control signal alter the execution of the instruction?
Some cases: yes! (MemRW, for example)
Some cases: no! (ImmSel in R-type inst, for example)
8

CMPT 295
L28: Control

8

Let’s try an example!
Design PCSel yourself!

I recognise this is hard, given you don’t all have logic simulator in front of you 🙁 describe the circuit and its dependencies as best you can !

Might help to split it into three cases:
Regular (non-control) instructions
Branches
Jumps
You may assume PCSel = 0 → PC = PC + 4, and PCSel = 1 → PC = ALUout

9

CMPT 295
L28: Control

9

PCSel: Regular Instructions
Assumption: PCSel = 0 → PC = PC + 4, and PCSel = 1 → PC = ALUout
This isn’t the case in every datapath! How can you check? Look at what the PC-input MUX maps 0 and 1 to. Its controlled by PCSel! (Review lecture 9, 10)
For regular instructions, PCSel is always 0, so our circuit looks like this (pretty boring, huh?)
10

CMPT 295
L28: Control

10

PCSel: Branch Instructions
How do we know if an instruction is a branch?
Intuition: check the green sheet!

In order: opcode, func3 → same fields but in hex
Look at that! they all have the same opcode! We should also check to make sure no other instructions have the same one!
spoiler: they don’t, but you should check!
11

CMPT 295
L28: Control

11

PCSel: Branch Instructions
Let’s describe our desired behaviour in words:
If we are a regular instruction, choose PC+4. If we are a branch instruction, choose ALUout
If we are a regular instruction, set PCSel = 0. If we are a branch instruction, set PCSel = 1
We can identify a branch instruction by doing an equality check on the opcode. Here’s our sub circuit:
12

CMPT 295
L28: Control

12

PCSel: Jumps
Which instructions are our jump instructions?

In order, opcode, func3 (none for jal) → hex
Oh no! These are different, so no easy generalisation here.
Same as with branching, though, we can do an equality check on the opcode using a comparator. We’ll have to do two separate comparisons here.

13

CMPT 295
L28: Control

13

Peer Question
Which of the following circuits is the correct PCSel for jumps?
14

A
B

CMPT 295
L28: Control

14

Putting them all together
We can have a regular instruction OR a branch instruction OR a jump instruction. To combine all our signals together and retain the functionality of each individual piece, we’ll OR them!
Describing your circuit aloud, and keying in on the words you use, might be a helpful design/debugging strategy!
If any of the sub-circuits are true, PCSel will become (1)
Otherwise, it’ll be 0
Because we only have sub-circuits for the branch and jump cases, all normal instructions will have PCSel = 0, while branch, jump will have PCSel = 1 as desired 🙂
15

CMPT 295
L28: Control

15

PCSel: Final Circuit
16

CMPT 295
L28: Control

16

Control Signals: Big picture!
Control signals are how we get the same hardware to behave differently and produce different instructions
For every instruction, all control signals are set to one of their possible values (Not always 0 or 1!) or an indeterminate (*) value indicating the control signal doesn’t affect the instruction’s execution
Each control signal has a sub-circuit based on ~nine bits from the instruction format:
Upper 5 func7 bits (lower 2 are the same for all 295 instructions)
All func3 bits
“2nd” upper opcode bit (others are the same for all 295 instructions)

17

CMPT 295
L28: Control

17

Control Signals: ADD
18
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]

inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb

CMPT 295
L28: Control

18

ADD: PCSel
Should we execute the next instruction (0), or jump control flow to the address given by our ALU output (1)?
We aren’t a branch or jump!
PCSel = 0
19

CMPT 295
L28: Control

19

ADD: ImmSel, RegWEn
How do we want to assemble our immediate?
Wait… we don’t ? have one?
We DON’T CARE about this signal
ImmSel = *

Do we want to write (1) to our destination register rd, or not (0)?
Add should write!
RegWEn = 1
20

CMPT 295
L28: Control

20

ADD: BrUn
When we compare R[rs1] and R[rs2], should the comparison be signed (0), or unsigned (1)?
We aren’t doing a branch !
This value doesn’t matter

BrUn = *
21

CMPT 295
L28: Control

21

ADD: ASel, BSel, ALUSel
Which operands do we want to operate on?
ADD requires rs1 and rs2
ASel = 0 (rs1)
BSel = 0 (rs2)

What operation do we want to perform?
ADD == uh… add ?
ALUSel = “Add”
but wait… that’s not binary, how does that work?
22

CMPT 295
L28: Control

22

ALUSel
For diagramming purposes, we set ALUSel on examples and exam questions to an english value (add, sub, or, etc.)
In your CPU, it’ll have a binary value (and so will all other signals!)
The mapping between english words and binary values depends on how you build your ALU!
These mappings are arbitrary! As long as you’re consistent (all add-based instructions have the same ALUSel) things will work just fine
23

CMPT 295
L28: Control

23

ADD: MemRW, WBSel
Are we reading (0) or writing (1) memory?
Wait, we’re not doing anything with memory. Can this be a “don’t care” value?
NO NO NO NO NO ! 🙁
We never want to “accidentally” write memory! This has to be a “passive read”.
MemRW = 0
What value do we want to write back to rd?
ALU Out!
WBSel = 2
24

CMPT 295
L28: Control

24

ADD: Control Signals
Here are the signals and values we’ve compiled for our ADD instruction:

(green = left 3 cols = control INPUTS)
(orange = right 9 cols = control OUTPUTS)
25
Inst[31:0] BrEq BrLT PCSel ImmSel BrUn ASel BSel ALUSel MemRW RegWEn WBSel
add * * +4 * * Reg Reg Add Read 1 (Y) ALU

CMPT 295
L28: Control

25

RV32I, a nine-bit ISA!
26

Not in
CMPT 295
Instruction type encoded using only 9 bits inst[30],inst[14:12], inst[6:2]

inst[30]
inst[14:12]
inst[6:2]

CMPT 295
L28: Control

Control Logic
addi datapath
27
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb
Inst[31:0] PCSel ImmSel RegWEn BrUn BrLT BrEq BSel ASel ALUSel MemRW WBSel
addi +4 I 1 * * * Imm Reg Add Read ALU

CMPT 295
L28: Control

27

28
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]

inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb

lw datapath
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
lw +4 I 1 * * * Imm Reg Add Read Mem

CMPT 295
L28: Control

28

29
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]

inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb

Br datapath
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
beq +4 B 0 * 0 * Imm PC Add Read *
beq ALU B 0 * 1 * Imm PC Add Read *

CMPT 295
L28: Control

29

30
IMEM

ALU

Imm.
Gen

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

0
1
2

1
0
pc

0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]

inst[31:0]
ImmSel
RegWEn
BrUn
BrEq
BrLT
ASel
BSel
ALUSel
MemRW
WBSel
PCSel
wb

Jal datapath
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
jal ALU J 1 * * * Imm PC Add Read PC+4

CMPT 295
L28: Control

30

31
Inst[31:0] PCSel ImmSel RegWEn BrUn BrEq BrLT BSel ASel ALUSel MemRW WBSel
add +4 * 1 (Y) * * * Reg Reg Add Read ALU
sub +4 * 1 * * * Reg Reg Sub Read ALU
(R-R Op) +4 * 1 * * * Reg Reg (Op) Read ALU
addi +4 I 1 * * * Imm Reg Add Read ALU
lw +4 I 1 * * * Imm Reg Add Read Mem
sw +4 S 0 (N) * * * Imm Reg Add Write *
beq +4 B 0 * 0 * Imm PC Add Read *
beq ALU B 0 * 1 * Imm PC Add Read *
bne ALU B 0 * 0 * Imm PC Add Read *
bne +4 B 0 * 1 * Imm PC Add Read *
blt ALU B 0 0 * 1 Imm PC Add Read *
bltu ALU B 0 1 * 1 Imm PC Add Read *
jalr ALU I 1 * * * Imm Reg Add Read PC+4
jal ALU J 1 * * * Imm PC Add Read PC+4
auipc +4 U 1 * * * Imm PC Add Read ALU

CMPT 295
L28: Control

Control Construction Options
ROM
“Read-Only Memory”
Regular structure (like previous slide’s table)
Can be easily reprogrammed
fix errors
add instructions
Popular when designing control logic manually
Combinatorial Logic
Today, chip designers use logic synthesis tools to convert truth tables to networks of gates
Not easily changeable/re-programmable because requires modifying hardware
But! Likely less expensive, more complex
32

CMPT 295
L28: Control

ROM-based Control

Lecture 12: Control & Performance
33
ROM
Inst[30,14:12,6:2]
BrEq
9
PCSel
ALUSel[3:0]
4
11-bit address (inputs)
15 data bits (outputs)
BrLT
ImmSel[2:0]
3
BrUn
ASel
BSel
MemRW
RegWEn
WBSel[1:0]
2

CMPT 295
L28: Control
ROM Controller Implementation
34
Control Word for add
Control Word for sub
Control Word for or
.
.
.

Address
Decoder
.
.
.
Inst[]
BrEQ
BrLT
Controller output (PCSel, ImmSel, …)
add
sub
or
jal
11

CMPT 295
L28: Control
Agenda
Quick Datapath Review
Control Implementation
Performance Analysis
Pipelined Execution
Pipelined Datapath
35

CMPT 295
L28: Control

Instruction Timing
IF ID EX MEM WB Total
IMEM Reg Read ALU DMEM Reg W
200 ps 100 ps 200 ps 200 ps 100 ps 800 ps

36

CMPT 295
L28: Control

Instruction Timing

Maximum clock frequency
fmax = 1/800ps = 1.25 GHz
Most blocks idle most of the time! ex. “IF” active every 600ps
Instr IF = 200ps ID = 100ps ALU = 200ps MEM=200ps WB = 100ps Total
add X X X X 600ps
beq X X X 500ps
jal X X X X 600ps
lw X X X X X 800ps
sw X X X X 700ps

Instruction 1
Instruction 2
ID
MEM
WB
ALU
IF
ID
MEM
WB
ALU
IF
Clk

CMPT 295
L28: Control

Performance Measures
In our example, CPU executes instructions at 1.25 GHz
1 instruction every 800 ps

Can we improve its performance?
What do we mean with this statement?
Not so obvious:
Quicker response time, so one job finishes faster?
More jobs per unit time (e.g. web server returning pages)?
Longer battery life?

38

CMPT 295
L28: Control

“Iron Law” of Processor Performance
39
Time = Instructions Cycles Time
Program Program * Instruction * Cycle

CMPT 295
L28: Control

Instructions per Program
Determined by
Task
Algorithm, e.g. O(N2) vs O(N)
Programming language
Compiler
Instruction Set Architecture (ISA)
40
Time = Instructions Cycles Time
Program Program * Instruction * Cycle

CMPT 295
L28: Control

(Average) Clock cycles per Instruction, or CPI
Determined by
ISA and processor implementation (or microarchitecture)
E.g. for “our” single-cycle RISC-V design, CPI = 1
Complex instructions (e.g. strcpy), CPI >> 1
True for most CISC languages
Superscalar processors, CPI < 1 (next lecture) 41 Time = Instructions Cycles Time Program Program * Instruction * Cycle CMPT 295 L28: Control Time per Cycle (1/Frequency) Determined by Processor microarchitecture (processor critical path) Technology (e.g. transistor size) Power budget (lower voltages reduce transistor speed) 42 Time = Instructions Cycles Time Program Program * Instruction * Cycle CMPT 295 L28: Control Speed Trade-off Example For some task (e.g. image compression) … 43 Processor A Processor B # Instructions 1 Million 1.5 Million Average CPI 2.5 1 Clock rate f 2.5 GHz 2 GHz Execution time 1 ms 0.75 ms Processor B is faster for this task, despite executing more instructions and having a lower clock rate! Why? Each instruction is less complex! (~2.5 B instructions = 1 A instruction) CMPT 295 L28: Control Poll If we reduce the time a program takes to execute by pipelining our CPU, which factor is likely to shrink? Instructions per program Cycles per instruction Time per cycle 44 Time = Instructions Cycles Time Program Program * Instruction * Cycle CMPT 295 L28: Control 44 Agenda Quick Datapath Review Control Implementation Administrivia Performance Analysis Pipelined Execution Pipelined Datapath 45 CMPT 295 L28: Control Pipeline Analogy: Doing Laundry Minh, Nick, Brandon, and Ali each have one load of clothes to wash, dry, fold, and put away Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers 46 M N B A CMPT 295 L28: Control lol wash the MIPS green shirts Sequential Laundry Sequential laundry takes 8 hours for 4 loads 1 load finishes every 2 hours, 47 T a s k O r d e r B N A M 30 Time 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 6 PM 7 8 9 10 11 12 1 2 AM CMPT 295 L28: Control Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads! 1 load finishes every half hour (after the first load, which takes 2 hours) 48 T a s k O r d e r B N A M 12 2 AM 6 PM 7 8 9 10 11 1 Time 30 30 30 30 30 30 30 CMPT 295 L28: Control Pipelining Lessons (1/2) Pipelining doesn’t decrease latency of single task; it increases throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup ~ number of pipeline stages Speedup reduced by time to fill and drain the pipeline: 8 hours/3.5 hours which gives 2.3X speedup v. potential 4X in this example 49 6 PM 7 8 9 Time 30 30 30 30 30 30 30 B N A M T a s k O r d e r CMPT 295 L28: Control Pipelining Lessons (2/2) 50 6 PM 7 8 9 Time 30 30 30 30 30 30 30 B N A M T a s k O r d e r Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipeline stages reduces speedup CMPT 295 L28: Control Agenda Quick Datapath Review Control Implementation Administrivia Performance Analysis Pipelined Execution Pipelined Datapath 51 CMPT 295 L28: Control 52 IMEM ALU Imm. Gen +4 DMEM Branch Comp. Reg[] AddrA AddrB DataA AddrD DataB DataD Addr DataW DataR 1 0 0 1 2 1 0 pc 0 1 inst[11:7] inst[19:15] inst[24:20] inst[31:7] pc+4 alu mem wb alu pc+4 Reg[rs1] pc imm[31:0] Reg[rs2] wb Pipelining RISC-V CPU CMPT 295 L28: Control 52 Quick review: Circuit pipelining! When we calculate cycle time, or critical path, we do so between state elements, inputs, and outputs. Adding registers between circuit components decreases our critical path and increases our frequency 53 CMPT 295 L28: Control 53 Pipelining with RISC-V 54 Phase Pictogram tstep Serial Instruction Fetch 200 ps Reg Read 100 ps ALU 200 ps Memory 200 ps Register Write 100 ps tinstruction 800 ps add t0, t1, t2 or t3, t4, t5 sll t6, t0, t3 tcycle instruction sequence tinstruction tcycle Pipelined 200 ps 200 ps 200 ps 200 ps 200 ps 1000 ps CMPT 295 L28: Control Tc = “Time between completion of instructions” 54 Pipeline Performance Use Tc (“time between completion of instructions”) to measure speedup Speedup due to increased throughput Latency for each instruction does not decrease, in fact it may increase if our stages are uneven! It takes longer for the first instruction to finish, but every instruction after finishes faster! 55 CMPT 295 L28: Control Morgan Kaufmann Publishers Chapter 4 — The Processor Pipelining with RISC-V 56 add t0, t1, t2 or t3, t4, t5 sll t6, t0, t3 tcycle instruction sequence tinstruction Single Cycle Pipelining Timing tstep = 100 … 200 ps tcycle = 200 ps Register access only 100 ps All cycles same length Instruction time, tinstruction = tcycle = 800 ps 1000 ps Clock rate, fs 1/800 ps = 1.25 GHz 1/200 ps = 5 GHz CMPT 295 L28: Control Tc = “Time between completion of instructions” 56 Sequential vs Simultaneous add t0, t1, t2 or t3, t4, t5 sll t6, t0, t3 tcycle = 200 ps instruction sequence tinstruction = 1000 ps sw t0, 4(t3) lw t0, 8(t3) addi t2, t2, 1 What happens sequentially, what happens simultaneously? CMPT 295 L28: Control Tc = “Time between completion of instructions” 57 RISC-V Pipeline add t0, t1, t2 or t3, t4, t5 slt t6, t0, t3 tcycle = 200 ps instruction sequence tinstruction = 1000 ps sw t0, 4(t3) lw t0, 8(t3) addi t2, t2, 1 Resource use by instruction over time Resource use in a particular time slot 58 CMPT 295 L28: Control Single-Cycle RISC-V RV32I Datapath 59 IMEM ALU Imm. Gen +4 DMEM Branch Comp. Reg[] AddrA AddrB DataA AddrD DataB DataD Addr DataW DataR 1 0 0 1 2 1 0 pc 0 1 inst[11:7] inst[19:15] inst[24:20] inst[31:7] pc+4 alu mem wb alu pc+4 Reg[rs1] pc imm[31:0] Reg[rs2] inst[31:0] ImmSel RegWEn BrUn BrEq BrLT ASel BSel ALUSel MemRW WBSel PCSel wb CMPT 295 L28: Control 59 Pipelining RISC-V RV32I Datapath 60 IMEM ALU Imm. Gen +4 DMEM Branch Comp. Reg[] AddrA AddrB DataA AddrD DataB DataD Addr DataW DataR 1 0 0 1 2 1 0 pc 0 1 inst[11:7] inst[19:15] inst[24:20] inst[31:7] pc+4 alu mem wb alu pc+4 Reg[rs1] pc imm[31:0] Reg[rs2] wb Instruction Fetch (F) Instruction Decode/Register Read (D) ALU Execute (X) Memory Access (M) Write Back (W) NOTE: Control signals are also pipelined! CMPT 295 L28: Control 60 Pipelined RISC-V RV32I Datapath 61 IMEM ALU +4 DMEM Branch Comp. Reg[] AddrA AddrB DataA AddrD DataB DataD Addr DataW DataR 1 0 aluX pcF+4 +4 pcD pcF pcX pcM instD instX rs1X rs2X aluM rs2M immX Imm. Recalculate PC+4 in M stage to avoid sending both PC and PC+4 down pipeline instM instW Must pipeline instruction along with data, so control operates correctly in each stage CMPT 295 L28: Control 61 Each stage operates on different instruction 62 IMEM ALU +4 DMEM Branch Comp. Reg[] AddrA AddrB DataA AddrD DataB DataD Addr DataW DataR 1 0 aluX pcF+4 +4 pcD pcF pcX pcM instD instX rs1X rs2X aluM rs2M immX Imm. instM instW add t0, t1, t2 or t3, t4, t5 slt t6, t0, t3 sw t0, 4(t3) lw t0, 8(t3) Pipeline registers separate stages, hold data for each instruction in flight CMPT 295 L28: Control 62 Instruction Level Parallelism (ILP) Pipelining allows us to execute parts of multiple instructions at the same time using the same hardware! This is known as instruction level parallelism Later: Other types of parallelism DLP: same operation on lots of data (SIMD) TLP: executing multiple threads “simultaneously” (OpenMP) 63 CMPT 295 L28: Control Pipelined Control Control signals derived from instruction As in single-cycle implementation Information is stored in pipeline registers for use by later stages At any given point, there are up to 5 different instructions in the datapath! We must keep track of 5 different sets of control bits! 64 ID M WB ALU IF ID M WB ALU IF ID M WB ALU IF CMPT 295 L28: Control Morgan Kaufmann Publishers 10 October, 2017 Chapter 4 — The Processor 64 Question: Assume the stage times shown below. Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version. The latency will be 1.25x slower. The throughput will be 3x faster. F F (A) F T (B) T F (C) T T (D) 1 2 65 Instr Fetch Reg Read ALU Op Mem Access Reg Write 200ps 100 ps 200ps 200ps 100 ps CMPT 295 L28: Control 1/600 4/800 = 1/200 1/200*600/1 = 3x faster throughput 800/600 = 4/3 = 1.33!! Question: Assume the stage times shown below. Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version. The latency will be 1.25x slower. The throughput will be 3x faster. F F (A) F T (B) T F (C) T T (D) 1 2 66 Instr Fetch Reg Read ALU Op Mem Access Reg Write 200ps 100 ps 200ps 200ps 100 ps No mem access throughput: (IF+ID+EX+WB) = 600 → (4*max_stage)/4 = 200 old/new = 600/200 = 3x faster CMPT 295 L28: Control 1/600 4/800 = 1/200 1/200*600/1 = 3x faster throughput 800/600 = 4/3 = 1.33!! F F (A) F T (B) T F (C) T T (D) 1 2 67 Instr Fetch Reg Read ALU Op Mem Access Reg Write 200ps 100 ps 200ps 200ps 100 ps No mem access! Latency: IF+ID+EX+WB = 600 → 4*max_stage = 800 old/new = 600/800 = negative speedup! 800/600 = 1.33x slower! Question: Assume the stage times shown below. Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version. The latency will be 1.25x slower. The throughput will be 3x faster. CMPT 295 L28: Control 1/600 4/800 = 1/200 1/200*600/1 = 3x faster throughput 800/600 = 4/3 = 1.33!! Question: Assume the stage times shown below. Suppose we remove loads and stores from our ISA. Consider going from a single-cycle implementation to a 4-stage pipelined version. The latency will be 1.25x slower. The throughput will be 3x faster. F F (A) F T (B) T F (C) T T (D) 1 2 68 Instr Fetch Reg Read ALU Op Mem Access Reg Write 200ps 100 ps 200ps 200ps 100 ps CMPT 295 L28: Control 1/600 4/800 = 1/200 1/200*600/1 = 3x faster throughput 800/600 = 4/3 = 1.33!! Summary Implementing controller for your datapath Ask yourself the questions on the beginning slides! Work in stages, put everything together at the end! Pipelining improves performance by exploiting Instruction Level Parallelism 5-stage pipeline for RV32I: IF, ID, EX, MEM, WB Executes multiple instructions in parallel Each instruction has the same latency, but there’s better throughput Think: what problems does pipelining introduce? (more on this next lecture) 69 CMPT 295 L28: Control Bonus slides (non testable) 70 CMPT 295 L28: Control 70 Energy per Task 71 Energy = Instructions Energy Program Program * Instruction Energy α Instructions * C V2 Program Program “Capacitance” depends on technology, processor features e.g. # of cores Supply voltage, e.g. 1V Want to reduce capacitance and voltage to reduce energy/task CMPT 295 L28: Control Energy Trade-off Example “Next-generation” processor C (Moore’s Law): -15 % Supply voltage, Vsup: -15 % Energy consumption: 1 - (1-0.85)3 = -39 % Significantly improved energy efficiency thanks to Moore’s Law AND Reduced supply voltage 72 CMPT 295 L28: Control Energy “Iron Law” Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule) Energy efficiency (e.g., instructions/Joule) is key metric in all computing devices For power-constrained systems (e.g., 20MW datacenter), need better energy efficiency to get more performance at same power For energy-constrained systems (e.g., 1W phone), need better energy efficiency to prolong battery life CMPT 295 L28: Control End of Scaling In recent years, industry has not been able to reduce supply voltage much, as reducing it further would mean increasing “leakage power” where transistor switches don’t fully turn off (more like dimmer switch than on-off switch) Also, size of transistors and hence capacitance, not shrinking as much as before between transistor generations Power becomes a growing concern – the “power wall” Cost-effective air-cooled chip limit around ~150W 74 CMPT 295 L28: Control