PowerPoint 演示文稿
CO101
Principle of Computer
Organization
Lecture 14: Pipelined MIPS
Processor 4
Liang Yanyan
澳門科技大學
Macau of University of Science and Technology
Control Hazards
• When the flow of instruction addresses is not sequential
(i.e., PC = PC + 4); incurred by change of flow
instructions
• Unconditional branches (j, jal, jr)
• Conditional branches (beq, bne)
• Exceptions
• Possible approaches
• Stall (impacts CPI)
• Move decision point as early in the pipeline as possible, thereby
reducing the number of stall cycles
• Delay decision (requires compiler support)
• Predict and hope for the best !
• Control hazards occur less frequently than data hazards,
but there is nothing as effective against control hazards
as forwarding is for data hazards.
2
Control Hazards 1: Jumps Incur One Stall
• Jumps not decoded until ID, so one flush is needed
• To flush, set IF.Flush to zero the instruction field of the IF/ID
pipeline register (turning it into a nop)
• Fortunately, jumps are very infrequent – only 3% of the
SPECint instruction mix
3
Datapath Branch and Jump Hardware
4
Supporting ID Stage Jumps
5
Control Hazard 2: Branch Instr
• Branch instruction.
• Branch is calculated at EX stage.
• Branch decision is not known until EX is finished, i.e. next
instruction cannot be loaded until EX is finished → control
hazard.
6
DM Reg Reg IM beq $2, $3, Label
? ? ? IM
Clock cycle
1 2 3 4 5 6 7 8
At compile time, insert NOP to stall the pipeline
• Insert two NOPs below BEQ to stall for two cycles.
• But stall often reduces performance of the pipeline.
Loop: addi $t1, $t1, 1
bne $t1, $t2, Loop
• Assume $t1 = 0 and $t2 = 1000 initially. We are expecting the
loop can finish within 2000 cycles, but with branch stalling, it
takes 4000 cycles.
7
DM Reg Reg IM beq $2, $3, Label
? ? ? DM Reg Reg IM
Clock cycle
1 2 3 4 5 6 7 8
Calculate the branch condition earlier
• Can we calculate the branch condition earlier?
• Yes. Add extra functional unit to calculate the branch condition at
ID stage → The branch condition will be available at cycle 3.
• If misprediction, only need to flush one instruction.
8
DM Reg Reg IM beq $2, $3, Label
next instruction 1
Label: . . .
IM
Clock cycle
1 2 3 4 5 6 7
DM Reg Reg IM
flush
Single cycle processor: partitioning
9
Branch detection at ID stage
10
check if two registers
are equal.
Used to insert NOP instruction
into pipeline when there is a
branch instruction.
P
C
Addr
Inst.
Mem.
4
Read Reg #1
Reg #1
Read Reg #2
Write Reg
Reg #2
Write Data
Reg. File
Addr
Read Data
Write Data
Data Mem ext.
0
1
0
1
×4
ALU
+
+
Inst[25-21]
Inst[20-16]
IF/ID
ID/EX EX/MEM MEM/WB
Control
IF.Flush
0
1
0
1 Inst[15-11]
Inst[20-16]
0
1 0
=
Inst[31-26]
0
1
Two “Types” of Stalls
• Nop instruction (or bubble) inserted between two
instructions in the pipeline (as done for load-use
situations)
• Keep the instructions earlier in the pipeline (later in the code)
from progressing down the pipeline for a cycle (“bounce” them in
place with write control signals)
• Insert nop by zeroing control bits in the pipeline register at the
appropriate stage
• Let the instructions later in the pipeline (earlier in the code)
progress normally down the pipeline
• Flushes (or instruction squashing) were an instruction in
the pipeline is replaced with a nop instruction (as done
for instructions located sequentially after j instructions)
• Zero the control bits for the instruction to be flushed
11
Delayed Branches
• If the branch hardware has been moved to the ID stage,
then we can eliminate all branch stalls with delayed
branches which are defined as always executing the
next sequential instruction after the branch instruction –
the branch takes effect after that next instruction.
• MIPS compiler moves an instruction to immediately after
the branch that is not affected by the branch (a safe
instruction) thereby hiding the branch delay.
• With deeper pipelines, the branch delay grows requiring
more than one delay slot
• Delayed branches have lost popularity compared to more
expensive but more flexible (dynamic) hardware branch
prediction
• Growth in available transistors has made hardware branch
prediction relatively cheaper
12
Scheduling Branch Delay Slots
• A is the best choice, fills delay slot and reduces IC
• In B and C, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
13
add $1,$2,$3
if $2=0 then
delay slot
A. From before branch B. From branch target C. From fall through
add $1,$2,$3
if $1=0 then
delay slot
add $1,$2,$3
if $1=0 then
delay slot
sub $4,$5,$6
sub $4,$5,$6
becomes becomes becomes
if $2=0 then
add $1,$2,$3
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
At run time, using branch prediction
• Another approach is to guess whether or not the branch is taken.
• In terms of hardware, it’s easier to assume the branch is not taken.
• This way we just increment the PC and continue to execution the next instruction,
as for normal instructions.
• If we’re correct, then there is no problem and the pipeline keeps
going at full speed.
14
DM Reg Reg IM beq $2, $3, Label
next instruction 1
next instruction 2
DM Reg Reg IM
Clock cycle
1 2 3 4 5 6 7
DM Reg Reg IM
Branch Prediction
• If our guess is wrong, that means we have already started executing
two incorrect instructions. As a result, we need to discard, or flush,
those instructions from the pipeline and begin executing the right
one from the branch target address, Label.
• If misprediction, need to flush one instructions.
15
DM Reg Reg IM beq $2, $3, Label
next instruction 1
Label: . . .
Reg IM
Clock cycle
1 2 3 4 5 6 7 8
DM Reg Reg IM
flush
Static Branch Prediction
• Resolve branch hazards by assuming a given outcome and
proceeding without waiting to see the actual branch outcome
• Predict not taken – always predict branches will not be taken,
continue to fetch from the sequential instruction stream, only when
branch is taken does the pipeline stall
• If taken, flush instructions after the branch in IF stage if branch logic in
ID – one stall
• ensure that those flushed instructions haven’t changed the machine
state – automatic in the MIPS pipeline since machine state changing
operations are at the tail end of the pipeline (MemWrite (in MEM) or
RegWrite (in WB))
• restart the pipeline at the branch destination
• Predict taken – predict branches will always be taken
• Predict taken always incurs one stall cycle (if branch destination
hardware has been moved to the ID stage)
• Is there a way to “cache” the address of the branch target instruction ??
16
Example
• Assumption: The prediction strategy is that branch not
taken.
17
36# sub $10, $4, $8
40# beq $1, $3, TARGET
44# and $12, $2, $5
48# or $13, $2, $6
52# add $14, $4, $2
56# slt $15, $6, $7
…
72# TARGET: lw $4, 50($7)
Guess branch not taken: load instruction “and”
18
0
1 0
After the branch condition is calculated, we find that the prediction
is wrong → Select the target address, choose NOP instruction.
Fetch target instruction “lw”, insert “nop”
19
0
1 0
Performance gain of branch prediction
• In general, branch prediction is worth
• Example: for (i = 0; i < 100; i++) if we always predict (i < 100) is true → 99% correct. • Mispredicting a branch means that one clock cycles are wasted. • A longer pipeline may require more instructions to be flushed for a misprediction, resulting in more wasted time and lower performance. • If our predictions are just occasionally correct, e.g. 10% accuracy, then it is better to stall and waste more than one cycles for every branch. • We must be careful that instructions do not modify registers or memory before they get flushed. 20 Branching Structures • Predict not taken works well for “top of the loop” branching structures • But such loops have jumps at the bottom of the loop to return to the top of the loop – and incur the jump stall overhead • Predict not taken doesn’t work well for “bottom of the loop” branching structures 21 Static Branch Prediction, con’t • As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance. With more hardware, it is possible to try to predict branch behavior dynamically during program execution. • Dynamic branch prediction – predict branches at runtime using run-time information. 22 Dynamic prediction (1-bit prediction scheme) • Predict branch behavior during program execution. 23 Dynamic prediction (2-bit prediction scheme) • Predict branch behavior during program execution. 24 Dynamic prediction • A branch prediction buffer (aka branch history table (BHT)) in the IF stage addressed by the lower bits of the PC, contains bit(s) passed to the ID stage through the IF/ID pipeline register that tells whether the branch was taken the last time it was execute. 25 • The BHT predicts when a branch is taken, but does not tell where its taken to! • If the prediction is correct, stalls can be avoided no matter which direction they go. Example of dynamic branch prediction • Assume that we start off in state 3. A program contains 7 beq instructions. • Prediction accuracy: 4/7=57% 26 Instruction Actual execution Current FSM state Prediction Correct? Next FSM state beq … Taken 3 Taken Yes 3 beq … Not taken 3 Taken No 2 beq … Taken 2 Taken Yes 3 beq … Taken 3 Taken Yes 3 beq … Not Taken 3 Taken No 2 beq … Not Taken 2 Taken No 1 beq … Not Taken 1 Not taken Yes 0 1-bit Prediction Accuracy 27 2-bit Prediction Accuracy • A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed. 28 2-bit Prediction Accuracy • A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed. 29 Summary • Control hazard: attempts to make a decision before branch condition is evaluated. • E.g. branch instructions. • Solution: insert NOP at compile time; branch prediction with flush. • Modify the datapath to detect the branch one cycle earlier. 30 CO101�Principle of Computer Organization Control Hazards Control Hazards 1: Jumps Incur One Stall Datapath Branch and Jump Hardware Supporting ID Stage Jumps Control Hazard 2: Branch Instr At compile time, insert NOP to stall the pipeline Calculate the branch condition earlier Single cycle processor: partitioning Branch detection at ID stage Two “Types” of Stalls Delayed Branches Scheduling Branch Delay Slots At run time, using branch prediction Branch Prediction Static Branch Prediction Example Guess branch not taken: load instruction “and” Fetch target instruction “lw”, insert “nop” Performance gain of branch prediction Branching Structures Static Branch Prediction, con’t Dynamic prediction (1-bit prediction scheme) Dynamic prediction (2-bit prediction scheme) Dynamic prediction Example of dynamic branch prediction 1-bit Prediction Accuracy 2-bit Prediction Accuracy 2-bit Prediction Accuracy Summary