Lesson 06 – Thread-Level Parallelism: Introduction
Introduction
Introduction
Pipelining became universal technique in 1985 Overlaps execution of instructions
Beyond pipelining, Instruction Level Parallelism (ILP) Executes instructions in parallel
There are two main approaches:
Hardware-based dynamic approaches:
Software-based static approaches:
Used in server and Not as successful outside of desktop processors scientific applications
Instruction-Level Parallelism
When exploiting instruction-level parallelism, goal is to maximize CPI
Pipeline CPI =
Ideal pipeline CPI +
Structural stalls + Data hazard stalls +
Control stalls
Parallelism with basic block is limited Typical size of basic block =
3-6 instructions
Must optimize across branches
Data Dependence
Major challenge for Instruction Level Parallelism: Data dependency
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used
by instruction j
Instruction j is data dependent on instruction
k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously.
Data Dependence
Dependencies are a property of programs:
Pipeline organization determines if dependence is detected and if it causes a stall
Data dependence conveys:
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level parallelism
Dependencies that flow through memory locations are difficult to detect.
Name Dependence
Two instructions use the same name but no flow of information.
Not a true data dependence, but is a problem when reordering instructions
Anti-dependence: instruction j writes a register or memory location that instruction i reads
Initial ordering (i before j) must be preserved
Output dependence: instruction i and instruction j write the same register or memory location
Ordering must be preserved
To resolve, use register renaming techniques.
Data Hazards:
Read after write (RAW) Write after write (WAW) Write after read (WAR)
Other Factors
Control Dependence:
Ordering of instruction i with respect to a branch instruction
Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch
An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch
Control Dependence
ADD R1,R2,R3 BEQ R4,R0,L SUB R1,R1,R6 ….
L: ….
OR R7,R1,R8
ADD R1,R2,R3 BEQ R12,R0, skip SUB R4,R5,R6 ADD R5,R4,R9
…. skip: ….
OR R7,R8,R9
Example 1:
OR instruction dependent on ADD and SUB.
Example 2:
Assume R4 isn’t used after skip
Possible to move SUB before the BEQ
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Loop Unrolling
Loop Unrolling
Compiler Techniques for Exposing ILP
Loop Unrolling: loop transformation technique to optimize program execution through (statical) pipeline scheduling of dependent instructions
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Instruction producing result
Instruction using result
Latency in clock cycles
FP ALU op
Another FP ALU op
3
FP ALU op
Store Double
2
Load Double
FP ALU op
1
Load Double
Store Double
0
FP Loop: Where Are the Hazards?
To simplify, assume 8 is lowest address, R1 = base address of x and F2 = s
Loop: L.D F0,0(R1) ADD.D F4,F0,F2
S.D 0(R1),F4 DADDUI R1,R1,-8 BNEZ R1,Loop
;F0=vector element ;add scalar from F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
FP Loop Showing Stalls
1 Loop: 2
3
4
5 6 7 8 9
L.D F0,0(R1) stall
ADD.D F4,F0,F2 stall
stall
S.D 0(R1),F4 DADDUI R1,R1,-8 stall
BNEZ R1,Loop
;F0=vector element ;add scalar in F2
;store result
;decrement pointer 8B (DW) ;assumes can’t forward to branch ;branch R1!=zero
9 clock cycles: Rewrite code to minimize stalls?
Revised FP Loop Minimizing Stalls
1 Loop: 2
3
4
5 6 7
L.D F0,0(R1) DADDUI R1,R1,-8 ADD.D F4,F0,F2 stall
stall
S.D 8(R1),F4 BNEZ R1,Loop
;altered offset when moving DADDUI
Swap DADDUI and S.D by changing address of S.D
7 clock cycles, but only 3 for execution
(L.D, ADD.D, S.D), 4 for loop overhead; How to make faster?
Loop unrolling:
Unroll by a factor of 4 (assume # elements is divisible by 4)
Eliminate unnecessary instructions (DADDUI and BNEZ)
Optimize (reschedule the code)
1 Loop: 3
6
7
9
12
13
15
18
19
21
24
25
26
L.D F0,0(R1) ADD.D F4,F0,F2 S.D 0(R1),F4 L.D F6,-8(R1) ADD.D F8,F6,F2 S.D -8(R1),F8 L.D F10,-16(R1) ADD.D F12,F10,F2 S.D -16(R1),F12 L.D F14,-24(R1) ADD.D F16,F14,F2 S.D -24(R1),F16 DADDUI R1,R1,#-32 BNEZ R1,Loop
Loop Unrolling
Please Note
Number of live registers vs. original loop.
1Loop: L.D
3 ADD.D
6 S.D
7 L.D
9 ADD.D
12 S.D
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32
26 BNEZ R1,Loop
1 cycle stall
2 cycles stall
;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ
;alter to 4*8
F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8
27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)
Rewrite loop to minimize stalls?
Loop Unrolled 4 Times
Optimize Unrolled Loop by Minimizing Stalls
1 Loop: L.D
2 L.D
3 L.D
4 L.D
5 ADD.D 6 ADD.D 7 ADD.D 8 ADD.D 9 S.D
10 S.D
11 S.D
12 DSUBUI 13 S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1) F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2 F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12 R1,R1,#32
8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration.
14
BNEZ R1,Loop
Loop Unrolling Factor
Do not usually know upper bound of loop
Suppose it is n, and we would like to unroll
the loop to make k copies of the body
Instead of a single unrolled loop, we
generate a pair of consecutive loops:
1st executes (n mod k) times and has a body that is the original loop
2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times
For large values of n, most of the execution time will be spent in the unrolled loop
5 Loop Unrolling Decisions
Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences:
3
Eliminate the extra test and branch instructions and adjust the loop termination and iteration code
1
Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code)
2
Use different registers to avoid unnecessary constraints forced by using same registers for different computations
4
Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent.
Transformation requires analyzing memory addresses and
making sure that they do not refer to the same address
5
Rescheduling the instructions in the unrolled loop, preserving any dependences needed to yield the same result as the original code
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Dynamic Branch Prediction
Basic Dynamic Branch Prediction
Static Branch Prediction
Rescheduling code around delayed branch
To reorder code around branches, need to predict branch statically when compile Simplest scheme is to predict a branch as taken
Average misprediction = untaken branch frequency = 34% (SPEC benchmarks)
25%
22%
20% 15% 10%
5% 0%
18%
15%
More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run
12%
11%
12%
10%
9%
4%
6%
Integer
Floating Point
compress eqntott
espresso gcc
li doduc ear
hydro2d mdljdp su2cor
Misprediction Rate
Dynamic Branch Prediction
Why does prediction work?
Underlying algorithm has regularities
Data that is being operated on has regularities
Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems
Can we implement a dynamic branch prediction which will be executed at runtime?
There is a small number of important branches in programs which have dynamic behavior
Dynamic Branch Prediction
Performance = ƒ(accuracy, cost of misprediction)
Branch History Table (BHT): Lower bits of PC
address index table of 1-bit values
Says whether or not branch taken last time
No address check
Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit):
End of loop case, when it exits instead of looping as before
First time through loop on next time through code, when it predicts exit instead of looping
Solution: 2-bit scheme where change prediction only if get misprediction twice
Branch Prediction
Basic 2-bit predictor: For each branch:
Predict taken or not taken
If the prediction is wrong two consecutive times,
change prediction
Correlating predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes of preceding n branches
(m,n) predictor: behavior from last m branches to choose from 2m n-bit predictors
Tournament predictor:
Combine correlating predictor with local predictor
2-Bit Branch Predictor
Idea: Change prediction only if get misprediction twice T
Predict Taken
Predict Not Taken
NT Predict Taken T T NT
NT
T
Predict Not Taken
Green: go, taken
Red: stop, not taken
Adds hysteresis to decision making process
NT
2-Bit Branch Predictor Accuracy
20% 18% 16% 14% 12% 10%
8% 6% 4%
2% 0%
18%
4096 Entries BHT
5%
5%
12%
10%
9%
9% 9%
0% 1%
Integer
Floating Point
Mispredict because either:
Wrong guess for that branch
Got branch history of wrong branch when index the BHT
eqntott espresso
gcc li
spice doduc spice
fpppp matrix300 nasa7
Misprediction Rate
Correlating Branch Prediction
Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table.
In general, a (m,n) predictor means recording last m branches to select between 2m history tables, each with n-bit counters.
Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NT status of last m branches.
Each entry in table has 2m n-bit predictors.
Correlating Branch Prediction
(2,2) predictor:
Behavior of recent branches selects between four predictions of next branch, updating just that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
20% 18% 16% 14% 12% 10%
8%
6%
4%
2%
0%
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
9% 9% 5%
18%
Branch Prediction Performance
12% 11%
10% 6% 6% 5% 6%
5%
4%
1%
1% 0% 1% 1%
Spec89 Benchmarks
nasa7 matrix300
tomcatv doducd spice fpppp
gcc expresso
eqntott li
Frequency of Mispredictions
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Dynamic Branch Prediction
Advanced Dynamic Branch Prediction
The motivation for correlating branch predictors came from the standard 2-bit predictor only using local information.
The tournament predictors use multiple predictors (global and local predictors) and choosing between them.
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between:
Global predictor:
4K entries index by history of last 12 branches (212 = 4K)
Each entry is a standard 2-bit predictor Local predictor:
Local history table: 1024 10-bit entries recording last 10 branches, index by branch address
The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters
Advantage of tournament predictor is ability to select the right predictor for a particular branch (e.g., 40% for SPEC integer and less than 15% for SPEC FP).
Tournament Predictors
0/0, 1/0,1/1
0/0, 1/0,1/1
Using multilevel brand prediction can achieve better accuracy and effectively use large number of prediction bits
Use n-bit saturating counter to choose between predictors
Usual choice between global and local predictors
P1/P2 ={0,1}/{0,1}
0: prediction incorrect 1: prediction correct
Use predictor 1
Use predictor 2
1/0
0/1
1/0
0/1
Use predictor 1
0/1 1/0
Use predictor 2
0/0,1/1
0/0,1/1
10-bit shift register
10 10
Most recent branch result (not taken/taken)
Branch Prediction
Branch history
Branch address
Branch history
Branch address
Global predictors
Selector
Local predictors
Exclusive OR
10
1024 2-bit predictors
Prediction
Prediction
gshare Predictor
Tournament Predictor
mux
8%
7% 6%
5% 4%
3% 2% 1%
Local 2-bit predictors
Branch Prediction Performance
Corrrelating predictors Tournaments predictors
0%0 32 64 96 128160192224256288320352384416448480512 Total predictor size
Conditional branch misprediction rate
Tagged Hybrid Predictors
Problem: reduce the size of the tables required for each branch and history
Combines multiple predictors to track if a prediction is associated with the current branch or not
Employs a series of global predictors indexed with different length histories
Solution: tagged hybrid branch predictor
Use hash tables, whose hash value is based
on branch address and branch history
Longer histories may lead to increased chance of hash collision, use multiple tables with increasingly shorter histories
pc
pc
hash
h[0:L(1)]
hash
pc h[0:L(2)]
hash hash
P(2)
pc h[0:L(3)]
hash hash
P(3)
pc h[0:L(4)]
hash hash
P(4)
Tagged Hybrid Predictors
P(0) P(1)
pred
tag
pred
tag
pred
tag
pred
tag
=? =? =? =?
Prediction
Base predictor
8 7 6 5 4 3 2 1 0
SPECfp
SPECint
MultiMedia
Server
Tagged Hybrid Predictors Performance
TAGE gshare
0.59
1.301
3.299
6.607
6.778
6.52
4.45
2.939
Misses per one thousand instructions
Branch Target Buffers (BTB)
Branch hazards are caused by delays needed for calculating the branch condition and the branch target address
Calculation of the branch target address is costly and stalls the instruction fetch
Branch Target Buffer (BTB) stores PCs of recent branch instructions and their calculated target addresses.
The same way as in caches, when a match is found the corresponding Predicted PC
is returned
If the branch was predicted taken, instruction fetch continues at the returned predicted PC
Dynamic Branch Prediction
Branch Target Buffer:
Dynamic branch prediction address of branch index to get prediction AND branch address (if taken)
Branch PC
Predicted PC
PC of Instruction FETCH
Please Note
Must check for branch match now, since can’t use wrong branch address
No: branch not predicted, proceed normally (Next PC = PC+4)
=?
Yes: instruction is branch and use predicted PC as next PC
Extra prediction state bits
Branch-Target Buffer
PC of Instruction to fetch
Look up
Branch PC (Tag)
Predicted PC (Data)
Number of entries in Branch-target buffer
=
Yes: then instruction is taken branch and predicted PC should be used as the next PC
No: instruction is not predicted to be a taken branch; proceed normally
Branch-Target Buffer
Send PC to memory and branch-target buffer
IF
Entry
No found in Yes branch-target
buffer?
ID
No
Normal instruction execution
Is
instruction Yes
a taken branch?
Send out predicted PC
No
Taken Yes Branch?
EX
Enter branch instruction address and next PC into branch-buffer
Mispredicted branch, kill fetched instruction: restart fetch at other target: delete entry from target buffer
Branch correctly predicted; continue execution with no stalls
Branch Folding
Optimization:
For unconditional branches store the target instructions themselves in the buffer
Larger branch-target buffer! “Branch folding”
Add target instruction into buffer to deal with longer decoding time required by larger buffer
Return Address Predictor
Most unconditional branches come from functional returns The same procedure can be called from multiple sites
Causes the buffer to potentially forget about the return address from previous calls
Create return address buffer organized as a stack
Return Address Predictor
70% 60%
50%
40% 30% 20%
10%
0%0 1 2 4 8 16
Go m88ksim cc1 Compress Xlisp ljpeg
Perl Vortex
Return address buffer entries
Misprediction frequency
Integrated Instruction Fetch Unit
Design monolithic unit that performs: Branch prediction
Instruction prefetch
Fetch ahead
Instruction memory access and buffering
Deals with crossing cache lines
Dynamic Branch Prediction (Summary)
Prediction becoming important part of execution
Branch History Table: 2 bits for loop accuracy
Correlation: Recently executed branches correlated with next branch
Either different branches or different executions of same branches
Tournament predictors take insight to next level, by using multiple predictors
Usually one based on global information and one based on local information, and combining them with a selector
Branch Target Buffer: include branch address & prediction
Branch target folding adds addresses for
unconditional branches
Return address buffer to cater for functional returns
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Dynamic Scheduling Introduction
Dynamic Scheduling
Rearrange order of instructions to reduce stalls while maintaining data flow.
Advantages Disadvantages
Substantial increase in hardware complexity.
Compiler doesn’t need to have knowledge of microarchitecture.
Complicates exceptions handling.
Handles cases where dependencies are unknown at compile time.
Dynamic scheduling implies: Out-of-order execution Out-of-order completion
Example 1
FDIV.D F0,F2,F4 FADD.D F10,F0,F8 FSUB.D F12,F8,F14
Dynamic Scheduling
FSUB.D is not dependent, issue before FADD.D.
Dynamic Scheduling
FADD.D is not dependent, but the anti-dependence in F0 makes it impossible to issue earlier without register renaming.
Example 2
FDIV.D F0,F2,F4 FMUL.D F6,F0,F8 FADD.D F0,F10,F14
Example 3
FDIV.D F0,F2,F4 FADD.D F6,F0,F8 SD.D F6,0(X1) FSUB.D F8,F10,F14 FMUL.D F6,F10,F8
Data-dependence F0
Anti-dependence F8 Output dependence F6
Register Renaming
Name dependence with F0,F6,F8.
Register Renaming
Example 3
FDIV.D F0,F2,F4 FADD.D S,F0,F8 SD.D S,0(X1) FSUB.D T,F10,F14 FMUL.D F6,F10,T
Now only data-dependences (RAW hazards) remain, which can be strictly ordered.
Register Renaming
Register renaming is provided by reservation stations (RS)
Tomasulo’s Algorithm
Tracks when operands are available Introduces register renaming
in hardware
Minimizes WAW and WAR hazards
Contains:
The instruction
Buffered operand values (when available)
Reservation station number of instruction providing the operand values
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Tomasulo’s Algorithm
Tomasulo’s Algorithm
Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
Tomasulo’s Algorithm: Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
Issue—Decode instructions, check for structural hazards
Read operands—Wait until no data hazards, then read operands
Tomasulo’s Algorithm
Control & buffers distributed with Function Units (FU)
FU buffers called “Reservation Stations” (RS)
RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file)
As instructions are issued, the register specifiers are replaced by values or pointers to RS
Form of register renaming
Avoids WAR and WAW hazards
Pending instructions designate the RS to which they will
send their output
May be more reservation stations than registers, so can do optimizations compilers can’t
Tomasulo’s Algorithm
Result values broadcast from RS to all FUs on a result bus, called the common data bus (CDB)
Results do not go through registers Form of bypassing
Only the last output updates the register file
Load and Stores treated as FUs with buffers
Contain data and addresses, act like reservation stations
Integer instructions can go past branches, allowing FP ops beyond basic block in
FP queue
Tomasulo’s Algorithm
From instruction unit
Instruction queue
Load/Store operations
FP registers
Store buffers
Floating-point operations
Address unit
Load buffers
Address Data
Memory unit
FP adders
Operation bus
321
stations 1
Operand buses
Reservation 2
Common data bus (CDB)
FP multipliers
Issue:
Get next instruction
from FIFO queue
If available RS, issue the instruction to the RS with operand values if available
If operand values not available, stall the instruction
Tomasulo’s Algorithm
Three Steps:
Execute:
When operand becomes available, store it in any reservation stations waiting for it
When all operands are ready, issue the instruction
Loads and stores maintained in program order through effective address
No instruction allowed to initiate execution until all branches that precede it in program order have completed
Write result:
Write result on CDB into reservation stations and store buffers
Stores must wait until address and value are received
Tomasulo’s Algorithm Execution
1. Issue: get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction and sends operands (renames registers).
2. Execute: operate on operands
When both operands ready then execute; if not ready, watch Common Data Bus for result.
3. Write result: finish execution
Write on Common Data Bus to all awaiting units; mark reservation station available.
Normal data bus: data + destination (“go to” bus)
Common data bus: data + source (“come from” bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result) Does the broadcast
Tomasulo’s Algorithm Example 1
Code Sequence
LD.D F6,32(R2) LD.D F2,44(R3) FMUL F0,F2,F4 FSUB F8,F2,F6 FDIV F0,F0,F6 FADD F6,F8,F2
Instruction status
Instruction Issue
Execute Write result
LD.D F6,32(R2) ✔
✔✔
LD.D F2,44(R3) ✔
✔
FMUL F0,F2,F4 ✔
FSUB F8,F2,F6 ✔
FDIV F0,F0,F6 ✔
FADD F6,F8,F2 ✔
Tomasulo’s Algorithm Snapshot
Reservation stations
Name Busy
Op Vj Vk Qj Qk A
Load1 No
Load2 Yes
Load 44+[R3]
Add1 Yes
FSUB M[32+ [R2]] Load2
Add2 Yes
FADD Add1 Load2
Add3 No
Mult1 Yes
FMUL [F4] Load2
Mult2 Yes
FDIV M[32+[R2]] Mult1
Field f0 f2 f4 f6 f8 f10 f12 … f30
FU Mult1 Load2 Add2 Add1 Mult2
Tomasulo’s Algorithm Example 2
Loop Iterations
Loop: LD F0,0(R1) FMUL.D F4,F0,F2 SD F4,0(R1)
SUBI R1,R1,#8
BNE R1,R2,Loop // branches if R1 != R2
Instruction From iteration Issue Execute Write result
LD.D F0,0(R1) 1 ✔ ✔
SD.D F4,0(R1) 1 ✔
Tomasulo’s Algorithm Snapshot
Load1 Yes
Load2 Yes
Add1 No
Add2 No
Add3 No
Mult1 Yes
Mult2 Yes
Store1 Yes
Store2 Yes
Instruction status
FMUL F4, F0,F2 1 ✔
LD.D F0,0 (R1) 2 ✔ ✔
FMUL F4,F0,F2 2 ✔
SD.D F4,0(R1) 2 ✔
Name Busy
Reservation stations
Op Vj Vk Qj Qk A
Load [R1] + 0
Load [R1] – 8
FMUL [F2] Load1
FMUL [F2] Load2
Store [R1] Mult1
Store [R1] – 8 Mult2
Field f0 f2 f4 f6 f8 f10 f12 … f30
FU Load2 Mult2
From Mem
FP Op Queue
FP Registers
Tomasulo’s Algorithm Organization
Load1
Load2 Load3 Load4 Load5 Load6
FP Add/Sub: 3 clocks FP Mult: 10 clocks FP Div: 40 clocks
Load Buffers
FP adders
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
Store Buffers
To Mem
FP multipliers
Common Data Bus (CDB)
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands
Store buffers has V field, result to
be stored
Qj, Qk: Reservation stations producing source registers (value to be written)
Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Tomasulo’s Algorithm Example
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2
LD.D F2
45+ R3
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Instruction stream
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
FU Countdown
3 Load/Buffers
3 FP Adder R.S. 2 FP Mult R.S.
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Clock
0
Register result status:
Clock cycle counter
Tomasulo’s Algorithm Example (Cycle 1)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1
LD.D F2
45+ R3
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1
Clock
1
Register result status:
Tomasulo’s Algorithm Example (Cycle 2)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1
LD.D F2
45+ R3 2
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Load1
Clock
2
Register result status:
Tomasulo’s Algorithm Example (Cycle 3)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3
LD.D F2
45+ R3 2
FMUL F0
F2 F4 3
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F4] Load2
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Load2 Load1
Clock
3
Register result status:
Tomasulo’s Algorithm Example (Cycle 4)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 No
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 Yes FSUB M[A1] Load2
Add2 No
Add3 No
Mult1 Yes FMUL [F4] Load2
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Load2 M[A1] Add1
Clock
4
Register result status:
Tomasulo’s Algorithm Example (Cycle 5)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6 5
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
3
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 No
Add3 No
10
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] M[A1] Add1 Mult2
Clock
5
Register result status:
Tomasulo’s Algorithm Example (Cycle 6)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
2
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 Yes FADD M[A2] Add1
Add3 No
9
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 Add1 Mult2
Clock
6
Register result status:
Tomasulo’s Algorithm Example (Cycle 7)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
1
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 Yes FADD M[A2] Add1
Add3 No
8
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 Add1 Mult2
Clock
7
Register result status:
Tomasulo’s Algorithm Example (Cycle 8)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
3
Add2 Yes FADD (M-M) M[A2]
Add3 No
7
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
8
Register result status:
Tomasulo’s Algorithm Example (Cycle 9)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
2
Add2 Yes FADD (M-M) M[A2]
Add3 No
6
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
9
Register result status:
Tomasulo’s Algorithm Example (Cycle 10)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
1
Add2 Yes FADD (M-M) M[A2]
Add3 No
5
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
10
Register result status:
Tomasulo’s Algorithm Example (Cycle 11)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
11
Register result status:
Tomasulo’s Algorithm Example (Cycle 12)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
12
Register result status:
Tomasulo’s Algorithm Example (Cycle 13)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
13
Register result status:
Tomasulo’s Algorithm Example (Cycle 14)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
14
Register result status:
Tomasulo’s Algorithm Example (Cycle 15)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14 15
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
15
Register result status:
Faster Than Light Computation (Skip 39 Cycles)
Tomasulo’s Algorithm Example (Cycle 54)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14 15
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
2
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
54
Register result status:
Tomasulo’s Algorithm Example (Cycle 55)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 15 16
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5 55
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
55
Register result status:
Tomasulo’s Algorithm Example (Cycle 56)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 15 16
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5 55 56
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Result
Clock
56
Register result status:
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Tomasulo’s Algorithm (Dynamic Loop Unrolling)
Tomasulo’ Algorithm Loop Example
Loop: LD.D F0, 0(R1) FMUL F4, F0, F2
SD.D F4, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop
This time assume Multiply takes 5 clocks
Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
To be clear, will show clocks for SUBI, BNEZ
Reality: integer instructions ahead of FP Instructions
Show 2 iterations
Iteration count
Instruction status:
Loop Example
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1
1 FMULF4 F0 F2
1 SD.DF4 0 R1
2 LD.DF0 0 R1
2 FMULF4 F0 F2
2 SD.DF4 0 R1
Load2 No
Busy Addr
FU
Load1 No
Load3 No
Store1 No
Store2 No
Store3 No
Added Store Buffers
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
Instruction Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Clock R1 0 80
Register result status:
Value of Register used for address, iteration control
Loop Example (Cycle 1)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 No
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1
Clock R1 1 80
Register result status:
Loop Example (Cycle 2)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 No
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 2 80
Register result status:
Loop Example (Cycle 3)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Busy Addr
Load1 Yes 80
FU
Load2 No
Load3 No
Store1 Yes 80
Mult1
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 3 80
Register result status:
Loop Example (Cycle 4)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 4 80
Register result status:
Loop Example (Cycle 5)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 5 72
Register result status:
Loop Example (Cycle 6)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
Busy Addr
Load2 Yes 72
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult1
Clock R1 6 72
Register result status:
Loop Example (Cycle 7)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
Busy Addr
Load2 Yes 72
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 7 72
Register result status:
Loop Example (Cycle 8)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Busy Addr
Load1 Yes 80
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 8 72
Register result status:
Loop Example (Cycle 9)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Busy Addr
Load1 Yes 80
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 9 72
Register result status:
Loop Example (Cycle 10)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 10 64
Register result status:
Loop Example (Cycle 11)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 11 64
Register result status:
Loop Example (Cycle 12)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 12 64
Register result status:
Loop Example (Cycle 13)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 13 64
Register result status:
Loop Example (Cycle 14)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 14 64
Register result status:
Loop Example (Cycle 15)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
M[80]*F2
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 15 64
Register result status:
Loop Example (Cycle 16)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
M[80]*F2
M[72]*F2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
5 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 16 64
Register result status:
Loop Example (Cycle 17)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 Yes 64
M[80]*F2
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 17 64
Register result status:
Loop Example (Cycle 18)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 Yes 64
M[80]*F2
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 18 64
Register result status:
Loop Example (Cycle 19)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18 19
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8 19
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 No
Store2 Yes 72
Store3 Yes 64
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 19 56
Register result status:
Loop Example (Cycle 20)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18 19
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8 19 20
Load2 No
Busy Addr
FU
Load1 Yes 56
Load3 Yes 64
Store1 No
Store2 No
Store3 Yes 64
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 20 56
Register result status:
Why Can Tomasulo Overlap Iterations of Loops?
Register renaming
Multiple iterations use different physical destinations for registers (dynamic loop unrolling).
Reservation stations
Permit instruction issue to advance past integer control flow operations.
Also buffer old values of registers – totally avoiding the WAR stalls.
Other perspective: Tomasulo building data flow dependency graph on the fly
Major Advantages of Tomasulo’s Algorithm
The distribution of the hazard detection logic:
Distributed reservation stations and the CDB
If multiple instructions waiting on single result, and each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
If a centralized register file is used, the units would have to read their results from the registers when register buses are available
The elimination of stalls for WAW and WAR hazards
What about Precise Interrupts?
State of machine looks as if no instruction beyond faulting instructions has issued.
Tomasulo had:
In-order issue, out-of-order execution, and out-of-order completion
Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Tomasulo’s Algorithm with Speculation
Relationship between Precise Interrupts and Speculation
Speculation: guess and check
Important for branch prediction:
Need to “take our best shot” at predicting branch direction
If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:
This is exactly same as precise exceptions!
Technique for both precise interrupts/exceptions
and speculation:
In-order completion or commit
Hardware-Based Speculation
Execute instructions along predicted execution paths but only commit the results of prediction when correct
Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative
Need an additional piece of hardware to prevent any irreversible action until an instruction commits (i.e., updating state or taking an execution)
Reorder Buffer
Reorder Buffer (ROB)
Register values and memory references are not written until an instruction commits.
On misprediction:
Speculated entries in ROB are cleared
Exceptions:
Not recognized until it is ready to commit
Reorder Buffer Operation
ROB holds instructions in FIFO order, exactly as issued
Modify reservation stations:
Operand source is now ROB instead of functional unit
When instructions complete, results placed into ROB
Supplies operands to other instruction between execution complete & commitmore registers like RS
Tag results with ROB buffer number instead of reservation station
Instructions commit values at head of ROB placed in registers (or memory locations)
As a result, easy to undo speculated instructions on mispredicted branches or on exceptions
Commit path
FP
Op Queue
Reorder Buffer
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
Reorder Buffer Entry Fields
Each entry in the ROB contains four fields:
1 Instruction type
A branch (has no destination result), a store (has a memory address destination), or a register operation (FP operation or load, which
has register destinations)
2 Destination
Register number (for loads and FP operations) or memory address (for stores) where the instruction result should be written
3 Value
Value of instruction result until the instruction commits
4 Ready
Indicates that instruction has completed execution, and the value is ready
Reorder Buffer
Entry Busy Instruction State Destination Value
1 No LD.D F6,32(R2) Commit F6 Mem[32+R2]
2 No LD.D F2, 44(R3) Commit F2 Mem[44+R3]
3 Yes FMUL F0,F2,F4 Write result
F0 #2 x F4
4 Yes FSUB F8,F2,F6 Write result F8 #2 – #1
5 Yes FDIV F0,F0,F6 Execute F0
6 Yes FADD F6,F8,F2 Write result F6 #4 + #2
Reservation stations
Name Busy
OP Vj Vk
Qj Qk Dest A
Load1 No
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 No
FMUL Mem[44+R3] F4
#3
Mult2 Yes
FDIV Mem[32+R2]
#3 #5
FP register status
Field F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Reorder # 3 6 4 5
Busy Yes No No No No No Yes … Yes No Yes
Adding Speculation to Tomasulo
In non-speculative Tomasulo’s Algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file
With speculation, the register file is not updated until the instruction commits
(we know definitively that the instruction should execute)
Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit
ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s Algorithm
Revised Tomasulo’s Algorithm
1 Issue — get instruction from FP Op Queue (sometimes called “dispatch”)
If reservation station and ROB slot free, issue instruction and send operands and ROB no. for destination
2 Execution — operate on operands
When both operands ready then execute; if not ready, watch CDB for result; when both operand values in reservation station, execute; checks RAW
3 Write result — finish execution
Write result and ROB tag on CDB to all awaiting FUs and ROB; Mark reservation station available
4 Commit — update register with reorder result (sometimes called “graduation”)
When instruction reaches head of ROB and result present, update register with result (or store to memory) and remove instruction from ROB. When a mispredicted branch reaches head of ROB, discard all entries (flush ROB)
Address unit
123
Store address
Instruction queue
Load/Store operations
Floating-point operations
Load buffers
123
Reservation stations
From instruction unit
Operation bus
Reorder buffer
Reg # Data
FP registers
Operand buses
12 Common data bus (CDB)
Memory unit
FP adders
FP multipliers
Architecture of Speculative Tomasulo
Floating point addition: 2 Clock Cycles Floating point division: 20 Clock Cycles Load operation: 3 Clock Cycles
Cache miss penalty: 10 Clock Cycles
Store data
Load data
Speculative Tomasulo Example (Cycle 1)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 2)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 3)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 4)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 5)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F4
LD.D F4,0(R3)
N
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 6)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F0
FADD F0,F4,F6
N
F4
LD.D F4,0(R3)
N
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
ROB5, R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 7)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
—
ROB5
SD.D 0(R3),F4
N
F0
FADD F0,F4,F6
N
F4
LD.D F4,0(R3)
N
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
ROB5, R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 8)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
—
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
N
F4
M[10]
LD.D F4,0(R3)
Y
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
3
FDIV
ROB2,R(F6)
6
FADD
R(F4),ROB1
M[10], R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 9)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
—
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Ex
F4
M[10]
LD.D F4,0(R3)
Y
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
M[10], R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 10)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
—
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Y
F4
M[10]
LD.D F4,0(R3)
Y
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
FP adders
FP multipliers
Speculative Tomasulo Example (Cycle 11)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
—
M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Y
F4
M[10]
LD.D F4,0(R3)
Y
—
BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
M[20]
LD.D F0,10(R2)
Y
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
What about memory hazards?
3
FDIV
ROB2,R(F6)
FP adders
FP multipliers
Avoiding Memory Hazards
WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending
RAW dependence through memory are maintained by two restrictions:
1
Not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the address field of the load.
2
Maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.
These restrictions ensure that any load that accesses a memory location written by an earlier store cannot perform the memory access until the store has written the data
Speculation for Greater ILP
Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct.
Speculation
fetch, issue, and execute instructions as if branch predictions were always correct
Dynamic scheduling
only fetches and issues instructions
Essentially a data flow execution model: Operations execute as soon as their operands are available.
Speculation for Greater ILP
3 components of HW-based speculation
1 2
3
Dynamic branch prediction to choose which instructions to execute
Speculation to allow execution of instructions before control dependences are resolved
Ability to undo effects of incorrectly speculated sequence
Dynamic scheduling to deal with scheduling of different combinations of basic blocks
Exceptions and Interrupts
IBM 360/91 invented “imprecise interrupts”
Computer stopped at this PC; its likely close to this address Not so popular with programmers
Also, what about Virtual Memory? (Not in IBM 360)
Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit
If we speculate and are wrong, we need to back up and restart execution to point at which we predicted incorrectly
This is exactly what is needed to implement a precise exception
Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB
If a speculated instruction raises an exception, the exception is recorded in the ROB This is why reorder buffers are in all new processors
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Multiple Instruction Issue
Multiple Instructions Issue
Multiple Issue and ILP
To achieve CPI < 1, need to complete multiple instructions per clock.
Solutions:
Multi-way issue superscalar processors VLIW (very long instruction word) processors Multi-word vector processors
Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI
Main motivation of ILP is to reduce the amount of stall in the pipelined execution of instructions
In particular for multi-cyle operations
Main goal of ILP is to get performance close to ideal CPI = 1
Getting CPI < 1:
Issuing Multiple Instructions/Cycle
Superscalar
Varying no. instructions/cycle (1 to 8) scheduled statically by compiler or dynamically in hardware (Tomasulo)
IBM PowerPC, Sun UltraSparc, Pentium 4
Very Long Instruction Words VLIW
Fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates
Intel Architecture-64 (IA-64) 64-bit address
Renamed: “Explicitly Parallel Instruction Computer (EPIC)”
Vector Processing
Explicit coding of independent loops as operations on large vectors of numbers
Multimedia instructions being added to
many processors
Scalar Pipeline Execution
Successive Instructions
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Time in Base Cycles
Fetch Decode Execute Write back
Superscalar of Degree 3
Fetch Decode Execute Write back
0123456789
Time in Base Cycles
VLIW Execution with Degree 3
Fetch Decode Execute Write back
Execute
3 operations
0123456789
Time in Base Cycles
Vector Execution with Degree 3
Fetch
Decode
Read vector
Execute
Write back
Write vector
0123456789
Time in Base Cycles
Multiple Issue
Common Issue name structure
Hazard Distinguishing
detection Scheduling characteristic Example
Superscalar Dynamic (static)
Hardware Static In-order execution Mostly in the embedded space: MIPS and ARM, including the
Cortex-A53
Superscalar Dynamic (dynamic)
Hardware Dynamic Some out-of-order None at the present execution, but no
speculation
Superscalar Dynamic (speculative)
Hardware Dynamic with Out-of-order execution Intel Core i3, i5: AMD Phenom: speculation with speculation IBM Power 7
VLIW/LIW Static
Primary Static All hazards determined Most examples are in signal software and indicated by compiler processing, such as the TI C6x
(often implicitly)
EPIC Primarily static
Primary Mostly static All hazards determined Itanium software and indicated explicitly
by the compiler
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Multiple Instruction Issue
Very Long Instruction Word (VLIW)
VLIW Processors
Example VLIW bundle:
One integer instruction (or branch)
Two independent floating-point operations
Two independent memory references
Package multiple operations into one instruction.
Must be enough parallelism in code to fill the available slots.
VLIW Pipeline
i t
IF
ID
EX
EX
MEM
WB
EX
EX
IF
ID
EX
MEM
WB
EX
EX
EX
IF
ID
EX
...
MEM
WB
Instruction Format
VLIW Organization
FP Add
FP Mult
Int ALU
Branch
Load/Store
Instruction Issue Unit
FP Mult
FP Add
Int ALU
Branch
Load/ Store
Register File
VLIW: Very Long Instruction Word
Each “instruction” has explicit coding for multiple operations
In the IA-64 architecture, grouping called a “packet”
In the Transmeta processor, grouping called a “molecule” (with “atoms” as ops)
Tradeoff instruction space for simple decoding
The long instruction word has room for many operations
By definition, all the operations the compiler puts in the long instruction word are independent execute
in parallel
E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch
16 to 24 bits per field7×16 or 112 bits to 7×24 or 168 bits wide
Need compiling technique that schedules across several branches
Recall: Unrolled Loop that Minimizes Stalls for Scalar
1 Loop: LD.D 2 LD.D 3 LD.D 4 LD.D 5 FADD 6 FADD 7 FADD 8 FADD 9 SD.D 10 SD.D 11 SD.D 12 SUBUI 13 BNEZ 14 SD.D
F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F4,0(R1) F8,-8(R1) F12,-16(R1), R1,R1,#32 R1,Loop F16,8(R1)
; 8-32 = -24
LD.D to FADD: 1 Cycle FADD to SD.D: 2 Cycles
14 clock cycles, or 3.5 per iteration.
Loop Unrolling in VLIW
Memory Memory reference 1 reference 2
FP FP Integer
operation 1 operation 2 operation/branch
Clock
LD.D F0,0(R1) LD.D F6,-8(R1) 1
LD.D F10,-16(R1) LD.D F14,-24(R1) 2
LD.D F18,-32(R1) LD.D F22,-40(R1) FADD F4,F0,F2 FADD F8,F6,F2 3
LD.D F26,-48(R1) FADD F12,F10,F2 FADD F16,F14,F2 4
FADD F20,F18,F2 FADD F24,F22,F2
5
SD.D F4, 0(R1) SD.D F8,-8(R1) FADD F28,F26,F2
6
SD.D F12,-16(R1) SD.D F16,-24(R1) SUBUI R1,R1,#56
7
SD.D F20,24(R1) SD.D F24,16(R1)
8
SD.D F28,8(R1) BNEZ R1,Loop
9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency
Please Note
Need more registers in VLIW (15 vs. 6 in single scalar)
Problems with 1st Generation VLIW
VLIW have not been successful for commercial processors due to several reasons:
Increase in code size
Generating enough operations in a straight-line code fragment requires ambitious unrolling of loops
Whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding
Operated in lock-step; no hazard detection HW
A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronized
Compiler might predict function units, but caches hard to predict
Binary code compatibility
Pure VLIW different numbers of functional units and unit latencies require different versions of the code
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Multiple Instruction Issue
Dynamic Scheduling, Multiple Issue, and Speculation
Dynamic Scheduling, Multiple Issue, and Speculation
Modern architectures:
Dynamic scheduling + multiple issue
+ speculation
Two approaches:
Assign reservation stations and update pipeline
control table in half clock cycles
Only supports 2 instructions/clock
Design logic to handle any possible dependencies between the instructions
Issue logic (dispatch unit) is the bottleneck in dynamically scheduled superscalars
Superscalar Organization
Instruction Fetch Unit Instruction Queue
Dispatch Unit
Floating Point Execution Unit
Load Unit
Store Unit
Integer Execution Unit
Branch Unit
Write Buffer
2-way Superscalar Pipeline
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
i t
Multiple Issue and Dependencies
Examine all the dependencies among the instructions issued simultaneously
If dependencies exist in the bundle of instruction, encode them in reservation stations
Also need multiple completion/commit
To simplify reservation station allocation:
Limit the number of instructions of a given class that can be issued in a “bundle”, i.e., one FP, one integer, one load, one store
2-way Superscalar Tomasulo
From instruction unit
Load/store operations
Load buffers
Instruction Reg # queue
Data
Operand buses
Address unit
Store address
32 1
Reservation 21 stations
21
Store data
Load data
Address
Reorder buffer
Integer and FP registers
Floating-point operations
Memory unit
Operation bus
Interger unit
FP adders
FP multipliers
Common data bus (CDB)
Multiple Issue Example
Loop: LD R2,0(R1) ADDI R2,R2,#1 SD R2,0(R1)
ADDI R1,R1,#8 BNE R2,R3,Loop
//R2=array element //increment R2 //store result //increment pointer //branch if not last
Multiple Issue Example (No Speculation)
Iteration Issues at clock Execute number Instruction cycle number at clock cycle
number
Memory access Write CDB at clock at clock cycle cycle number
number
Comment
1 LD
R2,0(R1) 1 2 3 4
First issue
1
ADDI
2,R2,#1 1 5 6
Wait for LD
1
SD
R2,0(R1) 2 3 7
Wait for ADDI
1
ADDI
R1,R1,#8 2 3 4
Execute directly
1
BNE
R2,R3,Loop 3 7
Wait for ADDI
2
LD
R2,0(R1) 4 8 9 10
Wait for BNE
2
ADDI
R2,R2,#1 4 11 12
Wait for LD
2
SD
R2,0(R1) 5 9 13
Wait for ADDI
2
ADDI
R1,R1,#8 5 8 9
Wait for BNE
2
BNE
R2,R3,Loop 6 13
Wait for ADDI
3
LD
R2,0(R1) 7 14 15 16
Wait for BNE
3
ADDI
R2,R2,#1 7 17 18
Wait for LD
3
SD
R2,0(R1) 8 15 19
Wait for ADDI
3 ADDI
R1,R1,#8
8 14 15
Wait for BNE
3 BNE
R2,R3,Loop
9 19
Wait for ADDI
Multiple Issue Example (with Speculation)
Iteration Issues at clock Execute number Instruction number at clock
number
Read access at clock number
Write CDB at clock number
Commits
at clock Comment number
1 LD R2,0(R1) 1 2 3 4 5 First issue
1 ADDI R2,R2,#1 1 5 6 7 Wait for LD
1 SD R2,0(R1)
2 3 7 Wait for ADDI
1 ADDI R1,R1,#8 2 3 4
8 Commit in order
1 BNE R2,R3,Loop 3 7 8 Wait for ADDI
2 LD R2,0(R1) 4 5 6 7 9 No execute delay
2 ADDI R2,R2,#1 4 8 9 10 Wait for LD
2 SD R2,0(R1) 5 6 10 Wait for ADDI
2 ADDI R1,R1,#8 5 6 7 11 Commit to order
2 BNE R2,R3,Loop
6 10 11 Wait for ADDI
3 LD R2,0(R1) 7 8 9 10 12
Earliest possible
3 ADDI R2,R2,#1 7 11 12 13 Wait for LD
3 SD R2,0(R1) 8 9 13 Wait for ADDI
3 ADDI R1,R1,#8 8 9 10 14 Executes earlier
3 BNE R2,R3,Loop 9 13 14 Wait for ADDI
Register Renaming
Register renaming vs. reorder buffers:
Instead of virtual registers from reservation stations and reorder buffer, create a single register pool
Contains visible registers and virtual registers
Use hardware-based map to rename registers during issuing process
WAW and WAR hazards are avoided
Speculation recovery occurs by
copying during commit
Still need a ROB-like queue to update table in order
Simplifies commit:
Record that mapping between architectural register and physical register is no longer speculative
Free up physical register used to hold older value
In other words: SWAP physical registers on commit
Physical register de-allocation is more difficult
Simple approach: deallocate virtual register when next instruction writes to its mapped architecturally-
visibly register
Integrated Issue and Renaming
Combining instruction issue with register renaming:
Issue logic pre-reserves enough physical registers for the bundle
Issue logic finds dependencies within bundle, maps registers as necessary
Issue logic finds dependencies between current bundle and already in execution bundles, maps registers as necessary
Instr. #
Instruction
Physical register assigned Instruction with physical or destination register numbers
Rename map change
1
ADD R1,R2,R3 p32 ADD p32,p2,p3
R1→p32
2
SUB R1,R1,R2 p33 SUB p33,p32,p2
R1→p33
3
ADD R2,R1,R2 p34 ADD p34,p33,p2
R2→p34
4
SUB R1,R3,R2 p35 SUB p35,p3,p34
R1→p35
5
ADD R1,R1,R2 p36 ADD p36,p35,p34
R1→p36
6
SUB R1,R3,R1 p37 SUB p37,p3,p36
R1→p37
The difficulty in this process is that all renaming and replacement of architecture registers by physical ones happens in one clock cycle and not sequentially.
It is up to issue logic to find all dependencies and rewrite the instructions in parallel.
How Much Speculation?
How much to speculate
Mis-speculation degrades performance and
power relative to no speculation
May cause additional misses (cache, TLB)
Prevent speculative code from causing higher costing misses (e.g. L2)
Speculating through multiple branches Complicates speculation recovery
Speculation and energy efficiency
Note: speculation is only energy efficient when it significantly improves performance
45% 40% 35% 30% 25% 20% 15% 10%
5% 0%
How Much Speculation?
integer
floating-point
164.gzip 175.vpr
176.gcc
186.crafty
181.mcf 168.wupwise 172.mgrid 177.mesa
171.swim
173.applu
Misspeculation
Energy Efficiency
Value prediction
Uses:
Loads that load from a constant pool
Instruction that produces a value from a
small set of values
Not incorporated into modern processors
Similar idea — address aliasing prediction— is used on some processors to determine if two stores or a load and a store reference the same address to allow for reordering
END OF SECTION
Lesson 06 – Instruction-Level Parallelism: Case Studies
Fallacies and Pitfalls
11 10 9 8 7 6 5 4 3 2 1 0
Fallacies and Pitfalls
It is easy to predict the performance/energy efficiency of two different versions of the same ISA if we hold the technology constant
Speedup
Energy efficiency
_201_compress _202_jess
Fop Luindex
_209_db _213_javac
_212_mpegaudio _228_jack
400.perlbench 401.bzip2 403.gcc
445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 470.omnetpp 473.astar
483.xalancbmk
416.gamess
433.milc 434.zeusmp
435.gromacs 436.Cactus ADM 437.leslie3d 444.namd 447.dealll 450.soplex
453.povray
454.calculix 459.gams FDTD
antlr Bloat
429.mcf
465.tonto 470.ibm
482.sphinx3
I7 920 and Atom 230 performance and energy ratio
Fallacies and Pitfalls
Processors with lower CPIs / faster clock rates will also be faster
Processor
Implementation Clock rate Power SPECCInt2006 SPECCFP2006 technology base baseline
Intel Pentium 4 670
90 nm 3.8 GHz 115 W 11.5 12.2
Intel Itanium 2
90 nm 1.66 GHz 104 W approx. 14.5 17.3 70 W one core
Intel i7 920
45 nm 3.3 GHz 130 W total approx. 35.5 80 W one core
38.4
Pentium 4 had higher clock, lower CPI Itanium had same CPI, lower clock
Fallacies and Pitfalls
Sometimes bigger and more aggressive is better
Pentium 4 and Itanium were advanced designs, but could not achieve their peak instruction throughput because of relatively small caches as compared to i7
And sometimes smarter is better than bigger and more aggressive
TAGE branch predictor outperforms gshare with less stored predictions
espresso
gcc
10 10 10
89
15 8 10 13
15
Fallacies and Pitfalls
Believing that there are large amounts of ILP available, if only we had the right techniques
9 11
li fpppp
doduc tomcatv
12 11
12
22 35
52 47
9
14
17 16
1215 14
22 34
56 45
00 10 20 30 40 50 60 Degree of parallelism (%)
Window size
Infinite 256 128 64
32
Benchmarks
END OF SECTION