CS计算机代考程序代写 cache algorithm Java compiler scheme mips Lesson 06 – Thread-Level Parallelism: Introduction

Lesson 06 – Thread-Level Parallelism: Introduction
Introduction

Introduction
Pipelining became universal technique in 1985  Overlaps execution of instructions
Beyond pipelining, Instruction Level Parallelism (ILP)  Executes instructions in parallel
 There are two main approaches:
Hardware-based dynamic approaches:
Software-based static approaches:
 Used in server and  Not as successful outside of desktop processors scientific applications

Instruction-Level Parallelism
 When exploiting instruction-level parallelism, goal is to maximize CPI
 Pipeline CPI =
Ideal pipeline CPI +
Structural stalls + Data hazard stalls +
Control stalls
 Parallelism with basic block is limited  Typical size of basic block =
3-6 instructions
 Must optimize across branches

Data Dependence
Major challenge for Instruction Level Parallelism:  Data dependency
 Instruction j is data dependent on instruction i if
 Instruction i produces a result that may be used
by instruction j
 Instruction j is data dependent on instruction
k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously.

Data Dependence
Dependencies are a property of programs:
 Pipeline organization determines if dependence is detected and if it causes a stall
Data dependence conveys:
 Possibility of a hazard
 Order in which results must be calculated
 Upper bound on exploitable instruction level parallelism
Dependencies that flow through memory locations are difficult to detect.

Name Dependence
Two instructions use the same name but no flow of information.
 Not a true data dependence, but is a problem when reordering instructions
 Anti-dependence: instruction j writes a register or memory location that instruction i reads
 Initial ordering (i before j) must be preserved
 Output dependence: instruction i and instruction j write the same register or memory location
 Ordering must be preserved
To resolve, use register renaming techniques.

Data Hazards:
Read after write (RAW) Write after write (WAW) Write after read (WAR)
Other Factors
Control Dependence:
 Ordering of instruction i with respect to a branch instruction
 Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch
 An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Control Dependence
ADD R1,R2,R3 BEQ R4,R0,L SUB R1,R1,R6 ….
L: ….
OR R7,R1,R8
ADD R1,R2,R3 BEQ R12,R0, skip SUB R4,R5,R6 ADD R5,R4,R9
…. skip: ….
OR R7,R8,R9
Example 1:
OR instruction dependent on ADD and SUB.
Example 2:
 Assume R4 isn’t used after skip
 Possible to move SUB before the BEQ

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Loop Unrolling
Loop Unrolling

Compiler Techniques for Exposing ILP
Loop Unrolling: loop transformation technique to optimize program execution through (statical) pipeline scheduling of dependent instructions
Example:
 for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Instruction producing result
Instruction using result
Latency in clock cycles
FP ALU op
Another FP ALU op
3
FP ALU op
Store Double
2
Load Double
FP ALU op
1
Load Double
Store Double
0

FP Loop: Where Are the Hazards?
 To simplify, assume 8 is lowest address, R1 = base address of x and F2 = s
Loop: L.D F0,0(R1) ADD.D F4,F0,F2
S.D 0(R1),F4 DADDUI R1,R1,-8 BNEZ R1,Loop
;F0=vector element ;add scalar from F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero

FP Loop Showing Stalls
1 Loop: 2
3
4
5 6 7 8 9
L.D F0,0(R1) stall
ADD.D F4,F0,F2 stall
stall
S.D 0(R1),F4 DADDUI R1,R1,-8 stall
BNEZ R1,Loop
;F0=vector element ;add scalar in F2
;store result
;decrement pointer 8B (DW) ;assumes can’t forward to branch ;branch R1!=zero
9 clock cycles: Rewrite code to minimize stalls?

Revised FP Loop Minimizing Stalls
1 Loop: 2
3
4
5 6 7
L.D F0,0(R1) DADDUI R1,R1,-8 ADD.D F4,F0,F2 stall
stall
S.D 8(R1),F4 BNEZ R1,Loop
;altered offset when moving DADDUI
Swap DADDUI and S.D by changing address of S.D
7 clock cycles, but only 3 for execution
(L.D, ADD.D, S.D), 4 for loop overhead; How to make faster?

Loop unrolling:
 Unroll by a factor of 4 (assume # elements is divisible by 4)
 Eliminate unnecessary instructions (DADDUI and BNEZ)
 Optimize (reschedule the code)
1 Loop: 3
6
7
9
12
13
15
18
19
21
24
25
26
L.D F0,0(R1) ADD.D F4,F0,F2 S.D 0(R1),F4 L.D F6,-8(R1) ADD.D F8,F6,F2 S.D -8(R1),F8 L.D F10,-16(R1) ADD.D F12,F10,F2 S.D -16(R1),F12 L.D F14,-24(R1) ADD.D F16,F14,F2 S.D -24(R1),F16 DADDUI R1,R1,#-32 BNEZ R1,Loop
Loop Unrolling
Please Note
Number of live registers vs. original loop.

1Loop: L.D
3 ADD.D
6 S.D
7 L.D
9 ADD.D
12 S.D
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32
26 BNEZ R1,Loop
1 cycle stall
2 cycles stall
;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ
;alter to 4*8
F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8
27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)
Rewrite loop to minimize stalls?
Loop Unrolled 4 Times

Optimize Unrolled Loop by Minimizing Stalls
1 Loop: L.D
2 L.D
3 L.D
4 L.D
5 ADD.D 6 ADD.D 7 ADD.D 8 ADD.D 9 S.D
10 S.D
11 S.D
12 DSUBUI 13 S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1) F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2 F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12 R1,R1,#32
8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration.
14
BNEZ R1,Loop

Loop Unrolling Factor
 Do not usually know upper bound of loop
 Suppose it is n, and we would like to unroll
the loop to make k copies of the body
 Instead of a single unrolled loop, we
generate a pair of consecutive loops:
 1st executes (n mod k) times and has a body that is the original loop
 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times
 For large values of n, most of the execution time will be spent in the unrolled loop

5 Loop Unrolling Decisions
Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences:
3
Eliminate the extra test and branch instructions and adjust the loop termination and iteration code
1
Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code)
2
Use different registers to avoid unnecessary constraints forced by using same registers for different computations
4
Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent.
 Transformation requires analyzing memory addresses and
making sure that they do not refer to the same address
5
Rescheduling the instructions in the unrolled loop, preserving any dependences needed to yield the same result as the original code

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Dynamic Branch Prediction
Basic Dynamic Branch Prediction

Static Branch Prediction
 Rescheduling code around delayed branch
 To reorder code around branches, need to predict branch statically when compile  Simplest scheme is to predict a branch as taken
 Average misprediction = untaken branch frequency = 34% (SPEC benchmarks)
25%
22%
20% 15% 10%
5% 0%
18%
15%
 More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run
12%
11%
12%
10%
9%
4%
6%
Integer
Floating Point
compress eqntott
espresso gcc
li doduc ear
hydro2d mdljdp su2cor
Misprediction Rate

Dynamic Branch Prediction
Why does prediction work?
 Underlying algorithm has regularities
 Data that is being operated on has regularities
 Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems
Can we implement a dynamic branch prediction which will be executed at runtime?
 There is a small number of important branches in programs which have dynamic behavior

Dynamic Branch Prediction
 Performance = ƒ(accuracy, cost of misprediction)
 Branch History Table (BHT): Lower bits of PC
address index table of 1-bit values
 Says whether or not branch taken last time
 No address check
 Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit):
 End of loop case, when it exits instead of looping as before
 First time through loop on next time through code, when it predicts exit instead of looping
 Solution: 2-bit scheme where change prediction only if get misprediction twice

Branch Prediction
Basic 2-bit predictor:  For each branch:
 Predict taken or not taken
 If the prediction is wrong two consecutive times,
change prediction
Correlating predictor:
 Multiple 2-bit predictors for each branch
 One for each possible combination of outcomes of preceding n branches
 (m,n) predictor: behavior from last m branches to choose from 2m n-bit predictors
Tournament predictor:
 Combine correlating predictor with local predictor

2-Bit Branch Predictor
 Idea: Change prediction only if get misprediction twice T
Predict Taken
Predict Not Taken
NT Predict Taken T T NT
NT
T
Predict Not Taken
 Green: go, taken
 Red: stop, not taken
 Adds hysteresis to decision making process
NT

2-Bit Branch Predictor Accuracy
20% 18% 16% 14% 12% 10%
8% 6% 4%
2% 0%
18%
4096 Entries BHT
5%
5%
12%
10%
9%
9% 9%
0% 1%
Integer
Floating Point
Mispredict because either:
 Wrong guess for that branch
 Got branch history of wrong branch when index the BHT
eqntott espresso
gcc li
spice doduc spice
fpppp matrix300 nasa7
Misprediction Rate

Correlating Branch Prediction
Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table.
In general, a (m,n) predictor means recording last m branches to select between 2m history tables, each with n-bit counters.
 Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NT status of last m branches.
Each entry in table has 2m n-bit predictors.

Correlating Branch Prediction
(2,2) predictor:
 Behavior of recent branches selects between four predictions of next branch, updating just that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history

20% 18% 16% 14% 12% 10%
8%
6%
4%
2%
0%
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
9% 9% 5%
18%
Branch Prediction Performance
12% 11%
10% 6% 6% 5% 6%
5%
4%
1%
1% 0% 1% 1%
Spec89 Benchmarks
nasa7 matrix300
tomcatv doducd spice fpppp
gcc expresso
eqntott li
Frequency of Mispredictions

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Dynamic Branch Prediction
Advanced Dynamic Branch Prediction

The motivation for correlating branch predictors came from the standard 2-bit predictor only using local information.
The tournament predictors use multiple predictors (global and local predictors) and choosing between them.
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between:
Global predictor:
 4K entries index by history of last 12 branches (212 = 4K)
 Each entry is a standard 2-bit predictor Local predictor:
 Local history table: 1024 10-bit entries recording last 10 branches, index by branch address
 The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters
Advantage of tournament predictor is ability to select the right predictor for a particular branch (e.g., 40% for SPEC integer and less than 15% for SPEC FP).

Tournament Predictors
0/0, 1/0,1/1
0/0, 1/0,1/1
 Using multilevel brand prediction can achieve better accuracy and effectively use large number of prediction bits
 Use n-bit saturating counter to choose between predictors
 Usual choice between global and local predictors
P1/P2 ={0,1}/{0,1}
0: prediction incorrect 1: prediction correct
Use predictor 1
Use predictor 2
1/0
0/1
1/0
0/1
Use predictor 1
0/1 1/0
Use predictor 2
0/0,1/1
0/0,1/1

10-bit shift register
10 10
Most recent branch result (not taken/taken)
Branch Prediction
Branch history
Branch address
Branch history
Branch address
Global predictors
Selector
Local predictors
Exclusive OR
10
1024 2-bit predictors
Prediction
Prediction
gshare Predictor
Tournament Predictor
mux

8%
7% 6%
5% 4%
3% 2% 1%
Local 2-bit predictors
Branch Prediction Performance
Corrrelating predictors Tournaments predictors
0%0 32 64 96 128160192224256288320352384416448480512 Total predictor size
Conditional branch misprediction rate

Tagged Hybrid Predictors
 Problem: reduce the size of the tables required for each branch and history
 Combines multiple predictors to track if a prediction is associated with the current branch or not
 Employs a series of global predictors indexed with different length histories
 Solution: tagged hybrid branch predictor
 Use hash tables, whose hash value is based
on branch address and branch history
 Longer histories may lead to increased chance of hash collision, use multiple tables with increasingly shorter histories

pc
pc
hash
h[0:L(1)]
hash
pc h[0:L(2)]
hash hash
P(2)
pc h[0:L(3)]
hash hash
P(3)
pc h[0:L(4)]
hash hash
P(4)
Tagged Hybrid Predictors
P(0) P(1)
pred
tag
pred
tag
pred
tag
pred
tag
=? =? =? =?
Prediction
Base predictor

8 7 6 5 4 3 2 1 0
SPECfp
SPECint
MultiMedia
Server
Tagged Hybrid Predictors Performance
TAGE gshare
0.59
1.301
3.299
6.607
6.778
6.52
4.45
2.939
Misses per one thousand instructions

Branch Target Buffers (BTB)
 Branch hazards are caused by delays needed for calculating the branch condition and the branch target address
 Calculation of the branch target address is costly and stalls the instruction fetch
 Branch Target Buffer (BTB) stores PCs of recent branch instructions and their calculated target addresses.
 The same way as in caches, when a match is found the corresponding Predicted PC
is returned
 If the branch was predicted taken, instruction fetch continues at the returned predicted PC

Dynamic Branch Prediction
 Branch Target Buffer:
 Dynamic branch prediction address of branch index to get prediction AND branch address (if taken)
Branch PC
Predicted PC
PC of Instruction FETCH
Please Note
Must check for branch match now, since can’t use wrong branch address
No: branch not predicted, proceed normally (Next PC = PC+4)
=?
Yes: instruction is branch and use predicted PC as next PC
Extra prediction state bits

Branch-Target Buffer
PC of Instruction to fetch
Look up
Branch PC (Tag)
Predicted PC (Data)
Number of entries in Branch-target buffer
=
Yes: then instruction is taken branch and predicted PC should be used as the next PC
No: instruction is not predicted to be a taken branch; proceed normally

Branch-Target Buffer
Send PC to memory and branch-target buffer
IF
Entry
No found in Yes branch-target
buffer?
ID
No
Normal instruction execution
Is
instruction Yes
a taken branch?
Send out predicted PC
No
Taken Yes Branch?
EX
Enter branch instruction address and next PC into branch-buffer
Mispredicted branch, kill fetched instruction: restart fetch at other target: delete entry from target buffer
Branch correctly predicted; continue execution with no stalls

Branch Folding
Optimization:
 For unconditional branches store the target instructions themselves in the buffer
 Larger branch-target buffer!  “Branch folding”
 Add target instruction into buffer to deal with longer decoding time required by larger buffer

Return Address Predictor
 Most unconditional branches come from functional returns  The same procedure can be called from multiple sites
 Causes the buffer to potentially forget about the return address from previous calls
 Create return address buffer organized as a stack

Return Address Predictor
70% 60%
50%
40% 30% 20%
10%
0%0 1 2 4 8 16
Go m88ksim cc1 Compress Xlisp ljpeg
Perl Vortex
Return address buffer entries
Misprediction frequency

Integrated Instruction Fetch Unit
 Design monolithic unit that performs:  Branch prediction
 Instruction prefetch
 Fetch ahead
 Instruction memory access and buffering
 Deals with crossing cache lines

Dynamic Branch Prediction (Summary)
 Prediction becoming important part of execution
 Branch History Table: 2 bits for loop accuracy
 Correlation: Recently executed branches correlated with next branch
 Either different branches or different executions of same branches
 Tournament predictors take insight to next level, by using multiple predictors
 Usually one based on global information and one based on local information, and combining them with a selector
 Branch Target Buffer: include branch address & prediction
 Branch target folding adds addresses for
unconditional branches
 Return address buffer to cater for functional returns

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Dynamic Scheduling Introduction

Dynamic Scheduling
Rearrange order of instructions to reduce stalls while maintaining data flow.
Advantages Disadvantages
Substantial increase in hardware complexity.
Compiler doesn’t need to have knowledge of microarchitecture.
Complicates exceptions handling.
Handles cases where dependencies are unknown at compile time.

Dynamic scheduling implies:  Out-of-order execution  Out-of-order completion
Example 1
FDIV.D F0,F2,F4 FADD.D F10,F0,F8 FSUB.D F12,F8,F14
Dynamic Scheduling
FSUB.D is not dependent, issue before FADD.D.

Dynamic Scheduling
FADD.D is not dependent, but the anti-dependence in F0 makes it impossible to issue earlier without register renaming.
Example 2
FDIV.D F0,F2,F4 FMUL.D F6,F0,F8 FADD.D F0,F10,F14

Example 3
FDIV.D F0,F2,F4 FADD.D F6,F0,F8 SD.D F6,0(X1) FSUB.D F8,F10,F14 FMUL.D F6,F10,F8
Data-dependence F0
Anti-dependence F8 Output dependence F6
Register Renaming
Name dependence with F0,F6,F8.

Register Renaming
Example 3
FDIV.D F0,F2,F4 FADD.D S,F0,F8 SD.D S,0(X1) FSUB.D T,F10,F14 FMUL.D F6,F10,T
Now only data-dependences (RAW hazards) remain, which can be strictly ordered.

Register Renaming
Register renaming is provided by reservation stations (RS)
Tomasulo’s Algorithm
 Tracks when operands are available  Introduces register renaming
in hardware
 Minimizes WAW and WAR hazards
 Contains:
 The instruction
 Buffered operand values (when available)
 Reservation station number of instruction providing the operand values

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Tomasulo’s Algorithm

Tomasulo’s Algorithm
 Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
 Tomasulo’s Algorithm: Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
 Issue—Decode instructions, check for structural hazards
 Read operands—Wait until no data hazards, then read operands

Tomasulo’s Algorithm
 Control & buffers distributed with Function Units (FU)
 FU buffers called “Reservation Stations” (RS)
 RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file)
 As instructions are issued, the register specifiers are replaced by values or pointers to RS
 Form of register renaming
 Avoids WAR and WAW hazards
 Pending instructions designate the RS to which they will
send their output
 May be more reservation stations than registers, so can do optimizations compilers can’t

Tomasulo’s Algorithm
 Result values broadcast from RS to all FUs on a result bus, called the common data bus (CDB)
 Results do not go through registers  Form of bypassing
 Only the last output updates the register file
 Load and Stores treated as FUs with buffers
 Contain data and addresses, act like reservation stations
 Integer instructions can go past branches, allowing FP ops beyond basic block in
FP queue

Tomasulo’s Algorithm
From instruction unit
Instruction queue
Load/Store operations
FP registers
Store buffers
Floating-point operations
Address unit
Load buffers
Address Data
Memory unit
FP adders
Operation bus
321
stations 1
Operand buses
Reservation 2
Common data bus (CDB)
FP multipliers

Issue:
 Get next instruction
from FIFO queue
 If available RS, issue the instruction to the RS with operand values if available
 If operand values not available, stall the instruction
Tomasulo’s Algorithm
Three Steps:
Execute:
 When operand becomes available, store it in any reservation stations waiting for it
 When all operands are ready, issue the instruction
 Loads and stores maintained in program order through effective address
 No instruction allowed to initiate execution until all branches that precede it in program order have completed
Write result:
 Write result on CDB into reservation stations and store buffers
 Stores must wait until address and value are received

Tomasulo’s Algorithm Execution
1. Issue: get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction and sends operands (renames registers).
2. Execute: operate on operands
When both operands ready then execute; if not ready, watch Common Data Bus for result.
3. Write result: finish execution
Write on Common Data Bus to all awaiting units; mark reservation station available.
 Normal data bus: data + destination (“go to” bus)
 Common data bus: data + source (“come from” bus)
 64 bits of data + 4 bits of Functional Unit source address
 Write if matches expected Functional Unit (produces result)  Does the broadcast

Tomasulo’s Algorithm Example 1
Code Sequence
LD.D F6,32(R2) LD.D F2,44(R3) FMUL F0,F2,F4 FSUB F8,F2,F6 FDIV F0,F0,F6 FADD F6,F8,F2

Instruction status
Instruction Issue
Execute Write result
LD.D F6,32(R2) ✔
✔✔
LD.D F2,44(R3) ✔

FMUL F0,F2,F4 ✔
FSUB F8,F2,F6 ✔
FDIV F0,F0,F6 ✔
FADD F6,F8,F2 ✔
Tomasulo’s Algorithm Snapshot
Reservation stations
Name Busy
Op Vj Vk Qj Qk A
Load1 No
Load2 Yes
Load 44+[R3]
Add1 Yes
FSUB M[32+ [R2]] Load2
Add2 Yes
FADD Add1 Load2
Add3 No
Mult1 Yes
FMUL [F4] Load2
Mult2 Yes
FDIV M[32+[R2]] Mult1
Field f0 f2 f4 f6 f8 f10 f12 … f30
FU Mult1 Load2 Add2 Add1 Mult2

Tomasulo’s Algorithm Example 2
Loop Iterations
Loop: LD F0,0(R1) FMUL.D F4,F0,F2 SD F4,0(R1)
SUBI R1,R1,#8
BNE R1,R2,Loop // branches if R1 != R2

Instruction From iteration Issue Execute Write result
LD.D F0,0(R1) 1 ✔ ✔
SD.D F4,0(R1) 1 ✔
Tomasulo’s Algorithm Snapshot
Load1 Yes
Load2 Yes
Add1 No
Add2 No
Add3 No
Mult1 Yes
Mult2 Yes
Store1 Yes
Store2 Yes
Instruction status
FMUL F4, F0,F2 1 ✔
LD.D F0,0 (R1) 2 ✔ ✔
FMUL F4,F0,F2 2 ✔
SD.D F4,0(R1) 2 ✔
Name Busy
Reservation stations
Op Vj Vk Qj Qk A
Load [R1] + 0
Load [R1] – 8
FMUL [F2] Load1
FMUL [F2] Load2
Store [R1] Mult1
Store [R1] – 8 Mult2
Field f0 f2 f4 f6 f8 f10 f12 … f30
FU Load2 Mult2

From Mem
FP Op Queue
FP Registers
Tomasulo’s Algorithm Organization
Load1
Load2 Load3 Load4 Load5 Load6
FP Add/Sub: 3 clocks FP Mult: 10 clocks FP Div: 40 clocks
Load Buffers
FP adders
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
Store Buffers
To Mem
FP multipliers
Common Data Bus (CDB)

Reservation Station Components
 Op: Operation to perform in the unit (e.g., + or –)
 Vj, Vk: Value of Source operands
 Store buffers has V field, result to
be stored
 Qj, Qk: Reservation stations producing source registers (value to be written)
 Store buffers only have Qi for RS producing result
 Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Tomasulo’s Algorithm Example
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2
LD.D F2
45+ R3
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Instruction stream
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
FU Countdown
3 Load/Buffers
3 FP Adder R.S. 2 FP Mult R.S.
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Clock
0
Register result status:
Clock cycle counter

Tomasulo’s Algorithm Example (Cycle 1)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1
LD.D F2
45+ R3
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1
Clock
1
Register result status:

Tomasulo’s Algorithm Example (Cycle 2)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1
LD.D F2
45+ R3 2
FMUL F0
F2 F4
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Load1
Clock
2
Register result status:

Tomasulo’s Algorithm Example (Cycle 3)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3
LD.D F2
45+ R3 2
FMUL F0
F2 F4 3
FSUB F8
F6 F2
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 Yes 34+[R2]
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F4] Load2
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Load2 Load1
Clock
3
Register result status:

Tomasulo’s Algorithm Example (Cycle 4)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 No
Load2 Yes 45+[R3]
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 Yes FSUB M[A1] Load2
Add2 No
Add3 No
Mult1 Yes FMUL [F4] Load2
Mult2 No
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Load2 M[A1] Add1
Clock
4
Register result status:

Tomasulo’s Algorithm Example (Cycle 5)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6 5
FADD F6
F8 F2
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
3
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 No
Add3 No
10
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] M[A1] Add1 Mult2
Clock
5
Register result status:

Tomasulo’s Algorithm Example (Cycle 6)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
2
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 Yes FADD M[A2] Add1
Add3 No
9
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 Add1 Mult2
Clock
6
Register result status:

Tomasulo’s Algorithm Example (Cycle 7)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
1
Add1 Yes FSUB M[A1] M[A2] Load2
Add2 Yes FADD M[A2] Add1
Add3 No
8
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 Add1 Mult2
Clock
7
Register result status:

Tomasulo’s Algorithm Example (Cycle 8)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
3
Add2 Yes FADD (M-M) M[A2]
Add3 No
7
Mult1 Yes FMUL M[A2] [F4] Load2
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
8
Register result status:

Tomasulo’s Algorithm Example (Cycle 9)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
2
Add2 Yes FADD (M-M) M[A2]
Add3 No
6
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
9
Register result status:

Tomasulo’s Algorithm Example (Cycle 10)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
1
Add2 Yes FADD (M-M) M[A2]
Add3 No
5
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] Add2 (M-M) Mult2
Clock
10
Register result status:

Tomasulo’s Algorithm Example (Cycle 11)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
11
Register result status:

Tomasulo’s Algorithm Example (Cycle 12)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
12
Register result status:

Tomasulo’s Algorithm Example (Cycle 13)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
13
Register result status:

Tomasulo’s Algorithm Example (Cycle 14)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1
Mult1 Yes FMUL M[A2] [F4]
Mult2 Yes FDIV M[A1] Mult1
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 M[A2] (M-M+M) (M-M) Mult2
Clock
14
Register result status:

Tomasulo’s Algorithm Example (Cycle 15)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14 15
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
15
Register result status:

Faster Than Light Computation (Skip 39 Cycles)

Tomasulo’s Algorithm Example (Cycle 54)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 14 15
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
2
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
54
Register result status:

Tomasulo’s Algorithm Example (Cycle 55)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 15 16
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5 55
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1
Mult2 Yes FDIV M*F4 M[A1]
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Mult2
Clock
55
Register result status:

Tomasulo’s Algorithm Example (Cycle 56)
Exec Write
Instruction
j k Issue Comp Result
LD.D F6
34+ R2 1 3 4
LD.D F2
45+ R3 2 4 5
FMUL F0
F2 F4 3 15 16
FSUB F8
F6 F2 4 7 8
FDIVD F10
F0 F6 5 55 56
FADD F6
F8 F2 6 10 11
Instruction status:
Busy Addr
Load1 No
Load2 No
Load3 No
S1 S2 RS RS
Time
Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes
Reservation Stations:
F0 F2 F4 F6 F8 F10 F12 … F30
FU M*F4 M[A2] (M-M+M) (M-M) Result
Clock
56
Register result status:

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Tomasulo’s Algorithm (Dynamic Loop Unrolling)

Tomasulo’ Algorithm Loop Example
Loop: LD.D F0, 0(R1) FMUL F4, F0, F2
SD.D F4, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop
 This time assume Multiply takes 5 clocks
 Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
 To be clear, will show clocks for SUBI, BNEZ
 Reality: integer instructions ahead of FP Instructions
 Show 2 iterations

Iteration count
Instruction status:
Loop Example
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1
1 FMULF4 F0 F2
1 SD.DF4 0 R1
2 LD.DF0 0 R1
2 FMULF4 F0 F2
2 SD.DF4 0 R1
Load2 No
Busy Addr
FU
Load1 No
Load3 No
Store1 No
Store2 No
Store3 No
Added Store Buffers
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
Instruction Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Clock R1 0 80
Register result status:
Value of Register used for address, iteration control

Loop Example (Cycle 1)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 No
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1
Clock R1 1 80
Register result status:

Loop Example (Cycle 2)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 No
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 2 80
Register result status:

Loop Example (Cycle 3)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Busy Addr
Load1 Yes 80
FU
Load2 No
Load3 No
Store1 Yes 80
Mult1
Store2 No
Store3 No
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 3 80
Register result status:

Loop Example (Cycle 4)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 4 80
Register result status:

Loop Example (Cycle 5)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
Load2 No
Busy Addr
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load1 Mult1
Clock R1 5 72
Register result status:

Loop Example (Cycle 6)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
Busy Addr
Load2 Yes 72
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult1
Clock R1 6 72
Register result status:

Loop Example (Cycle 7)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
Busy Addr
Load2 Yes 72
FU
Load1 Yes 80
Load3 No
Store1 Yes 80
Store2 No
Store3 No
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 7 72
Register result status:

Loop Example (Cycle 8)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Busy Addr
Load1 Yes 80
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 8 72
Register result status:

Loop Example (Cycle 9)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Busy Addr
Load1 Yes 80
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL [F2] Load1
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 9 72
Register result status:

Loop Example (Cycle 10)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 Yes 72
Load3 No
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL [F2] Load2
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load2 Mult2
Clock R1 10 64
Register result status:

Loop Example (Cycle 11)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 11 64
Register result status:

Loop Example (Cycle 12)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 12 64
Register result status:

Loop Example (Cycle 13)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 13 64
Register result status:

Loop Example (Cycle 14)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
Mult1
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes FMUL M[80] [F2]
Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 14 64
Register result status:

Loop Example (Cycle 15)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
M[80]*F2
Mult2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes FMUL M[72] [F2]
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 15 64
Register result status:

Loop Example (Cycle 16)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 No
M[80]*F2
M[72]*F2
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
5 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 16 64
Register result status:

Loop Example (Cycle 17)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 Yes 64
M[80]*F2
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 17 64
Register result status:

Loop Example (Cycle 18)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 Yes 80
Store2 Yes 72
Store3 Yes 64
M[80]*F2
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 18 64
Register result status:

Loop Example (Cycle 19)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18 19
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8 19
Load1 No
Busy Addr
FU
Load2 No
Load3 Yes 64
Store1 No
Store2 Yes 72
Store3 Yes 64
M[72]*F2
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 19 56
Register result status:

Loop Example (Cycle 20)
Iteration Instruction j k Issue
Exec. Write Comp Result
1 LD.DF0 0 R1 1 9 10
1 FMULF4 F0 F2 2 14 15
1 SD.DF4 0 R1 3 18 19
2 LD.DF0 0 R1 6 10 11
2 FMULF4 F0 F2 7 15 16
2 SD.DF4 0 R1 8 19 20
Load2 No
Busy Addr
FU
Load1 Yes 56
Load3 Yes 64
Store1 No
Store2 No
Store3 Yes 64
Mult1
Instruction status:
Reservation stations:
Code:
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes FMUL [F2] Load3
Mult2 No
LD.D F0 0 R1
FMUL F4 F0 F2
SD.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
F0 F2 F4 F6 F8 F10 F12 … F30
FU Load3 Mult2
Clock R1 20 56
Register result status:

Why Can Tomasulo Overlap Iterations of Loops?
Register renaming
Multiple iterations use different physical destinations for registers (dynamic loop unrolling).
Reservation stations
Permit instruction issue to advance past integer control flow operations.
Also buffer old values of registers – totally avoiding the WAR stalls.
Other perspective: Tomasulo building data flow dependency graph on the fly

Major Advantages of Tomasulo’s Algorithm
 The distribution of the hazard detection logic:
 Distributed reservation stations and the CDB
 If multiple instructions waiting on single result, and each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
 If a centralized register file is used, the units would have to read their results from the registers when register buses are available
 The elimination of stalls for WAW and WAR hazards

What about Precise Interrupts?
State of machine looks as if no instruction beyond faulting instructions has issued.
Tomasulo had:
 In-order issue, out-of-order execution, and out-of-order completion
Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Dynamic Scheduling
Tomasulo’s Algorithm with Speculation

Relationship between Precise Interrupts and Speculation
 Speculation: guess and check
 Important for branch prediction:
 Need to “take our best shot” at predicting branch direction
 If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:
 This is exactly same as precise exceptions!
 Technique for both precise interrupts/exceptions
and speculation:
 In-order completion or commit

Hardware-Based Speculation
 Execute instructions along predicted execution paths but only commit the results of prediction when correct
 Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative
 Need an additional piece of hardware to prevent any irreversible action until an instruction commits (i.e., updating state or taking an execution)
 Reorder Buffer

Reorder Buffer (ROB)
Register values and memory references are not written until an instruction commits.
On misprediction:
 Speculated entries in ROB are cleared
Exceptions:
 Not recognized until it is ready to commit

Reorder Buffer Operation
 ROB holds instructions in FIFO order, exactly as issued
 Modify reservation stations:
 Operand source is now ROB instead of functional unit
 When instructions complete, results placed into ROB
 Supplies operands to other instruction between execution complete & commitmore registers like RS
 Tag results with ROB buffer number instead of reservation station
 Instructions commit  values at head of ROB placed in registers (or memory locations)
 As a result, easy to undo speculated instructions on mispredicted branches or on exceptions
Commit path
FP
Op Queue
Reorder Buffer
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder

Reorder Buffer Entry Fields
Each entry in the ROB contains four fields:
1 Instruction type
A branch (has no destination result), a store (has a memory address destination), or a register operation (FP operation or load, which
has register destinations)
2 Destination
Register number (for loads and FP operations) or memory address (for stores) where the instruction result should be written
3 Value
Value of instruction result until the instruction commits
4 Ready
Indicates that instruction has completed execution, and the value is ready

Reorder Buffer
Entry Busy Instruction State Destination Value
1 No LD.D F6,32(R2) Commit F6 Mem[32+R2]
2 No LD.D F2, 44(R3) Commit F2 Mem[44+R3]
3 Yes FMUL F0,F2,F4 Write result
F0 #2 x F4
4 Yes FSUB F8,F2,F6 Write result F8 #2 – #1
5 Yes FDIV F0,F0,F6 Execute F0
6 Yes FADD F6,F8,F2 Write result F6 #4 + #2
Reservation stations
Name Busy
OP Vj Vk
Qj Qk Dest A
Load1 No
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 No
FMUL Mem[44+R3] F4
#3
Mult2 Yes
FDIV Mem[32+R2]
#3 #5
FP register status
Field F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Reorder # 3 6 4 5
Busy Yes No No No No No Yes … Yes No Yes

Adding Speculation to Tomasulo
 In non-speculative Tomasulo’s Algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file
 With speculation, the register file is not updated until the instruction commits
 (we know definitively that the instruction should execute)
 Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit
 ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s Algorithm

Revised Tomasulo’s Algorithm
1 Issue — get instruction from FP Op Queue (sometimes called “dispatch”)
 If reservation station and ROB slot free, issue instruction and send operands and ROB no. for destination
2 Execution — operate on operands
 When both operands ready then execute; if not ready, watch CDB for result; when both operand values in reservation station, execute; checks RAW
3 Write result — finish execution
 Write result and ROB tag on CDB to all awaiting FUs and ROB; Mark reservation station available
4 Commit — update register with reorder result (sometimes called “graduation”)
 When instruction reaches head of ROB and result present, update register with result (or store to memory) and remove instruction from ROB. When a mispredicted branch reaches head of ROB, discard all entries (flush ROB)

Address unit
123
Store address
Instruction queue
Load/Store operations
Floating-point operations
Load buffers
123
Reservation stations
From instruction unit
Operation bus
Reorder buffer
Reg # Data
FP registers
Operand buses
12 Common data bus (CDB)
Memory unit
FP adders
FP multipliers
Architecture of Speculative Tomasulo
 Floating point addition: 2 Clock Cycles  Floating point division: 20 Clock Cycles  Load operation: 3 Clock Cycles
 Cache miss penalty: 10 Clock Cycles
Store data
Load data

Speculative Tomasulo Example (Cycle 1)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
1
10+R2
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 2)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
1
10+R2
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 3)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 4)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 5)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F4
LD.D F4,0(R3)
N

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 6)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest
F0
FADD F0,F4,F6
N
F4
LD.D F4,0(R3)
N

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
ROB5, R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 7)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest

ROB5
SD.D 0(R3),F4
N
F0
FADD F0,F4,F6
N
F4
LD.D F4,0(R3)
N

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
ROB5, R(F6)
1
10+R2
5
0+R3
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 8)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest

M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
N
F4
M[10]
LD.D F4,0(R3)
Y

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
3
FDIV
ROB2,R(F6)
6
FADD
R(F4),ROB1
M[10], R(F6)
1
10+R2
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 9)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest

M[10]
SD.D 0(R3),F4
Y
F0
FADD F0,F4,F6
Ex
F4
M[10]
LD.D F4,0(R3)
Y

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
6
FADD
M[10], R(F6)
1
10+R2
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 10)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest

M[10]
SD.D 0(R3),F4
Y
F0

FADD F0,F4,F6
Y
F4
M[10]
LD.D F4,0(R3)
Y

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
LD.D F0,10(R2)
N
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
3
FDIV
ROB2,R(F6)
1
10+R2
FP adders
FP multipliers

Speculative Tomasulo Example (Cycle 11)
Done?
ROB7
ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
To Memory
from Memory
Dest
Newest

M[10]
SD.D 0(R3),F4
Y
F0

FADD F0,F4,F6
Y
F4
M[10]
LD.D F4,0(R3)
Y

BNE F2,<...>
N
F2
FDIV F2,F10,F6
N
F10
FADD F10,F4,F0
N
F0
M[20]
LD.D F0,10(R2)
Y
FP Op Queue
Reorder Buffer Registers
Oldest
Dest
Dest
Reservation Stations
2
FADD
R(F4),ROB1
What about memory hazards?
3
FDIV
ROB2,R(F6)
FP adders
FP multipliers

Avoiding Memory Hazards
 WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending
 RAW dependence through memory are maintained by two restrictions:
1
Not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the address field of the load.
2
Maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.
 These restrictions ensure that any load that accesses a memory location written by an earlier store cannot perform the memory access until the store has written the data

Speculation for Greater ILP
Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct.
Speculation
fetch, issue, and execute instructions as if branch predictions were always correct
Dynamic scheduling
only fetches and issues instructions
Essentially a data flow execution model: Operations execute as soon as their operands are available.

Speculation for Greater ILP
3 components of HW-based speculation
1 2
3
Dynamic branch prediction to choose which instructions to execute
Speculation to allow execution of instructions before control dependences are resolved
 Ability to undo effects of incorrectly speculated sequence
Dynamic scheduling to deal with scheduling of different combinations of basic blocks

Exceptions and Interrupts
IBM 360/91 invented “imprecise interrupts”
 Computer stopped at this PC; its likely close to this address  Not so popular with programmers
 Also, what about Virtual Memory? (Not in IBM 360)
Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit
 If we speculate and are wrong, we need to back up and restart execution to point at which we predicted incorrectly
 This is exactly what is needed to implement a precise exception
Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB
 If a speculated instruction raises an exception, the exception is recorded in the ROB  This is why reorder buffers are in all new processors

END OF SECTION

Lesson 06 – Instruction-Level Parallelism: Multiple Instruction Issue
Multiple Instructions Issue

Multiple Issue and ILP
To achieve CPI < 1, need to complete multiple instructions per clock. Solutions: Multi-way issue superscalar processors VLIW (very long instruction word) processors Multi-word vector processors Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI Main motivation of ILP is to reduce the amount of stall in the pipelined execution of instructions  In particular for multi-cyle operations Main goal of ILP is to get performance close to ideal CPI = 1 Getting CPI < 1: Issuing Multiple Instructions/Cycle Superscalar Varying no. instructions/cycle (1 to 8) scheduled statically by compiler or dynamically in hardware (Tomasulo)  IBM PowerPC, Sun UltraSparc, Pentium 4 Very Long Instruction Words VLIW Fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates  Intel Architecture-64 (IA-64) 64-bit address  Renamed: “Explicitly Parallel Instruction Computer (EPIC)” Vector Processing Explicit coding of independent loops as operations on large vectors of numbers  Multimedia instructions being added to many processors Scalar Pipeline Execution Successive Instructions 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Time in Base Cycles Fetch Decode Execute Write back Superscalar of Degree 3 Fetch Decode Execute Write back 0123456789 Time in Base Cycles VLIW Execution with Degree 3 Fetch Decode Execute Write back Execute 3 operations 0123456789 Time in Base Cycles Vector Execution with Degree 3 Fetch Decode Read vector Execute Write back Write vector 0123456789 Time in Base Cycles Multiple Issue Common Issue name structure Hazard Distinguishing detection Scheduling characteristic Example Superscalar Dynamic (static) Hardware Static In-order execution Mostly in the embedded space: MIPS and ARM, including the Cortex-A53 Superscalar Dynamic (dynamic) Hardware Dynamic Some out-of-order None at the present execution, but no speculation Superscalar Dynamic (speculative) Hardware Dynamic with Out-of-order execution Intel Core i3, i5: AMD Phenom: speculation with speculation IBM Power 7 VLIW/LIW Static Primary Static All hazards determined Most examples are in signal software and indicated by compiler processing, such as the TI C6x (often implicitly) EPIC Primarily static Primary Mostly static All hazards determined Itanium software and indicated explicitly by the compiler END OF SECTION Lesson 06 – Instruction-Level Parallelism: Multiple Instruction Issue Very Long Instruction Word (VLIW) VLIW Processors Example VLIW bundle:  One integer instruction (or branch)  Two independent floating-point operations  Two independent memory references Package multiple operations into one instruction. Must be enough parallelism in code to fill the available slots. VLIW Pipeline i t IF ID EX EX MEM WB EX EX IF ID EX MEM WB EX EX EX IF ID EX ... MEM WB Instruction Format VLIW Organization FP Add FP Mult Int ALU Branch Load/Store Instruction Issue Unit FP Mult FP Add Int ALU Branch Load/ Store Register File VLIW: Very Long Instruction Word  Each “instruction” has explicit coding for multiple operations  In the IA-64 architecture, grouping called a “packet”  In the Transmeta processor, grouping called a “molecule” (with “atoms” as ops)  Tradeoff instruction space for simple decoding  The long instruction word has room for many operations  By definition, all the operations the compiler puts in the long instruction word are independent  execute in parallel  E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch  16 to 24 bits per field7×16 or 112 bits to 7×24 or 168 bits wide  Need compiling technique that schedules across several branches Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: LD.D 2 LD.D 3 LD.D 4 LD.D 5 FADD 6 FADD 7 FADD 8 FADD 9 SD.D 10 SD.D 11 SD.D 12 SUBUI 13 BNEZ 14 SD.D F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F4,0(R1) F8,-8(R1) F12,-16(R1), R1,R1,#32 R1,Loop F16,8(R1) ; 8-32 = -24 LD.D to FADD: 1 Cycle FADD to SD.D: 2 Cycles 14 clock cycles, or 3.5 per iteration. Loop Unrolling in VLIW Memory Memory reference 1 reference 2 FP FP Integer operation 1 operation 2 operation/branch Clock LD.D F0,0(R1) LD.D F6,-8(R1) 1 LD.D F10,-16(R1) LD.D F14,-24(R1) 2 LD.D F18,-32(R1) LD.D F22,-40(R1) FADD F4,F0,F2 FADD F8,F6,F2 3 LD.D F26,-48(R1) FADD F12,F10,F2 FADD F16,F14,F2 4 FADD F20,F18,F2 FADD F24,F22,F2 5 SD.D F4, 0(R1) SD.D F8,-8(R1) FADD F28,F26,F2 6 SD.D F12,-16(R1) SD.D F16,-24(R1) SUBUI R1,R1,#56 7 SD.D F20,24(R1) SD.D F24,16(R1) 8 SD.D F28,8(R1) BNEZ R1,Loop 9  Unrolled 7 times to avoid delays  7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)  Average: 2.5 ops per clock, 50% efficiency Please Note Need more registers in VLIW (15 vs. 6 in single scalar) Problems with 1st Generation VLIW VLIW have not been successful for commercial processors due to several reasons: Increase in code size  Generating enough operations in a straight-line code fragment requires ambitious unrolling of loops  Whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding Operated in lock-step; no hazard detection HW  A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronized  Compiler might predict function units, but caches hard to predict Binary code compatibility  Pure VLIW  different numbers of functional units and unit latencies require different versions of the code END OF SECTION Lesson 06 – Instruction-Level Parallelism: Multiple Instruction Issue Dynamic Scheduling, Multiple Issue, and Speculation Dynamic Scheduling, Multiple Issue, and Speculation  Modern architectures:  Dynamic scheduling + multiple issue + speculation  Two approaches:  Assign reservation stations and update pipeline control table in half clock cycles  Only supports 2 instructions/clock  Design logic to handle any possible dependencies between the instructions  Issue logic (dispatch unit) is the bottleneck in dynamically scheduled superscalars Superscalar Organization Instruction Fetch Unit Instruction Queue Dispatch Unit Floating Point Execution Unit Load Unit Store Unit Integer Execution Unit Branch Unit Write Buffer 2-way Superscalar Pipeline IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB i t Multiple Issue and Dependencies  Examine all the dependencies among the instructions issued simultaneously  If dependencies exist in the bundle of instruction, encode them in reservation stations  Also need multiple completion/commit  To simplify reservation station allocation:  Limit the number of instructions of a given class that can be issued in a “bundle”, i.e., one FP, one integer, one load, one store 2-way Superscalar Tomasulo From instruction unit Load/store operations Load buffers Instruction Reg # queue Data Operand buses Address unit Store address 32 1 Reservation 21 stations 21 Store data Load data Address Reorder buffer Integer and FP registers Floating-point operations Memory unit Operation bus Interger unit FP adders FP multipliers Common data bus (CDB) Multiple Issue Example Loop: LD R2,0(R1) ADDI R2,R2,#1 SD R2,0(R1) ADDI R1,R1,#8 BNE R2,R3,Loop //R2=array element //increment R2 //store result //increment pointer //branch if not last Multiple Issue Example (No Speculation) Iteration Issues at clock Execute number Instruction cycle number at clock cycle number Memory access Write CDB at clock at clock cycle cycle number number Comment 1 LD R2,0(R1) 1 2 3 4 First issue 1 ADDI 2,R2,#1 1 5 6 Wait for LD 1 SD R2,0(R1) 2 3 7 Wait for ADDI 1 ADDI R1,R1,#8 2 3 4 Execute directly 1 BNE R2,R3,Loop 3 7 Wait for ADDI 2 LD R2,0(R1) 4 8 9 10 Wait for BNE 2 ADDI R2,R2,#1 4 11 12 Wait for LD 2 SD R2,0(R1) 5 9 13 Wait for ADDI 2 ADDI R1,R1,#8 5 8 9 Wait for BNE 2 BNE R2,R3,Loop 6 13 Wait for ADDI 3 LD R2,0(R1) 7 14 15 16 Wait for BNE 3 ADDI R2,R2,#1 7 17 18 Wait for LD 3 SD R2,0(R1) 8 15 19 Wait for ADDI 3 ADDI R1,R1,#8 8 14 15 Wait for BNE 3 BNE R2,R3,Loop 9 19 Wait for ADDI Multiple Issue Example (with Speculation) Iteration Issues at clock Execute number Instruction number at clock number Read access at clock number Write CDB at clock number Commits at clock Comment number 1 LD R2,0(R1) 1 2 3 4 5 First issue 1 ADDI R2,R2,#1 1 5 6 7 Wait for LD 1 SD R2,0(R1) 2 3 7 Wait for ADDI 1 ADDI R1,R1,#8 2 3 4 8 Commit in order 1 BNE R2,R3,Loop 3 7 8 Wait for ADDI 2 LD R2,0(R1) 4 5 6 7 9 No execute delay 2 ADDI R2,R2,#1 4 8 9 10 Wait for LD 2 SD R2,0(R1) 5 6 10 Wait for ADDI 2 ADDI R1,R1,#8 5 6 7 11 Commit to order 2 BNE R2,R3,Loop 6 10 11 Wait for ADDI 3 LD R2,0(R1) 7 8 9 10 12 Earliest possible 3 ADDI R2,R2,#1 7 11 12 13 Wait for LD 3 SD R2,0(R1) 8 9 13 Wait for ADDI 3 ADDI R1,R1,#8 8 9 10 14 Executes earlier 3 BNE R2,R3,Loop 9 13 14 Wait for ADDI Register Renaming Register renaming vs. reorder buffers:  Instead of virtual registers from reservation stations and reorder buffer, create a single register pool  Contains visible registers and virtual registers  Use hardware-based map to rename registers during issuing process  WAW and WAR hazards are avoided  Speculation recovery occurs by copying during commit  Still need a ROB-like queue to update table in order  Simplifies commit:  Record that mapping between architectural register and physical register is no longer speculative  Free up physical register used to hold older value  In other words: SWAP physical registers on commit  Physical register de-allocation is more difficult  Simple approach: deallocate virtual register when next instruction writes to its mapped architecturally- visibly register Integrated Issue and Renaming Combining instruction issue with register renaming:  Issue logic pre-reserves enough physical registers for the bundle  Issue logic finds dependencies within bundle, maps registers as necessary  Issue logic finds dependencies between current bundle and already in execution bundles, maps registers as necessary Instr. # Instruction Physical register assigned Instruction with physical or destination register numbers Rename map change 1 ADD R1,R2,R3 p32 ADD p32,p2,p3 R1→p32 2 SUB R1,R1,R2 p33 SUB p33,p32,p2 R1→p33 3 ADD R2,R1,R2 p34 ADD p34,p33,p2 R2→p34 4 SUB R1,R3,R2 p35 SUB p35,p3,p34 R1→p35 5 ADD R1,R1,R2 p36 ADD p36,p35,p34 R1→p36 6 SUB R1,R3,R1 p37 SUB p37,p3,p36 R1→p37 The difficulty in this process is that all renaming and replacement of architecture registers by physical ones happens in one clock cycle and not sequentially. It is up to issue logic to find all dependencies and rewrite the instructions in parallel. How Much Speculation?  How much to speculate  Mis-speculation degrades performance and power relative to no speculation  May cause additional misses (cache, TLB)  Prevent speculative code from causing higher costing misses (e.g. L2)  Speculating through multiple branches  Complicates speculation recovery  Speculation and energy efficiency  Note: speculation is only energy efficient when it significantly improves performance 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% How Much Speculation? integer floating-point 164.gzip 175.vpr 176.gcc 186.crafty 181.mcf 168.wupwise 172.mgrid 177.mesa 171.swim 173.applu Misspeculation Energy Efficiency  Value prediction  Uses:  Loads that load from a constant pool  Instruction that produces a value from a small set of values  Not incorporated into modern processors  Similar idea — address aliasing prediction— is used on some processors to determine if two stores or a load and a store reference the same address to allow for reordering END OF SECTION Lesson 06 – Instruction-Level Parallelism: Case Studies Fallacies and Pitfalls 11 10 9 8 7 6 5 4 3 2 1 0 Fallacies and Pitfalls  It is easy to predict the performance/energy efficiency of two different versions of the same ISA if we hold the technology constant Speedup Energy efficiency _201_compress _202_jess Fop Luindex _209_db _213_javac _212_mpegaudio _228_jack 400.perlbench 401.bzip2 403.gcc 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 470.omnetpp 473.astar 483.xalancbmk 416.gamess 433.milc 434.zeusmp 435.gromacs 436.Cactus ADM 437.leslie3d 444.namd 447.dealll 450.soplex 453.povray 454.calculix 459.gams FDTD antlr Bloat 429.mcf 465.tonto 470.ibm 482.sphinx3 I7 920 and Atom 230 performance and energy ratio Fallacies and Pitfalls  Processors with lower CPIs / faster clock rates will also be faster Processor Implementation Clock rate Power SPECCInt2006 SPECCFP2006 technology base baseline Intel Pentium 4 670 90 nm 3.8 GHz 115 W 11.5 12.2 Intel Itanium 2 90 nm 1.66 GHz 104 W approx. 14.5 17.3 70 W one core Intel i7 920 45 nm 3.3 GHz 130 W total approx. 35.5 80 W one core 38.4  Pentium 4 had higher clock, lower CPI  Itanium had same CPI, lower clock Fallacies and Pitfalls  Sometimes bigger and more aggressive is better  Pentium 4 and Itanium were advanced designs, but could not achieve their peak instruction throughput because of relatively small caches as compared to i7  And sometimes smarter is better than bigger and more aggressive  TAGE branch predictor outperforms gshare with less stored predictions espresso gcc 10 10 10 89 15 8 10 13 15 Fallacies and Pitfalls  Believing that there are large amounts of ILP available, if only we had the right techniques 9 11 li fpppp doduc tomcatv 12 11 12 22 35 52 47 9 14 17 16 1215 14 22 34 56 45 00 10 20 30 40 50 60 Degree of parallelism (%) Window size Infinite 256 128 64 32 Benchmarks END OF SECTION