程序代写代做代考 algorithm compiler go mips Module Outline:

Module Outline:
Module 3: High Performance Techniques— Tomasulo Algorithm
• Improved Dynamic Instruction Scheduling: Tomasulo Algorithm
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 1

Recall: Compiler Techniques for Parallelism
• Loop unrolling => multiple iterations of loop in software:
– amortizes loop overhead over several iterations
– gives more opportunity for scheduling around stalls
• Software Pipelining => take one instruction from each of several iterations of the loop
– software overlapping of loop iterations
• Very Long Instruction Word machines (VLIW) => multiple operations coded in single, long instruction
– requires sophisticated compiler to decide which operations can be done in parallel
– trace scheduling => find common path and schedule code as if branches didn’t exist
(add “fixup code”)
• ALL OF THE ABOVE require additional registers
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 2

Recall: How are WAR and WAW hazards handled in scoreboard?
• WAR hazards handled by stalling in WriteBack stage • WAW hazards handled by stalling in Issue stage
• Are either of these real hazards???
– consider the following WAR hazard:
ADD $1, $2, $3 SUB $3, $5, $4 ADD $2, $3, $5
– why not do renaming like this:
ADD $1, $2, $3 SUB $7, $5, $4 ADD $2, $7, $5
– now, WAR hazard has disappeared!!
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 3

A Famous Dynamic Scheduling Algorithm: Tomasulo Algorithm
• For IBM 360/91 about 3 years after CDC6600 (1966)
• Goal: high performance without special compilers
• Difference between IBM 360 and CDC6600 ISA
– IBM has only 2 register specifiers per instruction vs. 3 in CDC6600 – IBM has 4 FP registers vs. 8 in CDC6600
– IBM has memory-register ops
• Why study this algorithm?
BECAUSE: it leads to DEC Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …etc.
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 4

Tomasulo Algorithm vs. Scoreboard
• Control and buffers distributed with Function Units (FU) vs. centralized in scoreboard:
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to reservation stations (RS);
called Register Renaming
– avoids WAR, WAW hazards
– more reservation stations than registers, so can do optimizations that compilers can’t do
• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
• Load and Store treated as FUs with RSs as well
• Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 5

Tomasulo Organization
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 6

Reservation Station Components
• Op: operation to perform in the unit (e.g., add or sub)
• Vj, Vk, Value of source operands
– store buffers has V field, result to be stored
• Qj, Qk: reservation stations producing source registers (value to be written)
– note: no ready flags as in scoreboard; Qj, Qk = 0 ==> ready – store buffers only have Qi for RS producing result
• Busy: indicates reservation station or FU is busy
• Register result status—indicates which functional unit will write each register, if one exists; blank when no pending instructions that will write that register
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 7

1. Issue:
3. Write Result:
Three Stages of Tomasulo Algorithm
• Get instruction from FP Op Queue
– if reservation station free (no structural hazard), control issues instructions and sends operands (renames registers);
2. Execution:
• Operate on operands (EX)
– when both operands ready then execute; if not ready, watch Common Data Bus for result
• Finish execution (WB)
– write on Common Data Bus to all awaiting units; mark reservation station available
Remarks:
• Normal data bus: data + destination (“go to” bus)
• Common Data Bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– write if matches expected Functional Unit (produces result); does the broadcast
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 8

Tomasulo Example
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 9

Tomasulo Example Cycle 1
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 10

Tomasulo Example Cycle 2
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 11

Tomasulo Example Cycle 3
• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoredboard
• Load1 completing; what is waiting for Load1??
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 12

Tomasulo Example Cycle 4
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 13

Tomasulo Example Cycle 5
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 14

Tomasulo Example Cycle 6
• Issue ADDD here vs. scoreboard?
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 15

Tomasulo Example Cycle 7
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 16

Tomasulo Example Cycle 8
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 17

Tomasulo Example Cycle 9
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 18

Tomasulo Example Cycle 10
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 19

Tomasulo Example Cycle 11
• All quick instructions complete in this cycle!
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 20

Tomasulo Example Cycle 12
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 21

Tomasulo Example Cycle 13
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 22

Tomasulo Example Cycle 14
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 23

Tomasulo Example Cycle 15
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 24

Tomasulo Example Cycle 16
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 25

Let’s skip some cycles: Tomasulo Example Cycle 55
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 26

Tomasulo Example Cycle 56
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 27

Tomasulo Example Cycle 57
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 28

Compare to Scoreboard Cycle 62
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 29

Tomasulo
Scoreboard
Tomasulo vs. Scoreboard (IBM 360/91 vs. CDC6600)
pipelined functional units
6 load, 3 store, 3 add, 2 mult/division window size: ~14 instructions
no issue on structural hazard
WAR: renaming avoids
WAW: renaming avoids
broadcast results from FU
control: reservation stations
multiple functional units
1 load/store, 1 add, 2 mult, 1 division ~5 instructions
same
WAR: stall completion
stall issue
write/read registers
central scoreboard
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 30

• Complexity
– delays of 360/91, MIPS 10000, IBM 620?
• Many associative stores (CDB) at high speed • Performance limited by Common Data Bus
Tomasulo Drawbacks
– multiple CDBs ==> more FU logic for parallel associative stores
ELEC6036 (Vincent TAM) Module 3: High Performance Techniques—Tomasulo Page 31