CS3350B Computer Organization Chapter 4: Instruction-Level Parallelism Part 3: Beyond Pipelining
Alex Brandt
Department of Computer Science University of Western Ontario, Canada
Monday March 22, 2021
Alex Brandt
Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 1 / 30
Outline
1 Introduction
2 VLIW
3 Loop Unrolling
4 Dynamic Superscalar Processors
5 Register Renaming
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 2 / 30
Instruction-Level Parallelism (ILP)
Instruction-level parallelism involves executing multiple instructions at the same time.
ë Instructions may simply overlap (pipelining) or,
ë Instructions may be executed completely in parallel (superscalar). There are many techniques which are used to provide ILP or to support
ILP in achieving greater speed-up. ë Pipelining.
ë Branch prediction.
ë Superscalar execution.
ë Very Long Instruction Word (VLIW). ë Register renaming.
ë Loop unrolling.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 3 / 30
Multiple Issue Processors
A multiple issue processor issues (executes) multiple instructions within a clock cycle. (Aims for CPI < 1)
ë VLIW Processors.
ë Static Superscalar Processors (essentially same as VLIW). ë Dynamic Superscalar Processors.
By their nature, all multiple issue processors have multiple execution units (ALUs) in their datapath.
Depending on the type of multiple issue processor, other circuitry may also be duplicated or augmented.
Note: multiple issue processors are not necessarily pipelined (these concepts are separate) but in reality pipelining is so good and multiple issue came after so all multiple issue processors are also pipelined.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 4 / 30
Static Superscalar Processors
The name static implies the code scheduling is done by compiler. Basically side-by-side datapaths simultaneously executing instructions.
Compiler handles dependencies and hazards and scheduling code so that instructions on different datapaths don’t conflict.
Near identical to VLIW so we’ll skip the details.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 5 / 30
Static Superscalar Pipeline
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 6 / 30
Outline
1 Introduction
2 VLIW
3 Loop Unrolling
4 Dynamic Superscalar Processors
5 Register Renaming
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 7 / 30
VLIW Processors (1/2)
VLIW processors have very long instruction words.
Essentially, multiple instructions are encoded within a single (long)
instruction memory word called an issue packet.
The instructions which can be packed together are limited. Usually
only one lw/sw, only one branch, rest arithmetic.
In this case instructions word size x data memory word size. Simplest scheme: just concatenate multiple instructions together. Ex: Two 32-bit instrs. together in a single 64-bit instruction word.
32 bits 32 bits One full instruction word
Instr. 1
Instr. 2
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 8 / 30
VLIW Processors (2/2)
In a VLIW pipeline:
One IF unit fetches a single long word encoding multiple instrs.
ID stage must handle multiple simultaneous decodes and register reads.
In EX stage, each instr. is issued to a different execution unit (ALU).
Only one data memory to read/write from!
There is a limitation on which kinds of instructions can be executed simultaneously.
In the WB stage the register file must handle multiple writes (to different registers, obviously).
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 9 / 30
4-Stage VLIW (without MEM stage for simplicity)
M. Oskin et al. Exploiting ILP in page-based intelligent memory. In ACM/IEEE International Symposium on MICRO-32, Proceedings, pages 208-218, 1999.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 10 / 30
A VLIW Example (1/3)
Consider a 2-issue extension of MIPS.
The first slot of the issue packet must be an arithmetic (R- or I-type) instruction or a branch.
The second slot of the issue packet must be lw or sw. If compiler cannot find an instruction, insert nop.
ë Much like load-delay slot or branch-delay slot.
Arithmetic or branch lw or sw
Instr. 1
Instr. 2
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 11 / 30
A VLIW Example (2/3)
loop:
lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $0, loop
# $t0=array element
# add scalar in $s2
# store result
# decrement pointer
# branch if $s1 != 0
for (int i=n; i>0; –i) {
A[i] += s2;
}
Need to schedule code for 2-issue.
Instructions in same issue packet must be independent.
Assume perfect branch prediction.
Load-use and RAW dependencies still need to be handled.
ë But, assume all possible datapath optimizations (forwarding).
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 12 / 30
A VLIW Example (3/3)
loop:
lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $0, loop
ALU or branch
loop: nop
addi $s1, $s1, -4
addu $t0, $t0, $s2
bne $s1, $0, loop
# $t0=array element
# add scalar in $s2
# store result
# decrement pointer
# branch if $s1 != 0
Data transfer CC lw $t0, 0($s1) 1 nop 2 nop 3 sw $t0, 4($s1) 4
Scheduled 5 instructions in 4 cycles. In the limit, CPI approaches 0.8. nops don’t count towards performance.
Sometimes when scheduling code you need to adjust offsets.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 13 / 30
Outline
1 Introduction
2 VLIW
3 Loop Unrolling
4 Dynamic Superscalar Processors
5 Register Renaming
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 14 / 30
Loop Unrolling
Compilers use loop unrolling to expose more parallelism.
Essentially, body of loop is replaced with multiple copies of itself (still
in a loop).
Avoids unnecessary branch delays, can more effectively schedule code and fill load-use slots.
Ex: 4-time loop unrolling. Notice i +=
4 in unrolled code.
int i = 0;
for (i; i < n; i++) {
int i = 0;
for (i; i < n; i += 4) {
}
A[i] += 10;
A[i] += 10;
A[i+1] += 10;
A[i+2] += 10;
A[i+3] += 10;
}
Alex Brandt
Chapter 4: ILP , Part 3: Beyond Pipelining
Monday March 22, 2021
15 / 30
Unrolling MIPS
loop: lw $t0, 0($s1) addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $0, loop
for (int i=n; i>0; –i) {
A[i] += s2;
}
loop: lw $t0,0($s1) # $t0=array element
lw $t1,-4($s1) # $t1=array element
lw $t2,-8($s1) # $t2=array element
lw $t3,-12($s1)# $t3=array element
addu $t0,$t0,$s2 # add scalar in $s2
addu $t1,$t1,$s2 # add scalar in $s2
addu $t2,$t2,$s2 # add scalar in $s2
addu $t3,$t3,$s2 # add scalar in $s2
sw $t0,0($s1) # store result
sw $t1,-4($s1) # store result
sw $t2,-8($s1) # store result
sw $t3,-12($s1)# store result
addi $s1,$s1,-16 # decrement pointer bne $s1,$0,loop #branchif$s1!=0
Notice, loop body is not exactly copied 4 times.
Static register renaming: $t0 becomes $t1, $t2, $t3 for successive loops.
Can now easily reschedule code and fill load-delay slots.
Much fewer branch instr. and branch delay slots.
Alex Brandt
Chapter 4: ILP , Part 3: Beyond Pipelining
Monday March 22, 2021 16 / 30
Combining Loop Unrolling and VLIW
Same 2-issue extension of MIPS. First slot is ALU, second slot is data transfer. Datapath has all forwarding and possible optimizations.
Remember:
Need one instruction between load and use. Insert nop if no instruction possible
loop:
ALU or branch
addi $s1,$s1,-16
nop
addu $t0,$t0,$s2
addu $t1,$t1,$s2
addu $t2,$t2,$s2
addu $t3,$t3,$s2
nop
bne $s1,$0,loop
Data transfer CC lw $t0,0($s1) 1 lw $t1,12($s1) #-4 2 lw $t2,8($s1) #-8 3 lw $t3,4($s1) #-12 4 sw $t0,16($s1) #0 5 sw $t1,12($s1) #-4 6 sw $t2,8($s1) #-8 7 sw $t3,4($s1) #-12 8
Scheduled 14 instructions in 8 cycles. In the limit, CPI 0.57..
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 17 / 30
Outline
1 Introduction
2 VLIW
3 Loop Unrolling
4 Dynamic Superscalar Processors
5 Register Renaming
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 18 / 30
Dynamic Superscalar Processors
Dynamic in the name implies that the hardware handles code scheduling.
Because of the dynamic nature out-of-order execution can occur. ë Instructions are actually executed in a different order than they are
fetched.
This scheme allows for different types of execution paths which all take a different amount of time:
ë Ex: Normal ALU, Floating point unit, Memory load/store path. This scheme is particularly good at overcoming stalls due to cache
misses and other dependencies.
However, hardware becomes much more complex than static schemes.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 19 / 30
A Dynamic Superscalar Pipeline (1/2)
Fetch one instr. per cycle as normal.
“Pre-decode” instr. and add to instruction buffer in RD (dispatch) stage.
ë Out-of-order execution must wait for updated values of operands.
Once instr. operands are ready, dispatcher issues instr. to one of many execution units. This is where out-of-order execution is introduced.
Dispatch: adding instr. to buffer. Issue: sending instr. to execution
unit.
RO/WB
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining
Monday March 22, 2021
20 / 30
A Dynamic Superscalar Pipeline (2/2)
Before WB stage, a reorder buffer or completion stage makes sure instructions are in-order before writing results back (a.k.a committing).
ë Out-of-order execution means WAR and WAW dependencies matter. Finally, instructions are retired.
A centralized reservation station is both a dispatcher and issuer.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 21 / 30
Pros and Cons of Dynamic Scheduling
Compiler is only so good at scheduling code. Data hazards are hard to resolve.
ë Compiler only sees pointers but hardware sees actual memory addresses. Dynamic scheduling overcomes unpredictable stalls (cache misses) but
requires complex circuitry.
Dynamic scheduling more aggressively overlaps instructions since operand values are read and then queued in reservation stations.
Instructions are fetched in-order, executed out-of-order, and then committed/retired in-order and sequentially.
Flynn’s bottleneck: can only retire as many instr. per cycle as are fetched.
ë Superscalar machines usually augmented with fetching multiple instructions at once.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 22 / 30
Comparing Multiple Issue Processors (1/2)
VLIW
Static scheduling. In-order execution.
Single IF unit but many EX units.
Instructions packed together in issue packet by compiler.
Static SS
Static scheduling. In-order execution.
Many IF units (or one IF fetching multiple instr.) and many EX units.
Compiler explicitly schedules each datapath.
Dynamic SS
Dynamic scheduling. Out-of-order exec.
Single IF unit but many EX units.
IF unit might fetch multiple instr. per cycle.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining
Monday March 22, 2021 23 / 30
Comparing Multiple Issue Pipelines (2/2)
Tapani Ahonen, University of Tampere
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 24 / 30
Outline
1 Introduction
2 VLIW
3 Loop Unrolling
4 Dynamic Superscalar Processors
5 Register Renaming
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 25 / 30
Writer After X Dependencies
Out-of-order execution causes hazards beyond RAW dependencies.
WAR dependency – write a value after it is read by a previous instruction.
Ex: Cannot write to $t1 in until addi has read its value of $t1. addi $t0, $t1, 2
add $t1, $t3, $0
WAW dependency – write a value after its destination has been written to by a previous instruction. Needed to maintain consistent values for future instructions.
Ex: If add executed before addi then value of $t1 incorrect. addi $t1, $t4, 12
add $t1, $t3, $0
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 26 / 30
False Dependencies
Some dependencies only exist due to re-use of the same variable/register.
In out-of-order execution, the last 3 instructions cannot execute before the first 3 instructions, because of WAR dependency between A[4] = t0 and t0 = A[64].
Renaming removes false dependency.
t0 = A[4];
t0 = t0 + 3;
A[4] = t0;
t0 = A[64];
t0 = t0 + 5;
A[64] = t0;
t0 = A[4];
t0 = t0 + 3;
A[4] = t0;
t1 = A[64];
t1 = t1 + 5;
A[64] = t1;
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 27 / 30
Register Renaming
Ex:
Register renaming is both a static and dynamic technique used to help superscalar pipelines and can fix WAR and WAW dependencies. Code is modified so that every destination is replaced with a unique “logical” destination (sometimes called value name).
For every instruction which writes to a register, create a new value name for that destination.
ë ë
Reservation stations provide a hardware buffer for storing logical destinations.
For dynamic renaming, hardware maintains a mapping from register names to logical destinations to modify operands for incoming instructions.
add $t6, $t0, $t2
sub $t4, $t2, $t0
xor $t0, $t6, $t2
and $t2, $t2, $t6
RAW: $t6 in add and xor. WAR: $t0 in sub and xor. WAR: $t2 in sub and and. WAR: $t2 in sub and and.
Alex Brandt
Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 28 / 30
Register Renaming Example
Instr. $t0 $t2 $t4 $t6 Initially V0 V1 V2 V3 add $t6, $t0, $t2 V4
sub $t4, $t2, $t0 V5 xor $t0, $t6, $t2 V6
and $t2, $t2, $t6
Renamed Instr. —
add V4, V0, V1 sub V5, V1, V0 xor V6, V4, V1 add ??, V1, ??
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining
Monday March 22, 2021
29 / 30
Register Renaming Example
Instr. $t0 $t2 $t4 $t6 Initially V0 V1 V2 V3
Renamed Instr. —
add V4, V0, V1 sub V5, V1, V0 xor V6, V4, V1 add ??, V1, ??
Renamed Instr. —
add V4, V0, V1 sub V5, V1, V0 xor V6, V4, V1 add V7, V1, V4
add $t6, $t0, $t2
sub $t4, $t2, $t0
xor $t0, $t6, $t2 V6 and $t2, $t2, $t6
V4 V5
Instr. $t0
Initially V0 V1 V2 V3
add $t6, $t0, $t2 V4 sub $t4, $t2, $t0 V5
xor $t0, $t6, $t2 V6
and $t2, $t2, $t6 V7
$t2
$t4 $t6
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining
Monday March 22, 2021
29 / 30
Summary
Multiple issue processors execute multiple instructions simultaneously for CPI < 1.
VLIW uses statically-scheduled issue packets.
Loop unrolling exposes more parallelism and removes some branching overhead.
Dynamic superscalar processors combat against unexpected stalls (cache misses) by allowing for out-of-order execution.
Register renaming fixes WAR and WAW dependencies.
RAW dependencies are the only true dependency and still must be accounted for in scheduling.
Alex Brandt Chapter 4: ILP , Part 3: Beyond Pipelining Monday March 22, 2021 30 / 30