Learning goals:
• Explain how a CPU pipeline works and how it can contribute to the
efficient execution of a program.
• Trace the flow of instructions through a pipelined processor. As part
of the trace be able to explain what values would be loaded (and
when) into the various stage registers.
• Appropriatelyuselatency,throughput,andCPItodescribethe
performance of a pipeline.
• Performlatency,throughputandCPIcalculationsforbothsequential
and pipelined processors as appropriate
UNIT 2
CPU Pipeline
CPSC 313.203 – 2019W2 Pipelined Y86-64
1
irmovq
irmovq
pushq
call
irmovq rmmovq halt
result, %rbx %rax, 0(%rbx)å
Processing now
• 000000 000008 000010
000018 30 000020 F300100000000000 000028 0040030000000000 000030 0000000050040800 000038 00000000000030F6 000040 0200000000000000 000048 66 60
000050 30 F0 01 00 00 000058 0000000000900000 ***
001000 EFBEADDE00000000 ***
001FE0 EFBEADDE00000000 001FE8 EFBEADDE00000000 001FF0 EFBEADDE00000000 001FF8 EFBEADDE00000000
isodd: mrmovq
irmovq modq je irmovq
done: ret
.pos 0x1000 result: .quad
# The stack .pos 0x1FE0
8(%rsp), %rax 2, %rsi
%rsi, %rax done
1, %rax
0xDEADBEEF
0x2000, %rsp
4, %rbx
%
rbx
isodd
30 00
F4
00
20
00
.quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF
CPSC 313.203 – 2019W2 Pipelined Y86-64
2
00
F3
00
04
A0
00
00
00
30 00 00
00
00
00 00
00 00
00 00
3F
80
34
00
00
00
73 00
5D
00
00
00
00
00
00
irmovq
irmovq pushq call irmovq rmmovq halt
0x2000, %rsp
4, %rbx
%rbx
isodd
result, %rbx %rax, 0(%rbx)
Icode
Ifun rA
rB valA valB valC
valP ValE ValM
nextPC
isodd:
mrmovq
irmovq modq je irmovq
done: ret
.pos 0x1000 result: .quad
# The stack
.pos 0x1FE0
8(%rsp), %rax 2, %rsi
%rsi, %rax done
1, %rax
0xDEADBEEF
.quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF
CPSC 313.203 – 2019W2 Pipelined Y86-64
3
Icode
Ifun
rA
rB
valA
valB
valC
valP
ValE
ValM
nextPC
irmovq 0x2000, %rsp
3
0
F
4
0
0
2000
A
2000
0
A
irmovq
irmovq pushq call irmovq rmmovq halt
0x2000, %rsp
4, %rbx
%rbx
isodd
result, %rbx %rax, 0(%rbx)
isodd:
mrmovq
irmovq modq je irmovq
done: ret
.pos 0x1000 result: .quad
# The stack
.pos 0x1FE0
8(%rsp), %rax 2, %rsi
%rsi, %rax done
1, %rax
0xDEADBEEF
.quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF
CPSC 313.203 – 2019W2 Pipelined Y86-64
4
Icode
Ifun
rA
rB
valA
valB
valC
valP
ValE
ValM
nextPC
irmovq 0x2000, %rsp
3
0
F
4
0
0
2000
A
2000
0
A
irmovq 4, %rbx
3
0
F
3
0
0
4
14
4
0
14
pushq %rbx
call isodd
irmovq 0x2000, %rsp irmovq 4, %rbx pushq %rbx
call
irmovq rmmovq halt
isodd
result, %rbx %rax, 0(%rbx)
isodd:
mrmovq
irmovq
modq
je
irmovq
done: ret
.pos 0x1000 result: .quad
# The stack
.pos 0x1FE0
8(%rsp), %rax 2, %rsi
%rsi, %rax done
1, %rax
0xDEADBEEF
.quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF .quad 0xDEADBEEF
CPSC 313.203 – 2019W2 Pipelined Y86-64
5
Motivation and basic concepts • How instruction execution proceeds:
• Sequential Y86 implementation signal flows through each stage
Instr 1 Instr 2
Instr 3
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
CPSC 313.203 – 2019W2 Pipelined Y86-64
6
CPSC 313.203 – 2019W2 Pipelined Y86-64
7
Human Pipeline
Fun video – not shown in class
CPSC 313.203 – 2019W2 Pipelined Y86-64
8
Goal?
CPSC 313.203 – 2019W2 Pipelined Y86-64
9
Motivation and basic concepts
• Can we start executing Instruction 2 before Instruction 1 finishes? (i.e. can we overlap execution of the instructions)
Instr 1 Instr 2
Instr 3
FDEMW FDEMW
FDEMW
CPSC 313.203 – 2019W2 Pipelined Y86-64
10
Yes, by doing this • Pipeline computation
– Divide the execution into modules (fetch, decode, …) • Modules are ordered along the flow of computation.
• One module’s output is the input of the next module.
• Then each module becomes a pipeline stage.
– Add pipeline registers before every stage except the 1st.
– These store the inputs for that stage (i.e. outputs from the previous stage)
– Stages execute in parallel working on different instructions. • But, this introduces some new problems (more later)
CPSC 313.203 – 2019W2 Pipelined Y86-64
11
Pipeline Stages How many stages should a pipeline have?
• If it has too few stages…
– We are not exploiting the parallelism present in the program’s
instructions
• If it has too many…
– Then there is much more overhead and complexity.
– The program may not have enough parallelism to use them well.
• More later
CPSC 313.203 – 2019W2 Pipelined Y86-64
12
Examples
• MIPS processor (1985): first RISC processor, 5 stages. • Sparc, PowerPC processors: 9 pipeline stages.
• Intel Pentium IV (late models): ____________
• Intel Core i7 processor: _______________
CPSC 313.203 – 2019W2 Pipelined Y86-64
13
Write back
Memory
Stat Stat
W_stat
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_CndCC
AB
ALU fun.
ALU
dstE
ALU
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
srcA
srcB
D
stat
icode
ifun
rA
rB
valC
valP
m_sta Stat t
M_sta t
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
Data memory
M_valA
rmmovq rA, D(rB) 40ABDDDDDDDDFFFFFFFF
Section 4.5.3 has detailed Description of registers
valP and valA are merged for the call and jmp instructions
Execute
Decode
Fetch imem_error instr_valid
E_stat
Select d_rvalA A
d_srcA d_srcB
dstE dstM srcA srcB
W_valM W_valE
AB Register M
file
D_stat
f_stat Stat
Predict PC
E
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
An Example Problem
What will each pipeline register contain when the last instruction in this sequence is entering the Fetch stage?
• .pos 0x100
• irmovq $1, %rax
• irmovq $2, %rcx
• irmovq $3, %rdx
• irmovq $4, %rbx
• irmovq $5, %rsi
CPSC 313.203 – 2019W2 Pipelined Y86-64
15
F
D ➤ E M W
F.prPC
irmovq $1, %rax
main memory
register file
16
CPSC 313.203 – 2019W2 Pipelined Y86-64
F
D ➤ E M W
F.prPC
D.iCd
D.iFn
D.rA
D.rB
D.valC
D.valP
irmovq $2, %rcx
irmovq $1, %rax
main memory
register file
17
CPSC 313.203 – 2019W2 Pipelined Y86-64
F
D ➤ E
M W
F.prPC
D.iCd
D.iFn
D.rA
D.rB
D.valC
D.valP
E.iCd
E.iFn
E.valC
E.srcA
E.srcB
E.dstE
E.dstM
E.valA
E.valB
irmovq $3, %rdx
irmovq $2, %rcx irmovq $1, %rax
main memory
register file
18
CPSC 313.203 – 2019W2 Pipelined Y86-64
F
D ➤ E M W
F.prPC
D.iCd
D.iFn
D.rA
D.rB
D.valC
D.valP
irmovq $3, %rdx
E.iCd
E.iFn
E.valC
E.srcA
E.srcB
E.dstE
E.dstM
E.valA
E.valB
M.iCd
M.dstE
M.dstM
M.valA
M.valE
M.bch
irmovq $4, %rbx
irmovq $2, %rcx irmovq $1, %rax main memory
register file
19
CPSC 313.203 – 2019W2 Pipelined Y86-64
F
D ➤ E
M
W
F.prPC
D.iCd
D.iFn
D.rA
D.rB
D.valC
D.valP
E.iCd
E.iFn
E.valC
E.srcA
E.srcB
E.dstE
E.dstM
E.valA
E.valB
main memory
M.iCd
M.dstE
M.dstM
M.valA
M.valE
M.bch
W.iCd
W.dstE
W.dstM
W.valE
W.valM
irmovq $5, %rsi
irmovq $4, %rbx irmovq $3, %rdx
irmovq $2, %rcx
irmovq $1, %rax
register file
20
CPSC 313.203 – 2019W2 Pipelined Y86-64
Consider these 3 instructions
• IRMOVQ 0xcafe, %rax • ADDQ %rax, %rbx • JE 0x0
• HALT
Assume %rbx has the –ve value of 0xCAFE
CPSC 313.203 – 2019W2 Pipelined Y86-64
IRMOVQ ADDQ
JE
F
Tick 1
CPSC 313.203 – 2019W2 Pipelined Y86-64
22
IRMOVQ ADDQ
JE
Tick 2
D
F
CPSC 313.203 – 2019W2 Pipelined Y86-64
23
Tick 3
E
D
F
IRMOVQ ADDQ
JE
CPSC 313.203 – 2019W2 Pipelined Y86-64
24
IRMOVQ ADDQ
JE
Tick 4
M
E
D
CPSC 313.203 – 2019W2 Pipelined Y86-64
25
IRMOVQ ADDQ
JE
Tick 5
W
M
E
CPSC 313.203 – 2019W2 Pipelined Y86-64
26
IRMOVQ ADDQ
JE
Tick 6
W
M
CPSC 313.203 – 2019W2 Pipelined Y86-64
27
IRMOVQ ADDQ
JE
W
Tick 7
CPSC 313.203 – 2019W2 Pipelined Y86-64
28
Today
– Continue transforming the sequential processor into a pipelined one
• Learning outcomes
– Quantitatively evaluate tradeoffs in different pipeline architectures
CPSC 313.203 – 2019W2 Pipelined Y86-64
Consider these 3 instructions
• IRMOVQ 0xcafe, %rax • ADDQ %rax, %rbx • JE 0x0
• HALT
Assume %rbx has the –ve value of 0xCAFE
CPSC 313.203 – 2019W2 Pipelined Y86-64
IRMOVQ ADDQ
JE
Tick 3
E
D
F
CPSC 313.203 – 2019W2 Pipelined Y86-64
31
irmovq 0xcafe, %
addq %rax
F0
Start of tick 0
000000 000008 000010
, %rbx
FE
rax
je 0
halt
00
00
30
00
00
00
60
CA
03
73
00
00
00
00
00
00
00
00
00
10
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0
irmovq 0xcafe, %rax
CPSC 313.203 – 2019W2 Pipelined Y86-64
32
0x00
irmovq 0xcafe, %
F0
End of tick 0
000000 000008 000010
FE
rax
addq %rax
je 0
, %rbx
halt
CA
00
00
00
00
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0xA
3
0
F
0
0xcafe
irmovq 0xcafe, %rax
0xA 0xA
CPSC 313.203 – 2019W2 Pipelined Y86-64
0xA
33
irmovq 0xcafe, %
addq %rax
000000 000008 000010
F0
Start of tick 1
FE
rax
, %rbx
je 0
halt
CA
00
00
00
00
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0xA
irmovq 0xcafe, %rax
addq %rax, %rbx
0xA
3
0
F
0
0xcafe
CPSC 313.203 – 2019W2 Pipelined Y86-64
34
0xA
irmovq 0xcafe, %
addq %rax
000000 000008 000010
F0
End of tick 1
FE
rax
, %rbx
je 0
halt
CA
00
00
00
00
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
3
0
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0xC
irmovq 0xcafe, %rax
addq %rax, %rbx
0xC 0xC
CPSC 313.203 – 2019W2 Pipelined Y86-64
35
0xcafe
0
6
0
0
3
0xC
irmovq 0xcafe, %
addq %rax
halt
Start of tick 2
, %rbx
rax
je 0
Where we left off on Friday
000000 000008 000010
F0
FE
CA
00
00
00
00
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
irmovq 0xcafe, %rax
3
0
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0xC
addq %rax, %rbx
je 0
0xC
CPSC 313.203 – 2019W2 Pipelined Y86-64
0xcafe
0
6
0
0
3
0xC
36
irmovq 0xcafe, %
je 0
End of tick 2
rax
addq %rax
, %rbx
halt
000000 000008 000010
3
0xcafe
0
F0
FE
CA
00
00
00
00
30
00
00
00
60
73
00
00
00
00
00
03
00
00
10
irmovq 0xcafe, %rax
6
0
0
-0xCAFE
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0x0
addq %rax, %rbx
je 0
0x0
0x15 0x15
3
7
3
CPSC 313.203 – 2019W2 Pipelined Y86-64
0x0
37
irmovq 0xcafe, %
End of tick 2
We have some problems!
rax
addq %rax
, %rbx
je 0
halt
000000 000008 000010
3
0xcafe
0
F0
FE
CA
00
00
00
00
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
irmovq 0xcafe, %rax
Got 0 but would expect 0xcafe
addq %rax, %rbx
Which one? Won’t know until je is in the execute stage
je 0
3
6
0
0
-0xCAFE
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0x0
7
3
0x0
0x15 0x15
0x0
CPSC 313.203 – 2019W2 Pipelined Y86-64
0x0
38
irmovq 0xcafe, %
addq %rax
rax
, %rbx
je 0
halt
000000 000008 000010
3
0xcafe
0
F0
Start of tick 3
FE
CA
00
00
00
00
irmovq 0xcafe, %rax
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
addq %rax, %rbx
6
0
0
-0xcafe
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0x0
je 0
irmovq 0xcafe, %rax
0x0
0x15
3
7
3
CPSC 313.203 – 2019W2 Pipelined Y86-64
0x0
39
irmovq 0xcafe, %
End of tick 3
rax
addq %rax
, %rbx
je 0
halt
irmovq 0xcafe, %rax
3
0xcafe
0
000000 000008 000010
3
-0xcafe
3
F0
FE
CA
00
00
00
00
30
00
00
00
60
73
00
00
00
00
00
03
00
00
10
addq %rax, %rbx
7
3
%rax 0 %rbx -0xCAFE
SF 1 OF 0 ZF 0
PC 0xA
je 0
irmovq 0xcafe, %rax
0xA 0xA
3
0
F
0
0xcafe
CPSC 313.203 – 2019W2 Pipelined Y86-64
0xA
40
irmovq 0xcafe, %
Start of tick 4
irmovq 0xcafe, %rax
rax
addq %rax
F0
, %rbx
je 0
halt
addq %rax, %rbx
3
0xcafe
0
000000 000008 000010
3
-0xcafe
3
FE
CA
00
00
00
00
30
00
00
00
60
03
73
00
00
00
00
00
00
00
10
je 0
7
3
%rax 0 %rbx -0xCAFE
SF 1 OF 0 ZF 0
PC 0xA
irmovq 0xcafe, %rax
addq %rax, %rbx
0xA 0xA
3
0
F
0
0xcafe
CPSC 313.203 – 2019W2 Pipelined Y86-64
0xA
41
irmovq 0xcafe, %
addq %rax
End of tick 4
irmovq 0xcafe, %rax
rax
, %rbx
je 0
halt
addq %rax, %rbx
6
-0xcafe
3
000000 000008 000010
7
F0
FE
CA
30
00
00
00
60
03
73
00
00
00
00
00
00
00
00
00
00
00
10
je 0
3
0
%rax 0xCAFE %rbx -0xCAFE
SF 1 OF 0 ZF 0
PC 0xC
irmovq 0xcafe, %rax
addq %rax, %rbx
0xC
0xC
0xcafe
0
6
0
0
3
CPSC 313.203 – 2019W2 Pipelined Y86-64
0xC
42
irmovq 0xcafe, %
addq %rax
je 0
halt
000000 000008 000010
7
F0
Start of tick 5
addq %rax, %rbx
FE
rax
, %rbx
CA
je 0
6
-0xcafe
3
30
00
00
00
60
03
73
00
00
00
00
00
00
00
00
00
00
00
10
irmovq 0xcafe, %rax
3
0
%rax 0xCAFE %rbx -0xCAFE
SF 1 OF 0 ZF 0
PC 0xA
addq %rax, %rbx
je 0
0xC
0xC
0xcafe
0
6
0
0
3
CPSC 313.203 – 2019W2 Pipelined Y86-64
0xC
43
irmovq 0xcafe, %
addq %rax
End of tick 5
addq %rax, %rbx
rax
, %rbx
je 0
halt
je 0
7
000000 000008 000010
3
0xcafe
0
F0
FE
CA
30
00
00
00
60
03
73
00
00
00
00
00
00
00
00
00
00
00
10
irmovq 0xcafe, %rax
6
0
0xcafe
-0xcafe
3
%rax 0xCAFE %rbx -0xCAFE
SF 1 OF 0 ZF 0
PC 0x0
addq %rax, %rbx
je 0
0x0
0x15 0x15
7
3
CPSC 313.203 – 2019W2 Pipelined Y86-64
0x0
44
Problems
1. Data hazard – An instruction in the decode step needs an updated register value from an instruction ahead of it in the pipeline
2. Control hazard – Occurs when we don’t know what the next instruction to execute should be: conditional jmp, and return.
CPSC 313.203 – 2019W2 Pipelined Y86-64
How could we solve these problems?
CPSC 313.203 – 2019W2 Pipelined Y86-64
46
If we had added 3 NOPs it would work!
irmovq 0xcafe, %rax
3
0xcafe 0
0xcafe, %
rax
irmovq
NOP NOP NOP
NOP
addq %rax
, %rbx
je 0
halt
NOP
%rax 0 %rbx -0xCAFE
SF 0 OF 0 ZF 0
PC 0x0
NOP
addq %rax, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
47
Measuring Efficiency
– How long it takes to execute one instruction from start to finish
• Latency:
– Will usually not be reduced in a pipeline (in fact it might go up!)
• Throughput:
– The number of instructions we can execute per unit of time – This is the only meaningful measure for pipelined CPUs
CPSC 313.203 – 2019W2 Pipelined Y86-64
48
Adding Pipeline Registers
Fetch
Decode
Execute
Memory
Writeback
CPSC 313.203 – 2019W2 Pipelined Y86-64
49
Adding Pipeline Registers
Fetch
Decode Execute Memory Writeback
CPSC 313.203 – 2019W2 Pipelined Y86-64
50
Question:
Assume that:
• Eachstagestakes200picosecondstoexecute • Eachregisteradds20picoseconds
Decode Execute Memory Writeback
Q1: What is the latency of an instruction in an UNPIPELINED implementation? Q2: What is the latency of an instruction in a PIPELINED implementation?
Fetch
CPSC 313.203 – 2019W2 Pipelined Y86-64
51
Today – Identify data and control hazards
• Learning outcomes
– Propose various mechanisms to deal with them.
CPSC 313.203 – 2019W2 Pipelined Y86-64
52
From Latency to Throughput:
Assume that:
• Eachstagestakes200picosecondstoexecute • Eachregisteradds20picoseconds
Decode Execute Memory Writeback
Q1: What is the throughput of an UNPIPELINED implementation? Q2: What is the throughput of a PIPELINED implementation?
We have picoseconds per instruction; bandwidth is instructions per second …
Fetch
CPSC 313.203 – 2019W2 Pipelined Y86-64
53
From Latency to Throughput (this is just math)
• We have:
1020 picoseconds/instruction
How many picoseconds / second? 10?
• We want:
??? Instructions / second
CPSC 313.203 – 2019W2 Pipelined Y86-64
54
CPSC 313
What is the throughput of our pipelined processor?
Assume that:
• Eachstagestakes200picosecondstoexecute • Eachregisteradds20picoseconds
Decode Execute Memory Writeback
Fetch
CPSC 313.203 – 2019W2 Pipelined Y86-64
55
From Latency to Throughput
• We have:
An instruction retired every 220 picoseconds
How many picoseconds / second? 10?
• We want:
??? Instructions / second
CPSC 313.203 – 2019W2 Pipelined Y86-64
56
Now you do it!
• Assume the following:
– Four stages, each takes 240 picoseconds – Registers add 40 picoseconds
• Compute:
– Latency of an unpipelined implementation
– Latency of a pipelined implementation
– Throughput of an unpipelined implementation – Throughput of a pipelined implementation
CPSC 313.203 – 2019W2 Pipelined Y86-64
57
Now you do it! – Latency of an unpipelined implementation
240 * 4 + 40 = 1000ps
– Latency of a pipelined implementation
280 * 4 = 1120ps
– Throughput of an unpipelined implementation 1012 1
1 ∗1000=1𝐺𝐼𝑃𝑆
– Throughput of a pipelined implementation 1012 1
1 ∗ 280 = 3.57 𝐺𝐼𝑃𝑆
• Compute:
Latency increase 1120 −1000 = 0.12→12%
Throughput Increase
3.57 = 3.57 ~ 3.6 1
Rule of thumb:
Throughput gain is slightly less than the number of stages
1000
CPSC 313.203 – 2019W2 Pipelined Y86-64
58
What if? • Assumethefollowing:
– Three stages:
• Stage 1 takes 300 picoseconds
• Stage 2 takes 60 picoseconds • Stage 3 takes 100 picoseconds
– Registers add 40 picoseconds
• Compute:
– Latency of an unpipelined implementation
– Latency of a pipelined implementation
– Throughput of an unpipelined implementation – Throughput of a pipelined implementation
CPSC 313.203 – 2019W2 Pipelined Y86-64
59
• Compute:
– Latency of an unpipelined implementation
300 + 60 + 100 + 40 = 500𝑝𝑠
– Latency of a pipelined implementation 3 ∗ 300+40 =1020𝑝𝑠
Latency increase 1020 −500 = 1.04→100%
500
Throughput Increase
2.94 = 1.47 ~ 1.5 2
Rule of thumb didn’t hold. Why?
– Throughput of an unpipelined implementation 12
10 ∗ 1 =2𝐺𝐼𝑃𝑆 1 500
– Throughput of a pipelined implementation 1012 1
1 ∗ 340 = 2.94 𝐺𝐼𝑃𝑆
CPSC 313.203 – 2019W2 Pipelined Y86-64
60
Performance another way • Cycles per instruction (CPI)
• In a perfect world, how many (clock) cycles should it take to execute an instruction?
– Pipelined?
– Unpipelined?
• What might prevent a processor from achieving this?
CPSC 313.203 – 2019W2 Pipelined Y86-64
61
Data Hazards
• Data hazards result in a decrease in the effective throughput
• Challenge is to reduce or eliminate the number of data hazards
• First we need to understand under what conditions a hazard will occur
CPSC 313.203 – 2019W2 Pipelined Y86-64
62
Data Hazards: To Hazard or Not to Hazard (1)
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax irmovq 0xBEEF, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
63
Data Hazards: To Hazard or Not to Hazard
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax irmovq 0xBEEF, %rbx
An irmovq never suffers from a hazard, BUT It can cause hazards!
CPSC 313.203 – 2019W2 Pipelined Y86-64
64
Data Hazards: To Hazard or Not to Hazard (2)
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax rmmovq %rax, 0(%rbx)
CPSC 313.203 – 2019W2 Pipelined Y86-64
65
Data Hazards: To Hazard or Not to Hazard
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax rmmovq %rax, 0(%rbx)
read
CPSC 313.203 – 2019W2 Pipelined Y86-64
66
Data Hazards: To Hazard or Not to Hazard (3)
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax irmovq 0xFADE, %rcx rmmovq %rax, 0(%rbx)
CPSC 313.203 – 2019W2 Pipelined Y86-64
67
CPSC 313
Data Hazards: To Hazard or Not to Hazard
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax irmovq 0xFADE, %rcx rmmovq %rax, 0(%rbx)
CPSC 313.203 – 2019W2 Pipelined Y86-64
68
Data Hazards: To Hazard or Not to Hazard (4)
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax irmovq 0xFADE, %rcx irmovq 0xCAFE, %rdi rmmovq %rax, 0(%rbx)
CPSC 313.203 – 2019W2 Pipelined Y86-64
69
CPSC 313
Data Hazards: To Hazard or Not to Hazard
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
irmovq 0xDEAD, %rax irmovq 0xFADE, %rcx irmovq 0xCAFE, %rdi rmmovq %rax, 0(%rbx)
CPSC 313.203 – 2019W2 Pipelined Y86-64
70
CPSC 313
Data Hazards: To Hazard or Not to Hazard (5)
Fetch Decode Fetch
Execute Decode Fetch
Memory Execute Decode Fetch
Writeback Memory Execute Decode Fetch
Writeback Memory Execute
Decode
rmmovq %rax, 0(%rbx) mrmovq 0(%rbx), %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
71
Writeback Memory Execute
Writeback Memory
Writeback
CPSC 313
Data Hazards: To Hazard or Not to Hazard
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
rmmovq %rax, 0(%rbx) mrmovq 0(%rbx), %rdx
No hazards due to memory read/write
(But still possible on the register being updated In a mrmovq.)
CPSC 313.203 – 2019W2 Pipelined Y86-64
72
What rule governs data Hazards?
• Can you write a rule that summarizes all the ways you can get data hazards?
CPSC 313.203 – 2019W2 Pipelined Y86-64
73
What rule governs data Hazards?
• Can you write a rule that summarizes all the ways you can get data hazards?
– We create a data hazard any time we read a register within three instructions of the instruction in which we wrote it.
– In other words, in software, the only way to avoid a data hazard is to place three instructions between the instruction in which you read the register and the instruction in which you write the register.
CPSC 313.203 – 2019W2 Pipelined Y86-64
74
• Nops +
–
• Stalls +
–
• Data Forwarding +
–
Coping with Hazards
CPSC 313.203 – 2019W2 Pipelined Y86-64
75
CPSC 313
Coping with Hazards
• Nops
+ Easy;canbedoneentirelyinsoftware – Wastes cycles/resources
• Stalls
+ Even easier: transparent to the programmer; hardware does it. – Wastes cycles/resources
• Data Forwarding +
–
CPSC 313.203 – 2019W2 Pipelined Y86-64
76
CPSC 313
irmovq 0xDEAD, %rax
addq %rax, %rdi
Data Mitigation: Stalling
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Decode
Decode
Decode
Execute
Memory
Writeback
Fetch
Fetch
Fetch
Fetch
Decode
Execute
Memory
Fetch
Decode
Execute
Fetch
Decode
Fetch
Stall Bubbles
CPSC 313.203 – 2019W2 Pipelined Y86-64
77
irmovq $5, %rax
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
78
irmovq $5, %rax
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
79
irmovq $5, %rax
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
80
irmovq $5, %rax
Bubble/ NOP
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
81
Bubble/ NOP
Bubble/ NOP
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
82
Bubble/ NOP
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
addq%rax, %rsi
CPSC 313.203 – 2019W2 Pipelined Y86-64
83
Administrivia
• Technical Career Fair today – Next until 4 – see Canvas annoucment
• Online quiz 2 out at 5:00 PM due Monday 11:59 PM • Midterm 1 – February 10 5:30 PM in ESB 1013 (?)
– See canvas announcement on reporting conflicts
• A2 is in the process of being finalized will be out by early next
week – Due end of February
CPSC 313.203 – 2019W2 Pipelined Y86-64
84
irmovq $5, %rax
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
85
irmovq $5, %rax
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
86
irmovq $5, %rax
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
87
irmovq $5, %rax
Bubble/ NOP
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
88
Bubble/ NOP
Bubble/ NOP
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
89
Bubble/ NOP
Bubble/ NOP
subq %rax, %rcx
xorq %rax, %rdx
addq%rax, %rsi
CPSC 313.203 – 2019W2 Pipelined Y86-64
90
How many stalls/bubbles?
irmovq $5, %rax
3 stalls need
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
91
irmovq $5, %rax
3 stalls need
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
92
2 stalls need
irmovq $5, %rax
xor %r10, %r10
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
93
irmovq $5, %rax
xor %r10, %r10
1 stall need
xor %r11, %r11
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
94
xor %r10, %r10 xor %r11, %r11
xor %r12, %r12
0 stalls need
subq %rax, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
95
What can you or the compiler do to help?
• How can we rewrite the following code to expose more
instruction level parallelism?
1. addq %rax, %rbx
2. iaddq $5, %rbx
3. addq %rdx, %rcx
4. iaddq $9, %rcx
5. addq %rsi, %rdi
6. iaddq $3, %rdi
iaddq – fictitious immediate add instruction
CPSC 313.203 – 2019W2 Pipelined Y86-64
96
Dealing with hazards
• Each stage register has a control input that determines what happens when the clock ticks:
– Normal: register’s new value is the input value.
– Stall: register’s new value is the same as its current value. – Bubble: register’s new value is the same as for NOP.
• But how does the hardware know to stall?
CPSC 313.203 – 2019W2 Pipelined Y86-64
97
Write back
Memory
Stat W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU
fun. dstE
ALU AB
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
A couple of issues Stat register
srcA, srcB
srcA = func(ra, rsp) srcB = func(rb, rsp)
Doesn’t get used beyond Decode
M_valA
Execute
Decode
E_stat
Select d_rvalA A
dstE dstM
AB Register M
file
E
D_stat
f_stat Stat
Fetch imem_error instr_valid
W_valM W_valE
srcA srcB
Predict PC
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
98
CPSC 313.203 – 2019W2 Pipelined Y86-64
Data memory
Pipeline Control Unit
CPSC313.203–2019W929 PipelinedY86-64
99
irmovq $5, %rax
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
100
Pipeline Considerations Can We Do Better?
Consider this code: irmovq $5, %rax
subq %rax, %rdx
At what stage is irmovq in when addq needs %rax?
CPSC 313.203 – 2019W2 Pipelined Y86-64
101
At what stage will %rax be updated?
At what stage is %rax extracted from the register file?
irmovq $5, %rax
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
102
Administrivia
• Online quiz 2 out at 5:00 PM due tonight 11:59 PM
• Midterm 1 – February 10 5:30 PM in ESB 1013 (?) – See canvas announcement on reporting conflicts
• A2 is in the process of being finalized will be out early this week – Due end of February
CPSC 313.203 – 2019W2 Pipelined Y86-64
103
popq %rcx
irmovq $5, %rax
subq %rax, %rcx
xorq %rax, %rdx
CPSC 313.203 – 2019W2 Pipelined Y86-64
104
Data Forwarding
In y86, at the end of what stage is output known by CPU?
– irmovq, addq, subq, andq, xorq, pushq, popq (rsp), call (rsp), ret (rsp) – mrmovq, popl (target reg)
• in y86, at the end of what stage is register value needed?
CPSC 313.203 – 2019W2 Pipelined Y86-64
105
CPSC 313
Forwarding
Write back
Memory
Stat W_stat
m_stat Stat M_stat
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU
fun. dstE
ALU AB
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
M_valA
Execute
Decode
E_stat
Select d_rvalA A
dstE dstM
AB Register M
file
E
D_stat
f_stat Stat
Fetch imem_error instr_valid
W_valM W_valE
srcA srcB
Predict PC
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
106
CPSC 313.203 – 2019W2 Pipelined Y86-64
Data memory
CPSC 313
Forwarding
Write back
Memory
Stat W_stat
m_stat
Stat
M_stat
W
stat
icode
dmem_error
valE
valM
Data memory
dstE
dstM
Mem. control
M_Cnd
read
write
Addr
data out
data in
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU
fun. dstE
ALU AB
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
M_valA
Forward A
Execute
Decode
E_stat
Forward B
Select d_rvalA dstE dstM A
W_valM W_valE
AB Register M
file
E
D_stat
f_stat Stat
Fetch imem_error instr_valid
srcA srcB
Predict PC
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
107
CPSC 313.203 – 2019W2 Pipelined Y86-64
CPSC 313
Forwarding
Write back
Memory
Stat W_stat
m_stat Stat M_stat
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU
fun. dstE
ALU AB
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
Data memory
M_valA
Execute
Decode
E_stat
Forw Forw AB
Select A
dstE dstM
W_valM W_valE
srcA srcB
AB Register M
file
E
CPSC 313.203 – 2019W2 Pipelined Y86-64
CPSC 313
Forwarding
Write back
Memory
Stat W_stat
m_stat Stat M_stat
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU
fun. dstE
ALU AB
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
Data memory
M_valA
Execute
Decode
E_stat
Forw Forw AB
Select A
dstE dstM
W_valM W_valE
srcA srcB
AB Register M
file
E
CPSC 313.203 – 2019W2 Pipelined Y86-64
CPSC 313.203 – 2019W2 Pipelined Y86-64
110
110
irmovq $10, %rax
Pipe (with forwarding)
CPSC 313.203 – 2019W2 Pipelined Y86-64
111
111
irmovq $10, %rax
addq $%rax, %rbx
Pipe (with forwarding)
CPSC 313.203 – 2019W2 Pipelined Y86-64
112
112
Pipe (with forwarding)
irmovq $10, %rax
Note dependency But no stall needed
addq $%rax, %rbx
mrmovq $10(%rax), %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
113
113
Pipe (with forwarding)
irmovq $10, %rax
addq $%rax, %rbx
mrmovq $10(%rax), %rcx
addq $%rcx, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
114
114
irmovq $10, %rax addq $%rax, %rbx
Pipe
(with forwarding)
mrmovq $10(%rax), %rcx
Data dependency stall needed
addq $%rcx, %rbx
addq $%rsi, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
115
115
addq $%rax, %rbx
Pipe
mrmovq $10(%rax), %rcx
(with forwarding)
Bubble
addq $%rcx, %rbx
addq $%rsi, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
116
116
mrmovq $10(%rax), %rcx
Pipe
(with forwarding)
Bubble
addq $%rcx, %rbx
addq $%rsi, %rbx
addq $%rax, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
117
117
Coping with Hazards
• Nops
+ Easy;canbedoneentirelyinsoftware – Wastes cycles/resources
• Stalls
+ Even easier: transparent to the programmer; hardware does it. – Wastes cycles/resources
• Data Forwarding +
–
CPSC 313.203 – 2019W2 Pipelined Y86-64
118
Coping with Hazards
• Nops
+ Easy; can be done entirely in software – Wastes cycles/resources
• Stalls
+ Even easier: transparent to the programmer; hardware does it. – Wastes cycles/resources
• Data Forwarding
+ Achieve maximum use of pipeline resources without programmer
involvement
– Adds hardware complexity
CPSC 313.203 – 2019W2 Pipelined Y86-64
119
Pipelined Y86 Implementation
• Unitoutline
• Motivationandbasicconcepts • Initialimplementation
• Hazards
– Types of hazards
– Dealing with hazards by stalling
– Data hazards: using data forwarding to avoid stalling – Control hazards
– Indirect jumps
• Performanceanalysis.
CPSC 313.203 – 2019W2 Pipelined Y86-64
120
Control Dependencies
• Control dependencies determine what code is executed next
• Examples:
– Whether a branch is taken or not (conditional jump)
– Whether the address of the next instruction is taken from memory or a register (e.g. return)
– When an instruction writes to the area of memory instructions are being fetched from (self modifying code) We will not explore this
CPSC 313.203 – 2019W2 Pipelined Y86-64
121
Control Hazards
Consider: an unconditional JMP What about a CALL instruction?
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
Data memory
M_valA
Execute
Decode
E_stat
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
D_stat
E
f_stat Stat jcmalpl afalwvoauyrsiTtaekFeunctFioetnch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
122
Control Hazards
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
Data memory
M_valA
Consider: an unconditional JMP What about a CALL instruction? What about a RET instruction?
Execute
Decode
E_stat
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
E
D_stat
f_stat Stat
ret Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
123
Control Hazards
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
Data memory
M_valA
Consider: an unconditional JMP What about a CALL instruction? What about a RET instruction?
Execute
retDecode
E_stat
Select d_rvalA A
dstE dstM
Predict PC
W _valM W_valE
AB Register M
file
E
??
D_stat
f_stat Stat
Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
124
Control Hazards
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
M_valA
Consider: an unconditional JMP
What about a CALL instruction? ret What about a RET instruction?
bubDbecleode
Execute
E_stat
Select d_rvalA A
dstE dstM
Predict PC
W _valM W_valE
AB Register M
file
E
??
D_stat
f_stat Stat
Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
125
Data memory
Write back
W_stat
m_stat Stat
M_stat
Control Hazards
dmem_error
control
M_Cnd
read
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
ret
Mem.
Memory
write
Addr
M_valA
Consider: an unconditional JMP What about a CALL instruction? What about a RET instruction?
e
bubble
Execut
E_stat
bubDbecleode
Select d_rvalA A
dstE dstM
Predict PC
W _valM W_valE
??
D_stat
f_stat Stat
Fetch imem_error instr_valid
CPSC 313.203 – 2019W2 Pipelined Y86-64
126
f_pc
Select PC
M_valA W_valM
Data memory
AB Register M
file
E
Instruction memory
PC increment
F
predPC
Control Hazards
ret Write back
y
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
bubble
Data memory
Memor
M_valA
Consider: an unconditional JMP What about a CALL instruction? What about a RET instruction?
e
bubDbecleode
E_stat
bubble
Execut
Select d_rvalA A
dstE dstM
Predict PC
W _valM W_valE
AB Register M
file
E
Instruction memory
PC increment
??
D_stat
f_stat Stat
Fetch imem_error instr_valid
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
127
Control Hazards
Consider: an unconditional JMP
What about a CALL instruction?
What about a RET instruction?
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
Data memory
M_valA
Execute
Decode
E_stat
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
What about a conditional jump instruction?
D_stat
f_stat Stat
E
jge dest Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
128
What if we stalled?
CPSC 313.203 – 2019W2 Pipelined Y86-64
© 2012 Donald Acton 129
Control Hazards
Consider: an unconditional JMP
What about a CALL instruction?
What about a RET instruction?
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
Data memory
M_valA
Execute
Decode
E_stat
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
What about a conditional jump instruction?
D_stat
f_stat Stat
E
jge dest Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
130
Control Hazards
Consider: an unconditional JMP
What about a CALL instruction?
What about a RET instruction?
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
Data memory
M_valA
Execute
E_stat
jge destDecode What about a conditional jump instruction?
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
E
??
D_stat
f_stat Stat
Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
131
Control Hazards
Consider: an unconditional JMP
What about a CALL instruction?
What about a RET instruction?
Write back
Memory
W_stat
m_stat Stat M_stat
dmem_error
M_Cnd
Mem. control
read
write
Addr
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
M_valA
jge dest
Execute
E_stat
bubble Decode What about a conditional jump instruction?
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
E
??
D_stat
f_stat Stat
Fetch imem_error instr_valid
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
132
Data memory
Write back
W_stat
m_stat Stat
M_stat
Control Hazards
dmem_error
M_Cnd
Mem. control
read write
data out
data in
W
stat
icode
valE
valM
dstE
dstM
M
stat
icode
Cnd
valE
valA
dstE
dstM
e_Cnd
CC
ALU
ALU AB
ALU fun.
dstE
ALU
E
stat
icode
ifun
valC
valA
valB
dstE
dstM
D
stat
icode
ifun
rA
rB
valC
valP
jge destMemory
Addr
M_valA
Consider: an unconditional JMP
What about a CALL instruction?
What about a RET instruction?
bubbleExecute
E_stat
bubble Decode What about a conditional jump instruction?
Select d_rvalA A
dstE dstM
Predict PC
W_valM W_valE
AB Register M
file
D_stat
E
f_stat Stat Fetch imem_error
instr_valid
We know
Instruction memory
PC increment
f_pc
Select PC
M_valA W_valM
F
predPC
CPSC 313.203 – 2019W2 Pipelined Y86-64
133
Data memory
Branch Prediction
• Let’s convert these two code snippets into y86
do { if (p == NULL)
j++; handle error
} while j < 10;
CPSC 313.203 – 2019W2 Pipelined Y86-64
134
Branch Prediction
• Let’s convert these two code snippets into y86
do {
j++;
} while j < 10;
# Let’s assume j is in %rsi irmovq 1, %rax
# rax = 1
# rdi = 10 # set CC
if (p == NULL)
handle error
# Let’s assume p is in %rbp
rrmovq %rbp, %rbx andq %rbx, %rbx jne skip
# Assume we handle the error here
Skip: # p != NULL or has been handled
Loop:
addq %rax, %rsi irmovq 10, %rdi subq %rsi, %rdi jl Loop
CPSC 313.203 – 2019W2 Pipelined Y86-64
135
Fun Facts about Branch Prediction
• Always taken typically has about a 60% success rate
• Never taken typically has about a 40% success rate
• Some processors do:
– Backward taken; forward not taken (65% success rate)
• Some processors devote lots of fancy hardware to branch prediction – 1 bit: do whatever you did last time
– More bits for more complicated schemes
– Entries for branches at different values of the PC
• Other processors let you execute instructions while you’re waiting to determine what’s going to happen with your jump (these instructions occupy what are known as branch delay slots).
CPSC 313.203 – 2019W2 Pipelined Y86-64
136
CPSC 313
Today – Identify data and control hazards
• Learningoutcomes
– Distinguish those that the hardware can handle without stalling and those
that require stalling.
– Compute CPI in the presence of stalls.
– Practice rearranging code to reduce the number of hazards
• How we’ll get there
– Finish talking about branch prediction
– Practice computing CPI
– Practice reducing the number of hazards
CPSC 313.203 – 2019W2 Pipelined Y86-64
137
CPSC 313
administrivai
• Midterm 1 on Monday @ 5:30 ESB 1013
• Monday’s class will be TA led question/answer session
• See the MT 1 description on Canvas for details on what is allowed
• We will be adjusting the learning outcomes to be tested to better reflect what was covered (e.g. you don’t have to know anything about the x86)
• Make sure you know your account (r2d2) – Run getacct to determine it
CPSC 313.203 – 2019W2 Pipelined Y86-64
138
CPSC 313
The Problem with Branch Prediction • What happens if you are wrong?
• Consider:
irmovq 0x1, %rcx
irmovq 0x1, %rcx
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx mulq %rcx, %rcx
CPSC 313.203 – 2019W2 Pipelined Y86-64
139
The Problem with Branch Prediction
• What happens if you are wrong?
• Consider:
irmovq 0x1, %rcx
irmovq 0x1, %rcx addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
addq %rcx, %rbx mulq %rcx, %rcx
You have two bad instructions in the pipeline!
• That is, instructions you should not have executed!
• So what do you do???
You
• cancel them
• or squash them
• or quash them!
• Shoot them down
All mean the same thing
So:
• In the best case, conditional branches incur no penalty
• In the worst case (where you mispredict), it’s no worse
than if you’d done no prection)
CPSC 313.203 – 2019W2 Pipelined Y86-64
140
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (1)
addq %rax, %rbx
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
141
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (2)
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
Execute
Memory
Writeback
je Skip
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
142
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (3)
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
valE = valA+valB=0
Memory
Writeback
je Skip
valC=Skip
Execute
Memory
Writeback
irmovq 0x5000, %rcx
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Assume we predict NOT TAKEN
Execute next instruction in sequence as if the jump were not taken.
Warning The book predicts branch always taken
CPSC 313.203 – 2019W2 Pipelined Y86-64
143
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (4)
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
valE = valA+valB=0 ZF = 1
Writeback
je Skip
valC=Skip
Cnd(1,’e’)
Memory
Writeback
irmovq 0x5000, %rcx
valC=0x5000
Execute
Memory
Writeback
irmovq 0x5000, %rcx
Decode
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Uh oh! We can now evaluate the condition, and it turns out we should have jumped!
CPSC 313.203 – 2019W2 Pipelined Y86-64
144
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (5)
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
valE = valA+valB=0 ZF = 1
R[%rbx] = 0
je Skip
valC=Skip
Cnd(1,’e’)
Writeback
irmovq 0x5000, %rcx
valC=0x5000
valE=valC+0
Memory
Writeback
irmovq 0x5000, %rcx
valC=0x5000
Execute
Memory
Writeback
addq %rax, %rbx
Decode
Execute
Memory
OK – so now we are executing at the place we should have been executing after the jump
CPSC 313.203 – 2019W2 Pipelined Y86-64
145
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (6)
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
valE = valA+valB=0 ZF = 1
R[%rbx] = 0
je Skip
valC=Skip
Cnd(1,’e’)
Writeback
irmovq 0x5000, %rcx
valC=0x5000
valE=valC+0
Memory
Writeback
irmovq 0x5000, %rcx
valC=0x5000
Execute
Memory
Writeback
addq %rax, %rbx
Decode
Execute
Memory
But what about these instructions???
Bad news: You have started executing two instructions that you should not have.
Good news: They haven’t modified any state that anyone has “seen.”
CPSC 313.203 – 2019W2 Pipelined Y86-64
146
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Branch Mis-Predicts (6)
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
valE = valA+valB=0 ZF = 1
R[%rbx] = 0
je Skip
valC=Skip
Cnd(1,’e’)
Writeback
irmovq 0x5000, %rcx
valC=0x5000
valE=valC+0
Memory
Writeback
irmovq 0x5000, %rcx
valC=0x5000
Execute
Memory
Writeback
addq %rax, %rbx
Decode
Execute
Memory
We will cancel/squash/quash them!
Make sure that no visible changes get made:
• No writeback
• No writes to memory
• No update of condition codes
Bad news: You have started executing two instructions that you should not have.
Good news: They haven’t modified any state that anyone has “seen.”
CPSC 313.203 – 2019W2 Pipelined Y86-64
147
CPSC 313
R[%rcx] = 1, R[%rdx] = 1 R[%rax] = 10, R[%rbx] = -10
addq %rax, %rbx
je Skip
irmovq 0x5000, %rcx irmovq 0x5000, %rdx
Skip:
addq %rcx, %rbx
mulq %rcx, %rcx
Compute CPI
addq %rax, %rbx
valA=R[%rax]=10 valB=R[%rbx]=-10
valE = valA+valB=0 ZF = 1
R[%rbx] = 0
je Skip
valC=Skip
Cnd(1,’e’)
Writeback
irmovq 0x5000, %rcx
valC=0x5000
valE=valC+0
Memory
Writeback
irmovq 0x5000, %rcx
valC=0x5000
Execute
Memory
Writeback
addq %rax, %rbx
Decode
Execute
Memory
9 cycles/3 instructions completed = 3 CPI
+1
CPSC 313.203 – 2019W2 Pipelined Y86-64
148
FAILED BRANCH PREDICTION – BY THE BOOK
CPSC 313.203 – 2019W2 Pipelined Y86-64
149
Branch prediction – failed Pipe (with forwarding)
jle
CPSC 313.203 – 2019W2 Pipelined Y86-64
150
150
Branch prediction – failed Pipe (with forwarding)
jle
addq $%rsi, %rbx
pc + 9
CPSC 313.203 – 2019W2 Pipelined Y86-64
151
151
Branch prediction – failed Pipe (with forwarding)
jle
pc + 9
addq $%rsi, %rbx
addq $%rdx, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
152
152
jle
Branch prediction – failed Pipe (with forwarding)
pc + 9
aBdudbqbl$e%rsi, %rbx
These need to be shot down
aBdudbqbl$e%rdx, %rbx
subq $%rsi, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
153
153
Bubble
Branch prediction – failed Pipe (with forwarding)
jle
Bubble
subq $%rsi, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
154
154
Bubble
Branch prediction – failed
jle
Pipe
Bubble
(with forwarding)
subq $%rsi, %rbx
addq %rax, %rax
CPSC 313.203 – 2019W2 Pipelined Y86-64
155
155
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
Now you practice! (1)
1. Assume no forwarding: how many bubbles are produced?
2. How many CPI did this program achieve?
CPSC 313.203 – 2019W2 Pipelined Y86-64
156
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
Now you practice! (1)
1. Assume no forwarding: how many bubbles are produced? 5/iteration => 30
2. How many CPI did this program achieve? Just under 2 (71/37)
CPSC 313.203 – 2019W2 Pipelined Y86-64
157
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
Now you practice! (2)
1. Assume forwarding: how many bubbles are produced?
2. How many CPI did this program achieve?
CPSC 313.203 – 2019W2 Pipelined Y86-64
158
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
Now you practice! (2)
1. Assume forwarding: how many bubbles are produced? 2/iteration => 12
2. How many CPI did this program achieve? (53/37 = 1.4)
CPSC 313.203 – 2019W2 Pipelined Y86-64
159
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Now you practice! (3)
Done:
1. Assume forwarding and branch prediction (always taken): How many bubbles?
2. How many CPI did this program achieve?
CPSC 313.203 – 2019W2 Pipelined Y86-64
160
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi = 5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
Now you practice! (3)
1. Assume forwarding and branch prediction (always taken): How many bubbles?
2. How many CPI did this program achieve?
NOT MUCH BETTER!
We are predicting everything but the last branch wrong!
CPSC 313.203 – 2019W2 Pipelined Y86-64
161
CPSC 313
Now you practice! (4)
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
1. Assume forwarding and branch prediction (never taken): How many bubbles?
2. How many CPI did this program achieve?
CPSC 313.203 – 2019W2 Pipelined Y86-64
162
CPSC 313
Loop:
xorq %rax, %rax xorq %rsi, %rsi irmovq 1, %rbx irmovq 5, %rdi subq %rsi, %rdi jle Done
addq %rbx, %rsi addq %rbx, %rax jmp Loop
# rax = 0 (count) # rsi = 0 = j
# rdi=5
# check end cond
# if 10-j <= 0, jmp # j++
# count++
count = 0;
for (j = 0; j < 5; j++)
count++;
Done:
Now you practice! (4)
1. Assume forwarding and branch prediction (never taken): How many bubbles? Just 2 on the last iteration!
2. How many CPI did this program achieve? 1.16
CPSC 313.203 – 2019W2 Pipelined Y86-64
163
Administrivia • Midterm Monday 5:30 ESB 1013
• All module 2 slides posted in 1 batch
CPSC 313.203 – 2019W2 Pipelined Y86-64
164
Today
– Practice rearranging code to reduce the number of hazards
• Learning outcomes
– Place pipelining in the context of the many forms of hardware
parallelism
• How we’ll get there
– Finish practice problems from Wednesday
CPSC 313.203 – 2019W2 Pipelined Y86-64
165
Aside: how are errors handled in the pipeline
• Instructions could get errors while executing (e.g. divide by 0) • Fetch could try to fetch an instruction from an invalid address • Memory stage could try to read or write to an invalid address
CPSC 313.203 – 2019W2 Pipelined Y86-64
166
Suppose divide by 0
irmovq $100, %rsi Pipe
This instruction must complete
(with forwarding)
divq %rax, %rbx
Error Detected here
pc + 9
• Update stat at end of clock tick to indicate exception
addq %rsi, %rbx
• Deal with exception as instruction is
retired
• Probably shoot-down instuctions
behind it in the pipeline
addq %rdx, %rbx
CPSC 313.203 – 2019W2 Pipelined Y86-64
167
167
CPSC 313
irmovq Data, %rbp points to data
irmovq 8, %rsi
xorq %rcx, %rcx (accumulator)
loop:
mrmovq 0(%rbp), %rdi andq %rdi, %rdi
je Done
addq %rdi, %rcx
addq %rsi, %rbp
item
jmp loop
Done:
halt Data:
.quad 0xDECADE .quad 0xBE11 .quad 0xAB1E .quad 0xABBA .quad 0
# rbp
# read from array
# Check for 0 element # Done if 0
# add element to rcx # advance to next
# process next
Assume forwarding and branch prediction (always taken)
long long array[5]; long long *ap = array; long long sum = 0;
while (*ap != 0) {
sum += *ap;
ap++; }
Let’s try getting rid of bubbles
# rsi = 8 # rcx = 0
1. Where are the bubbles?
2. How many bubbles?
3. CPI (Hint: 26 instructions are executed in the sequential case)
CPSC 313.203 – 2019W2 Pipelined Y86-64
168
CPSC 313
irmovq Data, %rbp points to data
irmovq 8, %rsi
xorq %rcx, %rcx (accumulator)
loop:
mrmovq 0(%rbp), %rdi andq %rdi, %rdi
je Done
addq %rdi, %rcx
addq %rsi, %rbp
item
jmp loop
Done:
halt Data:
.quad 0xDECADE .quad 0xBE11 .quad 0xAB1E .quad 0xABBA .quad 0
# rbp
# read from array
# Check for 0 element # Done if 0
# add element to rcx # advance to next
# process next
Assume forwarding and branch prediction (always taken)
long long array[5]; long long *ap = array; long long sum = 0;
while (*ap != 0) {
sum += *ap;
ap++; }
Let’s try getting rid of bubbles
# rsi = 8 # rcx = 0
There are two ways you can reduce the number of bubbles. What are they?
CPSC 313.203 – 2019W2 Pipelined Y86-64
169
CPSC 313
irmovq Data, %rbp points to data
irmovq 8, %rsi
xorq %rcx, %rcx (accumulator)
loop:
mrmovq 0(%rbp), %rdi andq %rdi, %rdi
je Done
addq %rdi, %rcx
addq %rsi, %rbp
item
jmp loop
Done:
halt Data:
.quad 0xDECADE .quad 0xBE11 .quad 0xAB1E .quad 0xABBA .quad 0
# rbp
# read from array
# Check for 0 element # Done if 0
# add element to rcx
# advance to next
# process next
Assume forwarding and branch prediction (always taken)
long long array[5]; long long *ap = array; long long sum = 0;
while (*ap != 0) {
sum += *ap;
ap++; }
Let’s try getting rid of bubbles
# rsi = 8 # rcx = 0
There are two ways you can reduce the number of bubbles. What are they?
1. Do something useful after the mrmovq that does not depend on rdi
CPSC 313.203 – 2019W2 Pipelined Y86-64
170
CPSC 313
irmovq Data, %rbp points to data
irmovq 8, %rsi
xorq %rcx, %rcx (accumulator)
loop:
mrmovq 0(%rbp), %rdi addq %rsi, %rbp
# rbp
Assume forwarding and branch prediction (always taken)
long long array[5]; long long *ap = array; long long sum = 0;
while (*ap != 0) {
sum += *ap;
ap++;
item
Done: halt
Data:
.quad 0xDECADE
.quad 0xBE11 .quad 0xAB1E .quad 0xABBA .quad 0
# add element to rcx } # Check for 0 element
# process next
addq %rdi, %rcx
andq %rdi, %rdi
jne loop
Let’s try getting rid of bubbles
#rsi=8 #rcx=0
# read from array
# advance to next
There are two ways you can reduce the number of bubbles. What are they?
1. Do something useful after the mrmovq that does not depend on rdi
2. Remove one branch and make the taken be the more common
CPSC 313.203 – 2019W2 Pipelined Y86-64
171
General CPI Analysis
• cycles per instruction (CPI)
– measures pipeline efficiency = 1/throughput
Don’t count NOPs
• CPI = totalCycles / instructionRetiredCycles = 1 + lp + mp + rp
• bubble penalties – load/use
– branchmisprediction – return
lp=1bubble*probabilityof occurrence
mp=2bubbles*probabilityof occurrence
rp=3bubbles*probabilityof occurrence
CPSC 313.203 – 2019W2 Pipelined Y86-64
172
172
Performance Analysis Example
Cause
Name
Instruction frequency
Condition frequency
Bubbles
Penalty
load/use
lp
25%
20%
1
.25?*.2=.05
mispredict
mp
20%
40%
2
.2*.4?*2=.16
return
rp
2%
100%
3
.0?2*3 = .06
total
1 + .27 = 1.27
?
CPI = ?
CPSC 313.203 – 2019W2 Pipelined Y86-64
173
173
Pipelining & Parallelism
• Pipelining is one form of parallelism, in that multiple instructions execute at the same time (just in difference stages).
• Can you think of scenarios in which you might want different forms of parallelism?
CPSC 313.203 – 2019W2 Pipelined Y86-64
174
Pipelining & Parallelism
• Pipelining is one form of parallelism, in that multiple instructions execute at the same time (just in difference stages).
• Can you think of scenarios in which you might want different forms of parallelism?
– My laptop: I want to prepare powerpoint sides, but also listen to music. – A server: I want to handle multiple requests at the same time.
– A compute heavy job: I want to parallelize my one big workload.
– Multithreading: I want all my threads to run at the same time.
CPSC 313.203 – 2019W2 Pipelined Y86-64
175
Different Kinds of Parallelism • Multitasking (or multiprogramming): This is basically an illusion.
– Software makes it look like many things are happening at once, but quite likely the operating system is simply rapidly switching among many different jobs.
• Multiprocessing:
– Place multiple processing units (CPU) on a machine so you can run different programs on
each unit.
• Parallel runtimes that can leverage multiprocessing or other forms of parallelism:
– The parallelism can be figured out either manually or automatically.
– Then you need a runtime system to dispatch the different tasks to different processing elements
• Multithreading:
– One way to implement a parallel runtime
– Again, the idea is to allow different threads to run on different processing units.
CPSC 313.203 – 2019W2 Pipelined Y86-64
176
Multiprocessing
CPU
CPU
CPU
CPU
Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
177
Multiprocessing
CPU
CPU
CPU
CPU
Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
178
What is really inside a single chip?
Execution Core
L1 Cache
L2 Cache
L2 controller is here, But the actual cache is On a separate die.
Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
179
Multicore
Execution Core
L1 Cache
L2 Cache
Execution Core
Execution Core
L1 Cache
L1 Cache
L2 Cache
Memory Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
180
CPSC 313
Hyperthreading in brief
Integer Adder FPU
CPSC 313.203 – 2019W2 Pipelined Y86-64
181
Hyperthreaded
Execution Core
L1 Cache
L2 Cache
Execution Cores
L1 Cache
L2 Cache
Memory Memory
CPSC 313.203 – 2019W2 Pipelined Y86-64
182
CPSC 313
Hyperthreaded Multicore
Execution Core
L1 Cache
L2 Cache
Execution Cores
L1 Cache
L2 Cache
Execution Cores
L1 Cache
Execution Cores
L1 Cache
L2 Cache
Memory Memory Memory CPSC 313.203 – 2019W2 Pipelined Y86-64
183
Flynn’s Taxonomy: xIxD for x = M(ultiple)/S(ingle)
• SISD
– Single Instruction, Single Data – Your basic single processor.
• MIMD:
– Multiple Instruction, Multiple Data
– This is really what a multi-core or multiprocessor machine is.
CPSC 313.203 – 2019W2 Pipelined Y86-64
184
More Exotic Architectures
• SIMD:
– Single Instruction, Multiple Data
– That is, you want to apply the same set of instructions to lots of data – Can you think of applications where this makes sense?
• MISD
– Multiple Instruction, Single Data
– This time, different functional units do different things to the same data. – These are less common structures.
CPSC 313.203 – 2019W2 Pipelined Y86-64
185
• SIMD:
More Exotic Architectures
– Single Instruction, Multiple Data
– That is, you want to apply the same set of instructions to lots of data
– Can you think of applications where this makes sense?
• This is basically what a GPU is – you try to perform the same computation on a lot of
data in parallel.
• MISD
– Multiple Instruction, Single Data
– This time, different functional units do different things to the same data. – These are less common structures.
CPSC 313.203 – 2019W2 Pipelined Y86-64
186