SEC204
1
Computer Architecture and Low Level Programming
Dr. Vasilios Kelefouras
Email: v.kelefouras@plymouth.ac.uk Website: https://www.plymouth.ac.uk/staff/vasilios -kelefouras
School of Computing (University of Plymouth)
Date
21/10/2019
2
Instruction Pipeline
So far we have used the following sequence: fetch instruction, decode instruction, execute instruction
Notice that each execution step uses a different functional unit But the units are idle most of the time
That’s a lot of hardware sitting around doing nothing
We shouldn’t have to wait for an entire instruction to complete before
we can re-use the function units
Pipelining solves the above inefficiency
Pipelining is a general technique applied in our everyday life, not just to computers, e.g., restaurants
3
Instruction Pipeline – an analogy to laundry
Assume we have got
One washer (takes 1 hour)
One Drier (takes 1 hour)
One Folder (takes 1 hour) F
Something/someone to store the clothes (takes 1 hour) S
So, it takes 4 hours to wash, dry, fold and store a load of laundry
W D
4
Instruction Pipeline – an analogy to laundry (1)
Assume we have got 4 loads of laundry – we have to wait for 16 hours Time in hours
WDFS
WDFS
WDFS
WDFS
5
S F
D W
Instruction Pipeline – an analogy to laundry (2)
But, if we use pipelining technique, we need to wait just for 7 hours
Time in hours
WDF WDS
WFS DFS
2nd time slot: After the washer has washed the 1st laundry, it starts washing the 2nd
4th time slot: The washer washes the 4th load, the drier dries the 3th load, the folder folds the 2nd load and the man stores the 1st load
Instruction Pipeline – an analogy to laundry (3)
WDFS WDFS
WDFS WDFS
Time in hours
filling full emptying
The latency of a single load remains 4 hours
Butthroughputisincreased-thenumberofloadscompletedperunitoftime
Finish the execution of a load after every clock cycle
The time to fill and drain the pipeline reduces throughput, but it happens only at the
beginning and at the end, respectively Consider 1000 laundries
The maximum speedup equals to the number of pipeline stages
7
Instruction Pipeline – back to computers (1)
Now, the time in hours becomes time in CPU clock cycles
Different processors have different number of pipeline stages
Many designs include pipelines as long as 7, 10 and even 20 stages
The instructions are divided into smaller ones which are performed by
different processor units
Each pipeline stage needs one CPU cyle
Pipeline increases throughput, but increases latency due to the added overhead of the pipelining process itself (explain next)
As the pipeline is made “deeper” (with a greater number of steps), a given step can be implemented with simpler circuitry, which may let the processor clock run faster
8
Instruction Pipeline – back to computers (2) Pipelining increases the CPU instruction throughput – the number of
instructions completed per unit of time
it does not reduce the execution time of an individual instruction
In fact, it usually slightly increases the execution time of each instruction due to overhead in the pipeline control
Pipeline overhead arises because extra hardware is needed (registers) which introduce a delay for several reasons, e.g., clock skew
Clock skew is a phenomenon in synchronous circuits in which the clock signal arrives at different components at different times
The increase in instruction throughput means that a program runs faster and has lower total execution time
clock skew
9
Instruction Pipeline – Performance Issues (1)
Consider a non-pipelined machine with 5 execution stages of lengths 50 ns, 50 ns, 60 ns, 60 ns, and 50 nseconds
1. Find the instruction latency
2. How much time does it take to execute 100 instructions?
Instruction latency = 50+50+60+60+50= 270ns
Time to execute 100 instructions = 100*270 = 27000 ns
10
Instruction Pipeline – Performance Issues (2)
Suppose we introduce pipelining
Assume that when introducing pipelining, the clock skew adds 5ns of overhead to each execution stage
1. What is the instruction latency on the pipelined machine? 2. How much time does it take to execute 100 instructions?
The length of the pipe stages must all be the same, i.e., the speed of the slowest stage plus the overhead
The length of pipelined stage = max(lengths of unpipelined stages) + overhead = 60 + 5 = 65 ns
Instruction latency = 65 ns
Time to execute 100 instructions = 65*5*1 + 65*1*99 = 325 + 6435 = 6760 ns
Speedup for 100 instructions = 27000 / 6760 = 3.994≈4 (average instruction time without pipelining to the average instruction time with pipelining)
11
The classic five stage RISC pipeline
1. Instruction Fetch: Read an instruction from memory
2. Instruction Decode: identify the instruction and its operands
3. Execute: execute an arithmetical instruction or compute the address of a load/store
4. Memory: load or store from/to memory
5. Write Back: Store the result in the destination register
Note, not all instructions need all five steps
12
Main memory
L2 unified cache
L1 data cache
L1 instruction cache
RF
The classic five stage RISC pipeline
Instruction Fetch (IF):
The instruction is read from memory
The PC shows the address of the next instruction
The instruction is loaded from L1 instruction cache (1 cycle)
But the next instruction is not always in the instruction cache
The instruction is stored into the Instruction Register (IR)
CPU
13
The classic five stage RISC pipeline
Instruction Decode (ID)
The processor reads the Instruction Register (IR) and identifies the instruction
Reads any operands required from the register file
The CPU generates the control signals
The instruction decode phase will calculate the next PC and will send it back to the IF phase so that the IF phase knows which instruction to fetch next
Execute:
The Execute stage is where the actual computation occurs
These calculations are all done by the ALU
The arithmetical instructions are executed at this stage
For load/store instructions, the address calculation is made
14
Main memory
L2 unified cache
L1 data cache
L1 instruction cache
RF
The classic five stage RISC pipeline
Memory:
The Memory Access stage performs any memory access required by the current instruction
So, for loads, it would load an operand from L1 data cache memory
For stores, it would store an operand into memory
For all other instructions, it would do nothing
Note that the data are not always in L1 data cache memory
Write Back:
For instructions that have a result (a destination register), the Write Back writes this result back to the register file
CPU
15
Instruction
Steps required
Arithmetical
IF ID EX NOP WB
Load
IF ID EX MEM WB
Store
IF ID EX MEM NOP
Branch
IF ID EX NOP NOP
The classic five stage RISC pipeline
However, not all the instructions need five stages
16
Instruction1 Instruction2 Instruction3 Instruction4 Instruction5
Pipeline Terminology (1)
Clock Cycle
123456789
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
— —
A pipeline diagram shows the execution of a series of instructions The instruction sequence is shown vertically, from top to bottom Clock cycles are shown horizontally, from left to right
The pipeline depth is the number of stages – in this case, five
17
Instruction1 Instruction2 Instruction3 Instruction4 Instruction5
Pipeline Terminology (2)
Clock Cycle
123456789
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
filling
IF ID EX MEM WB full emptying
In the first four cycles here, the pipeline is filling, since there are unused functional units
In cycle 5, the pipeline is full. Five instructions are being executed simultaneously, so all hardware units are in use.
In cycles 6-9, the pipeline is emptying
18
Instruction Pipeline – Wrap Up
Pipelining attempts to maximize instruction throughput by overlapping the execution of multiple instructions
Pipeliningofferssignificantspeedup
— In the best case, one instruction finishes on every cycle, and the
speedup is equal to the pipeline depth
The pipeline datapath is much like the single-cycle one, but with added pipeline registers
— Each stage needs its own functional units
19
Instruction Pipeline – Hazards
Limits to pipelining: Hazards prevent next instruction from executing during
its designated clock cycle
1. Structural hazards: HW cannot support the usage of a function unit to 2 instructions at the same time
2. Data hazards: Instruction depends on result of prior instruction still in the pipeline
3. Control hazards: Pipelining branch and jump instructions introduce the problem that the destination of the branch is unknown
– Common solution is to stall the pipeline until the hazard is resolved, inserting one or more NOP cycles in the pipeline
20
Instruction Pipeline – Data Hazards (1)
number of data dependences -> Number of stalls
TARGET: Reduce Pipeline Stalls as far as possible Typical 5 Pipeline Stages of an integer ALU
+——-+———+———-+—————-+————+
| IF | ID | EX | MEM | WB | +——-+———+———-+—————-+————+
Floating Point ALU, consists of more pipeline stages (more EX stages)
When an instruction needs the result of another instruction, data hazards occur (RAW, WAR, WAW) => pipeline stalls
20
21
Instruction Pipeline – Data Hazards (2)
Example: add R1,R2,R3 sub R5,R3,R1
Let us try to pipeline these two instructions.
Sub instruction, needs the value of register R1 that will be available after the WB stage.
+——-+———+———–+—————-+————+
| IF | ID | EX | MEM | WB | ——- > instr 1 +——-+———+———-+—————-+————+
+——-+———+—————+———-+—————+————+————-+————-+
| IF | stall | stall | stall | ID | EX | MEM | WB | ——- > instr 2 +——-+———+—————-+———-+—————+————+————-+————-+
..and three stalls occur
SOLUTION: Pipeline Forwarding (or bypassing)
21
22
Instruction Pipeline – Data Hazards (3)
Example: add R1,R2,R3 sub R5,R3,R1
+——-+———+———–+—————-+————+
| IF | ID | EX | MEM | WB | ——- > instr 1 +——-+———+———-+—————-+————+
SOLUTION: Pipeline Forwarding (or bypassing)
+——-+———+—————+———-+—————+————+————-+
|IF| stall | ID | EX | MEM | WB | ——- > instr 2 +——-+———+—————-+———-+—————+————+————-+
The results of R1 is ready after the EX stage, so it is forwarded
22
23
Instruction Pipeline – Data Hazards (4)
Let us now, consider another example, where bypassing is applied ld R1, 0x12FF
add R3,R4,R1
+—-+—-+—-+——–+——+
| IF | ID | EX | MEM | WB | ——- > instr 1
+—-+—-+—-+——–+——+
+—+—-+——-+—-+——–+——+
| IF| stall | stall | ID | EX | MEM | WB | —— > instr 2
+—+—-+——-+—-+——–+——+
Calculation of the address is taking place in the EX stage
At the next stage, datum read from memory and it is bypassed to the next
instruction
The first stall cycle is inevitable
You can find more in https://www.youtube.com/watch?v=EW9vtuthFJY
23
24
I1: load r1,a
I2: load r2,b
I3: r3=r1+r2
I4: load r4,c
I5: r5=r3-r4
I6: r6=r3*r5
I7: st d,r6
•
In this code pipeline stalls occurs twice. The first one due to immediate read of r2 (i3) after it is loaded form memory (i2) , and similarly between i5 and i4.
In total, 4 CPU cycles are wasted
Think Pair Share
How many stall cycles occur to the following codes
Solution: Reorder instructions as follows
I1: load r1,a
I2: load r2,b
I4: load r4,c -> hiding the latency from i2 to i3.. I3: r3=r1+r2
I5: r5=r3-r4
I6: r6=r3*r5
I7: st d,r6
In this code, no stalls occur, since none of the load instructions is immediately followed by a dependent (arithmetic) instruction
25
Control Hazards
A control hazard is when we need to find the destination of a branch, and can’t fetch any new instructions until we know that destination
beq r1,r3, L1 and r2,r3,r5 or r6,r1,r7 add r8,r1,r9
IF ID EX
NOP NOP NOP IF?
NOP NOP NOP IF?
Branch Prediction: The outcome and target of conditional branches are
L1: xor r10,r1,r11 Solutions:
predicted using some heuristic
• Branch predictors play a critical role in modern CPUs
MEM WB
Only at that cycle the CPU knows which is the next instruction.
26
Control Hazards
Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline
The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken
The branch that is guessed to be the most likely is then fetched and speculatively executed
If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay
Avoid writing code with if conditions