程序代写代做代考 scheme cache computer architecture Computer Architecture I

Computer Architecture I

Computer Architecture I
CS-GY-6133
Topic: Out-of-Order Processing and Branch Prediction

Instructor: Siddharth Garg

Cycle 6
Tag Value Valid
– – 0
– – 0

Tag Value Valid
– – 0
– – 0

FP MULT (4 cycles)
Tag Value Valid
1 1
2 1
3 1
B 4 0
C 5 0

f0
f2
Tag Value Valid
– 6 0
– – 0
– – –

Tag Value Valid
– 4 1
– – 0
– – –

FP ADD (1 cycles)
f4
f6
f8
Inst Reg Value
MULD F8 6
ADDD F6 –
ADDD F8 3
BEQ – –
SUBD – –

A
B
C
D
E
ROB

ROB
B

Waiting
WB
ROB
Valid
0
0
1

Waiting
L1: MULD F4, F2, F8
ADDD F8, F6, F6
ADDD F2, F0, F8
BEQ R1, R2, L1
SUBD F2, F0, F4

BRANCH TAKEN (Oops!)
Issue (E1)
T/NT
0
0
0
NT

Undo renaming in RF

2

Hardware Branch Prediction
Hardware branch predictor has two tasks
Determine if branch is taken or not-taken
Target address calculation (where to go if taken)
Target address calculation also required for unconditional branches (jumps)

Past predicts the future.
Local History: If branch was taken the previous time, it will probably be taken subsequently (why?)
Global History: taken/not-taken behavior of other branches can be predictive for current branch (why?)

Local History Prediction
For each branch, predict based on whether the branch was previously T/NT
If previously Taken, then branch is Taken
Pattern: T, T, T, NT, T, T, T, T, ….
The NT branch will be predicted as T and the next T branch as NT
Even worse if pattern is: T, NT, T, NT, ….

Sticky prediction: wait for two mispredictions to change prediction from T to NT and vice-versa

2-bit Saturating Counter

Predict Taken
(11)

Predict Taken
(10)
Taken

Not Taken

Predict
Not Taken
(01)
Taken

“Strong Confidence
In Taken”
“Weak Confidence
In Taken”
“Weak Confidence
In Not Taken”
Taken

Predict
Not Taken
(00)
Not Taken

Taken

Not Taken

Not Taken

“Strong confidence in Not Taken”
Two bits of state required to implement FSM

5

Branch Prediction Buffer
Hardware structure that is looked-up during fetch stage to determine Taken/Not Taken

32-bit Program Counter
m bits
2m Entries
2 bit saturating counter
Taken/Not Taken
Ideally, one 2-bit saturating counter per PC but that is too expensive
Why m LSBs and not MSBs?

Example
Consider a m=3 and assume that all counters start in 11 state
8 2-bit counters indexed from 0…7
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
8 Entries
2 bit saturating counter
Taken/Not Taken
0
1
2
3
7
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

101

Branch 1
Index into table and predict branch outcome
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
8 Entries
2 bit saturating counter
Predict Taken
0
1
2
3
7
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

000

Branch 1
Update counter once actual branch outcome is known
1 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
8 Entries
2 bit saturating counter
0
1
2
3
7
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

000

Branch 2
Index into table and predict
1 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
Predict Taken
2 bit saturating counter
0
1
2
3
7
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

001

Branch 2
Update counter
1 0
1 1
1 0
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
2 bit saturating counter
0
1
2
3
7
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

001

Branch 3
Index into table and predict and update counter
1 0
1 1
1 0
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
2 bit saturating counter
0
1
2
3
7
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

111
Predict Taken

Correlating Predictors
2-bit scheme looks at each branches own prior behavior to make predictions
But branch behavior can be correlated

If (aa==2) aa=0;
If (bb==2) bb=0;
If (aa==bb) {.…}
If first two branches are taken, the third branch MUST be taken

k-bit global branch history register (BHR)

T
NT
k=2 BHR
(Prev branch was NT,
branch before that was T)

2-Level Correlation Predictors
Assume m=3, k=2

32-bit Program Counter
3 bits
s=2 bit saturating
counter
Taken/
Not Taken

T
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

Branch 1
Assume all counters start at 11, and BHR is initialized to NT, NT
1 1

32-bit Program Counter
3 bits
Predict Taken

NT
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

Branch 1
Update PHT
1 0

32-bit Program Counter
3 bits
Predict Taken

NT
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

Branch 2
Update BHR, Index into PHT
1 0

1 1

32-bit Program Counter
3 bits
Predict Taken

NT
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

older
younger

Branch 2
Update PHT
1 0

1 0

32-bit Program Counter
3 bits
Predict Taken

NT
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

older
younger

Branch 3
Update BHR, Index into PHT
1 0

1 0

1 1

32-bit Program Counter
3 bits
Predict Taken

NT
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

older
younger

Branch 3
Update PHT
1 0

1 0

1 1

32-bit Program Counter
3 bits
Predict Taken

NT
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

older
younger

Branch 4
Update BHR, Index into PHT
1 0

1 0

1 1

32-bit Program Counter
3 bits
Predict Taken

1 1

T
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

older
younger

Branch 4
Update PHT
1 0

1 0

1 1

32-bit Program Counter
3 bits
Predict Taken

1 1

T
NT
k=2 BHR
Pattern History Table (PHT)
Branch Trace

0x00000000 N
0x00000002 N
0x00000007 T
0x00000000 T
0x00000002 N
0x00000007 N
0x00000000 T
0x00000002 N
0x00000007 N

older
younger

Conflicts in PHT
Two branches use the same counter!
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1

32-bit Program Counter
3 bits
8 Entries
2 bit saturating counter
Predict Taken
0
1
2
3
7
Branch Trace

0x00000000 N
0x20000000 N
0x00000007 T
0x00000000 T
0x20000000 N
0x00000007 N
0x00000000 T
0x20000000 N
0x00000007 N

000

Virtual Memory
A 32-bit address allows each program to access 4GB of main memory
What if the DRAM is smaller than 4GB?

Virtual memory: addresses that a program don’t refer to actual physical locations in main memory

CPU

Main Mem.
(DRAM)
< 4GB Hard Disk (Denser than DRAM, but slow) > 4GB

Virtual Memory

0…00
0…01
1…10
Virtual Addresses
(32 bits)
4GB Virtual Address Space
00
01
Physical Addresses
(2 bits)
TINY 4B Main Memory!
0…10

10
11

1…11

Large Disk
Anything does not fit in main memory is stored on disk

Virtual Memory

0…00
0…01
1…10
Virtual Addresses
(32 bits)
4GB Virtual Address Space
00
01
Physical Addresses
(2 bits)
TINY 4B Main Memory!
0…10

10
11

1…11

Large Disk
Anything does not fit in main memory is stored on disk

Pages
0
4KB Page
32 Bit Addresses
Virtual address space is partitioned into chunks called “pages”
Typical page size: 4KB (212 Bytes)
Address space: 4GB = 4GB/4KB (232/ 212)= 1M (220) pages

1
4095
4096

8191
4097
4KB Page
232-1
4KB Page

Page Offset
(12 bits)
Virtual Page Number
(20 bits)

Virtual Address (32 bits)
Selects a page
Selects a byte within a page

Multiple Processes

Process 1 Virtual Address Space (4GB)

Physical Address Space
4 GB Main Memory
(1M Pages)

Process 2 Virtual Address Space (4GB)

PPN 0
PPN 1
PPN 220 -1
– Multiple processes share the same physical memory

– Each process sees a contiguous 32-bit virtual address space, even if physical locations are not contiguos

Address Translation
Page Offset
(12 bits)
Virtual Page Number
(20 bits)

Virtual Address (32 bits)

Page Table
Page Offset
(12 bits)
Physical Page Number
(m bits)

n Byte Main Memory
m = log2(n/4096)

PPN 0
PPN n/4096 – 1

Main Memory Address

Page Table
Page Offset
(12 bits)
Virtual Page Number
(20 bits)

Virtual Address (32 bits)

Page Table
Page Offset
(12 bits)
Physical Page Number
(m bits)

Main Memory Address

Holds all Virtual to physical mappings for addresses currently mapped into main memory

Note: not all VAs are necessarily mapped to main memory -> some are on disk!

Translation Look-Aside Buffer (TLB)
Page tables can be large
In practice most processes only use a fraction of VA space => most page table entries are empty
More complex data-structures to compress page table size
Bottom-line: expensive to store entire page table on chip and page table look-up can be time consuming

TLBs cache page table entries in a small on-chip hardware structure
On a miss in the TLB, either hardware or software brings in relevant entry from the page table into the TLB by “walking” the page table
What if entry not in page table? PAGE FAULT EXCEPTION!
Page loaded from disk into main memory on exception by OS

Address Translation

L1 Cache

L2 Cache

CPU

Main Memory
Address Translation
Main memory accessed using physical addresses
L1 and L2 caches accessed using virtual addresses

L1 Cache

L2 Cache

CPU

Main Memory
Address Translation
Main memory accessed using physical addresses
L1 cache physically addressed

L2 cache physically addressed

Homonyms
The same VA (in different processes) can map to two different physical addresses

Physical Address Space
4 GB Main Memory
(1M Pages)

Process 2 Virtual Address Space (4GB)

PPN 0
PPN 1
PPN 220 -1

Synonyms
Different VAs can map to the same physical address

Shared code and data

Physical Address Space
4 GB Main Memory
(1M Pages)

Process 2 Virtual Address Space (4GB)

PPN 0
PPN 1
PPN 220 -1

Address Translation

L1 Cache

L2 Cache

CPU

Main Memory
Address Translation
Main memory accessed using physical addresses
L1 and L2 caches accessed using virtual addresses
Drawbacks of using virtually addressed caches?
Homonyms: Same VAs map to different PAs. Can the caches distinguish homonyms?

Synonyms: two VAs map to the same PA. Multiple copies of the same data in the cache

Solution: Flush caches on a context switch

Address Translation

L1 Cache

L2 Cache

CPU

Main Memory
Address Translation
Main memory accessed using physical addresses
L1 cache physically addressed

L2 cache physically addressed
Drawbacks?

Adds extra latency for TLB look-up before L1 cache access

Fast, single-cycle L1 access is critical for processor performance

Address Translation

L1 Cache

L2 Cache

CPU

Main Memory
Address Translation
Main memory accessed using physical addresses
L1 cache virtually indexed but physically tagged

L2 cache physically addressed
Perform address translation in parallel to L1-cache lookup

Virtually indexed but physically tagged (VIPT) cache!

Best of both worlds (hopefully)
Since the L1 cache tags are physical, solves the synonym and homonym problems
But, address translation does not add extra latency to lookup

/docProps/thumbnail.jpeg