CLASS NOTES/FOILS:
CS 520: Computer Architecture & Organization
Part I: Basic Concepts
Dr. Kanad Ghose ghose@cs.binghamton.edu http://www.cs.binghamton.edu/~ghose
Department of Computer Science State University of New York Binghamton, NY 13902-6000
All material in this set of notes and foils authored by Kanad Ghose 1997-2019 and 2020 by Kanad Ghose
Any Reproduction, Distribution and Use Without Explicit Written Permission from the Author is Strictly Forbidden
CS 520 – Fall 2020
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Course Goals
To gain a thorough understanding of all major aspects of contemporary processor design,including:
– general performance considerations
– hardware-software tradeoffs
– considerations made for implementation technology
– state-of-the-art techniques for enhancing performance
Appreciate and understand the synergism between the compiler and the architecture in modern systems
Understand and gain a deeper appreciation of the buzzwords:
“superscalar”, “L2 cache”, “trace cache”, “MMU”, “DDR”, “SDRAM”, “VLIW” , “EPIC”, “SMT”, “CMP”, “turbo boost”, “Multi- core”, “GPU”, “GP-GPU”…..
Understand how various hardware components are integrated into a computing platform and all relevant issues and tradeoffs.
Gain an insight into how the architecture influences the OS (& vice-ver- sa), how to write fast, efficient code and other things that all software designers (including systems programmers) ought to know!
The focus of this course is on uniprocessor systems and instruction level parallelism (ILP); there are other courses that look at parallel/multiprocessor systems in depth (CS 624, CS 625)
2
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
What you SHOULD know as pre-requisites to this course
Basic concepts of instruction set design and computer organization:
– opcode encoding
– addressing modes
– register-to-register (aka Load-Store) instruction set
– data representation
– computer arithmetic
Material covered in any undergraduate logic design course:
– gates and implementation of combinatorial functions
– latches
– simple state machine design
Simple assembly language programming and programming in C
Virtual memory and memory protection mechanisms
– segments and pages
– translation of virtual addresses to physical addresses
– page-level protection
Material covered in any undergraduate operating system course
Hennessey and Patterson’s text, Computer Architecture, A Quantitative Ap- proach (published by Morgan-Kaufmann) will be a good background read- ing for this course. Either the 3rd edition or the 4-th edition will suffice.
We will largely rely on these notes and papers (which will be available on- line) for this course.
3
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Set Architectures
Instruction Set Architecture (ISA)
= functional behavior of a processor as specified by its instruction set
– characterizes a processor family
– there can be different implementations of a specific ISA
What characterizes an ISA:
– Instruction types and semantics
– Instruction formats
– Addressing modes, primitive binary-level data formats,
endian-ness etc.
Examples of popular processor families/ISAs:
X ’86 (aka IA-32):
64-bit versions of the X’86: IA-64:
SPARC:
POWER PC:
MIPS: Alpha: Precision ARM:
RISC-V:
Intel, AMD, Via, Cyrix (NS), IDT, Transmeta…
Intel, AMD
Intel
Sun, HAL, Fujitsu, ….
IBM, Motorola (Freescale), others Cray/SGI, Cavium, NEC, … DEC/Compaq
HP
DEC, Intel, Qualcom, Samsung,
+ MANY others
Open ISA
4
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Economics of Processor Design
Cost breakdown for a processor chip:
CPU cost = fabrication & packaging costs +
design cost +
profit
Time to design a new high end CPU
600 man years (150 designers working for 4 years)
At $ 300 K per person per year, this translates to $ 180 million
Neglecting all other cost components, the cost per unit breakdown based on the number of units sold are as follows:
# of units sold
unless sufficient volumes are expected, its difficult to justify the design of a new CPU
cost per unit
$ 18 K
$ 180 ::
10 K 1 M
5
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Economics of Processor Design (continued)
Putting out a new ISA costs more than putting out a processor: some of the cost components here are:
– ISA development costs
– software development costs
– source code recompilation costs
– processor costs
These factors pose significant challenges for the introduction of a new ISA, especially one competing with an existing ISA.
– A new ISA must be assured of an adequate base of applications (software)
Example:
Introduction of the POWER PC ISA (1990s) to compete with the X 86: – it would have taken 5+ years to “just recompile” the
then-existingPC software base to a new ISA
Market segments: no single ISA spans all of these segments
– PCs (biggest market segment in $ sales)
– Embedded CPUs (biggest market segment in number of units)
– Workstation, high-end systems
– Servers systems
– Others (specialized, special-purpose)
6
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Economics of Processor Design (continued)
Example 1: Evolution of the X86 PC processor market: desktops, serv- ers and laptops (no tablets or phones)
1996: Intel, Cyrix, AMD: 74 million units shipped, $15-16 billions in revenue.
1999: 80+ million units shipped, $18-20 bilion in revenue
:
2005: 240+ million – mostly Intel and AMD, $80 billion in revenue
2011: Downturn in desktop/laptop sales – users moving gradually to tablets and smart phones, tighter margins, lower prices: Intel (market share 80% (average across all lines), AMD – 19% and VIA – less than 0.5%) – $41 billion in revenue.
Trend: Most profits in server market, ultra-low power X86 devices showing up in phones, tablets and servers. Overall 5% to 6% market growth. Impact of competitions not clear.
Example 2: Extent of R&D, Capital Expenditures: Intel spent more than 5.5 billion $s each on R&D and capital expenditures annually for the past two years. In 1994, Intel’s R&D spending was $2.3 billion and capital expenditures were about $1.1 billion.
Example 3: Server processor market – for 2006-2011:
– 22 million server processor chips to be shipped
– 60% of these will be X86-based
– $500 to $1K+ for each microprocessor
7
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Economics of Processor Design (continued)
Example 4: The nature of the desktop/server processor market:
– Commodity-like in nature, huge and with low profit margin for all but the highest end offering (servers and high-end desktop CPUs)
– Innovation absolutely necessary to distinguish products from competitors: multiple cores, integrated memory controllers, integrated graphics, integrated encryption, performance per Watt etc.
Average selling price for CPU chip dropped from $200 in 2000 to
to $150 in 2005 to $85 to $105 in 2010 (Intel CPUs) and stabilized
– Highest-end products from Intel/AMD are still well over $1400 per
CPU chip in some cases.
Example 5: The nature of the cell phone/tablet CPU market
– Very competitive, many vendors, increasingly sophisticated designs
– Performance, features and power consumtions matters most
– Fetaures: graphics, encryption, speech engines
– Past designs: multiple discrete CPUs
– Now: highly integrated, multi-core designs, 1+ GHz clock rates
– Cost: few $s to few 10s of $s per CPU
– ARM ISA rules at this point, but new X86 products (Silverton) will appear in late 2013.
Example 6: The X86 server/desktop market: statistics for Intel
– Revenue: $26.5 billion in 2001 to $ 43.6 billion in 2010
– Breakdown for 2010: 60% from PC processors (desktop, laptop,
netbook), about 17% from servers, about 18% from chipsets
– Owns 80% of the PC/server CPU market, AMD owns most of rest
8
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Economics of Processor Design (continued) The market leaders by CPU types
Desktop PC: Embedded systems:
Servers Systems:
High-end systems: Workstations:
Intel, AMD
ARM, X-scale (Intel & others), IBM-PPC, Motorola 68K/Coldfire, MIPS
Intel (X86, X86-64, IA-64), AMD, Sun, HP, IBM
Intel, AMD, IBM (mostly in IBM products) Intel, AMD, Sun, IBM, SGI, HP
– IBM, SGI and HP’s CPU share of market is almost gone now! Alpha and Sun are extinct. ARM has been trying to penetrate the server CPU market in the recent years.
Current cost drivers for CPU chip beyond basic design costs:
– Functional testing
– VLSI/Chip-level testing
Both costs increasing significantly as chip complexity grows
“Fabless” CPU design companies are on the rise. ARM started like this, now just defines and licenses the ARM ISA. ARM CPUs are extensively used in the tablet/cell phone market (including the Apple products)
9
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CPI Equation
The execution time of a program: Texec = N * * CPI
N= = CPI =
number of instructions executed (dynamic instruction count) clock cycle time
average number of clocks per instruction
RISCs (Reduced Instruction Set Computers) and CISCs (Complex Instruction Set Computers):
RISC CPUs attempt to reduce Texec by:
– reducing the CPI
– reducing hardware complexity in a manner that makes it easier to use a faster clock
CISC CPUs attempt to reduce Texec by:
– reducing N
Here’s why……………………….
10
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RISCs vs. CISCs
Characteristics of a RISC ISA:
1. Usesa“load-store”architecture:onlyloadandstoreinstructions
access memory; the other instructions are register-to-register
2. Instructionsemanticsareatafairlylowlevel.
3. Usesasmallnumberofaddressingmodes
4. Instructionformatsareuniform,i.e.,instructionsizesdonotvary widely
1 & 2 operation times are fast (most of the times, operands are from registers)
2 & 3 & 4 instructions can be fetched and decoded quickly
Taken together, these, in turn,
a. Simpler hardware
b. Smallerlogicdelays,thatmakeitpossibletouseafasterclock However,
2 also that N goes up.
– So the design challenge is to ensure that the growth in N is more than compensated by the reduction in CPI and
11
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RISCs vs. CISCs (continued)
Characteristics of a CISC CPU:
1. Usesmany“operate”instructionsthatinvolveatleastoneoperand from memory
2. Semanticsofmanyinstructionsareatafairlyhighlevel
3. Usesafairnumberofaddressingmodes
4. Instructionformatsarenon-uniform,instrn.sizescanvarywidely
1 & 2 2 & 3 & 4 4
operation times are slow
instruction decoding times can be substantial
instruction fetching time may be higher due to misalignment
Taken together, these, in turn,
a. Relatively complex hardware
b. Largerlogicdelays,thatmakeitimpossibletouseafasterclock
c. Higher CPI
– Basic philosophy: amortize the cost of fetching and decoding an instruction by performing more operations per instruction
– Design challenge: ensure that increase in CPI and are more than compensated by the drop in N
12
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Locality of Reference (LOR)
LOR is the basis of almost all techniques for speeding up the processor
All programs exhibit two types of LOR: Temporal LOR and Spatial LOR:
Temporal LOR: Once an address A is accessed there is a high probability of accessing it in the near future. The reason for this behavior is due to the use of loops, loop counting variables, marching pointers, subroutines, system calls and other similar artifacts.
Spatial LOR: Once an address A is accessed there is a high probability of accessing a nearby location A ( small) in the near future. Programs exhibit this behavior due to the sequential nature
of instruction fetching between branches, the processing of consecutive array elements in a loop etc.
Examples:
Mechanism
Registers
Instruction Cache Data Cache Pipelining Interleaved Memory SDRAM, EDO DRAM
Type of LOR Exploited
Temporal (within data stream)
Temporal & Spatial (within instrn. stream) Temporal and Spatial (within data stream) Spatial (within instruction stream)
Spatial (within instruction or data stream) Spatial (within instruction or data stream)
13
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Contemporary VLSI Technology
Today, virtually all CPUs are implemented completely within a single chip. The basic circuit components implemented are transistors (electronic switches), capacitors, connections and limited resistors.
Basic Microchip Fabrication Process:
Cut out a wafer of extremely pure semiconducting material
Fabricate several identical copies of the design on a single wafer. Fabrication is a series of steps that forms layers of various materials within the bulk of the wafer through a series of etching or vapor deposition steps.
Examples of these layers:
metal layers (typical: 2 to 5 metal layers, separated by insulators). poorly conducting semiconducting layers, e.g., polysilicon. relatively better conducting semiconducting layers.
insulating layers (e.g., silicon oxide)
– To form each layer, one or more fabrication steps are needed. The pattern of each layer is specified using a mask that exposes or hides the surface area on which the new layer is to be formed. The masks for the various layers have to be aligned very carefully to ensure that layers are formed where they were really intended.
14
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Contemporary VLSI Technology (continued)
– Main use of metal layers: facilitate interconnection between parts of the circuit. A metal connection for one layer can cross a metal connection in a different metal layer without getting electrically shorted. This is possible as two different metal layers are separated by at least one insulating layer.
– The fabrication steps essentially lay out the necessary circuitry
on the surface of the wafer following a specification of the layout of the circuit.
Cut out each copy: each cut out copy is called a die – visually (or automatically) inspect the dies and reject ones that are obviously flawed; package the remaining dies.
Test out each packaged chip and discard the chips that fail the tests.
Fabrication processes for CPUs, logic and SRAMs are different from
the fabrication processes for DRAMs:
CPU/logic/SRAMs: faster speed requires several layers of metals and faster switching speeds for transistors: higher fabrication cost.
DRAMs: requires more density – one or two metal layers suffice: relatively cheaper fabrication cost.
15
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implications of the VLSI Fabrication Process
Cost of each good die goes down with an increase in the fraction of good
dies: this fraction is called the yield.
– Yield depends (among other factors) on:
Die area: the bigger the area is, the lower is the yield likely to be, assuming that defects are distributed randomly over the area of the wafer.
% Yield
Die Size
Design “Regularity”: the more repetitive the circuit layouts are, the higher is the yield. Repetitive layouts allows the masks for the various layers to be aligned better are thus reduces defects stemming from the misalignment of the masks of the various layers.
The bigger the layout/area of a circuit component is, the slower is its speed:
Bigger transistors: longer switching time Longer interconnection wires: higher delays
“Smaller is faster” – the smaller is the layout area of a component, the faster is it likely to be.
16
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Technological Constraints
1. Overall number of transistors within a single chip is limited. Current limits:
100 million to several 10s of billion transistors within a high-end CPU chip (irregular design, several metal layers for fast logic speeds)
About 1500 million transistors for DRAMs (regular design, 2 metal layers), 2 trillion transistors in Samsung’s 1 TB 3D V-NAND chip
2. Signaldelayswithinasinglechipareoften10to50timesfasterthatthe delays of signals that have to cross chip boundaries.
Delay = 10-50 X T
Delay = T Chip Boundary
– Near term situation: main source of delay within a chip is not the switching delay of gates but the delay of connections. This is exactly opposite of the situation just a few years back!
3. Numberofsignalpins/externalcontactbumpsonapackageislimited – typ. few hundreds of pins to 4066 pins at most (LGA-4066 package).
17
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implications of the Technological Constraints on CPU Design
1 Put only the most essential functions on a single chip:
1971: Very simple 4-bit CPU within a single chip – the Intel 4004 – the first microprocessor
late 70s: 16-bit general purpose CPU within a single chip, no floating point logic or cache or memory protection logic on-chip
early 80s: 32-bit CPU, no cache or floating point logic on-chip. Also, CPU + limited amount of RAM + communications controller on a single chip (Inmos Transputer)
late 80s, early 90s: 32-bit CPUs and 64-bit CPUs, on-chip floating point logic and single level of caching on chip. Gradual transition from single-issue pipelines to 2 or 4-way superscalar issues per cycle.
1993: 64-bit CPU, floating point logic, two levels of on-chip caches, superscalar instruction issue – DEC 21164
Also in 90s: 8 Complete 16-bit integer CPUs, each with 64 KBytes of DRAM plus hypercube interconnection within a single chip (IBM’s EXECUBE chip), multiple POWER PCs in a single die/MCM, VLIW media processors, IA-64 (VLIW-like)….
Recent Past (2000 through 2014): Multithreaded CPUs, multiple CPUs on a chip (CMP – chip multiprocessors, aka multi-core processors), multi-core chip with integrated DRAM controllers and 3-rd generation PCIe controller, Media processors with on-chip storage, GPUs, ….
Now: CPU + RAM + chipset logic within a single chip, multicore CPU, stacked DRAM memory, IO transceivers within a single package, with or without GPUs AND emphasis on power-aware system-level designs
18
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implications of the Technological Constraints on CPU Design: continued
2 and 3 Confine most of the accesses within the CPU chip whenever possible
– requires the exploitation of spatial and temporal LOR, such as the use of on-chip caches
When off-chip accesses are needed, 3
(i) May have to multiplex pins between address or data or
between instruction and data streams, and/or
(ii) Use burst transfers wherever possible to amortize off-chip access latency – this requires exploitation of spatial LOR leading to the use of streaming (off-chip) memory devices such as EDO DRAMs or SDRAMs or interleaved memory systems.
2 Cannot feed a very fast clock to the CPU from external sources: fast clock has to be synthesized internally
General design goals – optimize design to get the best performance under the technological constraints for a reasonable $ cost (for some target market segment)
Two new technology constraints:
Heating is a problem: need more energy-efficient designs
– high-end CPUs, as well as embedded CPUs for portable devices
Interconnect delays becoming higher than logic delays: clock skews may become unmanageable.
19
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Power/Energy Dissipation – Potential Limiters
Peak power dissipations in some past and recent high-end processors:
Processor
Peak Power (W)
Clock rate (MHz)
Alpha 21264C
95
1001
AMD Athlon XP
67
1800
AMD Athlon 64 3800+
89
2400
IBM Power 4
135
1300
Intel Pentium 4 (dual MT)
115
3600
Intel Itanium 2
130
1000
Intel Xeon 5680 (6-core)
130
3333
AMD FX-8170 Bulldozer (8-core, 2011)
125+
3900
Intel i7-6950X (10-core, 25 MB L3, 2016)
140
3500/4000 (turbo)
AMD 7H12 (64-core, 256 MB L3, 2020)
280
2600/3300 (turbo)
Power dissipation has the following components:
Dynamic or switching power: caused by transistors switching in the course of normal operations. This component is proportional to the product of clock frequency and the square of the supply voltage.
Static power: caused by leakage – transistors are imperfect switches – they provide a conducting path when they are supposed to be off. This component increases with temperature.
The problem with high power dissipation:
Processor lifetime decreases – heat puts mechanical stresses on the silicon die, interconnections on chip can break, transistors can malfunction, the leakage component of power goes up, increasing heat dissipation further.
– the areal power density – power dissipated per unit area of the die has to be low as well, otherwise localized hot spots are created.
Cooling and power supply costs go up, driving up overall system costs Stands in the way of using a faster clock – performance is limited
20
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
General Trends
CPU/Logic/SRAMs:
Device counts on chip are increasing at the rate of 60% to 80% per year for logic chips (like the CPU):
Moore’s law: transistor counts in chips double every 18 months
Common mislabeling: CPU performance doubles every 18 months
– Was true until 2007-2008, lower per core performance growth rate
now; cumulative performance of all cores per chip still growing Now: up to 5.6 Billion transistors in a microprocessor.
Logic gate speeds and CPU clock rate doubled every 3 years till early 2000. Now quite saturated at 4 to 6 GHz, although 500 GHz. lab proto- types cooled at 4 degrees Kelvin exist!
DRAMs:
DRAM device densities are increasing at the rate of 60% per year (= 4X in 3 years)
DRAM access times have improved by about only 30% in the last 10 years!
Many recent innovations to the DRAM interface has happened: – 2.4 GHz memory system bus
– Better power management
– Multiple DRAM controllers on CPU chip
– Very high capacity DRAM chips (1 Tb)
21
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
22
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
General Trends (contd.) Bulk/Mass Storage Devices:
Rotating platter magentic disk drives dominate: 1 TByte drives for less than $100 (= 10 cents per GigaByte). 10 TB heavy-duty HDD costs $350.
Flash based devices increasing in capacity (up to 256 GBytes in eSATA SSD drives), but costs are higher.
– February 2017: 1 TB Flash memory chip from Toshiba
– July 2013 announcement by Samsung: Almost 1 TByte capacity
Flash in production – V-Flash (128 Gbits in each Flash chip).
– Capacity achived through vertical integration, keeping basic process technology the same! 40:1 aspect ration (vertical dimension to horizontal dimension), vs. 10:1 from nearest competitor (IBM)
– Likely to have significant impact on the storage hierarchy. External Interconnection Between Chips on a PCB:
Despite significant advances in printed circuit board (PCB) design and implementation (multi-layer boards, surface mount technologies, smaller pitched lines etc.), the speed of external interconnections on the PCB has grown very slowly (compared to the rate of growth of the CPU clocks) over the recent years:
– 1.66 GHz+ clock rates for external buses + transfers on both clock edges
– Fast point-to-point PCI3e interconnections (up to 16-bits wide)
– 500 Mbits/secs. to 4 Gbits/secs. data rate on carefully sized (and
terminated point-to-point data links).
– Assisted by multiple on-chip DRAM controller
23
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Some Recent Data on the State of Markets and Hardware
Markets:
Server CPUs: 2016 Server CPU Marketshare (X86 and server-class ARM): 22.9 million units, $ 13.9 billion revenue. Intel dominates 98%+ of market. Numbers for recent years are hard to get these days!
Embedded CPUs: 17.7 billion ARM-based processors shipped in 2016. This is 34% of embedded CPU market by units. Numbers for recent years are hard to get these days!
Fastest server CPU (June 2017): AMD EPYC (Naples) 7551 – 180W, 32 cores, 128 PCIe connections/lanes, 8 DDR4 Channels per CPU, up to 2TB Memory per CPU, Dedicated Security Subsystem, Integrated Chipset. Numbers for recent years are hard to get these days!
Largest hard disk: 14 TB HGST (subsidiary of Western Digital) Ultra- star He 12 – 8-platter drive with 864Gbits/sq. inch storage density, 7200 RPM, 6 to 12 Gb/secs transfer rates.
Highest capacity SSD drives: Seagate 60 TB (3.5 inch), announced July 2017, Samsung 16 TB (2.5 inch), in production, $ 10K each.
Notable accelerators: Google’s TPU 2.0, Intel/Altera Stratix etc.
24
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The End of Performance Scaling with a Single Chip
Technology generation (also called technology node or simply node) is characterized by the size of the smallest feature that can be implemented on the chip, measure in nano meters (nm). Transistors shrink in size with each new generation.
Limitations of Single-Chip Solutions at the High-End:
Moore’s Law implicitly assumes that the chip area does not change from one technology generation to generation: this is no longer the case for high-end products:
The cost per unit area of the chip is going up more than exponentially as transistors shrink with progressively new technology generations:
In addition to this, design costs are going up with each new generation:
25
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The End of Performance Scaling with a Single Chip (contd)
Note that the overall “cost” includes design/validation software costs that go beyond the Silicon! Silicon, of course, cost more per area due to transistor shrinks as the wafer costs are going up and because yield is lower with smaller transistors.
To scale performance according to Moore’s Law in the future, instead of building a single large chip, it makes more sense to use several smaller (and possibly different types of) chips interconnected on a common substrate and place them within a single package. This is called heterogeneous integration.
There are other requirements that drive the need for heterogeneous integration:
The number of connections to devices outside the package are limited by the finite number of pins/contacts on the package – this constrains communication bandwidth and adds to the latency.
Package-external memory access using these pins are still slower as external DRAMs use a standard that limits the number of connection to a single memory DIMM to less than 200 bits.
– In contrast, connections inside a package are faster and wider, and being shorter in length, they consume less power!
Takeaway: heterogeneous integration also permits components that communicate heavily with each other to be connected more efficiently in terms both power and performance.
In general, heterogeneous integration connects several different chips on a common substrate (called the interposer) inside one package leading to a System-in-a-Package (SiP).
26
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The End of Performance Scaling with a Single Chip (contd)
– SiPs integrate many diverse components that can come from different vendors and different technology generations.
Some existing examples of the different chips integrated as a SiP include:
Several smaller multicore chips (AMD EPYC) or several smaller FPGAs (programmable hardware) (Xilinx Virtex)
General-purpose multicore processors, vector accelerators, (local) stacked DRAM and high-speed IO chips (Intel Xeon Phi)
GPU and (local) stacked DRAM (AMD Fiji)
General-purpose multicore processor chip, FPGA chip, transceiver chip (Intel Stratix)
Heterogeneous integration may also enable performance scaling for a SiP to take place at a rate exceeding what is predicted by Moore’s Law.
Heterogeneous interation is the revolution that is happening now in the chip industry!
27
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Steps Involved in Designing and Implementing a Modern Processor
Formulate, refine and validate instruction set (ISA)
Formulate datapath organization (main components, interconnections and interfaces to bus/memory) + control logic
Implement detailed simulator for datapath
Validate simulator
Simulate the execution of benchmarks Modify simulator
Need to refine organization, control logic No
Retarget compiler to ISA
Add organization-specific optimizations to back end of compiler
Yes
Convert simulator to register-transfer level (RTL) design
Validate RTL level design/refine-iterate as needed
Synthesize using custom VLSI cell library, validate
Extract circuit and simulate electrically/refine-iterate as needed
Tapeout and chip fabrication
Chip-level testing, hardware integration and board-level testing
28
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Designing and Implementing a Modern Processor (contd.)
Typical situations for a high-end design:
ISA already exists
Core design/validation team of 150 to 200+ engineers
2 validators for every one real designers
Separate support design teams (beyond core team) for interfaces, buses, chipsets….
Not uncommon to discover bugs at different validation points – may trigger complete design review and substantial changes (these are simplified as “refine” in the last flowchart.
The detailed organizational-level simulator can be quite complex. Ex- ample: the simulator for the Intel P6 (precursor to the Pentiums) had 750,000 lines of source code!!
Some bugs are discovered after release – usually not serious
Trend: use a synthesizable language like Verilog for the organizational simulator – eliminates step for converting organizational simulator to synthesizable code (and associated validation).
Steps for simpler processors are equally complex!
29
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Recent Industry Trends
High-end processors:
Addition of multithreading support within processor
Addition of support for virtual machines
Recognition of the “power wall” – migration to multi-core designs (multiple processors in a single chip) with simpler cores, each with slower clock rates.
– supports multithreaded workload – truer form of support
– more energy-efficient
Multi-core products not only target servers but desktops and laptops as well.
Embedded processors:
Integration with memory, digital signal processors (DSPs),
programmable logic and more.
Multi-core (both homogeneous and heterogeneous)
Array processors, with and without multithreading (for video, signal processing, network processing)
32-bit embedded processor for less than $1 – announced recently (Luminary Micro’s LMS3S101, 10K unit pricing).
Times of significant changes and revolutionary developments but exciting!
30
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Getting a Lower CPI: Pipelining
Given: a set of N processing tasks, each requiring K consecutive steps of duration T each.
Simple, non-pipelined processing: Employ a monolithic hardware of delay K * T to process one task at a time: total processing time is:
Tnp =N*(K*T)
time
0 KT 2KT 3KT 4KT
Task 1 Task 2 Task 3 Task 4 completes completes completes completes
– No overlap in the processing of consecutive tasks
– Thruput = completion rate = one task every K*T time units
– Processing latency per task = time between start and end of processing = K * T
Pipelined processing: dedicate a piece of logic for every processing step
– As soon as task i has gone through the processing logic for step j, process it using the logic for step (j+1) and use the processing logic for step j to process task (i + 1)
– This is like an assembly line
31
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
– Up to K consecutive tasks can be in varying stages of processing simultaneously
– The logic for each step is called a stage
Pictorially, the overlapped processing activity is shown using a Gantt
Chart: note analogy with fluid flow through a pipeline
Stage 1 Stage 2 Stage 3 Stage 4
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
Task 2
Task 3
Task 4
Task 5
Task 1
Task 2
Task 3
Task 4
Task 1
Task 2
Task 3
0 T 2T 3T 4T 5T 6T
time
K = 4 in this example Note different time scale
Task 1 completes
Task 2 completes
Task 3 completes
Some characteristics of pipelined processing: Total processing time: Tp = K * T + (N – 1)* T
Reason: first task completes after time K * T, thereafter, we have one completion every T time units
Thruput = one task per T time units in the steady state – improved by a factor of K
Processing latency = K * T (unchanged)
32
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Pipelines: Pipelines for overlapping instruction execution steps
For simplicity, let us assume that the steps involved in processing an instruction, regardless of the type of the instruction, are as follows:
1. FetchInstruction(F)
2. DecodeInstructionandfetchoperandsfromregisters(D/RF)
3. Executearithmetic,logical,shiftoperationspecifiedininstrn.(EX) 4. Performanymemoryoperationspecifiedininstruction(MEM)
5. Writetheresult(s)tothedestination(s)(WB)
– The steps given above are typical of load-store architectures
Assume a processing delay of T for each step and a stage for each step, resulting in the following instruction pipeline:
F
D/RF
E
MEM
WB
– We will call this pipeline for our example load-store machine as APEX (A Pipeline EXample) from now on
– The instruction set of APEX will be described shortly
33
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Consider a sequence of N instructions to be processed: the Gantt chart is shown below:
: Idle
T 2T 3T 4T 5T 6T 7T 8T
Pipeline Drain Time
I1
I2
I3
I4
I5
I6
I7
I8
I1
I2
I3
I4
I5
I6
I7
I1
I2
I3
I4
I5
I6
I1
I2
I3
I4
I5
I1
I2
I3
I4
IN
IN-1
IN
IN-2
IN-1
IN
IN-3
IN-2
IN-1
IN
IN-4
IN-3
IN-2
IN-1
IN
F D/RF
Stages E MEM
WB
0
(N+4).T
Pipeline Fill Time
Time
The Execution of a Sequence of N Instructions on APEX
A possible implementation of a single stage of APEX:
One or more input latches or registers
Combinational Logic for the stage
An implementation of the pipeline as a whole:
Stage #k Stage #k+1 Stage #k+2
clock
Master-Slave Latches
This is a synchronous pipeline, clocked from common source 34
Output to the following stage (or to an earlier stage)
Logic for Implementing Step #k
Logic for Implementing Step #k+1
Logic for Implementing Step #k+2
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Aside: Components of the Datapath of a Modern Processors
Register file: a small RAM with multiple ports, each location of this
RAM is a registers. Two or more registers can be read concurrently and
one or more register may be written while other registers are being read:
Clock/control
Address of register to be read out on Port A
Address of register to be read out on Port B
Address of register to be written with data on Port C
Contents of register Contents of register
Data to be written
A register file with 2 read ports and a single write port
– behaves like a combinatorial circuit
ALU: combinatorial circuit performing arithmetic and logical ops:
New flag values ALU Result
Flag value (e.g., carry)
Function select
Shifter/rotator: combinatorial shift/rotate logic – similar to ALU
Multiplexors & tri-state buffers/drivers
Port A Port B
Port C
Operand 1
Operand 2
35
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The APEX Pipeline: Details
The ISA of APEX: a first look
load: LOAD
opL: op
As an instruction flows through the pipeline, all required operand val- ues, addresses flow along with it. The address of the instruction (“PC- value”) also flows with the instruction through the pipeline. Reasons:
– PC-relative addresses can be computed where they are needed
– Return address for an interrupt can be determined easily
36
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The APEX Pipeline: Details
APEX (A Pipeline EXample) is an instruction pipeline that implements a load-store instruction set. All instructions are 32-bits wide. There are 32 general purpose registers, numbered 0 through 31. In the generic descrip- tion of example instructions, registers that are the destination of the result of an instruction are dubbed as rdest. Registers that provide an input value to an instruction are dubbed as rsrc. Memory locations are assumed to be 32-bits wide. Mem [X] refers to the memory location at address X.
APEX implements the load-store ISA (Instruction Set Architecture) using the following 5-stage pipeline:
F
D/RF
Processing Step
Fetch Instruction
Decode and fetch register operands Execute register-to-register operation Perform memory operation, if any Write results back
Associated Stage Name Abbreviation
Fetch stage F Decode, register fetch D/RF Execute EX Memory MEM Writeback WB
ADD rdest rsrc1 rsrc2 : rdest <- ADDL rdest rsrc1 literal : rdest <- LOAD rdest rsrc1 literal : rdest <- LDR rdest rsrc1 rsrc2 : rdest <- STORE rsrc1 rsrc2 literal :
rsrc1 + rsrc 2
rsrc1 + literal
Mem [rsrc1 + literal]
Mem [rsrc1 + rsrc2]
Mem [rsrc2 + literal] <- rsrc1
EX
The APEX Pipeline
Examples of instructions from the ISA implemented by APEX:
37
MEM
WB
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The APEX Pipeline: Details How the stages are used:
INSTRUCTION \ Stage
F stage
D/RF stage
EX stage
MEM stage
WB stage
Reg-to-reg:
ADD rdest, rsrc1, rsrc2
Read instruction from memory
Decode instruction; Read rsrc1 and rsrc2 from the RF
Add contents of rsrc1 and rsrc2; Set condition code flags
Hold on to the result produced
in the EX stage in the previous cycle;
Write result to register rdest
Load indexed-literal offset: LOAD rdest, rsrc1,
Decode instruction; Read rsrc1; sign extend
Read the contents of the memory location whose address was computed in the previous cycle in the EX stage
Write the data read out from memory to the register rdest
Load indexed-reg. offset: LDR rdest, rsrc1, rsrc2
Read instruction from memory
Decode instruction; Read rsrc1 and rsrc2 from the RF
Add contents of rsrc1 and rsrc2; Set condition code flags
Read the contents of the memory location whose address was computed in the previous cycle in the EX stage
Write the data read out from memory to the register rdest
Store indexed-literal offset: STORE rsrc1, rsrc2,
Decode instruction; Read rsrc1 and rsrc2; sign extend
Write the contents of rsrc1 (read out in the D/RF stage) to the memory location whose address was com- puted in the EX stage
in the previous cycle
No action
Use of the stages of APEX by register-to-register, load and store instructions
38
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Information Flowing Between Adjacent Pipeline Stages:
INSTRUCTION \ Between Stages
F and D/RF
D/RF and EX
EX and MEM
MEM and WB
Reg-to-reg:
ADD rdest, rsrc1, rsrc2
– The instruction read from memory
– The PC value
– Instrn. info;
– The PC value
– Contents of rsrc1 and
rsrc2;
– Address of rdest
– Instrn. info;
– The PC value
– Address of rdest;
– Result of operation
in the EX stage
– Instrn. info;
– The PC value
– Address of rdest;
– Result of operation
in the E stage
Load indexed-literal offset: LOAD rdest, rsrc1,
– The PC value
– Instrn. info;
– The PC value
– Contents of rsrc1;
– Sign extended literal – Address of rdest
– Instrn. info;
– The PC value
– Address of rdest;
– Result of operation
in the EX stage
– Instrn. info;
– The PC value
– Address of rdest; (Data being loaded comes from memory)
Load indexed-reg. offset: LDR rdest, rsrc1, rsrc2
– The instruction read from memory
– The PC value
– Instrn. info;
– The PC value
– Contents of rsrc1 and
rsrc2;
– Address of rdest
– Instrn. info;
– The PC value
– Address of rdest;
– Result of operation
in the EX stage
– Instrn. info;
– The PC value
– Address of rdest; (Data being loaded comes from memory)
Store indexed-literal offset: STORE rsrc1, rsrc2,
– The PC value
– Instrn. info;
– The PC value
– Contents of rsrc1
and rsrc2;
– Sign extended literal
– Instrn. info;
– The PC value
– Contents of rsrc1; – Result of operation
in the EX stage
None
* “Instrn. Info” indicates information about the instruction, such as the instruction itself or its decoded form – see text
Information flow among the stages of APEX for register-to-register, load and store instruc- tions
39
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
address of r1
addl r5,
Instrn. Register
55
address of destination register (=5)
value of r1
data to be written (=sum)
ALU sum
1
literal 20
sum
sign-extended
I-Cache
Register File
r1, #20
5
Decoder
address of r5
Decoded information sets up ALU
to do an addition
Chain of latches holding decoded info. (“IR Chain”)
address of r5
Decoded info tells the MEM stage to pass sum and 5
address of r5
Decoded info tells the WB stage to write sum into r5
10
01
1001
1001
1001
“PC Chain”
1001
PC Update Logic
F D/RF EX MEM WB
Time: T T+1 T+2 T+3 T+4
Events in the APEX Pipeline during the processing of an ADDL instruction stored at address 1001
40
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The APEX Pipeline – Some Noteworthy Features
For certain instructions, no processing is done within a stage – the inputs to the stage are simply passed on to the following stage.
Examples:
For register-to-register operate instructions, no processing is done within the MEM stage.
The WB stage performs no processing for the STORE instruction
All register files are read from the same stage (viz., the D/RF stage).
– Reason for doing this: reduces need for extra read ports on the register file.
All register files are written from the same stage (viz., WB).
– This is true even if the result to be written to a register is available in an earlier stage (as in the case of register-to-register operate instructions, where the results are computed by the EX stage, and can be written from the MEM stage into the register file)
– Reason for doing this: reduces need for extra write ports on the register file.
41
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The APEX Pipeline – Some Noteworthy Features (contd.)
The EX stage implements arithmetic and logical ops; effective address- es for LOADs and STOREs are also computed in this stage.
Ideally, the effective processing delay of each stage is T, where T is the period of the pipeline clock.
In practice, an instruction can spend more than one cycle within a stage. When this happens, all prior stages are also held up: such stalls are common in situations such as:
Delays in fetching an instruction due to a miss in the instruction cache (I-cache). (The F stage fetched instructions from the I-cache at the rate of one every cycle when the instruction is in the I-cache.)
Similar delays in accessing memory locations cached in the D-cache within the MEM stage.
– Formally, a pipeline stage is said to stall if its processing delay exceeds the ideal one cycle delay.
– In this simplistic APEX pipeline, when a stage stalls, all preceding stages also stall
PROCESSING OF BRANCH INSTRUCTIONS WILL BE DESCRIBED LATER!
42
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The APEX Pipeline – Some Timing Issues
Overlapping decoding and reading of register file:
Pipeline cycle time, T
Register file read access time:
two src registers read out assuming both src register address filelds within the instruction are valid
Decoding delay
At this point, decoded info is available to discard data read from register(s) whose address field(s) was/were incorrectly assumed to be valid
Reading out an instruction from the I-cache, assuming a cache hit: Pipeline cycle time, T
I-cache accessed & instruction read out from I-cache using PC value at the begining of the cycle
PC updated for use in the next cycle
43
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
More on Basics of Instruction Pipeline
Instruction pipelines exploit ILP – Instruction Level Parallelism – concurrency within a single instruction stream.
Terminology for synchronous pipelines:
Pipeline latency = number of pipeline cycles needed to process an
instruction
Pipeline clock = Common clock used to drive interstage latches “Clock rate” = pipeline clock rate = 1/T
Choosing the pipeline clock rate: clock period must be long enough to accommodate:
Slowest logic delay
Delay of input and output latches Clock skew
Logic for Stage #i
Input Latch (es)
tl1 Li
Output Latch (es)
tl2
tl1 = input latching delay (delay of slave latch)
tl2 = output latching delay (delay of master latch)
tlatch = tl1 + tl2 = overall latching delay: roughly the same for all stages
Li = processing delay for the logic of
stage #i
Lmax = processing delay for the slowest logic
associated with any stage = max {Li}
T = pipeline cycle time = tl + Lmax + tskew
tskew = allowance for clock skew =Lmax-Li +tskew
T
44
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Efficiency Issues
Overall pipeline efficiency:
Total duration for which stages are committed = Tp
Total resources committed over this duration = K * Tp (in stage-cycles)
Actual usage of stages: K * N (each stage used once by N instructions)
Efficiency = Actual resources used/Resources committed
= (K * N)/(K * Tp)
= N/(K + N -1), which is < 100%
Reason: some stages remain unused during pipeline filling and flushing
Logic Utilization:
Logic delays of all stages are not uniform
Fastest logic waits for slowest logic to catch up: utilization loss
Cure: ensure logic delays are about the same for all stages
45
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CPI - Ideal vs. Real
CPI with pipelining:
Texec = Tp = K * T + (N - 1) * T = N * CPI * T CPI = (K + N - 1)/N
Best possible CPI is 1 ( N >> K and K > 1): ideal value Thruput = N/Texec ; IPC (Instructions per clock) = 1/CPI
Real-life CPI is higher than 1 – this is due to:
Data dependencies (or interlocking): to see the impact of dependencies, consider the processing of the following code fragment on APEX:
LOAD R1, R3, #50 /* R1 <- Mem [R3 + 50] */
ADD R2, R1, R6 /* R2 <- R1 + R6 */
Here, R1 does not get written till the LOAD goes into the WB stage. The ADD has to wait in the D/RF stage waiting to read the updated value of R1.
Result: The EX and MEM stage idle - 2-cycles lost = 2-cycle bubble
Branches: If the decision to take a branch is made when the branch instruction is in the EX stage, the processing of the two instructions that have followed the branch instruction into the pipeline (which are sitting in the F and D/RF stages) have to be abandoned.
Result: 2-cycle long bubble in the pipeline.
46
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CPI - Ideal vs. Real (continued)
Slow memory interface: If the memory and cache system cannot deliver data in one cycle, the activities of the F and MEM stages can take longer (2 cycles or more).
Result: 1-cycle or longer bubbles
Bubbles due to other types of resource contention (LATER)
CPI for real pipelines:
- Net effect of factors listed above: total number of cycles lost (i.e., number of 1-cycle bubbles) is roughly proportional to N = f.N, say. (Simple model)
CPI = (K + N * (1 + f) - 1)/N = 1 + f, at best
Instruction pipeline design goal: achieve a CPI as close to 1 as possible in the face of dependencies, branching, slow memory interfaces and resource conflicts.
- both hardware and software solutions are used.
47
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Depicting Usage of Pipeline Stages
Precedence graphs: nodes = stages, directed arcs = input-output order
Stage 1 Stage 2 Stage 3 Stage 5
Stage 4
Reservation Tables:
One row per stage, one column per cycle of use by instruction Total number of columns = pipeline latency in cycles Columns numbered 1 onwards, left to right
An X mark is put in the grid at the intersection of the row for stage S and column m if stage S is used by the instruction in the m-th cycle following the initiation of the instruction
Example: reservation table for APEX:
column number
F D/RF
Stages EX MEM WB
12345
012345
Pipeline Latency
48
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
A More Realistic Instruction Pipeline - Multiple Execution Units
F
D/ RF
MEM
EX
WB
FD/ FRF
EX1
EX2
EX3
EX4
FWB
Another example of an instruction pipeline; each stage has a delay of one pipeline cycle.
Separate registers for floating point data and integer data
Separate execution units for:
(i) integer (“integer unit”/ IU),
(ii) floating point (“floating point unit”/FPU) and, (iii) load/store instructions (“load-store unit”)
Floating point execution units are usually pipelined: “floating point pipeline”
F and MEM may be further pipelined (not shown)
Note potential resource conflict in using the WB and FWB stages to the
integer register file and the floating point register file, respectively.
49
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Function Pipelines: Pipelines for Implementing Operations
A function or an operation that is relatively complex can be implemented as pipeline
Example: Floating Point Addition:
Assume input operands are normalized and of the form: m1 be1 andm2 be2
- the base b is usually a power of 2: lets assume its 2
The result is also needed in normalized form
The leading 1 in the normalized mantissa may not be stored explicitly (as in the IEEE floating point standard specs). We assume that the lead- ing 1 is explicitly stored
The logic blocks that are needed are: adders, comparators, combination- al shifters, and a combinational logic block for counting the number of leading zeros.
50
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Algorithm:
Step 1. Compare the exponents of the two operands to determine the number that has an exponent higher than or equal to the exponent of the other number. Assume that the number so identified is m1 be1. Let = |e1 - e2|, i.e., magnitude of the difference of the exponents of the two numbers.
Step 2. Shift m2 by places to the right to effect the alignment of the ex- ponents. Doing this step ensures that both numbers are aligned - i.e., they have the same exponent, viz., e1.
Step 3. Add the resulting mantissa for the second number (i.e., m2 shifted as in step 2, which is equivalent to m2 b) with m1. The sum so produced corresponds to the mantissa of the result. The exponent of the result at this point is e1.
Step 4. Normalize the result by shifting the result mantissa to the left and decrementing the result exponent by one for each position shifted, if no overflow resulted in the addition of step 3. If an overflow occurred in Step 3, normalization is performed by shifting the result mantissa one position to the right and incrementing the result exponent by one.
Figures 2.7 (a) and 2.7(b) depict two possible pipelined implementa- tions of these steps.
Figure 2.8(a) and 2.8(b) depict the reservation table for the floating point function pipelines of Figures 2.7(a) and 2.7(b).
51
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Inputs: m1 be1 and m2 be2 m1, e1, m2, e2, sign of
m1,e1,m2 b-
( m1,) + ( m2 b- e1
#ofleading0sin(m1)+(m2 b-(m1,)+(m2 b-e1 Outputs: final normalized sum
Figure 2.7(a). A function pipeline for implementing floating point addition (simplified). Information flowing between stages shown on the right. Error codes flowing between stages are not depicted
Compare Exponents
Shift Mantissa
Add Mantissas
LZD
Normalize/Overflow Adjustment
Inputs: m1 be1 and m2 be2 m1, e1, m2, e2, sign of
m1,e1,m2 b-
( m1,) + ( m2 b- e1
#ofleading0sin(m1,)+(m2 b-e1 #ofleading0sin(m1,)+(m2 b-)
Adjusted Exponent of Result
Compare Exponents
Shift Mantissa
Add Mantissas
LZD
( m1,) + ( m2 b-)
Adjust Exponent
Normalized Mantissa of Result
Figure 2.7(b). Alternative structure of function pipeline for implementing floating point addition (simplified). In this case the mantissa shifter is used for aligning the mantissa before addition as well as for normalizing the man- tissa of the sum.
52
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
column number
Compare Exponents Shift Mantissa Add Mantissas
12345
LZD Overflow Adjustment
Compare Exponents Shift Mantissa Add Mantissas
LZD
Normalize/
Adjust Exponent
012345 012345
Reservation table for the Reservation table for the floating point adder of Figure 2.7(a) floating point adder of Figure 2.7(b)
Figure 2.8 Reservation tables for the pipelines of Figure 2.7
53
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Pipeline Classifications
Linear vs. non-linear: In a linear pipeline, a stage is used exactly once by the task processed by the pipeline. In a non-linear pipeline, a task may use a stage more than once.
Non-linear pipeline some row in the reservation table for the pipeline can have two or more X marks
- Also called feedback pipeline.
Synchronous vs. asynchronous:
common clock used to move data between stages
asynchronous clockless design - handshake signals are used between stages to coordinate data movement
Unifunction vs, multi-function (applies to function pipelines only): unifunction only one function is implemented by the pipeline
multi-function a common set of pipeline stages are used to imple- ment two or more functions.
- a multi-function pipeline can either auto-configure (i.e., sense the type of function requested and perform it automatically) or require explicit reconfiguration by a controller.
Static vs. Dynamic:
static pipeline configuration fixed
dynamic pipeline structure can be altered on-the-fly 54
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Number of Pipeline Stages: What’s Optimum?
Speedup with pipelining: S = Tnp/Tp
= N*K*T/(K*T+(N-1)*T)
Maximum possible speedup, Smax = K (assuming N >> K, K > 1)
– Does this mean we can keep on increasing the number of stages indefinitely? The answer is NO for several reasons:
Clock skewing may go up, forcing the use of a slower clock – the simple analysis does not capture the effect of clock skews
The latching overhead goes up, again forcing the use of a slower clock – this is again because the analysis was simplistic. Earle latches, combining function logic into the latches effectively help in reducing the latching overhead.
The number of cycles wasted due to branching, data dependencies and resource conflicts can go up as the number of stages increase
The amount of state information needed to resume from an interrupt increases with the number of stages, complicating the process of resumption following an interrupt.
Most real instruction pipelines are limited to 4 to 8 stages for the reasons listed above
Where additional stages are needed to get a faster clock rate, consecutive stages are grouped and inter-group communication takes place through a queue (instead of a set of latches), leading to a decoupled pipeline.
55
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
De-coupled Pipelines: making pipeline sections independent of each other
Original pipeline:
Need to choose clock period to take into account larger clock skew
Blocking of a pipeline stage towards the end of the pipeline (say, stage S6) can hold up activities in all stages preceding S6. This also impacts the activities of stages S8 through S10
De-coupled version of the same pipeline:
Queues
There are three independent sections comprising of stages: S1 through S4, S5 through S7 and S8 through S10.
Each section is clocked independently (at the same frequency) – as each section is smaller, the clock skews are smaller.
Each section operates independently of the others – this is due to the replacement of latches in-between the sections in the original pipelines with queues. For example, if S6 blocks, the two sections S1 through S4 and S8 through S10 can continue to operate.
S1
S2
S3
S4
Q1
Q2
S2
S3
S4
56
S8
S5
S8
S5
S9
S9
S6
S10
S6
S7
S10
S1
S7
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
A Fundamental Result
Consider the row of the reservation table of a pipeline, linear or non-linear, that has the most number of X marks – let M be the number of X marks in this row. The corresponding stage is clearly the most heavily used or busiest stage.
Consider now the use of the pipeline over time Tp by N instructions (or operations), where N is sufficiently large.
The time for which the busiest stage is actually used is N * M cycles. The time for which this stage remains committed is Tp. The efficiency of using the busiest stage is thus:
N*M/Tp (a)
The thruput of the pipeline in processing the N instructions (or ops) is:
= N/Tp (b)
Since 1 (efficiency cannot exceed 100%), (a) and (b) together
implies that: 1/M
Thus, the maximum thruput potentially realizable from the pipeline is:
MPRT (Maximum Potentially Realizable Thruput) = 1/M
The maximum actual thruput (MAT) obtained from the pipeline may be less than the MPRT.
57
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Impact of Pipeline Bottlenecks
Consider a K-stage pipeline, with non-uniform stage delays. Assume thedelayofstageSi (1iK)isTi. Thetimeitstakestoexecutea sequence of N instructions (or operations) using this pipeline is:
Tp = i Ti + (N-1).Ts
where the summation is taken over all the stages and Ts = max { Ti } is
the delay of the slowest stage(s).
Instructions complete at the rate of one every Ts
The slowest stage is a performance bottleneck:.
S1
Delay = T Delay = 2T Delay = T
S1 Stages S2
S3
: Idle
2T 3T 4T 5T 6T 7T 8T 9T Time
I1
I2
I3
I4
I5
I1
I1
I2
I2
I3
I3
I4
I4
I1
I2
I3
S2
S3
0 T
A stage with a delay of Ti is unutilized for the duration (Ts – Ti) – Thruput and utilization both suffer!
58
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Avoiding Pipeline Bottlenecks: Replication vs. Segmentation
The slower stages can be split up into a sequence of simpler stages, each with a smaller delay than the delay of the slowest stage. This approach of further pipelining a stage is called segmentation:
Latency = 4.T
Maximum Throughput = 1/T
Delay of each stage = T
Alternatively, the slowest stage may be replicated as many times as re- quired to allow operations (instructions) to effectively pass more frequently through it, albeit through a copy that is free
S1
S2a
Delay = 2T
demux mux
Delay = 2T
Latency = 4.T
Maximum Throughput = 1/T
S2p
S2b
S3
S1
Delay = T
S1 S2p
S2q
S3 0
Delay = T
S2q
S3
I1
I2
I1
I3
I1
I2
I4
I3
I2
I1
I5
I3
I4
I2
I6
I5
I4
I3
I7
I5
I6
I4
I8
I7
I6
I5
I9
I7 I8 I6
: Idle
T 2T 3T 4T 5T 6T 7T 8T
59
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Improving the Thruput of Uni-function, Non-linear Function Pipelines
With linear, uni-function pipelines, a new operation can be started every cycle. In the steady state, we also have an operation completing per cycle.
What about non-linear pipelines?
– The presence of two or more X marks in the row for a stage in the reservation table implies that the stage is used more than once by an operation
– An operation already started up can contend for the use of this stage with an operation that started later:
Column 12345 Number
Use of stage S1 by instruction Ia
Use of stage S1 by instruction Ib initiated 3 cycles after Ia was ini- tiated
S1 S2 S3 S4
– –
012345
01234567 Time
Unless mechanisms are in place to prevent both instructions from using this stage, the to instructions will collide
In general, a new operation cannot be initiated every cycle for a non-linear pipeline.
60
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Improving the Thruput of Uni-function, Non-linear Function Pipelines (Continued)
Claim: If the reservation table has a row where (any) two X marks on this row are separated by a distance of k, then two operations initiated at an interval of k will collide:
Use of stage S by instruction Ia
Use of stage S by instruction Ib
Proof: trivial!
0j j+k S
0
k k
j+k j+2k j+k j+2k
Time
There are two ways to avoid a collision:
Static technique: schedule instructions into the pipeline in a manner that
avoids the condition for collision mentioned above
schedule is such that the initiation interval between any two opera- tions in the schedule does not equal the distance between any two X marks in any row of the reservation table of the pipeline
The compiler can generate code using such a schedule
Dynamic technique: use hardware facilities to avoid collisions: directly or indirectly delay the initiation of a new operation that can collide with ops already within the pipeline.
61
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Improving the Thruput of Uni-function, Non-linear Function Pipelines (Continued)
Static scheduling:
– Trivial static schedules – follow a schedule where the initiations of
consecutive ops are separated by:
(i) any distance higher than the pipeline latency or,
(ii) any distance higher than the maximum of all the distances between any two X marks in the same row, for all the rows.
– More desirable: schedules that produce the best thruput and yet avoid collisions: best thruput schedules
Terminology:
Initiation: the starting of a new operation in the pipeline
Initiation latency: the latency between two successive initiations
Initiation cycle: a repetitive sequence of initiations
Forbidden latency: a initiation latency that causes collisions
Collision-free initiation: an initiation that does not collide with other already-initiated instructions.
62
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multifunction Pipelines
Example1: The Arithmetic Unit Pipeline of the TI ASC (Advanced Scientific Computer) (late 60s to early 80s; research project):
All reconfiguration is done by microcode.
Cycle time is 60 nsecs.
Inputs and outputs are vectors (= linear arrays)
Receiver
Receiver Receiver
Receiver
Receiver
Multiply
Multiply
Align
Multiply
Multiply
Multiply
Accumulate
Accumulate
Accumulate
Accumulate
Accumulate
Exponent Subtract
Exponent Subtract
Exponent Subtract
Exponent Subtract
Exponent Subtract
Align
Align
Align
Align
Add
Add
Add
Add
Add
Normalize
Normalize
Normalize
Normalize
Normalize
Output
Output
Output
Output
Output
Basic Pipeline
Integer Add Integer Multiply Floating Point
Add
Receiver gets a stream of operands from the memory Output writes result stream to memory
Floating Point Vector Dot Product
63
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multifunction Pipelines (continued)
Receiver
Multiply Accumulate Exponent Subtract
Align Add
Normalize Output
123 1234 Receiver
Integer Add
Integer Multiply
123456 Receiver
Multiply Accumulate Exponent Subtract
Align Add
Normalize Output
Adds a zero
Multiply Accumulate Exponent Subtract
Align Add
Normalize Output
Multiply Accumulate Exponent Subtract
Align Add
Normalize Output
Floating Point Add
Adds exponents of multiplicands, holds accumulated value
12345678 Receiver
Addition of accumulated values for final result, after inputs
are exhausted, is not shown
64
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multifunction Pipelines (continued)
Example 2: Software controlled pipeline implementing multiple functions:
Common set of stages implementing floating point multiply&add function:
operand_1 * operand_2 + operand_3
Three functions implemented:
– multiply&add (as above)
– multiply (operand_3 is forced to be zero, perhaps using an explicit zero register (hardwired register containing zero), so that multiply is a pseudo instruction)
– addition (operand_1 is forced to be a one, perhaps using an explicit one register – similar comments as above apply)
Reservation tables for all three functions are identical – they all implement multiply&add
65
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CLASS NOTES/FOILS:
CS 520: Computer Architecture & Organization
Part II: Instruction Pipelines
Dr. Kanad Ghose ghose@cs.binghamton.edu http://www.cs.binghamton.edu/~ghose
Department of Computer Science State University of New York Binghamton, NY 13902-6000
All material in this set of notes and foils authored by Kanad Ghose 1997-2019 and 2020 by Kanad Ghose
Any Reproduction, Distribution and Use Without Explicit Written Permission from the Author is Strictly Forbidden
CS 520 – Fall 2020
66
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Pipelines: Scalar Pipelines and Some Terminology
Instruction issuing: the act of moving an instruction from the decoding stage to the execution stage. In APEX, “issue” thus refers to the process involved in moving an instruction from the D/RF stage to the EX stage.
We will further clarify the notion of issue shortly and distinguish it from a process known as dispatch
Scalar pipeline: an instruction pipeline capable of issuing at most one instruction per cycle
– Design goal for a scalar pipeline: achieve a CPI as close to unity as possible.
Function unit: function pipeline (= multiple stages) or a non-pipelined logic unit within the EX stage that implements operations.
– APEX, as described thus far, has a non-pipelined, single cycle delay function unit within the EX stage
Sequential execution model: the model of program execution that is implicitly assumed:
– Instructions are processed logically in a non-pipelined fashion in program order
– The current instruction being processed is pointed to by an
architectural PC
– All pipelined CPUs must implement this model, although the physical implementation is pipelined
67
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Pipelines: Scalar Pipelines and Some Terminology (continued)
Implications of the sequential execution model: Results produced by the pipelined execution must match results produced by a strictly sequential, non-pipelined execution. Specifically,
1. Instructionshavetobefetched,decodedandissuedinprogramorder to respect/maintain the data dependencies in the original program
2. Theprocessorstatedefinedbytheexecutionmodel,consistingofthe contents of the architectural registers and memory locations must be updated in program order
– APEX clearly meets both of these requirements
More terminology:
ILP (Instruction Level Parallelism): parallelism available across the processing steps of instructions, within a single thread of control
Machine parallelism: mechanisms within an instruction pipeline for exploiting ILP
– Sufficient machine parallelism is required in order to come close to a CPI of unity in real scalar pipelines
68
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Recap: The APEX Pipeline – Some Noteworthy Features
Destination and source register addresses appear in fixed fields within the instructions.
– Logic within the D/RF stage blindly assumes that all of these fields are valid within the instruction that is in the D/RF stage and reads out the source registers from the register file before the instruction has been decoded.
– If decoding reveals that a field assumed as the address of a source register is not valid, then the data read out using that field is simply discarded.
– This scheme allows decoding delays and register access delays to be partly overlapped.
For certain instructions, no processing is done within a stage – the inputs to the stage are simply passed on to the following stage (at the end of the current clock cycle). Examples:
For register-to-register operate instructions, no processing is done within the MEM stage.
The WB stage performs no processing for the STORE instruction
All register files are read from the same stage (viz., the D/RF stage).
– Reason for doing this: reduces need for extra read ports on the register file.
69
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Recap: The APEX Pipeline – Some Noteworthy Features (contd.)
All register files are written from the same stage (viz., WB).
– This is true even if the result to be written to a register is available in an earlier stage (as in the case of register-to-register operate instructions, where the results are computed by the EX stage, and can be written from the MEM stage into the register file)
– Reason: reduces need for extra write ports on the register file.
The EX stage implements arithmetic and logical ops; effective
addresses for LOADs and STOREs are also computed in this stage.
Ideally, the effective processing delay of each stage is T, where T is the period of the pipeline clock.
In practice, an instruction can spend more than one cycle within a stage. When this happens, all prior stages are also held up: such stalls are com- mon in situations such as:
Delays in fetching an instruction due to a miss in the instruction cache (I-cache). (The F stage fetched instructions from the I-cache at the rate of one every cycle when the instruction is in the I-cache.)
Similar delays in accessing memory locations cached in the D-cache within the MEM stage.
In the simple APEX pipeline, an instructions stalls in the D/RF stage till the registers it needs as input have valid data in them. This stall is needed to ensure that data dependencies are satisfied. (We will look at mechanisms that reduce such stalls later on.)
– Formally, a pipeline stage is said to stall if its processing delay exceeds the ideal one cycle delay.
– In this simplistic APEX pipeline, when a stage stalls, all preceding stages also stall
70
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Multiple Function Units
A single, non-pipelined function unit can be a bottleneck if instruction execution latencies (i.e., instruction execution times) vary from one type of instruction to another.
Example: APEX, MULtiply instructions with an execution latency of 3 cycles and all other register-to-register ops with a latency of a single cycle.
Consider the execution of the following code fragment, with no data dependencies among the instructions:
I1: ADD R1, R2, R3 /* R1 <- R2 + R3 */ I2: MUL R8, R4, R6 /* R8 <- R4 * R6 */ I3: ADD R7, R2, R5
I4: ADD R9, R3, #1 /* R9 <- R3 + 1 */
The Gantt chart for the processing of the code fragment is as follows:
F D/RF EX MEM WB
t
I1 I2 I3 I4
: idle
: blocked
I1
I2 I3 I4
I1 I2 I3 I4 I1I2I3 I2
t+3 t+4 t+5 t+6 t+7
I2
I2
I4 I3
I1
I4
t+1
t+2
t+8
t+9 t+10
The 3-cycle execution latency of the MUL instruction (I2) prevents the D/RF stage from maintaining an issue rate of one instruction per cycle: the D/RF stage and the stage preceding its (F) stalls (or blocks) for two cycles, since the D/RF stage cannot send over I3 to the EX stage while it is still busy executing the multiply operation
Result: 10 cycle processing time for the above code fragment.
71
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Multiple Function Units (continued)
Function unit replication, based on the operation type, offers a solution to this problem.
Example: Same as last example, but dedicate a separate function unit (FU) - with a latency of 3 cycles - to implement multiplication; all other register-to-register ops are implemented in a single cycle by an “add” FU:
F I1 I2 I3 I4
D/RF I1 I2 I3 I4
blocked - waiting for writes in program order
Add
Add FU Mult FU MEM
WB
t
I1
I3 I2
I4
t+7 t+8 : blocked
F
D/RF
MEM
WB
I2 I1
I2
Mul
I2
I3 I2
Add: FU for ADD instructions Mult: FU for MUL instructions
t+1
t+2
t+3 t+4 : idle
t+5
t+6
I4
I3 I4
t+9 t+10
I1
In this case, while the multiply FU is busy processing I2 (MUL), the D/RF stage can issue I3 (ADD) to the add FU.
Assuming that registers have to be updated in program order, notice that although I3 completes before I2 (i.e., completes execution out of pro- gram order), it has to wait within the add FU (thereby stalling it for one cycle) before it can write its destination register - this is to allow the pre- ceding instruction I2 to complete its write.
Result: Even with the ability to sustain issue with multiple FUs, the pro- cessing time for the code fragment remains unchanged at 10 cycles
72
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Multiple Function Units (continued)
Assume now that additional mechanisms are in place to allow the writes
to the registers to take place out of program order. - The resulting processing profile is:
F I1 I2 I3 I4
D/RF I1 I2 I3 I4
blocked - waiting for MEM
Add
Add FU Mult FU MEM
WB
t
I1
I3 I2
I4
I2 I3
F
D/RF
MEM
WB
I2 I1
Mul
I2 I3
I1
Add: FU for ADD instructions Mult: FU for MUL instructions
t+1
t+2
t+3
: idle
t+4
t+5
t+6
I4
I2 I4
t+7 t+8 t+9
: blocked
Overall result: Total processing time of the code fragment is 9 cycles - the one cycle improvement is not dramatic in absolute terms, but it does indicate a 10% decrease over the original value.
In general, to fully exploit multiple FUs, writes have to be made out of order, but additional mechanisms have to be in place to allow data dependencies in the original program to be maintained irrespective of out-of-order writes
- We will look at mechanisms for doing this - such as register renaming - LATER
Notice also that FUs are underutilized. 73
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Multiple Function Units (continued)
Mechanisms/artifacts needed to support multiple FUs:
(Obviously) multiple FUs
Additional datapath connections: D/RF to FUs, FUs to WB
Mechanisms to handle simultaneous or out-of-order completions
Data forwarding mechanisms from one FU to another (LATER!)
Contemporary scalar pipelines have at least three types of FUs:
Integer FU (single cycle latency for integer (including MUL!!), logical and shift ops)
Floating point FU - for floating point ops
LOAD/STORE FU - implements memory address generation and ac- cess operations.
74
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Multiple Function Units (continued)
Main motivations for having multiple FUs:
Avoids FU bottlenecks
The breakup of a single monolithic function unit into multiple types of function units, each implementing specific function types allows the function unit hardware to be tailored to the appropriate function type
Setting up the pipeline to handle multiple function units allow for the ability to add/upgrade function units as needed.
- An initial implementation may not have a floating point unit and will have to implement such operations in software using the integer units.
- A later generation may implement the floating point operations using a floating point unit.
Isolating the functions by type and using dedicated function units promotes design modularity. For instance, if independent function units are available for integer addition/subtraction and integer multiply, the design of the multiply unit can be upgraded without affecting the integer unit or affecting the issue logic
Any additional hardware necessary to support multiple FUs is thus justifiable.
75
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Pipelined Function Units
Even with multiple FUs, instruction issuing may stall if subsequent instructions require a common FU with multi-cycle latency.
Example:
APEX, 2 FUs (multiply - 3 cycle latency, add - 1 cycle latency) No dependency within code fragment
Writes to register file done in program order
I1: ADDR1,R2,R3/*R1<-R2+R3*/
I2: MULR4,R5,R6/*R4<-R5*R6*/
I3: MUL R7, R8, R9
I4: ADDR10,R3,#10/*R10<-R3+10*/
I5: ADD R11, R12, R13
Gantt Chart for processing:
F I1 I2 I3 I4 I5
blocked - waiting for MUL FU to free up
blocked - waiting for writes in program order
I2 I1
I2
I2
I3 I2
I3
I3
I4 I3
I1
I2
I5
D/RF Add FU Mult FU
MEM WB
I1
I2 I1
I3
I4
I5 I4
I3
: idle
: blocked
I5
I5 I4
t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+10 t+11 t+12 t+13
Result: 13 cycles needed to process the code fragment
76
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Pipelined Function Units (continued)
The MUL FU is clearly a bottleneck in this case
The need to write registers in program order causes the ADD FU to block: this requirement can be relaxed to avoid this indirect blocking, but other mechanisms have to be put in place to maintain all data depen- dencies despite out-of-order writes to the registers.
- Can easily verify that out-of-order writes will cut down the processing time of the above code fragment to 12 cycles.
One possible solution is to pipeline the MUL FU - this is really a use of the segmentation solution for removing bottlenecks
- The pipelined FU solution helps when a new operations for the MUL
FU does not depend on the operation in progress within the MUL FU
in any way (this is the case for the example)
blocked - waiting for writes in program order
F I1 I2 I3 I4 I5
D/RF I1 I2 I3 I4 I5
I2
I3 I2
I5
I3 I2
I3 I2
I1
I3 I2
I4 I3
I5 I4
I1
Add FU Mult FU: Stage 1 Mult FU: Stage 2 Mult FU: Stage 3
MEM WB
t
I1
I4
: idle
: blocked
t+1
t+2
t+3
t+4 t+5
t+6 t+7
t+8 t+9
I5
t+10 t+11
Result: Code fragment processing time is 11 cycles, down from 13; out-of-order writes do not help in this case.
77
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Issue Queue
So far, the conditions for issuing an instruction are: 1. RequiredFUmustbefree,and
2. Allinputoperandsareavailable
Relaxing the second constraint can be useful for performance in a pipe- line with multiple function units. To see this, consider the following code fragment:
/* R2, R6, R7 contain valid data at this point */
I1 LOAD
I2 ADDL
I3 MUL
R1, R2, #10
R4, R1, #1
R5, R6, R7
Here, assuming that there are two FUs, one for the ADD and one for the MUL and that the address computation of the LOAD to be done on the ADD FU, we see that the ADDL has to wait in the D/RF stage for 3 cycles, introducing a 2 cycle bubble in the pipeline. The MUL instruc- tion, which followed the ADDL cannot be issued till the ADDL moves out of the D/RF stage:
blocked - waiting for D/RF to free up
F I1 I2 I3 D/RFI1I2 I3
: idle
: blocked
Add FU Mult FU MEM
WB
I1
I2
t+4 t+5 t+6
I3 I2
I3
I3
I1
I3
I1
I2
I3
t
t+1
t+2
t+3
t+7 t+8
t+9 t+10 t+11
Result: assuming the same latencies as before, total processing time is 11 cycles
78
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Issue Queue (continued)
If we can somehow move the ADDL out of the D/RF stage and make it wait for the value of R1 in a latch associated with the ADD FU, the MUL instruction can be readily issued to the MUL FU, to shorten the total time needed to process the same code fragment:
F D/RF
Add FU Mult FU MEM
WB
t
I1
I2 I1
I3 I2 I1
ADDL “issued” to buffer, freeing up D/RF ADDL executes
Assume writes from + and * FUs are serialized in program order
: idle
: blocked
I3
I2 I2
I3
I3
I3
I2
I3 I2
I1
I1
I3
t+1
t+2
t+3
t+4 t+5
t+6 t+7
t+8 t+9
Result: Total processing time is 9 cycles
For pipelines with multiple FUs, the D/RF stage is a critical resource. All we have done in this case is to free up the D/RF stage for use by a later instruction as soon as possible.
Note also what we have effectively done: instruction executions can now start out- of- order.
79
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Issue Queue (continued)
Note, however, that before an instruction that has not received all of its input operands can be moved out of the D/RF stage, we must:
Ensure that storage is available within the buffers that hold instruction that have left the D/RF stage and are waiting for one or more of their input values to become available. These buffers are collectively known as the issue queue (IQ). Some alternative names for the IQ as used in existing literature are: scheduling queue, dispatch buffer, instruction window (buffer), an instruction pool buffer (IPB), instruction shelves, reservation stations:
IQ
Mechanisms are in place for noting the sources of the inputs of this instruction that are not available in the D/RF stage (“who am I waiting for?”) and to hold the value of any valid operand that was successfully acquired in the D/RF stage.
Mechanisms must be in place to allow the waiting instruction to eventually receive its operands when they become available (“how do I get my inputs?”)
Once the waiting instruction has received all of its inputs, mechanisms must be in place to start its execution (“when do I start up?”)
Add
F
D/RF
MEM
WB
Mul
Add: FU for ADD instructions Mult: FU for MUL instructions
80
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Issue Queues (continued)
Effectively, the issue queue decouples the F and D/RF stages from the rest of the pipeline, allowing F and D/RF not to be held up due to momentary delays in the rest of the pipeline
Out-of-order instruction startup requires the role of the D/RF stage to be modified as follows:
Decode instruction
Read out the values of all registers that are ready (i.e., available)
Note dependencies from earlier instructions in program order, specifically the instructions that are to provide the remaining input operands for the instruction and set up mechanisms/data flow paths to allow such values to be steered to the instruction when its waiting in the issue queue (IQ)
Move the instruction into the IQ
The term instruction dispatching is used to refer to the above steps, while the term instruction issue is used to refer to the step of starting up the execution of an instruction in the IQ once its inputs are all received and when the required FU is free
Note that the dispatching mechanism not only relaxes condition 2 for instruction issuing, but condition 1 as well. The only requirement for dispatching an instruction is that an empty slot be available in the IQ
81
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Issue Queues (continued)
What the IQ really does: opens up a window of instructions from which instructions can be selectively issued - possibly out of program order - to maximize the exploitation of the ILP
CAUTION: Many authors use the term dispatch and issue interchangably and in a sense opposite to what we have used - what they really mean can be usually understood in context.
So, to clear up things, the following picture shows where the terms issue and dispatch apply:
In-order issue, with no issue taking place till all input operands are ready, in-order startup:
Issue
Add
F
D/RF
Mul
MEM
WB
Out-of-order startup using a IQ:
Issue Dispatch
IQ
Add
F
D/RF
Mul
82
MEM
WB
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Enhancing Machine Parallelism: Instruction Queues (continued)
Centralized Implementation of the IQ:
Global IQ (shelves)
Waiting instructions carry a tag that indicates the type of FU needed
Multiple connections to the FUs from the IQ to avoid serializing the start ups. (Two different FUs should be allowed to start up at the same time, if needed.)
Distributed Implementation of the IQ:
Separate set of instruction buffers - called reservation stations - for each FU
Each reservation station entry (RSE) associated with a FU defines a virtual FU (VFU). Dispatching thus amounts to issuing an instruction to a virtual FU.
Other obvious comparisons are possible for centralized vs. distributed implementations of the IQ
Aspects of machine parallelism seen so far:
Multiple FUs
Pipelined FUs
Out-of-order completions (writes to registers)
IQ and dispatch/issue mechanisms - out of order startup
83
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Terminology related to instruction scheduling
Resolving dependencies: noting and setting up data flow paths to satisfy flow dependencies and, possibly, other types of dependencies as well.
Satisfying dependencies: completion of flow of data to satisfy flow dependency and, possibly, other types of dependencies.
Instruction dispatching: A step that involves decoding one (or more) instruction(s), determining type(s) of VFU needed and resolving dependencies.
Instruction issuing: this refers to the satisfaction of dependencies for an already dispatched instruction, which enables the instruction for execution. The actual execution can be delayed pending the availability of a physical FU.
Out-of-order (OOO) datapaths/processors: processors that support out-of-order startup and completion in an effort to harvest ILP. Virtually all contemporary desktop, laptop and server processor falls into this category, as do some high-end embedded CPUs.
84
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Wait States in Out-of-Order Machines
With all the mechanisms discussed so far for enhancing the machine
parallelism, an instruction can be in one of the following wait states:
WD - waiting to be dispatched after fetched -- this happens when a virtual FU (or equivalently, a space in the IQ) is not available for the instruction.
WOP - dispatched, waiting for one or more operand in the IQ.
WE - waiting issue: all input operands are available, but waiting for the FU or other execution resources (which is/are busy).
WR - waiting for the results to be produced after execution has started. The minimum waiting time in this state is 1 cycle.
WWB - execution completed, waiting for write back resources (like register file write port, common write bus etc.).
WC - waiting for committment - that is, waiting for prior writes to architectural registers and memory to be serialized following the sequential execution model. More on this later!
The minimum waiting time in some of these states can be zero cycles (i.e., some waiting states can be skipped), unless noted otherwise.
85
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies
Given a pipeline with infinite machine parallelism, program data depen- dencies limit the performance improvements possible over a non-pipe- lined execution
Definitions:
The sources (or inputs) of an instruction are the registers or memory loca- tions whose values are needed by the instruction to produce the results.
The sinks (or destinations or outputs) of an instruction are the registers or memory locations that are updated by the instruction.
The input set (or read set) of an instruction is the set of its inputs.
The output set (or write set) of an instruction is the set of its outputs. Some
of the inputs and outputs of an instruction can be implicit
Examples:
(a) ADD R1, R4, R6 /* R1 <-- R4 + R6 */ Sources: R4, R6; read set = {R4, R6}
Destinations: R1, PSW flags (implicit); Write set = {R1, PSW flags}
(b) LOAD R1, R2, #120 /* R1 <-- Mem[R2 + 120] */ Sources: R2, Mem[R2 + 120];
Read set = {R2, Mem[R2 + 120]} Destinations: R1;
Write set = {R1}
- Note that the literal (120) is not considered as a source
86
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies (continued)
We define the sequential ordering or the program order of two instruc- tions within the binary of the program as the relative logical order of processing these two instructions as dictated by the sequential execu- tion model.
- We will use the notation I > J to indicate that that the sequential execution model implies that instruction I should be processed before instruction J.
A true dependency or flow dependency or a R-A-W (Read after write) dependency is said to exist between two instructions I and J un- der the following conditions:
(i) I precedes J in program order, i.e., I > J, and
(ii) A source for J is identical to a destination for I, i.e., destination_set (I) /\ source_set (J) is non-empty (/\ = set intersection)
– This dependency is depicted as: IJ
Example:
I1: ADD R1, R2, R4 I2: MUL R4, R1, R8
Here, we have a flow dependency from I1 to I2 over the register R1 Also, we say that I2 is flow dependent on I1
87
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies (continued)
For pipelines that support in-order startup, the flow dependency of an instruction J on a prior instruction I implies a potential waiting for J in the D/RF stage till at least the production of the result by I.
For pipelines supporting out-of-order startup, a flow dependency from I to J implies a potential waiting for J in the WOP state when I is in the WE or E state (or even the WWB state).
An anti-dependency (also called a write-after-read, W-A-R dependency) exists between two instructions I and J under the follow- ing conditions:
(i) I precedes J in program order, i.e, I > J, and
(ii) A destination of J is identical to a source for I, i.e,
source_set (I) /\ destination_set (J) is non-empty (/\ = set intersection)
– This dependency is depicted as: IJ
Example:
I1: MUL R1, R6, R9
I2: ADD R6, R11, R15
I3: STORE R1, R10, #10
Here we have an anti-dependency from I1 to I2 over R6. We also say that I2 is anti-dependent on I1.
88
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies (continued)
The implication of an anti-dependency from I to J over a register Rk implies that I must read Rk before J writes Rk. More specifically, if the pipeline employs out-of-order startups and completion, I can force J to wait in the WWB state while I is in the state WOP.
(If I > J, and J is in WWB, I cannot be in any state that precedes WOP)
Why is it important to ensure that anti-dependencies are maintained? To see this, consider the code fragment from the last example:
Here an anti-dependency exists from the MUL to the ADD over R6 and the STORE is flow-dependent on the MUL over R1
If, for any reason, the ADD writes its result into R6 before the MUL instruction has a chance to read R6, MUL will produce an incorrect result. The STORE, which is flow dependent on the MUL will thus store an incorrect value into the memory, producing unintended results.
This example shows that the ordering constraints imposed by an anti-dependency is necessary for preserving a flow dependency elsewhere in the program.
In particular, if somehow the flow dependency from the MUL to the STORE is maintained (i.e., if we by some means guarantee that the STORE writes to memory the result produced by the MUL) then the ordering required by the anti-dependency from the MUL to the ADD can be ignored.
89
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies (continued)
The third and final type of data dependency in a program is output dependency, also known as write-after-write (W-A-W dependency). An output dependency exists between instructions I and J under the following conditions:
(i) I > J, and
(ii) A destination of I is the same as a destination of J, i.e.,
destination_set (I) /\ destination_set (J) is non-empty (/\ = set intersection)
– This dependency is depicted as: IJ
Example:
I1: MUL R1, R2, R6
I2: STORE R1, R5, #4
I3: ADD R1, R10, R19
Here an output dependency exists from I1 to I3 over register R1; I3 is output-dependent on I1
90
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies (continued)
Output dependencies between instructions imply that results must be written to their destinations in program order. Specifically, an output dependency from I to J implies that J can potentially wait in the WWB state while I is in the state WOP, WE or E.
This implies that results produced out of program order must be held in temporary latches and have to be written out from these latches to the appropriate destinations in program order.
Why is it important to maintain the ordering constraints implied by an output dependency?
To understand this, lets look at the last example:
Here an output dependency exists between the MUL and the ADD. If ADD completes before the MUL and writes the result of the addition into R1, the STORE instruction can incorrectly pick up the value of R1 as generated by the ADD, producing unintended results.
This happens because the flow dependency between the MUL and the STORE gets violated.
This example shows, the real need to maintain the ordering dictated by the output dependency stems from a need to maintain flow dependencies elsewhere in the program.
As the two preceding examples show, the importance for maintaining the anti and output dependencies stem from the need to maintain one or more flow dependencies. Thus, flow dependencies represent a more fundamental form of data dependency than the two other types of de- pendencies. In fact, it is possible to use temporary storage or temporary registers to get rid of anti and output dependencies completely, retaining only the flow dependencies.
91
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Dependencies (continued)
Alternatively stated, this means that as long as the flow dependencies in the original binary versions are somehow maintained, it is possible to ignore anti and output dependencies. It is for this reason that flow dependency has been referred to as true data dependency.
The ability to ignore anti and output dependencies by simply maintaining the original flow dependencies has been exploited in the design of modern pipelined CPUs.
Data dependencies over memory locations can be defined in the same manner as dependencies over registers:
I1: LOAD R1, R2, #10
I2: ADD R4, R3, R1
I3: STORE R4, R2, #10
Here, both the LOAD and the STORE target a common memory loca- tion whose address is obtained by adding the contents of R2 with literal 10. Since the (randomly-accessible) memory can be viewed as an array, we can designate this memory location as Mem[R2 + 10]. Since Mem[R2 + 10] is in the source set of the LOAD and also in the write set of the STORE, and since the LOAD precedes the STORE in program order, there is an anti-dependency between the LOAD and the STORE. One can similarly have flow and output dependencies over memory locations.
Dependencies not present in the original program are introduced when the compiler is forced to reuse the registers, which are finite in number.
92
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Detecting Data Dependencies
Data dependencies over registers can be easily detected by comparing the contents of the field within the instructions that contain register ad- dresses. One can detect such dependencies statically – for example, at compilation time or during post-compilation processing of the object code. These dependencies can also be detected dynamically – i.e., at run time, by the hardware.
Dependencies over memory locations, can be sometimes difficult or even impossible to detect statically. For example, consider the follow- ing code fragment:
I1: STORE R0, R5, #0
I2: ADD R3, R1, R4
I3: LOAD R1, R2, #10
Since the contents of the registers are generally not known at the time of compilation (or post processing), there is no way to infer that Mem[R5 + 0] is different or the same as memory location Mem[R2 + 10]. Consequently, its not possible to conclude if a flow dependency exists between the STORE and the LOAD.
Sometimes it may be possible to conclude if the STORE and the LOAD are dependent if additional information is available from the original source code of the binaries. For example, if R5 and R2 point, respectively, to the base addresses of two arrays A and B, each of which occupy 2048 consecutive locations, then Mem[R5 + 0] is a different memory location than Mem[R2 + 10] and thus the STORE and the LOAD are not dependent.
More general techniques are available for detecting dependencies over memory locations using source codes – these techniques go by the gen- eral name of “memory disambiguation”
93
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Detecting Data Dependencies (continued)
When it is impossible to detect a dependency over memory locations, most compilers and post-processors make a conservative assumption to play it safe and assume that the instructions in question are dependent. This is precisely a situation where dynamic dependency checking by the hardware has an edge. The effective addresses computed for the STORE and LOAD instructions can be compared at run-time to detect any dependency between these two instructions. If no dependencies are detected, both the LOAD and the STORE may be allowed to execute concurrently if hardware facilities are present, leading to some potential performance improvement.
The cost of detecting dependencies dynamically:
The detection of dependencies over registers requires the use of comparators that are typically 5 to 6 bits wide, since the typical numbers of registers are about 32 to 64.
In contrast, since the physical addresses of memory locations can be few tens of bits, wider comparators are needed to detect dependencies over memory locations at run time.
The order in which instructions are examined for data dependencies is also critical – this order must correspond to the sequence (otherwise known as program order) in which these instructions are executed in the implicit sequential execution model.
94
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Detecting Data Dependencies (continued)
Example: Detecting dependencies over registers
I1: ADD R1, R2, R4
I2: MUL R6, R8, R1
Assume the following instruction format for these instructions:
<31:25 - opcode><24:20 - dest><19:15 - src1><14:10 - src2><9:0 - unused>
A flow dependency exits if src1 or src2 of I2 is the same as the dest of I1
The necessary hardware facilities for detecting this is as follows:
Two 5-bit comparators (the ISA used in this example has 32 architectural registers)
Additional logic indicating the type of the instructions involved in the dependency. As an example, if I1 is a STORE, then bits 14 thru 10 of the latch that holds I2 should not be compared with bits 24:20 of I1
I1:
I2:
Flow dependence from I1 to I2 on src2
Flow dependence from I1 to I2 on src1
opcode
dest
src1
src2
unused
=
=
opcode
dest
src1
src2
unused
“I1 is register-to-register” “I2 is register-to-register”
Note also that simple instruction formats, as in a RISC, help considerably
The other two types of dependencies over registers can be detected similarly
Detecting dependencies in this manner is necessary for setting up appropriate forwarding paths (LATER).
95
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Detecting Data Dependencies (continued)
Example: Detecting dependencies over memory locations
In CPUs that use a write-back cache, a need to detect dependencies over memory location occurs with the use of writeback buffers to hold pending writes to memory from the cache. (In a write-back cache, data is updated in the cache only; the updated or “dirty” cache data is written back to the memory only when the updated cache line is selected as a victim for replacement.)
– Typically the writes from the data cache are queued up in what is called a write buffer (aka writeback buffer) in FIFO order (which is the order in which dirty lines were replaced); the actual writes to the memory takes place from these buffers in program order. Each entry in this queue contains the data value being written and the effective memory address to which that data item is to be written
– As soon as the effective address of a LOAD is computed, it is compared in parallel with the effective addresses of all the writes queued up in the write buffer. A match indicates a dependency over a memory location that is yet to be updated
– If multiple matches occur, the dependency is from the STORE whose matching entry occurs last in FIFO order in the queue
– Note that wider comparators are needed to detect dependencies over memory addresses compared to the size of the comparators that are needed to detect dependencies over registers.
From Cache
Effective Address Computed for LOAD from Load/ Store FU
Data to be written
Effective address Comparators
To memory
=
=
=
=
=
=
=
96
Priority & Data Mux
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Flow Graphs
Shows data dependencies graphically: Nodes = instructions
Directed arcs = data dependencies An Example:
I1: FMUL R5, R7, R8
I2: FADD R1, R2, R5
I3: MOV R9, R1
I4: FADD R6, R5, R8
I5: FDIV R1, R4, R11
I6: FST R1, R10, #100
I7: FADD R4, R7, R11
I1
(R5) (R5)
I2 (R1)
I3
I4
(R1)
I5 (R1)
I6
(R4)
I7
Arcs are labelled with register over which dependency occurs
(R1)
(R1)
Dashed arc can be removed – if a node has flow dependency over the same register from more than one node, only the flow dependency from the node closest to it in program order matters
97
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Coping with Data Dependencies
Hardware Techniques:
Simple interlocking for in-order machines
Data forwarding
Dynamic Instruction Scheduling:
– Register renaming (most prevalent)
– CDC 6600 Scoreboarding (historical)
– Tomasulo’s “algorithm” (historical – similarities with register renaming)
Decoupled Execute-Access Mechanism
Multithreading Software Techniques:
Software interlocking
Software pipelining
98
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Simple Interlocking For Pipelines with In-Order Issuing Mechanism
Simple goal: stall instruction issue till input register operands are avail- able.
For the 5-stage APEX pipeline, one simple way to implement this is to make an instruction wait in the D/RF stage, till the instruction producing the required register value has updated the register in question.
The mechanism for doing this is as follows: Define the status of a register as follows:
valid (or available): if it contains valid data
invalid (or busy): if the instruction generating data into the register
has not updated the register (this does not happen till the instruction has gone through the WB stage)
A single bit is added to every register in the register file (RF) to indicate if its valid or not. These bits are read out along with the register value from the RF
Stall an instruction in the D/RF stage:
(a) if any of its source register is invalid OR (b) if the destination register is invalid
1 (on dispatch) dest address
Values of src1, src 2
src1 address src2 address
Data part of reg. file:
2 read ports, 1 write port
Status Bit Array
( 2 write ports, 2 read ports)
src1 status src2 status dest status
From/ To D/RF stage
Stall issue if OR- ing
results in a 1
Register File
0 ( => valid) data to be written dest address
From WB stage
99
status: invalid = 1
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Simple Interlocking For Use with In-Order Issuing Mechanism (contd.)
Condition (a) ensures that the data dependency from the instruction pro- ducing the value of a source register to the instruction that consumes this value is maintained.
Condition (b) ensures that anti- and output-dependencies from preced- ing instructions in the pipeline to the instruction in the D/RF stage is maintained.
Condition (b) also handles multiple function units and out-of-order writebacks – verify this!
This mechanism is also referred to as “simple scoreboarding”.
Note that this interlocking mechanism introduces a pipeline bubble for
every cycle spent by a stalled instuction in the D/RF stage.
100
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Forwarding
Used to reduce delays due to flow dependencies
Consider a single FU version of APEX and the following code
fragment:
I1: AND R1, R2, R3
I2: ADD R4, R1, R6
There is a flow dependence from the AND instruction to the ADD over R1.
Recall that in APEX, register operands are read out while the instruction is in the D/RF stage and that results are written to the registers only when the instruction is in the WB stage.
The ADD must therefore wait in the D/RF stage till the AND has proceeded into the WB stage and written its result to R1. The net result of the flow dependency in this case is to introduce a two cycle long bubble in the pipeline.
Data forwarding (or shortstopping or data bypassing) can get rid of this bubble altogether:
The solution exploits the fact that the ADD needs the value of R1 only when it is about to enter the EX stage – at which time the required value of R1 has been already produced by the AND, which is about to enter the MEM stage.
What is thus needed is a mechanism that will pick up the result produced by the AND and forward it to the ADD as it enters the EX stage.
101
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Forwarding (continued)
The key requirements of a forwarding mechanism are as follows:
Logic to detect when an opportunity for forwarding exists:
This requires that an instruction to carry along with it the addresses of all its source registers until at least its entry into the EX stage. If the address of the destination register of the instructions in the EX and MEM stage matches the address of any source register of an instruction that is about to enter the EX stage, data forwarding is necessary.
– Need latches and comparators
A mechanism to prioritize the potential sources of forwarding:
When the destinations of more than one instructions matches a source, only the instruction closest in program order to the instruction about to enter EX must forward the required data:
AND R1, R2, R3 ADD R1, R1, #1 SUB R4, R1, R5
The SUB should get its data (value of R1) forwarded from the ADD instead of from the AND
Appropriate data flow paths needed to accomplish data forwarding: These usually take the form of multiplexers and additional connections. The multiplexer simply chooses between the value of a register (possi- bly not up to date) that was read out when the instruction was in the D/RF stage and the data from one of the two potential forwardees (the instructions in the EX and MEM stages) based on the output of the compartors and additional information (for prioritizing the potential forwardees and validating that they are instruction types that can legiti- mately forward data).
102
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Forwarding (continued)
Notice that data forwarding is actually an optimization that establishes data flow from the source instruction to the destination instruction of a flow-dependent pair of instructions. Such an optimization is beyond the realm of the normal optimizations performed routinely by a typical compiler, which can never optimize activities “within” instructions.
I1: ADD R1, R2, R3
I2: MUL R4, R5, R1
Original Data Flow: R2
R3 + R1 * R4 R5
Data Flow with Forwarding
R2
+ R1
R3
R5
* R4
Is it possible to avoid delays due to flow dependencies altogether with forwarding? The answer to this question is no, even for a simple pipeline like APEX. To see this, consider the code fragment:
LOAD R1, R2, #20 ADD R3, R1, R7
Here we have a flow dependency from the LOAD to the ADD over R1. If bubbles are to be entirely avoided in the pipeline due to this dependency, the value of R1 must be available when the ADD is about to enter the EX stage. This is not possible because at that point, the LOAD is about to enter the MEM stage and the cache access within the MEM stage (which is necessary to retrieve the value to be loaded by the LOAD) has not yet been started.
– Similar conclusions when FUs have latencies > 1 cycle.
103
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Data Forwarding (continued)
Example:
Forwarding paths used to forward an ALU-computed result in APEX:
src2* src1
src2 or sign- extended literal
D/RF EX MEM WB
Paths Used for Forwarding th Result Computed by the ALU in APEX (*applies to STORE instruction)
Note a subtlety in the above diagram: consider, for example, the forwarding paths from the result latch following the ALU to the input of the ALU through the MUXes. Do not conclude that these paths are used for forwarding the ALU result of an instruction in the EX back to the ALU input (i.e., to the same instruction)! The ALU result latch makes all the difference. These paths are instead used to forward data from an instruction exiting from the EX stage (and going into the MEM stage) to an instruction entering the EX stage (from the D/RF stage). Similar remarks apply to the other forwarding paths.
M U X
ALU
Result
Result
M U X
104
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Instruction Scheduling
The main limitations of static instruction scheduling techniques for coping with dependencies, such as software interlocking and software pipelining were noted earlier:
Software scheduling does not allow binary compatibility
Software scheduling does not cope efficiently with FUs that have unpre-
dictable latencies
Software scheduling techniques can add NOPs that can increase the size of the binaries
Software scheduling techniques cannot cope with dependencies when program branching occurs unless predictions can be made about the branch directions statically
Does not handle dependencies over memory locations efficiently
Dynamic instruction scheduling takes care of data dependencies on-the- fly (i.e., as the program executes) using hardware and as such does not have the limitations listed above. To get the best possible performance, dynamic instruction scheduling hardware will use one or more of the following:
multiple FUs, possibly pipelined
out-of-order startup (aka out-of-order execution)
out-of-order completions
The many advantages of dynamic instruction scheduling comes at a cost in the form of:
Substantial amount of additional hardware
Potential increase in the pipeline cycle time (and possibly the CPI)
Larger amount of CPU state information – complicates interrupt processing
105
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Terminology Related to Dynamic Instruction Scheduling
Resolving dependencies: noting and setting up data flow paths to satisfy flow dependencies and, possibly, other types of dependencies as well.
Satisfying dependencies: completion of flow of data to satisfy flow dependency and, possibly, other types of dependencies.
Instruction dispatching: A step that involves decoding one (or more) instruction(s), determining type(s) of VFU needed and resolving dependencies.
Instruction issuing: this refers to the satisfacton of dependencies for an already dispatched instruction, which enables the instruction for execution. The actual execution can be delayed pending the availability of a physical FU.
106
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming: Preamble
Consider the following code fragment:
(I1) MUL R2, R1, R7
(I2) ADD R4, R2, R3
(I3) SUB R2, R6 #8
(I4) LOAD R5, R2, #12
/* R2 <- R1 * R7 */
/* R4 <- R2 + R3 */
/* R2 <- R6 - 8 */
/* R5 <- Mem[R2 + 12] */
When this code is executed on an out-of-order processor, it is likely that the MUL instruction may be still executing (or waiting to start execu- tion) when the SUB has produced a result.
In this case we have two active instances (or simply instances) of the same architectural register R2:
- One to hold a value targeting R2 by the MUL (and to be consumed by the ADD)
- Another to hold a value targeting R2 by the SUB (and to be consumed by the LOAD).
Note that in the sequential execution model, we do not have the possibil- ity of having multiple instances of an architectural register at any time.
In general in any out-of-order processor design, the dispatch of an in- struction that has an architectural register as a destination effectively creates a new instance of that architectural register.
107
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming
Uses multiple physical registers to implement different active instances of an architectural register, if needed, to satisfy true data dependencies. This allows dispatches to be sustained and delays due to output and anti depen- dency constraints to be completely avoided.
Basic Idea:
The dispatch of an instruction that updates an architectural register A results in the allocation of a free physical register P for implementing a new instance of A
Subsequent instructions (till the issue of another instruction that updates A) requiring A as an input get the data from P: this maintains all the flow dependencies
Physical registers are thus dynamically bound to architectural registers Additional hardware mechanisms are needed for:
Managing the allocation, de-allocation and use of physical registers
Preventing an instruction from starting up prematurely before flow dependencies are satisfied.
Register renaming is the most prevalent dynamic instruction issuing mech- anism today.
108
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming (continued)
An example:
Consider the following code fragment (“instruction stream”):
I0: ADD R1, R2, R3 I1: MUL R4, R1, R6 I2: LOAD R1, address I3: ADD R2, R1, R6
Here, the LOAD cannot update R1 till the MUL has read the value of R1 as generated by the first ADD. This is a delay due to anti-dependency. Further, the LOAD cannot update R1 before the first ADD has updated R1. This is a delay due to an output dependency.
Consider now the renamed version of this instruction stream:
/* Assume that Pj currently implements Rj */
/* Assume free physical registers are P32, P33, P34 and P35 */
IO: ADD P32, P2, P3
/* R1 renamed to P32 */
I1: MUL P33, P32, P6
/* R4 renamed to P33; note use of P32 for R1 */
I2: LOAD P34, address
/* R1 renamed to P34 */
I3: ADD P35, P34, P6
/* R2 renamed to P35 */
In this version, the delays due to anti and output dependencies are absent.
109
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming (continued)
Requirements of register renaming:
Need more physical registers than architectural registers
Need a mapping table to indicate physical register currently associated with an architectural register: this is called the rename table or the register alias table
Need a mechanism to substitute physical register ids for architectural registers within an instruction
Need a mechanism to free up a physical register once it has been used up:
A physical register can be reclaimed (and added back to the free list) if:
All instructions that require it as an input have read it
and
The corresponding architectural register has been renamed
110
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX
Datapath components:
(Physical Registers)
IntFU
Register Files (int + FP)
FP Mul/Div FU
Decode/ Rename 1
Rename 2/ Dispatch
Load FU
Store FU
FP Add FU
Fetch
Reservation Stations
Reservation station slot
to hold contents of an input register or a literal value
Reservation station slot to hold address of the destination register
Allocated/free bit of reservation station and valid bit of input operand slots for a reservation station are NOT shown
InBus
Result Buses
Function Unit
111
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
The APEX pipeline/datapath is modified as follows:
- The fetch stage remains unchanged
- The D/RF split into two stages:
Decode/Rename1,
Rename2/Dispatch
The actions necessary to resolve references to architectural registers and instruction dispatching are implemented in these stages.
- All FUs update the physical registers through two result buses, which write data through two write ports
Each RSE includes slots to hold input operands and their status and a slot to hold the address of the destination register. An allocated/free bit is also associated with each RSE.
Two result buses are used to forward data to waiting VFUs and update the destination register. Completing VFUs have to compete for access to these buses. A VFU can use any one of the two buses for this purpose.
The result put out on one of the result buses is also forwarded to instruc- tions within the Decode/Rename1 and Rename2/Dispatch stages. This essentially requires VFUs using the result buses to put out their result as well as the address of the destination register.
112
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
As in the case of Tomasulo’s technique, a FU uses appropriate techniques to select one of a number of RSEs that have been enabled for execution.
The FUs have the same characteristics that we assumed for APEX incorporating Tomasulo’s technique: the physical FUs are pipelined and have the following latencies:
Physical FU
Integer FU
Load FU
Store FU
FP Add FU
FP Mul./Div. FU
Latency (in cycles)
1 3*
3*
4 7
* If memory target is in cache; longer otherwise
Assume that a single physical register file implements both integer and FP registers. Architectural registers have addresses in contiguous, neighboring ranges for integer and FP registers.
Example:
40 architectural registers: first 32 are for integers (R0 through R31), next 8 are for FP operands (F0 through F7, which are aliased to R32 to R39). This allows an unique index in the range 0 to 39 to be used to look up the rename table, as explained later.
113
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
The major hardware-implemented data structures used to support the register renaming mechanism, instruction dispatch, instruction issue and instruction completion are as follows:
A rename table, indexed by the address of an architectural register. The corresponding entry in the rename table gives the physical register acting as the stand-in (hereafter called the most recent instance, MRI) of the architectural register:
0 Physical register P5 is the MRI of R0
1 Physical register P12 is the MRI of R1
2 Physical register P4 is the MRI of R2
3 Physical register P6 is the MRI ofR3
5
12
4
6
11
39
The number of entries in the rename table = the # of architectural registers
An allocation list, AL [ ], of physical registers, implemented as a bit vector: this list is used for allocating a free physical register as the MRI of the destination architectural registers at the time of instruction dispatch:
0 1 2 3 4 5 6 7 8 9 10 11 49 50
Physical registers P0, P1, P2, P4, P7 through P11 are allocated Physical registers P3, P5, P6, P49 and P50 are free
Physical register P11 is the MRI of R39 Rename Table
1
1
1
0
1
0
0
1
1
1
1
1
0
0
114
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
An array Renamed [ ]: Renamed [k] indicates if the architectural register that physical register Pk was a MRI of an architectural register in the past that has been superceded with a more recent instance of the same architectural register. This information is used for deallocating a physical register when it is updated. (Renamed [k] = 1 means Pk was superceded by a more recent instance etc.)
Status information for each physical register:
- A bit vector, Status [ ] indicating if a physical register contains valid data or not: the validity information is obtained by looking at
AL [ ] and Status [ ] in conjunction: A physical register Pj contains valid data if Pj is allocated (i.e., AL[j] = 1) and if Status [j] is a one.
Forwarding information that allows the simultaneous update of a physical register by a VFU and the forwarding of the result being written to the physical register to waiting VFUs:
- A bit vector Wk for each physical register Pk that indicates the VFU slots that are awaiting the data to be written into this physical register: Wk[j] is set to 1 if the result to be written to Pk is to be forwarded to slot #j
- Each VFU slot that can have a data forwarded to it is given a unique id
115
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued) Example:
0
1
9
2
3
11
4
5
6
7
IntFU
34
35
37
38
39
40
41
Store FU
36
42
8
10
FP Mul/Div FU
28
29
30
31
32
33
FP Add FU
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Load FU
VFU Slot Numbering
0 1 2 3 4 5 6 7
41 42
0
1
0
0
1
0
0
1
1
1
Wk
Data to be written to Pk has to be forwarded to VFU slots numbered 1, 4, 7, 41, 42
A bit vector indicating the allocation status of each VFU, VFU_Status [ ].
- All of these status information are maintained in a scoreboard-like structure.
116
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
Format of a RSEs that requires two input operands:
Operation type (+ or -)
1
0
1
1
1
0
FP Add FU (physical FU)
Slot id = 32
Slot id = 30
Slot id = 28
0
1
1
Slot id = 33
Slot id = 31
Slot id = 29
destination physical register
FPAddVFU0 FPAddVFU FPAddVFU2
right input data right input data valid bit
left input data
left input data valid bit (1 = valid, 0 = invalid)
status (0 = free; 1 = allocated)
- Note need to have operation type field with some RSEs (e.g., integer VFUs) since they implement multiple functions.
Data forwarding is accomplished by using the information in the W array for the physical register being written. A slot after receiving its data from the forwarding bus sets its valid bit to 1:
To forwarding logic for Decode/RNM1 and RNM2/Dispatch Stages
Result Bus
Latch holding Wk
Using Wk to forward data being written to physical register Pk
After a physical register is updated, its W vector is cleared 117
Register File
VFU writing result to physical register Pk
Slot 0
Slot 1
Slot 2
Slot S-1
Load enable
0
1
0
1
012 S-1
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
Renaming and Dispatching Instructions:
The steps involved in renaming the destination architectural register, proc- essing the references to source architectural registers and eventually dispatching an instruction of the form:
Rj Rk op Rl are as follows:
1. TheavailabilityofaVFUoftheappropriatetypeandtheavailabilityof a free physical register to hold the result is checked simultaneously. If any one or both of these required resources are unavailable, instruction dispatching blocks till they become available.
2. TheaddressesPrandPs,say,ofthephysicalregistersthatarethe(cur- rent) MRIs of the architectural registers Rk and Rl, respectively, are read out from the rename table.
3. Thefreephysicalregisterpulledoutfromthefreelistinstep1,sayPq, is recorded as the (new) MRI for Rj in the rename table. Pq is also marked as allocated. The selected VFU is also marked as allocated. The “Renamed” bit of the physical register that was the past MRI of Rj is set to indicate that it has been replaced by the new MRI.
4. Theaddressfieldsoftheoriginalinstructionaresubstitutedwiththe addresses of the physical registers corresponding to the architectural register, producing the renamed instruction:
Pq Pr op Ps
118
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
5. If the source physical registers are valid, they are read out. If a source physical register is busy, the number for the corresponding slot in the selected function unit is used to update the waiting list for the physical
register. A forwarding mechanism is also used to bypass the attempt to read source physical registers to pick up the data from function units that are writing data into these source physical registers.
6. The renamed instruction is dispatched to the selected function unit as follows:
- The data that could be read out of the source physical registers (or forwarded in from the output of function unit(s)) is moved into the respective input slot(s) of the function unit selected for the instruction. These input slots are marked as “valid”.
- All other input operand slot(s) of the selected function unit - i.e., the slots that are still waiting input data are marked as “invalid”.
- The address of the destination physical register is also moved into the result slot for the selected VFU.
Note that step 2 has to precede step 3, since Rk or Rl may be the same as Rj. Notice also the relative complexity of the renaming and dispatch process.
The steps involved in handling the issue of LOAD and STORE instructions are similar.
119
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued)
The relative complexity of the renaming and dispatch processes require that these steps be implemented using more than one stage to avoid pipeline bottlenecks. The current example uses two pipeline stages to do this:
The Decode/Rename1 stage implements steps 1 through 3 The Rename2/Dispatch stage implements steps 4 through 6
VFU completion and data forwarding:
When a FU completes, it arbitrates for access to any one of the two result buses. On being granted access to a result bus, it writes the result into the destination register.
At the same time, the Wq list associated with the destination register Pq is used to forward the data to the waiting function unit slots, as described earlier.
This result is also forwarded to the instructions currently within the Decode/Rename1 and Rename2/Dispatch stages.
If Renamed [q] is set, Pq is added back to the free list (i.e., AL [q] is reset to a 0). Otherwise, Status [q] is set to indicate that Pq contains valid data.
Deallocating a VFU: Once the operation for a VFU has started up (and if the physical FUs are pipelined), the VFU can be de-allocated. Note contrast with Tomasulo’s.
120
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming Adapted for APEX (continued) Deallocating physical registers:
A physical register can be deallocated at the time a physical register is updated, as described earlier.
Another opportunity for deallocating a physical register occurs when an instruction renames the architectural register for which physical register Pq was the most recent instance and if there is no instruction waiting to read Pq (which implies that Pq’s status bit should indicate Pq’s content as valid). If these conditions are valid, Pq is marked for deallocation in the Decode/Rename1 stage but the actual deallocation does not take place till the instruction moves into the Rename2/Dispatch stage, one cycle later.
- The reason for delaying the actual deletion is to allow the instruction, say I, preceding the one that marked Pq as renamed to read Pq as a possible input operand when I is in the Rename2/decode stage.
When a physical register is deallocated, its W vector is cleared.
Additional terminology related to dynamic scheduling:
instruction wakeup: the process of enabling waiting instructions for potential execution when such instructions receive the last of any source operands they were waiting for.
instruction selection: the process of selecting a subset of the enabled instructions for issue.
Precise interrupts are easy to implement with register renaming (LATER). Note also the differences between register renaming and Tomasulo’s algorithm (irrespective of the specific implementation of register renaming).
121
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Forwarding Register Value Using the Register Address as a “Tag”
Any slot waiting for the value of a physical register (“destination slot”) can pick up the value of the register from a forwarding bus using an associative addressing mechanism
Each destination slot has two parts - a destination slot that will hold the value sought and a “tag” part that holds the address of the register whose value is sought (that is, being awaited for)
Destination Slot
Associated Data in tag slot
enable
Clock
Tag_slot_reset
p-bit comparator
Tag part of Forwarding bus
Data part of Forwarding bus
=
A tag match causes data from the forwarding bus to be loaded into the destination slot.
The forwarding bus now requires two parts - an address part and a data part
The function unit producing the value gains access to the forwarding bus and floats out the value of the register as well as its address.
Destinations whose associated tag value match what is floated on the tag part of the forwarding bus latch the value into the destination field. Additionally, the tag slot may be reset to a value that does not equal the address of any physical register to indicate that the destination slot contains valid data. Alternatively, an associated ”valid” bit may be set.
122
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative Implementation of Register Renaming
Uses centralized Issue Queue (IQ) and associative tag matching
Associative matching, using the address of physical register being written as key or tag is used to accomplish forwarding to data slots within the IQ as shown below (paths for dispatching and issuing to/from the IQ are NOT shown):
Forwarding path to instructions in Decode.rename/ dispatch stages
Issue Queue
src2 field
other fields
From FUs
Every physical register has a bit (register valid bit) that indicates if the contents of the register is valid or not.
Format of an entry in the issue queue:
“other fields” of above figure src1 field src2 field
src1 ready bit src2 ready bit status bit: indicates if this IQ entry is allocated or free
src1 tag, src2 tag = addresses of physical registers corresponding to src registers
Physical Register File
src1 field
src1 field: placeholder for src1 value + src1 tag + ready bit src2 field: placeholder for src2 value + src2 tag + ready bit
Forwarding bus
FU type needed
literal operand if any
src1 tag
src1 value
src2 tag
src2 value
dest
123
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative Implementation of Register Renaming (continued) Instruction decoding and dispatching steps:
a. Stall if the resources needed for the dispatch are not available. (For a register-to-register operation, these resources needed for a dispatch area: a free IQ entry AND a free physical register for the destination.)
b. Lookuprenametableforphysicalregisteraddressescorrespondingto the source registers. Assign the physical register for the destination and update the rename table.
c. If a source register is valid (as indicated by its register valid bit), read it out into the corresponding field in the selected IQ entry and set the ready bit of this entry to mark the contents of the source field as valid.
d. Ifasourceregisterisnotvalid,setthesrcregisteraddressfieldinthe IQ entry to the physical address of the source register and clear the ready bit to indicate that the corresponding source register value is not ready.
e. If the instruction has a single source register, set the ready bit of the unused source register.
f. If the instruction has any literal operand, read it out into the literal field of the IQ entry. Set the “dest” field in the entry to the address of the reg- ister allocated in Step b).
h. Setthe“FUtypeneeded”fieldintheIQentrytoindicatethetypeofFU needed to execute this instruction.
i. Mark the IQ entry as allocated.
124
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative Implementation of Register Renaming (continued)
Instruction writeback:
a. When an instruction completes, wait till a forwarding bus is available. Continue with the following steps when a forwarding bus is available. (Several FUs completing at the same time have to arbitrate for access to the forwarding buses. A FU cannot forward its result to waiting in- structions and write its result to the destination register unless it gains access to the forwarding bus)
b. Drivetheaddressofthedestinationregisterandtheresultonrespective parts of the forwarding bus. The result also gets written to the destination register and also gets forwarded to instructions which are being renamed and dispatched.
Instruction wakeup, selection and issue:
a. Associative matching and data pickup: As a result gets forwarded, IQ source register fields whose ready bits are not set and whose src register address field matches the address of the register driven on the forwarding bus do the following:
- The register value driven on the forwarding bus is latched into the src register value field
- The ready bit is set
Note that if multiple forwarding buses are present, the src tag field in all src register fields that are not marked as ready have to be compared against the register address on each of the forwarding buses at the same time. This is because the result can be forwarded on any (and exactly) one of these buses
125
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative Implementation of Register Renaming (continued)
b. Instructionwakeup:IQentriesthathavebothsourcesmarkedasready are eligible for issue. An instruction waiting for one or more source operands wakes up (= becomes ready for execution) when all of its sources are ready. Wakeups are thus a consequence of a match (or matches) in the course of forwarding. The wakeup logic associated with every allocated IQ entry simply ANDs the ready bits in the source fields of the IQ entry of that instruction. The instruction is considered to have awakened when this AND gate outputs a 1.
c. Selecting ready instructions for issue: Not all ready instructions can be issued simultaneously due to limitations on the connections between the IQ and the FUs and/or due to the existence of a finite number of FUs of a specific type. This step essentially selects a few of the ready instructions for actual issue as follows:
- Ready IQ entries send a request for issue to the selection logic
- The selection logic identifies instructions that can be issued
- The operands for these selected instruction are read out from the IQ entries along with any necessary information into the appropriate FU to allow execution to commence in the following cycle.
- The ready bits in the IQ entries vacated by issued instructions are cleared and these IQ entries are marked as free.
126
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative Implementation of Register Renaming (continued)
Timing requirement: data forwarding, instruction wakeup, selection of instructions for issue and instruction issue are consecutive steps that all have to be completed within a single clock cycle. Equivalently,
Tforward + Tmatch + Twakeup + Tselection + Tissue Tclock where:
Tforward =
Tmatch = Twakeup =
Tselection =
Tissue =
time taken to drive the result and the dest register address on the forwarding bus
time needed to do a tag match
time needed to AND the ready bits of an IQ entry and drive the output of the gate as the request input to the selection logic
time needed by the selection logic to select instruction to issue
time needed to move selected instruction to the appropriate FU.
As clock frequencies increase, it becomes increasingly difficult to meet this constraint.
In the most common implementation, there is one selection logic for each FU in the processor.
Most modern pipelined processors use renaming and associative tag matching with a centralized IQ or distributed IQs (reservation stations).
127
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Dependencies over Condition Codes in Register Renaming
Basic idea: CC architectural registers are handled just like the GP (gen- eral purpose) architectural registers.
Hardware needed: ICC and FPCC registers (architectural); more physi- cal registers (and associated W vectors, status fields etc.) to allow ICC or FPCC to be renamed.
Rename table extended to add entries for ICC and FPCC
When an instruction that requires flags in a CC as input is dispatched, the rename table is looked up to get the current MRI (which is a physical register)forthatCC. (TherequiredCCarchitecturalregisteristreated just like an input GP architectural registers).
When an instruction that sets ICC or FPCC is dispatched, a new physical registers is assigned to hold the CC values generated by this instruction.
At time of completion, CC values and results are both forwarded on the result bus. Physical registers are deallocated as usual.
128
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Obtaining Operands in a Dynamically Scheduled Processor: Choices
Dispatch-bound operand reads (from register file):
Whatever source registers are ready are read out at the time of dispatch
Whatever is read out is moved along with the instruction to wherever it has to wait (IQ or reservation station) for its remaining operand(s).
The waiting station entry (IQ entry or reservation station entry) must have placeholders for all source operands
Issue-bound operand reads (from the register file):
The instruction, source register addresses and any literals are moved to
the IQ or reservation station entry at the time of dispatch.
At the time of instruction issue (when all sources are known to be valid), source register operands are read out from the register file.
Permits IQ entries and reservation station entries to be narrower. Entries need to hold registers addresses (as opposed to source values in the case of dispatch-bound designs).
Delaying the register file reads (compared to the dispatch-bound designs) slow up the deallocation of physical registers.
129
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism
One of the earliest dynamic instruction scheduling techniques was used in the CDC 6600 which:
- was the fastest supercomputer at one time
- had a clean RISCy ISA (with some exceptions)
- survived into the early 80s
- Had 16 FUs (5 load/store FUs, 7 integer FUs, 4 floating pt. FUs)
We will illustrate this mechanism by adapting it for APEX, preserving all the main features of the original mechanism:
Combinational Logic
Input
Buses Buses
Output
F
FU
FU
FU
FU
input latches to hold a single set of inputs until the result is written back
Pipeline
Register Operand Flow
FU
FU
D
WB
130
Register File
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
Multiple sets of input buses exist to allow several sets of input operands to be read by a number of FUs within one cycle. A single cycle is needed to read out and transfer the contents of valid source registers to waiting function units via this bus.
Multiple sets of result buses are provided to let a number of FUs write their results to the register file. A single cycle is needed to write a destination register via these buses
The input latches associated with a FU have to hold the inputs steady till the result computed by the FU is written to the register file. Till this write completes, the FU is considered busy.
A register is considered busy if an instruction that is to write a result to it has been dispatched.
An instruction is active if its in the pipeline and has not yet completed
The D stage can dispatch at most one instruction per cycle - the
conditions for a dispatch are given later.
131
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
A hardware-maintained global data structure, called a scoreboard, is used to control/initiate:
Instruction dispatching to the FU (original papers/book calls this “issue”)
Initiate the transfer of input operands to the FUs through a set of input buses when ALL input operands required by the FUs become ready (i.e., available)
Initiate writeback of the computed result to the FUs
The scoreboard does this by maintaining the following status information, which is updated continuously:
Function Unit Status:
- allocation status: busy or free
- operation type (e.g., addition, subtraction)
- source register addresses
- destination register address
- status of source registers (busy or free)
Register Status:
- ids of function units that are computing the data for the source registers (if applicable)
Instruction Status: maintained for each active instruction
- Pipeline stage where instruction is currently in and its status.
132
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
Register addresses for reading, as well as writing the register file via the input and output buses are provided by the scoreboard logic.
Dispatch Rule: An instruction fetched by the F stage can be dispatched by the scoreboard logic only when the two following conditions are met:
1. Noactiveandalreadydispatchedinstructionhasthesame destination register
2. WhenaFUoftherequiredtypeisfree.
- Note that condition (1) ensures that output (W-A-W) dependencies are maintained
- Note also that the availability of input operands is not a condition for dispatching.
Startup Rule: A dispatched instruction does not start up until it has all of its input operands. The scoreboard signals a FU holding a dispatched instruction to read the register file for all input operands simultaneously (i.e., in the same cycle) when:
1. Allrequiredinputregistersarefree
2. Therequiredinputbusisfree(onebusisusedtoreadallinput operands in parallel)
An obvious optimization occurs when these conditions are valid at the time of dispatching, in which case dispatching and input operand readout take place in the same cycle.
133
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
- Note that the startup rule enforces the flow dependencies.
Completion Rule: A FU starts up execution once it has read in all input operands. After a FU has finished execution, a FU has to wait for a signal from the scoreboard logic before it can write the register file going through an output bus. The scoreboard sends a signal to the FU to proceed with the write only when it knows that no FU is yet to read the contents of the register being written to.
- This completion rule takes care of anti-dependencies.
- Notice also that limiting the write to a busy register from only one active instruction (thru the dispatch rule) considerably simplifies the bookeeping needed for implementing the completion rule.
Note the absence of any FU-to-FU forwarding - this was the case in the original CDC 6600, too.
Dependency Type
Flow Anti Output
How Enforced
Startup rule Completion rule Dispatch rule
134
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
Reasons for the startup rule in the CDC 6600: each bus was designed to carry a full set of input operands; reading input operands piecemeal will reduce the bus utilizations. (FUs were divided into 4 groups and buses allocated to a group were shared within the group.)
FU Latency Integer 1
Floating Point Add 4 Floating Point Mul. 3 Floating Point Div. 7
Load unit 3*
Store unit 3*
Comments Implements integer ops including MOVC
FP adds & subtracts FP multiplication FP division
Integer & FP loads; Latency includes time for address computation
Integer & FP stores; Latency includes time for address computation
* If memory target is in cache; longer otherwise Assume also:
A one cycle delay in the D stage. If all operands are ready, they are transferred to the appropriate FU in the same cycle. Literals are transferred at the time of dispatch via the input bus.
A one cycle delay for all other transfers via the input bus.
135
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
A one cycle delay for a write to a register file (and the PSW flags) via the output bus. The scoreboard signals a FU to do such a write at the beginning of the cycle during which the write takes place.
Two sets of input and output buses:
- Both sets of input and output buses steer data to/from all FUs.
- In case of contention, the transfers associated with the FUs having longer latencies have priority over others, including transfers associated with literal movements during dispatch.
(I0) MOVC R1, #200
(I1) FLOAD F0, R1, #0
(I2) FLOAD F1, R1, #8
(I3) FDIV F2, F1, F0
(I4) FMUL F3, F1, F0
(I5) FSTORE F2, R1, #16
(I6) ADDL R1, R1, #400
(I7) FSUB F4, F1, F0
/* Implemented on IntFU */
136
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
The scoreboard entries updated at the end of this cycle are:
Instruction status: I2 is in F; I1 is in D; I0 is in Int FU & executing
Register status: R1 is busy, expecting an update from the Int FU.
FU status:
FU id op type status src1/status src2/status dest/status
IntFU movc busy XX/XX XX/XX XX = don’t care
The scoreboard entries updated at the end of this cycle are:
R1/busy
Instruction status: I3 is in F; I2 is in D; I1 is in LoadFU; I0 is in Int FU
Register status: R1 is busy, expecting an update from the Int FU; F0 is busy, expecting an update from the LoadFU.
FU status:
FU id op type IntFU movc LoadFU fload
status src1/status busy XX/XX busy R1/busy
src2/status XX/XX XX/XX
dest/status R1/busy-
F0/busy
137
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued) The scoreboard entries updated in this cycle are:
Instruction status: I3 is in F; I2 is in D; I1 is (still) in LoadFU; I0 is in WB.
Register status: R1 is free, updated from the Int FU; F0 is busy, expecting an update from the LoadFU.
FU status:
FU id op type IntFU XX LoadFU fload
status src1/status src2/status free XX/XX XX/XX busy R1/free XX/XX
dest/status XX
F0/busy
Instruction status: I3 is in F; I2 is in D; I1 is in LoadFU.
Register status: R1 is free; F0 is busy, expecting an update from the
LoadFU.
FU status: Similar as that at the end of cycle 4, except that src1/status field of LoadFU is R1/free.
138
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The CDC 6600 Scoreboard Mechanism (continued)
The term scoreboarding has been used to mean very diverse kinds of dependency handling mechanisms:
- In the early RISC CPUs, scoreboarding was a simple interlocking mechanism that prevented an instruction from being dispatched if either its source or destination registers were busy:
1 (on dispatch) dest address
Values of src1, src 2
src1 address src2 address
Data part of reg. file:
2 read ports, 1 write port
Status Bit Array
( 2 write ports, 2 read ports)
src1 status src2 status dest status
From/ To D/RF stage
Stall issue if OR- ing
results in a 1
Register File
0 ( => free)
data to be written dest address
From WB stage
– Considers all three types of dependencies to limit dispatch from the same pipeline stage (namely, D/RF): relatively less sophisticated compared to the original CDC 6600
– Today, the term may represent dynamic instruction issue/dispatch mechanisms that are more complex than the CDC 6600’s
Potential improvements to the CDC 6600 scoreboarding mechanism:
Allow operands to be transferred to FUs in a piecemeal fashion
Incorporate forwarding logic
Incorporate write buffers within the FUs – this frees up the input latches for a FU to accept a new set of inputs
139
status: busy = 1
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm
Employed in the IBM 360, model 91, this scheme is named after its inventor:
– Landmark in pipeline CPU design
– Remains sophisticated even by today’s standard
– This dynamic issuing scheme was implemented within the floating point unit of the 360/91
– The 360/91 was a commercial failure – the 360/85 ran faster using a cache.
– The Tomasulo scheme was way ahead of its times: variations of this scheme started showing up in microprocessors only in the early 90s!
We will present Tomasulo’s algorithm as adapted for APEX: this mech- anism will be used to schedule all types of instructions, not just floating point ops.
The main features of the datapath in APEX using Tomasulo’s algorithm are:
– Multiple FUs as shown. All FUs have associated reservation stations
– A common bus, called the CDB (“common data bus”) is used for forwarding data from one FU to all destinations that are waiting for the result within a single cycle.
– Forwarding is accomplished by using statically assigned source tags to identify forwarding sources, as detailed later.
140
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued) Datapath Components:
Register Files (int + FP)
IntFU
FP Mul/Div FU
Load FU
D
Store FU
FP Add FU
CDB
F
InBus
– Common data bus (CDB), shown in bold (grey shading) is used for forwarding
– Register values, control information (including tags) move on thin lines: these actually represent multiple buses.
– Each reservation station defines a virtual function unit (VFU)
– Fetches and dispatches take place in program order.
141
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
The physical FUs are assumed to be pipelined and have the following latencies:
Physical FU
Integer FU
Load FU
Store FU
FP Add FU
FP Mul./Div. FU
Latency (in cycles)
1 3*
3*
4 7
* If memory target is in cache; longer otherwise
– Assume branches to be processed by the F stage itself using some hardware prediction mechanism. This is not discussed further.
Each reservation station entry (RSE) holds a set of operands (or information to locate the input operands), in addition to implicit information naming this RSE uniquely as a virtual FU.
Each VFU, excepting the Store VFUs, is a potential source of data to be forwarded: these sources are uniquely numbered, based on the relative position of the RSEs in front of each physical FU, as shown in the table below:
142
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
VFU Number
Integer VFUs 4 Load VFUs 8 FP Add VFUs 3 FPMulDiv VFUs 2
Identifier
1, 2, 3, 4
5 through 12 13, 14, 15 16, 17
Notice that the Store VFUs are not assigned any unique ids, since they do not need to deliver a data to a waiting VFU or a result. The source ids are called source tags.
– Note that 5-bit tags suffice for the purpose of unique naming
– Note also that the tag value zero (00000) is unassigned to any source
Each architectural register has an associated tag register. If the tag register holds a zero, then the register contains a valid data. If the tag register is non-zero, the contents of the tag register names the VFU that is to write a result into the register.
Example:
F0 R2
FP register F0 contains valid data; Integer register R2 is expecting data
from virtual load unit with the id 7
00111
143
00000
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Each RSE of a virtual FU (excluding the Store VFUs) has fields for two input operands and the associated tag registers for each field/slot. The implications of the contents of the tag registers for RSE fields are identi- cal to that for registers. Each RSE for a Store VFU has three input slots: for the three input operands of a STORE with a literal offset. (This re- stricts the load and store instructions to a register-offset format for memory addresses.)
In addition, each RSE has a bit flag that indicates its status as free or allocated.
Example:
FP Add FU (physical FU)
1
00000
00000
0
00010
00011
1
00001
00101
left input data left input tag register
FPAddVFU0 (Id = 13) FPAddVFU1 (Id = 14) FPAddVFU2 (Id = 15)
right input data right input tag register
status (0 = free; 1 = allocated)
144
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
In this example,
– FPAddVFU0 (Id = 13) has been allocated to an instruction and has received both input operands, since the associated tag registers hold the tag 0000; this VFU is presumably executing the operation using the physical FPAdd FU (or about to start execution).
– FPAddVFU 1 (Id = 14), has not been allocated to any instruction.
– FPAddVFU 2 (Id = 15) is awaiting both operands from sources with id 0001 (Integer VFU 0) and 00101 (Load VFU 0).
Potential destinations of forwarding:
– architectural registers
– any RSE slot
A virtual FU is enabled for startup when all tag registers in its associated RSE contain 00000.
– The startup actually occurs when the associated FU selects one of the enabled RSE.
– The issue rule for a VFU is thus:
(a) Selected VFU must have all inputs available
– (b) Associated physical FU is free
145
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Forwarding using the source tags:
– When a VFU completes, it requests access to the CDB, possibly competing for the CDB with other VFUs that have completed. The CDB arbitration logic selects one of these VFUs.
– The selected VFU puts out the result it computed, as well as its source tag (i.e., id) on the CDB.
– All destinations, whose associated tag register contains the same tag value as the one floated on the CDB pick up the data simultaneously from the CDB. After the data from the CDB has been loaded, the associated tag register is set to 00000.
– This requires each destination to have an associated 5-bit comparator to continuously monitor the CDB, in effect implementing an associative tag matching mechanism:
Destination (architectural register or RSEinputdataslot)
Associated
Datain tagregister
enable Tag_register_reset
Clock
Tag part of CDB
Data part of CDB
=
5-bit comparator
A tag match enables data from the CDB to be loaded into the destina- tion.
The associated tag register is reset to 00000 after the destination is loaded from the CDB
146
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Instruction dispatching rule: The D stage can dispatch an instruction only if a virtual FU of the required type is free. The following actions are taken on a dispatch depending on the type of instruction:
Register-to-register instructions:
– Whether a source architectural register is busy or not, its tag register as well as its contents are copied into the appropriate slots of the selected RSE. If the register is busy, this amounts to copying invalid data into the RSE data slot, which gets overwritten when the actual data is forwarded. Notice that copying the source tags associated with the busy register sets up the forwarding path, maintaining the required flow dependency.
– Literal operands of a register-to-register instruction are copied into the appropriate slot of the selected RSE and the associated tag register is re- set to 00000.
– The tag register associated with the destination architectural register is overwritten with the source tag of the selected virtual function unit. Note that overwriting the contents of the tag register associated with a busy architectural register does not pose any problem – it simply reflects the fact that the register is to be updated by the source named in the tag register instead of the source named by previous contents of the tag register, if any.
MOV instruction:
– MOV is treated as a register-to-register instruction (MOV
147
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
LOAD instructions:
– The tag and data contents of source registers are copied to the appropriate slots of the selected Load VFU
– Any literal associated with the LOAD instruction is copied into the appropriate slot of the selected Load VFU and its associated tag register is reset to 00000.
– The tag register associated with the destination architectural register of the LOAD instruction is set to the source tag of the selected VFU.
STORE instruction:
– The store instruction does not have a destination register – only the source registers and any literal operand need to be handled. These are handled exactly as in the case of a LOAD instruction.
Branch instructions:
Branch instructions are handled in a manner to be discussed later.
148
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Handling VFU Completions: Details
The data forwarding from a VFU that has completed is handled as
described earlier. This step takes a single cycle.
At the end of the cycle in which data forwarding takes place, the completing VFU is marked as released (i.e., unallocated).
– Note following subtle aspect of the forwarding mechanism:
Within a cycle, the completion protocols are implemented before the protocols for dispatching an instruction are started. This can be accomplished, for instance, by using a two-phase clock: completion protocols are initiated and finished within the first phase, while the issue protocols are implemented during the second phase.
If this relative ordering of the completion and issue protocols are not implemented, it is possible for a waiting destination to never receive the data that it was supposed to receive or get data forwarded to it from an unintended source.
149
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued) An Example:
We will assume that all registers are initially set to zeros and their tag registers, consequently, are set to 00000. All virtual function units are also assumed to be free.
– This, in fact, is how the control logic will initialize the tag registers and VFU status
Assume further that each VFU has its own output buffer. Till the data is written from a VFU’s output buffer, the VFU and its associated RSE are not free.
Assume that when a conflict occurs over using the CDB, the VFU that has the longest latency is given preference. Ties are broken arbitrarily.
(CDB conflicts do not occur in this example.) Consider now the following code fragment:
(I0) MOVC R1, #200
(I1) FLOAD F2, R1, #0
(I2) FLOAD F3, R1, #4
(I3) FADD F4, F2, F3
(I4) FLOAD F5, R1, #8
(I5) FLOAD F7, R1, #12
(I6) FSUB F6, F4, F5
(I7) FMUL F5, F4, F7
(I8) FSTORE F5, R1, #100
150
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
The DFG for this code fragment is as follows:
(R1)
I1 I2 I4 I5
I0
(R1)
(R1) (R1)
(R1)
(F2)
(F3)
I3
(F4)
I6 (F4)
I8
(F5)
(F5) (F7)
(F5)
I7 (F5)
The Gantt chart for the processing of this code sequence is as follows. Note that only four Load VFUs are shown for brevity.
151
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Tomasulo’s dynamic scheduling algorithm handles all three types of data dependencies (flow, anti, output) by simply maintaining only the flow dependencies.
(Recall that anti and output dependency ordering constraints have to be maintained only to preserve flow dependencies elsewhere in the code.)
– This is equivalent to getting rid of anti and output dependencies
– Flow dependencies are maintained by copying tags to set up appropriate data flow paths for forwarding:
152
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Hardware facilities required by Tomasulo’s algorithm: Common data bus and associated arbitration logic
Hardware facilities to keep track of instruction and VFU status for dispatching. This information is global in nature.
Hardware facilities to allow the physical FUs to start up (and arbitrate over selecting one of possibly many VFUs that are enabled for startup). This logic is distributed in nature.
Tag manipulation logic and tag registers
Associative tag matching logic:
– Number of comparators needed is D, where D is the number of potential destinations
D = # of architectural registers + # of RSE slots (all VFU inputs that can hold a register value)
– If more than one CDB is used to avoid a bottleneck over the use of the CDB, the number of comparators needed increase proportionately.
This is a substantial investment even by todays standard. Most of the performance benefits show up when we have superscalar instruction issue.
153
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
An instruction can wait in a VFU for three reasons:
Waiting for the availability of input operands
Waiting for the availability of the physical FU (even if it is pipelined). Waiting for access to the CDB
Notice that Tomasulo’s algorithm dynamically associates architectural registers with VFUs – in doing so, it supports multiple instances of the same architectural register.
For example, in the example, during cycles 9 and 10, one instance of F5 is associated with the VFU with id 7, while a later instance of F5 is associated with the VFU with id 16. (The output buffers of the VFUs and the RSE slots actually serve as the storage – direct or indirect – needed by each instance.)
– This is equivalent to “faking” more registers than are apparent from the ISA description.
– Functionally, another dynamic instruction issuing technique called register renaming (coming up next) provides similar abilities: performance is not restricted by the number of architectural registers
CAUTION: Hennessey & Patterson dub Tomasulo’s scheme as register renaming – there are actually major differences between these two schemes.
154
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Tomasulo’s algorithm will dynamically unroll loops and allocate VFUs as needed – this is equivalent to dynamically allocating more registers (in the form of the output buffers of each allocated VFU).
– Check this out by drawing the Gantt chart for the execution of the matrix multiplication inner loop, assuming that the BNZ is correctly predicted within the F stage.
Also check out the Weiss and Smith paper – it talks about alternate mechanisms for data forwarding without the need for tag matching.
This scheme relies on the fact (based on OLD stats) that most of the time one FU has to forward its result to another FU. Today, with large instruction windows and good register allocation schemes, this may no longer be true!
Tomasulo’s scheme, like the scoreboarding mechanism, allows archi- tectural registers to be updated out of order. This generally poses problems with determining a precise state to resume from following an interrupt.
– Additional mechanisms are needed to get a precise state (LATER)
155
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Tomasulo’s Algorithm (continued)
Some notes about the original implementation of the Toamsulo technique:
Was deployed only within the floating point part of the 360/91
Function units were not pipelined
VFUs implemented on a common physical FU shared a common output buffer that simply held the computed result till it was driven out on the CDB. Till the result was forwarded, the VFU and its associated RSE was not considered free.
Precise interrupts (LATER) were not implemented despite out-of-order completions.
The implementation complexity, was very significant, as discrete components were used in the implementation.
156
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Dependencies over Condition Code in Tomasulo’s Technique
Hardware needed:
One CC register for integer FUs and another for floating point FUs (ICC
and FPCC, respectively).
Source tag registers associated with these CC registers.
RSE of VFU extended to have one more slot, to hold the CC flags – this slot has an associated tag field; this extension is needed to handle in- structions like ADDC (add with carry) and conditional branches. VFU entries now look as follows:
CC input data right input data CC input tag register
right input tag register left input data
left input tag register
status (0 = free; 1 = allocated)
Handling dispatches:
Instructions dependent on CC flags (ADDC, BC, BZ etc.):
– handle src registers as usual
– copy ICC (or FPCC as the case may be) tag and data into the CC
input slot in the VFU
– if instruction can set CC flag, copy id of selected VFU into tag
register of this CC (ICC or FCC)
Other instructions:-
– handle src registers as usual
– copy a tag value of 0000 (=valid) into the CC input tag field of
the VFU; these instructions do not care about the CC value, so
they don’t need to wait till the CC becomes valid.
– if instruction can set CC flag, copy id of selected VFU into tag
register of this CC
157
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Dependencies over Condition Code in Tomasulo’s (contd.)
Handling startups: execution of an instruction starts up when all the tags associated with all of its inputs are 0000.
Handling completions:when a VFU completes, it floats out the result, the CC flag values for this result and its id on the CDB. CC values are thus forwarded (like the results) to any waiting instruction.
158
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm
Basic idea: rearrange instructions in a code sequence so that interlocked instructions are separated by other instructions from elsewhere within this sequence or by NOPs, keeping the behavior of the code sequence unchanged.
Typically used to take care of dependencies that involve ops with a predictible latency
– Dependencies from a LOAD are thus not handled by this algorithm
– However, if the LOADs have a minimum guaranteed latency, then this algorithm can be used to reduce the interlocking delay due to a depenedency from a LOAD; additional hardware mechanisms have to be used to stall the startup of an operation that depends on the LOAD should the latency of the LOAD extend beyond this minimum stipulated value
Algorithm requires specific details of the pipeline (#FUs by type, latencies etc.)
This algorithm is implemented in the postpass phase of a compiler, since this is a machine-specific optimization
Assumes that there is no hardware mechanism for delaying the issue of an instruction when input operands are not ready. Neither do we have an issue queue – so “issue” in this case means “start execution”.
– The D/RF stage has to issue an instruction every cycle: when the issue of a data-dependent has to be delayed by Q cycles, Q NOP instructions are issued back-to-back.
Only algorithm of choice if hardware does not incorporate interlocking logic to enforce data dependencies (such as in a VLIW processor – LATER)
159
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued) Outline of Algorithm:
1. ConstructtheDFGofabasicblockofcode(basicblock=codefrag- ment with one entry point and one exit point)
2. Initialization:
(a) Create a list L of instructions that can be started when control flows into the basic block – these instructions are the ones whose inputs are all ready and whose output and anti-dependencies have cleared.
(b) Create an output list, R, which is initially empty.
(c) Initialize a data structure to keep track of the state of the CPU as follows:
– Mark each register that contains a valid data item as valid
– Mark each FU as free
3. Pretendthataninstructionisissued:
(a) If L is empty, add a NOP to the end of the list R
160
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued)
(b) If L is non-empty, then check if an instruction from L can be issued as follows:
For an instruction I in L to be issuable:
All the other instructions that it depends on (for a flow, anti or output dependency) must have completed (i.e., updated the necessary registers or memory locations) or have delivered the data that I is waiting for through a forwarding mechanism, if such a mechanism exists.
A function unit for executing the instruction must be free
(i) If there is more than one instruction in L that can be issued, some secondary criterion may be used to pick one for issue. The selected instruction is removed from L and added to the end of R. The status of the destination register of this instruction is marked as busy
(ii)IfnoinstructionfromLcanbeissued,addaNOP totheendofR.
4. Assumethatasinglecyclehaselapsedsincetheinitiationofthe instruction that was just added to R. Update the state of the processor (FU status, register status etc.) at this point in time.
161
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued)
5. UpdateLbasedonthenewlyupdatedstateinformation:
– From the DFG, add any instruction to L which can now be issued since previously initiated instruction(s) that it depended on have just completed or forwarded any required data
– Any type of dependence (flow, anti- or output) have to be considered
– From L take out any instruction that cannot be issued because the required FU is busy
6. Gotostep3,loopingthroughsteps3through6tillallinstructionnodes in the original DFG have been added to R
The final contents of R, taken in sequence, gives a software interlocked version of the code for the original basic block
– This rearranged code has the same DFG as the original code
– The rearranged code may also have additional NOPs
162
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued)
Generally, software interlocking cannot be used to take care of dependencies from a LOAD to another instruction.
– This is because of the unpredictable timing of a LOAD (cache miss service times can vary): hardware mechanisms are very much required to cope with these dependencies.
An Example:
Consider the following code for a basic block
/* R2, R4, R5, R7, R8, R10 and R11 contain valid data at this point */
I1: FMUL
I2: FADD
I3: MOV
I4: FADD
I5: FDIV
I6: FST
I7: FADD
R5, R7, R8
R1, R2, R5
R9, R1 /* R9 <- R1; done using integer unit */
R6, R5, R8
R1, R4, R11
R1, R10, #100 /* Mem [R6 + 100] <-- R1; dedicated
unit to compute address */
R4, R7, R11
Assume the following pipeline structure:
Independent floating point add and multiplication units; both pipelined
Latencies: FADD: 1 cycle; FMUL: 3 cycles; FDIV: 4 cycles; MOV: 1: FST: 1 cycle for address computation
Forwarding is possible from the output of the FUs to an instruction reading its operand in the D/RF stage
Only one writeback port
163
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued) DFG for the example:
I1: FMUL R5, R7, R8
I2: FADD R1, R2, R5
I3: MOV R9, R1
I4: FADD R6, R5, R8
I5: FDIV R1, R4, R11
I6: FST R1, R10, #100
I7: FADD R4, R7, R11
I1
(R5) (R5)
I2 (R1)
I3
I4
(R1)
I5 (R1)
I6
(R4)
I7
Arcs are labelled with register over which dependency occurs
(R1)
(R1)
Dashed arc can be removed - if a node has flow dependency over the same register from more than one node, only the flow dependency from the node closest to it in program order matters
164
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued)
Pass #
1 2
3 4
5
6
7
8 9 10
L
(at start of pass)
L={I1} L={} L={} L={I2,I4}
L={I4,I3}
L={I4,I5}
L={I4,I7}
L={I7} L={} L = {I6 }
L
(at end of pass)
L={} L={} L={} L={I4}
L={I4}
L={I4}
L={I7}
L={} L={} L={}
Instruction Added to R
I1 NOP
NOP I2
I3
I5
I4
I7 NOP I6
Comments
I1 “issued” NOP “issued” NOP “issued”
Flow dependency
from I1 to I2 satisfied through forwarding
I2 “issued” since it has more instructions dependent on it than I4
Flow dependency from I2 to I3 satisfied through forwarding; I3 “issued” in preference over I4 With I3’s issue, anti- dependency from I3
to I5 is satisfied.
I2 updates R1 - this satisfies output dep. to I5, making I5 issuable*; I5 “issued” With I5’s issue, anti- dep. to I7 is satisfied
I4 “issued”; I4 chosen arbitrarily over I7
I7 “issued”
NOP “issued”
I5 completes, forwarding result to I6; I6 “issued”
* actually satisfied earlier since writes are serialized
165
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Software Interlocking Algorithm (continued)
The re-arranged, software-interlocked version of the code, as produced by the algorithm is thus:
I1: FMUL NOP
NOP
I2: FADD
I3: MOV
I5: FDIV I4: FADD I7: FADD
NOP I6: FST
R5, R7, R8
R1, R2, R5
R9, R1
R1, R4, R11
R6, R5, R8
R4, R7, R11
R1, R10, #100
General limitations of software interlocking:
1. CodeproducedisspecifictoaparticularCPUimplementation-
there is no binary compatibility
2. DoesnothandlevariablelatenciesofLOADsintheworstcase: Preferred (& commonly used compromise): use software interlocking to fill the delay slot of the LOAD assuming that cache hit occurs; use hardware to hold up instructions that depend on the LOAD when a cache miss occurs
- This is a hybrid solution that uses the best of both worlds!
166
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Software Pipelining: Another Software Interlocking Algorithm
Used to reduce interlocking delays within a loop body
Basic idea is to modify the loop body by splicing in instructions from
consecutive loop iterations
The algorithm that does loop pipelining effectively starts out with the DFG of one or a number of consecutive iterations and applies the software interlocking algorithm to come up with a modified loop body that has instructions from different loop iterations within the new loop body:
- The construction starts by picking up the instruction(s) that has (have) the most impact on latency as the initial member of L
- A loop priming code sequence is needed to start the modified loop
- A similar flush code sequence is needed following the modified loop body
- Often, when operation latencies are large, the modified loop is obtained by starting with an unrolled version of the loop: this requires more registers
Example:
Consider the nested loops used for multiplying two-dimensional matrices:
for (i = 0; i++; i < N)
for (j = 0; j++; j < N)
for (k = 0; k++; k < N)
c[i,j] = C[i,j] + A[i,k] * B[k, j]
- assume that the array C[ ] is initialized to zeros.
167
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Software Pipelining: Another Software Interlocking Algorithm (contd.)
An straightforward implementation of the body of the inner loop in APEX is:
loop: LOAD
LOAD R3, R2, R7
MUL R4, R1, R3 ADD R5, R4, R5 ADDL R2, R2, #4 SUBL R9, R9, #1 BNZ loop
/* load A[i,k] into R1: R1 <- A[i,k] */
/* R3 <- B[k,j], B is stored transposed */
/* form A[i,k] * B[k,j] */
/* C[i,j]in R5; C[i,j] += A[I,k] * B[k,j] */
/* increment k */
/* decrement inner loop counter */
R1, R2, R6
The DFG for the loop body is:
LOAD R1
LOAD R3
MUL
ADD ADDL SUBL
BNZ
LOAD R1 LOAD R3 MUL ADD ADDL SUBL BNZ
This DFG does not show dependencies from one iteration to a subsequent one (“loop-carried dependencies”)
168
Gantt chart showing execution latencies
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Software Pipelining: Another Software Interlocking Algorithm (contd.)
Consider the execution of this code on an in-order startup pipeline with multiple FUs (for ADD, MUL, LD/ST and branch) where:
All FUs are pipelined
Multiple writeback ports exist into the register file
Forwarding mechanisms from the outputs of the FUs to be in place
Assume the following latencies for the FUs: LOAD: 2 cycles
MUL: 4 cycles all others: 1 cycle
The flow dependencies in this code results in an execution time of 11 cycles per iteration.
Software pipelining reduces the cycle count for the above loop by splicing together instructions from different iterations of the loop:
The software pipelined version that reduces the interlocking delays is:
{code to prime loop}
loop:
MUL R4,
LOAD
LOAD
ADDL
ADD R5, R4, R4 /* update C[i, j] for iteration k */
SUBL
R1, R3 /* perform the multiplication for iteration k */
/* load A[i, k+1] for iteration k+1 */
/* load B[k+1, j] for iteration k+1 */
/* increment for iteration k+1 */
Notice how the delay slot for the MUL is filled with LOADs for the next iteration and by the ADDL. The ADDL also fills the delay slot of the second LOAD
The total time needed per iteration of this loop is now reduced to just 7 cycles.
R1, R2, R6
R3, R2, R7
R2, R2, #4
R9, R9, #1 BNZ loop
{code to flush loop}
169
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Software Pipelining: Another Software Interlocking Algorithm (contd.)
To see how this modified body was obtained, consider the DFG for two consecutive iterations of the original loop:
LOAD R1
MUL
ADD ADDL SUBL BNZ
Iteration k
LOAD R3
LOAD R1
LOAD R3
We start with the MUL from iteration #k in L. This implies that the two LOADs from iteration #k and the ADDL from iteration #k (which has a smaller latency than the MUL) have already been scheduled. The other implication is that the two LOADs from iteration #(k+1) are also in L to begin with:
L = {MULk, LOADk+1 R1, LOADk+1 R3}
where subscripts have been used to identify the iterations. The SUBL and BNZ can also be in this list, but we will add them later, since these two in- structions should occur at the end of the loop body.
The software interlocking algorithm is now used, but if an instruction was already generated in R, it is not added to L again (from a different iteration).
170
MUL ADD ADDL SUBL BNZ
Iteration k+1
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Software Pipelining: Another Software Interlocking Algorithm (contd.) The final schedule produced is:
R = { MULk, LOADk+1 R1, LOADk+1 R2, ADDLk+1, ADDk, SUBk, BNZk} This modified loop - the software pipelined loop, as mentioned earlier,
takes only 7 cycles per iteration:
MULk LOADk+1 R1 LOADk+1 R2 ADDLk+1 ADDk SUBLk BNZk
Note that in the modified loop, we have spliced instructions from two consecutive iteration. Only one instance of each instruction in the original loop is included in the modified loop body.
Example:
If the latency of the MUL is 8 cycles instead of the 4 cycles assumed, the software pipelined code produced above does not produce the best performance.
In this particular case, the original loop body has to be unrolled once to get a loop body consisting of instructions from two consecutive iterations - this requires the use of more registers to support software pipelining:
171
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Software Pipelining: Another Software Interlocking Algorithm (contd.)
loop: LOAD
LOAD R3, R2, R7
MUL R4, R1, R3 ADD R5, R4, R5 ADDL R2, R2, #4 LOAD R10, R2, R6 LOAD R11, R2, R7 MUL R12, R10, R11 ADD R5, R12, R5 ADDL R2, R2, #4 SUBL R9, R9, #1 BNZ loop
/* load A[i,k] into R1: R1 <- A[i,k] */
/* R3 <- B[k,j], B is stored transposed */
/* form A[i,k] * B[k,j] */
/* C[i,j]in R5; C[i,j] += A[I,k] * B[k,j] */
/* increment k */
/* load A[i,k+1] into R10 */
/* R11 <- B[k+1,j] */
/* form A[i,k+1] * B[k+1,j] */
/* C[i,j]in R5; C[i,j] += A[I,k] * B[k,j] */
/* increment k */
/* decrement inner loop counter */
R1, R2, R6
The second set of instructions starting with the LOAD R10 could have reused the data registers from the earlier instructions in this case. New register allocations are used only to support software pipelining.
The software pipelined version of this loop can be obtained using the same technique described earlier. The modified loop body is:
loop: MUL
MUL R12, R10, R11
LOAD R1, R2, R6 LOAD R3, R2, R7 ADDL R2, R2, #4 LOAD R10, R2, R6 LOAD R11, R2, R7 ADDL R2, R2, #4 ADD R5, R4, R5 ADD R5, R12, R5 SUBL R9, R9, #1 BNZ loop
Again, the software pipelined version does better.
R4, R1, R3
172
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Using Multithreading to Avoid Interlocking Delays
Basic idea: Instructions for multiple threads are fetched into indepen- dent instruction buffers and injected into function pipelines. As long as two instructions from the same thread are not present simultaneously in the pipeline, data dependencies within a single thread do not cause de- lays.
Basic hardware setup is as shown: Instructions Instructions
Instructions Instructions
of thread 1
of thread 2
of thread t-1
of thread t
Instruction fetching mechanism
: :
: : : :
It,7
: :
I2,6
It,6
I2,5
It-1,5
It,5
I2,4
It-1,4
It,4
I2,3
It-1,3
It,3
I2,2
It-1,2
It,2
I2,1
It-1,1
It,1
Memory system
: :
I1,7
I1,6
I1,5
I1,4
I1,3
I1,2
I1,1
Instruction buffers
Multiplexer
k
21
Register file partitioned into t sets
Function Unit with k stages
Interlocked instructions from the same thread are separated by instructions from the other threads: in the above example, as long as k < t, two instructions from the same thread cannot be in the FU at the same time.
173
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Using Multithreading to Avoid Interlocking Delays (contd.)
Each thread has its own register set; a separate PC is also maintained for each thread in the fetch unit.
The thread id or some equivalent information flows into the FU with the thread instruction: this is used to select the appropriate register set for operand reads and result writes.
Particularly suitable for avoiding relatively large load latencies.
The arrangement shown also avoids branching penalty: the instruction
following a branch does not enter the pipeline till the branch is resolved.
Instructions may be entered into the pipeline in two ways:
Synchronously/cyclically (instruction level multiprogramming) from each thread; consequently, the first k instructions injected into the pipeline are: I1,1, I2,1, I3,1, ..., Ik,1. Most implementations of multithreading to this date have used this approach.
Asynchronously: instructions are entered as and when needed from the threads. For example, instructions may be dispatched from one thread continuously till a load or branch instruction is included, when instruction dispatching switches to the next thread.
The processing time per thread goes up, but the overall system/pipeline utilization goes up.
Used in the Denelcor HEP, Tera system’s multithreaded CPU, the I/O processor of the CDC 6600 (“barrel”) and the IBM Space Shuttle I/O processor.
174
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Back to Dynamic Scheduling: The LOAD/STORE Queue
To comply with the sequential execution model, updates to memory must be made in program order. A FIFO queue, called the load store queue (LSQ), is used to enforce this ordering. Other names for the LSQ: Memory Order Buffer, Memory Queue.
In addition to the IQ entry, an additional entry is made at the tail of the LSQ at the time of dispatching a LOAD or STORE instruction: thus, a free entry at the tail of the LSQ is another resource that is needed for dis- patching a LOAD or STORE instruction.
The LSQ has head and tail pointers and the pointer values are also used to determine if an entry is free at the tail of the LSQ.
The IQ entry for a LOAD or STORE is used to calculate the memory address targeted by the LOAD or STORE. The entry in the LSQ is used to perform the memory operation (via the cache or caches), when the entry has all relevant fields valid and when the entry is at the head of the LSQ.
Recall from the example shown on Page 90, that memory address comparators are associated with a queue like this.
The format of a LSQ entry is as follows: memory address
bit indicating if memory address field is valid
bit indicating if the entry is for a LOAD or STORE status bit: indicates if this LSQ entry is allocated or free
data ready bit
address of dest reg, for LOAD
src 1
tag of value to be STOREd STORE
175
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The LSQ and Dynamic Instruction Scheduling (conted.)
A LSQ entry has the following fields:
A bit that indicates if the entry is for a load or store instruction.
A field to hold a computed memory address.
A field to hold the destination physical register address (for a LOAD) and a bit that indicates if this field holds a valid address.
A field to hold the data to be stored (for STORE instructions) and a bit that indicates if this field holds a valid data and a tag field for picking up this data through forwarding. Forwarding logic is associated with this field to pick up the forwarded value, as in he case of the IQ.
With the LSQ in place, the issue queue (IQ) entry for a LOAD is shown below:
“other fields” src1 field src2 field
src1 ready bit src2 ready bit status bit: indicates if this IQ entry is allocated or free
Once the IQ entry for the LOAD or STORE issues, the targeted memory address is computed and written directly into the memory address field of the LSQ entry. The LSQ entry is directly addressed in the LSQ using the LSQ index stored in the IQ.
The forwarding bus also runs under LSQ entries and is used to forward the value of a register to be STOREd.
load/store FU id
literal operand if any
src1 tag
src1 value
src2 tag
src2 value
LSQ index
176
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The LSQ and Dynamic Instruction Scheduling (conted.)
The dispatching step for the instruction LOAD Rj, Rk, literal is as follows:
1. Stall if ALL resources needed for a dispatch are unavailable. The re- sources needed are a free entry in the IQ (or a free RSE), a free physical register for the destination, a free entry at the tail of the LSQ.
2. TheaddressesPrsay,ofthephysicalregistercurrentlyactingasstand- ins for the architectural registers Rk is read out from the rename table.
3. Thefreephysicalregisterpulledoutfromthefreelistinstep1,sayPq, is recorded as the new stand-in for Rj in the rename table. Pq is also marked as allocated. The selected IQ is also marked as allocated.
4. The IQ and LSQ entries are set up appropriately. If Pr contains valid data, its value is also copied into the appropriate field of the IQ entry. the free physical register pulled out from the free list in step 1, say Pq, is recorded as the new stand-in for Rj in the rename table. Pq is also marked as allocated. The index of the LSQ entry is also written into the appropriate field in the IQ enetry for the LOAD. The “FU Type” field in the IQ entry is set to that of the integer adder.
Again, note that the IQ entry for a LOAD is used only to compute the memory address and the computed address is directly written into the memory address field of the LSQ entry, using the contents of the LSQ index field stored in the IQ entry. The LSQ entry is used to perform the memory access (via the cache(s)) when the entry moves to the head of the LSQ and when all pertinent fields of the LSQ are valid.
177
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dispatch and Issue Components for a Centralized IQ
The “FU type” field in the IQ entry is uses one-hot encoding - that is, it has one bit corresponding to each FU in the pipeline and only one bit in this field is set at any time within a valid IQ entry,
Wakeup logic for each IQ entry - multi-input AND gate that lgically ANDs the src valid bits and the IQ entry valid bit. The output of this gate is directed to one of the “request_issue” inputs of the appropriate selection logic for the FU needed based on the one-hot encoded id of the FU needed.
Each FU in the pipeline has a selection logic. The selection logic has a number of inputs (“request_issue”), one per IQ entry. The signal on each input indicates if a request to use the FU is present from an awak- ened IQ entry. If simultaneous requests are made in the same cycle for the FU, only one of them is selected.
The selection logic for a FU uses the index of the selected IQ entry to read it out of the IQ for issue.
178
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Interrupts in Out-of-Order Execution Pipelines
General Overview Of Interrupt Processing:
Interrupts are asynchronous events that are triggered during the normal course of processing by external events (such as I/O completion, reset) or internal events (such as overflow, page faults etc.).
- While many interrupts caused by external events can be momentarily ignored, interrupts caused by internal events have to be generally serviced promptly in order to continue with further processing.
- Interrupt processing involves saving the processor state and transferring control to an appropriate routine to service the interrupt. This control transfer is also accompanied by a switch to the privileged mode (“kernel mode”).
- The interrupt service may result in a context switch to another task (as in servicing a page fault interrupt) or in resuming with the interrupted task. In any case, the processing of the interrupted task has to be resumed with the processor in the user mode.
- Resuming the processing of an interrupted task involves the restoration of the processor state.
179
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Interrupts in Pipelined CPUs: The Problem
Out-of-order completions are a consequence of many of the artifacts (such as multiple FUs, dynamic scheduling etc.) used for speeding up instruction pipelines. It is thus possible for completing instructions to update architectural registers out of program order.
- Consequently, if an instruction I causes an interrupt, it is possible that instructions occurring after I in program order (that were not dependent on I) may have already updated the architectural registers:
- When program execution has to resume by restarting I, the instructions that occur after I and which already updated their destination register when the interrupt was recognized should not be reprocessed
- One way to do this will be to save information to identify the instruc- tions that occur after I in program order and already updated their destinations: maintaining this information is not as easy as it seems; even if this information is available, it is not a simple matter to make sure that these instructions are not processed again when I is reprocessed after servicing the interrupt
- Requires the saving of extensive “micro-state” info (scoreboard state, reisters, pipline latches etc.)
180
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Interrupts in Pipelined CPUs: The Problem
A relatively easier approach will be to implement what is called a precise state, which is a processor state obtained by updating the architectural registers strictly in program order. A precise state corresponds to what will be obtained if the instructions are processed strictly one at a time, without any overlap - corresponding exactly to the sequential execution model. A precise state also has an associated instruction address A such that:
(i) Allinstructionsoccurringinprogramorderbeforetheinstructionatthis address address A have completed without any problems (or these problems were already taken care of).
(ii) No instruction occurring after the instruction A in program order, as well as the instruction at address A has updated any architectural register.
- the effect of instruction processing is atomic
If the precise state at the point of transferring control to the interrupt handling routine is saved, then resuming processing after the interrupt consists simply of:
a) Establishingtheprocessorstatetothesavedprecisestate.
b) Transferring control to the instruction at the address A associated with the precise state.
181
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Interrupts in Pipelined CPUs: Some Terminology
Architectural registers - registers visible at the ISA level, including flag registers.
Architectural state - same as precise state.
Instruction retirement - the step of writing the result of an instruction that has not caused an exception to update the precise state. Sometimes also called instruction commitment.
- This is the process of updating the architectural state
Precise Interrupts: an interrupt (and its handling) mechanism relying on the resumption from a precise state which is maintained by the hardware or reconstructed by the software.
182
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Interrupts in Pipeline - Overview Possible Techniques:
Disallow certain types of interrupts and/or allow interrupts to result in imprecise state: many early pipelines did this.
Use explicit instructions and compilation techniques to reconstruct a precise state from which processing continues after interrupt handling
- Hardware facilities (like shadow registers or “barrier” instructions) are usually provided to let the software reconstruct a precise state after control is transferred to the system-level handler.
Design hardware to ensure that a precise state is always maintained:
(i) Avoid out-of-order completions even with multiple FUs present: this requires FUs to have identical latencies and updates to be in program order. Delay stages can be added to equalize the FU latencies.
Examples: early Pentium implementations, most Motorola 680X0 implementations, Intel Atom.
Variation: make sure by a certain stage within the FU that the instruction cannot generate an interrupt
183
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Interrupts in Pipeline - Overview (continued)
(ii) Allow out of order completions but employ additional facilities to implement a precise state:
(a) Use reorder buffers: most modern CPUs, especially ones that use register renaming, rely on this mechanism:
- Entries set up in a FIFO queue called the reorder buffer (ROB) in program order at the time of dispatching.
- Results of instructions completing out-of-order are written to the corresponding ROB entries as they complete.
- The contents of the ROB are then used to update the architectural (=precise) state in program order.
(b) Use history buffers: Here updates to the architectural registers may be made as soon as instructions complete - in particular, out-of-order.
- Enough “history” information about the old contents of destinations updated out-of-order is saved (in a history buffer or shadow registers to revert to a precise state should an interrupt occur.
- A few microprocessors, such as the Motorola 88K used shadow registers to checkpoint registers updated out of order.
184
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Interrupts in Pipeline - Overview (continued)
(c) Use future (register) files: Here a complete duplicate of the register set is maintained.
- The primary register file is updated as instructions complete while the other register file is updated in program order.
- On an interrupt, the precise state is obtained by swapping these two register files.
- This technique was implemented in some early Cray pipelined supercomputers (the Cray X-MP). Some AMD processors use this technique as well.
Note the need to restore the rename table and other structures (such as free lists, waiting lists etc.) in all of the techniques for implementing precise states for processors that use register renaming.
185
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers
The most dominant scheme for implementing precise interrupts today
Fits well with the use of register renaming
We will illustrate how a reorder buffer (ROB) mechanism works for APEX with register renaming
The ROB used is a circular FIFO. There is a head and tail pointer for this queue.
There is no physical register file: the slots within the ROB serve as physical registers
There is, however, a separate register file, ARF, which is updated in program order. The ARF has a register for every architectural register.
- The precise state is defined by the contents of these registers and additional information.
- ARF = architectural register file
The rename table is modified to indicate if the current stand-in for an input architectural register is a register within the ARF (src_bit = 0) or a slot within the ROB (src_bit = 1).
ar/slot_id
src_bit
RNT
Alias for Rk is the p-th register in the ARF Alias for Rm is the q-th slot in the ROB
p
0
q
1
k m
186
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.) Datapath components:
Retirement Logic
Reorder Buffer (ROB)
Register Files (int + FP)
IntFU
FP Mul/Div FU
Decode/ Rename 1
Rename 2/ Dispatch
Load FU
Store FU
FP Add FU
Architectural Register File (ARF)
Reservation Stations
Reservation station slot
to hold contents of an input register or a literal value
Reservation station slot to hold address of the destination register
Allocated/free bit of reservation station and valid bit of input operand slots for a reservation station are NOT shown
InBus
Result Buses
Fetch
Function Unit
Branch handling mechanisms are NOT shown
187
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
An entry is made for an instruction in the ROB in program order at the time of dispatching the instruction.
- Each entry in this FIFO consists of the following fields:
Instruction address (PC_value)
Address of destination architectural register (ar_address), if any
Result value (result): result of reg-to-reg or calculated memory address
Value of register to be stored to memory, used by STORE, (svalue) and its valid bit (sval_valid)
Exception codes (excodes)
Status bit (indicating if the result/address is valid) (status)
Instruction type (reg-to-reg, load, store, branch etc.) - used at commitment to interpret the format of the entry and carry out actions specific to instruction type (itype)
- We will use the notation ROB[i].
188
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
The structure of the ROB:
uuuu u u u u u u uuuu
ROB.tail
u wc
uuu e c e c e c e w wwe
moving pointers
ROB.head
e: Instruction assigned to this slot is in execution
w: Instruction assigned to this slot is waiting for input(s) c: Instruction assigned to this slot has completed
u: This slot is unallocated
The ROB is a circular FIFO, maintained in the order of instruction dis- patch – that is, in program order.
Entries are made at the tail of the ROB at the time of dispatch – this is in program order
Instructions that completed without any exception are retired from the head of the ROB – this is also in program order. The entry at the head of the ROB at any point is the earliest instruction (in program order) at that point that is yet to be retired.
Note that we do not need any special flag to indicate if a ROB entry is free or allocated. We can deduce whether an entry is free or allocated from its relative position from the head and tail pointers.
189
Direction of
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
We now describe how this ROB is used to implement precise interrupts
in APEX by describing ONLY THE STEPS THAT GET AFFECTED. Steps necessary for processing an instruction of the form:
Rj Rk op Rl Decode/Rename1/Rename2/Dispatch:
A dispatch takes place only if a free slot is available in the ROB and if a VFU of the required type is available. (The lack of a free slot is indicated by an identical value for the head and tail pointers.)
Note where sources (latest values for Rk and Rl) are available from:
– if Rename_Table[src_ar].src_bit = 0, valid data for src is available from the ARF directly, otherwise, it is available from the ROB slot indicated in the rename table entry for the src_ar. (src_ar = k and l).
The ROB slot for the destination is marked as allocated, other fields are initialized, the rename table entry for the destination is updated and the ROB tail pointer is updated:
ROB[ROB.tail].status = invalid;
ROB[ROB.tail]. itype = r2r; /* register-to-register instruction */ ROB[ROB.tail].PC_value = address of dispatched instruction; ROB[ROB.tail].ar_address = j (address of destination ar); ROB[ROB.tail].excodes, result is left as it is;
ROB[ROB.tail].type = indicator for the type of the dispatched instrn Rename_Table[j].src_bit = 1 (latest value of Rj will be generated
within the ROB entry just set up); Rename_Table[j].ar/slot_id = ROB.tail;
ROB.tail++;
190
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
Data is read out for any input that is a register within the ARF (src_bit = 0 in the rename table input). ARF registers are always valid. If an input register is a ROB slot, then the input operand is read out only if the ROB status bit is valid.
– It is assumed that the ARF is read during the early part of a clock cycle while the ARF is updated (by the retire logic) during the later part of a clock.
Any forwarding to an instruction takes place as before, on the basis of the ROB slot number (which is the analog of a physical register address).
The destination slot of the selected VFU is written with the index of the ROB slot selected for the instruction.
– Other steps are unchanged. Instruction Completion:
When an instruction completes, it simply writes its result and any exception code that it produced into the ROB slot assigned for its destination.
Forwarding to waiting VFU slots and to an instruction within the Rename2/Dispatch stage takes place as before.
191
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
The ROB slot for the result is updated as follows:
ROB[dest_slot_id].result = result produced by VFU ROB[dest_slot_id].excodes = exception conditions produced
by the instruction during its execution (can be all 0s, when no excepetions have occurred)
ROB[dest_slot_id].status = valid
– Unlike the implementation described earlier, the ROB entry for the destination is not deallocated: this is left to the logic implementing the retirement of completed instructions.
Instruction Retirement:
For a register-to-register instruction (as inferred from the itype field, with a value of r2r), the instruction retirement unit uses the head pointer to write valid results from the head of the ROB into the ARF in program order only if no exceptions were produced during its execution.
– Only one result is written per cycle, during the later part of a clock. (More than one entry at the head of the ROB can also be retired in order during a single cycle.)
No writes take place into the ARF if:
(i) ROB[head].status is invalid, OR
(ii) ROB[head].excodes indicates that an exception code was generated.
192
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
– In case (i), the retirement unit stalls till this condition becomes valid.
– In case (ii), the pipeline is interrupted. The ARF gives a precise state whose associated instruction address is in the PC_value field of ROB[head].
If both of these conditions are not valid, the retirement unit simply implements the following steps:
ARF[ROB[ROB.head].ar_address]] = ROB[head].result; if ( ((Rename_Table[ROB[head].ar_address].ar/slot_id == ROB.head)
& (Rename_Table[ROB[head].ar_address].src_bit ==1) ) then {
Rename_Table[ROB[head].ar_address].ar/slot_id = ROB[head].ar_address;
Rename_Table[ROB[head].ar_address].src_bit = 0 };
ROB.head++
– What the if-then does is to check if the Rename Table entry for the destination architectural register was set up by the instruction being retired – in other words, if the entry being retired corresponds to the most recent instance of the corresponding architectural register. If this is the case, the Rename Table entry is updated to point to the ar to which the result is being committed to.
NOTE: In this design, we assume that we have an ISA that does not use a condition code register. If the ISA uses condition code registers, the ”excodes” field in a ROB entry can be expanded to hold the condition codes AND the exception codes.
193
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
Note the need to need to have the following ports to the register file im- plementing the Rename Table:
– read ports for reading the entries for the source architectural registers
– write port to update the rename table with the destination of the
dispatched instruction
– read/write port to check if the rename table entry has to be updated
and the write for the update (if-then on the last page)
We assume that within a cycle, these ports are used in the order they are listed. The potential bottleneck implied by the need to support two back-to-back rename table read-write sequences within a single cycle can be avoided by using latches and comparators in a forwarding-like mechanism. (See Problem XX).
Handling STORE instructions:
This is quite similar to register-to-register instructions. The field “result” field will hold the effective memory address of a STORE and the “svalue” field will hold the value to be stored.
Valid bits of the “result” and “svalue” are appropriately initialized at the time of dispatch and set when the values of these fields are written. Other fields are set appropriately at the time of dispatch and “itype” is set to the value “Store”.
The actual write to the memory takes place when the ROB entry for the STORE gets processed in the normal course. At this time the TLB is probed to determine that the STORE will not cause a page fault. If the page to which the write is to take place is not in the RAM, the STORE instruction is considered to have triggered a page fault (which is then handled as stated earlier).
194
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
Till the write to memory takes place, it sits in a write buffer; LOADs targeting the same memory location as a pending write (i.e., a STORE) gets data forwarded to them (as described earlier) from the write buffer.
Handling LOAD instructions:
These are also processed as register-to-register instructions. The LOAD VFUs perform the memory accesses and deposit the fetched data into the appropriate ROB slot. Various fields are initialized and set appropriately.
The memory data gets written to the ARF as the retirement logic processes it, and if the LOAD did not generate any exception
A page fault occurring during the execution of the LOAD by the VFU is stored as an exception code. Page faults for LOADs are not recognized till the retirement logic processes the corresponding ROB entry.
To allow the routine processing of a LOAD, the LOAD VFU sets the valid bit of the ROB entry as soon as it detects a page fault. Other than this all other phases of completion are skipped.
195
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.)
Handling other exception and other page faults:
Page faults triggered during instruction fetching, during the memory write of a STORE or by external sources are handled as follows:
– Page faults triggered by instruction fetch: further fetches are suspended and the pipeline is allowed to drain. If the retirement logic finds that a prior instruction generated an exception, the precise state is installed as before and the exception is serviced. If processing is to continue, the pending page fault induced by instruction fetching is also serviced.
Other page faults: these are treated similarly as fetch-induced page faults.
In modern CPUs, page faults are discovered as the TLB miss handler is executed – this is a software routine consisting typically of at least a few tens of instructions.
– No hardware exception is generated – control flows to a TLB miss handler, treating the TLB miss like a trap. The software routine handling the TLB miss discovers the page faults and transfers control to the handler. (TLB miss handler typically have their own dedicated registers.) If the TLB miss handler determines that the page is missing, a page fault interrupt is generated.
– By this time, typically, all instructions prior to the one that triggered the page fault have been processed – even if they caused exceptions. These exceptions, if any, and the page fault interrupt triggered by the TLB miss handler are treated in the usual way (i.e., processed in program order using the ROB).
196
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reorder Buffers (contd.) Real Examples:
Historical: Pentium P6 Pro – 40 entry ROB, AMD K5 – 16 entry ROB, HP PA8000 – 56 entry ROB, occupied 15% of die area
More recent:
Intel CPUs: Haswell – 182 entries, Skylake – 224 entries, Sunny Cove
– 352 entries
AMD CPUs: Zen – 192 entries, Zen 2 – 224 entries Compacting ROB entries:
Instruction addresses are consecutive between branches that are taken, so no need to store the full PC value for each of consecutive instructions that follow the first one that begins the consecutive sequence
Store offsets for consecutive instructions if consecutive instructions are different in size
If instructions all have the same size, say 32-bits, the offsets are implicit (4 bytes). A single bit is needed in this case to mark which ROB entry contains the full PC value.
In contemporary machines, a branch prediction mechanism is used to change the instruction fetch path following the branch at the time the branch instruction is fetched. It is thus possible to know at the time of setting up the ROB entry of the branch whether the sequence of consecutive instructions will be broken or not. Put in other words, the predicted direction of the branch instruction determines if the full PC is to be stored in the ROB entry of the branch.
197
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Aside: Implementing the ROB
Uses a multiported register file with 2m entries:
A write port serves as portal to access the tail of ROB. A m-bit counter is used to address this port, that is, the entry at the tail. This m-bit counter serves as the tail pointer.
– To insert an entry, the entry is assembled in an external latch and written into the register file through this write port at the location addressed by the counter.
– This counter is incremented after adding an entry at the tail of the ROB. Note that this counter wraps around automatically to all zeroes when it is at its maximum value and then incremented.
A read port serves as the portal for accessing the head of the ROB. A m-bit counter is used to address this port, that is, the entry at the head of the ROB. This counter serves as the head pointer.
– To commit an entry, the read port is accessed and the location addressed by the associated m-bit counter is read out into a latch.
– The status bit indicates if the entry can be committed. If the entry is not ready for committment, the reading process is repeated in the next cycle.
– If the entry read out into the latch can be committed, the counter is in- cremented. Again nite the automatic wrap-around of the pointer value.
A separate set of read and write ports on this register file is used to read out input operand values and write result and the excodes fields, respectively.
The number of entries in most real ROBs is not a power of 2
Many machines have a ROB timeout hardware, to detect the blocking of the head due to memory or memory-mapped device errors
198
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Constructing the Precise State in Software
Most modern hardware schemes for precise interrupts maintains a precise state before the interrupt handler takes over.
In some cases, the precise state is reconstructed by the interrupt handler. Some hardware support is necessary for this – this is in the form of checkpointing.
– Checkpoints are made using shadow registers: the dispatch logic saves a destination register that will be written out of order into a shadow register
reg. id
saved value
Shadow register array – only a finite number of regs are checkpointed
– shadow regs aliased to memory locations
– The shadowing mechanisms is turned off when the interrupt handler is constructing the precise state – if this is not the case, the checkpointed values will get clobbered while the precise state is being constructed!
– Similar schemes used in many early RISC CPUs (Motorola 88K, MIPS 2000 etc.)
Shadow registers can be implemented as a stack in many machines that incorporate speculative execution: previous contents of destination registers updated speculatively are saved on a shadow register stack.
The history buffer mechanism is a full-blown version of shadowing: it checkpoints all registers that are written – in order or out of order.
More sophisticated forms of hardware checkpointing is possible but real processors rarely rely on such mechanisms.
199
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Ensuring Precise States Using Barrier Instructions
Basic Idea: Instructions that can potentially generate an exception are allowed to complete before any following instruction can be allowed to update the architectural register.
Often, a special instruction has to be inserted by the compiler to defer the update:
Example:
The TRAPB (“trap barrier”) instruction in the DEC Alpha ISA
Consider the following code
:
FMUL F1, F5, F0 FADD F2, F1, F4 TRAPB
STORE R1, R6, #0
:
– Here the TRAPB instruction ensures that the STORE does not update the memory till the FMUL and FADD completes without an exception.
The overhead of using a TRAPB instruction frequently can be avoided by coding sections of code in the single assignment style, with TRAPB instructions in-between such adjacent sections:
200
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Ensuring Precise States Using Barrier Instructions (contd.)
– In the single assignment style of coding (reminiscent of a functional style of programming), each register is assigned at most once
– Single assignment ensures that if an exception occurs within a single- assignment coded section, execution can resume just after the TRAPB preceding this section.
– Effectively, each TRAPB implements a checkpoint.
This, of course, comes at a cost: more architectural registers are tied up.
Tradeoffs: among number of registers needed, number of TRAPBs used and amount of useful work undone on an exception:
– more TRAPBs used => less work undone, higher overhead, fewer registers needed
– fewer TRAPBs used => more work undone – exception can be generated by last instruction within a single-assignment coded section. However, the results of all prior instruction till the TRAPB preceding this section have to be thrown away etc.
201
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Real Machines: Implementations of Register Renaming
Variation 1: Reorder buffer slots are used as physical registers and a separate architectural register file (ARF) is used (what we have just seen). The Intel P6 microarchitecture, implemented by the Pentium Pro, Pentium, Pentium II and Pentium III are examples of this design.
Instruction issue
Instruction dispatch
Connections for reading physical registers
Architectural register file
Committment
IQ
Function Units
FU 1 FU 2
F1
Fetch
F2
D1
D2
LSQ
The LSQ (load-store queue) shown is used to maintain the program-order of load and store instructions. Memory operations have to be performed in memory order. This ordering can be relaxed only when the address of a later load in the LSQ does not match the address of every store in front of it in the LSQ (LATER)
A load or a store instruction is handled as follows:
– The instruction is dispatched to the issue queue (IQ) as described earlier. Simultaneously, an entry for the instruction is set up at the tail of the LSQ (which is a FIFO queue)
– After the effective address of the load or store instruction is computed, the effective address is inserted into the address field of the LSQ entry of the instruction. The IQ entry for the load or store instruction has the address of the LSQ entry of the load or store instruction.
– The value of the register to be stored can be read out (if it is ready into the LSQ entry of the store at the time of dispatch or it can be forwarded later to the LSQ entry.
FU m
ROB
Result/status forwarding buses
ARF
Decode/ Dispatch
EX
Dcache
202
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Real Machines: Implementations of Register Renaming (continued)
Variation 2: A separate physical register file (PRF) and an architectural register file is used. The ROB entry of an instruction that has a destination register points to the physical register is assigned to hold the result. In this case, the physical registers are called rename buffers. The IBM Power PC 604 implements this design.
Instruction issue
Instruction dispatch
Connections for reading physical registers
IQ
Function Units
FU 1 FU 2
F1
D1
D2
results
exception codes/ status
Result/status forwarding buses
Architectural register file
Committment
Fetch
F2
Decode/ Dispatch
PRF
ARF
LSQ
FU m
EX
ROB
The physical registers can be allocated in a circular FIFO fashion (like ROB entries) or they can be allocated using a list or lists to keep track of free and allocated registers.
Variation 3: The physical and architectural registers are implemented within a common register file. The Intel Pentium 4 and the Alpha 21264 implementation uses this design.
Dcache
Instruction issue
Instruction dispatch
Connections for reading physical and architectural registers
IQ
Function Units
FU 1 FU 2
F1
results
exception codes/ status
Result/status forwarding buses
architectural registers and physical registers
Fetch
F2
D1
D2
Decode/ Dispatch
RF
FU m
EX
LSQ
ROB
203
Dcache
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Real Machines: Implementations of Register Renaming (continued)
The register alias table (RAT) points to the registers within the common RF that correspond to the most recent instances of the architectural registers.
A second table (the “retirement RAT”, in Intel’s terminology), similar to the register alias table points to architectural registers within the RF, i.e., to registers that hold committed values of the architectural registers.
The ROB entries point to registers within the common RF.
A non-FIFO, list-based allocation/deallocation is an absolute need for this scheme.
As register values are committed, no data movement is necessary.
204
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Superscalar Processors: A First Look
Execution time of a program with N instruction is: texec = N. CPI.
where CPI is the average number of clocks needed per instruction and is the clock period.
Superscalar machines dispatch more than one instruction per cycle Requires complicated fetch & dispatch logic
Requires complex logic to cope with dependencies
Effective CPI goes down, increasing throughput
CPI decrease somewhat defeated by relatively larger branching penalty
In the steady state, the completion rate is equal to the instruction dis- patch rate.
A (maximum) dispatch rate of one instruction per cycle – what we have in scalar pipelines – is thus a bottleneck
A dispatch rate of one instruction per cycle is often called the Flynn bottleneck, after Michael Flynn, who first pointed this out quantitatively in a landmark paper in the early 70s.
205
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Superscalar Processors: A First Look (contd.)
Superscalar CPUs attempt to dispatch multiple instructions per cycle.
In a m-way superscalar CPU, a maximum of m instructions can be dispatched per cycle.
Ideally, this should lead to a CPI thats 1/m-th of a scalar CPU.
Typical values of m:
2 (early superscalar CPUs)
4 (most common today)
6 to 8 – some very aggressive designs.
Actual performance gains are less than what is predicted, particularly as m goes up.
Still need to maintain compliance with the sequential execution model, despite parallel dispatching.
Branching and dependencies take a relatively higher toll on performance.
Proposed in the early 70s; internal IBM projects during the 80s; first production CPUs (i960CA, i860, Risc/6000, DEC 21064) in the late 80s, early 90s.
Mainstream technology now!
206
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Potential Challenges in Designing a m-way Superscalar Processor
Fetching m instructions per cycle for dispatch in a m-way superscalar CPU: this is complicated by the fact that the set of m instructions to be dispatched can cross memory and cache line boundaries.
Basic strategy: maximize the number of instructions that have to be examined per cycle for dispatch
Resolving dependencies among the instructions being dispatched and the instructions that have been dispatched earlier and still remain active.
Issuing multiple instructions per cycle to free FUs when the input oper- ands become available – this is not different from what is done in scalar pipelines with multiple FUs and dynamic scheduling.
Retiring multiple instructions per cycle – this is again not different from what is done in scalar pipelines with multiple FUs and dynamic schedul- ing.
Coping with branching – this is a very serious problem in superscalar machines where a branch instruction may be encountered potentially in each consecutive group of m instructions that are being examined for dispatch.
Coping with load latencies, as clock rates get higher.
207
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Superscalar CPUs: Ramifications on Architecture and Implementation Wider Fetch, Decode, Dispatch Stages, Wider/more internal busses:
– Width(Fetch) Width(Decode) Number of Dispatches/Cycle
Function units likely to be pipelined + dispatch buffer/reservation sta-
tions needed to maintain issue rate that balances dispatch rate
Faster/complex memory interface, ROBs, register files
Wider paths to memory, wider caches, load bypassing, more ports on ROB, register file
Popular implementation in a m-way machine: in-order dispatching: in each cycle, at least m instructions are fetched and examined for dis- patch:
If resources/dependencies permit, dispatch all instructions in the group of instructions fetched
Simultaneously dispatch all instructions in this group in program order, stopping at the first instruction in program order within the group that cannot be dispatched
Note that following the instruction within the group that could not be dispatched, there may be others that could be dispatched
– This strategy of in-order dispatching is easy to implement but lowers the effective dispatch rate
Instruction aligning facility needed to have the ability to examine at least m instructions for dispatch, even if they are located in different cache lines
Aggressive branch handling techniques are needed 208
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dependency Checking in Superscalar CPUs
Consider a m-way dependence style superscalar CPU. In this case the dependencies to be resolved are:
1. Dependenciesamongtheminstructionsbeingdispatched,say,I1,I2,I3, .., Im.
2. Dependencies involving each of the dispatched instructions and the instructions that have already been dispatched but not completed.
Detecting dependencies in the first category requires dependencies to be resolved in program order among the m instructions being examined for dispatch.
If the processor uses register renaming, flow dependencies are the only dependencies that have to be detcted.
Consider the potential flow dependencies within the m instructions being examined for dispatch:
– Assume dependencies to be over architectural registers
– Assume each instruction to have one destination register and at most two source registers
– There can be flow dependencies from I1 to I2, I3, …, Im: detecting these require 2 * (m – 1) comparators (to compare the destination register address of I1 with the addresses of two source registers for each of I2 through Im.
– To detect potential flow dependencies from I2 to I3, I4, .., Im similarly requires 2 * (m – 2) comparators.
209
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dependency Checking in Superscalar CPUs (contd.)
– The number of comparators needed to detect the flow dependencies among the m instructions is thus
2 * {(m – 1) + (m – 2) + … + 1} = m * (m – 1), which is O(m2).
Consider now how the flow dependencies among the group of m in- structions being examined for dispatch impacts the assignment of physi- cal registers for the destinations of these instructions:
The physical registers for the sources of I1 can be looked up from the rename table – this automatically takes care of flow dependencies to I1 from previously dispatched active instruction.
The destination of I1 can be assigned from the pool of free physical registers.
The physical registers for the sources of I2 can be looked up from the rename table in parallel with the lookup of the physical registers for the sources of I1.
–
–
This lookup for the sources of I2 will be valid only if there are no flow dependencies from I1 to I2.
In case there is a flow dependency from I1 to a source of I2, the physical register to be used for that source of I2 is the physical register used for the destination of I1. The physical register id for this register, as picked up from the rename table, has to be discarded.
210
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dependency Checking in Superscalar CPUs (contd.)
– The physical register id for a source of I2 is thus:
–
= the physical register id looked up from the rename table if there is no dependency from I1 to I2.
= the physical register id assigned to the destination of I1 from the free list if there is a flow dependncy from I1 to I2 over this source.
This correct id for the physical register for the source can be selected using a multiplexer controlled by the output of the comparator detecting the flow dependency:
physical register id for src1 of I2 looked up from the rename table
physical register id for dest of I1 obtained from the free list
physical register id for src1 of I2
0
MUX 1
Output of comparator that detects if the architectural address of the dest of I1 is the same as the address of the architectural register for src1 of I2
211
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dependency Checking in Superscalar CPUs (contd.)
Note that if the rename table lookup for the physical register addresses for the sources are done before the free register lookup and the rename table update for the destination registers in consecutive pipeline stages, a latch may be needed to hold the physical register id for the sources till the physical register id for the destinations have been assigned:
Decode/ Rename1
Rename2/ Dispatch
physical register id for dest of I1 obtained from the free list
physical register id for src1 of I2
Rename Table
Free list
1
MUX 0
latch
physical register id for src1 of I2 looked up from the rename table
Output of comparator that detects if the architectural address of the dest of I1 is the same as the address of the architectural register for src1 of I2
The mechanism used for handling dependencies can also be used to allow the rename table update for destinations and the lookup of the physical register ids for the sources to be done simultaneously within a single stage.
212
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dependency Checking in Superscalar CPUs (contd.)
Similarly, the physical register id for a source of I3 is thus:
= the physical register id picked up from the rename table if there is no dependency from I1 to I3 or from I2 to I3.
= the physical register id assigned to the destination of I1 from the free list if there is a flow dependncy from I1 to I2 over this source and no flow dependency from I2 to I3 over this source.
= the physical register id assigned to the destination of I2 from the free list if there is a flow dependncy from I2 to I3 over this source, irrespective of the presence or absense of any flow dependency from I1 to I3 over this source.
Notice that the manner of picking up the physical register ids as described above takes care of both categories of dependencies.
The complete dependency resolution process is shown in the following figure, when three instructions are being dispatched. The boxes labeled M are data selectors (essentially multiplexers) for picking up a physical register stand-in for a source register from the rename table or from the newly-asigned stand in for the destination of an earlier instruction that is being co-dispatched.
213
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dependency Checking in Superscalar CPUs (contd.)
dest
I1:
I2:
I3:
src1
src2
= C112 =
dest
src1
src2
=
dest
src1
src2
C123
= C212
=
Comparators used for checking flow dependencies from
I1 to I2, from I1 to I3 and from I2 to I3
Cabc: output of comparator detecting flow dependency of srca of instruction c from destination of instruction b
Phy. reg. ids of destinations of:
I3 I2 I1
Phy. reg. id of src2 of I3
Phy. reg. id of src1 of I3
Phy. reg. id of src2 of I2
Phy. reg. id of src1 of I2
Phy. reg. id of src 2 of I1
Phy. reg. id of src1 of I1
C113
=
C213
C223
012345
RENAME TABLE
012345
012
FREE LIST OF
PHYSICAL REGISTERS
012
Rename Table alias for src1 of I1
Rename Table alias for src2 of I2
C213 C223
M3
C113 C123
C212
C112
M2
M1
M0
214
M0 through M3: data selectors (see text)
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Supporting Back-to-Back Execution
Consider the following instruction sequence: ADD R2, R4, R5
SUB R6, R2, #4
In this code sequence, if both the ADD and the SUB have been dispatched, the SUB has to wait for the value of R2 to be produced before it can start up.
Ideally, the SUB must issue as soon as the value of R2 is available. The ADD and the SUB are then said to execute back- to- back
With the use of an IQ, such a back-to-back execution is only possible if the value of R2 is forwarded to the SUB as it moves to the assigned FU, after being issued.
This, in turn, implies that the SUB must be selected for issue one cycle before the ADD completes execution, so that it can be awakened and selected in the cycle before, permitting the SUB to move out to its assigned FU as the result of the ADD is forwarded to waiting instructions. Unfortunately, this is what happens in the wakeup logic discussed on Pages 160-164:
ADD executes and pro- duces the value of R2
Cycle T
Value of R2 and tag broadcasted- to waiting instructions
SUB awakened as a result of the tag match
SUB selected for issue and moves to its assigned FU
Cycle T+1
SUB executes
Cycle T+2
There is a one cycle between the startup of the executions of the ADD and SUB – that is, back-to-back execution is not supported.
215
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Supporting Back-to-Back Execution (contd.)
Back-to-back execution is supported if the tag of R2 is broadcasted one cycle before the value of R2 is available:
ADD executes and pro- duces the value of R2
Tag of R2 broadcasted to waiting instructions
SUB selected for issue and moves to its assigned FU
Value of R2 picked up by waiting instructions in the IQ that had a match with the tag broadcasted in the previous cycle
SUB receives the value of R2 di- rectly from the forwarding bus to the input of the assigned FU
SUB executes
Cycle T Cycle T+1 Note the following requirements:
1. Tag must be broadcasted before the result (one cycle earlier in this example)
2. Thebroadcastofthevaluefollowsthebroadcastofthetag-entriesin the IQ that match a broadcasted tag value pick up their data from the forwarding bus in the following cycle.
3. Dependent instructions that are starting up have the data forwarded directly to the FU input(s). These instructions cannot pick up the broadcasted data value as they have moved out from the IQ to their assigned FU’s inputs.
4. Theselectionlogicitselfbroadcaststhetagbasedonwhatitselectedfor issue in the past; it also schedules the broadcast of the data value. Note that if the data does not follow the tag, issued instructions (like the SUB) will never get the value(s) they are waiting for! The selection logic must thus guarantee that the data values follow the broadcast of the corresponding tags.
216
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CLASS NOTES/FOILS:
CS 520: Computer Architecture & Organization
Part III: Branch Handling Techniques and Speculative Execution
Dr. Kanad Ghose ghose@cs.binghamton.edu http://www.cs.binghamton.edu/~ghose
Department of Computer Science State University of New York Binghamton, NY 13902-6000
All material in this set of notes and foils authored by Kanad Ghose 1997-2019 and 2020 by Kanad Ghose
Any Reproduction, Distribution and Use Without Explicit Written Permission from the Author is Strictly Forbidden
CS 520 – Fall 2020
217
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Handling Techniques – Basics
Branch instruction: any instruction that can potentially cause an instruc-
tion from a non-consecutive location to be fetched.
Branching is necessitated by the following mechanisms in HLLs:
Subroutine calls, returns, and jumps: these require unconditional branching
Data dependent selection of control flow path: these require conditional branching.
Branching is also needed during system calls and returns, as well as trap handling and returns: special instructions that branch and change the ex- ecution mode are needed for these functions.
Branching is one of the most serious impediments towards achieving a good CPI (particularly in superscalar CPUs).
Both hardware and software solutions exist.
218
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Instructions: ISA Variations
Subroutine call/return: RISCy: Jump-and-link:
JAL dest src1 literal
– saves the return address in dest and transfer control to the effective address computed by adding the contents of src1 and the sign- extended literal.
Variation: dest is an implied register
– No special instruction needed for returns!
CISCy: combines return address saving and control transfer with other functions (e.g., pushing call arguments into the activation stack, forming frame pointer to the called routine’s activation record etc.)
– Special instructions for returns quite common.
Conditional branches:
Branch condition computed by an instruction separate from the one that causes branching:
Variation 1: branch condition held in PSW flags (“condition code flags”):
Example:
SUB R2, R5, R11/* sets CC flags */ BZ #428 /* tests Z flag */
219
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Instructions: ISA Variations (continued)
– branch instructions usually use PC-relative addressing in modern ISAs.
– Several sets of CC registers may be used one for each FU or one for each stage of a FU. The compiler schedules the branch instruction to check the appropriate CC register explicitly to maintain the flow dependency between the instruction that sets the CC and the branch instruction that checks it.
Variation 2: branch condition held in architectural register: Example:
CMP R2, R1, R4 /* compare R1, R4; result in R2 */
BZ R2, #428
– If multiple CC registers are absent, variation 1 constrains instruction ordering severely – for instance, no instruction that sets the CC flags can appear between the SUB and the BZ. Variation 2 (or variation 1 with multiple CC flags) does not have this restriction.
Combined instruction for evaluating the branch condition and branching: here no explicit storage is needed to hold the evaluated branch condition.
Example:
CMPBZ R1, R4, #428
– This approach also reduces the instruction count compared to the other approaches.
220
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Handling: Terminology and Stats
Taken branch: a branch instruction that causes an instruction from a non-consecutive location to be fetched
Branch target: the address to which a taken branch transfers control to
Branch penalty: number of pipeline bubbles resulting from a taken branch
Branch Resolution: resolving the direction in which a branch transfers control to
Fall-through part: instruction sequence starting with the instruction immediately following the branch; these instructions are executed if the branch is not taken (hence, fall-through)
Branching Facts:
Branch Instruction Frequency: One out of every 5 to 6 instructions on the average is a branch
– branching is more pervasive within OS code
– branching is less frequent in scientific code (where one out of 10 to 15 instructions is a branch)
Branch Instruction Behavior: 60% to 75% of the branch instructions are taken.
– Reason: branches at the end of a loop are likely to be taken; uncondition- al branches are always taken etc.
Stats apply to both CISC and RISC ISAs
221
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Instructions and Pipelines: Main Issues
1. Maintainingflowdependencybetweeninstructionthatevaluates condition for branching and instruction that branches on this condition:
– Important consideration in dynamically scheduled pipelines
– Can use software scheduling to maintain dependency in pipelines that employs no hardware scheduling (except perhaps for handling interlocks on loads)
2. Reducingoravoidingthepenaltyofbranching-thisactuallytargetsone or more of the following components:
Reduce delay in branch resolution
Reduce delay in fetching and processing the target of a taken branch: this, in turn, has several components:
– Reduce/Eliminate time taken for computing the effective address of the target
– Reduce the time taken to fetch the target instruction (or a group of consecutive instructions starting with the target)
– Reduce the time taken to decode the instructions
3. Speculativeexecution:Reducethetimetakentostarttheexecutionof the instructions starting with the target executing instructions beyond unresolved branches
222
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Penalty of Branching in Pipelines
Consider a simple version of APEX executing the following code fragment:
target:
SUB R1, R2, R4 CMP R1, R6
BZ target ADDL R1, R1, #4 LOAD R2, R4, #0
:
:
STORE R6, R2, #0
Assume further that the branch condition is evaluated when the BZ instruction is in the EX stage. (The result of the CMP can be forwarded to the BZ as it enters the EX stage.)
If the branch instruction causes the branch to take place, further processing of the two following instructions (ADDL and LOAD), which have been partially processed have to abandoned – a process that is described as squashing or flushing or annulment.
The earliest instance at which the BZ can enter the pipeline is in the cycle following the one in which the BZ entered the EX stage. Consequently, branching results in a two cycle bubble:
STORE
Instructions to be squashed
BZ CMP SUB
F
D/RF
Note that there is no penalty if the branch is not taken.
223
EX
MEM
WB
LOAD
ADDL
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Penalty of Branching in Pipelines (continued) An analysis:
Execution time of N instructions without branching: Texec = k * T + (N – 1) * T
Execution time with branching:
b = s = P =
probability that an instruction is a branch probability that the branch instruction is taken
length of bubbles introduced in the pipeline on a taken branch
– Each taken branch effectively prolongs the execution time by P cycles
– Total execution time with branching is thus:
Texec,branch = Texec + N.b.s.P
Example values: for APEX: P = 2, k = 5; from stats: b = 0.2, s = 0.75
When N is large,
Texec,branch/Texec = 1.3
execution time is prolonged by 30% due to branching in scalar pipelines
(Branching penalty is more severe in superscalar pipelines, as we will see later.)
224
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Squashing
Not generally a problem, unless squashed instructions have modified processor state
What if squashed instructions trigger page faults or I-cache miss?
– Since page faults are not processed until the preceding instruction has
committed, page faults do not create problems
– I-cache miss handling triggered by squashed instruction does not have any impact on the results – it only has an impact on performance
What if squashed instructions have updated the CPU state?
– e.g., squashed instructions may have updated CPU registers using
some options like auto-increment or auto-decrement
– Only way to undo this is to rely on mechanisms for implementing precise interrupts (LATER)
If squashed instruction have no side effects, squashing can be accom- plished by freezing the clock to the pipeline stages following the ones that contain the instructions to be squashed till the instruction at the tar- get of the branch enter these stages
Some pipelines also require selective squashing based on the branch di- rection (LATER).
225
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Maintaining Flow Dependency to Branch Instructions
Maintaining flow dependency between instruction that evaluates condition for branching and instruction that branches on this condition:
– With out-of-order startups and completion, this dependency has to be enforced explicitly.
– Not an issue with combined evaluate and branch instructions (like CMPBZ).
– When branch instruction puts evaluated branch condition into a GPR, existing mechanisms to enforce flow dependencies can be used.
– If the branch condition is evaluated into the CC flags, and if multiple sets of CC flags exist, the compiler can explicitly code in the dependency and rely on existing mechanisms for enforcing flow dependencies. This is very much like the last approach.
– If a single set of CC registers exist, hardware mechanisms can be used to ensure that only the instruction before the branch sets the CC.
– In an ISA that feature an explicit “setCC” bit with instructions that can set the CC flags, the compiler can set this bit for the instruction that precedes a branch and reset this bit in all other instructions that can set the CC.
226
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Techniques for Avoiding or Reducing the Penalty of Branching
Early Compares
Unconditionally Fetching Instruction Stream Starting at the Target
Static Prediction
Delayed Branching and Delayed Branching with Squashing
Dynamic Branch Prediction:
– Branch History Table
– Branch Target Buffers
– Alternate Stream Prefetching Based on Prediction
Separate Branch Unit
Predicated Execution (aka conditional assignments/guarded execution)
Branch Folding
Two-level Branch Prediction
Hybrid Branch Handling Techniques & Others
227
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Handling: Early Compares
Resolves branch earlier within the instruction pipeline, reducing
branching penalty for certain compare-and-branch instructions
Some comparisons, like comparing a register against zero or any
equality or inequality can be done as the instruction is decoded
– The hardware assumes that the instruction is a branch and performs such necessary comparisons
– Comparator circuit for comparing against zero, checking for equality or inequality is fast
– Result of comparison is used as soon as the decoding completes
The decode stage also includes a dedicated adder to compute the branch target address – this address computation also proceeds speculatively in parallel with the decoding
Comparand GPR Address
PC value (in PCD/RF)
Sign-Extended Literal
Instruction Opcode
Target Address
Decoder
Result of Comparison
To PCF (if instrn. is a branch and comparand register = 0)
Register File
+
= 0?
Parallel activities within the D/RF stage for Early Compares
228
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Branch Handling: Early Compares (continued)
Net result: branch direction resolved and the branch instruction is pro- cessed completely within D/RF stage
For APEX, this reduces the penalty on branching to just a single cycle. The instruction following the branch has to be squashed if the branch is taken.
The compiler can also transform the source code to convert branch tests to branch tests against a zero when possible (& when this conversion does not increase the dynamic instruction count):
for (i = 0; i < N; i++) { }
can be transformed to:
for (i = 0; i != N; i++) { }
- the transformed loop executes faster since a fast compare can implement the looping test (i != N) within the D/RF stage using a single compare-and-branch instruction.
- Other comparisons may be converted using additional instructions
229
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Unconditionally Fetching Instruction Stream Starting at the Target
Loop Buffers:
Used to hold instructions that make up a small loop body
Reduces time taken to fetch instructions in the loop body when the branch at the end of the loop is taken
Examples:
CDC Star 100: 256 byte loop buffer CDC 6600: 60 byte loop buffer
Special instructions are used to demarcate the loop body and indicate the loop buffer hardware to store instructions fetched for the loop body into the loop buffer
Limitations:
- Size of loop buffer
- Branches within loop body
Well-designed I-caches do this automatically!
230
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Unconditionally Fetching Instruction Stream Starting at the Target (contd.) Explicit Prefetching of Alternate Streams:
Compiler inserts instructions to prefetch alternate stream of instruction starting at the target.
Example:
PREPARE_TO_BRANCH instruction in the TI ASC
Very hardware intensive if the original instruction stream starting with the fall-through part as well as the stream starting at the target have to be fetched simultaneously (more ports to memory system)
Alternate stream prefetching is generally employed in conjunction with prediction information.
Main limitations:
- Very hardware intensive
- Branch instructions within alternate streams may inhibit potential gains
231
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching - preamble
Definition: The delay slot of a branch instruction is defined as the number of instructions that have to be squashed when the branch is taken.
Example: the delay slot for branch instructions in APEX as described earlier is 2 cycles without early compares and a single cycle with fast compares.
Definition: Control dependencies: control dependencies are defined at the level of the source code, which implicitly uses the sequential semantics. Consider a conditional branch defined by the if statement:
if
Here, the sequence of statements, statement_sequence_1 is control dependent on the condition
Consider now the following if-else statement:
if
else {
Here
Control dependencies at the source translate to control dependencies at the level of instructions.
Control dependencies essentially imply that the processing of some instructions are dependent on the way a branch goes.
232
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching
Makes explicit use of the post-processing phase of the compiler to rearrange instructions within the binary
The delayed branching mechanism for branch handling modifies the hardware to reduce the branching penalty as follows:
Allow the processing of the instructions in the delay slot of the branch to continue irrespective of the outcome of the branch
Allow the target for a taken branch to be fetched as soon as the branch is resolved and the address for the target is computed
Note that the semantics of the delayed branch is different from the sequential semantics:
– In the sequential semantics of a branch, the instruction of the target is processed immediately after the branch instruction if the branch is taken
– In the delayed branching scheme, the instruction processing sequence on a taken branch is:
the branch instruction, instructions in the delay slot, followed by the instruction at the target.
– The net effect is that as if the processing of the branch is delayed by the number of cycles in the delay slot
233
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching (continued)
The pipeline modifications for delayed branching are very simple, as seen from the following example for APEX:
Register File
ALU
I-Cache
Instrn. Register
Chain of latches
holding decoded info. (“IR Chain”)
M U X
Decoder
“PC Chain”
F D/RF EX MEM WB
– The multiplexer chooses the effective address computed for the condi- tional branch instruction by the ALU in the EX stage to go into the PC for the fetch stage if the branch condition is valid
PC Update Logic
234
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching (continued)
Making use of the delayed branching mechanism:
Lets first consider what the delayed branching mechanism does by looking at the processing of the following code fragment on the simple version of APEX with delayed branching hardware and a delay slot of 2 cycles:
I1
I2,
BZ target I3
I4
I5
:
target: I20
– When the branch is not taken the instruction sequence processed is:
I1, I2, BZ, I3, I4, I5,….
– The instruction sequence processed when the branch is taken is:
I1, I2, BZ, I3, I4, I20, I21,….
Note that the two instructions in the delay slot (I3 & I4) are never
squashed – they are processed no matter which way the branch goes.
235
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching (continued)
Thus, to make use of the delayed branching mechanism and to ensure that no performance loss results on branching, we must find enough instructions that can be moved after the branch such that:
a) Thedatadependenciesintheoriginalprogramarepreserved
b) The instructions moved into the delay slot can be executed no matter which way the branch goes – this requires the control dependencies in the original program to be preserved.
c) The instructions moved into the delay slot accomplish some useful processing, if possible
A software interlocking like algorithm can be used to determine the instructions that can be moved into the delay slot to exploit the delayed branching mechanism:
– The algorithm must take into account control dependencies in addition to flow dependencies
– A conditional branch instruction is scheduled as early as possible
– Control dependent instructions cannot be moved into the delay slot
– Where enough useful instructions cannot be moved into the delay slot of a branch, NOPs should be used as fillers
Note that the NOPs used as fillers make no useful contribution to the performance – each NOP is effectively a bubble. In fact, NOP fillers increase the binary size and thus can have a detrimental effect by cluttering up the cache.
236
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching (continued)
An Example:
– Suppose the original code fragment to be reorganized to exploit the delayed branching hardware for APEX is:
target:
LOAD R2, R5, #2 ADDL R5, R5, #1 SUB R1, R2, R4 CMP R1, R6
BZ target ADDL R1, R1, #4 LOAD R2, R4, #0
:
:
STORE R6, R2, #0
– The reorganized version is:
target:
LOAD R2, R5, #2
SUB R1, R2, R4
CMP R1, R6
BZ target
ADDL R5, R5, #1
NOP /* filler */ ADDL R1, R1, #4
LOAD R2, R4, #0 :
:
STORE R6, R2, #0
237
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching (continued) Facts:
One useful instruction to be moved into the delay slot of a branch can be found about 70% of the time
The average probability of finding two useful instructions to be moved into the delay slot is 25%
The average probability of finding three or more useful instructions that can be moved into the delay slot is very small
It is thus clear that the delayed branching mechanism is of little use in pipelines where the delay slot is 2 cycles long or more
In modern superscalar pipelines, a single cycle delay slot can encompass 2 to 6 instructions (if the machine uses 2-way to 6-way dispatching per cycle). Delayed branching is also fairly useless in these CPUs.
– Delayed branching was used extensively in many early RISC scalar CPUs
238
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching With Squashing/Annulment
Makes use of the predicted direction of a branch at compile time. Often, the programmer can supply the expected branch direction as a hint to the compiler.
This is an enhancement of the basic delayed branching mechanism that allows the instructions within the delay slot to be squashed if the branch direction does not coincide with what was predicted at compile time
Thus, to exploit this mechanism, instructions along the predicted path can be moved into the delay slot – these will all be useful instructions. NOP fillers are not needed.
The success of this scheme depends on how accurate the predictions are.
Some CPUs also allow the number of instructions to be squashed to be
specified. Example:
– The delay slot of a loop-closing branch can be filled with instructions from the target if delayed branching with squashing is used. Most of the time, this branch will be taken, executing useful instructions
in the delay slot. When the branch is not taken, performance losses result due to squashing.
239
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Delayed Branching With Squashing/Annulment (continued) Implementation:
Two bits can be associated with each conditional branch instruction to encode the prediction and the squashing information – both
of these bits can be set by the compiler. The interpretation of these two bits are as follows:
00: No squashing, irrespective of branch direction – This is the pure delayed branching scheme
01: Squash if the branch is not taken (this implies that
the static prediction is that branch is taken most of the time)
10: Squash if the branch is taken (this is used if the static prediction is that the branch is not taken most of the time)
11: Squash, irrespective of the branch direction: this is the primitive scheme, with no special handling for branches.
– Given that all four schemes are available, the order in which these options are used are as follows:
(i) If a definite prediction is possible, use the options encoded by 01 or 10
(ii) If a definite prediction is not possible – i.e., the branch is equally likely to be taken and not taken, attempt to fill in the delay slot (or most of the delay slot) with useful instructions wherever possible, and use the option encoded by 00.
(iii) Use the option encoded by 11 in all other cases. This avoids an increase in the size of the binary by avoiding NOP fillers.
Note that delayed branching with squashing is really a primitive form of speculative execution.
240
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Static Branch Prediction
The compiler can make reasonable prediction on the way a conditional branch is likely to transfer control in many cases:
The branch at the end of a loop is likely to be taken.
A branch that checks for an error condition (such as overflow or the carry flag being set) is likely to be not taken.
Opcode-based static prediction: certain branch instructions are used in very specific ways by the compiler – in many cases the compiler can make a prediction on the branch behavior (e.g., in the POWER PC architecture, branch instructions that check the link and count register used for looping, are predicted to not branch).
A profiling run that uses a “typical” data set can sometimes indicate which way some branches are likely to proceed.
Static branch prediction can be frozen into the hardware: No branch handling mechanism prediction is not taken!
Predict branch is taken based on the sign of the offset in the branch instruction (which typically uses PC-relative addressing):
+ve offset: branch is not taken
-ve offset: branch is taken
– used in the DEC 21064 Alpha implementation, POWER PC 601, 603
241
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Static Branch Prediction (continued)
Some CPUs use a combination of static predictions (e.g., the POWER PC architectures use opcode based static prediction for some branch instruction and prediction based on the sign of the branch offset for the others.
Some architectures allow the compiler to reverse the default/frozen prediction – a bit settable by the compiler to indicate the reversal
of the prediction is incorporated within the instruction (e.g., POWER PC 601, 603).
It is also possible to switch between static and dynamic prediction in some CPUs (e.g. HP PA 8000).
How static prediction is used:
Instructions can be tagged with a bit indicating the static prediction – the hardware uses this to decide whether to continue processing instructions in the fall through part or fetching instructions from the tar- get
In pipelines that employ delayed branching with squashing (e.g., some SPARC implementations, Intel i860, HP PA-RISC etc).
In more general mechanisms for supporting speculative execution. -In many cases, remedial actions to undo the effects of speculatively executing along the predicted path are carried out as instruction processing continues along the predicted path. MORE LATER!
As the default initial prediction for Branch Target Buffer (BTB) entries (e.g., DEC 21064 Alpha implementation).
242
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Branch Target Buffers
Basic tenet: the past behavior of a branch instruction is indicative of its future behavior – i.e., the there is a large correlationship among the behaviors of the same branch instruction.
Dynamic branch prediction thus requires some history reflecting the past behavior to be maintained within the CPU.
A Branch Target Buffer (BTB) maintains this information – this is very much like a cache. In fact, the BTB is probed in parallel with the I-cache, using the instruction address as the key:
Possible outcomes/conclusions:
BTB hit, I-cache hit instruction fetched is a branch instruction BTB hit, I-cache miss instruction to be fetched is a branch BTB miss, I-cache hit instruction may or may not be a branch BTB miss, I-cache miss instruction may or may not be a branch
– If a BTB miss occurs, and the fetched instruction turns out to be a branch, the following actions are carried out:
a) AnentryisestablishedintheBTBfortheinstruction,possiblyafter replacing a victim entry to make room in the BTB for the entry being added.
b) A default prediction is used for the branch instruction and appropriate actions based on this default prediction are carried out.
c) TheinformationintheBTBisupdatedafterthebranchisresolved.
243
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Branch Target Buffers (continued) – If a BTB hit occurs, the following actions are necessary:
(i) Appropriate actions are carried out based on the prediction information (as well as other information about the target) as retrieved from the BTB.
(ii) The information in the BTB is updated after resolving the branch.
If a BTB is incorporated in the simple APEX pipeline, these actions are carried out as shown:
F
D/RF
BTB hit:
BTB miss:
I-cache and BTB probed in parallel
Make use of the prediction to alter the fetch sequence, if needed
Instruction decoded & found to be a branch
If the instrn. is a branch establish a BTB entry & use default prediction
Branch resolved; target
address computed
Probe BTB
& update entry
Probe BTB, update entry, & compute target address for default pred of taken
Penalties:
BTB hit and prediction correct:
BTB hit and incorrect prediction:
BTB miss, default prediction of not taken, 2 subcases:
0 cycles (default pred is correct), 2 cycles (default pred incorrect) BTB miss, default pred of taken, 2 subcases:
2 cycles (correct default pred – still need to compute target address in EX stage, so default pred does not help!), 0 cycle (incorrect default pred)
EX
MEM
WB
244
0 cycles 2 cycles
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Branch Target Buffers (continued)
– Note that the BTB has to be multi-ported to allow its continuous access from the fetch stage while it may get updated from the EX stage. Notice also that it does not matter if the BTB entry being replaced from the fetch stage is the same as the entry that is being updated from the EX stage! (The BTB access for a update from the EX stage is abandoned if the a BTB probe from the EX stage results in a miss.)
The BTB is associatively addressed; in some CPUs, it is fully associative.
Modern, larger BTBs can use a simple hash function on the address of a branch to locate the BTB entry for the branch instruction – in such cases the probe has to wait till the fetched instrction has been identified as a branch.
In addition to prediction information, the BTB contains information about the target of the branch. Several BTB variations are possible depending on the nature of the information regarding the target:
Only the computed address of the target is stored in the BTB: If a BTB hit occurs, and if the branch is taken, this saves the time needed to compute the effective address of the target. In this case, the BTB is called a branch history buffer. In many machines, the presence of an entry in the history buffer implies a prediction of taken.
The target instruction itself and the address of the successor of the target information is stored within the BTB entry. On a BTB hit and if the branch is taken, this saves the time needed to access the target instruction, as well as the time needed to compute the address of the successor of the target.
245
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Branch Target Buffers (continued)
Decoded information about the target is stored in the BTB, in addition the address of the successor of the target. On a BTB hit and a taken branch, this saves the time needed to fetch and decode the target, as well as the time needed to initiate the fetching of the instruction beyond the target.
– Useful for CISCy ISAs.
The target instruction, as well as its logical successor and the address of the successor of the target’s successor is stored in the BTB. This allows the possibility of branch folding: if the target is an unconditional branch, it can be skipped and the successor of the target can be fetched directly. (Many compilers implement “long” conditional jumps in this fashion, and this scheme will be useful in such cases.)
– Other obvious variations are possible.
The BTB entry for a target address can be reliably used only when the target address does not depend on the value of a GPR (which might get modified between two consecutive executions of the same branch instruction).
– PC-relative addressing for specifying the target is useful in this context.
In many cases, the BTB entries may be incorporated into the I-cache en- tries. In this case, I-cache entries are made wider to accommodate the BTB entries for potential branch instructions. Each cache entry is also tagged to indicate if the entry is for a branch instruction. On the positive size, this saves the tags that would be otherwise needed in an indepen- dent BTB.
246
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Branch Target Buffers (continued)
Another implementation of the BTB, called a branch target access cache (BTAC) keeps only the target information is a separate cache (the BTAC), while the prediction/history information is maintained within the I-cache.
– This helps in avoiding wide entries in the I-cache
Example:
Fully-associative branch target buffer with explicit prediction information:
Example entry for
BZ # -400, occurring at address 1200
Associative Lookup Tags
1200
Prediction info: history bits
…..
Effective address of target
800
Prediction Logic
247
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Predictors
Real-life constraints:
1. Onlyafiniteamountofinformationcanberetainedaboutthepast
executions of a branch for future predictions.
2. Thepredictioninformationhastobequicklydecodedtogetthecurrent prediction.
Most common implementations use a finite state machine (FSM) to encode the past few actual outcomes of a branch instruction. The FSM state represents the current prediction.
One-bit predictor: a single bit, coding the actual outcome of the last execution of the branch is kept. The current prediction is the same as the last actual behavior.
– accuracy small compared to other schemes, but still reasonable
– one possible predictor option in the MIPS 8000
Two-bit saturating counter: logically, we have a two bit counter that is incremented if the branch is taken branch and decremented if its not taken. The counters saturate at either extreme. If the current value of the counter is 1 or 0, then the prediction is that the branch is likely to be not taken. If the current value of the counter is 2 or 3, then the prediction is that the branch is likely to be taken.
248
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Predictors (continued)
– The two bit saturating counter is implemented with the following FSM:
Initial
State t
T n N?
11
nt
T? 10
01
n t
t
n
N 00
– N (predict not-taken, higher confidence), N? (predict not taken, lower confidence), T? (predict taken, lower confidence), T(predict taken, higher confidence)
– n, t: actual behavior of branch (t = taken, n = not taken)
– msb of the state label is the prediction
– Two bad guesses in a row changes prediction.
– Note need to start from the initial state (if we start in T? and alternately go through n and t, a correct prediction is never made).
249
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Predictors (continued)
Alternate 2-bit predictor: predicts branch to be taken if any one of the last two executions resulted in a taken branch
– Here state labels in bold give state names, associated predictions are noted below the state labels:
Initial
State t
TT t NT TT
n
TN
Tn
t
NN
N
n
n
t
– Note that the labels for the state record the branch history explicitly.
– Can implement the state update using a shift register and use an OR-
gate to derive the prediction info:
2-bit shift register
t = 1, n = 0
actual branch behavior
prediction
250
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – Predictors (continued)
Accuracy of branch prediction:
(1984 data): Based on n bits of history per branch (CISCy + reg-to-reg
ISAs)
# of history bits
0 (default prediction) 1
2
::
5 83.7% to 97.1%
– This old data shows that the accuracy of the predictor saturates after 2 to 3 bits of history.
Accuracy of prediction of 2-bit saturating counter (1992, SPEC92 benchmarks):
Prediction accuracy (average)
64.1% to 77.8 % 79.7% to 96.5%
Prediction Technique
Static, always taken
Static, based on sign of branch offset
Dynamic (1-bit)
Dynamic (2-bit saturating counter)
Accuracy
(geometric mean/(range)) (over most SPEC 92 int+fp)
62.5%
68.5%/
89% (int: 76.5% to 87.5%, fp: 88% to 97.8%) 93% (int: 86.5% to 93%, fp: 91.5% to 98.5%)
83.4% to 96.5%
251
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Branch Prediction – More on BTB Structures & Variations
Typical Hardware Associated With BTB:
Prediction Logic: Derives current prediction from encoded FSM state
Prediction FSM updating logic: Updates FSM state based on current FSM state and actual branch behavior
BTB data/tag structures: holds data about branches & prediction info, together with address tags for lookup.
BTB access steps:
– F: BTB lookup; FSM state retrieved; target address retrieved
– D/RF: Initial FSM state established based on default prediction
– EX: Target address computed only on BTB miss BTB probed again and FSM state updated
BTB variations: examples
Separate I-cache and target instruction cache (BTC):
Target instruction sequence
Instruction sequence including branch
– I-cache and BTC are probed in parallel
PC of Fetch Stage
I-cache
252
BTC: Branch Target Cache
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
More on BTB Structures & Variations (continued)
Integrated I-cache and BTB with successor index: Direct-mapped I-cache
I-cache with “successor index” to point to target of first branch within a cache line
Successor index points to next line or line containing target of first branch (in program order) in a line:
Prediction Info
Successor Index
Instructions
Integrated I-cache
Successor field has a valid bit
One set of common information for all branch instructions in a cache line
– variation: pred. info only for first branch in a cache line; the compiler spaces out the branched to make sure that there’s only one branch instruction per cache line (MIPS 8000)
253
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
More on BTB Structures & Variations (continued)
Successor field can be initialized as follows:
– When instructions are processed from the I-cache: successor index
is initialized to point to next cache line by default; successor field set appropriately when instructions in line are actually processed (AMD K5, K6)
– Value of successor index is not stored but recomputed every time branch instruction(s) in line is processed (MIPS 8000 and 10K)
– When fetched instructions are predecoded (&tagged) by a separate predecode unit before the fetched instruction is put into the I-cache (Sun UltraSPARC)
How does the successor index field help?
– Saves 2-step cache lookup (RAM access followed by tag comparison in normal I-cache access sequence is replaced by a direct lookup for the target instruction in the I-cache)
– Preferred as clock cycle times fall (normal I-cache lookup can require 2 to 3 pipeline stages; with successor index this reduces to 1 or 2 cycles)
254
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Early Branch Resolution with Dedicated Branch Unit
Used in CPUs that employ queue for fetched instructions
Variations in increasing order of performance improvement
Buffer/Queue of instructions between Fetch and Decode/Dispatch stages; Branch unit examines entries in this queue before they are decoded and dispatched (e.g., POWER 1, POWER 2). The branch unit may look ahead of the instruction(s) being sent over for decoding/ dispatch
Branch unit examines instructions as they are read into Buffer/Queue; instructions dispatched from the queue (e.g., POWER PC 603)
Fetch Logic
Instruction Queue
To decode/dispatch
Branch Unit
Pre-d
Instruction Queue
To decode/dispatch
Key requirements for early branch resolution:
Branch unit has access to registers/flags that contain branch condition
or relies on prediction information
Branch unit has dedicated address computation unit or BTB that has cached target address
Fetch Logic
ecode
Branch Unit
255
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Using Predicated Execution to Remove Branch Instructions
Predicated instruction:
– Every (or some subset – typically, reg-to-reg instructions) instruction
has an associated predicate (bit flag)
– The predicate can be tied to a branch condition/flag
– Instruction updates its destination(s) only if the associated predicate is valid.
Example: Conditional MOVE (CMOV) instructions:
CMOV
– moves
Coding the following source code using CMOVs:
if (x == y) {
a = c;
y = x + 4;} else
a++;
CMP R1, R2, R3
/* R2, R3 holds x, y; result of CMP in R1 */ CMOVEQ R4, R1, R5 /* R4 holds a; R5 holds c */ ADDL R6, R2, #4
CMOVEQ R3, R1, R6
ADDL R7, R4, #1
CMOVNEQ R4, R1, R7
256
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Using Predicated Execution to Remove Branch Instructions (contd.)
– Avoids branch instructions, but the CMOVEQs act like NOPs and one of the ADDLs is “wasted” if the “then” or “else” clauses are not valid. Note also that in addition to spurious ops (like one of the ADDLs), we also need additional registers to hold results (R6 or R7 in this case) that have to be moved conditionally.
– Does not save bubbles on misprediction but ensures smooth sequential instruction fetching pattern is maintained; avoids any potential delay for fetching the branch target that would be present in a traditional implementation without CMOVs.
– Really useful for coding small sections of conditionally executed code
– Similar facilities are provided in the DEC Alpha ISA and some
VLIW processors (LATER).
257
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Improving the Accuracy of Dynamic Prediction: 2-level Branch Prediction
Exploits correlation between the branch in question and other branches that occurred before it (“global branch history”)
– this correlationship is seen predominantly in integer/systems code (scientific code is more dominated by loop closing branches)
– note that all dynamic predictors that we looked at thus far ignore this correlationship.
Main idea: how control went to a branch instruction depends on other branches taken on the way.
Provides better accuracy than (independent) “per branch instruction” predictors discussed earlier
Example 1: This is from the gcc benchmark of the SPEC 92 (also SPEC 95) benchmark suite:
if (tem != 0) /* branch b1 */ y0 = tem;
if (y0 == 0) /* branch b2 */ return 0;
– Assume that the two “if”s are implemented using the branch instructions b1 and b2 as shown, and that the “then” clauses of the two “if”s are implemented as the fall-through part of these branches.
– This is really control dependencies that lead to data dependencies within a subsequent branch condition!
258
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
– The possible control flow paths leading from b1 to b2 are:
b2
y0 = 0 behavior of b2: taken
n b1 t
depends on value of y0
– In this case, the behavior of b2 can be predicted from the way which branch b1 actually behaved: if b1 is not taken then b2 is taken, indicating that the behaviors of b1 and b2 are correlated
Example2: This is from the eqntott benchmark (which is part of the SPEC 92 and SPEC 95 integer suites)
if (aa == 2) aa = 0;
if (bb == 2) bb = 0;
if (aa != bb) {
:
/* branch b1 */
/* not taken part */ /* branch b2 */
/* not taken part */ /* branch b3 */
/* not taken part */
}
– Assume that the three “if”s are implemented using three branch instruc- tions b1, b2 and b3 as shown, and that the “then” clauses of the three “if”s are implemented in the fall-through part of the branches b1, b2 and b3.
259
b2 y0 = 0
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
– The possible control flow paths leading into the branch b3 and the
values of the variables aa and bb just prior to processing b3 are: n b1 t
b2 ntnt
b2
b3 b3 b3 b3
Values: Branch path:
aa = 0 bb = 0 nn
aa = 0 bb = 2
nt
aa = 2 bb = 0 tn
aa = 2 bb = 2
tt
– In one out of the four possible ways of reaching b3 (using the path nn) the outcome of b3 can be predicted based on the outcome of b1 and b2
– In the other cases (b3 reached via paths nt, tn and tt), a good guess can be made on the likely outcome of b3 based on the outcome of b1 and b2 and expected values of aa and bb just prior to b3.
– In other words, there is a correlationship between the three branches
Exploiting the branch correlationship:
– Consider again the code of example 2. Assume now that this code is executed in a loop with the values for aa and bb as indicated in the table below just prior to the three branches. These input values are chosen somewhat randomly before each iteration, assuming that 0 aa, bb 2.
260
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
– –
Assume that a 2-bit saturating counter is used to predict the outcome of b3, and this counter is initialized to zero
The table below shows how this counter predicts the branch outcomes:
Actual Counter after Correct behavior b3 is resolved prediction?
0220nnn t1No 1021tnn t2No 2212ntt n1No 3101 tt n n0Yes 4220nnn t1No 5021tnn t2No 6212ntt n1No 7021tnn t2No 8112 tt t t3Yes 9123tnt n2No 10 0 2 2 tn t t3Yes 11223 nn t t3Yes 12 1 2 3 tn t n2No 13 2 1 2 nt t n1No 14011 tt n n0Yes 15200 nt n t1No 16211 nt n n0Yes 17220nnn t1No 18221nnn t2No
Iter.
# aa bb counter to b3 for b3
At start of iteration: Path Prediction
19 1 1 2
20 0 2 3
21 2 0 3
22 1 2 3
23 0 1 2
24221nnn t2No
tt t t3Yes tn t t3Yes nt t t3Yes tn t n2No tt t n1No
9 out of 25 executions of b3 are correctly predicted, leading to a branch prediction accuracy of 36%
261
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
Notice, however, the branch patterns – i.e., actual behavior of b3 (in execution order) along individual branch paths to b3:
path nn: path nt: path tn: path tt:
tttttt nnntnt tttntntn ntntn
There is a more predictable branch behavior for b3 if we separate out the behavior of b3 based on the path leading to b3, as shown above
Four different predictors for b3, one for each path into b3, can be used for this purpose. If four 2-bit saturating counters are used for each path, all initialized to 0, the number of correct predictions we get are:
predictor for path nn: 4 correct predictions (counter values before ex- ecution of b3 on this path are: 0-1-2-3-3-3)
predictor for path nt: 4 correct predictions (counter values before ex- ecution of b3 on this path are: 0-0-0-0-1-0)
predictor for path tn: 3 correct predictions (counter values before ex- ecution of b3 on this path are: 0-1-2-3-2-3-2-3)
predictor for path tt: 3 correct predictions (counter values before ex- ecution of b3 on this path are: 0-0-1-0-1)
– This leads to a total of 14 correct predictions, improving the overall prediction accuracy to 56% (vs. 36% obtained from a single 2-bit saturating counter predictor for b3): this improvement comes about from taking into account the behavior of other branches on the way to b3.
262
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
This leads to the main idea behind two-level branch predictors:
– choose a per-branch instruction predictor based on how control
arrived a branch predictor via other branches on the way
– the latter information is global in nature and called the global branch history: it can be implemented as a shift register
Implementation:
m-bit Global
History Register
0 1
Current Branch Outcome
Selected Prediction Counter
(2m – 1)
Branch Instruction Address
Counter Subarrays Prediction logic and update paths NOT shown
m-bit global shift register can be used to select one of 2m counter sub arrays.
Lower order k-bits of branch instruction address can be used to select the relevant counter within the selected counter subarray.
Alternatively, bits of the branch instruction’s address and bits of the history register can be EX-ORed to locate the branch’s predictor: common in contemporary implementations.
263
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
Note that several branch instructions can share common counters – by increasing the number of locations per subarray (i.e., using a larger k) the degree of sharing can be minimized
A variation will be to use fully-associative or set-associative lookup within the selected subarray.
Total number of bits needed for the subarrays: 2m *2k *n,wheren=#ofbitspercounter
– This configuration is called a (m, n) correlating predictor – a more accurate classifier is (m, k, n)
– Logically, if the subarrays are viewed together as a single array of prediction counters, (m + k) bits, derived from the global shift register and the branch instruction address, are needed to access the n-bit counter from this unified array
Total # of bits in global history register = m
There are two extreme cases: none of these are really two-level in nature:
k = 0: In this case, the appropriate counter is selected based solely
on global branching history; the selected counter is also heavily shared
m = 0: In this case we have the classical BTB structure with uncorrelated prediction
264
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
2-level Branch Prediction (continued)
Reported performance: For (8,2) configuration, 2-bit saturating counter (i.e., n = 2), with a total unified array size of 1 K Byte, running SPEC benchmarks on IBM RISC 6000-like system:
– Prediction accuracy improves from about 82% for eqntott to 95% (classical, i.e., (0, 2) vs. (8, 2))
– Improvement – but less dramatic – for other int benchmarks (up to 5.4%)
– Little improvement for scientific benchmarks, as expected
When only the global shift register is used to select the counter (i.e., k = 0), significant performance gains are obtained only when m is large (m = 15 boosts prediction accuracy for eqntott by 14.3% to nearly 97%)
Other two-level predictors – see paper by Yeh & Patt: more complex design; no intuitive reasons for design
Who uses 2-level branch prediction: Intel P6, Pentium IIs – but no details have been published!
265
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reducing the impact of interference/conflicts in predictor tables
Main problem: if lower order bits of PC are used to index into the table that holds predictor info, several different branch instructions share a common predictor
– one branch thus affects the prediction of all branches that it shares it’s predictor with
Solution 1: make the predictor table large to reduce interference. Rationale: obvious!
Solution 2 – used in conjunction with global history: combine bits from branch instruction’s address and global history to generate the table index.
Rationale: Better to use information from two sources! Branches with some common history/address combinations share an entry; use of global info factors in context; note similarity with two-level scheme discussed earlier
Variations:
Ex-or bits of the branch instruction’s address and the global history bits and use the lower-order bits of the ex-or as the table index (“gshare”). Better than gselect (below) in deriving more info out of the individual components.
Concatenate lower-order bits of the branch instruction’s address and the history (“gselect”) to get the table index.
266
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reducing Interference/Conflicts in Predictor Tables (contd.)
Solution 3 – used in conjunction with global history: use a naive indexing scheme (say, just the lower order bits of the branch instruction’s address) to index into a pattern table, where an entry (in the pattern table) records the global behavior of all branches that map into this entry, and then use the pattern table entry’s value to index into a predictor table:
Branch PC
Predictor Array
Rationale: global information is still combined – albeit indirectly – with the address of the branch. Preserving the global (=context) information is better!
Solution 4 – use mutiple banks for predictor arrays and use a dedicated hash function for each bank to map the branch instruction’s address to an entry within each bank. The hash functions are chosen to guarantee that two distinct branch addresses map to different entries in at least one bank. Output from banks may be chosen by a majority vote or by a chooser. Proposed for use in the Alpha EV8 implementation. This ap- proach directly reduces interferences arising from sharing predictors.
Other solutions: combines PC and global history to index into pattern table, selection of one of multiple pattern tables (say, based on global history bits and branch PC) to locate predictor index etc.
Pattern Table
267
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Combined/Selective Branch Predictors Motivations:
One type of predictor is not good enough:
– extremes of choices: completely branch-local info (e.g., 2-bit saturating counter) vs. global history based.
Must keep on using predictor that demonstrates better accuracy and switch to alternative predictors if currently chosen predictor fails to make correct predictions.
Selection mechanism – choosing among two predictors dynamically:
index – derived from branch PC and/or global history
Predictor 1
Individual predictions
Selected prediction
MUX
Predictor 2
Chooser table – tracks predictor accuracy
Chooser table could be an array of 2-bit saturating counter; recall that such counters are good at tracking consistency in behavior and forgiving momentary inconsistencies.
Update a counter as follows:
– if predictor 1 predicts correctly, and predictor 2 predicts incorrectly, decrement saturating counter
– if predictor 2 predicts correctly, and predictor 1 predicts incorrectly, increment saturating counter
– if both predict correctly or both predict incorrectly, do not update
268
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Combined Branch Predictors (contd.)
Choose among predictor using most significant bit of the counter obtained from the chooser table.
McFarling’s combined predictor: uses lower order bits in PC to index into chooser array; component predictors can be bimodal (2-bit saturating counter), gshare, gselect, pattern based.
Variation used in DEC/Compaq Alpha 21264: local and global bits access separate arrays:
lower 10 bits of PC
1K 3-bit prediction array
12-bit global history
Local predictor
Global predictor
2-bit saturating counter chooser
1 K entry pattern table, 10- bits/ pattern
4K 2-bit counters
Update logic NOT shown
bimodal predictor
MUX Chosen prediction
4K 2-bit counters
Other styles of combined or multiple predictors:
Use static prediction, where directed by compiler (requires an additional bit in the branch instruction set by the compiler; can look at offset sign to predict) or use a dynamic predictor, including a selective predictor as described above.
Use a separate structure to predict return address of branches that implement calls – common practice in many high-end designs.
269
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Other Techniques for Improving Branch Prediction Accuracy
Another way to remember how control flowed to a branch instruction: probe the BTB with the address of the predecessor of the branch instruction – used in the Hitachi Gmicro/100.
270
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speculative Execution
Main idea: execute instructions (& produce results) beyond (= after) branch instruction, speculating (= predicting) the branch direction
– architectural state is not updated till direction of branch is resolved
– goes well beyond what is achieved using delayed branching with
squashing
– can execute instructions speculatively along both possible paths following a branch and beyond more than one branch in some aggressive designs.
Speculation degree/distance: amount of processing done for specula- tively executed instructions. Several possibilities:
– just fetch instructions speculatively
– fetch & decode instructions speculatively
– fetch, decode & dispatch instructions speculatively
– fetch, decode, dispatch, issue & execute instructions speculatively, forward results to waiting FUs/VFUs (but not update architectural state): this is commonly done in modern, high-end CPUs.
Level of speculation refers to the number of of branch instruction in the speculated path, beyond which instructions are speculatively executed. Restricted to a few levels at most.
Commonly used in modern superscalar CPUs. 271
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speculative Execution (continued)
Speculative execution along predicted branch direction:
:predicted control flow path
* active = dispatched, in execution, waiting execution, waiting retirement
Currently active* instructions
I1 I2
B3 I4 I5
B6
I7
I8
I9 I10
I11 B12
Here, instructions are speculatively executed along the predicted path following three unresolved branch instructions (B3, B6 and B12)
Level of speculation is 3 I13
I14I15
If it is discovered that B6 is mispredicted, instructions I7 through I15, including B12 have to be squashed.
Any mechanism that implements precise interrupts in hardware can be extended to support speculative execution: a mispredicted branch is treated like an instruction that has raised an exception, causing the rollback of all processing done after this branch.
This rollback has to be up to and including the earliest mispredicted branch in the window of active instructions.
272
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Speculative Execution with a ROB
Additional facilities needed:
A branch instruction stack, BIS, implemented as a circular FIFO queue, with pointers for the stack top and stack bottom. Both pointers move, as seen below:
BIS index 17
BIS.top15 14
BIS.bottom
12
BIS entry
ROB.tail
I201
I200
B4
I27
B3
I20
B2
I9
I8
B1
I2
I1
BIS
I201 I200 B4 I27
B3 I405 I20
B2 I9 I8
B1 I2 I1
ROB predicted path (solid
line)
BIS, ROB are both implemented as circular FIFOs
ROB.head (next instrn to be retired)
– As soon as a branch instruction is dispatched, an entry for this branch is pushed onto the BIS: the BIS entry points to the ROB entry established for the branch. BIS.top points to the topmost entry in the BIS; BIS.bottom points to entry at the bottom of the stack.
– As a branch instruction is committed, the BIS.bottom pointer is moved up (incremented in a circular fashion).
– At any time, the BIS has entries for all branch instructions that are executing or to be committed from the ROB.
– Figure shows when the branches B1, B2, B3 and B4 have been dispatched, with B1 as the earliest uncommitted branch.
273
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Speculative Execution with a ROB (contd.)
Assume the use of centralized Issue Queue (=IQ) where the entry for an instruction is tagged with the BIS index for the closest preceding branch instruction. This tag is called the branch tag. This index is the value of BIS.top at the time of dispatching the instruction. An associative addressing mechanism is used to clear IQ entries that match a BIS index used as a key.
Issued instructions carry along their branch tags; An executing instruction on a FU can be flushed when a matching branch tag is issued to flush any executing instruction.
Dispatched branch instructions also carry with them a pointer to their BIS entry (BIS pointer) as they go through the pipeline.
Dispatching a branch instruction:
Follow usual steps for dispatching a branch instruction, waiting if the BIS is full. Tag the entry for the dispatched branch instruction in the IQ with the value of the current BIS.top.
Increment BIS.top (in a circular fashion) and establish an entry for this branch instruction in the BIS, with the BIS entry pointing to the ROB entry for the branch just dispatched. Set the BIS pointer for the dispatched branch instruction with the current value of BIS.top.
Dispatching non-branch instructions:
Follow usual steps and tag IQ entry for the instruction with the current value of BIS.top.
274
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Speculative Execution with a ROB (contd.) Handling correctly predicted branch:
No special actions needed at the time of resolving the branch.
At the time of committing the ROB entry for the branch, increment
BIS.bottom (to free up the BIS entry).
Handling a mispredicted branch:
As soon as a branch misprediction is detected, the pointer to its BIS entry (BIS pointer, which flows along with the instruction after dispatch) is used to locate the ROB entry of the branch.
All ROB entries starting with the one follwing the mispredicted branch’s entry and ending with the entry at the tail of the ROB are flushed from the ROB: this is simply done by setting ROB.tail to point to the ROB entry immediately following that of the mispredicted branch in program order.
Example: if B2’s misprediction is discovered, ROB.tail is set to point to the ROB entry for I20 (see figure). The entry for B2 is still kept on the ROB and BIS. I20’s entry will be replaced with that for the instruction (I405 in this example) that follows B2 in the correct path (the path that was not predicted).
All instructions in the IQ or that are currently executing are flushed by using the branch tags (=BIS indices) starting with the value of BIS.top through and including the BIS entry for the mispredicted branch. This is done by generating a succession of branch tag values, starting with BIS.top and decrementing it, till it points to the BIS entry for the mispredicted branch.
275
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Handling Speculative Execution with a ROB (contd.)
Example: The branch tags generated for flushing instructions from the IQ or the FUs when B2 is mispredicted are, in succession:
BIS.top, BIS.top – 1, BIS.top – 2 (16, 15 and 14, respectively)
All instructions in the fetch and decode/rename/dispatch stages are flushed.
BIS.top is updated to point to the entry of the mispredicted branch.
The mispredicted branch is re-dispatched (after updating the BTB etc.)
Note that when a branch instruction is mispredicted, the scheme described flushes all instructions that were dispatched following this branch in program order.
Note similarity with the handling of exceptions (the precise interrupt mechanism).
276
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming & Speculative Execution
In a machine using register renaming, the handling of the mispredicted branch requires the following:
a) Physicalregistersallocatedtoallinstructionsdispatchedfollowingthe mispredicted branch have to be deallocated.
b) Therenametablemustberestoredtoitsstateatthetimeofdispatching the mispredicted branch instruction.
Implementing Condition (a) – several choices:
In a machine integrating physical registers into the ROB entry or a ma- chine that uses rename buffers arranged in a circular FIFO (like the ROB) (Datapath variations 1 and 2 on pages 191-192):
– Done as described on Page 264 – simply move pointers within the ROB and the circular FIFO of rename buffers from the dispatch end (ROB.tail) to the entry corresponding to the mispredicted branch.
In a machine using rename buffers where the rename buffers are managed as a linked list (datapath variation 2, pages 191-192) or in a datapath integrating architectural registers and physical registers (datapath variation 3):
– Alternative 1: Walk back the ROB from the dispatching end (ROB.tail) to the entry of the mispredicted branch and for every ROB entry so walked, add back the physical register to the free list of physical regis- ters
– Alternative 2: Save the state of (= “checkpoint”) the free list at the point of dispatching a branch into the BIS (easy to do when the list is implemented as a bit vector). Restore the state of the free list from what was saved with the BIS entry for the mispredicted branch.
277
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming & Speculative Execution (contd.)
Implementing Condition (b) – restoration of the rename table – several possibilities:
Alternative 1: Applicable to a machine that integrates physical regis- ters and architectural registers (datapath variation 3, pages 191-192):
– Recall use of two rename tables: normal rename table and “retirement RAT” pointing to committed values of architectural registers (Page 192).
– Stop instruction dispatching on discovering a misprediction
– Allow instructions ahead of the mispredicted branch to commit
– When the mispredicted branch reaches the commit end of the ROB, copy the retirement rename table (“retirement RAT”) into the rename table: this established the machine to a precise state. Renaming starts normally from this point onwards and dispatching is resumed with the mispredicted branch.
– Note how this treats a mispredicted branch like an instruction that generated an exception.
– Recovery from misprediction can be slowed down by a long latency instruction that precedes the mispredicted branch.
278
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming & Speculative Execution (contd.)
Alternative 2: Checkpoint the rename table – works with all datapath variations
– Save the rename table just prior to dispatching a branch (or periodically).
– Make the BIS entry for the branch to point to the saved rename table.
– On a misprediction, restore the rename table from the saved rename table, as pointed to by the BIS entry of the mispredicted branch.
– Requires a significant amount of storage to save rename table for every speculatively dispatched branch.
Alternative 3: Improvement to Alternative 1 – use retirement RAT and walk back from the commit end of the ROB to the entry for the mispredicted branch:
– Applicable to datapaths integrating the physical registers and the architectural registers (Variation3, Pages 191-192).
– Stop dispatching and copy the contents of the retirement RAT into the normal rename table.
– Start with the ROB entry at the commit end (ROB.head): let A be the destination architectural register of the instruction allocated to this entry and let P be the physical register allocated to this entry.
– Update the rename table entry for A with P
279
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming & Speculative Execution (contd.)
– Walk back to the entry of the mispredicted branch and process all ROB entries encountered similarly.
– Rename table now holds mappings that existed at the time of dispatching the mispredcited branch.
– Resume dispatching with mispredicted branch.
– Recovery can be faster compared to that of Alternative 1: the resumption of dispatches is no longer held up till the commitment of all instructions preceding the mispredicted branch.
Alternative 4: use normal rename table and walk forward – works with all datapath variations:
– Save old state of the rename table entry modified by an instruction at the time of its dispatch – either in its ROB entry or in a separate history buffer.
– On discovering a misprediction, stop dispatching and walk forward from the most recent entry established in the ROB to the entry for the mispredicted instruction.
– For each ROB entry encountered, restore the rename table entry updated by the instruction – using the saved rename table entry. This restores the rename table to the state that existed at the time just prior to the dispatch of the mispredicted branch.
– Resume dispatching with the mispredicted branch after this.
280
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RENAME TABLE RESTORATION ON BRANCH MISPREDICTIONS:
Assume for these examples a ficticous machine with 4 architectural registers, R0 through R1. As- sume that there are 10 physical registers, P0 through P9. Assume identity mapping in front end RAT (or rename table) before we execute the following code fragment (source operands/regs not listed):
ADD R1, … MUL R0, … AND R1, .. BZ … ADDC R2, .. SUB R1, …
/* physical reg P4 assigned to R1 */
/* P5 assigned to R0 */
/* P6 assigned to R1 */
/* speculatively executed branch that gets mispredicted */ /* P7 assigned to R2 */
/* P8 assigned to R1 */
Assume further that at the time the misprediction is detected all of the above instructions have dis- patched and the ADD is about to be committed. The contents of the rename table (= front-end RAT) evolve as shown below.
P0
P1
P2
P3
P0
P5
P1 P4 P6
P8
P2
P7
P3
P5
P6
P2
P3
P5
P8
P7
P3
R0 R0 R0 R1 R1 R1 R2 R2 R2 R3 R3 R3
R0 R1 R2 R3
Rename Table contents after dispatching SUB
Rename Table = overwritten values – initial contents
Rename Table at the point of dispatching BZ
The contents of the ROB for the destination architectural register id and the destination physical reg- ister at the time the misprediction is detected is:
most recent entry made
arch reg phy reg
other fields
entry to retire ROB
ADD
R
R
–
R
R
R
1
2
–
1
0
1
P8
P7
P6
P5
P4
SUB BZ
281
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative 1: Treat mispredicted branch as an exception. Restore rename table by copying valid entries from the retirement RAT to the front end RAT (FE-RAT) after all instructions ahead of BZ are committed:
Retirement RAT (R-RAT) after com- mitting instructions preceding BZ:
R0 P5 R1 P6 R2 – R3 –
Restored front-end RAT: valid
entries from R-RAT copied to FE-RAT
R0 P5 R1 P6 R2 P2 R3 P3
R-RAT entries for R2 and R3 are not copied to the FE-RAT, as they ar not initialized
Alternative 2: Checkpoint the rename table – works with all datapath variations – here rename table at the point of dispatching BZ is saved.
Alternative 3: Improvement to Alternative 1 – use retirement RAT and walk back from the commit end of the ROB to the entry for the mispredicted branch:
R0 – R1 –
R2 – R3 –
R-RAT just before committing ADD
R0 – R1 P4
R2 – R3 –
R-RAT updated
with mapping for the ADD
R0 P5 R1 P4 R2 – R3 –
R-RAT updated
with mapping for MUL
walk back updates
R0 P5 R1 P6 R2 – R3 –
R-RAT updated
with mapping for AND
R0 P5 R1 P6 R2 P2 R3 P3
Valid R-RAT entries copied to FE-RAT (as in Alt. 1)
282
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alternative 4: use normal rename table and walk forward – works with all datapath variations: old rename table mappings saved in ROB entry. ROB contents at the point of detecting the mis- prediction
most recent entry made
entry to retire ROB
R
R
–
R
R
R
1
2
–
1
0
1
P68
P72
P46
P50
P14
arch reg
phy reg
old mapping:
other fields
SUB BZ ADD
Rename table (RAT) updates on walking forward:
R0 R0 R0 R1 R1 R1 R2 R2 R2 R3 R3 R3
R0 R1 R2 R3
No up- dates
for BZ: this is the restored RAT
P5
P8
P7
P3
P5
P6
P7
P3
P5
P6
P2
P3
P5
P6
P2
P3
RAT just after dispatching SUB
RAT updated with old mapping overwritten by ADDC
Free P8
RAT updated with old mapping overwritten by ADDC
Free P7
walk forward updates
283
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming & Speculative Execution (contd.) Restoring the rename table – general comments:
Checkpointing is the fastest scheme – hardware implementation used should avoid entry-by-entry copying.
Example: fast, parallel restoration similar to that used in the AMD K6:
– One entry in the rename table: permits four checkpoints, “a” through
“d”; each entry uses 7 bits (bits 6 through 0) to name a destination ….
– For each bit, there are 4 values – bits “a” through “d”, corresponding to the 4 checkpoints. The bits holding these 4 values are set up logically as a circular shift register, as indicated by the arrows
– The current rename table entry is made up of bits of the same type, either “a” or “b” or “c” or “d”. A column multiplexer is used to point to the right type of bits
– If the “a” bits correspond to the current checkpoint, the next checkpoint will use the “b” bits, the one after that the “c” bits and so on
– Restoration: simply consists of shifting all entries by the desired amount – in reality, we do not even need to shift the bits – if the “c” bits made up the current rename table and we need to move back to checkpoint “a”, we simply set the column multiplexers to point to the “a” columns.
6a
6b
6c
6d
5a
5b
5c
5d
0a
0b
0c
0d
284
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Register Renaming & Speculative Execution (contd.)
– Restoration of all entries are thus performed in parallel and is very fast – whatever it takes to set the column multiplexer’s selection bits. Note also that no data movement takes place during a restoration – this saves power.
– Practically feasible only for ISAs that have a few architectural registers, like the X’86.
Both walking techniques are inherently sequential in nature and slow: this can be a serious performance impediment in superscalar datapaths.
The performances of the walking techniques are dependent on the position of the mispredicted branch’s entry in the ROB.
Additional storage requirement (highest to lowest, not counting storage needed by retirement RAT):
– rename table checkpointing (most) – Alternative 2
– rename table + forward walking – Alternative 4
– retirement RAT + backward walk (Alternative 3) and Alternative 1.
Rename table restoration technique chosen can influence the choice of the technique used for physical register restoration.
285
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Getting Rid of the Rename Table: Associatively Addressed ROB
Motivation: Rename table – current and saved versions – require a substantial amount of storage if several levels of speculative execution is supported.
ROB modified as follows:
ROB entries also double as physical registers.
The ROB permits an associative search for locating the most recent entry established entry that updates a specific architectural register (i.e., associative addressing locates the most recent instance for an architectural register). (ROB entries have a type field to identify entries for instructions that update a register).
search key: architectural register id search results:
match or no-match indication
match returns contents of: (i) result field, (ii) status of result field and, (iii) index of ROB entry.
Dispatch steps (assumes a centralized issue queue, IQ:
1. WaitifafreeROBentryandanIQentry.
2. Foreverysourcearchitecturalregisterfortheinstructiondispatched,do the following:
(a) Associatively address the ROB to locate the most recent entry for the source architectural register.
(b) On a match, if the returned result is valid (status = valid), the source operand value as returned is used as before.
286
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Getting Rid of the Rename Table: Associatively Addressed ROB (contd.)
(c) On a match, if the returned result is not valid, the returned ROB index (which is the analog of the physical register id, as obtained from a rename table) is used to set up the forwarding path that would eventually bring the result to the dispatched instruction.
(d) On a no-match, the value of the architectural register is read out from the architectural register file (ARF).
3. Instruction is dispatched to the IQ. The index of ROB slot for the instruction dispatched is written to the IQ entry – this index is used to eventually write the result and also to forward the result to instructions that are awaiting the result.
Technique described is used in the Pentium II and III implementations.
Optimization: ROB probed and ARF read in parallel: value read from
ARF is used only in no match case and discarded otherwise.
Limitation: associative addressing delay (and hardware expenses) may become prohibitive as the ROB size increases. Not used in current designs.
287
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CLASS NOTES/FOILS:
CS 522: Computer Architecture & Organization
Part IV: Memory Systems
Dr. Kanad Ghose ghose@cs.binghamton.edu http://www.cs.binghamton.edu/~ghose
Department of Computer Science State University of New York Binghamton, NY 13902-6000
All material in this set of notes and foils authored by Kanad Ghose 1997-2018 and 2019 by Kanad Ghose
Any Reproduction, Distribution and Use Without Explicit Permission from the Author is Strictly Forbidden
CS 522 – Fall 2020
288
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Memory Hierarchy
Hierarchy based on:
Speed: access latency, also called access delay or simply, delay Cost/bit
The hierarchy of randomly-accessible storage devices
STORAGE
Off chip SRAM
Off-chip DRAM
NEW non-volatile: 3D Xconnect (”Optane”), P-RAM, STT-RAM*
DELAYS: Absolute
a few nsecs. to 20 nsecs.
DELAY: in CPU clocks
Fastest Isolated latches
Registers/ Small RAM
Large on-chip SRAM
a few psecs to a few nsecs
a few 10s of psecs to a few nsecs.
a few nsecs.
<< 1 CPU clock < 1 CPU clock
< 1 to a few CPU clocks
30 nsecs. and higher
M-RAM
* price point unclear
- Newest non-volatile memory technology is Toggle M-RAM (magnetic RAM) is shown. But ReRAM (resitive RAM) and some others are not shown, but M-RAM is becoming a close contender to DRAM from a performance perspective; cost re- mains unclear.
Flash storage
Magnetic disks Slowest Optical disks
: on-chip storage resources
tens of microsecs, slower writes 6 to 14 msecs.
10 to 20 msecs.
289
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Some Non-volatile Memory Technologies and DRAM
Non-volatile memory devices retain contents even if power is shut off. Flash and hard disk drives are common examples, but many other solid-state (that is, made using semiconductor technology, no moving parts) non-volatile randomly-ac- cessible (NVRAM) memory technologies are available.
Generally, with solid state NVRAM devices fail after a specific number of writes to a memory cell. The number of writes (in cycles) after which failure occurs is called the write endurance. Flash memory devices used in SSD drive use internal wear-levelling to improve overall write endurance (LATER).
Write speed and write endurance are important metrics for NVRAMs. Here’s how they stack up:
SRAM has a write speed of a few nsecs. and supports practically unlimited number of writes (close to 10 PetaQuintillion writes = 10 followed by 18 zeros)
DRAM has a write speed of several tens of psecs. and has a write endurance of 10 Quadrillion cycles (= 10 followed by 15 zeros)
Toggle MRAM has a write time that is only about 2 to 3 times of DRAM and has a write endurance very close to that of SRAM
Optane, ReRAM, PCRAM (Phase change RAM) have write speeds of 100s of microseconds and have a write endurance of 10 million to 10 trillion cycles
Flash memory has a write speed approaching a msec. and a write endurance of 10,000 cycles.
290
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Memory Hierarchy (continued)
The other aspect of memory performance: bandwidth
Examples:
Register file bandwidth for a 4-issue per cycle superscalar CPU running at 3 GHz.:
- 8 operands, 32-bits wide per cycle => BW = 96 Gbits/sec.
– Double this for 64-bit datapaths
CPU I/O pin bandwidth: 128 I/O pins, 1 GHz. external memory bus: 32 Gbits/sec. – this is the best data rate one can realize from off-chip memory devices. Can double this by using two transfers per clock (one per clock edge, as in DDR memory interfaces)
PCI-3e connections, BW depends on number of parallel links
Typical disk transfer rate: several 10s to 125 MBytes/sec. (single disk unit, sustained data transfer rate), peak data rates: 6 Gbits/sec. (SAS drives)
DRAM bandwidth: regular DRAM chips –
– Single byte access (“byte-wide” DRAM chip): 1 byte/60 nsecs.
– Internal BW based on a 1024-bit row size: 1K to 4K bits/40 nsecs.
– Significant loss in data rate occurs as only a byte gets selected internally within the chip and gets sent out!
– Burst or page mode data rate from DRAM is faster (roughly about 2 times individual byte access rates)
291
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Some Characteristics of Memory Traffic
In a RISC ISA, roughly 15% to 20% of the instructions processed are loads and stores.
For a CISC ISA like the X’86, the percentage of memory instructions are higher – about 35% of the instructions are like loads and stores, if the CISC instructions is broken down into RISC-like equivalents.
Loads outnumber stores by a ratio of 2:1, typically.
Traffic directed at the external memory is usually bursty in nature,
because of locality of reference and the presence of on-chip caches.
In a high end processor, a series of cache misses can occur in rapid succession.
Challenge: smooth out memory traffic – improves overall performance of memory system.
In many scientific applications, memory locations separated by a constant distance (“stride”) are accessed. This is also true for graphics and media applications.
292
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Processor-Memory Performance Gap
Alludes to ability of memory to supply data to processor as and when needed.
Both data rate (bandwidth, BW) and latency (time to get data) matter: Effective memory data transfer time approximately given by:
Flat overhead + (data_size/BW)
Flat overhead comes from time taken to send address to memory, activate memory to initiate the transfer, flay delay of bus, I/O pins.
BW can be improved by widening the external bus and using memory devices in parallel. Sounds simplistic but there are engineering limits imposed by limited number of I/O pins, limits within memory device.
Much more challenging: reducing the flat overhead.
293
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Techniques for Improving Memory Performance
Memory interleaving
Cache memory
Newer memory interfaces
Prefetching
Stream buffers
Store bypassing by loads and predicting store bypassing (“dynamic memory disambiguation”)
Cache miss prediction
Simultaneous multithreading
294
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices
Basic storage component: bitcell – one bit of storage:
Static RAM (SRAM) bitcell = flip-flop (back-to-back connected
inverters)
Dynamic (DRAM) bitcell = MOS capacitor: charged => 1, discharged => 0
The RAM as a black box:
a
Address Control
2a words, d bits/ : a word
Bit Array
(2 X d RAM)
d
Data in/out
295
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Steps involved in reading out a word from a SRAM: apply the address, apply the read and output enable signal, wait for the duration of the read access time and latch in the data put out on the data output line:
Address address_strobe
Valid Address
valid address applied on address bus
falling edge informs SRAM that a new address is issued
indicates to SRAM that this request is a read
enables SRAM to output contents of address on data bus
SRAM drives data onto data bus Read Access Time
r/w oe Data
Valid Data
–
Note that the interface is asynchronous. A memory controller is used to generate these signals from the CPU side
= undefined : valid information (1 or 0) on one or more lines
In a conventional SRAM chip, the address_strobe line has to be pulsed low every time a new address is applied: this is because of the asynchronous nature of the interface
Steps involved for a write are similar.
The read and write access times are almost identical for semiconductor SRAMs.
296
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.) Simplified SRAM organization:
Bitcells are arranged in 2-dimensional array:
Data Being Written
Row 0 Row 1
Read/Write Control
4
Row address decoder/driver
Bit line drivers
bitcell
Sa
Sb
bitcell latch
word line
bit line
Row 2r – 1 Data Read
r
Row Address
bit line
Bit line sense amps
Out 4
Bitcells in a row share a word line
Read/Write Control
Bitcells in a column share a pair of bitlines carrying complementary signals (bit line and bit line)
For a single-ported SRAM, only one row of bitcells are selected for access – this is done by driving the word select line for the row to a high state: this has the effect of turning on the switches Sa and Sb on, connect- ing the bitcell to the bit and bit lines:
297
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
If the access operation is to read out the contents of the bitcell, the contents of the bitcell are driven onto the bit and bit lines
If the access operation is to write into the bitcell, the data to be written is driven onto the bitline (and the complement of this data is driven onto the bit line) – this latches the data to be written into the bitcell
Typical SRAM access times: few nsecs. to tens of nsecs., sizes to 256K X 18 per chip.
The various register files internal to a CPU chip are actually on-chip SRAMs with multiple ports.
What are the components of the SRAM access time?
Data Being Written
Row 0 Row 1
Row 2r – 1
Data Read Out
Read/Write Control
Row address decoder/driver
Bit line drivers
Bit line sense amps
The above figure shows the longest delay path in a SRAM (in gray)
Row Address
Read/Write Control
298
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Between the application of an address and control signals and the time the data is read out, the delays in the path shown are:
tdecoder = the delay of the decoder and the driver of the word lines: this is proportional to the # of address bits (a)
trow_line = the delay of the row wire – from the decoder to the farthest end: this is proportional to the # of bits/word (d)
tbit_line= thedelayofthebitlines-fromtherowfarthestfromthesense amp to the sense amp: this is proportional to the number of words (2a)
tsense = the delay of the sense amp
The read access time is thus (tdecoder + trow_line + tbit_line + tsense)
Note how the access time depends on the dimensions of the array and why “smaller is faster” holds in this case.
– Possible ways to speed up the SRAM access time:
(a) Reduce the array dimensions
(b) Use smaller transistors (i.e., a more aggressive fabrication process): this reduces the various proportionality constants.
–
299
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Pipelined burst mode SRAMs have been introduced to allow a se- quence of memory locations to be read out (or written to) consecutively without the need to repeatedly assert the control lines (address strobe, read/write, output enable) before every access.
– This is accomplished by using a synchronous interface (which also eliminates the need for the address strobe line). The net result is that data is accessed at a faster rate compared to ordinary SRAMs:
Address Bus Clock
r/w oe
Data
Address 1
address sequence applied on address bus
Address 2 Address 3
indicates to SRAM that these requests are reads
enables SRAM to output contents of address on data bus
Data 1 Data 2 Data 3
SRAM drives data onto data bus
Although not shown here, the time taken to get the contents of the first address onto the data bus may take a little longer (typically 2 or more clock cycles); subsequent data items come out at a regular rate (typically one per bus clock).
The high data rate is achieved by breaking down the unified array into banks and interleaving accesses to the banks (LATER!)
Potential applicability to pipelined CPUs: on-chip cache misses (that require several consecutive memory words to be fetched) can be serviced quickly using the burst mode
300
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Register Files in Modern Datapaths
Implemented as relatively small multiported SRAM. (SRAM design is used for its speed.)
Multiported register files are also used to implement other components of modern datapaths such as:
– Physical register files
– Register alias tables
– Reorder buffers
– Repository of status info for scoreboards
– Issue Queue and RSEs (each word has additional associated logic, e.g., for tag matching)
– Instruction buffers (in between memory and the I-cache or between I-cache and decoder logic)
Delay impacted by number of words in register file, number of bits/ word etc. exactly as in static RAMs.
Area and delays increase with number of register file ports.
General design strategy: limit register file (RF) accesses to at most one per pipeline stage if possible. Where needed, overlap accesses to multiple RFs.
301
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Partitioning Register Files (RFs) to Avoid Pipeline Bottlenecks:
With increasing numbers of FUs and physical registers, two delay components increase:
Register file access time increases (bitwire delay, decoder delay, delays due to use of large number of ports etc. all go up).
Wire delays for the connections between the FUs and the register file go up.
– Net effect: register access time may form a critical path, affecting pipeline clock cycle time adversely.
Solution 1 – traditional: several approaches:
Break down the physical registers with N rows into K smaller portions (segments or modules), each with N/K rows. Each module is an independent RF. Pre-decode address to first identify required module and then perform access within module. Ports with the same number on all moules share a common bus. Often called a multi-banked RF design.
Use a sub-banked design: break down rows in the RF into independent columns of adjacent bits, called subbanks. Each subbank has its own word line and driver and shares sensing and word-line driving logic with adjacent subbanks. Accesses are confined to subbanks of interest. This reduces word-line delays.
302
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Solution 2 – Clustered Datapath: partition physical register file, and dedicate a partition to a small number of FUs:
– Reduces number of register file ports needed
– Also reduces length (& delay) of interconnection between FUs and RF.
– In effect, register files and FUs that use them predominantly are clustered. A cluster refers to a group of FUs and the local RF that they share
Accesses to a non-local or remote RF by a FU usually costs an additional cycle. To avoid performance losses, accesses by a FU must be confined predominantly to its local RF. Careful mapping of source and destination registers may be needed here: ideally, an instruction producing a value and its consumers (i.e., flow-dependent instructions) must be mapped to the same cluster.
Register renaming is almost mandated with the use of a clustered design to simplify the process of allocating instructions to clusters.
Example: DEC 21264 and 21364 CPUs use a clustered datapath.
Solution 3 – other micro-architectural level solutions: several possible, all exploit usage patterns of data, locality of values, short lifetimes of values etc.
303
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.) Simplified DRAM organization:
Logically DRAMs are organized like SRAMs – the one distinction is that each row is quite long internally to the chip.
The address of the data to be read out is supplied in two parts:
– The row address is first driven to the DRAM chip and the row
is read out into an internal row buffer
– The column address is then driven, and is used to select the required data from the line in the row buffer using a multiplexer
Control Inputs
Multiplexed Address Input
:
Desired Byte
Control
Row Address
Row Address Buffer
Row Driver
DRAM Chip
DMUX
Col. Address Buffer
Row Buffer
Column Multiplexer
Column Address
Output Buffer and Driver
Accessing a Byte-wide DRAM
304
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
The read access sequence for a DRAM with an asynchronous interface is illustrated in the timing
diagram below:
Falling edge informs DRAM chip that a new row address has been issued
RAS
CAS Address
r/w oe
Data
Falling edge informs DRAM chip that a new column address has been issued
Row
Column
Data
read access time
= undefined : valid information (1 or 0) on one or more lines = output floating (open)
– Note the asynchronous nature of the memory interface. The write access sequence is similar.
DRAM access times are in the range 40 to 70 nsecs.
305
output data valid
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
Since the charge in the capacitor within a bitcell in a DRAM decays because of inevitable leakage (no real insulator is perfect!), the charge held in such capacitors have to be replenished periodically, unless the bitcell contents are modified: this operation is called refreshing.
– The duration between successive refreshing of a row can be at most 10 to 15 microseconds, typically.
A DRAM controller has to be used in any system that contains DRAM chips. This controller performs the following functions:
Generates the necessary signals for each DRAM chip in the system to perform a read, a write or a read-modify-write cycle, decoding higher order address bits to select the chips .
Initiates and performs the refreshing of each DRAM chip it controls.
In higher end systems, it performs any error detection and correction and interleaving
The general approach to refreshing is to refresh the contents of a row at a time.
– Refreshing the row involves reading out the contents of a row into an internal buffer before the data stored in that row deteriorates
– The contents of the buffer is then written back to the row – the write replenishes the charges on the bitcell capacitors that held a ‘1’
– When the refresh of a row is in progress, the DRAM cannot perform an access request.
306
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
A refresh cycle consists of a sequence of refreshes that restore the logic levels of the rows, one row at a time. Two types of refreshing cycles are used in practice:
Burst mode refresh: here a number of rows are consecutively refreshed when the DRAM is idle (i.e., not servicing any request) – this is generally used in a system with a cache, where the DRAM idles for a relatively longer duration on the average between servicing cache misses.
Distributed mode refresh: here the refresh of a row alternates with a period of use of the DRAM, so that the refreshes are uniformly distributed between uses.
The DRAM controller can refresh a row by using any one of the following approaches (irrespective of the type of refresh cycle used):
RAS-only refresh: Here the DRAM controller puts out the address of the row to be refreshed and drives the RAS line low, keeping the CAS line high and the write and output disabled. This refreshes the addressed row. The DRAM controller needs an internal register to keep track of the row addresses to perform the full refresh cycle.
307
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
RAM Devices (contd.)
CAS before RAS (CBR) refresh: most DRAM chips feature an internal row address register (RAR) that is initialized randomly (but wraps around) to generate the addresses of all the rows when incremented progressively.
– When the CAS line is strobed low before the RAS line is strobed low, keeping the write and output disabled, the internal logic of the DRAM refreshes the row pointed to by the RAR, which automatically gets incremented after the refresh.
– Note that the DRAM controller does not need to put out the address of the row to be refreshed – this saves power.
– Likely to be the refresh mode in all future DRAMs
Hidden refresh: Here a refresh is performed on the row just accessed. This is done by keeping CAS low (which is required for the normal access), strobing RAS low (which, again, is required for the normal access), then bringing RAS high again, finally strobing the RAS line low again. This has the effect of a CBR refresh. There is nothing “hidden” in terms of delays – the CBR refresh sequence takes additional time. The only “hidden” aspect of this refresh is the fact that the refresh is sneaked in right after a normal read access.
308
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Evolution DRAM Memory Technology
Initial large-scale DRAM offerings had an asynchronous interface. Latter technologies provide good burst performance.
Fast Page Mode (FPM) DRAMs
This is a special operating mode for asynchronous DRAMs that allow the contents of consecutive columns within the same row to be read out in succession. The column addresses can in some cases be internally generated based on some programmed burst patterns. The timing dia- gram that applies is:
Falling edge informs DRAM chip that a new row address has been issued
RAS
Falling edge informs DRAM chip that a
new column address has been issued CAS
Rising edge turns off data output
Address r/w
oe Data
Row
Column1
turnoff delay
Data1 data output off
Column2
Column3
Data2
Data3
309
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
DRAM Technologies (contd.)
Gives a good data transfer rate for data within a row compared to doing a sequence of solitary reads since repeated strobing of the RAS line is not needed.
Extended Data Out (EDO) DRAMs
EDO DRAMs are a very simple extension of the FPM DRAMs – in the EDO DRAMs, the data output is kept enabled even when CAS goes high.
Falling edge informs DRAM chip that a new row address has been issued
RAS
Falling edge informs DRAM chip that a new column address has been issued
does not turn off data output
Falling edge indicates availability of the next column address; this is also used to latch-in the data from the previous column
CAS Address
r/w oe
Data
Row
Column1
Column2
Column3
Data1 data output on
Data2
Data3
310
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
DRAM Technologies (contd.)
– This allows output data to remain valid even when CAS falls subsequently.
– The falling edge of CAS can be used to indicate the availability of the next column address to the EDO DRAM. It can simultaneously be used to latch-in the data from the previous column.
In the FPM DRAMs, a new column address is not applied till the previous data output is available. This is no longer a requirement for EDO DRAMs – the next column address is applied as the data from the previous column is being latched in.
ColumnaddressescanbeappliedquickercomparedtoaFPM DRAM
Fasterdataaccessrate
Interface/controllers not much different from standard FPM DRAMs:
interface is asynchronous.
Synchronous DRAMs (SDRAMs)
Similar to synchronous RAMs – interface to SDRAMs is synchronous, using a clock derived from the CPU clock.
High data rate obtained from DRAM array by using interleaving within the SDRAM.
311
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
DRAM Memory Technologies (contd.)
Provides a very high data rate – better than EDO DRAMs.
DDR(Double Data Rate) DRAM: SDRAM, with one transfer per clock edge. Mainstream technology now.
Rambus DRAMs
Uses dedicated, impedance-matched bus (“RAMbus”) between specially packaged DRAM components and CPU to obtain a very high data rate. Really a combination of an interconnection and DRAM.
High data rate results from the use of a short, impedance matched, high clock rate (500 MHz. +, 8+ bits wide) RAMbus.
Possible way to implement large off-chip cache and CPU-off-chip cache interconnection.
Was embraced recently by Intel – now fallen from grace; several vendors make RAMbus DRAMs as commodity parts. Future uncertain.
Cached DRAMs (CDRAMs)
Uses wide line buffer within DRAM as a cache.
Variations use static RAM-like buffers internal to DRAM as cache and
more than one set of buffers.
Has not caught on.
312
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Synchronous DRAM (SDRAM), DDR-n
Current mainstream technology – now in the 3rd generation (DDR-3)
Modern DRAM memory systems use a number of DIMMs (dual in-line memory module). Each DIMM is a small PCB card with memory chips on each side and incudes other logic elements. Each DIMM has an edge connector on the PCB that plugs into a DIMM socket on the mother- board.
DRAM Chip X8
DRAM DRAM Chip X8 Chip X8
Notch
DRAM Chip X8
Area for Cntrl Buffers etc
DRAM Chip X8
DRAM Chip X8
DRAM Chip X8
DRAM Chip X8
Connections
Basic interface is similar to synchronous static RAM interface:
DIMM sockets are connected to a memory bus driven by a clock signal which is derived from the CPU clock by logic within the DRAM controller. DRAM controller can be external or internal to the CPU chip. Common now: internal DRAM controller.
Interface still requires row address to precede column address, as in the asynchronous DRAM interface.
Commands are sent on the rising edge of this bus clock (1T timing). If the memory bus has a higher level of loading (= higher capacitive loading), a 2T timing is used to send commands to the DIMM over a two bus cycle period. 1T addressing is more common now.
313
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
SDRAMs/DDR-n (contd.)
Two data transfers per bus clock – hence the name DDR – double data
rate.
Standards maintained by JEDEC
For DDR-2 and DDR-3, the memory data bus width is set at 64-bits (“x64”). This is also the same as the width of the DIMM modules.
If x8 (8-bit wide DRAM chips) are used on the DIMM, at least 8 of these chips are needed on the DIMM card so as to accommodate (that is, match) the 64-bit bus width. Error correction, if supported on the DIMM with error correction codes (ECC), will require additional chips.
Most DRAM controllers for DDR-2 and DDR-3 support multiple channels. These are like DMA channels and enable concurrent trans- fers on the channels.
Modern DIMM cards include a SPD (serial presence device) that re- ports the various characteristics of the DIMM (including timing, orga- nization and performance limits) to the DRAM controller to let the DRAM controller automatically figure out the best operational configu- ration. Chip temperatures are also reported to the DRAM controller by the SPD in some cases for power management.
The DDR-3 devices include explicit power managements modes. As memory bus speeds increase, the power dissipation in the RAM goes up and such power management hooks are necessary.
314
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving
Memory interleaving is a technique for enhancing the effective memory bandwidth – it does nothing to reduce the memory latency
The basic idea is to set up a number of memory modules, M0, M1, .., Mk-1, each of which has a word size of w and an access time of T
Addresses are assigned to these k memory modules in the following manner:
– An address a is mapped onto the memory module numbered a mod k, to an address a div k within that memory module
A request to access a number of consecutive memory addresses in sequence can be serviced in an overlapped fashion by the k memory modules, resulting in the delivery of k consecutive words in time T:
0T
M0
sf
T/4 T + T/4 sf
Address accessed: a
a+1
M1
M2 M3
– –
T/2
T + T/2 a+2 f T + 3T/4
a+3
As shown above, the initiation of memory accesses for consecutive modules are staggered by T/k (k = 4 in the example above).
This results in memory accesses completing at the rate of one every T/4 time units
s : start of memory access f : end of memory access
s 3T/4 sf
315
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Since the data available from each module appear at distinct times of their output, a w-bit wide bus can be used to deliver the data from all the modules to the CPU, even though each module delivers a w-bit wide data:
wwwww w
CPU
– As long as the data put out on the common bus is picked up within time T/k of their appearance on the common bus, the outputs from modules do not interfere with each other
– This is called k-way low-order interleaving, since the lower order bits in the memory address determines the memory module that holds the target address.
The effective bandwidth obtained from the interleaved memory system is one w-bit word every T/k cycles, resulting in a data rate of (w*k)/T bits/sec. This is a k-fold increase over the data rate from a single memory module.
The bandwidth improvement given above can be realized only if k consecutive addresses to be accessed all map to distinct memory modules – if this is not the case we have a module conflict: an access is directed to a memory module while it is currently servicing another.
M0
M1
M2
Mk-1
316
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Just what kind of bandwidth improvement results when the sequence of addresses issued to the interleaved memory system leads to module conflicts? We derive an expression for the effective bandwidth for the two following cases:
Memory accesses during instruction fetching:
In this case, the address sequence consists consecutive addresses between two consecutive taken branches.
We assume that k consecutive addresses to be fetched are available
in a buffer of size k. The addresses in this buffer are issued to the interleaved memory system all at once, although internally, the memory system initiates the reads from consecutive modules in a staggered fashion.
We further assume that as soon as the first taken branch instruction is encountered in program order within this group of k instructions in the buffer, the instructions within the buffer following this branch do not have to be accessed. This implicitly requires the use of a BTB to predict if the branch is going to be taken or not.
We also assume that the instruction sequence at the target of the taken branch is not fetched as soon as the taken branch is discovered, since any attempt to fetch these instructions may cause result in module conflicts with the accesses already in progress.
Let B = probability that an instruction is likely to be a taken branch. (In terms of the earlier notation, B = b.s)
317
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Now, the probability that the first j of the k addresses (where j k) in the buffer do not correspond to the address of a taken branch are as follows:
Value of j probability # of useful accesses
1B1 2 (1 – B) * B 2 3 (1 – B)2 * B 3 4 (1 – B)3 * B 4 ::: k – 1 (1 – B)k- 2 * B k – 1 k (1 – B)k- 1 k
Note that when j equals k, it does not matter if the last instruction is a taken branch or not!
The effective bandwidth from the interleaved memory system, scaled to the bandwidth of a single module, is thus:
BWk = 1*B+2*(1-B)*B+3*(1-B)2*B+…. + (k – 1) * (1 – B)k-2 * B + k * (1 – B)k – 1
BWk can be solved by induction on the value of k. This gives:
BWk = {1 – (1 – B)k}/B
This result shows that for a non-zero B, the bandwidth improvement by adding additional memory modules is less than non-linear, with the BW saturating at higher values of k.
– This is not surprising, given the branch statistics of typical programs.
318
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Memory accesses during data fetching:
The data access pattern from a CPU can be assumed to be randomly scattered across the memory modules – this is an approximation, since actual data access patterns do have some locality!
Under these assumptions, the expected number of overlapped fetches, without any module conflict, starting with k requested addresses sitting in the request buffer is the same as the average length of a sequence of integers drawn from the set {1, 2, .., k} without any repetition.
The probability of drawing (j + 1) integers (where j k) from a set of k distinct integers, such that the first j drawn are distinct and the (j + 1)-th is a repetition of one of the first j is:
p(j) = {(k/k)*((k-1)/k)*((k-2)/k)*….*((k-j-1)/k)}*j/k = {(k – 1)!/j}/{(k – j)! * kj}
Thus, the average length of the distinct strings, with non-repeating numbers in a string, is simply j =1, k {p(j) * j} – which is also the scaled bandwidth.
Unfortunately, there is no exact closed form for the above summation. Approximations, within 4% error for values of k up to 45, show that:
B Wk = k 0 . 5 6
This again shows that increasing k indefinitely does not pay off in this
case also!
– Real systems behave significantly better because of locality.
319
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Other types of memory interleaving schemes:
The low order interleaving scheme that we have discussed thus far takes a N-bit physical address and interprets it as follows:
N bits
Address within module Module number
High-order interleaving is a variation of interleaving that specifies the module number in the higher order bits of the memory address and the address within a module in the lower order bits:
N bits
Module number Address within module
Higher-order interleaving effectively partitions the physical address space of size 2N words into subspaces that have 2Fcontiguous words and maps the addresses within each subspace onto a single module.
With high order interleaving, assuming that F is adequately large, the instruction accesses made by a program will be mostly confined to a single module. This implies that virtually no bandwidth improvement will be seen in the course of fetching instructions.
F bits
m bits
m bits
F bits
320
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Note now the contrast between low order interleaving and high order interleaving:
Expandability: The configuration of the low order interleaved memory system s quite rigid: the number of memory modules used will have to be exactly equal to k = 2m.
– If fewer memory modules are used, the memory system is quite useless since “holes” are left in the physical address space at regular intervals – there is no large contiguous range of addresses available for use.
– Additional memory modules cannot be added in the low order interleaved systems, since module address bits will have to “spill” into the the field of bits reserved for the address within a module.
– In contrast, its easy to add additional modules in the high-order interleaved systems. In fact, in such systems, the number of memory modules does not have to be exactly equal to 2m. In particular, more than 2m memory modules can be added – this will simply increase the number of physical address bits to something higher than N.
Performance: For a sequential memory access pattern, the high-order interleaved memory system offers no (or, at best, occasional) BW gain. In contrast, with low order interleaving, sequential access patterns can offer a BW improvement by up to a factor of 2m.
321
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Fault Tolerance: When a single memory module fails, the entire low-order interleaved memory system is rendered useless – no large contiguous range of addresses is available in the rest of the system. In contrast, the failure of a single memory module in the high-order interleaved system still leaves large contiguous parts of the entire address space intact.
– Is it possible to get an interleaved memory system that combines the best features of low and high order interleaving? The answer is yes and the scheme that results is called hybrid interleaving.
In the hybrid interleaving scheme, 2b memory modules are grouped together in what is called a block. The number of blocks in the system is 2m-b. The physical address is interpreted as follows:
N bits
Block # Address within module Module number within the block
– This arrangement offers the advantages of the high-order and low-order interleaving schemes
– The failure of a single module within a block renders that block useless, but enough blocks are still left to offer a large contiguous range of addresses and BW improvement for sequential accesses.
(m-b)bits
F bits
b bits
322
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Interleaving (contd.)
Handling module conflicts: the address sequence to be issued is dispatched from a request queue or the address sequence is generated from a descriptor as in the case of a vector pipeline. If the address being dispatched targets a busy memory module (i.e., it has a module conflict), the request is queued into a buffer of blocked requests and retried at a later time:
Address sequence
To memory modules
blocked request
Interleaved Memory Controller
Where is memory interleaving used?
– Inside pipelined burst-mode SRAMs
– Inside synchronous DRAMs (SDRAMs)
– Inside graphics cards employing DRAMs or SGRAMs
– In memory systems of high-end workstations and PCs
– As the main memory system of pipelined vector supercomputers
323
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Early DRAM Memory System
SIMM/DIMM (single/dual in-line memory module)/SODIMM (small outline DIMM): A number of DRAM chips on a small board, that are used to implement effectively memory system with a larger word size.
Example: Four 1 M X 16 DRAM chips on a SIMM can implement a 1 M X 64 memory system.
A memory system using SIMMs and a DRAM controller is shown below:
Module select lines
Address within a module
Control signals
SIMM 0
SIMM 1
Bits 63 thru 48 Bits 47 thru 32 Bits 31 thru 16 Bits 15 thru 0
Data lines
Higher-order bits of address
Lower-order bits of address
DRAM Controller
control signals
address From/to CPU
data
324
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems Based on DDRs
Modern memory systems are built using dynamic RAM chips that are packaged as DIMMs (dual in-line memory modules) on a circuit card with edge connectors.
A DIMM is essentially a circuit card that has DRAM chips on both sides and additional control and driver logic for addressing the chips and re- sponding to commands issued by the external DRAM controller.
DIMMs are designed to drop into memory slots on the motherboard DIMMs generally have 8 DRAM chips per side
DIMMs use X4 or X8 memory chips (that is, chips that have 4 or 8 bit-wide data pins on data each chip)
With X8 chips, 8 or 9 chips are used on each side of the DIMM to realize memory systems that support 64 bit wide words or words with 64 bits of data and 8-bits of ECC, respectively.
With X4 chips, 8 or 9 chips are used on each side of the DIMM and both sides collectively realize the 64-bit or 64+8 bit memory words.
DIMMs have a fixed number of external connections: from 72 pins on a small outline DIMM (SO-DIMM) to 240 (DDR2, DDR3, FB DIMM,..) to 288 pins (DDR4).
325
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems: Details (contd.)
Connectors on the DIMM are used for command, address and data I/O
In the memory system, each 64-bit wide (or 72-bit wide: 64 data+8-bit ECC) set of chips within a DIMM is called a rank.
DIMMs within a rank share a common chip select signal.
When X4 chips are used, both sides of the DIMM make up a single rank and with the use of X8 chips, there are two ranks per DIMM, one on each side.
With the use of X16 and wider DRAM chips or DRAM chip in smaller packages on a DIMM, a single DIMM can have 4 or 8 ranks
A memory channel is a physical connection used by the memory con- troller to communicate with the DIMM slots that associated with the channel. DIMM connectors connect to the channel lines on the mother- board through the memory slot
A channel consists of separate lines for commands, address and data
In higher end systems multiple channels can be used. This requires independent per-channel controllers
New support in hardware permits channels to be virtualized.
326
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems: Details (contd.) Types of DIMMs:
Unregistered or unbuffered (older terminology): all signal lines between DIMM and channel lines do not go through buffers.
– Access latency is lowest but only a few DIMMs can be accommodated per channel.
Registered or Buffered DIMMs incorporate buffer registers and drivers to reduce the signal driving load from the DRAM controller on the command and address lines.
– Buffering increases access delays but permit more DIMMs to be accommodated on a single physical channel
Fully buffered DIMMs (FB-DIMMs) provide buffering for all signals – command, address and data.
– FB-DIMMs include logic parallel-to-serial and serial-to-parallel conversions of the data lines to permit higher density (= wider and/or higher capacity) DRAM chips to be accommodated on the DIMM without requiring any change to the number of data lines on a channel.
Load-reduced DIMMs (LR-DIMMs) buffer all signal lines to/from the DIMM like FB-DIMMs but transfer data in parallel for higher performance.
Modern DIMMs also include a serial presence detect (SPD) facility in the form of a pre-programmed EEPROM chip that contains prepro- grammed parameters that describe (taht is self-identify) the chip to the DRAM controller. These parameters include DIMM geometry (capac- ity, number of ranks, banks (LATER) etc. and timing parameters.
Some DIMMS also contain temperature sensors that can be read out. 327
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems: Details (contd.)
DRAM controllers are external to the DIMMs (DIMMs incorporate some simpler control logic on board for responding to commands from the DRAM controller).
Earlier systems: DRAM controller was external to the CPU (as part of the “North bridge” chipset, now called Master Control Hub, MCH).
Recent CPU chips, which are multicore, incorporate on or more DRAM controllers to meet the demands of faster and/or multiple cores- this leads to higher performance and power savings.
Ranks, Banks and Arrays:
Rank: A set of chips on a single DIMM that share a chip select line – that is chips in a rank are addressed simultaneously by the DRAM con- troller.
Commonly, a single rank provides 64-bit memory words (or 64-bit memory words and 8-bits of ECC). This is referred to as 64-bit rank, that is, a rank with a data width of 64 bits.
Chips in a rank share address lines but each chip has its dedicated data line to permit data access in parallel at the 64-bit (or 64+8-bit) memory word granularity.
Bank: Each DRAM chip consists of a number of chip-internal banks.
Access to banks within a chip can be interleaved.
At the level of a rank, a rank can be viewed as consisting of a number of banks with each such bank being made up of the individual banks (with the same chip-internal bank number) of the chips in the rank that are co-accessed.
328
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems: Details (contd.)
Array: This is a chip-internal bank and is essentially a DRAM bitcell array that uses a row and column address to access specific bits (as shown on Page 293 of the notes).
2 ot 4 arrays per chip are typical, but higher number of arrays (16 and higher) are also encountered within some chips.
Low-order interleaving is typical.
Sensing data during reads in DRAM Arrays:
DRAM arrays use a single bitline per column for sensing
Before a bitcell row in the array (which could be a few thousand bits wide) can be accessed, the bit lines in the array need to be precharged to a logic one level (or something in-between a 0 and 1 level) to enable the data to be sensed and read out by the sense amps.
This is because sensing is essentially done to detect a 1 or 0 by noting whether the voltage level of a precharged bitline comes down slightly when a bitcell is connected to it. This happens when the bitcell contains a 0.
Actually, sensing is more complex than this and reference bitcells containing 1 and 0s and used to sense the contents of bitcells in a row, by noting the difference in the bitline levels caused by the bitcells in the row and by the reference bitcells associated with the row.
329
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems: Details (contd.)
Physical hierarchy in modern memory systems
Control Inputs
Multiplexed Address Input
:
Desired Data
R o w
D r i v e r
Bitcell Array
DRAM Chip
Chip-internal Banks
Row Address
Column Address
Output Buffer and Driver
Row Address Buffer
MUX
Hierarchy in the memory access path:
DIMMs
Physical address issued by CPU -> Specific DRAM controller for Channel -> DIMM -> Rank -> Bank -> Row in array -> Column in array
330
Control
Col. Address Buffer
Row Buffer
Column Multiplexer
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern Memory Systems: Details (contd.)
Memory Access Delays: Slowest to Fastest:
Slowest: Row being refreshed (need to wait till refresh is over)
Need to precharge row before access, then read out row to chip-internal row buffer for the bank and then access column
Bitlines already precharged, just access row and then column
Fastest: Row buffer hit, just access column (aka “open page hit” or “open row hit”) – fastest access. Most DRAM controllers maintain a separate request queue for access requests to open pages and schedule them preferentially over other requests.
Timing and other issues: See next page for details.
Mapping addresses to channel, rank, bank: specified by OS, to optimize
concurrency and overall performance.
Mapping configurations are maintained in registers within the DRAM controllers
Can also secify interleaving modes in the same manner, along with power savings options (DDR-3 and later)
331
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
DDR-3 Operation and Timing Details
DDR-3 (and SDRAM) chips in general contain mode registers which contain data that specify the delay parameters (tCL, tCRD, tRP – see be- low) being used by the DRAM controller (these have to be compatible with the chip!), burst size (amount ofdata transferred back-to-back), burst order and other parameters. These mode registers are “pro- grammed” before use by the DRAM controller.
DDR-n chips usually incorporates lower-order interleaving.
General DDR-3 Operation:
Based on a physical address issued by the CPU, the DRAM controller computes the bank address, row address and the column address.
An “active” command is issued (specifying a bank address and a row address) to select a row.
Read or write commands are then issued specifying a column address
Data access follows, matching the pre-specified read (or write) burst
size (which is kept as part of a mode register).
DDR-3 Timing: recall that row address must precede column address to access data
CAS (column address strobe) latency, tCL: number of CPU cycles between the sending of a column address to the DRAM controller and arrival of data (which is usually sent as required word first, with wraparound delivery for the rest). This time is also referred to as the delay for accessing an “open page” (a row = a page).
332
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
DDR-3 Operation and Timing (contd.)
RAS to CAS delay (row address to column address delay), tCRD: time (in CPU cycles) between the sending of a row address to the DRAM controller and the earliest time at which a column address can be sent to access data in the selected row.
– If no row is currently selected, it will take time tCRD + tCL to access data from the DRAM.
Row precharge time, tRP: time in clock cycles needed to prepare a row, which is different from the row that was being accessed before a row address can be sent.
– If the address of a row changes from that of the current access, it will take time tRP + tCRD + tCL to access data from the new row.
– Access to a different row is called a row (or page) conflict.
Row active time, tRAS: minimum time for which a row must remain to be active if any data has to be accessed from it and before another precharge command can be sent.
– tRAS tRP + tCRD + tCL (but tRAS can be reduced by overlapping tRP and tCL)
DRAM timing is specified as a series of numbers separated by dashes, as follows: tCL-tCRD-tRP-tRAS-tCMD (where tCMD is the time needed to send a command; sometimes the last two parameters are omitted!)
333
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems
Basic idea introduced by M. V. Wilkes (“slave store” – ’50s), first incorporated (commercially) in the IBM S/360 model 85 (early ’60s).
Motivations/Basics:
CPU cycle times are decreasing at a faster rate than the rate at which memory access times are decreasing.
Cannot have too many registers to cut down on the frequency of memory accesses made by the CPU:
– too many registers => longer register access times (“larger is slower”)
– too many registers => more register saves/restores during a context switch
– ISA limitations may preclude the use of additional registers
Locality of reference (spatial + temporal) should allow the use of a fast buffer, with lower access time than the memory, to hold copies of memory locations that are accessed often (or are likely to be accessed because of spatial locality).
– When the CPU generates a memory request, this buffer is likely to have a copy of the required data
– The CPU can then retrieve the data from this buffer, instead of having to go to the slower memory
334
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
– The memory address issued by the CPU can be used to look up the buffer to see if a copy of the memory location exists within the buffer
– Such a buffer is called a cache. – Cache = hidden storage
Required features of a cache:
Speed: The time it takes to access data (in this context, data also includes instructions) from the cache should be substantially faster than the time it takes to access the same data from the memory. Typically, the cache access times – at least for on-chip caches – are higher than but close to the access times for registers.
Fast lookup logic: allows CPU to quickly determine if a copy of the memory address issued by the CPU exists in the cache or not: if a copy is found in the cache, we have a cache hit, otherwise we have a cache miss
Fast, automatic management: allows the desired data to be fetched into the cache on a miss, so that subsequent accesses to the same address may result in a cache hit. This, in turn, involves the use of:
A fast, easy-to-implement replacement algorithm: on a miss data least likely to be used in the future has to be evicted from the finite capacity cache to make room for the new data that is fetched into the cache on a miss
335
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
A fast placement policy, which determines where within the cache the data to be fetched is to be placed
Appropriate write policies to handle updates made by the CPU to any data item in the cache
– closely related to the write policy is a write allocation policy that decides if the memory data is to be allocated space in the cache on a write by the CPU that results in a cache miss.
Dynamic tracking of program locality: the management hardware associated with the cache must ensure that the data maintained within the cache ensures that cache hits occur most of the time a memory access is attempted by the CPU.
The cache logic attempts to dynamically track and exploit the temporal locality by bringing in missing data into the cache expecting it to be accessed again in the near future
The cache mechanism also attempts to dynamically track and exploit the spatial locality by using a unit of data allocation within the cache called a cache line or a cache block. The cache line size is bigger than the size of the biggest data accessed at a time by the CPU (i.e., the word size): the line size is typically = 2n * word size, where n is a small integer.
– On a cache miss, the cache line containing the missing data
is fetched into that cache with the expectation that other data within this line are likely to be referenced in the future due to the existence of spatial locality.
336
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
– The cache replacement algorithm, in conjunction with these two mechanisms has to ensure that the localities continue to be tracked well by the cache.
Transparency: Ideally, the cache should remain transparent to applications, so as to ensure that binary compatibility is guaranteed as the application is moved from one CPU to another that is ISA compat- ible, but has a different cache configuration. We will see later that in real systems, the cache has to remain visible to the OS and in some cases the applications have to be transformed to make optimal use of the cache!
Caches vs. registers:
Feature
Tracking of locality
Expandability
How managed ISA visibility
Registers
Static, compiler can “look ahead”
Typically, none for architectural registers
Software – by the compiler
Caches
Dynamically: based on past behavior
Easy
By hardware
visible Mostly invisible
337
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Simple Cache-CPU Interface and Operation:
Control/status
Main Memory Subsystem
(= RAM chips + controllers)
Data_in
hit/miss probe
Cache
read/write
clock
CPU
Data_in
Data_out
Address
Data_out
Address
Interface between the cache and the memory are as discussed earlier for SRAMs (or DRAMs).
Interface between CPU and cache is similar but the control signals used are worth pointing out:
probe: this informs the cache to perform a cache access (strobed) read/write: type of cache access desired.
hit/miss: status of probe from cache. The CPU stalls till the miss is serviced.
clock: The CPU clock is fed to the cache to allow the cache operations to be synchronized with the CPU. In a pipelined CPU it is usual to operate the cache closest to the CPU at the pipeline clock rate.
338
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Performing a memory read with the cache in place – steps:
1. The CPU puts out the effective address of the memory location to be read, raises the read/write line, strobes the probe line and waits for
a hit or miss indication from the cache.
2. Thecacherespondswithahitormissindication:
(a) If the response is a hit, the desired data item is moved into the CPU completing the memory access
(b) If the response is a miss, the CPU stalls, and steps 3) and 4) are carried out:
3. Thecachememorysystemselectsavictimlineusingitsreplacement algorithm and takes the actions necessary to evict the victim line. This step may involve the updating of the victim’s line image in the main memory for some write policies. Once the victim line is replaced, a memory access is initiated to fetch the line containing the desired word into the cache.
4. The cache gets the desired line from the memory and simultaneously forwards the desired word to the CPU. A variety of cache fetch policies can be used to speed this step up. On receiving the desired data, the CPU continues.
339
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Performing a memory write with the cache in place – steps:
1. The CPU puts out the effective address of the memory location to be written to and the data to be written, lowers the read/write line, strobes the probe line and waits for a hit or miss indication from the cache.
2. Thecacherespondswithahitormissindication:
(a) If the response is a hit, the desired data item is written to the cache. The memory copy of the data being written may also have to be updated, if the cache implements a write-through writeback policy
(b) If the response is a miss, the CPU stalls, and steps 3) and 4) are carried out:
3. Basedontheallocationpolicyforawritemiss,thecachemaysimply update the main memory copy of the data and signal the completion of the write to the CPU to allow it to continue, without allocating space in the cache for the line containing the word being updated.
Alternatively, the cache logic selects and evicts a victim for replacement (as in the case of a read access), and proceeds on to the next step.
4. Thecachefetchesthelinecontainingthewordbeingwrittentointhe slot left vacant by the victim line, updates the word in the cache (and possibly the copy in main memory, depending on the writeback policy), and signals the completion of the write to the CPU.
340
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Cache allocation policy on a write miss: two variations are possible here:
Always allocate a line for the line containing the word being written to on a cache miss during a write (“write miss”) – this is called the WA (write allocate), the fact that there is a miss being implicit.
Do not allocate a line for the line containing the word being written to on a cache miss during a write miss – this is called NWA (No write allocate), again the the fact that there is a miss being kept implicit.
Cache write policies: two variations are possible here
Write-through (WT) or store-through: on a write hit, the memory copy is updated along with the copy in the cache. On a write miss, the memory copy is updated. If the cache employs a NWA allocation policy,noadditionalstepsareneeded. IfthecacheemploystheWAal- location policy, the updated line in memory is also fetched into the cache after evicting a victim line from the cache.
– A cache employing WT (called a “WT-cache”) is preferred in systems where the main memory and the cache have to be kept consistent: this is a typical need in many shared memory multiprocessors.
Write-back (WB) or copy-back: on a write hit, only the copy in the cache is updated and the line is marked as “dirty” (to imply that its not consistent with its image in the memory). On a cache miss, if the allocation policy is WA, the line containing the word being written to is fetched into the cache, marked as dirty and word is updated only within the cache. A NWA write allocation scheme is generally not used with a WB cache, although in theory one is possible.
341
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
– In a WB cache victim line, if marked as dirty, requires its image in the memory to be updated before it can be evicted from the cache.
Cache fetch policies: several variations possible, as follows:
Simple fetching: entire line containing the desired word is loaded into the cache on a miss, and the missing word is then delivered to the wait- ing CPU.
Load-through or read-through: If the width of the data connection from the memory to the CPU is less than the width of a cache line, this policy is frequently implemented. The part of the line containing the desired word is first fetched into the cache and simultaneously forwarded to the CPU. The remaining parts of the line are then fetched in a wraparound fashion into the cache:
One line: 8 words
Fetch sequence:
3 0
desired word
1 2
Memory transfer unit = 2 words
342
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Cache Organizations: Three organizations, direct-mapped, set-associative and fully-associative, are predominantly used, we will also discuss a fourth one (sector-mapped).
Common concepts for direct-mapped, set-associative and fully-associa- tive caches:
The physical address space is considered to be broken up into lines, where each line is 2L words, where, typically, L is a small integer. Every memory line has an unique address, which is obtained by taking the ad- dress of any word within the line and dropping the lower L bits.
Address of word in memory
012
3456 7 89
10 11
Address of line in memory
0
1 Memory: 4 words per line 2
A line from memory is held within a line frame within the cache.
The placement rules for the direct-mapped, set-associative and fully- associative caches allow several lines to be fetched into a line frame, although only one such memory line can be held within a line frame at a time.
343
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Additional information must thus be held as part of a line frame to indicate which one of these memory lines is actually resident within the line frame: this information is called the tag
The field of a line frame holding the actual line is called the data part, while the field holding the tag is called the tag part.
A line frame, typically, has some additional information besides the tag and the actual contents of the memory line that is identified by the tag, in the least a single bit that marks the contents of the tag and data fields as valid:
valid bit
Other info
– The cache is initialized by clearing the valid bits in all line frames.
Determining a cache hit or miss boils down to comparing parts of the address issued by the processor with the tag fields of all line frames that have the valid bit set.
The capacity or size of a cache usually refers to the combined size of the data parts of all line frames within the cache.
v
Tag field (“tag part”)
Data field (“data part”)
344
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.) The direct-mapped cache:
Here the cache has a capacity of C (= 2c) line frames, numbered 0 onwards. A memory line with the address l can be placed only in the line frame numbered l mod C.
– The line frame in which a word with a N-bit address w can be potentially found is located in the line frame obtained by dropping the lower L bits from the lower c bits of w.
The above placement rule indicates that:
memory lines with the address
0, C, 2C, 3C, …..
1, C+1, 2C+1, 3C+1, ..
can be placed in line frame…..
0
1 :: j, C+j, 2C+j, 3C+j,.. (j < C) j
- Note that the lower c bits of all lines that can be placed in a line frame are identical
The tag part of a line frame must be long enough to uniquely identify the line currently residing in that frame: this is precisely the leading (m-c) bits of the m-bit address of a line in memory.
tag c L line address
345
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
An example - small cache (unrealistic), but shows the major ideas:
Numberoflineframesincache=8(=> c=3)
Number of bits in memory address of a word, N = 10 (say) Number of words per line = 4 (=> L = 2)
Size of tag field = N – c – L bits = 5-bits
Value of tag field = leading T = (N – c – L) bits of the line address
line number
1100100111: address of word; this is the word numbered 11(=3) within line number 11001001
v Tag
Data
1
11001
c T
word number within line:
line frame
L
number:
A cached word and its associated tag in the example direct-mapped cache
Notice that for direct-mapped caches, there is no choice in selecting a victim line: whatever is the line that is resident in the line frame needed by a missing line (that is about to be fetched) has to be replaced.
Each line frame will have an additional “dirty” bit if the direct-mapped cache uses writeback. No such bit is needed if the cache uses write through.
0 1
2 3
0 1 2 3 4 5 6 7
346
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
The organization of a direct-mapped cache:
word address issued by CPU
valid bit
tag
data
dirty bit
“frame number”
“tag”
hit/miss
=
Multiplexer
“word number”
output latch
The line frames are words within a static RAM.
347
desired word
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.) Steps for read access:
1. Data/tagreadout:Usetheframenumberbitsfromtheaddress issued by the CPU to read out the contents of the addressed frame into a latch. (The frame number bits are obtained by taking the lower c bits of the line containing the word.)
2. Tagcomparison:Comparethetagfieldintheaddress(higher (N – c – L) bits of the address issued by the CPU) with the tag value of the line frame read out in step 1. If a match occurs and the line frame read out is valid, then signal a cache hit. Signal
a cache miss otherwise.
3. Datasteering:Onacachehit,usethelowerwbitsoftheaddress issued by the CPU to select the desired word from the array output latch.
– in many modern caches, steps 2 and 3 can be overlapped, as described later.
Steps for a write access:
– Steps 1 and 2 are similar to those for a read
3. Updatingwithinthelatch:onahit,thedatawordbeingwrittenis driven into the data to be written is first written into the appropriate field in the array output latch. The dirty bit within this latch is also set.
4. Updatingthelineinthearray:Onahit,thecontentsofthelatch are written back to its original location within the array.
348
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Handling a miss: on a miss, the line read out in the output buffer is the victim: if it is dirty, it is written back. The missing line is read from memory into the output latch, and its tag value is written into the tag part of the output latch, whose valid bit is also set. The contents of the output latch is then written to the line frame identified using the middle order bits in the address that was issued by the CPU.
Timing issues: we assume that the time to access the data and tag parts of a line from the cache is identical. This is true when both are parts of the same RAM (as we have shown) or when they are both implemented in the same technology. The overall timing is as shown:
Tag Data
Cache probe started
Cache access time
Tag match completed
Tag and data parts of line frame read out into the latch
time
The cache access time, tc, is defined as the time between the start of the cache probe and the time by which the data is available in the destination.
In most scalar pipelines, the cache access time – particularly the cache RAM access time – determined the pipeline cycle time. In early pipelined microprocessors, the cache access was performed in just a single stage, while in later designs, which featured a higher clock rate, the cache access is performed in two stages. Here the tag/data RAM was read out in the first stage, while the remaining steps were carried out within the second stage,
Required data steered to destination
349
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
In many modern microprocessors, the data steering time can be quite substantial due to the relatively large wire delays. This may well force the use of three stages for the cache access.
An unique feature of the direct-mapped cache is that on a read access, only one data item is delivered out from the cache. The candidate data can be steered towards the destination as the tag is being compared, cut- ting down on the cache access time in case of a hit. On a miss, the latch- ing of the steered data into the destination latch is abandoned:
Tag Data
Cache probe started
Cache access time
Tag match completed
Tag and data parts of line frame read out into the latch
time
– Modern systems use this approach.
A further timing optimization is possible when the data steering time is less than the tag comparison time (this, BTW, is unlikely in modern microprocessors). Here, the processing on this data item can start spec- ulatively before the result of the tag match is available. If a hit occurs, this effectively overlaps part of the processing with the tag matching step. If a miss is signalled, further processing can be abandoned. This overlap can be well exploited when the I-cache is direct-mapped. Parts of the instruction decoding step and the tag matching process can be overlapped.
350
Required data steered to destination
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
A direct-mapped cache provides a good performance for instruction accesses, which are predominantly sequential in nature. If the capacity of the direct-mapped cache is C line frames, it will still provide good performance for all types of accesses in general when C is reasonably large.
I-caches in many high-end microprocessors are direct-mapped, since they provide fast access times. Many off-chip caches, which have a large capacity, are also direct-mapped.
Another timing optimization is to avoid the use of the AND gate by wid- ening the comparator by one bit and comparing the tags as before, and also comparing the valid bit with a literal one.
The fully-associative cache:
Here the cache has a capacity of C (= 2c) line frames, numbered 0 onwards. A memory line can be placed in any line frame, irrespective of its line address. This very flexible placement rule has far reaching consequences:
The tag width needed to uniquely identify a memory line within the cache has to be (N – w) bits, which is independent of C. The tags are thus wider than what we have in direct-mapped caches.
When a victim line frame has to be found, quite unlike the case of a direct-mapped cache, any one of the line frames can be chosen. A replacement algorithm has to be found to choose one of the line frames. Because of this wide choice for selecting a victim line, a good replacement algorithm can provide a very high hit ratio even for a relatively small fully-associative cache.
351
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
– As we will see later, a random replacement algorithm can give a fairly good performance with a fully-associative cache.
An associative lookup mechanism has to be used to look up the desired memory line from the cache: in essence, every line frame needs its own comparator to detect if the tag part of the issued address matches the contents of its tag field.
The organization of the fully-associative cache is as follows:
word address issued by CPU
valid bit
tag
data
dirty bit
“tag”
Associative tag memory
Non-associative data memory
hit/miss
Multiplexer
“word number”
output latch
352
desired word
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Steps for a read access:
1. Thetagpartoftheissuedaddressisassociativelycompared
(i.e., compared in parallel) with the tag parts of each cache line.
2. Ifamatchoccursandthecontentsofthelineframeisvalid(i.e.,if there is a cache hit),, the data part of the matching line frame is steered out into an external latch, along with an indication of the hit. On a miss, the associative array simply puts out the miss signal.
3. Onahit,thedesiredwordisthensteeredthroughamultiplexertoits destination. Status information used for replacement is also updated.
Steps for a write access: similar to those for a read: the associative search is performed and the data is written into the matching line (only on a hit).
Handling a cache miss: On a miss, the replacement logic is used to iden- tify a victim line frame. The missing line is written into the victim frame along with its associated tag, and the replacement status information is updated.
353
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Timing issues: In the fully-associative cache, the data can be steered only after the associative lookup is completed:
Tag Data
Cache probe started
Cache access time
Data from matching line available in the output latch
Associative lookup completed
time
Required data steered to destination
Note that unlike the direct-mapped cache, there is no possibility of steering the data towards the destination as the associative lookup is in progress. Furthermore, the associative lookup process generally takes longer than a solitary comparison due to the length of the wires internal to the associative array. Consequently, all things being equal, the fully associative cache will have a longer cycle time than the direct-mapped cache.
Fully associative caches are generally not used for the I-cache or the D- cache, since its access time is higher than that of other alternatives for these caches. Typically, the fully associative cache is used to implement TLBs (LATER!), which are small in size and thus have an acceptably low cycle time.
354
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
The design of the set-associative cache is motivated by the contrasting features of the direct-mapped cache and the fully-associative cache:
Feature cycle time
amount of hardware needed beyond a RAM
hit ratio
Direct-mapped cache fast
small:
latch + comparator + multiplexer
low for small sizes
Fully-associative cache relatively slower
substantial: associative logic + latch + multplexer + replacement logic
high, even for small capacities
The higher hit ratio of the fully-associative cache comes from its ability to have a choice in selecting a victim. The lower cycle time of the direct-mapped cache comes from the use of an external comparator and its ability to overlap parts of the tag comparison process and the data steering steps. The design of the set associative cache attempts to incor- porate both of these features.
355
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
In a p-way set-associative cache with a capacity of C line frames, we effectively have p direct-mapped caches that have a capacity of C/p line frames.
The line frames that are at the same offset within each direct-mapped cache make up a set.
A set thus consists of p line frames
The placement rule for a set-associative cache allows a line from the memory to be placed within any line frame in the set that is uniquely identified by the lower log2(C/p) bits of the line address.
If a victim needs to be selected, the contents of one of the p line frames may be chosen as a victim: this provides the flexibility that is missing in a direct-mapped cache.
Once a choice exists in selecting a victim for replacement, appropriate status information must be maintained for that purpose.
The lookup process for a set-associative cache is as follows:
– Direct lookup of the line frames of the set using the lower (C/p) bits of the line address.
– Fully-associative lookup of the desired line within the set.
356
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Note the two extremes of a p-way set-associative cache: p = 1: this is a direct-mapped cache
p = C: this is a fully-associative cache.
The set-associative cache is organized as follows:
=
word address issued by CPU “word number”
: equality comparator valid bit tag data
“set #”
output latches
1
=
=
=
1
1
=
1
encoder
Multiplexer
a set
status
0123
“tag”
Replacement logic
hit/miss
desired word
A 4-way set-associative cache
357
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.) Steps for a read access:
1. Thelowerlog2(C/p)bitsofthelineaddressisusedtoreadoutthe tag, data and status bits of the set into the output buffers.
2. Thetagpartoftheissuedaddressisassociativelycompared(i.e., compared in parallel) with the tag parts of each of the p cache lines in the output buffers.
3. Ifamatchoccursandthematchinglineframeisvalid(i.e.,ifthere is a cache hit), the data part of the matching line frame is steered out to its destination, along with an indication of the hit. On a miss, the cache simply puts out the miss signal.
– the status information used for replacement is also updated.
Steps for a write access: similar to those for a read: the set-associative search is performed and the data is written into the matching line only on a hit. The time it takes to complete a write hit is thus longer compared to the time needed for a read hit.
Handling a cache miss: On a miss, the replacement logic is used to iden- tify a victim line frame within a set. The missing line is written into the victim frame along with its associated tag, and the replacement status information is updated.
358
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Timing Issues: The timing of the set-associative cache is as follows for a read access:
Cache access time
Tag match completed: hit or miss indication available
Tag Data
Cache probe started
Tag and data parts of line frames within
the set that potentially contains the desired data is read out into the output latches; tag matching for the tag fields read out from all of the “ways” started
Required data steered to destination
time
– this is like the timing for the unoptimized direct-mapped cache.
Just like the fully-associative cache, there is no way to start steering the required data to the destination as the tag matching step is in progress – in a p-way set associative cache there are p potential candidates for the desired data.
359
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
– In theory, all p candidate data items can be steered towards the destina- tion and only the desired item can be multiplexed at the destination: this will mask out some of the wire delays, but demands p groups of connec- tions, making this possibility impractical.
A simple analysis: effective memory access time with the cache, Teff Assumes:
– Time to detect a cache hit or miss = tc
– Additional time for a read hit = 0
– Additional time needed for a write on a hit = tw (total time for a write hit is thus tc + tw)
– Cache does not overlap steps of consecutive accesses
– hit ratio (for both read & write) = h
– cache is write through with write allocate; on a write miss, memory is first updated and the missing line is then read into the cache
– no read through, but data needed by the CPU is forwarded to the CPU as the line is loaded into the cache
– time to update a word in memory = tM
– time to fetch a line from memory = tL
– time for selecting a victim is overlapped with cache probe
– fraction of write accesses = w
360
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
The table given below shows the timings in the four different cases hat arise on a cache access:
access type
outcome ->
cache hit: probability = h
cache miss: probability = (1 – h)
read access (probability = 1 – w)
tc
tc + tL
write access (probability = w)
tc + tw
tc +tM+tL
Weighting the timing figures in the above table with the appropriate probabilities for the cases, we get:
Teff = tc .h.(1 – w) + (tc + tL).(1 – h).(1 – w) + (tc + tw).h.w + (tc + tM + tL).(1 – h).w
= tc + (1 – h).tL + h.w.tw + (1 – h).w.tM (1)
– Note a few obvious things: Teff comes down as h increases or as w de- creases, and in general, as the miss handling times come down.
361
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Lets now plug in some typical values, with times in CPU cycles: tc =1,tw =1,tM =4,tL =16
w = 0.3
This gives:
Teff = 1+(1-h)*4+h*0.3+(1-h)*4.8
= 9.8 -8.5*h (2)
When h = 0.95 (this is easily achieved in practice for typical caches used)
Teff = 1.725
When h = 0.97 (this is again easily achieved in practice for typical caches used)
Teff = 1.555
When h = 0.98 (this is also achieved in practice for somewhat bigger caches)
Teff = 1.415
– The effective memory access time is roughly 70%, 50% and 40% more than the cache read access time
362
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Best value of Teff is 1.3 occurs when there are no misses (h = 1), which is, of course, impossible in practice. Note that the extra 0.3 cycles beyond the 1 cycle needed for a read access come from the additional time needed for a write during a write hit. A subsequent cache access is delayed till the write is completed.
Cutting down the effective time of a cache write hit – allowing the addi- tional time needed by a cache access to be overlapped with the addition- al time needed by a prior write hit:
Basic idea: use a 2nd. port into the data part of the cache arrays to do the write after a write hit is detected.
The data being written has to be forwarded to an immediately following cache read, if needed.
The tag part of the cache does not need an extra port – keeps the tag RAM access time unchanged.
Read and write hits can be sustained in any order at the rate of one access per cycle: the additional time needed by a cache write thus does not get in the way of cache accesses.
The write mechanism can also be used to load missing data into the cache.
363
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.) Resulting timings:
time
Tag match completed
Tag and data parts of line frames within set read out into the output latches
Data from impending cache write forwarded to read following this write if necessary; write started
Cache probe started
Tag Data
Data for a cache write hit is written through a second port to the data RAM part of the cache
Data required by a read steered to destination
Cache Access Pipeline:
Stage 1
Stage 2
Stage 3: used
on a write hit
Pipelined Cache Writes and Reads: one cache access per cycle maintained on read or write hits
The second port used for the write is also used for filling in the cache.
Second port can be avoided if the data part is sub-banked (= broken down into columns, each a word wide, with a single port for each word-wide column)
– as long as the write and the access following it address distinct word-wide columns, the write update and the following access can be done concurrently.
364
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
What it really takes to do pipelined cache writes: dual-ported data arrays plus miscellaneous logic –
data for write
address issued by CPU
cache components used for a read access
data for row to be written
line frame # being written
pipeline latches between stages 1 and 2
DMU
X
tag
word #
tag
data
tag
=
data
set_#
=
MUX
forwarded data
data of matching line word data for write
E
MUX
=
address to be written
DMUX
address issued by CPU
comparator to detect if read attempt targets word written in the previous cycle
word being read
pipeline latches between stages 2 and 3
Removing the additional delay needed by a cache write hit through write pipelining
365
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.) Performance Aspects:
Caches in modern systems – at least at the first level – are split: separate caches are used for instructions (I-cache) and data (D-cache).
The three Cs of cache misses: cache misses fall into three categories: Compulsory or cold start misses: happens when very first reference is
made to a line.
Capacity misses: these happen because the cache is not big enough to hold all the data that is accessed by the program – causes cached blocks to be evicted to make room and bought back into the cache when they are needed again.
Conflict misses: these occur as lines contend for space within a common line frame – conflicts increase as associativity decreases.
Capacity misses are the most dominant; conflict misses decrease as associativity increases; compulsory misses increase as the cache capacity and associativity is increased (causing relative number of capacity and conflict misses to come down).
Overall miss rates go down as cache size and associativity increases. On the down side, increasing associativity may result in a longer cache cycle time.
Associativities beyond 4 or 8 do not help that much in reducing cache miss rates.
366
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Miss rates of 2% or less, overall, are typical even in modest sized caches (16 KBytes, 4-way associative).
Rule of thumb: miss rate of direct-mapped cache is roughly equal to that of a set-associative cache of roughly half the capacity.
Typical range of line sizes used in L1 caches: 16 bytes to 64 bytes; for L2 caches somewhat larger line sizes (up to 128 bytes) are used.
Increasing the line size keeping the cache capacity fixed initially causes the miss rate to come down – larger line sizes better exploit spatial local- ity of reference. However, increasing the line size keeping the capacity (and associativity) fixed cuts down on the number of sets and increases the possibility of conflicts, increasing the overall miss rate.
Larger line sizes also increase the cache miss handling time. An opti- mum line sizes that gives the best overall performance (not miss rate) therefore exists.
367
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Improving the cache miss access time:
The simple estimate of the effective memory access time with a cache shows that the cache miss handling time has to be improved to reduce Teff. Several techniques are possible:
Read through: This was discussed earlier (see page 33).
Write Buffering: Writes to the memory are required on a cache miss:
– in the write-through caches on a write miss and.
– in write back caches, on a read or write miss (with WA) if the victim line is dirty
Instead of waiting for these writes to memory to complete, these writes are deposited into a fast write buffer, from where they proceed at the rate that memory allows. The cache controller waits for whatever time it takes to deposit these writes to the write buffer.
A set of comparators monitor the addresses of pending writes in the write buffer to forward data required by a later miss that requires the data in the write buffer.
368
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Load bypassing: This technique is used in conjunction with write buffering: if the data required by a read miss (triggered by a load) does not need any data item sitting in the write buffer, the memory access for the read miss is serviced before any of the writes in the write buffer.
– Helps overall performance as many subsequent instructions are data dependent on a load in general.
Multiple levels of caches: general motivations:
(a) Caches closest to main datapath of processor should be as fast as
possible: these are the so-called level 1 or L1 caches
=> these caches have a relatively lower capacity (“smaller is faster”)
and a relatively higher miss rate
(b) For best overall performance, it is thus essential to reduce the miss handling times for the caches closest to the main datapath
– The second requirement can be met by using caches beyond the L1 caches (level 2 onwards).
Design considerations of L2 and higher level caches: note that references that have the locality are mostly satisfied by the L1 caches: the L1 caches thus “filter out” most of the locality.
– The access pattern directed at the L2 and higher level caches thus have progressively lower amounts of locality
369
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
To get a good hit ratio and performance for the L2 and higher level caches, these caches have to be progressively bigger capacity (and/or have higher associativity), with larger line sizes.
Streaming type of RAMs (synchronous DRAMs, EDO DRAMs, synchronous SRAMs) will serve as efficient storage technology for handling misses from the lowest level cache.
Most modern systems use at least two levels of caching:
– split, on-chip L1 caches, operating at the pipeline clock rate
– split or shared (between instructions and data) L2 cache(s): these are generally off-chip, but in some cases the L2 cache(s) may be on-chip
Examples:
Most low-end PCs and workstations: on-chip L1 caches (16 to 32 KBytes, split) and off-chip SRAM-based integrated L2 cache.
DEC 21164 Alpha implementation: two levels of on-chip caches – L1 I-cache and L1 D-cache: 16 KBytes each; L2 cache – integrated, 96 KBytes
Pentium II: L1-caches: 16 KBytes each, 256 KBytes or 512 KBytes of integrated L2 cache within the same package (or cassette)
Multiple levels of caches will remain to be mainstream as CPU clock rates grow at a higher rate than improvements in memory speeds.
370
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Subblocking: With cache subblocking a line frame (aka a block frame) is divided up into slots that are each capable of holding a part of a line (called a subblock). There is a valid bit associated with each subblock frame (as well as, possibly, dirty bits):
valid bit for tag
Data field (“data part”)
Subblock
v
Tag field (“tag part”)
subblock data
valid bit for subblock
– on a miss, a line frame is allocated as before but only the subblock containing the targeted word is fetched and its valid bit is set.
– a cache hit now requires: a tag match, a valid bit for the tag to be set and a valid bit for the targeted subblock to be set as well.
– a miss may still occur if the tag is valid and matches, but the required subblock is not valid: in this case the missing subblock needs to be fetched; no allocation of a line frame is needed.
With subblocking larger block sizes can be used without increasing the miss handling time.
Cache subblocking also provides a way of “sharing” a common tag across subblocks,
371
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Lockup-free caches: With lockup-free caches, one or more subse- quent cache accesses are allowed as a miss is being serviced.
– The CPU stalls only when some of these subsequent accesses require the line being fetched.
Victim caches: Here a separate victim cache is used to hold recently evicted lines from the cache. On a miss, the missing line can be possibly found in the victim cache – works well when conflict misses or capacity misses dominate.
Software techniques: several techniques can be used to improve the performance of caching – examples include the following:
Data padding: The data set of a structure that are accessed in succession can be forced to be within a single cache line for efficient access by padding the structure with dummy data to force the alignment of the elements of the structure within a line.
Array merging: elements of different arrays that are accessed in succession can be put into the same cache line to improve the hit ratio and overall performance:
for (i = 0; i< N; i++) {
A[i] = B[i] + C[i];
B[i] = B[i]/2;
}
Here, A[i], B[i] and C[i] can be forced to be within a common cache line by using an array of a structure that has fields for the elements of A, B and C.
372
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Loop fusion: Consecutive loops that access common arrays can be fused, data dependencies permitting, into a single loop to promote the efficient use of the data cache.
Loop interchange: The iterators for nested loops can be switched where data dependencies allow to promote better use of the cache:
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
if (A[i, j] > s) A[i,j] = A[i,j] – s;
In this loop, assuming a column order allocation and each element to fit into a single word, the consecutive elements of A[i,j] that are accessed are N words apart. More efficient cache accesses, accessing elements that are in consecutive words take place when the j and i loops are interchanged.
Blocking: Blocking simply exploits temporal locality of reference: data items fetched into the cache are reused from within the cache as much as possible. This is done by limiting the span of inner loops to force data reuse as much as possible.
Example: 2-dimensional matrix multiplication. The original inner loop is converted to two nested loops that force the product for a submatrix to be generated at a time.
373
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Cache memory accessing in a system using paged virtual memory:
CPU issues virtual addresses. Each such virtual address is as follows:
vpn bits
Each virtual address has to be translated to a physical address before a memory access is made. What about accessing caches?
– The answer depends on the type of index and tagging used by the cache.
The index refers to the bits used to select a set
Tagging, as before, refers to the information used to uniquely identify a line within a line frame.
An index or tag can be physical (meaning they are derived out of the physical address, after address translation) or virtual (meaning that they are picked up directly off the virtual address issued by the CPU)
For paged virtual memory system, if the index is derived out of the bits within the offset field, then the index is both virtual and real – since the offset bits remain unchanged during address translation.
virtual page number (VP#)
offset within a page
374
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Possible combinations of indices and tags:
index
physical or virtual
physical
virtual
virtual real
tag physical
physical
virtual
real virtual
constraints
index does not change during address translation
virtual address translation needed to index into set
possible mapping constraints
possible constraints ?
what resulting cache type is: physically indexed and tagged
(aka, physically addressed cache): requires translation of VP# using a TLB to detect a a hit or miss.
– most common
physically addressed cache; both VP# and part or all of virtual index bits have to be translated before any array access is possible: not used
virtually indexed and tagged
virtually addressed cache
– used in some CPUs; avoids address translation delays
virtually indexed and physically tagged – used in some CPUs
Not used in any real CPU
375
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
In a virtually-indexed and physically tagged cache, under certain conditions, cache access and address translation can be overlapped if a TLB is used.
A TLB (Translation Lookaside Buffer) is a special, small cache used for translating the virtual page number of recently accessed pages to their corresponding (physical) page frame numbers.
– if the index bits used for selecting a set do not depend on address translation, the readout of the tags and data can take place while the virtual page number gets translated using the TLB. Details are given LATER.
Characteristics of the TLB:
The TLB is probed using the virtual page number and an address space identifier (ASid) that uniquely identifies the virtual address space that this virtual page belongs to as the probe keys.
– The ASid is kept is a special control register
– The ASid can be the process id (pid) or parts of the pid
– The ASid is not needed as a key if the TLB is purged (i.e., flushed) when a context switch occurs.
The TLB can be fully-associative (common) or it can be set-associative.
376
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
The number of entries in the TLB is quite small: 16, 32, 64 or 128 entry TLBs have been used in practice.
Despite the small size of the TLB, a good hit ratio is obtained, since each entry within a TLB “covers” accesses within an entire page. (Accesses are localized within a page or a small number of pages).
There may be dedicated TLBs for the I-cache and the D-cache (increasingly common) or a multi-ported TLB may be shared between the I-cache and the D-cache (early microprocessors).
Page-level protection information (i.e., access modes – read-only, read-write, execute-only etc.) are also stored within the TLB. These are used to ensure that pages are accessed in the appropriate mode.
If the cache is physically tagged, it necessary to first translate the virtual page number issued by the processor to a physical page number in order to determine a cache hit or miss. This implied ordering leads to two pos- sible implementations for sustaining a high cache access rate (especially for I-caches, which, in general, have to be accessed once per cycle):
First translate the virtual page number using the TLB and then perform the cache access – throughput degradations due to this sequential ordering can be avoided by pipelining the access.
– Not a preferred solution: increases latencies of LOADs, branch
penalties etc. (The cache access takes 2+ stages, TLB adds potentially one more)
Overlap accesses of TLB with part of access of physically tagged cache: possible under some conditions (coming up shortly!)
– preferred approach.
377
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Privileged instructions are provided to allow the OS to manipulate the TLB entries. Examples of TLB-related instructions:
Instruction
PROBE_RANDOM
WRITE_INDEX_hi|lo
select an entry randomly from the TLB (for replacement and get its index)
write contents of register into slot pointed to by index
invalidate entry for specified virtual page number for the specified process
– The virtual address that resulted in a TLB miss is also available in a special TLB register. Other instructions and special registers are also used.
– A special instruction may also be included to flush the entire TLB.
378
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
TLB misses are usually handled in software (TLB misses trap to a kernel-level trap handler, which typically use dedicated trap handling registers; such traps can be serviced without the usual saving and restoration of registers on a context switch):
enter on TLB miss
extract the virtual page number that resulted in a TLB miss and get the process id of the process that was running
look up the page mapping table for the process to see if the page is in the main memory
No
Select a victim entry in the TLB (“PROBE_RANDOM”)
Install entry in the TLB for the missing page (“WRITE_INDEX”)
resume
Invoke routines for context switching and page fault service
Page in main memory? Yes
– Note that page faults are detected by the TLB miss handler!
The access time of the TLB is usually faster compared to the access time
of the tag and data arrays – this is a consequence of “smaller is faster”.
379
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Overlapping TLB-based address translation with cache access in a physically addressed cache:
Basic premise: bits in virtual address used as index (to select set) is not affected by address translation.
As long as the above constraint is maintained, the index bits are used to read out the tag and data arrays while the virtual address bits are translated to derive the tag needed for tag matching:
Cache probe started using index bit in virtual address
Tag Data
TLB ac- cess started
Tag and data parts of line frames within set read out into the output latches
TLB access completed: physical tag available for comparison
Tag match completed
Data required by a read steered to destination
time
Overlapping the access of a physically addressed cache with TLB-based address translation
– Note need to keep TLB access time the tag and data array access times: if this is not the case, address translation delays become the bottleneck.
380
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Example hardware setup for overlapped address translation for a physically addressed cache:
word address issued by CPU
address space id
virtual page number (VP#)
“set #”
“word number”
“word number” validbit 0
=
aset status
: equality comparator
(VP#)
Translation Lookaside Buffer
(TLB)
tag
data
“set #”
output latches
1
encoder
=
=
1
Multiplexer
1
page frame number (PF#)
Replacement logic
“physical address tag” “word number”
hit/miss
desired word
A 2-way set-associative cache with a TLB
381
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Guranteeing the constraints of overlapped translation for a physically addressed cache:
Ensure that the index bits are confined entirely within the offset field:
f bits 2f bytes per page index: s bits
L bits: selects byte within a line
– This requirement boils down to ensuring that (s + L) f
– The maximum value of the number of index bits is thus:
smax = (f – L)
– This has implications on the total capacity of the cache:
If the cache is p-way set-associative, the maximum cache capacity that allows address translation to be overlapped with tag and data array accesses is:
Dmax=p*2smax*2Lbytes=p*2(f-L)* 2Lbytes
– There are several ways to increase the cache capacity and still meet the constraints of overlapping address translation:
(a) Increase associativity (p) – not preferred beyond a limit
(b) Increase the line size (2L) – not preferred beyond a limit, as cache
miss handling time increases, unless subblocking is used. 382
virtual page number (VP#)
offset within a page
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Memory Systems (contd.)
Have the OS ensure that the k lower order bits in the virtual page number are identical to the k lower order bits in the frame number, where k is typically small. Since these k bits are unaffected by address translation, they can be used to extend the index beyond the page offset field.
– Inthiscase,smax=(f+k-L)bits
383
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Replacement Algorithms
Key requirements:
Must be fast – otherwise replacement algorithm may stretch cache miss
handling time => cannot implement in software
Must be easy to implement in hardware (relatively small area requirements)
Preferable: replacement logic operates in parallel with cache access so that victim is identified by the time a cache hit or miss is detected.
Commonly used algorithms for set-associative caches:
Hot/cold bit: used in two-way set-associative caches:
A single status bit is associated with each set: this bit is set to point to the line frame in the way last accessed within the set.
The status bit is read out with the tag and data parts of the set
The victim frame is pointed to by the complement of this bit: no complementing is necessary: the complement signal is available in the status latch when the set is read out: implements true LRU in this case.
The status bit is updated on a hit to a line frame within the set: this requires the status bit to be updated on every cache access: this updating can be done using a second port to the status RAM, in parallel with the steering of the data.
384
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Replacement Algorithms (contd.)
A true LRU implementation for a 4-way set-associative cache:
True LRU implementations are expensive: for a m-way set-associative enough status bits must be used to rank order the m line frames within a set in the order of accesses made to them. The number of orderings possible is m!, so log2 (m!) status bits must be used per set for this purpose.
A more direct approach that easier to implement is shown below. This the logic that is used within each set to identify the LRU frame:
A0 B0 C0 D0
D
L0
Hit
D
D
D
D
D
D
D
L1
A1 B1 C1 D1 NA NB NC
– Here L0L1 is the address of the currently accessed frame, while D0D1 gives the address of the victim frame.
385
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache
Replacement Algorithms (contd.)
– NA is a one if the currently accessed frame is not the one with the address stored in A0A1:
NA=(L0 A0)+(L1 A1)
– NB and NC are similarly derived.
– The hit line is pulsed high on every hit.
– What this logic does is to shift the address of the an accessed frame to the right if it is not subsequently accessed.
– The address of the LRU frame is found in the rightmost pair of latches
– Reads and updates are done as in the case of the hot/cold bits.
It is unlikely for a true LRU implementation to be used in practice because of its complexity; approximate LRU implementation and other algorithms give adequate performance.
A multi-level hot/cold bit replacement algorithm approximating the true LRU algorithm:
Used in the Intel 486 and Pentium CPUs for 4-way set-associative caches.
386
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Replacement Algorithms (contd.)
Here the 4 frames within a set (say, A, B, C and D) are divided into pairs (A,B) and (C,D).
A hot/cold bit is used to point to the pair most recently accessed, while two hot/cold bits are used to identify the frame most recently accessed within a pair. A total of three status bits are thus needed within each set. The hot/cold bit within the pair that is not accessed remains unchanged.
The victim frame is within the cold half, and is the cold frame within this half: the address of the victim is obtained by complementing the hot/cold bit for the pair and complementing the hot/cold bit within the pair so identified.
An example trace; this assumes that initially all three status bits are cleared.
address of pair
01
Frame accessed
Pair
Status bits
frame within pair 0
frame within pair 1
victim
A
B
C
D
0 D offrame B010D within pair C 1 1 0 A
frameaddress 00 01 10 11 D 1 1 1 A
– Note why this is approximate LRU: the victim selected after accessing C should have been D if true LRU was implemented.
address 0 1 0 1 A 0 0
387
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Replacement Algorithms (contd.)
Random replacement: Here the victim frame within a set is chosen at random.
No status information needs to be maintained with each set.
Typical ways of implementing random: use a pseudo random generator or use lower order bits of a cycle counting clock or a combination.
Probability of choosing a victim that is likely to be used next is 1/m – this is not as bad as direct-mapped!
Rotating pointer: Here each set has a log2m bit pointer that is initial- ized randomly to point to the victim block. This is the status information associated with each set.
This counter is read out along with the tags and data items of the set: if the counter points to the frame just accessed, it is incremented (modulo log2m) to point away from the most recently accessed frame.
Not last used (NLU): Here the status information stored for a set simply points to the frame that was most recently accessed.
The victim frame is chosen randomly as long as it is not the frame pointed to by the pointer. NLU thus simply means replace at random, but not the frame that was accessed most recently (i.e., accessed last).
Has the potential to perform slightly better than random replacement.
388
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Replacement Algorithms (contd.)
Replacement algorithms for fully-associative caches (such as most TLBs): The commonly used algorithms here are random, rotating pointer or NLU.
Bottom line:
True LRU replacement algorithms are hardware intensive; approxima-
tions – including random – give performance that comes quite close.
As cache sizes increase, the choice of the replacement algorithm used becomes less and less critical.
389
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Virtual Address Caches and Virtually Indexed, Physically Tagged Caches
Virtual Address Caches
Used in the Univ. of Manchester MU5 prototype and the Berkeley SPUR prototype and the multi-chip implemented MIPS 6000 CPU (all are history now!)
Main advantage: no address translation is needed before a cache access; no TLB is needed.
Presents some serious problems:
Data inconsistency: a line frame shared among processes does not have the same virtual address in general (since part of the cache tag is made up with the AS identifier). Consequently, multiple entries (synonyms) may exist in the cache for the same physical line frame. The update made by a process will affect only the copy that it modifies – the update is not automatically propagated to the other copy.
Possible solutions:
1. Flushcacheatcontextswitchtimesandensurethatthephysical copy in the main memory also gets updated. A reverse translation buffer (RTB) has to be maintained to get the physical address of the line for these memory updates.
2. Detectsynonymsandmovethemoutofthecache,possiblydown a cache hierarchy: requires additional facilities and logic (like a RTB) to detect synonyms: this logic was part of the L2 cache in the MIPS 6000. Moving synonyms out requires the same overhead as a miss: this effectively slows down cache operation.
390
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Virtual Address Caches
3. UseacommonsetofASidentifiersandvirtualaddressforshared data: this complicates the allocation procedures followed by the OS.
Performance problems: These arise from the need to handle data inconsistencies with synonyms, as described above.
Potential false matches: If the virtual address spaces are not distinguished from each other using AS ids, a cache flush will be necessary on context switches to avoid false matches.
Virtually Indexed, Physically Tagged Caches
Motivation: same as for a virtual address cache – avoid address transla- tion before array lookup; a TLB is still needed (unlike a virtual address cache). Another motivation is described later. Used in some HP Precision implementations.
Here the index used to locate the sets include k lower order bits from the virtual page number:
k bits
byte offset within line
virtual page number
offset within page
A line frame shared between several processes can thus be in up to 2k sets, since the virtual addresses can differ in the leading k bits of the in- dex. (The remaining lower order bits of the index do not change from one process to another.)
391
index
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Virtual Address Caches and Virtually Indexed, Physically Tagged Caches
Synonyms for a shared page can be detected by comparing the physical address tag obtained from the TLB against the frames read out from all possible sets that can contain aliases.
Because of the use of physical address tags, all of the synonyms will have the same tag.
Potential sets that can contain synonyms are identified by using all possible combinations of the leading k bits in the index.
If k is 2, and the cache is m-way set-associative, this means 4 * m line frames will be read out instead of the m that are read out in a physically addressed cache or a virtually addressed cache. Each data RAM will thus need 4 read ports.
Data inconsistencies can be avoided by updating all the line frames that contain synonyms.
Since physical address tags are used, the cache does not need to be flushed on context switches: false matches can never occur.
A direct-mapped, virtually-indexed, physically-tagged cache can pro- vide all the advantages of a direct-mapped cache (over a set-associative cache) and may provide a better hit ratio – a replaced line may still have a synonym in the cache that can supply the missing data. This, of course, requires that the synonyms be handled as described above.
– The same advantage can be gleaned from virtually addressed caches.
392
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speeding Up TLB Miss Handling: Inverted Page Tables
Dedicate registers for TLB miss handling – no need for full-fledged context switch.
Also need to look up page table very quickly on a TLB miss.
Simplest way of looking up a page table: use virtual page number as an index to directly locate the required page-mapping table entry (“direct lookup”). These page tables must be held in the RAM for best perfor- mance (= least TLB miss handling time).
Problem with direct lookup: enormous amounts of RAM needed to hold page table entries in RAM:
Example: For a system that supports 48-bit virtual addresses, with 12 bits of byte offset within a page, 236 page table entries are needed. If each entry requires 8 Bytes, this translates to a storage requirement of 239 Bytes. Not only is this huge but also beyond the physical addressing capability of all contemporary CPUs (32-bit physical addresses are common)!!
Solution: store mappings (virtual page number to physical page frame number) in the RAM only for the pages that have been bought into the RAM – this is done using an inverted page table.
– The “inverted” qualifier alludes to the fact that there is (at most) one entry in this table for each page frame (as opposed to one entry per virtual page in the normal (disk-resident) page tables. “Inverted” does not imply that the table is searched using the page frame number – we still use the virtual page number to perform any lookup in the inverted page table.
393
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Inverted Page Table
Typical implementation:
– hash virtual page number into an index into a table of pointers
– the entry located in this table is a pointer into linked list of page table entries, which are searched linearly
– linked list is required to handle collisions on hashing (different virtual page numbers mapping to the same pointer table index)
Time to locate an entry = time to generate pointer table index through hashing + time needed to search linked list.
Fast TLB miss handling implies linked list should be small.
Tradeoff: size of pointer table vs. size of linked list; increasing pointer table size reduces collisions on hashing and reduces the size of the linked list.
Allows entries for pages within memory to be updated easily on page swaps.
Hash function: typically simple ex-or of bits fields in the virtual address (e.g., ex-or of the bits of the segment address (see next foil!) and the virtual page number within this segment (with zero padding to equalize the bit lengths; the necessary number of lower order bits of the ex-or-ed value is used as an index into the pointer table).
394
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Providing Enough Bits to Address Large Virtual Address Spaces
Need to address large address spaces is becoming all too common.
Typical number of bits in effective address generated by CPU is 32 bits. How can large virtual address spaces be addressed within only 32 Bytes?
– if page offset is 12 bits (4 KByte pages), the 20 bits in the virtual page number allows for a virtual address space of only 220 pages.
Solution: use segmented virtual address space: few higher order bits (say 4) in effective address issues by CPU locates a segment register (one of 16 in this example) that contains a segment address that has many more bits (say, 24) – this is appended to whatever is the remaining part of the virtual page number (16 bits in this case) to form a virtual address space of 240 pages.
4-bit segment register number
Segment register
24 (segment address) 40-bit virtual page address
32-bit effective address issued by CPU 12-bit byte offset within page
16 (identifies virtual page within segment)
If an inverted page table is used, the hash function will ex-or the 24-bit segment address with the 16 bits in the virtual page number within the segment, padded with 8 leading zeros, and use the lower order n-bits in the ex-or as an index into a pointer table of 2n entries.
395
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speeding Up TLB Miss Handling: Two-Level TLBs and Super Pages
Basic idea is very much like multi-level caches.
Second level TLB is bigger and is needed for accommodating
applications with large cache footprints.
Lower level TLBs remain dedicated to the specific L1 TLBs or a common 2nd-level TLB can be shared by the L1 TLBs.
Not generally useful to use a L2 cache as a backup for the L1 TLBs, as entry formats are different.
Common in most high-end designs today. Sceond level TLB sizes range from 256 to 2K entries.
TLB coverage can also be improved using super pages. A super page is a set of 2k contiguous virtual pages of normal size, where k is a small integer value (6 to 8 is common). A virtual super page is allocated 2k contiguous normal sized page frames.
Super pages are used for graphics and I/O buffers for high-speed devices.
A single TLB entry maps a superpage, thus conserving TLB space and permitting a higher TLB coverage (or “reach”)
396
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Characteristics of Contemporary Caches/Off-chip Memory Systems
On-chip/On same package cache characteristics:
L1 I-caches: 8 KBytes to 128 KBytes, direct-mapped to 4-way, line
sizes: 8 Bytes to 16 Bytes
L1 D-caches: 8 KBytes to 512 KBytes, direct-mapped to 8-way associativity, line sizes from 8 Bytes to 16 Bytes
L2 cache: mostly unified (I + D), up to 2 to 4 MBytes, lines sizes from 16 Bytes to 128 Bytes. Subblocked for large capacities/line sizes. On-chip L3 caches can be large (24+ MBytes), but are shared by many cores (CPUs).
Small on-chip caches for embedded CPUs can have very high associativity.
Many contemporary CPUs are adding an unified L3 cache, with capacity in the oder of one MBytes.
On-chip cache delays:
L1 caches: 1 cycle (embedded CPUs, rare in medium to high end
systems) to 2 cycles (common in medium to high-end systems). L2 caches: 2 cycles to 6 cycles, higher for L3 in general.
Off-chip memory access delays: 10s of cycles.
On-chip caches can occupy 20% to 55%+ of the chip area!
Many CPUs give a boot time choice for:
– choosing between write-back and write through for a range of pages
– turning off caching for a range of pages
397
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Hiding Memory Latency: Multi-threaded Datapaths
Also called “hyperthreading”. Thread = a process (incorrect usage of the term “thread”, but the processor industry is kind of stuck with this!).
Same as “instruction level” multiprocessing, discussed earlier.
Here, instructions from different threads are injected into the pipeline. In a superscalar multithreaded design, instructions from different threads can be co-dispatched. Hence the alternative name, simultaneous multithreading (SMT).
Threads have their own contexts: context = state information needed to run a thread = registers, rename tables, PCs etc.
Common implementation:
Some resources, such as L1 I-cache and L1 D-cache, caches beyond L1, IQ, fetch logic, parts of the predictor, execution units, etc. can be shared among threads – some restrictions may be imposed on the sharing to prevent the hogging of resources by a thread.
Other resources (rename table, physical registers, ROB, LSQs etc) can be dedicated to threads.
Note that partitioning the ROB among threads can indirectly limit use of shared resources among threads. Additional mechanisms may be needed to ensure that all threads are making progress – e.g., round robin instruction fetching, dispatching.
398
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Hiding Memory Latency: Multi-threaded Datapaths (contd.)
Such multithreading provides a way of hiding memory latency -when a thread is blocked due to a cache miss, instructions from other threads can continue to be processed.
Multithreaded datapaths thus improve average throughput on a set of ready-to-run threads; individual execution time of a thread can go up!
Real designs: DEC Alpha EV8 (21464) – never put into production, HT versions of Intel P4 (“Northwood”, “Prescott”), IBM POWER 4 and beyond, Sun’s Niagara.
There is a practical limit on the number of threads supported simultaneously – shared resources increase in size and become slower, putting a limit on the pipeline clock.
A limit of 2 to 4 threads is common; limit can be higher with simpler datapaths (e.g., 8 or more threads in Sun’s in-order Niagara implementations).
Most OSs typically have a few “threads” (really, processes) that can be run simultaneously multithreaded. Typical server code is heavily multi- threaded in this sense – as soon as a server request comes in, a process (i.e., thread in the jargon of SMT) is spawned to serve the request.
399
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Design Choices for Off-Chip Caches: Historical Evolution Trends:
Off-chip discrete component cache controller with standard RAMs (early systems) -> specialized off-chip cache components (tag RAM, cache controller) -> on-chip controller for off-chip caches.
Organizational trends: lookaside cache -> backside cache -> in-line cache (in-line only for very high end systems).
Memory beyond off-chip cache: standard DRAM -> FPM DRAM -> standard SRAM -> EDO DRAM -> SDRAM -> SSRAMs (the last only on very high end systems).
Lookaside Cache
External cache components sit on external memory bus (aka “system bus”), along with other devices, including DRAMs.
Memory Bus
Lookaside off-chip cache organization with an off-chip cache/DRAM controller. In this implementation, the external cache components and the DRAMs share a single bus. Decoding logic and detailed connections are not shown
CPU
Tag RAM
SRAMs
Cache/ DRAM controller
DRAM Banks
400
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Design Choices for Off-Chip Caches (contd.)
External cache and DRAM accesses started in parallel – on a cache hit, DRAM access is abandoned.
Major limitation: external cache can only operate at system bus speed, which is relatively slow as the bus is long and several other things are hanging off this bus.
Backside Cache
Uses a shorter, dedicated external “backside bus” for the external bus. Memory
Bus
“Backside” Bus
CPU
DRAM controller
DRAM Banks
SRAMs
Cache controller/ Tag RAM
Backside off-chip cache organization with an off-chip cache/DRAM controller. Decoding logic and detailed connections are not shown
Backside bus is significantly faster, allowing external cache to operate at higher speeds (compared to the lookaside organization).
401
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Design Choices for Off-Chip Caches (contd.)
In-line Cache
Goes beyond what the backside bus offers – here CPU accesses the external cache using a dedicated point-to-point port (not a bus) that connects it to the cache through the cache controller:
Memory Bus
DRAM controller
Off-chip Cache
Off-chip Cache Controller
DRAM Banks
CPU
In-line cache organization with a single-chip off-chip cache. Decoding logic and de- tailed connections are not shown. This organization provides the best performance and is preferred for bus-connected shared memory multiprocessor systems that implement coherent caches with bus snooping logic.
Dedicated port can be clocked at very high speed, often close to the CPU clock speed, leading to best throughput.
402
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Design Choices for Off-Chip Caches (contd.)
Provides the best performance: cache controller is likely to be integrated into the CPU chip in future systems, leading to the following configura- tion:
Memory Bus
DR controller
Encapsulated “Off-chip” Caches
Off-chip Cache
AM
DRAM Banks
Off-chip Cache Controller
CPU
Here, off-chip caches are within the same package, MCM (multi-chip module) or cartridge as the CPU – provides faster data rates to cache, as physical dimensions of interconnects are small. Can use either back- side or in-line organizations. There were products in this category (Intel P6, Pentium II and III “Slot 1” cartridges).
Current Trends
On-chip caches have dedicated access paths and controllers.
DRAM controller part of North-bridge chipset (MCH – memory con-
trol hub) – see later. No external caches are used.
DRAM controllers and interfaces are moving on-chip in current and emerging multicore products.
403
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Removing the Cache Bottleneck From the Pipeline
The I-Cache and the D-Cache are bottlenecks in contemporary pipe- lines: caches are pipelined in 2 stages as described earlier (actually, 3 stages including the write).
As processor clocks frequencies increase, this degree of pipelining may not be enough.
Of the two cache access stages, the stage involving the data and tag RAM access is the slowest – further pipelining of this stage, retaining traditional circuit design techniques is almost impossible.
Here’s what can be done to reduce tag/data RAM access delays:
Implement the tag/data array as multiple smaller arrays predecoding to the target (small) array (smaller is faster, net access delay smaller even with added predecoding delay).
Wave pipeline the tag/data RAMs: Static RAMs are essentially combinatorial logic – wave pipelining basically exploits logic delays to send multiple access requests into the RAM, without any interference among multiple requests “in flight”.
Net effect: pipelining at a faster rate without any latches (& associated delays) to separate out consecutive requests.
Can wave pipeline entire cache (RAM access + compare/steer).
Reduce logic delay in cache access path by combining the logic of effective address computation (effectively an adder) with the logic of the row address decoder for the RAMs. (In theory, any combinatorial function can be implemented with two levels of gates.)
– Wavepipelining & logic merging as described is used in the Sun Ultra SPARC-III implementation.
404
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Other Techniques for Speeding Up the Memory Interface
Streaming Stores: No cache allocations on a write miss – used when it is known that written lines are not going to be reused. Shows up ex- plicitly as part of ISA.
Write Combining: Used to improve performance of streaming stores. Sequential write-throughs to parts of a single cache line are collected into a write buffer before the entire line is written out, thus combining a series of writes to smaller regions into a single write to an entire line.
Amortizes memory startup overhead for writes, making efficient use of newer DRAM technologies (EDO DRAMs, SDRAMs, Rambus) that support streaming. Makes efficient use of external bus as well – one single bus transaction suffices instead of multiple transactions.
Writes can be combined in a single line buffer, which will be written out to memory (= “flushed”) only when writes are generated to another line. Used in Intel’s Deschutes line. Need associated comparators to allow LOADs to bypass correctly, as in any write buffer.
Further improvements possible through the use of multiple “write combining” buffers (e.g., in Intel’s Katmai (Pentium III) line).
Explicit instruction needed to flush these buffers is also added to the ISA (e.g., the SFENCE instruction in Intel’s Katmai line).
Explicit Prefetching: Instruction added to prefetch data from memory into caches at any specified level or implied buffer – only if these pre- fetches do not trigger a page fault. Does not modify processor state.
Example: PREFETCH instruction in Intel’s Katmai specifies where cache line has to be prefetched – all cache levels, L1 only, all levels be- yond L1 or a buffer (bypassing caches).
405
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Prefetching in Contemporary Processors
Basics
Prefetch request: anticipated memory read request.
Demand request: actual request (triggered on a miss) – different from
prefetch request.
Hardware-initiated prefetching anticipates the next address from which a load instruction will access data and performs this access in advance to move the requested data into the L1 cache well before it is needed.
Streaming instructions (typically found in the multimedia instruction sets such as SSE) do this based on user supplied information, as seen earlier.
For non-explicit accesses, the hardware needs to anticipate the address of the memory locations to be accessed in advance.
Many ISAs support explicit prefetching commands that can be inserted by the compiler in strategic places within the code.
– In many cases, these instruction specify what is to be done with the prefetched data (place in cache, consume in buffers only without being moved into the cache, get data for potential update and writeback with no intent to cache for future use etc.)
406
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Prefetching in Contemporary Processors (contd.)
The term prefetch distance is used to refer to how far in advance (in terms of number of data items) the actual prefetching starts.
This distance has to be chosen to account for delays in the access path and the number of misses that would occur before any data is prefetched.
– Keeping this distance small would reduce potential near-term misses but may not prefetch data far enough in advance to avoid future cache misses.
– Keeping this distance large may trigger too many near-term demand misses.
Prefetch burst size refers to the number of items prefetched back-to- back. A small burst size can cause future misses; a large burst size may bring in a lot of data in vain (prefetched data does not get used). Unuti- lized prefetches can waste precious memory bandwidth and latency.
Hardware-initiated Prefetching
Anticipating the address of the next memory location to be accessed: two broad techniques are used in current hardware implementation:
Next line prefetch: prefetch next cache line – works well for instruction access as well as data access. hardware has to detect sequential access pattern before starting prefetching. For data accesses, next line prefetching amounts to prefetching at stride distances of unity (see below).
407
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Prefetching in Contemporary Processors (contd.)
Stride-based prefetching – detects stride-based accesses and initiates prefetching based on detected stride. Usually triggered by repeated executions of the same LOAD instruction in a loop, as exemplified in the following code:
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
sum = sum + A[i, j] * B [j, i];
- If the arrays A and B are allocated in column-order, the loop that implements this code will have two LOAD instructions - one accessing A [i, j] and another accessing B [j, i]. The strides used by the LOAD for the A and the B elements are 1 and N, respectively.
Example 1: Adaptive next line prefetching in the AMD Barcelona Quad- Core Design
Prefetch next line is triggered by two consecutive accesses: access of lines L, L+1 will trigger the prefetching of the next N consecutive lines starting with L+2.
Burst size(N) is programmable.
Prefetch depth is adaptive - this distance is increased dynamically if the
demand stream catches up with the prefetch stream.
- The IBM POWER family uses a similar next line prefetching scheme.
408
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Prefetching in Contemporary Processors (contd.)
Example 2: Prefetching from the L1 D-cache in the Intel Core 2 Duo
Prefetchers exist between core and L1, between L1 and L2, between L2 and memory. A prefetch request from the upper level is treated as a de- mand request at the next level.
Instruction-pointer (IP) based prefetching as used in this design stores information related to prefetching and related to automatically detecting the state of the prefetching and the setup of prefetching conditions in a 256-entry prefetch history table (PHT).
Detects constant stride-based accesses. These are a series of accesses that target uniformly-spaced memory addresses. Often, the same LOAD instruction is used within a loop to generate a series of accesses. The IP prefetcher used in Intel’s Core 2 duo automatically enables pre- fetching from the L1 D-cache for such LOAD instructions.
The format of an entry in the PHT is as follows:
entry in the history table
<12 bits of the last virtual memory address targeted by the load>
<13-bit signed computed stride>: difference between the targeted memory addresses between the two past accesses for the past two executions of the LOAD.
<2-bit history/state>: relates to the state and usage of this entry for prefetching – not disclosed, but potentially: entry just set up on first execution of load, second execution – stride between addresses used in first and second execution computed, stride confirmed on third execution, prefetching disabled.
409
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Prefetching in Contemporary Processors (contd.)
<6-bits of address targeted for prefetching>: used to avoid duplicate prefetches.
The prefetching sequence is as follows:
1. The PHT is looked up for LOAD instructions at the head of the load
queue.
2. Ondetectingastridebasedaccess,aprefetchrequestisgeneratedfor the L1-D cache and queued up in a FIFO queue. As mentioned, the prefetch is triggered on the 3rd execution of a LOAD at the earliest. When this FIFO queue is full, new requests overwrite earlier requests.
3. Prefetchrequestscompetewithstreamingrequestsanddemandmisses for accessing the L1 D-cache. When the cache port is free and when appropriate number of cache fill buffers and external bus request queue entries are available, the prefetch request is processed.
– If a L1 D-cache hit occurs, the request data is fetched into the fill buffer.
– If a L1 D-cache miss is triggered, the request moves down to the L2 cache as a normal “demand” request.
4. TheprefetcheddataisnotnecessarilyplacedintotheL1D-cache-an option specifies if the line is to be placed into the L1 D-cache or not. The option may specify that the prefetched line be consumed directly offthefillbufferandnotcached. Underthatoption,theprefetchedline is cached only when a later demand request uses it.
410
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Prefetching in Contemporary Processors (contd.)
The Intel design uses multiple prefetchers: mechanisms are in place to avoid any adverse impact of aggressive prefetching. A prefetch moni- toring logic can throttle prefetches or momentarily suspend prefetching as needed.
Implementation-dependent options can specify the prefetch distance (how much in advance to prefetch) and the amount of prefetched data (how many data items are prefetched at a time back-to-back as a burst).
Stream prefetchers fetches a sequence of lines into a separate stream buffers. These buffers are FIFO in nature.
A common implementation is a next-line prefetcher that prefetches lines into a buffer.
Stream and stride-based access detection are essentially identical.
Many CPUs include special instructions to control prefetching (ARM, X86, POWER,..).
The memory controller needs to prioritize memory accesses by miss over prefetch accesses.
411
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Memory Prefetching: Recap
Key idea: fetch memory location into cache well before it is accessed. Almost imperative in superscalar designs.
Types of prefetching:
Compiler-directed: compiler generates/inserts special prefetching instructions – existed in many earlier designs. Also called explicit prefetching. Also possible to use “helper threads” (where thread = thread in OS sense!) to do such prefetching.
Automatic prefetching: automatically triggered by hardware on consistent cache misses. Need to design the mechanism carefully to avoid cache pollution due to unnecessary prefetches.
Typical implementation of automatic prefetching:
Hardware detects stride-based references (e.g., in processing array
elements in a loop) and prefetches based on the predicted stride.
Hardware can use predictor in the course of pointer-based accesses (more difficult in general) and other scenarios.
Determining when to initiate and turn off prefetching is always a challenge!
Not uncommon to see prefetchers in-between cache levels and between lowest level on-chip cache and memory. Intel’s Core 2 Duo has this.
412
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CLASS NOTES/FOILS:
CS 520: Computer Architecture & Organization
Part V: Superscalar CPUs (remaining parts)
Dr. Kanad Ghose ghose@cs.binghamton.edu http://www.cs.binghamton.edu/~ghose
Department of Computer Science State University of New York Binghamton, NY 13902-6000
All material in this set of notes and foils authored by Kanad Ghose 1997-2019 and 2020 by Kanad Ghose
Any Reproduction, Distribution and Use Without Explicit Written Permission from the Author is Strictly Forbidden
CS 520 – Fall 2020
413
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Superpipelined, Superscalar and VLIW Architectures: Recap
Execution time of a program with N instruction is: texec = N. CPI.
where CPI is the average number of clocks needed per instruction and is the clock period.
texec can be reduced by reducing N (the approach taken by CISCs, which have a number of well-known performance problems) or by reducing one or both of CPI and
Superpipelined, Superscalar and VLIW architectures represent approaches for further exploitation of instruction-level parallelism – they differ in how the CPI and factors are reduced.
Superpipelined machines break up pipeline stages further to allow a faster clock (i.e., a smaller ):
Requires more pipeline stages
Average CPI goes up since more pipe stages imply larger penalty on branching, larger interlock delays etc.
Throughput goes up because of higher clock rates
414
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Superpipelined, Superscalar and VLIW Architectures: Recap
Superscalar machines dispatch more than one instruction per cycle Requires complicated fetch & dispatch logic
Requires complex logic to cope with dependencies
Effective CPI goes down, increasing throughput
CPI decrease somewhat defeated by relatively larger branching penalty
VLIW machines issue a single instruction that starts several operations per cycle
Requires extensive compilation to identify concurrently initially operations
Effective CPI goes down, lower is also possible, increasing throughput
Throughput growth somewhat offset by poor code density
Almost guaranteed to be binary incompatible!
Actual performance gains are less than what is predicted, particularly as m goes up.
415
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Potential Challenges in Designing a m-way Superscalar Processor: Recap
Fetching m instructions per cycle for dispatch in a m-way superscalar CPU: this is complicated by the fact that the set of m instructions to be dispatched can cross memory and cache line boundaries.
Basic strategy: maximize the number of instructions that have to be examined per cycle for dispatch
Resolving dependencies among the instructions being dispatched and the instructions that have been dispatched earlier and still remain active.
Issuing multiple instructions per cycle to free FUs when the input oper- ands become available – this is not different from what is done in scalar pipelines with multiple FUs and dynamic scheduling.
Retiring multiple instructions per cycle – this is again not different from what is done in scalar pipelines with multiple FUs and dynamic schedul- ing.
Coping with branching – this is a very serious problem in superscalar machines where a branch instruction may be encountered potentially in each consecutive group of m instructions that are being examined for dispatch.
Coping with load latencies, as clock rates get higher.
Aggressive branch handling techniques are needed
416
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Instruction Dispatching Alternatives in Superscalar CPUs Variables in the design space of superscalar CPUs:
Number of instructions examined at a time for dispatch:
Constrained by cache/memory alignment requirements:
Alignment-constrained dispatching.
Independent (almost!) of cache/memory alignment requirements:
Alignment-independent dispatching.
How dependencies are handled
In-order vs. out-of-order dispatching.
Datapath restrictions for dispatching:
Restriction imposed on types of instructions that can be dispatched per
cycle.
Datapath does not restrict instructions for dispatch based on their types.
Primary mechanism used to select instructions for dispatching: Compiler
Hardware
417
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Constrained Dispatching
Early superscalar microprocessors used this technique.
Here, a complete cache line is fetched from the I-cache and all the in- structions within this line are dispatched before the successor cache line is fetched.
The instruction decoder does not step beyond a cache line boundary to maximize the number of instructions being examined for dispatch.
This can seriously lower the average number of instructions that are examined for dispatch and the average number of instructions that are dispatched per cycle in a k-way superscalar machine for two reasons:
1. Duetotheinabilitytodispatchkinstructionspercycle: Cycle 1: Cache line containing I1, I2, I3, I4 fetched into instruction buffer
Cycle 2: I1, I2, I3 dispatched; I4 not dispatched due to non-availability of VFU
I1
I2
I3
I4
–
–
–
I4
I5
I6
I7
I8
–
# of instructions dispatched = 3
Cycle 3: I4 dispatched; next cache line containing I5, I6, I7, I8 fetched into the instruction buffer
# of instructions dispatched = 1
Cache line size is 4 instructions
average dispatch rate during cycles 2 and 3 is 2 instructions per cycle.
418
k=4
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Constrained Dispatching (contd.)
2. Duetobranching:
– Assume I3 is a branch instruction that is taken:
Cycle 1 :
Cycle 2:
Cycle 3:
Cache line containing I1, I2, I3, I4 fetched from cache and put into the instruction buffer: I3 is a branch thats predicted to be taken. The target of I3 is I23
I1, I2, I3 dispatched; Cache line containing the target instruction is fetched into the instruction buffer
# of instructions dispatched = 3
I23, I24 dispatched; Cache line containing the next set of instructions is fetched into the instruction buffer
# of instructions dispatched = 2
Cache line size is 4 instructions
I1
I2
I3
I4
I21
I22
I23
I24
I25
I26
I27
I28
– average dispatch rate during cycles 2 and 3 is 2.5 instructions per cycle.
A technique used in some early superscalar CPUs was to use NOP pad- ding to force the target of a branch to start at the beginning of a cache line to allow k instructions starting withe the target to be examined for dispatch on a taken branch. For the example shown above, this means that the instruction layouts in the cache have to be as follows:
:
– Increases binary size: not an attractive solution.
I21
I22
NOP
NOP
I23
I24
I25
I26
419
k=4
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching
In an ideal k-way superscalar CPU with alignment-independent dis- patching, k instructions are examined for dispatching every cycle, irre- spective of branching or the number of issues made in the previous cycle.
This goal implies that k instructions starting with the first instruction in program order to be dispatched must be available for dispatch in that cycle.
When branches are absent or when branches are predicted to be not taken, the goal of alignment-free dispatching can be achieved by using a circular buffer that can hold 2k instructions.
– Cache line size is assumed as k instructions.
– This buffer is initially loaded with 2k instructions from two
consecutive cache lines.
– As soon as k instructions are dispatched from the buffer, k instructions from a cache line are loaded into this buffer.
– At any point, k consecutive instructions can be examined from this buffer
420
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.)
–
The following example shows how this works:
Cycle 1: I1 through I4 examined for dispatch: I1 through I3 dispatched
Cycle 2: I4 through I7 examined for dispatch: I4, I5 dispatched; Fetch of cache line containing following instructions started
Dispatch PC I1 I2
I8 I3 I8 – I7 I4 I7 I4
I6 I5
Cycle3: I6 through I9 examined for dispatch: I6 through I9 dispatched; Fetch of cache line containing following instructions started
k=4
I6 I5
Cycle4: I10 through I13 examined for dispatch: I10, I11 dispatched
Dispatch PC
I9
I6
I10
–
Dispatch PC
I16 I15
– I10
I11
I12 I14 I13
I8 I7
I11 I12
Can still allow k instructions to be examined for dispatch with branching if a prediction mechanism with a BTIC (branch target instruction cache) is used, with k instructions starting with the target stored in the BTIC (even if the group of k instructions starting with the target crosses a cache line boundary).
421
Dispatch PC – –
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.)
A straightforward implementation is as follows:
The buffer is a register file with 2k entries, with k read ports and k write ports.
The decoder accesses instructions using the read port; instructions are loaded from the cache line via the write ports.
The read ports are addressed using the following addresses:
PC div 2k (lower 2 * log k bits of the PC) (PC div 2k) + 1
(PC div 2k) + 2
:
(PC div 2k) + (k – 1)
The addresses for the write ports are set up similarly:
(PC div 2k) + k (PC div 2k) + k + 1
:
(PC div 2k) + 2k – 1
– The cache line is not fetched and written through the write ports until there is space in the buffer for k instructions.
422
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.)
An example of this is shown below for k = 4: we assume that each instruction is one word wide and the PC contains the word address for the first instruction of the next group of instructions to be examined for dispatch.
lower 3 bits of PC
+4
+5
2k-entry register file
+1
+6
+2
+7
+3
From I-cache or BTIC
Instruction decoder
This implementation requires a large amount of Silicon area; effective wire lengths from registers holding instructions to decoder are long and may introduce relatively large wire delays; the address decoders for the register file also introduces delays – simpler implementations that logically realize the same functional requirements are possible.
423
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.)
The deck buffer implementation used in the 4-way superscalar MIPS 8000:
Instructions are a word wide; fetches made from the cache are always “quadword” aligned (aligned in groups of 4 words).
Instructions fetched from the cache move into the the actual latches from where decoding takes place (called the dispatch buffers, DBs) through a series of instruction buffers (IBs) and then through a single set of latches called a on-deck buffers (ODBs) as shown below:
4 quadword aligned instructions
IB IB IB IB
ODR ODR ODR ODR
recirculating connection
DB DB DB DB
To instruction decoders
Each IB entry, ODR and DB can hold a single instruction
bypass connection
MUX
MUX
MUX
MUX
A DB can hold on to the instruction that it currently has for the next cycle using the recirculating connection or it can have a new instruction forwarded to it from the ODR or the entry at the head of the IB in the next cycle.
424
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.)
These facilities are used to ensure that the four DBs always contain four consecutive instructions that can be examined for dispatch in every cycle:
Cycle 1: IJ, IJ+1, IJ+2 dispatched; Following three in-
structions move into vacant DB slots from the ODRs
j+12
j+13
j+14
j+15
j+8
j+9
j+10
j+11
j+12
j+13
j+14
j+15
j+8
j+9
j+10
j+11
IBs
ODRs DBs
Dispatch PC
IBs
ODRs DBs
Dispatch PC
j+4
j+5
j+7
j+4
j+5
j+6
j+7
j
j+4
j+5
j+6
j+3
IBs
DBs
Dispatch PC
– = already dispatched
IBs
ODRs DBs
Dispatch PC
– Effectively, the four ODRs and the four DBs make up a 8-entry circular buffer; IB is like a queue – one entry of four instructions move out to the ODR/DBR at a time.
425
j+6
j+16
j+12
j+1
j+17
j+13
j+2
j+18
j+14
j+3
Cycle 2: IJ+3, IJ+4 dispatched; Following two instructions move into vacant DB slots from the ODR and the IB; Vacant slots in the ODR are also filled from the IB – this frees up a set of IB entries; IB shifts down and loaded from the I-cache
j+19
j+15
Cycle 3: IJ+5, IJ+6, IJ+7 dispatched; Following three instructions move into vacant DB slots from the ODRs; vacant ODR slots filled from the IBs
j+16
j+12
j+17
j+13
j+18
j+14
j+19
j+15
ODRs
–
j+9
j+10
j+11
j+12
j+13
j+14
j+15
j+8
j+5
j+6
j+7
j+8
j+9
j+10
j+11
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.)
Although not depicted, instructions can also be forwarded to the DBs from a BTIC.
Odd-even line caches is a way of logically partitioning the I-cache to achieve alignment-independent dispatching.
First used in the 4-way superscalar IBM RISC/6000; a cache line can hold a quadword aligned group of 4 instructions.
The I-cache is partitioned logically into two independent I-caches: one (“even cache”) holds only the even-numbered lines from memory; the other (“odd cache”) holds the odd-numbered lines from memory.
On a I-cache access request, the line address is first incremented. The original address and the incremented addresses are used to probe the even and odd caches, ensuring that the odd and the even of these two addresses are applied to the appropriate cache.
– If the incremented address crosses a page boundary, the appropriate cache is not accessed
– Can avoid this restriction by looking up the TLBs for both line addresses
Resulting scheme fetches 8 consecutive instructions on a hit, starting with the instruction pointed to by the dispatch PC.
426
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Alignment-Independent Dispatching (contd.) Cache interface is as shown below:
Required lines
Tag comparison and multiplexing logic
TLB
Even Cache: Data & Tag arrays
Odd Cache: Data & Tag arrays
set # from even address
virtual page #
Even/Odd address gating
+1
Line address based on dispatch PC
Potential limitation: delay of incrementer in cache access path.
A dual-ported TLB can be used to allow two consecutive memory lines that are in two different pages to be read out from the cache.
set # from odd address
427
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
In-order vs. Out-of-order Dispatchig in Superscalar CPUs
In-order dispatching: Here a group of k consecutive instructions in program order, say I1, I2, …, Ik, are examined for dispatch. If the dis- patching conditions allow all of these k instructions to be dispatched, they are dispatched. Otherwise, the instructions dispatched are I1 through Ip, where p < k, and Ip+1 is the first instruction in program order that does not satisfy the dispatching conditions.
In other words the group of dispatched instructions in program order ends with the first instruction that cannot be dispatched within the group of instructions that are being examined for dispatch.
Note that as a consequence of in-order dispatching an instruction Ir, where p < k and r > p but r k will not be dispatched even if it satisfies the conditions for dispatch. This effectively reduces the dispatch rate. This is shown in the following example for alignment-constrained dispatch
Cycle 1 :
Cycle 2:
Cache line containing I1, I2, I3, I4 are examined for dispatch in the decode/dispatch buffer. I1, I2 and I4 satisfies the dispatch conditions but I3 does not. Consequently, only I1 and I2 are dispatched. I4 is not dispatched even if I4 meets the dispatching conditions:
# of instructions dispatched = 2
I3 becomes dispatchable, so both I3 and I4 are dispatched; the next group of 4 instructions are fetched into the dispatch buffer:
# of instructions dispatched = 2
I1
I2
I3
I4
k=4
–
–
I3
I4
428
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
In-order vs. Out-of-order Dispatching in Superscalar CPUs (contd.)
Out-of-order dispatching: Here a group of k consecutive instructions in program order, say I1, I2, …, Ik, are examined for dispatch. All of the instructions in the group that satisfy the dispatching conditions are dis- patched – irrespective of the program order.
Instructions dispatched out-of-order are usually not dependent on earlier instructions that could not be dispatched: this is checked for explicitly.
The effective dispatch rate is improved only if alignment-independent dispatching is used:
Cycle 1 :
Cycle 2:
Situation with alignment-constrained dispatching:
Cache line containing I1, I2, I3, I4 are examined for dispatch in the decode/dispatch buffer. I1, I2 and I4 satisfies the dispatch conditions but I3 does not. I4 is not dependent in any way on I3. All except I3 are thus dispatched:
# of instructions dispatched = 3
I3 becomes dispatchable, I3 is dispatched; the next group of 4 instructions are fetched into the dispatch buffer:
# of instructions dispatched = 1
I1
I2
I3
I4
k=4
–
–
I3
–
– effective dispatch rate over two cycles = 2 per cycle
Situation with alignment-independent dispatching:
CycleS1a:me as before; but next group of instructions are fetched into the dispatch buffer.
Cycle 2: I3 becomes dispatchable; I5 and I7 are also dispatchable I3 is dispatched; the next group of 4 instructions are fetched into the dispatch buffer:
# of instructions dispatched = 3
– effective dispatch rate over two cycles = 3 per cycle
I3
I5
I6
I7
429
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
In-order vs. Out-of-order Dispatching in Superscalar CPUs (contd.)
Note that the alignment-independent instruction fetching mechanism also gets complicated to handle out of order dispatching
VeryfewsuperscalarCPUsuseout-of-orderdispatching;thesearema- chines with a few or no reservation stations – examples include:
– Motorola 88110: only FP instructions are dispatched out-of-order
– IBM Power PC 601: only FP and branch instructions dispatched out-of-order.
Modern superscalar CPUs tend to employ an instruction pool buffer and use in-order dispatching to get all the advanatges of out-of-order dis- patching. Additionally, logic needed to take care of dependencies and implement extra complexities in alignment-independent dispatching are absent.
430
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Datapath Restrictions in Superscalar CPUs
In most early superscalar CPUs limitations of the datapath (connections from decode buffer to the FUs, RSEs) or the relatively smaller number of FUs restricted the types of instructions that could be dispatched in the same cycle. Examples of such restriction are:
No more than a single LOAD or a single STORE could be dispatched in a single cycle, since the RSEs of the load and store units were single ported.
A floating point instruction or integer instruction instructions could be dispatched to the respective FUs only from specific positions within the dispatch buffer because of restrictions on the connections to the FUs from instruction slots within the dispatch buffer.
– In the DEC 21064 and 21164 Alpha implementations, a dedicated stage was added to the pipeline to rearrange instructions into their appropriate positions within the dispatch buffer to enable them to be co-dispatched.
– More specifically, in the 2-way superscalar DEC 21064, a “swap” stage was added to switch the positions of the two consecutive instructions fetched from the I-cache, if necessary to enable their parallel dispatching. (In the 4-way superscalar 21164, the analogous stage was called the “reorder” stage.
In most modern superscalar CPUs, these restrictions are absent because adequate number of FUs and connections exist, and also because an in- struction pool buffer is used, with connections to it from all slots within the dispatch buffer.
431
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Other Issues in a m-way Superscalar CPUs
Implications on instruction issuing: The issuing mechanism must be able to take in m instructions from the decode/dispatch stage. It must also have the ability to issue at least m instructionsto the physical FUs. If an instruction pool buffer is used, with issuing taking place from this buffer, this requires a large set of connections from the IQ to the FUs.
Implications on the completion and retirement: In the steady state, m dispatches per cycle imply that m instructions have to be retired per cycle. This does not necessarily require the reorder buffer to have multi- ple ports, so that m consecutive instructions can be retired in one cycle. A possible solution to avoid multiple ports (and the addressing logic needed to address the consecutive slots for the instructions being re- tired) will be as follows:
– Each entry in the ROB can have m slots to accommodate up to m instructions in program order.
– If fewer than m dispatches take place in a cycle, not all of the slots in the ROB entry will be used.
– The retirement logic simply looks at the contents of the m slots in a single ROB entry to decide what instructions can be retired in
a cycle.
The architectural register file, however, cannot avoid the need for m write ports.
Coping with branching: branching presents a more serious threat to superscalar datapaths. For every cycle lost due to branching, m instruc- tions are “lost”. Agressive branch handling techniques and speculative execution mechanisms have to be used in modern superscalar CPUs.
432
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Addressing the Memory Gap in Superscalar CPUs
The performance gap between the CPU and memory has a much higher impact on performance compared to scalar pipelines:
The issue queue can fill up more quickly on load misses
If the loads have a latency of two cycle or higher (as is the case), the compiler has to find more instructions to put into the delay slot of the load
The load-store queue has to be longer compared to a scalar design The L1 D-cache will need additional ports
Multiple D-cache misses need to be outstanding – lockup-free caches are a must
Speculative execution exacerbates all of this.
Solutions currently deployed to address these issues:
Cache miss prediction for loads
Speculative bypassing of the store queue:
– Based on partial address matches
– Based on prediction
433
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Memory Dataflow Path: Details
In spite of out-of-order execution, memory operations must be com- pleted in sequential processing order. Usual implementation:
As load/store instructions are dispatched to the IQ or reservation stations, an entry is made for these instructions at the tail of a FIFO queue, called the LSQ (load-store queue) to retain their program order.
The LSQ entries look like this:
where:
– op_type is the type of the loads, stores (word load, byte load..)
– valid_bits indicates if the address and data fields have valid contents – address: virtual address generated by the processor (usually the
address of the first byte of the full-word targeted by the load or store) – data: applicable only to stores – data to be stored
– mask: indicates bytes targeted within full-width memeory word
– we assume that the larget chunk of memory addressable by a single
load or store is a full memory word (= a power-of-two bytes)
The IQ or reservation station entries for instructions are set up to compute an effective address and forward that address directly to the LSQ entry.
The data to be stored are similarly forwarded to the LSQ entries that need them.
A load or store operation proceeds from the head of the LSQ to the L1 D-cache only if all fields within that entry are valid. If the entry at the head is not valid, entries behind it are also held up.
434
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Memory Dataflow Path: Details (contd.)
Loads in the LSQ are allowed to bypass the stores ahead of them in the LSQ only under the following conditions:
1. ThememoryaddressfieldsofallstoresaheadoftheloadareALLvalid.
2. (a)TheaddressfieldofthesestoresdoNOTmatchtheaddressofthe
load OR,
(b) The address(es) match but the mask bits of these stores indicate that
the load does not target any of the bytes that these stores target.
– Note that condition 1(a) can be relaxed by checking for a mismatch of a few bits of the full address instead of the cimplete address
Similarly, to forward a data to a load from an earlier store still sitting in the LSQ, the conditions are:
1. ThememoryaddressfieldsofallstoresaheadoftheloadareALLvalid.
2. Theaddressfieldofthesestoresofoneormorestoresmatchtheaddress of the load AND the mask bit indicates that the load targets a subset of the bytes in the matching store.
– On such a match, the load gets its data from the nearest preceding store
– Partial address matches (as in the case of store bypassing) does not work in this case!
With large LSQs, the sheer number of comparators needed to detect the bypassing or forwarding conditions can be quite substantial. In a super- scalar design that allows L loads to proceed to the L1 D-cache per cycle, the number of these comparators go up by a factor of L.
435
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
L1 D-Cache Miss Predictor The Motivation:
First used in the DEC Alpha 21264 implementation: addresses latency in accessing data from the L1 D-cache by loads (3 cycles for this implementation).
To support back-to-back issue of a load, its dependents and their dependents should be issued as early as possible.
A load hit or a miss is not discovered until 3 cycles after its issue; waiting that long before its dependents (and their dependents) can be issued can lead to serious performance loss.
The Solution:
Issue instructions in the “shadow” of the load (that is, instructions following the load that can be issued within the 3-cycle load latency) speculatively assuming a hit in the L1 D-cache can avoid this performance loss. The instructions within the shadow of the load are typically the dependents of the load and their dependents.
On discovering a miss on the L1 D-cache access, speculatively issued instructions have to be rolled back (that is, their execution have to be abandoned) and reissued after the miss is serviced. The term instruction replay has been used to describe the rollback and reissue process. (DEC called this the “mini-restart” mechanism.)
In the 21264 instructions, all instructions in the shadow of the load, whether they were dependent on the load or not were replayed on a cache miss by the load.
436
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
L1 D-Cache Miss Predictor (contd.)
Instruction replays have an additional overhead (2 cycles in the 21264), so to avoid wasting these two additional cycles for the replays, a cache hit predictor was used.
Instructions are issued in the shadow of the load only if a cache hit was predicted – this reduces replays.
The cache hits predictor was a 4-bit saturating counter:
Most significant bit of this counter indicates prediction – if this is a one,
a cache hit is predicted
counter was incremented on cache hits
counter was decremented by 2 on cache misses – this adds the required hysterisis to avoid flipping predictions frequently after a misprediction was made.
Similar techniques used in many contemporary processors (such as the Intel P4).
437
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speculative Forwarding From Stores to Loads
Motivation:
Forwarding data from a prior store in the LSQ to a later load is almost mandated for superscalar designs.
If the load is not dependent on the prior stores in the LSQ, it can go ahead and access the L1 D-cache
Detecting the bypassing or store-to-load dependency requires the full address of the load to be compared with the address of the prior stores in the LSQ
Ideal arrangement: perform the address comparisons and the L1 D- cache access in parallel
The problem: comparing long memory addresses take a long time, pre- cluding the possibility of overlapping cache access with the address comparison.
The solution as used in the Intel P4 (90 nm implementation):
Compare only a few bits of the addresses (faster than comparing the full addresses) and perform any store-load forwarding/bypassing.
Compare full addresses later and verify that the results of comparing a few bits match the results of comparing full addresses.
Replay the load and following instructions if earlier comparisons were incorrect.
438
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speculative Forwarding From Stores to Loads (contd.)
The actual solution is more complex as the P4 supports byte addressing with different types of loads and stores (load/store byte, load word, load double word, load quad word, word = 2 Bytes) and the virtual addresses generated specify the address of the first byte of whatever is being loaded or stored to:
Correct offset based on full address match
Comparators, priority logic
M U X
SFB: Store Forwarding Buffer
MOB: Memory Ordering Buffer
MUX
Offset for Store in SFB based on partial address match
Data forwarded to Load
Shift Amount
The SFB stores data to be written to the L1 D-cache by the queued up stores and selected bits of the memory addresses for each – these address bits are compared for the speculative store-to-load forwarding.
The MOB holds full addresses of loads and stores – this is effectively what we called the LSQ. Full addresses are compared by the logic associated with the MOB. The usual conditions for comparing addresses apply, as discussed on Page 406.
439
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speculative Forwarding From Stores to Loads (contd.)
Incorrect forwarding from a store to a load can take place because of the following reasons:
Results of partial address computations were not corroborated by full address comparison in the MOB – this comparison takes place when the load is added to the MOB.
– Two possibilities on detecting an incorrect forwarding case using partial bit matching: either a different store matches or no stores match. In the former case, the address of the correct matching entry is forwarded to the SFB through the multiplexer on the left.
The match based on partial bit comparisons were correct but the data forwarded was not correctly aligned – in this case the multiplexer at the bottom shifts the forwarded data appropriately.
On incorrect matches using partial bit comparisons or on detecting incorrect alignments, the load and the following instructions are replayed.
In the implementation described restrictions are placed on forwarding from data being stored to the load based on the alignment of data – for example, a load cannot get data forwarded from a store storing a quad word if the data needed by the load crosses the double word boundary. These restrictions simplify the address comparison/shift logic.
440
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Complexity of Superscalar Datapaths
The hardware complexity of a m-way superscalar processor grows with
m as follows:
The area of the register file grows as m2 – this is because with the addition of a port to the register file, each of the linear dimensions (height, width) of each bitcell in the register file grows linearly with m, sothetotalbitcellareaisO(m2). Othercomponentsoftheregisterfile (decoders, drivers, sense amps etc.) grow linearly with m.
The dispatch buffers, ROB etc. grows similarly.
Area of dependency checking logic grows as m2.
The total area occupied by the FUs grows linearly with m.
– Overall area needed is O(m2).
The performance growth, in contrast, goes up sub-linearly with m:
Relative Performance
m
Reasons:
Branches and dependencies limit potential growth in performance Pipeline complexities also cause these penalties to go up. Extension of cycle times due to hardware complexity
Bottom line: bad return for investments made! 441
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Superpipelined CPUs
Historically, superpipelining was necessitated to accommodate the relatively large latencies of the L1 caches in relation to the relatively smaller latencies of the other stages in the pipeline:
– Circuit design techniques could be used to reduce logic delays
– Little could be done to reduce the overall cache access time.
– The obvious solution was to pipeline the cache access into two (or more stages), which basically implied that two or more stages have to be devoted for instruction fetching and the data cache access.
– Net result: more pipeline stages, making possible the use of a faster pipeline clock.
There is no clear agreement on the transition point between normal pipe- lining and superpipelining.
Today, virtually all pipelines can be considered as superpipelined – the term “superpipelining” only has historical significance.
442
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Putting it Together: A Typical Classical Desktop PC System & Bridges Typical PC System with “chipset” for accommodating peripherals:
Memory Memory
Memory Bus
AGP Graphics Card
CPU + on-chip caches
North Bridge
– System Controller – Faster Buses/Links – More I/O Pins
Data & Control Signals
Control Signals Only
South Bridge
– Peripheral Controller – Slower Buses/Links – Support for legacy
buses
Network Card
SCSI Card
PCI Bus
LPC Bus
ISA
EIDE 1 EIDE 2
North (= north of PCI Bus in the diagram) and South Bridge chips make up the so-called “chipset”.
North bridge handles higher speed buses (memory buses – often sepa- rate buses for DDR memory and RDRAM) and high-speed point-to- point links.
443
Buses for Slow I/O Devices
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
A Typical Classical Desktop PC System & Bridges (contd.)
The North bridge has the following features/functions:
Integrates DRAM controller, bus and link interfaces, DMA controllers
Provides bus interfaces, buffers and arbitration logic
Provides interface to South Bridge – usually proprietary, implying that north and south bridge chips from different vendors may not work together!
Provides AGP interface (PCI-like, but faster, point-to-point) Connectivity and interface between the various buses it handles.
The North Bridge has a higher number of signal pins and faster logic.
The South bridge has the following features/functions:
Interfaces to buses, in-between buses, buffers, arbitration logic, DMA controllers.
EIDE controllers (usually two: primary and secondary)
Real-time clock
Power management hooks
LPC (“low pin count”) bus interface and controller (serial port. mouse port, game port etc)
LAN wakeup and diagnostic interfaces
Interfaces to Flash devices
USB controller, buffers and interfaces
Additional ports (e.g., the AC97 audio port on Intel south bridges)
444
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
A Typical Classical Desktop PC System & Bridges (contd.)
Main reasons for not integrating the North and South bridges:
Peripheral bus specs evolving, legacy support needs changing – separate chips means these changes affect only the South bridge
An integrated chip will need more pins – drives up costs
Processing speeds are different – can use more cost-effective technology (e.g., slow, older process for South Bridge, more aggressive technology for North Bridge).
– Some vendors are integrating bridges with limited functionality for each bridge.
Going beyond the classical organization
North bridge includes support for multiprocessing, encryption etc.,
Use of a fast, dedicated connection between bridges, with the South Bridge getting exclusive access to the PCI Bus,
Integration of graphics controller into the North Bridge
Integration of off-chip cache or off-chip cache controller into the North Bridge
Support for managing requests from multiple threads (as in the Intel 875 chipset).
Integration of North Bridge functions into CPU chip.
Intel terminology: Intel now calls bridges as hubs:
– North Bridge = memory controller hub (MCH)
– South Bridge = I/O controller hub (ICH)
445
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Modern System-Level Organization Trends
Low-cost, low-power embedded CPUs integrate DRAM controller and graphics (many products available now, starting with the Intel Atom and others).
This integration is possible in low power products targeting the embedded area as the processing cores used are relatively simpler compared to the high-end ones, so Silicon real-estate and power budgets are avilable for these other artifacts. Integration also save power, and enhances overall system-level performance as connections among components are all internal to the chips.
CPU chips targeting cell phones integrate graphics, encryption, DRAM controllers (and multiple cores) within a single chip.
Other embedded products integrate DRAM controllers, graphics, hard disk (SATA) controllers with one or more cores in a single chip – targets low cost laptop, tablets. (Tablets often use Flash-based solid state non-volatile storage – these use a SATA interface.)
High-end multicore products (targeting desktops, servers) integrate multiple cores, one or more DRAM controllers into a single chip.
Access to DRAM banks use high-speed PCI 3e interface with multiple lanes (= multiple pins/wires).
Emerging server CPUs will permit the network adapter to be directly connected to the CPU using PCI 3e links – the Intel Xeon 5 family is an example of this. The need to do this is driven by the prevalance of high-speed network adapters.
446
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Emerging System-Level Configuration: Intel Xeon 5 Family
NIC and adapters interfaces directlty with Multicore CPU Chip Using PCI3e links:
QPI Interconnection to Another CPU
PCI3e Links (Multi-lane)
Xeon 5 Multicore CPU
On-chip DRAM Controllers
NIC
SATA/eSATA Adapter
DRAM DIMMs
On-board DRAM controllers interface to DRAM system
Has fast connection to another CPU for multi-CPU chip
multiprocessing configuration.
NIC can DMA to/from one half of shared lowest level (L3) cache.
447
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Energy Implications: Power and Performance
Power = energy consumed per unit time
The total power consumed by a chip is the sum of two components:
Dynamic power: caused by transisitors switching from on to off or vice versa.
Static power: caused by leakage within each transistor, inevitable as transistors are not perfect switches – they conduct a little even when they are supposed to be completely off. Static power consumption goes up with temperature and is thus dependent indirectly on dynamic power.
As an application executes, transistors that make up the chip switch.
Not every transistor switches at each clock, some may not even switch as the program executes – for example, none of the transistors that make up the floating point execution units will switch if the application does not use any floating point instruction.
In CMOS circuits (which is used in implementing modern processors), when a transistor switches, it essentially charges up or discharges a load (which is effectively a capacitor) connected to its output. In synchronous pipelines, switching is synchronized with the clock.
At each such switch, the energy expended is e = 0.5 X c X V2, where c is the load capacitance and V is the supply voltage. (X = multiply.)
448
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Energy Implications: Power and Performance (contd.)
The dynamic power consumed by each processing core due to switch- ing (dynamic power consumption) can be expressed as:
S.C.V2.f, where:
S = switching coefficient – depends on application, hardware (microar- chitecture): measures how often each transistor switches on the average in each clock tick
C = average load capacitance charged/discharged on each switching V = supply voltage
f = clcock frequency (
Simply reducing the clock frequency will not save energy
– a given application requires a fixed number of switching events, irrespective of the clock frequency; at each switching event energy will be spent
– Another way to look at this is if we lower only the clock frequency, we will lowe the power dissipation, but now need to execute for a longer time, so the energy consumed will be unaffected.
449
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Energy Implications: Power and Performance (contd.)
Most chips use a dynamic voltage and frequency scaling (DVFS) mech- anism, that adjusts the clock frequency, f, as well as the supply voltage- based on the realized performance.
The clock frequency can be reduced during periods of low activity without impacting performance adversely. For example, when the IPC is low due to sustained on-chip cache misses, it doesn’t make sense to run the core at the highest clock frequency, so the clock frequency can be lowered.
The time it takes for a transistor to switch is a function of the supply voltage – the switching time decreases as supply voltage is increased.
Fast clock rates demand smaller switching time for the transistors.
Thus, at lower clock frequencies, a lower supply voltage can be used.
The DVFS mechanism reduces energy consumption by reducing both the supply voltage and clock frequency during times of low activity and boosts up both when activity picks up.
Modern chips use a voltage regulator (either external to the processor chip or internal to the chip, as in many emerging designs) to adjust the supply voltage as part of the DVFS mechanism.
Software modules, called governors, are used to choose the correct DVFS setting (a fixed number of such settings exist) based on some per- formance criteria (such as CPU utilization).
450
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
CLASS NOTES/FOILS:
CS 520: Computer Architecture & Organization
Part VI: Multicore Microprocessors and Parallel Systems
Dr. Kanad Ghose ghose@cs.binghamton.edu http://www.cs.binghamton.edu/~ghose
Department of Computer Science State University of New York Binghamton, NY 13902-6000
All material in this set of notes and foils authored by Kanad Ghose 1997-2018 and 2019 by Kanad Ghose
Any Reproduction, Distribution and Use Without Explicit Written Permission from the Author is Strictly Forbidden
CS 520 – Fall 2020
451
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Types of Parallelism
ILP – concurrency within a single flow of control:
Exposed/exploited using hardware only (out-of-order execution mechanisms)
Exposed completely using compilers (VLIW)
Exposed/exploited using hardware and a one-time compilation for discovering concurrency (EPIC)
TLP – thread level parallelism:
The processor industry refers to independent processes as threads!!
Such “threads” can be processed concurrently
Modern systems typically have many active threads (several users processes as well as kernel processes) – many of these can be processed in parallel
Typical servers spawn processes to service requests, so these can also be processed concurrently
Coarse-grained synchronization needed for accessing shared resources
Hardware mechanisms used for harvesting TLP:
– Multi-threaded processors (”hyper-threaded” or SMT processors)
– Multi-core processors or chip multiprocessors (CMPs)
– Distributed computing systems
452
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Types of Parallelism (contd.)
TTLP: True thread-level parallelism – these are threads as used in
traditional OS terminology
Here a thread is a single control flow and threads share a common address space
Threads share common variables (= memory locations)
Finer-grained synchronization required; spin- locks are used commonly to for synchronization across threads
Main reasons: overhead for sleep locks can be significant – requires a system call and process scheduling; other processors exist for making progress on overall computation.
Can use all three hardware mechanisms above for TLP (distributed versions use a distributed shared memory abstraction).
Data Parallelism: another type of parallelism where a common instruc- tion stream applies the same operation to multiple data streams in paral- lel, using independent execution units for each data stream.
Implemented on a single platform using one or independent memory systems.
Distributed Data Paralleism: multiple platforms (hosts) execute the same code on their own chunk of data. The “map” part of map-reduce implementations is an instance of exploiting this parallelism.
453
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Reasons for Using Parallel Systems
Main reasons for using parallel systems:
Performance: execute concurrent portions in parallel.
Reliability: redundant executions on identical data sets with identical code, cross-checks for validation. Can using voting with at least 3 identical executions.
Increased energy-efficiency: execute in parallel at lower cock frequency settings on each processor. Lower clock frequency also permits supply voltage to be reduced.
Contemporary terminology:
Core = processor
Multicore procesor or multicore chip = many cores within a single package, possibly sharing resources such as lower-level on-chip caches, DRAM controllers, on-chip interconnections etc.
Uncore = parts of a multicore chip, excluding the cores.
454
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speedup Using Multiple Processors – the Limit
Amdahl’s Law: For any solution using independent processors, the performance speedup compared to a single processor is limited by the fraction of sequential code, say F:
Speedup = T1/TN = 1/{F + (1-F)/N}
where: T1 =
TN = S= P= F =
Speedup
1
1
processing time on a single processors = S + P
processing time on N processors = S + P/N
uniprocessor execution time on the sequential part of the code uniprocessor exec. time on the parallelizable part of the code S/(S+P)
Ideal
1/F
Above eqn
N
The maximum possible speedup (when N is very large) is 1/F. The goal of programming/code parallelization should be to reduce F.
455
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Speedup Using Multiple Processors – the Limit (contd.)
The ideal behavior: speedup increases linearly with N. The reality is
different:
The non-zero value of F prevents linear speedup
As N increases, the synchronization overhead can go up; as some aspects of synchronization is inherently sequential, this can increase the effective value of F.
The cumulative impact of caching can reduce the time spent on both the sequential and parallel portions of the code, an this can sometimes lead to super-linear speedup.
Bottom Line: It is still important to design a high performance processor as applications are bound to have inherently sequential portions of code.
456
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Flynn’s Classification of Computer Organizations
Based on number of instruction streams and data streams – Instruction stream = instruction fetch path (logical)
– Data stream = operand access path (logical)
Four classes possible:
EX
Memory
F/D
SISD: Single Instruction Stream, Single Data Stream
Examples: Uniprocessors (including pipelined processors)
SIMD: Single Instruction Stream, Multiple Data Streams
Examples: Thinking Machine CM, CM-2, Maspar MP series, DAP
Memory
EX
EX
EX
F/D
Instruction Stream Data Stream
This is the missed category: no real machines exist in this class
MISD: Multiple Instruction Streams, Single Data Stream
EX
F/D
EX
Memory
F/D
MIMD: MultipleInstructionStreams, Multiple Data Streams
Examples: Intel iPSC, iPSC/2, iPSC/860, Thinking Machine CM-5, Cray T3D
BBN Butterfly, Alliant FX series
EX
F/D
457
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
(Classical) Shared Memory Multiprocessors
Also known as tightly-coupled multiprocessors
Processors share address space (i.e., memory modules)
Processors interact by reading and writing shared variables; processor interaction delay typically few 10s to few 100 processor cycles (at most) – most of this delay is incurred in going through the interconnection network
M: Memory Modules P: Processors
“Dance Hall” Style
Examples: Most Single-Bus Systems (Alliant FX/80, Sequent Balance), Cedar
M: Memory Modules P: Processors
“Distributed Shared Memory” (DSM) Style
Examples: BBN Butterfly, IBM RP3, Kendall Square KSR-1, DASH, Cray T3D
M
M
M
M
M
Interconnection Network
P
P
P
P
P
Interconnection Network
P
P
P
P
P
M
M
M
M
M
458
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Shared Memory Multiprocessors (contd.)
Caches can be incorporated with processors to improve effective memory latencies; if shared variables are cacheable, the caches must be kept coherent (LATER)
Memory Access Latency = Network Latency * 2 + Memory Access Time
– caching data on the processor side cuts this down drastically
DSMMs do better, even if caches are not used (on access to dedicated module)
Interconnection network (IN) is usually dynamic (and has low latency); Some DSMMs use static INs
Programming model: threads sharing variables, usually sequential con- sistency model of memory – relatively more convenient than a distrib- uted memory model
Synchronization mechanisms needed to assure consistency of shared data where needed; Spin locks are common
Multicore microprocessors are effectively SMPs – they incorporate processors with local and/or shared lower level caches and an on-chip interconnection into a single chip. Also called a CMP (chip multiprocessor).
A shared memory multiprocessor is called a SMP (for Symmetric MP) if all processors are identically organized around the shared memory and if all processors experience the same shared memory access latency.
459
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Distributed Memory Multiprocessors
Also called loosely-coupled multiprocessors or multicomputers
These effectively look like a distributed/networked system, with a fast
local IN.
“Node”
Interconnection Network
P
P
P
P
M
M
Examples: Intel iPSC, iPSC/2 (“Hypercube”), iPSC/860, NCube/10, Thinking Machine CM-5
M: Memory Modules P: Processors (Processing Element, PE)
Processors do not share address space – each processor has its private address space/memory module
Processors interact by sending/receiving messages – each node typical- ly has dedicated communications coprocessors for this purpose
Message passing latencies (=communication delays) are relatively longer: often as high as a few thousand processor cycles
Communication delay = Flat Overhead + Constant * Size of Message
M
M
460
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Distributed Memory Multiprocessors (contd.)
The interconnection network is usually static
Programming model: concurrently executing tasks (i.e., processes) communicating via message passing:
Not suited for applications whose concurrent units require frequent interaction
Relatively difficult to program: programmer has to take into account network topology, communication latencies, data partitioning to write efficient code
Run-time programming library supports primitives for communication and data distribution/collection. (Using traditional distributed programming primitives like the socket API is useless, as these use the slower TCP/IP protocols).
Overall design scales much better with number of processors compared to a SMP – mostly because static INs are more scalable,
Note that a DSM can be easily simulated on a SMP.
461
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Chip Multiprocessors (CMPs)
These are shared memory multiprocessors within a single chip – the on- chip artifacts include:
Real products in this category were enabled in the last 9 to 10 years, as it became possible to pack more devices into a single chip.
Multiple CPUs (or processing cores) – hence the alternative name multicore microprocessor applies.
The cores often include one or two levels of provate caches, along with shared caches – either global or shared among a few cores.
Newer multicore products include on-chip: DRAM controllers, interconnection networks and hardware-supported coherent caches (later).
Programming model: globally shared address space, Pthreads or Java threads.
These are threads in the OS sense!
Thread allocations to cores are done by the OS, but programmer can
define CPU affinities.
What on-chip integrations achieves:
Higher thruput
Higher energy efficiency (combined IPC per Watt)
462
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multicore Microprocessors
Rationale:
Single core, out-of-order designs were giving disproportionately low performance gain compared to the investments made in the number of transistors. This is a technology-independent limitation.
The so-called power wall was also getting in the way of further performance gains – total power, local peak temperature on the chip were at their limits. This is a technology dependent limitation.
With the shrinking of transistor sizes, two or more simpler cores could be integrated into a single chip and can provide better performance with a lower power requirement than a single powerful core that used the same number of transistors.
Basic tenet: use multiple, more energy-efficient cores within a single chip to get a higher combined performance as opposed to implementing a single, powerful, inefficient design.
Generations of Multicore Microprocessors:
First generation: Homogeneous cores shared no on-chip resources. Each core had its private cache hierarchy; I/O pads were either shared or dedicated to the cores. Hastily thrown in together to hit the market!
Second generation: Homogeneous cores shared lower level on-chip caches. Advantage: overhead of process migration from one core to another is faster.
463
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multicore Microprocessors (contd.)
Third generation: Homogeneous cores, with possibly shared lower level caches and hardware supported coherent caches. Requires shared interconnections among caches and cache coherency logic.
Fourth generation: All or some of the above, but with heterogeneous cores. Example: “big-little” ARM and AMD x86 products.
Product offering from many vendors: dual-core and soon-to-be-available quad-core designs from AMD and Intel implementing the X86-64 and IA-32 ISAs; 8-core (in-order cores) SPARC implementation from Sun (the Niagara processor). The embedded computing market had multicore offerings from the early 2000s.
Programming these microprocessors to exploit their full capabilities is challenging – OSs, run-time systems and libraries need to exploit the capabilities of multi-core designs.
Chips with heterogeneous cores (general purpose core(s) plus mix of one or more of graphics cores, specialized floating point cores, DSP cores etc are likely to appear soon. The embedded computing market is way ahead in this respect.
Current products/trends: 2 to 12 cores, several levels of caching, co- herent caches (using on-chip switched interconnection: e.g., ring inter- connection in Intel’s Sandy Bridge line, point-to-point connections, e.g., QPI), on-chip DRAM controllers. Some include GPUs + general purpose cores. Multicore products have also permeated into the em- bedded market segment.
464
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Example: Intel Core 2 Duo
Implements the 64-bit X86-64 ISA. Dual core, with shared lower level L2 cache.
65 nm implementation, 290 million transistors; 1.86 GHz to 2.93 GHz clock rate. Introduced in late 2006.
32 KByte L1-I, 32 KByte L1-D and 2 MByte to 4 MByte shared on-chip L2 cache. No cache coherence.
Additional undisclosed mechanisms are provided for implementing dynamic and fair sharing of the shared L2 cache (“smart cache” technology).
Uses a total of 8 prefetchers, to L1 caches, to L2 caches and to memory.
Core design is relatively simpler: the 20+ stage P4 pipeline is aban- doned in favor of a 14-stage pipeline very similar to that of the P6 (which had 2 stages).
One of the two extra stages permits macro-op fusion (fuses two simple X86 instructions – such as a compare with a branch – into a single instruction to achieve a faster decoding rate).
The other stage is used for computing 64-bit addresses.
Power dissipation is lower than the single core P4.
465
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multicore Microprocessors (contd.)
Main advantages of integrating multiple processing cores into a single
chip:
Communication and interaction among cores is much faster, as on chip interconnections can be operated at a much faster rate compared to off-chip interconnections.
Snoop based protocols will thus have lower performance overhead compared to implementation using off-chip logic and interconnections.
Faster design time – basic core design can be used unchanged.
Simplifies motherboard design compared to SMP motherboards for single-core design.
Cores can be controlled independently to match processing power and power dissipation to dynamics of application – higher energy efficiency results.
Has significant performance and energy advantages compared to a multi-threaded design when it comes to processing concurrent “threads”.
The main challenges in exploiting multicore processors:
On-chip memory hierarchy has to be much more smarter and
energy-efficient.
Programming these microprocessors to exploit their full capabilities is challenging – OSs, run-time systems and libraries need to exploit the capabilities of multi-core designs.
Have to somehow address the Amdahl bottleneck!
466
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multicore Microprocessors (contd.) Current design flavors:
CPU
L1 $
CPU
CPU
L1 $
L1 $
L2 $
L2 $
L2 $
Shared L3 $ + inclusion caches with coherence
CPU
CPU
CPU
CPU
CPU
CPU
L1 $
L1 $
L1 $
L1 $
L1 $
L1 $
L2 $
L2 $
L2 $
L2 $
Shared L2 $
Shared L3 $
External Memory
No Shared On-Chip Caches
External Memory External Memory
Shared On-Chip Lower-Level Cache but No Cache Coherence. With a L3$ present, core-pairs share a L2$ in some designs
External Memory
Shared On Chip Caches – inclusion + coherence Example: Intel Nehalem
When no cache sharing is present, writing (true) thread based applications can be grossly inefficient – as shared structure updates will need locks and these locks have to be marked uncacheable.
This style is more suited for multiprocessing applications (processes interacting very rarely)
What about sharing at the kernel level?
With shared lower level caches, the situation is a little better:
But significant overhead is needed to keep shared variables coherent in private caches; often critical sections are used, with locks marked uncacheable
467
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Multicore Microprocessors (contd.)
Cache coherence in multicore systems:
Often first two levels of cache (L1 and L2) are private, and no on-chip memory is present.
Private caches kept coherent using a cache coherence protocol (LATER). An on-chip interconnectionn network is used to maintain cache coherence.
Often, because of large wire delays, larger shared caches are partitioned into banks and delays in accessing cache banks are non-uniform.
Lowest level cache can be shared.
Increasingly common to see multiple on-chip DRAM controllers, one DRAM controller dedicated to a small number of cores, but still usable by other cores (via a fast on-chip interconnection).
Cores shared a common clock and DVFS logic in early products but emerging products have per-core DVFS to support efficient power management (as in the IBM POWER 8 series).
468
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Interconnection Networks
Dedicated communication path set up dynamically between communicating entities
Paths are reconfigurable – reconfiguration is done entirely in hardware using switching elements
Characteristics of interest:
Total number of switching elements (cost)
Number of switching elements to be traversed in a path set up (delay)
Number and nature of interconnection permutations realizable (connectivity)
Parallelism: number of concurrent paths that can be set up Strategies for setting up and using path (arbitration, routing)
Path redundancy – number of different ways of connection point A to point B (“unique path property” or not), fault tolerance etc.
Implementation Constraints: wiring complexity, complexity of switching elements, complexity of control logic, power requirements etc.
469
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Interconnection Networks (contd.)
The two extremes: the single bus and the crossbar INs
The Single Bus Interconnection:
p = # of processors, m = # of memory modules
Total # of switches needed = O(m+p) = O(p)
Delay per path = 2 = O(1)
All 1-to-1 or 1-to-many permutations realizable, one at a time Each connection to the bus is a switching element
The Crossbar Interconnection:
: switching element
M
M
P
M
P
P
P
P
P
P
M
M
M
Total # of switches needed = O(m.p) = O(p2) Delay per path = 1 = O(1)
All permutations realizable
470
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Dynamic Interconnection Networks (contd.)
An intermediate design: the Shuffle-Exchange Multistage IN (SE-MIN) – trades off performance for lower complexity (compared to a crossbar)
A 8 X 8 Shuffle-Exchange MIN Using 2 X 2 Switching Element (“Omega Network”)
00 11
22 33
44 55
66 77
Switching element : 2 X 2 Crossbar
Switching Element Settings:
Straight Crossed
Upper Lower B’dcast B’dcast
STAGE NUMBER –>
2 1 0
For N X N S-E MIN:
N = 2n = p = m
O(N.log N) switches
Delay per path = n = O(log N)
Many permutations realizable (shifted, ordered ..)
The wiring pattern preceding each stage implements the shuffle permutation
Each switching element incorporates control logic to arbitrate over conflicting requests and determine switch setting
Logic for setting switch: To set switch in Stage # i, look at bit # i in address of destination (“destination tag”): if its 0, set switch to choose upper output of switching element; set switch to select lower output, otherwise. (LSB = bit # 0)
471
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Static Interconnection Networks
Connects entities (called nodes) using a fixed wiring pattern. Static point-to-point interconnection among nodes: in general, direct connec- tion (=link) does not exist between any pair of nodes
Going from one node (the source) to another (the destination) may thus require the traversal of intermediate nodes
Links are usually serial; buffers are provided at each node to hold on to message if required outgoing link is busy (i.e., in use)
Characteristics of interest: Number of links per node (degree)
Diameter = maximum # of links to be crossed in path from source to destination = “hopcount”
Types and number of embedded subnetworks
Routing algorithm: strategy used to follow fixed links between node pairs on the way from a source node to a destination node
Planarity: how many crossovers are needed for implementation – a consideration for VLSI implementations.
Alternative routing strategies, fault-tolerance etc.
472
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Static Interconnection Networks (contd.)
Node-to-node message is usually tagged with address of destination and/or source node (and other routing info, if required): these tags are used for properly routing a message to its destination
– Dedicated communications controller can be used at each node to perform routing functions (more common)
– In the absence of a dedicated communications controller, the arrival of a message usually interrupts the processor, which performs the routing functions
End=to-end delay of sending a message from a source node to a destination node may be reduced by using dedicated controllers at each node and using:
Circuit switching: dynamically set up dedicated path (as in a dynamic IN) using links at each node on the way.
Pipelining traversal of messages across nodes, links and buffers – a special form of such pipelining called wormhole routing is used in most high-end static INs today.
– with wormhole routing a message is broken down into smaller chunks called flits
– each node can buffer only a few flits
– as soon as a flit moves out to the next node on the way, it signals the preceding node to send over the next flit
– buffers are relinquished as soon as the last flit moves out
473
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Special variant: point-to-point switched on-chip interconnections
Usually on-chip and based on dynamic switches that are intercon- nected, with wide links, each switch having:
At least one connection to a core
Other connections to other switches
Example: bridged ring interconnections in the Intel Sandy Bridge series of multicore microprocessor:
:switch
:bridging connection
:link between switches in the ring
The ring passes traffic in one direction only
The ring actually consists of four rings operating in parallel: request, data (32 Byte wide), ack, coherence traffic
Each link operates at the full CPU clock rate
The link is unidirectional and has ten access points (switches)
The core can access the L3 bank directly across it or access the ring at two separate points (to cut down on the number of ”hops”). These all use the grey colored ”bridge” connections
PCIe
Display Mgr
System Agent
DRAM cntrller
Core 0
L3 bank 0
Core 1
L3 bank 1
Core 2
L3 bank 2
Core 3
L3 bank 3
Shader, Graphics Video controller
474
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Static Interconnection Networks (contd.)
An example – the hypercube interconnection network (aka n-dimen- sional cube or n-cube or binary hypercube):
Connects 2n nodes, using n links per node
Nodes are assigned n-bit binary addresses; a link exists between two
nodes if their addresses differ in exactly one bit position. An Example: A 3-cube and routing from node 000 to node 111
link 0
link 2
000
link 1
link 1 110
link 2
100
101 link 0
link 1
link 2 001
011
src dest
000 111
001 111
011 111
111 111 (Done)
routed to 001 011 111
XOR link 111 0 110 1 100 2
link 1 link 2
010
link0 111 link 0
Routing algorithm:
S:
src, dest: n-bit addresses of source and destination nodes
Compute rel = src dest; /* = bit-wise ex-or */
/* src = address of node sending message or node relaying msg. */
if (rel == 0) then
done /* message has reached its destination */
else {
}
Let K be the position of the most-significant 1
in the n-bit entity, rel;
/* bits are given positions (n-1) through 0, msb onwards */ Route message out through link # K
/* Link # K connects two neighboring nodes in the
hypercube that differ in their addresses in the
K-th bit position */
475
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Static Interconnection Networks (contd.)
The hypercube interconnection was used in the Intel iPSC, iPSC/2, N- cube, CM-1, CM-2 (CM-1, CM-2 uses a 12-dimensional hypercube with a node that uses a 4 X 4 mesh interconnection internal to the node)
Another example – the FAT tree interconnection: used in the Thinking Machine’s CM-5 system
: Processing Node : 2 X 4 Switch
Routing from one node to another requires the message to be routed up from the source to the nearest common “ancestor” switch and going down to the destination from there.
Earlier ancestors can be used in case of congestions
476
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Abstraction of Shared Memory in Shared Memory MPs
All shared variables are kept in a common (i.e., global) memory – some additional mechanisms are needed to handle race conditions over shared variables, specifically the following:
A write to the variable from one processor that is attempted concurrently with reads on the same variable by others
Concurrent write attempts to a shared variable from two or more processors
Concurrent read and write attempts on the same variable by several processors
The sequential consistency model of shared memory:
Concurrent read and write attempts are atomic and arbitrarily serialized. Synchronization is required if any specific ordering is needed by the programmer.
All processors see the same serialization order. Simple and intuitive model, this is what we implicitly assume
There is a delay in enforcing the required serialization – a write attempt blocks the writer till the serialization is completed AND till every processor gets to see the completion of this serialization.
477
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Abstraction of Shared Memory in Shared Memory MPs (contd.) These requirements are captured in the following abstraction:
Memory – shared by all
All variables are stored here
Random Selector
Memory Queues
There is only one copy of each variable and its kept in the memory shared by all CPUs.
Rotating selector randomly selects a queue at random, picks up the request at the head of the queue and performs the memory operation.
A new request is not selected till the previous request is completed. The selector is the serialization mechanism.
This simple view gets enormously complicated when one or more levels of caches are present in each processor.
rd A
wr B
wr A
wr B
rd B
wr B
rd A
rd D
wr C
wr D
wr A
CPU 0
CPU 1
CPU 2
CPU 3
478
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Abstraction of Shared Memory in Shared Memory MPs (contd.) Implementation considerations:
Serialization is complicated due to the use of one or more levels of caching in each processor
– updates to a shared variable have to be propagated to all cached copies to maintain the illusion of a common update order on a single logical copy of the variable.
The serialization overhead can slow down overall progress when variables are heavily shared.
Alternative memory consistency models exist:
Main idea: permit momentary inconsistencies when they do not matter Reduces performance overhead for maintaining consistency Programming models are less intuitive
Implementations get to be quite complicated, not very popular
479
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Sequential Consistency Model: Details
Requirements of the sequential consistency model of shared memory:
Single logical copy of each variable, irrespective of caching, bypassing etc.
Memory operations from each processor issued to single logical copy in program order, one at a time till data is received for a read or the update made by a write operation is immediately visible to all processors. The second part of this requirement states that memory operations are atomic: a completed memory operation should be visible to all processors before the next memory operation is started.
– This property is essentially a strict ordering requirement of memory operations issed by a single thread/processor
Ordering of memory operations from different processors interleaved arbitrarily in a globally visible manner (that is, all processors see the same arbitrary order) AND relative ordering of a single processor’s request is also maintained in this global sequence.
Simply put, the atomicity requirement simply reflects the expected behavior that once a memory operation is completed, it’s result should be visible to all processors immediately. This atomicity requirement may be violated if notification of the completion of any two memory operations are seen in different orders by two or more processors.
480
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Sequential Consistency Model: Details (contd.)
Example: Assume X, Y, A and B are all initialized to 0s. Consider the following two threads executing on two CPUs:
CPU0: X=1;Y=1; CPU1: A=Y;B=X;
Possible global serialization sequences permitted by sequential consis- tency can be:
X=1;Y=1;A=Y;B=X; X=1;A=Y;Y=1;B=X; X=1;A=Y;B=X;Y=1; A=Y;B=X;X=1;Y=1; A=Y;X=1;B=X;Y=1; A = Y; X = 1; Y= 1; B = X
/*A=1andB=1*/ /*A=0andB=1*/ /*A=0andB=1*/ /*A=0andB=0*/ /*A=0andB=1*/ /*A=0andB=1*/
The final values of A and B that are possible are as listed in the com- ments next to each sequence.
Note that A = 1 and B = 0 is not a possible outcome, as this will imply: Y = 1 completed before A = Y and B = X completed before X = 1
(i) Y = 1 completing before A = Y also implies that X = 1 completed before A = Y (program ordering requirement for CPU 0).
(ii) B = X completing before X = 1 also implies A = Y completed before X = 1 (program ordering requirement for CPU 1).
So, (i) and (ii) are in contradiction.
481
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Cache Coherence Problem in Shared Memory MPs
All processors/cores have local caches
The basic requirement: a write to a locally cached variable must be eventually propagated to other copies cached elsewhere and the main memory copy; data inconsistency results if this is not the case
Two main issues need to be addressed:
Propagating the updates correctly to processors that need them Implementing the correct model of memory consistency
Propagating the Updates – two broad approaches:
Write broadcasting – broadcast the update to all cached copies
– writer blocks till broadcast is successful.
– while updates are being installed, other processors cannot look at their cached copies
Write invalidation – invalidate all other cached copies before the write can be performed locally
– writer blocks till the invalidations are completed
– if other caches need the most recent version of the shared variable, they will experience a cache miss and this will be serviced by supplying the most recent copy to them.
482
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Cache Coherence Problem in Shared Memory MPs (contd.)
Cache invalidations DO NOT replace the line – they simply mark the contents of the line as invalid. Retaining the tag and the cache entry on invalidations saves any overhead on subsequent local accesses to the line. A local access to a line marked as invalid is treated as a cache miss.
In both approaches, the most recent update (in the order implied by the memory consistency model) will have to be eventually propagated to the memory copy.
Generic Implementation:
Cache lines holding shared variables are maintained in different states States indicate:
– What is the most recent version following the consistency model.
– Whether the locally cached copy is valid or not (invalid means that the line does not reflect the most recent state)
– Other information: if the line needs to be propagated to the memory if it is selected for replacement; whether if this is the only cached copy etc.
A cache coherency protocol is used to decide on state transitions consistent with the memory consistency model being used: on a read miss or a write – either a hit or a miss, the processor puts out the address of the line on the interconnection along with other information. Other processors see this address and look up their caches. If the same line exists in their caches, appropriate protocol actions are taken – these include possibly changing the state of the locally-cached lines in all processors.
483
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
An Example: the MESI Cache Coherence Protocol
Usable in SMP systems that support a broadcast network: a bus interconnection or a fast, on-chip interconnection for multi-core designs that supports broadcast
Falls in the general category of snoopy protocols, as cache controllers always look (or “snoop”) at line addresses floated out in the broadcast network to take appropriate protocol actions.
Implements the sequential consistency model
Broadcast network is used for two reasons: To serialize concurrent read/write attempts.
To learn about the sharing status (i.e., whether other caches have a copy or not) of a cache line.
Protocol name is derived from the possible states of a cache line:
M – modified: present only in one cache and this copy is more
up-to-date than the memory copy – that is, this line is “dirty”.
E – exclusive: only cached copy and memory copy is up-to-date.
S – shared: line is held in one or more other caches as well. Memory copy is up-to-date as well.
I – invalid: cached locally, but contents invalid
484
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
An Example: the MESI Cache Coherence Protocol (contd.)
Hardware Requirements for implementation on a bus-based SMP:
Snoop logic: shadow tag memory (or dual-ported cache directory access) to look up if a line is locally cached when a line address is floated out on the broadcast network by another processor’s cache controller.
A single additional line on the bus called the “shared” line, that is raised to a high logic level prior to any bus transaction. If a line address is floated out on the bus on a cache miss, processors that have a valid copy of the line pull down this line from a logic one level to a zero. The processor which floated out the line address can thus conclude if the cache line is being shared or not.
Write-through caches at each processor to ensure that the line address is floated out on the network on cache writes (and misses).
Protocol Actions and State Transitions:
Read hit: No state changes or broadcast needed; proceeds without
delay.
Read miss: Put out line address and information indicating read miss on the bus.
– If another cache has a copy (as indicated by the shared line going low), in the M state, it supplies the missing line, updates the memory copy and the state of the lines in both caches are set to S.
– If other caches have a copy in the S state, they remain in the S state and only one of these caches (selected through arbitration) supplies the line.
– If no other cache has a copy, the read miss request is satisfied by the memory and it is installed in the state E in the cache that experienced the read miss.
485
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
An Example: the MESI Cache Coherence Protocol (contd.) Write hit: Actions depend on the state of the locally cached line
– If the line is in the E state, change the state to M and the write proceeds to update the local copy; no bus actions are needed.
– If the line is in the M state, no state changes or bus actions are needed and the write proceeds without delay on the local copy.
– If the line is in the S state, broadcast the line address along with information that indicates the transaction as a write on the bus. On seeing this, other caches change their block state to I. The state of the local copy is changed to M and the local copy is updated.
Write miss: Put out the line address and information that indicates a write miss on the bus. Protocol actions depend on the state of the line in other caches
– If no other caches have a copy (“shared” line on bus remains high), the rest of the line is read out from memory, the update is made on the local copy and its state is set to M.
– If another cache has a copy of the line in the M state, it updates the memory copy and sets its own state to I. The processor generating the miss gets the line from memory, updates the line and sets the state of the linetoM. Thisensuresthatdirtylinesareflushedoutpriortoanupdate by another processor and improves the resiliency of the system to faults.
– If other caches have a copy in the state S or another cache has a copy in the state E, one of the caches that have it in the S state or the cache that has it in the M state supplies the copy (to effectively get the rest of the data in the line) and caches that had a copy change their state to I. The write updates the relevant part of the line and the local copy is set to the state M. Variation: if other caches have the line in S or E, they simply change their state to I and memory supplies the line; rest of step identical. This variation matches how the above subcase was handled.
486
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
An Example: the MESI Cache Coherence Protocol (contd.)
A few things worth noting:
Notice how the memory copies are kept up-to-date as often as possible when a line in the M state is supplied to another cache on a read miss or when a write miss occurs and the only cached copy is in the M state.
On a read miss, the reader is not delayed as the cache-to-cache supply proceeds in parallel with the memory update. This is possible as the data being written out is floated on the bus for the memory update and the processor generating the miss picks it up.
In the case of a write miss, the writer is delayed as a result of the memory update. Question:whycan’tthismemorywritebeoverlappedwiththe cache-to-cache supply?
The MESI protocols (and slight variants) are implemented in chipsets that support cache coherence for SMPs using the Intel Pentium series of processors as well as SMPs using some of the IBM Power PC imple- mentations.
NEXT PAGE: protocol state transitions in MESI
487
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
An Example: the MESI Cache Coherence Protocol (contd.)
The MESI Protocol States and Transitions:
* alsotriggersMM(mainmemory)update
wm in another* cache
wm or wh
in another cache, another cache supplies
rd miss in another cache; this cache supplies
wh
Modified wh
*
Shared
rd miss in another cache; this cache supplies or another cache supplies
rd miss – another cache supplies
wm in another Invalid cache
Exclusive
wm supplied by MM or another cache – can’t start in Exclusive as update has not been propagated to MM
wh
wm or wh
in another cache, this cache supplies
rd miss in another cache; this cache supplies
local transaction, no bus activity
local transaction, bus activity generated, no transitions elsewhere
local transaction, bus activity generated, possible transitions elsewhere transitions induced by other caches/cores
488
rd miss
MM supplies
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
An Example: the MESI Cache Coherence Protocol (contd.)
Implementation:
Uses a shared bus to serialize coherence requests.
Each cache has a snooping logic (shown shaded in gray below) to monitor the block addresses issues on the shared bus.
Tag fields are extended to hold the state of the cached block. A separate shadow directory (actually a second port to the extended tag array) is used to permit concurrent access to the extended tag by the snooping logic: only one private cache is shown; others look similar are connected identically to the bus
CPU (& upper level private cache)
Private Cache
Shared Bus: Address Part
Shared Bus: Data Part
The shared bus may be replaced by an interconnection supporting broadcast and serialization.
Intel products and the AMBA interface of ARM implenment MESI cache coherence.
Cache Lookup Logic
Tag
Data
489
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Details: Snooping Controller
Hardware outlay:
Snoop Latch
Broadcast Network (usually bus)
Bus Arbriter
Memory Write Queue Tag + Data
Read Buffer
Shadow Tag part
LS
Data part of Dcache
Normal Tag part
Tag logic
State Update Logic
Tag match logic/muxes
CPU Datapath
LS = Line state array
One Core/Node
Shadow tag array is a copy of the primary tag array. A dual ported tag array can be used as an alternative.
Line address and bus transaction type info is latched into the snoop latch. A matching entry in the cache is looked up and appropriate state updates are made using the snoop update logic.
The state update logic is invoked when: (a) shadow tag array match logic indicates a match with address on bus (from another CPU) or (b) when arbiter wins access to bus to write local data/transaction info to bus. Complications added when two or more levels of local caching is added.
490
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
MESI: SC Model Compliance
Meeting the requirements of the sequentially-consistent (SC) model of memory:
Maintaining ordering of memory operations from a single processor and thread: using ROB and LSQ – a memory operation is initiated via the cache when the LOAD or STORE is at the head of the LSQ and ROB and by waiting for the memory op to complete before the next memoy operation can be started.
Implementing global ordering and atomicity: using serialization through the bus:
– Before any memory update is made to the locally cached copy, the request is to be floated out on the bus (going through bus arbitration). When the request is visible on the bus, all other processors see it and make appropriate state changes to the line, if it exists, in the cache.
– Read misses also need to be floated out to the bus and retrieves the most-recently updated copy. Thus, the most recently updated copy is visible to any requester.
491
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
MESI: Performance Improvements
Reducing the impact of strict ordering on performance:
Prefetch a line to be read and installing the prefetched cache lines in E
state (if MM supplies) or S state (if another cache supplies the line).
Prefetch a line to be written and installing the prefetched cache lines in M state. (Even if a part of the line is updated, the rest of the data in the line need to be fetched.)
Exploit the speculative execution mechanism: Under the strict SC requirement, if a CPU has two LOADs in sequence, say LOAD X followed by LOAD Y (X, Y are memory locations), LOAD X has to complete before the memory read for LOAD Y is started.
– This strict ordering is enforced by requiring the memory operation to be started only when a memory instruction is at the head of the ROB.
This strict ordering delay can be avoided by starting LOAD Y before LOAD X completes or before LOAD Y is at the head of the ROB. The data returned by LOAD Y is marked as speculative.
If Y is invalidated before the LOAD Y instruction is committed (that is, before LOAD Y ends up at the head of the ROB), this implies that the atomicity requirement of LOAD Y was violated. In other words, the speculatively read value for Y was modified by a write operation (from another processor) before the memory operation for the LOAD Y was really supposed to start (that is when the LOAD Y was at the head of the ROB).
– If LOAD Y’s atomicity requirement is violated, the LOAD Y and all following instructions are flushed and execution resumes with the LOAD Y.
492
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
MESI: Performance Improvements (contd.)
Note that STOREs cannot be issued speculatively, as updates made to memory cannot be undone.
Use a split-phase bus: in a simple bus, the bus remains locked up till the operation on the bus completes. Long coherence activities thus prevents the bus from being used by other requests.
A split-phase bus allows the bus request to be started and any response to be delivered as a separate bus transaction.
Coherence activities on different lines can thus be overlapped with a split-phase bus.
LOAD bypassing earlier stores is not an option to use for performance improvement, as this violates the SC model requirements (see “TSO model of memory consistency” – LATER)
493
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Other Cache Coherence Protocols and Implementations MOESI is another popular cache coherence protocol:
In MOESI, an “Owner” state is added to permit the writeback of a modified block to be delayed.
The cache where the modification took place has the block in the “Owner” state and this block can be supplied directly by the owner to another cache who has a miss on the block.
Replacement of a block in the “Owner” state causes a writeback to update the memory.
Requires a cache-to-cache transfer path. AMD, Sun and Intel implement MOESI protocol in their products.
494
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Coherency in Distributed Shared Memory Multiprocessors
Uses a centralized or distributed directory scheme and messsage passing with acknowledgements to implement cache coherence. A distributed directory based scheme is shown below:
M: Memory Modules P: Processors
C: Cache
D: Directory
Directory-Based Cache Coherency Protocol for a DSM Multiprocessor:
In a DSM, the memory module associated with a processor contributes lines to the shared address space.
Each memory module maintains a local directory. For each line accessed from the memory module, this directory lists an entry for the line that records the state of the line and the ids of the processors that have cached the line. The format of a directory entry is thus as follows:
Requests to access a memory line by any core requires serialization through the directory associated with the memory module that contains the line (the “home” of the line).
Interconnection Network
P
P
P
P
P
C
C
C
C
C
D
M
D
M
D
M
D
M
D
M
495
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Coherency in DSM Multiprocessors (contd.)
In a directory-based cache coherence protocol supported on a static network, messages convey requested protocol actions: read_miss, write_miss,invalidate(orwrite_broadcast),dataetc. Theterm“node” is used to refer to a CPU-private cache-memory module combination.
The messages specify the block address and all messages need to be acknowledged by the proper directory/node.
A write request to a line requires the following steps:
– Requesting node, RN, sends a write request to the home node (HN)
– HN looks up its directory entry for the line and does the following:
If the state of the line is listed as shared:
– HN sends invalidation messages to nodes that have cached the line and waits for all such nodes to send an acknowledgement message back (to indicate that they have marked their locally cached copies as invalid)
– HN then supplies the missing line, marking it as invalid in its cache (if present) and notes it state as modified and records the requesting node as the only node containing the updated line in the presence bit vector.
– On receiving the supplied line, the RN marks its state as modified in its cache and proceeds with the update.
If the state of the line is modified:
– HN requests the node (say, XN) that has the line in the modified state to supply the line to the requester and updates its presence bit vector appropriately.
– Till HN receives an acknowledgement from the RN following RN’s receipt of the line, all requests to access that line are queued up.
– On receiving the supplied line, the RN marks its state as modified in its cache and proceeds with the update.
496
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Cache Coherency in DSM Multiprocessors (contd.)
A read request proceeds analogously. In the directory state at the HN lists the line as shared, a node close to the requester may be notified via a message from the owner to supply the line etc.
As described above, the three possible states of a cached line are: SHARED, INVALID and MODIFIED (and MODIFIED is also the same as exclusive).
Things to note:
Directories require extra storage.
Several messages needed to take care of a single request. The latency for implementing coherence can be considerable.
Directory-based cache coherence protocols generally have a higher overhead in terms of performance (and implementation/verification cost). However, they can be scaled up to accommodate more nodes, as a bus can only handle a few nodes.
Scales better than MESI, as degree of sharing is typically small.
Ideally, each directory must have a entry for each line in the associated memory module. However, fewer entries can be used if directory en- tries are replaced. On replacing a directory entry, appropriate actions need to be taken. (Think of what these ought to be!)
497
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The TSO Model of Memory Consistency
Total Store Order (TSO): permits later LOADs to bypass earlier STOREs for improving performance. Main features:
Strict ordering among writes from a single CPU maintained.
Reads from a processor can take place out of program order: LOAD can bypass earlier STOREs when the address targeted by the LOAD does not match the addresses targeted by the earlier STOREs. This removes the strict ordering requirement that a STORE occurring earlier in program order must complete before a following LOAD can be started.
Writes are implemented atomically: till a write is seen by all processors, reads to the same address cannot return any value.
Example: Assume A,B, X and Y are all initialized to zero before the following concurrent codes execute:
CPU 0: X = 1; LOAD Y: CPU 1: Y = 1; LOAD X;
With the strict sequential consistency model, the both of the LOADs will not return 0s as STOREs in each CPU will complete before the LOADs that follow each STORE. With TSO, the LOAD in each CPU can complete before the STOREs and thus both LOADs can return 0s.
Programmer’s job can get complicated to exploit the semantics of TSO!
498
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Release Consistency Model of Memory Consistency
Main idea: relax ordering of reads and writes within a thread in code out- side critical sections.
Uses two primitives: Acquire (=get lock, treated like a read) and Release (= free lock, treated like a write). These primitives can be coded using existing instructions, for example:
Acquire can be implemented as: while (lock): /* loop as long as ”lock” is set to 1 */
Release can be implemented as: lock = 0;
Requirements/attributes of RC model:
Outside critical sections, reorder reads and writes to get performance advantage without violating control flow and data dependencies.
All previous reads and writes must complete before a release can complete. This can be implemented using a memory fence instruction (MFENCE) inserted before the Release.
An Acquire has to complete before subsequent read and write operations can complete.
Acquire and Release operations must follow strict sequential ordering requirement.
Again, programmers will have to figure out how to use this model and how to use it correctly!
Current consensus: stick to the sequential consistency model at the programming level.
499
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Synchronization in Multithreaded/Multicore Systems: Mutexes
Used for updating shared variables consistently – part of code that should be atomically updated by one thread is called a critical section.
Main issues to consider:
Nature of indivisible primitives used:
– all-software vs. hardware-supported
Waiting on locks:
– spin-wait (busy waiting) vs. sleep-wait
Implications of caching:
– coherent caches vs. caches with no coherence support
Fairness issues:
– starvation-free vs. fairness
– Generally difficult to have perfect solutions for all of these considerations. Good practical solutions invariably involve some compromises across the board.
We will consider the lock and unlock functions at this point. These are usually part of a thread programming library in some form or other.
500
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implementing Mutex Locks
All software implementation – e.g., Lamport’s bakery algorithm Very slow, difficult to scale to many threads/cores
Simpler software implementation on uniprocessors: shut off interrupts when executing critical sections:
– Does not work on SMPs/multicores
– Will also not work on hyperthreaded designs where thread switching is controlled in hardware
Most common hardware implementation of indivisible (= atomic) locking: use read-modify-write memory cycles:
Old contents of a memory location read out and updated in one indivisible memory (r-m-w) cycle: indivisibility guaranteed by the memory controller
Example: the test-and-set (T&S) instruction:
– reg = T&S(X), X is a location containing a Boolean value
– semantics: [[reg = X; X = 1)]]
– [[..]] = sequence implemented atomically
501
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implementing Mutex Locks (contd.)
Example: Compare-and-swap (C&S) instruction:
– reg = C&S(address, old_value, new_value)
– semantics:
[[ if (*address == old_value) {*address = new_value; reg = true}
else reg = false ]]
– CMPXCHG in X86 ISA, CMPSWP in PPC ISA are similar to this. Implementing mutual exclusion:
Naive implementation:
lock(X);
code for critical section protected by X
unlock(X);
lock (X) can be implemented using T&S as: while (T&S(X)) do; /* spin wait */
that is:
L: reg = T&S(X); if (reg == 1) goto L;
502
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implementing Mutex Locks (contd.)
unlock(X) is implemented as X=0; /* no atomic operation needed */
With C&S, the implementation of “lock (X)” simply uses the fact that: reg = T&S(X) is equivalent to reg = C&S(X, 0, 1)
Problems with the naive implementation: Memory traffic caused by spin waiting:
– T*S done needlessly in a loop: ties up access to X because of repeated r-m-w cycles caused by T&S, including and possibly by the thread doing the unlock(X)
– Can create hot spots in large scale SMP systems – if lock access requires a trip through a network that can have congestions (such as bus, switched multistage network), the network traffic induced by spinning can delay traffic that is not related to the lock or to the critical section protected by the lock – this causes overall throughput to drop and is called the hot spot contention problem.
– Solution: check if lock is free before actually trying a lock(X):
L: if(X==1)gotoL;/*spinifsomebodyelsehasthelock*/ reg = T&S(X);
if (reg == 1) go to L;
– This is a little more efficient, but there are repeated read cycles wasted as X is read in a loop before the T&S is executed.
503
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Implementing Mutex Locks (contd.)
– Better solution: wait before you retry the T&S
L: reg = T&S(X);
if (reg == 1) wait(delay); goto L;
–
– The choice of the delay is critical: too small a delay may not reduce the spin traffic appreciably; too long a delay may be reduce the impact of spinning but can delay the thread considerably and unnecessarily when very few threads content for the lock
– Solution: use a dynamically adjusted delay that senses the degree of contention and sets the individual delays appropriately: very difficult to implement simply
– A practical way of adjusting the delay dynamically will be to increase the delay progressively on each successive failed attempt to grad a lock (as in the classical Ethernet’s backoff, doubling the delay between successive retries at each failed attempt) – this is the so-called backoff lock.
To spin or sleep:
Spin-wait is not a viable choice for uniprocessors that do not support
multithreading
Sleep waiting may be an overkill for multithreaded/multicore systems, as context switching overhead is significant.
504
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
Locking and Caches
Multicores with private caches and no coherence (current generation of multicores)
Private copies not kept coherent, so update to X using a T&S(X) in one core is not seen in the cached copy in the other core: the locking protocol fails here
Solution: do not bring locks into caches – disable caching: the instruction will do this automatically or locks can be kept in a page marked as non-cacheable (like IO buffer pages)
Can we use the cache to read-in X to avoid unnecessary T&S?
Systems with private, coherent caches:
No need to disable caching, but repeated T&S in cores can cause “ping-ponging” if cache coherence protocol uses write invalidations (later)
Ping-ponging: the only valid copy of X moves back and forth across the various caches as they do a T&S
Can use reads (and caching) to first check if the lock is free before trying a T&S (as in earlier solution) to avoid unnecessary T&S and thus reduce ping-ponging
For small critical sections protected by locks, it may be better to prevent locks from being cached (as coherence overhead can be substantial)
505
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Issue of Fairness in Mutual Exclusion
A locking protocol is unfair if a thread attempting to enter the critical section is delayed indefinitely (this is called starvation) or delayed for a unduly long time compared to other threads.
Sleep waiting mechanisms can implement fairness by using appropriate scheduling changes
To implement a similar solution with spin waiting, we have to:
Build in a fair scheduling solution as part of the lock-related functions Make considerations for spin traffic and caches
Here’s a solution: assume that there are N threads (and N cores) Assume caches are present and a cache coherence protocol to be in
place. A coherence protocol is NOT a requirement, but helps!
Use a thread/core-local bit to spin on – assume an array SPIN[N] to be used – each element is cache line sized (use padding if necessary!). This array is indexed by a unique thread id, tid
Use a global queue, GLQ, to implement fair scheduling on the lock. Updates to this queue are protected using a single lock variable, qlock.
qlock is not cacheable if coherence protocols are not used Assume appropriate initializations for the various structs.
506
These foils copyrighted by Kanad Ghose, 1997 through 2020. DO NOT COPY OR DISTRIBUTE IN ANY FORM or give a copy to anybody who is not registered for this course. The contents of these notes
are protected by the US copyright laws.
The Issue of Fairness in Mutual Exclusion (contd.)
lock (X) does the following:
S:
L:
reg1 = T&S (X) if (reg1 == false)
enter CS else {
reg2 = (T&S(glq)) /* can have contention here */ if (reg2 == true) go to S;
enqueue this threads tid (my_tid) on the GLQ;
glq = false;
SPIN[my_tid] = true;
if (SPIN[my_tid]) go to L; /* local spinning */ } /* else */
unlock(X) does this:
S2:reg3 = (T&S(glq)) /* can have contention here */ if (reg3 == true) go to S2;
/* now check and update GLQ indivisibly */
if (GLQ is empty) /* need to check this in CS! */
X = 0; /* clear lock */ else {
next_tid = dequeue (GLQ);
SPIN[next_tid] = false; /* no need to protect using a lock */ }; /* else */
glq = false;
There IS contention over checking and updating the GLQ, but this is minimal; backoff locks can be used if this contention is serious enough.
507