程序代写代做代考 computer architecture algorithm A 64 Kbytes ISL-TAGE branch predictor∗

A 64 Kbytes ISL-TAGE branch predictor∗

André Seznec
INRIA/IRISA

May 20, 2011

1 Outline

The ISL-TAGE predictor consists in a TAGE pre-
dictor combined with a loop predictor (to predict
loops) , a Statistical Corrector predictor and an Im-
mediate Update Mimicker, IUM.

A TAGE predictor [4] constitutes the core of the
ISL-TAGE predictor.

The TAGE predictor (Section 2) captures most of
the correlation on the branch outcomes for very long
histories. But sometimes TAGE fails to predict loops
with constant number of iterations. The loop predic-
tor (Section 3 ) is used to predict these loops. TAGE
also fails to predict branches that are not strongly
correlated, but only statistically biased. The Statis-
tical Corrector predictor (Section 4) is in in charge of
tracking these statistically correlated branches that
are not correctly predicted by the TAGE predictor.
The TAGE predictor must be updated at retire time
to avoid pollution by the wrong path; this delayed up-
date induces extra mispredictions compared with an
optimistic at fetch time update. The Immediate Up-
date Mimicker (Section 5) aims at limiting these ex-
tra mispredictions through predicting branches with
inflight non-retired occurrences.

The parameters of the submitted ISL-TAGE pre-
dictor are summarized in Section 6 .

2 The TAGE conditional branch pre-
dictor

The TAGE predictor was described in [3] and [4].
Only marginal modification sare introduced here, es-
sentially associated with the huge storage budget al-
lowed for the contest.

Figure 1 illustrates a TAGE predictor. The TAGE
predictor features a base predictor T0 in charge of
providing a basic prediction and a set of (partially)
tagged predictor components Ti. These tagged pre-
dictor components Ti, 1 ≤ i ≤ M are indexed using

∗This work was partially supported by the European Re-
search Council Advanced Grant DAL

=?
=?
=?

predic*on

Xbar

h[0:L1]
pc
pc
h[0:L2]
pc
h[0:L3]

h[0,L1]

Figure 1: A 4-component TAGE predictor synopsis:
a base predictor is backed with several tagged pre-
dictor components indexed with increasing history
lengths

different history lengths that form a geometric series
[1], i.e, L(i) = (int)(αi−1 ∗ L(1) + 0.5). Through-
out this paper, the base predictor will be a simple
PC-indexed 2-bit counter bimodal table; in order to
save storage space, the hysteresis bit is shared among
several counters as in [2].

An entry in a tagged component consists in a
signed counter ctr which sign provides the predic-
tion, a (partial) tag and an unsigned useful counter
u. Throughout this paper, u is a 1-bit counter and
ctr is a 3-bit counter.

Sharing storage space between predictor ta-
bles in the TAGE predictor The TAGE pre-
dictor features independent logic tables. For some
applications, some of the tables are underutilized
while some others are under larger pressure. Shar-
ing the storage space among several tables can be
implemented without requiring to a real multiported
memory table, but a bank interleaved tables as illus-

trated in Figure 1. Experiments showed that global
interleaving of all tables is not the best solution, but
that interleaving between for a few adjacent history
lengths can be slightly beneficial (reduction of about
1% of the misprediction number on the distributed
benchmark sets).

A few definitions and notations The provider
component is the matching component with the
longest history. The alternate prediction altpred is
the prediction that would have occurred if there had
been a miss on the provider component.

If there is no hitting component then altpred is the
default prediction.

2.1 Prediction computation

The prediction selection algorithm is exactly the
same in [3] or [4].

At prediction time, the base predictor and the
tagged components are accessed simultaneously. The
prediction is provided by the hitting tagged predictor
component that uses the longest history. In case of
no matching tagged predictor component, the default
prediction is used.

However we remarked that when the confidence
counter of the matching entry is null, on some traces,
the alternate prediction altpred is sometimes more
accurate than the ”normal” prediction. This prop-
erty is global to the application and can be dy-
namically monitored through a single 4-bit counter
( USE ALT ON NA in the simulator).

Prediction computation summary Therefore
the prediction computation algorithm is as follows:

1. Find the longest matching component and the
alternate component

2. if (the confidence counter is not weak or
USE ALT ON NA is negative) then the provider
component provides the prediction else the pre-
diction is provided by the alternate component

2.2 Updating the TAGE predictor

Update on a correct prediction by the longest
marching entry The prediction counter of the
provider component is updated.

Update on a misprediction by the longest
matching entry First we update the provider
component prediction counter.

As a second step, if the provider component Ti
is not the component using the longest history (i.e.,
i < M) , we try to allocate entries on predictor components Tk using a longer history than Ti (i.e., i < k < M). It should be noticed that, since, the predictor size storage budget allocated for the competition is huge we allocate up to 4 of these entries. For smaller pre- dictors, one would try to limit the footprint of the application on the predictor by allocating a single predictor entry. The M-i-1 uj useful bits are read from predictor components Tj, i < j < M . The al- location algorithm chose up to four entries for which the useful bits are null, moreover we guarantee that the entries are not allocated in consecutive tables. When an entry is allocated, its prediction counter is set in weak mode and its useful bit is set to null. Updating the useful bit u The useful bit u of the provider component is set whenever the actual pre- diction is correct and the alternate prediction altpred is incorrect. In order to avoid the useful bits to stay forever set, we implement the following reset policy. On al- location of new entries, we dynamically monitor the number of successes and fails when trying to allo- cate new entries after a misprediction; this monitor- ing is performed through a single 8-bit counter (u=1, increment, u=0 decrement). This counter (variable TICK in the simulator) saturates when more failures than successes are encountered on allocations. At that time we reset all the u bits of the predictor. Typically, such a global reset occurs when one of two entries on the used portion of the predictor has been set to useful. This simple policy was found to be more efficient than the previously proposed management using 2-bit useful counters for the TAGE predictor [4][3]. 3 The loop predictor component The loop predictor simply tries to identify regular loops with constant number of iterations. The implemented loop predictor provides the global prediction when a loop has successively been executed 7 times with the same number of iterations. The loop predictor used in the submission features only 64 entries and is 4-way skewed associative. Each entry consists of a past iteration count on 10 bits, a speculative and a retire iteration count on 10 bits each , a partial tag on 10 bits, a confidence counter on 3 bits, an age counter on 3 bits and 1 direction bit i.e. 47 bits per entry. The loop predictor storage is only 3,008 bits i.e. 376 bytes. Replacement policy is based on the age. An entry can be replaced only if its age counter is null. On allocation, age is first set to 7. Age is decremented 2 whenever the entry was a possible replacement target and incremented when the entry is used and has pro- vided a valid prediction. Age is reset to zero whenever the branch is determined as not being a regular loop. 4 The Statistical Corrector Predictor The TAGE predictor is very efficient at predict- ing very correlated branches even if the correlation is with very remote branches, e.g. on a 1000 bits branch history. However TAGE fails at predicting statistically biased branches e.g. branches that have only a very small bias towards a direction, but are not correlated to the history path. On some of these branches, the TAGE predictor performs worse than a simple PC-indexed table of wide counters. In order to better predict this class of statistically biased branches, we introduce the Statistical Correc- tor predictor . The correction aims at detecting the unlikely predictions and to revert them: the predic- tion and the (address, history) pair is presented to Statistical Corrector predictor which decides whether or not inverting the prediction. Since in most cases the prediction provided by the TAGE predictor is cor- rect, the Statistical Corrector predictor can be quite small. In the submission, the Statistical Corrector predic- tor is derived from the GEHL predictor [1]. It fea- tures 5 logical tables indexed with the same history lengths as the main TAGE predictor (0, 3,8,12,17) and the prediction (taken/not taken) flowing out from the TAGE predictor. The logical tables are sharing a single interleaved 4K 6-bit entries, i.e., 3Kbytes. The prediction is computed as the sign of the sum of the (centered) predictions read on the Sta- tistical Corrector table plus the (centered) output of the hitting bank in TAGE (x8), thus taking into ac- count the confidence in the TAGE prediction. The TAGE prediction is reverted if the Statistical Cor- rector predictor disagrees and the absolute value of the sum is above a dynamic threshold. The dynamic threshold is adjusted at run-time in order to ensure that the use Statistical Corrector predictor is benefi- cial. The technique is similar to the technique used for dynamically adapting the update threshold of the GEHL predictor. 5 The Immediate Update Mimicker On a real hardware processor, the predictor tables are updated at retire time to avoid pollution of the predictor by the wrong path. A single predictor table entry may provide several mispredictions in a row due to this late update. In order to reduce this impact, we implement an add-on to TAGE, the IUM, immediate update mimicker. However on a misprediction the history can be re- paired immediately and when a block is fetched on the correct path, the still speculative branch history is correct. In practice, repairing the global history is straightforward if one uses a circular buffer to imple- ment the global history. We leverage the same idea with IUM predictor. When fetching a conditional branch, IUM memorizes the identity of the entry E in the TAGE predictor (number of the table and its index) that provides the prediction as well as the pre- dicted direction. At branch resolution on a mispre- diction, the IUM is repaired through reinitializing its head pointer to the associated IUM entry and updat- ing this entry with the correct direction. When fetching on the correct path, the associated IUM entry associated with an inflight branch B fea- tures the matching predictor entry E that provided the TAGE prediction and the effective outcome of branch B (corrected in case of a misprediction on B). In case of a new hit on entry E in the predictor be- fore the retirement of branch B, the (TAGE predictor + IUM) can respond with the direction provided by the IUM rather than with the TAGE prediction (on which entry E has not been updated). IUM can be implemented in hardware through a fully-associative table, It allows to recover about 3/ 4th of the mispredictions due to late update of the TAGE predictor tables. The storage cost of the spec- ulative predictor is only 64*20 bits, i.e. 160 bytes, plus the pointers to determine the head ( position of the last fetch branch) and the tail ( position of the next to be retired branch) of the IUM. Important remark : In the submission, the TAGE predictor, the Statistical Predictor and the loop predictor are re-read at retire stage to recom- pute the predictions, but also to get parameters such as the provider components etc before update. On a real hardware predictor, one would like to avoid this re-read and recomputation. The IUM could be aug- mented with a few fields (counter values, u bit, ..) to transmit these informations from fetch stage to re- tire stage, thus saving an extra read on the predictor tables. 6 The submitted ISL-TAGE predictor 6.1 The TAGE predictor component For a 512 Kbits predictor, the best accuracy we found is achieved by a 16 components TAGE predic- tor, i.e. 15 tagged components and a base bimodal 3 predictor. Hysteresis bits are shared on the base pre- dictor. Each entry in predictor table Ti features a Wi bits wide tag, a 3-bit prediction counter and a 1-bit useful counter. 6.1.1 Tag width tradeoff Using a large tag width leads to waste part of the storage while using a too small tag width leads to false tag match detections. Experiments showed that one can use narrower tags on the tables with smaller history lengths. 6.1.2 The global history vector The usual combination of the global branch history vector and a short path history (limited to 1 bit per branch) was found to be slightly outperformed by a global history vector introducing much more infor- mation for the indirect branches as well as for the calls. 4 bits (resp. 5 bits) mixing target and program counter are introduced for the indirect branches and the calls. 6.2 TAGE parameters Since this is competition we have run a guided search for the best set of history lengths. We first determined the number and respective sizes of tables in the predictor through exploring the design space using series of geometric history lengths. Then we grouped the tables to be interleaved. This leads to a configuration with 4 groups of tagged tables with increasing tag width, respectively T1-T2, T3-T7, T8- T13 and T14-T15. The set of possible history lengths was then explored with a geometric series with min- imum history length 8 and maximum history length 2000 as a starting point. The set of history lengths in the submitted predictor is {0, 3, 8, 12, 17, 33, 35, 67, 97, 138, 195, 330, 517, 1193, 1741, 1930}. The characteristics of the TAGE component are summarized in Table 1. The base bimodal predictor features 32K prediction entries sharing 8K hysteresis entries. The TAGE predictor features a total of 482 Kbits of prediction storage. Base T1-2 T3-7 T8-13 T14-15 Kentries 32+8 4 16 8 1 Tag bits 8 11 13 14 Kbits 40 48 240 136 18 Table 1: Characteristics of the TAGE predictor com- ponents 6.2.1 Total predictor storage budget The total storage budget for the submitted ISL- TAGE predictor represents 61,696 bytes for TAGE predictor, 3,072 bytes for the Statistical Corrector predictor, 376 bytes for the loop predictor and 160 bytes for the IUM predictor, i.e a total of 65,304 bytes. Apart the prediction tables storage, the ISL-TAGE predictor uses 1) a circular buffer to store the history, for convenience we use a a 4Kbits circular buffer, 2) two 12-bit pointers on the circular buffer, one for fetch history, the second for retire history 3) two 16 bits path history vectors 4) a 7-bit counter to determine the usefulness 5) a 8-bit counter to control the u bits reset 6) a 32-bit random number 7) two 6-bit point- ers on the IUM 8) two 6-bit counters to manage the threshold on the statistical corrector, i.e. a total of 4,223 extra storage bits for control logic. In total, the ISL predictor uses 65,833bytes of storage. 7 Options of the submitted ISL- TAGE predictor Through commenting some #define lines, one can run different versions of the submitted predictor. #define IUM enables the IUM predictor. #define INITHISTLENGTH enables the use of the best found set of history length. #define SHARINGTABLES enables the sharing of storage tables among the logic tables of the TAGE predictor. #define LOOPPRE- DICTOR enables the loop predictor. #define STAT- COR enables the Statistical Corrector predictor. References [1] A. Seznec. Analysis of the o-gehl branch predic- tor. In Proceedings of the 32nd Annual Inter- national Symposium on Computer Architecture, june 2005. [2] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeidès. Design tradeoffs for the ev8 branch predictor. In Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002. [3] André Seznec. The l-tage branch predic- tor. Journal of Instruction Level Parallelism (http://wwwjilp.org/vol9), April 2007. [4] André Seznec and Pierre Michaud. A case for (partially)-tagged geometric history length pre- dictors. Journal of Instruction Level Parallelism (http://www.jilp.org/vol8), April 2006. 4

Related Posts