Tartalmi kivonat
Generic Datapath Alternatives of Advanced Superscalar Processors Dezső Sima Kandó Polytechnic, Institute of Informatics, Budapest, Hungary sima@alpha1.obudakandohu Abstract: In the quest for higher performance, virtually all recent models of significant superscalar lines have introduced both shelving and register renaming. The implementation of these advanced features have had major implications on the datapath of the processors. As a consequence, the datapaths of recent superscalars differ substantially from those used in previous models. Although the internal details of most new microarchitectures have been disclosed, the design space of their datapath has not yet been described. In this paper we contribute to this area by identifying generic alternatives of the kernel of the datapath in advanced superscalar processors. Our work is based on the exploration of the design spaces of both shelving and register renaming, since the manner of their implementation is decisive for the
layout of the kernel of the datapath. From the design spaces mentioned, we first point out those aspects which are relevant for the generic alternatives of the kernel of the datapath and review the related design choices. Then from the feasible combinations of the indicated design choices we derive and present 24 generic datapath alternatives. In this framework we also show which generic alternatives have been chosen in recent superscalars. Key Words: Datapath, Microarchitecture, Superscalar Processors, Shelving, Register Renaming 1 Introduction In a superscalar processor dependencies between subsequent instructions are clear obstacles to parallel execution, and confine performance. Thus, to achieve higher performance, superscalars must aggressively counteract control and data dependencies [Rau and Fisher 1993, Smith and Sohi 1995]. When no suitable countermeasures are taken, control dependencies, caused by conditional control transfer instructions, segment a general purpose code to
short sequences of about 4 to 6 instructions, each of which is separately utilized by the processor for parallel execution [Stephens 1991, Yeh and Patt 1992]. This fact and existing data dependencies between subsequent instructions limit the speed-up potential of a straightforward ILP-processor to an astonishingly low figure of about two [Rieseman and Foster 1972, Jouppi 1989, Lam and Wilson 1992]. 1 Over the years a number of techniques have been introduced to address these issues. To cope with control dependencies superscalars employ speculative branch processing [Smith 1981]. In this technique the processor makes a guess about the outcome of each conditional control transfer instruction and continues with the execution along that path. If the guess turns out to be false, the processor cancels all speculatively executed instructions and resumes the execution along the right path. This technique originated in the beginning of the 1980s to avoid bubbles in pipeline processors and
came to widespread use with advanced pipeline processors, such as the i80486, MicroSparc, R4000 or the MC 68040, around 1990 [Sima 1997]. To tackle data dependencies two major techniques were conceived: shelving [Thornton 1970, Tomasulo 1967] and register renaming [Tjaden and Flynn 1970, Keller 1975]. Shelving is a hardware technique which eliminates issue blockages due to all three kinds of register data dependencies in straight line code, that is due to RAW, WAR and WAW dependencies. Shelving makes use of two concepts: (a) it decouples dependency checking from instruction issue, and (b) significantly widens the window that is scanned for independent instructions. Register renaming is also a hardware technique used to remove false data dependencies; that is both WAR- and WAW-dependencies occurring between instructions of a straight-line code segment. The principle of register renaming is straightforward. If a WAR or WAW dependent instruction is encountered, its destination register is
temporally substituted by a dynamically allocated rename buffer. For an overview of the techniques mentioned, see e.g [Sima, Fountain and Kacsuk 1997] In Fig. 1 we outline the introduction of shelving and register renaming in superscalar processors by reviewing the evolution of instruction issue policies in superscalars. In the figure we indicate the first year of volume shipment of the processors. 2 Instruction issue policies of superscalar processors A: B: C: Straightforward superscalar issue Straightforward superscalar issue with partial shelving Straightforward superscalar issue with renaming Advanced superscalar issue no renaming, speculative ex. direct issue no renaming, speculative ex. shelved issue, renaming, speculative ex. direct issue, renaming, speculative ex. shelved issue, R8000 (1994) R 10000 (1996) Pentium (1995) PentiumPro (1995) PowerPC 601 (1993) PowerPC 602 (1995) PA 7100 (1992) PA 7200 (1995) PM1 (Sparc 64) (1995) SuperSparc (1992) UltraSparc
(1995) Alpha 21064 (1992) Alpha 21064A (1994) Alpha 21164 (1995) MC 68060 (1993) PowerPC 603 (1995) PowerPC 604 (1995) PowerPC 620 (1996) PA 8000 (1996) Alpha 21264 (1998) Am K5 (1995) Am 29000 sup. (1995) M1 (1995) MC 88110 (1993) Issue performance, trend A: Coping with false data dependencies B: Coping with control dependencies C: Use of shelving Figure 1: Evolution of instruction issue policies of superscalar processors The first wave of superscalars employed the straightforward superscalar issue policy. This policy addresses control dependencies with speculative branch processing but does not provide either shelving or renaming. In this case data dependencies cause issue blockages This gives rise to an issue bottleneck which severely impedes processor performance. Despite its limitations the straightforward superscalar issue policy is used in virtually all first models of superscalar lines, such as in the Pentium, PA7100, PowerPC 601, the Alpha 21064 or the SuperSparc. A few
subsequent members of these families also retained this simple issue scheme, including the PA 7200, Alpha 21164 and the UltraSparc. A few processors enhanced the straightforward issue policy either by shelving alone or by renaming alone, as indicated in Fig. 1 But the decisive step in the evolution of issue policies was when in addition to speculative branch processing processors introduced both shelving and register renaming. The resulting advanced superscalar issue policy benefits from the synergy of the following three features: (a) Shelving relieves instruction issue from dependency checking. This eliminates the issue bottleneck of the straightforward issue policy. (b) Register renaming removes both WAR and WAW dependencies. This increases the average number of executable instructions per clock cycle in the instruction window. 3 (c) Shelving considerably widens the instruction window. This reduces the performance degradation caused by RAW dependencies and further increases
the average number of executable instructions per clock cycle in the window. As a consequence, the advanced superscalar issue policy has a considerably higher performance potential than the straightforward issue policy. For the reasons mentioned most recent superscalar processors turned to the advanced superscalar issue scheme, including the R10000, PentiumPro, the PowerPC 603 - PowerPC 620, the Alpha 21264 and a number of other recent processors. This recent stage of evolution allows us in our discussion of generic datapath alternatives to take it for granted that processors make use of both shelving and register renaming. In our paper we restrict ourselves to the kernel of the datapath, to that essential part of the datapath, which is needed for the processing of fixed point- and floating point data, excluding input and output. Thus, we omit all additional parts which are required for further tasks, such as fetching and decoding of instructions, fetching of data from the cache or
executing load/store and control transfer instructions. Furthermore, in this paper we focus on generic datapath alternatives of advanced superscalar processors. We understand alternatives to be generic if they differ in qualitative respect, that is in their structure, from each other. While considering generic alternatives we do not care about quantitative aspects of the datapath, like width of the busses, or the capacity of register files or buffers employed etc., or even the kind and number of the execution units available. Under the assumptions made the generic alternatives of the kernel of the datapath may be derivered from the qualitative aspects of the design spaces of shelving and renaming. Thus, for obtaining the generic datapath alternatives first we identify and review the relevant qualitative aspects of the design spaces of shelving and renaming, and then we determine their feasible combinations from which we obtain the generic datapath alternatives. 2. Basic alternatives
of shelving 2.1 Overview From the design space of shelving [Sima, Fountain and Kacsuk 1997, Sima 1998] there are two qualitative aspects which are decisive for the layout of the datapath: the type of shelving buffers and the operand fetch policy used. The type of shelving buffer specifies the infrastructural background of shelving whereas the operand fetch policy decides whether operands are fetched during instruction issue or, in contrast, during instruction dispatch. 2.2 Types of shelving buffers 4 Shelving buffers may be implemented either as reservation stations (RS) or as combined shelving buffers (see Fig. 2) Reservation stations are used exclusively for shelving In contrast, combined buffers are also employed for reordering, and in some cases, for renaming as well, as shown in Fig. 2. Types of shelving buffers Reservation stations Combined buffers Buffers are used exclusively for shelving Buffers are used at the same time for shelving, reordering and possible also for
renaming Figure 2: Types of shelving buffers Usually, shelving buffers are implemented as reservation stations following one of three possible schemes, as Fig. 3 illustrates Reservation stations (RS) Individual RSs RS RS EU EU Power1 (1990) Nx586 (1994) PowerPC 603 (1993) PowerPC 604 (1995) PowerPC 620 (1996) Am 29000 sup (1995) Am K5 (1995) Group RSs RS EU Central RS RS RS EU EU EU ES/9000 (1992p) Power2 (1993) R10000 (1996) PM1 (Sparc64)(1995) Alpha 21264 (1998) EU EU Pentium Pro (1995) Figure 3: Basic variants of reservation stations Individual reservation stations are used in front of each execution unit (EU). In this case instructions which are scheduled for execution in a particular EU are first transferred into the individual reservation station preceding that EU. An alternative approach is when reservation stations are implemented as group stations. Here, the same reservation station serves a group of cooperating EUs executing the same type of
instructions, for instance, FX-, FP- or load/store 5 instructions. A final alternative is the central reservation station In this case a single buffer holds and dispatches instructions for all EUs available in the processor. A quite different approach to implementing shelving buffers is to use a combined buffer for shelving and reordering and in some cases for renaming as well, as shown in Fig. 4 Combined buffers Combined buffers for Combined buffers for shelving and reordering shelving, reordering and renaming Reorder buffer also used for shelving Reorder buffer also used for renaming and shelving EU EU EU PA 8000 (1996) EU Lightning (1991) K6 (1997) (This buffer is called DRIS Deferred Scheduling, Register Renaming, Instruction Shelve) Figure 4: Use of combined buffers for shelving Most recent superscalars make use of a reorder buffer (ROB), to assure the logical integrity of the program execution. An ROB can be enhanced to function also as a shelving buffer, or even
to perform register renaming as well. The first alternative is exemplified by the PA 8000 The latter by the Metaflow Lightning processor which was announced around 1990 but never reached the market and by the K6. In the Lightning, the combined structure was designated as the DRIS (Deferred scheduling, register Renaming, Instruction Shelve). 2.3 Operand fetch policies 2.31 Basic fetch policies Closely connected with shelving is the policy governing how processors fetch operands. A processor’s fetch policy is either issue bound or dispatch bound, as indicated in Fig. 5 6 Operand fetch policy Issue bound fetch Dispatch bound fetch Operands are fetched during instruction issue Operands are fetched during instruction dispatch Shelving buffers hold source operand values (if available). Shelving buffers hold register identifiers. IBM 360/91 (1967) PowerPC 603 (1993) PowerPC 604 (1995) PowerPC 620 (1996) Am 29000 sup. (1995) Am K5 (1995) PM-1(Sparc64) (1995) Pentium Pro (1995)
CDC 6600 (1964) Power1 (1990) Power2 (1993) Lightning (1991p) ES/9000 (1992p) Nx586 (1994) PA 8000 (1996) R10000 (1996) Alpha 21264 (1998) Figure 5: Operand fetch policies The issue bound fetch policy means that operands are fetched during instruction issue. In this case, shelving buffers basically hold instructions along with their operand values, requiring that the buffers be long enough to provide space for all the source operands. The other basic alternative is when operands are fetched during dispatching, called the dispatch bound fetch policy. In this case, shelving buffers can be much shorter since they contain instructions with their register identifiers only. 2.32 The issue bound fetch policy Fig. 6 shows the principle of the issue bound operand fetch policy, assuming the use of individual reservation stations and a common architectural register file for both FX- and FP-data. We also suppose for the moment that the processor employs only shelving. Later, in Section 4 we
describe how the processor operates when both shelving and renaming are used. The assumptions made do not affect the principle discussed but allow us to focus on the vital points. In our case, the referenced source register identifiers of the issued instructions (s1 and s2) are forwarded during instruction issue to the architectural register file to fetch the referenced operands (o1, o2). In addition, the operation codes (OC) and the destination register identifiers of the issued instructions (d) are written into the allocated reservation stations. After the referenced source operands (o1, o2) become available, they too are written into the proper reservation station. Here we will assume that all referenced register contents are available. Later in Section 4 we will discuss the case when referenced operands are not yet available during instruction issue. 7 Decode / Issue s2 s1 Issue Arch. reg file OC, d o1 o2 Dispatch OC d RS RS RS o1 o2 OC d EU o1 o2 OC d o1 o2 EU
EU Figure 6: The principle of issue bound operand fetching (In the figure we assume a common architectural register file for both FX- and FP-data and individual reservation stations) 2.33 The dispatch bound operand fetch policy An alternative to the issue bound fetch policy we have discussed so far is the dispatch bound operand fetch policy. In this case, operands are fetched in connection with instruction dispatch rather than with instruction issue. In Fig. 7 we illustrate the principle of the dispatch bound operand fetch policy under the same assumptions as above. From I-buffer Issue Decode / Issue i RS OC d s1 RS RS s2 OC d s1 s2 OC d s1 s2 Dispatch OC,d OC,d OC,d Arch. reg file o1 o2 EU EU EU Figure 7: The principle of dispatch bound operand fetching (In the figure we assume a common architectural register file for both FX- and FP-data and individual reservation stations) As Fig. 7 shows, the reservation stations hold the instructions along with the
referenced source register identifiers. During dispatch, the operation codes (OC) and destination register identifiers of the dispatched instructions (d) are forwarded from the reservation stations to the associated EUs, 8 and the source register identifiers (s1, s2) are passed to the register file. After fetching, the source operands (o1, o2) are gated into the inputs of the corresponding EUs. 3 Basic alternatives of register renaming 3.1 Overview From the design space of register renaming in superscalar processors [Sima, Fountain and Kacsuk 1997] there is only one qualitative aspect which affects basically the implementation of the datapath, that is the layout of the rename buffers. In the following, we outline the design options for it. As Fig. 8 illustrates, there are three fundamentally different ways to implement rename buffers You can choose among: using merged architectural and rename register files, employing stand alone rename register files, and holding renamed values
in the ROB. Type of rename buffers (The basic approach of how rename buffers are implemented) Merged architectural and rename register files Holding renamed values in the ROB Stand alone register files Method of fetching operands Method of updating Merged FXreg. file the program state Merged FPreg. file Power1 (1990) Power2 (1993) ES/9000 (1992p) Nx586 (1994) PM1 (Sparc64, 1995) R10000 (1996) Alpha 21264 (1998) FX-rename reg. file FP-rename reg. file FX-arch. reg. file FP-arch. reg. file PowerPC 603 (1993) PowerPC 604 (1995) PowerPC 620 (1996) ROB Architectural reg. file Pentium Pro (1995) Am29000 sup (1995) K5 (1995) M1 (1995) Figure 8: Generic types of rename buffers In the first approach, rename buffers and architectural registers are implemented in the same physical register file, called the merged architectural and rename register file. Available physical registers are assigned dynamically either to architectural or to rename registers. Assuming again a split
register scenario, FX- and FP-data are separately held in the corresponding merged FX- and FP-register files. The high-end models of the IBM ES/9000 mainframe family, the Power line of processors, the PM1 (Sparc64) and the R10000 are examples which opted for merged architectural and rename files. By contrast, both alternative methods separate rename buffers and architectural registers. 9 The second option employs stand alone rename register files for FX- and FP-data. In this case, the rename register files are used exclusively for renaming. The PowerPC processors PowerPC 603 - PowerPC 620 exemplify this alternative. The last alternative for renaming is to enhance the ROB so that it can perform this task as well. While using an ROB for preserving the sequential consistency of the instruction execution, an entry is assigned to each issued instruction for the duration of its execution. So, it is quite natural to use these entries for renaming as well, basically through extending them
by a new field which will hold the execution results. Since the ROB maintains a single mechanism to keep the program logic intact, usually all renamed instructions are held in the same ROB queue despite using split architectural register files for FX and FP data. Therefore, each ROB entry is expected to be long enough to buffer either FX- or FP-data. Examples of processors which use the ROB also for renaming are the Am 29000 superscalar, the K5 and the PentiumPro. Here we note that the ROB may actually be extended even further to provide additional buffer space for and to manage shelving as well. The Lightning processor project made use of this concept, designating the enhanced ROB as the DRIS. Also the K6 made use of this arrangement Concerning terminology we point out that we use the term rename buffer in a generic sense to denote any one of the implementation alternatives mentioned above. If we instead use a specific designation of rename buffers, such as rename register file etc.,
we are referring to a particular implementation alternative. 4 Principle of operation assuming shelving, renaming and the use of an ROB 4.1 Overview In this section we first outline how the ROB operates. Based on this we then describe the overall processor operation, assuming the use of both shelving and renaming. In our description we focus on the kernel of the datapath which serves to execute operational FX- and FP-instructions. In order to describe the operation we have assumed one particular implementation alternative (use of individual reservation stations for shelving and stand alone rename register files for renaming), but we will point out substantial differences that using other alternatives for shelving or for renaming will cause. For sake of simplicity we ignore bypasses of the execution units in our discussion Also we omit any datapaths or status bits needed to indicate or check operand availability. The ROB is implemented basically as a circular buffer whose entries are
allocated and deallocated by means of two revolving pointers. Quite simply, during instruction issue an ROB entry is allocated to each issued instruction. Assuming in-order issue, in accordance with recent practise, subsequent ROB entries are then allocated to issued instructions in program order. Each ROB entry 10 keeps track of the execution status of the associated instruction. When execution units finish their operation they send status reports to the ROB. Based on this, the ROB allows instructions to update the program state strictly in program order. This is achieved by permitting a finished instruction to update the program state only after all preceding instructions have been completed. Thereafter, the associated ROB entry will be deallocated and becomes eligible for reuse. Because of major differences we discuss the operation of the processor separately for the issue bound and for the dispatch bound operand fetch policies. 4.2 Processor operation assuming issue bound
operand fetching A processor built according to the assumptions made and presuming two FX- and FP-execution units each, looks like the one shown in Fig. 9 It consists of two symmetrical halves for the processing of FX- and FP-instructions as well as of a Decode/issue unit and an ROB. Each half consists of the FX- and FP-rename and architectural register files, two individual reservation stations each (FX-RS, FP-RS) and two execution units (FX-EU, FP-EU). Decode / Issue s1 ,s2, d FX- ren. reg file r,d s1 ,s2, d FX- arch. reg file FP- arch. reg file OC r,d FP- ren. reg file OC d o1/s1, o2/s2 d o1/s1, o2/s2 FX-RS FX-RS FP-RS FP-RS OC d o1/s1 o2/s2 i OC d o1/s1 o2/s2 i OCd o1/s1 o2/s2 i OCd o1/s1 o2/s2 i FXEU FXEU FPEU r,d d d FPEU r,d ROB d d Figure 9: The datapath of a processor assuming rename register files, individual reservation stations, issue bound operand fetch and an ROB The overall operation is as follows. Decoded instructions are issued into the
allocated reservation stations. Meanwhile referenced source and destination registers are renamed and source operands are fetched, provided that they are available. Instructions wait in the reservation stations until their operands become available and they can be forwarded to the execution units. When an execution unit finishes its operation, the result produced will be written into the allocated rename buffer. At the same time, the result is also broadcasted to the reservation stations to supply this value to all instructions waiting for it. Because of the different latencies of the operations, instructions will 11 typically be finished out-of-order. As long as a result is held in a rename buffer it is called an intermediate result since it has not yet been used to update the program state. A rename buffer holds an intermediate result until the instruction which generated this result completes or it becomes obsolete due to specific events, such as a mispredicted branch. The last
action of instruction processing is completion. During completion the program state is updated irreversibly by writing the intermediate result held in a rename buffer into the originally referenced architectural register. The ROB allows instructions to complete only in program order to preserve the sequential consistency, as discussed above. In the following we discuss the individual tasks of the operation. These are as follows: renaming the destination and the source registers, fetching the operands, dispatching the instructions, finishing and completion of instructions. Renaming the destination registers: Each destination register is renamed by allocating a free rename buffer in the rename register file to it and substituting the number of the destination register (d) with the identifier of the allocated rename buffer (d), which is usually its index, provided that there is a free buffer available. An established allocation remains valid until either (a) a subsequent instruction
writes the same destination register and consequently this architectural register will be reallocated to a new rename buffer, or (b) until the instruction which includes this destination register completes. After completion the allocated rename buffer will be deallocated and reclaimed for further use. Renaming the source registers: A referenced architectural register is either renamed (if there exists a valid mapping for it) or it isn’t. If there exists a valid mapping, a referenced source register also needs to be renamed during instruction issue. This is achieved by substituting its register number (s1 or s2) with the identifier of the rename buffer (s1’or s2’), which is actually allocated to it. In this way, references to the renamed source registers are redirected to the allocated rename buffers. Fetching the source operands: Referenced source operands are fetched either from the corresponding rename or from the corresponding architectural register file, depending on whether
they are renamed or not. If a referenced source register is renamed, renaming yields a valid rename register index (s1’,s2’) and the requested operand is fetched from the addressed rename buffer. If, on the other hand the referenced source register isn’t renamed, the rename process signals this and the requested operand (o1,o2) is fetched from the referenced architectural register. For more details of the rename process see [Sima, Fountain and Kacsuk 1997]. If a referenced source register is renamed and thus it is fetched from the rename register file, it can happen that the requested operand (o1,o2) is not yet available since it has not yet been produced. In this case, a tag is placed into the shelving buffer, which uniquely identifies the requested operand. 12 Usually, the index of the rename register (s1’,s2’) that will hold the requested source operand valueis used as the tag. Dispatching: Once renamed, instructions are held in the corresponding reservation station
(FX-RS, FP-RS). In each cycle the reservation stations are scanned for executable instructions, that is for instructions which have all of their source operands (o1, o2) available. Executable instructions are then dispatched to available execution units according to the dispatch policy of the processor [Sima, Fountain and Kacsuk 1997]. Finishing of instructions: When an instruction finally finishes its execution, three tasks need to be performed; (a) the generated result has to be written into the allocated rename buffer, as already discussed, (b) the reservation stations need to be updated, and (c) the ROB needs to be updated. To update the reservation stations generated results (r) are broadcasted along with their tags (d) to all reservation stations holding the same type of instructions, such as FX- or FP-instructions. Each reservation station performs an associative look-up for the broadcasted tag. Matching tags are substituted with the broadcasted result values and their status
bits are appropriately set. Thus, when in the next cycle the reservation stations are scanned for executable instructions this operand already shows up as available. To update the ROB the tags of the generated results are forwarded to it. Assuming that the tag is the index of the allocated rename buffer (d), again an associative look-up is needed to find the related ROB-entry. The status of the instruction found in the matching ROB-entry is then changed to have finished. Completion of instructions: Finally, while an instruction completes three tasks need to be performed: (a) The program status should be updated by assigning the value of the associated rename buffer to the corresponding architectural register file, (b) the rename buffer, and (c) the ROB entry allocated to the completed instruction needs to be reclaimed for further use. The program status will be updated differently when using merged architectural and rename files and other alternatives. In the case of merged files when
an instruction completes the rename buffer allocated to that instruction is simply declared to be the associated architectural register and its status is changed appropriately. In addition, the physical register which has represented the reallocated architectural register is freed up for further use. If rename buffers are implemented separately from the architectural registers (rename files, rename buffers within the ROB) each time an instruction completes, the content of the allocated rename buffer needs to be written back into the corresponding architectural register. 4.3 Processor operation assuming dispatch bound operand fetching In the case of dispatch bound operand fetching (Fig. 10) source and destination registers are renamed during instruction issue by means of separate mapping tables for FX- and FP-instructions. 13 Renaming of the source registers yields an indication whether the referenced source register is actually renamed to any rename buffer. If yes, it yields the
index of the rename buffer (s1’,s2’) which holds the requested source operand. If no, the source register identifiers (s1,s2) remain unchanged. For each referenced source operand either the source register identifier (s1,s2) or the index of the associated rename buffer (s1’, s2’) is written into the appropriate fields of the allocated reservation station. Thus, when operands are fetched dispatch bound, reservation stations contain renamed destination register identifiers (d) and renamed/not renamed source register identifiers (s1/s1’, s2/s2’). Decode / Issue s1, s2,d s1, s2,d FX-mapping table FX-mapping table d d s1/s1,s2/s2 FX-RS s1/s1,s2/s2 r ,d FX- ren. reg file OC s1/s1,s2/s2 FX-RS d s1/s1,s2/s2 FP-RS OC d OC d s1/s1,s2/s2 OC, d FX- arch. reg file OC, d FP-RS s1/s1,s2/s2 OC d s1/s1,s2/s2 OC, d OC, d s1/s1,s2/s2 r ,d FP- ren. reg file FP- arch. reg file o1, o2 o1, o2 FXEU FXEU r,d FPEU d d FPEU r,d ROB d d Figure 10: The datapath of a
processor assuming rename register files, individual reservation stations, dispatch bound operand fetch and an ROB Again, as in the case of issue bound operand fetching, instructions with all needed operands available are dispatched to the connected execution units (FX-EU, FP-EU). But in this case source operands (o1, o2) are fetched during instruction dispatch rather than during instruction issue. A further difference is that the reservation stations need not be updated with the produced results. A last dissimilarity is that during completion of an instruction the corresponding mapping table also needs to be updated by deleting the mapping (d-d) that is no longer needed. 5 Generic datapath alternatives of advanced superscalars As pointed out previously, the basic alternatives for shelving differ on two aspects; (a) on the layout of the shelving buffers, and (b) on the operand fetch policy used, whereas the basic alternatives for renaming vary (c) on the layout of the rename buffers.
Now, we can derive the generic datapath alternatives from the feasible combinations of available design choices for the aspects mentioned. While neglecting contradicting combinations (designated by crossed fields) we obtain 24 generic datapath alternatives for superscalar processors, which employ both shelving and renaming, as indicated in Fig. 11 Each generic datapath alternative is characterized by a particular combination of the available design choices. Thus, alternative 1 uses individual reservation stations, the issue bound operand fetch policy and merged architectural and rename register files, alternative 2. employs individual reservation stations, the dispatch bound operand fetch policy and merged architectural and rename register files, alternative 3. has group reservation stations, the issue bound operand fetch policy and merged architectural and rename register files, etc. 14 15 Shelving Reservation station Individual RS Issue bound Group RS Dispatch bound 2.
Power1 1. Nx586 Fig. 12/1 Combined buffers Issue bound PM1 3. Fig. 12/2 Fig. 12/3 Central RS Dispatch bound 4. ES-9000 Power2 R10000 Alpha 21264 Issue bound Dispatch bound 5. Fig. 12/4 Shelving within the ROB 6. 7. Fig. 12/5 Fig. 12/6 9. PPC603 10. 11. 12. 13. 14. 17. Am 29000 18. 19. 20. 21. 22. PPC604 PPC620 supersc. K5 Fig. 12/9 Issue bound Dispatch bound Issue bound Dispatch bound 8. Fig. 12/7 15. Shelving and renaming within the ROB Fig. 12/8 16. PA-8000 Pentium Pro Fig. 12/10 Fig. 12/11 Fig. 12/12 Fig. 12/13 Fig. 12/14 23. 24. Lightning K6 Fig. 12/15 Figure 11: Basic datapath alternatives 16 Fig. 12/16 Fig. 11 also shows which datapath alternatives have been chosen by recent superscalars Thus, for example alternative 2. is chosen by the developers of the Power1 and of the Nx586, alternative 3 is used in the PM1 (Sparc64), alternative 4. in a number of processors, such as in the Power2, R10000, Alpha 21264 and the high end
processors of the ES-9000 family, and so on. Subsequently, in Figures 12/1- 12/16 we present the generic datapath alternatives in more detail. These figures however, do not include alternatives which make use of separate rename register files, since their corresponding internal structures can easily be derived from the datapaths using merged files by replacing merged files by separate rename and architectural files, as indicated in Fig. 8 In order to have a lucid presentation we have made a few simplifications in the figures. First, we focus on the kernel of the datapaths by taking into account only FX- and FP-execution but neglecting load/store and branch processing. Second, we ignore bypasses that forward the results of execution units to their inputs, nor do we indicate the multiplicity of the buses. Third, we omit any datapaths for the operation codes and for indicating or checking the availability of operands. In the figures we use the following abbreviations: d: destination
register identifier d: renamed destination register identifier (it specifies the rename buffer where the result of an instruction has to be written into) MT: mapping table o1, o2: source operands ren: renaming s1, s2: source register identifiers s1, s2: renamed source register identifiers In Figures 12/1-12/16 wide buses carrying FX- or FP-data are designated by bold lines to indicate their high implementation cost. This is in contrast to narrow buses, which typically forward register identifiers and are drawn by thin lines. 17 Decode / Issue Decode / Issue d s1, s2 Merged FX- reg. file Merged FP- reg. file Merged FX- reg. file d o1/s1, o2/s2 FX-RS d s1, s2 d d s1, s2 s1, s2 d o1/s1, o2/s2 FP-RS FX-RS i i i i FXEU FXEU FPEU FPEU FXEU d FP-RS i FXEU r, d d o1/s2, o2/s2 d FX-RS FP-RS i r, d Merged FP- reg. file o1/s2, o2/s2 d r,d i i FPEU FPEU r,d d d ROB ROB d d Figure 12/1: Datapath alternative 1: Individual reservation
stations, issue bound operand fetch, merged register files ( PowerPC 603-620 but with, stand alone rename register files) Figure 12/3: Datapath alternative 3: Group reservation stations, issue bound operand fetch, merged register files (PM-1) Decode / Issue d FX- mapping table FP- mapping table FX- mapping table d FP- mapping table d s1/s1, s2/s2 d s1/s1, s2/s2 d s1, s2 s1, s2 d s1, s2 s1, s2 d d Decode / Issue d d s1/s1, s2/s2 FP-RS FX-RS FX-RS s1/s1, s2/s2 FP-RS d d d o1, o2 s1/s1, s2/s2 s1/s1, s2/s2 s1/s1, s2/s2 d Merged FX- reg. file Merged FX- reg. file d s1/s1, s2/s2 FP-RS FX-RS Merged FP- reg. file d d d Merged FP- reg. file o1, o2 o1, o2 o1, o2 FXEU FXEU FXEU r, d FPEU FPEU d d r, d FPEU FXEU d d FPEU r, d r, d ROB ROB d d Figure 12/4: Datapath alternative 4: Group reservation stations, dispatch bound operand fetch, merged register files (ES-9000 models, Power2, R10000, Alpha 21264) Figure 12/2: Datapath
alternative 2: Individual reservation stations dispatch bound operand fetch, merged register files ( Power1, Nx586 ) 18 Decode / Issue Decode / Issue d d s1, s2 d s1, s2 d s1, s2 Merged FX- reg. file s1, s2 Merged FP- reg. file o1/s1, o2/s2 o1/s1, o2/s2 d Merged FX- reg. file d Merged FP- reg. file o1/s1, o2/s2 o1/s1, o2/s2 Central RS d o1,o2 o1,o2 FXEU r, d d r, d d d o1, o2 o1, o2 d FPEU FPEU FXEU d ROB with shelving FXEU ROB FXEU r,d FPEU FPEU d r,d d d Figure 12/7: Datapath alternative 7: Shelving within the ROB, issue bound operand fetch, merged register files Figure 12/5: Datapath alternative 5: Central reservation station, issue bound operand fetch, merged register files Decode / Issue d Decode / Issue s1, s2 d s1, s2 s1, s2 d s1, s2 d FX- mapping table FP- mapping table FX- mapping table d s1/s1, s2/s2 s1/s1, s2/s2 d d s1/s1, s2/s2 Merged FX- reg. file s1/s1, s2/s2 r, d s1/s1, s2/s2 d d Merged FP- reg. file
FPEU FXEU d d d s1/s1, s2/s2 d Merged FX- reg. file o1, o2 o1, o2 FXEU s1/s1, s2/s2 ROB with shelving Central RS d FP- mapping table s1/s1, s2/s2 Merged FP- reg. file o1, o2 FPEU FXEU o1, o2 FXEU FPEU FPEU r, d r, d r, d ROB d Figure 12/8: Datapath alternative 8: Shelving within the ROB, dispatch bound operand fetch, merged register files (PA-8000 but with stand alone rename register files) Figure 12/6: Datapath alternative 6: Central reservation station, dispatch bound operand fetch, merged register files 19 Decode / Issue s1, s2 FX- arch. reg file d r, d s1, s2 d s1, s2 s1, s2 r, d ROB with renaming o1/s1, o2/s2 Decode / Issue s1, s2 s1, s2 d FX- arch. reg file r, d ROB with renaming r, d FP- arch. reg fil d o1/s1, o2/s2 o1/s1, o2/s2 d FX-RS s1, s2 s1, s2 FP- arch. reg file o1/s1, o2/s2 d d FX-RS FP-RS FP-RS i i i i FXEU FXEU FPEU FPEU FX-RS r, d r, d d FP-RS i i i i FXEU FXEU FPEU FPEU Figure
12/9: Datapath alternative 9: Individual reservation stations, issue bound operand fetch, renaming within the ROB (AM 29000 superscalar, K5) r, d r, d Figure 12/11: Datapath alternative 11: Group reservation stations, issue bound operand fetch, renaming within the ROB Decode / Issue mapping table d d s1/s1, s2/s2 s1/s1, s2/s2i Decode / Issue d FX-RS FX-RS FP-RS s1/s1, s2/s2 d FP-RS mapping table s1/s1, s2/s2 d d r, d r, d d d FX-RS FX- arch. reg file ROB with renaming FP- arch. reg file s1/s1, s2/s2 o1, o2 FXEU r, d r, d FXEU r, d FP- arch. reg fil o1, o2 FXEU r, d d ROB with renaming o1, o2 FPEU FPEU d d r, d FX- arch. reg file FXEU FP-RS s1/s1, s2/s2 d o1, o2 d s1/s1, s2/s2 d s1/s1, s2/s2 FPEU FPEU r, d Figure 12/10: Datapath alternatives 10.: Individual reservation stations, dispatch bound operand fetch, renaming within the ROB Figure 12/12: Group reservation stations, dispatch bound operand fetch, renaming within the ROB 20
Decode / Issue s1, s2 s1, s2 d s1, s2 s1, s2 d Decode / Issue r, d r, d FX- arch. reg file FP- arch. reg fil ROB with renaming o1/s1, o2/s2 s1, s2 o1/s1, o2/s2 d s1, s2 d s1, s2 s1, s2 r, d r, d FX- arch. reg file FP- arch. reg fil ROB with shelving and ren. o1/s1, o2/s2 Central RS FXEU d o1/s1, o2/s2 o1, o2 d o1, o2 i i FXEU FPEU FPEU FXEU d FPEU FPEU FXEU r, d r, d r, d r, d Figure 12/13: Central reservation station, issue bound operand fetch, renaming within the ROB Figure 12/15: Shelving and renaming within the ROB, issue bound operand fetch Decode / Issue d d s1, s2 Decode / Issue s1, s2 FP- mapping table FX- mapping table d d s1/s1, s2/s2 Mapping table s1/s1, s2/s2 s1/s1, s2/s2 Central-RS d d d d s1/s1, s2/s2 s1, s2 r, d FP- arch. reg fil ROB with shelving and renaming o1, o2 r, d r, d ROB with renaming o1,o2 r, d o1, o2 o1, o2 d o1, o2 FP- arch. reg file d o1,o2 FXEU FXEU d s1, s2 s1/s1, s2/s2 s1/s1,
s2/s2 FX- arch. reg file r, d FX- arch. reg file d d FPEU FXEU FXEU FPEU r, d r, d FPEU FPEU r, d d Figure 12/16: Datapath alternative 16: Shelving and renaming within the ROB, dispatch bound operand fetch ( Lightning) Figure 12/14: Central reservation station, dispatch bound operand fetch, renaming within the ROB (Pentium Pro) 21 Thus far we reviewed possible layouts of the kernel of the datapath in advanced superscalars which employ both shelving and renaming. As indicated in Fig 11, eight out of the 24 possible alternatives are used in recent processors. The details of the microarchitectures of recent superscalars have been disclosed in different depth. The given generic datapath alternatives may be helpful in uncovering undisclosed aspects of the described microarchitectures. We point out that there are currently more than a dozen possible but unused generic datapath alternatives. It would be an appealing albeit rather complex task to investigate the
alternatives that have not yet been singled out, or in a broader sense to assess all generic datapath alternatives in terms of their performance potential and cost of implementation. But that is a further issue 7. Conclusion Based on the exploration of the design spaces of shelving and register renaming, generic alternatives of the kernel of the datapath can be determined. Assuming a given execution engine, generic alternatives differ in the following three aspects: (a) the layout of the shelving buffers, (b) the operand fetch policy used, and (c) the layout of the rename buffers. By taking into account all feasible combinations of available design choices for the aspects mentioned we find that there are 24 generic datapath alternatives. Out of these approximately each third is used in advanced superscalars. References [Rau and Fisher 1993] Rau, B.R, Fisher, JA: "Instruction level parallel processing: History, overview and perspective”, The Journal of Supercomputing, 7,
(1993), 9-50. [Smith and Sohi 1995] Smith, J.E, Sohi GS: "The microarchitecture of superscalar processors”, Proc. IEEE, 83, 12, (1995), 1609-1624 [Sima 1997] Sima, D.: "The design space of superscalar instruction issue”, IEEE Micro, Sept./Oct, (1997), 28-39 [Sima, Fountain and Kacsuk 1997] Sima, D., Fountain, T, Kacsuk, P: "Advanced Computer Architectures”, Addison Wesley, Harlow etc., (1997) [Stephens 1991] Stephens, C. et al: "Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, (1991), 137-46 [Yeh and Patt 1992] Yeh, T.-Y, Patt, YN: "Alternative implementations of two-level adaptive branch prediction", Proc. 19th ISCA, (l992), 124-134 [Rieseman and Foster 1972] Rieseman, E.M, Foster, CC: ”The inhibition of potential parallelism by conditional jumps”, IEEE Trans. On Computers, C-21, 12, (1972), 1406-1411 22 [Jouppi 1989] Jouppi, N.P: "The nonuniform distribution of instruction-level and machine parallelism
and its effect on performance", IEEE Trans. on Computers, 38, 12, (1989), 1645-1658 [Lam and Wilson 1992] Lam, M.S, Wilson, RP: "Limits of Control Flow on Parallelism", Proc 19th AISCA, (1992), 46-57. [Smith 1981] Smith, J.E: "A study of branch prediction strategies", Proc 8th ASCA, (1981), 135148 [Thornton 1970] Thornton, J.E: "Design of a Computer: The Control Data 6600" Scott, Foresmann & Company, Glenview, (1970). [Tomasulo 1967] Tomasulo, R.M: "An efficient algorithm for exploiting multiple arithmetic units", IBM J. Res and Dev, 11, 1, (1967), 25-33 [Tjaden 1970] Tjaden, G.S, Flynn, MJ: “Detection and parallel execution of independent instructions”, IEEE Trans. on Computers, C-19, 10, (1970), 889-895 [Keller 1975] Keller, R.M: “Look-ahead processors”, Computing Surveys, 7, 4, (1975), 177-195 [Sima 1998] Sima, D.: "The design space of shelving”, Journal of Systems Architecture, to be published. References to
processors: [R8000] Hsu, P. Y-T: "Designing the FPT microprocessor", IEEE Micro, Apr (1994), 23-33 [R10000] Gwennap, L.: "MIPS R10000 uses decoupled architecture", Microprocessor Report, 8, 14, (1994), 18-22. [Pentium] Alpert, D., Avnon, D: "Arhitecture of the Pentium microprocessor", IEEE Micro, June, (1993), 11-21. [Pentium Pro] Gwennap, L.: "Intel’s P6 uses decoupled superscalar design", Microprocessor Report, 9, 2, (1995), 9-15. [PowerPC 601] Becker, M. et al: "The PowerPC 601 microprocessor", IEEE Micro, Oct, 1993, 54-68. [PowerPC 603] Burgess, B. et al: "The PowerPC 603 microprocessor", Comm ACM, 37, (1994), 34-42. [PowerPC 604] Song, S.P et al: "The PowerPC 604 RISC microprocessor", IEEE Micro, Oct, (1994), 8-17. [PowerPC 620a] Levitan, D. et al: "The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor", Proc. COMPCOM, (1995), 285-291 [PowerPC 620b] Ogden, D. et al:
"A new PowerPC microprocessor for low power computing systems", Proc. COMPCON, (1995), 281-284 [PA-7100] Asprey, T. et al: "Performance features of the PA-7100 microprocessor", IEEE Micro, June, (1993), 22-35. 23 [PA-7200] Kurpanek, G. et al: "PA-7200: A PA-RISC processor with integrated high performance MP bus interface", Proc. COMPCON, (1994), 375-382 [PA-8000] Gwennap, L.: "PA-8000 combines complexity and speed", Microprocessor Report, 8, 15, (1994), 1-8. [SuperSPARC] Blanck, G., Krueger, S: "The SuperSPARC microprocessor" Proc COMPCON, (1992), 136-141. [UltraSparc] Greenley, D. et al: "UltraSparc: the next generation superscalar 64-bit SPARC", Proc. COMPCON, (1995), 442-451 [Gmicro/500] Kchiyard, K. et al: "The Gmicro/500 superscalar microprocessors with branch buffers", IEEE Micro, Oct., (1993), 12-22 [Alpha 21064] DECchip 21064 and DECchip 21064A Alpha AXP Microprocessors Hardware Reference Manual, DEC,
Maynard, Massachusetts, (1994). [Alpha 21064A] DECchip 21064 and DECchip 21064A Alpha AXP Microprocessors Hardware Reference Manual, DEC, Maynard, Massachusetts, (1994). [Alpha 21164] Edmondson, J.Het al: "Superscalar instruction execution in the Alpha microprocessor", IEEE Micro, April, (1995), 33-43. [Alpha 21264] Leibholz, D., Razdan, R: "The Alpha 21264: A 500 MIPS out-of-order execution microprocessor", Proc. COMPCON, (1997), 28-36 [MC68060] Circello, J.et al: "The superscalar architecture of the MC68060", IEEE Micro, April, (1995), 10-21. [Motorola 88110] Diefendorff, K., Allen, M: "Organization of the Motorola 88110 superscalar RISC microprocessor", IEEE Micro, April, (1992), 40-63. [AMD K5] Slater, M.: "AMDs K5 Designed to outrun Pentium", Microprocessor Report, 8, 14, (1994), 1-11. [AMD 29K] Case, B.: "AMD unveils first superscalar 29K core", Microprocessor Report, 8, 14, (1994), 23-26. [M1] Burkhardt, B.:
"Delivering next-generation performance on todays installed computer base", Proc. COMPCON, (1994), 11-16 24