fbpx
Wikipedia

Out-of-order execution

In computer engineering, out-of-order execution (or more formally dynamic execution) is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units,[1] rather than by their original order in a program.[2][3] In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.[4]

History edit

Out-of-order execution is a restricted form of data flow computation, which was a major research area in computer architecture in the 1970s and early 1980s.

Early use in supercomputers edit

The first machine to use out-of-order execution was the CDC 6600 (1964), designed by James E. Thornton, which uses a scoreboard to avoid conflicts. It permits an instruction to execute if its source operand (read) registers aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) register not be a register used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an execution unit on false dependencies (write after write (WAW) and write after read (WAR) conflicts, respectively termed first order conflict and third order conflict by Thornton, who termed true dependencies (read after write (RAW)) as second order conflict) because each address has only a single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction.[5]

About two years later, the IBM System/360 Model 91 (1966) introduced register renaming with Tomasulo's algorithm,[6] which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register rn can be executed before an earlier instruction using the register rn is executed, by actually writing into an alternative (renamed) register alt-rn, which is turned into a normal ("architectural") register rn only when all the earlier instructions addressing rn have been executed, but until then rn is given for earlier instructions and alt-rn for later ones addressing rn. In the Model 91 the register renaming is implemented by a bypass termed Common Data Bus (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute out-of-order the instructions at the same execution unit, not just between the units like the 6600. This is accomplished by reservation stations, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. The Model 91 is also capable of re-ordering loads and stores to execute before the preceding loads and stores,[7] unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads.[8] Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point code. The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers.

Precise exceptions edit

To have precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun.[9] The CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an invisible exchange package, so that it can resume at the same state of execution.[10] However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.[9] Smith simulated that adding a reorder buffer (or history buffer or equivalent) to Cray-1S would reduce the performance of executing the first 14 Livermore loops (unvectorized) by only 3%.[9] Important academic research in this subject was led by Yale Patt with his HPSm simulator.[11]

In the 1980s many early RISC microprocessors, like the Motorola 88100, had out-of-order writeback to the registers, resulting in imprecise exceptions. Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.

Decoupling edit

Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.[12] He implemented those ideas in the Astronautics ZS-1 (1988), featuring a decoupling of the integer/load/store pipeline from the floating-point pipeline, allowing inter-pipeline re-ordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many usecases, as it even permits virtual memory. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front-end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable queues that effectively performed limited register renaming.[7] A similar decoupled architecture had been used a bit earlier in the Culler 7.[13] The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches.

Research comes to fruition edit

With the POWER1 (1990) IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a physical register file (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a datafull reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named program counter stack by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have reservation stations needed for out-of-order use of a same execution unit.[14][15] The next year IBM's ES/9000 model 900 had register renaming also for the general-purpose registers. It also has reservation stations with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The re-ordering distance is up to 32 instructions.[16] The A19 of Unisys' A-series of mainframes was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of the competition.[17][18]

Wide adoption edit

The first superscalar single-chip processors (Intel i960CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had quarter of a century earlier, but in 1992-1996 a rapid advancement of techniques, enabled by increasing transistor counts, saw proliferation down to personal computers. Motorola 88110 (1992) used a history buffer to revert instructions.[19] Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance.[20][21][22] PowerPC 601 (1993) was an evolution of the RISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched-instruction-queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count register could be renamed.[28] In the fall of 1994 NexGen and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first x86 processor capable of out-of-order execution, accomplished with micro-OPs. The re-ordering distance is up to 14 micro-OPs.[29] PowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry re-order buffer lets no more than four instructions to overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store.[30][31]

PowerPC 604 (1995) was the first single-chip processor with execution unit-level re-ordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The re-order buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the re-ordering of loads and stores upon cache misses.[32] HAL SPARC64 (1995) exceeded the re-ordering capacity of the ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and address generation unit, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a re-ordered state at a time[33][34] Pentium Pro (1995) introduced a unified reservation station, which at the 20 micro-OP capacity permitted very flexible re-ordering, backed by a 40-entry re-order buffer. Loads can be re-ordered ahead of both loads and stores.[35]

The practically attainable per-cycle rate of execution rose more as full out-of-order execution was further adopted by SGI/MIPS (R10000) and HP PA-RISC (PA-8000) in 1996. The same year Cyrix 6x86 and AMD K5 brought advanced re-ordering techniques into mainstream personal computers. Since DEC Alpha gained out-of-order execution in 1998 (Alpha 21264), the top-performing out-of-order processor cores have been unmatched by in-order cores other than HP/Intel Itanium 2 and IBM POWER6, though the latter had an out-of-order floating-point unit.[36] The other high-end in-order processors fell far behind, namely Sun's UltraSPARC III/IV, and IBM's mainframes which had lost the out-of-order execution capability for the second time, remaining in-order into the z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the SPARC T series and Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively.

Almost all processors for phones and other lower-end applications remained in-order until c. 2010. First, Qualcomm's Scorpion (re-ordering distance of 32) shipped in Snapdragon,[37] and a bit later Arm's A9 succeeded A8. For low-end x86 personal computers in-order early Intel Atoms were first challenged by AMD's Bobcat, and in 2013 were succeeded by an out-of-order Silvermont.[38] Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in microcontrollers and embedded systems, as well as in phone-class cores such as Arm's A55 and A510 in big.LITTLE configurations.

Basic concept edit

To appreciate out-of-order Execution it is useful to first describe in-order, to be able to make a comparison of the two. Instructions cannot be completed instantaneously: they take time (multiple cycles). Therefore, results will lag behind where they are needed. In-order still has to keep track of the dependencies. Its approach is however quite unsophisticated: stall, every time. out-of-order uses much more sophisticated data tracking techniques, as seen below.

In-order processors edit

In earlier processors, the processing of instructions is performed in an instruction cycle normally consisting of the following steps:

  1. Instruction fetch.
  2. If input operands are available (in processor registers, for instance), the instruction is dispatched to the appropriate functional unit. If one or more operands are unavailable during the current clock cycle (generally because they are being fetched from memory), the processor stalls until they are available.
  3. The instruction is executed by the appropriate functional unit.
  4. The functional unit writes the results back to the register file.

Often, an in-order processor has a bit vector recording which registers will be written to by a pipeline.[39] If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.

Out-of-order processors edit

This new paradigm breaks up the processing of instructions into these steps:

  1. Instruction fetch.
  2. Instruction decoding.
  3. Instruction renaming.
  4. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
  5. The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions.
  6. The instruction is issued to the appropriate functional unit and executed by that unit.
  7. The results are queued.
  8. Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.

OoOE processors fill these slots in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.

The benefit of OoOE processing grows as the instruction pipeline deepens and the speed difference between main memory (or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue edit

One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was decoupled architecture. In the earlier in-order processors, these stages operated in a fairly lock-step, pipelined fashion.

The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer.

The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two.[40] In doing so, it effectively hides all memory latency from the processor's perspective.

A larger buffer can, in theory, increase throughput. However, if the processor has a branch misprediction then the entire buffer may need to be flushed, wasting a lot of clock cycles and reducing the effectiveness. Furthermore, larger buffers create more heat and use more die space. For this reason processor designers today favour a multi-threaded design approach.

Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well.[41] Control intensive code include such things as nested branches that occur frequently in operating system kernels. Decoupled architectures play an important role in scheduling in very long instruction word (VLIW) architectures.[42]

To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called register renaming is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.

Execute and writeback decoupling allows program restart edit

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.

The ability to issue instructions past branches that are yet to resolve is known as speculative execution.

Micro-architectural choices edit

  • Are the instructions dispatched to a centralized queue or to multiple distributed queues?
IBM PowerPC processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term reservation stations for their distributed queues.
  • Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.
Early Intel out-of-order processors use a results queue called a re-order buffer, while most later out-of-order processors use register maps.
More precisely: Intel P6 family microprocessors have both a re-order buffer (ROB) and a register alias table (RAT). The ROB was motivated mainly by branch misprediction recovery.
The Intel P6 family is among the earliest OoOE microprocessors but were supplanted by the NetBurst architecture. Years later, NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the Core and Nehalem microarchitectures. The succeeding Sandy Bridge, Ivy Bridge, and Haswell microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the EV6 and the P4 but with a somewhat shorter pipeline.[43][44]

See also edit

References edit

  1. ^ Kukunas, Jim (2015). Power and Performance: Software Analysis and Optimization. Morgan Kaufman. p. 37. ISBN 9780128008140.
  2. ^ "Out-of-order execution" (PDF). cs.washington.edu. 2006. Retrieved 2014-01-17. don't wait for previous instructions to execute if this instruction does not depend on them
  3. ^ "The Centennial Celebration". Regis High School. 2011-03-14. Retrieved 2022-06-25. The algorithm "allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially" (also known as out of order execution).
  4. ^ "Out-of-order Execution". pcguide.com. Retrieved 2014-01-17. This flexibility improves performance since it allows execution with less 'waiting' time.
  5. ^ Thornton (1970, p. 125-127)
  6. ^ Tomasulo, Robert Marco (1967), (PDF), IBM Journal of Research and Development, 11 (1): 25–33, CiteSeerX 10.1.1.639.7540, doi:10.1147/rd.111.0025, S2CID 8445049, archived from the original (PDF) on 2018-06-12
  7. ^ a b Smith, James E. (July 1989). "Dynamic Instruction Scheduling and the Astronautics ZS-1" (PDF). Computer. 22 (7): 21–35. doi:10.1109/2.30730. S2CID 329170.
  8. ^ Thornton (1970, p. 48-50)
  9. ^ a b c Smith, James E.; Pleszkun, Andrew R. (June 1985). "Implementation of precise interrupts in pipelined processors". 12th ISCA.
    (Expanded version published in May 1988 as Implementing Precise Interrupts in Pipelined Processors.)
  10. ^ Moudgill, Mayan; Vassiliadis, Stamatis (January 1996). . p. 18. CiteSeerX 10.1.1.33.3304. Archived from the original (pdf) on 13 October 2022.
  11. ^ Hwu, W.; Patt, Yale N. (1986). HPSm, a high performance restricted data flow architecture having minimal functionality. ACM. pp. 297–306. ISBN 978-0-8186-0719-6. Retrieved 2013-12-06. {{cite book}}: |work= ignored (help)
  12. ^ Smith, James E. (November 1984). "Decoupled Access/Execute Computer Architectures" (PDF). ACM Transactions on Computer Systems. 2 (4): 289–308. doi:10.1145/357401.357403. S2CID 13903321.
  13. ^ Smotherman, Mark. "Culler-7". Clemson University.
  14. ^ Grohoski, Gregory F. (January 1990). (PDF). IBM Journal of Research and Development. 34 (1): 37–58. doi:10.1147/rd.341.0037. Archived from the original (PDF) on January 9, 2005.
  15. ^ Smith, James E.; Sohi, Gurindar S. (December 1995). "The Microarchitecture of Superscalar Processors" (PDF). Proceedings of the IEEE. 83 (12): 1617. doi:10.1109/5.476078.
  16. ^ Liptay, John S. (July 1992). (PDF). IBM Journal of Research and Development. 36 (4): 713–731. doi:10.1147/rd.364.0713. Archived from the original (PDF) on January 17, 2005.
  17. ^ Ziegler, Bart (March 7, 1991). "Unisys Unveils 'Top Gun' Mainframe Computers". AP News.
  18. ^ "Unisys' New Mainframe Leaves Big Blue In The Dust". Bloomberg. March 25, 1991. The new A19 relies on "super-scalar" techniques from scientific computers to execute many instructions concurrently. The A19 can overlap as many as 140 operations, more than 10 times as many as conventional mainframes can.
  19. ^ Ullah, Nasr; Holle, Matt (March 1993). "The MC88110 Implementation of Precise Exceptions in a Superscalar Architecture" (pdf). ACM Sigarch Computer Architecture News. Motorola Inc. 21: 15–25. doi:10.1145/152479.152482. S2CID 7036627.
  20. ^ Smotherman, Mark (29 April 1994). "Motorola MC88110 Overview".
  21. ^ Diefendorff, Keith; Allen, Michael (April 1992). (PDF). IEEE Micro. 12 (2): 40–63. doi:10.1109/40.127582. S2CID 25668727. Archived from the original (PDF) on 2022-10-21.
  22. ^ Smotherman, Mark; Chawla, Shuchi; Cox, Stan; Malloy, Brian (December 1993). "Instruction scheduling for the Motorola 88110". Proceedings of the 26th Annual International Symposium on Microarchitecture. pp. 257–262. doi:10.1109/MICRO.1993.282761. ISBN 0-8186-5280-2. S2CID 52806289.
  23. ^ "PowerPC™ 601 RISC Microprocessor Technical Summary" (PDF). Retrieved 23 October 2022.
  24. ^ Moore, Charles R.; Becker, Michael C. et al. "The PowerPC 601 microprocessor". IEEE Micro. 13 (5). September 1993.
  25. ^ Diefendorff, Keith (August 1993). "PowerPC 601 Microprocessor" (PDF). Hot Chips.
  26. ^ Smith, James E.; Weiss, Shlomo (June 1994). "PowerPC 601 and Alpha 21064: A Tale of Two RISCs" (PDF). IEEE Computer. 27 (6): 46–58. doi:10.1109/2.294853. S2CID 1114841.
  27. ^ Sima, Dezsö (September–October 2000). "The design space of register renaming techniques". IEEE Micro. 20 (5): 70–83. CiteSeerX 10.1.1.387.6460. doi:10.1109/40.877952. S2CID 11012472.
  28. ^ [23][24][25][26][27]
  29. ^ Gwennap, Linley (28 March 1994). (PDF). Microprocessor Report. Archived from the original (PDF) on 2 December 2021.
  30. ^ Burgess, Brad; Ullah, Nasr; Van Overen, Peter; Ogden, Deene (June 1994). "The PowerPC 603 microprocessor". Communications of the ACM. 37 (6): 34–42. doi:10.1145/175208.175212. S2CID 34385975.
  31. ^ "PowerPC™ 603 RISC Microprocessor Technical Summary" (PDF). Retrieved 27 October 2022.
  32. ^ Song, S. Peter; Denman, Marvin; Chang, Joe (October 1994). "The PowerPC 604 RISC microprocessor" (PDF). IEEE Micro. 14 (5): 8. doi:10.1109/MM.1994.363071. S2CID 11603864.
  33. ^ "SPARC64+: HAL's Second Generation 64-bit SPARC Processor" (PDF). Hot Chips.
  34. ^ "Le Sparc64". Research Institute of Computer Science and Random Systems (in French).
  35. ^ Gwennap, Linley (16 February 1995). "Intel's P6 Uses Decoupled Superscalar Design" (PDF). Microprocessor Report.
  36. ^ Le, Hung Q. et al. "IBM POWER6 microarchitecture" (PDF). IBM Journal of Research and Development. 51 (6). November 2007.
  37. ^ Mallia, Lou. (PDF). Archived from the original (PDF) on 29 October 2013.
  38. ^ Anand Lal Shimpi (2013-05-06). "Intel's Silvermont Architecture Revealed: Getting Serious About Mobile". AnandTech.
  39. ^ "Inside the Minor CPU model: Scoreboard". 2017-06-09. Retrieved 2023-01-09.
  40. ^ Smith, J. E. (1984). "Decoupled access/execute computer architectures". ACM Transactions on Computer Systems. 2 (4): 289–308. CiteSeerX 10.1.1.127.4475. doi:10.1145/357401.357403. S2CID 13903321.
  41. ^ Kurian, L.; Hulina, P. T.; Coraor, L. D. (1994). (PDF). IEEE Transactions on Computers. 43 (10): 1129–1139. doi:10.1109/12.324539. S2CID 6913858. Archived from the original (PDF) on 2018-06-12.
  42. ^ Dorojevets, M. N.; Oklobdzija, V. (1995). "Multithreaded decoupled architecture". International Journal of High Speed Computing. 7 (3): 465–480. doi:10.1142/S0129053395000257.
  43. ^ Kanter, David (2010-09-25). "Intel's Sandy Bridge Microarchitecture".
  44. ^ "The Haswell Front End - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel".
  • Thornton, James (1970). Design of a Computer: The Control Data 6600 (PDF). ISBN 9780673059536.

Further reading edit

  • Smith, James E.; Pleszkun, A. R. (June 1985). "Implementation of precise interrupts in pipelined processors". ACM SIGARCH Computer Architecture News. 13 (3): 36–44. doi:10.1145/327070.327125.

order, execution, computer, engineering, order, execution, more, formally, dynamic, execution, paradigm, used, most, high, performance, central, processing, units, make, instruction, cycles, that, would, otherwise, wasted, this, paradigm, processor, executes, . In computer engineering out of order execution or more formally dynamic execution is a paradigm used in most high performance central processing units to make use of instruction cycles that would otherwise be wasted In this paradigm a processor executes instructions in an order governed by the availability of input data and execution units 1 rather than by their original order in a program 2 3 In doing so the processor can avoid being idle while waiting for the preceding instruction to complete and can in the meantime process the next instructions that are able to run immediately and independently 4 Contents 1 History 1 1 Early use in supercomputers 1 2 Precise exceptions 1 3 Decoupling 1 4 Research comes to fruition 1 5 Wide adoption 2 Basic concept 2 1 In order processors 2 2 Out of order processors 3 Dispatch and issue decoupling allows out of order issue 4 Execute and writeback decoupling allows program restart 5 Micro architectural choices 6 See also 7 References 8 Further readingHistory editOut of order execution is a restricted form of data flow computation which was a major research area in computer architecture in the 1970s and early 1980s Early use in supercomputers edit The first machine to use out of order execution was the CDC 6600 1964 designed by James E Thornton which uses a scoreboard to avoid conflicts It permits an instruction to execute if its source operand read registers aren t to be written to by any unexecuted earlier instruction true dependency and the destination write register not be a register used by any unexecuted earlier instruction false dependency The 6600 lacks the means to avoid stalling an execution unit on false dependencies write after write WAW and write after read WAR conflicts respectively termed first order conflict and third order conflict by Thornton who termed true dependencies read after write RAW as second order conflict because each address has only a single location referable by it The WAW is worse than WAR for the 6600 because when an execution unit encounters a WAR the other execution units still receive and execute instructions but upon a WAW the assignment of instructions to execution units stops and they can not receive any further instructions until the WAW causing instruction s destination register has been written to by earlier instruction 5 About two years later the IBM System 360 Model 91 1966 introduced register renaming with Tomasulo s algorithm 6 which dissolves false dependencies WAW and WAR making full out of order execution possible An instruction addressing a write into a register rn can be executed before an earlier instruction using the register rn is executed by actually writing into an alternative renamed register alt rn which is turned into a normal architectural register rn only when all the earlier instructions addressing rn have been executed but until then rn is given for earlier instructions and alt rn for later ones addressing rn In the Model 91 the register renaming is implemented by a bypass termed Common Data Bus CDB and memory source operand buffers leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB Another advantage the Model 91 has over the 6600 is the ability to execute out of order the instructions at the same execution unit not just between the units like the 6600 This is accomplished by reservation stations from which instructions go to the execution unit when ready as opposed to the FIFO queue of each execution unit of the 6600 The Model 91 is also capable of re ordering loads and stores to execute before the preceding loads and stores 7 unlike the 6600 which only has a limited ability to move loads past loads and stores past stores but not loads past stores and stores past loads 8 Only the floating point registers of the Model 91 are renamed making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed point code The 91 and 6600 both also suffer from imprecise exceptions which needed to be solved before out of order execution could be applied generally and made practical outside supercomputers Precise exceptions edit To have precise exceptions the proper in order state of the program s execution must be available upon an exception By 1985 various approaches were developed as described by James E Smith and Andrew R Pleszkun 9 The CDC Cyber 205 was a precursor as upon a virtual memory interrupt the entire state of the processor including the information on the partially executed instructions is saved into an invisible exchange package so that it can resume at the same state of execution 10 However to make all exceptions precise there has to be a way to cancel the effects of instructions The CDC Cyber 990 1984 implements precise interrupts by using a history buffer which holds the old overwritten values of registers that are restored when an exception necessitates the reverting of instructions 9 Smith simulated that adding a reorder buffer or history buffer or equivalent to Cray 1S would reduce the performance of executing the first 14 Livermore loops unvectorized by only 3 9 Important academic research in this subject was led by Yale Patt with his HPSm simulator 11 In the 1980s many early RISC microprocessors like the Motorola 88100 had out of order writeback to the registers resulting in imprecise exceptions Instructions started execution in order but some e g floating point took more cycles to complete execution However the single cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600 Decoupling edit Smith also researched how to make different execution units operate more independently of each other and of the memory front end and branching 12 He implemented those ideas in the Astronautics ZS 1 1988 featuring a decoupling of the integer load store pipeline from the floating point pipeline allowing inter pipeline re ordering The ZS 1 was also capable of executing loads ahead of preceding stores In his 1984 paper he opined that enforcing the precise exceptions only on the integer memory pipeline should be sufficient for many usecases as it even permits virtual memory Each pipeline had an instruction buffer to decouple it from the instruction decoder to prevent the stalling of the front end To further decouple the memory access from execution each of the two pipelines was associated with two addressable queues that effectively performed limited register renaming 7 A similar decoupled architecture had been used a bit earlier in the Culler 7 13 The ZS 1 s ISA like IBM s subsequent POWER aided the early execution of branches Research comes to fruition edit With the POWER1 1990 IBM returned to out of order execution It was the first processor to combine register renaming though again only floating point registers with precise exceptions It uses a physical register file i e a dynamically remapped file with both uncommitted and committed values instead of a datafull reorder buffer but the ability to cancel instructions is needed only in the branch unit which implements a history buffer named program counter stack by IBM to undo changes to count link and condition registers The reordering capability of even the floating point instructions is still very limited due to POWER1 s inability to reorder floating point arithmetic instructions results became available in order their destination registers aren t renamed POWER1 also doesn t have reservation stations needed for out of order use of a same execution unit 14 15 The next year IBM s ES 9000 model 900 had register renaming also for the general purpose registers It also has reservation stations with six entries for the dual integer unit each cycle from the six instructions up to two can be selected and then executed and six entries for the FPU Other units have simple FIFO queues The re ordering distance is up to 32 instructions 16 The A19 of Unisys A series of mainframes was also released in 1991 and was claimed to have out of order execution and one analyst called the A19 s technology three to five years ahead of the competition 17 18 Wide adoption edit The first superscalar single chip processors Intel i960CA in 1989 used a simple scoreboarding scheduling like the CDC 6600 had quarter of a century earlier but in 1992 1996 a rapid advancement of techniques enabled by increasing transistor counts saw proliferation down to personal computers Motorola 88110 1992 used a history buffer to revert instructions 19 Loads could be executed ahead of preceding stores While stores and branches were waiting to start execution subsequent instructions of other types could keep flowing through all the pipeline stages including writeback The 12 entry capacity of the history buffer placed a limit on the reorder distance 20 21 22 PowerPC 601 1993 was an evolution of the RISC Single Chip itself a simplification of POWER1 The 601 permitted branch and floating point instructions to overtake the integer instructions already in the fetched instruction queue the lowest four entries of which were scanned for dispatchability In the case of a cache miss loads and stores could be reordered Only the link and count register could be renamed 28 In the fall of 1994 NexGen and IBM with Motorola brought the renaming of general purpose registers to single chip CPUs NexGen s Nx586 was the first x86 processor capable of out of order execution accomplished with micro OPs The re ordering distance is up to 14 micro OPs 29 PowerPC 603 renamed both the general purpose and FP registers Each of the four non branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units A five entry re order buffer lets no more than four instructions to overtake an unexecuted instruction Due to a store buffer a load can access cache ahead of a preceding store 30 31 PowerPC 604 1995 was the first single chip processor with execution unit level re ordering as three out of its six units each had a two entry reservation station permitting the newer entry to execute before the older The re order buffer capacity is 16 instructions A four entry load queue and a six entry store queue track the re ordering of loads and stores upon cache misses 32 HAL SPARC64 1995 exceeded the re ordering capacity of the ES 9000 model 900 by having three 8 entry reservation stations for integer floating point and address generation unit and a 12 entry reservation station for load store which permits greater reordering of cache memory access than preceding processors Up to 64 instructions can be in a re ordered state at a time 33 34 Pentium Pro 1995 introduced a unified reservation station which at the 20 micro OP capacity permitted very flexible re ordering backed by a 40 entry re order buffer Loads can be re ordered ahead of both loads and stores 35 The practically attainable per cycle rate of execution rose more as full out of order execution was further adopted by SGI MIPS R10000 and HP PA RISC PA 8000 in 1996 The same year Cyrix 6x86 and AMD K5 brought advanced re ordering techniques into mainstream personal computers Since DEC Alpha gained out of order execution in 1998 Alpha 21264 the top performing out of order processor cores have been unmatched by in order cores other than HP Intel Itanium 2 and IBM POWER6 though the latter had an out of order floating point unit 36 The other high end in order processors fell far behind namely Sun s UltraSPARC III IV and IBM s mainframes which had lost the out of order execution capability for the second time remaining in order into the z10 generation Later big in order processors were focused on multithreaded performance but eventually the SPARC T series and Xeon Phi changed to out of order execution in 2011 and 2016 respectively Almost all processors for phones and other lower end applications remained in order until c 2010 First Qualcomm s Scorpion re ordering distance of 32 shipped in Snapdragon 37 and a bit later Arm s A9 succeeded A8 For low end x86 personal computers in order early Intel Atoms were first challenged by AMD s Bobcat and in 2013 were succeeded by an out of order Silvermont 38 Because the complexity of out of order execution precludes achieving the lowest minimum power consumption cost and size in order execution is still prevalent in microcontrollers and embedded systems as well as in phone class cores such as Arm s A55 and A510 in big LITTLE configurations Basic concept editTo appreciate out of order Execution it is useful to first describe in order to be able to make a comparison of the two Instructions cannot be completed instantaneously they take time multiple cycles Therefore results will lag behind where they are needed In order still has to keep track of the dependencies Its approach is however quite unsophisticated stall every time out of order uses much more sophisticated data tracking techniques as seen below In order processors edit In earlier processors the processing of instructions is performed in an instruction cycle normally consisting of the following steps Instruction fetch If input operands are available in processor registers for instance the instruction is dispatched to the appropriate functional unit If one or more operands are unavailable during the current clock cycle generally because they are being fetched from memory the processor stalls until they are available The instruction is executed by the appropriate functional unit The functional unit writes the results back to the register file Often an in order processor has a bit vector recording which registers will be written to by a pipeline 39 If any input operands have the corresponding bit set in this vector the instruction stalls Essentially the vector performs a greatly simplified role of protecting against register hazards Thus out of order execution uses 2D matrices whereas in order execution uses a 1D vector for hazard avoidance Out of order processors edit This new paradigm breaks up the processing of instructions into these steps Instruction fetch Instruction decoding Instruction renaming Instruction dispatch to an instruction queue also called instruction buffer or reservation stations The instruction waits in the queue until its input operands are available The instruction can leave the queue before older instructions The instruction is issued to the appropriate functional unit and executed by that unit The results are queued Only after all older instructions have their results written back to the register file then this result is written back to the register file This is called the graduation or retire stage The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable In the outline above the OoOE processor avoids the stall that occurs in step 2 of the in order processor when the instruction is not completely ready to be processed due to missing data OoOE processors fill these slots in time with other instructions that are ready then re order the results at the end to make it appear that the instructions were processed as normal The way the instructions are ordered in the original computer code is known as program order in the processor they are handled in data order the order in which the data operands become available in the processor s registers Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output the processor itself runs the instructions in seemingly random order The benefit of OoOE processing grows as the instruction pipeline deepens and the speed difference between main memory or cache memory and the processor widens On modern machines the processor runs many times faster than the memory so during the time an in order processor spends waiting for data to arrive it could have processed a large number of instructions Dispatch and issue decoupling allows out of order issue editOne of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage An early name for the paradigm was decoupled architecture In the earlier in order processors these stages operated in a fairly lock step pipelined fashion The instructions of the program may not be run in the originally specified order as long as the end result is correct It separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer The buffer s purpose is to partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine grain parallelism between the two 40 In doing so it effectively hides all memory latency from the processor s perspective A larger buffer can in theory increase throughput However if the processor has a branch misprediction then the entire buffer may need to be flushed wasting a lot of clock cycles and reducing the effectiveness Furthermore larger buffers create more heat and use more die space For this reason processor designers today favour a multi threaded design approach Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well 41 Control intensive code include such things as nested branches that occur frequently in operating system kernels Decoupled architectures play an important role in scheduling in very long instruction word VLIW architectures 42 To avoid false operand dependencies which would decrease the frequency when instructions could be issued out of order a technique called register renaming is used In this scheme there are more physical registers than defined by the architecture The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time Execute and writeback decoupling allows program restart editThe queue for results is necessary to resolve issues such as branch mispredictions and exceptions traps The results queue allows programs to be restarted after an exception which requires the instructions to be completed in program order The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions The ability to issue instructions past branches that are yet to resolve is known as speculative execution Micro architectural choices editAre the instructions dispatched to a centralized queue or to multiple distributed queues IBM PowerPC processors use queues that are distributed among the different functional units while other out of order processors use a centralized queue IBM uses the term reservation stations for their distributed queues Is there an actual results queue or are the results written directly into a register file For the latter the queueing function is handled by register maps that hold the register renaming information for each instruction in flight Early Intel out of order processors use a results queue called a re order buffer while most later out of order processors use register maps More precisely Intel P6 family microprocessors have both a re order buffer ROB and a register alias table RAT The ROB was motivated mainly by branch misprediction recovery The Intel P6 family is among the earliest OoOE microprocessors but were supplanted by the NetBurst architecture Years later NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies Materials were not able to match the design s ambitious clock targets due to thermal issues and later designs based on NetBurst namely Tejas and Jayhawk were cancelled Intel reverted to the P6 design as the basis of the Core and Nehalem microarchitectures The succeeding Sandy Bridge Ivy Bridge and Haswell microarchitectures are a departure from the reordering techniques used in P6 and employ re ordering techniques from the EV6 and the P4 but with a somewhat shorter pipeline 43 44 See also edit nbsp The Wikibook Microprocessor Design has a page on the topic of Out Of Order Execution Instruction scheduling Memory fence Replay system Shelving bufferReferences edit Kukunas Jim 2015 Power and Performance Software Analysis and Optimization Morgan Kaufman p 37 ISBN 9780128008140 Out of order execution PDF cs washington edu 2006 Retrieved 2014 01 17 don t wait for previous instructions to execute if this instruction does not depend on them The Centennial Celebration Regis High School 2011 03 14 Retrieved 2022 06 25 The algorithm allows sequential instructions that would normally be stalled due to certain dependencies to execute non sequentially also known as out of order execution Out of order Execution pcguide com Retrieved 2014 01 17 This flexibility improves performance since it allows execution with less waiting time Thornton 1970 p 125 127 Tomasulo Robert Marco 1967 An Efficient Algorithm for Exploiting Multiple Arithmetic Units PDF IBM Journal of Research and Development 11 1 25 33 CiteSeerX 10 1 1 639 7540 doi 10 1147 rd 111 0025 S2CID 8445049 archived from the original PDF on 2018 06 12 a b Smith James E July 1989 Dynamic Instruction Scheduling and the Astronautics ZS 1 PDF Computer 22 7 21 35 doi 10 1109 2 30730 S2CID 329170 Thornton 1970 p 48 50 a b c Smith James E Pleszkun Andrew R June 1985 Implementation of precise interrupts in pipelined processors 12th ISCA Expanded version published in May 1988 as Implementing Precise Interrupts in Pipelined Processors Moudgill Mayan Vassiliadis Stamatis January 1996 On Precise Interrupts p 18 CiteSeerX 10 1 1 33 3304 Archived from the original pdf on 13 October 2022 Hwu W Patt Yale N 1986 HPSm a high performance restricted data flow architecture having minimal functionality ACM pp 297 306 ISBN 978 0 8186 0719 6 Retrieved 2013 12 06 a href Template Cite book html title Template Cite book cite book a work ignored help Smith James E November 1984 Decoupled Access Execute Computer Architectures PDF ACM Transactions on Computer Systems 2 4 289 308 doi 10 1145 357401 357403 S2CID 13903321 Smotherman Mark Culler 7 Clemson University Grohoski Gregory F January 1990 Machine organization of the IBM RISC System 6000 processor PDF IBM Journal of Research and Development 34 1 37 58 doi 10 1147 rd 341 0037 Archived from the original PDF on January 9 2005 Smith James E Sohi Gurindar S December 1995 The Microarchitecture of Superscalar Processors PDF Proceedings of the IEEE 83 12 1617 doi 10 1109 5 476078 Liptay John S July 1992 Design of the IBM Enterprise System 9000 high end processor PDF IBM Journal of Research and Development 36 4 713 731 doi 10 1147 rd 364 0713 Archived from the original PDF on January 17 2005 Ziegler Bart March 7 1991 Unisys Unveils Top Gun Mainframe Computers AP News Unisys New Mainframe Leaves Big Blue In The Dust Bloomberg March 25 1991 The new A19 relies on super scalar techniques from scientific computers to execute many instructions concurrently The A19 can overlap as many as 140 operations more than 10 times as many as conventional mainframes can Ullah Nasr Holle Matt March 1993 The MC88110 Implementation of Precise Exceptions in a Superscalar Architecture pdf ACM Sigarch Computer Architecture News Motorola Inc 21 15 25 doi 10 1145 152479 152482 S2CID 7036627 Smotherman Mark 29 April 1994 Motorola MC88110 Overview Diefendorff Keith Allen Michael April 1992 Organization of the Motorola 88110 superscalar RISC microprocessor PDF IEEE Micro 12 2 40 63 doi 10 1109 40 127582 S2CID 25668727 Archived from the original PDF on 2022 10 21 Smotherman Mark Chawla Shuchi Cox Stan Malloy Brian December 1993 Instruction scheduling for the Motorola 88110 Proceedings of the 26th Annual International Symposium on Microarchitecture pp 257 262 doi 10 1109 MICRO 1993 282761 ISBN 0 8186 5280 2 S2CID 52806289 PowerPC 601 RISC Microprocessor Technical Summary PDF Retrieved 23 October 2022 Moore Charles R Becker Michael C et al The PowerPC 601 microprocessor IEEE Micro 13 5 September 1993 Diefendorff Keith August 1993 PowerPC 601 Microprocessor PDF Hot Chips Smith James E Weiss Shlomo June 1994 PowerPC 601 and Alpha 21064 A Tale of Two RISCs PDF IEEE Computer 27 6 46 58 doi 10 1109 2 294853 S2CID 1114841 Sima Dezso September October 2000 The design space of register renaming techniques IEEE Micro 20 5 70 83 CiteSeerX 10 1 1 387 6460 doi 10 1109 40 877952 S2CID 11012472 23 24 25 26 27 Gwennap Linley 28 March 1994 NexGen Enters Market with 66 MHz Nx586 PDF Microprocessor Report Archived from the original PDF on 2 December 2021 Burgess Brad Ullah Nasr Van Overen Peter Ogden Deene June 1994 The PowerPC 603 microprocessor Communications of the ACM 37 6 34 42 doi 10 1145 175208 175212 S2CID 34385975 PowerPC 603 RISC Microprocessor Technical Summary PDF Retrieved 27 October 2022 Song S Peter Denman Marvin Chang Joe October 1994 The PowerPC 604 RISC microprocessor PDF IEEE Micro 14 5 8 doi 10 1109 MM 1994 363071 S2CID 11603864 SPARC64 HAL s Second Generation 64 bit SPARC Processor PDF Hot Chips Le Sparc64 Research Institute of Computer Science and Random Systems in French Gwennap Linley 16 February 1995 Intel s P6 Uses Decoupled Superscalar Design PDF Microprocessor Report Le Hung Q et al IBM POWER6 microarchitecture PDF IBM Journal of Research and Development 51 6 November 2007 Mallia Lou Qualcomm High Performance Processor Core and Platform for Mobile Applications PDF Archived from the original PDF on 29 October 2013 Anand Lal Shimpi 2013 05 06 Intel s Silvermont Architecture Revealed Getting Serious About Mobile AnandTech Inside the Minor CPU model Scoreboard 2017 06 09 Retrieved 2023 01 09 Smith J E 1984 Decoupled access execute computer architectures ACM Transactions on Computer Systems 2 4 289 308 CiteSeerX 10 1 1 127 4475 doi 10 1145 357401 357403 S2CID 13903321 Kurian L Hulina P T Coraor L D 1994 Memory latency effects in decoupled architectures PDF IEEE Transactions on Computers 43 10 1129 1139 doi 10 1109 12 324539 S2CID 6913858 Archived from the original PDF on 2018 06 12 Dorojevets M N Oklobdzija V 1995 Multithreaded decoupled architecture International Journal of High Speed Computing 7 3 465 480 doi 10 1142 S0129053395000257 Kanter David 2010 09 25 Intel s Sandy Bridge Microarchitecture The Haswell Front End Intel s Haswell Architecture Analyzed Building a New PC and a New Intel Thornton James 1970 Design of a Computer The Control Data 6600 PDF ISBN 9780673059536 Further reading editSmith James E Pleszkun A R June 1985 Implementation of precise interrupts in pipelined processors ACM SIGARCH Computer Architecture News 13 3 36 44 doi 10 1145 327070 327125 Retrieved from https en wikipedia org w index php title Out of order execution amp oldid 1185158093, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.