Содержание
- 2. Background Required to Understand this Chapter Advanced Computer Architecture. Smruti R. Sarangi Chapter 4
- 3. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
- 4. Aggressive Speculation Branch prediction is one form of speculation If we detect that a branch has
- 5. Types of Aggressive Speculation Advanced Computer Architecture. Smruti R. Sarangi
- 6. Address Speculation: Predict the memory address of a load or store Predict last address scheme Use
- 7. Stride based Address Pattern Advanced Computer Architecture. Smruti R. Sarangi
- 8. Predicting the Stride Last address (A): The memory address computed the last time the instruction with
- 9. Load-Store Dependence Speculation Advanced Computer Architecture. Smruti R. Sarangi Predict a collision (same memory address) between
- 10. Collision History Table Loads show consistent behavior They are either colliding or non-colliding Advanced Computer Architecture.
- 11. Using the CHT When we compute the address of a load We access the CHT If
- 12. Store Sets Advanced Computer Architecture. Smruti R. Sarangi Explicitly remember load-store dependences PC ? Store set
- 13. Basic Idea For every load, we have an associated store set Stores that have forwarded values
- 14. Load Latency Speculation A load might hit in the L1 cache (2 cycles) or might go
- 15. Make a guess Advanced Computer Architecture. Smruti R. Sarangi For load instructions, predict if it will
- 16. Advanced Computer Architecture. Smruti R. Sarangi Constants Value prediction: Why are values predictable?
- 17. Value Predictor Advanced Computer Architecture. Smruti R. Sarangi
- 18. Using an additional predictor for confidence First, use the confidence table to find out if it
- 19. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
- 20. Replay Flushing the pipeline for every misspeculation is not a wise thing Instead, flush a part
- 21. Forward Slice of Instruction I0 Advanced Computer Architecture. Smruti R. Sarangi A forward slice contains an
- 22. Non-Selective Replay Trivial Solution: Flush the pipeline between the dispatch and execute stages Smarter Solution It
- 23. Example Let us say that instructions 2, 3, and 4 had one operand waking up in
- 24. Instruction Window Entry When an operand becomes ready, we set its timer to n Every cycle
- 25. More about Non-Selective Replay We attach the expected latency with each instruction packet as it flows
- 26. Two methods of replaying Method 1: Keep instructions that have been issued in the issue queue
- 27. Two methods of replaying - II Move the instructions to a dedicated replay queue after issue
- 28. Orphan Instructions Assume that the load instruction misses in the L1 cache The add, sub, and
- 29. Orphan Instructions - II Keep track of squashed instructions. Re-broadcast tags of orphan instructions. ? We
- 30. Delayed Selective Replay Let us now propose an idea to replay only those instructions that are
- 31. Delayed Selective Replay - II When an instruction finishes execution ? Check if its poison bit
- 32. Orphan Instructions We can always wait for the instruction to reach the head of the ROB.
- 33. Token Based Selective Replay Let us use a pattern found in most programs: Most of the
- 34. After Predicting a d-cache Miss Instructions that are predicted to miss, will have a non-deterministic execution
- 35. Structure of the Rename Table If an instruction is a token head, we save the id
- 36. While reading the rename table ... Read the tokenVecs of the source operands Merge the tokenVecs
- 37. After execution After the token head instruction completes execution, see if it took additional cycles (verification
- 38. Instructions in S2 Assume an instruction that was not predicted to miss actually misses No token
- 39. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
- 40. A Simpler Design Physical Register File (PRF) based design Advanced Computer Architecture. Smruti R. Sarangi Fast
- 41. Let us now look at a different kind of OOO processor Instead of having a physical
- 42. Changes to renaming Entry in the RAT table ROB id ROB/RF bit ROB/RF bit ? 1
- 43. Changes to Dispatch and Wakeup Each entry in the IW now stores the values of the
- 44. Changes to Wakeup, Bypass, Reg. Write and Commit We can follow the same speculative wakeup strategy
- 45. PRF based design vs ARF based design points in the PRF based design A value resides
- 46. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
- 47. Compiler based Optimizations Can the compiler optimize the code? Advanced Computer Architecture. Smruti R. Sarangi
- 48. Constant Folding Advanced Computer Architecture. Smruti R. Sarangi We can directly replace a with 10, b
- 49. Strength Reduction Advanced Computer Architecture. Smruti R. Sarangi slow fast
- 50. Common Subexpression Elimination Each line in the second example corresponds to one line of assembly code.
- 51. Dead Code Elimination Advanced Computer Architecture. Smruti R. Sarangi Dead code
- 52. Silent Stores Silent stores write the same value that is already present Advanced Computer Architecture. Smruti
- 53. Advanced Computer Architecture. Smruti R. Sarangi Loop Based Optimizations
- 54. Loop Invariant based Code Motion There is no point setting (val = 5) repeatedly. Advanced Computer
- 55. Induction Variable based Optimization Advanced Computer Architecture. Smruti R. Sarangi Original Induction variable Replace a multiply
- 56. Loop Fusion Advanced Computer Architecture. Smruti R. Sarangi Original Optimized Fuse the loops Loop fusion reduces
- 57. Loop Unrolling - I Advanced Computer Architecture. Smruti R. Sarangi Original loop Assembly code
- 58. Advanced Computer Architecture. Smruti R. Sarangi Loop Unrolling - II Advantage: fewer total instructions and specifically
- 59. Advanced Computer Architecture. Smruti R. Sarangi Software Pipelining
- 60. Advanced Computer Architecture. Smruti R. Sarangi L S I
- 61. Visualization of the Execution Process Advanced Computer Architecture. Smruti R. Sarangi We can create our loops
- 62. Can we execute instructions in this order? Advanced Computer Architecture. Smruti R. Sarangi I0 ? S1
- 63. Advantages of Software Pipelining Consider this order: I0 ? S1 ? L2 ? I1 ? S2
- 64. Different Loop Iterators: Group of 3 iterations Advanced Computer Architecture. Smruti R. Sarangi
- 65. Code with Different Loop Iterators Advanced Computer Architecture. Smruti R. Sarangi Unroll the loop 3 times
- 66. Advanced Computer Architecture. Smruti R. Sarangi If we had 32 registers, we could do this very
- 67. Epilogue and Prologue Advanced Computer Architecture. Smruti R. Sarangi
- 68. Solution without Unrolling Advanced Computer Architecture. Smruti R. Sarangi i = -1; t = B[0]; .loop
- 69. Unrolling and Mixing Advanced Computer Architecture. Smruti R. Sarangi
- 70. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
- 71. . Sounds like a promising idea … Less hardware ? less power, less complexity Modern software
- 72. VLIW Processors VLIW (Very Long Instruction Word) processors were the first designs in this space. Bundle
- 73. If Statements: Predicated Execution Use predicated execution (remember GPUs). Advanced Computer Architecture. Smruti R. Sarangi If
- 74. Curious Case of Memory Instructions We can have multiple memory instructions in a bundle The addresses
- 75. VLIW vs EPIC Advanced Computer Architecture. Smruti R. Sarangi Given that VLIW processors do not necessarily
- 76. Intel Itanium Processor Unique collaboration between Intel and HP Aim: EPIC processor Designed to leverage the
- 77. Fetch Stage Each bundle contains 3 instructions The decoupling buffer can hold 8 such bundles Advanced
- 78. Branch Predictors Itanium has four types of branch predictors Compiler directed Four special registers: Target Address
- 79. Branch Predictors – II Multi-way Branches Compilers ensure that (typically) the last instruction in a bundle
- 80. This part of the pipeline Itanium has 9 issue ports: 2 for memory, 2 for integer,
- 81. Register Remapping Stage Large 128-entry register file. Advanced Computer Architecture. Smruti R. Sarangi 32 static registers
- 82. Example: Function foo calls function bar Advanced Computer Architecture. Smruti R. Sarangi We deliberately create an
- 83. Register Stack Frame The in and local registers are preserved across function calls. The out registers
- 84. Binary Search Advanced Computer Architecture. Smruti R. Sarangi No processing done after receiving the return value.
- 85. Register Stack Frame The in and local registers are preserved across function calls. The out registers
- 86. Support for Software Pipelining and Overflows Main Problem: We run out of registers Itanium has a
- 87. High Performance Execution Engine Advanced Computer Architecture. Smruti R. Sarangi Scoreboard Simple mechanism for OOO execution
- 88. Conditions: Instruction I Advanced Computer Architecture. Smruti R. Sarangi WAW Hazards Check all the earlier entries
- 89. Conditions: II Instructions wait in the scoreboard until they are safe No hazards Advanced Computer Architecture.
- 90. Predication If we flush the pipeline upon a branch misprediction It would be quite unfair Let
- 91. Code without Predication Count the number of branch instructions. Advanced Computer Architecture. Smruti R. Sarangi /*
- 92. Predicated Instructions The comparison generates predicates (flags) po ? number is odd, pe ? number is
- 93. Advanced Computer Architecture. Smruti R. Sarangi Pipeline
- 94. Load Boosting Boost a load and some instructions that use its value to a point before
- 95. Advanced Computer Architecture. Smruti R. Sarangi A host of compiler optimizations can be used to speed
- 97. Скачать презентацию