Chapter_5-alternative-approaches

Февраль 27, 2021

Главная
Разное
Chapter_5-alternative-approaches

Содержание

2. Background Required to Understand this Chapter Advanced Computer Architecture. Smruti R. Sarangi Chapter 4
3. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
4. Aggressive Speculation Branch prediction is one form of speculation If we detect that a branch has
5. Types of Aggressive Speculation Advanced Computer Architecture. Smruti R. Sarangi
6. Address Speculation: Predict the memory address of a load or store Predict last address scheme Use
7. Stride based Address Pattern Advanced Computer Architecture. Smruti R. Sarangi
8. Predicting the Stride Last address (A): The memory address computed the last time the instruction with
9. Load-Store Dependence Speculation Advanced Computer Architecture. Smruti R. Sarangi Predict a collision (same memory address) between
10. Collision History Table Loads show consistent behavior They are either colliding or non-colliding Advanced Computer Architecture.
11. Using the CHT When we compute the address of a load We access the CHT If
12. Store Sets Advanced Computer Architecture. Smruti R. Sarangi Explicitly remember load-store dependences PC ? Store set
13. Basic Idea For every load, we have an associated store set Stores that have forwarded values
14. Load Latency Speculation A load might hit in the L1 cache (2 cycles) or might go
15. Make a guess Advanced Computer Architecture. Smruti R. Sarangi For load instructions, predict if it will
16. Advanced Computer Architecture. Smruti R. Sarangi Constants Value prediction: Why are values predictable?
17. Value Predictor Advanced Computer Architecture. Smruti R. Sarangi
18. Using an additional predictor for confidence First, use the confidence table to find out if it
19. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
20. Replay Flushing the pipeline for every misspeculation is not a wise thing Instead, flush a part
21. Forward Slice of Instruction I0 Advanced Computer Architecture. Smruti R. Sarangi A forward slice contains an
22. Non-Selective Replay Trivial Solution: Flush the pipeline between the dispatch and execute stages Smarter Solution It
23. Example Let us say that instructions 2, 3, and 4 had one operand waking up in
24. Instruction Window Entry When an operand becomes ready, we set its timer to n Every cycle
25. More about Non-Selective Replay We attach the expected latency with each instruction packet as it flows
26. Two methods of replaying Method 1: Keep instructions that have been issued in the issue queue
27. Two methods of replaying - II Move the instructions to a dedicated replay queue after issue
28. Orphan Instructions Assume that the load instruction misses in the L1 cache The add, sub, and
29. Orphan Instructions - II Keep track of squashed instructions. Re-broadcast tags of orphan instructions. ? We
30. Delayed Selective Replay Let us now propose an idea to replay only those instructions that are
31. Delayed Selective Replay - II When an instruction finishes execution ? Check if its poison bit
32. Orphan Instructions We can always wait for the instruction to reach the head of the ROB.
33. Token Based Selective Replay Let us use a pattern found in most programs: Most of the
34. After Predicting a d-cache Miss Instructions that are predicted to miss, will have a non-deterministic execution
35. Structure of the Rename Table If an instruction is a token head, we save the id
36. While reading the rename table ... Read the tokenVecs of the source operands Merge the tokenVecs
37. After execution After the token head instruction completes execution, see if it took additional cycles (verification
38. Instructions in S2 Assume an instruction that was not predicted to miss actually misses No token
39. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
40. A Simpler Design Physical Register File (PRF) based design Advanced Computer Architecture. Smruti R. Sarangi Fast
41. Let us now look at a different kind of OOO processor Instead of having a physical
42. Changes to renaming Entry in the RAT table ROB id ROB/RF bit ROB/RF bit ? 1
43. Changes to Dispatch and Wakeup Each entry in the IW now stores the values of the
44. Changes to Wakeup, Bypass, Reg. Write and Commit We can follow the same speculative wakeup strategy
45. PRF based design vs ARF based design points in the PRF based design A value resides
46. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
47. Compiler based Optimizations Can the compiler optimize the code? Advanced Computer Architecture. Smruti R. Sarangi
48. Constant Folding Advanced Computer Architecture. Smruti R. Sarangi We can directly replace a with 10, b
49. Strength Reduction Advanced Computer Architecture. Smruti R. Sarangi slow fast
50. Common Subexpression Elimination Each line in the second example corresponds to one line of assembly code.
51. Dead Code Elimination Advanced Computer Architecture. Smruti R. Sarangi Dead code
52. Silent Stores Silent stores write the same value that is already present Advanced Computer Architecture. Smruti
53. Advanced Computer Architecture. Smruti R. Sarangi Loop Based Optimizations
54. Loop Invariant based Code Motion There is no point setting (val = 5) repeatedly. Advanced Computer
55. Induction Variable based Optimization Advanced Computer Architecture. Smruti R. Sarangi Original Induction variable Replace a multiply
56. Loop Fusion Advanced Computer Architecture. Smruti R. Sarangi Original Optimized Fuse the loops Loop fusion reduces
57. Loop Unrolling - I Advanced Computer Architecture. Smruti R. Sarangi Original loop Assembly code
58. Advanced Computer Architecture. Smruti R. Sarangi Loop Unrolling - II Advantage: fewer total instructions and specifically
59. Advanced Computer Architecture. Smruti R. Sarangi Software Pipelining
60. Advanced Computer Architecture. Smruti R. Sarangi L S I
61. Visualization of the Execution Process Advanced Computer Architecture. Smruti R. Sarangi We can create our loops
62. Can we execute instructions in this order? Advanced Computer Architecture. Smruti R. Sarangi I0 ? S1
63. Advantages of Software Pipelining Consider this order: I0 ? S1 ? L2 ? I1 ? S2
64. Different Loop Iterators: Group of 3 iterations Advanced Computer Architecture. Smruti R. Sarangi
65. Code with Different Loop Iterators Advanced Computer Architecture. Smruti R. Sarangi Unroll the loop 3 times
66. Advanced Computer Architecture. Smruti R. Sarangi If we had 32 registers, we could do this very
67. Epilogue and Prologue Advanced Computer Architecture. Smruti R. Sarangi
68. Solution without Unrolling Advanced Computer Architecture. Smruti R. Sarangi i = -1; t = B[0]; .loop
69. Unrolling and Mixing Advanced Computer Architecture. Smruti R. Sarangi
70. Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO Processor Compiler based Techniques
71. . Sounds like a promising idea … Less hardware ? less power, less complexity Modern software
72. VLIW Processors VLIW (Very Long Instruction Word) processors were the first designs in this space. Bundle
73. If Statements: Predicated Execution Use predicated execution (remember GPUs). Advanced Computer Architecture. Smruti R. Sarangi If
74. Curious Case of Memory Instructions We can have multiple memory instructions in a bundle The addresses
75. VLIW vs EPIC Advanced Computer Architecture. Smruti R. Sarangi Given that VLIW processors do not necessarily
76. Intel Itanium Processor Unique collaboration between Intel and HP Aim: EPIC processor Designed to leverage the
77. Fetch Stage Each bundle contains 3 instructions The decoupling buffer can hold 8 such bundles Advanced
78. Branch Predictors Itanium has four types of branch predictors Compiler directed Four special registers: Target Address
79. Branch Predictors – II Multi-way Branches Compilers ensure that (typically) the last instruction in a bundle
80. This part of the pipeline Itanium has 9 issue ports: 2 for memory, 2 for integer,
81. Register Remapping Stage Large 128-entry register file. Advanced Computer Architecture. Smruti R. Sarangi 32 static registers
82. Example: Function foo calls function bar Advanced Computer Architecture. Smruti R. Sarangi We deliberately create an
83. Register Stack Frame The in and local registers are preserved across function calls. The out registers
84. Binary Search Advanced Computer Architecture. Smruti R. Sarangi No processing done after receiving the return value.
85. Register Stack Frame The in and local registers are preserved across function calls. The out registers
86. Support for Software Pipelining and Overflows Main Problem: We run out of registers Itanium has a
87. High Performance Execution Engine Advanced Computer Architecture. Smruti R. Sarangi Scoreboard Simple mechanism for OOO execution
88. Conditions: Instruction I Advanced Computer Architecture. Smruti R. Sarangi WAW Hazards Check all the earlier entries
89. Conditions: II Instructions wait in the scoreboard until they are safe No hazards Advanced Computer Architecture.
90. Predication If we flush the pipeline upon a branch misprediction It would be quite unfair Let
91. Code without Predication Count the number of branch instructions. Advanced Computer Architecture. Smruti R. Sarangi /*
92. Predicated Instructions The comparison generates predicates (flags) po ? number is odd, pe ? number is
93. Advanced Computer Architecture. Smruti R. Sarangi Pipeline
94. Load Boosting Boost a load and some instructions that use its value to a point before
95. Advanced Computer Architecture. Smruti R. Sarangi A host of compiler optimizations can be used to speed
97. Скачать презентацию

Background Required to Understand this Chapter
Advanced Computer Architecture. Smruti R. Sarangi
Chapter 4

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Techniques

EPIC based Techniques: Intel Itanium

Aggressive Speculation
Branch prediction is one form of speculation
If we detect that

a branch has been mispredicted
Solution: flush the pipeline
This is not the only form of speculation
Another very common type: load latency speculation or value speculation
Assume that a load will hit in the cache
Speculatively wakeup instructions
Later on if this is not the case: DO SOMETHING

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 5

Types of Aggressive Speculation
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 6

Address Speculation: Predict the memory address of a load or store
Predict last

address scheme
Use a simple predictor

Advanced Computer Architecture. Smruti R. Sarangi
2n entries

Слайд 7

Stride based Address Pattern
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 8

Predicting the Stride
Last address (A): The memory address computed the last time

the instruction with this PC was executed.
A stride-based access pattern is followed if: current address – last address = S
Then we set the pattern bit, P
Alternatively, if P is set, we predict the next address to be
A + S

Advanced Computer Architecture. Smruti R. Sarangi

Last address (A)

Stride (S)

Pattern (P)

Слайд 9

Load-Store Dependence Speculation
Advanced Computer Architecture. Smruti R. Sarangi
Predict a collision (same memory

address) between a load and a store

If there are no collisions, send the load directly to the cache.

Forward values across unresolved stores.

Слайд 10

Collision History Table
Loads show consistent behavior
They are either colliding or non-colliding
Advanced Computer

Architecture. Smruti R. Sarangi

n-bit PC

colliding or non-colliding

Collision History Table (CHT)

Слайд 11

Using the CHT
When we compute the address of a load
We access the

CHT
If it is predicted to be colliding
Wait for all prior stores to be resolved
Else
Send the load to the d-cache
Once the address is resolved
Update the CHT, recover the state (if necessary)

Advanced Computer Architecture. Smruti R. Sarangi

We can augment it with the store?load distance (D). A load waits till there are less than D entries before it in the LSQ.

Слайд 12

Store Sets
Advanced Computer Architecture. Smruti R. Sarangi
Explicitly remember load-store dependences
PC ? Store

set identifier

Last fetched store in the store set

Слайд 13

Basic Idea
For every load, we have an associated store set
Stores that have

forwarded values to it in the past
A store may be a part of a single store set

Advanced Computer Architecture. Smruti R. Sarangi

Load

Read the store set id
Get the instruction number of the latest store (S) from the LFST
The load waits for store S to get resolved and then receives the forwarded data.

Store

Read the store set id
Set the instruction number of the current store in the corresponding entry of the LFST.
Can be used to speculatively forward data to loads in its store set.

Whenever we detect a load-store dependence, we update the SSIT and LFST

Слайд 14

Load Latency Speculation
A load might hit in the L1 cache (2 cycles) or

might go to the lower levels of the memory system.
We don’t know for sure

Advanced Computer Architecture. Smruti R. Sarangi

Pipeline

L1 d-cache

L2 cache

Main memory

1-2 cycles: hit rate: 90%

10-50 cycles: hit rate: 50%

300 cycles: hit rate: 100%

Слайд 15

Make a guess
Advanced Computer Architecture. Smruti R. Sarangi
For load instructions, predict if

it will hit in the data cache or not. If it will, do an early broadcast.

Design a hit-miss predictor. Same idea as branch predictors.

Слайд 16

Advanced Computer Architecture. Smruti R. Sarangi
Constants
Value prediction: Why are values predictable?

Слайд 17

Value Predictor
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 18

Using an additional predictor for confidence
First, use the confidence table to find

out if it makes sense to predict
Simultaneously, make a prediction using a predictor table (value, memory dependence, ALU result)
Predictor table can contain 1 value, or the last k values
Make a prediction, and use it if it has high confidence
Update both the tables when the results are available
If needed recover with a replay/flush mechanism

Confidence
Table
(uses sat. counters)

Confidence

Predictor
Table

Prediction

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 19

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Techniques

EPIC based Techniques: Intel Itanium

Слайд 20

Replay
Flushing the pipeline for every misspeculation is not a wise thing
Instead, flush

a part of the pipeline (or only those instructions that have gotten a wrong value)
Replay those instructions once again (after let’s say the load completes its execution)
When the instructions are being replayed, they are guaranteed to use the correct value of the load
Identify and replay the forward slice ?

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 21

Forward Slice of Instruction I0
Advanced Computer Architecture. Smruti R. Sarangi
A forward slice

contains an instruction’s consumers, its
consumers and so on.

Слайд 22

Non-Selective Replay
Trivial Solution: Flush the pipeline between the dispatch and execute stages

Smarter Solution
It is not necessary to flush all the instructions between the schedule and execute stages
Try to reduce the set of instructions
Define a window of vulnerability (WV) for n cycles after a load is selected. A load should complete within n cycles if it hits in the d-cache and does not wait in the LSQ
However, if the load takes more than n cycles, we need to do a replay

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 23

Example
Let us say that instructions 2, 3, and 4 had one operand

waking up in the WV of instruction 1
If there is a misspeculation, all three instructions get squashed
Instruction 1 gets reissued with the correct value later

Advanced Computer Architecture. Smruti R. Sarangi
1: ld r1, [r2]
2: add r4, r1, r3
3: add r5, r6, r7
4: add r8, r9, r10

Predict the value

squash them

Слайд 24

Instruction Window Entry
When an operand becomes ready, we set its timer to

n
Every cycle it decrements (count down timer)
Once it becomes 0, we can conclude that this instruction will not be squashed

Set after a misspeculation

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 25

More about Non-Selective Replay
We attach the expected latency with each instruction packet

as it flows down the pipeline
Wherever there is an additional delay (such as a cache miss)
Time for a replay
Set the kill wire
Each instruction window entry that has a non-zero timer, resets its ready flag
We now have a set of instructions that will be replayed
Methods of replaying instructions

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 26

Two methods of replaying
Method 1: Keep instructions that have been issued in

the issue queue (see reference)

Pipeline Stages

Verify

verification status

remove from the IW if verified

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 27

Two methods of replaying - II
Move the instructions to a dedicated replay

queue after issue
Once an instruction is verified, remove it from the replay queue

Pipeline Stages

Replay queue

Verify

remove from the IW

verification status

remove from replay
queue if verified

Method 2:

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 28

Orphan Instructions
Assume that the load instruction misses in the L1 cache
The add,

sub, and xor instructions will need to be squashed, and replayed
For the add and sub instructions, tag will be broadcast
What about the xor instruction?
Say that r6’s ready bit was forcefully set to 0

ld r1, 8[r4]
add r2, r1, r1
sub r4, r3, r2
xor r5, r6, r7

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 29

Orphan Instructions - II
Keep track of squashed instructions.
Re-broadcast tags of orphan

instructions.
? We need to dynamically detect which instructions are orphans.

Advanced Computer Architecture. Smruti R. Sarangi

Impractical Method

Better Approach

Let the orphan instruction reach the head of the ROB
Execute and commit it.

Слайд 30

Delayed Selective Replay
Let us now propose an idea to replay only those

instructions that are in the forward slice of the misspeculated load
Let us extend the non-selective replay scheme
At the time of asserting the kill signal, plant a poison bit in the destination register of the load
Propagate the bit along the bypass paths and through the register file
If an instruction reads any operand whose poison bit is set, then the instruction’s poison bit and its destination register are also set.
When an instruction finishes execution ?

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 31

Delayed Selective Replay - II
When an instruction finishes execution ?
Check if its

poison bit is set.
If yes, squash it
If no, remove the instruction from the IW (it is verified to be correct)
Issues with this scheme
It is effective, but assumes that we know the value: n
This might not be possible all the time
Instructions in the WV that have not been issued might become orphans

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 32

Orphan Instructions
We can always wait for the instruction to reach the head

of the ROB.
Another scheme: Let’s say instruction J was orphaned because one of its operands (woken up by inst. K) was reset back to a non-ready state.
Instruction K will later come back to rescue J, via broadcasting on the completion bus.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 33

Token Based Selective Replay
Let us use a pattern found in most programs:
Most

of the misses in the data cache are accounted for by a relatively small number of instructions
90/10 thumb rule ? 90% of the misses are accounted for by 10% of instructions
Predictor ? Given a PC, predict if it will lead to a d-cache miss
Use a predictor similar to a branch predictor at the fetch stage

Prediction
Table

Hit/miss prediction

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 34

After Predicting a d-cache Miss
Instructions that are predicted to miss, will have

a non-deterministic execution time (most likely) and lead to replays (set S1)
Other instructions will not lead to replays (most likely) (set S2)
Let us consider an instruction in set S1
At decode time, let the instruction collect a free token
Save the id of the token in the instruction packet
Example: assume the instruction: ld r1, 4[r4] is predicted to miss
Save the id of the token in the instruction packet of this instruction
Say that the instruction gets token #5
This instruction is the token head for token #5
Let us propagate this information to all the instructions dependent on the load
If this load fails, all the dependent instructions fail as well

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 35

Structure of the Rename Table
If an instruction is a token head, we

save the id of the token that it owns in the instruction packet
Assume we have a maximum of N tokens.
tokenVec is an N-bit vector
For the token head instruction, if it owns the ith token, set the ith bit to true in tokenVec
Tokens are propagated the same way as poison bits

rename
table

phy. reg

tokenVec

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 36

While reading the rename table ...
Read the tokenVecs of the source operands
Merge

the tokenVecs of the source operands
Save the merged tokenVec for the destination register (in the rename table)

add r1, r2, r3

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 37

After execution
After the token head instruction completes execution, see if it took

additional cycles (verification of latency speculation)
If YES, broadcast the token id to signal a replay (Case 1)
If NO, broadcast the token id to all the instructions. They can turn the corresponding bit off. (Case 2)

Case 1

Replay

Case 2

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 38

Instructions in S2
Assume an instruction that was not predicted to miss actually

misses
No token is attached to it
Wait till it reaches the head of the ROB; flush the pipeline.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 39

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Techniques

EPIC based Techniques: Intel Itanium

Слайд 40

A Simpler Design
Physical Register File (PRF) based design
Advanced Computer Architecture. Smruti R.

Sarangi

Fast and efficient

State recovery is complex

Physical register management is onerous

Architectural Register File (ARF) based design

Have a dedicated architectural register file that stores the committed state

Enhance the ROB to store uncommitted values

Слайд 41

Let us now look at a different kind of OOO processor
Instead of

having a physical register file, let us have an architectural register file(ARF)
A 16-entry architectural register file that contains the committed architectural state.

Decode

Renaming

ARF

ROB

Committed State

Temporary State

Commit Time

Rest remains the same

Write back uncommitted results

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 42

Changes to renaming
Entry in the RAT table
ROB id
ROB/RF bit
ROB/RF bit ? 1

(value in the ROB), 0 (value in the ARF)
Use the ROB if the ROB/RF bit indicates that the value might be there in the ROB
Entry in the ROB: (ready bit indicates if the value is in the ROB (1) or being generated in the pipeline (0))

Instruction

value dest

ready dest

Entry in the RAT table

Entry in the ROB

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 43

Changes to Dispatch and Wakeup
Each entry in the IW now stores the

values of the operands
Reason: We will not be accessing the RF again
What is the tag in this case?
It is not the id of the physical register.
It is the id of the ROB entry.
What else?
Along with the tag, we need to broadcast the value of the operand, if we will not get the value from the bypass network
This will make the circuit slower

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 44

Changes to Wakeup, Bypass, Reg. Write and Commit
We can follow the same

speculative wakeup strategy and broadcast a tag (in this case, id of ROB entry) immediately after an instruction is selected. Tags+values are broadcast when the instruction is in the write-back stage.
Instructions directly proceed from the select unit to the execution units
All tags are ROB ids.
After execution, we write the result to the ROB entry
Commit is simple. We always have the architectural state in the ARF.
We just need to flush the ROB.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 45

PRF based design vs ARF based design
points in the PRF

based design
A value resides in only a single location (PRF). Multiple copies of values are never maintained. In a 64-bit machine, a value is 64 bits wide.
Each entry in the IW is smaller (values are not saved).
The broadcast also uses 7-bit tags
Restoring state is complicated
points in the ARF based design
Recovery from misspeculation is easy
We do not need a free list
Values are stored at multiple places (ARF, ROB, IW)

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 46

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Techniques

EPIC based Techniques: Intel Itanium

Слайд 47

Compiler based Optimizations
Can the compiler optimize the code?
Advanced Computer Architecture. Smruti

R. Sarangi

Слайд 48

Constant Folding
Advanced Computer Architecture. Smruti R. Sarangi
We can directly replace a with

10, b with 20, and c with 400

Слайд 49

Strength Reduction
Advanced Computer Architecture. Smruti R. Sarangi
slow
fast

Слайд 50

Common Subexpression Elimination
Each line in the second example corresponds to one line

of assembly code.
We do not compute (a+b) many times.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 51

Dead Code Elimination
Advanced Computer Architecture. Smruti R. Sarangi
Dead code

Слайд 52

Silent Stores
Silent stores write the same value that is already present
Advanced Computer

Architecture. Smruti R. Sarangi

Silent
store

Слайд 53

Advanced Computer Architecture. Smruti R. Sarangi
Loop Based Optimizations

Слайд 54

Loop Invariant based Code Motion
There is no point setting (val = 5)

repeatedly.

Advanced Computer Architecture. Smruti R. Sarangi

Original

Loop Invariants Moved

Слайд 55

Induction Variable based Optimization
Advanced Computer Architecture. Smruti R. Sarangi
Original
Induction
variable
Replace
a multiply
with an add
Optimized
An

add operation is faster than a multiply operation. Hence, it makes sense to replace multiplies with adds.

-6

Слайд 56

Loop Fusion
Advanced Computer Architecture. Smruti R. Sarangi
Original
Optimized
Fuse the loops
Loop fusion reduces the

instruction count and the number of branches
significantly

Слайд 57

Loop Unrolling - I
Advanced Computer Architecture. Smruti R. Sarangi
Original loop
Assembly code

Слайд 58

Advanced Computer Architecture. Smruti R. Sarangi
Loop Unrolling - II
Advantage: fewer total instructions

and specifically fewer branch instructions

Слайд 59

Advanced Computer Architecture. Smruti R. Sarangi
Software Pipelining

Слайд 60

Advanced Computer Architecture. Smruti R. Sarangi
L
S
I

Слайд 61

Visualization of the Execution Process
Advanced Computer Architecture. Smruti R. Sarangi
We can create

our loops differently. Execute instructions from different iterations.

Слайд 62

Can we execute instructions in this order?
Advanced Computer Architecture. Smruti R.

Sarangi

I0 ? S1 ? L2

I1 ? S2 ? L3

I2 ? S3 ? L4

Order of
operations
in a row

Treat each row as a pipeline stage. Execute
instructions from different
iterations roughly at the same time.

Слайд 63

Advantages of Software Pipelining
Consider this order:
I0 ? S1 ? L2 ?

I1 ? S2 ? L3 ? I2 ? S3 ? L4
The gap between the L, S, and I blocks is one block
This means that we can absorb delays
We can accommodate multi-cycle loads without stalls
The blocks I, S, and L can possibly be executed concurrently
There is a problem
How do we execute three blocks (belonging to different iterations) possibly concurrently?
Solution: Use different loop iterators

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 64

Different Loop Iterators: Group of 3 iterations
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 65

Code with Different Loop Iterators
Advanced Computer Architecture. Smruti R. Sarangi
Unroll the loop

3 times

We cannot execute S1
and L2 in parallel because
of the dependence on r5

Слайд 66

Advanced Computer Architecture. Smruti R. Sarangi
If we had 32 registers, we could

do this very easily
and elegantly

Слайд 67

Epilogue and Prologue
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 68

Solution without Unrolling
Advanced Computer Architecture. Smruti R. Sarangi
i = -1; t =

B[0];
.loop if (i < 298) {
i++;
A[i] = t;
t = B[i+1];
}
A[299] = t;

Слайд 69

Unrolling and Mixing
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 70

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Techniques

EPIC based Techniques: Intel Itanium

Слайд 71

.
Sounds like a promising idea …
Less hardware ? less power, less complexity
Modern

software is quite fast and quite intelligent
Basic idea:
Create bundles of several instructions (using the compiler)
Schedule a bundle in one go
Assume that all dependences are handled.

Advanced Computer Architecture. Smruti R. Sarangi

Can we outsource the work of renaming and scheduling
to the compiler?

Слайд 72

VLIW Processors
VLIW (Very Long Instruction Word) processors were the first designs in

this space.
Bundle instructions into long words
If an instruction is 4 bytes, bundle 4 into a 16-byte word
Schedule and execute all instructions together
Problems caused by
Conditional if statements – control flow not predictable
Memory instructions – addresses are computed at runtime

Advanced Computer Architecture. Smruti R. Sarangi

Basic philosophy of many VLIW processors

It is the compiler’s job to ensure correctness

Слайд 73

If Statements: Predicated Execution
Use predicated execution (remember GPUs).
Advanced Computer Architecture. Smruti R.

Sarangi

If Statements

There maybe a branch in the bundle
If it is taken, the rest of the instructions are invalid
Mark them with an invalid bit
Let these instructions pass through the pipeline (just don’t process them)
Remember predicated execution in GPUs

Слайд 74

Curious Case of Memory Instructions
We can have multiple memory instructions in a

bundle
The addresses are computed at runtime
In this case, we have a hazard
Same is the case for two store instructions, and a load ? store dependence

Advanced Computer Architecture. Smruti R. Sarangi

st r1, 8[r2]
ld r3, 8[r4]

Avoid such situations in software or
hardware

Слайд 75

VLIW vs EPIC
Advanced Computer Architecture. Smruti R. Sarangi
Given that VLIW processors do

not necessarily guarantee correctness, their usability is limited
Mostly used in digital signal processors
VLIW processors have been replaced by EPIC processors
EPIC ? Explicitly Parallel Instruction Computing
They guarantee correctness
Irrespective of the compiler

Слайд 76

Intel Itanium Processor
Unique collaboration between Intel and HP
Aim:
EPIC processor
Designed to leverage

the best of software and hardware
Targeted the server market
Primarily gets rid of the scheduler: instruction window, wakeup, select, and broadcast
The branch predictor, decode unit, execute units, and advanced load-store handling are still required

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 77

Fetch Stage
Each bundle contains 3 instructions
The decoupling buffer can hold 8 such

bundles

Advanced Computer Architecture. Smruti R. Sarangi

Read 32 bytes from the i-cache

We can fit six instructions

Bundle 1

Bundle 2

Decoupling Buffer

Слайд 78

Branch Predictors
Itanium has four types of branch predictors
Compiler directed
Four special registers: Target

Address Registers (TARs)
The compiler populates them.
Contain a PC and a target
Whenever the current PC matches the PC in a TAR ? predict taken and jump to the target
Traditional Predictor
Large PAp predictor

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 79

Branch Predictors – II
Multi-way Branches
Compilers ensure that (typically) the last instruction

in a bundle is a branch
If there are multiway branches: there are many possible targets for a given bundle
Predict the first instruction that is most likely a taken branch and then predict its target
Loop Exit Predictor
The compiler marks the loop instruction
It also populates the register with the loop iteration count
The predictor keeps decrementing the loop count till it reaches 0. Then it predicts a loop exit.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 80

This part of the pipeline
Itanium has 9 issue ports: 2 for memory,

2 for integer, 2 for floating point, 3 for branch instructions
Disperse the instructions ? map instructions to issue ports
Data hazards:
Option 1: Avoid data hazards in a bundle or put nop instructions or forward results.
Option 2: Use stop bits. Instructions between two instructions with their stop bits set to 1 are independent of each other.
Structural hazards: Each bundle indicates the resources that it requires. This information is used to avoid structural hazards.

Advanced Computer Architecture. Smruti R. Sarangi

Instruction
Fetch

Instruction
Dispersal

Слайд 81

Register Remapping Stage
Large 128-entry register file.
Advanced Computer Architecture. Smruti R. Sarangi
32 static

registers

96 stacked registers

Visible to all functions

Limited visibility across functions

Allocate different sets of virtual registers to each function.
This avoids spilling.

Слайд 82

Example: Function foo calls function bar
Advanced Computer Architecture. Smruti R. Sarangi
We deliberately

create an overlap in the virtual register set to pass
parameters.

Слайд 83

Register Stack Frame
The in and local registers are preserved across function calls.
The

out registers are used to send parameters to callee functions.
An alloc instruction automatically creates such a register stack frame.
Communicating return values.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 84

Binary Search
Advanced Computer Architecture. Smruti R. Sarangi
No processing done after receiving the

return value. Just pass it on.

This is known as tail recursion

Слайд 85

Register Stack Frame
The in and local registers are preserved across function calls.
The

out registers are used to send parameters to callee functions.
An alloc instruction automatically creates such a register stack frame.
Communicating return values.
Store the return values in a static register
In this case, directly jump to the return address in the main function.
We don’t need to process return values.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 86

Support for Software Pipelining and Overflows
Main Problem: We run out of registers

Itanium has a Register Stack Engine (RSE)
Automatically handles the spilling of registers to memory and restoring them
Software Pipelining
We use separate registers for the same variable across different iterations.
This issue is taken care of automatically
Notion of a rotating register set
Assign registers based on the loop iteration number
Easier to write the code of SW-pipelined loops

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 87

High Performance Execution Engine
Advanced Computer Architecture. Smruti R. Sarangi
Scoreboard
Simple mechanism for OOO

execution
Makes instructions wait till it is safe to execute them
finished field ? 1 if it has finished its execution, 0 otherwise.
fu ? Name of the functional unit

Слайд 88

Conditions: Instruction I
Advanced Computer Architecture. Smruti R. Sarangi
WAW Hazards
Check all the

earlier entries
For each earlier entry E the following expression should be false

WAR Hazards

Check all the earlier entries
For each earlier entry E the following expression should be false

Слайд 89

Conditions: II
Instructions wait in the scoreboard until they are safe
No hazards
Advanced Computer

Architecture. Smruti R. Sarangi

RAW Hazards

Check all the earlier entries
For each earlier entry E the following expression should be false

Structural Hazards

1. For each earlier entry E

Слайд 90

Predication
If we flush the pipeline upon a branch misprediction
It would be quite

unfair
Let the if statement just be used to mark an instruction with the result of the comparison
Store the result in a flags register
The rest of the instructions are processed regardless of the branch outcome
Some results modify the architectural state, many do not

Advanced Computer Architecture. Smruti R. Sarangi

Consider the following piece of code

if( rand () %2 == 0)
x = y;
else
x = z;

Слайд 91

Code without Predication
Count the number of branch instructions.
Advanced Computer Architecture. Smruti R.

Sarangi

/* mappings : x <-> r1 , y <-> r2 , z <-> r3 */
mod r0 , r0 , 2 /* assume r0 contains the output of rand () ,
compute the remainder when dividing it by 2 */
cmp r0 , 0 /* compare */
beq . even
mov r1 , r3 /* odd case */
b. exit
.even :
mov r1 , r2 /* even case */
. exit :

Слайд 92

Predicated Instructions
The comparison generates predicates (flags)
po ? number is odd, pe ?

number is even
If the predicate is correct, the instruction gets executed, otherwise not
Itanium sets and maintains the predicate registers
An instruction is executed if all the predicate registers are set to 1 for the instruction.

Advanced Computer Architecture. Smruti R. Sarangi

/* mappings : x <-> r1 , y <-> r2 , z <-> r3 */
mod r0 , r0 , 2 /* assume r0 contains the output of rand ()
compute the remainder when dividing it by */
po ,pe = cmp r0 , 0 /* compare and set the predicates */
[po] mov r1 , r3 /* odd case */
[pe] mov r1 , r2 /* even case */

Слайд 93

Advanced Computer Architecture. Smruti R. Sarangi
Pipeline

Слайд 94

Load Boosting
Boost a load and some instructions that use its value to

a point before “where it appears in the code”.
Loads are almost always on the critical path ? hence, boosting them is beneficial because they can get their data early
Put the load address in the ALAT
Advanced Load Address Table
Subsequently, each store checks the ALAT for a match, and marks it (upon an address match)
Put a load-check (ld.c) instruction at the original point
Check the ALAT
If there have been no intervening stores to the same address, the speculation is successful.
Else, re-execute the load and its boosted forward slice

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 95

Advanced Computer Architecture. Smruti R. Sarangi
A host of compiler optimizations can be

used to speed up
programs and improve their memory access behavior.

EPIC processors guarantee correctness as well as follow the VLIW model that gives primacy to the compiler.

We can use the ROB as the physical register file and use it to buffer temporary values. The ARF can contain the committed state.

There are three methods of replaying instructions: non-selective, delayed selective and token-based replay

There are four kinds of aggressive speculation: load address, dependence, latency, and value speculation.

Conclusion

Chapter_5-alternative-approaches

Содержание

Background Required to Understand this ChapterAdvanced Computer Architecture. Smruti R. SarangiChapter 4

ContentsAdvanced Computer Architecture. Smruti R. SarangiSimpler Version of an OOO ProcessorCompiler based

Aggressive Speculation Branch prediction is one form of speculation If we detect that

Types of Aggressive SpeculationAdvanced Computer Architecture. Smruti R. Sarangi

Address Speculation: Predict the memory address of a load or storePredict last

Stride based Address PatternAdvanced Computer Architecture. Smruti R. Sarangi

Predicting the StrideLast address (A): The memory address computed the last time

Load-Store Dependence SpeculationAdvanced Computer Architecture. Smruti R. SarangiPredict a collision (same memory

Collision History TableLoads show consistent behaviorThey are either colliding or non-collidingAdvanced Computer

Using the CHT When we compute the address of a loadWe access the

Store SetsAdvanced Computer Architecture. Smruti R. SarangiExplicitly remember load-store dependencesPC ? Store

Basic IdeaFor every load, we have an associated store setStores that have

Load Latency SpeculationA load might hit in the L1 cache (2 cycles) or

Make a guessAdvanced Computer Architecture. Smruti R. SarangiFor load instructions, predict if

Advanced Computer Architecture. Smruti R. SarangiConstantsValue prediction: Why are values predictable?

Value PredictorAdvanced Computer Architecture. Smruti R. Sarangi

Using an additional predictor for confidenceFirst, use the confidence table to find

ContentsAdvanced Computer Architecture. Smruti R. SarangiSimpler Version of an OOO ProcessorCompiler based

Replay Flushing the pipeline for every misspeculation is not a wise thingInstead, flush

Forward Slice of Instruction I0Advanced Computer Architecture. Smruti R. SarangiA forward slice

Non-Selective ReplayTrivial Solution: Flush the pipeline between the dispatch and execute stages

ExampleLet us say that instructions 2, 3, and 4 had one operand

Instruction Window EntryWhen an operand becomes ready, we set its timer to

More about Non-Selective ReplayWe attach the expected latency with each instruction packet

Two methods of replayingMethod 1: Keep instructions that have been issued in

Two methods of replaying - IIMove the instructions to a dedicated replay

Orphan InstructionsAssume that the load instruction misses in the L1 cacheThe add,

Orphan Instructions - IIKeep track of squashed instructions. Re-broadcast tags of orphan

Delayed Selective ReplayLet us now propose an idea to replay only those

Delayed Selective Replay - IIWhen an instruction finishes execution ?Check if its

Orphan InstructionsWe can always wait for the instruction to reach the head

Token Based Selective ReplayLet us use a pattern found in most programs:Most

After Predicting a d-cache MissInstructions that are predicted to miss, will have

Structure of the Rename TableIf an instruction is a token head, we

While reading the rename table ...Read the tokenVecs of the source operandsMerge

After executionAfter the token head instruction completes execution, see if it took

Instructions in S2Assume an instruction that was not predicted to miss actually

ContentsAdvanced Computer Architecture. Smruti R. SarangiSimpler Version of an OOO ProcessorCompiler based

A Simpler DesignPhysical Register File (PRF) based designAdvanced Computer Architecture. Smruti R.

Let us now look at a different kind of OOO processorInstead of

Changes to renamingEntry in the RAT tableROB idROB/RF bitROB/RF bit ? 1

Changes to Dispatch and Wakeup Each entry in the IW now stores the

Changes to Wakeup, Bypass, Reg. Write and CommitWe can follow the same

PRF based design vs ARF based design points in the PRF

ContentsAdvanced Computer Architecture. Smruti R. SarangiSimpler Version of an OOO ProcessorCompiler based

Compiler based OptimizationsCan the compiler optimize the code? Advanced Computer Architecture. Smruti

Constant FoldingAdvanced Computer Architecture. Smruti R. SarangiWe can directly replace a with

Strength ReductionAdvanced Computer Architecture. Smruti R. Sarangislowfast

Common Subexpression EliminationEach line in the second example corresponds to one line

Dead Code EliminationAdvanced Computer Architecture. Smruti R. SarangiDead code

Silent StoresSilent stores write the same value that is already presentAdvanced Computer

Advanced Computer Architecture. Smruti R. SarangiLoop Based Optimizations

Loop Invariant based Code MotionThere is no point setting (val = 5)

Induction Variable based OptimizationAdvanced Computer Architecture. Smruti R. SarangiOriginalInductionvariableReplacea multiplywith an addOptimizedAn

Loop FusionAdvanced Computer Architecture. Smruti R. SarangiOriginalOptimizedFuse the loopsLoop fusion reduces the

Loop Unrolling - IAdvanced Computer Architecture. Smruti R. SarangiOriginal loopAssembly code

Advanced Computer Architecture. Smruti R. SarangiLoop Unrolling - IIAdvantage: fewer total instructions

Advanced Computer Architecture. Smruti R. SarangiSoftware Pipelining

Advanced Computer Architecture. Smruti R. SarangiLSI

Visualization of the Execution ProcessAdvanced Computer Architecture. Smruti R. SarangiWe can create

Can we execute instructions in this order? Advanced Computer Architecture. Smruti R.

Advantages of Software PipeliningConsider this order: I0 ? S1 ? L2 ?

Different Loop Iterators: Group of 3 iterationsAdvanced Computer Architecture. Smruti R. Sarangi

Code with Different Loop IteratorsAdvanced Computer Architecture. Smruti R. SarangiUnroll the loop

Advanced Computer Architecture. Smruti R. SarangiIf we had 32 registers, we could

Epilogue and PrologueAdvanced Computer Architecture. Smruti R. Sarangi

Solution without UnrollingAdvanced Computer Architecture. Smruti R. Sarangii = -1; t =

Unrolling and MixingAdvanced Computer Architecture. Smruti R. Sarangi

ContentsAdvanced Computer Architecture. Smruti R. SarangiSimpler Version of an OOO ProcessorCompiler based

.Sounds like a promising idea …Less hardware ? less power, less complexityModern

VLIW ProcessorsVLIW (Very Long Instruction Word) processors were the first designs in

If Statements: Predicated ExecutionUse predicated execution (remember GPUs).Advanced Computer Architecture. Smruti R.

Curious Case of Memory InstructionsWe can have multiple memory instructions in a

VLIW vs EPICAdvanced Computer Architecture. Smruti R. SarangiGiven that VLIW processors do

Background Required to Understand this Chapter
Advanced Computer Architecture. Smruti R. Sarangi
Chapter 4

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Aggressive Speculation
Branch prediction is one form of speculation
If we detect that

Types of Aggressive Speculation
Advanced Computer Architecture. Smruti R. Sarangi

Address Speculation: Predict the memory address of a load or store
Predict last

Stride based Address Pattern
Advanced Computer Architecture. Smruti R. Sarangi

Predicting the Stride
Last address (A): The memory address computed the last time

Load-Store Dependence Speculation
Advanced Computer Architecture. Smruti R. Sarangi
Predict a collision (same memory

Collision History Table
Loads show consistent behavior
They are either colliding or non-colliding
Advanced Computer

Using the CHT
When we compute the address of a load
We access the

Store Sets
Advanced Computer Architecture. Smruti R. Sarangi
Explicitly remember load-store dependences
PC ? Store

Basic Idea
For every load, we have an associated store set
Stores that have

Load Latency Speculation
A load might hit in the L1 cache (2 cycles) or

Make a guess
Advanced Computer Architecture. Smruti R. Sarangi
For load instructions, predict if

Advanced Computer Architecture. Smruti R. Sarangi
Constants
Value prediction: Why are values predictable?

Value Predictor
Advanced Computer Architecture. Smruti R. Sarangi

Using an additional predictor for confidence
First, use the confidence table to find

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Replay
Flushing the pipeline for every misspeculation is not a wise thing
Instead, flush

Forward Slice of Instruction I0
Advanced Computer Architecture. Smruti R. Sarangi
A forward slice

Non-Selective Replay
Trivial Solution: Flush the pipeline between the dispatch and execute stages

Example
Let us say that instructions 2, 3, and 4 had one operand

Instruction Window Entry
When an operand becomes ready, we set its timer to

More about Non-Selective Replay
We attach the expected latency with each instruction packet

Two methods of replaying
Method 1: Keep instructions that have been issued in

Two methods of replaying - II
Move the instructions to a dedicated replay

Orphan Instructions
Assume that the load instruction misses in the L1 cache
The add,

Orphan Instructions - II
Keep track of squashed instructions.
Re-broadcast tags of orphan

Delayed Selective Replay
Let us now propose an idea to replay only those

Delayed Selective Replay - II
When an instruction finishes execution ?
Check if its

Orphan Instructions
We can always wait for the instruction to reach the head

Token Based Selective Replay
Let us use a pattern found in most programs:
Most

After Predicting a d-cache Miss
Instructions that are predicted to miss, will have

Structure of the Rename Table
If an instruction is a token head, we

While reading the rename table ...
Read the tokenVecs of the source operands
Merge

After execution
After the token head instruction completes execution, see if it took

Instructions in S2
Assume an instruction that was not predicted to miss actually

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

A Simpler Design
Physical Register File (PRF) based design
Advanced Computer Architecture. Smruti R.

Let us now look at a different kind of OOO processor
Instead of

Changes to renaming
Entry in the RAT table
ROB id
ROB/RF bit
ROB/RF bit ? 1

Changes to Dispatch and Wakeup
Each entry in the IW now stores the

Changes to Wakeup, Bypass, Reg. Write and Commit
We can follow the same

PRF based design vs ARF based design
points in the PRF

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

Compiler based Optimizations
Can the compiler optimize the code?
Advanced Computer Architecture. Smruti

Constant Folding
Advanced Computer Architecture. Smruti R. Sarangi
We can directly replace a with

Strength Reduction
Advanced Computer Architecture. Smruti R. Sarangi
slow
fast

Common Subexpression Elimination
Each line in the second example corresponds to one line

Dead Code Elimination
Advanced Computer Architecture. Smruti R. Sarangi
Dead code

Silent Stores
Silent stores write the same value that is already present
Advanced Computer

Advanced Computer Architecture. Smruti R. Sarangi
Loop Based Optimizations

Loop Invariant based Code Motion
There is no point setting (val = 5)

Induction Variable based Optimization
Advanced Computer Architecture. Smruti R. Sarangi
Original
Induction
variable
Replace
a multiply
with an add
Optimized
An

Loop Fusion
Advanced Computer Architecture. Smruti R. Sarangi
Original
Optimized
Fuse the loops
Loop fusion reduces the

Loop Unrolling - I
Advanced Computer Architecture. Smruti R. Sarangi
Original loop
Assembly code

Advanced Computer Architecture. Smruti R. Sarangi
Loop Unrolling - II
Advantage: fewer total instructions

Advanced Computer Architecture. Smruti R. Sarangi
Software Pipelining

Advanced Computer Architecture. Smruti R. Sarangi
L
S
I

Visualization of the Execution Process
Advanced Computer Architecture. Smruti R. Sarangi
We can create

Can we execute instructions in this order?
Advanced Computer Architecture. Smruti R.

Advantages of Software Pipelining
Consider this order:
I0 ? S1 ? L2 ?

Different Loop Iterators: Group of 3 iterations
Advanced Computer Architecture. Smruti R. Sarangi

Code with Different Loop Iterators
Advanced Computer Architecture. Smruti R. Sarangi
Unroll the loop

Advanced Computer Architecture. Smruti R. Sarangi
If we had 32 registers, we could

Epilogue and Prologue
Advanced Computer Architecture. Smruti R. Sarangi

Solution without Unrolling
Advanced Computer Architecture. Smruti R. Sarangi
i = -1; t =

Unrolling and Mixing
Advanced Computer Architecture. Smruti R. Sarangi

Contents
Advanced Computer Architecture. Smruti R. Sarangi
Simpler Version of an OOO Processor
Compiler based

.
Sounds like a promising idea …
Less hardware ? less power, less complexity
Modern

VLIW Processors
VLIW (Very Long Instruction Word) processors were the first designs in

If Statements: Predicated Execution
Use predicated execution (remember GPUs).
Advanced Computer Architecture. Smruti R.

Curious Case of Memory Instructions
We can have multiple memory instructions in a

VLIW vs EPIC
Advanced Computer Architecture. Smruti R. Sarangi
Given that VLIW processors do