Chapter_5-alternative-approaches

Содержание

Слайд 2

Background Required to Understand this Chapter

Advanced Computer Architecture. Smruti R. Sarangi

Chapter 4

Background Required to Understand this Chapter Advanced Computer Architecture. Smruti R. Sarangi Chapter 4

Слайд 3

Contents

Advanced Computer Architecture. Smruti R. Sarangi

Simpler Version of an OOO Processor

Compiler based

Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO
Techniques

EPIC based Techniques: Intel Itanium

Слайд 4

Aggressive Speculation

Branch prediction is one form of speculation
If we detect that

Aggressive Speculation Branch prediction is one form of speculation If we detect
a branch has been mispredicted
Solution: flush the pipeline
This is not the only form of speculation
Another very common type: load latency speculation or value speculation
Assume that a load will hit in the cache
Speculatively wakeup instructions
Later on if this is not the case: DO SOMETHING

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 5

Types of Aggressive Speculation

Advanced Computer Architecture. Smruti R. Sarangi

Types of Aggressive Speculation Advanced Computer Architecture. Smruti R. Sarangi

Слайд 6

Address Speculation: Predict the memory address of a load or store

Predict last

Address Speculation: Predict the memory address of a load or store Predict
address scheme
Use a simple predictor

Advanced Computer Architecture. Smruti R. Sarangi
2n entries

Слайд 7

Stride based Address Pattern

Advanced Computer Architecture. Smruti R. Sarangi

Stride based Address Pattern Advanced Computer Architecture. Smruti R. Sarangi

Слайд 8

Predicting the Stride

Last address (A): The memory address computed the last time

Predicting the Stride Last address (A): The memory address computed the last
the instruction with this PC was executed.
A stride-based access pattern is followed if: current address – last address = S
Then we set the pattern bit, P
Alternatively, if P is set, we predict the next address to be
A + S

Advanced Computer Architecture. Smruti R. Sarangi

Last address (A)

Stride (S)

Pattern (P)

Слайд 9

Load-Store Dependence Speculation

Advanced Computer Architecture. Smruti R. Sarangi

Predict a collision (same memory

Load-Store Dependence Speculation Advanced Computer Architecture. Smruti R. Sarangi Predict a collision
address) between a load and a store

If there are no collisions, send the load directly to the cache.

Forward values across unresolved stores.

Слайд 10

Collision History Table

Loads show consistent behavior
They are either colliding or non-colliding

Advanced Computer

Collision History Table Loads show consistent behavior They are either colliding or
Architecture. Smruti R. Sarangi

n-bit PC

colliding or non-colliding

Collision History Table (CHT)

Слайд 11

Using the CHT

When we compute the address of a load
We access the

Using the CHT When we compute the address of a load We
CHT
If it is predicted to be colliding
Wait for all prior stores to be resolved
Else
Send the load to the d-cache
Once the address is resolved
Update the CHT, recover the state (if necessary)

Advanced Computer Architecture. Smruti R. Sarangi

We can augment it with the store?load distance (D). A load waits till there are less than D entries before it in the LSQ.

Слайд 12

Store Sets

Advanced Computer Architecture. Smruti R. Sarangi

Explicitly remember load-store dependences

PC ? Store

Store Sets Advanced Computer Architecture. Smruti R. Sarangi Explicitly remember load-store dependences
set identifier

Last fetched store in the store set

Слайд 13

Basic Idea

For every load, we have an associated store set
Stores that have

Basic Idea For every load, we have an associated store set Stores
forwarded values to it in the past
A store may be a part of a single store set

Advanced Computer Architecture. Smruti R. Sarangi

Load

Read the store set id
Get the instruction number of the latest store (S) from the LFST
The load waits for store S to get resolved and then receives the forwarded data.

Store

Read the store set id
Set the instruction number of the current store in the corresponding entry of the LFST.
Can be used to speculatively forward data to loads in its store set.

Whenever we detect a load-store dependence, we update the SSIT and LFST

Слайд 14

Load Latency Speculation

A load might hit in the L1 cache (2 cycles) or

Load Latency Speculation A load might hit in the L1 cache (2
might go to the lower levels of the memory system.
We don’t know for sure

Advanced Computer Architecture. Smruti R. Sarangi

Pipeline

L1 d-cache

L2 cache

Main memory

1-2 cycles: hit rate: 90%

10-50 cycles: hit rate: 50%

300 cycles: hit rate: 100%

Слайд 15

Make a guess

Advanced Computer Architecture. Smruti R. Sarangi

For load instructions, predict if

Make a guess Advanced Computer Architecture. Smruti R. Sarangi For load instructions,
it will hit in the data cache or not. If it will, do an early broadcast.

Design a hit-miss predictor. Same idea as branch predictors.

Слайд 16

Advanced Computer Architecture. Smruti R. Sarangi

Constants

Value prediction: Why are values predictable?

Advanced Computer Architecture. Smruti R. Sarangi Constants Value prediction: Why are values predictable?

Слайд 17

Value Predictor

Advanced Computer Architecture. Smruti R. Sarangi

Value Predictor Advanced Computer Architecture. Smruti R. Sarangi

Слайд 18

Using an additional predictor for confidence

First, use the confidence table to find

Using an additional predictor for confidence First, use the confidence table to
out if it makes sense to predict
Simultaneously, make a prediction using a predictor table (value, memory dependence, ALU result)
Predictor table can contain 1 value, or the last k values
Make a prediction, and use it if it has high confidence
Update both the tables when the results are available
If needed recover with a replay/flush mechanism

Confidence
Table
(uses sat. counters)

PC

Confidence

Predictor
Table

PC

Prediction

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 19

Contents

Advanced Computer Architecture. Smruti R. Sarangi

Simpler Version of an OOO Processor

Compiler based

Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO
Techniques

EPIC based Techniques: Intel Itanium

Слайд 20

Replay

Flushing the pipeline for every misspeculation is not a wise thing
Instead, flush

Replay Flushing the pipeline for every misspeculation is not a wise thing
a part of the pipeline (or only those instructions that have gotten a wrong value)
Replay those instructions once again (after let’s say the load completes its execution)
When the instructions are being replayed, they are guaranteed to use the correct value of the load
Identify and replay the forward slice ?

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 21

Forward Slice of Instruction I0

Advanced Computer Architecture. Smruti R. Sarangi

A forward slice

Forward Slice of Instruction I0 Advanced Computer Architecture. Smruti R. Sarangi A
contains an instruction’s consumers, its
consumers and so on.

Слайд 22

Non-Selective Replay

Trivial Solution: Flush the pipeline between the dispatch and execute stages

Non-Selective Replay Trivial Solution: Flush the pipeline between the dispatch and execute

Smarter Solution
It is not necessary to flush all the instructions between the schedule and execute stages
Try to reduce the set of instructions
Define a window of vulnerability (WV) for n cycles after a load is selected. A load should complete within n cycles if it hits in the d-cache and does not wait in the LSQ
However, if the load takes more than n cycles, we need to do a replay

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 23

Example

Let us say that instructions 2, 3, and 4 had one operand

Example Let us say that instructions 2, 3, and 4 had one
waking up in the WV of instruction 1
If there is a misspeculation, all three instructions get squashed
Instruction 1 gets reissued with the correct value later

Advanced Computer Architecture. Smruti R. Sarangi
1: ld r1, [r2]
2: add r4, r1, r3
3: add r5, r6, r7
4: add r8, r9, r10

Predict the value

squash them

Слайд 24

Instruction Window Entry

When an operand becomes ready, we set its timer to

Instruction Window Entry When an operand becomes ready, we set its timer
n
Every cycle it decrements (count down timer)
Once it becomes 0, we can conclude that this instruction will not be squashed

Set after a misspeculation

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 25

More about Non-Selective Replay

We attach the expected latency with each instruction packet

More about Non-Selective Replay We attach the expected latency with each instruction
as it flows down the pipeline
Wherever there is an additional delay (such as a cache miss)
Time for a replay
Set the kill wire
Each instruction window entry that has a non-zero timer, resets its ready flag
We now have a set of instructions that will be replayed
Methods of replaying instructions

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 26

Two methods of replaying

Method 1: Keep instructions that have been issued in

Two methods of replaying Method 1: Keep instructions that have been issued
the issue queue (see reference)

IW

Pipeline Stages

Verify

verification status

remove from the IW if verified

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 27

Two methods of replaying - II

Move the instructions to a dedicated replay

Two methods of replaying - II Move the instructions to a dedicated
queue after issue
Once an instruction is verified, remove it from the replay queue

IW

Pipeline Stages

Replay queue

Verify

remove from the IW

verification status

remove from replay
queue if verified

Method 2:

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 28

Orphan Instructions

Assume that the load instruction misses in the L1 cache
The add,

Orphan Instructions Assume that the load instruction misses in the L1 cache
sub, and xor instructions will need to be squashed, and replayed
For the add and sub instructions, tag will be broadcast
What about the xor instruction?
Say that r6’s ready bit was forcefully set to 0

ld r1, 8[r4]
add r2, r1, r1
sub r4, r3, r2
xor r5, r6, r7

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 29

Orphan Instructions - II

Keep track of squashed instructions.
Re-broadcast tags of orphan

Orphan Instructions - II Keep track of squashed instructions. Re-broadcast tags of
instructions.
? We need to dynamically detect which instructions are orphans.

Advanced Computer Architecture. Smruti R. Sarangi

Impractical Method

Better Approach

Let the orphan instruction reach the head of the ROB
Execute and commit it.

Слайд 30

Delayed Selective Replay

Let us now propose an idea to replay only those

Delayed Selective Replay Let us now propose an idea to replay only
instructions that are in the forward slice of the misspeculated load
Let us extend the non-selective replay scheme
At the time of asserting the kill signal, plant a poison bit in the destination register of the load
Propagate the bit along the bypass paths and through the register file
If an instruction reads any operand whose poison bit is set, then the instruction’s poison bit and its destination register are also set.
When an instruction finishes execution ?

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 31

Delayed Selective Replay - II

When an instruction finishes execution ?
Check if its

Delayed Selective Replay - II When an instruction finishes execution ? Check
poison bit is set.
If yes, squash it
If no, remove the instruction from the IW (it is verified to be correct)
Issues with this scheme
It is effective, but assumes that we know the value: n
This might not be possible all the time
Instructions in the WV that have not been issued might become orphans

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 32

Orphan Instructions

We can always wait for the instruction to reach the head

Orphan Instructions We can always wait for the instruction to reach the
of the ROB.
Another scheme: Let’s say instruction J was orphaned because one of its operands (woken up by inst. K) was reset back to a non-ready state.
Instruction K will later come back to rescue J, via broadcasting on the completion bus.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 33

Token Based Selective Replay

Let us use a pattern found in most programs:
Most

Token Based Selective Replay Let us use a pattern found in most
of the misses in the data cache are accounted for by a relatively small number of instructions
90/10 thumb rule ? 90% of the misses are accounted for by 10% of instructions
Predictor ? Given a PC, predict if it will lead to a d-cache miss
Use a predictor similar to a branch predictor at the fetch stage

Prediction
Table

PC

Hit/miss prediction

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 34

After Predicting a d-cache Miss

Instructions that are predicted to miss, will have

After Predicting a d-cache Miss Instructions that are predicted to miss, will
a non-deterministic execution time (most likely) and lead to replays (set S1)
Other instructions will not lead to replays (most likely) (set S2)
Let us consider an instruction in set S1
At decode time, let the instruction collect a free token
Save the id of the token in the instruction packet
Example: assume the instruction: ld r1, 4[r4] is predicted to miss
Save the id of the token in the instruction packet of this instruction
Say that the instruction gets token #5
This instruction is the token head for token #5
Let us propagate this information to all the instructions dependent on the load
If this load fails, all the dependent instructions fail as well

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 35

Structure of the Rename Table

If an instruction is a token head, we

Structure of the Rename Table If an instruction is a token head,
save the id of the token that it owns in the instruction packet
Assume we have a maximum of N tokens.
tokenVec is an N-bit vector
For the token head instruction, if it owns the ith token, set the ith bit to true in tokenVec
Tokens are propagated the same way as poison bits

rename
table

r1

phy. reg

tokenVec

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 36

While reading the rename table ...

Read the tokenVecs of the source operands
Merge

While reading the rename table ... Read the tokenVecs of the source
the tokenVecs of the source operands
Save the merged tokenVec for the destination register (in the rename table)

add r1, r2, r3

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 37

After execution

After the token head instruction completes execution, see if it took

After execution After the token head instruction completes execution, see if it
additional cycles (verification of latency speculation)
If YES, broadcast the token id to signal a replay (Case 1)
If NO, broadcast the token id to all the instructions. They can turn the corresponding bit off. (Case 2)

Case 1

Replay

Case 2

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 38

Instructions in S2

Assume an instruction that was not predicted to miss actually

Instructions in S2 Assume an instruction that was not predicted to miss
misses
No token is attached to it
Wait till it reaches the head of the ROB; flush the pipeline.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 39

Contents

Advanced Computer Architecture. Smruti R. Sarangi

Simpler Version of an OOO Processor

Compiler based

Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO
Techniques

EPIC based Techniques: Intel Itanium

Слайд 40

A Simpler Design

Physical Register File (PRF) based design

Advanced Computer Architecture. Smruti R.

A Simpler Design Physical Register File (PRF) based design Advanced Computer Architecture.
Sarangi

Fast and efficient

State recovery is complex

Physical register management is onerous

Architectural Register File (ARF) based design

Have a dedicated architectural register file that stores the committed state

Enhance the ROB to store uncommitted values

Слайд 41

Let us now look at a different kind of OOO processor

Instead of

Let us now look at a different kind of OOO processor Instead
having a physical register file, let us have an architectural register file(ARF)
A 16-entry architectural register file that contains the committed architectural state.

Decode

Renaming

ARF

ROB

IW

Register
Write

Committed State

Temporary State

Commit Time

Rest remains the same

Write back uncommitted results

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 42

Changes to renaming

Entry in the RAT table

ROB id

ROB/RF bit

ROB/RF bit ? 1

Changes to renaming Entry in the RAT table ROB id ROB/RF bit
(value in the ROB), 0 (value in the ARF)
Use the ROB if the ROB/RF bit indicates that the value might be there in the ROB
Entry in the ROB: (ready bit indicates if the value is in the ROB (1) or being generated in the pipeline (0))

Instruction

value dest

ready dest

Entry in the RAT table

Entry in the ROB

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 43

Changes to Dispatch and Wakeup

Each entry in the IW now stores the

Changes to Dispatch and Wakeup Each entry in the IW now stores
values of the operands
Reason: We will not be accessing the RF again
What is the tag in this case?
It is not the id of the physical register.
It is the id of the ROB entry.
What else?
Along with the tag, we need to broadcast the value of the operand, if we will not get the value from the bypass network
This will make the circuit slower

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 44

Changes to Wakeup, Bypass, Reg. Write and Commit

We can follow the same

Changes to Wakeup, Bypass, Reg. Write and Commit We can follow the
speculative wakeup strategy and broadcast a tag (in this case, id of ROB entry) immediately after an instruction is selected. Tags+values are broadcast when the instruction is in the write-back stage.
Instructions directly proceed from the select unit to the execution units
All tags are ROB ids.
After execution, we write the result to the ROB entry
Commit is simple. We always have the architectural state in the ARF.
We just need to flush the ROB.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 45

PRF based design vs ARF based design

points in the PRF

PRF based design vs ARF based design points in the PRF based
based design
A value resides in only a single location (PRF). Multiple copies of values are never maintained. In a 64-bit machine, a value is 64 bits wide.
Each entry in the IW is smaller (values are not saved).
The broadcast also uses 7-bit tags
Restoring state is complicated
points in the ARF based design
Recovery from misspeculation is easy
We do not need a free list
Values are stored at multiple places (ARF, ROB, IW)

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 46

Contents

Advanced Computer Architecture. Smruti R. Sarangi

Simpler Version of an OOO Processor

Compiler based

Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO
Techniques

EPIC based Techniques: Intel Itanium

Слайд 47

Compiler based Optimizations

Can the compiler optimize the code?

Advanced Computer Architecture. Smruti

Compiler based Optimizations Can the compiler optimize the code? Advanced Computer Architecture. Smruti R. Sarangi
R. Sarangi

Слайд 48

Constant Folding

Advanced Computer Architecture. Smruti R. Sarangi

We can directly replace a with

Constant Folding Advanced Computer Architecture. Smruti R. Sarangi We can directly replace
10, b with 20, and c with 400

Слайд 49

Strength Reduction

Advanced Computer Architecture. Smruti R. Sarangi

slow

fast

Strength Reduction Advanced Computer Architecture. Smruti R. Sarangi slow fast

Слайд 50

Common Subexpression Elimination

Each line in the second example corresponds to one line

Common Subexpression Elimination Each line in the second example corresponds to one
of assembly code.
We do not compute (a+b) many times.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 51

Dead Code Elimination

Advanced Computer Architecture. Smruti R. Sarangi

Dead code

Dead Code Elimination Advanced Computer Architecture. Smruti R. Sarangi Dead code

Слайд 52

Silent Stores

Silent stores write the same value that is already present

Advanced Computer

Silent Stores Silent stores write the same value that is already present
Architecture. Smruti R. Sarangi

Silent
store

Слайд 53

Advanced Computer Architecture. Smruti R. Sarangi

Loop Based Optimizations

Advanced Computer Architecture. Smruti R. Sarangi Loop Based Optimizations

Слайд 54

Loop Invariant based Code Motion

There is no point setting (val = 5)

Loop Invariant based Code Motion There is no point setting (val =
repeatedly.

Advanced Computer Architecture. Smruti R. Sarangi

Original

Loop Invariants Moved

Слайд 55

Induction Variable based Optimization

Advanced Computer Architecture. Smruti R. Sarangi

Original

Induction
variable

Replace
a multiply
with an add

Optimized

An

Induction Variable based Optimization Advanced Computer Architecture. Smruti R. Sarangi Original Induction
add operation is faster than a multiply operation. Hence, it makes sense to replace multiplies with adds.

-6

Слайд 56

Loop Fusion

Advanced Computer Architecture. Smruti R. Sarangi

Original

Optimized

Fuse the loops

Loop fusion reduces the

Loop Fusion Advanced Computer Architecture. Smruti R. Sarangi Original Optimized Fuse the
instruction count and the number of branches
significantly

Слайд 57

Loop Unrolling - I

Advanced Computer Architecture. Smruti R. Sarangi

Original loop

Assembly code

Loop Unrolling - I Advanced Computer Architecture. Smruti R. Sarangi Original loop Assembly code

Слайд 58

Advanced Computer Architecture. Smruti R. Sarangi

Loop Unrolling - II

Advantage: fewer total instructions

Advanced Computer Architecture. Smruti R. Sarangi Loop Unrolling - II Advantage: fewer
and specifically fewer branch instructions

Слайд 59

Advanced Computer Architecture. Smruti R. Sarangi

Software Pipelining

Advanced Computer Architecture. Smruti R. Sarangi Software Pipelining

Слайд 60

Advanced Computer Architecture. Smruti R. Sarangi

L

S

I

Advanced Computer Architecture. Smruti R. Sarangi L S I

Слайд 61

Visualization of the Execution Process

Advanced Computer Architecture. Smruti R. Sarangi

We can create

Visualization of the Execution Process Advanced Computer Architecture. Smruti R. Sarangi We
our loops differently. Execute instructions from different iterations.

Слайд 62

Can we execute instructions in this order?

Advanced Computer Architecture. Smruti R.

Can we execute instructions in this order? Advanced Computer Architecture. Smruti R.
Sarangi

I0 ? S1 ? L2

I1 ? S2 ? L3

I2 ? S3 ? L4

Order of
operations
in a row

Treat each row as a pipeline stage. Execute
instructions from different
iterations roughly at the same time.

Слайд 63

Advantages of Software Pipelining

Consider this order:
I0 ? S1 ? L2 ?

Advantages of Software Pipelining Consider this order: I0 ? S1 ? L2
I1 ? S2 ? L3 ? I2 ? S3 ? L4
The gap between the L, S, and I blocks is one block
This means that we can absorb delays
We can accommodate multi-cycle loads without stalls
The blocks I, S, and L can possibly be executed concurrently
There is a problem
How do we execute three blocks (belonging to different iterations) possibly concurrently?
Solution: Use different loop iterators

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 64

Different Loop Iterators: Group of 3 iterations

Advanced Computer Architecture. Smruti R. Sarangi

Different Loop Iterators: Group of 3 iterations Advanced Computer Architecture. Smruti R. Sarangi

Слайд 65

Code with Different Loop Iterators

Advanced Computer Architecture. Smruti R. Sarangi

Unroll the loop

Code with Different Loop Iterators Advanced Computer Architecture. Smruti R. Sarangi Unroll
3 times

We cannot execute S1
and L2 in parallel because
of the dependence on r5

Слайд 66

Advanced Computer Architecture. Smruti R. Sarangi

If we had 32 registers, we could

Advanced Computer Architecture. Smruti R. Sarangi If we had 32 registers, we
do this very easily
and elegantly

Слайд 67

Epilogue and Prologue

Advanced Computer Architecture. Smruti R. Sarangi

Epilogue and Prologue Advanced Computer Architecture. Smruti R. Sarangi

Слайд 68

Solution without Unrolling

Advanced Computer Architecture. Smruti R. Sarangi

i = -1; t =

Solution without Unrolling Advanced Computer Architecture. Smruti R. Sarangi i = -1;
B[0];
.loop if (i < 298) {
i++;
A[i] = t;
t = B[i+1];
}
A[299] = t;

Слайд 69

Unrolling and Mixing

Advanced Computer Architecture. Smruti R. Sarangi

Unrolling and Mixing Advanced Computer Architecture. Smruti R. Sarangi

Слайд 70

Contents

Advanced Computer Architecture. Smruti R. Sarangi

Simpler Version of an OOO Processor

Compiler based

Contents Advanced Computer Architecture. Smruti R. Sarangi Simpler Version of an OOO
Techniques

EPIC based Techniques: Intel Itanium

Слайд 71

.

Sounds like a promising idea …
Less hardware ? less power, less complexity
Modern

. Sounds like a promising idea … Less hardware ? less power,
software is quite fast and quite intelligent
Basic idea:
Create bundles of several instructions (using the compiler)
Schedule a bundle in one go
Assume that all dependences are handled.

Advanced Computer Architecture. Smruti R. Sarangi

Can we outsource the work of renaming and scheduling
to the compiler?

Слайд 72

VLIW Processors

VLIW (Very Long Instruction Word) processors were the first designs in

VLIW Processors VLIW (Very Long Instruction Word) processors were the first designs
this space.
Bundle instructions into long words
If an instruction is 4 bytes, bundle 4 into a 16-byte word
Schedule and execute all instructions together
Problems caused by
Conditional if statements – control flow not predictable
Memory instructions – addresses are computed at runtime

Advanced Computer Architecture. Smruti R. Sarangi

Basic philosophy of many VLIW processors

It is the compiler’s job to ensure correctness

Слайд 73

If Statements: Predicated Execution

Use predicated execution (remember GPUs).

Advanced Computer Architecture. Smruti R.

If Statements: Predicated Execution Use predicated execution (remember GPUs). Advanced Computer Architecture.
Sarangi

If Statements

There maybe a branch in the bundle
If it is taken, the rest of the instructions are invalid
Mark them with an invalid bit
Let these instructions pass through the pipeline (just don’t process them)
Remember predicated execution in GPUs

Слайд 74

Curious Case of Memory Instructions

We can have multiple memory instructions in a

Curious Case of Memory Instructions We can have multiple memory instructions in
bundle
The addresses are computed at runtime
In this case, we have a hazard
Same is the case for two store instructions, and a load ? store dependence

Advanced Computer Architecture. Smruti R. Sarangi

st r1, 8[r2]
ld r3, 8[r4]

Avoid such situations in software or
hardware

Слайд 75

VLIW vs EPIC

Advanced Computer Architecture. Smruti R. Sarangi

Given that VLIW processors do

VLIW vs EPIC Advanced Computer Architecture. Smruti R. Sarangi Given that VLIW
not necessarily guarantee correctness, their usability is limited
Mostly used in digital signal processors
VLIW processors have been replaced by EPIC processors
EPIC ? Explicitly Parallel Instruction Computing
They guarantee correctness
Irrespective of the compiler

Слайд 76

Intel Itanium Processor

Unique collaboration between Intel and HP
Aim:
EPIC processor
Designed to leverage

Intel Itanium Processor Unique collaboration between Intel and HP Aim: EPIC processor
the best of software and hardware
Targeted the server market
Primarily gets rid of the scheduler: instruction window, wakeup, select, and broadcast
The branch predictor, decode unit, execute units, and advanced load-store handling are still required

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 77

Fetch Stage

Each bundle contains 3 instructions
The decoupling buffer can hold 8 such

Fetch Stage Each bundle contains 3 instructions The decoupling buffer can hold
bundles

Advanced Computer Architecture. Smruti R. Sarangi

Read 32 bytes from the i-cache

We can fit six instructions

Bundle 1

Bundle 2

Decoupling Buffer

Слайд 78

Branch Predictors

Itanium has four types of branch predictors
Compiler directed
Four special registers: Target

Branch Predictors Itanium has four types of branch predictors Compiler directed Four
Address Registers (TARs)
The compiler populates them.
Contain a PC and a target
Whenever the current PC matches the PC in a TAR ? predict taken and jump to the target
Traditional Predictor
Large PAp predictor

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 79

Branch Predictors – II

Multi-way Branches
Compilers ensure that (typically) the last instruction

Branch Predictors – II Multi-way Branches Compilers ensure that (typically) the last
in a bundle is a branch
If there are multiway branches: there are many possible targets for a given bundle
Predict the first instruction that is most likely a taken branch and then predict its target
Loop Exit Predictor
The compiler marks the loop instruction
It also populates the register with the loop iteration count
The predictor keeps decrementing the loop count till it reaches 0. Then it predicts a loop exit.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 80

This part of the pipeline

Itanium has 9 issue ports: 2 for memory,

This part of the pipeline Itanium has 9 issue ports: 2 for
2 for integer, 2 for floating point, 3 for branch instructions
Disperse the instructions ? map instructions to issue ports
Data hazards:
Option 1: Avoid data hazards in a bundle or put nop instructions or forward results.
Option 2: Use stop bits. Instructions between two instructions with their stop bits set to 1 are independent of each other.
Structural hazards: Each bundle indicates the resources that it requires. This information is used to avoid structural hazards.

Advanced Computer Architecture. Smruti R. Sarangi

Instruction
Fetch

Instruction
Dispersal

Register Remapping

Слайд 81

Register Remapping Stage

Large 128-entry register file.

Advanced Computer Architecture. Smruti R. Sarangi

32 static

Register Remapping Stage Large 128-entry register file. Advanced Computer Architecture. Smruti R.
registers

96 stacked registers

Visible to all functions

Limited visibility across functions

Allocate different sets of virtual registers to each function.
This avoids spilling.

Слайд 82

Example: Function foo calls function bar

Advanced Computer Architecture. Smruti R. Sarangi

We deliberately

Example: Function foo calls function bar Advanced Computer Architecture. Smruti R. Sarangi
create an overlap in the virtual register set to pass
parameters.

Слайд 83

Register Stack Frame

The in and local registers are preserved across function calls.
The

Register Stack Frame The in and local registers are preserved across function
out registers are used to send parameters to callee functions.
An alloc instruction automatically creates such a register stack frame.
Communicating return values.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 84

Binary Search

Advanced Computer Architecture. Smruti R. Sarangi

No processing done after receiving the

Binary Search Advanced Computer Architecture. Smruti R. Sarangi No processing done after
return value. Just pass it on.

This is known as tail recursion

Слайд 85

Register Stack Frame

The in and local registers are preserved across function calls.
The

Register Stack Frame The in and local registers are preserved across function
out registers are used to send parameters to callee functions.
An alloc instruction automatically creates such a register stack frame.
Communicating return values.
Store the return values in a static register
In this case, directly jump to the return address in the main function.
We don’t need to process return values.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 86

Support for Software Pipelining and Overflows

Main Problem: We run out of registers

Support for Software Pipelining and Overflows Main Problem: We run out of

Itanium has a Register Stack Engine (RSE)
Automatically handles the spilling of registers to memory and restoring them
Software Pipelining
We use separate registers for the same variable across different iterations.
This issue is taken care of automatically
Notion of a rotating register set
Assign registers based on the loop iteration number
Easier to write the code of SW-pipelined loops

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 87

High Performance Execution Engine

Advanced Computer Architecture. Smruti R. Sarangi

Scoreboard

Simple mechanism for OOO

High Performance Execution Engine Advanced Computer Architecture. Smruti R. Sarangi Scoreboard Simple
execution
Makes instructions wait till it is safe to execute them
finished field ? 1 if it has finished its execution, 0 otherwise.
fu ? Name of the functional unit

Слайд 88

Conditions: Instruction I

Advanced Computer Architecture. Smruti R. Sarangi

WAW Hazards

Check all the

Conditions: Instruction I Advanced Computer Architecture. Smruti R. Sarangi WAW Hazards Check
earlier entries
For each earlier entry E the following expression should be false

 

WAR Hazards

Check all the earlier entries
For each earlier entry E the following expression should be false

 

Слайд 89

Conditions: II

Instructions wait in the scoreboard until they are safe
No hazards

Advanced Computer

Conditions: II Instructions wait in the scoreboard until they are safe No
Architecture. Smruti R. Sarangi

RAW Hazards

Check all the earlier entries
For each earlier entry E the following expression should be false

 

Structural Hazards

 

1. For each earlier entry E

Слайд 90

Predication

If we flush the pipeline upon a branch misprediction
It would be quite

Predication If we flush the pipeline upon a branch misprediction It would
unfair
Let the if statement just be used to mark an instruction with the result of the comparison
Store the result in a flags register
The rest of the instructions are processed regardless of the branch outcome
Some results modify the architectural state, many do not

Advanced Computer Architecture. Smruti R. Sarangi

Consider the following piece of code

if( rand () %2 == 0)
x = y;
else
x = z;

Слайд 91

Code without Predication

Count the number of branch instructions.

Advanced Computer Architecture. Smruti R.

Code without Predication Count the number of branch instructions. Advanced Computer Architecture.
Sarangi

/* mappings : x <-> r1 , y <-> r2 , z <-> r3 */
mod r0 , r0 , 2 /* assume r0 contains the output of rand () ,
compute the remainder when dividing it by 2 */
cmp r0 , 0 /* compare */
beq . even
mov r1 , r3 /* odd case */
b. exit
.even :
mov r1 , r2 /* even case */
. exit :

Слайд 92

Predicated Instructions

The comparison generates predicates (flags)
po ? number is odd, pe ?

Predicated Instructions The comparison generates predicates (flags) po ? number is odd,
number is even
If the predicate is correct, the instruction gets executed, otherwise not
Itanium sets and maintains the predicate registers
An instruction is executed if all the predicate registers are set to 1 for the instruction.

Advanced Computer Architecture. Smruti R. Sarangi

/* mappings : x <-> r1 , y <-> r2 , z <-> r3 */
mod r0 , r0 , 2 /* assume r0 contains the output of rand ()
compute the remainder when dividing it by */
po ,pe = cmp r0 , 0 /* compare and set the predicates */
[po] mov r1 , r3 /* odd case */
[pe] mov r1 , r2 /* even case */

Слайд 93

Advanced Computer Architecture. Smruti R. Sarangi

Pipeline

Advanced Computer Architecture. Smruti R. Sarangi Pipeline

Слайд 94

Load Boosting

Boost a load and some instructions that use its value to

Load Boosting Boost a load and some instructions that use its value
a point before “where it appears in the code”.
Loads are almost always on the critical path ? hence, boosting them is beneficial because they can get their data early
Put the load address in the ALAT
Advanced Load Address Table
Subsequently, each store checks the ALAT for a match, and marks it (upon an address match)
Put a load-check (ld.c) instruction at the original point
Check the ALAT
If there have been no intervening stores to the same address, the speculation is successful.
Else, re-execute the load and its boosted forward slice

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 95

Advanced Computer Architecture. Smruti R. Sarangi

A host of compiler optimizations can be

Advanced Computer Architecture. Smruti R. Sarangi A host of compiler optimizations can
used to speed up
programs and improve their memory access behavior.

EPIC processors guarantee correctness as well as follow the VLIW model that gives primacy to the compiler.

We can use the ROB as the physical register file and use it to buffer temporary values. The ARF can contain the committed state.

There are three methods of replaying instructions: non-selective, delayed selective and token-based replay

There are four kinds of aggressive speculation: load address, dependence, latency, and value speculation.

Conclusion

Имя файла: Chapter_5-alternative-approaches.pptx
Количество просмотров: 48
Количество скачиваний: 0