Chapter_2_ooo

Февраль 28, 2021

Содержание

2. Background Required to Understand this Chapter Advanced Computer Architecture. Smruti R. Sarangi http://www.cse.iitd.ac.in/~srsarangi/archbooksoft.html
3. Advanced Computer Architecture. Smruti R. Sarangi
4. A Simplified Diagram of a Processor with 5 Stages Advanced Computer Architecture. Smruti R. Sarangi Fetch
5. Pipelines For more efficiency, we can pipeline the design. This will eliminate idleness in the processor.
6. Pipelined Version of the Processor Advanced Computer Architecture. Smruti R. Sarangi Note the positions of the
7. Problems with In-order Pipelines Hazards Structural Hazards ? Two instructions vie for the same resource (NOT
8. Pipeline Diagrams Advanced Computer Architecture. Smruti R. Sarangi add r1, r2, r3 add r4, r1, r3
9. Pipeline Interlocks Advanced Computer Architecture. Smruti R. Sarangi IF OF EX MA RW 1 1 1
10. Forwarding from the MA to the EX stage ? No stalls Advanced Computer Architecture. Smruti R.
11. Forwarding Multiplexers Advanced Computer Architecture. Smruti R. Sarangi ALU Latch Memory access unit Forwarded input Input
12. We need 4 Forwarding Paths Advanced Computer Architecture. Smruti R. Sarangi Forward as late as possible
13. Final View of the Pipelined Processor with Forwarding Multiplexers Advanced Computer Architecture. Smruti R. Sarangi We
14. Data Hazards in In-order Pipelines with Forwarding ld r4, 4[r0] add r5, r4, 1 Need the
15. Solution: Stall the Pipeline ld r4, 4[r0] add r5, r4, 1 Cycle N Cycle N+1 ld
16. Control Hazards beq .label add r1, r2, r3 beq .label sub r5, r6, r7 We know
17. Advanced Computer Architecture. Smruti R. Sarangi
18. Performance Equation - I Is Computer A faster that Computer B Wrong Answers: More is the
19. Performance Equation - II IPC is the number of instructions per cycle Let us loosely refer
20. So, what does performance depend on … #instructions in the program Depends on the compiler Frequency
21. How to improve performance? There are 3 factors: IPC, #instructions, and frequency #instructions is dependent on
22. What about frequency? What is frequency dependent on … Frequency = 1 / clock period Clock
23. Limits to Increasing Frequency Assume that we have the fastest possible transistors Can we increase the
24. Limits to increasing frequency - II What does it mean to have a very high frequency?
25. How many pipeline stages can we have? We are limited by the latch delay Even with
26. Pipeline Stages vs IPC CPI = CPIideal + stall_rate * stall_penalty The stall rate will remain
27. Summary: Why we cannot increase frequency by increasing the number of pipeline stages? Advanced Computer Architecture.
28. Since we cannot increase frequency … Increase IPC Advanced Computer Architecture. Smruti R. Sarangi
29. Increase IPC Issue more instructions per cycle 2, 4, or 8 instructions Make it a superscalar
30. In-order Superscalar Processor Have multiple in-order pipelines. Inst. Fetch Operand Fetch Execute Memory Access Register Write-back
31. In-order Superscalar Processor - II There can be dependences between instructions Have O(n2) forwarding paths for
32. Contents Advanced Computer Architecture. Smruti R. Sarangi
33. What to do ... Don’t follow program order Too many dependences mov r1, 1 add r3,
34. Execute out of order mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov
35. Continuation ... mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5, 1
36. Basic Principle of OOO Processors Advanced Computer Architecture. Smruti R. Sarangi ILP Instruction level parallelism The
37. Revisit the Example mov r1, 1 add r3, r1, r2 add r4, r3, r2 mov r5,
38. Pool of Instructions: Instruction Window Needs to be large enough such that the requisite number of
39. Problems with creating an Instruction Pool Typically 1 in 5 instructions is a branch Predict the
40. Advanced Computer Architecture. Smruti R. Sarangi Motivation for Branch Prediction 1 2 3 4 We need
41. The Maths of Branch Prediction Advanced Computer Architecture. Smruti R. Sarangi
42. Advanced Computer Architecture. Smruti R. Sarangi For (n=100) : A plot of Pn vs p If
43. Advanced Computer Architecture. Smruti R. Sarangi If we need a large instruction window, we need a
44. Advanced Computer Architecture. Smruti R. Sarangi Nature of Dependences
45. Dependences between Instructions Program Order Dependence mov r1, 1 mov r2, 2 One instruction appears after
46. Data Dependences RAW ? Read after Write Dependence (True dependence) mov r1, 1 add r3, r1,
47. Data Dependences - II WAW ? Write after Write Dependence (Output dependence) mov r1, 1 add
48. Data Dependences - III WAR ? Write after Read Dependence (Anti dependence) add r1, r2, r3
49. Control Dependences The add instruction is control dependent on the branch(beq) instruction If the branch is
50. Basic Results In-order processors respect all program order dependences. Thus, they automatically respect all data and
51. Can output and anti dependences be removed? Don’t you think that these dependences are there because
52. Solution: Assume infinite number of physical registers mov r1, 1 add r1, r2, r3 add r4,
53. Renaming Program with real (architectural) registers Program with physical registers RAW dependences WAR dependences WAW dependences
54. Where are we now ... Fetch + Decode + Rename Instruction Memory Pool of Instructions Execution
55. Issue with Write-back To an outsider should it matter if the processor is in-order or OOO
56. Assume that there is an exception or interrupt Languages like C or Java have dedicated functions
57. Precise Exceptions Assume that the exception handler decides to do nothing and return back After this
58. Precise Exceptions - II To an external observer The execution should always be correct and as
59. Precise Exceptions - III We thus need precise exceptions Assume that the dynamic instructions in a
60. Precise Exceptions in an OOO Processor Processor (Out-of-order execution) Register File Memory Update State In program
61. Advanced Computer Architecture. Smruti R. Sarangi In-order pipelines have a limited IPC because of hazards and
63. Скачать презентацию

Background Required to Understand this Chapter
Advanced Computer Architecture. Smruti R. Sarangi
http://www.cse.iitd.ac.in/~srsarangi/archbooksoft.html

Advanced Computer Architecture. Smruti R. Sarangi

A Simplified Diagram of a Processor with 5 Stages
Advanced Computer Architecture. Smruti

R. Sarangi

Fetch

unit

Instruction

memory

Immediate

and branch

unit

file

Control

unit

op2

op1

ALU
unit

Branch

unit

flags

Memory

unit

Data

memory

write unit

Interconnection

element

Instruction

fetch stage

Operand fetch and

decode stage

Execute stage

Memory access

stage

[IF] Instruction fetch
[OF] Operand fetch and decode
[EX] Execute

[MA] Memory access
[RW] Register write-back

Pipelines
For more efficiency, we can pipeline the design. This will eliminate idleness

in the processor.

In-order Pipelines
Instructions enter the pipeline in program order

Inst. Fetch

Operand Fetch

Execute

Memory
Access

5-Stage Pipeline

Advanced Computer Architecture. Smruti R. Sarangi

Pipelined Version of the Processor
Advanced Computer Architecture. Smruti R. Sarangi
Note the positions

of the pipeline latches.

IF-OF

Fetch

unit

Instruction

memory

Immediate

and branch

unit

file

Control

unit

OF-EX

op2

op1

ALU

Unit

Branch

unit

flags

EX-MA

Memory

unit

Data

memory

MA-RW

write unit

Problems with In-order Pipelines
Hazards
Structural Hazards ? Two instructions vie for the same

resource (NOT possible in simple 5-stage pipelines)
Data Hazards ? An instruction stands to read or write the wrong data.
Control Hazards ? Instructions are fetched from the wrong path of the branch

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Advanced Computer Architecture. Smruti R. Sarangi

Pipeline Diagrams
Advanced Computer Architecture. Smruti R. Sarangi
add r1, r2, r3
add r4,

r1, r3

Pipeline Interlocks
Advanced Computer Architecture. Smruti R. Sarangi
IF
OF
EX
MA
RW
1
1
1
1
1
2
2
2
3
3
1
2
3
4
5
6
Pipeline
bubble
3
2
2
3
2
3
7
An interlock inserts a nop

instruction (bubble) in the pipeline

Forwarding from the MA to the EX stage ? No stalls
Advanced Computer

Architecture. Smruti R. Sarangi

Forwarding Multiplexers
Advanced Computer Architecture. Smruti R. Sarangi
ALU
Latch
Memory
access unit
Forwarded input
Input 1
Input 2
EX

stage

MA stage

We need 4 Forwarding Paths
Advanced Computer Architecture. Smruti R. Sarangi
Forward as late

as possible

Final View of the Pipelined Processor with Forwarding Multiplexers
Advanced Computer Architecture. Smruti

R. Sarangi

We add 6 forwarding multiplexers

Data Hazards in In-order Pipelines with Forwarding
ld r4, 4[r0]
add r5, r4,

Need the value in r4 now

Earliest it can be generated

Load-use Hazard

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Advanced Computer Architecture. Smruti R. Sarangi

Solution: Stall the Pipeline
ld r4, 4[r0]
add r5, r4, 1
Cycle N
Cycle N+1
ld

r4, 4[r0]

Cycle N+2

Advanced Computer Architecture. Smruti R. Sarangi

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Control Hazards
beq .label
add r1, r2, r3
beq .label
sub r5, r6, r7
We know the

status of the branch now. Assume it is taken.

Cancel these instructions

Two instruction slots are wasted

Advanced Computer Architecture. Smruti R. Sarangi

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Advanced Computer Architecture. Smruti R. Sarangi

Performance Equation - I
Is Computer A faster that Computer B
Wrong Answers:
More

is the clock speed, faster is the computer
More is the RAM, faster is the computer
What does it mean for computer A to be faster than computer B
Short Answer: NOTHING
Performance is always with respect to a program. You can say that a certain program runs faster on computer A as compared to computer B.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 19

Performance Equation - II
IPC is the number of instructions per cycle
Let us

loosely refer to the reciprocal of the time per program as the performance

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 20

So, what does performance depend on …
#instructions in the program
Depends on the

compiler
Frequency
Depends on the transistor technology and the architecture
If we have more pipeline stages, then the time to traverse each stage reduces roughly proportionally
Given that each stage needs to be processed in one clock cycle, smaller the stage, higher the frequency
To increase the frequency, we simply need to increase the number of pipeline stages
IPC
Depends on the architecture and the compiler
A large part of this book is devoted to this aspect.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 21

How to improve performance?
There are 3 factors:
IPC, #instructions, and frequency
#instructions

is dependent on the compiler ? not on the architecture
Let us look at IPC and frequency
IPC
What is the IPC of an in-order pipeline?

1 if there are no stalls, otherwise < 1

Advanced Computer Architecture. Smruti R. Sarangi

Methods to increase IPC

Слайд 22

What about frequency?
What is frequency dependent on …
Frequency = 1

/ clock period
Clock Period:
1 pipeline stage is expected to take 1 clock cycle
Clock period = maximum latency of the pipeline stages
How to reduce the clock period?
Make each stage of the pipeline smaller by increasing the number of pipeline stages
Use faster transistors

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 23

Limits to Increasing Frequency
Assume that we have the fastest possible transistors
Can we

increase the frequency to 100 GHz?

Reasons

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 24

Limits to increasing frequency - II
What does it mean to have a

very high frequency?
Before answering, keep these facts in mind:

Thumb Rule

P ? power
f ? frequency

Thermo-dynamics

T ? Temperature

We need to increase the number of pipeline stages ? more hazards, more forwarding paths

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 25

How many pipeline stages can we have?
We are limited by the latch

delay
Even with an infinite number of stages, the minimum clock period will be equal to the latch delay

Logic

Latch

Logic

Latch

Few stages

More stages

Even more stages

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 26

Pipeline Stages vs IPC

CPI = CPIideal + stall_rate * stall_penalty
The stall rate

will remain more or less constant per instruction with the number of pipeline stages
The stall penalty (in terms of cycles) will however increase
This will lead to a net increase in CPI and loss in IPC

Advanced Computer Architecture. Smruti R. Sarangi

As we increase the number of stages, the IPC goes down.

Слайд 27

Summary: Why we cannot increase frequency by increasing the number of pipeline

stages?

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 28

Since we cannot increase frequency …
Increase IPC
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 29

Increase IPC
Issue more instructions per cycle
2, 4, or 8 instructions
Make it a

superscalar processor ? A processor that can execute multiple instructions per cycle

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 30

In-order Superscalar Processor
Have multiple in-order pipelines.
Inst. Fetch
Operand Fetch
Execute
Memory
Access
Register
Write-back
Inst. Fetch
Operand Fetch
Execute
Memory
Access
Register
Write-back
Inst. Fetch
Operand

Fetch

Execute

Memory
Access

Instruction i

Instruction i+1

Instruction i+2

Advanced Computer Architecture. Smruti R. Sarangi

Inst. Fetch

Слайд 31

In-order Superscalar Processor - II
There can be dependences between instructions
Have O(n2)

forwarding paths for an n-issue processor
Complicated logic for detecting dependences, hazards, and forwarding
Still might not be enough ...
To get the peak IPC (= n) in an n-issue pipeline, we need to ensure that there are no stalls
There will be no stalls if there are no taken branches, and no data dependences between instructions.
Programs typically do not have such long sequences of instructions without dependences

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 32

Contents
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 33

What to do ...
Don’t follow program order
Too many dependences
mov r1,

1
add r3, r1, r2
add r4, r3, r2
mov r5, 1
add r6, r5, 1
add r8, r7, r6

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 34

Execute out of order
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov

r5, 1
add r6, r5, 1
add r8, r7, r6

Execute on a 2-issue OOO processor

Advanced Computer Architecture. Smruti R. Sarangi

Execute 2 instructions in parallel

Слайд 35

Continuation ...
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5,

1
add r6, r5, 1
add r8, r7, r6

cycle 1

cycle 2

cycle 3

issue slot 1

issue slot 2

Advanced Computer Architecture. Smruti R. Sarangi

In Out-of-order (OOO) processors, the execution is not as per program order. It is as per the data dependence order ? the consumer is executed always after the producer.

Слайд 36

Basic Principle of OOO Processors
Advanced Computer Architecture. Smruti R. Sarangi
ILP
Instruction level parallelism
The

number of ready and independent instructions
we can simultaneously execute.

Слайд 37

Revisit the Example
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5,

add r6, r5, 1

add r8, r7, r6

Pool of Instructions

Issue ready and mutually independent instructions

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 38

Pool of Instructions: Instruction Window
Needs to be large enough such that the

requisite number of mutually independent instructions can be found.
Typical instruction window sizes: 64 to 128
How do we create a large pool of instructions in a program with branches? We need to be sure that all the instructions are on the correct path

for (i = 1; i < m; i++) {
for (j = 1; j < i; j ++ ) {
if (j %2 == 0) continue;
....
}
}

Advanced Computer Architecture. Smruti R. Sarangi

Example

Слайд 39

Problems with creating an Instruction Pool
Typically 1 in 5 instructions is

a branch

Predict the directions of the branches, and their targets

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 40

Advanced Computer Architecture. Smruti R. Sarangi
Motivation for Branch Prediction
1
2
3
4
We need
high IPC
This

means that
we need a large
instruction window

It will have a
lot of branches.

We need to predict
ALL the branches
correctly.

Слайд 41

The Maths of Branch Prediction
Advanced Computer Architecture. Smruti R. Sarangi

Слайд 42

Advanced Computer Architecture. Smruti R. Sarangi
For (n=100) : A plot of Pn

vs p

If Pn = 10%, p has to be as low as 0.5% !!!

Слайд 43

Advanced Computer Architecture. Smruti R. Sarangi
If we need a large instruction window,

we need a very accurate branch predictor. The accuracy of the branch predictor limits the size of the instruction window.

Слайд 44

Advanced Computer Architecture. Smruti R. Sarangi
Nature of Dependences

Слайд 45

Dependences between Instructions
Program Order Dependence
mov r1, 1
mov r2, 2
One instruction appears after

the other in program order

The program order is the order of instructions that is perceived by a single cycle in-order processor executing the program.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 46

Data Dependences
RAW ? Read after Write Dependence (True dependence)
mov r1, 1
add

r3, r1, r2

It is a producer-consumer dependence.
The earlier instruction produces a value, and the later instruction reads it.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 47

Data Dependences - II
WAW ? Write after Write Dependence (Output dependence)
mov r1,

1
add r1, r4, r2

Two instructions write to the same location
The later instruction needs to take effect after the former

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 48

Data Dependences - III
WAR ? Write after Read Dependence (Anti dependence)
add r1,

r2, r3
add r2, r5, r6

Earlier instruction reads, later instruction writes
The later instruction needs to execute after the earlier instruction has read its values

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 49

Control Dependences
The add instruction is control dependent on the branch(beq) instruction
If the

branch is taken then only the add instruction will execute, not otherwise

beq .label
.....
.label
add r1, r2, r3

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 50

Basic Results
In-order processors respect all program order dependences. Thus, they automatically respect

all data and control dependences.

OOO processors respect only data and control dependences.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 51

Can output and anti dependences be removed?
Don’t you think that these dependences

are there because we have a finite number of registers.
What if we had an infinite number of registers?

mov r1, 1
add r5, r6, r7
add r1, r4, r2
add r8, r9, r10

add r1, r2, r3
add r5, r6, r7
add r2, r5, r6
add r8, r9, r10

Advanced Computer Architecture. Smruti R. Sarangi

Set r1

Use r1

Avatar 1

Avatar 2

Слайд 52

Solution: Assume infinite number of physical registers
mov r1, 1
add r1, r2, r3
add

r4, r1, 1
mov r2, 5
add r6, r2, r8
mov r1, 8
add r9, r1, r2

mov p11, 1
add p12, p2, p3
add p41, p12, 1
mov p21, 5
add p61, p21, p8
mov p13, 8
add p91, p13, p21

Code with architectural registers

Code with physical registers

Advanced Computer Architecture. Smruti R. Sarangi

Architectural register

Physical register

Format in this example: rx is mapped to px

Слайд 53

Renaming
Program with real (architectural) registers
Program with physical registers
RAW dependences
WAR dependences
WAW dependences
RAW dependences
Higher

instruction level parallelism (ILP)

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 54

Where are we now ...
Fetch + Decode + Rename
Instruction
Memory
Pool of Instructions
Execution Units
Issue

+ Reg.
read

Write back results

Branch Predictor

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 55

Issue with Write-back
To an outsider should it matter if the processor is

in-order or OOO
NO

Processor

Instructions

Memory

Update State

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 56

Assume that there is an exception or interrupt
Languages like C or Java

have dedicated functions that are called if there is a divide-by-zero in the code.
The question is:
What if the sub instruction has executed when we enter the exception handler?
An in-order processor will never do this.

mov r4, 10
mov r2, 0
div r3, r1, r2
sub r4, r4, 1

Divide by 0

Divide by zero handler

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 57

Precise Exceptions
Assume that the exception handler decides to do nothing and return

back
After this the sub instruction should be executed
This is exactly what will happen in an in-order processor
In an OOO processor there is a possibility that the sub inst. can execute out of order
The outsider (exception handler) will see a different view as compared to the view it will see with an in-order processor.

Regular Instructions

÷ by 0

Exception
handler

sub
inst.

Advanced Computer Architecture. Smruti R. Sarangi

Flow of actions

Слайд 58

Precise Exceptions - II
To an external observer
The execution should always be correct

and as per program order
Even in the presence of interrupts and exceptions

Regular Instructions

÷ by 0

Exception
handler

sub
inst.

Regular Instructions

÷ by 0

Exception
handler

sub
inst.

Correct

Wrong

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 59

Precise Exceptions - III
We thus need precise exceptions
Assume that the dynamic instructions

in a program (ordered in program order) are: ins1, ins2, ins3 ... insn
Assume that the processor starts the exception/interrupt handler after it has just finished writing the results of instruction: insk
Then instructions: ins1 ... insk should have executed completely and written their results to the memory/register file
AND, insk+1 and later instructions should not appear to have started their execution at all
Such an exception or interrupt is precise

ins1

ins2

ins3

insk

Exception

insk+1

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 60

Precise Exceptions in an OOO Processor
Processor
(Out-of-order execution)
Register File
Memory
Update State
In program order
Advanced Computer

Architecture. Smruti R. Sarangi

Слайд 61

Advanced Computer Architecture. Smruti R. Sarangi
In-order pipelines have a limited IPC because
of

hazards and branches

Multi-issue in-order pipelines do not solve the
problem. Reason: dependences and interlocks

Hence, we issue instructions out-of-order (OOO). We need
a large instruction window to find sufficient independent insts.

To sustain a large instruction window, we need a very
accurate branch predictor.

To expose additional ILP, we can remove WAR/WAR
hazards. Finally, we need to have precise exceptions.

Conclusion

Chapter_2_ooo

Содержание

Background Required to Understand this ChapterAdvanced Computer Architecture. Smruti R. Sarangihttp://www.cse.iitd.ac.in/~srsarangi/archbooksoft.html

Advanced Computer Architecture. Smruti R. Sarangi

A Simplified Diagram of a Processor with 5 StagesAdvanced Computer Architecture. Smruti

PipelinesFor more efficiency, we can pipeline the design. This will eliminate idleness

Pipelined Version of the ProcessorAdvanced Computer Architecture. Smruti R. SarangiNote the positions

Problems with In-order PipelinesHazardsStructural Hazards ? Two instructions vie for the same

Pipeline Diagrams Advanced Computer Architecture. Smruti R. Sarangiadd r1, r2, r3add r4,

Pipeline Interlocks Advanced Computer Architecture. Smruti R. SarangiIFOFEXMARW1111122233123456Pipelinebubble3223237An interlock inserts a nop

Forwarding from the MA to the EX stage ? No stallsAdvanced Computer

Forwarding MultiplexersAdvanced Computer Architecture. Smruti R. SarangiALULatchMemory access unitForwarded inputInput 1Input 2EX

We need 4 Forwarding PathsAdvanced Computer Architecture. Smruti R. SarangiForward as late

Final View of the Pipelined Processor with Forwarding MultiplexersAdvanced Computer Architecture. Smruti

Data Hazards in In-order Pipelines with Forwardingld r4, 4[r0] add r5, r4,

Solution: Stall the Pipelineld r4, 4[r0] add r5, r4, 1Cycle NCycle N+1ld

Control Hazardsbeq .labeladd r1, r2, r3beq .labelsub r5, r6, r7We know the

Advanced Computer Architecture. Smruti R. Sarangi

Performance Equation - IIs Computer A faster that Computer BWrong Answers: More

Performance Equation - IIIPC is the number of instructions per cycleLet us

So, what does performance depend on …#instructions in the programDepends on the

How to improve performance? There are 3 factors: IPC, #instructions, and frequency#instructions

What about frequency? What is frequency dependent on … Frequency = 1

Limits to Increasing FrequencyAssume that we have the fastest possible transistorsCan we

Limits to increasing frequency - IIWhat does it mean to have a

How many pipeline stages can we have?We are limited by the latch

Pipeline Stages vs IPC CPI = CPIideal + stall_rate * stall_penaltyThe stall rate

Summary: Why we cannot increase frequency by increasing the number of pipeline

Since we cannot increase frequency …Increase IPCAdvanced Computer Architecture. Smruti R. Sarangi

Increase IPC Issue more instructions per cycle2, 4, or 8 instructionsMake it a

In-order Superscalar ProcessorHave multiple in-order pipelines. Inst. FetchOperand FetchExecuteMemoryAccessRegisterWrite-back Inst. FetchOperand FetchExecuteMemoryAccessRegisterWrite-back Inst. FetchOperand

In-order Superscalar Processor - IIThere can be dependences between instructions Have O(n2)

ContentsAdvanced Computer Architecture. Smruti R. Sarangi

What to do ... Don’t follow program order Too many dependencesmov r1,

Execute out of ordermov r1, 1add r3, r1, r2add r4, r3, r2mov

Continuation ...mov r1, 1 add r3, r1, r2add r4, r3, r2mov r5,

Basic Principle of OOO ProcessorsAdvanced Computer Architecture. Smruti R. SarangiILPInstruction level parallelismThe

Revisit the Examplemov r1, 1add r3, r1, r2add r4, r3, r2mov r5,

Pool of Instructions: Instruction WindowNeeds to be large enough such that the

Problems with creating an Instruction Pool Typically 1 in 5 instructions is

Advanced Computer Architecture. Smruti R. SarangiMotivation for Branch Prediction1234We need high IPCThis

The Maths of Branch PredictionAdvanced Computer Architecture. Smruti R. Sarangi

Advanced Computer Architecture. Smruti R. SarangiFor (n=100) : A plot of Pn

Advanced Computer Architecture. Smruti R. SarangiIf we need a large instruction window,

Advanced Computer Architecture. Smruti R. SarangiNature of Dependences

Dependences between InstructionsProgram Order Dependencemov r1, 1mov r2, 2One instruction appears after

Data DependencesRAW ? Read after Write Dependence (True dependence)mov r1, 1 add

Data Dependences - IIWAW ? Write after Write Dependence (Output dependence)mov r1,

Data Dependences - IIIWAR ? Write after Read Dependence (Anti dependence)add r1,

Control DependencesThe add instruction is control dependent on the branch(beq) instructionIf the

Basic ResultsIn-order processors respect all program order dependences. Thus, they automatically respect

Can output and anti dependences be removed?Don’t you think that these dependences

Solution: Assume infinite number of physical registersmov r1, 1add r1, r2, r3add

RenamingProgram with real (architectural) registersProgram with physical registersRAW dependencesWAR dependencesWAW dependencesRAW dependencesHigher

Where are we now ...Fetch + Decode + RenameInstructionMemoryPool of InstructionsExecution UnitsIssue

Issue with Write-backTo an outsider should it matter if the processor is

Assume that there is an exception or interruptLanguages like C or Java

Precise ExceptionsAssume that the exception handler decides to do nothing and return

Precise Exceptions - IITo an external observerThe execution should always be correct

Precise Exceptions - IIIWe thus need precise exceptionsAssume that the dynamic instructions

Precise Exceptions in an OOO ProcessorProcessor(Out-of-order execution)Register FileMemoryUpdate StateIn program orderAdvanced Computer

Advanced Computer Architecture. Smruti R. SarangiIn-order pipelines have a limited IPC becauseof

Похожие презентации

Background Required to Understand this Chapter
Advanced Computer Architecture. Smruti R. Sarangi
http://www.cse.iitd.ac.in/~srsarangi/archbooksoft.html

A Simplified Diagram of a Processor with 5 Stages
Advanced Computer Architecture. Smruti

Pipelines
For more efficiency, we can pipeline the design. This will eliminate idleness

Pipelined Version of the Processor
Advanced Computer Architecture. Smruti R. Sarangi
Note the positions

Problems with In-order Pipelines
Hazards
Structural Hazards ? Two instructions vie for the same

Pipeline Diagrams
Advanced Computer Architecture. Smruti R. Sarangi
add r1, r2, r3
add r4,

Pipeline Interlocks
Advanced Computer Architecture. Smruti R. Sarangi
IF
OF
EX
MA
RW
1
1
1
1
1
2
2
2
3
3
1
2
3
4
5
6
Pipeline
bubble
3
2
2
3
2
3
7
An interlock inserts a nop

Forwarding from the MA to the EX stage ? No stalls
Advanced Computer

Forwarding Multiplexers
Advanced Computer Architecture. Smruti R. Sarangi
ALU
Latch
Memory
access unit
Forwarded input
Input 1
Input 2
EX

We need 4 Forwarding Paths
Advanced Computer Architecture. Smruti R. Sarangi
Forward as late

Final View of the Pipelined Processor with Forwarding Multiplexers
Advanced Computer Architecture. Smruti

Data Hazards in In-order Pipelines with Forwarding
ld r4, 4[r0]
add r5, r4,

Solution: Stall the Pipeline
ld r4, 4[r0]
add r5, r4, 1
Cycle N
Cycle N+1
ld

Control Hazards
beq .label
add r1, r2, r3
beq .label
sub r5, r6, r7
We know the

Performance Equation - I
Is Computer A faster that Computer B
Wrong Answers:
More

Performance Equation - II
IPC is the number of instructions per cycle
Let us

So, what does performance depend on …
#instructions in the program
Depends on the

How to improve performance?
There are 3 factors:
IPC, #instructions, and frequency
#instructions

What about frequency?
What is frequency dependent on …
Frequency = 1

Limits to Increasing Frequency
Assume that we have the fastest possible transistors
Can we

Limits to increasing frequency - II
What does it mean to have a

How many pipeline stages can we have?
We are limited by the latch

Pipeline Stages vs IPC

CPI = CPIideal + stall_rate * stall_penalty
The stall rate

Since we cannot increase frequency …
Increase IPC
Advanced Computer Architecture. Smruti R. Sarangi

Increase IPC
Issue more instructions per cycle
2, 4, or 8 instructions
Make it a

In-order Superscalar Processor
Have multiple in-order pipelines.
Inst. Fetch
Operand Fetch
Execute
Memory
Access
Register
Write-back
Inst. Fetch
Operand Fetch
Execute
Memory
Access
Register
Write-back
Inst. Fetch
Operand

In-order Superscalar Processor - II
There can be dependences between instructions
Have O(n2)

Contents
Advanced Computer Architecture. Smruti R. Sarangi

What to do ...
Don’t follow program order
Too many dependences
mov r1,

Execute out of order
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov

Continuation ...
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5,

Basic Principle of OOO Processors
Advanced Computer Architecture. Smruti R. Sarangi
ILP
Instruction level parallelism
The

Revisit the Example
mov r1, 1
add r3, r1, r2
add r4, r3, r2
mov r5,

Pool of Instructions: Instruction Window
Needs to be large enough such that the

Problems with creating an Instruction Pool
Typically 1 in 5 instructions is

Advanced Computer Architecture. Smruti R. Sarangi
Motivation for Branch Prediction
1
2
3
4
We need
high IPC
This

The Maths of Branch Prediction
Advanced Computer Architecture. Smruti R. Sarangi

Advanced Computer Architecture. Smruti R. Sarangi
For (n=100) : A plot of Pn

Advanced Computer Architecture. Smruti R. Sarangi
If we need a large instruction window,

Advanced Computer Architecture. Smruti R. Sarangi
Nature of Dependences

Dependences between Instructions
Program Order Dependence
mov r1, 1
mov r2, 2
One instruction appears after

Data Dependences
RAW ? Read after Write Dependence (True dependence)
mov r1, 1
add

Data Dependences - II
WAW ? Write after Write Dependence (Output dependence)
mov r1,

Data Dependences - III
WAR ? Write after Read Dependence (Anti dependence)
add r1,

Control Dependences
The add instruction is control dependent on the branch(beq) instruction
If the

Basic Results
In-order processors respect all program order dependences. Thus, they automatically respect

Can output and anti dependences be removed?
Don’t you think that these dependences

Solution: Assume infinite number of physical registers
mov r1, 1
add r1, r2, r3
add

Renaming
Program with real (architectural) registers
Program with physical registers
RAW dependences
WAR dependences
WAW dependences
RAW dependences
Higher

Where are we now ...
Fetch + Decode + Rename
Instruction
Memory
Pool of Instructions
Execution Units
Issue

Issue with Write-back
To an outsider should it matter if the processor is

Assume that there is an exception or interrupt
Languages like C or Java

Precise Exceptions
Assume that the exception handler decides to do nothing and return

Precise Exceptions - II
To an external observer
The execution should always be correct

Precise Exceptions - III
We thus need precise exceptions
Assume that the dynamic instructions

Precise Exceptions in an OOO Processor
Processor
(Out-of-order execution)
Register File
Memory
Update State
In program order
Advanced Computer

Advanced Computer Architecture. Smruti R. Sarangi
In-order pipelines have a limited IPC because
of