Содержание

Слайд 2

Background Required to Understand this Chapter

Advanced Computer Architecture. Smruti R. Sarangi

http://www.cse.iitd.ac.in/~srsarangi/archbooksoft.html

Background Required to Understand this Chapter Advanced Computer Architecture. Smruti R. Sarangi http://www.cse.iitd.ac.in/~srsarangi/archbooksoft.html

Слайд 3

Advanced Computer Architecture. Smruti R. Sarangi

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 4

A Simplified Diagram of a Processor with 5 Stages

Advanced Computer Architecture. Smruti

A Simplified Diagram of a Processor with 5 Stages Advanced Computer Architecture.
R. Sarangi

Fetch

unit

Instruction

memory

Immediate

and branch

unit

Register

file

Control

unit

op2

op1

ALU
unit

Branch

unit

flags

Memory

unit

Data

memory

Register

write unit

 

Interconnection

element

Instruction

fetch stage

Operand fetch and

decode stage

 

Execute stage

 

 

Memory access

stage

 

 

Register write-back

stage

 

 

[IF] Instruction fetch
[OF] Operand fetch and decode
[EX] Execute

[MA] Memory access
[RW] Register write-back

Слайд 5

Pipelines

For more efficiency, we can pipeline the design. This will eliminate idleness

Pipelines For more efficiency, we can pipeline the design. This will eliminate
in the processor.

In-order Pipelines
Instructions enter the pipeline in program order

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

5-Stage Pipeline

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 6

Pipelined Version of the Processor

Advanced Computer Architecture. Smruti R. Sarangi

Note the positions

Pipelined Version of the Processor Advanced Computer Architecture. Smruti R. Sarangi Note
of the pipeline latches.

IF-OF

Fetch

unit

Instruction

memory

Immediate

and branch

unit

Register

file

Control

unit

OF-EX

op2

op1

ALU

Unit

Branch

unit

flags

EX-MA

Memory

unit

Data

memory

MA-RW

Register

write unit

Слайд 7

Problems with In-order Pipelines

Hazards
Structural Hazards ? Two instructions vie for the same

Problems with In-order Pipelines Hazards Structural Hazards ? Two instructions vie for
resource (NOT possible in simple 5-stage pipelines)
Data Hazards ? An instruction stands to read or write the wrong data.
Control Hazards ? Instructions are fetched from the wrong path of the branch

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 8

Pipeline Diagrams

Advanced Computer Architecture. Smruti R. Sarangi

add r1, r2, r3
add r4,

Pipeline Diagrams Advanced Computer Architecture. Smruti R. Sarangi add r1, r2, r3
r1, r3

1

2

Слайд 9

Pipeline Interlocks

Advanced Computer Architecture. Smruti R. Sarangi

IF

OF

EX

MA

RW

1

1

1

1

1

2

2

2

3

3

1

2

3

4

5

6

Pipeline

bubble

3

2

2

3

2

3

7

An interlock inserts a nop

Pipeline Interlocks Advanced Computer Architecture. Smruti R. Sarangi IF OF EX MA
instruction (bubble) in the pipeline

Слайд 10

Forwarding from the MA to the EX stage ? No stalls

Advanced Computer

Forwarding from the MA to the EX stage ? No stalls Advanced
Architecture. Smruti R. Sarangi

Слайд 11

Forwarding Multiplexers

Advanced Computer Architecture. Smruti R. Sarangi

ALU

Latch

Memory

access unit

Forwarded input

Input 1

Input 2

EX

Forwarding Multiplexers Advanced Computer Architecture. Smruti R. Sarangi ALU Latch Memory access
stage

MA stage

Слайд 12

We need 4 Forwarding Paths

Advanced Computer Architecture. Smruti R. Sarangi

Forward as late

We need 4 Forwarding Paths Advanced Computer Architecture. Smruti R. Sarangi Forward as late as possible
as possible

Слайд 13

Final View of the Pipelined Processor with Forwarding Multiplexers

Advanced Computer Architecture. Smruti

Final View of the Pipelined Processor with Forwarding Multiplexers Advanced Computer Architecture.
R. Sarangi

We add 6 forwarding multiplexers

Слайд 14

Data Hazards in In-order Pipelines with Forwarding

ld r4, 4[r0]

add r5, r4,

Data Hazards in In-order Pipelines with Forwarding ld r4, 4[r0] add r5,
1

Need the value in r4 now

Earliest it can be generated

Load-use Hazard

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 15

Solution: Stall the Pipeline

ld r4, 4[r0]

add r5, r4, 1

Cycle N

Cycle N+1

ld

Solution: Stall the Pipeline ld r4, 4[r0] add r5, r4, 1 Cycle
r4, 4[r0]

Cycle N+2

Advanced Computer Architecture. Smruti R. Sarangi

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

Слайд 16

Control Hazards

beq .label

add r1, r2, r3

beq .label

sub r5, r6, r7

We know the

Control Hazards beq .label add r1, r2, r3 beq .label sub r5,
status of the branch now. Assume it is taken.

Cancel these instructions

Two instruction slots are wasted

Advanced Computer Architecture. Smruti R. Sarangi

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

Слайд 17

Advanced Computer Architecture. Smruti R. Sarangi

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 18

Performance Equation - I

Is Computer A faster that Computer B
Wrong Answers:
More

Performance Equation - I Is Computer A faster that Computer B Wrong
is the clock speed, faster is the computer
More is the RAM, faster is the computer
What does it mean for computer A to be faster than computer B
Short Answer: NOTHING
Performance is always with respect to a program. You can say that a certain program runs faster on computer A as compared to computer B.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 19

Performance Equation - II

IPC is the number of instructions per cycle
Let us

Performance Equation - II IPC is the number of instructions per cycle
loosely refer to the reciprocal of the time per program as the performance

 

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 20

So, what does performance depend on …

#instructions in the program
Depends on the

So, what does performance depend on … #instructions in the program Depends
compiler
Frequency
Depends on the transistor technology and the architecture
If we have more pipeline stages, then the time to traverse each stage reduces roughly proportionally
Given that each stage needs to be processed in one clock cycle, smaller the stage, higher the frequency
To increase the frequency, we simply need to increase the number of pipeline stages
IPC
Depends on the architecture and the compiler
A large part of this book is devoted to this aspect.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 21

How to improve performance?

There are 3 factors:
IPC, #instructions, and frequency
#instructions

How to improve performance? There are 3 factors: IPC, #instructions, and frequency
is dependent on the compiler ? not on the architecture
Let us look at IPC and frequency
IPC
What is the IPC of an in-order pipeline?

1 if there are no stalls, otherwise < 1

Advanced Computer Architecture. Smruti R. Sarangi

Methods to increase IPC

Слайд 22

What about frequency?

What is frequency dependent on …
Frequency = 1

What about frequency? What is frequency dependent on … Frequency = 1
/ clock period
Clock Period:
1 pipeline stage is expected to take 1 clock cycle
Clock period = maximum latency of the pipeline stages
How to reduce the clock period?
Make each stage of the pipeline smaller by increasing the number of pipeline stages
Use faster transistors

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 23

Limits to Increasing Frequency

Assume that we have the fastest possible transistors
Can we

Limits to Increasing Frequency Assume that we have the fastest possible transistors
increase the frequency to 100 GHz?

Reasons

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 24

Limits to increasing frequency - II

What does it mean to have a

Limits to increasing frequency - II What does it mean to have
very high frequency?
Before answering, keep these facts in mind:

 

Thumb Rule

P ? power
f ? frequency

 

Thermo-dynamics

T ? Temperature

We need to increase the number of pipeline stages ? more hazards, more forwarding paths

1

2

3

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 25

How many pipeline stages can we have?

We are limited by the latch

How many pipeline stages can we have? We are limited by the
delay
Even with an infinite number of stages, the minimum clock period will be equal to the latch delay

Logic

Latch

Latch

Logic

Latch

Few stages

More stages

Even more stages

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 26

Pipeline Stages vs IPC

 

CPI = CPIideal + stall_rate * stall_penalty

The stall rate

Pipeline Stages vs IPC CPI = CPIideal + stall_rate * stall_penalty The
will remain more or less constant per instruction with the number of pipeline stages
The stall penalty (in terms of cycles) will however increase
This will lead to a net increase in CPI and loss in IPC

Advanced Computer Architecture. Smruti R. Sarangi

As we increase the number of stages, the IPC goes down.

Слайд 27

Summary: Why we cannot increase frequency by increasing the number of pipeline

Summary: Why we cannot increase frequency by increasing the number of pipeline
stages?

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 28

Since we cannot increase frequency …

Increase IPC

Advanced Computer Architecture. Smruti R. Sarangi

Since we cannot increase frequency … Increase IPC Advanced Computer Architecture. Smruti R. Sarangi

Слайд 29

Increase IPC

Issue more instructions per cycle
2, 4, or 8 instructions
Make it a

Increase IPC Issue more instructions per cycle 2, 4, or 8 instructions
superscalar processor ? A processor that can execute multiple instructions per cycle

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 30

In-order Superscalar Processor

Have multiple in-order pipelines.

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

Inst. Fetch

Operand Fetch

Execute

Memory
Access

Register
Write-back

Inst. Fetch

Operand

In-order Superscalar Processor Have multiple in-order pipelines. Inst. Fetch Operand Fetch Execute
Fetch

Execute

Memory
Access

Register
Write-back

Instruction i

Instruction i+1

Instruction i+2

Advanced Computer Architecture. Smruti R. Sarangi

Inst. Fetch

Inst. Fetch

Inst. Fetch

Слайд 31

In-order Superscalar Processor - II

There can be dependences between instructions
Have O(n2)

In-order Superscalar Processor - II There can be dependences between instructions Have
forwarding paths for an n-issue processor
Complicated logic for detecting dependences, hazards, and forwarding
Still might not be enough ...
To get the peak IPC (= n) in an n-issue pipeline, we need to ensure that there are no stalls
There will be no stalls if there are no taken branches, and no data dependences between instructions.
Programs typically do not have such long sequences of instructions without dependences

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 32

Contents

Advanced Computer Architecture. Smruti R. Sarangi

Contents Advanced Computer Architecture. Smruti R. Sarangi

Слайд 33

What to do ...

Don’t follow program order

Too many dependences

mov r1,

What to do ... Don’t follow program order Too many dependences mov
1
add r3, r1, r2
add r4, r3, r2
mov r5, 1
add r6, r5, 1
add r8, r7, r6

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 34

Execute out of order

mov r1, 1
add r3, r1, r2
add r4, r3, r2

mov

Execute out of order mov r1, 1 add r3, r1, r2 add
r5, 1
add r6, r5, 1
add r8, r7, r6

Execute on a 2-issue OOO processor

Advanced Computer Architecture. Smruti R. Sarangi

Execute 2 instructions in parallel

Слайд 35

Continuation ...

mov r1, 1
add r3, r1, r2
add r4, r3, r2

mov r5,

Continuation ... mov r1, 1 add r3, r1, r2 add r4, r3,
1
add r6, r5, 1
add r8, r7, r6

cycle 1

cycle 2

cycle 3

issue slot 1

issue slot 2

Advanced Computer Architecture. Smruti R. Sarangi

In Out-of-order (OOO) processors, the execution is not as per program order. It is as per the data dependence order ? the consumer is executed always after the producer.

Слайд 36

Basic Principle of OOO Processors

Advanced Computer Architecture. Smruti R. Sarangi

ILP

Instruction level parallelism
The

Basic Principle of OOO Processors Advanced Computer Architecture. Smruti R. Sarangi ILP
number of ready and independent instructions
we can simultaneously execute.

Слайд 37

Revisit the Example

mov r1, 1

add r3, r1, r2

add r4, r3, r2

mov r5,

Revisit the Example mov r1, 1 add r3, r1, r2 add r4,
1

add r6, r5, 1

add r8, r7, r6

Pool of Instructions

Issue ready and mutually independent instructions

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 38

Pool of Instructions: Instruction Window

Needs to be large enough such that the

Pool of Instructions: Instruction Window Needs to be large enough such that
requisite number of mutually independent instructions can be found.
Typical instruction window sizes: 64 to 128
How do we create a large pool of instructions in a program with branches? We need to be sure that all the instructions are on the correct path

for (i = 1; i < m; i++) {
for (j = 1; j < i; j ++ ) {
if (j %2 == 0) continue;
....
}
}

Advanced Computer Architecture. Smruti R. Sarangi

Example

Слайд 39

Problems with creating an Instruction Pool

Typically 1 in 5 instructions is

Problems with creating an Instruction Pool Typically 1 in 5 instructions is
a branch

Predict the directions of the branches, and their targets

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 40

Advanced Computer Architecture. Smruti R. Sarangi

Motivation for Branch Prediction

1

2

3

4

We need
high IPC

This

Advanced Computer Architecture. Smruti R. Sarangi Motivation for Branch Prediction 1 2
means that
we need a large
instruction window

It will have a
lot of branches.

We need to predict
ALL the branches
correctly.

Слайд 41

The Maths of Branch Prediction

Advanced Computer Architecture. Smruti R. Sarangi

The Maths of Branch Prediction Advanced Computer Architecture. Smruti R. Sarangi

Слайд 42

Advanced Computer Architecture. Smruti R. Sarangi

For (n=100) : A plot of Pn

Advanced Computer Architecture. Smruti R. Sarangi For (n=100) : A plot of
vs p

If Pn = 10%, p has to be as low as 0.5% !!!

Слайд 43

Advanced Computer Architecture. Smruti R. Sarangi

If we need a large instruction window,

Advanced Computer Architecture. Smruti R. Sarangi If we need a large instruction
we need a very accurate branch predictor. The accuracy of the branch predictor limits the size of the instruction window.

Слайд 44

Advanced Computer Architecture. Smruti R. Sarangi

Nature of Dependences

Advanced Computer Architecture. Smruti R. Sarangi Nature of Dependences

Слайд 45

Dependences between Instructions

Program Order Dependence

mov r1, 1
mov r2, 2

One instruction appears after

Dependences between Instructions Program Order Dependence mov r1, 1 mov r2, 2
the other in program order

The program order is the order of instructions that is perceived by a single cycle in-order processor executing the program.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 46

Data Dependences

RAW ? Read after Write Dependence (True dependence)

mov r1, 1
add

Data Dependences RAW ? Read after Write Dependence (True dependence) mov r1,
r3, r1, r2

It is a producer-consumer dependence.
The earlier instruction produces a value, and the later instruction reads it.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 47

Data Dependences - II

WAW ? Write after Write Dependence (Output dependence)

mov r1,

Data Dependences - II WAW ? Write after Write Dependence (Output dependence)
1
add r1, r4, r2

Two instructions write to the same location
The later instruction needs to take effect after the former

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 48

Data Dependences - III

WAR ? Write after Read Dependence (Anti dependence)

add r1,

Data Dependences - III WAR ? Write after Read Dependence (Anti dependence)
r2, r3
add r2, r5, r6

Earlier instruction reads, later instruction writes
The later instruction needs to execute after the earlier instruction has read its values

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 49

Control Dependences

The add instruction is control dependent on the branch(beq) instruction
If the

Control Dependences The add instruction is control dependent on the branch(beq) instruction
branch is taken then only the add instruction will execute, not otherwise

beq .label
.....
.label
add r1, r2, r3

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 50

Basic Results

In-order processors respect all program order dependences. Thus, they automatically respect

Basic Results In-order processors respect all program order dependences. Thus, they automatically
all data and control dependences.

OOO processors respect only data and control dependences.

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 51

Can output and anti dependences be removed?

Don’t you think that these dependences

Can output and anti dependences be removed? Don’t you think that these
are there because we have a finite number of registers.
What if we had an infinite number of registers?

mov r1, 1
add r5, r6, r7
add r1, r4, r2
add r8, r9, r10

add r1, r2, r3
add r5, r6, r7
add r2, r5, r6
add r8, r9, r10

Advanced Computer Architecture. Smruti R. Sarangi

Set r1

Set r1

Use r1

Use r1

Use r1

Avatar 1

Avatar 2

Слайд 52

Solution: Assume infinite number of physical registers

mov r1, 1
add r1, r2, r3
add

Solution: Assume infinite number of physical registers mov r1, 1 add r1,
r4, r1, 1
mov r2, 5
add r6, r2, r8
mov r1, 8
add r9, r1, r2

mov p11, 1
add p12, p2, p3
add p41, p12, 1
mov p21, 5
add p61, p21, p8
mov p13, 8
add p91, p13, p21

Code with architectural registers

Code with physical registers

Advanced Computer Architecture. Smruti R. Sarangi

Architectural register

Physical register

Format in this example: rx is mapped to px

Слайд 53

Renaming

Program with real (architectural) registers

Program with physical registers

RAW dependences

WAR dependences

WAW dependences

RAW dependences

Higher

Renaming Program with real (architectural) registers Program with physical registers RAW dependences
instruction level parallelism (ILP)

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 54

Where are we now ...

Fetch + Decode + Rename

Instruction
Memory

Pool of Instructions

Execution Units

Issue

Where are we now ... Fetch + Decode + Rename Instruction Memory
+ Reg.
read

Write back results

Branch Predictor

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 55

Issue with Write-back

To an outsider should it matter if the processor is

Issue with Write-back To an outsider should it matter if the processor
in-order or OOO
NO

Processor

Instructions

Register File

Memory

Update State

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 56

Assume that there is an exception or interrupt

Languages like C or Java

Assume that there is an exception or interrupt Languages like C or
have dedicated functions that are called if there is a divide-by-zero in the code.
The question is:
What if the sub instruction has executed when we enter the exception handler?
An in-order processor will never do this.

mov r4, 10
mov r2, 0
div r3, r1, r2
sub r4, r4, 1

Divide by 0

Divide by zero handler

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 57

Precise Exceptions

Assume that the exception handler decides to do nothing and return

Precise Exceptions Assume that the exception handler decides to do nothing and
back
After this the sub instruction should be executed
This is exactly what will happen in an in-order processor
In an OOO processor there is a possibility that the sub inst. can execute out of order
The outsider (exception handler) will see a different view as compared to the view it will see with an in-order processor.

Regular Instructions

÷ by 0

Exception
handler

sub
inst.

Advanced Computer Architecture. Smruti R. Sarangi

Flow of actions

Слайд 58

Precise Exceptions - II

To an external observer
The execution should always be correct

Precise Exceptions - II To an external observer The execution should always
and as per program order
Even in the presence of interrupts and exceptions

Regular Instructions

÷ by 0

Exception
handler

sub
inst.

Regular Instructions

÷ by 0

Exception
handler

sub
inst.

Correct

Wrong

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 59

Precise Exceptions - III

We thus need precise exceptions
Assume that the dynamic instructions

Precise Exceptions - III We thus need precise exceptions Assume that the
in a program (ordered in program order) are: ins1, ins2, ins3 ... insn
Assume that the processor starts the exception/interrupt handler after it has just finished writing the results of instruction: insk
Then instructions: ins1 ... insk should have executed completely and written their results to the memory/register file
AND, insk+1 and later instructions should not appear to have started their execution at all
Such an exception or interrupt is precise

ins1

ins2

ins3

insk

Exception

insk+1

Advanced Computer Architecture. Smruti R. Sarangi

Слайд 60

Precise Exceptions in an OOO Processor

Processor
(Out-of-order execution)

Register File

Memory

Update State

In program order

Advanced Computer

Precise Exceptions in an OOO Processor Processor (Out-of-order execution) Register File Memory
Architecture. Smruti R. Sarangi

Слайд 61

Advanced Computer Architecture. Smruti R. Sarangi

In-order pipelines have a limited IPC because
of

Advanced Computer Architecture. Smruti R. Sarangi In-order pipelines have a limited IPC
hazards and branches

Multi-issue in-order pipelines do not solve the
problem. Reason: dependences and interlocks

Hence, we issue instructions out-of-order (OOO). We need
a large instruction window to find sufficient independent insts.

To sustain a large instruction window, we need a very
accurate branch predictor.

To expose additional ILP, we can remove WAR/WAR
hazards. Finally, we need to have precise exceptions.

Conclusion

Имя файла: Chapter_2_ooo.pptx
Количество просмотров: 31
Количество скачиваний: 0