Agenda
Agenda
What Happens When the Processor is Too Large?
What Happens When the Processor is Too Large?
Microcontrol Unit
Microcoded Microarchitecture
A Bus-based Datapath for RISC
Agenda
An Ideal Pipeline
An Ideal Pipeline
Unpipelined Datapath for MIPS
Simplified Unpipelined Datapath
Pipelined Datapath
Pipelined Control
Pipelined Control
Pipelined Control
Pipelined Control
“Iron Law” of Processor Performance
“Iron Law” of Processor Performance
“Iron Law” of Processor Performance
CPI Examples
Technology Assumptions
Pipeline Diagrams
Pipeline Diagrams: Transactions vs. Time
Pipeline Diagrams: Space vs. Time
Instructions Interact With Each Other in Pipeline
Agenda
Overview of Structural Hazards
Example Structural Hazard:
Example Structural Hazard:
Example Structural Hazard:
Agenda
Overview of Data Hazards
Example Data Hazard
Feedback to Resolve Hazards
Resolving Data Hazards with Stalls
Stalled Stages and Pipeline Bubbles
Stall Control Logic
Stall Control Logic (ignoring jumps &branches)
Source & Destination Registers
Deriving the Stall Signal
Hazards due to Loads & Stores
Data Hazards Due to Loads and Store
Overview of Data Hazards
Adding Bypassing to the Datapath
Deriving the Bypass Signal
The Bypass Signal
Bypass and Stall Signals
Fully Bypassed Datapath
Overview of Data Hazards
Agenda
Control Hazards
Opcode Decoding Bubble
Speculate next address is PC+4
Pipelining Jumps
Jump Pipeline Diagrams
Pipelining Conditional Branches
Pipelining Conditional Branches
Pipelining Conditional Branches
New Stall Signal
Control Equations for PC and IR Muxes
Branch Pipeline Diagrams
Reducing Branch Penalty
Branch Delay Slots
Branch Pipeline Diagrams (branch delay slot)
Why an Instruction may not be dispatched every cycle (CPI>1)
Agenda
Acknowledgements
706.77K
Category: informaticsinformatics

Computer Architecture ELE 475 / COS 475 Slide Deck 2: Microcode and Pipelining Review of Embedded Systems

1.

Computer Architecture
ELE 475 / COS 475
Slide Deck 2: Microcode and
Pipelining Review of
Embedded Systems
Lecturer: Assistant-professor – Mukhanov
Samat Bakytzhanovich
1

2. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
2

3. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
3

4. What Happens When the Processor is Too Large?

4

5. What Happens When the Processor is Too Large?

• Time Multiplex Resources!
5

6. Microcontrol Unit

Maurice Wilkes, 1954
op
code
conditional
flip-flop
First used in EDSAC-2,
completed 1958
Next state
address
Matrix A
Matrix B
Decoder
Memory
Embed the control
logic state table in
a memory array
Control lines to
ALU, MUXs, Registers
6

7. Microcoded Microarchitecture

busy?
zero?
opcode
controller
(ROM)
holds fixed
microcode instructions
Datapath
Data
holds user program
written in macrocode
instructions (e.g., x86,
MIPS, etc.)
Addr
Memory
(RAM)
enMem
MemWrt
7

8. A Bus-based Datapath for RISC

bcompare?
Opcode
ldIR
OpSel ldA
ldB
32(PC)
1(Link)
rd
rs2
rs1
2
IR
ImmSel
Imm
2
Ext
enImm
rd
rs2
rs1
busy
3
A
ALU
control
enALU
B
ALU
RegSel
addr
32 GPRs
+ PC ...
32-bit Reg
data
ldMA
MA
addr
RegWrt
enReg
Memory MemWrt
data
enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
8

9. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
9

10. An Ideal Pipeline

stage
1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• Scheduling of a transaction entering the pipeline is not
affected by the transactions in other stages
10

11. An Ideal Pipeline

stage
1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• Scheduling of a transaction entering the pipeline is not
affected by the transactions in other stages
• These conditions generally hold for industry assembly
lines, but instructions depend on each other causing
various hazards
11

12. Unpipelined Datapath for MIPS

PCSrc
br
rind
jabs
pc+4
RegWrite
MemWrite
WBSrc
0x4
Add
Add
clk
we
PC
clk
31
addr
inst
Inst.
Memory
clk
rs1
rs2
rd1
ws
wd rd2
we
addr
ALU
GPRs
z
Imm
Ext
rdata
Data
Memory
wdata
ALU
Control
OpCode
RegDst
ExtSel
OpSel
BSrc
zero?
12

13. Simplified Unpipelined Datapath

0x4
Add
PC
addr
rdata
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
13

14. Pipelined Datapath

0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
14

15. Pipelined Control

0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
15

16. Pipelined Control

Hardwired
Controller
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
16

17. Pipelined Control

Hardwired
Controller
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
17

18. Pipelined Control

Hardwired
Controller
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
18

19. “Iron Law” of Processor Performance

Time = Instructions
Cycles
Time
Program
Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
CPI
>1
1
1
cycle time
short
long
short
19

20. “Iron Law” of Processor Performance

Time = Instructions
Cycles
Time
Program
Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
Multi-cycle, unpipelined control
CPI
>1
1
1
cycle time
short
long
short
20

21. “Iron Law” of Processor Performance

Time = Instructions
Cycles
Time
Program
Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
Multi-cycle, unpipelined control
CPI
>1
1
1
>1
cycle time
short
long
short
short
21

22. CPI Examples

Microcoded machine
7 cycles
5 cycles
10 cycles
Inst 2
Inst 3
Inst 1
Time
3 instructions, 22 cycles, CPI=7.33
Unpipelined machine
Inst 1
Inst 2
Inst 3
3 instructions, 3 cycles, CPI=1
Pipelined machine
Inst 1
Inst 2
Inst 3
3 instructions, 3 cycles, CPI=1
22

23. Technology Assumptions

• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower!)
Thus, the following timing assumption is reasonable
tIM tRF tALU tDM tRW
A 5-stage pipeline will be the focus of our detailed design
- some commercial designs have over 30 pipeline
stages to do an integer add!
23

24. Pipeline Diagrams

Hardwired
Controller
0x4
Add
PC
addr
rdata
Inst.
Memory
fetch
phase
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
decode & registerfetch phase
we
addr
wdata
execute
phase
memory
phase
write
-back
phase
We need some way to show multiple
simultaneous transactions in both space and time
24

25. Pipeline Diagrams: Transactions vs. Time

Hardwired
Controller
0x4
Add
PC
addr
rdata
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
Inst.
Memory
fetch
phase
ALU
rdata
Data
Memory
Imm
Ext
decode & registerfetch phase
time
instruction1
instruction2
instruction3
instruction4
instruction5
t0
IF1
t1 t2
ID1 EX1
IF2 ID2
IF3
we
addr
wdata
execute
phase
memory
phase
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
write
-back
phase
25

26. Pipeline Diagrams: Space vs. Time

Hardwired
Controller
0x4
Add
addr
rdata
IR
Inst.
Memory
fetch
phase
Resources
PC
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
wdata
decode & registerfetch phase
time
IF
ID
EX
MA
WB
we
addr
t0
I1
t1
I2
I1
t2
I3
I2
I1
execute
phase
t3
I4
I3
I2
I1
t4
I5
I4
I3
I2
I1
memory
phase
t5
t6
t7
....
I5
I4
I3
I2
I5
I4
I3
I5
I4
I5
write
-back
phase
26

27. Instructions Interact With Each Other in Pipeline

• Structural Hazard: An instruction in the
pipeline needs a resource being used by
another instruction in the pipeline
• Data Hazard: An instruction depends on a
data value produced by an earlier instruction
• Control Hazard: Whether or not an instruction
should be executed depends on a control
decision made by an earlier instruction
27

28. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
28

29. Overview of Structural Hazards

• Structural hazards occur when two instructions need
the same hardware resource at the same time
• Approaches to resolving structural hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create structural hazards
– Stall: Hardware includes control logic that stalls until
earlier instruction is no longer using contended resource
– Duplicate: Add more hardware to design so that each
instruction can access independent resources at the same
time
• Simple 5-stage MIPS pipeline has no structural hazards
specifically because ISA was designed that way
29

30. Example Structural Hazard:

Unified Memory
0x4
Add
PC
addr
rdata
Inst.
Memory
IF
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
ID
we
addr
wdata
EX
MEM
WB
30

31. Example Structural Hazard:

Unified Memory
0x4
Add
PC
addr
rdata
Inst.
Memory
IF
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
ID
EX
MEM
WB
addr
rdata
wdata
Unified Memory
31

32. Example Structural Hazard:

2-Cycle Memory
0x4
Add
PC
addr
rdata
Inst.
Memory
IF
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
M0
Stage
ALU
we
addr
rdata
Data
Memory
Imm
Ext
ID
M1
Stage
wdata
EX
MEM
WB
32

33. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
33

34. Overview of Data Hazards

• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
34

35. Example Data Hazard

r1 …
r4 r1 …
0x4
PC
addr
IR
IR
IR
Add
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
(ADDI R1, R0, #10)
(ADDI R4, R1, #17)
R
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
...
we
addr
MD2
r1 is stale. Oops!
35

36. Feedback to Resolve Hazards

FB1
FB2
stage
1
FB3
stage
2
FB4
stage
3
stage
4
• Later stages provide dependence information to
earlier stages which can stall (or kill) instructions
• Controlling a pipeline in this manner works provided
the instruction at stage i+1 can complete without any
interaction from instructions in stages 1 to i
(otherwise deadlock)
36

37. Resolving Data Hazards with Stalls

(Interlocks)
Stall Condition
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
...
r1 r0 + 10
r4 r1 + 17
...
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
37

38. Stalled Stages and Pipeline Bubbles

time
t0 t1
(I1) r1 (r0) + 10 IF1 ID1
(I2) r4 (r1) + 17
IF2
(I3)
(I4)
(I5)
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2 t3 t4 t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t2
I3
I2
I1
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
t3 t4 t5 t6 t7 . . . .
I3
I3
I3
I4
I5
I2
I2
I2
I3
I4
I5
nop nop nop I2
I3
I4
I5
I4
I1
nop nop nop I2
I3
I3
I1
nop nop nop I2
nop
I5
I4
I5
pipeline bubble
38

39. Stall Control Logic

stall
ws
Cstall
rs
rt
?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Compare the source registers of the instruction in the decode
stage with the destination register of the uncommitted
instructions.
39

40. Stall Control Logic (ignoring jumps &branches)

Stall Control Logic (ignoring jumps &branches)
stall
ws
we
Cstall
rs
?
rt
re2
re1
Cre
0x4
addr
IR
ws
we
Cdest
Cdest
nop
Add
PC
ws
we
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Cdest
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Should we always stall if the rs field matches some rd?
not every instruction writes a register we
not every instruction reads a register re
40

41. Source & Destination Registers

Source & Destination Registers
ALU
ALUI
LW
SW
BZ
J
JAL
JR
JALR
R-type:
op
rs
rt
rd
func
I-type:
op
rs
rt
immediate16
J-type:
op
immediate26
source(s) destination
rs, rt
rd
rs
rt
rs
rt
rs, rt
rd (rs) func (rt)
rt (rs) op immediate
rt M [(rs) + immediate]
M [(rs) + immediate] (rt)
cond (rs)
true: PC (PC) + immediate
rs
false: PC (PC) + 4
rs
PC (PC) + immediate
r31 (PC), PC (PC) + immediate
PC (rs)
rs
r31 (PC), PC (rs)
rs
31
31
41

42. Deriving the Stall Signal

Cdest
ws = Case opcode
ALU
rd
ALUi, LW
rt
JAL, JALR
R31
we = Case opcode
ALU, ALUi, LW (ws 0)
JAL, JALR
on
...
off
Cstall
Cre
re1 = Case opcode
ALU, ALUi,
LW, SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
on
ALU, SW
off
...
stall = ((rsD =wsE).weE +
(rsD =wsM).weM +
(rsD =wsW).weW) . re1D
((rtD =wsE).weE +
(rtD =wsM).weM +
(rtD =wsW).weW) . re2D
+
42

43. Hazards due to Loads & Stores

Hazards due to Loads & Stores
Stall Condition
What if
(r1)+7 = (r3)+5 ?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
...
M[(r1)+7] (r2)
r4 M[(r3)+5]
...
we
addr
R
wdata
wdata
MD1
MD2
Is there any possible data hazard
in this instruction sequence?
43

44. Data Hazards Due to Loads and Store

• Example instruction sequence
– Mem[ Regs[r1] + 7 ] <- Regs[r2]
– Regs[r4] <- Mem[ Regs[r3] + 5 ]
• What if Regs[r1]+7 == Regs[r3]+5 ?
– Writing and reading to/from the same address
– Hazard is avoided because our memory system
completes writes in a single cycle
– More realistic memory system will require more
careful handling of data hazards due to loads and
stores
44

45. Overview of Data Hazards

• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
45

46. Adding Bypassing to the Datapath

stall
r4 r1
0x4
r1
nop
Add
IR
E
IR
M
IR
31
ASrc
PC
addr
inst IR
D
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
(I1)
(I2)
R
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
we
addr
MD2
When does this bypass help?
r1 Mem[r0 + 10]
r4 r1 + 17
JAL 500
r4 r31 + 17
46
W

47. Deriving the Bypass Signal

time
t0 t1
(I1) r1 (r0) + 10 IF1 ID1
(I2) r4 (r1) + 17
IF2
(I3)
(I4)
(I5)
t2 t3 t4 t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
Each stall or kill introduces a bubble in the pipeline
⇒ CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
t0 t1
(I1) r1 (r0) + 10 IF1 ID1
(I2) r4 (r1) + 17
IF2
(I3)
(I4)
(I5)
t2
EX1
ID2
IF3
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
47

48. The Bypass Signal

Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
ws = Case opcode
ALU
rd
ALUi, LW
rt
JAL, JALR R31
ASrc = (rsD=wsE).weE.re1D
we = Case opcode
ALU, ALUi, LW (ws 0)
JAL, JALR
on
...
off
Is this correct?
No because only ALU and ALUi instructions can benefit
from this bypass
Split weE into two components: we-bypass, we-stall
48

49. Bypass and Stall Signals

Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE
ALU, ALUi (ws 0)
...
off
we-stallE = Case opcodeE
LW
(ws 0)
JAL, JALR on
...
off
ASrc
= (rsD =wsE).we-bypassE . re1D
stall
= ((rsD =wsE).we-stallE +
(rsD=wsM).weM + (rsD=wsW).weW). re1D
+((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D
49

50. Fully Bypassed Datapath

PC for JAL, ...
stall
0x4
nop
Add
PC
addr
ASrc
inst IR
Inst
Memory
D
Is there still
a need for the
stall signal ?
IR
M
IR
31
we
rs1
rs2
rd1
ws
wd rd2
A
ALU
GPRs
Imm
Ext
IR
E
Y
B
BSrc
we
addr
rdata
Data
Memory
R
wdata
wdata
MD1
MD2
stall = (rsD=wsE). (opcodeE=LWE).(wsE 0 ).re1D
+ (rtD=wsE). (opcodeE=LWE).(wsE 0 ).re2D
50
W

51. Overview of Data Hazards

• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
51

52. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
52

53. Control Hazards

• What do we need to calculate next PC?
– For Jumps
• Opcode, offset and PC
– For Jump Register
• Opcode and Register value
– For Conditional Branches
• Opcode, PC, Register (for condition), and offset
– For all other instructions
• Opcode and PC
– have to know it’s not one of above!
53

54. Opcode Decoding Bubble

(assuming no branch delay slots for now)
time
t0 t1 t2
(I1) r1 (r0) + 10 IF1 ID1 EX1
(I2) r3 (r2) + 17
IF2 IF2
(I3)
(I4)
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1 t2
nop I2
I1
nop
I1
I1
CPI = 2!
t3 t4 t5 t6 t7 . . . .
MA1 WB1
ID2 EX2 MA2 WB2
IF3 IF3 ID3 EX3 MA3 WB3
IF4 IF4 ID4 EX4 MA4 WB4
t3
t4 t5 t6 t7 . . . .
nop I3
nop I4
nop I3
nop I4
I2
nop I2
nop I3
nop I4
nop I4
I1
nop I2
nop I3
nop I4
I1
nop I2
nop I3
nop
pipeline bubble 54

55. Speculate next address is PC+4

PCSrc (pc+4 / jabs / rind/ br)
stall
Add
nop
0x4
Add
Jump?
I1
I2
I3
I4
PC
addr
inst
IR
104
Inst
Memory
I2
096
100
104
304
ADD
J 304
ADD
ADD
kill
E
M
IR
IR
I1
A jump instruction kills (not stalls)
the following instruction
How? 55

56. Pipelining Jumps

PCSrc (pc+4 / jabs / rind/ br)
stall
To kill a fetched
instruction -- Insert
a mux before IR
Add
nop
0x4
Add
Jump?
IRSrc D
I1
I2
I3
I4
PC
addr
304
104
Inst
Memory
096
100
104
304
inst
nop
IR
nop
I2
ADD
J 304
ADD
ADD
kill
E
M
IR
IR
II21
I1
Any
interaction
between
stall and
IRSrcD = Case opcodeD jump?
J, JAL
...
nop
IM
56

57. Jump Pipeline Diagrams

time
t0 t1 t2
IF1 ID1 EX1
IF2 ID2
IF3
(I1) 096: ADD
(I2) 100: J 304
(I3) 104: ADD
(I4) 304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
I3
I2
I1
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
nop nop nop nop
IF4 ID4 EX4 MA4 WB4
t3 t4 t5 t6 t7
I4
I5
nop I4
I5
I2
nop I4
I5
I5
I1
I2
nop I4
I1
I2
nop I4
nop
....
I5
pipeline bubble
57

58. Pipelining Conditional Branches

PCSrc (pc+4 / jabs / rind / br)
stall
Add
nop
0x4
Add
E
M
IR
IR
I1
BEQZ?
zero?
IRSrcD
PC
104
I1
I2
I3
I4
096
100
104
108
304
addr
inst
nop
Inst
Memory
ADD
BEQZ r1 +200
ADD

ADD
IR
A
ALU
Y
I2
Branch condition is not known until the
execute stage
what action should be taken in the
decode stage ?
58

59. Pipelining Conditional Branches

PCSrc (pc+4 / jabs / rind / br)
stall
?
Add
E
nop
0x4
Add
M
BEQZ?
IR
IR
I2
I1
zero?
IRSrcD
I1
I2
I3
I4
PC
addr
108
Inst
Memory
096
100
104
108
304
inst
nop
IR
A
ALU
Y
I3
If the branch is taken
ADD
- kill the two following instructions
BEQZ r1 +200
- the instruction at the decode stage is
ADD

not valid
59
ADD
stall signal is not valid

60. Pipelining Conditional Branches

stall
Add
PCSrc (pc+4 / jabs / rind / br)
E
IRSrc E
nop
0x4
Add
Jump?
M
BEQZ?
IR
IR
I2
I1
zero?
PC
I1
I2
I3
I4
IRSrcD
PC
addr
108
Inst
Memory
096
100
104
108
304
inst
nop
IR
A
ALU
Y
I3
If the branch is taken
ADD
- kill the two following instructions
BEQZ r1 +200
- the instruction at the decode stage is
ADD

not valid
60
ADD
stall signal is not valid

61. New Stall Signal

stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+ ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
. !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)
Don’t stall if the branch is taken. Why?
Instruction at the decode stage is invalid
61

62. Control Equations for PC and IR Muxes

PCSrc = Case opcodeE
BEQZ.z, BNEZ.!z
br
...
Case opcodeD
J, JAL
JR, JALR
...
jabs
rind
pc+4
IRSrcD = Case opcodeE
BEQZ.z, BNEZ.!z
nop
...
Case opcodeD
J, JAL, JR, JALR nop
...
IM
Give priority
to the older
instruction,
i.e., execute-stage
instruction
over decode-stage
instruction
IRSrcE = Case opcodeE
BEQZ.z, BNEZ.!z
nop
...
stall.nop + !stall.IRD
62

63. Branch Pipeline Diagrams

(resolved in execute stage)
time
t0 t1
(I1) 096: ADD
IF1 ID1
IF2
(I2) 100: BEQZ +200
(I3) 104: ADD
(I4) 108:
(I5) 304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
EX1
ID2
IF3
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 nop nop nop
IF4 nop nop nop nop
IF5 ID5 EX5 MA5 WB5
t2
I3
I2
I1
t3
I4
I3
I2
I1
t4 t5 t6 t7 . . . .
I5
nop I5
nop nop I5
I2
nop nop I5
I1
I2
nop nop I5
nop
pipeline bubble
63

64. Reducing Branch Penalty

(resolve in decode stage)
• One pipeline bubble can be removed if an extra
comparator is used in the Decode stage
– But might elongate cycle time
PCSrc (pc+4 / jabs / rind/ br)
E
Add
nop
0x4
IR
Add
PC
addr
nop
inst
IR
Inst
Memory
D
we
rs1
rs2
rd1
ws
wd rd2
Zero detect on
register file output
GPRs
Pipeline diagram now same as for jumps64

65. Branch Delay Slots

(expose control hazard to software)
• Change the ISA semantics so that the instruction
that follows a jump or branch is always executed
– gives compiler the flexibility to put in a useful instruction where normally
a pipeline bubble would have resulted.
I1
I2
I3
I4
096
100
104
304
ADD
BEQZ r1 +200
Delay slot instruction executed
ADD
regardless of branch outcome
ADD
• Other techniques include more advanced
branch prediction, which can dramatically
reduce the branch penalty... to come later
65

66. Branch Pipeline Diagrams (branch delay slot)

time
t0 t1
(I1) 096: ADD
IF1 ID1
(I2) 100: BEQZ +200
IF2
(I3) 104: ADD
(I4) 304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
EX1
ID2
IF3
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
t2
I3
I2
I1
t3
I4
I3
I2
I1
t4
t5
t6
t7
I4
I3
I2
I1
I4
I3
I2
I4
I3
I4
....
66

67. Why an Instruction may not be dispatched every cycle (CPI>1)

Why an Instruction may not be
dispatched every cycle (CPI>1)
• Full bypassing may be too expensive to implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase cycle time and
counteract the benefit of reducing CPI
• Loads have two-cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible pipeline hazard
(compiler schedules independent instruction or inserts NOP to avoid
hazard). Removed in MIPS-II (pipeline interlocks added in hardware)
• MIPS:“Microprocessor without Interlocked Pipeline Stages”
• Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots
Machines with software-visible delay slots may execute significant
number of NOP instructions inserted by the compiler. NOPs not
counted in useful CPI (alternatively, increase instructions/program)
67

68. Agenda

• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
68

69. Acknowledgements

• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
• MIT material derived from course 6.823
• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750
69
English     Русский Rules