Similar presentations:
Computer Architecture ELE 475 / COS 475 Slide Deck 2: Microcode and Pipelining Review of Embedded Systems
1.
Computer ArchitectureELE 475 / COS 475
Slide Deck 2: Microcode and
Pipelining Review of
Embedded Systems
Lecturer: Assistant-professor – Mukhanov
Samat Bakytzhanovich
1
2. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
2
3. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
3
4. What Happens When the Processor is Too Large?
45. What Happens When the Processor is Too Large?
• Time Multiplex Resources!5
6. Microcontrol Unit
Maurice Wilkes, 1954op
code
conditional
flip-flop
First used in EDSAC-2,
completed 1958
Next state
address
Matrix A
Matrix B
Decoder
Memory
Embed the control
logic state table in
a memory array
Control lines to
ALU, MUXs, Registers
6
7. Microcoded Microarchitecture
busy?zero?
opcode
controller
(ROM)
holds fixed
microcode instructions
Datapath
Data
holds user program
written in macrocode
instructions (e.g., x86,
MIPS, etc.)
Addr
Memory
(RAM)
enMem
MemWrt
7
8. A Bus-based Datapath for RISC
bcompare?Opcode
ldIR
OpSel ldA
ldB
32(PC)
1(Link)
rd
rs2
rs1
2
IR
ImmSel
Imm
2
Ext
enImm
rd
rs2
rs1
busy
3
A
ALU
control
enALU
B
ALU
RegSel
addr
32 GPRs
+ PC ...
32-bit Reg
data
ldMA
MA
addr
RegWrt
enReg
Memory MemWrt
data
enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
8
9. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
9
10. An Ideal Pipeline
stage1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• Scheduling of a transaction entering the pipeline is not
affected by the transactions in other stages
10
11. An Ideal Pipeline
stage1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• Scheduling of a transaction entering the pipeline is not
affected by the transactions in other stages
• These conditions generally hold for industry assembly
lines, but instructions depend on each other causing
various hazards
11
12. Unpipelined Datapath for MIPS
PCSrcbr
rind
jabs
pc+4
RegWrite
MemWrite
WBSrc
0x4
Add
Add
clk
we
PC
clk
31
addr
inst
Inst.
Memory
clk
rs1
rs2
rd1
ws
wd rd2
we
addr
ALU
GPRs
z
Imm
Ext
rdata
Data
Memory
wdata
ALU
Control
OpCode
RegDst
ExtSel
OpSel
BSrc
zero?
12
13. Simplified Unpipelined Datapath
0x4Add
PC
addr
rdata
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
13
14. Pipelined Datapath
0x4Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
14
15. Pipelined Control
0x4Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
15
16. Pipelined Control
HardwiredController
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
16
17. Pipelined Control
HardwiredController
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
17
18. Pipelined Control
HardwiredController
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
fetch
decode & registerexecute
memory
phase
fetch phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
write
-back
phase
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
18
19. “Iron Law” of Processor Performance
Time = InstructionsCycles
Time
Program
Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
CPI
>1
1
1
cycle time
short
long
short
19
20. “Iron Law” of Processor Performance
Time = InstructionsCycles
Time
Program
Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
Multi-cycle, unpipelined control
CPI
>1
1
1
cycle time
short
long
short
20
21. “Iron Law” of Processor Performance
Time = InstructionsCycles
Time
Program
Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
Multi-cycle, unpipelined control
CPI
>1
1
1
>1
cycle time
short
long
short
short
21
22. CPI Examples
Microcoded machine7 cycles
5 cycles
10 cycles
Inst 2
Inst 3
Inst 1
Time
3 instructions, 22 cycles, CPI=7.33
Unpipelined machine
Inst 1
Inst 2
Inst 3
3 instructions, 3 cycles, CPI=1
Pipelined machine
Inst 1
Inst 2
Inst 3
3 instructions, 3 cycles, CPI=1
22
23. Technology Assumptions
• A small amount of very fast memory (caches)backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower!)
Thus, the following timing assumption is reasonable
tIM tRF tALU tDM tRW
A 5-stage pipeline will be the focus of our detailed design
- some commercial designs have over 30 pipeline
stages to do an integer add!
23
24. Pipeline Diagrams
HardwiredController
0x4
Add
PC
addr
rdata
Inst.
Memory
fetch
phase
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
decode & registerfetch phase
we
addr
wdata
execute
phase
memory
phase
write
-back
phase
We need some way to show multiple
simultaneous transactions in both space and time
24
25. Pipeline Diagrams: Transactions vs. Time
HardwiredController
0x4
Add
PC
addr
rdata
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
Inst.
Memory
fetch
phase
ALU
rdata
Data
Memory
Imm
Ext
decode & registerfetch phase
time
instruction1
instruction2
instruction3
instruction4
instruction5
t0
IF1
t1 t2
ID1 EX1
IF2 ID2
IF3
we
addr
wdata
execute
phase
memory
phase
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
write
-back
phase
25
26. Pipeline Diagrams: Space vs. Time
HardwiredController
0x4
Add
addr
rdata
IR
Inst.
Memory
fetch
phase
Resources
PC
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
wdata
decode & registerfetch phase
time
IF
ID
EX
MA
WB
we
addr
t0
I1
t1
I2
I1
t2
I3
I2
I1
execute
phase
t3
I4
I3
I2
I1
t4
I5
I4
I3
I2
I1
memory
phase
t5
t6
t7
....
I5
I4
I3
I2
I5
I4
I3
I5
I4
I5
write
-back
phase
26
27. Instructions Interact With Each Other in Pipeline
• Structural Hazard: An instruction in thepipeline needs a resource being used by
another instruction in the pipeline
• Data Hazard: An instruction depends on a
data value produced by an earlier instruction
• Control Hazard: Whether or not an instruction
should be executed depends on a control
decision made by an earlier instruction
27
28. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
28
29. Overview of Structural Hazards
• Structural hazards occur when two instructions needthe same hardware resource at the same time
• Approaches to resolving structural hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create structural hazards
– Stall: Hardware includes control logic that stalls until
earlier instruction is no longer using contended resource
– Duplicate: Add more hardware to design so that each
instruction can access independent resources at the same
time
• Simple 5-stage MIPS pipeline has no structural hazards
specifically because ISA was designed that way
29
30. Example Structural Hazard:
Unified Memory0x4
Add
PC
addr
rdata
Inst.
Memory
IF
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
ID
we
addr
wdata
EX
MEM
WB
30
31. Example Structural Hazard:
Unified Memory0x4
Add
PC
addr
rdata
Inst.
Memory
IF
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
ID
EX
MEM
WB
addr
rdata
wdata
Unified Memory
31
32. Example Structural Hazard:
2-Cycle Memory0x4
Add
PC
addr
rdata
Inst.
Memory
IF
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
M0
Stage
ALU
we
addr
rdata
Data
Memory
Imm
Ext
ID
M1
Stage
wdata
EX
MEM
WB
32
33. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
33
34. Overview of Data Hazards
• Data hazards occur when one instruction depends on adata value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
34
35. Example Data Hazard
r1 …r4 r1 …
0x4
PC
addr
IR
IR
IR
Add
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
(ADDI R1, R0, #10)
(ADDI R4, R1, #17)
R
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
...
we
addr
MD2
r1 is stale. Oops!
35
36. Feedback to Resolve Hazards
FB1FB2
stage
1
FB3
stage
2
FB4
stage
3
stage
4
• Later stages provide dependence information to
earlier stages which can stall (or kill) instructions
• Controlling a pipeline in this manner works provided
the instruction at stage i+1 can complete without any
interaction from instructions in stages 1 to i
(otherwise deadlock)
36
37. Resolving Data Hazards with Stalls
(Interlocks)Stall Condition
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
...
r1 r0 + 10
r4 r1 + 17
...
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
37
38. Stalled Stages and Pipeline Bubbles
timet0 t1
(I1) r1 (r0) + 10 IF1 ID1
(I2) r4 (r1) + 17
IF2
(I3)
(I4)
(I5)
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2 t3 t4 t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t2
I3
I2
I1
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
t3 t4 t5 t6 t7 . . . .
I3
I3
I3
I4
I5
I2
I2
I2
I3
I4
I5
nop nop nop I2
I3
I4
I5
I4
I1
nop nop nop I2
I3
I3
I1
nop nop nop I2
nop
I5
I4
I5
pipeline bubble
38
39. Stall Control Logic
stallws
Cstall
rs
rt
?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Compare the source registers of the instruction in the decode
stage with the destination register of the uncommitted
instructions.
39
40. Stall Control Logic (ignoring jumps &branches)
Stall Control Logic (ignoring jumps &branches)stall
ws
we
Cstall
rs
?
rt
re2
re1
Cre
0x4
addr
IR
ws
we
Cdest
Cdest
nop
Add
PC
ws
we
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Cdest
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Should we always stall if the rs field matches some rd?
not every instruction writes a register we
not every instruction reads a register re
40
41. Source & Destination Registers
Source & Destination RegistersALU
ALUI
LW
SW
BZ
J
JAL
JR
JALR
R-type:
op
rs
rt
rd
func
I-type:
op
rs
rt
immediate16
J-type:
op
immediate26
source(s) destination
rs, rt
rd
rs
rt
rs
rt
rs, rt
rd (rs) func (rt)
rt (rs) op immediate
rt M [(rs) + immediate]
M [(rs) + immediate] (rt)
cond (rs)
true: PC (PC) + immediate
rs
false: PC (PC) + 4
rs
PC (PC) + immediate
r31 (PC), PC (PC) + immediate
PC (rs)
rs
r31 (PC), PC (rs)
rs
31
31
41
42. Deriving the Stall Signal
Cdestws = Case opcode
ALU
rd
ALUi, LW
rt
JAL, JALR
R31
we = Case opcode
ALU, ALUi, LW (ws 0)
JAL, JALR
on
...
off
Cstall
Cre
re1 = Case opcode
ALU, ALUi,
LW, SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
on
ALU, SW
off
...
stall = ((rsD =wsE).weE +
(rsD =wsM).weM +
(rsD =wsW).weW) . re1D
((rtD =wsE).weE +
(rtD =wsM).weM +
(rtD =wsW).weW) . re2D
+
42
43. Hazards due to Loads & Stores
Hazards due to Loads & StoresStall Condition
What if
(r1)+7 = (r3)+5 ?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
...
M[(r1)+7] (r2)
r4 M[(r3)+5]
...
we
addr
R
wdata
wdata
MD1
MD2
Is there any possible data hazard
in this instruction sequence?
43
44. Data Hazards Due to Loads and Store
• Example instruction sequence– Mem[ Regs[r1] + 7 ] <- Regs[r2]
– Regs[r4] <- Mem[ Regs[r3] + 5 ]
• What if Regs[r1]+7 == Regs[r3]+5 ?
– Writing and reading to/from the same address
– Hazard is avoided because our memory system
completes writes in a single cycle
– More realistic memory system will require more
careful handling of data hazards due to loads and
stores
44
45. Overview of Data Hazards
• Data hazards occur when one instruction depends on adata value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
45
46. Adding Bypassing to the Datapath
stallr4 r1
0x4
r1
nop
Add
IR
E
IR
M
IR
31
ASrc
PC
addr
inst IR
D
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
(I1)
(I2)
R
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
we
addr
MD2
When does this bypass help?
r1 Mem[r0 + 10]
r4 r1 + 17
JAL 500
r4 r31 + 17
46
W
47. Deriving the Bypass Signal
timet0 t1
(I1) r1 (r0) + 10 IF1 ID1
(I2) r4 (r1) + 17
IF2
(I3)
(I4)
(I5)
t2 t3 t4 t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
Each stall or kill introduces a bubble in the pipeline
⇒ CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
t0 t1
(I1) r1 (r0) + 10 IF1 ID1
(I2) r4 (r1) + 17
IF2
(I3)
(I4)
(I5)
t2
EX1
ID2
IF3
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
47
48. The Bypass Signal
Deriving it from the Stall Signalstall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
ws = Case opcode
ALU
rd
ALUi, LW
rt
JAL, JALR R31
ASrc = (rsD=wsE).weE.re1D
we = Case opcode
ALU, ALUi, LW (ws 0)
JAL, JALR
on
...
off
Is this correct?
No because only ALU and ALUi instructions can benefit
from this bypass
Split weE into two components: we-bypass, we-stall
48
49. Bypass and Stall Signals
Split weE into two components: we-bypass, we-stallwe-bypassE = Case opcodeE
ALU, ALUi (ws 0)
...
off
we-stallE = Case opcodeE
LW
(ws 0)
JAL, JALR on
...
off
ASrc
= (rsD =wsE).we-bypassE . re1D
stall
= ((rsD =wsE).we-stallE +
(rsD=wsM).weM + (rsD=wsW).weW). re1D
+((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D
49
50. Fully Bypassed Datapath
PC for JAL, ...stall
0x4
nop
Add
PC
addr
ASrc
inst IR
Inst
Memory
D
Is there still
a need for the
stall signal ?
IR
M
IR
31
we
rs1
rs2
rd1
ws
wd rd2
A
ALU
GPRs
Imm
Ext
IR
E
Y
B
BSrc
we
addr
rdata
Data
Memory
R
wdata
wdata
MD1
MD2
stall = (rsD=wsE). (opcodeE=LWE).(wsE 0 ).re1D
+ (rtD=wsE). (opcodeE=LWE).(wsE 0 ).re2D
50
W
51. Overview of Data Hazards
• Data hazards occur when one instruction depends on adata value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
51
52. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
52
53. Control Hazards
• What do we need to calculate next PC?– For Jumps
• Opcode, offset and PC
– For Jump Register
• Opcode and Register value
– For Conditional Branches
• Opcode, PC, Register (for condition), and offset
– For all other instructions
• Opcode and PC
– have to know it’s not one of above!
53
54. Opcode Decoding Bubble
(assuming no branch delay slots for now)time
t0 t1 t2
(I1) r1 (r0) + 10 IF1 ID1 EX1
(I2) r3 (r2) + 17
IF2 IF2
(I3)
(I4)
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1 t2
nop I2
I1
nop
I1
I1
CPI = 2!
t3 t4 t5 t6 t7 . . . .
MA1 WB1
ID2 EX2 MA2 WB2
IF3 IF3 ID3 EX3 MA3 WB3
IF4 IF4 ID4 EX4 MA4 WB4
t3
t4 t5 t6 t7 . . . .
nop I3
nop I4
nop I3
nop I4
I2
nop I2
nop I3
nop I4
nop I4
I1
nop I2
nop I3
nop I4
I1
nop I2
nop I3
nop
pipeline bubble 54
55. Speculate next address is PC+4
PCSrc (pc+4 / jabs / rind/ br)stall
Add
nop
0x4
Add
Jump?
I1
I2
I3
I4
PC
addr
inst
IR
104
Inst
Memory
I2
096
100
104
304
ADD
J 304
ADD
ADD
kill
E
M
IR
IR
I1
A jump instruction kills (not stalls)
the following instruction
How? 55
56. Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)stall
To kill a fetched
instruction -- Insert
a mux before IR
Add
nop
0x4
Add
Jump?
IRSrc D
I1
I2
I3
I4
PC
addr
304
104
Inst
Memory
096
100
104
304
inst
nop
IR
nop
I2
ADD
J 304
ADD
ADD
kill
E
M
IR
IR
II21
I1
Any
interaction
between
stall and
IRSrcD = Case opcodeD jump?
J, JAL
...
nop
IM
56
57. Jump Pipeline Diagrams
timet0 t1 t2
IF1 ID1 EX1
IF2 ID2
IF3
(I1) 096: ADD
(I2) 100: J 304
(I3) 104: ADD
(I4) 304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
I3
I2
I1
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
nop nop nop nop
IF4 ID4 EX4 MA4 WB4
t3 t4 t5 t6 t7
I4
I5
nop I4
I5
I2
nop I4
I5
I5
I1
I2
nop I4
I1
I2
nop I4
nop
....
I5
pipeline bubble
57
58. Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br)stall
Add
nop
0x4
Add
E
M
IR
IR
I1
BEQZ?
zero?
IRSrcD
PC
104
I1
I2
I3
I4
096
100
104
108
304
addr
inst
nop
Inst
Memory
ADD
BEQZ r1 +200
ADD
…
ADD
IR
A
ALU
Y
I2
Branch condition is not known until the
execute stage
what action should be taken in the
decode stage ?
58
59. Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br)stall
?
Add
E
nop
0x4
Add
M
BEQZ?
IR
IR
I2
I1
zero?
IRSrcD
I1
I2
I3
I4
PC
addr
108
Inst
Memory
096
100
104
108
304
inst
nop
IR
A
ALU
Y
I3
If the branch is taken
ADD
- kill the two following instructions
BEQZ r1 +200
- the instruction at the decode stage is
ADD
…
not valid
59
ADD
stall signal is not valid
60. Pipelining Conditional Branches
stallAdd
PCSrc (pc+4 / jabs / rind / br)
E
IRSrc E
nop
0x4
Add
Jump?
M
BEQZ?
IR
IR
I2
I1
zero?
PC
I1
I2
I3
I4
IRSrcD
PC
addr
108
Inst
Memory
096
100
104
108
304
inst
nop
IR
A
ALU
Y
I3
If the branch is taken
ADD
- kill the two following instructions
BEQZ r1 +200
- the instruction at the decode stage is
ADD
…
not valid
60
ADD
stall signal is not valid
61. New Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D+ ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
. !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)
Don’t stall if the branch is taken. Why?
Instruction at the decode stage is invalid
61
62. Control Equations for PC and IR Muxes
PCSrc = Case opcodeEBEQZ.z, BNEZ.!z
br
...
Case opcodeD
J, JAL
JR, JALR
...
jabs
rind
pc+4
IRSrcD = Case opcodeE
BEQZ.z, BNEZ.!z
nop
...
Case opcodeD
J, JAL, JR, JALR nop
...
IM
Give priority
to the older
instruction,
i.e., execute-stage
instruction
over decode-stage
instruction
IRSrcE = Case opcodeE
BEQZ.z, BNEZ.!z
nop
...
stall.nop + !stall.IRD
62
63. Branch Pipeline Diagrams
(resolved in execute stage)time
t0 t1
(I1) 096: ADD
IF1 ID1
IF2
(I2) 100: BEQZ +200
(I3) 104: ADD
(I4) 108:
(I5) 304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
EX1
ID2
IF3
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 nop nop nop
IF4 nop nop nop nop
IF5 ID5 EX5 MA5 WB5
t2
I3
I2
I1
t3
I4
I3
I2
I1
t4 t5 t6 t7 . . . .
I5
nop I5
nop nop I5
I2
nop nop I5
I1
I2
nop nop I5
nop
pipeline bubble
63
64. Reducing Branch Penalty
(resolve in decode stage)• One pipeline bubble can be removed if an extra
comparator is used in the Decode stage
– But might elongate cycle time
PCSrc (pc+4 / jabs / rind/ br)
E
Add
nop
0x4
IR
Add
PC
addr
nop
inst
IR
Inst
Memory
D
we
rs1
rs2
rd1
ws
wd rd2
Zero detect on
register file output
GPRs
Pipeline diagram now same as for jumps64
65. Branch Delay Slots
(expose control hazard to software)• Change the ISA semantics so that the instruction
that follows a jump or branch is always executed
– gives compiler the flexibility to put in a useful instruction where normally
a pipeline bubble would have resulted.
I1
I2
I3
I4
096
100
104
304
ADD
BEQZ r1 +200
Delay slot instruction executed
ADD
regardless of branch outcome
ADD
• Other techniques include more advanced
branch prediction, which can dramatically
reduce the branch penalty... to come later
65
66. Branch Pipeline Diagrams (branch delay slot)
timet0 t1
(I1) 096: ADD
IF1 ID1
(I2) 100: BEQZ +200
IF2
(I3) 104: ADD
(I4) 304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
EX1
ID2
IF3
t3 t4 t5 t6 t7 . . . .
MA1 WB1
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
t2
I3
I2
I1
t3
I4
I3
I2
I1
t4
t5
t6
t7
I4
I3
I2
I1
I4
I3
I2
I4
I3
I4
....
66
67. Why an Instruction may not be dispatched every cycle (CPI>1)
Why an Instruction may not bedispatched every cycle (CPI>1)
• Full bypassing may be too expensive to implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase cycle time and
counteract the benefit of reducing CPI
• Loads have two-cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible pipeline hazard
(compiler schedules independent instruction or inserts NOP to avoid
hazard). Removed in MIPS-II (pipeline interlocks added in hardware)
• MIPS:“Microprocessor without Interlocked Pipeline Stages”
• Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots
Machines with software-visible delay slots may execute significant
number of NOP instructions inserted by the compiler. NOPs not
counted in useful CPI (alternatively, increase instructions/program)
67
68. Agenda
• Microcoded Microarchitectures• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
68
69. Acknowledgements
• These slides contain material developed and copyright by:– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
• MIT material derived from course 6.823
• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750
69
informatics