Software Scheduling to Avoid Load Hazards

672.04K

Category:

electronics

Computer structure pipeline

1. Computer Structure Pipeline

Lecturer:
Aharon Kupershtok
1
Computer Structure 2015 – Pipeline

2. A Basic Processor

Fetch
src1 reg
PC
src2 reg
deco
de
opcode
Inst. Inst.
Cache
Execute
Decode
Memory
Control signals
ALU
ALU
Register
source control
File
src1 src1
src1
src2
data
src2 src2
data
A
L
U
Mem
Rd/Wr
Data
Cache
address
dst
dst reg
imm
data
Sign
Ext.
imm
Write-back value
2
Computer Structure 2015 – Pipeline
Write
back

3. Pipelined Car Assembly

1 hour
2 hours
1 hour
chassis
engine
finish
Car 1
2
3
3
Computer Structure 2015 – Pipeline

4.

Pipelining Instructions
Program
execution
order
2
Time
lw R1, 100(R0)
4
6
Inst
Reg ALU
Fetch
lw R2, 200(R0)
8
10
12
14
16
18
Data
Reg
Access
Inst
Reg ALU
Fetch
8 ns
lw R3, 300(R0)
Data
Reg
Access
Inst
Fetch
8 ns
...
8 ns
Program
execution
order
2
Time
lw R1, 100(R0)
Inst
Fetch
lw R2, 200(R0)
2 ns
lw R3, 300(R0)
4
6
Reg ALU
Inst
Fetch
2 ns
8
2 ns
14
12
Data
Reg
Access
Reg ALU
Inst
Fetch
10
Data
Reg
Access
Reg ALU
2 ns
2 ns
Data
Reg
Access
2 ns
2 ns
Ideal speedup is number of stages in the pipeline. Do we achieve this?
4
Computer Structure 2015 – Pipeline

5. Pipelining

Pipelining does not reduce the latency of single task,
it increases the throughput of entire workload
Potential speedup = Number of pipe stages
Pipeline rate is limited by the slowest pipeline stage
Þ Partition the pipe to many pipe stages
Þ Make the longest pipe stage to be as short as possible
Þ Balance the work in the pipe stages
Pipeline adds overhead (e.g., latches)
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for dependencies
Þ Too many pipe-stages start to loose performance
IPC of an ideal pipelined machine is 1
5
Every clock one instruction finishes
Computer Structure 2015 – Pipeline

6. Pipelined CPU

Fetch
4
Decode
Memory
+
deco
de
Inst.
Cache
Control
signals
Register
File
src1
src1
data
Instruction
PC
Execute
src2
src2
data
ALU
source
ALU
Control
A
L
U
Mem
Wr
Mem
Rd
Data
Cache
address
dst
Sign
Ext.
data
dst
6
Computer Structure 2015 – Pipeline
WB

7. Structural Hazard

Different instructions using the same resource at the same time
Register File:
Accessed in 2 stages:
Read during stage 2 (ID)
Write during stage 5 (WB)
Solution: 2 read ports, 1 write port
IM
Reg
IM
DM
DM
Reg
IM
Reg
DM
Reg
Memory
Accessed in 2 stages:
Instruction Fetch during stage 1 (IF)
Data read/write during stage 4 (MEM)
Solution: separate instruction cache and data cache
IM
Reg
DM
Reg
IM
Reg
Reg
Each functional unit can only be used once per instruction
Each functional unit must be used at the same stage for all
instructions
7
Computer Structure 2015 – Pipeline
Reg
DM
Reg

8. Pipeline Example: cycle 1

Fetch
4
Decode
deco
de
0
Memory
+
Inst.
Cache
Control
signals
Register
File
src1
src1
data
Instruction
PC
Execute
src2
src2
data
ALU
source
ALU
Control
A
L
U
Mem
Wr
Data
Cache
address
dst
Sign
Ext.
data
dst
0
4
8
12
lw
sub
and
or
8
Mem
Rd
R10,9(R1)
R11,R2,R3
R12,R4,R5
R13,R6,R7
Computer Structure 2015 – Pipeline
WB

9. Pipeline Example: cycle 2

Fetch
4
Decode
deco
de
4
Memory
+
Inst.
Cache
lw
Control
signals
Register
File
src1
src1
data
Instruction
PC
Execute
src2
src2
data
ALU
source
ALU
Control
A
L
U
Mem
Wr
Data
Cache
address
dst
Sign
Ext.
data
dst
0
4
8
12
lw
sub
and
or
9
Mem
Rd
R10,9(R1)
R11,R2,R3
R12,R4,R5
R13,R6,R7
Computer Structure 2015 – Pipeline
WB

10. Pipeline Example: cycle 3

Fetch
4
Decode
deco
de
8
Control
signals
Inst.
Cache sub
src2
src2
data
10
ALU
source
ALU
Control
A
L
U
Mem
Wr
Mem
Rd
Data
Cache
address
dst
dst
lw
sub
and
or
lw
Register
File
src1
src1
data [R1]
Sign
Ext.
0
4
8
12
Memory
+
Instruction
PC
Execute
data
9
10
R10,9(R1)
R11,R2,R3
R12,R4,R5
R13,R6,R7
Computer Structure 2015 – Pipeline
WB

11. Pipeline Example: cycle 4

Fetch
4
Decode
deco
de
12
Memory
+
Control
signals
sub
Register
File
src1
src1
data [R2]
Instruction
PC
Execute
Inst.
Cache and
src2
src2
data [R3]
ALU
source
ALU
Control
lw
A
L [R1]+9 address
U
dst
dst
lw
sub
and
or
11
Mem
Rd
Data
Cache
data
Sign
Ext.
0
4
8
12
Mem
Wr
11
10
R10,9(R1)
R11,R2,R3
R12,R4,R5
R13,R6,R7
Computer Structure 2015 – Pipeline
WB

12. Pipeline Example: cycle 5

Fetch
4
Decode
deco
de
16
Memory
Inst.
Cache
or
Control
signals
and
Register
File
src1
src1
data [R4]
src2
src2
data [R5]
ALU
source
ALU
Control
sub
12
Mem
Rd
l
w
A
L[R2]-[R3]address
U
M[[R1]+9]
dst
dst
lw
sub
and
or
Mem
Wr
Data
Cache
data
Sign
Ext.
0
4
8
12
WB
+
Instruction
PC
Execute
12
11
R10,9(R1)
R11,R2,R3
R12,R4,R5
R13,R6,R7
Computer Structure 2015 – Pipeline
10

13. RAW Dependency

Program
execution
order
sub R2, R1, R3
and R12,R2, R5
or
R13,R6, R2
add R14,R2, R2
sw
13
R15,100(R2)
Fetch DEC
EXE
MEM
WB
Fetch DEC
EXE
MEM
WB
Fetch DEC
EXE
MEM
WB
Fetch DEC
EXE
MEM
WB
EXE
MEM
Fetch DEC
Computer Structure 2015 – Pipeline
WB

14. Using Bypass to Solve RAW Dependency

Program
execution
order
sub R2, R1, R3
and R12,R2, R5
or
R13,R6, R2
add R14,R2, R2
sw
R15,100(R2)
Fetch DEC
EXE
MEM
WB
Fetch DEC
EXE
MEM
WB
Fetch DEC
EXE
MEM
WB
Fetch DEC
EXE
MEM
WB
EXE
MEM
Fetch DEC
Bypass result directly from EXE output to EXE input
14
Computer Structure 2015 – Pipeline
WB

15. RAW Dependency

Fetch
4
Decode
deco
de
16
Memory
Inst.
Cache
or
Control
signals
and
Register
File src1
[R4]
src1
data
src2
src2
data [R5]
ALU
source
ALU
Control
sub
15
Mem
Rd
l
w
A
L[R2]-[R3]address
U
M[[R1]+9]
dst
dst
lw
sub
and
or
Mem
Wr
Data
Cache
data
Sign
Ext.
0
4
8
12
WB
+
Instruction
PC
Execute
R5
R4,9(R1)
R5,R2,R3
R12,R4,R5
R13,R6,R7
Computer Structure 2015 – Pipeline
R4

16. Forwarding Hardware

Fetch
L1
deco
de
L2
L3
and
sub
Inst.
Cache
or
Control
signals
Register
File src1 [R4]
src1
data
src2
ALU
source
src2 [R5]
data
ALU
Control
Mem
Wr
2
16
l
w
A
L[R2]-[R3]address
U
M[[R1]+9]
dst
data
R4 R5
src1
src2
R4,9(R1)
R5,R2,R3
R12,R4,R5
R13,R6,R7
R4
R5
dst
lw
sub
and
or
Mem
Rd
Data
Cache
1
Sign
Ext.
0
4
8
12
WB
L4
R5
Bypass Control
Computer Structure 2015 – Pipeline
R4
dstt-2
16
Memory
+
Instruction
PC
Execute
dstt-1
4
Decode

17. Forwarding Control

Forwarding from EXE (L3)
if (L3.RegWrite and (L3.dst == L2.src1)) ALUSelA = 1
if
(L3.RegWrite and (L3.dst == L2.src2)) ALUSelB = 1
Forwarding from MEM (L4)
if (L4.RegWrite and
((not L3.RegWrite) or (L3.dst L2.src1)) and
(L4.dst = L2.src1)) ALUSelA = 2
if
17
(L4.RegWrite and
((not L3.RegWrite) or (L3.dst L2.src2)) and
(L4.dst = L2.src2)) ALUSelB = 2
Computer Structure 2015 – Pipeline

18. Register File Split

sub R2, R1, R3
xxx
xxx
and R12,R2,R11
Þ
18
IM
Reg
IM
DM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Register file is written during first half of the cycle
Register file is read during second half of the cycle
Register file is written before it is read Þ returns the
correct data
Computer Structure 2015 – Pipeline
Reg

19. Can't Always Forward

Load word can still causes a hazard:
an instruction tries to read a register following a load
instruction that writes to the same register
Program
execution
order
lw
Time (clock cycles)
1
2
R2, 30(R1)
and R12,R2, R5
or
R13,R6, R2
add R14,R2, R2
sw
R15,100(R2)
IM
3
Reg
IM
4
5
DM
Reg
Reg
IM
DM
Reg
IM
6
8
9
Reg
DM
Reg
IM
7
Reg
DM
Reg
Reg
DM
Reg
Þ A hazard detection unit is needed to “stall” the load instruction
19
Computer Structure 2015 – Pipeline

20. Stall If Cannot Forward

if (L2.RegWrite and (L2.opcode == lw) and
( (L2.dst == L1.src1) or (L2.dst == L1.src2) ) then stall
De-assert the enable to the L1 latch, and to the IP
The dependent instruction (and) stays another cycle in L1
Issue a NOP into the L2 latch (instead of the stalled inst.)
Allow the stalling instruction (lw) to move on
Program
execution
order
lw
Time (clock cycles)
1
2
R2, 30(R1)
and R12,R2, R5
or
R13,R6, R2
IM
3
Reg
IM
4
5
DM
Reg
Reg
Reg
IM
IM
6
DM
Reg
7
8
9
Reg
DM
Reg
bubble
add R14,R2, R2
sw
20
R15,100(R2)
IM
DM
Reg
IM
Reg
Computer Structure 2015 – Pipeline
Reg
DM
Reg

21. Software Scheduling to Avoid Load Hazards

Example: code for (assume all variables are in memory):
a = b + c;
d = e – f;
Slow code
LW Rb,b
LW Rc,c
Stall ADD
Ra,Rb,Rc
SW
a,Ra
LW Re,e
LW Rf,f
Stall SUB
Rd,Re,Rf
SW d,Rd
Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD
Ra,Rb,Rc
LW Rf,f
SW
a,Ra
SUB
Rd,Re,Rf
SW d,Rd
Instruction order can be changed as long as the
correctness is kept
21
Computer Structure 2015 – Pipeline

22. Control Hazards

22
Computer Structure 2015 – Pipeline

23. Control Hazard on Branches

Fetch
4
Execute
+
Inst.
Cache
Memory
+
Instruction
PC
Decode
Register
File
src1
src1
data
src2
src2
data
A
L
U
Data
Cache
address
dst
data
Sign
Ext.
0 or
4 jcc 48 (offset=40)
8 and
12 mul
16 sub
23
jcc target; if cond then IP IP + 4 + offset
Computer Structure 2015 – Pipeline
WB

24. Control Hazard on Branches

Fetch
4
8
Inst.
Cache
8
jcc
Instruction
PC
+
Decode
Execute
Register
File
src1
src1
data
src2
src2
data
4
or
Memory
+
A
L
U
Data
Cache
address
dst
data
Sign
Ext.
0 or
4 jcc 48 (offset=40)
8 and
12 mul
16 sub
24
jcc target; if cond then IP IP + 4 + offset
Computer Structure 2015 – Pipeline
WB

25. Control Hazard on Branches

Fetch
4
+
Decode
12
8
and
12
Instruction
PC
Inst.
Cache
Execute
jcc
Register
File
src1
src1
data
src2
src2
data
+
Memory
or
A
L
U
Data
Cache
address
dst
Sign
Ext.
data
40
0 or
4 jcc 48 (offset=40)
8 and
12 mul
16 sub
25
jcc target; if cond then IP IP + 4 + offset
Computer Structure 2015 – Pipeline
WB

26. Control Hazard on Branches

Fetch
Decode
Execute
Memory
WB
48
4
+
16
12
mul
16
Instruction
PC
Inst.
Cache
and
Register
File
src1
src1
data
src2
src2
data
+
or
jcc
A cond
L
U
Data
Cache
address
dst
data
Sign
Ext.
0 or
4 jcc 48 (offset=40)
8 and
12 mul
16 sub
26
jcc target; if cond then IP IP + 4 + offset
Computer Structure 2015 – Pipeline

27. Control Hazard on Branches

Fetch
4
+
Decode
16
20
sub
48
Instruction
PC
Inst.
Cache
Execute
mul
Register
File
src1
src1
data
src2
src2
data
+
Memory
and
A
L
U
jcc
Data
Cache
address
dst
data
Sign
Ext.
0 or
4 jcc 48 (offset=40)
8 and
12 mul
16 sub
27
jcc target; if cond then IP IP + 4 + offset
Computer Structure 2015 – Pipeline
WB

28. Control Hazard on Branches

PC
Beq
IM
Reg
Reg
DM
PC
And
The 3 instructions
following the branch
get into the pipe even
if the branch is taken
IM
Reg
DM
Reg
PC
mul
IM
Reg
DM
Reg
PC
sub
IM
Reg
DM
Reg
PC
Inst from target
28
IM
Computer Structure 2015 – Pipeline
Reg
DM
Reg

29. Control Hazard: Stall

Stall pipe when branch is encountered until resolved
Stall impact: assumptions
CPI = 1
20% of instructions are branches
Stall 3 cycles on every branch
Þ CPI new = 1 + 0.2 × 3 = 1.6
(CPI new = CPI Ideal + avg. stall cycles / instr.)
We loose 60% of the performance
29
Computer Structure 2015 – Pipeline

30. Control Hazard: Predict Not Taken

Execute instructions from the fall-through (not-taken) path
As if there is no branch
If the branch is not-taken (~50%), no penalty is paid
If branch actually taken
Flush the fall-through path instructions before they change
the machine state (memory / registers)
Fetch the instructions from the correct (taken) path
Assuming ~50% branches not taken on average
CPI new = 1 + (0.2 × 0.5) × 3 = 1.3
30
Computer Structure 2015 – Pipeline

31. Dynamic Branch Prediction

Add a Branch Target Buffer (BTB) that predicts (at fetch)
Instruction is a branch
Branch taken / not-taken
Taken branch target
Branch IP
IP of inst in fetch
31
Target IP
History
?=
Inst predicted
Predicted
Branch predicted
BTB allocatedtoatbeexecute
all branch
info
is known
branch – after
Target
taken
or not
taken
BTB is looked up at instruction fetch
Computer Structure 2015 – Pipeline

32. BTB

Allocation
Allocate instructions identified as branches (after decode)
Both conditional and unconditional branches are allocated
Not taken branches need not be allocated
BTB miss implicitly predicts not-taken
Prediction
BTB lookup is done parallel to IC lookup
BTB provides
32
Indication that the instruction is a branch (BTB hits)
Branch predicted target
Branch predicted direction
Branch predicted type (e.g., conditional, unconditional)
Update (when branch outcome is known)
Branch target
Branch history (taken / not-taken)
Computer Structure 2015 – Pipeline

33. BTB (cont.)

Wrong prediction
Predict not-taken, actual taken
Predict taken, actual not-taken, or actual taken but wrong
target
In case of wrong prediction – flush the pipeline
Reset latches (same as making all instructions to be NOPs)
Select the PC source to be from the correct path
Need get the fall-through with the branch
33
Start fetching instruction from correct path
Assuming P% correct prediction rate
CPI new = 1 + (0.2 × (1-P)) × 3
For example, if P=0.7
CPI new = 1 + (0.2 × 0.3) × 3 = 1.18
Computer Structure 2015 – Pipeline

34. Adding a BTB to the Pipeline

Fetch
Decode
Execute
Memory
Flush and Repair
Repair target
Predicted target
Instruction
BTB
target
50
IP
4
src2
direction
taken
allocate
Register
File
src1
src1
data
src2
data
≠
+ 8
≠
4
Predicted direction
Next seq. address
+ calc. target
A
L
U
address
dst
or
Sign
Ext.
Inst.
0 or
Cache jcc
4 jcc 50
8 and
Lookup
current
BTB
provides
IC provides the
...
IP in IC and
in
predicted
target
50 sub
instruction bytes
BTBdirection
in parallel
and
54 mul
58 add
34
Data
Cache
Computer Structure 2015 – Pipeline
data
WB

35. Adding a BTB to the Pipeline

Fetch
Decode
Execute
Memory
Flush and Repair
Repair target
Predicted target
+
takenPredicted direction
8 Next seq. address
Instruction
BTB
target
IP
50
54
allocate
Inst.
0 or
sub
Cache
4 jcc 50
8 and
...
50 sub
54 mul
58 add
35
jcc
42
+ calc. target
Register
File
src1
src1
data
src2
direction
≠
4
≠
50
A
L
U
src2
data
Data
Cache
address
dst
or
Sign
Ext.
Computer Structure 2015 – Pipeline
data
WB

36. Adding a BTB to the Pipeline

Fetch
Decode
Execute
Flush and Repair
Predicted directiontaken
Next seq. address 8
+
Instruction
BTB
target
IP
58
54
Inst.
0 or
mul
Cache
4 jcc 50
8 and
...
50 sub
54 mul
58 add
36
src2
direction
allocate
Register
File
src1
src1
data
≠
4
50
≠
Repair target
Predicted target
Issue flush
Memory in case
WB of
Along with mismatch
the repair IPVerify target
(if taken)
Verify
direction
+ calc. target50
A taken
L
U
src2
data
Data
Cache
address
dst
jcc
sub
or
Sign
Ext.
Computer Structure 2015 – Pipeline
data

37. Using The BTB

PC moves to next instruction
Inst Mem gets PC
and fetches new inst
BTB gets PC
and looks it up
IF
yes
yes
IF/ID latch loaded
with new inst
ID
EXE yes
37
Br taken ?
PC perd addr
IF/ID latch loaded
with pred inst
Branch ?
BTB Hit ?
no
no
PC PC + 4
IF/ID latch loaded
with seq. inst
no
Computer Structure 2015 – Pipeline

38. Using The BTB (cont.)

ID
Using The BTB (cont.)
yes
EXE
Branch ?
Calculate br
cond & trgt
no
continue
Update BTB
MEM
yes
Corect
pred ?
no
continue
Flush pipe &
update PC
WB
38
IF/ID latch loaded
with correct inst
Computer Structure 2015 – Pipeline

39. Backup

39
Computer Structure 2015 – Pipeline

40. MIPS Instruction Formats

31
26
· R-type
op
(register insts)
6
0
rt
rd
shamt funct
rs
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
21
16
· I-type (Load,
16
31
26
21
rt
Store, Branch,
op
rs
6 bits 5 bits 5 bits
inst’s w/imm
data)
· J-type (Jump) 31
26
op
6 bits
11
0
immediate
16 bits
target address
26 bits
op: operation of the instruction
rs, rt, rd: the source and destination register specifiers
shamt: shift amount
funct: selects the variant of the operation in the “op” field
address / immediate: address offset or immediate value
target address: target address of the jump instruction
40
Computer Structure 2015 – Pipeline
0

41. The Memory Space

1 byte
41
Each memory location
is 8 bit = 1 byte wide
has an address
We assume 32 byte address
An address space of 232 bytes
Memory stores both instructions
and data
Each instruction is 32 bit wide
Þ stored in 4 consecutive bytes
in memory
Various data types have
different width
Computer Structure 2015 – Pipeline
00000000
00000001
00000002
00000003
00000004
00000005
00000006
FFFFFFFA
FFFFFFFB
FFFFFFFC
FFFFFFFD
FFFFFFFE
FFFFFFFF

42. Register File

The Register File holds 32 registers
Each register is 32 bit wide
The RF supports parallel
reading any two registers and
writing any register
Inputs
5
5
5
32
Read
reg 1
Read
data 1
32
Read
reg 2 Register
Write
reg
File
Write
data
Write reg (relevant when RegWrite=1)
#register to which the value in Write data is written to
Write data (relevant when RegWrite=1)
data written to Write reg
Outputs
42
Read reg 1/2: #register whose value
will be output on Read data 1/2
RegWrite: write enable
RegWrite
Read data 1/2: data read from Read reg 1/2
Computer Structure 2015 – Pipeline
Read
data 2
32

43. Memory Components

Inputs
Address: address of the memory
location we wish to access
Read: read data from location
Write: write data into location
Write data (relevant when Write=1)
data to be written into specified location
Outputs
Write
32
Address
32
Memory
Write
Data
Read
Data
32
Read
Read data (relevant when Read=1)
data read from specified location
Cache
Memory components are slow relative to the CPU
43
A cache is a fast memory which contains only small part of the memory
Instruction cache stores parts of the memory space which hold code
Data Cache stores parts of the memory space which hold data
Computer Structure 2015 – Pipeline

44. The Program Counter (PC)

44
Holds the address (in memory) of the next instruction to be
executed
After each instruction, advanced to point to the next
instruction
If the current instruction is not a taken branch,
the next instruction resides right after the current instruction
PC PC + 4
If the current instruction is a taken branch,
the next instruction resides at the branch target
PC target
(absolute jump)
PC PC + 4 + offset×4 (relative jump)
Computer Structure 2015 – Pipeline

45. Instruction Execution Stages

Fetch
Decode
For load: read data from memory
For store: write data into memory
Write Back
45
For a memory access: calculate effective address
For an ALU operation: execute operation in ALU
For a branch: calculate condition and target
Memory Access
Decode instruction (generate control signals)
Fetch operands from register file
Execute
Fetch instruction pointed by PC from I-Cache
Write result back to register file
update program counter
Computer Structure 2015 – Pipeline
Instruction
Fetch
Instruction
Decode
Execute
Memory
Result
Store

46. The MIPS CPU

Instruction
fetch
Instruction
Decode /
register fetch
Execute /
address
calculation
Control
RegWrite
[20-16]
Address
Instruction
Instruction
Cache
Instruction
0
[15-11]
m
u
x
1
RegDst
[15-0]
Read
reg 1
Read
reg 2
Write
reg
Write
data
Add
Shift
left 2
Read
data 2
m
u
x
1
PCSrc
Read
data 1
Register
File
[25-21]
PC
ALUSrc
0
ALU
Branch
Zero
ALU res
MemWrite
MemtoReg
Address
m
u
x
1
Write
Data
16
Sign 32
extend
ALU
Control
[5-0]
ALUOp
46
Write
back
0
Add
4
Memory
access
Computer Structure 2015 – Pipeline
Read
Data
Data
Cache
MemRead
1
m
u
x
0

47. Executing an Add Instruction

R2 R3+R5
Add R2, R3, R5 ;
op 26
31
rs 21
ALU
rt 16
3
Add
rd 11 shamt 6
5
2
0
0 = Add
[PC]+4
0
4
Address
Instruction
Instruction
Memory
ADD
[20-16]
Instruction
3
5
m 2
u
0
[15-11]
x
1
RegDst=1
[15-0]
RegWrite=1
Read
reg 1
Read
reg 2
Write
reg
Write
data
Read
data 1
Register
File
[25-21]
PC
funct 0
Read
data 2
Add
Shift
left 2
PCSrc=0
R3
ALUSrc=0
R5
m
u
x
1
0
m
u
x
1
ALU
+
Branch=0
Zero
ALU res
MemtoReg=0
Address
Write
Data
16
Sign 32
extend
ALU
Control
[5-0]
ALUOp
47
MemWrite=0
Computer Structure 2015 – Pipeline
Read
Data
Data
Memory
MemRead=0
R3 + R5
1
m
u
x
0

48. Executing a Load Instruction

R1 Mem[R2+30]
LW R1, (30)R2 ;
31
op 26
rs 21
LW
rt 16
2
Add
immediate
1
30
[PC]+4
0
4
RegWrite=1
Address
Instruction
Instruction
Memory
LW
[20-16]
Instruction
0
[15-11]
m
u
x
1
1
RegDst=0
[15-0]
Read
reg 1
Read
reg 2
Write
reg
Write
data
16
Read
data 1
Register
File
2
[25-21]
PC
0
PCSrc=0
R2
ALUSrc=1
0
Read
data 2
Sign 32
extend
Add
Shift
left 2
m
u
x
1
30
m
u
x
1
ALU
+
Zero
ALU res
Branch=0 MemWrite=0
R2+30
Address
Write
Data
ALU
Control
Read
Data
Data
Memory
MemtoReg=1
1
m
u
x
0
MemRead=1
[5-0]
ALUOp
48
Computer Structure 2015 – Pipeline
Mem[R2+30]

49. Executing a Store Instruction

SW R1, (30)R2 ; Mem[R2+30] R1
31
op 26
rs 21
SW
rt 16
2
Add
immediate
1
30
[PC]+4
0
4
PC
Address
Instruction
Instruction
Memory
SW
[20-16]
Instruction
RegWrite = 0
2
1
Read
reg 1
Read
reg 2
0
[15-11]
Write
reg
m
u
x
1
RegDst =
[15-0]
Write
data
16
Read
data 1
Register
File
[25-21]
0
PCSrc = 0
R2
ALUSrc = 1
0
Read
data 2
Sign 32
extend
Add
Shift
left 2
m
u
x
1
30
m
u
x
1
ALU
+
Zero
ALU res
Branch=0 MemWrite = 1
R2+30
R1
Write
Data
ALU
Control
[5-0]
ALUOp
49
Address
Computer Structure 2015 – Pipeline
MemtoReg =
Read
Data
Data
Memory
MemRead
1
m
u
x
0

50. Executing a BEQ Instruction

BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ;
else PC PC+4
31
op 26
rs 21
BEQ
rt 16
4
Add
immediate
5
27
PC+4
Instruction
Instruction
Memory
ADD
[20-16]
Instruction
0
[15-11]
m
u
x
1
RegDst=
[15-0]
[5-0]
RegWrite=0
Read
reg 1
Read
reg 2
Write
reg
Write
data
Add
Shift
left 2
Read
data 1
0)
ALUSrc=0
Read
data 2
R5
m
u
x
1
PCSrc= (R4 – R5 ==
R4
Register
File
4
5
[25-21]
Address
PC+4
or
PC+4+SignExt(27)*4
0
4
PC
0
0
m
u
x
1
-
ALU
Zero
ALU res
Branch=1 MemWrite=0
R4-R5
Write
Data
16
Sign 32
extend
27
ALU
Control
ALUOp
50
MemtoReg=
Address
Computer Structure 2015 – Pipeline
Read
Data
Data
Memory
MemRead=0
1
m
u
x
0

51. Control Signals

Don’t Care
func 10 0000 10 0010
op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
add
sub
ori
lw
sw
beq
jump
RegDst
ALUSrc
MemtoReg
1
0
0
1
0
0
0
1
0
0
1
1
x
1
x
x
0
x
x
x
x
RegWrite
MemWrite
Branch
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
1
0
0
x
Jump
0
0
0
0
0
0
1
Add
Add
ALUctr<2:0>
51
Add
Subtract Or
Subtract xxx
Computer Structure 2015 – Pipeline

52. Pipelined CPU: Load (cycle 1 – Fetch)

R1 Mem[R2+30]
LW R1, (30)R2 ;
op 26
31
rs 21
LW
rt 16
2
immediate
1
0
30
0
PCSrc
m
u
x1
IF/ID
4
Address
Instruction
Instruction
Memory
lw
Instruction
PC
PC+4
RegWrite
Read
reg 1
Read
reg 2
Write
reg
Write
data
Register File
Add
[15-0] 16
[20-16]
[15-11]
Read
data 1
Add
result
Add
Shift
left 2
Branch
MemWrite
ALUSrc
Read
data 2
Sign 32
extend
L4
L3
L2
0
m
u
x
1
ALU
m
u
x
1
MemtoReg
Address
Write
Data
ALU
Control
6
0
Zero
result
Data
Memory
MemRead
ALUOp
RegDst
52
Read
Data
Computer Structure 2015 – Pipeline
0
m
u
x
1

53. Pipelined CPU: Load (cycle 2 – Dec)

R1 Mem[R2+30]
LW R1, (30)R2 ;
op 26
31
LW
rs 21
rt 16
2
immediate
1
0
30
0
PCSrc
m
u
x1
IF/ID
Add
PC+4
Address
Instruction
Instruction
Memory
Instruction
PC
RegWrite
Read
reg 1
Read
reg 2
Write
reg
Write
data
Register File
4
[15-0] 16
[20-16]
[15-11]
Read
data 1
Add
result
Add
Shift
left 2
R2
ALUSrc
0
m
u
x
1
30
Branch
MemWrite
Read
data 2
Sign 32
extend
L4
L3
L2
ALU
m
u
x
1
MemtoReg
Address
Write
Data
ALU
Control
6
0
Zero
result
Data
Memory
MemRead
ALUOp
RegDst
53
Read
Data
Computer Structure 2015 – Pipeline
0
m
u
x
1

54. Pipelined CPU: Load (cycle 3 – Exe)

R1 Mem[R2+30]
LW R1, (30)R2 ;
op 26
31
LW
rs 21
rt 16
2
immediate
1
0
30
0
PCSrc
m
u
x1
IF/ID
Add
Address
Instruction
Instruction
Memory
Instruction
PC
RegWrite
Read
reg 1
Read
reg 2
Write
reg
Write
data
Register File
4
[15-0] 16
[20-16]
[15-11]
Read
data 1
Add
result
Add
Shift
left 2
Branch
MemWrite
ALUSrc
Read
data 2
Sign 32
extend
L4
L3
L2
0
m
u
x
1
ALU
m
u
x
1
MemtoReg
R2+30
Address
Write
Data
ALU
Control
6
0
Zero
result
Data
Memory
MemRead
ALUOp
RegDst
54
Read
Data
Computer Structure 2015 – Pipeline
0
m
u
x
1

55. Pipelined CPU: Load (cycle 4 – Mem)

R1 Mem[R2+30]
LW R1, (30)R2 ;
op 26
31
LW
rs 21
rt 16
2
immediate
1
0
30
0
PCSrc
m
u
x1
IF/ID
Add
Address
Instruction
Instruction
Memory
Instruction
PC
RegWrite
Read
reg 1
Read
reg 2
Write
reg
Write
data
Register File
4
[15-0] 16
[20-16]
[15-11]
Read
data 1
Add
result
Add
Shift
left 2
Branch
MemWrite
ALUSrc
Read
data 2
Sign 32
extend
L4
L3
L2
0
m
u
x
1
ALU
m
u
x
1
MemtoReg
Address
Write
Data
ALU
Control
6
0
Zero
result
Data
Memory
MemRead
ALUOp
RegDst
55
Read
Data
Computer Structure 2015 – Pipeline
D
0
m
u
x
1

56. Pipelined CPU: Load (cycle 5 – WB)

R1 Mem[R2+30]
LW R1, (30)R2 ;
op 26
31
LW
rs 21
rt 16
2
immediate
1
0
30
0
PCSrc
m
u
x1
IF/ID
Add
Address
Instruction
Instruction
Memory
Instruction
PC
RegWrite
Read
reg 1
Read
reg 2
Write
reg
D
Write
data
Register File
4
[15-0] 16
[20-16]
[15-11]
Read
data 1
Add
result
Add
Shift
left 2
Branch
MemWrite
ALUSrc
Read
data 2
Sign 32
extend
L4
L3
L2
0
m
u
x
1
ALU
m
u
x
1
MemtoReg
Address
Write
Data
ALU
Control
6
0
Zero
result
Data
Memory
MemRead
ALUOp
RegDst
56
Read
Data
Computer Structure 2015 – Pipeline
0
m
u
x
1

57. Datapath with Control

PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
MemRead
ALUOp
RegDst
57
Computer Structure 2015 – Pipeline
1
M
u
x
0

58. Multi-Cycle Control

Pass control signals along just like the data
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
WB
Instruction
IF/ID
58
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
Computer Structure 2015 – Pipeline
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X

59. Five Execution Steps

Instruction Fetch
Instruction Decode and Register Fetch
59
Use PC to get instruction and put it in the Instruction Register.
Increment the PC by 4 and put the result back in the PC.
IR = Memory[PC];
PC = PC + 4;
Read registers rs and rt
Compute the branch address
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extend(IR[15-0]) << 2);
We aren't setting any control lines based on the instruction type
(we are busy "decoding" it in our control logic)
Computer Structure 2015 – Pipeline

60. Five Execution Steps (cont.)

Execution
ALU is performing one of three functions, based on instruction type:
Memory Reference: effective address calculation.
ALUOut = A + sign-extend(IR[15-0]);
R-type:
ALUOut = A op B;
Branch:
if (A==B) PC = ALUOut;
Memory Access or R-type instruction completion
Write-back step
60
Computer Structure 2015 – Pipeline

61. The Store Instruction

31
26
op
6 bits
sw
21
rs
5 bits
16
rt
5 bits
0
immediate
16 bits
rt, rs, imm16
mem[PC]
Fetch the instruction from memory
Addr <- R[rs] + SignExt(imm16)
Calculate the memory address
Mem[Addr] <- R[rt] Store the register into memory
PC <- PC + 4
Calculate the next instruction’s address
Mem[Rs + SignExt[imm16]] <- Rt
61
Example: sw
Computer Structure 2015 – Pipeline
rt, rs, imm16

62. RAW Hazard: SW Solution

• Have compiler avoid hazards by adding NOP instructions
Program
execution
order
Time (clock cycles)
CC1
CC2
Value of R2 10
10
sub R2, R1, R3
NOP
Reg
IM
NOP
CC4
10
CC5
10/–20
DM
Reg
DM
Reg
IM
NOP
IM
R13,R6, R2
CC7
-20
Reg
DM
Reg
IM
Reg
DM
Reg
IM
Computer Structure 2015 – Pipeline
Reg
DM
Reg
Problem: this really slows us down!
62
CC9
-20
Reg
DM
IM
R15,100(R2)
CC8
-20
Reg
Reg
add R14,R2, R2
sw
CC6
-20
DM
Reg
IM
and R12,R2, R5
or
IM
CC3
10
Reg
Reg
DM
Reg

63. Delayed Branch

63
Define branch to take place AFTER n following instruction
HW executes n instructions following the branch
regardless of branch is taken or not
SW puts in the n slots following the branch instructions
that need to be executed regardless of branch resolution
Instructions that are before the branch instruction, or
Instructions from the converged path after the branch
If cannot find independent instructions, put NOP
Original Code
New Code
r3
R4
If
R1
X:
If (r1==r2) goto x
r3 = 23
R4 = R3 +R5
NOP
R1 = R4 + R5
X: R7 = R1
= 23
= R3+R5
(r1==r2) goto x
= R4 + R5
R7 = R1
Þ
Computer Structure 2015 – Pipeline

64. Delayed Branch Performance

Filling 1 delay slot is easy, 2 is hard, 3 is harder
Assuming we can effectively fill d% of the delayed slots
CPInew = 1 + 0.2 × (3 × (1-d))
For example, for d=0.5, we get CPInew = 1.3
Mixing architecture with micro-arch
New generations requires more delay slots
Cause computability issues between generations
64
Computer Structure 2015 – Pipeline

English Русский Rules