Memory Data Flow
Memory Data Flow
Memory Data Dependences
Memory Data Dependences
Total Order of Loads and Stores
Illustration of Total Order
Load Bypassing
Illustration of Load Bypassing
Load Forwarding
Illustration of Load Forwarding
The DAXPY Example
Performance Gains From Weak Ordering
Optimizing Load/Store Disambiguation
Speculative Disambiguation
Speculative Disambiguation: Load Bypass
Speculative Disambiguation: Load Forward
Speculative Disambiguation: Safe Speculation
Speculative Disambiguation: Violation
Use of Prediction
Load/Store Disambiguation Discussion
The Memory Bottleneck
Load/Store Processing
Easing The Memory Bottleneck
Memory Bottleneck Techniques
Caches and Performance
Performance Impact
Cache Hit continued
Cache Hits and Performance
Cache Misses and Performance
Cache Miss Rate
Improving Locality
Improving Locality
Cache Miss Rates: 3 C’s [Hill]
Cache Miss Rate Effects
Cache Miss Rate
Cache Miss Rates: 3 C’s
Cache Miss Rates: 3 C’s
Summary
628.00K
Category: informaticsinformatics

Memory Data Flow

1. Memory Data Flow

Prof. Mikko H. Lipast
University of Wisconsin-Madison
Lecture notes based on notes by John P. Shen
Updated by Mikko Lipast

2. Memory Data Flow

• Memory Data Flow





Memory Data Dependences
Load Bypassing
Load Forwarding
Speculatve Disambiguaton
The Memory Bottleneck
• Cache Hits and Cache Misses

3. Memory Data Dependences


Besides branches, long memory latencies are one of the biggest
performance challenges today.
To preserve sequential (in-order) state in the data caches and
external memory (so that recovery from exceptions is possible)
stores are performed in order. This takes care of antidependences
and output dependences to memory locations.
However, loads can be issued out of order with respect to stores if
the out-of-order loads check for data dependences with respect to
previous, pending stores.
WAW
WAR
RAW
store X
load X
store X
:
:
:
store X
store X
load X

4. Memory Data Dependences


“Memory Aliasing” = Two memory references involving the same memory
location (collision of two memory addresses).
“Memory Disambiguation” = Determining whether two memory references
will alias or not (whether there is a dependence or not).
Memory Dependency Detection:
– Must compute effectve addresses of both memory references
– Effectve addresses can depend on run-tme data and other instructons
– Comparison of addresses require much wider comparators
Example code:
(1)
STORE
(2)
ADD
(3)
LOAD
(4)
LOAD
(5)
LOAD
(6)
ADD
(7)
STORE
V
W
WAR
X
V
W
RAW

5. Total Order of Loads and Stores


Keep all loads and stores totally in order with respect to each other.
However, loads and stores can execute out of order with respect to
other types of instructions.
Consequently, stores are held for all previous instructions, and loads
are held for stores.
– I.e. stores performed at commit point
– Sufficient to prevent wrong branch path stores since all prior branches
now resolved

6. Illustration of Total Order

Illustraton of Total Order
Decoder
Store  v
Load v
Add
Add
Cycle 1
Load 
x
Load 
x
Load/Store
Load 
x
Load 
Reservation 
x
Load 
Load 
Load 
Load 
x
Load 
Station
x
x
x
x
x
Load 
Address
Unit
cache
addr
Cycle 2
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load x
x
Load 
Load w
Load 
x
Store
x
Load  v
Load 
x
Store
Load  w
x
Load v
Load 
x
Load x
x
Load 
Load w
Load 
x
Store v
x
Load 
Cycle 3
data x
Load 
Load 
x
Load 
x
Load 
x
data x
Load 
Address
Unit
x
Load 
data x
Load 
Load 
x
Load 
x
Load 
x
data
Store
Load  w
x
Load v
Load 
x
Load x
x
Load 
Load w
Load 
x
Store v
x
Load 
Cycle 4
data x
Load 
Load 
x
Load 
x
Load 
x
data x
Load 
Address
Unit
cache
write
data
Address
Unit
Store v
data x
Load 
Address
Unit
Store v 
released
Cycle 5
Store
Load  w
x
Load v
x
Load 
Load x
Load 
x
Load
x
Load  w
Cycle 1
Cycle 2
Load w Load  x
Store  w
Cycle 6
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Store
x
Load  w
Load v
Load 
x
Load x
x
Load 
data x
Load 
Load 
x
Load 
x
Cycle 8
Cycle7
Store
Load  w
x
vx
Load 
Load 
data x
Load 
Load 
x
Store w
Address
Unit
Address
Unit
Address
Unit
Load w
Load x
Load v
ISSUING LOADS AND STORES WITH TOTAL ORDERING
data

7. Load Bypassing

• Loads can be allowed to bypass stores (if no aliasing).
• Two separate reservation stations and address
generation units are employed for loads and stores.
• Store addresses still need to be computed before loads
can be issued to allow checking for load dependences. If
dependence cannot be checked, e.g. store address
cannot be determined, then all subsequent loads are
held until address is valid (conservative).
• Stores are kept in ROB until all previous instructions
complete; and kept in the store buffer until gaining
access to cache port.
– Store buffer is “future file” for memory

8. Illustration of Load Bypassing

Decoder
Store v
Load v
Load
Reser­
vation 
Station
Address
Unit
Illustraton of Load Bypassing
Load w Load  x Cycle 1
Store w
Cycle 2
Add
Add
Cycle 1
x
Load 
Cycle 2
x
Store
x
Load  x
Reservation 
Load  x
Load  x
Load  x
Load  x
Station
Load 
Load 
Load 
x
Load 
x
Address
Unit
Cycle 3
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load x
Load w
Store 
Load  v
x
Address
Unit
Address
Unit
data x
Load 
Load v
Load x
Store 
Load  w
x
Address
Unit
Address
Unit
Store
Buffer
cache addr
Store v
cache
write data
Address
Unit
data
Load w
Cycle 4
Load  v
data x
Load 
Cycle 6
Cycle 5
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Address
Unit
Store w
Store v
Load x
Load v
Address
Unit
data
data
Store v
Store v
released
released
Address
Unit
Store w
Store v
Address
Unit
data
data
LOAD BYPASSING OF STORES
Address
Unit
Store w
Load v
data

9. Load Forwarding

• If a subsequent load has a dependence on a store
still in the store buffer, it need not wait till the store
is issued to the data cache.
• The load can be directly satisfied from the store
buffer if the address is valid and the data is available
in the store buffer.
• Since data is sourced from the store buffer:
– Could avoid accessing the cache to reduce power/latency

10. Illustration of Load Forwarding

Illustraton of Load Forwarding
Decoder
Store v
Load v
Add
Add
Load
Load 
Load w Load x Cycle 1
Store w
Cycle 2
Cycle 1
Reser­
vation 
Station
Address
Unit
Load 
Store
x
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Reservation 
x
Load  x
x
Load 
Load  x
Station
x
Cycle 3
x
Load 
Load 
Cycle 2
Load 
Load 
x
Address
Unit
Load x
Load w
Store 
v
Load  x
Address
Unit
Address
Unit
data x
Load 
Load v
Load x
Address
Unit
Store
Buffer
cache addr
Address
Unit
Address
Unit
Store v
cache
write data
data
Load w
Cycle 4
Load  v
Store 
data x
Load  w
x
Load 
Cycle 6
Cycle 5
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Load 
x
Address
Unit
Store w
Store v
Load x
Load v
Address
Unit
data
data
Store v
Store v
released
released
Address
Unit
Store w data
Store v data
Load v
Forward
Store v
data
released
Address
Unit
Address
Unit
Store w data
Store v
LOAD BYPASSING OF STORES WITH FORWARDING
data

11. The DAXPY Example

Y(i) = A * X(i) + Y(i)
LD
     F0, a
ADDI      R4, Rx, #512
; last address 
LD
Loop:
LD
     F2, 0(Rx)
; load X(i)
MULTD  F2, F0, F2
; A*X (i)
LD
; load Y(i)
     F4, 0(Ry)
ADDD    F4, F2, F4
; A*X (i) + Y(i)
SD
; store into Y(i)
     F4, 0(Ry)
ADDI      Rx, Rx, #8
; inc. index to X
ADDI      Ry, Ry, #8
; inc. index to Y
SUB      R20, R4, Rx
; compute bound
BNZ
; check if done
     R20, loop
M ULTD
LD
ADDD
SD
Total Order

12. Performance Gains From Weak Ordering

      Load Bypassing:                  Load Forwarding:
CODE:
ST  X
  :
  :
LD  Y
LD  Y
ST  X
Reservation
Station
CODE:
LD  X
ST  X
  :
  :
LD  X
Load/S tore
Unit
ST  X
Completion
Buffer
Store
Buffer
 
   Performance gain:
L o a d  b y p a s s in g :   1 1 % ­ 1 9 %  i n c r e a s e  o v e r  t o t a l  o r d e r in g  
L o a d  f o r w a r d in g :   1 % ­ 4 %  i n c r e a s e  o v e r  lo a d   b y p a s s in g

13. Optimizing Load/Store Disambiguation

Optmizing Load/Store Disambiguaton
Non-speculatve load/store disambiguaton
1. Loads wait for addresses of all prior stores
2. Full address comparison
3. Bypass if no match, forward if match
(1) can limit performance:
load r5,MEM[r3]
store r7, MEM[r5]

load r8, MEM[r9]
cache miss
RAW for agen, stalled
independent load stalled

14. Speculative Disambiguation

Speculatve Disambiguaton
What if aliases are rare?
1.
2.
3.
4.
Loads don’t wait for addresses of
all prior stores
Full address comparison of stores
that are ready
Bypass if no match, forward if
match
Check all store addresses when
they commit


5.
No matching loads – speculaton
was correct
Matching unbypassed load –
incorrect speculaton
Replay startng from incorrect load
Load/Store RS
Agen
Mem
Load
Queue
Store
Queue
Reorder Buffer

15. Speculative Disambiguation: Load Bypass

Speculatve Disambiguaton: Load Bypass
i1: st R3, MEM[R8]: ??
i2: ld R9, MEM[R4]: ??
Agen
Mem
Load
Queue
Store
Queue
i2: ld R9, MEM[R4]: x400A
i1: st R3, MEM[R8]: x800A
Reorder Buffer
• i1 and i2 issue in program order
• i2 checks store queue (no match)

16. Speculative Disambiguation: Load Forward

Speculatve Disambiguaton: Load Forward
i1: st R3, MEM[R8]: ??
i2: ld R9, MEM[R4]: ??
Agen
Mem
Load
Queue
Store
Queue
i2: ld R9, MEM[R4]: x800A
i1: st R3, MEM[R8]: x800A
Reorder Buffer
• i1 and i2 issue in program order
• i2 checks store queue (match=>forward)

17. Speculative Disambiguation: Safe Speculation

Speculatve Disambiguaton: Safe Speculaton
i1: st R3, MEM[R8]: ??
i2: ld R9, MEM[R4]: ??
Agen
Mem
Load
Queue
Store
Queue
i2: ld R9, MEM[R4]: x400C
i1: st R3, MEM[R8]: x800A
Reorder Buffer
• i1 and i2 issue out of program order
• i1 checks load queue at commit (no match)

18. Speculative Disambiguation: Violation

Speculatve Disambiguaton: Violaton
i1: st R3, MEM[R8]: ??
i2: ld R9, MEM[R4]: ??
Agen
Mem
Load
Queue
Store
Queue
i2: ld R9, MEM[R4]: x800A
i1: st R3, MEM[R8]: x800A
Reorder Buffer
• i1 and i2 issue out of program order
• i1 checks load queue at commit (match)
– i2 marked for replay

19. Use of Prediction

Use of Predicton
• If aliases are rare: statc predicton
– Predict no alias every tme
• Why even implement forwarding? PowerPC 620 doesn’t
– Pay mispredicton penalty rarely
• If aliases are more frequent: dynamic predicton
– Use PHT-like history table for loads
• If alias predicted: delay load
• If aliased pair predicted: forward from store to load
– More difficult to predict pair [store sets, Alpha 21264]
– Pay mispredicton penalty rarely
• Memory cloaking [Moshovos, Sohi]
– Predict load/store pair
– Directly copy store data register to load target register
– Reduce data transfer latency to absolute minimum

20. Load/Store Disambiguation Discussion

Load/Store Disambiguaton Discussion
• RISC ISA:




Many registers, most variables allocated to registers
Aliases are rare
Most important to not delay loads (bypass)
Alias predictor may/may not be necessary
• CISC ISA:




Few registers, many operands from memory
Aliases much more common, forwarding necessary
Incorrect load speculaton should be avoided
If load speculaton allowed, predictor probably necessary
• Address translaton:
– Can’t use virtual address (must use physical)
– Wait tll after TLB lookup is done
– Or, use subset of untranslated bits (page offset)
• Safe for proving inequality (bypassing OK)
• Not sufficient for showing equality (forwarding not OK)

21. The Memory Bottleneck

Reg. Write Back
Dispatch Buffer
Dispatch
Reg. File
Ren. Reg.
RS’s
Branch
Integer
Integer
Eff. Addr. Gen.
Float.­
Load/
Point
Store Addr. Translation
D­cache Access
Reorder Buff.
Data Cache
Complete
Store Buff.
Retire

22. Load/Store Processing

For both Loads and Stores:
1.
Effective Address Generation:
Must wait on register value
Must perform address calculaton
2.
Address Translation:
Must access TLB
Can potentally induce a page fault (excepton)
For Loads: D-cache Access (Read)
Can potentally induce a D-cache miss
Check aliasing against store buffer for possible load forwarding
If bypassing store, must be flagged as “speculatve” load untl completon
For Stores: D-cache Access (Write)
When completng must check aliasing against “speculatve” loads
After completon, wait in store buffer for access to D-cache
Can potentally induce a D-cache miss

23. Easing The Memory Bottleneck

Reg. Write Back
Dispatch Buffer
Dispatch
Reg. File
Ren. Reg.
RS’s
Branch
Integer
Integer
Float.­
Load/
Load/
Point
Store
Store
Missed
 loads
Reorder Buff.
Complete
Store Buff.
Data Cache
Retire

24. Memory Bottleneck Techniques

Dynamic Hardware (Microarchitecture):
Use Multple Load/Store Units (need multported D-cache)
Use More Advanced Caches (victm cache, stream buffer)
Use Hardware Prefetching (need load history and stride detecton)
Use Non-blocking D-cache (need missed-load buffers/MSHRs)
Large instructon window (memory-level parallelism)
Statc Software (Code Transformaton):
Insert Prefetch or Cache-Touch Instructons (mask miss penalty)
Array Blocking Based on Cache Organizaton (minimize misses)
Reduce Unnecessary Load/Store Instructons (redundant loads)
Software Controlled Memory Hierarchy (expose it to above DSI)

25. Caches and Performance

• Caches
– Enable design for common case: cache hit
• Cycle tme, pipeline organizaton
• Recovery policy
– Uncommon case: cache miss
• Fetch from next level
– Apply recursively if multple levels
• What to do in the meantme?
• What is performance impact?
• Various optmizatons are possible

26. Performance Impact

• Cache hit latency
– Included in “pipeline” porton of CPI
• E.g. IBM study: 1.15 CPI with 100% cache hits
– Typically 1-3 cycles for L1 cache
• Intel/HP McKinley: 1 cycle
– Heroic array design
– No address generaton: load r1, (r2)
• IBM Power4: 3 cycles




Address generaton
Array access
Word select and align
Register file write (no bypass)

27. Cache Hit continued

Cache Hit contnued
• Cycle stealing common
AGEN
– Address generaton < cycle
AGEN
– Array access > cycle
– Clean, FSD cycle boundaries violated
• Speculaton rampant




“Predict” cache hit
Don’t wait for (full) tag check
Consume fetched word in pipeline
Recover/flush when miss is detected
• Reportedly 7 (!) cycles later in Pentum 4
CACHE
CACHE

28. Cache Hits and Performance

• Cache hit latency determined by:
– Cache organizaton
• Associatvity
– Parallel tag checks expensive, slow
– Way select slow (fan-in, wires)
• Block size
– Word select may be slow (fan-in, wires)
Word Line
• Number of block (sets x associatvity)




Wire delay across array
“Manhattan distance” = width + height
Word line delay: width
Bit line delay: height
• Array design is an art form
– Detailed analog circuit/wire delay modeling
Bit Line

29. Cache Misses and Performance

• Miss penalty
– Detect miss: 1 or more cycles
– Find victm (replace block): 1 or more cycles
• Write back if dirty
– Request block from next level: several cycles
• May need to find line from one of many caches (coherence)
– Transfer block from next level: several cycles
• (block size) / (bus width)
– Fill block into data array, update tag array: 1+ cycles
– Resume executon
• In practce: 6 cycles to 100s of cycles

30. Cache Miss Rate

• Determined by:
– Program characteristcs
• Temporal locality
• Spatal locality
– Cache organizaton
• Block size, associatvity, number of sets

31. Improving Locality

• Instructon text placement
– Profile program, place unreferenced or rarely
referenced paths “elsewhere”
• Maximize temporal locality
– Eliminate taken branches
• Fall-through path has spatal locality

32. Improving Locality

• Data placement, access order
– Arrays: “block” loops to access subarray that fits into cache
• Maximize temporal locality
– Structures: pack commonly-accessed fields together
• Maximize spatal, temporal locality
– Trees, linked lists: allocate in usual reference order
• Heap manager usually allocates sequental addresses
• Maximize spatal locality
• Hard problem, not easy to automate:
– C/C++ disallows rearranging structure fields
– OK in Java

33. Cache Miss Rates: 3 C’s [Hill]

• Compulsory miss
– First-ever reference to a given block of memory
– Cold misses = mc : number of misses for FA infinite cache
• Capacity
– Working set exceeds cache capacity
– Useful blocks (with future references) displaced
– Capacity misses = mf - mc : add’l misses for finite FA cache
• Conflict
– Placement restrictons (not fully-associatve) cause useful
blocks to be displaced
– Think of as capacity within set
– Conflict misses = ma - mf : add’l misses in actual cache

34. Cache Miss Rate Effects

• Number of blocks (sets x associatvity)
– Bigger is better: fewer conflicts, greater capacity
• Associatvity
– Higher associatvity reduces conflicts
– Very little benefit beyond 8-way set-associatve
• Block size
– Larger blocks exploit spatal locality
– Usually: miss rates improve untl 64B-256B
– 512B or more miss rates get worse
• Larger blocks less efficient: more capacity misses
• Fewer placement choices: more conflict misses

35. Cache Miss Rate

• Subtle tradeoffs between cache organizaton
parameters
– Large blocks reduce compulsory misses but increase
miss penalty
• #compulsory ~= (working set) / (block size)
• #transfers = (block size)/(bus width)
– Large blocks increase conflict misses
• #blocks = (cache size) / (block size)
– Associatvity reduces conflict misses
– Associatvity increases access tme
• Can associatve cache ever have higher miss rate
than direct-mapped cache of same size?

36. Cache Miss Rates: 3 C’s

• Vary size and associatvity
– Compulsory misses are constant
– Capacity and conflict misses are reduced

37. Cache Miss Rates: 3 C’s

• Vary size and block size
– Compulsory misses drop with increased block size
– Capacity and conflict can increase with larger blocks

38. Summary

• Memory Data Flow





Memory Data Dependences
Load Bypassing
Load Forwarding
Speculatve Disambiguaton
The Memory Bottleneck
• Cache Hits and Cache Misses
• Further coverage of memory hierarchy later
in the semester
English     Русский Rules