Similar presentations:
Fault-Tolerant Design
1. Chapter 3
Fault-Tolerant DesignEE141
System-on-Chip
Test Architectures
1
Ch. 3 - Fault-Tolerant Design - P. 1
2. What is this chapter about?
Gives Overview of Fault-Tolerant DesignFocus on
Basic Concepts in Fault-Tolerant Design
Metrics Used to Specify and Evaluate Dependability
Review of Coding Theory
Fault-Tolerant Design Schemes
– Hardware Redundancy
– Information Redundancy
– Time Redundancy
Examples of Fault-Tolerant Applications in Industry
EE141
System-on-Chip
Test Architectures
2
Ch. 3 - Fault-Tolerant Design - P. 2
3. Fault-Tolerant Design
IntroductionFundamentals of Fault Tolerance
Fundamentals of Coding Theory
Fault Tolerant Schemes
Industry Practices
Concluding Remarks
EE141
System-on-Chip
Test Architectures
3
Ch. 3 - Fault-Tolerant Design - P. 3
4. Introduction
Fault ToleranceAbility of system to continue error-free operation in
presence of unexpected fault
Important in mission-critical applications
E.g., medical, aviation, banking, etc.
Errors very costly
Becoming important in mainstream applications
Technology scaling causing circuit behavior to
become less predictable and more prone to failures
Needing fault tolerance to keep failure rate within
acceptable levels
EE141
System-on-Chip
Test Architectures
4
Ch. 3 - Fault-Tolerant Design - P. 4
5. Faults
Permanent FaultsDue to manufacturing defects, early life failures,
wearout failures
Wearout failures due to various mechanisms
– e.g., electromigration, hot carrier degradation, dielectric
breakdown, etc.
Temporary Faults
Only present for short period of time
Caused by external disturbance or marginal design
parameters
EE141
System-on-Chip
Test Architectures
5
Ch. 3 - Fault-Tolerant Design - P. 5
6. Temporary Faults
TransientErrors (Non-recurring errors)
Cause by external disturbance
– e.g., radiation, noise, power disturbance, etc.
Intermittent
Errors (Recurring errors)
Cause by marginal design parameters
Timing problems
– e.g., races, hazards, skew
Signal integrity problems
– e.g., crosstalk, ground bounce, etc.
EE141
System-on-Chip
Test Architectures
6
Ch. 3 - Fault-Tolerant Design - P. 6
7. Redundancy
FaultTolerance requires some form of
redundancy
Time Redundancy
Hardware Redundancy
Information Redundancy
EE141
System-on-Chip
Test Architectures
7
Ch. 3 - Fault-Tolerant Design - P. 7
8. Time Redundancy
PerformSame Operation Twice
See if get same result both times
If not, then fault occurred
Can detect temporary faults
Cannot detect permanent faults
– Would affect both computations
Advantage
Little to no hardware overhead
Disadvantage
Impacts system or circuit performance
EE141
System-on-Chip
Test Architectures
8
Ch. 3 - Fault-Tolerant Design - P. 8
9. Hardware Redundancy
Replicatehardware and compare outputs
From two or more modules
Detects both permanent and temporary faults
Advantage
Little or no performance impact
Disadvantage
Area and power for redundant hardware
EE141
System-on-Chip
Test Architectures
9
Ch. 3 - Fault-Tolerant Design - P. 9
10. Information Redundancy
Encodeoutputs with error detecting or
correcting code
Code selected to minimize redundancy for
class of faults
Advantage
Less hardware to generate redundant
information than replicating module
Drawback
Added complexity in design
EE141
System-on-Chip
Test Architectures
10
Ch. 3 - Fault-Tolerant Design - P. 10
11. Failure Rate
(t)= Component failure rate
Measured in FITS (failures per 109 hours)
Failure rate
Infant
mortality
Working life
Wearout
Overall curve
Random failures
Early
failures
Wearout
failures
Time
EE141
System-on-Chip
Test Architectures
11
Ch. 3 - Fault-Tolerant Design - P. 11
12. System Failure Rate
Systemconstructed from components
No Fault Tolerance
Any component fails, whole system fails
k
sys c ,i
i 1
EE141
System-on-Chip
Test Architectures
12
Ch. 3 - Fault-Tolerant Design - P. 12
13. Reliability
Ifcomponent working at time 0
R(t) = Probability still working at time t
Exponential
Failure Law
If failure rate assumed constant
– Good approximation if past infant mortality period
R(t ) e
EE141
System-on-Chip
Test Architectures
t
13
Ch. 3 - Fault-Tolerant Design - P. 13
14. Reliability for Series System
SeriesSystem
All components need to work for system to
work
A
B
C
Rsys RA RB RC
EE141
System-on-Chip
Test Architectures
14
Ch. 3 - Fault-Tolerant Design - P. 14
15. System Reliability with Redundancy
Systemreliability with component B in
Parallel
Can tolerate one component B failing
B
A
C
B
Rsys RA 1 (1 RB ) RC RA (2RB R ) RC
EE141
System-on-Chip
Test Architectures
2
2
B
15
Ch. 3 - Fault-Tolerant Design - P. 15
16. Mean-Time-to-Failure (MTTF)
Averagetime before system fails
Equal to area under reliability curve
MTTF R(t )dt
0
For
Exponential Failure Law
MTTF e dt
0
EE141
System-on-Chip
Test Architectures
t
1
16
Ch. 3 - Fault-Tolerant Design - P. 16
17. Maintainability
Ifsystem failed at time 0
M(t) = Probability repaired and operational
at time t
System
repair time divided into
Passive repair time
– Time for service engineer to travel to site
Active repair time
– Time to locate failing component,
repair/replace, and verify system operational
– Can be improved through designing system so
easy to locate failed component and verify
EE141
System-on-Chip
Test Architectures
17
Ch. 3 - Fault-Tolerant Design - P. 17
18. Repair Rate and MTTR
= rate at which system repairedAnalogous to failure rate
Maintainability
often modeled as
M (t ) 1 e
Mean-Time-to-Repair
EE141
System-on-Chip
Test Architectures
t
(MTTR) = 1/
18
Ch. 3 - Fault-Tolerant Design - P. 18
19. Availability
S1
0
t0
Normal system operation
t1
t2
t3
t4
t
failures
System Availability
Fraction of time system is operational
MTTF
system availability
MTTF MTTR
EE141
System-on-Chip
Test Architectures
19
Ch. 3 - Fault-Tolerant Design - P. 19
20. Availability
TelephoneSystems
Required to have system availability of
0.9999 (“four nines”)
High-Reliability
Systems
May require 7 or more nines
Fault-Tolerant
Design
Needed to achieve such high availability
from less reliable components
EE141
System-on-Chip
Test Architectures
20
Ch. 3 - Fault-Tolerant Design - P. 20
21. Coding Theory
CodingUsing more bits than necessary to
represent data
Provides way to detect errors
– Errors occur when bits get flipped
Error
Detecting Codes
Many types
Detect different classes of errors
Use different amounts of redundancy
Ease of encoding and decoding data varies
EE141
System-on-Chip
Test Architectures
21
Ch. 3 - Fault-Tolerant Design - P. 21
22. Block Code
Message= Data Being Encoded
Block code
Encodes m messages with n-bit codeword
log 2 m
redundancy 1
n
If
no redundancy
m messages encoded with log2(m) bits
minimum possible
EE141
System-on-Chip
Test Architectures
22
Ch. 3 - Fault-Tolerant Design - P. 22
23. Block Code
Todetect errors, some redundancy
needed
Space of distinct 2n blocks partitioned into
codewords and non-codewords
Can
detect errors that cause codeword
to become non-codeword
Cannot detect errors that cause
codeword to become another codeword
EE141
System-on-Chip
Test Architectures
23
Ch. 3 - Fault-Tolerant Design - P. 23
24. Separable Block Code
Separablen-bit blocks partitioned into
– k information bits directly representing message
– (n-k) check bits
Denoted (n,k) Block Code
Advantage
k-bit message directly extracted without
decoding
Rate
of Separable Block Code = k/n
EE141
System-on-Chip
Test Architectures
24
Ch. 3 - Fault-Tolerant Design - P. 24
25. Example of Separable Block Code
(4,3)Parity Code
Check bit is XOR of 3 message bits
message 101 codeword 1010
Single
Bit Parity
log 2 m
log 2 (2k )
k n k 1
redundancy 1
1
1
n
n
n
n
n
k n 1
rate
n
n
EE141
System-on-Chip
Test Architectures
25
Ch. 3 - Fault-Tolerant Design - P. 25
26. Example of Non-Separable Block Code
One-HotCode
Each Codeword has single 1
Example of 8-bit one-hot
– 10000000, 01000000, 00100000, 00010000
00001000, 00000100, 00000010, 00000001
Redundancy = 1 - log2(8)/8 = 5/8
log 2 m
log 2 (n)
redundancy 1
1
n
n
EE141
System-on-Chip
Test Architectures
26
Ch. 3 - Fault-Tolerant Design - P. 26
27. Linear Block Codes
Specialclass
Modulo-2 sum of any 2 codewords also
codeword
Null space of (n-k)xn Boolean matrix
– Called Parity Check Matrix, H
For
any n-bit codeword c
cHT = 0
All 0 codeword exists in any linear code
EE141
System-on-Chip
Test Architectures
27
Ch. 3 - Fault-Tolerant Design - P. 27
28. Linear Block Codes
GeneratorMatrix, G
kxn Matrix
Codeword
c for message m
c = mG
GHT
=0
EE141
System-on-Chip
Test Architectures
28
Ch. 3 - Fault-Tolerant Design - P. 28
29. Systematic Block Code
Firstk-bits correspond to message
Last n-k bits correspond to check bits
For
Systematic Code
G = [Ikxk : Pkx(n-k)]
H = [I(n-k)x(n-k) : PT(n-k)xk]
Example
1 0 0
G 0 1 0
0 0 1
EE141
System-on-Chip
Test Architectures
1
1
1
H 1 1 1 1
29
Ch. 3 - Fault-Tolerant Design - P. 29
30. Distance of Code
Distancebetween two codewords
Number of bits in which they differ
Distance
of Code
Minimum distance between any two
codewords in code
If n=k (no redundancy), distance = 1
Single-bit parity, distance = 2
Code
with distance d
Detect d-1 errors
Correct up to (d-1)/2 errors
EE141
System-on-Chip
Test Architectures
30
Ch. 3 - Fault-Tolerant Design - P. 30
31. Error Correcting Codes
Codewith distance 3
Called single error correcting (SEC) code
Code
with distance 4
Called single error correcting and double
error detecting (SEC-DED) code
Procedure
for constructing SEC code
Described in [Hamming 1950]
Any H-matrix with all columns distinct and
no all-0 column is SEC
EE141
System-on-Chip
Test Architectures
31
Ch. 3 - Fault-Tolerant Design - P. 31
32. Hamming Code
Forany value of n
SEC code constructed by
– setting each column in H equal to binary
representation of column number (starting from 1)
Number of rows in H equal to log2(n+1)
Example
of SEC Hamming Code for n=7
0 0 0 1 1 1 1
H 0 1 1 0 0 1 1
1 0 1 0 1 0 1
EE141
System-on-Chip
Test Architectures
32
Ch. 3 - Fault-Tolerant Design - P. 32
33. Error Correction in Hamming Code
Syndrome,s
s = HvT for received vector v
If v is codeword
– Syndrome = 0
If v non-codeword and single-bit error
– Syndrome will match one of columns of H
– Will contain binary value of bit position in error
EE141
System-on-Chip
Test Architectures
33
Ch. 3 - Fault-Tolerant Design - P. 33
34. Example of Error Correction
For(7,3) Hamming Code
Suppose codeword 0110011 has one-bit
error changing it to 1110011
0
0
0
T
s vH [1110011] 1
1
1
1
EE141
System-on-Chip
Test Architectures
0 1
1 0
1 1
0 0 [001]
0 1
1 0
1 1
34
Ch. 3 - Fault-Tolerant Design - P. 34
35. SEC-DED Code
MakeSEC Hamming Code SEC-DED
By adding parity check over all bits
Extra parity bit
– 1 for single-bit error
– 0 for double-bit error
Makes possible to detect double bit error
– Avoid assuming single-bit error and
miscorrecting it
EE141
System-on-Chip
Test Architectures
35
Ch. 3 - Fault-Tolerant Design - P. 35
36. Example of Error Correction
For(7,4) SEC-DED Hamming Code
Suppose codeword 0110011 has two-bit
error changing it to 1010011
– Doesn’t match any column in H
0
0
0
s vH T [1010011] 1
1
1
1
EE141
System-on-Chip
Test Architectures
0 1 1
1 0 1
1 1 1
0 0 1 [0010]
0 1 1
1 0 1
1 1 1
36
Ch. 3 - Fault-Tolerant Design - P. 36
37. Hsiao Code
Weightof column
Number of 1’s in column
Constructing
n-bit SEC-DED Hsiao Code
First use all possible weight-1 columns
– Then all possible weight-3 columns
– Then weight-5 columns, etc.
Until n columns formed
Number check bits is log2(n+1)
Minimizes number of 1’s in H-matrix
– Less hardware and delay for computing syndrome
– Disadvantage: Correction logic more complex 37
EE141
System-on-Chip
Test Architectures
Ch. 3 - Fault-Tolerant Design - P. 37
38. Example of Hsiao Code
(7,3)Hsiao Code
Uses weight-1 and weight-3 columns
0
0
H
0
1
EE141
System-on-Chip
Test Architectures
0 0 1 0 1 1
0 1 0 1 0 1
1 0 0 1 1 0
0 0 0 1 1 1
38
Ch. 3 - Fault-Tolerant Design - P. 38
39. Unidirectional Errors
Errorsin block of data which only cause
0 1 or 1 0, but not both
Any number of bits in error in one direction
Example
Correct codeword 111000
Unidirectional errors could cause
– 001000, 000000, 101000 (only 1 0 errors)
Non-unidirectional errors
– 101001, 011001, 011011 (both1 0 and 0 1)
EE141
System-on-Chip
Test Architectures
39
Ch. 3 - Fault-Tolerant Design - P. 39
40. Unidirectional Error Detecting Codes
Allunidirectional error detecting (AUED)
Codes
Detect all unidirectional errors in codeword
Single-bit parity is not AUED
– Cannot detect even number of errors
No linear code is AUED
– All linear codes must contain all-0 vector, so
cannot detect all 1 0 errors
EE141
System-on-Chip
Test Architectures
40
Ch. 3 - Fault-Tolerant Design - P. 40
41. Two-Rail Code
Two-RailCode
One check bit for each information bit
– Equal to complement of information bit
Two-Rail Code is AEUD
50% Redundancy
Example
of (6,3) Two-Rail Code
Message 101 has Codeword 101010
Set of all codewords
– 000111, 001110, 010101, 011100, 100110,
101010, 110001, 111000
EE141
System-on-Chip
Test Architectures
41
Ch. 3 - Fault-Tolerant Design - P. 41
42. Berger Codes
Lowestredundancy of separable AUED
codes
For k information bits, log2(k+1) check bits
Check bits equal to binary representation
of number of 0’s in information bits
Example
Information bits 1000101
– log2(7+1)=3 check bits
– Check bits equal to 100 (4 zero’s)
EE141
System-on-Chip
Test Architectures
42
Ch. 3 - Fault-Tolerant Design - P. 42
43. Berger Codes
Codewordsfor (5,3) Berger Code
00011, 00110, 01010, 01101, 10010,
10101, 11001, 11100
If
unidirectional errors
Contain 1 0 errors
– increase 0’s in information bits
– can only decrease binary number in check bits
Contain 0 1 errors
– decrease 0’s in information bits
– can only increase binary number in check bits
EE141
System-on-Chip
Test Architectures
43
Ch. 3 - Fault-Tolerant Design - P. 43
44. Berger Codes
If8 information bits
Berger code requires log2 8+1 =4 check bits
log 2 m
log 2 (2k )
8 1
redundancy 1
1
1 25%
n
n
12 4
(16,8)
Two-Rail Code
Requires 50% redundancy
Redundancy
advantage of Berger Code
Increases as k increased
EE141
System-on-Chip
Test Architectures
44
Ch. 3 - Fault-Tolerant Design - P. 44
45. Constant Weight Codes
ConstantWeight Codes
Non-separable, but lower redundancy than
Berger
Each codeword has same number of 1’s
Example
2-out-of-3 constant weight code
110, 011, 101
AEUD
code
Unidirectional errors always change number
of 1’s
EE141
System-on-Chip
Test Architectures
45
Ch. 3 - Fault-Tolerant Design - P. 45
46. Constant Weight Codes
Numbercodewords in m-out-of-n code
C
n
m
Codewords
maximized when m close to
n/2 as possible
n/2-out-of-n when n even
(n/2-0.5 or n/2+0.5)-out-of-n when n odd
Minimizes redundancy of code
EE141
System-on-Chip
Test Architectures
46
Ch. 3 - Fault-Tolerant Design - P. 46
47. Example
6-out-of-12constant weight code
C612 924 codewords
log 2 m
log 2 (924)
redundancy 1
1
17.9%
n
12
12-bit
Berger Code
Only 28 = 256 codewords
log 2 m
log 2 (28 )
redundancy 1
1
33.3%
n
12
EE141
System-on-Chip
Test Architectures
47
Ch. 3 - Fault-Tolerant Design - P. 47
48. Constant Weight Codes
AdvantageLess redundancy than Berger codes
Disadvantage
Non-separable
Need decoding logic
– to convert codeword back to binary message
EE141
System-on-Chip
Test Architectures
48
Ch. 3 - Fault-Tolerant Design - P. 48
49. Burst Error
BurstError
Common, multi-bit errors tend to be clustered
– Noise source affects contiguous set of bus lines
Length of burst error
– number of bits between first and last error
Wrap around from last to first bit of codeword
Example:
Original codeword 00000000
00111100 is burst error length 4
00110100 is burst error length 4
– Any number of errors between first and last error
EE141
System-on-Chip
Test Architectures
49
Ch. 3 - Fault-Tolerant Design - P. 49
50. Cyclic Codes
Specialclass of linear code
Any codeword shifted cyclically is another
codeword
Used to detect burst errors
Less redundancy required to detect burst
error than general multi-bit errors
– Some distance 2 codes can detect all burst
errors of length 4
– detecting all possible 4-bit errors requires
distance 5 code
EE141
System-on-Chip
Test Architectures
50
Ch. 3 - Fault-Tolerant Design - P. 50
51. Cyclic Redundancy Check (CRC) Code
Mostwidely used cyclic code
Uses binary alphabet based on GF(2)
CRC
code is (n,k) block code
Formed using generator polynomial, g(x)
– called code generator
– degree n-k polynomial (same degree as
number of check bits)
g ( x ) g n k x
n k
... g 2 x g1 x g 0
2
c ( x ) m( x ) g ( x )
EE141
System-on-Chip
Test Architectures
51
Ch. 3 - Fault-Tolerant Design - P. 51
52.
Messagem(x)
g(x)
c(x)
Codeword
0000
0
x2 + 1
0
000000
0001
1
x2 + 1
x2 + 1
000101
0010
x
x2 + 1
x3 + x
001010
0011
x+1
x2 + 1
x3 + x2 + x + 1
001111
0100
x2
x2 + 1
x4 + x2
010100
0101
x2 + 1
x2 + 1
x4 + 1
010001
0110
x2 + x
x2 + 1
x4 + x3 + x2 + x
011110
0111
x2 + x + 1
x2 + 1
x4 + x3 + x + 1
011011
1000
x3
x2 + 1
x5 + x3
101000
1001
x3 + 1
x2 + 1
x5 + x3 + x2 + 1
101101
1010
x3 + x
x2 + 1
x5 + x
100010
1011
x3 + x + 1
x2 + 1
x5 + x2 + x + 1
100111
1100
x3 + x2
x2 + 1
x5 + x4 + x3 + x2
111100
1101
x3 + x2 + 1
x2 + 1
x5 + x4 + x3 + 1
111001
1110
x3 + x2 + x
x2 + 1
x5 + x4 + x2 + x
110110
1111
x3 + x2 + x + 1
x2 + 1
x5 + x4 + x + 1
110011
EE141
System-on-Chip
Test Architectures
52
Ch. 3 - Fault-Tolerant Design - P. 52
53. CRC Code
Linearblock code
Has G-matrix and H-matrix
G-matrix shifted version of generator
polynomial
g n k
0
G
.
0
EE141
System-on-Chip
Test Architectures
...
g1
g0
0
0
g n k
...
g1
g0
0
.
.
.
.
.
0
...
g n k
...
g1
0
0
.
g0
53
Ch. 3 - Fault-Tolerant Design - P. 53
54. CRC Code Example
(6,4)CRC code generated by g(x)=x2+1
1
0
G
0
0
EE141
System-on-Chip
Test Architectures
0 1 0 0 0
1 0 1 0 0
0 1 0 1 0
0 0 1 0 1
54
Ch. 3 - Fault-Tolerant Design - P. 54
55. Systematic CRC Codes
Toobtain systematic CRC code
codewords formed using Galois division
– nice because LFSR can be used for performing
division
c ( x ) m( x ) x
n k
r ( x)
n k
m( x ) x
r ( x) remainder of
g ( x)
EE141
System-on-Chip
Test Architectures
55
Ch. 3 - Fault-Tolerant Design - P. 55
56. Galois Division Example
Encodem(x)=x2+x with g(x)=x2+1
Requires dividing m(x)xn-k =x4+x3 by g(x)
111
101 11000
101
110
101
110
101
11 remainder
Remainder r(x)=x+1
– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
EE141
System-on-Chip
Test Architectures
56
Ch. 3 - Fault-Tolerant Design - P. 56
57.
Messagem(x)
g(x)
r(x)
c(x)
Codeword
0000
0
x2 + 1
0
0
000000
0001
1
x2 + 1
1
x2 + 1
000101
0010
x
x2 + 1
x
x3 + x
001010
0011
x+1
x2 + 1
x+1
x3 + x2 + x + 1
001111
0100
x2
x2 + 1
1
x4 + 1
010001
0101
x2 + 1
x2 + 1
0
x4 + x2
010100
0110
x2 + x
x2 + 1
x+1
x4 + x3 + x + 1
011011
0111
x2 + x + 1
x2 + 1
x
x4 + x3 + x + 1
011110
1000
x3
x2 + 1
x
x4 + x3 + x + 1
100010
1001
x3 + 1
x2 + 1
x+1
x4 + x3 + x + 1
100111
1010
x3 + x
x2 + 1
0
x4 + x3 + x + 1
101000
1011
x3 + x + 1
x2 + 1
1
x4 + x3 + x + 1
101101
1100
x3 + x2
x2 + 1
x+1
x4 + x3 + x + 1
110011
1101
x3 + x2 + 1
x2 + 1
x
x4 + x3 + x + 1
110110
1110
x3 + x2 + x
x2 + 1
1
x4 + x3 + x + 1
111001
1111
x3 + x2 + x + 1
x2 + 1
0
x4 + x3 + x2 + x 111100
EE141
System-on-Chip
Test Architectures
57
Ch. 3 - Fault-Tolerant Design - P. 57
58. Generating Check Bits for CRC Code
UseLFSR
With characteristic polynomial equal to g(x)
Append n-k 0’s to end of message
Example:
m(x)=x2+x+1 and g(x)=x3+x+1
Message
0
0
0
111000
Appended 0’s
0
1
0
Final state after shifting equals remainder
EE141
System-on-Chip
Test Architectures
58
Ch. 3 - Fault-Tolerant Design - P. 58
59. Checking CRC Codeword
CheckingReceived Codeword for Errors
Shift codeword into LFSR
– with same characteristic polynomial as used to
generate it
If final state of LFSR non-zero, then error
0
0
0
111010
codeword to check
EE141
System-on-Chip
Test Architectures
59
Ch. 3 - Fault-Tolerant Design - P. 59
60. Selecting Generator Polynomial
Keyissue for CRC Codes
If first and last bit of polynomial are 1
– Will detect burst errors of length n-k or less
If generator polynomial is multiple of (x+1)
– Will detect any odd number of errors
If g(x) = (x+1)p(x) where p(x) primitive of
degree n-k-1 and n < 2n-k-1
– Will detect single, double, triple, and odd errors
EE141
System-on-Chip
Test Architectures
60
Ch. 3 - Fault-Tolerant Design - P. 60
61. Commonly Used CRC Generators
CRC codeGenerator Polynomial
CRC-5 (USB token packets)
x5+x2+1
CRC-12 (Telecom systems)
x12+x11+x3+x2+x+1
CRC-16-CCITT (X25, Bluetooth)
CRC-32 (Ethernet)
CRC-64 (ISO)
EE141
System-on-Chip
Test Architectures
x16+x12+x5+1
x32+x26+x23+x22+x16+x12+x11+x10+x8
+x7+x5+x4+x+1
x64+x4+x3+x+1
61
Ch. 3 - Fault-Tolerant Design - P. 61
62. Fault Tolerance Schemes
AddingFault Tolerance to Design
Improves dependability of system
Requires redundancy
– Hardware
– Time
– Information
EE141
System-on-Chip
Test Architectures
62
Ch. 3 - Fault-Tolerant Design - P. 62
63. Hardware Redundancy
Involvesreplicating hardware units
At any level of design
– gate-level, module-level, chip-level, board-level
Three
Basic Forms
Static (also called Passive)
– Masks faults rather than detects them
Dynamic (also called Active)
– Detects faults and reconfigures to spare hardware
Hybrid
– Combines active and passive approaches
EE141
System-on-Chip
Test Architectures
63
Ch. 3 - Fault-Tolerant Design - P. 63
64. Static Redundancy
Masksfaults so no erroneous outputs
Provides uninterrupted operation
Important for real-time systems
– No time to reconfigure or retry operation
Simple self-contained
– No need to update or rollback system state
EE141
System-on-Chip
Test Architectures
64
Ch. 3 - Fault-Tolerant Design - P. 64
65. Triple Module Redundancy (TMR)
Well-knownstatic redundancy scheme
Three copies of module
Use majority voter to determine final output
Error in one module out-voted by other two
Module
1
Module
2
Majority
Voter
Module
3
EE141
System-on-Chip
Test Architectures
65
Ch. 3 - Fault-Tolerant Design - P. 65
66. TMR Reliability and MTTF
TMRworks if any 2 modules work
Rm = reliability of each module
Rv = reliability of voter
RTMR Rv [ Rm3 C23 Rm2 (1 Rm )] Rv (3Rm2 2 Rm3 )
MTTF
for TMR
0
0
0
MTTFTMR RTMR dt Rv (3Rm2 2 Rm3 )dt e vt (3e 2 mt 2e 3 mt )dt
3
2
2 m v 3 m v
EE141
System-on-Chip
Test Architectures
66
Ch. 3 - Fault-Tolerant Design - P. 66
67. Comparison with Simplex
NeglectingMTTFTMR
TMR
fault rate of voter
5 1 5
MTTFsimplex
2 m 3 m 6 m 6
3
2
has lower MTTF, but
Can tolerate temporary faults
Higher reliability for short mission times
EE141
System-on-Chip
Test Architectures
67
Ch. 3 - Fault-Tolerant Design - P. 67
68. Comparison with Simplex
Crossoverpoint
RTMR Rsimplex
3e 2 mt 2e 3 mt e mt
Solve t
RTMR
ln 2
m
0.7 MTTFsimplex
> Rsimplex when
Mission time shorter than 70% of MTTF
EE141
System-on-Chip
Test Architectures
68
Ch. 3 - Fault-Tolerant Design - P. 68
69. N-Modular Redundancy (NMR)
NMRN modules along with majority voter
– TMR special case
Number of failed modules masked = (N-1)/2
As N increases, MTTF decreases
– But, reliability for short missions increases
If
goal only to tolerate temporary faults
TMR sufficient
EE141
System-on-Chip
Test Architectures
69
Ch. 3 - Fault-Tolerant Design - P. 69
70. Interwoven Logic
Replaceeach gate
with 4 gates using inconnection pattern
that automatically corrects errors
Traditionally
not as attractive as TMR
Requires lots of area overhead
Renewed interest by researchers
investigating emerging nanoelectronic
technologies
EE141
System-on-Chip
Test Architectures
70
Ch. 3 - Fault-Tolerant Design - P. 70
71. Interwoven Logic with 4 NOR Gates
+2a
X
+
2b
X
Y
+
+
2c
+
1a
+
4a
+
+
2
+
1b
1
+
4
+
1c
3
+
+
1d
+
2d
4b
+
+
4c
3a
+
+
4d
3b
+
Y
3c
+
3d
EE141
System-on-Chip
Test Architectures
71
Ch. 3 - Fault-Tolerant Design - P. 71
72. Example of Error on Third Y Input
XX
Y
0
0
0
0
+ 0
2a
+ 0
2b
+ 1
2c
+ 1
1a
+ 0
4a
+ 1
+
+
2
+
1
+
4
+ 1
2d
1b
4b
+ 0
1c
3
+ 0
1d
Y
EE141
System-on-Chip
Test Architectures
+ 0
0
0
1
0
+ 1
+ 0
4c
3a
+ 1
+ 0
4d
3b
+ 0
3c
+ 0
3d
72
Ch. 3 - Fault-Tolerant Design - P. 72
73. Dynamic Redundancy
InvolvesDetecting fault
Locating faulty hardware unit
Reconfiguring system to use spare fault-free
hardware unit
EE141
System-on-Chip
Test Architectures
73
Ch. 3 - Fault-Tolerant Design - P. 73
74. Unpowered (Cold) Spares
AdvantageExtends lifetime of spares
Equations
Assume spare not failing until powered
Perfect reconfiguration capability
Rw / cold _ spare (1 t )e t
MTTFw / cold _ spare
EE141
System-on-Chip
Test Architectures
2
74
Ch. 3 - Fault-Tolerant Design - P. 74
75. Unpowered (Cold) Spares
Onecold spare doubles MTTF
Assuming faults always detected and
reconfiguration circuitry never fails
Drawback
of cold spare
Extra time to power and initialize
Cannot be used to help in detecting faults
Fault detection requires either
– periodic offline testing
– online testing using time or information
redundancy
EE141
System-on-Chip
Test Architectures
75
Ch. 3 - Fault-Tolerant Design - P. 75
76. Powered (Hot) Spares
Canuse spares for online fault detection
One approach is duplicate-and-compare
If outputs mismatch then fault occurred
– Run diagnostic procedure to determine which
module is faulty and replace with spare
Any number of spares can be used
Module
A
Spare
Module
Module
B
EE141
System-on-Chip
Test Architectures
Output
Compare
Agree/Disagree
76
Ch. 3 - Fault-Tolerant Design - P. 76
77. Pair-and-a-Spare
Avoidshalting system to run diagnostic
procedure when fault occurs
Module
A
Module
B
Output
Compare
Agree/Disagree
Switch
Module
C
Module
D
EE141
System-on-Chip
Test Architectures
Output
Compare
Agree/Disagree
77
Ch. 3 - Fault-Tolerant Design - P. 77
78. TMR/Simplex
Whenone module in TMR fails
Disconnect one of remaining modules
Improves MTTF while retaining advantages
of TMR when 3 good modules
TMR/Simplex
Reliability always better than either TMR or
Simplex alone
EE141
System-on-Chip
Test Architectures
78
Ch. 3 - Fault-Tolerant Design - P. 78
79. Comparison of Reliability vs Time
10.9
RELIABILITY
0.8
0.7
SIMPLEX
TMR
0.6
TMR/SIMPLEX
0.5
0.4
0.3
0
0.2
0.4
0.6
0.8
1
NORMALIZED MISSION TIME (T/MTTF)
EE141
System-on-Chip
Test Architectures
79
Ch. 3 - Fault-Tolerant Design - P. 79
80. Hybrid Redundancy
Combinesboth static and dynamic
redundancy
Masks faults like static
Detects and reconfigures like dynamic
EE141
System-on-Chip
Test Architectures
80
Ch. 3 - Fault-Tolerant Design - P. 80
81. TMR with Spares
IfTMR module fails
Replace with spare
– can be either hot or cold spare
While system has three working modules
– TMR will provide fault masking for
uninterrupted operation
EE141
System-on-Chip
Test Architectures
81
Ch. 3 - Fault-Tolerant Design - P. 81
82. Self-Purging Redundancy
Usesthreshold voter instead of majority
voter
Threshold voter outputs 1 if number of
input that are 1 greater than threshold
– Otherwise outputs 0
Requires hot spares
EE141
System-on-Chip
Test Architectures
82
Ch. 3 - Fault-Tolerant Design - P. 82
83. Self-Purging Redundancy
Module1
Elem.
Switch
Module
2
Elem.
Switch
Module
3
Elem.
Switch
Module
4
Elem.
Switch
Module
5
Elem.
Switch
Threshold
Voter
2
Elementary Switch
Initialization
S
R
Flip
Flop
EE141
System-on-Chip
Test Architectures
Module
&
Voter
83
Ch. 3 - Fault-Tolerant Design - P. 83
84. Self-Purging Redundancy
Comparedwith 5MR
Self-purging with 5 modules
– Tolerate up to 3 failing modules (5MR cannot)
– Cannot tolerate two modules simultaneously
failing (5MR can)
Compared
with TMR with 2 spares
Self-purging with 5 modules
– simpler reconfiguration circuitry
– requires hot spares (3MR w/spares can use
either hot or cold spares)
EE141
System-on-Chip
Test Architectures
84
Ch. 3 - Fault-Tolerant Design - P. 84
85. Time Redundancy
AdvantageLess hardware
Drawback
Cannot detect permanent faults
If
error detected
System needs to rollback to known good
state before resuming operation
EE141
System-on-Chip
Test Architectures
85
Ch. 3 - Fault-Tolerant Design - P. 85
86. Repeated Execution
Repeatoperation twice
Simplest time redundancy approach
Detects temporary faults occurring during
one execution (but not both)
– Causes mismatch in results
Can reuse same hardware for both
executions
– Only one copy of functional hardware needed
EE141
System-on-Chip
Test Architectures
86
Ch. 3 - Fault-Tolerant Design - P. 86
87. Repeated Execution
Requiresmechanism for storing and
comparing results of both executions
In processor, can store in memory or on
disk and use software to compare
Main
cost
Additional time for redundant execution
and comparison
EE141
System-on-Chip
Test Architectures
87
Ch. 3 - Fault-Tolerant Design - P. 87
88. Multi-threaded Redundant Execution
Canuse in processor-based system that
can run multiple threads
Two copies of thread executed concurrently
Results compared when both complete
Take advantage of processor’s built-in
capability to exploit processing resources
– Reduce execution time
– Can significantly reduce performance penalty
EE141
System-on-Chip
Test Architectures
88
Ch. 3 - Fault-Tolerant Design - P. 88
89. Multiple Sampling of Outputs
Doneat circuit-level
Sample once at end of normal clock cycle
Same again after delay of t
Two samples compared to detect mismatch
– Indicates error occurred
Detect fault whose duration is less than t
Performance overhead depends on
– Size of t relative to normal clock period
EE141
System-on-Chip
Test Architectures
89
Ch. 3 - Fault-Tolerant Design - P. 89
90. Multiple Sampling of Outputs
Simpleapproach using two latches
Main
Latch
Signal
Clk
Error
Shadow
Latch
Clk+ t
EE141
System-on-Chip
Test Architectures
90
Ch. 3 - Fault-Tolerant Design - P. 90
91. Multiple Sampling of Outputs
Approachusing stability checker at output
Normal
Clock Period
Checking
Period
Stability
Stability
Checking
Checking
Period
Period
Normal
Clock Period
t
t
&
+
&
Signal
Error
+
&
EE141
System-on-Chip
Test Architectures
91
Ch. 3 - Fault-Tolerant Design - P. 91
92. Diverse Recomputation
Usesame hardware, but perform
computation differently second time
Can detect permanent faults that affects
only one computation
For
arithmetic or logical operations
Shift operands when performing second
computation [Patel 1982]
Detects permanent fault affecting only one
bit-slice
EE141
System-on-Chip
Test Architectures
92
Ch. 3 - Fault-Tolerant Design - P. 92
93. Information Redundancy
Basedon Error Detecting and
Correcting Codes
Advantage
Detects both permanent and temporary
faults
Implemented with less hardware overhead
than using multiple copies of module
Disadvantage
More complex design
EE141
System-on-Chip
Test Architectures
93
Ch. 3 - Fault-Tolerant Design - P. 93
94. Error Detection
Errordetecting codes used to detect
errors
If error detected
– Rollback to previous known error-free state
– Retry operation
EE141
System-on-Chip
Test Architectures
94
Ch. 3 - Fault-Tolerant Design - P. 94
95. Rollback
Requiresadding storage to save
previous state
Amount of rollback depends on latency of
error detection mechanism
Zero-latency error detection
– rollback implemented by preventing system
state from updating
If errors detected after n cycles
– need rollback restoring system to state at least
n clock cycles earlier
EE141
System-on-Chip
Test Architectures
95
Ch. 3 - Fault-Tolerant Design - P. 95
96. Checkpoint
Executiondivided into set of operations
Before each operation executed
– checkpoint created where system state saved
If any error detected during operation
– rollback to last checkpoint and retry operation
If multiple retries fail
– operation halts and system flags that
permanent fault has occurred
EE141
System-on-Chip
Test Architectures
96
Ch. 3 - Fault-Tolerant Design - P. 96
97. Error Detection
Encodeoutputs of circuit with error
detecting code
Non-codeword output indicates error
Inputs
m
k
Functional
Logic
Outputs
k
m
EE141
System-on-Chip
Test Architectures
Check Bit
Generator
Checker
c
Error
Indication
97
Ch. 3 - Fault-Tolerant Design - P. 97
98. Self-Checking Checker
Hastwo outputs
Normal error-free case (1,0) or (0,1)
If equal to each other, then error (0,0) or (1,1)
Cannot have single error indicator output
– Stuck-at 0 fault on output could never be detected
EE141
System-on-Chip
Test Architectures
98
Ch. 3 - Fault-Tolerant Design - P. 98
99. Totally Self-Checking Checker
Requiresthree properties
Code Disjoint
– all codeword inputs mapped to codeword outputs
Fault Secure
– for all codeword inputs, checker in presence of
fault will either procedure correct codeword output
or non-codeword output (not incorrect codeword)
Self-Testing
– For each fault, at least one codeword input gives
error indication
EE141
System-on-Chip
Test Architectures
99
Ch. 3 - Fault-Tolerant Design - P. 99
100. Duplicate-and-Compare
Equalitychecker indicates error
Undetected error can occur only if
common-mode fault affecting both copies
Only faults after stems detected
Over 100% overhead (including checker)
Stems
Primary
Inputs
Functional
Logic
Equality
Checker
Error
Indication
Functional
Logic
EE141
System-on-Chip
Test Architectures
100
Ch. 3 - Fault-Tolerant Design - P. 100
101. Single-Bit Parity Code
Totallyself-checking checker formed by
removing final gate from XOR tree
Functional
Logic
EI0
EI1
Parity
Prediction
EE141
System-on-Chip
Test Architectures
101
Ch. 3 - Fault-Tolerant Design - P. 101
102. Single-Bit Parity Code
Cannotdetect even bit errors
Can ensure no even bit errors by
generating each output with independent
cone of logic
– Only single bit errors can occur due to single
point fault
– Typically requires a lot of overhead
EE141
System-on-Chip
Test Architectures
102
Ch. 3 - Fault-Tolerant Design - P. 102
103. Parity-Check Codes
Eachcheck bit is parity for some set of
output bits
Example: 6 outputs and 3 check bits
Z1 Z2 Z3 Z4 Z5 Z 6 c1 c2 c3
Parity Group 1
1
0
0
1
1
0
1
0
0
Parity Group 2
0
1
1
0
0
0
0
1
0
Parity Group 3
0
0
0
0
0
1
0
0
1
EE141
System-on-Chip
Test Architectures
103
Ch. 3 - Fault-Tolerant Design - P. 103
104. Parity-Check Codes
Forc check bits and k functional outputs
2ck possible parity check codes
Can choose code based on structure of
circuit to minimize undetected error
combinations
Fanouts in circuit determine possible error
combinations due to single-point fault
EE141
System-on-Chip
Test Architectures
104
Ch. 3 - Fault-Tolerant Design - P. 104
105. Checker for Parity-Check Codes
Constructedfrom single-bit parity
checkers and two-rail checkers
Z1
Z4
Z5
c1
Z2
Z3
Parity
Checker
Two-Rail
Checker
Parity
Checker
Two-Rail
Checker
E0
E1
c2
Z6
c3
Parity
Checker
EE141
System-on-Chip
Test Architectures
105
Ch. 3 - Fault-Tolerant Design - P. 105
106. Two-Rail Checkers
Totallyself-checking two-rail checker
A0
B0
&
+
C0
+
C1
&
A1
B1
EE141
System-on-Chip
Test Architectures
&
&
106
Ch. 3 - Fault-Tolerant Design - P. 106
107. Berger Codes
Inverter-freecircuit
Inverters only at primary inputs
Can be synthesized using only algebraic
factoring [Jha 1993]
Only unidirectional errors possible for
single point faults
– Can use unidirectional code
– Berger code gives 100% coverage
EE141
System-on-Chip
Test Architectures
107
Ch. 3 - Fault-Tolerant Design - P. 107
108. Constant Weight Codes
Non-separablewith lower redundancy
Drawback: need decoding logic to convert
codeword back to its original binary value
Can use for encoding states of FSM
– No need for decoding logic
EE141
System-on-Chip
Test Architectures
108
Ch. 3 - Fault-Tolerant Design - P. 108
109. Error Correction
Informationredundancy can also be
used to mask errors
Not as attractive as TMR because logic for
predicting check bits very complex
However, very good for memories
– Check bits stored with data
– Error do not propagate in memories as in logic
circuits, so SEC-DED usually sufficient
EE141
System-on-Chip
Test Architectures
109
Ch. 3 - Fault-Tolerant Design - P. 109
110. Error Correction
Memoriesvery dense and prone to errors
Especially due to single-event upsets (SEUs)
from radiation
SEC-DED
check bits stored in memory
32-bit word, SEC-DED requires 7 check bits
– Increases size of memory by 7/32=21.9%
64-bit word, SEC-DED requires 8 check bits
– Increases size of memory by 8/64=12.5%
EE141
System-on-Chip
Test Architectures
110
Ch. 3 - Fault-Tolerant Design - P. 110
111. Memory ECC Architecture
Write Data WordRead Data Word
Generate
Check
Bits
Data Word
In
Write
Check Bits
Memory
Calculated
Check Bits
Data Word
Out
Correct
Data
EE141
System-on-Chip
Test Architectures
Generate
Syndrome
Read
Check Bits
111
Ch. 3 - Fault-Tolerant Design - P. 111
112. Hamming Code for ECC RAM
ZInput Data
Hamming
Check Bit
Generator
Bit Error
Z
Correction Circuit
Z
RAM
Core
c
Parity Bit
Generator
c
N words
Z+c+1
bits/word
Parity Bit
Generator
Parity Group 1
Parity Group 2
Parity Group 3
Parity Group 4
Z2
1
0
1
0
Error Type
No bit error
Single-bit correctable error
Double-bit error detection
EE141
System-on-Chip
Test Architectures
Z3
0
1
1
0
Z4
1
1
1
0
Parity
Check
Hamming
Check
Hamming c
Check Bit
Detect/Correct
Generator
Generate
Z1
1
1
0
0
Output
Data
Z5
1
0
0
1
Z6
0
1
0
1
Z7
1
1
0
1
Z8
0
0
1
1
c1
1
0
0
0
c2
0
1
0
0
c3
0
0
1
0
c4
0
0
0
1
Condition
Hamming check bits match, no parity error
Hamming check bits mismatch, parity error
Hamming check bits mismatch, no parity error
112
Ch. 3 - Fault-Tolerant Design - P. 112
113. Memory ECC
SEC-DEDgenerally very effective
Memory bit-flips tend to be independent
and uniformly distributed
If bit-flip occurs, gets corrected next time
memory location accessed
Main risk is if memory word not access for
long time
– Multiple bit-flips could accumulate
EE141
System-on-Chip
Test Architectures
113
Ch. 3 - Fault-Tolerant Design - P. 113
114. Memory Scrubbing
Everylocation in memory read on
periodic basis
Reduces chance of multiple errors
accumulating in a memory word
Can be implemented by having memory
controller cycle through memory during idle
periods
EE141
System-on-Chip
Test Architectures
114
Ch. 3 - Fault-Tolerant Design - P. 114
115. Multiple-Bit Upsets (MBU)
Canoccur due to single SEU
Typically occur in adjacent memory cells
Memory
interleaving used
To prevent MBUs from resulting in multiple
bit errors in same word
Memory
Word1
Bit1
Word2
Bit1
Word3
Bit1
EE141
System-on-Chip
Test Architectures
Word4
Bit1
Word1
Bit2
Word2
Bit2
Word3
Bit2
Word4
Bit2
Word1 Word2
Bit3
Bit3
Word3
Bit3
Word4
Bit3
115
Ch. 3 - Fault-Tolerant Design - P. 115
116.
TypeIssues
Goal
Examples
Techniques
Long-Life
Systems
Difficult or
Expensive to Repair
Maximize
MTTF
Satellites
Spacecraft
Implanted Biomedical
Dynamic
Redundancy
Reliable
Real-Time
Systems
Error or Delay
Catastrophic
Fault Masking
Capability
Aircraft
Nuclear Power Plant
Air Bag Electronics
Radar
TMR
High
Availability
Systems
Downtime
Very Costly
High
Availability
Reservation System
Stock Exchange
Telephone Systems
No Single Point of
Failure;
Self-Checking Pairs;
Fault Isolation
High
Integrity
Systems
Data Corruption
Very Costly
High
Data Integrity
Banking
Transaction
Processing
Database
Checkpointing,
Time Redundancy;
ECC; Redundant
Disks
Mainstream
Low-Cost
Systems
Reasonable Level of
Failures Acceptable
Meet Failure Rate
Expectations
at Low Cost
Consumer Electronics
Personal Computers
Often None;
Memory ECC; Bus
Parity; Changing as
Technology Scales
EE141
System-on-Chip
Test Architectures
116
Ch. 3 - Fault-Tolerant Design - P. 116
117. Concluding Remarks
Manydifferent fault-tolerant schemes
Choosing scheme depends on
Types of faults to be tolerated
– Temporary or permanent
– Single or multiple point failures
– etc.
Design constraints
– Area, performance, power, etc.
EE141
System-on-Chip
Test Architectures
117
Ch. 3 - Fault-Tolerant Design - P. 117
118. Concluding Remarks
Astechnology scales
Circuits increasingly prone to failure
Achieving sufficient fault tolerance will be
major design issue
EE141
System-on-Chip
Test Architectures
118
Ch. 3 - Fault-Tolerant Design - P. 118