Reconfigurable computing
Small Embedded Systems
Unit 6.6
Reliability and Safety
Introduction
Definitions of reliability and availability
Reliability calculations
Redundancy as a way to increase reliability
Safety integrity levels standards
Examples
Reliability
Reliability R(t) is defined as the probability that a system will perform the required function, under the specified operating conditions, until at least time t.
If systems are subjected to conditions other than those for which the system was designed (e.g. extreme temperature or vibration) the statistical models that we will develop don’t apply
For many types of hardware, reliability is modelled well by an exponential distribution:
Mean Time to Fail
l is failure rate
Normal unit of l is FIT: failures in 109 hours.
The mean time to fail ( MTTF ) is the expected time of first failure, and is given by
Related Concepts
Availability
The probability that an item is in a state to perform a required function under given conditions at a given instant of time, assuming that the required external resources are provided.
Maintainability
The probability that an item, under stated conditions of use, can be returned in or restored to a state in which it can perform its required function in a specified time
What’s the difference between reliability and availability?
Imagine a system that breaks down for 2 seconds every hour
Its availability is quite good, but reliability is very bad
Reliability Example
We design a system of constant failure rate 100 FIT, and sell 50 000 units. How many of our units are likely to fail in the first year of operation?
100 FIT = 100 / ( 109 60 60 ) = 2.8 10-11 failures/sec
1 year = 365 24 60 60 = 3.15 107 seconds
Probability of surviving the first year is
So the probability of failure is 1 – 0.9991 = 0.0009
The number that fail is 50 000 0.0009 = 45.
Variations in Failure Rate
In practice, l is not normally constant throughout the whole lifetime of a system
Failure rates often follow the “Bath-tub” curve:
Initially failure rate is high, due to devices that are weakened by manufacturing defects.
Once the weakened devices have failed, the failure rate is constant and low ( the steady state condition ).
Eventually device wears out, and failure rate climbs again.
Burn-in
The initial high failure rate can cause devices pass the initial testing, but then to fail quickly when with the customer.
Weakened devices can be screened out by burn-in.
Period of deliberate over-stress of the device (elevated temperatures, voltages, ..) to accelerate failure of weak devices
Devices that survive the burn-in it may be assumed to be in steady state.
System Reliability
We know the reliability behaviour of our components.
What happens when we combine multiple components to build a system?
Reliability can be influenced at two levels:
At component level:
Each component being more reliable leads to a more reliable system.
At system level:
Design the system to be fault tolerant;
Operate the system to minimise unplanned downtime.
System Reliability
Calculation depends on how the failure of an individual sub-system affects the system as a whole:
A series system is a system in which all sub-systems must function correctly. If any sub-system fails, then the whole system fails.
A parallel system is a system which will work correctly if any sub-system works correctly. Only if all subsystems fail does the whole system fail.
The reliability use of the words series and parallel has nothing to do with the “normal” electronics meaning of the words: it’s not about how the components are connected together
Simple System Reliability Example
Most systems normally have series behaviour, unless we deliberately build in features to make system fault tolerant
Example: A system is built from two components
Component A: reliability = 90%
Component B: reliability = 95%
System requires that both components work
If any component fails, whole system fails
Probability of whole system working is
(Probability that A works) × (Probability that B works)
= 90% × 95% = 85.5%
System reliability is less than component reliability
With many components, reliability can be very low
Redundant Systems
To get parallel reliability, we build in redundancy to make system fault tolerant
Dual modular redundancy
Both units perform same computation
If one fails the overall system is OK, as long as the decision unit can recognize that one has malfunctioned
e.g. failed units goes dead, or produces invalid CRC
If failed unit can produce answers that are plausible but wrong (the “babbling idiot” problem), then we need a better approach
1
2
Decide
Inputs
Outputs
Redundant Systems
Suppose the components have reliability of 90%
Their unreliability is 100% – 90% = 10%
What is reliability of combination?
The system will only stop working if both copies stop working
Probability of not working is 10%×10% = 1%
Probability that system will work is 100%-1% = 99%
Reliability of combination is 99%
1
2
Decide
Inputs
Outputs
Redundant Systems
Triple modular redundancy
Three units perform same computation
Decision unit just needs to take majority decision
Say each component has reliability of 90%
System reliability is 72.9% + 24.3% = 97.2%
1
3
Decide
Inputs
Outputs
2
No copies fail (90%)3 72.9%
1 copy fails 3 × (90%)2 × (10%)1 24.3%
2 copies fail 3 × (90%)1 × (10%)2 2.7%
3 copies fail (10%)3 0.1%
Diversity Redundancy
Dual modular redundancy with two identical duplicate units:
This gives us good reliability if failures of units are random and uncorrelated
If there is a design error in the identical units then unit 2 fails whenever unit 1 fails: redundancy gives no protection at all
Diversity redundancy requires that units 1 and 2 perform the same task, but use a different hardware design and a different software design
This gives us protection against design errors
1
2
Decide
Inputs
Outputs
Reliability Block Diagrams (RBDs)
A diagrammatic representation of how the overall system reliability depends on the reliability of its components
Show components of system as a series of blocks
System is working if there exists a path from start to end through the blocks
Each block is characterised by a reliability measure (reliability, availability, maintainability)
Series Reliability
System is working if there exists a path from start to end through the blocks
System made of a series combination of blocks works correctly only if all blocks are working correctly
X
Z
Y
X
Z
Y
Series Reliability
Overall system reliability/availability is determined by product of component reliabilities/availabilities
X
Z
Y
90%×80%×70%=50.4%
System availability =
A=90%
A=80%
A=70%
Parallel Reliability
System made of a parallel combination of blocks works correctly only if any blocks are working correctly
Y
X
Z
Parallel Reliability
System made of a parallel combination of blocks works correctly only if any blocks are working correctly
Overall non-availability is the product of the component non-availabilities
Y
X
Z
A=90%
A=80%
A=70%
1-Asystem=(1-0.9)×(1-0.8)×(1-0.7)
System availability = 99.4%
Series/Parallel Combinations
Series and parallel can be combined
X
Z
Y
A=90%
A=80%
A=70%
1-AYZ = (1-0.8)×(1-0.7)
System availability = 90% × 94% = 84.6%
AYZ = 94%
Understanding RBDs
RBDs’ serial/parallel combinations are statements of how components interact to cause system failure
They are not an expression of physical circuit connection
Suppose we have a circuit whose circuit diagram is this
If failure of one sensor causes system failure, the RBD is this:
(The order we draw the three in doesn’t matter; the important thing is that the three are in a series reliability relationship)
Micro-
controller
Sensor1
Sensor2
Micro-
controller
Sensor1
Sensor2
Understanding RBDs
RBDs’ serial/parallel combinations are statements of how components interact to cause system failure
They are not an expression of physical circuit connection
Suppose we have a circuit whose circuit diagram is this
If the system only fails when both sensors fail, then our RBD is this:
Micro-
controller
Sensor1
Sensor2
Micro-
controller
Sensor1
Sensor2
Safety Integrity Levels (SILs)
SILs (defined by International Electrotechnical Commission) are commonly used to measure safety system performance
They are defined for several cases:
Low demand mode
Figures are expressed as probability of dangerous failure per occasion of demand
Demand mode
Consequence of a Failure
Level Reliability Probability of a failure on demand
SIL 4 >99.99% ≥10-5 to < 10-4 Catastrophic community impact
SIL 3 99.9% ≥10-4 to < 10-3 Potential for multiple fatalities
SIL 2 99%-99.9% ≥10-3 to < 10-2 Potential for major injuries or 1 fatality
SIL 1 90%-99% ≥10-2 to < 10-1 Potential for minor injuries
SIL 0 N/A
Safety Integrity Levels (SILs)
SILs (defined by International Electrotechnical Commission) are commonly used to measure safety system performance
They are defined for several cases:
Continuous mode usage
Figures are expressed as probability of dangerous failure per hour
Continuous mode Consequence of a Failure
Level Probability of a dangerous failure per hour
SIL 4 ≥10-9 to < 10-8 Catastrophic community impact
SIL 3 ≥10-8 to < 10-7 Potential for multiple fatalities
SIL 2 ≥10-7 to < 10-6 Potential for major injuries or 1 fatality
SIL 1 ≥10-6 to < 10-5 Potential for minor injuries
SIL 0 N/A
Example of Good Practice
Boeing 777 primary flight computer
Fly-by-wire system: computer directly controls actuators
Hardware and software use triple modular redundancy
Three different microprocessors (one Intel, one AMD, one Motorola) work out the pitch, yaw and roll control signals
Three different copies of software used, compiled on three different compilers
By aeroprints.com, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32511778
Alex Pereslavtsevderivative work: Altair78 (talk) -, GFDL 1.2, https://commons.wikimedia.org/w/index.php?curid=16892729
Example of Poor Practice
Boeing 737MAX
Two accidents (October 2018 and March 2019) led to entire fleet being grounded
One cause was lack of redundancy on unreliable sensor
User:Acefitt / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)
Boeing 737 Max: The Problem
https://spectrum.ieee.org/aerospace/aviation/how-the-boeing-737-max-disaster-looks-to-a-software-developer
Better thermodynamic efficiency (fuel economy) comes from larger engines
Ground clearance of original 737 was limited
737 Max mounted new engines higher and more forward
Altered handling led to tendency for nose to pitch up
A new computer automated system was introduced to push nose downwards when computer believed it was necessary in order to avoid stall
Boeing 737 Max: The Sensors
Computer measured angle of attack using a sensor
Sensor is external in extreme environment, subject to shaking and vibration
One sensor on each side, but controlling computer only takes data from sensor on one side
No operational redundancy (although there were 2 sensors)
Summary
Reliability, Availability and Maintainability are key measures for safety critical systems
Large systems where all components must work (serial reliability) have significantly lower reliability than the constituent components
Redundancy (only some of the components must work: parallel reliability) can give significant reliability gains
/docProps/thumbnail.jpeg