嵌入式系统代写代做代考 Embedded Systems Reconfigurable computing

Reconfigurable computing

Small Embedded Systems

Unit 6.6
Reliability and Safety

Introduction
Definitions of reliability and availability
Reliability calculations
Redundancy as a way to increase reliability
Safety integrity levels standards
Examples

Reliability
Reliability R(t) is defined as the probability that a system will perform the required function, under the specified operating conditions, until at least time t.
If systems are subjected to conditions other than those for which the system was designed (e.g. extreme temperature or vibration) the statistical models that we will develop don’t apply
For many types of hardware, reliability is modelled well by an exponential distribution:

Mean Time to Fail

l is failure rate
Normal unit of l is FIT: failures in 109 hours.
The mean time to fail ( MTTF ) is the expected time of first failure, and is given by

Related Concepts
Availability
The probability that an item is in a state to perform a required function under given conditions at a given instant of time, assuming that the required external resources are provided.

Maintainability
The probability that an item, under stated conditions of use, can be returned in or restored to a state in which it can perform its required function in a specified time
What’s the difference between reliability and availability?
Imagine a system that breaks down for 2 seconds every hour
Its availability is quite good, but reliability is very bad

Reliability Example
We design a system of constant failure rate 100 FIT, and sell 50 000 units. How many of our units are likely to fail in the first year of operation?

100 FIT = 100 / ( 109  60  60 ) = 2.8  10-11 failures/sec
1 year = 365  24  60  60 = 3.15  107 seconds
Probability of surviving the first year is

So the probability of failure is 1 – 0.9991 = 0.0009
 The number that fail is 50 000  0.0009 = 45.

Variations in Failure Rate
In practice, l is not normally constant throughout the whole lifetime of a system
Failure rates often follow the “Bath-tub” curve:

Initially failure rate is high, due to devices that are weakened by manufacturing defects.
Once the weakened devices have failed, the failure rate is constant and low ( the steady state condition ).
 Eventually device wears out, and failure rate climbs again.

Burn-in
The initial high failure rate can cause devices pass the initial testing, but then to fail quickly when with the customer.

Weakened devices can be screened out by burn-in.
Period of deliberate over-stress of the device (elevated temperatures, voltages, ..) to accelerate failure of weak devices
Devices that survive the burn-in it may be assumed to be in steady state.

System Reliability
We know the reliability behaviour of our components.
What happens when we combine multiple components to build a system? 
Reliability can be influenced at two levels:
At component level:
Each component being more reliable leads to a more reliable system.
At system level:
Design the system to be fault tolerant;
Operate the system to minimise unplanned downtime.

System Reliability
Calculation depends on how the failure of an individual sub-system affects the system as a whole:
A series system is a system in which all sub-systems must function correctly. If any sub-system fails, then the whole system fails.
A parallel system is a system which will work correctly if any sub-system works correctly. Only if all subsystems fail does the whole system fail.
The reliability use of the words series and parallel has nothing to do with the “normal” electronics meaning of the words: it’s not about how the components are connected together

Simple System Reliability Example
Most systems normally have series behaviour, unless we deliberately build in features to make system fault tolerant
Example: A system is built from two components
Component A: reliability = 90%
Component B: reliability = 95%
System requires that both components work
If any component fails, whole system fails
Probability of whole system working is
(Probability that A works) × (Probability that B works)
= 90% × 95% = 85.5%
System reliability is less than component reliability
With many components, reliability can be very low

Redundant Systems
To get parallel reliability, we build in redundancy to make system fault tolerant
Dual modular redundancy

Both units perform same computation
If one fails the overall system is OK, as long as the decision unit can recognize that one has malfunctioned
e.g. failed units goes dead, or produces invalid CRC
If failed unit can produce answers that are plausible but wrong (the “babbling idiot” problem), then we need a better approach

1

2

Decide
Inputs
Outputs

Redundant Systems
Suppose the components have reliability of 90%
Their unreliability is 100% – 90% = 10%
What is reliability of combination?

The system will only stop working if both copies stop working
Probability of not working is 10%×10% = 1%
Probability that system will work is 100%-1% = 99%

Reliability of combination is 99%

1

2

Decide
Inputs
Outputs

Redundant Systems
Triple modular redundancy

Three units perform same computation
Decision unit just needs to take majority decision
Say each component has reliability of 90%

System reliability is 72.9% + 24.3% = 97.2%

1

3

Decide
Inputs
Outputs

2
No copies fail (90%)3 72.9%
1 copy fails 3 × (90%)2 × (10%)1 24.3%
2 copies fail 3 × (90%)1 × (10%)2 2.7%
3 copies fail (10%)3 0.1%

Diversity Redundancy
Dual modular redundancy with two identical duplicate units:

This gives us good reliability if failures of units are random and uncorrelated
If there is a design error in the identical units then unit 2 fails whenever unit 1 fails: redundancy gives no protection at all
Diversity redundancy requires that units 1 and 2 perform the same task, but use a different hardware design and a different software design
This gives us protection against design errors

1

2

Decide
Inputs
Outputs

Reliability Block Diagrams (RBDs)
A diagrammatic representation of how the overall system reliability depends on the reliability of its components
Show components of system as a series of blocks
System is working if there exists a path from start to end through the blocks
Each block is characterised by a reliability measure (reliability, availability, maintainability)

Series Reliability
System is working if there exists a path from start to end through the blocks
System made of a series combination of blocks works correctly only if all blocks are working correctly

X

Z

Y

X

Z

Y







Series Reliability
Overall system reliability/availability is determined by product of component reliabilities/availabilities

X

Z

Y
90%×80%×70%=50.4%
System availability =
A=90%
A=80%
A=70%

Parallel Reliability
System made of a parallel combination of blocks works correctly only if any blocks are working correctly

Y

X

Z



Parallel Reliability
System made of a parallel combination of blocks works correctly only if any blocks are working correctly

Overall non-availability is the product of the component non-availabilities

Y

X

Z
A=90%
A=80%
A=70%
1-Asystem=(1-0.9)×(1-0.8)×(1-0.7)
System availability = 99.4%

Series/Parallel Combinations
Series and parallel can be combined

X

Z

Y
A=90%
A=80%
A=70%
1-AYZ = (1-0.8)×(1-0.7)
System availability = 90% × 94% = 84.6%

AYZ = 94%

Understanding RBDs
RBDs’ serial/parallel combinations are statements of how components interact to cause system failure
They are not an expression of physical circuit connection
Suppose we have a circuit whose circuit diagram is this

If failure of one sensor causes system failure, the RBD is this:

(The order we draw the three in doesn’t matter; the important thing is that the three are in a series reliability relationship)

Micro-
controller

Sensor1

Sensor2

Micro-
controller

Sensor1

Sensor2

Understanding RBDs
RBDs’ serial/parallel combinations are statements of how components interact to cause system failure
They are not an expression of physical circuit connection
Suppose we have a circuit whose circuit diagram is this

If the system only fails when both sensors fail, then our RBD is this:

Micro-
controller

Sensor1

Sensor2

Micro-
controller

Sensor1

Sensor2

Safety Integrity Levels (SILs)
SILs (defined by International Electrotechnical Commission) are commonly used to measure safety system performance
They are defined for several cases:
Low demand mode
Figures are expressed as probability of dangerous failure per occasion of demand

Demand mode
  Consequence of a Failure
Level Reliability Probability of a failure on demand  
SIL 4 >99.99% ≥10-5 to < 10-4 Catastrophic community impact SIL 3 99.9% ≥10-4 to < 10-3 Potential for multiple fatalities SIL 2 99%-99.9% ≥10-3 to < 10-2 Potential for major injuries or 1 fatality SIL 1 90%-99% ≥10-2 to < 10-1 Potential for minor injuries SIL 0     N/A Safety Integrity Levels (SILs) SILs (defined by International Electrotechnical Commission) are commonly used to measure safety system performance They are defined for several cases: Continuous mode usage Figures are expressed as probability of dangerous failure per hour Continuous mode Consequence of a Failure Level Probability of a dangerous failure per hour   SIL 4 ≥10-9 to < 10-8 Catastrophic community impact SIL 3 ≥10-8 to < 10-7 Potential for multiple fatalities SIL 2 ≥10-7 to < 10-6 Potential for major injuries or 1 fatality SIL 1 ≥10-6 to < 10-5 Potential for minor injuries SIL 0   N/A Example of Good Practice Boeing 777 primary flight computer Fly-by-wire system: computer directly controls actuators Hardware and software use triple modular redundancy Three different microprocessors (one Intel, one AMD, one Motorola) work out the pitch, yaw and roll control signals Three different copies of software used, compiled on three different compilers By aeroprints.com, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32511778 Alex Pereslavtsevderivative work: Altair78 (talk) -, GFDL 1.2, https://commons.wikimedia.org/w/index.php?curid=16892729 Example of Poor Practice Boeing 737MAX Two accidents (October 2018 and March 2019) led to entire fleet being grounded One cause was lack of redundancy on unreliable sensor User:Acefitt / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0) Boeing 737 Max: The Problem https://spectrum.ieee.org/aerospace/aviation/how-the-boeing-737-max-disaster-looks-to-a-software-developer Better thermodynamic efficiency (fuel economy) comes from larger engines Ground clearance of original 737 was limited 737 Max mounted new engines higher and more forward Altered handling led to tendency for nose to pitch up A new computer automated system was introduced to push nose downwards when computer believed it was necessary in order to avoid stall Boeing 737 Max: The Sensors Computer measured angle of attack using a sensor Sensor is external in extreme environment, subject to shaking and vibration One sensor on each side, but controlling computer only takes data from sensor on one side No operational redundancy (although there were 2 sensors) Summary Reliability, Availability and Maintainability are key measures for safety critical systems Large systems where all components must work (serial reliability) have significantly lower reliability than the constituent components Redundancy (only some of the components must work: parallel reliability) can give significant reliability gains /docProps/thumbnail.jpeg