Consider common causes of failure in the process industries, such as pipes becoming corroded and fracturing, valves sticking, or other physical problems with individual components. The safety of the whole system depends on the reliability of these individual components, so systems are designed to eliminate (so far as practicable) single points of failure where one component could fail and cause an accident.
The dual pumps in Figure 1 demonstrate the need for this single point of failure. Running both at lower speeds, for example, reduces both mechanical stress and noise. But a key feature of a set up like this is that even where it involves precisely duplicated components, the exact timing of any likely failure is sufficiently unpredictable (“random failure”), so the chance of both failing at the same time is extremely unlikely.
Figure 1: A pumping system designed using the principle of dual modular redundancy. (Source: LDRA)
To ensure that failure rates are within specification, a process of hazard analysis attempts to identify all the dangerous states the system could get into, and fault trees are drawn up to analyse the circumstances that could lead to each dangerous occurrence. Such a process is built on an underlying assumption that the system design of duplicated components or assemblies is sound. If that is not the case, resulting failures, known as “systemic failures,” will always occur whenever particular circumstances co-exist.
Software does not age in the way physical components do. Software failures are, in general, the result of software faults that were created when the system was specified, designed and built, and these faults will cause a failure whenever certain circumstances occur. They too are systematic failures, but in a world where modern cars can contain 100 million lines of software, it would be foolhardy to assume failure couldn’t happen.
Redundancy in software
Most functional safety process standards reference some form of a “V” model. Consider the example shown in Figure 2 from EN/ISO 13849:2015, “Safety of machinery — Safety-related parts of control systems”
click for full size image
Figure 2: V-model software development lifecycle model from ISO 13849:2015, “Safety of machinery – Safety related parts of control systems”. (Source: ISO 13849:2015 via LDRA)
Suppose that the simple system shown in Figure 1 is to be upgraded and applied to a situation where each pump needs to have its own custom-built, safety critical, embedded controller developed in accordance with ISO 13849.
If the same controller is applied to both pumps, then the robust dual redundancy protection afforded in the original system will be compromised. If there is a bug in the software that causes a pump to malfunction under a particular set of circumstances, then both pumps will fail at the same time.
If that is unacceptable, then the risk can be minimized by deploying two different design teams to develop the system from requirements through to completion, using different hardware and operating systems, as well as discrete development teams using different high-level languages.
How safe is safe enough?
Developing the system in this way might be commercially justifiable – but probably not. A compromise requires a definition of what represents a system that is “safe enough.”
One approach is to use applied statistics in the form of Failure Modes and Effects Analysis (FMEA). This involves considering each component and subsystem, and then analysing how each might fail and what would happen if it did. A refinement to that approach, Failure Mode Effects and Criticality Analysis (FMECA), brings the criticality of that failure into the calculation.
A failure mode that will apply to both of our example pump controllers will be considerably more critical than a failure mode that applies to only one; a fact that that would be reflected in any criticality analysis. Another consideration might be that common processors or RTOSs that are proven over many years of varied service are likely less risky than newly developed, bespoke code.
With reference to the V-model in Figure 2, the development lifecycle of that new code can be further dissected. For example, one approach to reducing risk while limiting expenditure might be to use a common design, but to develop two control system software packages based on it. In this scenario, the risk of a shared design flaw remains, but the likelihood of a shared software bug or coding error is considerably reduced.
Would it then be “safe enough?” In general, that is a social question rather than an engineering one, but a common approach to quantifying it leverages the value placed on a statistical life or VSL. This is defined as the additional cost that individuals would be willing to bear for improvements in safety (that is, reductions in risks) that, in the aggregate, reduce the expected number of fatalities by one.
Other compromises might be to use two different POSIX-compliant RTOS, which would allow the same code to be executed on either, the same code running on different hardware, different compilers…. the possibilities, especially in combination, are almost endless.
|Mark Pitchford is technical specialist with LDRA Software Technology. Mark has over 30 years’ experience in software development for engineering applications and has worked on many significant industrial and commercial projects in development and management, both in the UK and internationally. Since 2001, he has worked with development teams looking to achieve compliant software development in safety and security critical environments, working with standards such as DO-178, IEC 61508, ISO 26262, IIRA and RAMI 4.0. Mark earned his Bachelor of Science degree at Nottingham Trent University, and he has been a Chartered Engineer for over 20 years. He now works as Technical Specialist with LDRA Software Technology.
For more Embedded, subscribe to Embedded’s weekly email newsletter.
The post Software quality: Balancing risk and cost appeared first on Embedded.com.