Improving System Availability ― Part 2

Mean Times to Failure and Repair

Jan 21, 2026

This post is the second in a series to do with the use of risk management techniques to improve system reliability and availability in process and energy facilities. The first post is here.

In the first post in this series, Improving System Availability ― Part 1, we noted that both safety and reliability programs are risk based. However, they have different goals and rationales. There can never be an optimum level for safety; any accident that leads to injury or death is unacceptable. However, when it comes to system availability there is an optimum value that can be calculated using cost-benefit analysis.

Many organizations have mature safety programs, yet they continue to experience avoidable production losses, escalating maintenance costs, and declining asset performance. These outcomes are sometimes addressed using the language and tools of safety risk management, even though the underlying problems are fundamentally different. But, because both objectives are framed in terms of ‘risk reduction’, safety and reliability are frequently treated as variations of the same problem. It is therefore commonly assumed that improvements in safety will automatically improve reliability, and that reliability initiatives will inherently improve safety. This assumption is misleading, and even counterproductive.

The first post described some of the basic parameters in a reliability program: reliability, maintainability, and availability.

In this post we consider failure rates and three of the basic parameters to do with availability analysis.

Failure Rates

The failure rate of an item can be defined in one of two ways. Either it is the fraction of all units of the original population of a particular item that have failed by time t (cumulative distribution). Alternatively, it is the probability that a particular item will have failed by time t. Either way, failure rate is a population-based concept.

The failure rate is represented by F(t), which will generally have a shape such as that shown. It asymptotically approaches a value of 1.0. No matter how well built and maintained an item may be, eventually it will fail. Therefore, replacement or repair strategies must be considered as soon as an item is installed.

Failure Rate Terms

The following terms are used when calculating failure rates. (Other authorities use different definitions, particularly for MTBF and MTTF, so care should be taken when using data from different sources.)

Mean Time to Failure (MTTF)
The mean of an equipment item’s operating times, i.e., the time from when an item is put into operation to the time when it fails.
Mean Time to Repair (MTTR)
The mean time it takes to repair an equipment item. It is formally defined as the ‘total corrective maintenance time divided by the number of corresponding maintenance actions during a given period of time’.
Mean Downtime (MDT)
MDT and MTTR are often treated as being the same. However, some analysts distinguish between the two. MTTR is just the repair time itself, whereas MDT is the total time needed to bring an item back into service, including the time for shutdown activities such as waiting for technicians to be available, transporting items to and from the work site, and the ordering of spare parts.
Mean Time between Failures (MTBF)
MTBF is the mean of the time between the failures for any particular item. It includes both operating and repair time. Therefore, MTBF = (MTTF + MDT).

MTBF relates to the other terms as shown in the following sketch.

MTBF can be improved either by reducing MTTR or increasing MTTF. In practice, and depending on the specific details, it is often found that reducing MTTR contributes most to increasing overall availability.

Conclusion

In the first two posts in this series have described the distinction between safety and availability programs. They have also described some of the basic parameters to do with availability, reliability and maintainability.

In the next post we will discuss the concept of effectiveness.

The AI Architect

Jan 22

This is an excellent clarification of the fundamental differences between safety and availability programs. The point that safety has no optimum level while availability can be optimized through cost-benefit analysis is crucial for organizations to understand.

I particularly appreciate your breakdown of MTBF, MTTF, and MTTR. The distinction between MTTR (actual repair time) and MDT (total downtime including logistics) is often overlooked but can have significant implications for availability calculations. Your observation that reducing MTTR often contributes more to increasing overall availability than increasing MTTF is valuable practical wisdom.

The challenge you identify - that organizations frequently conflate safety and reliability objectives, assuming improvements in one automatically enhance the other - is indeed counterproductive. They require different analytical frameworks and investment strategies. Looking forward to your next post on effectiveness!

Discussion about this post

Ready for more?