Defect Analysis Decision Support Flowchart

, , Leave a comment

Logging defects in the CMMS is a key part of the work execution process. We need to get the quality of this information we collect correct the first time. Details such as the defect description, failure code, equipment type, correct tag number and the risk that the defect presents to the business are all vital feeds into downstream parts of the process of planning and execution of the defects. This area is often the initial focus of work execution improvement programs and it certainly is a sensible starting point.

However, as you start to capture high-quality information we need to develop a structured process for actually using this information. The structured use of this information needs to be included in your maintenance system for two main reasons. First of all, it drives system improvement. We start to use this information to drive improvement in the overall system outputs and guide decisions on which inputs need to be modified.

Second, it provides evidence to the maintenance team who are recording the data that it is being used and used for a purpose that will ultimately make their working lives and their plant better. It answers the “what’s in it for me?” question and in doing so re-enforces good defect logging behavior and increases buy-in to the maintenance process.

Reactive Failure Analysis Process


The process is fairly simple. This process is reactive in nature because it only kicks in once a failure or defect has been detected, this means we have often has to suffer the consequences of the failure but it’s an opportunity to stop in happening again. The main purpose is to guide the reliability engineer through a decision flow chart that allows them to quickly decide what the best course of action should be in dealing with this failure.

The purple boxes represent courses of action that should be followed up. The diamonds are questions and the white boxes are process steps.

We start with a failure event. This means that we have had a functional failure of some piece of equipment. From this point we do 2 things. First of all, we capture the data and at the same time we start the data analysis.

If the date results in a production loss then we log the loss in a production loss management system (PLMS) and relate the loss to the work order. This will automatically allow us to link to the equipment and the system it is part of through the work order so we can run reports in the future to understand which systems and items of equipment are contributing to the highest production losses. This is also known as indirect costs of maintenance.

If no production loss has resulted from the failure (this could the case where there is redundancy built into the plant) we need to understand if a safety, environment or cost impact has occurred. You can set limits on these for your site. For example, you might classify a cost impact as anything greater than £10,000 of repair costs or greater than a certain number of production units, say 5,000 barrels of oil for an oil production installation.

If there is no significant impact then at this point the process ends. We are concerned at this point with acute failure events. Chronic failures, i.e. the low impact failures that air either quickly fixed or have very little impact still need to be investigated but this will happen a part of monthly or even quarterly failure review.


We determine the root cause. If it was human factors based i.e. reliability of people then the action is to start to look at skills, attitude, competency, operating procedures etc.

The failure could have had both a human factors and technical root cause. In which case we to determine the failure mode i.e. how did it actually fail.

Next we ask, could we have detected this failure? If we couldn’t have detected it then we need to live the risk that it might happen again or look at engineering design out options.

If we could have detected the failure then we need to know if we had a preventative or predictive maintenance routine in place which was designed to detect the failure mode. If we didn’t then we need to consider creating one, if it makes logical cost and economic sense to do so and if you have the manpower and budget to carry out the PM activity.

If there was a PM in place we need to know if it was given the correct priority. If not we can update the priority to reflect the risk. If the PM was in place, was correctly prioritised but was still not carried out then we need to look at maintenance system improvements.

If the PM was carried out we now ask if it detected an issue and was this issue recorded and reporting into the work execution system. If not then the action it to focus on work order closeout training. This is process of capturing after a work order has been completed and triggering any follow-on actions. We are particularly interested at this point in understanding if the correct risk ranking was allocated to the work order if it was logged in the CMMS. If it wasn’t then we need to improve the understand in the team of the risk-based prioritisation tool that is used for ranking defects. If we don’t rank work orders correctly then they won’t be given the suitable level of importance in the planning and scheduling process.

Finally, if the priority was correct but we simply never acted on it we are back to looking at improvements in the maintenance system.

It is unlikely that the maintenance team will have the time or resources to take actions to address the root cause of all failure that are ran through this process. However, you can add a code to each work order to represent each course of action recommended and reports can be run at a later date or identify improvement opportunities. Each work order could be coded with a simple code or letter that allows this analysis to be carried out. For example DO – Design Out, HF – Human Factors Improvement, SU – Equipment Strategy Update, MS – Maintenance System Improvement etc.


This flow chart is designed to add structure to the day-to-day analysis of defects. It’s only by consistently analysing these and taking actions to prevent them happening again that we will really start to see reliability improvements in the plant.

This process can be integrated into my MaintXL System™. This is an overall framework for applying a maintenance excellence program and measuring the results.

I’d love to know what you think of the reactive failure analysis flowchart. If you have any comments or suggestions then please leave a comment below.


Leave a Reply