Maintenance Fundamentals

What is Reliability-Centered Maintenance (RCM)?

March 3, 202612 min readDovient Learning

Every plant has limited maintenance resources. You cannot afford to maintain every piece of equipment the same way. A critical pump that feeds your entire production line deserves a different maintenance strategy than a workshop fan. Reliability-Centered Maintenance gives you a structured method to decide what maintenance each piece of equipment actually needs based on what it does, how it fails, and what happens when it fails.

RCM was originally developed in the commercial aviation industry in the late 1960s. United Airlines engineers, working with the FAA, created the methodology to determine maintenance requirements for the Boeing 747. The results were striking: they proved that much of the traditional time-based overhaul maintenance being done on aircraft was unnecessary and, in some cases, actually introduced failures. The findings were published in a landmark 1978 report by F. Stanley Nowlan and Howard Heap.

Since then, RCM has been adopted across power generation, oil and gas, military systems, mining, and manufacturing. The international standard for RCM is SAE JA1011, which defines the minimum criteria an analysis must meet to be called RCM.

The core principle of RCM is this: the maintenance strategy for each asset should be determined by the consequences of its failure, not by a generic schedule. Some equipment needs daily monitoring. Some needs monthly inspection. Some is best left to run until it fails. RCM tells you which is which.

The 7 RCM Questions

Every RCM analysis answers seven questions about each piece of equipment (or more precisely, about each function of each piece of equipment). These questions are asked in order, and each answer feeds into the next.

Question 1: What are the functions of the asset in its current operating context?

Start by defining what the equipment is supposed to do, stated in measurable terms. Not "pump fluid" but "transfer cooling water at 200 liters per minute at 4 bar pressure from the reservoir to the heat exchanger." The operating context matters. A pump in a desert climate, running 24/7 on abrasive slurry, has a different context than an identical pump in an air-conditioned room running clean water for 8 hours a day.

Functions include primary functions (the main reason the equipment exists) and secondary functions (safety, containment, structural integrity, comfort, appearance, regulatory compliance).

Question 2: In what ways can each function fail (functional failures)?

A functional failure is the inability of the asset to perform a function to the standard required by the user. For our pump example:

  • Complete loss of function: pump stops entirely, zero flow
  • Partial loss of function: pump delivers less than 200 liters per minute
  • Exceeding upper limits: pump delivers fluid at more than 6 bar (overpressure)
  • Loss of containment: pump leaks more than 0.5 liters per hour

Notice that functional failures are defined against a performance standard. A pump that delivers 180 L/min when the standard requires 200 L/min has failed functionally, even though it is still running.

Question 3: What causes each functional failure (failure modes)?

Failure modes are the specific events that lead to each functional failure. For the "pump stops entirely" functional failure, failure modes might include:

  • Impeller seized due to bearing failure
  • Motor winding burned out
  • Coupling shear pin failed
  • Suction line blocked by debris
  • Power supply tripped on overload
  • Mechanical seal failed catastrophically

Be specific enough to identify a meaningful maintenance task. "Pump broke" is not a useful failure mode. "Impeller seized due to bearing failure caused by loss of lubrication" is, because it points directly to a preventable condition.

A thorough RCM analysis on a complex asset might identify 30-60 failure modes. Focus on the ones that are reasonably likely to occur in the operating context. You do not need to analyze meteor strikes.

Question 4: What happens when each failure occurs (failure effects)?

Describe what happens in real, observable terms when each failure mode occurs. Include:

  • What evidence of failure do operators or maintenance see? (alarm, noise, leak, product defect)
  • What is the impact on production? (line stops, reduced rate, quality problem)
  • What is the safety or environmental impact?
  • What does it take to repair? (time, parts, skills, cost)

For "impeller seized due to bearing failure": Operator hears grinding noise followed by high-vibration alarm. Pump stops within 2-3 minutes. Cooling water flow to heat exchanger drops to zero. Production line overheats and trips within 15 minutes. Repair requires pump disassembly, bearing and seal replacement, realignment. Typical repair time: 6-8 hours. Parts cost: $1,200. Lost production: approximately $45,000 per occurrence.

Question 5: Why does each failure matter (failure consequences)?

This is the critical question. RCM categorizes failure consequences into four groups, and the category determines the maintenance strategy. This is what makes RCM different from other approaches: the maintenance task must be justified by the consequence of the failure, not by the fact that the failure is possible.

RCM Failure Consequence Categories Hidden Failures Failure is not evident to operators under normal conditions. Typically protective devices. Example: relief valve, backup pump, safety interlock Safety / Environmental Failure could injure or kill someone, or cause an environmental breach. Example: pressure vessel rupture, toxic gas leak Operational Failure affects production output, quality, customer service, or operating costs significantly. Example: production line stops, major quality defect Non-Operational Failure has no safety, environmental, or significant operational impact. Only repair cost. Example: workshop light, non-critical indicator Highest priority Lowest priority Key RCM principle: Hidden failures ALWAYS require proactive maintenance (failure-finding tasks) because the failure is not apparent until a second event occurs, which may have catastrophic consequences.

The four consequence categories are:

Hidden failure consequences. These are the most dangerous. A hidden failure is one that is not evident to operators under normal conditions. The classic example is a relief valve. If the valve fails stuck shut, nobody knows until the system overpressures. The valve's failure is hidden, and the consequences of the combined event (valve failure + overpressure) can be catastrophic. RCM always requires proactive maintenance for hidden failures, typically through scheduled failure-finding tasks (testing the relief valve periodically to confirm it works).

Safety and environmental consequences. If a failure mode could kill or injure someone, or cause a significant environmental breach, a proactive maintenance task is mandatory. If no task can reduce the risk to an acceptable level, the equipment must be redesigned.

Operational consequences. The failure affects output, quality, or customer service, and the total cost (repair + lost production + consequential damage) justifies a proactive task. This is where economics drives the decision: the cost of the maintenance task must be less than the cost of the failure.

Non-operational consequences. The failure has no safety, environmental, or operational impact. The only cost is the repair itself. In this case, proactive maintenance is only worth doing if it costs less than the repair on failure. Often, the right answer is to let it run to failure.

Question 6: What can be done to predict or prevent each failure (proactive tasks)?

For each failure mode worth preventing (based on the consequence analysis), RCM evaluates possible proactive maintenance tasks in a specific order of preference:

  1. Condition-based maintenance (on-condition tasks). Monitor the equipment for signs of approaching failure and act when a defined threshold is reached. Examples: vibration monitoring on bearings, oil analysis on gearboxes, infrared scanning of electrical panels. This is the most efficient approach because you only do work when the equipment's condition warrants it.
  2. Scheduled restoration. Overhaul or restore the equipment at fixed intervals regardless of condition. This only works for failure modes that have a predictable wear-out pattern. A component must show increasing failure probability with age for this to be effective. Many failure modes do not show this pattern.
  3. Scheduled discard (replacement). Replace the component at fixed intervals regardless of condition. Same requirement as restoration: there must be an identifiable age at which failure probability increases.
  4. Failure-finding tasks. Periodically test hidden-function equipment (protective devices, standby systems, alarms) to confirm it works. The interval is calculated based on the acceptable probability of a multiple failure.

Question 7: What should be done if no suitable proactive task can be found?

If no proactive task is technically feasible and worth doing, the default action depends on the consequence category:

  • Hidden failures: The equipment must be redesigned to make the failure evident, or a failure-finding task must be implemented
  • Safety/environmental: The equipment must be redesigned to eliminate the failure mode or reduce its consequence
  • Operational: Accept the risk of run-to-failure (but document the decision)
  • Non-operational: Run to failure. No scheduled maintenance. Fix it when it breaks.

The RCM Decision Logic

The seven questions are applied through a structured decision logic. This logic determines which maintenance strategy to apply for each failure mode. The following diagram shows the flow.

RCM Decision Logic (Simplified) Failure Mode Identified Is the failure hidden? YES Schedule failure-finding task NO Safety/environmental impact? YES Task MUST reduce risk. If not: redesign. NO Significant operational impact? YES Task must be cost-effective vs. failure cost NO Non-operational failure Run to failure (if task not justified) Task Selection Order: 1. Condition-based (best) 2. Scheduled restoration 3. Scheduled discard 4. Failure-finding 5. Redesign 6. Run to failure Always try condition-based monitoring first. The logic always asks: does the consequence justify the task? Hidden and safety failures always require proactive action. Operational failures depend on cost-benefit. Non-operational failures default to run-to-failure unless a proactive task pays for itself.

RCM Analysis Example: Centrifugal Cooling Water Pump

Let us walk through a simplified RCM analysis on a centrifugal pump that supplies cooling water to a production heat exchanger. This pump is one of two (one running, one standby) in a chemical processing plant.

Asset Context

Horizontal centrifugal pump, 30 kW motor, 200 L/min rated capacity, operating 24/7 on clean treated water. Standby pump is available with manual switchover (takes approximately 10 minutes). The production process can tolerate a 15-minute interruption before product quality is affected.

Function 1: Transfer cooling water at 200 L/min at 4 bar

Functional Failure Failure Mode Consequence Task Selected
A: No flow (complete loss) Bearing seizure Operational (10 min to switch to standby, within tolerance) Vibration monitoring monthly, grease bearings per schedule
Motor winding burnout Operational (same as above) Insulation resistance test annually, thermal imaging quarterly
Coupling failure Operational (same as above) Visual inspection of coupling alignment quarterly
B: Reduced flow (<200 L/min) Impeller wear Operational (gradual, affects process quality) Monitor discharge pressure weekly, replace impeller when below threshold
Suction strainer blockage Operational Check differential pressure across strainer monthly, clean when delta-P exceeds 0.5 bar
C: External leakage (>0.5 L/hr) Mechanical seal failure Non-operational (clean water, no safety/environmental concern, contained in drip tray) Run to failure. Replace seal when leak exceeds acceptable drip rate.

Function 2: Contain fluid (secondary function)

Note that the standby pump itself is a hidden function. If the standby pump has failed and nobody knows (because it is not running), the running pump is now a single point of failure. RCM requires a failure-finding task for the standby: test-run the standby pump monthly for 15 minutes and verify it reaches rated flow and pressure.

This is one of RCM's most valuable contributions. Without the systematic analysis, many plants never test their standby equipment and only discover it does not work when they actually need it.

When to Use RCM vs. Simpler Approaches

A full RCM analysis is thorough but time-consuming. A team of 5-6 people (operators, maintenance, engineering) analyzing a complex asset can spend 40-60 hours on a single system. For a plant with thousands of assets, analyzing everything with full RCM is not practical.

Use full RCM analysis for:

  • Critical assets where failure has high consequences (safety, environmental, major production loss)
  • Complex systems with many failure modes and high maintenance costs
  • Assets where existing maintenance strategies are clearly not working (chronic failures despite regular PMs)
  • New or unfamiliar equipment where failure modes are not well understood

Use simpler approaches for less critical equipment:

  • Manufacturer recommendations. Follow the OEM maintenance schedule for standard, low-criticality equipment.
  • Condition monitoring only. For equipment where one or two failure modes dominate, set up condition monitoring without the full analysis.
  • Experience-based maintenance. For simple, well-understood equipment, experienced technicians often know the failure patterns. Document their knowledge and build schedules from it. Capturing tribal knowledge is valuable here.
  • Run to failure. For non-critical, cheap, easy-to-replace items (indicator lights, non-critical filters, convenience features), planned run-to-failure is a valid strategy.

A practical approach is to classify all plant assets by criticality (A, B, C) and apply full RCM to the top 10-15% (A-class assets), simplified RCM to the next 30% (B-class), and manufacturer-recommended or run-to-failure strategies to the rest (C-class).

Key Insights from RCM That Change How You Think About Maintenance

Even if you never do a formal RCM analysis, understanding its principles will improve your maintenance program. Here are the insights that matter most:

1. Not all failures are created equal

A bearing failure on a critical pump and a bearing failure on a workshop grinder are completely different maintenance problems, even if the bearing is identical. The consequence determines the response, not the component.

2. Time-based replacement is often wrong

The Nowlan and Heap study found that only 11% of components show an age-related wear-out pattern where failure probability increases with time. The other 89% fail randomly with respect to age. This means that replacing a component every 6 months "just in case" is wasteful for most failure modes. Condition monitoring (checking for actual signs of deterioration) is more effective.

3. Hidden failures are the most dangerous

Equipment that sits idle until needed (backup systems, safety devices, emergency generators) must be tested regularly. If you have a backup pump that you have not run in 6 months, you do not know if it works. Test it.

4. Run-to-failure is sometimes the right answer

RCM gives you permission to stop maintaining things that do not need it. If a failure has no safety, environmental, or significant operational consequence, and the repair is cheap and quick, there is no business case for preventive maintenance. Let it run until it fails, then fix it. This frees up maintenance resources for the equipment that actually needs attention.

5. Maintenance preserves function, not equipment

The goal of maintenance is not to keep machines looking new. It is to keep them performing their required functions. This distinction matters when deciding what to maintain and to what standard.

Common Mistakes in RCM Implementation

  • Analyzing too many assets. Trying to do full RCM on every pump, motor, and valve in the plant. Focus on the critical few.
  • Not having operators on the team. Engineers and maintenance planners cannot do RCM alone. Operators know how the equipment actually behaves in service, not just how it is supposed to behave.
  • Skipping the operating context. The same pump in two different services can have completely different failure modes and consequences. Always define the context first.
  • Writing tasks without specifying intervals and criteria. "Check bearing condition" is not a useful task. "Measure bearing vibration with handheld analyzer at points A, B, C monthly. Alert threshold: 4.5 mm/s. Action threshold: 7.1 mm/s" is a useful task.
  • Not reviewing results. An RCM analysis is a living document. As you accumulate failure data, some task intervals will need adjustment. Review the analysis every 2-3 years or when failure patterns change.

RCM and OEE

RCM and OEE complement each other. OEE tells you where your equipment losses are. RCM tells you the most effective maintenance strategy to reduce those losses. If your OEE data shows that unplanned breakdowns on a specific machine are your biggest Availability loss, an RCM analysis on that machine will identify the failure modes driving those breakdowns and the best tasks to prevent them.

Similarly, if your maintenance backlog is growing and you suspect you are doing unnecessary PMs, RCM can help you eliminate tasks that are not justified by the actual failure consequences. Plants that apply RCM typically reduce their total number of PM tasks by 40-60% while improving reliability. They do less maintenance, but more of the right maintenance.

Where Dovient Fits

RCM analysis generates detailed information about failure modes, maintenance tasks, and decision rationale. This information needs to live somewhere accessible, not in a binder on a shelf.

Dovient supports RCM programs by:

  • Storing failure mode knowledge. The failure modes, effects, and recommended tasks from your RCM analysis become part of your maintenance knowledge base. When a technician encounters a problem, they can search for the failure mode and find the recommended diagnostic and repair approach.
  • Connecting symptoms to root causes. Dovient's AI diagnostic tools use your plant's repair history to connect observed symptoms with likely failure modes. This is RCM knowledge put to practical use at the point of repair.
  • Tracking failure patterns over time. RCM analyses need to be reviewed as failure data accumulates. Dovient tracks repair patterns and failure frequencies, giving you the data to validate or adjust your RCM task intervals.

If you are running or considering an RCM program and want to see how Dovient can support it, schedule a conversation with our team.


Related Articles