Failure Mode and Effects Analysis: A Proactive Risk Assessment Tool
What is Failure Mode and Effects Analysis (FMEA)
Teams conducting FMEA regularly report 30-50% reductions in unplanned equipment failures.
FMEA is a structured, proactive method for identifying potential ways equipment or processes could fail and assessing the impact of those failures. Unlike reactive troubleshooting that responds to failures after they occur, FMEA anticipates failure modes and evaluates their severity, likelihood, and detectability before they impact production.
- Each potential failure is assigned three numerical ratings that multiply to produce a Risk Priority Number (RPN). This systematic approach forces teams to think critically about failure mechanisms and prioritize corrective actions based on risk, not gut feel or crisis management.
- FMEA originated in the aerospace and automotive industries where failure consequences are catastrophic, a design flaw could result in loss of life. The discipline ensures critical safety and reliability standards are met.
- Today, manufacturers across all sectors use FMEA to prevent downtime, reduce scrap, improve safety, and optimize equipment reliability. Teams that conduct FMEA regularly report 30-50% reductions in unplanned failures.
- FMEA documentation creates institutional knowledge that outlasts individual technicians. This knowledge base becomes a foundation for continuous improvement and focused preventive efforts rather than blanket maintenance.
How to Conduct an FMEA: A Practical Guide for Maintenance Teams
Cross-functional teams catch failure modes that individual specialists miss.
Assemble a cross-functional team: maintenance technicians, operators, engineers, and supervisors who understand the equipment deeply. Diversity of perspective is critical, operators see things technicians miss, engineers understand design intent. Select a specific piece of equipment or process and systematically identify every way it could fail.
- Define equipment functions clearly, then list every potential failure mode. For a pump, this includes seal failure, cavitation, impeller wear, suction blockage, overheating, bearing seizure, and others. Be comprehensive, missed failure modes undermine the entire analysis.
- For each failure mode, document what would happen if it occurred (effect on production, safety, customer perception, quality) and assess severity on a 1-10 scale. Severity is usually the hardest to change and drives action priority.
- Estimate how often this failure occurs (occurrence rating, 1-10), this might be based on historical data or expert judgment if data isn't available. High occurrence indicates maintenance gaps, design weakness, or harsh operating conditions.
- Evaluate how likely your current detection methods would catch this failure before it impacts operations (detection rating, 1-10). Multiply all three numbers to get RPN. Focus corrective actions on high-RPN items first. After implementing improvements, re-rate and recalculate RPN to verify progress.
Understanding RPN: Severity, Occurrence, and Detection Ratings
RPNs above 400 are critical, these need immediate corrective action regardless of other priorities.
RPN (Risk Priority Number) combines three independent ratings into a single risk score. Severity measures the consequence of failure, Occurrence rates how likely it is to happen, and Detection assesses how easily it would be caught. RPN = S × O × D produces a number from 1 to 1,000.
- Severity (S) rates the impact of the failure mode: 1 = negligible (equipment still runs), 5 = moderate (significant downtime), 10 = catastrophic (safety hazard, loss of life). High-severity failures demand attention regardless of how unlikely they are.
- Occurrence (O) rates how likely the failure is to happen: 1 = extremely rare (once per decade), 5 = occasional (few times per year), 10 = almost certain (monthly). High occurrence indicates conditions that need immediate attention and control.
- Detection (D) rates how likely your team is to catch the failure before it becomes critical: 1 = almost certain to detect, 5 = moderate (requires periodic inspection), 10 = virtually undetectable. Detection depends on monitoring capability and operator awareness.
- RPNs above 400 are critical and need immediate corrective action. RPNs between 200-400 are high risk. Even low RPNs warrant review, for example, a moderate failure (S=5, O=4, D=5) yields RPN 100, which merits investigation. Thresholds depend on industry, equipment criticality, and organizational risk tolerance.




