A bearing fails on your bottling line. The technician replaces it in 40 minutes. Two weeks later, the same bearing fails again. Then again a month after that. Each time, you fix the symptom. But the problem keeps coming back because nobody asked the real question: why did this bearing fail in the first place?
That is what Root Cause Analysis is for. RCA is a structured way to trace a problem back to its origin, so you fix the actual cause instead of patching the same failure over and over.
Plants that use RCA consistently report 30-50% fewer repeat failures within the first year. The technique is not complicated. It just requires discipline and a willingness to keep asking "why" past the obvious answer.
What is Root Cause Analysis?
Root Cause Analysis is a problem-solving method used to identify the underlying reason a failure or defect occurred. Instead of stopping at the immediate cause ("the bearing seized"), RCA pushes deeper to find systemic issues ("the lubrication schedule was never updated when we increased line speed").
RCA originated in the nuclear and aerospace industries in the 1950s, where the cost of repeated failures was catastrophic. It has since become standard practice in manufacturing, healthcare, aviation, and process industries.
The core principle is simple: every failure has a chain of causes. The visible symptom sits at the top. Underneath it, there are contributing factors. At the bottom is the root cause, the single factor that, if eliminated, would prevent the failure from recurring.
A good RCA answers three questions:
- What happened? The specific failure, defect, or unplanned event.
- Why did it happen? The chain of causes leading to the event.
- What will prevent it from happening again? The corrective action targeting the root cause.
Why Surface-Level Fixes Fail
Most maintenance teams are under pressure to get machines running again as fast as possible. That is understandable. Every minute of downtime costs money. But speed without investigation creates a cycle of repeat failures.
Consider these numbers from a typical plant we worked with:
- 62% of their unplanned downtime came from recurring failures (same machine, same failure mode)
- Average time to repair each occurrence: 35 minutes
- Average number of recurrences before someone investigated the root cause: 4.2 times
That means each recurring failure consumed roughly 147 minutes of downtime before anyone looked deeper. A single RCA session, typically 30-60 minutes, could have eliminated the problem after the first occurrence.
Surface-level fixes treat symptoms. They get the machine running today but guarantee you will be fixing the same thing next week. RCA trades a small investment of time now for a permanent fix.
The 5 Whys Method
The simplest RCA technique is the 5 Whys. You start with the problem and ask "why?" repeatedly until you reach the root cause. It usually takes about five iterations, though sometimes it takes three and sometimes seven. The number is a guideline, not a rule.
5 Whys Example: Pump Failure
Problem: The coolant pump on CNC Machine #7 failed during second shift.
Why #1: Why did the pump fail?
The pump motor overheated and tripped the thermal protection.
Why #2: Why did the motor overheat?
The pump was running with restricted flow, causing the motor to work harder.
Why #3: Why was the flow restricted?
The inlet strainer was clogged with metal chips and coolant sludge.
Why #4: Why was the strainer clogged?
The strainer had not been cleaned in over 6 weeks.
Why #5: Why had the strainer not been cleaned?
Strainer cleaning was not included in the preventive maintenance checklist for this machine.
Root cause: Missing PM task. The strainer cleaning was part of the original PM schedule but was dropped during a checklist revision 8 months ago.
Corrective action: Add strainer inspection and cleaning to the PM checklist with a 2-week frequency. Audit all PM checklists against original equipment manufacturer recommendations.
Notice how the first answer ("the motor overheated") would have led to a motor replacement. The real fix was a 5-minute strainer cleaning added to a PM checklist.
Tips for Running 5 Whys
- Do it at the machine, not in a conference room. You need to see the failure conditions firsthand.
- Include the operator and the technician who made the repair. They saw things that did not make it into the work order.
- Stick to facts. "The bearing was dry" is a fact. "Someone forgot to lubricate it" is a guess until confirmed.
- If the trail branches (multiple causes), follow each branch separately.
- Write it down. A 5 Whys analysis that stays in someone's head helps nobody.
Fishbone Diagram (Ishikawa)
For more complex problems where the 5 Whys hits a dead end, the fishbone diagram gives you a broader view. Developed by Kaoru Ishikawa in 1968, it organizes potential causes into six categories. This prevents tunnel vision and ensures your team considers every possible contributing factor.
The six standard categories for manufacturing are: Man (people), Machine (equipment), Material (inputs), Method (process), Measurement (instrumentation), and Environment (conditions).
How to Use a Fishbone Diagram
- Write the problem (effect) in the box on the right side of the diagram.
- Draw the main spine and six category branches.
- Brainstorm potential causes in each category. Write each cause as a branch off the relevant category.
- For each potential cause, ask "why?" to add sub-branches. Go 2-3 levels deep.
- Circle the 2-3 most likely root causes based on evidence and data.
- Verify with data. Inspect the machine, check logs, pull sensor readings, interview operators.
The fishbone diagram works best with 3-6 people in the room: the operator, the technician, a process engineer, and a supervisor. One person facilitates and writes. A typical session takes 30-45 minutes.
Fault Tree Analysis
Fault Tree Analysis (FTA) works in the opposite direction from the fishbone. Instead of brainstorming all possible causes, you start with the failure event and work backward through a logical tree of conditions that had to be true for the failure to occur.
FTA uses AND/OR logic gates:
- AND gate: All conditions below this gate must be present for the event above to occur. Example: a fire requires fuel AND ignition AND oxygen.
- OR gate: Any single condition below is enough to cause the event above. Example: a machine stops because of a power failure OR a mechanical jam OR an emergency stop.
FTA is more formal than the 5 Whys or fishbone diagram. It is typically used for high-consequence failures: safety incidents, failures that caused major production losses, or recurring problems that the simpler methods failed to resolve.
The output of FTA is a list of "minimal cut sets," the smallest combination of basic events that can cause the top-level failure. This tells you exactly where to focus your corrective actions for maximum impact.
When to Use RCA
You cannot run a full RCA on every failure. A blown fuse on a light fixture does not need a fishbone diagram. Use RCA when:
- The failure has occurred more than twice. Once could be random. Three times is a pattern.
- Downtime exceeded 2 hours. Any failure that takes your line down for a full 2 hours or more deserves investigation.
- Safety was involved. Any near-miss or injury requires a formal RCA. No exceptions.
- The cost is significant. If the failure caused more than $5,000 in lost production, scrap, or repair costs, it is worth 30-60 minutes of structured analysis.
- The repair was unusual. If the technician had to improvise or the fix was not in any manual, that is a knowledge gap worth investigating.
A good rule of thumb: if you track your MTTR (Mean Time to Repair), any failure with an MTTR above your plant average deserves an RCA.
Common RCA Mistakes
RCA is simple in concept but easy to do poorly. Here are the mistakes we see most often:
- Stopping too early. "The operator did not follow the procedure" is almost never the root cause. Why did the operator not follow the procedure? Was the procedure unclear? Was training inadequate? Was the procedure wrong for current conditions? Keep asking why.
- Blaming people instead of systems. If your RCA conclusion is "human error," you have not finished. People make mistakes when systems allow them to. The fix is in the system: better guards, clearer checklists, poka-yoke devices.
- Doing RCA in a conference room. You need to see the machine, touch the failed part, check the actual conditions. Go to the floor.
- No corrective action assigned. An RCA without a specific corrective action, a responsible person, and a due date is just a discussion. The whole point is to prevent recurrence.
- No follow-up. Did the corrective action actually work? Check back in 30 and 90 days to confirm the failure has not recurred. If it has, the RCA missed something.
- Only using one method. The 5 Whys works for straightforward problems. Complex failures with multiple contributing factors need a fishbone diagram or fault tree. Match the tool to the problem.
RCA in Your Breakdown Response Process
RCA fits into the final step of any good breakdown response process. After the machine is back up and running, take 30 minutes to run a quick 5 Whys on the failure. This is when the details are freshest and the physical evidence is still available.
Some plants schedule a weekly "RCA review" where the maintenance team picks the top 2-3 failures from the past week and runs a structured analysis on each. This takes about an hour and typically generates 3-5 corrective actions that prevent future downtime.
The key is to make RCA a habit, not a special event. The more your team practices it, the faster they get. An experienced team can run a solid 5 Whys in 10 minutes.
How Dovient Helps
Running an RCA is only useful if the results get captured, shared, and acted on. Dovient connects directly to the RCA workflow in three ways:
- Repair history at your fingertips. When you start an RCA, the first question is: has this happened before? Dovient's diagnostic troubleshooter instantly shows you past failures on the same machine, what was found, and what fixed it. No digging through paper logs.
- Pattern detection across machines. Dovient tracks failure modes across your entire plant. If three different conveyors all had bearing failures in the same month, that pattern shows up automatically. Same supplier? Same lubricant batch? Same installation crew? The data tells you where to look.
- Knowledge capture that sticks. Every RCA conclusion and corrective action gets stored in a searchable format. When a similar failure happens next year, the next technician finds the previous RCA in seconds, not buried in a filing cabinet or lost in a retired engineer's notebook.
If you want to see how Dovient helps maintenance teams run faster, more effective RCAs, schedule a conversation with our team.