Root Cause Analysis Done Right: The 5 Whys & Beyond
The 5 Whys is a powerful tool, but only if you avoid common traps: stopping too early, assigning blame, and ignoring systemic factors. Walk through three real post-mortems and learn what a blameless RCA looks like in practice.
Beyond Human Error
Human error is never the root cause; it's the starting point for an investigation. A proper blameless RCA uncovers the systemic factors that allowed the human error to occur in the first place.
The 5 Whys technique — ask 'why' five times until you reach a systemic cause — was developed at Toyota and has been widely adopted in software engineering. It works best for incidents with a clear causal chain. It fails when teams stop at 'someone made a mistake' instead of asking why the system allowed that mistake to have production impact.
Three real postmortems illustrate the difference between surface-level RCA and systemic RCA — and show why the distinction changes your action items entirely.
Real Postmortem 1: The Deployment That Took Down Payments
Incident: A deployment to the payments service caused a 40-minute outage during which no payment could be processed. Surface finding: an engineer deployed a configuration change without testing it in staging. Surface action item: 'Require staging sign-off before production deployments.'
5 Whys applied: Why did the configuration error reach production? Because staging validation was skipped. Why was staging validation skipped? Because the deployment was marked 'low-risk' and staging was optional for low-risk changes. Why was it marked low-risk? Because there was no automated risk classification — engineers self-assessed. Why is there no automated risk classification? Because the deployment pipeline was built before the payments service existed. Why does the deployment pipeline not reflect the current system's criticality? Because it has never been revisited since initial setup.
Systemic action items: (1) Automate risk classification based on which services a deployment touches. (2) Remove the 'skip staging' option for any deployment touching the payments service. (3) Add a quarterly deployment pipeline review to the engineering calendar. These actions prevent the entire class of problem — not just this specific case.
Real Postmortem 2: The Third-Party Outage Nobody Caught for 45 Minutes
Incident: A Stripe partial outage lasted 2.5 hours. The team didn't know about it until 45 minutes in, when customer support volume spiked. Surface finding: nobody was watching Stripe's status page. Surface action item: 'Someone should check vendor status pages during incidents.'
5 Whys applied: Why did the team not know about the Stripe outage for 45 minutes? Because they had no automated monitoring for third-party service status. Why did they have no automated monitoring? Because third-party monitoring wasn't on the team's observability roadmap. Why wasn't it on the roadmap? Because the last major third-party incident was over a year ago and the pain wasn't fresh. Why did the team learn about it from support volume rather than engineering monitoring? Because their alerting only covered internal error rates — which take 15+ minutes to rise above threshold after a third-party outage begins.
Systemic action items: (1) Set up PulsAPI monitoring for all critical third-party dependencies (Stripe, Auth0, AWS, SendGrid). (2) Configure PagerDuty integration for Tier 1 vendor outages. (3) Add a 'third-party status check' to the first step of all incident response runbooks. (4) Review the vendor dependency list quarterly and update monitoring coverage. These actions close the systematic 'third-party observability gap' — not just for Stripe but for every dependency.
Common 5 Whys Failure Modes and How to Avoid Them
Stopping at human error: 'The engineer didn't follow the procedure' is never a root cause. It's a prompt to ask why the procedure wasn't followed, why the procedure didn't prevent the outcome, and why the procedure existed in that form in the first place. Good facilitation pushes past human error every time.
Branching causes: many incidents have multiple contributing factors, not a single causal chain. Use a fishbone (Ishikawa) diagram to map multiple why-chains simultaneously. For a complex incident, you might have three parallel chains: one about tooling, one about process, and one about communication. All three need action items.
Action items without owners: an RCA that ends with a list of improvements but no assigned owners is a historical document, not a change driver. Every action item needs a single owner (a person, not a team), a concrete deliverable, and a deadline. Weekly reviews of open RCA action items — even briefly in engineering standups — dramatically improves follow-through rates. Use your project management tool to track them alongside feature work, making reliability improvements visible and accountable.
About the Author
Start monitoring your stack
Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.