The 5 whys method

This technique involves asking a series of “why”s until you eventually uncover the root cause—the real “why” beneath all those other “why”s. It's a lightweight root-cause analysis method where a team repeatedly asks “Why?” (usually ~5 times) about a clearly stated problem until it reaches a systemic, actionable cause that—if fixed—would have prevented the problem from occurring.

Why it's valuable

Fast & low-overhead: No fancy tools required; 30–60 minutes with the right people can reveal the real blocker.
Builds shared understanding: Cuts through opinions to converge on evidence-backed cause(s).
Targets leverage points: Ends at process, policy, environment, or design gap you can change.

Limits to be aware of

5 Whys can oversimplify complex/entangled causes if used dogmatically. To avoid this outcome, teams should consider the following principles:

Treat it as a why-tree (branch when needed)
Require evidence for each why
Stop when you hit a controllable, and testable cause

When to use it

Use 5 Whys when:

The problem is specific, observable and/or measurable.
You have access to the people and data closest to the work.
You need speed and a first, credible root cause to drive corrective and preventive actions (CAPA).

Avoid or augment when:

The issue is multi-factor across teams/systems (use a fishbone diagram first to fan out categories, then 5 Whys on the top suspects).
You lack trust/psychological safety (address that first or use an async, blameless write-up).
The “problem” is strategic/market-level (use Impact Mapping, hypothesis tests, or discovery methods instead).

How to prepare a 5 whys session

Crisp problem statement: what happened, where, when, scope, and impact/metric (e.g., “Checkout error rate spiked from 0.2% to 3.4% on 2025-09-28 between 12:00–14:00 CT; revenue down 7% hour-over-hour.”)
Evidence at hand: logs, screenshots, event timeline, change list, user reports—enough to verify each “why.”
The right people: 5–8 cross-functional participants who touch detection, diagnosis, fix, and prevention (facilitator + scribe included).
Working agreements: blameless approach, one conversation at a time, challenge ideas with data, not people.
Timebox: 45–60 minutes for the workshop; 15 minutes after for actions.

How to run it

Prep (10–15 min before)

Grab a copy of 5 Whys template, or search and add the template directly into your existing Figjam board
Assign roles
- Facilitator: keeps it blameless, probes for evidence, manages time.
- Scribe: captures the chain(s), decisions, owners, due dates.
Set the frame
- Write the problem statement and impact metric big and visible.
- Draw a simple why-tree: problem at the top; each “why” as a node; allow branches when multiple plausible causes appear.
Collect quick facts
- Timeline of events, recent changes, alerts, relevant tickets.

Facilitate the 5 Whys (40 min)

Why #1 — Proximate cause
- Ask: “Why does this happen?” "What causes this to happen?"
- Require evidence: “What observation/log shows that?”
Why #2–#4 — Mechanism & enabling conditions
- For each answer, ask follow up “Why?” based questions
- If multiple plausible answers emerge, branch and pursue the most impactful/likely first.
- Continually test: Would fixing this have prevented the problem?
Why #5 (±2) — Systemic cause
- Stop when you reach a controllable, systemic cause (policy, process, tooling, environment, design, staffing, training, incentives).
- If you land on “human error,” go one more why to reveal the system, process, political or other gap that allowed it.
Validate
- Cross-check with data or a quick experiment (“If X were true, would event Y still occur?”).
- If uncertainty remains, mark a learning action (small test) rather than guessing.
Converge on the primary root cause(s)
- You may end with 1–3 root causes across branches.
- Link to VSM, User Experience Maps, DDD Event Storm maps, or other artifacts.

Capture potential actions for causes

One simple way to turn root-causes into potential next steps for the team, is to ideate on Corrective and Preventative Actions (CAPA)

Containment: immediate guardrails to reduce risk of this problem recurring (feature flag, hotfix, alert).
Corrective: introduce a change to stop recurrence in the short term (procedure, check, test).
Preventive: design/process improvement that makes the failure mode impossible or highly unlikely (automation, architecture, training, policy, incentives).

Facilitator cheat-sheet phrases

“What evidence supports that why?”
“If we fixed this, would the problem still occur?”
“What system made that outcome likely?”
“Let’s branch: note both hypotheses, and start with the one with the biggest impact.”
“We’re not judging people—we’re improving the system.”

Examples

Domain	Problem (metric & impact)	5 Whys (evidence-led chain)	Root Cause (systemic)	CAPA Examples — Containment / Corrective / Preventive	Validation Metrics
Developer Platform	Service onboarding takes 15 days (target ≤5); teams miss milestones and spin up shadow infra.	1) Onboarding tickets sit idle ~6 days. (Jira cycle-time report) 2) Approvals are manual and batched weekly. (SOP, approver calendar) 3) Approvers fear over-provisioning without standardized roles. (interviews) 4) No codified IAM policy packs; each request needs bespoke review. (repo audit) 5) Platform backlog prioritized features over governance; no OKR for time-to-first-commit. (backlog history, OKR doc)	Missing least-privilege IAM policy packs and self-service provisioning policy, driven by misaligned prioritization.	Containment: Daily approval SLA; temporary pre-approved “starter” roles. Corrective: Ship 3 standardized IAM policy packs (read-only, contributor, admin) with guardrails; add request form validation. Preventive: Terraform module + policy-as-code tests in CI; platform OKR on TTFC; monthly access review.	Median Time-to-First-Commit ≤5 days; ≥80% of onboardings fully self-service; zero shadow infra exceptions in a quarter.
Dept of Veterans Affairs (VA)	Claims status data >72 hours stale; call center volume +35%; Veteran trust at risk.	1) Nightly ETL failed two nights. (Airflow logs) 2) Source API was rate-limited (429s). (HTTP logs) 3) Batch window overlapped partner’s peak after their schedule change. (partner notice) 4) No alerting/backoff for 429s; rigid schedule. (observability config) 5) No SLO/clear ownership for data freshness across data eng/integration teams. (RACI, SLO gap)	Absent ownership & SLO for data freshness, leaving rate-limit failures unmonitored and schedules inflexible.	Containment: Manual re-run off-peak; throttle requests. Corrective: 429 alerting; exponential backoff + retry; reschedule to partner off-peak. Preventive: Define Data Freshness SLO (P95 < 6h); RACI ownership; partner-schedule change webhook; migrate critical flows to event-driven ingestion.	Sustain P95 freshness < 6h for 30 days; call volume returns to baseline; 0 unacknowledged 429 episodes per month.
Dept of State	Passport median processing = 14 weeks (target ≤8); public service delays and backlog growth.	1) Manual review queue +60%. (queue metrics) 2) Photo auto-screen flag rate 40% (↑). (model logs) 3) New vendor model sensitivity increased. (release notes) 4) Deployed without calibration/shadow testing. (QA plan) 5) No representative, privacy-approved dataset or governance to validate model; no DSA enabling safe testing. (legal/policy audit)	Model governance gap: photo-screening model deployed without calibrated, representative validation due to missing data-governance & shadow-test protocol.	Containment: Lower sensitivity / rollback; add human-in-the-loop triage shifts. Corrective: Create privacy-approved synthetic dataset; run shadow test vs prior model; calibrate threshold. Preventive: Formal ML governance (pre-prod shadowing, bias/error checks, sign-off); vendor DSA; ongoing drift monitoring.	Median processing ≤8 weeks; manual-review rate ≤15%; false-positive photo flags ≤5% sustained.

Next steps

Once we've captured the root causes to problems we want to solve, we need to refine and prioritize them.

Leverage our Problem Statement Framing play to refine the issue into a clear, concise, actor-centric and aligned to measurable problem that can support organizing your Outcome-oriented Roadmap.
Leverage a 2x2 matrix or another favorite prioritization technique (e.g. dot voting), and select the root causes that we believe need to be addressed next.
- You need to determine the axes, and scope the definitions for each, that best aligns with your situational decision-making.
- Basic axis example for validated causes we want to prioritize could look like x = Value ("more or less value if we solve") and y = Effort ("requires more or less effort to solve").