{"kind":"Skill","metadata":{"namespace":"community","name":"thinking-five-whys-plus","version":"0.1.0"},"spec":{"description":"Enhanced root cause analysis with explicit bias guards and stopping criteria. Use for incident post-mortems, bug investigations, and process failures where standard 5 Whys might mislead.","files":{"SKILL.md":"---\nname: thinking-five-whys-plus\ndescription: Enhanced root cause analysis with explicit bias guards and stopping criteria. Use for incident post-mortems, bug investigations, and process failures where standard 5 Whys might mislead.\n---\n\n# Five Whys Plus\n\n## Overview\n\nThe Five Whys technique from Toyota Production System is powerful but often misapplied. This enhanced version adds explicit guards against common failures: premature stopping, single-cause bias, blame-oriented thinking, and confirmation bias. It transforms a simple technique into a rigorous root cause methodology.\n\n**Core Principle:** Keep asking \"why\" until you reach actionable root causes, but guard against the technique's known failure modes.\n\n## When to Use\n\n- Incident post-mortems\n- Bug investigations\n- Process failures\n- Customer complaints\n- Recurring problems\n- Any situation where you need root cause, not just proximate cause\n\nDecision flow:\n\n```\nProblem occurred?\n  → Is the cause obvious and verified? → yes → Fix directly\n  → Need to find root cause? → yes → APPLY FIVE WHYS PLUS\n  → Is this a complex multi-factor problem? → yes → Consider Kepner-Tregoe PA\n```\n\n## Standard Five Whys Failure Modes\n\n| Failure Mode | Description | Guard |\n|--------------|-------------|-------|\n| Premature stopping | Accepting first plausible cause | Minimum depth + actionability test |\n| Single-cause bias | Assuming one root cause | Branch on \"what else?\" |\n| Blame orientation | Stopping at human error | \"Why was error possible?\" |\n| Confirmation bias | Finding expected cause | Devil's advocate review |\n| Circular reasoning | Why loops back on itself | Detect and break cycles |\n| Speculation depth | Going beyond evidence | Evidence requirement |\n\n## The Five Whys Plus Process\n\n### Step 1: State the Problem Precisely\n\nBad: \"The system was slow\"\nGood: \"API response times exceeded 2 seconds for 30% of requests between 14:00-14:45 UTC on January 15\"\n\n```markdown\nProblem Statement:\n- What happened: [Specific observable symptom]\n- When: [Time range]\n- Where: [Affected systems/users]\n- Extent: [Scope and severity]\n- Impact: [Business/user impact]\n```\n\n### Step 2: Apply \"Why\" with Evidence Requirement\n\nFor each \"why,\" require evidence:\n\n```markdown\nWhy #1: Why did [problem] occur?\nAnswer: [Hypothesis]\nEvidence: [Data, logs, metrics that support this]\nConfidence: [High/Medium/Low]\n```\n\n**Evidence types:**\n- Logs showing the event\n- Metrics correlating with timeline\n- Code showing the behavior\n- Configuration proving the state\n- Testimony from multiple sources\n\n### Step 3: Branch on \"What Else?\"\n\nAfter each \"why,\" explicitly ask \"what else could cause this?\"\n\n```markdown\nWhy #1: Why did API response times spike?\nPrimary answer: Database queries were slow\nEvidence: DB query times increased from 50ms to 1.5s\n\nWhat else could cause this?\n- [ ] Network latency (checked: normal)\n- [ ] Application code changes (checked: none deployed)\n- [ ] Memory pressure (checked: normal)\n- [ ] External API dependencies (checked: normal)\n\n→ Proceeding with database queries as verified cause\n```\n\n### Step 4: Apply \"Why Was This Possible?\" for Human Error\n\nNever stop at \"human error\" or \"someone made a mistake.\"\n\n```\nBAD chain:\nWhy did the outage occur? → Config was wrong\nWhy was config wrong? → Engineer made a typo\n→ STOP (blames human)\n\nGOOD chain:\nWhy did the outage occur? → Config was wrong\nWhy was config wrong? → Engineer made a typo\nWhy was a typo possible? → No validation on config changes\nWhy was there no validation? → Config system doesn't support schemas\nWhy doesn't it support schemas? → Tech debt, never prioritized\n→ ROOT CAUSE: Config validation infrastructure gap\n```\n\n### Step 5: Check Stopping Criteria\n\nOnly stop when ALL are true:\n\n| Criterion | Question | ✓ |\n|-----------|----------|---|\n| Actionable | Can we take concrete action on this cause? | |\n| Controllable | Is this within our control to fix? | |\n| Fundamental | Would fixing this prevent recurrence? | |\n| Evidenced | Do we have evidence, not just speculation? | |\n| Not-blame | Is this a system issue, not just \"someone messed up\"? | |\n\n### Step 6: Verify with Counter-Analysis\n\nBefore finalizing, apply devil's advocate:\n\n```markdown\nProposed root cause: [X]\n\nCounter-analysis:\n1. What evidence contradicts this conclusion?\n2. What other explanation fits the evidence?\n3. Would someone with a different perspective agree?\n4. If we fix X, are we confident the problem won't recur?\n5. Are we finding what we expected to find? (confirmation bias check)\n```\n\n## Enhanced Template\n\n```markdown\n# Five Whys Plus Analysis\n\n## Problem Statement\n- **What:** [Specific symptom]\n- **When:** [Time range]\n- **Where:** [Affected scope]\n- **Impact:** [Severity and consequences]\n\n## Why Chain\n\n### Why #1: Why did [problem] occur?\n**Answer:**\n**Evidence:**\n**Confidence:** High / Medium / Low\n**What else considered:**\n**Ruled out because:**\n\n### Why #2: Why did [answer #1] occur?\n**Answer:**\n**Evidence:**\n**Confidence:**\n**What else considered:**\n**Ruled out because:**\n\n### Why #3: Why did [answer #2] occur?\n**Answer:**\n**Evidence:**\n**Confidence:**\n**What else considered:**\n**Ruled out because:**\n\n[Continue as needed...]\n\n## Stopping Criteria Check\n- [ ] Actionable: We can take concrete action\n- [ ] Controllable: Within our control\n- [ ] Fundamental: Prevents recurrence\n- [ ] Evidenced: Supported by data\n- [ ] System-focused: Not blaming individuals\n\n## Counter-Analysis\n**Contradicting evidence:**\n**Alternative explanations:**\n**Confirmation bias check:**\n**Confidence in conclusion:**\n\n## Root Causes Identified\n1. [Primary root cause]\n2. [Contributing factor if applicable]\n\n## Recommended Actions\n| Action | Addresses | Owner | Timeline |\n|--------|-----------|-------|----------|\n| | | | |\n\n## Verification Plan\nHow will we know the fix worked?\n```\n\n## Example: Production Outage\n\n```markdown\n# Five Whys Plus: Payment Service Outage\n\n## Problem Statement\n- What: Payment service returned 500 errors\n- When: 2024-01-15 14:00-14:45 UTC\n- Where: Production, US-East region\n- Impact: 2,400 failed transactions, ~$180K revenue impact\n\n## Why Chain\n\n### Why #1: Why did payment service return 500 errors?\n**Answer:** Database connection pool exhausted\n**Evidence:** Connection pool metrics showed 100/100 in use, logs show \"connection wait timeout\"\n**Confidence:** High\n**What else considered:**\n- Application bugs (no recent deploys)\n- Memory issues (heap normal)\n- Network problems (latency normal)\n\n### Why #2: Why was connection pool exhausted?\n**Answer:** Queries taking 10x longer than normal\n**Evidence:** P99 query time went from 50ms to 500ms at 14:00\n**Confidence:** High\n**What else considered:**\n- Connection leak (connection count stable before incident)\n- Sudden traffic spike (traffic was normal)\n\n### Why #3: Why were queries taking 10x longer?\n**Answer:** Missing index on payment_status table\n**Evidence:** EXPLAIN shows sequential scan on 10M row table\n**Confidence:** High\n**What else considered:**\n- Lock contention (no blocking locks)\n- DB resource exhaustion (CPU/memory normal)\n\n### Why #4: Why was the index missing?\n**Answer:** Migration to add index was rolled back 2 weeks ago\n**Evidence:** Deployment logs show rollback on 2024-01-01\n**Confidence:** High\n\n### Why #5: Why was the migration rolled back?\n**Answer:** Migration timed out during deploy window\n**Evidence:** Deploy log shows \"migration timeout after 30 minutes\"\n\n### Why #6: Why did migration timeout?\n**Answer:** Table too large for online migration in current window\n**Evidence:** Table has 10M rows, online migration takes ~2 hours\n**Confidence:** High\n\n### Why #7 (System-level): Why wasn't this caught before impact?\n**Answer:** No alerting on query performance degradation\n**Evidence:** No alerts fired until connection pool exhausted\n\n## Stopping Criteria Check\n- [x] Actionable: Can add index, fix alerting\n- [x] Controllable: Within our control\n- [x] Fundamental: Index prevents query issue, alerting prevents impact\n- [x] Evidenced: All steps have supporting data\n- [x] System-focused: Process and tooling issues, not blame\n\n## Root Causes Identified\n1. **Primary:** Index migration process doesn't handle large tables\n2. **Contributing:** No alerting on query latency before connection exhaustion\n\n## Recommended Actions\n| Action | Addresses | Owner | Timeline |\n|--------|-----------|-------|----------|\n| Implement online index creation tool | Root cause 1 | Platform | 2 weeks |\n| Add query latency alerting | Root cause 2 | SRE | 1 week |\n| Create index during maintenance window | Immediate fix | DBA | Tonight |\n```\n\n## Common Patterns to Catch\n\n### The Blame Stop\n\n```\nBAD: \"Why did it fail?\" → \"Engineer didn't test properly\" → STOP\n\nBETTER: → \"Why was it possible to deploy without proper testing?\"\n        → \"Why doesn't the pipeline enforce testing?\"\n        → System/process root cause\n```\n\n### The Premature Technical Stop\n\n```\nBAD: \"Why was it slow?\" → \"Query was inefficient\" → STOP\n\nBETTER: → \"Why was an inefficient query in production?\"\n        → \"Why didn't code review catch it?\"\n        → \"Why don't we have query performance testing?\"\n```\n\n### The Circular Why\n\n```\nDETECT: \"Why A?\" → \"Because B\" → \"Why B?\" → \"Because A\"\n\nBREAK: Introduce external evidence or third factor\n```\n\n### The Speculation Dive\n\n```\nDETECT: Answers become increasingly speculative without evidence\n\nBREAK: \"What evidence do we have for this?\"\n       If none, mark as hypothesis and seek evidence\n```\n\n## Verification Checklist\n\n- [ ] Problem stated with specific details (what, when, where, extent)\n- [ ] Each \"why\" has supporting evidence\n- [ ] \"What else?\" asked at each branch point\n- [ ] Didn't stop at human error—asked \"why was error possible?\"\n- [ ] Stopping criteria all satisfied\n- [ ] Counter-analysis performed\n- [ ] Root cause is actionable and controllable\n- [ ] Actions address root cause, not just symptoms\n\n## Key Questions\n\n- \"What evidence supports this answer?\"\n- \"What else could explain this?\"\n- \"Why was this mistake/error/failure possible?\"\n- \"If we stop here, will the problem actually be prevented?\"\n- \"Are we finding what we expected, or what the evidence shows?\"\n- \"Would someone outside our team reach the same conclusion?\"\n\n## Ohno's Wisdom (Extended)\n\nTaiichi Ohno said: \"By asking 'why' five times and answering each time, we can get to the real cause of the problem.\"\n\nThe extension: Five is not magic. The real guidance is:\n1. Keep asking until you reach something actionable\n2. But don't speculate past your evidence\n3. And never stop at human blame\n\nThe technique is simple. Applying it well requires discipline.\n"},"import":{"commit_sha":"a31e22d4445ad8fef7cd771d32af537aebb68c49","imported_at":"2026-05-22T21:14:39Z","license_text":"MIT License\n\nCopyright (c) 2025 TJ Boudreaux\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n","owner":"tjboudreaux","repo":"tjboudreaux/cc-thinking-skills","source_url":"https://github.com/tjboudreaux/cc-thinking-skills/tree/a31e22d4445ad8fef7cd771d32af537aebb68c49/skills/thinking-five-whys-plus"}},"content_hash":[98,13,156,164,54,121,194,92,218,214,98,15,55,197,172,182,206,231,244,12,220,125,172,116,157,246,202,157,241,148,155,207],"trust_level":"unsigned","yanked":false}
