Root Cause Analysis Methodology
This skill should be used when the user asks to "perform root cause analysis", "investigate production issue", "analyze incident", "find root cause", "debug production error", "trace the cause", or mentions investigating production problems, alerts, or outages. Provides systematic RCA methodology and investigation workflows.
What this skill does
# Root Cause Analysis Methodology ## Overview Root cause analysis (RCA) is a systematic investigation process to identify the underlying cause of production incidents, errors, and outages. This skill provides structured methodologies for conducting effective RCA that goes beyond surface-level symptoms to find actionable root causes. ## When to Use This Skill Apply this skill when: - Production alerts fire indicating system degradation - Users report errors or unexpected behavior - Incidents occur requiring post-mortem investigation - Metrics show anomalous patterns - Any situation requiring systematic debugging of production issues ## Core RCA Principles ### 1. Timeline Reconstruction Establish a clear timeline of events: - Identify when the issue first appeared (error logs, metrics, user reports) - Note when alerts fired or detection occurred - Map recent changes (deployments, configuration changes, infrastructure changes) - Identify when the issue resolved (if applicable) Create a visual timeline connecting: - **WHEN**: Timestamps of key events - **WHAT**: What changed or broke at each point - **WHERE**: Which systems, services, or components were affected ### 2. Symptom vs. Root Cause Distinguish between symptoms and root causes: **Symptoms** are observable effects: - "API returning 500 errors" - "Database queries timing out" - "Memory usage at 95%" **Root causes** are underlying reasons: - "Connection pool exhausted due to size reduction in deployment" - "Missing database index causing full table scans" - "Memory leak introduced in commit abc123" Always trace from symptoms to root causes by asking "why?" repeatedly. ### 3. The Five Whys Technique Ask "why?" five times to drill down from symptom to root cause: **Example:** 1. **Why** are users seeing errors? → API is returning 500s 2. **Why** is API returning 500s? → Database queries are timing out 3. **Why** are queries timing out? → Connection pool is exhausted 4. **Why** is connection pool exhausted? → Pool size was reduced from 100 to 10 5. **Why** was pool size reduced? → Deployment of commit abc123 changed configuration Root cause: Configuration change in commit abc123 reduced pool size inappropriately. ### 4. Data-Driven Investigation Base conclusions on evidence: - **Metrics**: Error rates, latency percentiles, resource utilization - **Logs**: Error messages, stack traces, debug output - **Code**: Recent commits, blame information, diff analysis - **Configuration**: Recent changes to config files, environment variables - **Infrastructure**: Deployment logs, scaling events, resource changes Avoid speculation—validate hypotheses with data. ## RCA Investigation Workflow ### Step 1: Gather Initial Information Collect the triggering incident data: - Alert details (name, severity, time, affected systems) - Error messages and stack traces - Relevant metrics (error rates, latency, resource usage) - User reports or issue descriptions ### Step 2: Establish Scope and Impact Determine: - **Scope**: Which services, endpoints, or features are affected? - **Severity**: How many users impacted? Revenue impact? - **Duration**: When did it start? Is it ongoing? - **Frequency**: One-time or recurring issue? ### Step 3: Build Timeline of Events Construct chronological timeline: 1. Query metrics to find when anomaly started 2. Identify recent deployments or changes before incident 3. Note when alerts fired 4. Map any correlated events (scaling, traffic spikes, dependency failures) ### Step 4: Search Codebase for Related Code Identify relevant code: - Search for error messages in logs - Find files/functions mentioned in stack traces - Locate services or components mentioned in alerts - Use grep to find error-handling code, API endpoints, database queries Focus on: - Entry points (API endpoints, event handlers) - Data access layer (database queries, cache operations) - External integrations (third-party APIs, message queues) ### Step 5: Analyze Recent Changes Use git to find recent changes to relevant code: - **git log**: Recent commits to affected files - **git blame**: Who changed specific lines and when - **git diff**: What changed between working and broken versions - **git bisect**: Binary search to find breaking commit (for regressions) Prioritize commits made shortly before incident started. ### Step 6: Correlate Changes with Timeline Connect code changes to incident timeline: - Did deployment coincide with error spike? - Was configuration changed near incident start? - Did dependency update introduce regression? Look for temporal correlation between changes and symptoms. ### Step 7: Identify Root Cause Synthesize findings to pinpoint root cause: - What specific code, configuration, or infrastructure change caused symptoms? - Why did this change cause the problem? - What assumption or validation was missing? Ensure root cause is: - **Specific**: Not "the code is buggy" but "missing null check in function X" - **Actionable**: Can be fixed with specific changes - **Validated**: Supported by evidence (metrics, logs, code) ### Step 8: Verify Root Cause Hypothesis Validate the identified root cause: - Confirm timeline alignment (change introduced before symptoms appeared) - Check if reverting change would resolve issue - Look for similar patterns in logs or metrics - Test hypothesis in staging environment if possible ### Step 9: Document Findings Create RCA report including: - **Summary**: One-paragraph overview of incident and root cause - **Timeline**: Chronological event sequence - **Root Cause**: Specific code/config change that caused issue - **Impact**: Scope, severity, duration, affected users - **Evidence**: Metrics, logs, commits supporting conclusion - **Suggested Fix**: How to resolve and prevent recurrence See `examples/rca-report-template.md` for report structure. ## Investigation Techniques ### Searching for Error Patterns When analyzing error messages: 1. Extract key terms from error message (excluding variable values) 2. Search codebase for error string 3. Find where error is raised or logged 4. Trace backwards to identify trigger conditions **Example:** Error: `ConnectionPoolExhausted: Could not acquire connection within timeout` Search for: `ConnectionPoolExhausted` or `Could not acquire connection` Find: Connection pool configuration and usage Trace: Recent changes to pool size or connection usage patterns ### Using Git Blame Effectively Git blame identifies when lines were last changed: ```bash git blame path/to/file.js ``` Focus on: - Lines mentioned in stack traces - Configuration values that seem incorrect - Error-handling code paths - Recently changed lines (within incident timeframe) Cross-reference blame timestamps with incident timeline. ### Analyzing Metrics Patterns Look for metric patterns indicating root cause: - **Sudden spike**: Deployment, configuration change, traffic surge - **Gradual increase**: Memory leak, resource exhaustion, unbounded growth - **Periodic pattern**: Cron job, scheduled task, batch process - **Correlation**: Multiple metrics changing together (cause and effect) Compare metrics before, during, and after incident. ### Dependency Analysis Consider dependencies that could cause issues: - Third-party API failures or slowdowns - Database performance degradation - Message queue backlogs - Infrastructure resource constraints - Network issues or DNS resolution failures Check dependency health metrics and status pages. ## Common Root Cause Categories ### Code Changes - New bugs introduced in recent commits - Logic errors in conditionals or loops - Missing error handling or validation - Resource leaks (memory, connections, file handles) - Race conditions or concurrency issues ### Configuration Changes - Incorrect values (pool sizes, timeouts, limits) - Missing required configuration - Environment variable changes - Feature flag toggles ### Infrastructure Changes - Scaling events (too few or too many i
Related in Code Review
gstack
IncludedFast headless browser for QA testing and site dogfooding. Navigate pages, interact with elements, verify state, diff before/after, take annotated screenshots, test responsive layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)
startup-due-diligence
IncludedLegal due diligence review for seed-stage and Series A startups (US, Delaware C-Corp focus). Supports both investor and founder perspectives. Capabilities include: (1) Interactive document review and issue spotting; (2) Document request list generation; (3) Cap table and SAFE/convertible note analysis; (4) Red flag identification with severity ratings; (5) Diligence report generation. TRIGGERS: due diligence, DD, startup investment, cap table review, Series A, seed round, investor diligence, legal review startup, SAFE analysis, convertible note, 409A, founder vesting.
interview-master
IncludedThis skill should be used when the user asks to "generate interview questions", "prepare for interview", "optimize resume", "conduct mock interview", "analyze git commits for resume", "generate resume from code", "review my resume", or mentions interview preparation, career assistance, or extracting project experience from git history. Provides comprehensive interview and career development guidance for both job seekers and interviewers.
fix-issue
IncludedFixes GitHub issues using parallel analysis agents for root cause investigation, code exploration, and regression detection. Reads issue context from gh CLI, searches codebase and memory for related patterns, generates a fix with tests, and links the resolution back to the issue via PR. Includes prevention analysis to avoid recurrence. Use when debugging errors, resolving regressions, fixing bugs, or triaging issues.
sf-apex
IncludedGenerates and reviews Salesforce Apex code with 150-point scoring. TRIGGER when: user writes, reviews, or fixes Apex classes, triggers, test classes, batch/queueable/schedulable jobs, or touches .cls/.trigger files. DO NOT TRIGGER when: LWC JavaScript (use sf-lwc), Flow XML (use sf-flow), SOQL-only queries (use sf-soql), or non-Salesforce code.
swift-development
IncludedComprehensive Swift development for building, testing, and deploying iOS/macOS applications. Use when Claude needs to: (1) Build Swift packages or Xcode projects from command line, (2) Run tests with XCTest or Swift Testing framework, (3) Manage iOS simulators with simctl, (4) Handle code signing, provisioning profiles, and app distribution, (5) Format or lint Swift code with SwiftFormat/SwiftLint, (6) Work with Swift Package Manager (SPM), (7) Implement Swift 6 concurrency patterns (async/await, actors, Sendable), (8) Create SwiftUI views with MVVM architecture, (9) Set up Core Data or SwiftData persistence, or any other Swift/iOS/macOS development tasks.