systematic-debugging

Included with Lifetime

$97 forever

Methodology for debugging non-trivial problems systematically. This skill should be used automatically when investigating bugs, test failures, or unexpected behavior that isn't immediately obvious. Emphasizes hypothesis formation, parallel investigation with subagents, and avoiding common anti-patterns like jumping to conclusions or weakening tests.

Code Review

What this skill does

# Systematic Debugging

## When to Use

Invoke this methodology automatically when:
- A test fails and the cause isn't immediately obvious
- Unexpected behavior occurs in production or development
- An error message doesn't directly point to the fix
- Multiple potential causes exist

## Core Principles

1. **Hypothesize before acting** - Form explicit hypotheses about root cause before changing code
2. **Test hypotheses systematically** - Validate or eliminate each hypothesis with evidence
3. **Parallelize investigation** - Use subagents for concurrent readonly exploration
4. **Preserve test integrity** - Never weaken tests to make them pass

## Debugging Scope Ladder

**Always prefer the smallest, most reproducible scope that demonstrates the bug.** Work up the ladder only when the smaller scope can't reproduce or doesn't apply:

| Priority | Scope | When to Use | Command |
|----------|-------|-------------|---------|
| 1 | **Unit test** | Logic errors, algorithm bugs, single-function issues | `cargo test -p freenet -- specific_test` |
| 2 | **Mocked unit test** | Transport/ring logic needing isolation | Unit test with `MockNetworkBridge` / `MockRing` |
| 3 | **Simulation test** | Multi-node behavior, state machines, race conditions | `cargo test -p freenet --test simulation_integration -- --test-threads=1` |
| 4 | **SimNetwork + FaultConfig** | Fault tolerance, message loss, network partitions | SimNetwork with configured fault injection |
| 5 | **fdev single-process** | Quick multi-peer CI validation | `cargo run -p fdev -- test --seed 42 single-process` |
| 6 | **freenet-test-network** | 20+ peer large-scale behavior | Docker-based `freenet-test-network` |
| 7 | **Real network** | Issues that only manifest with real UDP/NAT/latency | Manual multi-peer test across machines |

**Why this order matters:**
- Lower scopes are faster, deterministic, and reproducible by anyone
- Higher scopes require more infrastructure, time, and may not be accessible to all contributors
- Gateway logs, aggregate telemetry, and production metrics are not available to every developer — don't assume access to these when designing reproduction steps

## Debugging Workflow

### Phase 0: Claim the Issue

If you're working on a GitHub issue, **check if it's already assigned** before starting. If someone else is assigned, stop and inform the user — don't duplicate effort. If unassigned, assign it to yourself so others know it's being worked on:

```bash
gh issue view <ISSUE> --repo freenet/<REPO> # Check assignees
gh issue edit <ISSUE> --repo freenet/<REPO> --add-assignee @me
```

### Phase 1: Reproduce and Isolate

1. **Reproduce the failure** — Confirm the bug exists and is reproducible
2. **Use the scope ladder** — Start at the smallest scope that can demonstrate the bug:
- Can you write a unit test? Try that first
- Needs multiple nodes? Use the simulation framework with a deterministic seed
- Only happens under fault conditions? Use `SimNetwork` with `FaultConfig`
- Can't reproduce in simulation? Then escalate to real network testing
3. **Record the seed** — When using simulation tests, always record the seed value for reproducibility
4. **Gather initial evidence** — Read error messages, logs, stack traces

**Simulation-first approach for distributed bugs:**
```bash
# Run simulation tests deterministically
cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1

# With logging to observe event sequences
RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1

# Reproduce with a specific seed
cargo run -p fdev -- test --seed 0xDEADBEEF single-process
```

### Phase 1b: When the Bug Is Reported from the Live Network

When a bug comes from production observations (user reports, telemetry, monitoring), the goal is to **translate the network observation into a local reproduction as fast as possible**. Live-network debugging has the slowest feedback loop — adding telemetry, redeploying, waiting — so minimize time spent there.

**The workflow:**

1. **Constrain the problem from network data** — What operation type? Which peers? What hop count? What timing pattern? Use telemetry or user reports to narrow this down.
2. **Translate constraints into simulation parameters:**

| Network Observation | Simulation Translation |
|---------------------|----------------------|
| "GET times out at hop 3" | `#[freenet_test]` with 4+ nodes, specific `node_locations` matching topology |
| "Peer X never responds" | Node configured to drop/delay messages via `FaultConfig` |
| "73% timeout rate" | `FaultConfig { message_loss_rate: 0.7, .. }` or unresponsive target node |
| "Works for PUT but not GET" | Test both operations — likely incomplete wiring in dispatch path |
| "Rapid connect/disconnect cycles" | Simulation with transport-level fault injection |
| "Messages dropped after acknowledgement" | `FaultConfig` with selective message loss after initial handshake |

3. **Write the simulation test** — Start with `#[freenet_test]` or `SimNetwork + FaultConfig`. Use a deterministic seed.
4. **Debug locally** — Now iterate with full control: add tracing, assertions, state inspection. No redeployment needed.
5. **Validate** — Optionally confirm via telemetry that the deployed fix improves live behavior.

**If a `telemetry-monitor` skill is available** (project-local, not part of this plugin), use it to query the centralized OpenTelemetry collector for constraining the problem. But treat telemetry as input to simulation design, not as the primary debugging tool.

**Resist the temptation to keep adding telemetry to find the root cause.** Once you know *what* fails (operation type, peer pattern, timing), stop analyzing network data and reproduce locally. The simulation feedback loop is orders of magnitude faster.

### Phase 2: Form Hypotheses

Before touching any code, explicitly list potential causes:

```
Hypotheses:
1. [Most likely] The X component isn't handling Y case
2. [Possible] Race condition between A and B
3. [Less likely] Configuration mismatch in Z
```

Rank by likelihood based on evidence. Avoid anchoring on the first idea.

**Freenet-specific hypothesis patterns:**
- **State machine bugs** — Invalid transitions in operations (CONNECT, GET, PUT, UPDATE, SUBSCRIBE)
- **Ring/routing errors** — Incorrect peer selection, distance calculations, topology issues
- **Transport issues** — UDP packet loss handling, encryption/decryption, connection lifecycle
- **Contract execution** — WASM sandbox issues, state verification failures
- **Determinism violations** — Code using `std::time::Instant::now()` instead of `TimeSource`, or `rand::random()` instead of `GlobalRng`
- **Silent failure / fire-and-forget** — Spawned task dies with no error propagation (check: is the `JoinHandle` stored and polled? what happens if the task exits?), broadcast sent to zero targets with no warning, channel overflow silently dropping messages. Look for: `tokio::spawn` without `.await`/`.abort()`, `let _ = sender.send()`, missing logging on empty target sets
- **Resource exhaustion** — HashMap/Vec/channel entries inserted but never removed, causing unbounded memory growth or channel backpressure. Check: is there a cleanup path for every insert? Is cleanup triggered on both success AND failure/timeout? Run sustained operations and assert collection sizes stay bounded
- **Incomplete wiring** — Feature only works for some operation types (e.g., router feedback wired for GET but not subscribe/put/update). When debugging "X doesn't work for operation Y," check all enum variants in the dispatch path — commented-out arms, `_ => Irrelevant` catch-alls, and missing match arms are common
- **TTL/timing race conditions** — Two time-dependent operations where the first can expire before the second completes (e.g., transient TTL expires before CONNECT handshake, interest TTL expires before subscription renewal, broadcast fires be

Files: 2

Size: 30.3 KB

Complexity: 50/100

Category: Code Review

Source: https://github.com/freenet/freenet-agent-skills/tree/main/skills/systematic-debugging

Related in Code Review

gstack

Included

Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with elements, verify state, diff before/after, take annotated screenshots, test responsive layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)

Code Reviewscriptsfeatured

startup-due-diligence

Included

Legal due diligence review for seed-stage and Series A startups (US, Delaware C-Corp focus). Supports both investor and founder perspectives. Capabilities include: (1) Interactive document review and issue spotting; (2) Document request list generation; (3) Cap table and SAFE/convertible note analysis; (4) Red flag identification with severity ratings; (5) Diligence report generation. TRIGGERS: due diligence, DD, startup investment, cap table review, Series A, seed round, investor diligence, legal review startup, SAFE analysis, convertible note, 409A, founder vesting.

Code Reviewscripts

interview-master

Included

This skill should be used when the user asks to "generate interview questions", "prepare for interview", "optimize resume", "conduct mock interview", "analyze git commits for resume", "generate resume from code", "review my resume", or mentions interview preparation, career assistance, or extracting project experience from git history. Provides comprehensive interview and career development guidance for both job seekers and interviewers.

Code Reviewscripts

fix-issue

Included

Fixes GitHub issues using parallel analysis agents for root cause investigation, code exploration, and regression detection. Reads issue context from gh CLI, searches codebase and memory for related patterns, generates a fix with tests, and links the resolution back to the issue via PR. Includes prevention analysis to avoid recurrence. Use when debugging errors, resolving regressions, fixing bugs, or triaging issues.

Code Reviewscripts

sf-apex

Included

Generates and reviews Salesforce Apex code with 150-point scoring. TRIGGER when: user writes, reviews, or fixes Apex classes, triggers, test classes, batch/queueable/schedulable jobs, or touches .cls/.trigger files. DO NOT TRIGGER when: LWC JavaScript (use sf-lwc), Flow XML (use sf-flow), SOQL-only queries (use sf-soql), or non-Salesforce code.

Code Reviewscripts

swift-development

Included

Comprehensive Swift development for building, testing, and deploying iOS/macOS applications. Use when Claude needs to: (1) Build Swift packages or Xcode projects from command line, (2) Run tests with XCTest or Swift Testing framework, (3) Manage iOS simulators with simctl, (4) Handle code signing, provisioning profiles, and app distribution, (5) Format or lint Swift code with SwiftFormat/SwiftLint, (6) Work with Swift Package Manager (SPM), (7) Implement Swift 6 concurrency patterns (async/await, actors, Sendable), (8) Create SwiftUI views with MVVM architecture, (9) Set up Core Data or SwiftData persistence, or any other Swift/iOS/macOS development tasks.

Code Reviewscripts