Claude
Skills
Sign in
Back

systematic-debugging

Included with Lifetime
$97 forever

Methodology for debugging non-trivial problems systematically. This skill should be used automatically when investigating bugs, test failures, or unexpected behavior that isn't immediately obvious. Emphasizes hypothesis formation, parallel investigation with subagents, and avoiding common anti-patterns like jumping to conclusions or weakening tests.

Code Review

What this skill does


# Systematic Debugging

## When to Use

Invoke this methodology automatically when:
- A test fails and the cause isn't immediately obvious
- Unexpected behavior occurs in production or development
- An error message doesn't directly point to the fix
- Multiple potential causes exist

## Core Principles

1. **Hypothesize before acting** - Form explicit hypotheses about root cause before changing code
2. **Test hypotheses systematically** - Validate or eliminate each hypothesis with evidence
3. **Parallelize investigation** - Use subagents for concurrent readonly exploration
4. **Preserve test integrity** - Never weaken tests to make them pass

## Debugging Scope Ladder

**Always prefer the smallest, most reproducible scope that demonstrates the bug.** Work up the ladder only when the smaller scope can't reproduce or doesn't apply:

| Priority | Scope | When to Use | Command |
|----------|-------|-------------|---------|
| 1 | **Unit test** | Logic errors, algorithm bugs, single-function issues | `cargo test -p freenet -- specific_test` |
| 2 | **Mocked unit test** | Transport/ring logic needing isolation | Unit test with `MockNetworkBridge` / `MockRing` |
| 3 | **Simulation test** | Multi-node behavior, state machines, race conditions | `cargo test -p freenet --test simulation_integration -- --test-threads=1` |
| 4 | **SimNetwork + FaultConfig** | Fault tolerance, message loss, network partitions | SimNetwork with configured fault injection |
| 5 | **fdev single-process** | Quick multi-peer CI validation | `cargo run -p fdev -- test --seed 42 single-process` |
| 6 | **freenet-test-network** | 20+ peer large-scale behavior | Docker-based `freenet-test-network` |
| 7 | **Real network** | Issues that only manifest with real UDP/NAT/latency | Manual multi-peer test across machines |

**Why this order matters:**
- Lower scopes are faster, deterministic, and reproducible by anyone
- Higher scopes require more infrastructure, time, and may not be accessible to all contributors
- Gateway logs, aggregate telemetry, and production metrics are not available to every developer — don't assume access to these when designing reproduction steps

## Debugging Workflow

### Phase 0: Claim the Issue

If you're working on a GitHub issue, **check if it's already assigned** before starting. If someone else is assigned, stop and inform the user — don't duplicate effort. If unassigned, assign it to yourself so others know it's being worked on:

```bash
gh issue view <ISSUE> --repo freenet/<REPO>  # Check assignees
gh issue edit <ISSUE> --repo freenet/<REPO> --add-assignee @me
```

### Phase 1: Reproduce and Isolate

1. **Reproduce the failure** — Confirm the bug exists and is reproducible
2. **Use the scope ladder** — Start at the smallest scope that can demonstrate the bug:
   - Can you write a unit test? Try that first
   - Needs multiple nodes? Use the simulation framework with a deterministic seed
   - Only happens under fault conditions? Use `SimNetwork` with `FaultConfig`
   - Can't reproduce in simulation? Then escalate to real network testing
3. **Record the seed** — When using simulation tests, always record the seed value for reproducibility
4. **Gather initial evidence** — Read error messages, logs, stack traces

**Simulation-first approach for distributed bugs:**
```bash
# Run simulation tests deterministically
cargo test -p freenet --features simulation_tests --test sim_network -- --test-threads=1

# With logging to observe event sequences
RUST_LOG=info cargo test -p freenet --features simulation_tests --test sim_network -- --nocapture --test-threads=1

# Reproduce with a specific seed
cargo run -p fdev -- test --seed 0xDEADBEEF single-process
```

### Phase 1b: When the Bug Is Reported from the Live Network

When a bug comes from production observations (user reports, telemetry, monitoring), the goal is to **translate the network observation into a local reproduction as fast as possible**. Live-network debugging has the slowest feedback loop — adding telemetry, redeploying, waiting — so minimize time spent there.

**The workflow:**

1. **Constrain the problem from network data** — What operation type? Which peers? What hop count? What timing pattern? Use telemetry or user reports to narrow this down.
2. **Translate constraints into simulation parameters:**

| Network Observation | Simulation Translation |
|---------------------|----------------------|
| "GET times out at hop 3" | `#[freenet_test]` with 4+ nodes, specific `node_locations` matching topology |
| "Peer X never responds" | Node configured to drop/delay messages via `FaultConfig` |
| "73% timeout rate" | `FaultConfig { message_loss_rate: 0.7, .. }` or unresponsive target node |
| "Works for PUT but not GET" | Test both operations — likely incomplete wiring in dispatch path |
| "Rapid connect/disconnect cycles" | Simulation with transport-level fault injection |
| "Messages dropped after acknowledgement" | `FaultConfig` with selective message loss after initial handshake |

3. **Write the simulation test** — Start with `#[freenet_test]` or `SimNetwork + FaultConfig`. Use a deterministic seed.
4. **Debug locally** — Now iterate with full control: add tracing, assertions, state inspection. No redeployment needed.
5. **Validate** — Optionally confirm via telemetry that the deployed fix improves live behavior.

**If a `telemetry-monitor` skill is available** (project-local, not part of this plugin), use it to query the centralized OpenTelemetry collector for constraining the problem. But treat telemetry as input to simulation design, not as the primary debugging tool.

**Resist the temptation to keep adding telemetry to find the root cause.** Once you know *what* fails (operation type, peer pattern, timing), stop analyzing network data and reproduce locally. The simulation feedback loop is orders of magnitude faster.

### Phase 2: Form Hypotheses

Before touching any code, explicitly list potential causes:

```
Hypotheses:
1. [Most likely] The X component isn't handling Y case
2. [Possible] Race condition between A and B
3. [Less likely] Configuration mismatch in Z
```

Rank by likelihood based on evidence. Avoid anchoring on the first idea.

**Freenet-specific hypothesis patterns:**
- **State machine bugs** — Invalid transitions in operations (CONNECT, GET, PUT, UPDATE, SUBSCRIBE)
- **Ring/routing errors** — Incorrect peer selection, distance calculations, topology issues
- **Transport issues** — UDP packet loss handling, encryption/decryption, connection lifecycle
- **Contract execution** — WASM sandbox issues, state verification failures
- **Determinism violations** — Code using `std::time::Instant::now()` instead of `TimeSource`, or `rand::random()` instead of `GlobalRng`
- **Silent failure / fire-and-forget** — Spawned task dies with no error propagation (check: is the `JoinHandle` stored and polled? what happens if the task exits?), broadcast sent to zero targets with no warning, channel overflow silently dropping messages. Look for: `tokio::spawn` without `.await`/`.abort()`, `let _ = sender.send()`, missing logging on empty target sets
- **Resource exhaustion** — HashMap/Vec/channel entries inserted but never removed, causing unbounded memory growth or channel backpressure. Check: is there a cleanup path for every insert? Is cleanup triggered on both success AND failure/timeout? Run sustained operations and assert collection sizes stay bounded
- **Incomplete wiring** — Feature only works for some operation types (e.g., router feedback wired for GET but not subscribe/put/update). When debugging "X doesn't work for operation Y," check all enum variants in the dispatch path — commented-out arms, `_ => Irrelevant` catch-alls, and missing match arms are common
- **TTL/timing race conditions** — Two time-dependent operations where the first can expire before the second completes (e.g., transient TTL expires before CONNECT handshake, interest TTL expires before subscription renewal, broadcast fires be

Related in Code Review