Claude
Skills
Sign in
Back

self-healing

Included with Lifetime
$97 forever

Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.

Backend & APIsscripts

What this skill does


# Self-Healing

Active runtime recovery for coding agents. When something breaks, run the loop: **diagnose → patch → verify → file**. Leave behind a reusable, verified artifact instead of a swept-under-the-rug failure.

The premise mirrors [browser-use/browser-harness](https://github.com/browser-use/browser-harness): *the harness improves itself every run*. An agent that hits a gap doesn't fail — it writes the fix during execution, verifies it works, and files the durable artifact for future runs. Coding tasks deserve the same loop.

## What this skill is for

When a coding agent hits a wall mid-task, the default failure modes are:

1. **Paper over it** — "let me try a different approach" — and lose the recovery
2. **Pretend the fix worked** — without re-running the broken thing
3. **Symptom-fix** — skip the test, swallow the error, retry until green

All three turn a one-time failure into a recurrence. The next agent on the same project hits the same wall.

This skill enforces one discipline: **verify before persist**. A patch isn't real until you've re-run the failing operation and watched it succeed. When it does, file the verified fix so the next run benefits.

## Relationship to self-improvement

These two skills are deliberately split. Run both — they feed each other but don't overlap.

| Aspect      | `self-healing` (this skill)                                          | `self-improvement`                                            |
| ----------- | -------------------------------------------------------------------- | ------------------------------------------------------------- |
| **When**    | During execution, failure is live                                    | After the fact, at natural breakpoints                        |
| **Verb**    | Heal now — restore working state                                     | Remember for later — accumulate knowledge                     |
| **Outcome** | Verified patch + (optional) reusable artifact                        | Logged learning, correction, request                          |
| **Verify**  | **Mandatory** — no persist without proof                             | Not required                                                  |
| **Files**   | `.learnings/HEALS.md` + `.learnings/heals/<HEAL-ID>/` (lazy)         | `.learnings/ERRORS.md`, `LEARNINGS.md`, `FEATURE_REQUESTS.md` |
| **Trigger** | Failure observed mid-task                                            | Correction, knowledge gap, feature request, recurrence        |

**Boundary rule:** if you're capturing a fact, a correction, or a wish — that's `self-improvement`. If you're applying and verifying a fix to a live failure — that's `self-healing`.

## The Heal Loop

```
  ● failure observed
  │
  ● 1. DIAGNOSE  capture context — command, error, env, what was attempted
  │              search HEALS.md for the same Pattern-Key first
  │              (most heals are recurrences; don't reinvent)
  │
  ● 2. PATCH     write the fix — script, helper, env tweak, alt command
  │              artifacts → .learnings/heals/<HEAL-ID>/  (only if needed)
  │
  ● 3. VERIFY    re-run the failing op — must succeed
  │              ↻ if still failing: refine and retry, cap at 3 attempts
  │              ✗ if uncrackable: file Status: abandoned with notes
  │
  ● 4. FILE      write HEAL-YYYYMMDD-XXX to .learnings/HEALS.md
  │              with Pattern-Key, status, verification proof
  │
  ✓ working state restored, heal persisted

  (conditional) PROMOTE  if Pattern-Key recurrence ≥ 3 across distinct tasks,
                          append a Handoff block → self-improvement promotes to memory
```

If you abandon a heal mid-loop, don't pretend it succeeded. File a `HEAL-` entry with `Status: abandoned` and notes on what didn't work. The next agent learns from the dead end too.

## When to trigger

Self-healing fires on **active failures during execution** — the agent has just observed something not working and needs to make it work to continue. Five shapes:

### 1. Tool failure (command / test / build / lint)
Any invocation exits non-zero or produces wrong output. Don't acknowledge and retry verbatim — diagnose, patch, verify.

*Examples:* `npm install` errors when a `pnpm-lock.yaml` is present (switch tool); `pytest` fails with `ModuleNotFoundError` (activate the venv); `tsc` flags a stale type (regenerate the client); `eslint` reports a config error (install the missing parser).

### 2. Missing capability / tool gap
The agent needs something that doesn't exist yet — a script, a helper, a wrapper, a glue function. Write it in the moment. This is the closest analog to browser-harness's `agent_helpers.py`.

*Examples:* dedupe a CSV by custom key (write a small Python helper); bootstrap 12 microservices the same way (write `scripts/bootstrap-all.sh`); bulk-rename branches matching a pattern (write a `gh`-based shell helper).

### 3. Environment issue
The local environment isn't what the project expects. Detect, patch, verify.

*Examples:* runtime version mismatch (`nvm use`, `pyenv local`, `rustup override`); stale dependency cache after a branch switch; dirty git state blocking a checkout; missing `.env` (copy from `.env.example` and surface gaps).

### 4. External service / API change
A service the agent depends on returns something unexpected. Find a workaround and capture it.

*Examples:* an MCP tool returns `InputValidationError` because the schema changed (patch the call shape); a public API hits a rate limit (back off, switch endpoint, batch); an upstream lib bumped a default and broke a script (pin the version).

### 5. About-to-retry-the-same-broken-approach
The agent catches itself about to redo the failing step. That self-recognition is a heal forming — capture the alternate approach as the patch.

### Detection signals to watch for

- Non-zero exit codes
- Stack traces in tool output
- The same operation failing twice with the same error
- "I'll try a different approach" — capture it as a heal
- `command not found` / `module not found` / `permission denied`
- Stale assertions, snapshot mismatches, type errors that weren't there before
- "Weird" output that suggests environmental rather than logical bugs

## HEAL Entry Format

Append to `.learnings/HEALS.md` (create if missing):

```markdown
## [HEAL-YYYYMMDD-XXX] short_kebab_name

**Logged**: ISO-8601 timestamp
**Status**: verified | pending-verify | abandoned
**Trigger**: tool-failure | missing-capability | env-issue | external-change | <free-form>
**Active-Context**: (optional) — current skill, task phase, or workflow stage; omit if not applicable
**Area**: free-form tag — what part of the system (`build`, `tests`, `ci`, `auth`, `data-pipeline`, `mobile`, ...)
**Priority**: low | medium | high | critical

### Failure
What broke — concrete: the command, the error message, the action that was blocked. Include exit codes and verbatim error lines.

### Diagnosis
The root cause as understood after investigation. Why the obvious approach didn't work. Not a guess — what was actually verified during the heal.

### Fix
The patch that was applied. Verbatim commands, code snippets, or pointers to files under `.learnings/heals/<HEAL-ID>/`. Keep it minimal — just enough to reproduce.

### Verification
What was run after the fix and what it returned. Exit code, output snippet, test pass count. **This is the proof.** Without it, the entry is `pending-verify` or `abandoned`.

### Artifacts
(omit this section if no files were generated; otherwise list relative paths under `.learnings/heals/<HEAL-ID>/`)

### Metadata
- Related Files: path/to/file.ext
- See Also: HEAL-... | LRN-... | ERR-... (related entries)
- Pattern-Key: lower.snake.case key for recurrence detection (e.g. `env.lockfile_mismatch`)
- Recurrence-Count: 1
- First-Seen / Last-Seen: YYYY-MM-DD

---
```

### Field guidance

- **Status** — `verified` = the verify step passed. `pending-verify` = patch applied but couldn't be fully proven (sandboxed/offline

Related in Backend & APIs