self-healing

Included with Lifetime

$97 forever

Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.

Backend & APIsscripts

What this skill does


# Self-Healing

Active runtime recovery for coding agents. When something breaks, run the loop: **diagnose → patch → verify → file**. Leave behind a reusable, verified artifact instead of a swept-under-the-rug failure.

The premise mirrors [browser-use/browser-harness](https://github.com/browser-use/browser-harness): *the harness improves itself every run*. An agent that hits a gap doesn't fail — it writes the fix during execution, verifies it works, and files the durable artifact for future runs. Coding tasks deserve the same loop.

## What this skill is for

When a coding agent hits a wall mid-task, the default failure modes are:

1. **Paper over it** — "let me try a different approach" — and lose the recovery
2. **Pretend the fix worked** — without re-running the broken thing
3. **Symptom-fix** — skip the test, swallow the error, retry until green

All three turn a one-time failure into a recurrence. The next agent on the same project hits the same wall.

This skill enforces one discipline: **verify before persist**. A patch isn't real until you've re-run the failing operation and watched it succeed. When it does, file the verified fix so the next run benefits.

## Relationship to self-improvement

These two skills are deliberately split. Run both — they feed each other but don't overlap.

| Aspect      | `self-healing` (this skill)                                          | `self-improvement`                                            |
| ----------- | -------------------------------------------------------------------- | ------------------------------------------------------------- |
| **When**    | During execution, failure is live                                    | After the fact, at natural breakpoints                        |
| **Verb**    | Heal now — restore working state                                     | Remember for later — accumulate knowledge                     |
| **Outcome** | Verified patch + (optional) reusable artifact                        | Logged learning, correction, request                          |
| **Verify**  | **Mandatory** — no persist without proof                             | Not required                                                  |
| **Files**   | `.learnings/HEALS.md` + `.learnings/heals/<HEAL-ID>/` (lazy)         | `.learnings/ERRORS.md`, `LEARNINGS.md`, `FEATURE_REQUESTS.md` |
| **Trigger** | Failure observed mid-task                                            | Correction, knowledge gap, feature request, recurrence        |

**Boundary rule:** if you're capturing a fact, a correction, or a wish — that's `self-improvement`. If you're applying and verifying a fix to a live failure — that's `self-healing`.

## The Heal Loop

```
  ● failure observed
  │
  ● 1. DIAGNOSE  capture context — command, error, env, what was attempted
  │              search HEALS.md for the same Pattern-Key first
  │              (most heals are recurrences; don't reinvent)
  │
  ● 2. PATCH     write the fix — script, helper, env tweak, alt command
  │              artifacts → .learnings/heals/<HEAL-ID>/  (only if needed)
  │
  ● 3. VERIFY    re-run the failing op — must succeed
  │              ↻ if still failing: refine and retry, cap at 3 attempts
  │              ✗ if uncrackable: file Status: abandoned with notes
  │
  ● 4. FILE      write HEAL-YYYYMMDD-XXX to .learnings/HEALS.md
  │              with Pattern-Key, status, verification proof
  │
  ✓ working state restored, heal persisted

  (conditional) PROMOTE  if Pattern-Key recurrence ≥ 3 across distinct tasks,
                          append a Handoff block → self-improvement promotes to memory
```

If you abandon a heal mid-loop, don't pretend it succeeded. File a `HEAL-` entry with `Status: abandoned` and notes on what didn't work. The next agent learns from the dead end too.

## When to trigger

Self-healing fires on **active failures during execution** — the agent has just observed something not working and needs to make it work to continue. Five shapes:

### 1. Tool failure (command / test / build / lint)
Any invocation exits non-zero or produces wrong output. Don't acknowledge and retry verbatim — diagnose, patch, verify.

*Examples:* `npm install` errors when a `pnpm-lock.yaml` is present (switch tool); `pytest` fails with `ModuleNotFoundError` (activate the venv); `tsc` flags a stale type (regenerate the client); `eslint` reports a config error (install the missing parser).

### 2. Missing capability / tool gap
The agent needs something that doesn't exist yet — a script, a helper, a wrapper, a glue function. Write it in the moment. This is the closest analog to browser-harness's `agent_helpers.py`.

*Examples:* dedupe a CSV by custom key (write a small Python helper); bootstrap 12 microservices the same way (write `scripts/bootstrap-all.sh`); bulk-rename branches matching a pattern (write a `gh`-based shell helper).

### 3. Environment issue
The local environment isn't what the project expects. Detect, patch, verify.

*Examples:* runtime version mismatch (`nvm use`, `pyenv local`, `rustup override`); stale dependency cache after a branch switch; dirty git state blocking a checkout; missing `.env` (copy from `.env.example` and surface gaps).

### 4. External service / API change
A service the agent depends on returns something unexpected. Find a workaround and capture it.

*Examples:* an MCP tool returns `InputValidationError` because the schema changed (patch the call shape); a public API hits a rate limit (back off, switch endpoint, batch); an upstream lib bumped a default and broke a script (pin the version).

### 5. About-to-retry-the-same-broken-approach
The agent catches itself about to redo the failing step. That self-recognition is a heal forming — capture the alternate approach as the patch.

### Detection signals to watch for

- Non-zero exit codes
- Stack traces in tool output
- The same operation failing twice with the same error
- "I'll try a different approach" — capture it as a heal
- `command not found` / `module not found` / `permission denied`
- Stale assertions, snapshot mismatches, type errors that weren't there before
- "Weird" output that suggests environmental rather than logical bugs

## HEAL Entry Format

Append to `.learnings/HEALS.md` (create if missing):

```markdown
## [HEAL-YYYYMMDD-XXX] short_kebab_name

**Logged**: ISO-8601 timestamp
**Status**: verified | pending-verify | abandoned
**Trigger**: tool-failure | missing-capability | env-issue | external-change | <free-form>
**Active-Context**: (optional) — current skill, task phase, or workflow stage; omit if not applicable
**Area**: free-form tag — what part of the system (`build`, `tests`, `ci`, `auth`, `data-pipeline`, `mobile`, ...)
**Priority**: low | medium | high | critical

### Failure
What broke — concrete: the command, the error message, the action that was blocked. Include exit codes and verbatim error lines.

### Diagnosis
The root cause as understood after investigation. Why the obvious approach didn't work. Not a guess — what was actually verified during the heal.

### Fix
The patch that was applied. Verbatim commands, code snippets, or pointers to files under `.learnings/heals/<HEAL-ID>/`. Keep it minimal — just enough to reproduce.

### Verification
What was run after the fix and what it returned. Exit code, output snippet, test pass count. **This is the proof.** Without it, the entry is `pending-verify` or `abandoned`.

### Artifacts
(omit this section if no files were generated; otherwise list relative paths under `.learnings/heals/<HEAL-ID>/`)

### Metadata
- Related Files: path/to/file.ext
- See Also: HEAL-... | LRN-... | ERR-... (related entries)
- Pattern-Key: lower.snake.case key for recurrence detection (e.g. `env.lockfile_mismatch`)
- Recurrence-Count: 1
- First-Seen / Last-Seen: YYYY-MM-DD

---
```

### Field guidance

- **Status** — `verified` = the verify step passed. `pending-verify` = patch applied but couldn't be fully proven (sandboxed/offline

Files: 8

Size: 44.6 KB

Complexity: 72/100

Category: Backend & APIs

Source: https://github.com/pskoett/pskoett-ai-skills/tree/main/plugin/skills/self-healing

Related in Backend & APIs

jfrog

Included

Interact with the JFrog Platform via the JFrog CLI and REST/GraphQL APIs. Use this skill when the user wants to manage Artifactory repositories, upload or download artifacts, manage builds, configure permissions, manage users and groups, work with access tokens, configure JFrog CLI servers, search artifacts, manage properties, set up replication, manage JFrog Projects, run security audits or scans, look up CVE details, query exposures scan results from JFrog Advanced Security, manage release bundles and lifecycle operations, aggregate or export platform data, or perform any JFrog Platform administration task. Also use when the user mentions jf, jfrog, artifactory, xray, distribution, evidence, apptrust, onemodel, graphql, workers, mission control, curation, advanced security, exposures, or any JFrog product name.

Backend & APIsscripts

cupynumeric-migration-readiness

Included

Pre-migration readiness assessor for porting NumPy to cuPyNumeric. Use BEFORE substantial porting work begins when the user asks whether code will scale on GPU, whether they should migrate to cuPyNumeric, which NumPy patterns transfer cleanly, what must be refactored before porting, or mentions pre-port assessment, scaling analysis, or refactor planning. Inspect the user's source code, look up NumPy usage, cross-reference the cuPyNumeric API support manifest, and distinguish distributed-scaling-friendly patterns from blockers such as unsupported APIs, scalar synchronization, host round-trips, Python/object-heavy control flow, shape/data-dependent branching, and in-place mutation hazards. Produce a verdict of READY, LIGHT REFACTOR, SIGNIFICANT REFACTOR, or NOT RECOMMENDED, with concrete refactor pointers.

Backend & APIsscripts

alibabacloud-data-agent-skill

Included

Invoke Alibaba Cloud Apsara Data Agent for Analytics via CLI to perform natural language-driven data analysis on enterprise databases. Data Agent for Analytics is an intelligent data analysis agent developed by Alibaba Cloud Database team for enterprise users. It automatically completes requirement analysis, data understanding, analysis insights, and report generation based on natural language descriptions. This tool supports: discovering data resources (instances/databases/tables) managed in DMS, initiating query or deep analysis sessions, real-time progress tracking, and retrieving analysis conclusions and generated reports. Use this Skill when users need to query databases, analyze data trends, generate data reports, ask questions in natural language, or mention "Data Agent", "data analysis", "database query", "SQL analysis", "data insights".

Backend & APIsscripts

token-optimizer

Included

Reduce OpenClaw token usage and API costs through smart model routing, heartbeat optimization, budget tracking, and native 2026.2.15 features (session pruning, bootstrap size limits, cache TTL alignment). Use when token costs are high, API rate limits are being hit, or hosting multiple agents at scale. The 4 executable scripts (context_optimizer, model_router, heartbeat_optimizer, token_tracker) are local-only — no network requests, no subprocess calls, no system modifications. Reference files (PROVIDERS.md, config-patches.json) document optional multi-provider strategies that require external API keys and network access if you choose to use them. See SECURITY.md for full breakdown.

Backend & APIsscripts

resend-cli

Included

Use this skill when the task is specifically about operating Resend from an AI agent, terminal session, or CI job via the official resend CLI: installing/authenticating the CLI, sending/listing/updating/cancelling emails, batch sends, domains and DNS, webhooks and local listeners, inbound receiving, contacts, topics, segments, broadcasts, templates, API keys, profiles, or debugging Resend CLI/API failures. Trigger on mentions of Resend CLI, `resend`, `resend doctor`, `resend emails send`, `resend domains`, `resend webhooks listen`, `resend emails receiving`, or agent-friendly terminal automation.

Backend & APIsscripts

alibabacloud-odps-maxframe-coding

Included

Use this skill for MaxFrame SDK development and documentation navigation on Alibaba Cloud MaxCompute (ODPS). Helps answer MaxFrame API, concept, official example, and supported pandas API questions; create data processing programs; read/write MaxCompute tables; debug jobs (remote or local); and build custom DPE runtime images. Trigger when users mention MaxFrame, MaxCompute with MaxFrame, ODPS table processing, DPE runtime, MaxFrame docs/examples, DataFrame/Tensor operations, or GPU runtime setup. Works for both English and Chinese queries about Alibaba Cloud data processing with MaxFrame.

Backend & APIsscripts