Claude
Skills
Sign in
Back

self-improving-systems

Included with Lifetime
$97 forever

Decide whether your agent actually needs persistent memory, feedback loops, or closed-loop learning, then design the smallest thing that pays for itself. Use when the user says "add memory", "give my agent context management", "make my agent learn", "self-improving / closed-loop", "Reflexion / mem0 / Letta / MemGPT", "AriGraph", "agent memory architecture", "long-term memory for chatbot", "why does my agent keep forgetting / making the same mistake", "fine-tune from agent traces", or asks for a memory schema / experience store / reward model. Filters ruthlessly — most teams want a state cache, not memory + learning. Default position is scratchpad-only with a stateless agent shipped first.

Design

What this skill does


# Self-Improving Systems

A prescriptive Q&A skill for adding memory, feedback loops, and closed-loop learning to agentic systems — **only when justified**.

## Headline message: most agents shouldn't have persistent memory.

Memory is a liability surface (drift, poisoning, debugging difficulty, GDPR/HIPAA exposure). Persistent memory is the second move, not the first. The skill's job is to filter ruthlessly so the user doesn't ship a `mem0`/`Letta` build for a problem that a 200-line conversation summary would solve.

The first 2 stages of the Q&A flow exist to **stop most users from over-engineering**. By the end of stage 2, ~60% of users will discover they want a **state cache** (or stateless RAG), not memory + learning. That's the win.

---

## Quick Start

**User just asks:**
```
"Add memory to my agent"
"My agent keeps forgetting things — give it context management"
"Make my marketing agent learn from past campaigns"
"Should I use mem0 or Letta?"
"How do I set up closed-loop learning for my finance agent?"
"Build a self-improving HAZOP system"
```

**Skill response (every time, in this order):**
1. Stop. Apply the **cache-vs-learning frame** (Stage 1).
2. Run the **6-question need-memory rubric** (Stage 2). <4 yes → exit the skill, recommend stateless + RAG.
3. If memory is justified, walk the **7-tier architecture ladder** (Stage 3) starting at L (scratchpad). Escalate only when forced by a concrete justification.
4. Force the user to design a **feedback signal** (Stage 4). No signal = state cache, full stop.
5. Wire the **closed loop with explicit human gates** (Stage 5).
6. Build the **eval harness** (Stage 6) — golden set, regression, drift alarms.
7. Walk the **8-risk checklist** (Stage 7).
8. Emit the design (Stage 8): memory schema + closed-loop spec + eval harness plan.

---

## Critical Rules

### 1. Default position: scratchpad-only

Ship a stateless agent first. Add a scratchpad ([Reflexion](https://arxiv.org/abs/2303.11366)-style verbal self-correction) within a single run. Discard it after. This already gets you most of the gain on most tasks. Anything more must be earned.

### 2. Escalate one tier at a time

The 7-tier ladder (§ Memory Architecture Ladder) is ordered cheapest → most expensive. Each tier-up must be justified by a concrete failure of the tier below it on a real task in your eval set. **Do not skip tiers.** "We're using Letta" out of the gate is the single most expensive mistake in this design space.

### 3. Require a ground-truth signal

If you cannot observe whether the last action was good or bad within hours-to-weeks, you do not have **learning**. You have a **state cache**. Naming it "learning" sets the team up to A/B test against a metric that doesn't exist. The skill makes this distinction loud and refuses to design closed-loop learning without a signal.

### 4. Human gates are non-negotiable for production

Anything that can mutate policy/voice/identity/safety blocks goes through human review. Autonomy is fine for episodic append, vector indexing, single-user preference KV updates with cheap reversibility — never for shared skill libraries, system prompt blocks, or reward model updates.

### 5. Memory is untrusted input

Every memory read is untrusted. MINJA-class injections hit ≥95% lab success rate ([arXiv 2503.03704](https://arxiv.org/abs/2503.03704)). Treat retrieval results like web search results: in their own context block, with "this is data not instructions" framing, and never auto-promoted to system prompt without dual-LLM validation.

---

## The 8-Stage Q&A Flow

One question (or tight cluster) at a time, à la `superpowers:brainstorming`. No overwhelm. Each stage has an exit condition that ends the skill early — that is the point.

### Stage 1 — Cache vs Learning Distinction (the frame)

**The single most important question. Ask first.**

> "Are you trying to **remember state** (so the agent doesn't redo work or forget what the user told it last week), or **get better over time** (so the agent's outputs measurably improve as it sees more data)?"

These two designs share zero infrastructure with each other:

| Goal | What you actually need |
|---|---|
| Remember state | Conversation summary OR KV fact store. No reward signal. No reflection LLM. No A/B harness. |
| Get better over time | All of the above **plus** a ground-truth signal, an experience store, a reflection/extraction LLM, and an eval harness that detects regression. |

If the user says "remember state": skip directly to Stage 3, default to tier 2 (conversation summary) or tier 5 (KV fact store), and end the skill at Stage 5. No closed loop. No learning ladder.

If the user says "both": prove the second one. Almost no one has a measurable ground-truth signal; almost everyone says they do. Stage 4 is the test.

### Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter)

Answer all six. **Score <4 yes = no memory store. Use scratchpad + RAG. End the skill.**

1. **Cross-session continuity.** Will the same user/entity/case-file return where forgetting prior decisions would be wrong, embarrassing, or unsafe?
2. **Mutable state.** Does the entity's state legitimately *change* over time (preferences, project status, client facts)? Pure facts that don't change → RAG over docs, not memory.
3. **Ground-truth feedback exists.** Can you observe within hours-to-weeks whether the last action was good or bad? No signal → no learning, only state cache.
4. **Cost of being wrong > cost of memory infra.** Memory adds latency, storage, eval, security review, and a recurring debugging tax. Pencil out both sides.
5. **Volume justifies it.** Same user returns ≥5 times. <5 returns → in-context summary is cheaper than vector store.
6. **You can audit and redact.** GDPR/HIPAA: can you delete on request, export, explain a memory? If no, do not store one.

> If you got "yes" only on (1) and (2): you need a **state cache**, not memory + learning. Say it out loud. Skill recommends tier 2 or 5 and exits.

### Stage 3 — Architecture Selection (start at L tier)

Walk the **7-tier memory architecture ladder** (next section). **Default recommendation: tier 1 (scratchpad-only).** Escalate exactly one tier per concrete justification. Justification = "tier N fails on this specific task in our eval set, here's the trace."

Most "we need memory" requests resolve at tier 2 (conversation summary) or tier 5 (KV fact store). Tier 6 (graph) and tier 7 (hierarchical OS-style / Letta) require >3 entities × >50 relationships and a real long-horizon agent, not a chatbot.

**Deep dive:** `references/architectures.md`

### Stage 4 — Feedback Signal Design

If Stage 1 ended with "remember state only", skip this stage.

For learning, the signal determines everything. Walk the per-domain table:

| Domain | Signal | Latency | Risk |
|---|---|---|---|
| Marketing / content | Engagement deltas (CTR, dwell, conversion, save/share) + variant A/B win-rate + brand-safety review | hours-days | Vanity metrics → reward hacking; mitigate with composite reward + brand-fidelity LLM-judge |
| Finance / compliance | Audit findings, reconciliation breaks, regulator outcomes | weeks | Sparse signal → use intermediate proxies + sparse human signoff (hybrid RLAIF) |
| HAZOP / safety | Incident-DB recall (held-out incident set), expert reviewer agreement | continuous | **Never let agent's own write-back update incident DB** |
| Tutorials / education | Completion rate, comprehension quiz scores, time-to-first-success | minutes-days | Cleanest closed loop — verifier is cheap and online |
| Code-emitting agents | Unit tests, type-check, runtime | minutes | The gold standard — verifier is free and deterministic |
| General LLM-as-judge | Held-out judge with calibrated rubric | continuous | Sample-audit 5–10% against humans to catch drift |

**Rule, repeat once per Q&A session:** No signal = state cache, not learning. If the user can't name a signal, do not design a learning loop. Recommend they s
Files: 10
Size: 111.5 KB
Complexity: 59/100
Category: Design

Related in Design