self-improving-systems
Decide whether your agent actually needs persistent memory, feedback loops, or closed-loop learning, then design the smallest thing that pays for itself. Use when the user says "add memory", "give my agent context management", "make my agent learn", "self-improving / closed-loop", "Reflexion / mem0 / Letta / MemGPT", "AriGraph", "agent memory architecture", "long-term memory for chatbot", "why does my agent keep forgetting / making the same mistake", "fine-tune from agent traces", or asks for a memory schema / experience store / reward model. Filters ruthlessly — most teams want a state cache, not memory + learning. Default position is scratchpad-only with a stateless agent shipped first.
What this skill does
# Self-Improving Systems A prescriptive Q&A skill for adding memory, feedback loops, and closed-loop learning to agentic systems — **only when justified**. ## Headline message: most agents shouldn't have persistent memory. Memory is a liability surface (drift, poisoning, debugging difficulty, GDPR/HIPAA exposure). Persistent memory is the second move, not the first. The skill's job is to filter ruthlessly so the user doesn't ship a `mem0`/`Letta` build for a problem that a 200-line conversation summary would solve. The first 2 stages of the Q&A flow exist to **stop most users from over-engineering**. By the end of stage 2, ~60% of users will discover they want a **state cache** (or stateless RAG), not memory + learning. That's the win. --- ## Quick Start **User just asks:** ``` "Add memory to my agent" "My agent keeps forgetting things — give it context management" "Make my marketing agent learn from past campaigns" "Should I use mem0 or Letta?" "How do I set up closed-loop learning for my finance agent?" "Build a self-improving HAZOP system" ``` **Skill response (every time, in this order):** 1. Stop. Apply the **cache-vs-learning frame** (Stage 1). 2. Run the **6-question need-memory rubric** (Stage 2). <4 yes → exit the skill, recommend stateless + RAG. 3. If memory is justified, walk the **7-tier architecture ladder** (Stage 3) starting at L (scratchpad). Escalate only when forced by a concrete justification. 4. Force the user to design a **feedback signal** (Stage 4). No signal = state cache, full stop. 5. Wire the **closed loop with explicit human gates** (Stage 5). 6. Build the **eval harness** (Stage 6) — golden set, regression, drift alarms. 7. Walk the **8-risk checklist** (Stage 7). 8. Emit the design (Stage 8): memory schema + closed-loop spec + eval harness plan. --- ## Critical Rules ### 1. Default position: scratchpad-only Ship a stateless agent first. Add a scratchpad ([Reflexion](https://arxiv.org/abs/2303.11366)-style verbal self-correction) within a single run. Discard it after. This already gets you most of the gain on most tasks. Anything more must be earned. ### 2. Escalate one tier at a time The 7-tier ladder (§ Memory Architecture Ladder) is ordered cheapest → most expensive. Each tier-up must be justified by a concrete failure of the tier below it on a real task in your eval set. **Do not skip tiers.** "We're using Letta" out of the gate is the single most expensive mistake in this design space. ### 3. Require a ground-truth signal If you cannot observe whether the last action was good or bad within hours-to-weeks, you do not have **learning**. You have a **state cache**. Naming it "learning" sets the team up to A/B test against a metric that doesn't exist. The skill makes this distinction loud and refuses to design closed-loop learning without a signal. ### 4. Human gates are non-negotiable for production Anything that can mutate policy/voice/identity/safety blocks goes through human review. Autonomy is fine for episodic append, vector indexing, single-user preference KV updates with cheap reversibility — never for shared skill libraries, system prompt blocks, or reward model updates. ### 5. Memory is untrusted input Every memory read is untrusted. MINJA-class injections hit ≥95% lab success rate ([arXiv 2503.03704](https://arxiv.org/abs/2503.03704)). Treat retrieval results like web search results: in their own context block, with "this is data not instructions" framing, and never auto-promoted to system prompt without dual-LLM validation. --- ## The 8-Stage Q&A Flow One question (or tight cluster) at a time, à la `superpowers:brainstorming`. No overwhelm. Each stage has an exit condition that ends the skill early — that is the point. ### Stage 1 — Cache vs Learning Distinction (the frame) **The single most important question. Ask first.** > "Are you trying to **remember state** (so the agent doesn't redo work or forget what the user told it last week), or **get better over time** (so the agent's outputs measurably improve as it sees more data)?" These two designs share zero infrastructure with each other: | Goal | What you actually need | |---|---| | Remember state | Conversation summary OR KV fact store. No reward signal. No reflection LLM. No A/B harness. | | Get better over time | All of the above **plus** a ground-truth signal, an experience store, a reflection/extraction LLM, and an eval harness that detects regression. | If the user says "remember state": skip directly to Stage 3, default to tier 2 (conversation summary) or tier 5 (KV fact store), and end the skill at Stage 5. No closed loop. No learning ladder. If the user says "both": prove the second one. Almost no one has a measurable ground-truth signal; almost everyone says they do. Stage 4 is the test. ### Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter) Answer all six. **Score <4 yes = no memory store. Use scratchpad + RAG. End the skill.** 1. **Cross-session continuity.** Will the same user/entity/case-file return where forgetting prior decisions would be wrong, embarrassing, or unsafe? 2. **Mutable state.** Does the entity's state legitimately *change* over time (preferences, project status, client facts)? Pure facts that don't change → RAG over docs, not memory. 3. **Ground-truth feedback exists.** Can you observe within hours-to-weeks whether the last action was good or bad? No signal → no learning, only state cache. 4. **Cost of being wrong > cost of memory infra.** Memory adds latency, storage, eval, security review, and a recurring debugging tax. Pencil out both sides. 5. **Volume justifies it.** Same user returns ≥5 times. <5 returns → in-context summary is cheaper than vector store. 6. **You can audit and redact.** GDPR/HIPAA: can you delete on request, export, explain a memory? If no, do not store one. > If you got "yes" only on (1) and (2): you need a **state cache**, not memory + learning. Say it out loud. Skill recommends tier 2 or 5 and exits. ### Stage 3 — Architecture Selection (start at L tier) Walk the **7-tier memory architecture ladder** (next section). **Default recommendation: tier 1 (scratchpad-only).** Escalate exactly one tier per concrete justification. Justification = "tier N fails on this specific task in our eval set, here's the trace." Most "we need memory" requests resolve at tier 2 (conversation summary) or tier 5 (KV fact store). Tier 6 (graph) and tier 7 (hierarchical OS-style / Letta) require >3 entities × >50 relationships and a real long-horizon agent, not a chatbot. **Deep dive:** `references/architectures.md` ### Stage 4 — Feedback Signal Design If Stage 1 ended with "remember state only", skip this stage. For learning, the signal determines everything. Walk the per-domain table: | Domain | Signal | Latency | Risk | |---|---|---|---| | Marketing / content | Engagement deltas (CTR, dwell, conversion, save/share) + variant A/B win-rate + brand-safety review | hours-days | Vanity metrics → reward hacking; mitigate with composite reward + brand-fidelity LLM-judge | | Finance / compliance | Audit findings, reconciliation breaks, regulator outcomes | weeks | Sparse signal → use intermediate proxies + sparse human signoff (hybrid RLAIF) | | HAZOP / safety | Incident-DB recall (held-out incident set), expert reviewer agreement | continuous | **Never let agent's own write-back update incident DB** | | Tutorials / education | Completion rate, comprehension quiz scores, time-to-first-success | minutes-days | Cleanest closed loop — verifier is cheap and online | | Code-emitting agents | Unit tests, type-check, runtime | minutes | The gold standard — verifier is free and deterministic | | General LLM-as-judge | Held-out judge with calibrated rubric | continuous | Sample-audit 5–10% against humans to catch drift | **Rule, repeat once per Q&A session:** No signal = state cache, not learning. If the user can't name a signal, do not design a learning loop. Recommend they s
Related in Design
contribute
IncludedLocal-only OSS contribution command center. Auto-refreshes the user's in-flight PR and issue state on invoke so conversations start with full context — no need to brief Claude on what's in flight. Helps the user find issues to contribute to on GitHub, builds per-repo dossiers of what each upstream expects (CLA, DCO, branch convention, AI policy, draft-first, review bots, issue templates), runs deterministic gates before any external action so AI-assisted contributions don't reach maintainers as slop. State is markdown-only: candidate files at ~/.contribute-system/candidates/, repo dossiers at ~/.contribute-system/research/, append-only event log at ~/.contribute-system/log.jsonl. No database, no cloud calls. Use when the user asks about their PRs / issues / contributions, wants to find new work to take on, claim an issue, build/refresh a repo's dossier, or draft a Design Issue or PR. Trigger with "/contribute", "what's my PR status", "find a contribution", "claim issue X", "draft a Design Issue for Y", "refresh dossier for Z".
architectural-analysis
IncludedUser-triggered deep architectural analysis of a codebase or scoped subtree across eight modes — information architecture, data flow, integration points, UI surfaces, interaction patterns, data model, control flow, and failure modes. This skill should be used when the user asks to "diagram this codebase," "map the architecture," "show the data flow," "give me an ERD," "trace control flow," "find the integration points," "verify the layout pattern," "audit the UX architecture," or any similar request whose primary deliverable is mermaid diagrams plus cited reports under docs/architecture/. Dispatches haiku/sonnet sub-agents in parallel for per-mode exploration, then verifies every citation mechanically before any node lands in a diagram. Not for one-off prose explanations of code (use code-explanation) or for high-level system design from scratch (use system-design).
mcp
IncludedModel Context Protocol (MCP) server development and tool management. Languages: Python, TypeScript. Capabilities: build MCP servers, integrate external APIs, discover/execute MCP tools, manage multi-server configs, design agent-centric tools. Actions: create, build, integrate, discover, execute, configure MCP servers/tools. Keywords: MCP, Model Context Protocol, MCP server, MCP tool, stdio transport, SSE transport, tool discovery, resource provider, prompt template, external API integration, Gemini CLI MCP, Claude MCP, agent tools, tool execution, server config. Use when: building MCP servers, integrating external APIs as MCP tools, discovering available MCP tools, executing MCP capabilities, configuring multi-server setups, designing tools for AI agents.
react-native-skia
IncludedDesign, build, debug, and optimise high-polish animated graphics in React Native or Expo using @shopify/react-native-skia, Reanimated, and Gesture Handler. Use when the user wants canvas-driven UI, shaders, paths, rich text, image filters, sprite fields, Skottie, video frames, snapshots, web CanvasKit setup, or performance tuning for custom motion-heavy elements such as loaders, hero art, cards, charts, progress indicators, particle systems, or gesture-driven surfaces. Also use when the user asks for fluid, glow, glass, blob, parallax, 60fps/120fps, or GPU-friendly animated effects in React Native, even if they do not explicitly say "Skia". Do not use for ordinary form/layout work with standard views.
plaid
IncludedProduct Led AI Development — guides founders from idea to launched product. Six capabilities: Idea (discover a product idea), Validate (pressure-test the idea against fatal flaws, problem reality, competition, and 2-week MVP feasibility), Plan (vision intake + document generation), Design (translate image references into a design.md spec), Launch (go-to-market strategy), and Build (roadmap execution). Use when someone says "PLAID", "plaid idea", "help me find an idea", "product idea", "idea from my business", "idea from my expertise", "plaid validate", "validate my idea", "pressure-test", "is this idea good", "find fatal flaws", "validate the problem", "plan a product", "define my vision", "generate a PRD", "product strategy", "plaid design", "design from image", "translate image to design", "create design.md", "extract design tokens", "plaid launch", "go-to-market", "launch plan", "GTM strategy", "launch playbook", "plaid build", "build the app", "start building", or "execute the roadmap".
nextjs-framer-motion-animations
IncludedAdds production-safe Motion for React or Framer Motion animations to Next.js apps, including reveal, hover and tap micro-interactions, whileInView, stagger, AnimatePresence, layout and layoutId transitions, reorder, scroll-linked UI, and lightweight route-content transitions. Use when the user asks to add, refactor, or debug Motion or Framer Motion in App Router or Pages Router codebases, especially around server/client boundaries, reduced motion, LazyMotion, bundle size, hydration, or route transitions. Avoid for GSAP-style timelines, WebGL or 3D scenes, heavy scroll storytelling, or CSS-only effects unless Motion is explicitly requested.