am2rican5:test-driven-autonomous-dev
Design test suites and feedback loops that enable Claude agents to work autonomously with high confidence. Covers oracle-based testing, deterministic sampling, tiered test design, and structured error output for agent consumption. Use when user says "design tests for autonomous agents", "oracle testing", "test-driven autonomous development", "agent feedback loop", "tests for parallel agents", or "design test suite for autonomous work". Do NOT use for running existing tests or executing test commands (use your test runner directly) or for parallel agent coordination (use parallel-agent-orchestration instead).
What this skill does
# Test-Driven Autonomous Development ## Critical Rules - NEVER skip baseline measurement — always establish what "correct" looks like before agents start changing things - NEVER let agents make changes without a way to verify correctness — if there's no test, there's no autonomy - ALWAYS structure test output so agents can interpret failures without human help - WHEN designing tests, every failure message MUST include: what was expected, what happened, and where to look - WHEN no reference implementation exists, HELP the user create one before proceeding - NEVER optimize test speed at the cost of correctness — fast tests that miss bugs are worse than no tests ## Instructions ### Step 1: Identify the Oracle An oracle is a source of truth that tells you what "correct" looks like. Find or create one: **Option A: Reference Implementation (Best)** If a known-correct implementation exists (e.g., GCC for a compiler, a legacy system being replaced, a spec with reference outputs): 1. Set up the reference so it can be invoked programmatically 2. Create a harness that runs both the reference and the system under test on the same input 3. Diff the outputs — any difference is a failure **Option B: Snapshot/Golden File Testing** If no reference exists but you have known-correct outputs: 1. Run the current (correct) system and capture outputs as golden files 2. After agents make changes, compare new outputs against golden files 3. Differences require manual review or explicit approval **Option C: Property-Based Testing** If correct outputs aren't known but invariants are: 1. Define properties that must always hold (e.g., "output is valid JSON", "response time < 500ms", "no data loss") 2. Generate random inputs and verify properties hold 3. Agents can work autonomously as long as no property violations occur **Option D: Human-Verified Seed Tests** If nothing above works: 1. Create a small set of manually verified input/output pairs 2. These serve as regression anchors 3. Agents can work but must not break seed tests 4. Expand the seed set as confidence grows Present options to the user. IF no oracle can be identified → warn that full autonomy is risky and recommend shorter agent sessions with human checkpoints. ### Step 2: Design Test Tiers Organize tests from fast/narrow to slow/comprehensive: **Tier 1: Unit Tests (seconds)** - Test individual functions or components in isolation - Run after every change - Must complete in <30 seconds total - Purpose: catch obvious breakage immediately **Tier 2: Integration Tests (minutes)** - Test interactions between components - Run after a batch of related changes - Must complete in <5 minutes - Purpose: catch interface mismatches and data flow issues **Tier 3: System Tests (minutes to hours)** - Test the full system end-to-end against the oracle - Run after a milestone or before merging - May take longer but must be comprehensive - Purpose: final validation that the whole system works For each tier, define: 1. What inputs to use 2. What outputs to check 3. Pass/fail criteria 4. How to run (command or script) ### Step 3: Create a Fast Subset for Iteration Agents need rapid feedback. Create a "smoke test" subset: 1. From each tier, select the most representative tests 2. Aim for <60 seconds total runtime 3. Cover the critical paths — if these pass, the system is probably working 4. Include at least one test per major component #### Deterministic Sampling Strategy WHEN the full test suite has N tests: - IF N < 50 → run all of them as the fast subset - IF N is 50-500 → select ~10% covering each component, prioritize tests that have caught bugs before - IF N > 500 → select a fixed set of ~50 covering all components, plus any test that failed in the last 5 runs The fast subset should be runnable via a single command (e.g., `make test-fast` or `npm test -- --suite=smoke`). ### Step 4: Structure Error Output for Agent Consumption Test failures must be machine-readable AND agent-actionable. Every failure should include: test name, component, expected vs actual, diff, investigation path, and context. See `references/error-output-format.md` for the full structured format template and test runner configuration guidance. ### Step 5: Wire It All Together Create a test runner script that agents will use: 1. `test-fast` — runs the fast subset, returns structured output 2. `test-full` — runs all tiers sequentially, returns structured output 3. `test-against-oracle` — runs oracle comparison (if applicable) Each command should: - Exit 0 on all pass, non-zero on any failure - Output structured failure information (Step 4 format) - Report summary: `PASS: N, FAIL: M, SKIP: K` Configure agents to run `test-fast` after every change and `test-full` before committing. ## Examples ### Example: Designing Tests for a Compiler Project User says: "I'm building a C compiler and want agents to work on it autonomously" Result: GCC used as oracle, 3 test tiers (unit/integration/system), fast subset of 25 tests running in 20 seconds, structured error output pointing agents to specific codegen functions. ### Example: Tests for an API Migration User says: "Migrate REST API from v1 to v2, agents should handle each endpoint" Result: v1 API used as oracle on staging, schema validation per endpoint as fast subset (10 seconds), behavioral parity tests for full validation. See `references/examples.md` for detailed walkthroughs of both scenarios. ## Troubleshooting ### Flaky tests undermine agent confidence **Cause:** Non-deterministic tests (timing, ordering, external dependencies) that sometimes pass and sometimes fail. **Solution:** Quarantine flaky tests out of the fast subset. Fix them separately. Agents should only run deterministic tests autonomously — a flaky failure wastes an entire agent cycle investigating a non-bug. ### Oracle drift **Cause:** The reference implementation was updated but test expectations weren't. **Solution:** Pin the oracle version. When updating, re-run all tests to regenerate expected outputs. Keep oracle version in a config file that agents can check. ### Slow feedback loops killing agent productivity **Cause:** Fast subset is too large or includes slow tests. **Solution:** Profile test runtime. Move anything over 5 seconds out of the fast subset. Consider running slow tests in a separate background agent that validates while the main agent continues working. ### Agent can't interpret test failures **Cause:** Test output is a raw stack trace with no actionable context. **Solution:** Wrap the test runner with a harness that parses failures and adds the structured format from Step 4. Even a simple shell script that greps for FAIL lines and adds component/file information helps.
Related in Design
contribute
IncludedLocal-only OSS contribution command center. Auto-refreshes the user's in-flight PR and issue state on invoke so conversations start with full context — no need to brief Claude on what's in flight. Helps the user find issues to contribute to on GitHub, builds per-repo dossiers of what each upstream expects (CLA, DCO, branch convention, AI policy, draft-first, review bots, issue templates), runs deterministic gates before any external action so AI-assisted contributions don't reach maintainers as slop. State is markdown-only: candidate files at ~/.contribute-system/candidates/, repo dossiers at ~/.contribute-system/research/, append-only event log at ~/.contribute-system/log.jsonl. No database, no cloud calls. Use when the user asks about their PRs / issues / contributions, wants to find new work to take on, claim an issue, build/refresh a repo's dossier, or draft a Design Issue or PR. Trigger with "/contribute", "what's my PR status", "find a contribution", "claim issue X", "draft a Design Issue for Y", "refresh dossier for Z".
architectural-analysis
IncludedUser-triggered deep architectural analysis of a codebase or scoped subtree across eight modes — information architecture, data flow, integration points, UI surfaces, interaction patterns, data model, control flow, and failure modes. This skill should be used when the user asks to "diagram this codebase," "map the architecture," "show the data flow," "give me an ERD," "trace control flow," "find the integration points," "verify the layout pattern," "audit the UX architecture," or any similar request whose primary deliverable is mermaid diagrams plus cited reports under docs/architecture/. Dispatches haiku/sonnet sub-agents in parallel for per-mode exploration, then verifies every citation mechanically before any node lands in a diagram. Not for one-off prose explanations of code (use code-explanation) or for high-level system design from scratch (use system-design).
mcp
IncludedModel Context Protocol (MCP) server development and tool management. Languages: Python, TypeScript. Capabilities: build MCP servers, integrate external APIs, discover/execute MCP tools, manage multi-server configs, design agent-centric tools. Actions: create, build, integrate, discover, execute, configure MCP servers/tools. Keywords: MCP, Model Context Protocol, MCP server, MCP tool, stdio transport, SSE transport, tool discovery, resource provider, prompt template, external API integration, Gemini CLI MCP, Claude MCP, agent tools, tool execution, server config. Use when: building MCP servers, integrating external APIs as MCP tools, discovering available MCP tools, executing MCP capabilities, configuring multi-server setups, designing tools for AI agents.
react-native-skia
IncludedDesign, build, debug, and optimise high-polish animated graphics in React Native or Expo using @shopify/react-native-skia, Reanimated, and Gesture Handler. Use when the user wants canvas-driven UI, shaders, paths, rich text, image filters, sprite fields, Skottie, video frames, snapshots, web CanvasKit setup, or performance tuning for custom motion-heavy elements such as loaders, hero art, cards, charts, progress indicators, particle systems, or gesture-driven surfaces. Also use when the user asks for fluid, glow, glass, blob, parallax, 60fps/120fps, or GPU-friendly animated effects in React Native, even if they do not explicitly say "Skia". Do not use for ordinary form/layout work with standard views.
plaid
IncludedProduct Led AI Development — guides founders from idea to launched product. Six capabilities: Idea (discover a product idea), Validate (pressure-test the idea against fatal flaws, problem reality, competition, and 2-week MVP feasibility), Plan (vision intake + document generation), Design (translate image references into a design.md spec), Launch (go-to-market strategy), and Build (roadmap execution). Use when someone says "PLAID", "plaid idea", "help me find an idea", "product idea", "idea from my business", "idea from my expertise", "plaid validate", "validate my idea", "pressure-test", "is this idea good", "find fatal flaws", "validate the problem", "plan a product", "define my vision", "generate a PRD", "product strategy", "plaid design", "design from image", "translate image to design", "create design.md", "extract design tokens", "plaid launch", "go-to-market", "launch plan", "GTM strategy", "launch playbook", "plaid build", "build the app", "start building", or "execute the roadmap".
nextjs-framer-motion-animations
IncludedAdds production-safe Motion for React or Framer Motion animations to Next.js apps, including reveal, hover and tap micro-interactions, whileInView, stagger, AnimatePresence, layout and layoutId transitions, reorder, scroll-linked UI, and lightweight route-content transitions. Use when the user asks to add, refactor, or debug Motion or Framer Motion in App Router or Pages Router codebases, especially around server/client boundaries, reduced motion, LazyMotion, bundle size, hydration, or route transitions. Avoid for GSAP-style timelines, WebGL or 3D scenes, heavy scroll storytelling, or CSS-only effects unless Motion is explicitly requested.