Claude
Skills
Sign in
Back

agent-artifex:implement

Included with Lifetime
$97 forever

Use when the user wants to improve an existing MCP server, agent, chatbot, or tool-calling system. This includes: improving tool descriptions, fixing error messages, adding output schemas, writing tests, implementing quality checks, adding evals, setting up test harnesses, or any task where they say "help me improve", "fix my descriptions", "add tests", "write evals", "implement quality checks", "make my server better", "apply the design principles", or are ready to make code changes to improve quality. This skill covers both design application (making the code better) and test implementation (verifying the code is good). For scaffolding new projects, use claude-api:mcp-builder. For design principles without code changes, use agent-artifex:design.

Design

What this skill does


# agent-artifex:implement — AI Services Implementation Guide

## When to Use

This is the hands-on improvement skill. It covers both applying design principles to make code better AND writing tests to verify code quality. Use it whenever the user is ready to make changes — whether that means rewriting tool descriptions, restructuring error messages, adding output schemas, writing test harnesses, or building eval pipelines.

Cross-references:
- Scaffolding new projects → `claude-api:mcp-builder`
- Design principles without code changes → `agent-artifex:design`
- Gap diagnosis → `agent-artifex:assess`

---

## On Invocation

Start by understanding what the user needs:

1. **Determine the work type:**
   - **Design application**: Improving tool descriptions, fixing error messages, adding schemas, restructuring tool sets → read the corresponding design reference (`agent-artifex/skills/design/references/`)
   - **Test implementation**: Writing quality checks, evals, test harnesses → read the corresponding testing reference (`references/`)
   - **Both**: Improve the code AND add tests (the ideal flow)
2. **What are you building?** MCP server? Agent? Chatbot? All three?
3. **Which area?** If not specified, determine from context:
   - Building/modifying tool definitions → Tool Description Design / Quality
   - Validating tool call results → Server Correctness
   - Testing whether the FM picks the right tool → Agent Behavior
   - Verifying the final answer to the user → Response Accuracy
   - Testing multi-turn conversations → Chatbot Integration
   - Fixing error messages → Error Message Design
   - Restructuring parameters → Parameter & Schema Design
   - Optimizing system prompts → System Prompt Design
   - Improving multi-turn handling → Multi-Turn Conversation Design
   - Reorganizing tool sets → Tool Set Architecture
   - Standardizing output formats → Response Format Design
4. **What's the tech stack?** TypeScript/Python/Go? Which test framework? MCP SDK version?

Then **read the relevant reference files** before writing any code.

---

## Reference Files

### Design references (for applying improvements)

Read these when making code changes to improve quality. Each file contains principles, patterns, anti-patterns, and concrete guidance for one design area.

| Design Area | Reference File | What it contains |
|---|---|---|
| Tool Description Design | `agent-artifex/skills/design/references/tool-descriptions.md` | Six-component rubric, structural markers, augmentation patterns, domain-specific guidance |
| Parameter & Schema Design | `agent-artifex/skills/design/references/parameter-schema.md` | `.describe()` patterns, output schema design, argument count guidance, naming conventions |
| Error Message Design | `agent-artifex/skills/design/references/error-messages.md` | Problem/input/why/recovery structure, anti-patterns, `isError` usage, cross-references in recovery |
| System Prompt Design | `agent-artifex/skills/design/references/system-prompts.md` | Knowledge placement, ordering instructions, prompt sizing, collision avoidance |
| Multi-Turn Conversation Design | `agent-artifex/skills/design/references/multi-turn.md` | Result trimming, stable ID patterns, pagination, context pressure mitigation |
| Tool Set Architecture | `agent-artifex/skills/design/references/tool-set-architecture.md` | Dynamic discovery, cross-references, tool splitting, token footprint management |
| Response Format Design | `agent-artifex/skills/design/references/response-format.md` | Field naming, pagination patterns, fact vs. narrative, schema consistency |

### Testing references (for writing tests)

Read these when writing test code, assertions, harness setup, or eval pipelines. Each file contains working code examples, prompt templates, regex patterns, and pass/fail criteria.

| Testing Area | Reference File | What it contains |
|---|---|---|
| Tool Description Quality | `references/tool-descriptions.md` | Tier 1 code examples (all 5 checks with regex), Tier 2 FM scoring prompt template, multi-model jury setup, pass/fail criteria |
| Server Correctness | `references/server-correctness.md` | Schema validation (Ajv/jsonschema), error anti-pattern regex, golden-file patterns, FM recovery 4-step procedure |
| Agent Behavior | `references/agent-behavior.md` | Scenario design with examples, recorded replay (TestProvider pattern), live evaluation 4-step procedure, grading guidance |
| Response Accuracy | `references/response-accuracy.md` | Closed-loop harness 5 steps, claim decomposition with LLM prompt templates, DeepMind FACTS two-phase evaluation |
| Chatbot Integration | `references/chatbot-testing.md` | 5 coreference categories, 5 workflow patterns, 6 scenario categories, 4 conflict types, 6 degradation failure modes |

The canonical source documents with full evidence and footnotes are in `docs/ai-services/`.

---

## Design Application by Area

### Tool Description Design

**What to look for:**
- Descriptions shorter than 4 sentences
- Missing Usage Guidelines (89.3% prevalence — "use this when", "do not use", "instead use")
- Vague Limitations that hurt more than help (removing bad limitations improved SR by 10pp)
- No cross-references between related or confusable tools

**What to change:**
- Add a Purpose statement: what the tool does, what it returns, and its behavioral characteristics
- Add Usage Guidelines with domain-specific cues: when to use, when NOT to use, what to use instead
- Make Limitations concrete and actionable, or remove them entirely if they are vague
- Add inter-tool cross-references: "Use `tool_x` instead when [condition]" or "Often used after `tool_y`"

**How to verify:**
- Run Tier 1 structural checks: sentence count >= 3, regex markers for Usage Guidelines and Limitations present
- Check that every related tool pair has at least one cross-reference
- Optionally run Tier 2 FM-scored rubric: all six component means >= 3 across a 3-model jury

### Parameter & Schema Design

**What to look for:**
- Missing `.describe()` annotations on Zod schemas or missing `description` fields in JSON Schema
- No `outputSchema` declared (server returns unstructured text only)
- More than 20 parameters on a single tool (out-of-distribution for FM training)
- Generic parameter names like `data`, `input`, `value`, `options` without clarifying descriptions

**What to change:**
- Add type, meaning, behavioral effect, and default value to every parameter description
- Add `outputSchema` declarations so servers return `structuredContent`
- Rename ambiguous parameters or add descriptions that disambiguate
- For tools with > 20 parameters, consider splitting into multiple tools or using nested objects

**How to verify:**
- Check all `inputSchema.properties` entries have non-empty, non-trivial descriptions
- Verify `outputSchema` is declared and `structuredContent` conforms to it
- Count arguments per tool; flag any exceeding 20

### Error Message Design

**What to look for:**
- Stack traces leaking to the FM (`Error at`, `at function_name (`)
- Raw exception class names (`TypeError:`, `ReferenceError:`)
- Error messages shorter than 20 characters
- No recovery actions — the FM receives an error but no guidance on what to do next

**What to change:**
- Structure errors with: what went wrong, which input caused it, why it failed, and what to try instead
- Add tool cross-references in recovery actions: "Try `tool_x` with [adjusted args]"
- Set `isError: true` on all error responses so the FM knows the call failed
- Remove internal implementation details; replace with user/FM-facing language

**How to verify:**
- Regex checks: no matches for `/Error\s+at\s/`, `/at\s+\w+\s+\(/`, `/^(TypeError|ReferenceError|Error):/`
- Length checks: all error messages > 20 characters
- Recovery action presence: error text contains actionable guidance (not just "failed")

### System Prompt Design

**What to look for:**
- Domain knowledge duplicated between system prompt and tool descriptions

Related in Design