eval-design

Included with Lifetime

$97 forever

This skill should be used when the user asks "what evals can we create", "how do I evaluate this", "design an eval", "create evals for", "how do I know if my LLM is working", "measure quality", or mentions evals, evaluation, scoring rubrics, golden datasets, LLM-as-judge, quality metrics, or judge prompts.

Design

What this skill does


# Eval Design

Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.

**Announce at start:** "I'm using the eval-design skill to help design your evaluation."

## When to Use

```dot
digraph when_to_use {
  "User question" [shape=box];
  "About measuring LLM output quality?" [shape=diamond];
  "Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
  "Not this skill" [shape=box];

  "User question" -> "About measuring LLM output quality?";
  "About measuring LLM output quality?" -> "Use this skill" [label="yes"];
  "About measuring LLM output quality?" -> "Not this skill" [label="no"];
}
```

**Use for:**
- Designing new evaluations
- Choosing between eval approaches
- Creating scoring rubrics or judge prompts
- Building golden datasets
- Questions like "how do I know if my LLM is working?"

**Not for:**
- Langfuse setup, tracing, or observability (use the official `langfuse` skill)
- Implementing evals (output spec to coding agent)
- Generic testing that isn't LLM-specific

## The Process

```dot
digraph eval_design_flow {
  rankdir=TB;

  understand [label="Understand\nthe System" shape=box];
  failures [label="Identify\nFailure Modes" shape=box];
  match [label="Match Eval Type\nto Problem" shape=box];
  design [label="Design\nthe Eval" shape=box];
  output [label="Output\nSpec" shape=box];

  understand -> failures -> match -> design -> output;
}
```

Ask questions ONE AT A TIME. Adapt depth based on user's experience level.

## Phase 1: Understand the System

Ask about:
- What does the LLM application do? (summarisation, Q&A, agent, chat)
- What's already instrumented in Langfuse? (traces, spans, generations)
- What does a "good" output look like for users?

## Phase 2: Identify Failure Modes

Key questions:
- What failures have you seen in production?
- Have you done error analysis on real traces?
- Which failures have the highest business impact?

**Critical:** Ground eval design in ACTUAL failures, not hypothetical concerns. If user hasn't analysed real failures, recommend starting there.

## Phase 3: Match Eval Type to Problem

| Problem Type | Recommended Eval |
|--------------|------------------|
| Format validation (JSON, dates) | Code-based |
| Factual accuracy with known answers | Code-based + match |
| Tone, helpfulness, coherence | LLM-as-judge |
| Safety, toxicity | LLM-as-judge |
| RAG retrieval quality | LLM-as-judge per chunk |
| High-stakes decisions | Human → LLM-judge |
| Novel/unclear failure modes | Human annotation first |

Consult `references/eval-types.md` for detailed guidance on each type.

## Phase 4: Design the Eval

For each eval, define:
1. **Success criteria** - prefer binary PASS/FAIL over Likert scales
2. **Rubric or logic** - specific, unambiguous criteria
3. **Dataset requirements** - happy path + edge cases + adversarial
4. **Integration** - when/where to run

Use `references/judge-prompts.md` for LLM-as-judge prompt templates.

## Output Format

Output a structured spec using the template in `references/document-template.md`.

The spec should be complete enough that a coding agent can implement it using the Langfuse SDK without further clarification.

## Principles for Great Evals

### Start from Error Analysis

Examine real failures, not hypothetical concerns. Generic metrics ("hallucination score") without grounding in actual user complaints are noise.

Questions to surface:
- "What would a human expert check when reviewing this output?"
- "If this eval passes but users are unhappy, what did we miss?"

### Match Tool to Task

Don't over-engineer:
- Code-based evals are faster, cheaper, more deterministic
- Only use LLM-as-judge when judgment is genuinely subjective
- Human annotation for calibration, not production eval

### Binary PASS/FAIL Over Likert

Rating 1-5 creates noise and inconsistency. Binary forces clarity:
- What exactly makes something PASS?
- What exactly makes something FAIL?

### Cover the Right Cases

Dataset should include:
- **Happy path**: Common, expected inputs
- **Edge cases**: Boundaries, unusual but valid inputs
- **Adversarial**: Inputs designed to break the system

### Calibrate LLM Judges

Before deploying LLM-as-judge:
1. Create human-labelled examples (20-30)
2. Run judge on same examples
3. Measure agreement (target >85%)
4. Adjust prompt until calibrated

## Red Flags to Watch For

Push back when you see:
- Jumping to LLM-as-judge when code-based would suffice
- Likert scales instead of binary
- Generic metrics without connection to actual failures
- Missing adversarial cases
- No plan for calibration

## Anti-Patterns

**"Vibes-based" evaluation:** Testing a few examples and shipping if it "looks good"
→ Systematic dataset with coverage of failure modes

**Evaluating the model, not the product:** Generic benchmarks that don't reflect real use
→ Evals grounded in actual user complaints and failures

**Over-engineering:** Complex eval pipelines for simple checks
→ Simplest eval that catches the failure mode

## Handoff to Implementation

After outputting the spec:

"This eval design is ready for implementation. A coding agent can use the official `langfuse` skill to implement this using the Langfuse SDK."

Do NOT attempt to implement the eval. The skill outputs design specs only.

## Quick Reference

| User Says | Do This |
|-----------|---------|
| "How do I know if my LLM is working?" | Start Phase 1 |
| "I need to evaluate [specific thing]" | Jump to Phase 2, focus on that failure mode |
| "Should I use LLM-as-judge?" | Ask about failure mode → recommend based on type |
| "Help me write a judge prompt" | Use `references/judge-prompts.md` templates |
| "What should my dataset look like?" | Cover happy path + edge + adversarial |

## Additional Resources

### Reference Files

- **`references/document-template.md`** - Output spec template with guidance
- **`references/eval-types.md`** - Detailed eval type selection and patterns
- **`references/judge-prompts.md`** - Ready-to-use LLM-as-judge prompts

Files: 4

Size: 22.0 KB

Complexity: 39/100

Category: Design

Source: https://github.com/tavva/ben-claude-plugins/tree/main/plugins/eval-designer/skills/eval-design

Related in Design

contribute

Included

Local-only OSS contribution command center. Auto-refreshes the user's in-flight PR and issue state on invoke so conversations start with full context — no need to brief Claude on what's in flight. Helps the user find issues to contribute to on GitHub, builds per-repo dossiers of what each upstream expects (CLA, DCO, branch convention, AI policy, draft-first, review bots, issue templates), runs deterministic gates before any external action so AI-assisted contributions don't reach maintainers as slop. State is markdown-only: candidate files at ~/.contribute-system/candidates/, repo dossiers at ~/.contribute-system/research/, append-only event log at ~/.contribute-system/log.jsonl. No database, no cloud calls. Use when the user asks about their PRs / issues / contributions, wants to find new work to take on, claim an issue, build/refresh a repo's dossier, or draft a Design Issue or PR. Trigger with "/contribute", "what's my PR status", "find a contribution", "claim issue X", "draft a Design Issue for Y", "refresh dossier for Z".

Designscripts

architectural-analysis

Included

User-triggered deep architectural analysis of a codebase or scoped subtree across eight modes — information architecture, data flow, integration points, UI surfaces, interaction patterns, data model, control flow, and failure modes. This skill should be used when the user asks to "diagram this codebase," "map the architecture," "show the data flow," "give me an ERD," "trace control flow," "find the integration points," "verify the layout pattern," "audit the UX architecture," or any similar request whose primary deliverable is mermaid diagrams plus cited reports under docs/architecture/. Dispatches haiku/sonnet sub-agents in parallel for per-mode exploration, then verifies every citation mechanically before any node lands in a diagram. Not for one-off prose explanations of code (use code-explanation) or for high-level system design from scratch (use system-design).

Designscripts

mcp

Included

Model Context Protocol (MCP) server development and tool management. Languages: Python, TypeScript. Capabilities: build MCP servers, integrate external APIs, discover/execute MCP tools, manage multi-server configs, design agent-centric tools. Actions: create, build, integrate, discover, execute, configure MCP servers/tools. Keywords: MCP, Model Context Protocol, MCP server, MCP tool, stdio transport, SSE transport, tool discovery, resource provider, prompt template, external API integration, Gemini CLI MCP, Claude MCP, agent tools, tool execution, server config. Use when: building MCP servers, integrating external APIs as MCP tools, discovering available MCP tools, executing MCP capabilities, configuring multi-server setups, designing tools for AI agents.

Designscripts

react-native-skia

Included

Design, build, debug, and optimise high-polish animated graphics in React Native or Expo using @shopify/react-native-skia, Reanimated, and Gesture Handler. Use when the user wants canvas-driven UI, shaders, paths, rich text, image filters, sprite fields, Skottie, video frames, snapshots, web CanvasKit setup, or performance tuning for custom motion-heavy elements such as loaders, hero art, cards, charts, progress indicators, particle systems, or gesture-driven surfaces. Also use when the user asks for fluid, glow, glass, blob, parallax, 60fps/120fps, or GPU-friendly animated effects in React Native, even if they do not explicitly say "Skia". Do not use for ordinary form/layout work with standard views.

Designscripts

plaid

Included

Product Led AI Development — guides founders from idea to launched product. Six capabilities: Idea (discover a product idea), Validate (pressure-test the idea against fatal flaws, problem reality, competition, and 2-week MVP feasibility), Plan (vision intake + document generation), Design (translate image references into a design.md spec), Launch (go-to-market strategy), and Build (roadmap execution). Use when someone says "PLAID", "plaid idea", "help me find an idea", "product idea", "idea from my business", "idea from my expertise", "plaid validate", "validate my idea", "pressure-test", "is this idea good", "find fatal flaws", "validate the problem", "plan a product", "define my vision", "generate a PRD", "product strategy", "plaid design", "design from image", "translate image to design", "create design.md", "extract design tokens", "plaid launch", "go-to-market", "launch plan", "GTM strategy", "launch playbook", "plaid build", "build the app", "start building", or "execute the roadmap".

Designscripts

nextjs-framer-motion-animations

Included

Adds production-safe Motion for React or Framer Motion animations to Next.js apps, including reveal, hover and tap micro-interactions, whileInView, stagger, AnimatePresence, layout and layoutId transitions, reorder, scroll-linked UI, and lightweight route-content transitions. Use when the user asks to add, refactor, or debug Motion or Framer Motion in App Router or Pages Router codebases, especially around server/client boundaries, reduced motion, LazyMotion, bundle size, hydration, or route transitions. Avoid for GSAP-style timelines, WebGL or 3D scenes, heavy scroll storytelling, or CSS-only effects unless Motion is explicitly requested.

Designscripts