recce-eval

Included with Lifetime

$97 forever

Use when the user asks to "run eval", "recce eval", "evaluate plugin", "benchmark recce", "compare with plugin", "compare without plugin", "eval case", "score eval", "eval report", "eval history", "list eval scenarios", "list eval cases", "show eval history", "run eval case", or wants to measure the Recce Review Agent's effectiveness compared to pure Claude Code without the plugin.

AI Agentsscripts

What this skill does


# /recce-eval — Evaluate Recce Plugin Effectiveness

Measure the Recce Review Agent's impact by running headless Claude Code sessions with and without the Recce plugin, then scoring results against known ground truth.

**Relationship to mcp-e2e-validate:** The `mcp-e2e-validate` skill tests whether the plugin *mechanism* works (hooks fire, MCP responds). This skill tests whether the plugin provides *value* (better accuracy, fewer false positives). Run `mcp-e2e-validate` first to confirm plumbing works, then `recce-eval` to measure how much it helps.

---

## Dependencies

Eval scripts require:

- **yq** — YAML processor ([mikefarah/yq](https://github.com/mikefarah/yq)). Install: `brew install yq`
- **jq** — JSON processor. Install: `brew install jq`
- **git** — required for v2 eval flows that clone/manage projects. Install: `brew install git` (or use your OS package manager)
- **Python 3 with venv + pip** — required for v2 eval flows via `setup-v2-project.sh`. Ensure `python3`, `python3 -m venv`, and `pip` are available in your PATH.

## Setup

Read learned patterns before starting:

```
Read → ${CLAUDE_PLUGIN_ROOT}/reference/learned-patterns.md
```

## Prerequisites

Before running eval, confirm:

1. **dbt project with data loaded** — seeds populated, `dbt run` succeeds on the target
2. **Recce installed** — `recce` CLI in PATH (for MCP server)
3. **`target-base/` artifacts exist** — `dbt docs generate --target-path target-base` on the base branch
4. **No other Recce MCP server on eval port** — default 8085 (configurable via `RECCE_EVAL_MCP_PORT`)
5. **Claude Code CLI installed** — `claude` in PATH
6. **Sufficient API budget** — each run costs ~$1-5 depending on scenario complexity

---

## Subcommand Routing

Parse user input to determine which flow to execute:

- **`run --case <id>[,<id2>,...] [-n N]`** → Run Flow (one or more scenarios by ID)
- **`run --all [-n N]`** → Run Flow (all scenarios)
- **`run --select [-n N]`** → Select Flow → Run Flow (interactive scenario picker)
- **`score <run-dir>`** → Score Flow
- **`report [eval-id]`** → Report Flow
- **`list`** → List Flow (short-circuit)
- **`history`** → History Flow (short-circuit)

Shared flags (apply to all flows that accept them):

| Flag | Description | Default |
|------|-------------|---------|
| `--version` | Scenario version: `v1` or `v2` | `v2` |
| `--target` | dbt target name | `dev-local` (v1), `dev` (v2) |
| `--adapter` | Override adapter detection | Auto-detect from profiles.yml |
| `--plugin-dir` | Recce plugin path | Auto-resolve via `resolve-recce-root.sh` |
| `--model` | Claude model for headless runs | Inherits from current session |
| `--no-bare` | Disable bare mode — use OAuth auth, no API key needed | `--bare` is ON by default |

### Version-Based Path Routing

Based on `--version`, determine the scenario subdirectory and default target:

- `--version v1`: scenarios live in `skills/recce-eval/scenarios/v1/`, set `DEFAULT_TARGET=dev-local`
- `--version v2` (default): scenarios live in `skills/recce-eval/scenarios/v2/`, set `DEFAULT_TARGET=dev`

**IMPORTANT**: Throughout this document, all references to the scenarios directory must use the version-appropriate path. Use `scenarios/v1/` for v1 and `scenarios/v2/` for v2 in every scenario path lookup, glob, and `--patch-file` reference.

The `--target` flag overrides the default if provided. `DEFAULT_TARGET` is used in Step 2 when `--target` is not provided.

### List Flow (short-circuit)

```bash
bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/list-scenarios.sh --version <v1|v2>
```

Display results as a table:

| ID | Name | Case Type | Difficulty |
|----|------|-----------|---------|

**STOP here.** Do not proceed to the Run Flow.

### History Flow (short-circuit)

Read `.claude/recce-eval/history.json` in the dbt project root:

```bash
cat .claude/recce-eval/history.json 2>/dev/null || echo "NO_HISTORY"
```

- If the file is missing or reads `NO_HISTORY`, tell the user: "No eval history found. Run `/recce-eval run` first."
- If present, parse the JSON array and display as a table:

| Eval ID | Timestamp | Adapter | Scenario | Baseline Det. | Plugin Det. | Baseline Judge | Plugin Judge |
|---------|-----------|---------|----------|--------------|-------------|----------------|--------------|

**STOP here.** Do not proceed to the Run Flow.

### Select Flow (interactive picker)

When `--select` is used, present the user with an interactive scenario picker before entering the Run Flow.

1. Load all scenario YAML files from the version-appropriate directory (same as List Flow).
2. Use `AskUserQuestion` with `multiSelect: true` to let the user pick scenarios:
   - Each option's `label` is the scenario ID
   - Each option's `description` is the scenario name and difficulty
3. Parse the selected IDs and proceed to the Run Flow with those scenarios (same as `--case <id1>,<id2>,...`).

If the user selects nothing (cancels), **STOP**.

---

## Run Flow

This is the core orchestration — 12 steps that set up scenarios, run headless Claude Code, score results, and produce a report.

### Step 1: Read Scenario(s)

Use the version-appropriate scenario directory (see Version-Based Path Routing above).

If `--case <id>` (single ID): read `<scenario-dir>/<id>.yaml`.
If `--case <id1>,<id2>,...` (comma-separated): read each `<scenario-dir>/<id>.yaml`.
If `--all`: read all `data-*.yaml` and `code-*.yaml` files in `<scenario-dir>/` (skip non-scenario files like `eval-config.yaml`).
If `--select`: scenarios were already selected in the Select Flow above.

Where `<scenario-dir>` is `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v1` (v1) or `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v2` (v2).

For each scenario file, extract the required fields in a single `yq` call.

**v2 scenarios** use `prompt.template` + `prompt.vars` (template-based):

```bash
yq -o=json '{
  "id": .id,
  "case_type": .case_type,
  "setup_strategy": .setup.strategy,
  "patch_file": .setup.patch_reverse_file,
  "prompt_template": .prompt.template,
  "prompt_vars": .prompt.vars,
  "max_budget_usd": .headless.max_budget_usd,
  "ground_truth": .ground_truth,
  "judge_criteria": .judge_criteria,
  "restore_files": .teardown.restore_files
}' "<scenario-dir>/<id>.yaml"
```

**v1 scenarios** use `prompt:` as an inline string (no template/vars):

```bash
yq -o=json '{
  "id": .id,
  "case_type": .case_type,
  "setup_strategy": .setup.strategy,
  "patch_file": .setup.patch_reverse_file,
  "prompt_inline": .prompt,
  "max_budget_usd": .headless.max_budget_usd,
  "ground_truth": .ground_truth,
  "judge_criteria": .judge_criteria,
  "restore_files": .teardown.restore_files
}' "<scenario-dir>/<id>.yaml"
```

When `prompt_template` is non-null (v2), read the template file and substitute vars in Step 5. When `prompt_inline` is non-null (v1), use it directly as the prompt text.

### Step 1b: Clone & Bootstrap v2 Project (v2 only)

**Skip this step entirely for `--version v1`.** Only execute when `--version v2`.

v2 scenarios include `environment.repo` and `environment.ref` fields that specify the dbt project to clone. Parse these from the first scenario (all v2 scenarios share the same repo):

```bash
yq -o=json '{"repo": .environment.repo, "ref": .environment.ref // "main"}' "<scenario-dir>/<first-scenario-id>.yaml"
```

Clone the repo and bootstrap dbt:

```bash
eval "$(bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/setup-v2-project.sh \
    --repo "$REPO" --ref "$REF")"
echo "PROJECT_DIR=$PROJECT_DIR"
```

Record `PROJECT_DIR` — pass it as `--project-dir "$PROJECT_DIR"` to all `run-case.sh` invocations in Step 7.

**Cleanup**: At the very end of the Run Flow (after Step 12), remove the temp project:

```bash
if [ -n "$WORK_DIR" ] && [[ "$WORK_DIR" == "${TMPDIR:-/tmp}"* ]]; then
    rm -rf "$WORK_DIR"
fi
```

### Step 2: Detect Adapter

Determine the dbt adapter type from profiles.yml. Use `--adapter` if provided; otherwise auto-detect:

```bash
# Default target depen

Files: 40

Size: 137.7 KB

Complexity: 88/100

Category: AI Agents

Source: https://github.com/datarecce/recce-claude-plugin/tree/main/plugins/recce-dev/skills/recce-eval

Related in AI Agents

skill-development

Included

Comprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.

AI Agentsscripts

reprompter

Included

Transform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.

AI Agentsscripts

adaptive-compaction

Included

Adaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.

AI Agentsscripts

agent-skill-creator

Included

Create cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.

AI Agentsscripts

llm-wiki

Included

Use when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.

AI Agentsscripts

skill-master

Included

Agent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.

AI Agentsscripts