Claude
Skills
Sign in
Back

recce-eval

Included with Lifetime
$97 forever

Use when the user asks to "run eval", "recce eval", "evaluate plugin", "benchmark recce", "compare with plugin", "compare without plugin", "eval case", "score eval", "eval report", "eval history", "list eval scenarios", "list eval cases", "show eval history", "run eval case", or wants to measure the Recce Review Agent's effectiveness compared to pure Claude Code without the plugin.

AI Agentsscripts

What this skill does


# /recce-eval — Evaluate Recce Plugin Effectiveness

Measure the Recce Review Agent's impact by running headless Claude Code sessions with and without the Recce plugin, then scoring results against known ground truth.

**Relationship to mcp-e2e-validate:** The `mcp-e2e-validate` skill tests whether the plugin *mechanism* works (hooks fire, MCP responds). This skill tests whether the plugin provides *value* (better accuracy, fewer false positives). Run `mcp-e2e-validate` first to confirm plumbing works, then `recce-eval` to measure how much it helps.

---

## Dependencies

Eval scripts require:

- **yq** — YAML processor ([mikefarah/yq](https://github.com/mikefarah/yq)). Install: `brew install yq`
- **jq** — JSON processor. Install: `brew install jq`
- **git** — required for v2 eval flows that clone/manage projects. Install: `brew install git` (or use your OS package manager)
- **Python 3 with venv + pip** — required for v2 eval flows via `setup-v2-project.sh`. Ensure `python3`, `python3 -m venv`, and `pip` are available in your PATH.

## Setup

Read learned patterns before starting:

```
Read → ${CLAUDE_PLUGIN_ROOT}/reference/learned-patterns.md
```

## Prerequisites

Before running eval, confirm:

1. **dbt project with data loaded** — seeds populated, `dbt run` succeeds on the target
2. **Recce installed** — `recce` CLI in PATH (for MCP server)
3. **`target-base/` artifacts exist** — `dbt docs generate --target-path target-base` on the base branch
4. **No other Recce MCP server on eval port** — default 8085 (configurable via `RECCE_EVAL_MCP_PORT`)
5. **Claude Code CLI installed** — `claude` in PATH
6. **Sufficient API budget** — each run costs ~$1-5 depending on scenario complexity

---

## Subcommand Routing

Parse user input to determine which flow to execute:

- **`run --case <id>[,<id2>,...] [-n N]`** → Run Flow (one or more scenarios by ID)
- **`run --all [-n N]`** → Run Flow (all scenarios)
- **`run --select [-n N]`** → Select Flow → Run Flow (interactive scenario picker)
- **`score <run-dir>`** → Score Flow
- **`report [eval-id]`** → Report Flow
- **`list`** → List Flow (short-circuit)
- **`history`** → History Flow (short-circuit)

Shared flags (apply to all flows that accept them):

| Flag | Description | Default |
|------|-------------|---------|
| `--version` | Scenario version: `v1` or `v2` | `v2` |
| `--target` | dbt target name | `dev-local` (v1), `dev` (v2) |
| `--adapter` | Override adapter detection | Auto-detect from profiles.yml |
| `--plugin-dir` | Recce plugin path | Auto-resolve via `resolve-recce-root.sh` |
| `--model` | Claude model for headless runs | Inherits from current session |
| `--no-bare` | Disable bare mode — use OAuth auth, no API key needed | `--bare` is ON by default |

### Version-Based Path Routing

Based on `--version`, determine the scenario subdirectory and default target:

- `--version v1`: scenarios live in `skills/recce-eval/scenarios/v1/`, set `DEFAULT_TARGET=dev-local`
- `--version v2` (default): scenarios live in `skills/recce-eval/scenarios/v2/`, set `DEFAULT_TARGET=dev`

**IMPORTANT**: Throughout this document, all references to the scenarios directory must use the version-appropriate path. Use `scenarios/v1/` for v1 and `scenarios/v2/` for v2 in every scenario path lookup, glob, and `--patch-file` reference.

The `--target` flag overrides the default if provided. `DEFAULT_TARGET` is used in Step 2 when `--target` is not provided.

### List Flow (short-circuit)

```bash
bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/list-scenarios.sh --version <v1|v2>
```

Display results as a table:

| ID | Name | Case Type | Difficulty |
|----|------|-----------|---------|

**STOP here.** Do not proceed to the Run Flow.

### History Flow (short-circuit)

Read `.claude/recce-eval/history.json` in the dbt project root:

```bash
cat .claude/recce-eval/history.json 2>/dev/null || echo "NO_HISTORY"
```

- If the file is missing or reads `NO_HISTORY`, tell the user: "No eval history found. Run `/recce-eval run` first."
- If present, parse the JSON array and display as a table:

| Eval ID | Timestamp | Adapter | Scenario | Baseline Det. | Plugin Det. | Baseline Judge | Plugin Judge |
|---------|-----------|---------|----------|--------------|-------------|----------------|--------------|

**STOP here.** Do not proceed to the Run Flow.

### Select Flow (interactive picker)

When `--select` is used, present the user with an interactive scenario picker before entering the Run Flow.

1. Load all scenario YAML files from the version-appropriate directory (same as List Flow).
2. Use `AskUserQuestion` with `multiSelect: true` to let the user pick scenarios:
   - Each option's `label` is the scenario ID
   - Each option's `description` is the scenario name and difficulty
3. Parse the selected IDs and proceed to the Run Flow with those scenarios (same as `--case <id1>,<id2>,...`).

If the user selects nothing (cancels), **STOP**.

---

## Run Flow

This is the core orchestration — 12 steps that set up scenarios, run headless Claude Code, score results, and produce a report.

### Step 1: Read Scenario(s)

Use the version-appropriate scenario directory (see Version-Based Path Routing above).

If `--case <id>` (single ID): read `<scenario-dir>/<id>.yaml`.
If `--case <id1>,<id2>,...` (comma-separated): read each `<scenario-dir>/<id>.yaml`.
If `--all`: read all `data-*.yaml` and `code-*.yaml` files in `<scenario-dir>/` (skip non-scenario files like `eval-config.yaml`).
If `--select`: scenarios were already selected in the Select Flow above.

Where `<scenario-dir>` is `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v1` (v1) or `${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scenarios/v2` (v2).

For each scenario file, extract the required fields in a single `yq` call.

**v2 scenarios** use `prompt.template` + `prompt.vars` (template-based):

```bash
yq -o=json '{
  "id": .id,
  "case_type": .case_type,
  "setup_strategy": .setup.strategy,
  "patch_file": .setup.patch_reverse_file,
  "prompt_template": .prompt.template,
  "prompt_vars": .prompt.vars,
  "max_budget_usd": .headless.max_budget_usd,
  "ground_truth": .ground_truth,
  "judge_criteria": .judge_criteria,
  "restore_files": .teardown.restore_files
}' "<scenario-dir>/<id>.yaml"
```

**v1 scenarios** use `prompt:` as an inline string (no template/vars):

```bash
yq -o=json '{
  "id": .id,
  "case_type": .case_type,
  "setup_strategy": .setup.strategy,
  "patch_file": .setup.patch_reverse_file,
  "prompt_inline": .prompt,
  "max_budget_usd": .headless.max_budget_usd,
  "ground_truth": .ground_truth,
  "judge_criteria": .judge_criteria,
  "restore_files": .teardown.restore_files
}' "<scenario-dir>/<id>.yaml"
```

When `prompt_template` is non-null (v2), read the template file and substitute vars in Step 5. When `prompt_inline` is non-null (v1), use it directly as the prompt text.

### Step 1b: Clone & Bootstrap v2 Project (v2 only)

**Skip this step entirely for `--version v1`.** Only execute when `--version v2`.

v2 scenarios include `environment.repo` and `environment.ref` fields that specify the dbt project to clone. Parse these from the first scenario (all v2 scenarios share the same repo):

```bash
yq -o=json '{"repo": .environment.repo, "ref": .environment.ref // "main"}' "<scenario-dir>/<first-scenario-id>.yaml"
```

Clone the repo and bootstrap dbt:

```bash
eval "$(bash ${CLAUDE_PLUGIN_ROOT}/skills/recce-eval/scripts/setup-v2-project.sh \
    --repo "$REPO" --ref "$REF")"
echo "PROJECT_DIR=$PROJECT_DIR"
```

Record `PROJECT_DIR` — pass it as `--project-dir "$PROJECT_DIR"` to all `run-case.sh` invocations in Step 7.

**Cleanup**: At the very end of the Run Flow (after Step 12), remove the temp project:

```bash
if [ -n "$WORK_DIR" ] && [[ "$WORK_DIR" == "${TMPDIR:-/tmp}"* ]]; then
    rm -rf "$WORK_DIR"
fi
```

### Step 2: Detect Adapter

Determine the dbt adapter type from profiles.yml. Use `--adapter` if provided; otherwise auto-detect:

```bash
# Default target depen

Related in AI Agents