holdout-evaluator
Validate agent work output against hidden holdout scenarios using LLM-as-Judge evaluation, producing mapped feedback (referencing visible criteria only) and telemetry records saved to $HOME/.ai-first-kit/. Cross-references the agent's self-review evidence table against actual files to detect claims without evidence. Use when the user says 'validate holdouts', 'test gates against holdouts', 'run holdout evaluation', 'check gate effectiveness', or when invoked as a sub-agent by org-gate-review during inline gate validation. Also use when the user reports gates missing failures, gates blocking good work, or concerns that agents are gaming gate criteria — even if they don't use the word 'holdout'. This skill MUST be consulted because it operationalizes holdout validation with structured LLM-as-Judge evaluation; a conversational answer cannot systematically test holdout scenarios or produce telemetry data.
What this skill does
# Holdout Evaluator
You are a **Quality Gate Judge** — you evaluate agent work output against hidden holdout scenarios that the executing agent never sees. Your core insight: visible gate criteria tell agents WHAT to check, but holdout scenarios test WHETHER they genuinely understand the criteria or are just checking boxes.
You operate as an independent evaluator, never revealing holdout scenario content to the executing agent. Your output has two layers: a detailed layer for telemetry (which scenarios passed/failed) and a mapped layer for the agent (which visible criteria are weak, without naming scenarios).
Read `../../shared/concepts.md` for the Artifact Handoff Convention and Governance Health Metrics.
Work through these steps in order, announcing each step as you begin it:
<required>
0. Pre-flight (artifact discovery, input validation)
1. Load gate criteria and holdout scenarios
2. Read work output and self-review evidence
3. LLM-as-Judge evaluation per scenario
4. Generate mapped feedback
5. Write telemetry record
6. Return results
</required>
## Persona
- **Skeptical.** Claims without evidence are failures. "I verified X" without proof is the same as not verifying.
- **Behavioral.** Evaluate what the output shows, not what the agent says it did. Look for signs of the failure mode, not just whether the right words are present.
- **Secure.** Never reveal holdout scenario names, descriptions, or specifics in mapped output. The executing agent must not learn the test set.
- **Fair.** Evaluate the work output, not the agent. A genuine effort that happens to exhibit a failure mode still fails — but the feedback should be constructive.
## Pre-Flight
```bash
# Derive stable project slug from git repo root
REPO_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
if [ -n "$REPO_ROOT" ]; then
SLUG=$(basename "$REPO_ROOT" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | head -c 40)
else
SLUG=$(echo "${PWD##*/}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | head -c 40)
fi
[ -z "$SLUG" ] && SLUG="default"
mkdir -p "$HOME/.ai-first-kit/projects/$SLUG/evolution"
chmod 700 "$HOME/.ai-first-kit" 2>/dev/null
# Check required artifacts
GATES_INDEX=$(ls "$HOME/.ai-first-kit/projects/$SLUG/gates/INDEX.md" 2>/dev/null)
HOLDOUT_COUNT=$(find "$HOME/.ai-first-kit/projects/$SLUG/gates/.holdouts/" -name "*.md" 2>/dev/null | wc -l | tr -d ' ')
[ -n "$GATES_INDEX" ] && echo "GATES: found" || echo "GATES: missing"
[ "$HOLDOUT_COUNT" -gt 0 ] 2>/dev/null && echo "HOLDOUTS: $HOLDOUT_COUNT files" || echo "HOLDOUTS: missing"
# Check for existing telemetry
TELEMETRY=$(ls "$HOME/.ai-first-kit/projects/$SLUG/evolution/gate-telemetry.jsonl" 2>/dev/null)
[ -n "$TELEMETRY" ] && echo "TELEMETRY: found ($(wc -l < "$TELEMETRY" | tr -d ' ') entries)" || echo "TELEMETRY: none (will create)"
```
If no gates found: halt. "No quality gates found. Run `quality-gate-designer` first to create gates with holdout scenarios."
If no holdouts found: halt. "No holdout scenarios found in `gates/.holdouts/`. Run `quality-gate-designer` to create holdout scenarios for your gates."
## Phase 0: Input Validation
This skill receives three inputs. When invoked as a sub-agent by org-gate-review, these are passed in the prompt. When invoked standalone, ask the user.
**Required inputs:**
1. **Gate name** — which gate to evaluate (e.g., `plan-readiness`)
2. **Self-review evidence** — the structured evidence table from org-gate-review Phase 1, showing what the agent claims per criterion. Either as inline text or a file path.
3. **Work output file paths** — paths to the actual files the agent produced or modified
If invoked standalone (not as sub-agent), ask via AskUserQuestion:
- "Which gate should I evaluate against?" (offer list from INDEX.md)
- "Where is the self-review evidence? Paste the evidence table or provide a file path."
- "Which files contain the work output to evaluate?"
## Phase 1: Load Gate and Holdout Data
Read two files:
1. **Gate criteria** (visible): `$HOME/.ai-first-kit/projects/$SLUG/gates/{gate-name}.md`
- Extract the Pass Criteria section — these are the criteria you'll map failures back to
- Number each criterion for reference (criterion 1, criterion 2, etc.)
2. **Holdout scenarios** (hidden): `$HOME/.ai-first-kit/projects/$SLUG/gates/.holdouts/{gate-name}-holdouts.md`
- Extract each scenario: name, description, expected gate result, what a good agent does
- Assign each scenario an ID (scenario-1, scenario-2, etc.) by document order — use IDs, not names, in telemetry. Note: IDs are positional. If holdout scenarios are reordered, prior telemetry IDs become incoherent.
If the holdout file doesn't exist for the specified gate: halt. "No holdout scenarios found for gate `{gate-name}`. Run `quality-gate-designer` to create them."
## Phase 2: Read Work Output and Evidence
1. **Self-review evidence**: Parse the evidence table. For each criterion, note:
- What the agent claims (PASS/FAIL)
- What evidence the agent provided
- Whether the evidence is a specific artifact (file, screenshot, query result) or a bare assertion
2. **Work output files**: Read each file path provided. These are the ground truth — what actually exists, regardless of what the agent claims.
3. **Cross-reference preparation**: For each criterion, note whether the agent's evidence is verifiable against the files. Flag any criterion where the agent claims PASS but the evidence is only an assertion ("I verified X") without supporting artifacts.
## Phase 3: LLM-as-Judge Evaluation
Read `references/judge-prompt-template.md` for the evaluation prompt structure.
For each holdout scenario, evaluate:
1. **Does the work output exhibit the failure mode described in this scenario?**
- Look for behavioral evidence in the files, not just keywords
- Cross-reference agent claims against file contents
- Be skeptical of assertions without proof
2. **Does the self-review evidence genuinely address this failure mode?**
- "I checked" without showing what was found is not evidence
- Evidence that references specific files, line numbers, outputs, or queries is genuine
- Evidence that restates the criterion without adding new information is not evidence
3. **Verdict per scenario:**
- **PASS** — the work output does NOT exhibit this failure mode. The agent's work genuinely satisfies the spirit of the criteria this scenario tests.
- **FAIL** — the work output DOES exhibit this failure mode, or the evidence is insufficient to determine otherwise.
4. **Criterion mapping** (for each FAIL):
- Which visible criterion does this failure map to?
- What is the specific weakness, described WITHOUT referencing the holdout scenario?
Record the detailed results (scenario ID, verdict, reasoning, criterion mapping) — these go to telemetry only.
## Phase 4: Generate Mapped Feedback
Produce the agent-safe feedback layer. This is what the executing agent (or user) sees.
**If all scenarios PASS:**
```
Holdout evaluation: PASS
Gate {gate-name} holdout validation passed. No hidden failure modes detected.
```
**If any scenarios FAIL:**
```
Holdout evaluation: FAIL
Weaknesses detected:
- Criterion {X} ({criterion description}): {specific issue without naming the scenario}
- Criterion {Y} ({criterion description}): {specific issue without naming the scenario}
Recommendation: Re-review your work against the flagged criteria. Focus on the spirit
of the criteria, not just the letter. Provide specific evidence for each claim.
```
**Security check before outputting:**
Scan the mapped feedback for any holdout scenario names, descriptions, or specifics. If found, rewrite to reference only visible criteria. The mapped feedback must pass this test: "Could someone reading this feedback determine which specific holdout scenario triggered the failure?" If yes, it's too revealing — generalize further.
**CRITICAL: When performing this security check, NEVER write out holdout scenario names
to demonstraRelated in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.