agent-evaluation
Use this when you need to EVALUATE OR IMPROVE or OPTIMIZE an existing LLM agent's output quality - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. IMPORTANT - Always also load the instrumenting-with-mlflow-tracing skill before starting any work. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).
What this skill does
# Agent Evaluation with MLflow Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs. ## ⛔ CRITICAL: Must Use MLflow APIs **DO NOT create custom evaluation frameworks.** You MUST use MLflow's native APIs: - **Datasets**: Use `mlflow.genai.datasets.create_dataset()` - NOT custom test case files - **Scorers**: Use `mlflow.genai.scorers` and `mlflow.genai.judges.make_judge()` - NOT custom scorer functions - **Evaluation**: Use `mlflow.genai.evaluate()` - NOT custom evaluation loops - **Scripts**: Use the provided `scripts/` directory templates - NOT custom `evaluation/` directories **Why?** MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability. If you're tempted to create `evaluation/eval_dataset.py` or similar custom files, STOP. Use `scripts/create_dataset_template.py` instead. ## Table of Contents 1. [Quick Start](#quick-start) 2. [Documentation Access Protocol](#documentation-access-protocol) 3. [Setup Overview](#setup-overview) 4. [Evaluation Workflow](#evaluation-workflow) 5. [References](#references) ## Quick Start **⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.** **Setup (prerequisite)**: Install MLflow 3.8+, configure environment, integrate tracing **Evaluation workflow in 5 steps** (each uses MLflow APIs): 1. **Understand**: Run agent, inspect traces, understand purpose 2. **Scorers**: Select and register scorers for quality criteria 3. **Dataset**: ALWAYS discover existing datasets first, only create new if needed 3.5. **Dry Run**: Run 3 questions first — catch broken tools and misconfigured scorers before full eval 4. **Evaluate**: Run agent on dataset, apply scorers, analyze results ## Command Conventions **Always use `uv run` for MLflow and Python commands:** ```bash uv run mlflow --version # MLflow CLI commands uv run python scripts/xxx.py # Python script execution uv run python -c "..." # Python one-liners ``` This ensures commands run in the correct environment with proper dependencies. **CRITICAL: Separate stderr from stdout when capturing CLI output:** When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data: ```bash # Save both separately for debugging uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log ``` ## Documentation Access Protocol **All MLflow documentation must be accessed through llms.txt:** 1. Start at: `https://mlflow.org/docs/latest/llms.txt` 2. Query llms.txt for your topic with specific prompt 3. If llms.txt references another doc, use WebFetch with that URL 4. Do not use WebSearch - use WebFetch with llms.txt first **This applies to all steps**, especially: - Dataset creation (read GenAI dataset docs from llms.txt) - Scorer registration (check MLflow docs for scorer APIs) - Evaluation execution (understand mlflow.genai.evaluate API) ## Discovering Agent Structure **Each project has unique structure.** Use dynamic exploration instead of assumptions: ### Find Agent Entry Points ```bash # Search for main agent functions grep -r "def.*agent" . --include="*.py" grep -r "def (run|stream|handle|process)" . --include="*.py" # Check common locations ls main.py app.py src/*/agent.py 2>/dev/null # Look for API routes grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask grep -r "def.*route" . --include="*.py" ``` ### Understand Project Structure ```bash # Check entry points in package config cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points" # Read project documentation cat README.md docs/*.md 2>/dev/null | head -100 # Explore main directories ls -la src/ app/ agent/ 2>/dev/null ``` ## Setup Overview ### Pre-check: Use Existing Environment **Before doing ANY setup, check if `MLFLOW_TRACKING_URI` and `MLFLOW_EXPERIMENT_ID` are already set:** ```bash echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI" echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID" ``` **If BOTH are already set, skip Steps 1-2 entirely.** The environment is pre-configured. Do NOT run `setup_mlflow.py`, do NOT create a `.env` file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow. ### Setup Steps (only if environment is NOT pre-configured) 1. **Install MLflow** (version >=3.8.0) 2. **Configure environment** (tracking URI and experiment) - **Guide**: Follow `references/setup-guide.md` Steps 1-2 3. **Integrate tracing** (autolog and @mlflow.trace decorators) - ⚠️ **MANDATORY**: Use the `instrumenting-with-mlflow-tracing` skill for tracing setup - ✓ **VERIFY**: Run `scripts/validate_tracing_runtime.py` after implementing ⚠️ **Tracing must work before evaluation.** If tracing fails, stop and troubleshoot. **Checkpoint - verify before proceeding:** - [ ] MLflow >=3.8.0 installed - [ ] MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set - [ ] Autolog enabled and @mlflow.trace decorators added - [ ] Test run creates a trace (verify trace ID is not None) **Validation scripts:** ```bash uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity uv run python scripts/validate_auth.py # Test authentication before expensive operations ``` ## Evaluation Workflow ### Step 1: Agent Interview (REQUIRED — do not skip) Before doing anything else, ask the user these questions. Do NOT proceed until you have answers. **Required:** 1. "What does your agent do? Describe its purpose in 1-2 sentences." 2. "What are the 2-3 most important things it needs to get right?" 3. "Are there common failure modes you've already noticed?" **Use answers to:** - Derive scorer names and criteria (do not invent them) - Write the `agent_description` parameter for `generate_evals_df` - Set evaluation priorities **If running in automated mode:** Read agent purpose from the codebase (SKILL.md, README, or main entry point docstring). Still surface what you found and confirm before proceeding. ### Step 2: Define Quality Scorers 1. **Check registered scorers in your experiment:** ```bash uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID ``` **IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.** 2. **Select additional built-in scorers that apply to the agent** See `references/scorers.md` for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered. 3. **Create additional custom scorers as needed** If needed, create additional scorers using the `make_judge()` API. See `references/scorers.md` on how to create custom scorers and `references/scorers-constraints.md` for best practices. > ⚠️ **CRITICAL — Scorer Return Values:** Scorers MUST instruct the LLM judge to return `"yes"` or `"no"` (or booleans/numerics). Return values of `"pass"` or `"fail"` are **silently cast to `None`** by `_cast_assessment_value_to_float` and **excluded from `results.metrics`** with no error or warning — results simply disappear. See `references/scorers-constraints.md` Constraint 2 for the full list of safe vs. broken return values. 4. **REQUIRED: Register new scorers before evaluation** using Python API: ```python from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os scorer = make_judge(...) # Or, scorer = BuiltinScorerName() scorer.register() ``` ** IMPORTANT: See `references/scorers.md` → "Model Selection for Scorers" to configure the `model` parameter of scorers before registration. ⚠️ **Scorers MUST be registered before evaluation.
Related in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.