Claude
Skills
Sign in
Back

agent-evaluation

Included with Lifetime
$97 forever

Use this when you need to EVALUATE OR IMPROVE or OPTIMIZE an existing LLM agent's output quality - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. IMPORTANT - Always also load the instrumenting-with-mlflow-tracing skill before starting any work. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).

AI Agentsscriptsassets

What this skill does


# Agent Evaluation with MLflow

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.

## ⛔ CRITICAL: Must Use MLflow APIs

**DO NOT create custom evaluation frameworks.** You MUST use MLflow's native APIs:

- **Datasets**: Use `mlflow.genai.datasets.create_dataset()` - NOT custom test case files
- **Scorers**: Use `mlflow.genai.scorers` and `mlflow.genai.judges.make_judge()` - NOT custom scorer functions
- **Evaluation**: Use `mlflow.genai.evaluate()` - NOT custom evaluation loops
- **Scripts**: Use the provided `scripts/` directory templates - NOT custom `evaluation/` directories

**Why?** MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.

If you're tempted to create `evaluation/eval_dataset.py` or similar custom files, STOP. Use `scripts/create_dataset_template.py` instead.

## Table of Contents

1. [Quick Start](#quick-start)
2. [Documentation Access Protocol](#documentation-access-protocol)
3. [Setup Overview](#setup-overview)
4. [Evaluation Workflow](#evaluation-workflow)
5. [References](#references)

## Quick Start

**⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.**

**Setup (prerequisite)**: Install MLflow 3.8+, configure environment, integrate tracing

**Evaluation workflow in 5 steps** (each uses MLflow APIs):

1. **Understand**: Run agent, inspect traces, understand purpose
2. **Scorers**: Select and register scorers for quality criteria
3. **Dataset**: ALWAYS discover existing datasets first, only create new if needed
3.5. **Dry Run**: Run 3 questions first — catch broken tools and misconfigured scorers before full eval
4. **Evaluate**: Run agent on dataset, apply scorers, analyze results

## Command Conventions

**Always use `uv run` for MLflow and Python commands:**

```bash
uv run mlflow --version          # MLflow CLI commands
uv run python scripts/xxx.py     # Python script execution
uv run python -c "..."           # Python one-liners
```

This ensures commands run in the correct environment with proper dependencies.

**CRITICAL: Separate stderr from stdout when capturing CLI output:**

When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:

```bash
# Save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
```

## Documentation Access Protocol

**All MLflow documentation must be accessed through llms.txt:**

1. Start at: `https://mlflow.org/docs/latest/llms.txt`
2. Query llms.txt for your topic with specific prompt
3. If llms.txt references another doc, use WebFetch with that URL
4. Do not use WebSearch - use WebFetch with llms.txt first

**This applies to all steps**, especially:

- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)

## Discovering Agent Structure

**Each project has unique structure.** Use dynamic exploration instead of assumptions:

### Find Agent Entry Points
```bash
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"

# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null

# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py"  # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
```

### Understand Project Structure
```bash
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"

# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100

# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
```

## Setup Overview

### Pre-check: Use Existing Environment

**Before doing ANY setup, check if `MLFLOW_TRACKING_URI` and `MLFLOW_EXPERIMENT_ID` are already set:**

```bash
echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"
```

**If BOTH are already set, skip Steps 1-2 entirely.** The environment is pre-configured. Do NOT run `setup_mlflow.py`, do NOT create a `.env` file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow.

### Setup Steps (only if environment is NOT pre-configured)

1. **Install MLflow** (version >=3.8.0)
2. **Configure environment** (tracking URI and experiment)
   - **Guide**: Follow `references/setup-guide.md` Steps 1-2
3. **Integrate tracing** (autolog and @mlflow.trace decorators)
   - ⚠️ **MANDATORY**: Use the `instrumenting-with-mlflow-tracing` skill for tracing setup
   - ✓ **VERIFY**: Run `scripts/validate_tracing_runtime.py` after implementing

⚠️ **Tracing must work before evaluation.** If tracing fails, stop and troubleshoot.

**Checkpoint - verify before proceeding:**

- [ ] MLflow >=3.8.0 installed
- [ ] MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- [ ] Autolog enabled and @mlflow.trace decorators added
- [ ] Test run creates a trace (verify trace ID is not None)

**Validation scripts:**
```bash
uv run python scripts/validate_environment.py  # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py         # Test authentication before expensive operations
```

## Evaluation Workflow

### Step 1: Agent Interview (REQUIRED — do not skip)

Before doing anything else, ask the user these questions. Do NOT proceed until you have answers.

**Required:**
1. "What does your agent do? Describe its purpose in 1-2 sentences."
2. "What are the 2-3 most important things it needs to get right?"
3. "Are there common failure modes you've already noticed?"

**Use answers to:**
- Derive scorer names and criteria (do not invent them)
- Write the `agent_description` parameter for `generate_evals_df`
- Set evaluation priorities

**If running in automated mode:** Read agent purpose from the codebase (SKILL.md, README, or main entry point docstring). Still surface what you found and confirm before proceeding.

### Step 2: Define Quality Scorers

1. **Check registered scorers in your experiment:**
   ```bash
   uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
   ```

**IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.**

2. **Select additional built-in scorers that apply to the agent** 

See `references/scorers.md` for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered. 

3. **Create additional custom scorers as needed**

If needed, create additional scorers using the `make_judge()` API. See `references/scorers.md` on how to create custom scorers and `references/scorers-constraints.md` for best practices.

> ⚠️ **CRITICAL — Scorer Return Values:** Scorers MUST instruct the LLM judge to return `"yes"` or `"no"` (or booleans/numerics). Return values of `"pass"` or `"fail"` are **silently cast to `None`** by `_cast_assessment_value_to_float` and **excluded from `results.metrics`** with no error or warning — results simply disappear. See `references/scorers-constraints.md` Constraint 2 for the full list of safe vs. broken return values.

4. **REQUIRED: Register new scorers before evaluation** using Python API:
   
   ```python
   from mlflow.genai.judges import make_judge
   from mlflow.genai.scorers import BuiltinScorerName
   import os

   scorer = make_judge(...)  # Or, scorer = BuiltinScorerName()
   scorer.register()
   ```

** IMPORTANT:  See `references/scorers.md` → "Model Selection for Scorers" to configure the `model` parameter of scorers before registration.

⚠️ **Scorers MUST be registered before evaluation.
Files: 18
Size: 174.5 KB
Complexity: 95/100
Category: AI Agents

Related in AI Agents