agent-platform-eval-flywheel

Included with Lifetime

$97 forever

Measure and improve the quality of AI models and agents on Google Cloud using the Eval Quality Flywheel methodology. Use when evaluating an agent or model, building an eval dataset, picking or writing evaluation metrics, analyzing failures, comparing results before and after a fix, or when guidance is needed on Agent Platform eval methodology — including dataset schema, LLM-as-judge scoring, and common failure causes. For fine-tuning, use agent-platform-tuning. For deployment, use agent-platform-deploy.

Cloud & DevOpsscripts

What this skill does


# Agent Platform Eval Flywheel Skill

Help users evaluate and iteratively improve GenAI models and agents using
the Agent Platform GenAI Evaluation SDK (`google.genai` / `agentplatform`).

## When to use this skill

-   Evaluating GenAI agents or models with the Agent Platform GenAI
    Evaluation SDK (`client.evals.evaluate()`).
-   Creating evaluation datasets from session traces, pandas DataFrames, or
    synthetic generation.
-   Selecting, configuring, or writing custom evaluation metrics.
-   Analyzing rubric verdicts, loss patterns, and clustering failures.
-   Suggesting concrete code/prompt improvements based on eval results.

## Setup

Install the SDK:

```bash
pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0
```

Need `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION`. Check env vars
first; if missing, ask the user. Newer Gemini models often need
`location="global"`.

## The Quality Flywheel

Five stages, run in order on the first pass, then loop 2 → 5 until quality
targets are met.

### Shortcuts that waste time

| Shortcut                                                 | Why it fails                                                                                              |
| -------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| "I'll tune the metric threshold down so it passes."      | Hides real failures. Fix the agent, not the bar.                                                          |
| "This case is flaky, I'll skip it."                      | Flakiness reveals non-determinism in the agent. Fix with `temperature=0` or stricter instructions.        |
| "I just need to fix the eval dataset, not the agent."    | If expected outputs keep moving, the agent has a behavior problem.                                        |
| "I can tell from the trace it works — skip Stage 3."     | Self-grading doesn't generalize. Always run `evaluate()` and read scores.                                 |
| "One iteration is enough."                               | Expect 5–10+ iterations. Stopping early leaves regressions on other metrics undetected.                   |

### 1. Prepare Data

Produce an `EvaluationDataset`. There are three input shapes, pick the one
that matches the data the user already has:

-   **`EvalCase` list (single-turn or multi-turn):**

    ```python
    from agentplatform import types
    dataset = types.EvaluationDataset(eval_cases=[
        types.EvalCase(prompt="What is 2+2?", response="4", reference="4"),
        # For multi-turn agent traces, set agent_data instead of prompt/response.
    ])
    ```

    Multi-turn agent traces wrap each conversation in `AgentData` →
    `ConversationTurn` → `AgentEvent`. See
    [references/dataset_schema.md](references/dataset_schema.md) for the
    full type hierarchy.

-   **Pandas DataFrame (tabular sources — CSV, BigQuery, Sheets):**

    ```python
    import pandas as pd
    from agentplatform import types

    df = pd.DataFrame({
        "prompt":    ["What is 2+2?", "Capital of France?"],
        "response":  ["4",            "Paris"],
        "reference": ["4",            "Paris"],
    })
    dataset = types.EvaluationDataset(eval_dataset_df=df)
    ```

    Column names must match the fields the chosen metrics expect (see
    [references/dataset_schema.md](references/dataset_schema.md) for the
    per-metric requirements table).

-   **Cold start (no data at all):** synthesize scenarios server-side with
    `client.evals.generate_user_scenarios(...)` and a
    `UserScenarioGenerationConfig` (`user_scenario_count`,
    `simulation_instruction`, `environment_data`). Stage 2 plays them out.

For ADK session dumps, use `scripts/parse_adk_traces.py` instead of writing
the conversion by hand.

### 2. Run Inference

Populate responses/traces on the dataset. **Skip this stage** if traces are
already complete (e.g., production logs or replay).

```python
# Agent eval — pass a callable wrapping the user's ADK Agent/App.
client.evals.run_inference(model=agent_callable, src=dataset)

# Model eval — pass a model ID directly.
client.evals.run_inference(model="gemini-2.5-flash", src=dataset)

# Synthesized scenarios — let the simulator drive.
client.evals.run_inference(
    model=agent_callable,
    src=dataset,
    user_simulator_config=UserSimulatorConfig(max_turn=10),
)

# DataFrame also works as src= — no EvalCase wrapping needed.
client.evals.run_inference(model="gemini-2.5-flash", src=df)
```

### 3. Grade (always run)

```python
result = client.evals.evaluate(dataset=dataset, metrics=[...])
```

**Pick metrics by what you want to measure.** Full catalog in
[references/metric_registry.md](references/metric_registry.md).

**Agent metrics (multi-turn, adaptive rubrics)** — start here for agent eval.

| Goal                                          | Metric                          |
| --------------------------------------------- | ------------------------------- |
| Did the agent achieve the user's goal?        | `multi_turn_task_success`       |
| Was the reasoning path logical and efficient? | `multi_turn_trajectory_quality` |
| Tool/function calling quality across turns    | `multi_turn_tool_use_quality`   |
| Overall conversational quality                | `multi_turn_general_quality`    |
| Final response quality (no reference needed)  | `final_response_quality`        |
| Final response vs. a golden reference         | `final_response_match`          |
| Single-turn tool use                          | `tool_use_quality`              |

**General quality metrics (single-turn, adaptive rubrics)** — for model eval.

| Goal                                                  | Metric                  |
| ----------------------------------------------------- | ----------------------- |
| Overall response quality (recommended starting point) | `general_quality`       |
| Linguistic quality (fluency, coherence, grammar)      | `text_quality`          |
| Adherence to specific constraints / instructions      | `instruction_following` |

**Static rubric metrics (fixed criteria)** — apply alongside the above.

| Goal                                              | Metric          |
| ------------------------------------------------- | --------------- |
| Catch hallucinated claims (RAG, factual answers)  | `hallucination` |
| Factuality / consistency against provided context | `grounding`     |
| Safety policy compliance                          | `safety`        |

**Domain-specific check no built-in covers:** write a custom metric.

-   **Predefined:** `types.RubricMetric.<NAME>` — server-side AutoRater, no
    judge model needed.
-   **Custom LLM-as-a-judge:** `types.LLMMetric` with `prompt_template` or
    `types.MetricPromptBuilder` for structured rubrics.
-   **Custom code:** `types.CodeExecutionMetric` with a `custom_function`
    string containing `def evaluate(instance: dict)` for remote sandboxed
    execution; or `types.Metric` with `custom_function=<callable>` for
    local execution.

**Always persist the result** so Stage 4 and 5 can read it. Save both JSON
(machine-readable, diffable) and HTML (human-readable, linkable):

```python
import datetime
from pathlib import Path

from agentplatform._genai import _evals_visualization

out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)

html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))
```

Or after the fact: `scripts/render_html_report.py --type evaluation` or
`scripts/inspect_results.py --save-html`.

### 4. Analyze Failures

Read `summary_metrics` and `eval_case_results` — never fabricate scores.
Use `scripts/inspect_results.py --failing-only

Files: 10

Size: 87.7 KB

Complexity: 71/100

Category: Cloud & DevOps

Source: https://github.com/google/skills/tree/main/skills/cloud/agent-platform-eval-flywheel

Related in Cloud & DevOps

appbuilder-action-scaffolder

Included

Create, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.

Cloud & DevOpsscripts

orchestrating-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).

Cloud & DevOpsscripts

github-project-automation

Included

Automate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error

Cloud & DevOpsscripts

sf-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).

Cloud & DevOpsscripts

fabric-cli

Included

Use this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.

Cloud & DevOpsscripts

lark

Included

Lark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.

Cloud & DevOpsscripts