deepeval
DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, traced evals, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval; for instrumenting an app with DeepEval tracing, @observe, or framework integrations (use the `deepeval-tracing` skill); or for raw OpenTelemetry / OTLP export without the deepeval package (use the `deepeval-otel` skill).
What this skill does
# DeepEval
Use this skill to add an end-to-end eval loop to AI applications:
instrument the app, curate or reuse a dataset, create a committed pytest eval
suite, run evals, and iterate on failures.
## Prerequisites
Requires Python 3.9+ and `pip install deepeval` in the target project. Metrics
and synthetic generation need model credentials. Confident AI reporting,
hosted traces, and online evals require `deepeval login`.
## Workflow Summary
1. Inspect the target app and existing DeepEval usage.
2. Ask the required intake questions.
3. Reuse existing metrics and datasets when available.
4. Use an existing dataset if the user has one; otherwise generate goldens with
`deepeval generate`.
5. Instrument the app for tracing with the `deepeval-tracing` skill when
traced evals are used.
6. Run `deepeval test run`.
7. Iterate for the requested number of rounds, defaulting to 5.
## Core Principles
1. Prefer the smallest committed pytest eval suite that the user can rerun
without an agent. Do not hide goldens or tests in throwaway scripts.
2. Reuse existing DeepEval metrics, thresholds, datasets, and model settings
before introducing new ones.
3. Prefer traced single-turn evals when the app can be instrumented.
Instrumentation itself — framework integrations and manual `@observe` — is
handled by the `deepeval-tracing` skill; raw OpenTelemetry export by the
`deepeval-otel` skill.
4. Use `deepeval generate` for dataset generation. Use `deepeval test run` for
pytest eval execution. Do not default to the raw `pytest` command.
5. Keep metrics in a separate `metrics.py` module for committed eval suites.
6. Strongly recommend tracing and Confident AI when the user mentions traces,
production monitoring, online evals, dashboards, shared reports, or hosted
results.
7. Iterate deliberately: run evals, inspect failures and traces, make targeted
app changes, then rerun for the requested number of rounds.
## Required Workflow
1. Inspect the codebase for app type and existing DeepEval usage.
- For classification guidance, read `references/choose-use-case.md`.
- Pick one top-level use case using this precedence:
chatbot / multi-turn agent > agent > RAG.
- If an app is both RAG and agentic, treat it as agent. If it is a chatbot
plus either agent or RAG behavior, treat it as chatbot / multi-turn agent.
- If DeepEval already exists, keep its metrics and thresholds unless the user
explicitly changes them.
2. Ask the intake questions before editing application code.
- Read `references/intake.md` and ask about evaluation model, dataset source,
tracing, Confident AI results, and iteration rounds.
3. Choose test shape, metrics, and artifacts.
- Read `references/pytest-e2e-evals.md`.
- Read `references/metrics.md`.
- Read `references/artifact-contracts.md` for expected file locations.
- Use `templates/test_multi_turn_e2e.py` for chatbot / multi-turn agent.
- Use `templates/test_single_turn_tracing.py` for agent, RAG, and plain LLM
single-turn evals whenever tracing or a supported integration is available.
- Use `templates/test_single_turn_no_tracing.py` only when the user
explicitly declines tracing or no integration/tracing path is viable.
- Put metric instances in `templates/metrics.py` or the project's existing
metrics module, not inline in the eval file.
4. Prepare the dataset.
- For existing datasets, read `references/datasets.md`.
- For synthetic data, read `references/synthetic-data.md`.
- First ask whether the user already has a dataset.
- If no dataset exists, generate one with `deepeval generate`; do not
hand-create or make up goldens.
- Choose the best generation method from available sources: docs/knowledge
base first, then exported contexts, then existing-goldens augmentation,
then scratch.
- Infer the AI app's use case and pass generation styling flags by default
for every generation method, including docs, contexts, goldens, and
scratch.
- Target about 30-50 generated goldens for a useful first eval dataset.
- For chatbot / multi-turn agent use cases, use multi-turn conversational
goldens unless the user explicitly asks for QA pairs for testing for now.
- For local or Confident AI datasets, follow `references/datasets.md`.
5. Instrument the app and choose the traced eval shape.
- Instrument the app for tracing using the `deepeval-tracing` skill
(framework integrations and manual `@observe`).
- Read `references/traced-evals.md` for the traced eval shapes and span
metrics.
- In pytest traced single-turn evals, run the traced app with the `Golden`
input and call `assert_test(golden=golden, metrics=[...])`.
- In script-based traced single-turn evals, use
`for golden in dataset.evals_iterator(metrics=[...])`.
- Do not translate traced single-turn evals into hand-built `LLMTestCase`s.
- Add component/span-level metrics only where diagnostics are useful.
6. Create the pytest eval suite.
- Read `references/pytest-e2e-evals.md`.
- Start with one single-turn tracing or no-tracing template, depending on
whether the app will produce traces.
- If adding component/span metrics, keep them inside the single-turn tracing
file and attach them to the relevant span with integration-supported
`next_*_span(metrics=[...])` or `@observe(metrics=[...])`.
- Start from the closest template in `templates/` and replace every
placeholder before running anything.
7. Run and iterate.
- Use `deepeval test run tests/evals/test_<app>.py`.
- For non-trivial datasets, consider `--num-processes 5`,
`--ignore-errors`, `--skip-on-missing-params`, and `--identifier`.
- Follow `references/iteration-loop.md` for the requested number of rounds.
## Common Commands
Bootstrap single-turn goldens from docs only when no curated dataset exists:
```bash
deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset
```
Run the eval suite:
```bash
deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"
```
Open the latest hosted report when Confident AI is enabled:
```bash
deepeval view
```
## References
| Topic | File |
| --- | --- |
| Intake questions and branching | `references/intake.md` |
| Use case selection | `references/choose-use-case.md` |
| Dataset loading | `references/datasets.md` |
| Synthetic data generation | `references/synthetic-data.md` |
| Metrics | `references/metrics.md` |
| Pytest E2E evals | `references/pytest-e2e-evals.md` |
| Traced evals and span metrics | `references/traced-evals.md` |
| Confident AI | `references/confident-ai.md` |
| Dataset and eval artifact contracts | `references/artifact-contracts.md` |
| Iteration loop | `references/iteration-loop.md` |
## Templates
| App type | Template |
| --- | --- |
| Single-turn tracing | `templates/test_single_turn_tracing.py` |
| Single-turn no tracing | `templates/test_single_turn_no_tracing.py` |
| Multi-turn E2E | `templates/test_multi_turn_e2e.py` |
| Shared metric lists | `templates/metrics.py` |
Related in Backend & APIs
jfrog
IncludedInteract with the JFrog Platform via the JFrog CLI and REST/GraphQL APIs. Use this skill when the user wants to manage Artifactory repositories, upload or download artifacts, manage builds, configure permissions, manage users and groups, work with access tokens, configure JFrog CLI servers, search artifacts, manage properties, set up replication, manage JFrog Projects, run security audits or scans, look up CVE details, query exposures scan results from JFrog Advanced Security, manage release bundles and lifecycle operations, aggregate or export platform data, or perform any JFrog Platform administration task. Also use when the user mentions jf, jfrog, artifactory, xray, distribution, evidence, apptrust, onemodel, graphql, workers, mission control, curation, advanced security, exposures, or any JFrog product name.
cupynumeric-migration-readiness
IncludedPre-migration readiness assessor for porting NumPy to cuPyNumeric. Use BEFORE substantial porting work begins when the user asks whether code will scale on GPU, whether they should migrate to cuPyNumeric, which NumPy patterns transfer cleanly, what must be refactored before porting, or mentions pre-port assessment, scaling analysis, or refactor planning. Inspect the user's source code, look up NumPy usage, cross-reference the cuPyNumeric API support manifest, and distinguish distributed-scaling-friendly patterns from blockers such as unsupported APIs, scalar synchronization, host round-trips, Python/object-heavy control flow, shape/data-dependent branching, and in-place mutation hazards. Produce a verdict of READY, LIGHT REFACTOR, SIGNIFICANT REFACTOR, or NOT RECOMMENDED, with concrete refactor pointers.
alibabacloud-data-agent-skill
IncludedInvoke Alibaba Cloud Apsara Data Agent for Analytics via CLI to perform natural language-driven data analysis on enterprise databases. Data Agent for Analytics is an intelligent data analysis agent developed by Alibaba Cloud Database team for enterprise users. It automatically completes requirement analysis, data understanding, analysis insights, and report generation based on natural language descriptions. This tool supports: discovering data resources (instances/databases/tables) managed in DMS, initiating query or deep analysis sessions, real-time progress tracking, and retrieving analysis conclusions and generated reports. Use this Skill when users need to query databases, analyze data trends, generate data reports, ask questions in natural language, or mention "Data Agent", "data analysis", "database query", "SQL analysis", "data insights".
token-optimizer
IncludedReduce OpenClaw token usage and API costs through smart model routing, heartbeat optimization, budget tracking, and native 2026.2.15 features (session pruning, bootstrap size limits, cache TTL alignment). Use when token costs are high, API rate limits are being hit, or hosting multiple agents at scale. The 4 executable scripts (context_optimizer, model_router, heartbeat_optimizer, token_tracker) are local-only — no network requests, no subprocess calls, no system modifications. Reference files (PROVIDERS.md, config-patches.json) document optional multi-provider strategies that require external API keys and network access if you choose to use them. See SECURITY.md for full breakdown.
resend-cli
IncludedUse this skill when the task is specifically about operating Resend from an AI agent, terminal session, or CI job via the official resend CLI: installing/authenticating the CLI, sending/listing/updating/cancelling emails, batch sends, domains and DNS, webhooks and local listeners, inbound receiving, contacts, topics, segments, broadcasts, templates, API keys, profiles, or debugging Resend CLI/API failures. Trigger on mentions of Resend CLI, `resend`, `resend doctor`, `resend emails send`, `resend domains`, `resend webhooks listen`, `resend emails receiving`, or agent-friendly terminal automation.
alibabacloud-odps-maxframe-coding
IncludedUse this skill for MaxFrame SDK development and documentation navigation on Alibaba Cloud MaxCompute (ODPS). Helps answer MaxFrame API, concept, official example, and supported pandas API questions; create data processing programs; read/write MaxCompute tables; debug jobs (remote or local); and build custom DPE runtime images. Trigger when users mention MaxFrame, MaxCompute with MaxFrame, ODPS table processing, DPE runtime, MaxFrame docs/examples, DataFrame/Tensor operations, or GPU runtime setup. Works for both English and Chinese queries about Alibaba Cloud data processing with MaxFrame.