Claude
Skills
Sign in
Back

langsmith-dataset

Included with Lifetime
$97 forever

Use this skill for ANY question about creating test or evaluation datasets for LangChain agents. Covers generating datasets from traces (final_response, single_step, trajectory, RAG types), uploading to LangSmith, and managing evaluation data.

AI Agentsscripts

What this skill does


# LangSmith Dataset

Auto-generate evaluation datasets from LangSmith traces for testing and validation.

## Setup

### Environment Variables

```bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_PROJECT=your-project-name                   # Optional: default project
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
```

### Dependencies

```bash
pip install langsmith click rich python-dotenv
```

## Usage

Navigate to `skills/langsmith-dataset/scripts/` to run commands.

### Scripts

**`generate_datasets.py`** - Create evaluation datasets from traces
**`query_datasets.py`** - View and inspect datasets

### Common Flags

All dataset generation commands support:

- `--root-run-name <name>` - Filter traces by root run name (e.g., "LangGraph" for DeepAgents)
- `--limit <n>` - Number of traces to process (default: 30)
- `--last-n-minutes <n>` - Only recent traces
- `--output <path>` - Output file (.json or .csv)
- `--upload <name>` - Upload to LangSmith with this dataset name
- `--replace` - Overwrite existing file/dataset (will prompt for confirmation)
- `--yes` - Skip confirmation prompts (use with caution)

**IMPORTANT - Safety Prompts:**
- The script prompts for confirmation before deleting existing datasets with `--replace`
- **ALWAYS respect these prompts** - wait for user input before proceeding
- **NEVER use `--yes` flag unless the user explicitly requests it**
- The `--yes` flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user

### Understanding Trace Hierarchy

Traces have depth levels based on parent-child relationships:

```
Depth 0: Root agent (e.g., "LangGraph")
  ├── Depth 1: Middleware/chains (model, tools, SummarizationMiddleware)
  │     ├── Depth 2: Tool calls (sql_db_query, retriever, etc.)
  │     └── Depth 2: LLM calls (ChatOpenAI, ChatAnthropic)
  └── Depth 3+: Nested subagent calls
```

**Use `--root-run-name` to target specific agent frameworks:**
- DeepAgents: `--root-run-name LangGraph`
- Custom agents: Use your root node name

## Dataset Types

### 1. Final Response

Full conversation with expected output - tests complete agent behavior.

```bash
# Basic usage
python generate_datasets.py --type final_response \
  --project my-project \
  --root-run-name LangGraph \
  --limit 30 \
  --output /tmp/final_response.json

# With custom output fields
python generate_datasets.py --type final_response \
  --project my-project \
  --output-fields "answer,result" \
  --output /tmp/final.json

# Messages only (ignore output dict keys)
python generate_datasets.py --type final_response \
  --project my-project \
  --messages-only \
  --output /tmp/final.json
```

**Structure:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
  }
}
```

**Extraction Priority:**
1. Messages from root run (AI responses with content)
2. User-specified output fields (`--output-fields`)
3. Common keys (answer, output)
4. Full output dict

**Important:** Always checks root run first for final response to avoid intermediate tool outputs.

### 2. Single Step

Single node inputs/outputs - tests any specific node's behavior. **Supports multiple occurrences per trace** to capture conversation evolution.

```bash
# Extract all occurrences (default)
python generate_datasets.py --type single_step \
  --project my-project \
  --root-run-name LangGraph \
  --run-name model \
  --output /tmp/single_step.json

# Sample 2 occurrences per trace
python generate_datasets.py --type single_step \
  --project my-project \
  --root-run-name LangGraph \
  --run-name model \
  --sample-per-trace 2 \
  --output /tmp/single_step_sampled.json

# Target specific tool at depth 2
python generate_datasets.py --type single_step \
  --project my-project \
  --root-run-name LangGraph \
  --run-name sql_db_query \
  --output /tmp/sql_query.json
```

**Structure:**
```json
{
  "trace_id": "...",
  "run_id": "...",
  "occurrence": 2,
  "inputs": {
    "messages": [
      {"type": "human", "content": "What are the top 3 genres?"},
      {"type": "ai", "content": "", "tool_calls": [...]},
      {"type": "tool", "content": "...results..."},
      ...
    ]
  },
  "outputs": {
    "expected_output": {
      "messages": [
        {"type": "ai", "content": "", "tool_calls": [...]}
      ]
    },
    "node_name": "model"
  }
}
```

**Key Features:**
- `occurrence` field tracks which invocation (1st, 2nd, 3rd, etc.)
- Later occurrences have more conversation history → tests context handling
- `--sample-per-trace` randomly samples N occurrences per trace
- Use `--run-name` to target any node at any depth

**Common targets:**
- `model` (depth 1) - LLM invocations with growing context
- `tools` (depth 1) - Tool execution chain
- Any custom node name

### 3. Trajectory

Tool call sequence - tests execution path with configurable depth.

```bash
# Include all tool calls (all depths)
python generate_datasets.py --type trajectory \
  --project my-project \
  --root-run-name LangGraph \
  --limit 30 \
  --output /tmp/trajectory_all.json

# Only tool calls up to depth 2
python generate_datasets.py --type trajectory \
  --project my-project \
  --root-run-name LangGraph \
  --depth 2 \
  --output /tmp/trajectory_depth2.json

# Only root-level tool calls (depth 0) - usually empty if tools are at depth 2+
python generate_datasets.py --type trajectory \
  --project my-project \
  --depth 0 \
  --output /tmp/trajectory_root.json
```

**Structure:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_trajectory": [
      "sql_db_list_tables",
      "sql_db_schema",
      "sql_db_query_checker",
      "sql_db_query"
    ]
  }
}
```

**Depth Control:**
- Omit `--depth` = all levels (includes subagent tool calls)
- `--depth 2` = root + 2 levels (typical for capturing all main tools)
- `--depth 1` = often only middleware/chains, no actual tool calls
- `--depth 0` = root only (no tool calls)

**Note:** Tool calls are typically at depth 2 in LangGraph/DeepAgents architecture.

### 4. RAG

Question/chunks/answer/citations - tests retrieval quality.

```bash
python generate_datasets.py --type rag \
  --project my-project \
  --limit 30 \
  --output /tmp/rag_ds.csv  # Supports .json or .csv
```

**Structure (CSV format):**
```csv
question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"
```

## Output Formats

All dataset types support both JSON and CSV:
```bash
# JSON output (default)
python generate_datasets.py --type trajectory --project my-project --output ds.json

# CSV output (use .csv extension)
python generate_datasets.py --type trajectory --project my-project --output ds.csv
```

## Upload to LangSmith

```bash
# Generate and upload in one command
python generate_datasets.py --type trajectory \
  --project my-project \
  --root-run-name LangGraph \
  --limit 50 \
  --output /tmp/trajectory_ds.json \
  --upload "Skills: Trajectory"

# Use --replace to overwrite existing dataset
python generate_datasets.py --type final_response \
  --project my-project \
  --output /tmp/final.json \
  --upload "Skills: Final Response" \
  --replace
```

**Naming Convention:** Use "Skills: <Type>" format for consistency:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"

## Query Datasets

```bash
# List all datasets
python query_datasets.py list-datasets

# Filter by name pattern
python query_datasets.py list-datasets | grep "Skills:"

# View dataset examples
python query_datasets.py show "Skills: Trajectory" --limit 5

# View local file
python query_datasets.py view-file /
Files: 3
Size: 34.7 KB
Complexity: 52/100
Category: AI Agents

Related in AI Agents