arize-experiment
Creates, runs, and analyzes Arize experiments for evaluating and comparing model performance. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI. Use when the user mentions create experiment, run experiment, compare models, model performance, evaluate AI, experiment results, benchmark, A/B test models, or measure accuracy.
What this skill does
# Arize Experiment Skill
> **`SPACE`** — All `--space` flags and the `ARIZE_SPACE` env var accept a space **name** (e.g., `my-workspace`) or a base64 space **ID** (e.g., `U3BhY2U6...`). Find yours with `ax spaces list`.
## Concepts
- **Experiment** = a named evaluation run against a specific dataset version, containing one run per example
- **Experiment Run** = the result of processing one dataset example -- includes the model output, optional evaluations, and optional metadata
- **Dataset** = a versioned collection of examples; every experiment is tied to a dataset and a specific dataset version
- **Evaluation** = a named metric attached to a run (e.g., `correctness`, `relevance`), with optional label, score, and explanation
The typical flow: export a dataset → process each example → collect outputs and evaluations → create an experiment with the runs.
## Prerequisites
Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.
If an `ax` command fails, troubleshoot based on the error:
- `command not found` or version error → see references/ax-setup.md
- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong, follow references/ax-profiles.md to create/update it. If the user doesn't have their key, direct them to https://app.arize.com/admin > API Keys
- Space unknown → run `ax spaces list` to pick by name, or ask the user
- Project unclear → ask the user, or run `ax projects list -o json --limit 100` and present as selectable options
- **Security:** Never read `.env` files or search the filesystem for credentials. Use `ax profiles` for Arize credentials and `ax ai-integrations` for LLM provider keys. If credentials are not available through these channels, ask the user.
- **CRITICAL — Never fabricate outputs:** When running an experiment, you MUST call the real model API specified by the user for every dataset example. Never fabricate, simulate, or hardcode model outputs, latencies, or evaluation scores. If you cannot call the API (missing SDK, missing credentials, network error), stop and tell the user what is needed before proceeding.
## List Experiments: `ax experiments list`
Browse experiments, optionally filtered by dataset. Output goes to stdout.
```bash
ax experiments list
ax experiments list --dataset DATASET_NAME --space SPACE --limit 20 # DATASET_NAME: name or ID (name preferred)
ax experiments list --cursor CURSOR_TOKEN
ax experiments list -o json
```
### Flags
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--dataset` | string | none | Filter by dataset |
| `--limit, -l` | int | 15 | Max results (1-100) |
| `--cursor` | string | none | Pagination cursor from previous response |
| `-o, --output` | string | table | Output format: table, json, csv, parquet, or file path |
| `-p, --profile` | string | default | Configuration profile |
## Get Experiment: `ax experiments get`
Quick metadata lookup -- returns experiment name, linked dataset/version, and timestamps.
```bash
ax experiments get NAME_OR_ID
ax experiments get NAME_OR_ID -o json
ax experiments get NAME_OR_ID --dataset DATASET_NAME --space SPACE # required when using experiment name instead of ID
```
### Flags
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `NAME_OR_ID` | string | required | Experiment name or ID (positional) |
| `--dataset` | string | none | Dataset name or ID (required if using experiment name instead of ID) |
| `--space` | string | none | Space name or ID (required if using dataset name instead of ID) |
| `-o, --output` | string | table | Output format |
| `-p, --profile` | string | default | Configuration profile |
### Response fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Experiment ID |
| `name` | string | Experiment name |
| `dataset_id` | string | Linked dataset ID |
| `dataset_version_id` | string | Specific dataset version used |
| `experiment_traces_project_id` | string | Project where experiment traces are stored |
| `created_at` | datetime | When the experiment was created |
| `updated_at` | datetime | Last modification time |
## Export Experiment: `ax experiments export`
Download all runs to a file. By default uses the REST API; pass `--all` to use Arrow Flight for bulk transfer.
```bash
# EXPERIMENT_NAME, DATASET_NAME: name or ID (name preferred)
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE
# -> experiment_abc123_20260305_141500/runs.json
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --all
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --output-dir ./results
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --stdout
ax experiments export EXPERIMENT_NAME --dataset DATASET_NAME --space SPACE --stdout | jq '.[0]'
```
### Flags
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `NAME_OR_ID` | string | required | Experiment name or ID (positional) |
| `--dataset` | string | none | Dataset name or ID (required if using experiment name instead of ID) |
| `--space` | string | none | Space name or ID (required if using dataset name instead of ID) |
| `--all` | bool | false | Use Arrow Flight for bulk export (see below) |
| `--output-dir` | string | `.` | Output directory |
| `--stdout` | bool | false | Print JSON to stdout instead of file |
| `-p, --profile` | string | default | Configuration profile |
### REST vs Flight (`--all`)
- **REST** (default): Lower friction -- no Arrow/Flight dependency, standard HTTPS ports, works through any corporate proxy or firewall. Limited to 500 runs per page.
- **Flight** (`--all`): Required for experiments with more than 500 runs. Uses gRPC+TLS on a separate host/port (`flight.arize.com:443`) which some corporate networks may block.
**Agent auto-escalation rule:** If a REST export returns exactly 500 runs, the result is likely truncated. Re-run with `--all` to get the full dataset.
Output is a JSON array of run objects:
```json
[
{
"id": "run_001",
"example_id": "ex_001",
"output": "The answer is 4.",
"evaluations": {
"correctness": { "label": "correct", "score": 1.0 },
"relevance": { "score": 0.95, "explanation": "Directly answers the question" }
},
"metadata": { "model": "gpt-4o", "latency_ms": 1234 }
}
]
```
## Create Experiment: `ax experiments create`
Create a new experiment with runs from a data file.
```bash
ax experiments create --name "gpt-4o-baseline" --dataset DATASET_NAME --space SPACE --file runs.json
ax experiments create --name "claude-test" --dataset DATASET_NAME --space SPACE --file runs.csv
```
### Flags
| Flag | Type | Required | Description |
|------|------|----------|-------------|
| `--name, -n` | string | yes | Experiment name |
| `--dataset` | string | yes | Dataset to run the experiment against |
| `--space, -s` | string | no | Space name or ID (required if using dataset name instead of ID) |
| `--file, -f` | path | yes | Data file with runs: CSV, JSON, JSONL, or Parquet |
| `-o, --output` | string | no | Output format |
| `-p, --profile` | string | no | Configuration profile |
### Passing data via stdin
Use `--file -` to pipe data directly — no temp file needed:
```bash
echo '[{"example_id": "ex_001", "output": "Paris"}]' | ax experiments create --name "my-experiment" --dataset DATASET_NAME --space SPACE --file -
# Or with a heredoc
ax experiments create --name "my-experiment" --dataset DATASET_NAME --space SPACE --file - << 'EOF'
[{"example_id": "ex_001", "output": "Paris"}]
EOF
```
### Required columns in the runs file
| Column | Type | Required | Description |
|--------|------|----------|-------------|
| `example_id` | string | yes | ID of the dataset example this run corresponds to |
| `output` | string | yes | The modRelated in Code Review
gstack
IncludedFast headless browser for QA testing and site dogfooding. Navigate pages, interact with elements, verify state, diff before/after, take annotated screenshots, test responsive layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)
startup-due-diligence
IncludedLegal due diligence review for seed-stage and Series A startups (US, Delaware C-Corp focus). Supports both investor and founder perspectives. Capabilities include: (1) Interactive document review and issue spotting; (2) Document request list generation; (3) Cap table and SAFE/convertible note analysis; (4) Red flag identification with severity ratings; (5) Diligence report generation. TRIGGERS: due diligence, DD, startup investment, cap table review, Series A, seed round, investor diligence, legal review startup, SAFE analysis, convertible note, 409A, founder vesting.
interview-master
IncludedThis skill should be used when the user asks to "generate interview questions", "prepare for interview", "optimize resume", "conduct mock interview", "analyze git commits for resume", "generate resume from code", "review my resume", or mentions interview preparation, career assistance, or extracting project experience from git history. Provides comprehensive interview and career development guidance for both job seekers and interviewers.
fix-issue
IncludedFixes GitHub issues using parallel analysis agents for root cause investigation, code exploration, and regression detection. Reads issue context from gh CLI, searches codebase and memory for related patterns, generates a fix with tests, and links the resolution back to the issue via PR. Includes prevention analysis to avoid recurrence. Use when debugging errors, resolving regressions, fixing bugs, or triaging issues.
sf-apex
IncludedGenerates and reviews Salesforce Apex code with 150-point scoring. TRIGGER when: user writes, reviews, or fixes Apex classes, triggers, test classes, batch/queueable/schedulable jobs, or touches .cls/.trigger files. DO NOT TRIGGER when: LWC JavaScript (use sf-lwc), Flow XML (use sf-flow), SOQL-only queries (use sf-soql), or non-Salesforce code.
swift-development
IncludedComprehensive Swift development for building, testing, and deploying iOS/macOS applications. Use when Claude needs to: (1) Build Swift packages or Xcode projects from command line, (2) Run tests with XCTest or Swift Testing framework, (3) Manage iOS simulators with simctl, (4) Handle code signing, provisioning profiles, and app distribution, (5) Format or lint Swift code with SwiftFormat/SwiftLint, (6) Work with Swift Package Manager (SPM), (7) Implement Swift 6 concurrency patterns (async/await, actors, Sendable), (8) Create SwiftUI views with MVVM architecture, (9) Set up Core Data or SwiftData persistence, or any other Swift/iOS/macOS development tasks.