Claude
Skills
Sign in
Back

vm-trace-analyzer

Included with Lifetime
$97 forever

Analyze VictoriaMetrics query trace JSON to diagnose slow queries and produce a structured performance report with time breakdown, bottleneck analysis, and optimization recommendations. ALWAYS use this skill when: (1) the user mentions a VictoriaMetrics or VM trace, query trace, or trace JSON, (2) the user provides or references a JSON file containing duration_msec/message/children fields, (3) the user asks why a VictoriaMetrics/VM query is slow and has trace output, (4) the user asks about vmstorage node distribution, cache misses, or rollup performance in the context of a trace, (5) the user mentions vmselect trace, trace=1, or query performance debugging with VictoriaMetrics. This skill provides a structured report template that ensures consistent, thorough analysis — do not attempt to analyze VM traces without it.

Backend & APIsscripts

What this skill does


# VictoriaMetrics Query Trace Analyzer

You are analyzing a VictoriaMetrics query trace — a JSON tree that records every step of a PromQL query execution. Your goal is to read this tree, understand what happened, and produce a clear performance report with actionable recommendations.

## Background

In **Cluster** mode two components are involved in query processing:
- **vmselect** — query frontend that accepts PromQL or MetricsQL queries, fetches data from vmstorage nodes, and applies calculations
- **vmstorage** — stores time series data and serves it to vmselect over RPC

**Single-node** mode runs everything in one process. The trace structure is similar but without RPC wrappers.

You can tell which mode you're looking at from the root message in trace:
- **Cluster** traces contains `vmselect-<version>: /select/...`,
- **Single-node** traces contains `/victoria-metrics-<version>: /api/v1/...`.

## What is a query trace?

When you add `trace=1` to a VictoriaMetrics HTTP query, it returns a JSON tree describing every internal operation.
Each node looks like this:

```json
{
  "duration_msec": 123.456,
  "message": "description of what happened",
  "children": [ ... ]
}
```

The tree is rooted at vmselect. It captures the full query execution pipeline: parsing, series search, data fetch from storage, rollup computation, aggregation, and response generation.

## How to analyze the trace

### Step 0: Run the parse script

Before manually reading the trace file, run the analysis script to extract structured data:

```bash
python3 <skill_base_dir>/scripts/parse_trace.py <trace_file>
```

This outputs: root info, trace tree (depth 3), key nodes with durations, per-vmstorage RPC breakdown, and computed totals (bytes, samples, series). Use this output as your primary data source for the report.

Additional subcommands for deeper investigation:
- `python3 <script> <trace> tree --depth N` — print the trace tree to depth N
- `python3 <script> <trace> nodes --pattern "fetch unique"` — find all nodes matching a substring

Only drill deeper if the summary output reveals ambiguities or missing information.

After running the summary, also check for relevant performance improvements in newer VictoriaMetrics versions:

```bash
python3 <skill_base_dir>/scripts/check_changelog.py <version> <mode>
```

Where `<version>` is the semver from the parse script output (e.g., `v1.130.0`) and `<mode>` is `cluster` or `single-node`. This fetches changelogs from GitHub and shows performance-relevant fixes/features in versions newer than what the trace was captured on. If the fetch fails, skip this section gracefully.

### Step 1: Start at the root

Read the trace JSON file the user provides (or use the script output from Step 0).
The root node tells you the big picture. Extract:
- **Endpoint**: `/api/v1/query` (instant) or `/api/v1/query_range` (range)
- **Query**: the PromQL expression after `query=`
- **Time parameters**: `start=`, `end=`, `step=` (for range queries)
- **Result count**: `series=` at the end
- **Total duration**: the root `duration_msec`
- **Version**: printed in the very start of the trace.

### Step 2: Identify the phases

Walk the top-level children and classify each into one of these phases.
Not every trace has all of them — just report what's there.

For large traces, focus on the top-level children first.
Drill into subtrees only when they are relevant to the bottleneck or when durations are surprising.

A query trace typically has these phases, roughly in this order.
Not all phases appear in every trace. Identify them by matching the message patterns described here.

**Expression evaluation** — nodes matching: `eval: query=..., timeRange=..., step=..., mayCache=...: series=N, points=N, pointsPerSeries=N`
These trace the recursive PromQL/MetricsQL expression tree.
These trace the recursive evaluation of the PromQL/MetricsQL expression tree.
Each eval node may have children for sub-expressions. Key numbers:
- *series* — number of time series produced by this sub-expression
- *points* — total data points across all series
- *pointsPerSeries* — data points per series

**Functions and aggregations** — nodes matching:
- `transform <func>(): series=N` — PromQL functions (histogram_quantile, clamp, etc.)
- `aggregate <func>(): series=N` — aggregation operators (sum, avg, max, etc.)
- `binary op "<op>": series=N` — binary operations

**Series search (index lookup)** — where label matchers get resolved to internal series IDs:
- In *Cluster* traces, wrapped in `rpc at vmstorage <addr>` → `rpc call search_v7()`, in *Single-node* - appears directly without RPC wrappers
- Key messages:
    - `init series search`,
    - `search TSIDs`,
    - `search N indexDBs in parallel` — parallel index database search,
    - `search indexDB` — individual index partition,
    - `found N metric ids for filter=...` — metric ID, unique time series identifier within vmstorage,
    - `found N TSIDs for N metricIDs` — same as metric ID,
    - `sort N TSIDs`
- Cache-related messages in this phase:
    - `search for metricIDs in tag filters cache` followed by `cache miss` or a cache hit (no `cache miss` child)
    - `put N metricIDs in cache` / `stored N metricIDs into cache`

**Data fetch** — getting raw data:
- *Cluster:* `fetch matching series: ...` wraps RPC calls to each vmstorage node:
    - `rpc at vmstorage <addr>` — per-node RPC,
    - `sent N blocks to vmselect` — amount of raw data transmitted back,
    - `fetch unique series=N, blocks=N, samples=N, bytes=N` — aggregate summary across all vmstorage nodes,
- *Single-node*: `search for parts with data for N series` followed by data scan messages.
  The **bytes** value in `fetch unique series` tells you total data transferred and is a good indicator of I/O load.

**Rollup computation** — computing rate(), increase(), avg_over_time(), etc.:
- `rollup <func>(): timeRange=..., step=N, window=N`
- `rollup <func>() with incremental aggregation <agg>() over N series` — this is an optimization
- `the rollup evaluation needs an estimated N bytes of RAM for N series and N points per series`  — memory estimate
- `parallel process of fetched data: series=N, samples=N` — the actual computation over raw samples
- `series after aggregation with <func>(): N; samplesScanned=N` — post-aggregation result
  This phase often dominates execution time for queries that scan large amounts of raw data.

**Response generation** — usually trivial:
- `sort series by metric name and labels`
- `generate /api/v1/query_range response for series=N, points=N`
  Usually trivially fast. Could be a bottleneck if response is huge (hundreds of series and thousands of datapoints per-series) and client's speed on reading the response is slow.

### Step 3: Build the time breakdown

For each phase, note the `duration_msec`.
In **Cluster** traces, the same phases repeat for each vmstorage node — aggregate for the summary but also track per-node numbers to spot imbalances.

### Step 4: Find the bottleneck

Identify which phase consumed the most time and explain *why* in concrete terms.
For instance, "The rollup scanned 212M raw samples" is useful; "the query was slow" is not.

### Step 5: Write recommendations

Base recommendations only on what the trace actually shows.
If the query is fast and healthy, say so — don't invent problems.

Follow this algorithm to select recommendations:

- **Step 5a:** From the time breakdown, identify which single phase dominates (>60% of total latency). Map it to the matching pattern in the "Recommendation patterns" section below.
- **Step 5b:** Use ONLY that pattern's recommendations, in the listed priority order. Do not pull recommendations from other patterns.
- **Step 5c:** If no single phase exceeds 60%, pick the pattern with the highest contribution and note secondary factors, but still do not mix recommendations across patterns.

## Report format

```markdown
## Query Overview

- **Query:** `<the PromQL/MetricsQL expression>`
- **Type

Related in Backend & APIs