vm-trace-analyzer

Included with Lifetime

$97 forever

Analyze VictoriaMetrics query trace JSON to diagnose slow queries and produce a structured performance report with time breakdown, bottleneck analysis, and optimization recommendations. ALWAYS use this skill when: (1) the user mentions a VictoriaMetrics or VM trace, query trace, or trace JSON, (2) the user provides or references a JSON file containing duration_msec/message/children fields, (3) the user asks why a VictoriaMetrics/VM query is slow and has trace output, (4) the user asks about vmstorage node distribution, cache misses, or rollup performance in the context of a trace, (5) the user mentions vmselect trace, trace=1, or query performance debugging with VictoriaMetrics. This skill provides a structured report template that ensures consistent, thorough analysis — do not attempt to analyze VM traces without it.

Backend & APIsscripts

What this skill does


# VictoriaMetrics Query Trace Analyzer

You are analyzing a VictoriaMetrics query trace — a JSON tree that records every step of a PromQL query execution. Your goal is to read this tree, understand what happened, and produce a clear performance report with actionable recommendations.

## Background

In **Cluster** mode two components are involved in query processing:
- **vmselect** — query frontend that accepts PromQL or MetricsQL queries, fetches data from vmstorage nodes, and applies calculations
- **vmstorage** — stores time series data and serves it to vmselect over RPC

**Single-node** mode runs everything in one process. The trace structure is similar but without RPC wrappers.

You can tell which mode you're looking at from the root message in trace:
- **Cluster** traces contains `vmselect-<version>: /select/...`,
- **Single-node** traces contains `/victoria-metrics-<version>: /api/v1/...`.

## What is a query trace?

When you add `trace=1` to a VictoriaMetrics HTTP query, it returns a JSON tree describing every internal operation.
Each node looks like this:

```json
{
  "duration_msec": 123.456,
  "message": "description of what happened",
  "children": [ ... ]
}
```

The tree is rooted at vmselect. It captures the full query execution pipeline: parsing, series search, data fetch from storage, rollup computation, aggregation, and response generation.

## How to analyze the trace

### Step 0: Run the parse script

Before manually reading the trace file, run the analysis script to extract structured data:

```bash
python3 <skill_base_dir>/scripts/parse_trace.py <trace_file>
```

This outputs: root info, trace tree (depth 3), key nodes with durations, per-vmstorage RPC breakdown, and computed totals (bytes, samples, series). Use this output as your primary data source for the report.

Additional subcommands for deeper investigation:
- `python3 <script> <trace> tree --depth N` — print the trace tree to depth N
- `python3 <script> <trace> nodes --pattern "fetch unique"` — find all nodes matching a substring

Only drill deeper if the summary output reveals ambiguities or missing information.

After running the summary, also check for relevant performance improvements in newer VictoriaMetrics versions:

```bash
python3 <skill_base_dir>/scripts/check_changelog.py <version> <mode>
```

Where `<version>` is the semver from the parse script output (e.g., `v1.130.0`) and `<mode>` is `cluster` or `single-node`. This fetches changelogs from GitHub and shows performance-relevant fixes/features in versions newer than what the trace was captured on. If the fetch fails, skip this section gracefully.

### Step 1: Start at the root

Read the trace JSON file the user provides (or use the script output from Step 0).
The root node tells you the big picture. Extract:
- **Endpoint**: `/api/v1/query` (instant) or `/api/v1/query_range` (range)
- **Query**: the PromQL expression after `query=`
- **Time parameters**: `start=`, `end=`, `step=` (for range queries)
- **Result count**: `series=` at the end
- **Total duration**: the root `duration_msec`
- **Version**: printed in the very start of the trace.

### Step 2: Identify the phases

Walk the top-level children and classify each into one of these phases.
Not every trace has all of them — just report what's there.

For large traces, focus on the top-level children first.
Drill into subtrees only when they are relevant to the bottleneck or when durations are surprising.

A query trace typically has these phases, roughly in this order.
Not all phases appear in every trace. Identify them by matching the message patterns described here.

**Expression evaluation** — nodes matching: `eval: query=..., timeRange=..., step=..., mayCache=...: series=N, points=N, pointsPerSeries=N`
These trace the recursive PromQL/MetricsQL expression tree.
These trace the recursive evaluation of the PromQL/MetricsQL expression tree.
Each eval node may have children for sub-expressions. Key numbers:
- *series* — number of time series produced by this sub-expression
- *points* — total data points across all series
- *pointsPerSeries* — data points per series

**Functions and aggregations** — nodes matching:
- `transform <func>(): series=N` — PromQL functions (histogram_quantile, clamp, etc.)
- `aggregate <func>(): series=N` — aggregation operators (sum, avg, max, etc.)
- `binary op "<op>": series=N` — binary operations

**Series search (index lookup)** — where label matchers get resolved to internal series IDs:
- In *Cluster* traces, wrapped in `rpc at vmstorage <addr>` → `rpc call search_v7()`, in *Single-node* - appears directly without RPC wrappers
- Key messages:
    - `init series search`,
    - `search TSIDs`,
    - `search N indexDBs in parallel` — parallel index database search,
    - `search indexDB` — individual index partition,
    - `found N metric ids for filter=...` — metric ID, unique time series identifier within vmstorage,
    - `found N TSIDs for N metricIDs` — same as metric ID,
    - `sort N TSIDs`
- Cache-related messages in this phase:
    - `search for metricIDs in tag filters cache` followed by `cache miss` or a cache hit (no `cache miss` child)
    - `put N metricIDs in cache` / `stored N metricIDs into cache`

**Data fetch** — getting raw data:
- *Cluster:* `fetch matching series: ...` wraps RPC calls to each vmstorage node:
    - `rpc at vmstorage <addr>` — per-node RPC,
    - `sent N blocks to vmselect` — amount of raw data transmitted back,
    - `fetch unique series=N, blocks=N, samples=N, bytes=N` — aggregate summary across all vmstorage nodes,
- *Single-node*: `search for parts with data for N series` followed by data scan messages.
  The **bytes** value in `fetch unique series` tells you total data transferred and is a good indicator of I/O load.

**Rollup computation** — computing rate(), increase(), avg_over_time(), etc.:
- `rollup <func>(): timeRange=..., step=N, window=N`
- `rollup <func>() with incremental aggregation <agg>() over N series` — this is an optimization
- `the rollup evaluation needs an estimated N bytes of RAM for N series and N points per series`  — memory estimate
- `parallel process of fetched data: series=N, samples=N` — the actual computation over raw samples
- `series after aggregation with <func>(): N; samplesScanned=N` — post-aggregation result
  This phase often dominates execution time for queries that scan large amounts of raw data.

**Response generation** — usually trivial:
- `sort series by metric name and labels`
- `generate /api/v1/query_range response for series=N, points=N`
  Usually trivially fast. Could be a bottleneck if response is huge (hundreds of series and thousands of datapoints per-series) and client's speed on reading the response is slow.

### Step 3: Build the time breakdown

For each phase, note the `duration_msec`.
In **Cluster** traces, the same phases repeat for each vmstorage node — aggregate for the summary but also track per-node numbers to spot imbalances.

### Step 4: Find the bottleneck

Identify which phase consumed the most time and explain *why* in concrete terms.
For instance, "The rollup scanned 212M raw samples" is useful; "the query was slow" is not.

### Step 5: Write recommendations

Base recommendations only on what the trace actually shows.
If the query is fast and healthy, say so — don't invent problems.

Follow this algorithm to select recommendations:

- **Step 5a:** From the time breakdown, identify which single phase dominates (>60% of total latency). Map it to the matching pattern in the "Recommendation patterns" section below.
- **Step 5b:** Use ONLY that pattern's recommendations, in the listed priority order. Do not pull recommendations from other patterns.
- **Step 5c:** If no single phase exceeds 60%, pick the pattern with the highest contribution and note secondary factors, but still do not mix recommendations across patterns.

## Report format

```markdown
## Query Overview

- **Query:** `<the PromQL/MetricsQL expression>`
- **Type

Files: 3

Size: 33.4 KB

Complexity: 63/100

Category: Backend & APIs

Source: https://github.com/victoriametrics/skills/tree/main/plugins/diagnostics/skills/vm-trace-analyzer

Related in Backend & APIs

jfrog

Included

Interact with the JFrog Platform via the JFrog CLI and REST/GraphQL APIs. Use this skill when the user wants to manage Artifactory repositories, upload or download artifacts, manage builds, configure permissions, manage users and groups, work with access tokens, configure JFrog CLI servers, search artifacts, manage properties, set up replication, manage JFrog Projects, run security audits or scans, look up CVE details, query exposures scan results from JFrog Advanced Security, manage release bundles and lifecycle operations, aggregate or export platform data, or perform any JFrog Platform administration task. Also use when the user mentions jf, jfrog, artifactory, xray, distribution, evidence, apptrust, onemodel, graphql, workers, mission control, curation, advanced security, exposures, or any JFrog product name.

Backend & APIsscripts

cupynumeric-migration-readiness

Included

Pre-migration readiness assessor for porting NumPy to cuPyNumeric. Use BEFORE substantial porting work begins when the user asks whether code will scale on GPU, whether they should migrate to cuPyNumeric, which NumPy patterns transfer cleanly, what must be refactored before porting, or mentions pre-port assessment, scaling analysis, or refactor planning. Inspect the user's source code, look up NumPy usage, cross-reference the cuPyNumeric API support manifest, and distinguish distributed-scaling-friendly patterns from blockers such as unsupported APIs, scalar synchronization, host round-trips, Python/object-heavy control flow, shape/data-dependent branching, and in-place mutation hazards. Produce a verdict of READY, LIGHT REFACTOR, SIGNIFICANT REFACTOR, or NOT RECOMMENDED, with concrete refactor pointers.

Backend & APIsscripts

alibabacloud-data-agent-skill

Included

Invoke Alibaba Cloud Apsara Data Agent for Analytics via CLI to perform natural language-driven data analysis on enterprise databases. Data Agent for Analytics is an intelligent data analysis agent developed by Alibaba Cloud Database team for enterprise users. It automatically completes requirement analysis, data understanding, analysis insights, and report generation based on natural language descriptions. This tool supports: discovering data resources (instances/databases/tables) managed in DMS, initiating query or deep analysis sessions, real-time progress tracking, and retrieving analysis conclusions and generated reports. Use this Skill when users need to query databases, analyze data trends, generate data reports, ask questions in natural language, or mention "Data Agent", "data analysis", "database query", "SQL analysis", "data insights".

Backend & APIsscripts

token-optimizer

Included

Reduce OpenClaw token usage and API costs through smart model routing, heartbeat optimization, budget tracking, and native 2026.2.15 features (session pruning, bootstrap size limits, cache TTL alignment). Use when token costs are high, API rate limits are being hit, or hosting multiple agents at scale. The 4 executable scripts (context_optimizer, model_router, heartbeat_optimizer, token_tracker) are local-only — no network requests, no subprocess calls, no system modifications. Reference files (PROVIDERS.md, config-patches.json) document optional multi-provider strategies that require external API keys and network access if you choose to use them. See SECURITY.md for full breakdown.

Backend & APIsscripts

resend-cli

Included

Use this skill when the task is specifically about operating Resend from an AI agent, terminal session, or CI job via the official resend CLI: installing/authenticating the CLI, sending/listing/updating/cancelling emails, batch sends, domains and DNS, webhooks and local listeners, inbound receiving, contacts, topics, segments, broadcasts, templates, API keys, profiles, or debugging Resend CLI/API failures. Trigger on mentions of Resend CLI, `resend`, `resend doctor`, `resend emails send`, `resend domains`, `resend webhooks listen`, `resend emails receiving`, or agent-friendly terminal automation.

Backend & APIsscripts

alibabacloud-odps-maxframe-coding

Included

Use this skill for MaxFrame SDK development and documentation navigation on Alibaba Cloud MaxCompute (ODPS). Helps answer MaxFrame API, concept, official example, and supported pandas API questions; create data processing programs; read/write MaxCompute tables; debug jobs (remote or local); and build custom DPE runtime images. Trigger when users mention MaxFrame, MaxCompute with MaxFrame, ODPS table processing, DPE runtime, MaxFrame docs/examples, DataFrame/Tensor operations, or GPU runtime setup. Works for both English and Chinese queries about Alibaba Cloud data processing with MaxFrame.

Backend & APIsscripts