Claude
Skills
Sign in
Back

building-dashboards

Included with Lifetime
$97 forever

Designs and builds Axiom dashboards via API. Covers chart types, APL and metrics/MPL query patterns, SmartFilters, layout, and configuration options. Use when creating dashboards, migrating from Splunk, or configuring chart options.

Backend & APIsscripts

What this skill does


# Building Dashboards

## Philosophy

1. **Decisions first.** Every panel answers a question that leads to an action.
2. **Overview → drilldown → evidence.** Start broad, narrow on click/filter, end with raw logs.
3. **Rates and percentiles over averages.** Averages hide problems; p95/p99 expose them.
4. **Simple beats dense.** One question per panel. No chart junk.
5. **Validate with data.** Never guess fields—discover schema first.
6. **Compute what's asked, or defer.** If a panel can't be computed, replace it with a `Note` documenting the blocker. Never substitute a different quantity, even disclosed. See [Compute or Defer](#compute-or-defer).

---

## Entry Points

| Starting from | Workflow |
|---------------|----------|
| **Vague description** | Intake → check dataset kind → design blueprint (APL or MPL) → queries per panel → deploy |
| **Template** | Pick template → customize dataset/service/env → deploy |
| **Splunk dashboard** | Extract SPL → translate via spl-to-apl → map to chart types → deploy |
| **Grafana dashboard** | Project canonical panel spec (`expr`, `legendFormat`, `unit`, `title`, `description`) → translate PromQL → map chart types → deploy. See [reference/grafana-migration.md](./reference/grafana-migration.md). |
| **Exploration** | Use axiom-sre to discover schema/signals → productize into panels |

---

## Intake: What to Ask First

1. **Audience & decision**
   - Oncall triage? (fast refresh, error-focused)
   - Team health? (daily trends, SLO tracking)
   - Exec reporting? (weekly summaries, high-level)

2. **Scope**
   - Service, environment, region, cluster, endpoint?
   - Single service or cross-service view?

3. **Dataset kind.** Run `scripts/metrics/datasets <deploy>` and check `kind`.
   - `otel:metrics:v1` → metrics dataset, follow the **Metrics path**.
   - anything else → events/logs dataset, follow the **APL path**.

   > **Never run `getschema` on a metrics dataset.** It returns 0 rows without error.

   **APL path:** discover fields with `['dataset'] | where _time between (ago(1h) .. now()) | getschema`. Continue to steps 4–5.

   **Metrics path:**
   - `scripts/metrics/metrics-spec <deploy> <dataset>` — required before any MPL query.
   - `scripts/metrics/metrics-info <deploy> <dataset> metrics | tags | tags <tag> values` for discovery.
   - If discovery is empty, retry with `--start` 7 days ago (sparse metrics).
   - `find-metrics <value>` searches tag *values*, not metric names — use it only with a known entity name.
   - Skip to the **Metrics/MPL Blueprint**.

4. **Golden signals** (APL path)
   - Traffic: requests/sec, events/min
   - Errors: error rate, 5xx count
   - Latency: p50, p95, p99 duration
   - Saturation: CPU, memory, queue depth, connections

5. **Drilldown dimensions** (APL path)
   - What do users filter/group by? (service, route, status, pod, customer_id)

---

## Dashboard Blueprint

Pick the blueprint matching the dataset kind.

### APL Blueprint (events/logs datasets)

#### 1. At-a-Glance (Statistic panels)
Single numbers that answer "is it broken right now?"
- Error rate (last 5m)
- p95 latency (last 5m)
- Request rate (last 5m)
- Active alerts (if applicable)

#### 2. Trends (TimeSeries panels)
Time-based patterns that answer "what changed?"
- Traffic over time
- Error rate over time
- Latency percentiles over time
- Stacked by status/service for comparison

#### 3. Breakdowns (Table/Pie panels)
Top-N analysis that answers "where should I look?"
- Top 10 failing routes
- Top 10 error messages
- Worst pods by error rate
- Request distribution by status

#### 4. Evidence (LogStream + SmartFilter)
Raw events that answer "what exactly happened?"
- LogStream filtered to errors
- SmartFilter for service/env/route
- Key fields projected for readability

### Metrics/MPL Blueprint (metrics datasets)

Use `align to $__interval using …` for bucketing — `$__interval` is supplied by the dashboard runtime. Hard-coded windows over- or under-resolve. Validate every pipeline with `scripts/metrics/mpl-validate-chart`; both it and `chart-add --mpl` reject inline time ranges (`[1h..]`).

Exception: for sparse metrics where `$__interval` rounds to empty buckets, a fixed wider window (e.g. `1h`) is acceptable; document why on the chart.

#### 1. At-a-Glance (Statistic panels)
Current values — "what's the state right now?"
- Use `group using avg` (gauges) or `group using last` (counters).
- Read the metric's `unit` via `metrics-info … metrics <m> info` and pass it to `chart-add --unit`. Ratio metrics (0–1) need `| map * 100` in MPL before `--unit "%"`.

#### 2. Trends (TimeSeries panels)
Trends over time — "what changed?"
- `align to $__interval using avg|sum|last`.
- Group by low-cardinality tags only (≤10 series per chart).
- Embed the unit in `--name` (`"P95 Latency (ms)"`, `"Memory (MiB)"`); scale magnitudes in MPL (`| map / 1048576` for bytes → MiB).

#### 3. Breakdowns (TimeSeries or Table panels)
Per-entity detail — "where should I look?"
- Metrics broken down by entity (host, pod, service).
- Filter to keep series count manageable.
- One dimension per panel; don't overload a single chart.

#### 4. Entity State (TimeSeries or Table panels)
Boolean/state metrics — answer "what is on/off/active?"
- Use `align to $__interval using last`.
- Sparse state metrics may need a fixed wider interval (1h+).

---

## Required Chart Structure

Each chart needs a unique kebab-case `id` (`error-rate`, `p95-latency`); every layout `i` must match one. Pass the same id to `chart-add --id` and `layout-pack <id>:…`. `dashboard-assemble` cross-checks before emit.

---

## Chart Unit Configuration

Pass a friendly unit string to `chart-add --unit` (`"%"`, `"s"`, `"ms"`, `"B"`, `"req/s"`). The script picks `unit` enum + `customUnits` suffix per chart type. `customUnits` is a label, not a formatter — scale magnitudes in MPL (`| map / 1048576` for bytes → MiB, `| map / 1000000` for bytes → MB, `| map * 100` for 0–1 ratio → percent). For metrics charts, read the source unit from `metrics-info … metrics <m> info` and pass it through. Internals (advanced options the agent may merge with `jq`): [reference/chart-config.md](./reference/chart-config.md).

---

## Compute or Defer

Each panel either computes the requested quantity, or it's replaced by a `Note` documenting the blocker. Substituting a different quantity is never acceptable — disclaimers don't reach whoever acts on the number.

Defer template (use `chart-add --type Note`):

```
**Deferred — blocked by:** <one-line reason>.

**Original spec:** <what the panel should compute, dimensions, unit>.

**To unblock:** <pointer to the fix>.
```

Common blockers: MPL parser limits, missing tag with no reverse-tag equivalent, missing metric with no OTel rename match. Full rationale: [reference/design-playbook.md § Substituting a Different Quantity](./reference/design-playbook.md#substituting-a-different-quantity-for-the-asked-one).

---

## Chart Types

| Type          | When                                                | Key constraint                                                       |
|---------------|-----------------------------------------------------|----------------------------------------------------------------------|
| Statistic     | Single KPI, current value                           | Query must return one row.                                           |
| TimeSeries    | Trends over time, percentile overlays               | `bin_auto(_time)`; `percentiles_array()` for multi-percentile.       |
| Table         | Top-N lists, breakdowns                             | Bound with `top N`; control columns via `project`.                   |
| Pie           | Share-of-total for ≤6 categories                    | Aggregate to ≤6 slices; never high-cardinality.                      |
| LogStream     | Raw event inspection                                | `take 100–500`; `project-keep` to relevant fields; filter hard.      |
| Heatmap       | Distribution / latency densi
Files: 48
Size: 262.2 KB
Complexity: 81/100
Category: Backend & APIs

Related in Backend & APIs