signoz-investigating-alerts

Included with Lifetime

$97 forever

Diagnose why a SigNoz alert fired by correlating the alert's own signal with neighbor signals (error rate, latency, throughput, CPU/memory), traces, and logs around the fire window — and rank likely causes. Make sure to use this skill whenever the user asks "why did this alert fire", "what caused alert X", "investigate this alert", "RCA for the alert that paged me", "what's wrong with [service]" in the context of a recent fire, or otherwise asks for a root-cause analysis of a firing or recently-fired alert. Read-only — does not modify any alert or notification.

General

What this skill does

# Alert Investigate

Diagnose why a SigNoz alert fired. The skill correlates the alert's own
signal with neighbor signals around the fire window, and surfaces a
ranked list of likely causes with supporting evidence. It is the
companion to `signoz-explaining-alerts` — explain decodes the rule
statically; investigate diagnoses a specific incident.

## Prerequisites

This skill calls SigNoz MCP server tools heavily (`signoz:signoz_get_alert`,
`signoz:signoz_get_alert_history`, `signoz:signoz_execute_builder_query`,
`signoz:signoz_query_metrics`, `signoz:signoz_search_traces`, `signoz:signoz_search_logs`,
`signoz:signoz_get_trace_details`, etc.). Before running the workflow,
confirm the `signoz:signoz_*` tools are available. If they are not, the
SigNoz MCP server is not installed or configured — run `signoz-mcp-setup` first
to initialize or repair the MCP connection. The investigation depends on
correlating multiple MCP queries; without the server there is no way to ground
the analysis.

## When to use

Use this skill when the user wants to:
- Understand why a specific alert fired.
- Find the root cause of a recent incident triggered by an alert.
- Correlate the alert's signal with related metrics, traces, and logs.
- Distinguish "real signal" fires from flapping or threshold-mistuning.

Do NOT use when the user wants to:
- Understand what an alert is configured to monitor → `signoz-explaining-alerts`.
- Create a new alert → `signoz-creating-alerts`.
- Modify an alert (raise threshold, add hysteresis) → call
`signoz:signoz_update_alert` directly.
- Run a free-form ad-hoc investigation without an alert as the anchor →
`signoz-generating-queries`.

## Required inputs

| Input | Required | Source if missing |
|---|---|---|
| Alert identifier (rule ID or name) | yes | `$ARGUMENTS[0]` or recent context |
| Time window | no | default to most recent fire from `signoz:signoz_get_alert_history` |

If the alert name is fuzzy, this skill is **best-effort** (read-only):
1. Call `signoz:signoz_list_alert_rules`, paginate, fuzzy-match the name.
2. State the interpretation: "Investigating fire of 'High Error Rate —
Checkout' (id 42) at 14:32 UTC. If you meant a different alert or
fire, tell me."
3. Proceed.

If the alert has never fired in the lookback window, **stop**: there is
nothing to investigate. Respond with:
> "Alert '[name]' has not fired in the last 7d, so there is no fire
> window to investigate. Use `signoz-explaining-alerts` to walk through
> the rule, or check whether the alert is enabled."

## Workflow

The investigation runs in three tiers with strict early-stop gates.
Tier 1 always runs. Tier 2 runs only if tier 1 confirms a real fire.
Tier 3 runs only if tier 2 surfaces correlated anomalies. Skipping the
gates produces hundreds of unnecessary trace/log queries on quiet
alerts.

### Step 1: Resolve alert + fire window (Tier 0)

1. Resolve the alert id via `signoz:signoz_list_alert_rules` (paginated) if
not given.
2. Call `signoz:signoz_get_alert` for the full rule config — needed to know
what query, threshold, and resource scope the alert evaluated.
3. Call `signoz:signoz_get_alert_history` with a 7d lookback. From the
response:
- **Pick the fire window**. Default to the most recent fire. If the
user passed an explicit time window via `$ARGUMENTS[1]`, honor it.
- **Note the fire pattern**:
- `one-off` → single fire with a long quiet period before/after.
- `sustained` → fires that stayed firing for ≥ 1 evaluation cycle.
- `flapping` → ≥ 3 fires within a 1h window, alternating fire/resolve.
- `recurring` → fires at regular intervals (cron-like, e.g., every hour).
- The pattern tells you what to expect from tiers 2/3.

### Step 2: Tier 1 — what fired and how hard

This tier always runs. It establishes the fire is real (vs. transient
threshold tickle or flap) and quantifies the magnitude.

1. Re-run the alert's primary query over a window centered on the fire
start: `[fire_start - 30m, fire_start + 30m]`.
- Use `signoz:signoz_execute_builder_query` for builder/formula alerts.
- Use `signoz:signoz_query_metrics` for PromQL alerts.
2. Compute:
- **Peak value** during the fire window.
- **Threshold breach magnitude**: `(peak - threshold) / threshold *
100` for "above" alerts, inverted for "below".
- **Fire duration**: how long the breach lasted.
- **Pre-fire baseline**: average value in the 30m before fire start.
3. **Early-stop gate**: if the breach magnitude is < 10% over the
threshold AND the fire duration is < 1 evaluation window, classify
as "marginal fire" — the alert may be too sensitive. Skip tiers 2
and 3 and go to Step 5 with a single hypothesis: "threshold may be
too tight, recommend tuning."

### Step 3: Tier 2 — neighbor signals vs baseline

Run only if Tier 1 confirms a real breach. Pull related signals for the
same resource scope as the alert and compare the fire window to a
baseline window.

1. **Pick a baseline window**. Use the same hour, previous day
(`fire_start - 24h, fire_start - 24h + fire_duration`). If the
alert fired during a known-anomalous time (deploy, weekly job),
note it in the output but still proceed.

2. **Look up neighbor signals** for the alert's resource type. See
`references/neighbor-signals.md` for the lookup table. Common cases:
- **Service-level alert** (`service.name = X`): pull error rate,
p95/p99 latency, request throughput, dependency error rates if
trace data is available.
- **Host / VM alert** (`host.name = X`): CPU, memory, disk I/O,
network I/O.
- **K8s pod / namespace alert**: pod restarts, container CPU/memory
limits, node pressure, recent rollouts.

3. For each neighbor signal:
- Query both windows (fire + baseline) via
`signoz:signoz_execute_builder_query` or `signoz:signoz_query_metrics`.
- Compute the delta (% change in fire window vs baseline).
- Rank by absolute delta.

4. **Early-stop gate**: if no neighbor signal shows ≥ 25% deviation
from baseline, classify as "isolated fire — the alert's own signal
moved but nothing else did." This is unusual and worth surfacing.
Skip Tier 3 and go to Step 5 with hypotheses focused on the alert's
own query (likely causes: data source change, instrumentation
change, downstream silent failure that only shows in this metric).

### Step 4: Tier 3 — traces and logs at the fire window

Run only if Tier 2 found correlated neighbor anomalies. Drill into
specific failing operations.

1. **Traces** (if the alert is service-scoped and traces are
available):
- Call `signoz:signoz_search_traces` for the fire window with filter:
`service.name = <scope>` AND `hasError = true`. Cap at top 20.
- Group results by `name` (operation) and `error_message`. Surface
the top 3 by frequency with a representative trace ID for each.
- Optionally call `signoz:signoz_get_trace_details` on one representative
to extract specific span attributes (DB statement, downstream URL,
status code).

2. **Logs** for the fire window:
- Call `signoz:signoz_search_logs` with filter:
`<scope_filter>` AND `severity_text IN ('ERROR', 'FATAL')`. Cap
at top 20 most recent.
- Group by `body` pattern (or `exception.type` if present). Surface
the top 3 distinct messages with counts.

3. Cross-reference: do the traces and logs point at the same
downstream service, dependency, or code path? If so, that becomes
the leading hypothesis.

See `references/baseline-comparison.md` for query templates that pair
fire-window and baseline-window calls cleanly.

### Step 5: Build the structured output

Use this exact section order. Lead with a TL;DR — engineers under
pressure scan the top first and stop reading once they have what
they need. Compression plus proof: every claim cites the MCP query
that produced it; no generic "check logs / verify connectivity"
filler.

**1. TL;DR** — one or two sentences, no more.

Files: 4

Size: 35.1 KB

Complexity: 55/100

Category: General

Source: https://github.com/signoz/agent-skills/tree/main/plugins/signoz/skills/signoz-investigating-alerts

Related in General

modeling-omnistudio-epc-catalog

Included

Salesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).

Generalscripts

relationship-science-coach

Included

Use this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.

Generalscripts

building-sf-integrations

Included

Salesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).

Generalscripts

venue-templates

Included

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

Generalscripts

let-fate-decide

Included

Draws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.

Generalscripts

net-ops

Included

Cross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.

Generalscripts