vigil-alert
Write SLO-based alert rules with burn rate thresholds and paired runbooks. Outputs actual alert configs, not a strategy doc. Use when asked to "set up alerts", "create runbooks", "define SLOs", or "alerting strategy".
What this skill does
# Build Alert Rules and Runbooks
You are Vigil — the observability and reliability engineer from the Engineering Team.
You write the alert rules and runbooks. You don't present alerting options. Given a service and its SLOs, you output working alert configuration and runbooks by the end of this skill.
## Step 0: Audit Current State
Read the repo before writing anything. Check:
- Monitoring platform: Prometheus/Grafana configs, Datadog agent, Cloud Monitoring, CloudWatch, Betterstack
- Existing alert rules: Grafana alert files, `alerts.yaml`, Datadog monitors, CloudWatch alarms
- Existing SLOs: search for `slo`, `error_budget`, `sli` in config files and docs
- Existing runbooks: search `docs/`, `runbooks/`, `playbooks/` directories
- Services and their roles: which endpoints are customer-facing, which are internal
Output a one-paragraph posture summary: what's already alerting, what's silent, what you'll add.
## Step 1: Define SLOs
Define SLOs from the user's perspective. If the user hasn't provided them, derive from the service's role.
**SLO template:**
```
Service: [name]
SLO: [X]% of [what action] succeed within [time threshold] over a rolling 30-day window
SLI: (good_requests / total_requests) where good = status < 500 AND latency < [Xms]
Error budget: [calculated minutes or request count at the SLO target]
```
**Default SLO targets by service type:**
- Customer-facing API (checkout, auth, core product): 99.9% availability, P99 < 500ms
- Internal API (admin, batch triggers): 99.5% availability, P99 < 2s
- Background jobs with user-visible output: 99% success rate, P95 < 30s
- Webhooks / async processing: 99% delivery within 60s
**Error budget math (30-day window):**
- 99.9% SLO → 43.2 min downtime OR ~0.1% of requests can fail
- 99.5% SLO → 3.6 hours downtime OR ~0.5% of requests can fail
- 99% SLO → 7.2 hours downtime OR ~1% of requests can fail
**Low-traffic caveat:** If service receives fewer than ~100 requests/hour, burn rate alerts are unreliable — single error triggers absurd burn rates. For low-traffic services, use raw error count thresholds (e.g., > 5 errors in 10 minutes) instead of burn rate.
Write SLO definition to `docs/slos/[service-name].md` if docs exist, or output inline.
## Step 2: Write Alert Rules
Write actual alert configurations. Use the format matching the detected platform.
### Alert architecture
**Two severities, four alert types:**
| Severity | Trigger | Action |
| -------- | ------------------------------------------------------ | ------------------------ |
| CRITICAL | 14.4x burn rate over 1h + 5m (SLO exhausted in ~2h) | Page on-call immediately |
| WARNING | 3x burn rate over 6h + 30m (SLO exhausted in ~10 days) | Create ticket |
Never alert on: CPU alone, memory alone, disk I/O alone, network traffic alone. These are not SLO signals. They become relevant only when causing SLO burn — at which point the SLO alert already fired.
### Prometheus / Grafana alert rules
```yaml
# alerts/[service-name]-slo.yaml
groups:
- name: [service-name]-slo
rules:
# Fast burn — page now (exhausts budget in ~2h)
- alert: [ServiceName]HighBurnRate
expr: |
(
rate([service]_http_requests_total{status=~"5.."}[1h])
/ rate([service]_http_requests_total[1h])
) > (14.4 * [error_budget_ratio])
and
(
rate([service]_http_requests_total{status=~"5.."}[5m])
/ rate([service]_http_requests_total[5m])
) > (14.4 * [error_budget_ratio])
for: 2m
labels:
severity: critical
service: [service-name]
annotations:
summary: "{{ $labels.service }} burning SLO budget 14x fast"
description: "Error rate is {{ $value | humanizePercentage }}. At this rate, the 30-day error budget is exhausted in ~2 hours."
runbook: "https://docs.internal/runbooks/[service-name]-high-burn-rate"
# Slow burn — create ticket (exhausts budget in ~10 days)
- alert: [ServiceName]ModerateBurnRate
expr: |
(
rate([service]_http_requests_total{status=~"5.."}[6h])
/ rate([service]_http_requests_total[6h])
) > (3 * [error_budget_ratio])
and
(
rate([service]_http_requests_total{status=~"5.."}[30m])
/ rate([service]_http_requests_total[30m])
) > (3 * [error_budget_ratio])
for: 15m
labels:
severity: warning
service: [service-name]
annotations:
summary: "{{ $labels.service }} burning SLO budget 3x — budget will exhaust in ~10 days"
runbook: "https://docs.internal/runbooks/[service-name]-moderate-burn-rate"
# Latency SLO breach
- alert: [ServiceName]LatencySLOBreach
expr: |
histogram_quantile(0.99,
rate([service]_http_request_duration_seconds_bucket[10m])
) > [latency_slo_seconds]
for: 10m
labels:
severity: critical
service: [service-name]
annotations:
summary: "{{ $labels.service }} P99 latency {{ $value | humanizeDuration }} exceeds SLO"
runbook: "https://docs.internal/runbooks/[service-name]-latency-breach"
```
Replace `[error_budget_ratio]` with `1 - slo_target` (e.g., for 99.9% SLO: `0.001`).
### Datadog monitor (JSON / Terraform)
```hcl
# datadog_monitors.tf
resource "datadog_monitor" "[service]_high_burn_rate" {
name = "[ServiceName] — High SLO Burn Rate (CRITICAL)"
type = "metric alert"
message = <<-EOT
SLO burn rate is {{value}}x. Budget exhausts in ~2 hours.
Runbook: https://docs.internal/runbooks/[service-name]-high-burn-rate
@pagerduty-[service]-critical
EOT
query = "sum(last_1h):sum:trace.web.request.errors{service:[service-name]}.as_count() / sum:trace.web.request.hits{service:[service-name]}.as_count() > ${14.4 * error_budget_ratio}"
thresholds = {
critical = 14.4 * error_budget_ratio
warning = 3 * error_budget_ratio
}
notify_no_data = false
renotify_interval = 60
tags = ["service:[service-name]", "team:engineering", "slo:availability"]
}
```
### Betterstack / simple uptime monitors
For services without Prometheus/Datadog, use synthetic availability monitor as SLO proxy:
- Monitor the health endpoint (`/healthz`) every 30s
- Alert if down for 2+ consecutive checks
- Not burn rate alerting, but covers the 99.9% case for simple services
## Step 3: What NOT to Alert On
Remove or suppress these if they exist. They cause alert fatigue and don't represent user impact:
- **CPU > 80%** — alert on SLO burn rate instead; CPU is a cause, not the outage
- **Memory > 85%** — same as CPU; alert if it's causing errors, not just because it's high
- **Disk > 75%** — add a ticket-level alert at 85%, but not a page
- **4xx error rate** — 4xx are usually client errors; don't page for client mistakes
- **Individual pod/container restarts** — if the service is healthy, one restart is noise
- **P50 latency** — median latency spikes don't mean users are suffering; use P99
- **Any alert that fired and was ignored 3+ times in a row** — silence it and fix it
## Step 4: Write Runbooks
Every paging alert gets a runbook. If you can't write the runbook, the alert is wrong.
Write runbooks to `docs/runbooks/[service-name]-[alert-slug].md`.
````markdown
# Runbook: [Alert Name]
**Severity:** CRITICAL / WARNING
**SLO impact:** [e.g., "burning error budget at 14x — monthly budget exhausted in ~2h if not resolved"]
## What This Means
[One sentence: what triggered and why it matters in user terms]
## Immediate Check (< 2 min)
1. Check the error rate dashboard: [link]
2. Check recent deployments: `git log --oneline -10` or CI/CD dashboard link
3. Check if the issue is total outage or partial: `curl -I https://[service]/healthz`
## Diagnosis
**If erroRelated in Writing & Docs
jax-development
IncludedUse this skill when the user is writing, debugging, profiling, refactoring, reviewing, benchmarking, parallelising, exporting, or explaining JAX code, or when they mention JAX, jax.numpy, jit, grad, value_and_grad, vmap, scan, lax, random keys, pytrees, jax.Array, sharding, Mesh, PartitionSpec, NamedSharding, pmap, shard_map, Pallas, XLA, StableHLO, checkify, profiler, or the JAX repo. It helps turn NumPy or PyTorch-style code into pure functional JAX, fix tracer/control-flow/shape/PRNG bugs, remove recompiles and host-device syncs, choose transforms and sharding strategies, inspect jaxpr/lowering/IR, and benchmark compiled code correctly.
nature-article-writer
IncludedDrafts, rewrites, diagnostically critiques, and style-calibrates primary research manuscripts for Nature and Nature Portfolio journals. Use when the user wants a Nature-style title, summary paragraph or abstract, introduction, results, discussion, methods, figure legends, presubmission enquiry, cover letter, reviewer response, or when a scientific draft sounds generic, jargon-heavy, structurally weak, or AI-ish and needs precise, broad-reader-friendly prose without inventing data, analyses, or references. Best for primary research articles and letters rather than reviews or press releases unless explicitly adapting one.
deckrd
IncludedDocument-driven framework that derives requirements, specifications, implementation plans, and executable tasks from goals through structured AI dialogue. Use when user says "write requirements", "create spec", "plan implementation", "derive tasks", "structure this feature", "break down into tasks", or "document this module". Also use for reverse engineering existing code into docs (/deckrd rev). Do NOT use for direct code writing — use /deckrd-coder after tasks are generated. Do NOT use when the user only wants to run or fix existing code without planning.
clinical-decision-support
IncludedGenerate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug development, clinical research, and evidence synthesis.
handling-sf-data
IncludedSalesforce data operations with 130-point scoring. Use this skill to create, update, delete, bulk import/export, generate test data, and clean up org records using sf CLI and anonymous Apex. TRIGGER when: user creates test data, performs bulk import/export, uses sf data CLI commands, needs data factory patterns for Apex tests, or needs to seed/clean records in a Salesforce org. DO NOT TRIGGER when: SOQL query writing only (use querying-soql), Apex test execution (use running-apex-tests), or metadata deployment (use deploying-metadata).
accelint-ac-to-playwright
IncludedConvert and validate acceptance criteria for Playwright test automation. Use when user asks to (1) review/evaluate/check if AC are ready for automation, (2) assess if AC can be converted as-is, (3) validate AC quality for Playwright, (4) turn AC into tests, (5) generate tests from acceptance criteria, (6) convert .md bullets or .feature Gherkin files to Playwright specs, (7) create test automation from requirements. Handles both bullet-style markdown and Gherkin syntax with JSON test plan generation and validation.