langchain-incident-runbook
Triage LangChain 1.0 / LangGraph 1.0 production incidents — LLM-specific SLOs, provider outage runbook, latency spike decision tree, cost-overrun response, agent loop containment. Use during an on-call page, in a post-mortem, or writing the team's first LLM runbook. Trigger with "langchain incident", "llm on-call", "langchain slo", "langchain outage", "langchain cost spike", "langchain agent loop".
What this skill does
# LangChain Incident Runbook
## Overview
3:07am. PagerDuty: "LangChain p95 latency > 10s for 5 minutes." You open LangSmith,
filter by `service="triage-agent"` over the last 15 minutes, and the first trace
is 43 seconds long — an agent is on step 24 of 25 iterations, bouncing between
the same two tools on a vague user prompt ("help me with my account"). The cost
dashboard shows **$400 spent in the last 10 minutes**, up from a $6/hour baseline.
This is P10: `create_react_agent` defaults to `recursion_limit=25` with no cost
cap; vague prompts never converge; the spend hits before `GraphRecursionError`
surfaces. First move is not to push a code fix — it is to flip
`recursion_limit=5` via config reload and add a middleware token-budget cap per
session, then deal with the stuck sessions.
Or: same alert, different signature. p95 is healthy at 1.8s, but p99 is 12s and
spiky. The spikes correlate with instance starts in Cloud Run. P36: Python +
LangChain + embedding preloads = 5–15s cold start; Cloud Run scales to zero by
default, so first-request p99 is 10x p95. First move is `--min-instances=1`
(or a keepalive pinger), not more CPU.
The shape of the page decides the first move. This runbook gives you:
1. The LLM-specific SLO set most teams do not have: **p95 TTFT <1s, p99 total
latency <10s, error-rate <0.5%, cost-per-req <$0.05** — with Prometheus
burn-rate recording rules that page on user-visible regression.
2. A triage decision tree with three root paths (latency / cost / error-rate),
each with a 3-step diagnostic and first-response action.
3. Provider outage runbook wired to `.with_fallbacks(backup)` so failover is a
config flip, not a code change.
4. Agent-loop containment via `recursion_limit` tuning and middleware
token-budget caps so runaway agents stop burning cost **before** the
`GraphRecursionError`.
5. Post-incident debug bundle (cross-ref `langchain-debug-bundle`) and write-up
template.
Pinned: `langchain-core 1.0.x`, `langgraph 1.0.x`, `langsmith 0.3+`. Primary pain
anchors: **P10** (agent runaway), **P36** (cold start). Adjacent: P29 (per-process
rate limiter), P30 (`max_retries=6` means 7 attempts), P31 (Anthropic cache RPM).
## Prerequisites
- LangSmith workspace with tracing enabled (free tier is fine for runbook work)
- Prometheus + Alertmanager or equivalent (Datadog, Grafana Cloud) — burn-rate rules assume PromQL
- `langchain-observability` skill applied — metrics are grounded in what LangSmith callbacks emit
- A `backup` model factory from `langchain-rate-limits` — the failover playbook assumes `.with_fallbacks()` is already wired
- On-call rotation + PagerDuty (or equivalent) integrated with the alert pipeline
## Instructions
### Step 1 — Define the LLM-specific SLO set
HTTP-style SLOs miss what users actually feel. Define four, publish them, wire
burn-rate alerts to the symptom.
| SLO | Threshold | Alert condition | First-response action |
|---|---|---|---|
| **p95 TTFT** (time to first streamed token) | <1s | burn-rate > 2% over 5min | Check streaming is enabled; check provider status; check cold start (P36) |
| **p99 total latency** | <10s | burn-rate > 5% over 5min | Check agent loop depth (P10); cold start (P36); provider latency |
| **Error rate** (5xx + uncaught exceptions) | <0.5% | burn-rate > 1% over 5min | Check provider 429/500; auth token; schema drift on structured output |
| **Cost per request** | <$0.05 (tier-dependent) | p95 spend/req > $0.20 over 15min | Check agent recursion (P10); retry rate (P30); token-use per req |
Prometheus recording rule pattern for p99 latency burn-rate (replicate for TTFT,
error-rate, and cost):
```yaml
groups:
- name: langchain_slo
interval: 30s
rules:
- record: langchain:p99_latency_5m
expr: histogram_quantile(0.99, sum(rate(langchain_request_duration_seconds_bucket[5m])) by (le, service))
- alert: LangChainP99LatencyBurn
expr: langchain:p99_latency_5m > 10
for: 5m
labels: { severity: page, team: llm }
annotations:
summary: "LangChain p99 > 10s for {{ $labels.service }}"
runbook: "https://runbooks/langchain-incident-runbook#latency"
```
See [LLM SLOs](references/llm-slos.md) for the canonical set (free / paid /
enterprise tiers), burn-rate recipes (fast + slow), and a TTFT-specific rule
that requires streaming to be instrumented.
### Step 2 — Triage decision tree: which root path?
The alert name tells you the root path. Do not mix diagnostics across paths —
the first-response action differs.
```
Alert fired
├── Latency (p95/p99 breach, TTFT breach)
│ ├── 1. Provider status page (Anthropic, OpenAI) green? → if red, Step 3
│ ├── 2. Cold start pattern? (p99 >> p95, correlates with instance starts) → P36
│ └── 3. Streaming configured? (TTFT only makes sense with .stream/.astream)
│
├── Cost (spend/req or absolute spend/hour breach)
│ ├── 1. Agent recursion depth? (LangSmith: max steps per trace) → P10
│ ├── 2. Retry rate elevated? (callback log: attempt count / logical call) → P30
│ └── 3. Token-use per req regression? (input + output tokens from callbacks)
│
└── Error rate (5xx + uncaught exceptions)
├── 1. Provider 429/500 spike? (distinguish client 4xx from provider 5xx)
├── 2. Auth? (API key rotation, expired token, org quota exhausted)
└── 3. Schema drift on structured output? (Pydantic ValidationError in traces)
```
For each leaf, [Latency Triage](references/latency-triage.md) and
[Cost Overrun Response](references/cost-overrun-response.md) give the LangSmith
filter query, the exact metric to inspect, and the remediation.
### Step 3 — Provider outage: detect, circuit-break, fail over
Detection precedes failover. Do not flip fallbacks on an application bug.
1. **Detect** via three signals — all three should agree before declaring a
provider outage:
- Vendor status page watcher (`status.anthropic.com`, `status.openai.com`) —
poll every 30s, surface into Slack
- In-app canary probe — a 1-req/min call to each configured provider with a
trivial prompt, tracked as a separate SLO
- Error-rate spike on the primary provider in your own metrics (distinguishes
a real outage from your app's bug)
2. **Circuit-break** the primary: a `CircuitBreaker` middleware (see
`langchain-middleware-patterns` if available, or a simple
`aiobreaker`-backed runnable) opens after N consecutive `APIError` /
`APITimeoutError` within a window. Once open, calls skip the primary and go
straight to the backup. This bounds the latency cost of a down provider.
3. **Fail over** via `.with_fallbacks(backup)` — the fallback chain is already
wired (see `langchain-rate-limits`). During an outage, either flip a feature
flag that swaps the default factory, or temporarily set the primary's
`max_retries=0` so the chain reaches the fallback immediately.
4. **Comms**: post a user-facing status page entry ("Degraded performance on
feature X — monitoring upstream provider") and an internal Slack update with
the canary graph attached. [Provider Outage Playbook](references/provider-outage-playbook.md)
has the full comms template and the circuit-breaker middleware snippet.
### Step 4 — Agent loop containment: stop the bleed before GraphRecursionError
P10 is the most common cost-spike cause. `create_react_agent` defaults to
`recursion_limit=25`, meaning 25 model calls per user turn — with Claude Sonnet
at ~$3/MTok input, a 10k-token tool-call loop burns real money per minute.
Three containment layers, applied in order:
1. **Set `recursion_limit` per agent depth** — interactive chat agents rarely
need more than 5–8 steps; background research agents can justify 15; never
leave the default 25 in production.
```python
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
llm, tools,
recursion_limit=8, # P10 — was default 25
)
```Related in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.