Claude
Skills
Sign in
Back

langchain-incident-runbook

Included with Lifetime
$97 forever

Triage LangChain 1.0 / LangGraph 1.0 production incidents — LLM-specific SLOs, provider outage runbook, latency spike decision tree, cost-overrun response, agent loop containment. Use during an on-call page, in a post-mortem, or writing the team's first LLM runbook. Trigger with "langchain incident", "llm on-call", "langchain slo", "langchain outage", "langchain cost spike", "langchain agent loop".

AI Agentssaaslangchainlanggraphpythonlangchain-1.0sreincident-responseslo

What this skill does

# LangChain Incident Runbook

## Overview

3:07am. PagerDuty: "LangChain p95 latency > 10s for 5 minutes." You open LangSmith,
filter by `service="triage-agent"` over the last 15 minutes, and the first trace
is 43 seconds long — an agent is on step 24 of 25 iterations, bouncing between
the same two tools on a vague user prompt ("help me with my account"). The cost
dashboard shows **$400 spent in the last 10 minutes**, up from a $6/hour baseline.
This is P10: `create_react_agent` defaults to `recursion_limit=25` with no cost
cap; vague prompts never converge; the spend hits before `GraphRecursionError`
surfaces. First move is not to push a code fix — it is to flip
`recursion_limit=5` via config reload and add a middleware token-budget cap per
session, then deal with the stuck sessions.

Or: same alert, different signature. p95 is healthy at 1.8s, but p99 is 12s and
spiky. The spikes correlate with instance starts in Cloud Run. P36: Python +
LangChain + embedding preloads = 5–15s cold start; Cloud Run scales to zero by
default, so first-request p99 is 10x p95. First move is `--min-instances=1`
(or a keepalive pinger), not more CPU.

The shape of the page decides the first move. This runbook gives you:

1. The LLM-specific SLO set most teams do not have: **p95 TTFT <1s, p99 total
   latency <10s, error-rate <0.5%, cost-per-req <$0.05** — with Prometheus
   burn-rate recording rules that page on user-visible regression.
2. A triage decision tree with three root paths (latency / cost / error-rate),
   each with a 3-step diagnostic and first-response action.
3. Provider outage runbook wired to `.with_fallbacks(backup)` so failover is a
   config flip, not a code change.
4. Agent-loop containment via `recursion_limit` tuning and middleware
   token-budget caps so runaway agents stop burning cost **before** the
   `GraphRecursionError`.
5. Post-incident debug bundle (cross-ref `langchain-debug-bundle`) and write-up
   template.

Pinned: `langchain-core 1.0.x`, `langgraph 1.0.x`, `langsmith 0.3+`. Primary pain
anchors: **P10** (agent runaway), **P36** (cold start). Adjacent: P29 (per-process
rate limiter), P30 (`max_retries=6` means 7 attempts), P31 (Anthropic cache RPM).

## Prerequisites

- LangSmith workspace with tracing enabled (free tier is fine for runbook work)
- Prometheus + Alertmanager or equivalent (Datadog, Grafana Cloud) — burn-rate rules assume PromQL
- `langchain-observability` skill applied — metrics are grounded in what LangSmith callbacks emit
- A `backup` model factory from `langchain-rate-limits` — the failover playbook assumes `.with_fallbacks()` is already wired
- On-call rotation + PagerDuty (or equivalent) integrated with the alert pipeline

## Instructions

### Step 1 — Define the LLM-specific SLO set

HTTP-style SLOs miss what users actually feel. Define four, publish them, wire
burn-rate alerts to the symptom.

| SLO | Threshold | Alert condition | First-response action |
|---|---|---|---|
| **p95 TTFT** (time to first streamed token) | <1s | burn-rate > 2% over 5min | Check streaming is enabled; check provider status; check cold start (P36) |
| **p99 total latency** | <10s | burn-rate > 5% over 5min | Check agent loop depth (P10); cold start (P36); provider latency |
| **Error rate** (5xx + uncaught exceptions) | <0.5% | burn-rate > 1% over 5min | Check provider 429/500; auth token; schema drift on structured output |
| **Cost per request** | <$0.05 (tier-dependent) | p95 spend/req > $0.20 over 15min | Check agent recursion (P10); retry rate (P30); token-use per req |

Prometheus recording rule pattern for p99 latency burn-rate (replicate for TTFT,
error-rate, and cost):

```yaml
groups:
  - name: langchain_slo
    interval: 30s
    rules:
      - record: langchain:p99_latency_5m
        expr: histogram_quantile(0.99, sum(rate(langchain_request_duration_seconds_bucket[5m])) by (le, service))
      - alert: LangChainP99LatencyBurn
        expr: langchain:p99_latency_5m > 10
        for: 5m
        labels: { severity: page, team: llm }
        annotations:
          summary: "LangChain p99 > 10s for {{ $labels.service }}"
          runbook: "https://runbooks/langchain-incident-runbook#latency"
```

See [LLM SLOs](references/llm-slos.md) for the canonical set (free / paid /
enterprise tiers), burn-rate recipes (fast + slow), and a TTFT-specific rule
that requires streaming to be instrumented.

### Step 2 — Triage decision tree: which root path?

The alert name tells you the root path. Do not mix diagnostics across paths —
the first-response action differs.

```
Alert fired
  ├── Latency (p95/p99 breach, TTFT breach)
  │     ├── 1. Provider status page (Anthropic, OpenAI) green? → if red, Step 3
  │     ├── 2. Cold start pattern? (p99 >> p95, correlates with instance starts) → P36
  │     └── 3. Streaming configured? (TTFT only makes sense with .stream/.astream)
  │
  ├── Cost (spend/req or absolute spend/hour breach)
  │     ├── 1. Agent recursion depth? (LangSmith: max steps per trace) → P10
  │     ├── 2. Retry rate elevated? (callback log: attempt count / logical call) → P30
  │     └── 3. Token-use per req regression? (input + output tokens from callbacks)
  │
  └── Error rate (5xx + uncaught exceptions)
        ├── 1. Provider 429/500 spike? (distinguish client 4xx from provider 5xx)
        ├── 2. Auth? (API key rotation, expired token, org quota exhausted)
        └── 3. Schema drift on structured output? (Pydantic ValidationError in traces)
```

For each leaf, [Latency Triage](references/latency-triage.md) and
[Cost Overrun Response](references/cost-overrun-response.md) give the LangSmith
filter query, the exact metric to inspect, and the remediation.

### Step 3 — Provider outage: detect, circuit-break, fail over

Detection precedes failover. Do not flip fallbacks on an application bug.

1. **Detect** via three signals — all three should agree before declaring a
   provider outage:
   - Vendor status page watcher (`status.anthropic.com`, `status.openai.com`) —
     poll every 30s, surface into Slack
   - In-app canary probe — a 1-req/min call to each configured provider with a
     trivial prompt, tracked as a separate SLO
   - Error-rate spike on the primary provider in your own metrics (distinguishes
     a real outage from your app's bug)
2. **Circuit-break** the primary: a `CircuitBreaker` middleware (see
   `langchain-middleware-patterns` if available, or a simple
   `aiobreaker`-backed runnable) opens after N consecutive `APIError` /
   `APITimeoutError` within a window. Once open, calls skip the primary and go
   straight to the backup. This bounds the latency cost of a down provider.
3. **Fail over** via `.with_fallbacks(backup)` — the fallback chain is already
   wired (see `langchain-rate-limits`). During an outage, either flip a feature
   flag that swaps the default factory, or temporarily set the primary's
   `max_retries=0` so the chain reaches the fallback immediately.
4. **Comms**: post a user-facing status page entry ("Degraded performance on
   feature X — monitoring upstream provider") and an internal Slack update with
   the canary graph attached. [Provider Outage Playbook](references/provider-outage-playbook.md)
   has the full comms template and the circuit-breaker middleware snippet.

### Step 4 — Agent loop containment: stop the bleed before GraphRecursionError

P10 is the most common cost-spike cause. `create_react_agent` defaults to
`recursion_limit=25`, meaning 25 model calls per user turn — with Claude Sonnet
at ~$3/MTok input, a 10k-token tool-call loop burns real money per minute.

Three containment layers, applied in order:

1. **Set `recursion_limit` per agent depth** — interactive chat agents rarely
   need more than 5–8 steps; background research agents can justify 15; never
   leave the default 25 in production.

   ```python
   from langgraph.prebuilt import create_react_agent

   agent = create_react_agent(
       llm, tools,
       recursion_limit=8,  # P10 — was default 25
   )
   ```

Related in AI Agents