Claude
Skills
Sign in
Back

langchain-langgraph-checkpointing

Included with Lifetime
$97 forever

Persist LangGraph agent state correctly with MemorySaver and PostgresSaver — thread_id discipline, JSON-serializable state rules, time-travel, schema migration. Use when adding chat memory, migrating from ConversationBufferMemory, or time-traveling an agent state to debug an incident. Trigger with "langgraph checkpointer", "MemorySaver", "PostgresSaver", "thread_id", "langgraph time travel", "langgraph state persistence".

AI Agentssaaslangchainlanggraphpythonlangchain-1.0checkpointingpersistencememory

What this skill does

# LangGraph Checkpointing (Python)

## Overview

A chat agent that "keeps introducing itself" is almost always P16. The caller
invokes `graph.invoke(state)` without passing `config={"configurable":
{"thread_id": ...}}` — LangGraph's checkpointer silently spawns a fresh state per
call. No error, no warning, no log line. The user sees it; the code does not.

That is one of five separate checkpointing pitfalls this skill covers:

- **P16** — missing `thread_id` silently resets memory
- **P17** — `interrupt_before` raises `TypeError` when state holds non-JSON values
  (`datetime`, `Decimal`, custom classes) — and it raises *at the interrupt
  boundary*, not when the bad value was first assigned, so the traceback points
  at the wrong line
- **P20** — `PostgresSaver` does not auto-migrate checkpoint schema; upgrading
  `langgraph` silently reads old checkpoints as empty state
- **P40** — `ConversationBufferMemory` and the rest of legacy chat memory were
  removed in LangChain 1.0; checkpointers are the replacement
- **P51** — Deep Agent virtual-FS state in `state["files"]` grows unboundedly
  and eventually makes checkpoint writes a latency hotspot

This skill walks through picking a checkpointer by environment, enforcing
`thread_id` at the application boundary, constraining state to JSON-safe
primitives, Postgres setup + migration, and time-travel for incident debugging.
Pinned to `langgraph >= 1.0, < 2.0`, `langgraph-checkpoint-postgres >= 1.0, <
2.0`. Pain-catalog anchors: P16, P17, P18, P20, P22, P40, P51.

## Prerequisites

- Python 3.10+
- `pip install langgraph langchain-core` (both `>= 1.0, < 2.0`)
- For Postgres: `pip install langgraph-checkpoint-postgres` and a Postgres 13+
  instance
- For async Postgres: the same package plus `asyncpg`
- A `thread_id` strategy — typically a UUID4 string per conversation; see
  [thread-id-discipline.md](references/thread-id-discipline.md)

## Instructions

### Step 1 — Pick a checkpointer by environment

| Env | Checkpointer | Import |
|---|---|---|
| Dev, tests, notebooks | `MemorySaver` | `langgraph.checkpoint.memory` |
| Single-host CLI / desktop | `SqliteSaver` | `langgraph.checkpoint.sqlite` |
| Staging, prod (sync) | `PostgresSaver` | `langgraph.checkpoint.postgres` |
| Staging, prod (async / FastAPI) | `AsyncPostgresSaver` | `langgraph.checkpoint.postgres.aio` |

`MemorySaver` is in-process only. State vanishes on restart. Every worker has
its own (P22 analog for LangGraph). Use it anywhere state loss is acceptable;
never in a multi-worker web backend.

`PostgresSaver` and its async sibling require `setup()` on every startup *and
after every `langgraph` upgrade* (see Step 5). Checkpoint storage overhead is
typically **1-10 KB per step** of serialized state; plan your DB size
accordingly — a 2,000-turn conversation with 3 KB average state fits in
~6 MB per thread.

See [checkpointer-comparison.md](references/checkpointer-comparison.md) for the
full matrix including latency, concurrency, and the FastAPI lifespan pattern.

### Step 2 — Require `thread_id` at every invocation

This is the fail-loud middleware that prevents P16:

```python
from typing import Any

def require_thread_id(config: dict[str, Any]) -> dict[str, Any]:
    """Raise if thread_id is missing. Fails loud so P16 surfaces in tests,
    not in user-visible conversation logs."""
    configurable = (config or {}).get("configurable", {})
    thread_id = configurable.get("thread_id")
    if not thread_id:
        raise ValueError(
            "thread_id missing from config['configurable']. "
            "Every graph invocation must carry a thread_id."
        )
    if not isinstance(thread_id, str):
        raise TypeError(
            f"thread_id must be str (UUID), got {type(thread_id).__name__}"
        )
    return config
```

Call it at every application boundary:

```python
import uuid

config = {"configurable": {"thread_id": str(uuid.uuid4())}}
require_thread_id(config)
result = graph.invoke(initial_state, config=config)
```

For web endpoints, extract to a FastAPI dependency (`Header(...)` with no
default — forces `422` on missing). For multi-tenant apps, scope the thread id by composing tenant + user +
conversation ids into a single colon-delimited string (example:
`"acme:alice:conv-1"`). See
[thread-id-discipline.md](references/thread-id-discipline.md) for UUID
generation, rotation, and the integration test that proves tenants do not share
state.

### Step 3 — Keep state JSON-serializable (TypedDict, primitives only)

```python
from typing import Annotated, TypedDict
from langgraph.graph.message import add_messages
from langchain_core.messages import AnyMessage


class AgentState(TypedDict):
    # Messages are safe — LangGraph registers a custom serializer.
    messages: Annotated[list[AnyMessage], add_messages]

    # Primitives only below. NO datetime, NO Decimal, NO custom classes.
    user_id: str
    turn_count: int
    last_action_at: str           # ISO string, not datetime.datetime
    pending_approval: bool
    metadata: dict[str, str]      # dict keys must be str
    plan: list[dict[str, str]]    # list of primitive dicts
```

The rule: **state fields must be JSON-safe primitives or recursive structures
of them** (`str`, `int`, `float`, `bool`, `None`, `list`, `dict[str, ...]`).
`json.dumps(state)` must succeed. If it raises, the checkpointer raises —
often at a HITL interrupt many steps later (P17), which is why the traceback
never points at the line that introduced the bad value.

For non-primitive inputs, coerce at node output boundaries with a helper:

```python
from datetime import datetime
from decimal import Decimal

def record_purchase(state: AgentState) -> dict:
    now = datetime.utcnow()
    price = Decimal("19.99")
    return {
        "last_action_at": now.isoformat(),
        "metadata": {**state["metadata"], "price": str(price)},
    }
```

Forbidden-types reference and the full `to_state` / `from_state` helper pair
are in [json-serializability-rules.md](references/json-serializability-rules.md).

### Step 4 — Compile the graph with a checkpointer (Postgres, sync)

```python
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph
import os

DB_URI = os.environ["DATABASE_URL"]

def build_graph() -> StateGraph:
    builder = StateGraph(AgentState)
    builder.add_node("agent", agent_node)
    builder.add_node("human_approval", human_approval_node)
    builder.set_entry_point("agent")
    builder.add_edge("agent", "human_approval")
    builder.set_finish_point("human_approval")
    return builder

with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    checkpointer.setup()     # Idempotent. Creates checkpoint tables if missing.
    graph = build_graph().compile(
        checkpointer=checkpointer,
        interrupt_before=["human_approval"],
    )

    config = {"configurable": {"thread_id": "user-123"}}
    require_thread_id(config)
    result = graph.invoke({"messages": [HumanMessage("hi")]}, config=config)
```

For async, mirror the pattern with `AsyncPostgresSaver.from_conn_string(...)`
inside a FastAPI `@asynccontextmanager` lifespan; every call site uses
`await graph.ainvoke(...)`.

### Step 5 — Run `setup()` on startup AND after every `langgraph` upgrade

P20 is the quiet one: you `pip install --upgrade langgraph`, tests pass, CI
goes green, you deploy. Existing threads come back empty. No DB error.

```python
# Put this in your deploy script / migration runbook:
with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    checkpointer.setup()
    # Sanity check: read one known thread and assert it's not empty.
    snap = checkpointer.get({"configurable": {"thread_id": "canary-thread"}})
    assert snap is not None, "Canary thread lost after schema migration"
```

Run this in **staging** first, with a canary thread whose state you pre-populated
from an older `langgraph` version. If the assertion holds, promote. If not,
the migration path is: dump checkpoi

Related in AI Agents