Claude
Skills
Sign in
Back

langchain-middleware-patterns

Included with Lifetime
$97 forever

Build composable middleware for LangChain 1.0 chains and LangGraph 1.0 agents — PII redaction, caching, retry, token budgets, guardrails — with ORDERING rules that avoid cache-key leakage and double-counting. Use when adding cross-cutting behavior, hardening against prompt injection, enforcing per-tenant budgets, or debugging cache-poisoning incidents. Trigger with "langchain middleware", "langgraph middleware", "PII redaction middleware", "cache middleware order", "langchain guardrails".

AI Agentssaaslangchainlanggraphpythonlangchain-1.0middlewaresecuritycaching

What this skill does

# LangChain Middleware Patterns (Python)

## Overview

Tenant A sends a prompt: *"Summarize this support ticket from **[email protected]**
about her overdue invoice."* The chain's caching middleware ran before the PII
redaction middleware, so the raw prompt — email and all — became part of the
cache key. Thirty seconds later Tenant B sends a semantically identical prompt
(different tenant, different customer, same shape). Cache hits. Tenant B's user
gets back a summary that names `[email protected]` and her overdue invoice. That is
pain-catalog entry **P24** in production, and it is a real class of incident —
post-mortems read like "we added caching to cut cost, leaked a customer's PII to
a different tenant within an hour."

The sibling failure modes:

- **P25** — Retry middleware runs the model call twice on a 429; both attempts
  fire `on_llm_end`; the token-usage aggregator sums both; a single logical call
  bills as two, tenant's per-session budget trips at 50% of true usage.
- **P10** — Agent loops exceed 15 iterations on vague prompts. There is no
  default cost cap. A per-session token-budget middleware solves this; without
  one, a single "help me with my account" prompt can burn thousands of tokens.
- **P34** — `Runnable.invoke` does not sanitize prompt injection. A RAG document
  containing `"Ignore previous instructions and..."` is followed verbatim.
  Guardrails middleware is your injection defense; without it, indirect prompt
  injection is a one-line exploit.
- **P61** — `set_llm_cache(InMemoryCache())` hashes the prompt string only.
  Two chains with different tool bindings return the same cached response;
  tools are silently ignored by the cache key.

This skill defines the canonical middleware order for LangChain 1.0 chains and
LangGraph 1.0 agents, with an ordering-invariants matrix (every adjacent pair
has a named failure mode if you swap them), six reference implementations, a
cache-key hash that includes prompt **plus bound-tools plus tenant_id**, retry
telemetry that deduplicates by `request_id`, and an integration test pattern
that asserts the ordering invariant on every build.

Pin: `langchain-core 1.0.x`, `langchain 1.0.x`, `langgraph 1.0.x`. Pain-catalog
anchors: **P10, P24, P25, P34, P61**, with supporting references to P27, P29,
P30, P33.

## Prerequisites

- Python 3.10+
- `langchain-core >= 1.0, < 2.0`
- `langgraph >= 1.0, < 2.0` (for agent middleware)
- At least one provider package: `pip install langchain-anthropic` (or openai)
- Optional: `presidio-analyzer` + `presidio-anonymizer` for PII NER beyond regex
- Optional: `redis` + `langchain-redis` for multi-worker cache and rate limiting

## Instructions

### Step 1 — Adopt the canonical middleware order

Every LangChain 1.0 chain and LangGraph 1.0 agent that goes to production
applies middleware in this order:

```
user → redact → guardrail → budget → cache → retry → model
```

- **redact → cache (P24):** cache key must be PII-free or Tenant A's PII leaks to Tenant B on a hit
- **guardrail → cache:** an injection-laden prompt must never become a cache entry
- **budget → cache:** cache hits count against RPS; check budget first so loops cannot DoS a session on hits alone
- **cache → retry:** cache hits bypass retry; retry wraps only the model call

Production chains typically run **4-6 middleware layers** with **<1ms per
layer** overhead (bench: p50 0.3ms/layer, p99 0.9ms on a 100-request sample).
See [ordering-invariants.md](references/ordering-invariants.md) for the full
pairwise matrix and the benchmark script.

### Step 2 — PII redaction middleware

Mask entities with reversible placeholders so the caller can reinsert in the
output — but the cache key and the model prompt only ever see redacted text.

```python
import re
from typing import Any

_REDACTORS = [
    ("EMAIL", re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")),
    ("PHONE", re.compile(r"\+?\d[\d\s\-\(\)]{7,}\d")),
    ("SSN",   re.compile(r"\b\d{3}-\d{2}-\d{4}\b")),
    ("CC",    re.compile(r"\b(?:\d[ -]*?){13,16}\b")),
]

def redact(text: str) -> tuple[str, dict[str, str]]:
    pmap: dict[str, str] = {}
    for label, pattern in _REDACTORS:
        for i, match in enumerate(pattern.findall(text)):
            token = f"<{label}_{i}>"
            pmap[token] = match
            text = text.replace(match, token)
    return text, pmap

def redaction_middleware(inputs: dict[str, Any]) -> dict[str, Any]:
    redacted, pmap = redact(inputs["input"])
    return {**inputs, "input": redacted, "_pii_map": pmap}
```

For names, addresses, and custom entities, Presidio's `AnalyzerEngine` covers
20+ entity types. See [pii-redaction.md](references/pii-redaction.md) for the
regex vs spaCy vs Presidio tradeoff matrix, GDPR/HIPAA/PCI-DSS entity lists,
and the reinsertion pattern (return un-redacted output **only** to the
originating tenant — never cross-populate).

### Step 3 — Guardrails middleware

Detect injection patterns up front and wrap user content so the model treats
it as data. Two layers: pattern match (catches the 90% case cheaply) plus
prompt wrapping (neutralizes what slips through).

```python
INJECTION_PATTERNS = [
    re.compile(r"ignore (all |the )?(previous|prior|above) (instructions|rules)", re.I),
    re.compile(r"system prompt (is|was|now)", re.I),
    re.compile(r"you are now (a |an )?", re.I),
    re.compile(r"</?(system|instruction|prompt)>", re.I),
]

class GuardrailViolation(Exception):
    pass

def guardrail_middleware(inputs: dict[str, Any],
                        allowed_tools: set[str] | None = None) -> dict[str, Any]:
    for pattern in INJECTION_PATTERNS:
        if pattern.search(inputs["input"]):
            raise GuardrailViolation(f"Injection pattern matched: {pattern.pattern!r}")
    wrapped = f"<user_input>\n{inputs['input']}\n</user_input>"
    out = {**inputs, "input": wrapped}
    if allowed_tools is not None:
        out["_tool_allowlist"] = allowed_tools
    return out
```

Never rely on the model to "know what is an instruction" without wrapping.

### Step 4 — Token-budget middleware (per-session / per-tenant)

Directly addresses P10 — agents loop 15+ iterations on vague prompts and burn
thousands of tokens. The budget middleware raises before the model call if
the session is over ceiling.

```python
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock

class BudgetExceeded(Exception): pass

@dataclass
class TokenBudget:
    ceiling: int = 50_000           # tokens per session
    _usage: dict[str, int] = field(default_factory=lambda: defaultdict(int))
    _lock: Lock = field(default_factory=Lock)

    def record(self, session_id: str, tokens: int) -> None:
        with self._lock:
            self._usage[session_id] += tokens

    def check(self, session_id: str) -> None:
        with self._lock:
            used = self._usage[session_id]
        if used >= self.ceiling:
            raise BudgetExceeded(f"Session {session_id}: {used}/{self.ceiling}")

budget = TokenBudget(ceiling=50_000)

def budget_middleware(inputs: dict[str, Any]) -> dict[str, Any]:
    budget.check(inputs.get("session_id") or "anonymous")
    return inputs
```

Pair with a `BaseCallbackHandler.on_llm_end` that calls `budget.record(...)`
with `usage_metadata.input_tokens + output_tokens`. For multi-worker deploys,
back `TokenBudget` with Redis — per-process dicts are per-process (P29).

### Step 5 — Caching middleware with tool-aware key

P61 is the booby trap: `InMemoryCache()` hashes the prompt string only, so
two chains with different tool lists return the same cached response. Use a
custom key over **prompt + bound tools + tenant id**.

```python
import hashlib, json
from typing import Callable

def cache_key(prompt: str, bound_tools: list[dict] | None, tenant_id: str) -> str:
    """Blake2b-16 hash. Tool-aware, tenant-aware, collision-safe via \\x1f separator."""
    h = hashlib.blake2b(digest_size=16)
    h.update(prompt.encode("utf-8")

Related in AI Agents