Claude
Skills
Sign in
Back

langchain-cost-tuning

Included with Lifetime
$97 forever

Control LangChain 1.0 AI spend with accurate streaming token accounting, model tiering, provider-specific cache hit tuning, per-tenant budgets, and retry dedup. Use when AI spend grows faster than traffic, a cost regression lands, or you need per-tenant budget enforcement. Trigger with "langchain cost", "langchain token accounting", "langchain per-tenant budget", "langchain model tiering", "prompt cache savings".

AI Agentssaaslangchainlanggraphpythonlangchain-1.0costtokensbudget

What this skill does

# LangChain Cost Tuning (Python)

## Overview

An engineer shipped a new research agent Tuesday. By Friday the Anthropic
bill had grown 6x while traffic grew 1.4x. The cost dashboard — wired to
`on_llm_end` — showed spend up maybe 2x. Reconciling against the provider
console on Monday surfaced two compounding bugs: (1) the agent's `ChatOpenAI`
fallback kept the default `max_retries=6`, so each logical call billed as up
to **7 requests** (P30); (2) retry middleware was registered *below* token
accounting, so every retry fired `on_llm_end` twice — the aggregator summed
both emissions while LangSmith deduped them by generation ID, undercounting
the dashboard by ~50% against actual billed rate (P25).

The fix took an afternoon: cap retries at 2, tag retries with a stable
`request_id`, and migrate token accounting to `AIMessage.usage_metadata` read
from `astream_events(version="v2")`. Finding the bug took a week. This skill
is that week compressed into a runbook.

Cost tuning for a LangChain 1.0 production app has five levers, each with a
sharp failure mode:

- **Token accounting** — `on_llm_end` lags streams by 5-30s (P01); retries double-count (P25); Anthropic cache savings aggregate per-call, never per-session (P04).
- **Retry discipline** — `max_retries=6` default on `ChatOpenAI` (P30); Anthropic 50 RPM tier throttles cached and uncached calls against the same budget (P31).
- **Agent loop caps** — `create_react_agent` defaults to `recursion_limit=25`; vague prompts burn a session's budget before `GraphRecursionError` surfaces (P10).
- **Caching** — `InMemoryCache` ignores bound tools in the cache key and returns wrong answers (P61); `RedisSemanticCache` ships with a 0.95 threshold that hits <5% of the time (P62).
- **Model tiering** — Running `claude-opus-4-5` on intent classification is 30-60x more expensive than `claude-haiku-4-5` for a task the cheaper model solves at equal quality.

Pin: `langchain-core 1.0.x`, `langchain-anthropic 1.0.x`, `langchain-openai 1.0.x`.
Pain-catalog anchors: P01, P04, P10, P23, P25, P30, P31, P61, P62.

## Prerequisites

- Python 3.10+
- `langchain-core >= 1.0, < 2.0`
- At least one provider package: `pip install langchain-anthropic langchain-openai`
- `redis-py >= 5.0` for budget middleware (optional; in-process dict works for dev)
- Provider console access (Anthropic, OpenAI) to reconcile `usage_metadata`
  against billed spend — you will need this to verify any instrumentation fix

## Instructions

### Step 1 — Read `usage_metadata`, never `response_metadata["token_usage"]`

LangChain 1.0 standardizes all provider usage into `AIMessage.usage_metadata`.
`response_metadata["token_usage"]` still exists as a compatibility shim but its
shape is provider-specific (Anthropic nests under `usage`, OpenAI flat, Gemini
uses different keys). Code that reads it directly will break when you switch
providers or when a provider SDK upgrades.

```python
from langchain_core.messages import AIMessage

def read_usage(msg: AIMessage) -> dict:
    """Canonical shape: input_tokens, output_tokens, input_token_details,
    output_token_details. Safe across Anthropic, OpenAI, Gemini."""
    meta = msg.usage_metadata or {}
    details_in = meta.get("input_token_details", {}) or {}
    details_out = meta.get("output_token_details", {}) or {}
    return {
        "input": meta.get("input_tokens", 0),
        "output": meta.get("output_tokens", 0),
        "cache_read": details_in.get("cache_read", 0),       # Anthropic
        "cache_creation": details_in.get("cache_creation", 0),
        "reasoning": details_out.get("reasoning", 0),        # OpenAI o1/o3
    }
```

Include `reasoning` in your output-billable total for o1/o3. A call with
`output_tokens=500` and `reasoning=2000` actually bills 2500 output tokens.

### Step 2 — Stream-accurate aggregation via `astream_events(version="v2")`

`on_llm_end` fires once after the stream closes, so dashboards lag by stream
duration (P01). Anthropic populates `usage_metadata` on the `message_start` and
`message_delta` events; OpenAI populates only the final chunk. Both show up as
`on_chat_model_stream` events in `astream_events`.

```python
async def metered_invoke(chain, inputs, meter):
    async for event in chain.astream_events(inputs, version="v2"):
        if event["event"] == "on_chat_model_stream":
            chunk = event["data"]["chunk"]
            if getattr(chunk, "usage_metadata", None):
                meter.record(
                    run_id=event["run_id"],
                    usage=chunk.usage_metadata,
                )
```

See [Token Accounting Pitfalls](references/token-accounting-pitfalls.md) for
the full streaming-delta behavior across providers and reconciliation against
provider dashboards.

### Step 3 — Dedup retries on `run_id`, not prompt hash

Retry middleware runs the model twice on transient errors. Both emit usage
events. If the aggregator keys on prompt hash, it looks like one call cost
twice as much. If it keys on `run_id` (LangChain assigns one per generation
attempt), you can attach a stable `request_id` at the chain level and dedupe
on that (P25).

```python
from uuid import uuid4

class RetryAwareMeter:
    def __init__(self):
        self._seen: set[str] = set()
        self.totals = {"input": 0, "output": 0, "cache_read": 0}

    def record(self, run_id: str, usage: dict, request_id: str | None = None):
        # Keep only the last emission per logical request.
        # On retry: same request_id, different run_id -> overwrite.
        key = request_id or run_id
        if key in self._seen:
            # Retry emission — subtract prior, add new (last wins).
            prior = self._prior_by_key.get(key, {})
            for k in self.totals:
                self.totals[k] -= prior.get(k, 0)
        self._seen.add(key)
        self._prior_by_key[key] = usage
        self.totals["input"]  += usage.get("input_tokens", 0)
        self.totals["output"] += usage.get("output_tokens", 0)
        details = usage.get("input_token_details", {}) or {}
        self.totals["cache_read"] += details.get("cache_read", 0)
```

Inject `request_id` via `config={"metadata": {"request_id": str(uuid4())}}` on
each invoke. The meter reads `event["metadata"]["request_id"]` alongside
`run_id`.

Alternative: place token accounting **above** retry middleware in the chain —
retries happen inside, so only the successful attempt emits. This is simpler
but makes retries invisible to observability, which you usually want to see.

### Step 4 — Model tiering: draft cheap, finalize expensive

Most chains have a structural split: a cheap "understand the request" call and
an expensive "produce the final artifact" call. Running the expensive model on
both roughly triples cost for no quality gain.

**Per-1M pricing snapshot, 2026-04** (verify current prices before shipping at
https://www.anthropic.com/pricing and https://openai.com/api/pricing/):

| Model | Input $/1M | Output $/1M | Cache read $/1M | Role |
|---|---|---|---|---|
| `claude-haiku-4-5` | $1.00 | $5.00 | $0.10 | Draft, classify, route |
| `claude-sonnet-4-6` | $3.00 | $15.00 | $0.30 | Finalize, reason, extract |
| `claude-opus-4-5` | $15.00 | $75.00 | $1.50 | High-stakes, long-horizon |
| `gpt-4o-mini` | $0.15 | $0.60 | n/a (prefix cache only) | Draft, classify |
| `gpt-4o` | $2.50 | $10.00 | n/a | Finalize |
| `gpt-o3-mini` | $1.10 | $4.40 | n/a | Reasoning, planning |

Anthropic cache reads cost **10% of input**. Cache creation costs **125% of
input**. Break-even is ~4 uses of a cached prefix. See
[Cache Economics](references/cache-economics.md).

**Decision tree:**

```
input
  └── intent classification / routing
      └── gpt-4o-mini OR claude-haiku-4-5        (~$0.15-$1 per 1M in)
  └── generation / reasoning
      ├── single-pass, low-stakes
      │   └── gpt-4o-mini                         (draft)
      ├── single-pass, high-stakes (extraction, contracts)
      │   └── claude-sonnet-4-6                   (finalize)
      ├── mult

Related in AI Agents