langchain-cost-tuning
Control LangChain 1.0 AI spend with accurate streaming token accounting, model tiering, provider-specific cache hit tuning, per-tenant budgets, and retry dedup. Use when AI spend grows faster than traffic, a cost regression lands, or you need per-tenant budget enforcement. Trigger with "langchain cost", "langchain token accounting", "langchain per-tenant budget", "langchain model tiering", "prompt cache savings".
What this skill does
# LangChain Cost Tuning (Python)
## Overview
An engineer shipped a new research agent Tuesday. By Friday the Anthropic
bill had grown 6x while traffic grew 1.4x. The cost dashboard — wired to
`on_llm_end` — showed spend up maybe 2x. Reconciling against the provider
console on Monday surfaced two compounding bugs: (1) the agent's `ChatOpenAI`
fallback kept the default `max_retries=6`, so each logical call billed as up
to **7 requests** (P30); (2) retry middleware was registered *below* token
accounting, so every retry fired `on_llm_end` twice — the aggregator summed
both emissions while LangSmith deduped them by generation ID, undercounting
the dashboard by ~50% against actual billed rate (P25).
The fix took an afternoon: cap retries at 2, tag retries with a stable
`request_id`, and migrate token accounting to `AIMessage.usage_metadata` read
from `astream_events(version="v2")`. Finding the bug took a week. This skill
is that week compressed into a runbook.
Cost tuning for a LangChain 1.0 production app has five levers, each with a
sharp failure mode:
- **Token accounting** — `on_llm_end` lags streams by 5-30s (P01); retries double-count (P25); Anthropic cache savings aggregate per-call, never per-session (P04).
- **Retry discipline** — `max_retries=6` default on `ChatOpenAI` (P30); Anthropic 50 RPM tier throttles cached and uncached calls against the same budget (P31).
- **Agent loop caps** — `create_react_agent` defaults to `recursion_limit=25`; vague prompts burn a session's budget before `GraphRecursionError` surfaces (P10).
- **Caching** — `InMemoryCache` ignores bound tools in the cache key and returns wrong answers (P61); `RedisSemanticCache` ships with a 0.95 threshold that hits <5% of the time (P62).
- **Model tiering** — Running `claude-opus-4-5` on intent classification is 30-60x more expensive than `claude-haiku-4-5` for a task the cheaper model solves at equal quality.
Pin: `langchain-core 1.0.x`, `langchain-anthropic 1.0.x`, `langchain-openai 1.0.x`.
Pain-catalog anchors: P01, P04, P10, P23, P25, P30, P31, P61, P62.
## Prerequisites
- Python 3.10+
- `langchain-core >= 1.0, < 2.0`
- At least one provider package: `pip install langchain-anthropic langchain-openai`
- `redis-py >= 5.0` for budget middleware (optional; in-process dict works for dev)
- Provider console access (Anthropic, OpenAI) to reconcile `usage_metadata`
against billed spend — you will need this to verify any instrumentation fix
## Instructions
### Step 1 — Read `usage_metadata`, never `response_metadata["token_usage"]`
LangChain 1.0 standardizes all provider usage into `AIMessage.usage_metadata`.
`response_metadata["token_usage"]` still exists as a compatibility shim but its
shape is provider-specific (Anthropic nests under `usage`, OpenAI flat, Gemini
uses different keys). Code that reads it directly will break when you switch
providers or when a provider SDK upgrades.
```python
from langchain_core.messages import AIMessage
def read_usage(msg: AIMessage) -> dict:
"""Canonical shape: input_tokens, output_tokens, input_token_details,
output_token_details. Safe across Anthropic, OpenAI, Gemini."""
meta = msg.usage_metadata or {}
details_in = meta.get("input_token_details", {}) or {}
details_out = meta.get("output_token_details", {}) or {}
return {
"input": meta.get("input_tokens", 0),
"output": meta.get("output_tokens", 0),
"cache_read": details_in.get("cache_read", 0), # Anthropic
"cache_creation": details_in.get("cache_creation", 0),
"reasoning": details_out.get("reasoning", 0), # OpenAI o1/o3
}
```
Include `reasoning` in your output-billable total for o1/o3. A call with
`output_tokens=500` and `reasoning=2000` actually bills 2500 output tokens.
### Step 2 — Stream-accurate aggregation via `astream_events(version="v2")`
`on_llm_end` fires once after the stream closes, so dashboards lag by stream
duration (P01). Anthropic populates `usage_metadata` on the `message_start` and
`message_delta` events; OpenAI populates only the final chunk. Both show up as
`on_chat_model_stream` events in `astream_events`.
```python
async def metered_invoke(chain, inputs, meter):
async for event in chain.astream_events(inputs, version="v2"):
if event["event"] == "on_chat_model_stream":
chunk = event["data"]["chunk"]
if getattr(chunk, "usage_metadata", None):
meter.record(
run_id=event["run_id"],
usage=chunk.usage_metadata,
)
```
See [Token Accounting Pitfalls](references/token-accounting-pitfalls.md) for
the full streaming-delta behavior across providers and reconciliation against
provider dashboards.
### Step 3 — Dedup retries on `run_id`, not prompt hash
Retry middleware runs the model twice on transient errors. Both emit usage
events. If the aggregator keys on prompt hash, it looks like one call cost
twice as much. If it keys on `run_id` (LangChain assigns one per generation
attempt), you can attach a stable `request_id` at the chain level and dedupe
on that (P25).
```python
from uuid import uuid4
class RetryAwareMeter:
def __init__(self):
self._seen: set[str] = set()
self.totals = {"input": 0, "output": 0, "cache_read": 0}
def record(self, run_id: str, usage: dict, request_id: str | None = None):
# Keep only the last emission per logical request.
# On retry: same request_id, different run_id -> overwrite.
key = request_id or run_id
if key in self._seen:
# Retry emission — subtract prior, add new (last wins).
prior = self._prior_by_key.get(key, {})
for k in self.totals:
self.totals[k] -= prior.get(k, 0)
self._seen.add(key)
self._prior_by_key[key] = usage
self.totals["input"] += usage.get("input_tokens", 0)
self.totals["output"] += usage.get("output_tokens", 0)
details = usage.get("input_token_details", {}) or {}
self.totals["cache_read"] += details.get("cache_read", 0)
```
Inject `request_id` via `config={"metadata": {"request_id": str(uuid4())}}` on
each invoke. The meter reads `event["metadata"]["request_id"]` alongside
`run_id`.
Alternative: place token accounting **above** retry middleware in the chain —
retries happen inside, so only the successful attempt emits. This is simpler
but makes retries invisible to observability, which you usually want to see.
### Step 4 — Model tiering: draft cheap, finalize expensive
Most chains have a structural split: a cheap "understand the request" call and
an expensive "produce the final artifact" call. Running the expensive model on
both roughly triples cost for no quality gain.
**Per-1M pricing snapshot, 2026-04** (verify current prices before shipping at
https://www.anthropic.com/pricing and https://openai.com/api/pricing/):
| Model | Input $/1M | Output $/1M | Cache read $/1M | Role |
|---|---|---|---|---|
| `claude-haiku-4-5` | $1.00 | $5.00 | $0.10 | Draft, classify, route |
| `claude-sonnet-4-6` | $3.00 | $15.00 | $0.30 | Finalize, reason, extract |
| `claude-opus-4-5` | $15.00 | $75.00 | $1.50 | High-stakes, long-horizon |
| `gpt-4o-mini` | $0.15 | $0.60 | n/a (prefix cache only) | Draft, classify |
| `gpt-4o` | $2.50 | $10.00 | n/a | Finalize |
| `gpt-o3-mini` | $1.10 | $4.40 | n/a | Reasoning, planning |
Anthropic cache reads cost **10% of input**. Cache creation costs **125% of
input**. Break-even is ~4 uses of a cached prefix. See
[Cache Economics](references/cache-economics.md).
**Decision tree:**
```
input
└── intent classification / routing
└── gpt-4o-mini OR claude-haiku-4-5 (~$0.15-$1 per 1M in)
└── generation / reasoning
├── single-pass, low-stakes
│ └── gpt-4o-mini (draft)
├── single-pass, high-stakes (extraction, contracts)
│ └── claude-sonnet-4-6 (finalize)
├── multRelated in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.