langchain-performance-tuning
Tune LangChain 1.0 / LangGraph 1.0 Python chains and agents for throughput, latency, and cost — streaming modes, explicit batch concurrency, semantic plus exact caches, persistent message history, and async-safe retriever patterns. Use when p95 latency exceeds target, batching "does not work", cost grows linearly with traffic, or a process restart wipes chat history. Trigger with "langchain performance", "langchain slow batch", "langchain throughput", "langchain p95 latency", "semantic cache hit rate".
What this skill does
# LangChain Performance Tuning
## Overview
An engineer calls `chain.batch(inputs_1000)` expecting 1000 parallel LLM calls. Actual behavior: `Runnable.batch` and `Runnable.abatch` in LangChain 1.0 default to `max_concurrency=1`, so the 1000 inputs run **sequentially with bookkeeping overhead** — sometimes slower than a plain `for` loop. This is pain-catalog entry P08. The fix is one line:
```python
# Before: serial, ~1000 * per_call_latency
await chain.abatch(inputs)
# After: 10x throughput at 10 providers' worth of concurrency
await chain.abatch(inputs, config={"max_concurrency": 10})
```
Other silent regressions in the same pain catalog: P48 (`invoke` inside `async def` blocks the FastAPI event loop), P22 (`InMemoryChatMessageHistory` loses every user's chat on restart), P62 (`RedisSemanticCache` at the default `score_threshold=0.95` returns under 5% hit rate), P59 (async retrievers leak connections on cancellation), P60 (`BackgroundTasks` fires *after* the response — wrong for per-token SSE), P01 (streaming token counts are only reliable on the `on_chat_model_end` event).
This skill wires a production performance baseline: explicit batch concurrency, async-only code paths, Redis-backed caches tuned on a golden set, persistent chat history with TTL, and TTFT instrumentation from `astream_events(version="v2")`.
## Prerequisites
- Python 3.11+ with `langchain>=1.0,<2`, `langgraph>=1.0,<2`, `langchain-openai` or `langchain-anthropic`, `langchain-community`, `langchain-redis` or `redis>=5`.
- A working LangChain 1.0 chain or LangGraph 1.0 graph that already passes functional tests.
- Redis 7+ reachable from the app for cache and history (local Docker is fine for dev).
- A FastAPI / Starlette async endpoint, or an equivalent async entrypoint.
- Observability: a place to emit metrics (Prometheus, OpenTelemetry, or LangSmith) — needed to measure TTFT, p95, and cache hit rate.
## Instructions
1. **Establish a latency budget and baseline.** Pick explicit targets before changing code: TTFT under 1s, p95 total under 5s, throughput over 20 req/s per worker, cost under $X per 1k interactions. Run a 5-minute load test with `locust` or `wrk` against the current chain and record p50 / p95 / p99 / TTFT / total cost. Without these numbers every downstream change is theater.
2. **Convert every hot path to async (P48).** Inside `async def` handlers, replace `invoke`, `stream`, `batch`, `get_relevant_documents`, and `tool.run` with `ainvoke`, `astream` / `astream_events(version="v2")`, `abatch`, `aget_relevant_documents`, and `tool.arun`. See `references/async-safety-checklist.md` for a grep pattern and a CI linter. Target: zero sync LangChain calls inside any async function.
3. **Fix `.abatch()` concurrency (P08).** Every `.abatch` / `.batch` call must pass `config={"max_concurrency": N}` where N is chosen from the provider table in `references/batch-concurrency-per-provider.md` (Anthropic 10-20, OpenAI 20-50, local vLLM 100+). For multi-worker deploys, cap account-wide calls with a LiteLLM / Portkey proxy or a Redis semaphore — `max_concurrency` only governs one process.
4. **Instrument TTFT with `astream_events(version="v2")` (P01).** Measure time to first token separately from total latency — user-perceived performance hinges on TTFT. Read usage metadata only on the `on_chat_model_end` event; per-chunk usage fields lag and are not reliable mid-stream.
```python
from time import perf_counter
async def run(chain, query: str):
t0 = perf_counter(); ttft = None; tokens = 0
async for ev in chain.astream_events({"input": query}, version="v2"):
if ev["event"] == "on_chat_model_stream" and ttft is None:
ttft = perf_counter() - t0
if ev["event"] == "on_chat_model_end":
tokens = ev["data"]["output"].usage_metadata["total_tokens"]
return {"ttft_s": ttft, "total_s": perf_counter() - t0, "tokens": tokens}
```
5. **Enable an exact LLM cache.** For deterministic (temperature=0) prompts, set `RedisCache` or `SQLiteCache` globally. LangChain 1.0 keys include the bound tools signature (P61 fix), which prevents cache poisoning when an agent's tool list changes. Always set an explicit TTL on Redis keys — default Redis keys are immortal.
```python
from langchain_core.globals import set_llm_cache
from langchain_community.cache import RedisCache
import redis
set_llm_cache(RedisCache(redis.Redis.from_url("redis://cache:6379/0")))
```
6. **Add a semantic cache with a tuned threshold (P62).** The `RedisSemanticCache` default `score_threshold=0.95` produces < 5% hit rate on real traffic. Collect a 200-500 prompt golden set with labeled near-duplicates, measure cosine similarity with your embedding model, and pick the F1-maximizing threshold — typically **0.85-0.90** for `text-embedding-3-small`. Full procedure in `references/cache-tuning.md`. Do not run semantic cache behind `temperature > 0`; users will see prior random draws.
7. **Replace `InMemoryChatMessageHistory` (P22).** Every production chat path must use `RedisChatMessageHistory` (with `ttl`) or a LangGraph checkpointer (`AsyncPostgresSaver` / `AsyncSqliteSaver`). Add a restart test: mid-conversation, kill and restart the worker, assert the next user turn still sees prior messages. See `references/persistent-history.md` for migration steps and trim policies.
8. **Close retriever connection pools in FastAPI `lifespan` (P59).** Build the vector store once at startup, expose it via `app.state`, close it in the `finally` block. Never construct a retriever per request — cancellations leak pg connections.
9. **Stream tokens with SSE, not `BackgroundTasks` (P60).** `BackgroundTasks` runs after the response body is flushed; per-token dispatch via it delivers tokens the client will never read. Use `EventSourceResponse` (sse-starlette) or a WebSocket and pipe events from `astream_events`.
10. **Re-run the load test and diff the four metrics.** TTFT, p95, throughput, cost per 1k. If any regressed, revert that step and investigate — do not stack changes without verification. Execute in this order to isolate effects:
1. Run the baseline load test and save results.
2. Set `max_concurrency` on every `.abatch` call and re-run.
3. Add exact cache, re-run, check cache hit rate.
4. Configure semantic cache with tuned threshold, re-run, check hit rate again.
5. Verify persistent history survives a worker restart.
### Throughput Tuning Table (starting values)
| Provider | Safe `max_concurrency` | Ceiling signal |
|----------|------------------------|-----------------|
| Anthropic (sonnet-4.5/4.6) | 10-20 | 429 `rate_limit_error` |
| OpenAI (gpt-4o / 4o-mini) | 20-50 | 429 + TPM exhaustion header |
| OpenAI o1 / reasoning | 2-5 | Cost + latency, not rate |
| Google Gemini 1.5/2.5 | 10-30 | 429 |
| Cohere | 20-40 | 429 |
| Local vLLM / TGI | 100-500 (batch N≈32-64) | GPU KV-cache OOM |
| Ollama on consumer GPU | 1-4 | Process queue backpressure |
### Latency Breakdown Template
Record these for every change, not just total:
| Metric | Target | Source |
|--------|--------|--------|
| TTFT p50 / p95 | 500ms / 1s | first `on_chat_model_stream` event |
| Total p50 / p95 | 2s / 5s | end-to-end handler |
| Tool-call p95 | < 1s per tool | `on_tool_end` - `on_tool_start` |
| Retriever p95 | < 300ms | `on_retriever_end` - `on_retriever_start` |
| Provider p95 | measure per model | split by LLM node |
### Batch Sweet-Spot Numbers
- Anthropic tier 2 chat: `max_concurrency=10` saturates at roughly 8 req/s, p95 doubles past 20.
- OpenAI `gpt-4o-mini` tier 3: knee of the curve around `max_concurrency=30-40`; ~40 req/s throughput.
- Local vLLM A100: server-side batch sweet spot `N=32-64`, client `max_concurrency=100+`.
Verify on your own account — these are starting points, not promises.
## Output
Deliverables from running this skill end-to-end:
- A `perf/` directory with `baseline.json` and `tuned.json` load-test resRelated in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.