Claude
Skills
Sign in
Back

langchain-rate-limits

Included with Lifetime
$97 forever

Rate-limit LangChain 1.0 calls correctly across multi-worker deployments — Redis-backed limiters, asyncio.Semaphore, narrow exception whitelists, and provider-specific throttle handling. Use when hitting 429s in production, scaling workers horizontally, or tuning throughput against Anthropic, OpenAI, or Gemini tier limits. Trigger with "langchain rate limit", "langchain 429", "langchain semaphore", "langchain token bucket", "anthropic rpm", "openai rpm throttling", "InMemoryRateLimiter", "redis rate limiter".

Backend & APIssaaslangchainlanggraphpythonlangchain-1.0rate-limitsthrottlingconcurrency

What this skill does

# LangChain Rate Limits (Python)

## Overview

A team deploys 10 Cloud Run workers. Each worker initializes its `ChatAnthropic`
with `InMemoryRateLimiter(requests_per_second=10)` — they read the docs, they
picked a safe-looking number, they shipped. Thirty seconds later the dashboard
lights up with 429s: the cluster is pushing 100 RPS to Anthropic's 50 RPM
tier-1 ceiling, not the 10 RPS they configured. The name is the fix —
`InMemoryRateLimiter` is **in-process**. Each worker has its own counter. Ten
workers × 10 RPS = 100 RPS to the provider. This is pain-catalog entry **P29**
and it lands on every team that scales past one pod.

Three more traps wait on the same code path:

- **P07** — `.with_fallbacks([backup])` defaults `exceptions_to_handle=(Exception,)`,
  which on Python <3.12 swallows `KeyboardInterrupt`. Ctrl+C during a 429
  retry storm silently falls through to the backup chain and keeps billing.
- **P30** — `ChatOpenAI` and `ChatAnthropic` default `max_retries=6`. That is
  retries, not attempts: **7 total requests per logical call** on flaky
  networks. One `.invoke()` can bill 7x.
- **P31** — Anthropic's RPM counts cache reads, cache writes, and uncached
  calls **uniformly**. Cache-heavy workloads at 50 RPM can 429 on cache writes
  while the ITPM dashboard shows headroom.

This skill covers measuring demand before picking a limit; the
`InMemoryRateLimiter` vs Redis-backed limiter vs `asyncio.Semaphore` decision
tree; the narrow `exceptions_to_handle` whitelist; `max_retries=2` math; and
the provider-specific limit taxonomy (RPM, ITPM, OTPM, concurrent,
cached-vs-uncached). Pin: `langchain-core 1.0.x`, `langchain-anthropic 1.0.x`,
`langchain-openai 1.0.x`. Pain-catalog anchors: **P07, P08, P29, P30, P31**.
For `.batch(max_concurrency=...)` tuning, see the sibling skill
`langchain-performance-tuning` — this skill is about provider-facing rate caps.

## Prerequisites

- Python 3.10+ (3.12+ fixes the `KeyboardInterrupt` half of P07)
- `langchain-core >= 1.0, < 2.0`
- At least one provider: `pip install langchain-anthropic langchain-openai`
- For multi-worker prod: `redis >= 4.5` client and a Redis server reachable from every worker
- Completed `langchain-model-inference` — the chat-model factory from that skill is where `rate_limiter=` gets attached

## Instructions

### Step 1 — Measure actual demand before picking a number

**Do not guess at `requests_per_second`.** Instrument first, size second.
Attach a `BaseCallbackHandler` that logs per-call `input_tokens`,
`output_tokens`, and `cache_read_input_tokens` from `response.generations[].message.usage_metadata`:

```python
chain.with_config({"callbacks": [DemandLogger()]})
```

Collect 24-48 hours of representative traffic. Roll up: p50 and p95 RPM, p95
ITPM, p95 OTPM, cache hit rate. Size the limiter at **70% of the binding
constraint's tier ceiling** on your p95.

See [Measuring Demand](references/measuring-demand.md) for the full
`DemandLogger` implementation, pandas roll-up, OTEL integration, load-test
harness, and multi-tenant sizing strategies.

### Step 2 — `InMemoryRateLimiter` for single-process dev only; never multi-worker prod

LangChain 1.0 ships `InMemoryRateLimiter` as a first-class `BaseChatModel` parameter:

```python
from langchain_anthropic import ChatAnthropic
from langchain_core.rate_limiters import InMemoryRateLimiter

limiter = InMemoryRateLimiter(
    requests_per_second=0.58,    # 35 RPM = 70% of Anthropic tier-1 50 RPM
    check_every_n_seconds=0.1,
    max_bucket_size=5,           # burst capacity
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)
```

**`InMemoryRateLimiter` is per-process.** Safe for:

- Single-process local dev (`python script.py`)
- Single-worker uvicorn (`uvicorn --workers 1`)
- Jupyter notebooks, batch scripts

**Unsafe for** (this is P29):

- Multi-worker uvicorn / gunicorn (`--workers 4`)
- Any container orchestrator with replica count > 1 (Cloud Run min-instances > 1, K8s, ECS)
- Distributed job runners (Celery, Temporal, Cloud Tasks fanout)

### Step 3 — Redis-backed limiter for cluster-wide enforcement

For multi-worker deployments, cluster-wide rate limiting requires shared state.
Redis is the default answer — atomic Lua script for sliding-window, or Redis
6.2+ `CL.THROTTLE` for GCRA.

```python
import redis
from langchain_anthropic import ChatAnthropic
# RedisRateLimiter class defined in references/redis-limiter-pattern.md
from your_app.limiters import RedisRateLimiter

client = redis.Redis.from_url("redis://redis.internal:6379/0")

limiter = RedisRateLimiter(
    client,
    key="anthropic:prod",
    requests_per_second=35 / 60,  # 35 RPM cluster-wide, not per-worker
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)
```

**Key scoping decisions:**

- `key="anthropic:prod"` — all tenants share one global budget (simplest)
- `key=f"anthropic:tenant:{tenant_id}"` — per-tenant quota (requires cleanup for dead tenants)
- Two-level: per-tenant + global, acquire both (best for multi-tenant SaaS)

See [Redis Limiter Pattern](references/redis-limiter-pattern.md) for the full
`RedisRateLimiter` implementation (atomic Lua sliding window), the GCRA
alternative via `CL.THROTTLE`, failure modes (Redis down, clock skew), and
per-tenant cleanup strategy.

### Step 4 — `asyncio.Semaphore` for per-worker in-flight concurrency cap

The rate limiter throttles **request rate**. A semaphore throttles **in-flight
count**. Use both:

```python
import asyncio

# Cluster: 35 RPM (Redis enforces)
# Worker: 20 in-flight at once (semaphore enforces)
worker_sem = asyncio.Semaphore(20)

async def bounded_invoke(inp):
    async with worker_sem:
        return await llm.ainvoke(inp)

# Fanout
results = await asyncio.gather(*[bounded_invoke(x) for x in inputs])
```

Why both: a semaphore prevents a single worker from queueing hundreds of
pending limiter acquires against Redis (head-of-line blocking on the event
loop). The limiter prevents the cluster from exceeding the provider tier. They
solve different problems.

**Semaphore sizing**: target latency-bandwidth-product. If p95 request latency
is 2s and the worker's RPS cap is 10, in-flight count ≈ 2 × 10 = 20. Overshoot
is wasted memory; undershoot leaves throughput on the table.

### Step 5 — Narrow `with_fallbacks(exceptions_to_handle=...)` — never `(Exception,)`

`.with_fallbacks([backup])` defaults to catching `Exception`. This is P07 — on
Python <3.12, `Exception` edge-cases include `KeyboardInterrupt` propagation.
Ctrl+C during a retry storm silently hands off to the backup and keeps running.
**Always narrow the tuple:**

```python
from anthropic import (
    RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)

resilient = (prompt | claude | parser).with_fallbacks(
    [prompt | gpt4o | parser],
    exceptions_to_handle=(
        RateLimitError, APITimeoutError,
        APIConnectionError, InternalServerError,
    ),
    # NEVER: Exception, BaseException, AuthenticationError,
    # BadRequestError, ValidationError
)
```

The whitelist is **only transient provider errors**. `AuthenticationError`,
`BadRequestError`, and `ValidationError` are bugs in your code/credentials —
fallback produces the same crash. See the sibling skill's reference
`langchain-sdk-patterns/references/fallback-exception-list.md` for the full
per-provider whitelist (Anthropic, OpenAI, Gemini).

### Step 6 — `max_retries=2`, never the default `max_retries=6`

`max_retries` is **retries, not attempts.** Default `max_retries=6` on
`ChatOpenAI` / `ChatAnthropic` means **initial + 6 retries = 7 billed requests**
per logical call (P30). On a flaky network, one `.invoke()` costs 7x what you
budgeted.

```python
# BAD — default
llm = ChatOpenAI(model="gpt-4o")  # max_retries=6

# GOOD — production default
llm = ChatOpenAI(
    model="gpt-4o",
    max_retries=2,      # initial + 2 retries =

Related in Backend & APIs