langchain-rate-limits

Included with Lifetime

$97 forever

Rate-limit LangChain 1.0 calls correctly across multi-worker deployments — Redis-backed limiters, asyncio.Semaphore, narrow exception whitelists, and provider-specific throttle handling. Use when hitting 429s in production, scaling workers horizontally, or tuning throughput against Anthropic, OpenAI, or Gemini tier limits. Trigger with "langchain rate limit", "langchain 429", "langchain semaphore", "langchain token bucket", "anthropic rpm", "openai rpm throttling", "InMemoryRateLimiter", "redis rate limiter".

Backend & APIssaaslangchainlanggraphpythonlangchain-1.0rate-limitsthrottlingconcurrency

What this skill does

# LangChain Rate Limits (Python)

## Overview

A team deploys 10 Cloud Run workers. Each worker initializes its `ChatAnthropic`
with `InMemoryRateLimiter(requests_per_second=10)` — they read the docs, they
picked a safe-looking number, they shipped. Thirty seconds later the dashboard
lights up with 429s: the cluster is pushing 100 RPS to Anthropic's 50 RPM
tier-1 ceiling, not the 10 RPS they configured. The name is the fix —
`InMemoryRateLimiter` is **in-process**. Each worker has its own counter. Ten
workers × 10 RPS = 100 RPS to the provider. This is pain-catalog entry **P29**
and it lands on every team that scales past one pod.

Three more traps wait on the same code path:

- **P07** — `.with_fallbacks([backup])` defaults `exceptions_to_handle=(Exception,)`,
  which on Python <3.12 swallows `KeyboardInterrupt`. Ctrl+C during a 429
  retry storm silently falls through to the backup chain and keeps billing.
- **P30** — `ChatOpenAI` and `ChatAnthropic` default `max_retries=6`. That is
  retries, not attempts: **7 total requests per logical call** on flaky
  networks. One `.invoke()` can bill 7x.
- **P31** — Anthropic's RPM counts cache reads, cache writes, and uncached
  calls **uniformly**. Cache-heavy workloads at 50 RPM can 429 on cache writes
  while the ITPM dashboard shows headroom.

This skill covers measuring demand before picking a limit; the
`InMemoryRateLimiter` vs Redis-backed limiter vs `asyncio.Semaphore` decision
tree; the narrow `exceptions_to_handle` whitelist; `max_retries=2` math; and
the provider-specific limit taxonomy (RPM, ITPM, OTPM, concurrent,
cached-vs-uncached). Pin: `langchain-core 1.0.x`, `langchain-anthropic 1.0.x`,
`langchain-openai 1.0.x`. Pain-catalog anchors: **P07, P08, P29, P30, P31**.
For `.batch(max_concurrency=...)` tuning, see the sibling skill
`langchain-performance-tuning` — this skill is about provider-facing rate caps.

## Prerequisites

- Python 3.10+ (3.12+ fixes the `KeyboardInterrupt` half of P07)
- `langchain-core >= 1.0, < 2.0`
- At least one provider: `pip install langchain-anthropic langchain-openai`
- For multi-worker prod: `redis >= 4.5` client and a Redis server reachable from every worker
- Completed `langchain-model-inference` — the chat-model factory from that skill is where `rate_limiter=` gets attached

## Instructions

### Step 1 — Measure actual demand before picking a number

**Do not guess at `requests_per_second`.** Instrument first, size second.
Attach a `BaseCallbackHandler` that logs per-call `input_tokens`,
`output_tokens`, and `cache_read_input_tokens` from `response.generations[].message.usage_metadata`:

```python
chain.with_config({"callbacks": [DemandLogger()]})
```

Collect 24-48 hours of representative traffic. Roll up: p50 and p95 RPM, p95
ITPM, p95 OTPM, cache hit rate. Size the limiter at **70% of the binding
constraint's tier ceiling** on your p95.

See [Measuring Demand](references/measuring-demand.md) for the full
`DemandLogger` implementation, pandas roll-up, OTEL integration, load-test
harness, and multi-tenant sizing strategies.

### Step 2 — `InMemoryRateLimiter` for single-process dev only; never multi-worker prod

LangChain 1.0 ships `InMemoryRateLimiter` as a first-class `BaseChatModel` parameter:

```python
from langchain_anthropic import ChatAnthropic
from langchain_core.rate_limiters import InMemoryRateLimiter

limiter = InMemoryRateLimiter(
    requests_per_second=0.58,    # 35 RPM = 70% of Anthropic tier-1 50 RPM
    check_every_n_seconds=0.1,
    max_bucket_size=5,           # burst capacity
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)
```

**`InMemoryRateLimiter` is per-process.** Safe for:

- Single-process local dev (`python script.py`)
- Single-worker uvicorn (`uvicorn --workers 1`)
- Jupyter notebooks, batch scripts

**Unsafe for** (this is P29):

- Multi-worker uvicorn / gunicorn (`--workers 4`)
- Any container orchestrator with replica count > 1 (Cloud Run min-instances > 1, K8s, ECS)
- Distributed job runners (Celery, Temporal, Cloud Tasks fanout)

### Step 3 — Redis-backed limiter for cluster-wide enforcement

For multi-worker deployments, cluster-wide rate limiting requires shared state.
Redis is the default answer — atomic Lua script for sliding-window, or Redis
6.2+ `CL.THROTTLE` for GCRA.

```python
import redis
from langchain_anthropic import ChatAnthropic
# RedisRateLimiter class defined in references/redis-limiter-pattern.md
from your_app.limiters import RedisRateLimiter

client = redis.Redis.from_url("redis://redis.internal:6379/0")

limiter = RedisRateLimiter(
    client,
    key="anthropic:prod",
    requests_per_second=35 / 60,  # 35 RPM cluster-wide, not per-worker
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)
```

**Key scoping decisions:**

- `key="anthropic:prod"` — all tenants share one global budget (simplest)
- `key=f"anthropic:tenant:{tenant_id}"` — per-tenant quota (requires cleanup for dead tenants)
- Two-level: per-tenant + global, acquire both (best for multi-tenant SaaS)

See [Redis Limiter Pattern](references/redis-limiter-pattern.md) for the full
`RedisRateLimiter` implementation (atomic Lua sliding window), the GCRA
alternative via `CL.THROTTLE`, failure modes (Redis down, clock skew), and
per-tenant cleanup strategy.

### Step 4 — `asyncio.Semaphore` for per-worker in-flight concurrency cap

The rate limiter throttles **request rate**. A semaphore throttles **in-flight
count**. Use both:

```python
import asyncio

# Cluster: 35 RPM (Redis enforces)
# Worker: 20 in-flight at once (semaphore enforces)
worker_sem = asyncio.Semaphore(20)

async def bounded_invoke(inp):
    async with worker_sem:
        return await llm.ainvoke(inp)

# Fanout
results = await asyncio.gather(*[bounded_invoke(x) for x in inputs])
```

Why both: a semaphore prevents a single worker from queueing hundreds of
pending limiter acquires against Redis (head-of-line blocking on the event
loop). The limiter prevents the cluster from exceeding the provider tier. They
solve different problems.

**Semaphore sizing**: target latency-bandwidth-product. If p95 request latency
is 2s and the worker's RPS cap is 10, in-flight count ≈ 2 × 10 = 20. Overshoot
is wasted memory; undershoot leaves throughput on the table.

### Step 5 — Narrow `with_fallbacks(exceptions_to_handle=...)` — never `(Exception,)`

`.with_fallbacks([backup])` defaults to catching `Exception`. This is P07 — on
Python <3.12, `Exception` edge-cases include `KeyboardInterrupt` propagation.
Ctrl+C during a retry storm silently hands off to the backup and keeps running.
**Always narrow the tuple:**

```python
from anthropic import (
    RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)

resilient = (prompt | claude | parser).with_fallbacks(
    [prompt | gpt4o | parser],
    exceptions_to_handle=(
        RateLimitError, APITimeoutError,
        APIConnectionError, InternalServerError,
    ),
    # NEVER: Exception, BaseException, AuthenticationError,
    # BadRequestError, ValidationError
)
```

The whitelist is **only transient provider errors**. `AuthenticationError`,
`BadRequestError`, and `ValidationError` are bugs in your code/credentials —
fallback produces the same crash. See the sibling skill's reference
`langchain-sdk-patterns/references/fallback-exception-list.md` for the full
per-provider whitelist (Anthropic, OpenAI, Gemini).

### Step 6 — `max_retries=2`, never the default `max_retries=6`

`max_retries` is **retries, not attempts.** Default `max_retries=6` on
`ChatOpenAI` / `ChatAnthropic` means **initial + 6 retries = 7 billed requests**
per logical call (P30). On a flaky network, one `.invoke()` costs 7x what you
budgeted.

```python
# BAD — default
llm = ChatOpenAI(model="gpt-4o")  # max_retries=6

# GOOD — production default
llm = ChatOpenAI(
    model="gpt-4o",
    max_retries=2,      # initial + 2 retries =

Files: 6

Size: 49.4 KB

Complexity: 57/100

Category: Backend & APIs

Source: https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/langchain-py-pack/skills/langchain-rate-limits

Related in Backend & APIs

jfrog

Included

Interact with the JFrog Platform via the JFrog CLI and REST/GraphQL APIs. Use this skill when the user wants to manage Artifactory repositories, upload or download artifacts, manage builds, configure permissions, manage users and groups, work with access tokens, configure JFrog CLI servers, search artifacts, manage properties, set up replication, manage JFrog Projects, run security audits or scans, look up CVE details, query exposures scan results from JFrog Advanced Security, manage release bundles and lifecycle operations, aggregate or export platform data, or perform any JFrog Platform administration task. Also use when the user mentions jf, jfrog, artifactory, xray, distribution, evidence, apptrust, onemodel, graphql, workers, mission control, curation, advanced security, exposures, or any JFrog product name.

Backend & APIsscripts

cupynumeric-migration-readiness

Included

Pre-migration readiness assessor for porting NumPy to cuPyNumeric. Use BEFORE substantial porting work begins when the user asks whether code will scale on GPU, whether they should migrate to cuPyNumeric, which NumPy patterns transfer cleanly, what must be refactored before porting, or mentions pre-port assessment, scaling analysis, or refactor planning. Inspect the user's source code, look up NumPy usage, cross-reference the cuPyNumeric API support manifest, and distinguish distributed-scaling-friendly patterns from blockers such as unsupported APIs, scalar synchronization, host round-trips, Python/object-heavy control flow, shape/data-dependent branching, and in-place mutation hazards. Produce a verdict of READY, LIGHT REFACTOR, SIGNIFICANT REFACTOR, or NOT RECOMMENDED, with concrete refactor pointers.

Backend & APIsscripts

alibabacloud-data-agent-skill

Included

Invoke Alibaba Cloud Apsara Data Agent for Analytics via CLI to perform natural language-driven data analysis on enterprise databases. Data Agent for Analytics is an intelligent data analysis agent developed by Alibaba Cloud Database team for enterprise users. It automatically completes requirement analysis, data understanding, analysis insights, and report generation based on natural language descriptions. This tool supports: discovering data resources (instances/databases/tables) managed in DMS, initiating query or deep analysis sessions, real-time progress tracking, and retrieving analysis conclusions and generated reports. Use this Skill when users need to query databases, analyze data trends, generate data reports, ask questions in natural language, or mention "Data Agent", "data analysis", "database query", "SQL analysis", "data insights".

Backend & APIsscripts

token-optimizer

Included

Reduce OpenClaw token usage and API costs through smart model routing, heartbeat optimization, budget tracking, and native 2026.2.15 features (session pruning, bootstrap size limits, cache TTL alignment). Use when token costs are high, API rate limits are being hit, or hosting multiple agents at scale. The 4 executable scripts (context_optimizer, model_router, heartbeat_optimizer, token_tracker) are local-only — no network requests, no subprocess calls, no system modifications. Reference files (PROVIDERS.md, config-patches.json) document optional multi-provider strategies that require external API keys and network access if you choose to use them. See SECURITY.md for full breakdown.

Backend & APIsscripts

resend-cli

Included

Use this skill when the task is specifically about operating Resend from an AI agent, terminal session, or CI job via the official resend CLI: installing/authenticating the CLI, sending/listing/updating/cancelling emails, batch sends, domains and DNS, webhooks and local listeners, inbound receiving, contacts, topics, segments, broadcasts, templates, API keys, profiles, or debugging Resend CLI/API failures. Trigger on mentions of Resend CLI, `resend`, `resend doctor`, `resend emails send`, `resend domains`, `resend webhooks listen`, `resend emails receiving`, or agent-friendly terminal automation.

Backend & APIsscripts

alibabacloud-odps-maxframe-coding

Included

Use this skill for MaxFrame SDK development and documentation navigation on Alibaba Cloud MaxCompute (ODPS). Helps answer MaxFrame API, concept, official example, and supported pandas API questions; create data processing programs; read/write MaxCompute tables; debug jobs (remote or local); and build custom DPE runtime images. Trigger when users mention MaxFrame, MaxCompute with MaxFrame, ODPS table processing, DPE runtime, MaxFrame docs/examples, DataFrame/Tensor operations, or GPU runtime setup. Works for both English and Chinese queries about Alibaba Cloud data processing with MaxFrame.

Backend & APIsscripts