Claude
Skills
Sign in
Back

langchain-deploy-integration

Included with Lifetime
$97 forever

Deploy a LangChain 1.0 / LangGraph 1.0 app to Cloud Run, Vercel, or LangServe correctly — timeouts sized for chain length, cold-start mitigation, SSE anti-buffering headers, Secret Manager over `.env`. Use when prepping first prod deploy, debugging a stream that hangs behind a proxy, or diagnosing p99 latency spikes. Trigger with "langchain deploy", "langchain cloud run", "langchain vercel python", "langchain langserve", "langchain docker".

Cloud & DevOpssaaslangchainlanggraphpythonlangchain-1.0deploymentcloud-runvercel

What this skill does

# LangChain Deploy Integration (Python)

## Overview

An engineer ships a working LangGraph agent to Vercel. Every non-trivial request
returns `FUNCTION_INVOCATION_TIMEOUT`. The Python runtime on Vercel defaults to
a **10-second** cap (P35) — a three-tool agent with one RAG round easily runs
20-40s. Local dev never exposed the wall because `uvicorn` on a laptop has no
timeout. Two fixes apply together and each is load-bearing:

```json
// vercel.json — the baseline cap bump (Pro plan max is 60s, Enterprise 900s)
{ "functions": { "api/chat.py": { "maxDuration": 60 } } }
```

```python
# app/api/chat.py — stream the response so partial output arrives before the cap
from fastapi.responses import StreamingResponse

@app.post("/api/chat")
async def chat(req: ChatRequest):
    async def gen():
        async for chunk in chain.astream(req.input):
            yield f"data: {chunk.model_dump_json()}\n\n"
    return StreamingResponse(gen(), media_type="text/event-stream",
                             headers={"X-Accel-Buffering": "no"})
```

The `maxDuration: 60` raises the Vercel-imposed wall; streaming reduces
time-to-first-byte to under a second so the user sees progress even on a
40-second completion. Once the Vercel cap is fixed, the next three walls are:
Cloud Run cold starts (**5-15s** p99 on Python + LangChain — P36), `.env`
secrets leaking via `docker exec <pod> env` (P37), and SSE streams hanging
because Nginx / Cloud Run buffer the final chunk (P46).

This skill walks through a production-grade multi-stage Dockerfile, Cloud Run
flags for cold-start mitigation, Vercel `maxDuration` + streaming, LangServe
route mounting with FastAPI lifespan, SSE anti-buffering headers, and Secret
Manager via `pydantic.SecretStr`. Pin: `langchain-core 1.0.x`, `langgraph 1.0.x`,
`langserve 1.0.x`. Pain-catalog anchors: **P35** (Vercel 10s default),
**P36** (Cloud Run cold start), **P37** (`.env` leaks), **P46** (SSE buffering).

## Prerequisites

- Python 3.11+ (3.12 preferred for `uvicorn` startup speed)
- `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0`, `langserve >= 1.0, < 2.0`
- `fastapi >= 0.110`, `uvicorn[standard] >= 0.27`
- Target platform: `gcloud` CLI (Cloud Run), `vercel` CLI (Vercel), or `docker` (generic)
- For Cloud Run: a GCP project with Secret Manager API enabled
- For Vercel: a project with `@vercel/python` runtime configured

## Instructions

### Step 1 — Multi-stage Dockerfile with slim runtime and `uvicorn`

A multi-stage build keeps the runtime image under 400MB, which cuts Cloud Run
cold starts by 2-3 seconds. Use `python:3.12-slim` as the final stage (not
`python:3.12` — that base adds ~900MB for dev tooling that never runs in prod).

```dockerfile
# syntax=docker/dockerfile:1.7
FROM python:3.12-slim AS builder
WORKDIR /build
RUN pip install --no-cache-dir uv
COPY pyproject.toml uv.lock ./
RUN uv export --format requirements-txt --no-hashes > requirements.txt \
 && pip wheel --wheel-dir=/wheels -r requirements.txt

FROM python:3.12-slim AS runtime
RUN useradd -m -u 10001 app
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links=/wheels /wheels/* \
 && rm -rf /wheels
COPY --chown=app:app app/ ./app/
USER app
EXPOSE 8080
ENV PORT=8080 PYTHONUNBUFFERED=1 PYTHONDONTWRITEBYTECODE=1
CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${PORT} --workers 1"]
```

Single worker is correct — Cloud Run handles horizontal scale; in-process
multi-worker just duplicates LangChain client memory. See [Dockerfile and
Secrets](references/dockerfile-and-secrets.md) for the distroless variant
and the `.dockerignore` hardening for `.env` files.

### Step 2 — Deploy to Cloud Run with cold-start mitigation

Python + LangChain + `tiktoken` + one embedding model imports take 5-15
seconds (P36). At `--min-instances=0`, every scale-from-zero request eats that
as user-facing latency. Paying for one always-on instance is usually cheaper
than the lost requests.

```bash
gcloud run deploy langchain-api \
  --source=. \
  --region=us-central1 \
  --min-instances=1 \
  --max-instances=20 \
  --cpu=2 --memory=2Gi \
  --cpu-boost \
  --no-cpu-throttling \
  --timeout=3600 \
  --concurrency=80 \
  --set-secrets=ANTHROPIC_API_KEY=anthropic-key:latest,OPENAI_API_KEY=openai-key:latest \
  [email protected]
# --timeout=3600 is the Cloud Run per-request maximum (1 hour) — needed
# because multi-tool LangGraph agents routinely run 1-5 minutes end-to-end.
```

The load-bearing flags: `--min-instances=1` kills cold-start p99 (one always-warm
replica costs ~$15/mo and dominates p99 improvement); `--cpu-boost` doubles CPU
for the first 10 seconds; `--no-cpu-throttling` (CPU-always-allocated billing)
keeps `astream` running between keepalive pings so long LangGraph runs do not
stall at tool boundaries; `--concurrency=80` matches typical I/O-bound
workloads (drop to 10 if embedding large docs in-process).

See [Cloud Run Deploy](references/cloud-run-deploy.md) for VPC egress, file
secret mounts, revision traffic splitting, and the full cost model.

### Step 3 — Vercel Python: `maxDuration: 60` + streaming to beat the cap

On Vercel Hobby the max is **10s** by default (P35); Pro is **60s**, Enterprise
**900s**. Always set `maxDuration` explicitly — the default is a trap.

```json
// vercel.json
{
  "functions": {
    "api/chat.py": { "maxDuration": 60, "memory": 1024 }
  }
}
```

Streaming is not just a UX fix — it is the mitigation for bursts that still
exceed `maxDuration`. Time-to-first-byte under a second keeps the proxy
considering the request alive; partial content renders on the client; when
the cap finally triggers, the user has already seen most of the answer. The
Vercel entrypoint pattern mirrors the Overview snippet above — pair with the
SSE headers from Step 5.

Edge Runtime is **not** an option here — `@vercel/edge` is JavaScript-only.
Anything that imports `langchain` must run on `@vercel/python` (serverless,
Node-free Python container). See [Vercel Python Deploy](references/vercel-python-deploy.md)
for env vars vs Vercel Secrets, cold-start profiling, and the serverless vs
fluid-compute tradeoff.

### Step 4 — LangServe: `add_routes` + FastAPI lifespan for pool cleanup

LangServe ships typed HTTP routes over any `Runnable`. The `playground` path
is invaluable in dev but **must be disabled in production** — it leaks chain
topology to anyone who can hit the URL. Mount behind a FastAPI `lifespan`
that closes `asyncpg` / `httpx` / Redis pools on revision retirement;
`on_shutdown` fires too late on Cloud Run and connections leak across
revisions.

```python
# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from langserve import add_routes

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.chain = build_chain()
    yield
    await db_pool.close()

app = FastAPI(lifespan=lifespan)
add_routes(app, build_chain(), path="/chat",
           enable_feedback_endpoint=False,
           playground_type="chat" if __debug__ else None)  # None = off in prod
```

See [LangServe Patterns](references/langserve-patterns.md) for typed input/output
schemas, auth middleware, and coexisting with raw FastAPI handlers.

### Step 5 — SSE anti-buffering: survive Nginx, Cloud Run, Cloudflare

Nginx, Cloud Run's load balancer, and Cloudflare all buffer responses by
default. On SSE, buffering means the client never sees the final `end` event
and `LangGraph.astream` hangs forever (P46). Two headers plus one response
flush fix it:

```python
from fastapi.responses import StreamingResponse

def sse_headers() -> dict:
    return {
        "Content-Type": "text/event-stream",
        "Cache-Control": "no-cache, no-transform",
        "X-Accel-Buffering": "no",          # disables Nginx buffering
        "Connection": "keep-alive",
    }

@app.post("/api/chat/stream")
async def stream(payload: dict):
    async def gen():
        async for event in gra

Related in Cloud & DevOps