Claude
Skills
Sign in
Back

langchain-ci-integration

Included with Lifetime
$97 forever

Wire LangChain 1.0 / LangGraph 1.0 tests into a GitHub Actions pipeline — unit tests with FakeListChatModel, VCR-gated integration tests, warning-filter policy, and eval-regression merge gates. Complements langchain-local-dev-loop (F23) which covers the inner loop; THIS covers the CI wire-up. Use when setting up GHA for a new LLM service, after a VCR cassette leak incident, or hardening an existing pipeline. Trigger with "langchain ci", "langchain github actions", "langchain test pipeline", "vcr ci", "langchain eval gate", "pytest -W error langchain".

Cloud & DevOpssaaslangchainlanggraphpythonlangchain-1.0cigithub-actionstesting

What this skill does

# LangChain CI Integration (Python)

## Overview

A PR passes every test on your laptop. You push. GHA runs `pytest` and aborts
during collection — before a single test executes — with:

```
PytestUnraisableExceptionWarning: Exception ignored in: ...
DeprecationWarning: langchain_community.llms ...
```

The org runs `pytest -W error` and a provider SDK emitted a `DeprecationWarning`
at *import* time, which the warning filter promoted to an exception while pytest
was still walking the test tree. This is **P45** and it blocks every PR for the
team until someone pins a `filterwarnings` config.

Meanwhile the integration suite has its own failure mode: a VCR cassette
recorded three months ago at `temperature=0` against Anthropic is now flaking
against a snapshot. `temperature=0` is not deterministic on Claude — it still
nucleus-samples (**P05**) — so the cassette captured *one* valid completion, not
*the* valid completion. And yesterday a reviewer caught
`Authorization: Bearer sk-ant-...` in a cassette file that had been committed
six weeks ago (**P44**) because `vcrpy` records all request headers by default.

This skill covers the outer loop: the GitHub Actions workflow, the unit /
integration / eval gate separation, VCR cassette hygiene, pytest warning
policy, and a merge-blocking eval regression gate. The **inner** loop — fake
model fixtures, VCR recording workflow, local determinism tricks — lives in
`langchain-local-dev-loop` (F23); cross-reference it, do not duplicate it.
Pin: `langchain-core 1.0.x`, `langgraph 1.0.x`, `actions/checkout@v4`,
`actions/setup-python@v5`, `vcrpy 6.x`. Pain-catalog anchors: **P05, P43, P44, P45**.

## Prerequisites

- Python 3.10, 3.11, or 3.12 (matrix)
- `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0`
- `pytest >= 8`, `pytest-asyncio`, `vcrpy >= 6` (integration)
- `langchain-local-dev-loop` (F23) applied locally — fixtures and recording workflow
- GitHub repo with Actions enabled; secrets set for any live-API nightly job

## Instructions

### Step 1 — GHA workflow skeleton with four jobs

Single workflow at `.github/workflows/tests.yml`. Matrix on unit only; keep
integration and eval single-version to control cost.

```yaml
name: tests

on:
  pull_request:
  push:
    branches: [main]
  schedule:
    - cron: "0 6 * * *"  # nightly live-API re-record check (06:00 UTC)

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  unit:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python: ["3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python }}
          cache: pip
          cache-dependency-path: |
            pyproject.toml
            requirements*.txt
      - run: pip install -e ".[test]"
      - run: pytest tests/unit/ -W error --timeout=30 -q

  integration:
    needs: unit
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'run-integration')
    runs-on: ubuntu-latest
    env:
      RUN_INTEGRATION: "1"
      VCR_MODE: "none"  # replay-only; nightly cron flips to "once"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,integration]"
      - run: pytest tests/integration/ -W error --timeout=60 -q

  eval:
    needs: unit
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }   # need base ref for delta comparison
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,eval]"
      - run: python scripts/run_eval.py --baseline origin/${{ github.base_ref }} --head HEAD --n 100
      # run_eval.py posts a PR comment and exits nonzero on regression > threshold

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[dev]"
      - run: ruff check .
      - run: python scripts/dryrun_load_chains.py   # catches ImportError migration regressions
```

See [GHA Workflow Reference](references/github-actions-workflow.md) for the full
job definitions including the secret-injection pattern, the matrix caching
nuance, and the `softprops/action-gh-release`-style PR comment action used by
the eval job.

### Step 2 — Unit job: `-W error` + `filterwarnings` to neutralize P45

Root cause of the collection abort: pytest collects tests by importing them.
Some provider SDKs emit `DeprecationWarning` on import. With `-W error` those
become exceptions during collection. Fix at the *filter* level, not by dropping
`-W error` (which would mask real warnings).

In `pyproject.toml`:

```toml
[tool.pytest.ini_options]
filterwarnings = [
    "error",
    # P45 — neutralize known import-time noise; scoped per module so new
    # warnings from YOUR code still fail the build.
    "ignore::DeprecationWarning:langchain_community.*",
    "ignore::DeprecationWarning:pydantic.*",
    "ignore:Pydantic serializer warnings:UserWarning",
]
asyncio_mode = "auto"
testpaths = ["tests"]
```

The ordering matters — `"error"` first, specific `"ignore"` entries after, so
the filters override the global promote-to-error. Keep the list **narrow**: a
blanket `ignore::DeprecationWarning` hides regressions you need to see.

Unit tests use `FakeListChatModel` fixtures from F23 (do not redefine them
here). One CI-specific gotcha (**P43**): `FakeListChatModel` does not emit
`response_metadata["token_usage"]`, so any callback that asserts on token counts
will break. Either subclass the fake and inject `generation_info`, or gate the
assertion:

```python
def test_chain_uses_tokens(patched_chat_model):
    result = chain.invoke({"input": "hi"})
    if patched_chat_model.__class__.__name__ == "FakeListChatModel":
        pytest.skip("fake model doesn't emit token_usage (P43)")
    assert result.response_metadata["token_usage"]["total_tokens"] > 0
```

Budget: unit job should finish in **<2 minutes** across the 3-version matrix.
If it doesn't, something is calling out to a real provider — check with
`pytest --collect-only -q | wc -l` and audit which tests lack fake-model
fixtures.

### Step 3 — Integration job: VCR replay + `filter_headers` (P44)

Integration tests replay pre-recorded VCR cassettes. Three rules:

1. Gate the job. `if: contains(github.event.pull_request.labels.*.name, 'run-integration')` or `env.RUN_INTEGRATION == "1"`, plus a nightly cron that flips to `VCR_MODE=once` and re-records against live APIs. PRs default to pure replay.
2. Enforce `filter_headers` at the fixture level — not per-test. A single `conftest.py` prevents any contributor from recording a cassette with raw credentials.
3. Pre-commit + CI both scan cassettes for leaked keys. Belt and suspenders.

Fixture (lives in `tests/integration/conftest.py`, owned by this skill's
pipeline concern — F23 owns the *recording* workflow):

```python
import vcr
import pytest

@pytest.fixture(scope="module")
def vcr_config():
    return {
        "filter_headers": [
            "authorization",
            "x-api-key",
            "anthropic-version",
            ("openai-organization", "REDACTED"),
        ],
        "filter_post_data_parameters": ["api_key"],
        "record_mode": "none",  # CI default: replay only
        "match_on": ["method", "scheme", "host", "port", "path", "query"],
    }
```

Integration suite must finish in **<5 minutes** wall-clock on the runner, or
you will start getting cancellation flakes from the `concurrency` block. If
you exceed 5 minutes, split into a nightly-only long tier.

See [Integration Gating](references/integration-gating.md) for the full
live-vs-replay decision tree, cost-per-run budget worksheet, and the

Related in Cloud & DevOps