Claude
Skills
Sign in
Back

langchain-data-handling

Included with Lifetime
$97 forever

Load and chunk documents for LangChain 1.0 RAG pipelines correctly — language-aware splitters, table-safe PDF loaders, Cloudflare-compatible web loaders, chunk-boundary strategies that survive real-world structure. Use when building a RAG pipeline, diagnosing why retrieval misquotes a table, or debugging a crawler returning blank content. Trigger with "langchain document loader", "text splitter", "chunking strategy", "pdf loader", "markdown splitter", "webbaseloader".

Cloud & DevOpssaaslangchainlanggraphpythonlangchain-1.0document-loaderstext-splittersrag

What this skill does

# LangChain Data Handling — Loaders and Splitters (Python)

## Overview

You have a RAG system over a Python docs site. A user asks "what does
`trim_messages` do?" and the retriever returns this chunk:

```
### `trim_messages(strategy="last", include_system=True)`

Trim a message history to fit a token budget. The newest messages are kept;
older messages are dropped. Pass `include_system=True` to preserve the system
```

...and that's it. The chunk ends there. The code example showing the function
body — the actual thing the user wanted — is in a **different** chunk, retrieved
with a lower similarity score and dropped before the LLM sees it. The model
then hallucinates the function's behavior from the signature alone.

This is pain-catalog entry **P13**. `RecursiveCharacterTextSplitter`'s default
separators are `["\n\n", "\n", " ", ""]`. It splits on any blank line — including
**inside** triple-backtick code fences in Markdown. The fix is a one-line swap
to `RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)`, which
treats the fence as an atomic unit, but you have to know the bug exists.

The sibling failures this skill prevents:

- **P49** — `PyPDFLoader` splits by page. A 5-row financial table that spans
  a page break gets torn in half; rows 1-3 go in one chunk, rows 4-5 in another
  with no header. A RAG answer sourced from the second chunk misquotes the
  numbers because the column meanings are in the first chunk. Fix: use
  `PyMuPDFLoader` or `UnstructuredPDFLoader`, which detect tables and emit
  them as distinct structured elements.
- **P50** — `WebBaseLoader`'s default User-Agent is `python-requests/2.x`.
  Cloudflare-protected sites flag this as a bot and return a **403 interstitial
  HTML page** ("Checking your browser...") instead of real content. The crawler
  indexes the challenge page. You notice weeks later when every retrieval from
  that source returns the same Cloudflare text. Fix: set a realistic
  `header_template={"User-Agent": "Mozilla/5.0 ..."}`, respect `robots.txt`,
  and rate-limit per-host to 1 req/sec.

Pinned versions: `langchain-core 1.0.x`, `langchain-community 1.0.x`,
`langchain-text-splitters 1.0.x`, `pymupdf`, `unstructured`.
Pain-catalog anchors: P13, P49, P50, P15.

This skill is the **upstream half** of the RAG pipeline — load and chunk.
For the downstream half (embedding, scoring, reranking) see the pair skill
`langchain-embeddings-search`, which covers score semantics (P12), dim guards
(P14), and reranker filtering (P15). Do not re-implement chunking there.

## Prerequisites

- Python 3.10+
- `langchain-core >= 1.0, < 2.0` and `langchain-community >= 1.0, < 2.0`
- `langchain-text-splitters >= 1.0, < 2.0`
- PDF support: `pip install pymupdf unstructured[pdf]`
- Web loading: `pip install beautifulsoup4 requests`
- For corpus dedup (optional): `pip install datasketch`

## Instructions

### Step 1 — Choose a loader by source format

Loader selection is the first decision — get it wrong and no amount of
splitter tuning will recover. Use the decision table:

| Source | Use | NOT | Why |
|---|---|---|---|
| PDF with tables | `PyMuPDFLoader` or `UnstructuredPDFLoader` | `PyPDFLoader` | Tables torn by page splits (P49) |
| PDF text-only | `PyPDFLoader` | — | Simple, fast, OK when no tables |
| Web page | `WebBaseLoader(header_template=...)` | Default UA | Cloudflare 403 (P50) |
| Markdown docs | `UnstructuredMarkdownLoader` | Plain text read | Preserves heading structure |
| HTML long-form | `WebBaseLoader` + `HTMLHeaderTextSplitter` | Plain text | Keeps `<h1>`/`<h2>` context |
| Code repo | `GenericLoader` with language parser | `DirectoryLoader` as text | Language-aware chunking |
| Corpus (1000+ docs) | `DirectoryLoader` + `glob` filter | One-by-one | Parallel load, progress |

```python
from langchain_community.document_loaders import (
    PyMuPDFLoader,            # table-aware PDF
    WebBaseLoader,            # web pages (set custom UA)
    UnstructuredMarkdownLoader,
    DirectoryLoader,
)

# PDF with tables — P49 fix
pdf_docs = PyMuPDFLoader("10-Q-filing.pdf").load()

# Web page — P50 fix
web_docs = WebBaseLoader(
    "https://example.com/article",
    header_template={
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
    },
).load()

# Markdown docs site
md_docs = UnstructuredMarkdownLoader("docs/guide.md").load()

# Corpus
corpus = DirectoryLoader(
    "./docs", glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True,
).load()
```

Hard limit: keep single-PDF ingestion under **5 MB** per call. Larger files
should be pre-split with `pdftk` / `qpdf` to avoid OOM on `PyMuPDFLoader`'s
full-document parse.

See [Loader Selection Matrix](references/loader-selection-matrix.md) for the
full per-format table with cost and accuracy notes.

### Step 2 — Pick a splitter by content type

| Content | Splitter | chunk_size | chunk_overlap | Why |
|---|---|---|---|---|
| Prose (docs, articles) | `RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)` | 1000 | 100 | Preserves code fences (P13) |
| Python source | `RecursiveCharacterTextSplitter.from_language(Language.PYTHON)` | 1500 | 150 | Splits at `def`/`class` |
| FAQ / Q&A | `RecursiveCharacterTextSplitter` with `separators=["\n\n"]` | 500 | 50 | One chunk per Q-A pair |
| HTML long-form | `HTMLHeaderTextSplitter` | — | — | Headers become metadata |
| Generic text | `RecursiveCharacterTextSplitter` | 1000 | 100 | Safe default |

```python
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    Language,
    HTMLHeaderTextSplitter,
)

# GOOD — P13 fix for Markdown
md_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)

# GOOD — Python code
py_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.PYTHON, chunk_size=1500, chunk_overlap=150,
)

# GOOD — HTML long-form with heading-as-metadata
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
)

# BAD — breaks inside code fences (P13)
bad = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
```

See [Language-Aware Splitters](references/language-aware-splitters.md) for the
full list of `Language.*` enum values, custom separator patterns, and the
code-fence-detection regex for when you need a custom splitter.

### Step 3 — Tune chunk_size and overlap

Defaults from the table work for most corpora. Tune when:

- **Retrieval misses context**: increase `chunk_size` (1000 → 1500) or
  `chunk_overlap` (100 → 200). Overlap is what bridges a concept that crosses
  chunk boundaries.
- **Retrieval too broad, answers wander**: decrease `chunk_size` (1000 → 500).
  Smaller chunks = more precise retrieval but more chunks to index.
- **Tables / structured data**: do NOT tune — index them separately (step 4).

A 1% overlap-to-size ratio is too low (200/20000); 20% is the sweet spot for
most prose. Code needs less overlap (10%) because function boundaries are
natural splits.

### Step 4 — Detect and index tables as structured records

Tables are **not** text. If your corpus has financial filings, product specs,
or any tabular data, index tables as separate records with column metadata:

```python
import fitz  # pymupdf directly for table detection

def extract_tables_as_records(pdf_path: str) -> list[dict]:
    """Extract tables as one record per row."""
    doc = fitz.open(pdf_path)
    records = []
    for page_num, page in enumerate(doc):
        tables = page.find_tables()
        for table in tables:
            rows = table.extract()
            if not rows:
                continue
            headers = rows[0]
            for row_idx, row in enumerate(rows[1:], start=1):
                record = {
                    "page": page_num,
                    "table_idx": tables.tables.index(table),
                    "row_idx": row_idx,
                    "content": " | ".join(f"{h}: {

Related in Cloud & DevOps