langchain-data-handling
Load and chunk documents for LangChain 1.0 RAG pipelines correctly — language-aware splitters, table-safe PDF loaders, Cloudflare-compatible web loaders, chunk-boundary strategies that survive real-world structure. Use when building a RAG pipeline, diagnosing why retrieval misquotes a table, or debugging a crawler returning blank content. Trigger with "langchain document loader", "text splitter", "chunking strategy", "pdf loader", "markdown splitter", "webbaseloader".
What this skill does
# LangChain Data Handling — Loaders and Splitters (Python)
## Overview
You have a RAG system over a Python docs site. A user asks "what does
`trim_messages` do?" and the retriever returns this chunk:
```
### `trim_messages(strategy="last", include_system=True)`
Trim a message history to fit a token budget. The newest messages are kept;
older messages are dropped. Pass `include_system=True` to preserve the system
```
...and that's it. The chunk ends there. The code example showing the function
body — the actual thing the user wanted — is in a **different** chunk, retrieved
with a lower similarity score and dropped before the LLM sees it. The model
then hallucinates the function's behavior from the signature alone.
This is pain-catalog entry **P13**. `RecursiveCharacterTextSplitter`'s default
separators are `["\n\n", "\n", " ", ""]`. It splits on any blank line — including
**inside** triple-backtick code fences in Markdown. The fix is a one-line swap
to `RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)`, which
treats the fence as an atomic unit, but you have to know the bug exists.
The sibling failures this skill prevents:
- **P49** — `PyPDFLoader` splits by page. A 5-row financial table that spans
a page break gets torn in half; rows 1-3 go in one chunk, rows 4-5 in another
with no header. A RAG answer sourced from the second chunk misquotes the
numbers because the column meanings are in the first chunk. Fix: use
`PyMuPDFLoader` or `UnstructuredPDFLoader`, which detect tables and emit
them as distinct structured elements.
- **P50** — `WebBaseLoader`'s default User-Agent is `python-requests/2.x`.
Cloudflare-protected sites flag this as a bot and return a **403 interstitial
HTML page** ("Checking your browser...") instead of real content. The crawler
indexes the challenge page. You notice weeks later when every retrieval from
that source returns the same Cloudflare text. Fix: set a realistic
`header_template={"User-Agent": "Mozilla/5.0 ..."}`, respect `robots.txt`,
and rate-limit per-host to 1 req/sec.
Pinned versions: `langchain-core 1.0.x`, `langchain-community 1.0.x`,
`langchain-text-splitters 1.0.x`, `pymupdf`, `unstructured`.
Pain-catalog anchors: P13, P49, P50, P15.
This skill is the **upstream half** of the RAG pipeline — load and chunk.
For the downstream half (embedding, scoring, reranking) see the pair skill
`langchain-embeddings-search`, which covers score semantics (P12), dim guards
(P14), and reranker filtering (P15). Do not re-implement chunking there.
## Prerequisites
- Python 3.10+
- `langchain-core >= 1.0, < 2.0` and `langchain-community >= 1.0, < 2.0`
- `langchain-text-splitters >= 1.0, < 2.0`
- PDF support: `pip install pymupdf unstructured[pdf]`
- Web loading: `pip install beautifulsoup4 requests`
- For corpus dedup (optional): `pip install datasketch`
## Instructions
### Step 1 — Choose a loader by source format
Loader selection is the first decision — get it wrong and no amount of
splitter tuning will recover. Use the decision table:
| Source | Use | NOT | Why |
|---|---|---|---|
| PDF with tables | `PyMuPDFLoader` or `UnstructuredPDFLoader` | `PyPDFLoader` | Tables torn by page splits (P49) |
| PDF text-only | `PyPDFLoader` | — | Simple, fast, OK when no tables |
| Web page | `WebBaseLoader(header_template=...)` | Default UA | Cloudflare 403 (P50) |
| Markdown docs | `UnstructuredMarkdownLoader` | Plain text read | Preserves heading structure |
| HTML long-form | `WebBaseLoader` + `HTMLHeaderTextSplitter` | Plain text | Keeps `<h1>`/`<h2>` context |
| Code repo | `GenericLoader` with language parser | `DirectoryLoader` as text | Language-aware chunking |
| Corpus (1000+ docs) | `DirectoryLoader` + `glob` filter | One-by-one | Parallel load, progress |
```python
from langchain_community.document_loaders import (
PyMuPDFLoader, # table-aware PDF
WebBaseLoader, # web pages (set custom UA)
UnstructuredMarkdownLoader,
DirectoryLoader,
)
# PDF with tables — P49 fix
pdf_docs = PyMuPDFLoader("10-Q-filing.pdf").load()
# Web page — P50 fix
web_docs = WebBaseLoader(
"https://example.com/article",
header_template={
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
},
).load()
# Markdown docs site
md_docs = UnstructuredMarkdownLoader("docs/guide.md").load()
# Corpus
corpus = DirectoryLoader(
"./docs", glob="**/*.md",
loader_cls=UnstructuredMarkdownLoader,
show_progress=True,
).load()
```
Hard limit: keep single-PDF ingestion under **5 MB** per call. Larger files
should be pre-split with `pdftk` / `qpdf` to avoid OOM on `PyMuPDFLoader`'s
full-document parse.
See [Loader Selection Matrix](references/loader-selection-matrix.md) for the
full per-format table with cost and accuracy notes.
### Step 2 — Pick a splitter by content type
| Content | Splitter | chunk_size | chunk_overlap | Why |
|---|---|---|---|---|
| Prose (docs, articles) | `RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)` | 1000 | 100 | Preserves code fences (P13) |
| Python source | `RecursiveCharacterTextSplitter.from_language(Language.PYTHON)` | 1500 | 150 | Splits at `def`/`class` |
| FAQ / Q&A | `RecursiveCharacterTextSplitter` with `separators=["\n\n"]` | 500 | 50 | One chunk per Q-A pair |
| HTML long-form | `HTMLHeaderTextSplitter` | — | — | Headers become metadata |
| Generic text | `RecursiveCharacterTextSplitter` | 1000 | 100 | Safe default |
```python
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
Language,
HTMLHeaderTextSplitter,
)
# GOOD — P13 fix for Markdown
md_splitter = RecursiveCharacterTextSplitter.from_language(
Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)
# GOOD — Python code
py_splitter = RecursiveCharacterTextSplitter.from_language(
Language.PYTHON, chunk_size=1500, chunk_overlap=150,
)
# GOOD — HTML long-form with heading-as-metadata
html_splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
)
# BAD — breaks inside code fences (P13)
bad = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
```
See [Language-Aware Splitters](references/language-aware-splitters.md) for the
full list of `Language.*` enum values, custom separator patterns, and the
code-fence-detection regex for when you need a custom splitter.
### Step 3 — Tune chunk_size and overlap
Defaults from the table work for most corpora. Tune when:
- **Retrieval misses context**: increase `chunk_size` (1000 → 1500) or
`chunk_overlap` (100 → 200). Overlap is what bridges a concept that crosses
chunk boundaries.
- **Retrieval too broad, answers wander**: decrease `chunk_size` (1000 → 500).
Smaller chunks = more precise retrieval but more chunks to index.
- **Tables / structured data**: do NOT tune — index them separately (step 4).
A 1% overlap-to-size ratio is too low (200/20000); 20% is the sweet spot for
most prose. Code needs less overlap (10%) because function boundaries are
natural splits.
### Step 4 — Detect and index tables as structured records
Tables are **not** text. If your corpus has financial filings, product specs,
or any tabular data, index tables as separate records with column metadata:
```python
import fitz # pymupdf directly for table detection
def extract_tables_as_records(pdf_path: str) -> list[dict]:
"""Extract tables as one record per row."""
doc = fitz.open(pdf_path)
records = []
for page_num, page in enumerate(doc):
tables = page.find_tables()
for table in tables:
rows = table.extract()
if not rows:
continue
headers = rows[0]
for row_idx, row in enumerate(rows[1:], start=1):
record = {
"page": page_num,
"table_idx": tables.tables.index(table),
"row_idx": row_idx,
"content": " | ".join(f"{h}: {Related in Cloud & DevOps
appbuilder-action-scaffolder
IncludedCreate, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.
orchestrating-datacloud
IncludedSalesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).
github-project-automation
IncludedAutomate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error
sf-datacloud
IncludedSalesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).
fabric-cli
IncludedUse this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.
lark
IncludedLark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.