langchain-observability
Wire LangSmith tracing and custom metric callbacks into a LangChain 1.0 chain or LangGraph 1.0 agent correctly — env-var spelling, subgraph propagation, per-tenant dimensions, cost and latency counters. Use when setting up observability on a new service, debugging blank traces in LangSmith, or adding per-tenant cost breakdowns. Trigger with "langchain observability", "langsmith tracing", "langchain callbacks", "langchain metrics".
What this skill does
# LangChain Observability (Python)
## Overview
Engineer sets `LANGCHAIN_TRACING_V2=true` and `LANGCHAIN_API_KEY=...` from the
0.2 docs, restarts the service, and sees zero traces in LangSmith — no errors,
no warnings. That is P26: in LangChain 1.0 the canonical env vars are
`LANGSMITH_TRACING` and `LANGSMITH_API_KEY`. The `LANGCHAIN_*` names are
soft-deprecated and fail silently on any chain that goes through 1.0 middleware
or `create_react_agent`. One-line fix:
```bash
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=lsv2_...
export LANGSMITH_PROJECT=my-service-prod
```
Next failure mode: a custom `BaseCallbackHandler` attached via
`chain.with_config(callbacks=[meter])` fires on the parent but is silent on
LangGraph subgraphs and `create_react_agent` tool calls — token counts
under-report by 30-70% vs the provider dashboard. That is P28: LangGraph
creates a child runtime per subgraph, and bound callbacks do not propagate.
Pass callbacks at invocation time instead:
```python
await chain.ainvoke(inputs, config={"callbacks": [meter], "configurable": {"tenant_id": t}})
```
This skill walks through canonical LangSmith setup, a metric-callback template
with tenant dimensions, invocation-time propagation, `RunnableConfig` trace
tagging, and a decision tree for LangSmith-only vs OTEL-native (defer to
`langchain-otel-observability` / L33 for OTEL-heavy). Pin: `langchain-core 1.0.x`,
`langgraph 1.0.x`, `langsmith` current. LangSmith tracing adds <5ms per-span
overhead; metric callbacks add <1ms per fire. Pain-catalog anchors: P26, P28,
P04 (cache-token aggregation), P25 (retry double-counting).
## Prerequisites
- Python 3.10+
- `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0`
- `langsmith` (bundled with `langchain`; upgrade to current for 1.0 env-var support)
- A LangSmith API key (`lsv2_...`) — free tier at https://smith.langchain.com
- Optional metric sinks: `prometheus_client`, `statsd`, or `datadog` Python packages
## Instructions
### Step 1 — Enable LangSmith with the canonical 1.0 env vars
`LANGSMITH_TRACING=true` is the switch. `LANGSMITH_API_KEY` authenticates.
`LANGSMITH_PROJECT` groups traces by environment — use one project per
`service-env` pair (`myapp-prod`, `myapp-staging`), not one per service.
```bash
# .env (loaded via python-dotenv or secret manager)
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_PROJECT=my-service-prod
# Legacy fallback names (still work, soft-deprecated — do not use in new code):
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_API_KEY=lsv2_pt_...
# LANGCHAIN_PROJECT=my-service-prod
```
Verify in a REPL that the client sees the key before relying on it in
production:
```python
from langsmith import Client
c = Client() # reads LANGSMITH_API_KEY and LANGSMITH_ENDPOINT
print(c.list_projects(limit=1)) # raises LangSmithAuthError if key is wrong
```
Do NOT set both `LANGCHAIN_TRACING_V2` and `LANGSMITH_TRACING` — mixed settings
have caused stale project routing in 1.0.x. See P26.
For selective sampling in high-traffic services, set
`LANGSMITH_SAMPLING_RATE=0.1` (10% of runs). Full detail in
[LangSmith Setup](references/langsmith-setup.md).
### Step 2 — Write a metric callback for per-request observability
Subclass `BaseCallbackHandler`. Record `token_in`, `token_out`, `latency_ms`,
`tool_calls`, and `error`, tagged with a `tenant_id` dimension for downstream
grouping.
```python
import time
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult
class MetricCallback(BaseCallbackHandler):
"""Per-LLM-call metrics tagged with tenant_id. Overhead <1ms per event."""
def __init__(self, tenant_id: str, sink) -> None:
self.tenant_id = tenant_id
self.sink = sink
self._starts: dict[str, float] = {}
def on_llm_start(self, serialized, prompts, *, run_id, **kwargs) -> None:
self._starts[str(run_id)] = time.perf_counter()
def on_llm_end(self, response: LLMResult, *, run_id, **kwargs) -> None:
t0 = self._starts.pop(str(run_id), time.perf_counter())
elapsed_ms = (time.perf_counter() - t0) * 1000 # wall-clock latency
tags = {"tenant_id": self.tenant_id}
for gen in response.generations:
for g in gen:
meta = getattr(g.message, "usage_metadata", None) or {}
self.sink.incr("llm.token_in", meta.get("input_tokens", 0), tags)
self.sink.incr("llm.token_out", meta.get("output_tokens", 0), tags)
# P04 — aggregate Anthropic cache reads across calls
cache = meta.get("input_token_details", {}).get("cache_read", 0)
self.sink.incr("llm.cache_read", cache, tags)
self.sink.hist("llm.latency_ms", elapsed_ms, tags)
def on_llm_error(self, error, *, run_id, **kwargs) -> None:
self._starts.pop(str(run_id), None)
self.sink.incr("llm.error", 1, {"tenant_id": self.tenant_id,
"error_type": type(error).__name__})
def on_tool_end(self, output, *, run_id, **kwargs) -> None:
self.sink.incr("llm.tool_calls", 1, {"tenant_id": self.tenant_id})
```
A thin `sink` protocol (`incr`, `hist`) swaps between Prometheus, StatsD, or
Datadog. Alternative sinks (LangSmith-only, OTEL) do not need this callback
at all — see Step 5. Full sink adapters and P25 retry dedupe in
[Custom Metrics Callback](references/custom-metrics-callback.md).
### Step 3 — Pass callbacks via `config["callbacks"]` at invocation (P28)
This is the single most common observability bug in LangGraph 1.0 services.
Binding callbacks at definition time does not propagate into subgraphs or
`create_react_agent` tool nodes — those create child runtimes with their own
callback scope.
```python
# WRONG — fires on parent runnable only; silent on subgraphs (P28)
agent_bound = agent.with_config(callbacks=[MetricCallback(tenant_id, sink)])
result = await agent_bound.ainvoke(inputs)
# RIGHT — propagates to every runnable, subgraph, and tool call
meter = MetricCallback(tenant_id, sink)
result = await agent.ainvoke(
inputs,
config={
"callbacks": [meter],
"configurable": {"thread_id": session_id, "tenant_id": tenant_id},
"tags": ["prod", f"tenant:{tenant_id}"],
"metadata": {"request_id": req_id, "tier": "enterprise"},
},
)
```
Construct the callback *inside* the request handler so it captures a fresh
`tenant_id` per request — and in that pattern, invocation-time config is the
only way callbacks reach subgraphs. See [Trace Metadata and Tagging](references/trace-metadata-and-tagging.md)
for the full `RunnableConfig` shape.
### Step 4 — Tag and annotate traces via `RunnableConfig`
LangSmith indexes two per-request fields: `tags` (flat list, filterable) and
`metadata` (key-value, searchable). Fix conventions early — LangSmith has no
rename tool.
```python
config = {
"callbacks": [meter],
"tags": [
"env:prod", # environment
f"tenant:{tenant_id}", # tenant
f"tier:{tenant_tier}", # plan tier
f"feature:{feature_flag}", # A/B experiment arm
],
"metadata": {
"request_id": req_id,
"user_id": user_id,
"session_id": session_id,
"app_version": os.environ["APP_VERSION"],
},
"run_name": "agent_main", # LangSmith UI label; overrides chain class name
}
```
Hierarchical tag conventions (`env:prod`, `tenant:acme`, `tier:enterprise`)
make LangSmith filters work. Free-form tags (`"important"`, `"check-me"`) do
not. See [Trace Metadata and Tagging](references/trace-metadata-and-tagging.md).
### Step 5 — Pick a sink and the stack shape
The callback handler is the integration point. Options, in decreasing order of
fit:
- **LangSmith only** — zero additional overhead; tracing already covers latency
and token accounting. Fine for solo dev, small teams, and LLM-native ops.
- **Prometheus (pull)** — besRelated in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.