langchain-eval-harness
Build reproducible evaluation pipelines for LangChain 1.0 chains and LangGraph 1.0 agents — golden datasets, LangSmith evaluate(), ragas RAG metrics, deepeval LLM-as-judge, agent trajectory analysis, and CI gating on quality regressions. Use when setting up quality measurement for a new chain, diagnosing regression after a model switch, or building an evaluation gate for a pull request. Trigger with "langchain eval", "langsmith evaluate", "ragas", "llm-as-judge", "agent trajectory eval", "eval regression gate".
What this skill does
# LangChain Eval Harness (Python)
## Overview
A team swapped `gpt-4o` for `claude-sonnet-4-6` to save money and a week later CS
noticed answer quality dropped on 15% of refund tickets — the regression was
invisible in code review and invisible in CI because no golden set existed.
Fix: a versioned golden set, a stacked eval pipeline (LangSmith +
ragas + deepeval + custom trajectory), and a PR-blocking regression gate
with paired Wilcoxon significance. The tooling exists; the patterns for
wiring it into a statistically honest loop are scattered across five doc sites.
Build a 100-example JSONL golden set, wire LangSmith `evaluate()` with a
custom correctness evaluator, add a ragas quartet (faithfulness, answer
relevance, context precision/recall) for RAG, add deepeval LLM-as-judge
with N=3 judge quorum, score LangGraph trajectories on coverage/precision/
order, and gate PRs on a 2% aggregate drop or 5% per-example drop. Pin:
`langchain-core 1.0.x`, `langgraph 1.0.x`, `langsmith>=0.2`, `ragas>=0.2`,
`deepeval>=2.0`. Pain-catalog anchors: P01, P11, P12, P22, P33.
## Prerequisites
- Python 3.10+
- `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0` for the system under eval
- `pip install langsmith>=0.2 ragas>=0.2 deepeval>=2.0 scipy`
- LangSmith account + `LANGSMITH_API_KEY` (free tier is sufficient for dataset versioning)
- Provider API keys for the judge LLM: `OPENAI_API_KEY` and/or `ANTHROPIC_API_KEY`
## Instructions
### Step 1 — Build a versioned golden set
Format: JSONL, one example per line, with a `dataset_version` tag. Minimum 20
examples to start; grow to 100 for PR gating, 200+ for absolute-metric claims.
```python
# evals/golden_set/v2026.04.jsonl
{"id": "gs-0001", "input": "Refund policy for SKU ABC-42?", "expected": "30 days with receipt", "contexts": ["policy_v3.md"], "tags": ["refund"], "difficulty": "easy", "dataset_version": "2026.04"}
{"id": "gs-0002", "input": "Return policy for opened software?", "expected": "No, opened software is final sale", "contexts": ["policy_v3.md#returns"], "tags": ["refund"], "difficulty": "medium", "dataset_version": "2026.04"}
```
Sample from real traffic (redacted), not imagination. Stratify by tag and
difficulty (aim for 30% hard). Two annotators per example, disagreements
reconciled — reconciliation rate under 90% means your task definition is
ambiguous. Treat the file as immutable within a version; bump the version
to refresh. See [Golden Set Curation](references/golden-set-curation.md) for
sourcing strategy, annotation tool options, and the refresh cadence.
### Step 2 — Wire LangSmith `evaluate()` with a custom evaluator
```python
from langsmith import Client
from langsmith.evaluation import evaluate, EvaluationResult
from langchain_anthropic import ChatAnthropic
client = Client()
DATASET_VERSION = "2026.04"
# One-time: upload golden set as a versioned dataset
def upload_golden_set(jsonl_path, dataset_name):
examples = [json.loads(line) for line in open(jsonl_path)]
client.create_dataset(dataset_name)
client.create_examples(
inputs=[{"input": e["input"]} for e in examples],
outputs=[{"expected": e["expected"]} for e in examples],
metadata=[{"id": e["id"], "tags": e["tags"]} for e in examples],
dataset_name=dataset_name,
)
chain = ChatAnthropic(model="claude-sonnet-4-6", temperature=0, timeout=30)
def target(inputs):
return {"answer": chain.invoke(inputs["input"]).content}
def correctness(outputs, reference_outputs):
"""Deterministic exact-match floor — baseline, not ceiling."""
match = outputs["answer"].strip().lower() == reference_outputs["expected"].strip().lower()
return EvaluationResult(key="exact_match", score=float(match))
results = evaluate(
target,
data=f"golden-set-v{DATASET_VERSION}",
evaluators=[correctness],
experiment_prefix="refund-bot-v3",
max_concurrency=10, # Avoid 429s on judge LLM (P22)
)
```
Free-form outputs need semantic scoring (ragas, deepeval, or LLM-as-judge — Step 4).
### Step 3 — Add ragas metrics for RAG pipelines
For a RAG chain returning `{answer, contexts}`, ragas scores four standard
dimensions. The default judge is `gpt-4o-mini`; override to pin model +
cost:
```python
from ragas import evaluate as ragas_evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from datasets import Dataset
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embed = OpenAIEmbeddings(model="text-embedding-3-small")
# Prepare rows — ragas wants HuggingFace Dataset shape
rows = []
for ex in golden_examples:
result = rag_chain.invoke({"question": ex["input"]})
rows.append({
"question": ex["input"],
"answer": result["answer"],
"contexts": [d.page_content for d in result["source_documents"]],
"ground_truth": ex["expected"],
})
ragas_results = ragas_evaluate(
Dataset.from_list(rows),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=judge,
embeddings=embed,
)
# ragas_results is a dict of per-metric means; call .to_pandas() for per-row
```
Do not use ragas on non-RAG chains — `context_precision` against an empty
context list returns 0 and looks like a regression. See
[Framework Comparison](references/framework-comparison.md) for when each
tool fits.
### Step 4 — Add deepeval LLM-as-judge for free-form outputs
deepeval is pytest-shaped — each example is an `LLMTestCase` asserting against
metrics. Run N=3 judge invocations per example and take the median to tame
LLM-as-judge variance (±5-15% across runs; single-run scores are not CI-ready):
```python
import statistics
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
def eval_with_quorum(test_case, metric, n=3):
scores = []
for _ in range(n):
metric.measure(test_case)
scores.append(metric.score)
return statistics.median(scores), statistics.stdev(scores) if n > 1 else 0.0
correctness = GEval(
name="Correctness",
criteria="Does the actual output match the expected output in meaning?",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
model="gpt-4o-mini",
)
for ex in golden_examples:
result = chain.invoke({"input": ex["input"]})
case = LLMTestCase(input=ex["input"], actual_output=result, expected_output=ex["expected"])
median, sd = eval_with_quorum(case, correctness, n=3)
if sd > 0.2: # judge disagreeing with itself — flag, don't gate
flag_for_review(ex["id"], median, sd)
```
### Step 5 — LangGraph agent trajectory eval
For agents, final-answer correctness misses the process. Score the tool-call
sequence on three axes — coverage (did required tools run?), precision
(were extra tools used?), and order (Kendall's tau on shared tools):
```python
from langchain_core.messages import AIMessage
def extract_trajectory(final_state: dict) -> list[dict]:
return [
{"tool": tc["name"], "args": tc["args"]}
for msg in final_state["messages"] if isinstance(msg, AIMessage)
for tc in (msg.tool_calls or [])
]
def trajectory_score(expected: list[str], actual: list[str]) -> dict:
e_set, a_set = set(expected), set(actual)
coverage = len(e_set & a_set) / len(e_set) if e_set else 1.0
precision = len(e_set & a_set) / len(a_set) if a_set else 0.0
shared = [t for t in actual if t in e_set]
order = _kendall_tau(expected, shared) if len(shared) >= 2 else 1.0
return {"coverage": coverage, "precision": precision, "order": order}
# Composite: 0.5 * coverage + 0.3 * precision + 0.2 * order
```
Set `temperature=0` for the agent during eval — `temperature > 0` produces
different trajectories across runs (P11) and makes paired comparison
statistically invalid. See [Agent Trajectory Eval](references/agent-Related in Cloud & DevOps
appbuilder-action-scaffolder
IncludedCreate, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.
orchestrating-datacloud
IncludedSalesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).
github-project-automation
IncludedAutomate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error
sf-datacloud
IncludedSalesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).
fabric-cli
IncludedUse this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.
lark
IncludedLark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.