Claude
Skills
Sign in
Back

langchain-eval-harness

Included with Lifetime
$97 forever

Build reproducible evaluation pipelines for LangChain 1.0 chains and LangGraph 1.0 agents — golden datasets, LangSmith evaluate(), ragas RAG metrics, deepeval LLM-as-judge, agent trajectory analysis, and CI gating on quality regressions. Use when setting up quality measurement for a new chain, diagnosing regression after a model switch, or building an evaluation gate for a pull request. Trigger with "langchain eval", "langsmith evaluate", "ragas", "llm-as-judge", "agent trajectory eval", "eval regression gate".

Cloud & DevOpssaaslangchainlanggraphpythonlangchain-1.0evaluationlangsmithragas

What this skill does

# LangChain Eval Harness (Python)

## Overview

A team swapped `gpt-4o` for `claude-sonnet-4-6` to save money and a week later CS
noticed answer quality dropped on 15% of refund tickets — the regression was
invisible in code review and invisible in CI because no golden set existed.

Fix: a versioned golden set, a stacked eval pipeline (LangSmith +
ragas + deepeval + custom trajectory), and a PR-blocking regression gate
with paired Wilcoxon significance. The tooling exists; the patterns for
wiring it into a statistically honest loop are scattered across five doc sites.

Build a 100-example JSONL golden set, wire LangSmith `evaluate()` with a
custom correctness evaluator, add a ragas quartet (faithfulness, answer
relevance, context precision/recall) for RAG, add deepeval LLM-as-judge
with N=3 judge quorum, score LangGraph trajectories on coverage/precision/
order, and gate PRs on a 2% aggregate drop or 5% per-example drop. Pin:
`langchain-core 1.0.x`, `langgraph 1.0.x`, `langsmith>=0.2`, `ragas>=0.2`,
`deepeval>=2.0`. Pain-catalog anchors: P01, P11, P12, P22, P33.

## Prerequisites

- Python 3.10+
- `langchain-core >= 1.0, < 2.0`, `langgraph >= 1.0, < 2.0` for the system under eval
- `pip install langsmith>=0.2 ragas>=0.2 deepeval>=2.0 scipy`
- LangSmith account + `LANGSMITH_API_KEY` (free tier is sufficient for dataset versioning)
- Provider API keys for the judge LLM: `OPENAI_API_KEY` and/or `ANTHROPIC_API_KEY`

## Instructions

### Step 1 — Build a versioned golden set

Format: JSONL, one example per line, with a `dataset_version` tag. Minimum 20
examples to start; grow to 100 for PR gating, 200+ for absolute-metric claims.

```python
# evals/golden_set/v2026.04.jsonl
{"id": "gs-0001", "input": "Refund policy for SKU ABC-42?", "expected": "30 days with receipt", "contexts": ["policy_v3.md"], "tags": ["refund"], "difficulty": "easy", "dataset_version": "2026.04"}
{"id": "gs-0002", "input": "Return policy for opened software?", "expected": "No, opened software is final sale", "contexts": ["policy_v3.md#returns"], "tags": ["refund"], "difficulty": "medium", "dataset_version": "2026.04"}
```

Sample from real traffic (redacted), not imagination. Stratify by tag and
difficulty (aim for 30% hard). Two annotators per example, disagreements
reconciled — reconciliation rate under 90% means your task definition is
ambiguous. Treat the file as immutable within a version; bump the version
to refresh. See [Golden Set Curation](references/golden-set-curation.md) for
sourcing strategy, annotation tool options, and the refresh cadence.

### Step 2 — Wire LangSmith `evaluate()` with a custom evaluator

```python
from langsmith import Client
from langsmith.evaluation import evaluate, EvaluationResult
from langchain_anthropic import ChatAnthropic

client = Client()
DATASET_VERSION = "2026.04"

# One-time: upload golden set as a versioned dataset
def upload_golden_set(jsonl_path, dataset_name):
    examples = [json.loads(line) for line in open(jsonl_path)]
    client.create_dataset(dataset_name)
    client.create_examples(
        inputs=[{"input": e["input"]} for e in examples],
        outputs=[{"expected": e["expected"]} for e in examples],
        metadata=[{"id": e["id"], "tags": e["tags"]} for e in examples],
        dataset_name=dataset_name,
    )

chain = ChatAnthropic(model="claude-sonnet-4-6", temperature=0, timeout=30)

def target(inputs):
    return {"answer": chain.invoke(inputs["input"]).content}

def correctness(outputs, reference_outputs):
    """Deterministic exact-match floor — baseline, not ceiling."""
    match = outputs["answer"].strip().lower() == reference_outputs["expected"].strip().lower()
    return EvaluationResult(key="exact_match", score=float(match))

results = evaluate(
    target,
    data=f"golden-set-v{DATASET_VERSION}",
    evaluators=[correctness],
    experiment_prefix="refund-bot-v3",
    max_concurrency=10,   # Avoid 429s on judge LLM (P22)
)
```

Free-form outputs need semantic scoring (ragas, deepeval, or LLM-as-judge — Step 4).

### Step 3 — Add ragas metrics for RAG pipelines

For a RAG chain returning `{answer, contexts}`, ragas scores four standard
dimensions. The default judge is `gpt-4o-mini`; override to pin model +
cost:

```python
from ragas import evaluate as ragas_evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from datasets import Dataset

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embed = OpenAIEmbeddings(model="text-embedding-3-small")

# Prepare rows — ragas wants HuggingFace Dataset shape
rows = []
for ex in golden_examples:
    result = rag_chain.invoke({"question": ex["input"]})
    rows.append({
        "question": ex["input"],
        "answer": result["answer"],
        "contexts": [d.page_content for d in result["source_documents"]],
        "ground_truth": ex["expected"],
    })

ragas_results = ragas_evaluate(
    Dataset.from_list(rows),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=judge,
    embeddings=embed,
)
# ragas_results is a dict of per-metric means; call .to_pandas() for per-row
```

Do not use ragas on non-RAG chains — `context_precision` against an empty
context list returns 0 and looks like a regression. See
[Framework Comparison](references/framework-comparison.md) for when each
tool fits.

### Step 4 — Add deepeval LLM-as-judge for free-form outputs

deepeval is pytest-shaped — each example is an `LLMTestCase` asserting against
metrics. Run N=3 judge invocations per example and take the median to tame
LLM-as-judge variance (±5-15% across runs; single-run scores are not CI-ready):

```python
import statistics
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def eval_with_quorum(test_case, metric, n=3):
    scores = []
    for _ in range(n):
        metric.measure(test_case)
        scores.append(metric.score)
    return statistics.median(scores), statistics.stdev(scores) if n > 1 else 0.0

correctness = GEval(
    name="Correctness",
    criteria="Does the actual output match the expected output in meaning?",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    model="gpt-4o-mini",
)

for ex in golden_examples:
    result = chain.invoke({"input": ex["input"]})
    case = LLMTestCase(input=ex["input"], actual_output=result, expected_output=ex["expected"])
    median, sd = eval_with_quorum(case, correctness, n=3)
    if sd > 0.2:  # judge disagreeing with itself — flag, don't gate
        flag_for_review(ex["id"], median, sd)
```

### Step 5 — LangGraph agent trajectory eval

For agents, final-answer correctness misses the process. Score the tool-call
sequence on three axes — coverage (did required tools run?), precision
(were extra tools used?), and order (Kendall's tau on shared tools):

```python
from langchain_core.messages import AIMessage

def extract_trajectory(final_state: dict) -> list[dict]:
    return [
        {"tool": tc["name"], "args": tc["args"]}
        for msg in final_state["messages"] if isinstance(msg, AIMessage)
        for tc in (msg.tool_calls or [])
    ]

def trajectory_score(expected: list[str], actual: list[str]) -> dict:
    e_set, a_set = set(expected), set(actual)
    coverage = len(e_set & a_set) / len(e_set) if e_set else 1.0
    precision = len(e_set & a_set) / len(a_set) if a_set else 0.0
    shared = [t for t in actual if t in e_set]
    order = _kendall_tau(expected, shared) if len(shared) >= 2 else 1.0
    return {"coverage": coverage, "precision": precision, "order": order}

# Composite: 0.5 * coverage + 0.3 * precision + 0.2 * order
```

Set `temperature=0` for the agent during eval — `temperature > 0` produces
different trajectories across runs (P11) and makes paired comparison
statistically invalid. See [Agent Trajectory Eval](references/agent-

Related in Cloud & DevOps