Claude
Skills
Sign in
Back

evidence-standards

Included with Lifetime
$97 forever

Testing evidence standards and validation checklist; use when defining or reviewing test evidence requirements.

Code Review

What this skill does


# Evidence Standards for All Testing and Verification

## Core Principle

**Evidence must prove what you claim.** Mock data cannot prove production behavior.

## Minimum Viable Evidence Checklist

**Every test MUST capture these at minimum (copy-paste into test setup):**

```python
def capture_provenance():
    """REQUIRED: Capture all evidence standards."""
    provenance = {}

    # === GIT PROVENANCE (MANDATORY) ===
    subprocess.run(["git", "fetch", "origin", "main"], timeout=10, capture_output=True)
    provenance["git_head"] = subprocess.check_output(
        ["git", "rev-parse", "HEAD"], text=True).strip()
    provenance["git_branch"] = subprocess.check_output(
        ["git", "branch", "--show-current"], text=True).strip()
    provenance["merge_base"] = subprocess.check_output(
        ["git", "merge-base", "HEAD", "origin/main"], text=True).strip()
    provenance["commits_ahead_of_main"] = int(subprocess.check_output(
        ["git", "rev-list", "--count", "origin/main..HEAD"], text=True).strip())
    provenance["diff_stat_vs_main"] = subprocess.check_output(
        ["git", "diff", "--stat", "origin/main...HEAD"], text=True).strip()

    # === SERVER RUNTIME (MANDATORY for server tests) ===
    port = BASE_URL.split(":")[-1].rstrip("/")
    pids = subprocess.check_output(
        ["lsof", "-i", f":{port}", "-t"], text=True).strip().split("\n")
    provenance["server"] = {
        "pid": pids[0] if pids else None,
        "port": port,
        "process_cmdline": subprocess.check_output(
            ["ps", "-p", pids[0], "-o", "command="], text=True).strip() if pids else None,
        "env_vars": {var: os.environ.get(var) for var in
            ["WORLDAI_DEV_MODE", "TESTING", "GOOGLE_APPLICATION_CREDENTIALS"]}
    }

    return provenance
```

**Quick validation:** If your metadata.json is missing ANY of these fields, the test is incomplete:
- `provenance.merge_base`
- `provenance.commits_ahead_of_main`
- `provenance.diff_stat_vs_main`
- `provenance.server.pid`
- `provenance.server.port`
- `provenance.server.process_cmdline`

## Three Evidence Rule (from CLAUDE.md)

**MANDATORY for ANY integration claim:**

1. **Configuration Evidence**: Show actual config file entries enabling the behavior
2. **Trigger Evidence**: Demonstrate automatic trigger mechanism (not manual execution)
3. **Log Evidence**: Timestamped logs from automatic behavior (not manual testing)

## Mock vs Real Mode Decision Tree

Before running ANY test, answer:

| Question | If YES → |
|----------|----------|
| Testing production/preview server behavior? | MUST use real mode |
| Validating actual API responses? | MUST use real mode |
| Checking data integrity (dice, state, persistence)? | MUST use real mode |
| Proving a bug is fixed in production? | MUST use real mode |
| Development workflow validation only? | Mock mode acceptable |
| Unit testing isolated functions? | Mock mode acceptable |

### Production Mode vs Real Mode

**Production mode is NOT required for valid evidence.** Local testing with real services
(real LLM APIs, real Firebase, real dice) is sufficient to prove behavior.
If a run artifact records `production_mode`, `production_mode: false` is acceptable
for evidence as long as the claim is not about production configuration or prod-only behavior.

| Mode | When to Use | Evidence Value |
|------|-------------|----------------|
| `--production-mode` | Final deployment validation | Highest (actual prod config) |
| `--evidence` (local server) | PR validation, feature proof | **Valid** (real APIs, real data) |
| Mock mode | Unit tests, CI speed | Invalid for behavior claims |

The key requirement is **real execution** (actual API calls, actual RNG), not production
environment. Evidence from `--start-local --evidence` is valid proof.

## Mock Mode Prohibition

**MOCK MODE = INVALID EVIDENCE** for:
- Production server validation
- API integration claims
- Data integrity verification (dice rolls, state changes)
- Bug fix confirmation
- Performance claims
- Security validation

**Mock mode tests ONLY prove:**
- Code syntax is correct
- Function signatures work
- Basic logic flow (in isolation)

**Mock mode tests NEVER prove:**
- Production behavior
- Real API responses
- Actual data execution
- Integration correctness

## Evidence Collection Requirements

### Canonical Evidence Bundle Files

**Required files in every evidence bundle:**

| File | Purpose | Required Keys |
|------|---------|---------------|
| `run.json` | Test results | `scenarios[*].name`, `scenarios[*].campaign_id`, `scenarios[*].errors` |
| `metadata.json` | Git/server provenance | `git_provenance`, `server`, `timestamp` |
| `evidence.md` | Human-readable summary | Pass/fail counts matching run.json |
| `methodology.md` | Test methodology | Environment, steps, validation |
| `README.md` | Package manifest | Git commit, branch, collection time |
| `request_responses.jsonl` | Raw MCP captures | Full request/response pairs |

**DEPRECATED:** `evidence.json` - use `run.json` + `metadata.json` instead.

### Mandatory Scenarios Array

**Every test MUST emit `results["scenarios"]`** even for single-scenario runs:

```python
# ❌ BAD - Missing scenarios array causes "Total Scenarios: 0"
results = {"test_result": {...}}

# ✅ GOOD - Always include scenarios array
results = {
    "scenarios": [
        {
            "name": "scenario_name",
            "campaign_id": "abc123",  # Required for log traceability
            "passed": True,
            "errors": [],  # Always include, even if empty
            "checks": {...}
        }
    ],
    "test_result": {...}  # Optional summary
}
```

### Evidence Integrity (Checksums)

**ALL evidence files MUST have separate checksum files:**

```bash
# Generate checksums AFTER finalizing content
sha256sum run.json > run.json.sha256
sha256sum metadata.json > metadata.json.sha256

# Verify checksums
sha256sum -c run.json.sha256
```

**Anti-pattern:** Embedding checksums inside JSON files (self-invalidating).

**Checksum usability requirement:** `.sha256` files must reference the **local basename**
(e.g., `run.json`), not a nested path like `artifacts/run_.../run.json`.
This ensures `sha256sum -c` works when run from the evidence directory.

**ALL evidence files require checksums, including:**
- Individual test result files (PASS_*.json, FAIL_*.json)
- Aggregated files (request_responses.jsonl)
- Server logs (artifacts/server.log)

```python
def _write_checksum_for_file(filepath: Path) -> None:
    """Generate SHA256 checksum file for an existing file."""
    content = filepath.read_bytes()
    sha256_hash = hashlib.sha256(content).hexdigest()
    checksum_file = Path(str(filepath) + ".sha256")
    checksum_file.write_text(f"{sha256_hash}  {filepath.name}\n")
```

### Evidence Package Consistency (NEW)

**Single-run attribution:** If a bundle contains multiple runs, the docs **must**
name the exact run directory used for claims (e.g., `run_YYYYMMDD...`). Claims
must be traceable to one run only.

**Multi-campaign isolation:** If tests create multiple campaigns (e.g., isolated tests
for state-sensitive scenarios), evidence.md **must** include:
1. **Isolation Note** explaining why multiple campaigns exist
2. **Campaign ID** for each scenario result for traceability
3. **Claim Scoping** clarifying which campaign(s) aggregate claims reference

Example isolation note in evidence.md:
```markdown
## ⚠️ Multi-Campaign Isolation Note
This bundle contains **11 campaigns**: 1 shared + 10 isolated.
Each scenario includes its `campaign_id` for traceability.
```

**Per-scenario campaign ID in run.json:** When using fresh campaigns per scenario,
the test output **must** include `campaign_id` for each scenario entry:

```json
{
  "scenarios": [
    {
      "name": "Skill Check (Stealth)",
      "campaign_id": "zuFsywkYErTZpGBGDhDC",  // ← Required for log traceability
      "dice_audit_events": [...],
      "tool_results": [...]
    }
  ]
}
```

This enables matching server logs (which 

Related in Code Review