evidence-standards

Included with Lifetime

$97 forever

Testing evidence standards and validation checklist; use when defining or reviewing test evidence requirements.

Code Review

What this skill does


# Evidence Standards for All Testing and Verification

## Core Principle

**Evidence must prove what you claim.** Mock data cannot prove production behavior.

## Minimum Viable Evidence Checklist

**Every test MUST capture these at minimum (copy-paste into test setup):**

```python
def capture_provenance():
    """REQUIRED: Capture all evidence standards."""
    provenance = {}

    # === GIT PROVENANCE (MANDATORY) ===
    subprocess.run(["git", "fetch", "origin", "main"], timeout=10, capture_output=True)
    provenance["git_head"] = subprocess.check_output(
        ["git", "rev-parse", "HEAD"], text=True).strip()
    provenance["git_branch"] = subprocess.check_output(
        ["git", "branch", "--show-current"], text=True).strip()
    provenance["merge_base"] = subprocess.check_output(
        ["git", "merge-base", "HEAD", "origin/main"], text=True).strip()
    provenance["commits_ahead_of_main"] = int(subprocess.check_output(
        ["git", "rev-list", "--count", "origin/main..HEAD"], text=True).strip())
    provenance["diff_stat_vs_main"] = subprocess.check_output(
        ["git", "diff", "--stat", "origin/main...HEAD"], text=True).strip()

    # === SERVER RUNTIME (MANDATORY for server tests) ===
    port = BASE_URL.split(":")[-1].rstrip("/")
    pids = subprocess.check_output(
        ["lsof", "-i", f":{port}", "-t"], text=True).strip().split("\n")
    provenance["server"] = {
        "pid": pids[0] if pids else None,
        "port": port,
        "process_cmdline": subprocess.check_output(
            ["ps", "-p", pids[0], "-o", "command="], text=True).strip() if pids else None,
        "env_vars": {var: os.environ.get(var) for var in
            ["WORLDAI_DEV_MODE", "TESTING", "GOOGLE_APPLICATION_CREDENTIALS"]}
    }

    return provenance
```

**Quick validation:** If your metadata.json is missing ANY of these fields, the test is incomplete:
- `provenance.merge_base`
- `provenance.commits_ahead_of_main`
- `provenance.diff_stat_vs_main`
- `provenance.server.pid`
- `provenance.server.port`
- `provenance.server.process_cmdline`

## Three Evidence Rule (from CLAUDE.md)

**MANDATORY for ANY integration claim:**

1. **Configuration Evidence**: Show actual config file entries enabling the behavior
2. **Trigger Evidence**: Demonstrate automatic trigger mechanism (not manual execution)
3. **Log Evidence**: Timestamped logs from automatic behavior (not manual testing)

## Mock vs Real Mode Decision Tree

Before running ANY test, answer:

| Question | If YES → |
|----------|----------|
| Testing production/preview server behavior? | MUST use real mode |
| Validating actual API responses? | MUST use real mode |
| Checking data integrity (dice, state, persistence)? | MUST use real mode |
| Proving a bug is fixed in production? | MUST use real mode |
| Development workflow validation only? | Mock mode acceptable |
| Unit testing isolated functions? | Mock mode acceptable |

### Production Mode vs Real Mode

**Production mode is NOT required for valid evidence.** Local testing with real services
(real LLM APIs, real Firebase, real dice) is sufficient to prove behavior.
If a run artifact records `production_mode`, `production_mode: false` is acceptable
for evidence as long as the claim is not about production configuration or prod-only behavior.

| Mode | When to Use | Evidence Value |
|------|-------------|----------------|
| `--production-mode` | Final deployment validation | Highest (actual prod config) |
| `--evidence` (local server) | PR validation, feature proof | **Valid** (real APIs, real data) |
| Mock mode | Unit tests, CI speed | Invalid for behavior claims |

The key requirement is **real execution** (actual API calls, actual RNG), not production
environment. Evidence from `--start-local --evidence` is valid proof.

## Mock Mode Prohibition

**MOCK MODE = INVALID EVIDENCE** for:
- Production server validation
- API integration claims
- Data integrity verification (dice rolls, state changes)
- Bug fix confirmation
- Performance claims
- Security validation

**Mock mode tests ONLY prove:**
- Code syntax is correct
- Function signatures work
- Basic logic flow (in isolation)

**Mock mode tests NEVER prove:**
- Production behavior
- Real API responses
- Actual data execution
- Integration correctness

## Evidence Collection Requirements

### Canonical Evidence Bundle Files

**Required files in every evidence bundle:**

| File | Purpose | Required Keys |
|------|---------|---------------|
| `run.json` | Test results | `scenarios[*].name`, `scenarios[*].campaign_id`, `scenarios[*].errors` |
| `metadata.json` | Git/server provenance | `git_provenance`, `server`, `timestamp` |
| `evidence.md` | Human-readable summary | Pass/fail counts matching run.json |
| `methodology.md` | Test methodology | Environment, steps, validation |
| `README.md` | Package manifest | Git commit, branch, collection time |
| `request_responses.jsonl` | Raw MCP captures | Full request/response pairs |

**DEPRECATED:** `evidence.json` - use `run.json` + `metadata.json` instead.

### Mandatory Scenarios Array

**Every test MUST emit `results["scenarios"]`** even for single-scenario runs:

```python
# ❌ BAD - Missing scenarios array causes "Total Scenarios: 0"
results = {"test_result": {...}}

# ✅ GOOD - Always include scenarios array
results = {
    "scenarios": [
        {
            "name": "scenario_name",
            "campaign_id": "abc123",  # Required for log traceability
            "passed": True,
            "errors": [],  # Always include, even if empty
            "checks": {...}
        }
    ],
    "test_result": {...}  # Optional summary
}
```

### Evidence Integrity (Checksums)

**ALL evidence files MUST have separate checksum files:**

```bash
# Generate checksums AFTER finalizing content
sha256sum run.json > run.json.sha256
sha256sum metadata.json > metadata.json.sha256

# Verify checksums
sha256sum -c run.json.sha256
```

**Anti-pattern:** Embedding checksums inside JSON files (self-invalidating).

**Checksum usability requirement:** `.sha256` files must reference the **local basename**
(e.g., `run.json`), not a nested path like `artifacts/run_.../run.json`.
This ensures `sha256sum -c` works when run from the evidence directory.

**ALL evidence files require checksums, including:**
- Individual test result files (PASS_*.json, FAIL_*.json)
- Aggregated files (request_responses.jsonl)
- Server logs (artifacts/server.log)

```python
def _write_checksum_for_file(filepath: Path) -> None:
    """Generate SHA256 checksum file for an existing file."""
    content = filepath.read_bytes()
    sha256_hash = hashlib.sha256(content).hexdigest()
    checksum_file = Path(str(filepath) + ".sha256")
    checksum_file.write_text(f"{sha256_hash}  {filepath.name}\n")
```

### Evidence Package Consistency (NEW)

**Single-run attribution:** If a bundle contains multiple runs, the docs **must**
name the exact run directory used for claims (e.g., `run_YYYYMMDD...`). Claims
must be traceable to one run only.

**Multi-campaign isolation:** If tests create multiple campaigns (e.g., isolated tests
for state-sensitive scenarios), evidence.md **must** include:
1. **Isolation Note** explaining why multiple campaigns exist
2. **Campaign ID** for each scenario result for traceability
3. **Claim Scoping** clarifying which campaign(s) aggregate claims reference

Example isolation note in evidence.md:
```markdown
## ⚠️ Multi-Campaign Isolation Note
This bundle contains **11 campaigns**: 1 shared + 10 isolated.
Each scenario includes its `campaign_id` for traceability.
```

**Per-scenario campaign ID in run.json:** When using fresh campaigns per scenario,
the test output **must** include `campaign_id` for each scenario entry:

```json
{
  "scenarios": [
    {
      "name": "Skill Check (Stealth)",
      "campaign_id": "zuFsywkYErTZpGBGDhDC",  // ← Required for log traceability
      "dice_audit_events": [...],
      "tool_results": [...]
    }
  ]
}
```

This enables matching server logs (which

Files: 1

Size: 38.4 KB

Complexity: 33/100

Category: Code Review

Source: https://github.com/jleechanorg/claude-commands/tree/main/codex_skills/evidence-standards

Related in Code Review

gstack

Included

Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with elements, verify state, diff before/after, take annotated screenshots, test responsive layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)

Code Reviewscriptsfeatured

startup-due-diligence

Included

Legal due diligence review for seed-stage and Series A startups (US, Delaware C-Corp focus). Supports both investor and founder perspectives. Capabilities include: (1) Interactive document review and issue spotting; (2) Document request list generation; (3) Cap table and SAFE/convertible note analysis; (4) Red flag identification with severity ratings; (5) Diligence report generation. TRIGGERS: due diligence, DD, startup investment, cap table review, Series A, seed round, investor diligence, legal review startup, SAFE analysis, convertible note, 409A, founder vesting.

Code Reviewscripts

interview-master

Included

This skill should be used when the user asks to "generate interview questions", "prepare for interview", "optimize resume", "conduct mock interview", "analyze git commits for resume", "generate resume from code", "review my resume", or mentions interview preparation, career assistance, or extracting project experience from git history. Provides comprehensive interview and career development guidance for both job seekers and interviewers.

Code Reviewscripts

fix-issue

Included

Fixes GitHub issues using parallel analysis agents for root cause investigation, code exploration, and regression detection. Reads issue context from gh CLI, searches codebase and memory for related patterns, generates a fix with tests, and links the resolution back to the issue via PR. Includes prevention analysis to avoid recurrence. Use when debugging errors, resolving regressions, fixing bugs, or triaging issues.

Code Reviewscripts

sf-apex

Included

Generates and reviews Salesforce Apex code with 150-point scoring. TRIGGER when: user writes, reviews, or fixes Apex classes, triggers, test classes, batch/queueable/schedulable jobs, or touches .cls/.trigger files. DO NOT TRIGGER when: LWC JavaScript (use sf-lwc), Flow XML (use sf-flow), SOQL-only queries (use sf-soql), or non-Salesforce code.

Code Reviewscripts

swift-development

Included

Comprehensive Swift development for building, testing, and deploying iOS/macOS applications. Use when Claude needs to: (1) Build Swift packages or Xcode projects from command line, (2) Run tests with XCTest or Swift Testing framework, (3) Manage iOS simulators with simctl, (4) Handle code signing, provisioning profiles, and app distribution, (5) Format or lint Swift code with SwiftFormat/SwiftLint, (6) Work with Swift Package Manager (SPM), (7) Implement Swift 6 concurrency patterns (async/await, actors, Sendable), (8) Create SwiftUI views with MVVM architecture, (9) Set up Core Data or SwiftData persistence, or any other Swift/iOS/macOS development tasks.

Code Reviewscripts