Claude
Skills
Sign in
Back

o11y-logging

Included with Lifetime
$97 forever

Implement and verify joelclaw observability on every change so failures cannot stay silent. Use when adding/updating Inngest functions, gateway channels, webhook providers, APIs, workers, or any pipeline step. Enforces canonical OTEL contract, storage path, and verification gates. Triggers on: 'o11y', 'observability', 'logging', 'otel', 'instrument this', 'silent failure', 'add telemetry', 'log this function'.

Generaljoelclawobservabilityloggingo11yoteltypesenseconvexinngestscriptsassets

What this skill does


# JoelClaw Observability + Logging

Prevent silent failure by default. Observability is not optional polish: it is part of done.

## Non-Negotiable Rules

1. Use the canonical event contract only.
   - `packages/system-bus/src/observability/otel-event.ts`
   - `packages/system-bus/src/observability/emit.ts`
   - `packages/system-bus/src/observability/store.ts`
2. Worker/Inngest code emits through `emitOtelEvent` or `emitMeasuredOtelEvent`.
3. Gateway code emits through `emitGatewayOtel`.
4. Internal ingestion goes through `POST /observability/emit` (`packages/system-bus/src/serve.ts`), not ad-hoc writes.
5. Never treat `console.log` as primary observability. Keep structured events as source of truth.
6. High-cardinality values go in `metadata`, not in facet fields (`source`, `component`, `level`, `success`).
7. Failures must set `success: false` with a meaningful `error`.
8. For warn/error/fatal, verify Convex mirror behavior (rolling window) in addition to Typesense write.
9. In Inngest durable functions, any "emit once" telemetry must live inside `step.run(...)` to avoid replay duplication after resume.

## Event Conventions

- `source`: subsystem (`worker`, `gateway`, `webhook`, `memory`, `verification`, etc.)
- `component`: stable module/service name (`check-system-health`, `redis-channel`, `observe`)
- `action`: stable dotted action (`system.health.checked`, `events.immediate_telegram`)
- `metadata`: request IDs, deployment IDs, function IDs, session IDs, payload identifiers
- `duration_ms`: include for timed operations

Use event-per-hop (wide event style): one context-rich event for each major boundary/operation, not scattered string logs.

## Implementation Workflow

1. Identify the boundary being changed.
   - Inngest function, gateway channel, webhook route, API route, background job, sync step.
2. Add success and failure envelopes.
   - Start + completion for long tasks, or a single completion event for short tasks.
3. Include operational and business context in `metadata`.
   - Example: function id, event id, provider, queue depth, affected resource id.
4. Keep severity useful.
   - `debug/info` for normal activity, `warn` for degraded but recoverable, `error/fatal` for failures.
5. Run verification gates before finishing.

For full checklists and command recipes, read `references/implementation-checklist.md`.

## Quick Patterns

### Worker / Inngest timed operation

```typescript
import { emitMeasuredOtelEvent } from "../../observability/emit";

await emitMeasuredOtelEvent(
  {
    level: "info",
    source: "worker",
    component: "content-sync",
    action: "content_sync.run",
    metadata: { trigger: event.name },
  },
  async () => {
    await runSync();
  }
);
```

### Gateway emission

```typescript
import { emitGatewayOtel } from "../observability";

await emitGatewayOtel({
  level: "error",
  component: "redis-channel",
  action: "events.immediate_telegram",
  success: false,
  error: "telegram_send_failed",
  metadata: { sessionId, queueDepth },
});
```

### CLI emission

Use `--metadata` for JSON context. There is no `--attributes` flag.

```bash
joelclaw otel emit "task.completed" \
  --source system \
  --component skills \
  --success true \
  --metadata '{"session":"NimbleBadger","task":"install wzrrd-publish skill"}'
```

## Definition of Done

- Structured OTEL events added for the changed path.
- No direct feature-level writes to Typesense/Convex for observability data.
- Smoke probe passes (`scripts/otel-smoke.sh`).
- `joelclaw otel list` and `joelclaw otel stats` show expected behavior.
- New failure modes are queryable by `source`, `component`, and `action`.

## Inngest Replay + Hang Triage

Use this when step code appears to run but runs remain `RUNNING`/`CANCELLED` with `Finalization` errors.

1. Inspect run trace first.

```bash
joelclaw run <run-id>
```

Look for `errors.Finalization.stack` containing `Unable to reach SDK URL`.

2. Confirm whether this is true network reachability or worker-side blocking.

```bash
joelclaw inngest status
joelclaw logs worker --lines 200
joelclaw logs errors --lines 200
```

3. Check for replay-noise in OTEL.

If an action that should emit once (for example `manifest.archive.prereqs-passed`) appears hundreds of times in one run window, move that emit into its own `step.run`.

```bash
joelclaw otel search "manifest.archive.prereqs-passed" --hours 1
```

4. Treat `Unable to reach SDK URL` as an ambiguous symptom.

It can indicate ingress problems, but in practice it can also happen when a function handler blocks on local IO/dependencies long enough that finalization cannot complete.

## Helper Script

Use `scripts/otel-smoke.sh` for a fast end-to-end probe:

```bash
./skills/o11y-logging/scripts/otel-smoke.sh verification o11y-skill probe.emit
```

## Key Files

- `packages/system-bus/src/observability/otel-event.ts`
- `packages/system-bus/src/observability/emit.ts`
- `packages/system-bus/src/observability/store.ts`
- `packages/system-bus/src/serve.ts`
- `packages/gateway/src/observability.ts`
- `packages/system-bus/src/inngest/functions/check-system-health.ts`
- `packages/cli/src/commands/otel.ts`
- `apps/web/app/api/otel/route.ts`
Files: 6
Size: 16.8 KB
Complexity: 65/100
Category: General

Related in General