Claude
Skills
Sign in
Back

byteforge-prometheus-metrics

Included with Lifetime
$97 forever

Configure Prometheus metrics for Python/Flask applications with multi-worker gunicorn safety and optional Redis-backed business metrics. Use when adding a /metrics endpoint, instrumenting a Flask app, or fixing inflated metrics from multi-worker gunicorn.

Backend & APIs

What this skill does


# Prometheus Metrics for Python/Flask Apps

This skill wires Prometheus metrics into a Python/Flask application **safely under multi-worker gunicorn** and adds an optional Redis-backed pattern for cross-worker business metrics.

## When to Use This Skill

Use this skill when:
- Adding a `/metrics` endpoint to a Python/Flask service
- The service runs (or might run) under gunicorn with more than one worker
- You need cross-worker business metrics (signups, jobs, queue depth, etc.) and Redis is available
- You are debugging inflated `rate()` values or phantom traffic in PromQL — counter values appear to "reset" because scrapes bounce between workers with private registries

## The Multi-Worker Gunicorn Trap (Read This First)

`prometheus_client` keeps application metrics (`Counter`, `Histogram`, `Gauge`) in **per-process** memory. Under gunicorn with N workers, each worker has its own private registry. Prometheus scrapes hit a random worker each time, so:

- Counter values appear to "reset" as scrapes bounce between workers → PromQL `rate()` manufactures phantom traffic (often 5–10× the real rate for a 4-worker setup)
- Gauges that should be aggregate-able (e.g. queue depth) report whichever worker answered the scrape — not the cluster-wide value
- Histograms double-count or miss buckets the same way

**What does NOT prove this bug:** `process_start_time_seconds`, `python_info`, `process_cpu_seconds_total`, `python_gc_*`, and other metrics from `ProcessCollector` / `PlatformCollector` / `GCCollector` will **always** return N distinct values across scrapes — once per worker PID. `prometheus_client` does not aggregate those collectors across processes by design, regardless of whether multiproc is wired correctly. Diagnose using application metrics (`Counter`/`Histogram`/`Gauge` you declared), not process metrics.

**Fix:** every Step in this skill assumes multi-worker. The pieces are:

1. `prometheus_flask_exporter.GunicornPrometheusMetrics` (multiproc-aware)
2. `PROMETHEUS_MULTIPROC_DIR` env var set, dir cleaned on startup
3. `gunicorn_conf.py` with `child_exit` hook so dead workers release their shards
4. Custom Gauges declare `multiprocess_mode='livesum'` (or similar) — or they are silently dropped
5. Redis as the source-of-truth for cross-worker business counters (optional)

Skip any of these and metrics will look fine in dev (single process) and silently wrong in prod.

## What This Skill Creates

1. **requirements.txt entries** — `prometheus_client`, `prometheus_flask_exporter`, optionally `redis`
2. **Metrics initialization** in the main app file, wired through `create_app()` (post-fork)
3. **`gunicorn_conf.py`** — `child_exit` hook for multiproc shard cleanup
4. **Deploy snippets** — Dockerfile/docker-compose/systemd showing `PROMETHEUS_MULTIPROC_DIR` setup + startup dir cleanup
5. **Custom metric examples** — file-multiproc-safe (with Gauge mode) and Redis-backed (cross-worker aggregate)
6. **Starter Grafana dashboard JSON** — `references/starter-dashboard.json`, importable as-is
7. **Prometheus scrape config snippet** — for whoever runs the Prometheus server
8. **Verification checklist** — counter monotonicity and `rate()` sanity post-deploy

## Step 1: Gather Project Information

**IMPORTANT**: Before making changes, ask the user these questions:

1. **"What is your application tag/name?"** (e.g., "my-api", "ingest-worker")
   - This is the `application` label on all metrics — **must match** the `application_tag` you use in [[byteforge-loki-logging]] so logs and metrics correlate in Grafana
2. **"What is your main application file?"** (e.g., "app.py", "server.py")
   - Where `create_app()` lives
3. **"Does this service run under gunicorn with more than one worker?"** (yes/no — assume yes if unsure)
   - If yes: full multiproc setup (Steps 4 and 5)
   - If no: simpler single-process path, but **still scaffold the multiproc setup** behind a `PROMETHEUS_MULTIPROC_DIR` env-var gate so the service stays safe when scaled up later
4. **"Is Redis available to this service?"** (yes/no)
   - If yes: scaffold Step 6's Redis-backed business metrics pattern
   - If no: skip Step 6; document it as a follow-up if Redis is added later
5. **"What's the metrics scrape path?"** (default: `/metrics`)
   - Override only if `/metrics` collides with an existing route

## Step 2: Add Dependencies to requirements.txt

Add:

```txt
# Prometheus metrics
prometheus_client>=0.19.0
prometheus_flask_exporter>=0.23.0
```

If the user answered "yes" to Redis (Step 1, Q4), also add:

```txt
redis>=5.0.0
```

Install:

```bash
pip install -r requirements.txt
```

## Step 3: Wire Metrics into the Flask App

Add inside `create_app()` in `{app_file}.py`, **after** `Flask(__name__)` and **before** registering blueprints/routes:

```python
import os
from flask import Flask
from prometheus_flask_exporter.multiprocess import GunicornPrometheusMetrics


def create_app() -> Flask:
    app = Flask(__name__)

    # ... existing setup (logging, etc.) ...

    # Prometheus metrics — multiproc-aware. Reads PROMETHEUS_MULTIPROC_DIR
    # automatically; if unset, falls back to single-process mode (only
    # safe for dev or workers=1).
    metrics = GunicornPrometheusMetrics(
        app,
        defaults_prefix='flask',
        group_by='endpoint',  # group HTTP latency by Flask endpoint name, not URL path
    )
    metrics.info(
        'app_info',
        'Application info',
        application='{application_tag}',
    )

    # ... register blueprints, configure Api, etc.
    return app


app = create_app()
```

**CRITICAL**: Replace:
- `{app_file}` → your main application filename
- `{application_tag}` → the tag from Step 1, Q1

**Why inside `create_app()` and not at module level**: gunicorn imports the module in each worker post-fork. Module-level initialization runs in the master process and breaks multiproc shard ownership. This matches the same fork-safety rule that [[byteforge-loki-logging]] documents for the Loki handler.

**Why `group_by='endpoint'`**: URL-path grouping explodes cardinality (one time series per unique URL — `/users/123`, `/users/456`, ...). Endpoint grouping uses Flask's route name (`get_user`), keeping cardinality bounded.

### What this gives you for free

- `/metrics` endpoint mounted automatically
- `flask_http_request_total{method,status,endpoint}` — request counter
- `flask_http_request_duration_seconds{method,status,endpoint}` — latency histogram (p50/p95/p99 via PromQL)
- `flask_http_request_exceptions_total{method,endpoint}` — unhandled exception counter
- Default process metrics: `process_resident_memory_bytes`, `process_cpu_seconds_total`, `process_start_time_seconds`, GC stats

### Note: raw `prometheus_client` path (only if you can't use `GunicornPrometheusMetrics`)

Some services have reason not to adopt `prometheus_flask_exporter` (custom transport, non-Flask consumer of the same registry, etc.). The canonical bare-bones multiproc pattern in upstream `prometheus_client` docs looks like this:

```python
from prometheus_client import (
    generate_latest, CollectorRegistry, CONTENT_TYPE_LATEST, multiprocess,
)
import os

def get_metrics():
    if os.environ.get('PROMETHEUS_MULTIPROC_DIR'):
        registry = CollectorRegistry()
        multiprocess.MultiProcessCollector(registry)
        return generate_latest(registry), CONTENT_TYPE_LATEST
    return generate_latest(), CONTENT_TYPE_LATEST
```

**Gotcha**: the fresh `CollectorRegistry()` does **not** include `ProcessCollector`, `PlatformCollector`, or `GCCollector` — those are attached to the global `REGISTRY`. The resulting `/metrics` will be missing `process_*`, `python_info`, and `python_gc_*` entirely. To restore them:

```python
from prometheus_client import ProcessCollector, PlatformCollector, GC_COLLECTOR

registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
ProcessCollector(registry=registry)
PlatformCollector(registry=registry)
registry.register(GC_COLLECTOR)
```

The

Related in Backend & APIs