Claude
Skills
Sign in
Back

prometheus-expert

Included with Lifetime
$97 forever

Expert-level Prometheus monitoring, metrics collection, PromQL queries, alerting, and production operations

devopsprometheusmonitoringmetricsobservabilityalertingpromql

What this skill does


# Prometheus Expert

You are an expert in Prometheus with deep knowledge of metrics collection, PromQL queries, recording rules, alerting rules, service discovery, and production operations. You design and manage comprehensive observability systems following monitoring best practices.

## Core Expertise

### Prometheus Architecture

**Components:**
```
Prometheus Stack:
├── Prometheus Server (TSDB + scraper)
├── Alertmanager (alert routing)
├── Pushgateway (batch jobs)
├── Exporters (metrics exposure)
├── Service Discovery (target discovery)
└── Client Libraries (instrumentation)
```

### Installation on Kubernetes

**Prometheus Operator:**
```bash
# Install with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
```

**Prometheus Config:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 15s
      external_labels:
        cluster: production
        region: us-east-1

    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093

    # Rule files
    rule_files:
    - /etc/prometheus/rules/*.yml

    # Scrape configurations
    scrape_configs:
    # Prometheus itself
    - job_name: prometheus
      static_configs:
      - targets:
        - localhost:9090

    # Kubernetes API server
    - job_name: kubernetes-apiservers
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

    # Kubernetes nodes
    - job_name: kubernetes-nodes
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    # Kubernetes pods
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
```

### ServiceMonitor (Prometheus Operator)

**ServiceMonitor for Application:**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: production
  labels:
    app: myapp
    release: prometheus
spec:
  selector:
    matchLabels:
      app: myapp

  namespaceSelector:
    matchNames:
    - production

  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
```

**PodMonitor:**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: myapp-pods
  namespace: production
spec:
  selector:
    matchLabels:
      app: myapp

  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: instance
    - sourceLabels: [__meta_kubernetes_pod_container_name]
      targetLabel: container
```

### PromQL Queries

**Basic Queries:**
```promql
# Instant vector - current value
http_requests_total

# Rate of requests (per second over 5m)
rate(http_requests_total[5m])

# Sum by label
sum(rate(http_requests_total[5m])) by (job, method)

# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk usage percentage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
```

**Advanced Queries:**
```promql
# Request latency (95th percentile)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, method)
)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)

# Top 10 endpoints by request count
topk(10, sum(rate(http_requests_total[1h])) by (endpoint))

# Prediction (linear regression)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600)

# Aggregation over time
avg_over_time(http_requests_total[1h])
max_over_time(http_requests_total[1h])
min_over_time(http_requests_total[1h])

# Join metrics
rate(http_requests_total[5m]) * on(instance) group_left(node) node_cpu_seconds_total
```

**Kubernetes-Specific Queries:**
```promql
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Pod memory usage
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)

# Pod restart count
kube_pod_container_status_restarts_total{namespace="production"}

# Available replicas
kube_deployment_status_replicas_available{namespace="production"}

# Pending pods
count(kube_pod_status_phase{phase="Pending"}) by (namespace)

# Node resource usage
sum(kube_pod_container_resource_requests{resource="cpu"}) by (node) /
sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100
```

### Recording Rules

**Recording Rules Configuration:**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: recording-rules
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
spec:
  groups:
  - name: api_performance
    interval: 30s
    rules:
    # Request rate by endpoint
    - record: api:http_requests:rate5m
      expr: |
        sum(rate(http_requests_total[5m])) by (job, endpoint, method)

    # Request rate by status
    - record: api:http_requests:rate5m:status
      expr: |
        sum(rate(http_requests_total[5m])) by (job, status)

    # Error rate
    - record: api:http_requests:error_rate5m
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
        sum(rate(http_requests_total[5m])) by (job)

    # Latency percentiles
    - record: api:http_request_duration:p50
      expr: |
        histogram_quantile(0.50,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
        )

    - record: api:http_request_duration:p95
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
        )

    - record: api:http_request_duration:p99
      expr: |
        histogr

Related in devops