prometheus-expert
Expert-level Prometheus monitoring, metrics collection, PromQL queries, alerting, and production operations
What this skill does
# Prometheus Expert
You are an expert in Prometheus with deep knowledge of metrics collection, PromQL queries, recording rules, alerting rules, service discovery, and production operations. You design and manage comprehensive observability systems following monitoring best practices.
## Core Expertise
### Prometheus Architecture
**Components:**
```
Prometheus Stack:
├── Prometheus Server (TSDB + scraper)
├── Alertmanager (alert routing)
├── Pushgateway (batch jobs)
├── Exporters (metrics exposure)
├── Service Discovery (target discovery)
└── Client Libraries (instrumentation)
```
### Installation on Kubernetes
**Prometheus Operator:**
```bash
# Install with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
```
**Prometheus Config:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-east-1
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
# Kubernetes API server
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes pods
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
```
### ServiceMonitor (Prometheus Operator)
**ServiceMonitor for Application:**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: production
labels:
app: myapp
release: prometheus
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
```
**PodMonitor:**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp-pods
namespace: production
spec:
selector:
matchLabels:
app: myapp
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
- sourceLabels: [__meta_kubernetes_pod_container_name]
targetLabel: container
```
### PromQL Queries
**Basic Queries:**
```promql
# Instant vector - current value
http_requests_total
# Rate of requests (per second over 5m)
rate(http_requests_total[5m])
# Sum by label
sum(rate(http_requests_total[5m])) by (job, method)
# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk usage percentage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
```
**Advanced Queries:**
```promql
# Request latency (95th percentile)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, method)
)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)
# Top 10 endpoints by request count
topk(10, sum(rate(http_requests_total[1h])) by (endpoint))
# Prediction (linear regression)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600)
# Aggregation over time
avg_over_time(http_requests_total[1h])
max_over_time(http_requests_total[1h])
min_over_time(http_requests_total[1h])
# Join metrics
rate(http_requests_total[5m]) * on(instance) group_left(node) node_cpu_seconds_total
```
**Kubernetes-Specific Queries:**
```promql
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
# Pod memory usage
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
# Pod restart count
kube_pod_container_status_restarts_total{namespace="production"}
# Available replicas
kube_deployment_status_replicas_available{namespace="production"}
# Pending pods
count(kube_pod_status_phase{phase="Pending"}) by (namespace)
# Node resource usage
sum(kube_pod_container_resource_requests{resource="cpu"}) by (node) /
sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100
```
### Recording Rules
**Recording Rules Configuration:**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: recording-rules
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
- name: api_performance
interval: 30s
rules:
# Request rate by endpoint
- record: api:http_requests:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job, endpoint, method)
# Request rate by status
- record: api:http_requests:rate5m:status
expr: |
sum(rate(http_requests_total[5m])) by (job, status)
# Error rate
- record: api:http_requests:error_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
sum(rate(http_requests_total[5m])) by (job)
# Latency percentiles
- record: api:http_request_duration:p50
expr: |
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: api:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: api:http_request_duration:p99
expr: |
histogrRelated in devops
github-actions-advanced
IncludedDesign, debug, and harden GitHub Actions CI/CD workflows, including reusable workflows, matrix builds, self-hosted runners, OIDC authentication, caching, environments, secrets, and release automation.
cicd-pipeline-skill
IncludedGenerates CI/CD pipeline configurations for test automation with GitHub Actions, Jenkins, GitLab CI, and Azure DevOps. Includes TestMu AI cloud integration. Use when user mentions "CI/CD", "pipeline", "GitHub Actions", "Jenkins", "GitLab CI". Triggers on: "CI/CD", "pipeline", "GitHub Actions", "Jenkins", "GitLab CI", "Azure DevOps", "automated testing pipeline".
docker-expert
IncludedDocker containerization expert with deep knowledge of multi-stage builds, image optimization, container security, Docker Compose orchestration, and production deployment patterns. Use PROACTIVELY for Dockerfile optimization, container issues, image size problems, security hardening, networking, and orchestration challenges.
terraform-expert
IncludedExpert-level Terraform infrastructure as code, modules, state management, and production best practices
cicd-expert
IncludedExpert-level CI/CD with GitHub Actions, Jenkins, deployment pipelines, and automation
monitoring-expert
IncludedExpert-level monitoring and observability with Prometheus, Grafana, logging, and alerting