deploying-monitoring-stacks
Monitor use when deploying monitoring stacks including Prometheus, Grafana, and Datadog. Trigger with phrases like "deploy monitoring stack", "setup prometheus", "configure grafana", or "install datadog agent". Generates production-ready configurations with metric collection, visualization dashboards, and alerting rules.
What this skill does
# Deploying Monitoring Stacks ## Overview Deploy production monitoring stacks (Prometheus + Grafana, Datadog, or Victoria Metrics) with metric collection, custom dashboards, and alerting rules. Configure exporters, scrape targets, recording rules, and notification channels for comprehensive infrastructure and application observability. ## Prerequisites - Target infrastructure identified: Kubernetes cluster, Docker hosts, or bare-metal servers - Metric endpoints accessible from the monitoring platform (application `/metrics`, node exporters) - Storage backend capacity planned for time-series data (Prometheus TSDB, Thanos, or Cortex for long-term) - Alert notification channels defined: Slack webhook, PagerDuty integration key, or email SMTP - Helm 3+ for Kubernetes deployments using kube-prometheus-stack or similar charts ## Instructions 1. Select the monitoring platform: Prometheus + Grafana for open-source self-hosted, Datadog for managed SaaS, Victoria Metrics for high-cardinality workloads 2. Deploy the monitoring stack: `helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack` or Docker Compose for non-Kubernetes 3. Install exporters on monitored systems: node-exporter for host metrics, kube-state-metrics for Kubernetes object states, application-specific exporters 4. Configure scrape targets in `prometheus.yml`: define job names, scrape intervals, and relabeling rules for service discovery 5. Create recording rules for frequently queried aggregations to reduce dashboard query load 6. Define alerting rules with meaningful thresholds: high CPU (>80% for 5m), high memory (>90%), error rate (>1%), latency P99 (>500ms) 7. Configure Alertmanager with routing, grouping, and notification channels (Slack, PagerDuty, email) 8. Build Grafana dashboards: RED metrics (Rate, Errors, Duration) for services, USE metrics (Utilization, Saturation, Errors) for resources 9. Set up data retention: configure TSDB retention period (15-30 days local), set up Thanos/Cortex for long-term storage if needed 10. Test the full pipeline: trigger a test alert and verify notification delivery ## Output - Helm values file or Docker Compose for the monitoring stack - Prometheus configuration with scrape targets, recording rules, and alerting rules - Alertmanager configuration with routing tree and notification receivers - Grafana dashboard JSON files for infrastructure and application metrics - Exporter deployment manifests (node-exporter DaemonSet, application ServiceMonitor) ## Error Handling | Error | Cause | Solution | |-------|-------|---------| | `No data points in dashboard` | Scrape target not reachable or metric name wrong | Check `Targets` page in Prometheus UI; verify service discovery and metric name | | `Too many time series (high cardinality)` | Labels with unbounded values (user IDs, request IDs) | Remove high-cardinality labels with `metric_relabel_configs`; use recording rules for aggregation | | `Alert condition met but no notification` | Alertmanager routing or receiver misconfigured | Verify Alertmanager config with `amtool check-config`; test receiver with `amtool silence` | | `Prometheus OOMKilled` | Insufficient memory for series count | Increase memory limits; reduce scrape targets or retention; add WAL compression | | `Grafana datasource connection failed` | Wrong Prometheus URL or network policy blocking access | Verify datasource URL in Grafana; check Kubernetes service name and port; review network policies | ## Examples - "Deploy kube-prometheus-stack on Kubernetes with alerts for node CPU > 80%, pod restart count > 5, and API error rate > 1%, sending to Slack." - "Set up Prometheus + Grafana on Docker Compose for monitoring 10 application servers with node-exporter and custom application metrics." - "Create Grafana dashboards for the four golden signals (latency, traffic, errors, saturation) for a microservices application." ## Resources - Prometheus documentation: https://prometheus.io/docs/ - Grafana documentation: https://grafana.com/docs/grafana/latest/ - kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack - Alerting best practices: https://prometheus.io/docs/practices/alerting/ - Datadog documentation: https://docs.datadoghq.com/
Related in Cloud & DevOps
appbuilder-action-scaffolder
IncludedCreate, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.
orchestrating-datacloud
IncludedSalesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).
github-project-automation
IncludedAutomate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error
sf-datacloud
IncludedSalesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).
fabric-cli
IncludedUse this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.
lark
IncludedLark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.