datadog-observability

Included with Lifetime

$97 forever

Full-stack observability with Datadog APM, logs, metrics, synthetics, and RUM. Use when implementing monitoring, tracing, alerting, or cost optimization for production systems.

platformobservabilitymonitoringapmloggingmetricstracingdatadogalerting

What this skill does


# Datadog Observability

## Overview

Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.

**Core Capabilities:**
- **APM**: Distributed tracing with automatic instrumentation for 8+ languages
- **Infrastructure**: Host, container, and cloud service monitoring
- **Logs**: Centralized collection with processing pipelines and 15-month retention
- **Metrics**: Custom metrics via DogStatsD with cardinality management
- **Synthetics**: Proactive API and browser testing from 29+ global locations
- **RUM**: Frontend performance with Core Web Vitals and session replay

## When to Use This Skill

**Activate when:**
- Setting up production monitoring and observability
- Implementing distributed tracing across microservices
- Configuring log aggregation and analysis pipelines
- Creating custom metrics and dashboards
- Setting up alerting and anomaly detection
- Optimizing Datadog costs

**Do not use when:**
- Building with open-source stack (use Prometheus/Grafana instead)
- Cost is primary concern and budget is limited
- Need maximum customization over managed solution

## Quick Start

### 1. Install Datadog Agent

**Docker (simplest):**
```bash
docker run -d --name dd-agent \
  -e DD_API_KEY=<YOUR_API_KEY> \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7
```

**Kubernetes (Helm):**
```bash
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --set datadog.apiKey=<YOUR_API_KEY> \
  --set datadog.apm.enabled=true \
  --set datadog.logs.enabled=true
```

### 2. Instrument Your Application

**Python:**
```python
from ddtrace import tracer, patch_all

# Automatic instrumentation for common libraries
patch_all()

# Manual span for custom operations
with tracer.trace("custom.operation", service="my-service") as span:
    span.set_tag("user.id", user_id)
    # your code here
```

**Node.js:**
```javascript
// Must be first import
const tracer = require('dd-trace').init({
  service: 'my-service',
  env: 'production',
  version: '1.0.0',
});
```

### 3. Verify in Datadog UI

1. Go to Infrastructure > Host Map to verify agent
2. Go to APM > Services to see traced services
3. Go to Logs > Search to verify log collection

## Core Concepts

### Tagging Strategy

Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.

**Required Tags:**
| Tag | Purpose | Example |
|-----|---------|---------|
| `env` | Environment | `env:production` |
| `service` | Service name | `service:api-gateway` |
| `version` | Deployment version | `version:1.2.3` |
| `team` | Owning team | `team:platform` |

**Avoid High-Cardinality Tags:**
- User IDs, request IDs, timestamps
- Pod IDs in Kubernetes
- Build numbers, commit hashes

### Unified Observability

Datadog correlates metrics, traces, and logs automatically:
- Traces include span tags that link to metrics
- Logs inject trace IDs for correlation
- Dashboards combine all data sources

## Best Practices

### Start Simple
1. Install Agent with basic configuration
2. Enable automatic instrumentation
3. Verify data in Datadog UI
4. Add custom spans/metrics as needed

### Progressive Enhancement
```
Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM
```

### Key Instrumentation Points
- HTTP entry/exit points
- Database queries
- External service calls
- Message queue operations
- Business-critical flows

## Common Mistakes

1. **High-cardinality tags**: Using user IDs or request IDs as tags creates millions of unique metrics
2. **Missing log index quotas**: Leads to unexpected bills from log volume spikes
3. **Over-alerting**: Creates alert fatigue; alert on symptoms, not causes
4. **Missing service tags**: Prevents correlation between metrics, traces, and logs
5. **No sampling for high-volume traces**: Ingests everything, causing cost explosion

## Navigation

For detailed implementation:

- **[Agent Installation](references/agent-installation.md)**: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
- **[APM Instrumentation](references/apm-instrumentation.md)**: Python, Node.js, Go, Java instrumentation with code examples
- **[Log Management](references/log-management.md)**: Pipelines, Grok parsing, standard attributes, archives
- **[Custom Metrics](references/custom-metrics.md)**: DogStatsD patterns, metric types, tagging best practices
- **[Alerting](references/alerting.md)**: Monitor types, anomaly detection, alert hygiene
- **[Cost Optimization](references/cost-optimization.md)**: Metrics without Limits, sampling, index quotas
- **[Kubernetes](references/kubernetes.md)**: DaemonSet, Cluster Agent, autodiscovery

## Complementary Skills

When using this skill, consider these related skills (if deployed):

- **docker**: Container instrumentation patterns
- **kubernetes**: K8s-native monitoring patterns
- **python/nodejs/go**: Language-specific APM setup

## Resources

**Official Documentation:**
- APM: https://docs.datadoghq.com/tracing/
- Logs: https://docs.datadoghq.com/logs/
- Metrics: https://docs.datadoghq.com/metrics/
- DogStatsD: https://docs.datadoghq.com/developers/dogstatsd/

**Cost Management:**
- Billing: https://docs.datadoghq.com/account_management/billing/
- Usage Attribution: https://docs.datadoghq.com/account_management/billing/usage_attribution/

Files: 9

Size: 75.7 KB

Complexity: 56/100

Category: platform

Source: https://github.com/bobmatnyc/claude-mpm-skills/tree/main/toolchains/platforms/observability/datadog

Related in platform

flow-nexus-platform

Included

Comprehensive Flow Nexus platform management - authentication, sandboxes, app deployment, payments, and challenges

platform

flow-nexus-platform

Included

Comprehensive Flow Nexus platform management - authentication, sandboxes, app deployment, payments, and challenges

platform

flow-nexus-platform

Included

Comprehensive Flow Nexus platform management - authentication, sandboxes, app deployment, payments, and challenges

platform

flow-nexus-platform

Included

Comprehensive Flow Nexus platform management - authentication, sandboxes, app deployment, payments, and challenges

platform