devops-automation
DevOps and IT Ops automation - CI/CD, monitoring, incident management, and infrastructure workflows
What this skill does
# DevOps Automation
Automate DevOps workflows including CI/CD pipelines, monitoring, incident management, and infrastructure operations. Based on n8n's IT Ops workflow templates.
## Overview
This skill covers:
- CI/CD pipeline automation
- Monitoring and alerting
- Incident management
- Infrastructure automation
- Deployment workflows
---
## CI/CD Automation
### GitHub Actions Integration
```yaml
workflow: "GitHub CI/CD Notifications"
triggers:
- github_push
- github_pull_request
- github_workflow_run
on_push:
action:
- trigger_ci: if_main_branch
- notify_slack:
channel: "#deployments"
message: |
๐ฆ *New Push to {branch}*
Commit: `{commit_sha_short}`
Author: {author}
Message: {commit_message}
[View Diff]({compare_url})
on_pr_opened:
action:
- notify_slack:
channel: "#code-review"
message: |
๐ *New Pull Request*
Title: {pr_title}
Author: {author}
Branch: {head} โ {base}
[Review PR]({pr_url})
- assign_reviewers: based_on_codeowners
- run_ci_checks
on_workflow_complete:
action:
- notify_slack:
message: |
{status_emoji} *Build {status}*
Workflow: {workflow_name}
Branch: {branch}
Duration: {duration}
{if_failed: [View Logs]({logs_url})}
```
### Deployment Pipeline
```yaml
deployment_pipeline:
stages:
build:
trigger: push_to_main
steps:
- checkout_code
- install_dependencies
- run_tests
- build_artifact
- push_to_registry
staging:
trigger: build_success
steps:
- deploy_to_staging
- run_integration_tests
- notify_qa
production:
trigger: manual_approval
steps:
- create_backup
- deploy_to_production
- run_smoke_tests
- notify_team
rollback:
trigger: deployment_failed OR manual
steps:
- revert_to_previous
- notify_team
- create_incident
```
---
## Monitoring & Alerting
### Alert Routing
```yaml
alert_routing:
sources:
- prometheus
- datadog
- cloudwatch
- new_relic
severity_levels:
critical:
response_time: 5_minutes
channels: [pagerduty, slack_urgent, sms]
escalation: immediate
high:
response_time: 15_minutes
channels: [slack_alerts, email]
escalation: after_15_minutes
medium:
response_time: 1_hour
channels: [slack_alerts]
low:
response_time: 24_hours
channels: [slack_logging]
routing_rules:
- if: service == "payments"
team: payments_oncall
severity_boost: +1
- if: service == "auth"
team: security_oncall
- default:
team: platform_oncall
```
### Alert Templates
```yaml
alert_templates:
infrastructure:
cpu_high:
title: "๐ฅ High CPU Usage"
body: |
Server: {host}
CPU: {cpu_percent}%
Duration: {duration}
Threshold: {threshold}%
[View Dashboard]({grafana_url})
memory_critical:
title: "๐พ Critical Memory"
body: |
Server: {host}
Memory: {memory_percent}%
Available: {available_mb}MB
[SSH to Server]({ssh_link})
disk_full:
title: "๐ฟ Disk Space Critical"
body: |
Server: {host}
Disk: {disk_percent}%
Available: {available_gb}GB
Suggestion: Clean logs or expand volume
application:
error_spike:
title: "๐ Error Rate Spike"
body: |
Service: {service}
Error Rate: {error_rate}%
Normal: {baseline}%
Top Errors:
{top_errors}
latency_high:
title: "๐ข High Latency"
body: |
Service: {service}
P99 Latency: {p99_ms}ms
Threshold: {threshold_ms}ms
```
---
## Incident Management
### Incident Workflow
```yaml
incident_workflow:
detection:
sources: [monitoring, user_report, automated_check]
triage:
auto_severity:
- if: affects_payments
severity: critical
- if: affects_auth
severity: critical
- if: affects_api AND error_rate > 10%
severity: high
response:
critical:
- create_incident_channel: "#inc-{timestamp}"
- page_oncall: immediately
- notify_stakeholders: [engineering_lead, product]
- start_war_room: zoom_link
- create_status_page: incident
high:
- create_incident_channel
- notify_oncall: slack
- create_ticket: jira
communication:
internal:
frequency: every_30_minutes
channel: incident_channel
template: |
๐ *Incident Update*
Status: {status}
Impact: {impact}
Next update: {next_update_time}
Current actions:
{action_items}
external:
channel: status_page
template: customer_facing_update
resolution:
steps:
- confirm_resolution
- update_status_page: resolved
- notify_stakeholders
- schedule_postmortem
- close_incident_channel: after_24h
```
### Postmortem Template
```yaml
postmortem_template:
sections:
summary:
- incident_title
- duration
- severity
- impact
timeline:
format: |
| Time | Event |
|------|-------|
| {time} | {event} |
root_cause:
- what_happened
- why_it_happened
- contributing_factors
impact:
- users_affected
- revenue_impact
- sla_breach
resolution:
- how_it_was_fixed
- time_to_detect
- time_to_resolve
action_items:
format: |
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
lessons_learned:
- what_went_well
- what_went_poorly
- lucky_breaks
```
---
## Infrastructure Automation
### Server Provisioning
```yaml
provisioning_workflow:
trigger: jira_ticket OR slack_request
steps:
1. validate_request:
check: [budget_approval, security_review]
2. create_infrastructure:
terraform:
- vpc
- security_groups
- ec2_instances
- load_balancer
3. configure_server:
ansible:
- base_configuration
- security_hardening
- monitoring_agent
- application_setup
4. validate:
- health_check
- security_scan
- performance_baseline
5. notify:
slack: "โ
Server {hostname} is ready"
include: [ssh_access, dashboard_link]
```
### Scheduled Maintenance
```yaml
maintenance_automation:
tasks:
certificate_renewal:
schedule: "30 days before expiry"
action:
- request_new_cert: letsencrypt
- deploy_cert
- verify_ssl
- notify: if_failure
security_patching:
schedule: "weekly"
action:
- check_updates
- if_critical: immediate_patch
- else: schedule_maintenance_window
log_rotation:
schedule: "daily"
action:
- rotate_logs
- compress_old
- upload_to_s3
- delete_local: older_than_7_days
backup_verification:
schedule: "weekly"
action:
- restore_to_test_env
- run_integrity_checks
- report_status
```
---
## Kubernetes Automation
### K8s Workflows
```yaml
kubernetes_automation:
deployment:
trigger: docker_image_pushed
steps:
- update_manifest: with_new_image_tag
- apply_to_staging
- run_tests
- if_success: apply_to_prodRelated in devops
github-actions-advanced
IncludedDesign, debug, and harden GitHub Actions CI/CD workflows, including reusable workflows, matrix builds, self-hosted runners, OIDC authentication, caching, environments, secrets, and release automation.
cicd-pipeline-skill
IncludedGenerates CI/CD pipeline configurations for test automation with GitHub Actions, Jenkins, GitLab CI, and Azure DevOps. Includes TestMu AI cloud integration. Use when user mentions "CI/CD", "pipeline", "GitHub Actions", "Jenkins", "GitLab CI". Triggers on: "CI/CD", "pipeline", "GitHub Actions", "Jenkins", "GitLab CI", "Azure DevOps", "automated testing pipeline".
docker-expert
IncludedDocker containerization expert with deep knowledge of multi-stage builds, image optimization, container security, Docker Compose orchestration, and production deployment patterns. Use PROACTIVELY for Dockerfile optimization, container issues, image size problems, security hardening, networking, and orchestration challenges.
terraform-expert
IncludedExpert-level Terraform infrastructure as code, modules, state management, and production best practices
cicd-expert
IncludedExpert-level CI/CD with GitHub Actions, Jenkins, deployment pipelines, and automation
monitoring-expert
IncludedExpert-level monitoring and observability with Prometheus, Grafana, logging, and alerting