Claude
Skills
Sign in
Back

it-operations

Included with Lifetime
$97 forever

Manages IT infrastructure, monitoring, incident response, and service reliability. Provides frameworks for ITIL service management, observability strategies, automation, backup/recovery, capacity planning, and operational excellence practices.

General

What this skill does


# IT Operations Expert

A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.

## Core Principles

### 1. Service Reliability First
- **Proactive Monitoring**: Implement comprehensive observability before incidents occur
- **Incident Management**: Structured response processes with clear escalation paths
- **SLA/SLO Management**: Define and maintain service level objectives aligned with business needs
- **Continuous Improvement**: Learn from incidents through blameless post-mortems

### 2. Automation Over Manual Processes
- **Infrastructure as Code**: Manage infrastructure configuration through version-controlled code
- **Runbook Automation**: Convert manual procedures into automated workflows
- **Self-Healing Systems**: Implement automated remediation for common issues
- **Configuration Management**: Maintain consistency across environments

### 3. ITIL Service Management
- **Service Strategy**: Align IT services with business objectives
- **Service Design**: Design resilient, scalable services
- **Service Transition**: Manage changes with minimal disruption
- **Service Operation**: Deliver and support services effectively
- **Continual Service Improvement**: Iteratively enhance service quality

### 4. Operational Excellence
- **Documentation**: Maintain current runbooks, procedures, and architecture diagrams
- **Knowledge Management**: Build searchable knowledge bases from incident resolutions
- **Capacity Planning**: Forecast and provision resources proactively
- **Cost Optimization**: Balance performance requirements with infrastructure costs

## Core Workflow

### Infrastructure Operations Workflow

```
1. MONITORING & OBSERVABILITY
   ├─ Define SLIs/SLOs/SLAs for critical services
   ├─ Implement metrics collection (infrastructure, application, business)
   ├─ Configure alerting with proper thresholds and escalation
   ├─ Build dashboards for different audiences (ops, devs, executives)
   └─ Establish on-call rotation and escalation procedures

2. INCIDENT MANAGEMENT
   ├─ Receive alert or user report
   ├─ Assess severity and impact (P1/P2/P3/P4)
   ├─ Engage appropriate responders
   ├─ Investigate and diagnose root cause
   ├─ Implement fix or workaround
   ├─ Communicate status to stakeholders
   ├─ Document resolution in knowledge base
   └─ Conduct post-incident review

3. CHANGE MANAGEMENT
   ├─ Submit change request with impact assessment
   ├─ Review and approve through CAB (Change Advisory Board)
   ├─ Schedule change window
   ├─ Execute change with rollback plan ready
   ├─ Validate success criteria
   ├─ Document actual vs planned results
   └─ Close change ticket

4. CAPACITY PLANNING
   ├─ Collect resource utilization trends
   ├─ Analyze growth patterns
   ├─ Forecast future requirements
   ├─ Plan procurement or provisioning
   ├─ Execute capacity additions
   └─ Monitor effectiveness

5. AUTOMATION & OPTIMIZATION
   ├─ Identify repetitive manual tasks
   ├─ Document current process
   ├─ Design automated solution
   ├─ Implement and test automation
   ├─ Deploy to production
   ├─ Measure time/cost savings
   └─ Iterate and improve
```

## Decision Frameworks

### Alert Configuration Decision Matrix

| Scenario | Alert Type | Threshold | Response Time | Escalation |
|----------|-----------|-----------|---------------|------------|
| Service completely down | Page | Immediate | < 5 min | Immediate to on-call |
| Service degraded | Page | 2-3 failures | < 15 min | After 15 min to on-call |
| High resource usage | Warning | > 80% sustained | < 1 hour | After 2 hours to team lead |
| Approaching capacity | Info | > 70% trend | < 24 hours | Weekly capacity review |
| Configuration drift | Ticket | Any deviation | < 7 days | Monthly review |

### Incident Severity Classification

**Priority 1 (Critical)**
- Complete service outage affecting all users
- Data loss or security breach
- Financial impact > $10K/hour
- Response: Immediate, 24/7, all hands on deck

**Priority 2 (High)**
- Partial service outage affecting many users
- Significant performance degradation
- Financial impact $1K-$10K/hour
- Response: < 30 minutes during business hours

**Priority 3 (Medium)**
- Service degradation affecting some users
- Non-critical functionality impaired
- Workaround available
- Response: < 4 hours during business hours

**Priority 4 (Low)**
- Minor issues with minimal impact
- Cosmetic problems
- Enhancement requests
- Response: Next business day

### Change Management Risk Assessment

```
Risk Level = Impact × Likelihood × Complexity

Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing

Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before

Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide

Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)
```

### Monitoring Tool Selection

| Requirement | Prometheus + Grafana | Datadog | New Relic | ELK Stack | Splunk |
|-------------|---------------------|---------|-----------|-----------|---------|
| Cost | Free (self-hosted) | $$$$ | $$$$ | Free-$$ | $$$$$ |
| Metrics | Excellent | Excellent | Excellent | Good | Good |
| Logs | Via Loki | Excellent | Excellent | Excellent | Excellent |
| Traces | Via Tempo | Excellent | Excellent | Limited | Good |
| Learning Curve | Steep | Moderate | Moderate | Steep | Steep |
| Cloud-Native | Excellent | Excellent | Excellent | Good | Good |
| On-Premises | Excellent | Good | Good | Excellent | Excellent |
| APM | Via exporters | Excellent | Excellent | Limited | Good |

## Common Operational Challenges

### Challenge 1: Alert Fatigue
**Problem**: Too many false positive alerts causing team burnout

**Solution**:
```yaml
Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
   - Actionable + Urgent = Keep as page
   - Actionable + Not Urgent = Ticket
   - Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
   - MTTA (Mean Time to Acknowledge): < 5 min target
   - False Positive Rate: < 20% target
   - Alert Volume per Week: Trending down
```

### Challenge 2: Incident Documentation During Crisis
**Problem**: Teams skip documentation during high-pressure incidents

**Solution**:
- Assign dedicated scribe role (not the incident commander)
- Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
- Template-based incident reports with required fields
- Post-incident review scheduled automatically (within 48 hours)
- Gamify documentation (track and recognize thorough documentation)

### Challenge 3: Knowledge Silos
**Problem**: Critical knowledge trapped in individual team members' heads

**Solution**:
```yaml
Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion
```

### Challenge 4: Balancing Stability vs Innovation
**Problem**: Operations team resists change to maintain stability

**Solution**:
- Implement change windows (planned maintenance periods)
- Use blue-green or canary deployments for lower risk
- Establi

Related in General