oraclecloud-incident-runbook
Self-service incident runbook for OCI outages — health probes, instance recovery, cross-AD/region failover. Use when OCI instances go down, the status page is silent, or you need automated recovery without waiting for support. Trigger with "oraclecloud incident", "oci outage runbook", "oci failover", "oci instance recovery".
What this skill does
# Oracle Cloud Incident Runbook
## Overview
Self-service runbook for when OCI instances go down and the status page stays green. OCI's status page has a history of not acknowledging outages in real time (London Jan 2026 — 502s and instances disappearing for 10 minutes with no status update). OCI Support response times average 4+ hours for Sev-1 tickets. This runbook gives you health probes, automated instance recovery, cross-AD failover, and cross-region failover — all executable without waiting on Oracle.
**Purpose:** Detect OCI service degradation independently, recover instances automatically, and fail over to alternate availability domains or regions when the primary is impacted.
## Prerequisites
- **OCI CLI installed and configured** — `~/.oci/config` validated (see `oraclecloud-install-auth`)
- **Python 3.8+** with the OCI SDK — `pip install oci`
- **Pre-configured resources**: at least one compute instance, a VCN with subnets in multiple ADs
- **IAM policies**: `manage instances`, `manage volumes`, `inspect work-requests` in the target compartment
- **Boot volume backups** enabled (recovery depends on having a recent backup)
## Instructions
### Step 1: Independent Health Probes
Do not trust the OCI status page alone. Run your own health checks against the OCI API:
```python
import oci
import time
config = oci.config.from_file("~/.oci/config")
def probe_oci_health(config):
"""Probe OCI API endpoints independently of the status page."""
results = {}
# Probe 1: Identity service (lightest call)
try:
start = time.time()
identity = oci.identity.IdentityClient(config)
identity.list_regions()
results["identity"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
except oci.exceptions.ServiceError as e:
results["identity"] = {"status": "degraded", "error": str(e.status)}
# Probe 2: Compute service
try:
start = time.time()
compute = oci.core.ComputeClient(config)
compute.list_instances(compartment_id=config["tenancy"], limit=1)
results["compute"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
except oci.exceptions.ServiceError as e:
results["compute"] = {"status": "degraded", "error": str(e.status)}
# Probe 3: Networking service
try:
start = time.time()
network = oci.core.VirtualNetworkClient(config)
network.list_vcns(compartment_id=config["tenancy"], limit=1)
results["networking"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
except oci.exceptions.ServiceError as e:
results["networking"] = {"status": "degraded", "error": str(e.status)}
return results
health = probe_oci_health(config)
for service, status in health.items():
print(f" {service}: {status['status']} ({status.get('latency_ms', 'N/A')}ms)")
```
### Step 2: Severity Classification
| Level | Condition | Detection | Response Time |
|-------|-----------|-----------|---------------|
| **P1 — Total Outage** | All API probes fail, instances unreachable | All 3 probes return `degraded` | Immediate — trigger cross-region failover |
| **P2 — Partial Degradation** | Some APIs slow (>2s latency), instance agent unresponsive | Latency >2000ms on any probe | 15 min — attempt instance recovery, prepare failover |
| **P3 — Intermittent** | Sporadic 429/500 errors, some requests succeed | Occasional probe failures | 1 hour — enable retries, monitor trend |
### Step 3: Instance Recovery Actions
```bash
# Check current instance state
oci compute instance get --instance-id "$INSTANCE_OCID" \
--query 'data."lifecycle-state"' --raw-output
# Action: Reset (hard reboot) — fastest recovery for hung instances
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action RESET
# Action: Reboot (graceful) — for instances still responding to ACPI
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action SOFTRESET
# Action: Stop then Start — forces reallocation to new host hardware
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action STOP
# Wait for STOPPED state
oci compute instance action --instance-id "$INSTANCE_OCID" \
--action START
```
### Step 4: Cross-AD Failover
When an entire availability domain is impacted, launch a replacement instance in a different AD using the latest boot volume backup:
```python
import oci
config = oci.config.from_file("~/.oci/config")
compute = oci.core.ComputeClient(config)
blockstorage = oci.core.BlockstorageClient(config)
# Find latest boot volume backup
backups = blockstorage.list_boot_volume_backups(
compartment_id="COMPARTMENT_OCID",
boot_volume_id="BOOT_VOLUME_OCID",
sort_by="TIMECREATED",
sort_order="DESC",
limit=1,
).data
if not backups:
raise RuntimeError("No boot volume backups found — cannot failover")
latest_backup = backups[0]
print(f"Using backup: {latest_backup.id} from {latest_backup.time_created}")
# Launch replacement instance in alternate AD
launch_details = oci.core.models.LaunchInstanceDetails(
availability_domain="AD-2", # Different from the impacted AD
compartment_id="COMPARTMENT_OCID",
shape="VM.Standard.E4.Flex",
shape_config=oci.core.models.LaunchInstanceShapeConfigDetails(ocpus=2, memory_in_gbs=16),
display_name="failover-instance",
source_details=oci.core.models.InstanceSourceViaBootVolumeDetails(
source_type="bootVolume",
boot_volume_id=latest_backup.id,
),
create_vnic_details=oci.core.models.CreateVnicDetails(
subnet_id="SUBNET_OCID_IN_AD2",
),
)
response = compute.launch_instance(launch_details)
print(f"Failover instance launching: {response.data.id}")
```
### Step 5: Cross-Region Failover
For region-wide outages, use a pre-replicated boot volume in the DR region:
```bash
# List available boot volume replicas in DR region
oci bv boot-volume-backup list \
--compartment-id "$COMPARTMENT_OCID" \
--query 'data[0].id' --raw-output \
--region us-phoenix-1
# Launch instance in DR region
oci compute instance launch \
--compartment-id "$COMPARTMENT_OCID" \
--availability-domain "AD-1" \
--shape "VM.Standard.E4.Flex" \
--shape-config '{"ocpus":2,"memoryInGBs":16}' \
--display-name "dr-failover-instance" \
--source-boot-volume-id "$DR_BACKUP_OCID" \
--subnet-id "$DR_SUBNET_OCID" \
--region us-phoenix-1
```
### Step 6: Status Monitoring Script
Run this in a loop during an incident to detect recovery:
```bash
#!/bin/bash
while true; do
STATUS=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
--query 'data."lifecycle-state"' --raw-output 2>&1)
TIMESTAMP=$(date -u +%H:%M:%S)
echo "$TIMESTAMP | Instance: $STATUS"
if [ "$STATUS" = "RUNNING" ]; then
echo "$TIMESTAMP | Instance recovered."
break
fi
sleep 30
done
```
## Output
Successful execution produces:
- Independent health probe results for Identity, Compute, and Networking services
- Severity classification (P1/P2/P3) based on probe results
- Instance recovery action (reset, reboot, or stop/start) applied to the impacted instance
- Cross-AD failover instance launched from the latest boot volume backup (if AD-level failure)
- Cross-region failover instance launched in the DR region (if region-level failure)
- Continuous monitoring log tracking instance lifecycle state until recovery
## Error Handling
| Error | Code | Cause | Solution |
|-------|------|-------|----------|
| NotAuthenticated | 401 | API key expired during incident | Use `oci session authenticate` for token-based short-term auth |
| NotAuthorizedOrNotFound | 404 | IAM policy missing for recovery actions | Pre-create policy: `allow group sre to manage instances in compartment prod` |
| TooManyRequests | 429 | Rate limited during mass recovery | OCI has no Retry-After header — back off 30 seconds between calls |
| InternalError | 500 | OCI service itself is degraded | This confirms the ouRelated in Ads & Marketing
ads
IncludedMulti-platform paid advertising audit and optimization skill. Analyzes Google, Meta, YouTube, LinkedIn, TikTok, Microsoft, and Apple Ads. 250+ checks with scoring, parallel agents, industry templates, and AI creative generation.
banana
IncludedAI image generation Creative Director powered by Google Gemini Nano Banana models. Use this skill for ANY request involving image creation, editing, visual asset production, or creative direction. Triggers on: generate an image, create a photo, edit this picture, design a logo, make a banner, visual for my anything, and all /banana commands. Handles text-to-image, image editing, multi-turn creative sessions, batch workflows, and brand presets.
rpg-migration-analyzer
IncludedAnalyzes legacy RPG (Report Program Generator) programs from AS/400 and IBM i systems for migration to modern Java applications. Extracts business logic from RPG III/IV/ILE source code, identifies data structures (D-specs), file operations (F-specs), program dependencies (CALLB/CALLP), and converts RPG constructs to Java equivalents. Generates migration reports, complexity estimates, and Java implementation strategies with POJO classes, JPA entities, and service methods. Use when modernizing AS/400 or IBM i legacy systems, analyzing RPG source files (.rpg, .rpgle, .RPGLE), converting RPG to Java, mapping data specifications to Java classes, planning legacy system migration, or when user mentions RPG analysis, Report Program Generator, RPG III/IV/ILE, AS/400 modernization, IBM i migration, packed decimal conversion, or mainframe application rewrite.
brand-library-architect
IncludedBuild a complete brand library for a product — visual asset render pipeline, brand documentation set (BRAND, COPY, MANIFESTO, BIOS, FAQ, GLOSSARY, TONE, PRICING), open-source convention files (README, CONTRIBUTING, SECURITY, CODE_OF_CONDUCT), and a self-contained press kit. This skill should be used when the user asks to "build a brand library / brand kit / press kit / brand assets" for a product, "set up a brand library workflow," "create a positioning manifesto plus visual identity," or any combination of brand documentation + visual asset pipeline. Apply phase-by-phase or run end-to-end. Templates are product-agnostic and use {{TOKEN}} placeholders the skill prompts the user to fill.
writing-tech-post
IncludedAuthors engineering blog posts end-to-end: launch deep-dives, incident postmortems, architecture migrations, performance case studies, tutorials, AI/agent system writeups, security disclosures, and research-to-product translations. Picks the correct archetype, plans the abstraction ladder, enforces an evidence cadence (diagrams, benchmarks, profiles, traces, code, ablations), tunes voice against publisher house styles (Datadog, Vercel, GitHub, AWS, Meta, Cloudflare, Jane Street), and runs a pre-publish gate for narrative momentum and disclosure ethics. Use when drafting a new engineering post, restructuring a draft that feels flat, deciding which evidence form belongs where, validating that depth and product context are balanced, or preparing a postmortem, migration, or performance narrative for external publication. Do not use for API reference documentation, README authoring, marketing copy, release notes, generic SEO content, ghost-written executive thought leadership, or non-engineering long-form essays.
blog-google
IncludedGoogle API integration for blog performance: PageSpeed Insights, CrUX Core Web Vitals with 25-week history, Search Console performance, URL Inspection, Indexing API, GA4 organic traffic, NLP entity analysis for E-E-A-T, YouTube video search for embedding, and Google Ads Keyword Planner. Progressive feature availability based on credential tier (API key, OAuth/service account, GA4, Ads). Shares config with claude-seo at ~/.config/claude-seo/google-api.json. Use when user says "google data", "page speed", "core web vitals", "search console", "indexation", "GA4", "keyword research", "nlp entities", "blog performance", "youtube search", "google api setup".