vercel-incident-runbook

Included with Lifetime

$97 forever

Vercel incident response procedures with triage, instant rollback, and postmortem. Use when responding to Vercel-related outages, investigating production errors, or running post-incident reviews for deployment failures. Trigger with phrases like "vercel incident", "vercel outage", "vercel down", "vercel on-call", "vercel emergency", "vercel broken".

Cloud & DevOpssaasvercelincident-responserunbook

What this skill does

# Vercel Incident Runbook

## Overview

Step-by-step incident response for Vercel deployment failures, function errors, and platform outages. Covers rapid triage, instant rollback, communication templates, and postmortem procedures.

## Prerequisites

- Access to Vercel dashboard and CLI
- Access to Vercel status page (vercel-status.com)
- Communication channels (Slack, PagerDuty) configured
- Log drain or runtime log access

## Instructions

### Step 1: Rapid Triage (First 5 Minutes)

```bash
# 1. Check if it's a Vercel platform issue
curl -s "https://www.vercel-status.com/api/v2/summary.json" \
  | jq '.status.description, [.components[] | select(.status != "operational") | {name, status}]'

# 2. Check current production deployment status
vercel ls --prod
vercel inspect $(vercel ls --prod --json | jq -r '.[0].url')

# 3. Check recent deployments — did a deploy just happen?
curl -s -H "Authorization: Bearer $VERCEL_TOKEN" \
  "https://api.vercel.com/v6/deployments?target=production&limit=5&projectId=prj_xxx" \
  | jq '.deployments[] | {uid, state, createdAt: (.createdAt/1000 | todate), url}'

# 4. Check function logs for errors
vercel logs $(vercel ls --prod --json | jq -r '.[0].url') --level=error --limit=20
```

### Step 2: Decision Tree

```
Is vercel-status.com showing an incident?
├── YES → Vercel platform issue
│   ├── Subscribe to updates on status page
│   ├── Post internal status: "Vercel platform incident — monitoring"
│   └── No action needed from us — wait for Vercel resolution
│
└── NO → Issue is in our deployment
    ├── Did a deployment happen in the last 30 minutes?
    │   ├── YES → Likely deployment regression
    │   │   └── ROLLBACK immediately (Step 3)
    │   └── NO → Application-level issue
    │       ├── Check function logs for new errors
    │       ├── Check external dependency status (DB, APIs)
    │       └── Investigate and hotfix (Step 4)
    │
    └── Is the issue region-specific?
        ├── YES → Check function regions, possible edge issue
        └── NO → Global issue, check code and env vars
```

### Step 3: Instant Rollback (< 30 Seconds)

```bash
# Option A: Rollback to previous production deployment (fastest)
vercel rollback
# This instantly swaps production traffic — no rebuild needed

# Option B: Rollback to a specific known-good deployment
vercel rollback dpl_xxxxxxxxxxxx

# Option C: Via API (for automation/PagerDuty integration)
curl -X POST "https://api.vercel.com/v9/projects/my-app/promote" \
  -H "Authorization: Bearer $VERCEL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"deploymentId": "dpl_known_good_id"}'

# Verify rollback succeeded
vercel ls --prod
curl -s https://yourdomain.com/api/health | jq .
```

### Step 4: Investigate Root Cause

```bash
# Collect evidence while it's fresh
mkdir incident-$(date +%Y%m%d)
cd incident-$(date +%Y%m%d)

# Function logs around the incident time
vercel logs https://yourdomain.com --limit=200 > function-logs.txt

# Deployment diff — what changed?
curl -s -H "Authorization: Bearer $VERCEL_TOKEN" \
  "https://api.vercel.com/v13/deployments/dpl_broken" \
  | jq '.meta' > broken-deployment-meta.json

# Compare env vars between working and broken deployments
vercel env ls > env-vars.txt

# Check git diff between last good and broken commit
git log --oneline -10
git diff dpl_good_commit..dpl_broken_commit -- api/ src/
```

### Step 5: Enable Maintenance Page (If Needed)

```json
// vercel.json — temporary maintenance mode via rewrite
{
  "rewrites": [
    {
      "source": "/((?!_next|api/health).*)",
      "destination": "/maintenance.html"
    }
  ]
}
```

```html
<!-- public/maintenance.html -->
<!DOCTYPE html>
<html>
<head><title>Maintenance</title></head>
<body>
  <h1>We'll be right back</h1>
  <p>We're performing scheduled maintenance. Please check back shortly.</p>
</body>
</html>
```

### Step 6: Communication Templates

**Internal — Slack (Incident Start)**

```
:rotating_light: INCIDENT: [Project Name] production issue detected
Status: Investigating
Impact: [Description of user impact]
Start time: [UTC timestamp]
On-call: @[engineer]
Thread: replies here
```

**Internal — Slack (Mitigation)**

```
:white_check_mark: MITIGATED: [Project Name]
Action: Rolled back to deployment dpl_xxx
Impact duration: [X minutes]
Root cause: [Brief description]
Postmortem: [link] scheduled for [date]
```

**External — Status Page**

```
Title: Degraded performance on [service]
Body: We are investigating reports of [issue]. Some users may experience
[impact]. Our team is actively working on a resolution.
Update: The issue has been resolved. [Brief root cause].
```

### Step 7: Postmortem Template

```markdown
# Incident Postmortem: [Title]

## Summary
- Duration: [start] to [end] ([X minutes])
- Impact: [users/requests affected]
- Severity: [P1/P2/P3]

## Timeline (UTC)
- HH:MM — [event]
- HH:MM — Alert fired
- HH:MM — On-call acknowledged
- HH:MM — Root cause identified
- HH:MM — Rollback executed
- HH:MM — Service restored

## Root Cause
[What broke and why]

## Resolution
[What was done to fix it]

## Action Items
- [ ] [Preventive action] — Owner: @xxx — Due: [date]
- [ ] [Detection improvement] — Owner: @xxx — Due: [date]
- [ ] [Process improvement] — Owner: @xxx — Due: [date]
```

## Incident Severity Levels

| Severity | Definition | Response Time | Rollback? |
|----------|-----------|---------------|-----------|
| P1 | Production down, all users affected | < 5 min | Immediate |
| P2 | Degraded, some users affected | < 15 min | If not fixable in 30 min |
| P3 | Minor issue, workaround exists | < 1 hour | No |
| P4 | Cosmetic or non-urgent | Next business day | No |

## Output

- Incident categorized and triaged within 5 minutes
- Instant rollback executed if deployment regression detected
- Communication sent to internal and external stakeholders
- Postmortem scheduled with action items

## Error Handling

| Scenario | Action |
|----------|--------|
| Vercel status page shows incident | Monitor, communicate, no deployment changes |
| `vercel rollback` fails | Use API promotion: POST to `/v9/projects/.../promote` |
| Rollback deployment also broken | Deploy from a known-good git tag |
| Cannot access Vercel dashboard | Use CLI with saved VERCEL_TOKEN |
| Log retention expired | Check external log drain provider |

## Resources

- [Vercel Status Page](https://www.vercel-status.com)
- [Instant Rollback](https://vercel.com/docs/instant-rollback)
- [Vercel Support](https://vercel.com/support)
- [Vercel Logs CLI](https://vercel.com/docs/cli/logs)

## Next Steps

For data handling and compliance, see `vercel-data-handling`.

Files: 4

Size: 9.0 KB

Complexity: 31/100

Category: Cloud & DevOps

Source: https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/vercel-pack/skills/vercel-incident-runbook

Related in Cloud & DevOps

appbuilder-action-scaffolder

Included

Create, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.

Cloud & DevOpsscripts

orchestrating-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).

Cloud & DevOpsscripts

github-project-automation

Included

Automate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error

Cloud & DevOpsscripts

sf-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).

Cloud & DevOpsscripts

fabric-cli

Included

Use this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.

Cloud & DevOpsscripts

lark

Included

Lark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.

Cloud & DevOpsscripts