ci-failure-analysis

Included with Lifetime

$97 forever

This skill should be used when analyzing failed GitHub Actions CI/CD runs for Breenix kernel development. Use for diagnosing test failures, parsing QEMU logs, identifying kernel panics or faults, understanding timeout issues, and determining root causes of CI failures.

Cloud & DevOpsscripts

What this skill does


# CI Failure Analysis for Breenix

Systematically analyze and diagnose CI/CD test failures in Breenix kernel development.

## Purpose

This skill provides tools and workflows for analyzing failed CI runs, understanding kernel crashes, identifying environment issues, and determining root causes. It focuses on the unique challenges of kernel development CI: QEMU logs, kernel panics, double faults, page faults, and timeout analysis.

## When to Use This Skill

Use this skill when:

- **CI run fails**: GitHub Actions workflow fails and you need to understand why
- **Test timeout**: Test exceeds time limit and you need to determine if it's a hang or just slow
- **Kernel panic/fault**: Double fault, page fault, or other kernel crash in CI
- **Missing output**: Expected kernel log signals don't appear
- **Environment issues**: Build or dependency problems in CI that don't occur locally
- **Regression analysis**: New PR breaks previously passing tests

## Quick Start

When a CI run fails:

1. **Download artifacts**: Go to failed GitHub Actions run, download log artifacts
2. **Run analyzer**: `ci-failure-analysis/scripts/analyze_ci_failure.py target/xtask_*_output.txt`
3. **Review findings**: Analyzer reports known patterns with diagnosis and fixes
4. **Check context**: Use `--context` flag to see surrounding log lines
5. **Apply fix**: Follow suggested remediation steps

## Failure Analysis Script

The skill provides `analyze_ci_failure.py` to automatically detect common failures:

### Basic Usage

```bash
# Analyze a CI log file
ci-failure-analysis/scripts/analyze_ci_failure.py target/xtask_ring3_smoke_output.txt

# Show context around failures
ci-failure-analysis/scripts/analyze_ci_failure.py --context target/xtask_ring3_smoke_output.txt

# Analyze multiple logs
ci-failure-analysis/scripts/analyze_ci_failure.py target/*.txt logs/breenix_*.log
```

### What It Detects

The analyzer recognizes these failure patterns:

1. **Double Fault** - Stack corruption, unmapped exception handlers
2. **Page Fault** - Accessing unmapped or incorrectly mapped memory
3. **Test Timeout** - Exceeding time limits
4. **QEMU Not Found** - Missing system dependencies
5. **Rust Target Missing** - Wrong toolchain configuration
6. **rust-src Missing** - Missing required Rust component
7. **Userspace Binary Missing** - Forgetting to build userspace tests
8. **Compilation Error** - Build failures
9. **Signal Not Found** - Expected output missing (test didn't complete)
10. **Kernel Panic** - Unrecoverable errors

### Output Format

```
======================================================================
CI Failure Analysis: target/xtask_ring3_smoke_output.txt
======================================================================
Log size: 1523 lines
Patterns detected: 2

──────────────────────────────────────────────────────────────────────

[1] Page Fault
    Line 1234: PAGE FAULT at 0x10001082 Error Code: 0x0

    📊 Diagnosis:
       Page fault accessing unmapped or incorrectly mapped memory

    🔧 Fix:
       Identify the faulting address and check:
       1) Is it mapped in the active page table?
       2) Are the flags correct (USER_ACCESSIBLE, WRITABLE)?
       3) Was it recently unmapped?

    📄 Context:
         1230: [ INFO] Process created: PID 2
         1231: [DEBUG] Switching to process page table
         1232: [DEBUG] About to access userspace memory
         1233: [DEBUG] Buffer pointer: 0x10001082
    >>>  1234: PAGE FAULT at 0x10001082 Error Code: 0x0
         1235: Stack trace:
         1236:   0: copy_from_user
         1237:   1: sys_write
         1238:   2: syscall_handler
```

## Common Failure Patterns

### Double Fault

**Symptoms**:
```
DOUBLE FAULT - Error Code: 0x0
Instruction Pointer: 0x...
Code Segment: ... Ring3
```

**Common Causes**:
1. Kernel stack not mapped in process page table (Ring 3 → Ring 0 transition fails)
2. IST stack misconfigured or unmapped
3. Exception handler itself causes exception
4. Stack overflow

**Diagnosis**:
- Check if fault occurs during syscall (int 0x80)
- Look for recent page table changes
- Verify TSS RSP0 points to valid kernel stack
- Check IST configuration

**Fix Examples**:
- Add kernel stack mapping to process page tables
- Verify IST stacks are mapped
- Increase stack size if overflow
- Review exception handler code

### Page Fault

**Symptoms**:
```
PAGE FAULT at 0x... Error Code: 0x...
```

**Error Code Decoding**:
- Bit 0 (P): 0 = not present, 1 = protection violation
- Bit 1 (W/R): 0 = read, 1 = write
- Bit 2 (U/S): 0 = kernel, 1 = user
- Bit 3 (RSVD): 1 = reserved bit violation
- Bit 4 (I/D): 1 = instruction fetch

**Common Causes**:
1. Accessing unmapped memory
2. Writing to read-only page
3. User code accessing kernel page
4. Page table entry missing

**Diagnosis**:
- Identify faulting address and operation
- Check if address should be mapped
- Verify page table flags (PRESENT, WRITABLE, USER_ACCESSIBLE)
- Look for recent memory operations

### Test Timeout

**Symptoms**:
```
Timeout reached (60s)
... OR ...
Error: test exceeded time limit
```

**Distinguishing Hang vs Slow**:

1. **Kernel hang**: No new output for extended period
   - Timer interrupt not firing
   - Infinite loop
   - Deadlock

2. **Legitimately slow**: Continuous output, just takes longer
   - CI environment slower than local
   - Verbose logging enabled
   - Many tests in sequence

**Diagnosis**:
- Check last log message - what was kernel doing?
- Is timer interrupt still firing? (look for timer ticks)
- Are there any locks being acquired?
- Does it complete locally?

**Fixes**:
- Infinite loop: Add timeout or fix logic
- Deadlock: Review lock acquisition order
- Slow test: Increase timeout or optimize
- Hang: Add debug checkpoints to narrow down location

### Missing Success Signal

**Symptoms**:
```
❌ Ring-3 smoke test failed: no evidence of userspace execution
```

**Common Causes**:
1. Test didn't run (compilation failed silently)
2. Kernel panicked before reaching test
3. Test ran but failed assertions
4. Signal string changed but test wasn't updated

**Diagnosis**:
- Search log for ANY output from the test
- Check if kernel reached test execution point
- Look for earlier errors or panics
- Verify signal string matches test code

### Compilation Error

**Symptoms**:
```
error[E0...]: ...
  --> kernel/src/...
```

**Common Causes**:
1. Wrong Rust nightly version
2. Missing features
3. Syntax error
4. Dependency version mismatch

**Diagnosis**:
- Check Rust version in CI vs. expected
- Verify all required crates are available
- Look for changed dependencies
- Check for feature flag mismatches

### Environment Issues

**Symptoms**:
```
qemu-system-x86_64: command not found
... OR ...
error: target 'x86_64-unknown-none' may not be installed
```

**Common Causes**:
1. System dependencies not installed
2. Rust components missing
3. Wrong Rust installation method
4. PATH not set correctly

**Diagnosis**:
- Check workflow YAML for dependency installation
- Verify Rust toolchain setup
- Check for typos in package names
- Confirm correct ubuntu version

## Analysis Workflow

### Step 1: Identify Failure Type

1. **Download artifacts** from failed GitHub Actions run
2. **Check Actions summary** for which step failed
3. **Determine failure category**:
   - Build failure (compilation)
   - Environment setup failure (missing deps)
   - Test execution failure (kernel crash, timeout, wrong output)

### Step 2: Automated Analysis

```bash
# Run the analyzer on downloaded logs
ci-failure-analysis/scripts/analyze_ci_failure.py \
  --context \
  target/xtask_*_output.txt
```

Review the output for:
- Detected patterns
- Suggested diagnosis
- Recommended fixes

### Step 3: Manual Analysis

If automated analysis doesn't find clear patterns:

```bash
# Search for specific error keywords
grep -i "error\|panic\|fault\|timeout" target/xtask_*_output.txt

# Find last successful operation
grep "SUCCESS\|✓\|✅" target/xtask_*_output.txt | tail -20

# Look f

Files: 2

Size: 21.4 KB

Complexity: 42/100

Category: Cloud & DevOps

Source: https://github.com/ryanbreen/breenix/tree/main/breenix-ci-failure-analysis

Related in Cloud & DevOps

appbuilder-action-scaffolder

Included

Create, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.

Cloud & DevOpsscripts

orchestrating-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).

Cloud & DevOpsscripts

github-project-automation

Included

Automate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error

Cloud & DevOpsscripts

sf-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).

Cloud & DevOpsscripts

fabric-cli

Included

Use this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.

Cloud & DevOpsscripts

lark

Included

Lark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.

Cloud & DevOpsscripts