Claude
Skills
Sign in
Back

cpu-kernels

Included with Lifetime
$97 forever

Provides guidance for writing, optimizing, and benchmarking C++ CPU kernels with SIMD intrinsics (AVX2/AVX512) for the Hugging Face kernels ecosystem. Includes a two-phase workflow: Phase 1 correctness (generic → AVX2) and Phase 2 performance exploration (AVX512 with branching trial loop), runtime CPU dispatch, OpenMP threading, and brgemm integration for GEMM-heavy kernels.

AI Agentsscripts

What this skill does


# CPU C++ Kernels for x86 Processors

This skill provides patterns and guidance for developing optimized C++ kernels targeting x86 CPUs (Intel Xeon and compatible processors) with AVX2 and AVX512 intrinsics. Kernels are compiled via `kernel-builder` and distributed through the Hugging Face kernels ecosystem.

> **Who runs these commands?** *You*, the agent — not a human. This is an autonomous loop: you write/edit the C++ kernel, build it, then run the scripts below as tools (via Bash) to check correctness, benchmark, and profile. You read each result, record it with `trial_manager.py`, decide the next change from the Phase 2 decision tree, and repeat until you hit `early_stop_speedup` or run all `max_trials`.

## Key Concepts (read before the Quick Start)

The commands use a few names that mean different things. They are **not** interchangeable:

| Name (example) | What it is | Used by |
|----------------|-----------|---------|
| **`baseline.py`** | The **PyTorch reference implementation** you optimize against. It is the ground truth for correctness *and* the speed reference for speedup. **It must define `get_inputs()`** and **either** `get_reference_output()` **or** a `Model` class (plus optional `get_init_inputs()`). You write this file (or it is given) before starting. | every script |
| **`my_rmsnorm`** | A **trial-tree label** — an arbitrary name you pick for this optimization task. `trial_manager.py` stores all attempts under `trials/my_rmsnorm/`. It is *only* a tracking ID. | `trial_manager.py` only |
| **`my_kernel`** | The **installed Python package name** — the build artifact produced by `kernel-builder build` + `pip install`. This is the importable module that contains your compiled kernel. | `--kernel-package` |
| **`my_kernel.rms_norm`** | An **`<package>.<function>` path** — the actual callable inside the installed package. Passed to `--op` to tell the benchmark/profiler which function to run. | `--op` |

> ⚠️ **`--op` means two different things depending on the script.** In `analyze_op.py`, `--op` is a plain **operation name** (e.g. `"rms_norm"`) used to look up compute/memory characteristics. In `benchmark_cpu.py` and `cpu_profiler.py`, `--op` is a **`package.function` path** (e.g. `my_kernel.rms_norm`) used to import and call your kernel. Same flag, different meaning — read each command below carefully.

## Quick Start

### Write a New CPU Kernel

The example below optimizes an RMSNorm kernel. The trial label is `my_rmsnorm`, the built package is `my_kernel`, and its function is `my_kernel.rms_norm` — keep these consistent across all six steps.

```bash
# 1. Analyze the target op. Here --op is an OPERATION NAME (looked up in the
#    knowledge base), not a package path.
python scripts/analyze_op.py --op "rms_norm" --shapes "1024x4096,2048x8192"

# 2. Initialize trial tracking. Args: <trial-label> <baseline-file>.
#    Creates trials/my_rmsnorm/ and records baseline.py as the reference.
python scripts/trial_manager.py init my_rmsnorm baseline.py

# 3. Build the kernel package (produces the installable 'my_kernel' wheel).
cd /path/to/my-kernel && kernel-builder build --release && pip install dist/*.whl --force-reinstall

# 4. Benchmark correctness + performance. Here --op is a PACKAGE.FUNCTION path.
#    Compares my_kernel.rms_norm against baseline.py (correctness + speedup).
python scripts/benchmark_cpu.py baseline.py --kernel-package my_kernel --op my_kernel.rms_norm

# 5. Profile with perf stat (same package.function path as step 4).
python scripts/cpu_profiler.py --kernel-package my_kernel --op my_kernel.rms_norm

# 6. Finalize: promote the best trial in trials/my_rmsnorm/ into output/.
python scripts/trial_manager.py finalize my_rmsnorm output/
```

## Supported Hardware

| ISA | Extensions | Key Instructions | Typical CPUs |
|-----|-----------|-----------------|-------------|
| **AVX2** | FMA, F16C | `_mm256_fmadd_ps`, `_mm256_cvtph_ps` | Most x86 CPUs (2013+) |
| **AVX512** | F, BF16, VL, DQ, BW, VBMI | `_mm512_dpbf16_ps`, `_mm512_permutexvar_epi16` | Intel Xeon |

### GEMM Acceleration: brgemm

For kernels that involve matrix multiplication (quantized GEMM, Flash Attention, MoE), large-M cases use `at::native::cpublas::brgemm()` — a PyTorch wrapper around oneDNN brgemm, which internally dispatches to AMX tile instructions on Intel Xeon (4th Gen+). Small-M cases (M ≤ 4 for bf16) fall back to hand-written `tinygemm` using AVX512 `_mm512_dpbf16_ps`. See [brgemm_patterns.yaml](references/brgemm_patterns.yaml) for details.

> **Note**: brgemm is NOT used in element-wise kernels (RMSNorm, activations, reductions). Those use AVX512 intrinsics directly.

## When This Skill Applies

Use this skill when:
- Writing C++ CPU kernels with SIMD intrinsics for the HF kernels ecosystem
- Optimizing existing CPU kernels (e.g., adding AVX512 to a generic implementation)
- Implementing quantized GEMM kernels (INT4, NF4, FP4, FP8, MXFP4)
- Implementing Flash Attention or other attention kernels for CPU
- Building kernels with `kernel-builder` that target `backend = "cpu"`

## Two-Phase Optimization Workflow

CPU kernel development has two distinct phases with different strategies.

### Configuration — Read `config.yaml` first

At the start of every session, read `scripts/config.yaml`. It controls:
- **`max_trials`** — hard cap on Phase 2 optimization trials
- **`early_stop_speedup`** — speedup vs PyTorch baseline to trigger early stop (default: 3.0)
- **`perf_stat_enabled`** — if `true`, use `perf stat` for profiling (default)
- **`vtune_enabled`** — if `true`, use VTune for detailed microarchitecture analysis
- **`build_command`** — command to build the kernel package

### Rules — Never Violate

1. **ONLY modify** C++ kernel files (`.cpp`, `.hpp`), `torch_binding.cpp`, and `build.toml`. Do NOT create benchmark or test scripts.
2. **NEVER write custom timing code** — ONLY use `scripts/benchmark_cpu.py`.
3. If a tool fails, **STOP and report the error**. Do NOT work around it with custom scripts.
4. Generated kernels must follow the **runtime dispatch pattern** with `cpu_features.hpp` — see `references/runtime_dispatch.yaml`.
5. Every kernel should have a **generic ATen fallback** that works on any CPU. If a specific path cannot have a meaningful fallback, use `TORCH_CHECK(false, ...)` with a clear error message.
6. Each SIMD tier (AVX2, AVX512) must be in a **separate translation unit** (`.cpp` file) with its own compiler flags in `build.toml`. Do NOT mix intrinsics from different ISA levels in the same file.
7. All SIMD implementations must handle **edge cases** (hidden_size not divisible by vector width).
8. AVX2 tier is **optional** — most CPU kernels go directly from generic fallback to AVX512. Only add AVX2 when it provides meaningful benefit for element-wise ops.
9. You **MUST run all `max_trials` trials** in Phase 2. Do NOT stop early due to plateau — the only valid early stop is speedup > `early_stop_speedup`.

### Mandatory Tools

| Tool | Command | Purpose |
|------|---------|---------|
| **Analyze** | `python scripts/analyze_op.py --op <op_name> --shapes <shapes>` | Analyze PyTorch op: compute/memory characteristics, SIMD strategy recommendations |
| **Validate** | `python scripts/validate_cpu_kernel.py <kernel_dir>` | Static checks: alignment, OpenMP usage, intrinsics correctness, build.toml validation |
| **Build** | `kernel-builder build --release` | Compile C++ kernel via build.toml into a wheel |
| **Benchmark** | `python scripts/benchmark_cpu.py <baseline_file> --kernel-package <pkg> --op <func>` | Correctness + performance via `torch.utils.benchmark` |
| **Profile** | `python scripts/cpu_profiler.py --kernel-package <pkg> --op <func>` | `perf stat` hardware counters + optimization recommendations |
| **Trial Manager** | `python scripts/trial_manager.py <command> ...` | Trial tree management (init/save/result/status/best/finalize) |

> **Benchmark discipline**: Pin to a single NUMA node — `numactl --cpunodebind=0 --membind=0 pyt

Related in AI Agents