ml-training-recipes
Battle-tested PyTorch training recipes for all domains — LLMs, vision, diffusion, medical imaging, protein/drug discovery, spatial omics, genomics. Covers training loops, optimizer selection (AdamW, Muon), LR scheduling, mixed precision, debugging, and systematic experimentation. Use when training or fine-tuning neural networks, debugging loss spikes or OOM, choosing architectures, or optimizing GPU throughput.
What this skill does
# ML Training Recipes
Battle-tested patterns for PyTorch training across domains. Drawn from production codebases
(Karpathy's autoresearch/nanochat, torchvision, HuggingFace) and modern training practice.
## Reference files (read when needed)
- `references/architecture.md` — Transformer/LLM architecture code patterns, weight init
- `references/optimizers.md` — Muon, AdamW hybrid, per-group LR, compiled optimizer steps
- `references/domain-specific.md` — Vision, diffusion, contrastive, distributed, checkpointing, data loading
- `references/scaling-and-selection.md` — Scaling laws, compute budget tables, decision trees, DGX Spark
- `references/biomedical.md` — Drug discovery, protein models, medical imaging, genomics, clinical NLP
- `references/experiment-loop.md` — Autonomous experiment loop (autoresearch keep/discard/revert)
---
## Architecture Selection
Pick the right model by **data type** and **data scale**:
| Data Type | < 10K samples | 10K-100K | > 100K |
|-----------|--------------|----------|--------|
| **Images** | Pretrained CNN + fine-tune | Fine-tune ViT or CNN | ViT from scratch |
| **Text (gen)** | Few-shot prompting | Fine-tune GPT/LLaMA (LoRA) | Pretrain from scratch |
| **Tabular** | XGBoost/LightGBM | Still XGBoost | Neural viable |
| **Audio** | Pretrained Whisper | Fine-tune AST | Train from scratch |
| **Molecules** | Pretrained GNN | Fine-tune molecular LM | Train GNN from scratch |
| **Proteins** | ESM-2 embeddings + head | Fine-tune ESM-2 | Train protein LM |
| **Medical img** | Pretrained CNN | nnU-Net (auto-config) | Swin-UNETR / MedSAM |
**Key principle**: architecture matters less than training recipe at equal compute. A well-tuned
ResNet beats a poorly-tuned ViT (ref: "ResNet Strikes Back", Wightman 2021).
For biomedical domains, see `references/biomedical.md`.
For sequence model selection and compute planning, see `references/scaling-and-selection.md`.
---
## Scaling Laws
### Chinchilla rule (Hoffmann et al., 2022)
Compute-optimal training: **~20 tokens per parameter**.
| Model Size | Compute-Optimal | Inference-Optimal (100×) |
|-----------|----------------|--------------------------|
| 125M | 2.5B tokens | 12.5B tokens |
| 1B | 20B tokens | 100B tokens |
| 7B | 140B tokens | 700B tokens |
**FLOPs ≈ 6 × N × D** (N=params, D=tokens). Data repetition limit: ~4 epochs before diminishing returns.
---
## Training Loop
```python
import gc, time, torch
torch.manual_seed(42)
torch.set_float32_matmul_precision("high") # TF32 on Ampere+
autocast_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)
grad_accum_steps = total_batch_size // (batch_size * seq_len)
step = 0
while not done:
t0 = time.time()
for micro_step in range(grad_accum_steps):
with autocast_ctx:
loss = model(x, y)
(loss / grad_accum_steps).backward()
x, y = next(train_loader)
update_lr(optimizer, progress)
optimizer.step()
model.zero_grad(set_to_none=True) # frees memory vs zeroing
if loss.item() > 100: # fast-fail on divergence
print("FAIL: loss exploded"); exit(1)
torch.cuda.synchronize()
if step == 0:
gc.collect(); gc.freeze(); gc.disable() # avoid ~500ms GC stalls
step += 1
```
### Key principles
- **Gradient clipping**: `clip_grad_norm_(params, 1.0)` — near-universal for Transformers.
Exception: Muon optimizer normalizes updates via orthogonalization, so clipping is optional.
- **Tensor Core alignment**: batch size, hidden dims should be multiples of 8 (bf16) or 64 (A100).
- **Time-based budgets** make experiments comparable across hardware.
- **`cudnn.benchmark = True`** for fixed-size vision inputs.
---
## Optimizer Configuration
Modern LLM training uses different optimizers per parameter group:
| Parameter Type | Optimizer | LR (base) | Weight Decay |
|---------------|-----------|-----------|--------------|
| 2D weight matrices | Muon | 0.04 | 0.2 |
| Token embeddings | AdamW | 0.6 × scale | 0.0 |
| Unembedding (lm_head) | AdamW | 0.004 × scale | 0.0 |
| Per-layer scalars | AdamW | 0.005 × scale | 0.0 |
**LR scaling by dimension**: `lr * (d_model / 768)^(-0.5)` — keeps dynamics stable across sizes.
### Rules of thumb
- Embeddings need higher LR (sparse updates). Never weight-decay embeddings.
- Weight decay scheduling: linearly decay WD to 0 over training.
- AdamW defaults: β1=0.9, β2=0.95, eps=1e-10 (not default 1e-8 — prevents stale updates in bf16).
For Muon details (polar express orthogonalization, NorMuon), see `references/optimizers.md`.
---
## Learning Rate Scheduling
### Time-based (autoresearch style)
```python
def get_lr_multiplier(progress): # progress = elapsed_time / time_budget
if progress < warmup_ratio:
return progress / warmup_ratio
elif progress < 1.0 - warmdown_ratio:
return 1.0
else:
cooldown = (1.0 - progress) / warmdown_ratio
return cooldown + (1 - cooldown) * final_lr_frac
```
### Cosine decay
```python
def get_lr(step, total_steps, max_lr, min_lr, warmup_steps):
if step < warmup_steps:
return max_lr * step / warmup_steps
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
```
**WSD (Warmup-Stable-Decay)**: gaining traction — easier to resume training mid-run.
### Guidance
- **Warmup**: 1-5% of training. Zero warmup valid with Muon (autoresearch uses `WARMUP_RATIO=0.0`).
- **Warmdown**: 30-50% of training in LR decay. Matters more than warmup for final quality.
- **Final LR**: 0 or ~10% of peak. Zero is simpler.
---
## Mixed Precision & Compilation
```python
import os
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True" # before torch import
import torch
torch.set_float32_matmul_precision("high")
autocast_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)
model = torch.compile(model, dynamic=False)
```
- **bf16** (Ampere+): same exponent as fp32, no loss scaling needed. Preferred over fp16.
- **fp16**: needs GradScaler. Use only on V100 or older.
- `dynamic=False` enables max optimization. Add `fullgraph=True` if no graph breaks.
- First steps are slow (JIT) — exclude from timing.
---
## Memory & Performance
### Meta device init (large models)
```python
with torch.device("meta"):
model = GPT(config) # zero memory
model.to_empty(device="cuda")
model.init_weights()
```
### MFU (Model FLOPs Utilization)
```python
achieved_flops = model_flops_per_token * batch_tokens / step_time
mfu = achieved_flops / gpu_peak_flops
# H100 SXM: 989.5 TFLOPS | A100: 312 | RTX 4090: 165
```
Good targets: >30% decent, >40% good, >50% excellent (single-GPU).
### OOM solutions (in order)
1. Reduce `DEVICE_BATCH_SIZE`, increase `grad_accum_steps`
2. `PYTORCH_ALLOC_CONF=expandable_segments:True`
3. `model.zero_grad(set_to_none=True)`
4. Meta device init → `to_empty`
5. Activation checkpointing: `torch.utils.checkpoint.checkpoint()`
6. 8-bit optimizer (bitsandbytes): ~30% savings on optimizer states
---
## Hyperparameter Search
### Priority order (tune first → last)
1. **Learning rate** — most impactful. Always tune first.
2. **Batch size** — largest that fits. Speed knob, not quality knob.
3. **Weight decay** — 0.01-0.1 for AdamW.
4. **Warmup steps** — 1-5% of training.
### The 2025 default recipe
| Setting | Value |
|---------|-------|
| Optimizer | AdamW (β1=0.9, β2=0.95, eps=1e-10) |
| Weight decay | 0.1 |
| LR schedule | Cosine decay or WSD |
| Peak LR | 3e-4 (scale down for larger models) |
| Precision | bf16 |
| Grad clipping | max_norm=1.0 |
| Normalization | RMSNorm (pre-norm) |
| Activation | SwiGLU |
| Position encoding | RoPE |
| Attention | Flash Attention, optionallRelated in General
modeling-omnistudio-epc-catalog
IncludedSalesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).
relationship-science-coach
IncludedUse this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.
building-sf-integrations
IncludedSalesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).
venue-templates
IncludedAccess comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.
let-fate-decide
IncludedDraws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.
net-ops
IncludedCross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.