verl-rl-training
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
What this skill does
# verl: Volcano Engine Reinforcement Learning for LLMs
verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
## When to Use verl
**Choose verl when you need:**
- Production-ready RL training at scale (tested up to 671B parameters)
- Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
- Multi-turn rollout with tool calling for agentic workflows
- Vision-language model RL training
**Consider alternatives when:**
- You need Megatron-native training → use **slime** or **miles**
- You want PyTorch-native abstractions with Monarch → use **torchforge**
- You only need simple SFT/DPO → use **TRL** or **Axolotl**
## Key Features
- **Training backends**: FSDP, FSDP2, Megatron-LM
- **Rollout engines**: vLLM, SGLang, HuggingFace Transformers
- **Algorithms**: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
- **Models**: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
- **Advanced**: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools
## Installation
```bash
# Option 1: pip install
pip install verl[vllm] # or verl[sglang] for SGLang backend
# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest
# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
```
## Quick Start: GRPO Training
```bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8
```
## Core Architecture
verl uses a **HybridFlow** programming model separating control flow from computation:
```
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray) │
│ - Orchestrates: rollout → reward → train → sync │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers │
│ ├── ActorRolloutRefWorker (policy + generation) │
│ ├── CriticWorker (value estimation, PPO only) │
│ └── RewardManager (model-based or rule-based rewards) │
└─────────────────────────────────────────────────────────┘
```
---
## Workflow 1: Math Reasoning with GRPO
Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
### Prerequisites Checklist
- [ ] GPU cluster with 8+ GPUs (H100 recommended)
- [ ] Dataset in parquet format with `prompt` and `reward_model` columns
- [ ] Base model from HuggingFace Hub
### Step 1: Prepare Dataset
```python
import pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "What is 15 + 27?"}],
"reward_model": {"ground_truth": "42"}
},
# ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
```
### Step 2: Define Reward Function
```python
# reward_function.py
import re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
```
### Step 3: Create Training Config
```yaml
# config/grpo_math.yaml
algorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # samples per prompt
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
```
### Step 4: Launch Training
```bash
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7b
```
### Step 5: Monitor and Validate
- [ ] Check WandB/TensorBoard for loss curves
- [ ] Verify reward is increasing over steps
- [ ] Run evaluation on held-out test set
---
## Workflow 2: PPO with Critic Model
Use this workflow when you need value-based advantage estimation (GAE).
### Key Differences from GRPO
- Requires separate critic model
- Uses Generalized Advantage Estimation (GAE)
- Better for tasks with dense rewards
### Configuration
```yaml
algorithm:
adv_estimator: gae # Use GAE instead of GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO clipping
```
### Launch with Critic
```bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8
```
---
## Workflow 3: Large-Scale Training with Megatron
Use this workflow for models >70B parameters or when you need expert parallelism.
### Prerequisites
- [ ] Install Megatron-LM bridge: `pip install mbridge`
- [ ] Convert model to Megatron format
- [ ] Multi-node cluster with NVLink/InfiniBand
### Configuration for 70B+ Models
```yaml
actor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8
```
### Launch Multi-Node
```bash
# On head node
ray start --head --port=6379
# On worker nodes
ray start --address='head_ip:6379'
# Launch training
python3 -m verl.trainer.main_ppo \
trainer.nnodes=4 \
trainer.n_gpus_per_node=8
```
---
## Configuration Reference
### Algorithm Selection
| Algorithm | `adv_estimator` | Use Case |
|-----------|-----------------|----------|
| GRPO | `grpo` | Critic-free, math/reasoning |
| PPO/GAE | `gae` | Dense rewards, value estimation |
| REINFORCE++ | `reinforce_plus_plus` | Variance reduction |
| RLOO | `rloo` | Leave-one-out baseline |
| ReMax | `remax` | Maximum reward baseline |
| OPO | `opo` | Optimal policy optimization |
### Key Parameters
```yaml
# Rollout parameters
actor_rollout_ref.rollout.n: 8 # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
# Training parameters
actor_rollout_ref.actor.lr: 1e-6 # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control
```
---
## Common Issues and Solutions
### Issue: OOM During Rollout
**Symptoms**: CUDA out of memory during generation phase
**Solutions**:
```yaml
# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true
# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
```
### Issue: Training Instability
**Symptoms**: Loss spikes, reward collapse
**Solutions**:
```yaml
# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7
# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coeRelated in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.