Claude
Skills
Sign in
Back

unsloth-training

Included with Lifetime
$97 forever

Fine-tune LLMs with Unsloth using GRPO or SFT. Supports FP8, vision models, mobile deployment, Docker, packing, GGUF export, dataset preparation, synthetic data, MLX (Apple Silicon). Use when: train with GRPO, fine-tune, reward functions, SFT training, FP8 training, vision fine-tuning, phone deployment, docker training, packing, export to GGUF, prepare dataset, synthetic data, install unsloth, environment flags, MLX training.

Cloud & DevOps

What this skill does


<objective>
Guide LLM fine-tuning using Unsloth:

1. **GRPO** - RL with reward functions (no labeled outputs needed)
2. **SFT** - Supervised fine-tuning with input/output pairs
3. **Vision** - VLM fine-tuning (Qwen3-VL, Gemma3, Llama 3.2 Vision)

Key capabilities:
- **FP8 Training** - 60% less VRAM, 1.4x faster (RTX 40+, H100)
- **3x Packing** - Automatic 2-5x speedup for mixed-length data
- **Docker** - Official `unsloth/unsloth` image
- **Mobile** - QAT → ExecuTorch → iOS/Android (~40 tok/s)
- **Export** - GGUF, Ollama, vLLM, LM Studio, SGLang
</objective>

<quick_start>
**GRPO with FP8 (60% less VRAM):**
```python
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"  # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

def correctness_reward(completions, answer, **kwargs):
    return [2.0 if extract_answer(c) == a else 0.0
            for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
    train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()
```

**SFT with Packing (2-5x faster):**
```python
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model, train_dataset=dataset, processing_class=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2, num_train_epochs=3,
        learning_rate=2e-4, packing=True,  # 2-5x speedup
    ),
)
trainer.train()
```
</quick_start>

<success_criteria>
A training run is successful when:
- Model loads without OOM errors
- Reward (GRPO) or loss (SFT) shows improvement trend
- Generated outputs match expected format
- Model exported to desired format (LoRA, merged, GGUF)
- Test inference produces reasonable outputs
</success_criteria>

<activation_triggers>
**Explicit triggers:**
- `/unsloth grpo` - GRPO (RL) training
- `/unsloth sft` - SFT training
- `/unsloth fp8` - FP8 training setup
- `/unsloth vision` - VLM fine-tuning
- `/unsloth mobile` - Phone deployment (QAT)
- `/unsloth docker` - Docker container setup
- `/unsloth troubleshoot` - Debug issues
- `/unsloth install` - Installation guide
- `/unsloth dataset` - Dataset preparation
- `/unsloth mlx` - Apple Silicon training

**Natural language:**
- "train with GRPO", "fine-tune", "reward functions"
- "FP8 training", "fp8", "less VRAM"
- "vision fine-tuning", "VLM", "image training"
- "phone deployment", "mobile LLM", "ExecuTorch"
- "docker training", "container", "unsloth docker"
- "packing", "faster training", "500k context"
- "export GGUF", "Ollama", "vLLM", "SGLang"
- "install unsloth", "pip install", "setup unsloth"
- "prepare dataset", "training data", "synthetic data", "ChatML", "ShareGPT"
- "environment flags", "UNSLOTH_RETURN_LOGITS"
- "MLX", "Apple Silicon", "Mac training", "unsloth-mlx"
</activation_triggers>

<file_locations>
**Core references:**
- `reference/reward-design.md` - Reward function patterns
- `reference/domain-examples.md` - Voice AI, Sales Agent examples
- `reference/hyperparameters.md` - GRPOConfig reference
- `reference/troubleshooting.md` - Common fixes

**Setup and data references:**
- `reference/installation.md` - pip/uv install, CUDA versions, venv, Colab
- `reference/environment-flags.md` - UNSLOTH_RETURN_LOGITS, COMPILE_DISABLE, etc.
- `reference/datasets-guide.md` - Formats (ChatML/ShareGPT/Alpaca), chat templates, synthetic data
- `reference/mlx-training.md` - Apple Silicon training with unsloth-mlx

**Training feature references:**
- `reference/fp8-training.md` - FP8 setup, VRAM savings
- `reference/deployment.md` - Docker, vLLM, LoRA hot-swap, SGLang
- `reference/export-formats.md` - GGUF, Ollama, LM Studio, Dynamic 2.0
- `reference/advanced-training.md` - 500K context, packing, checkpoints
- `reference/vision-training.md` - VLM fine-tuning
- `reference/mobile-deployment.md` - QAT, ExecuTorch, iOS/Android

**Code examples:** `reference/grpo/`, `reference/sft/`
</file_locations>

<core_concepts>
## When to Use GRPO vs SFT

| Method | Use When | Data Needed |
|--------|----------|-------------|
| **GRPO** | Improving reasoning quality | Prompts + verifiable answers |
| **GRPO** | Aligning behavior with preferences | Reward functions |
| **GRPO** | When you can verify correctness | Verifiable outputs |
| **SFT** | Teaching specific output format | Input/output pairs |
| **SFT** | Following new instructions | Conversation examples |
| **SFT** | Learning domain knowledge | Labeled examples |

## Model Selection

| Model | Size | VRAM | Use Case |
|-------|------|------|----------|
| `unsloth/Qwen2.5-0.5B-Instruct` | 0.5B | 5GB | Mobile deployment (~200MB GGUF) |
| `unsloth/Qwen2.5-1.5B-Instruct` | 1.5B | 5GB | Learning/prototyping |
| `Qwen/Qwen2.5-3B-Instruct` | 3B | 8GB | Good balance (recommended start) |
| `unsloth/Qwen2.5-7B-Instruct` | 7B | 16GB | Production quality |
| `unsloth/Phi-4` | 14B | 20GB | Strong reasoning |

## Core Hyperparameters

**GRPO (RL):**
```python
GRPOConfig(
    num_generations=4,        # Completions per prompt (2-8)
    beta=0.04,                # KL penalty (0.01-0.1)
    learning_rate=5e-6,       # 10x smaller than SFT!
    max_completion_length=512,
    max_steps=300,            # Minimum for results
)
```

**SFT:**
```python
TrainingArguments(
    learning_rate=2e-4,       # Standard SFT rate
    num_train_epochs=3,       # 2-4 typical
    per_device_train_batch_size=2,
)
```
</core_concepts>

<reward_functions>
## Reward Function Design

Reward functions are the core of GRPO. They return a list of floats for each completion.

### Pattern 1: Correctness (Primary Signal)

```python
def correctness_reward(completions, answer, **kwargs):
    """
    +2.0 for correct answer, 0.0 otherwise.
    This should be your highest-weighted reward.
    """
    rewards = []
    for completion, true_answer in zip(completions, answer):
        extracted = extract_answer(completion)
        try:
            pred = float(extracted.replace(",", "").strip())
            true = float(true_answer.replace(",", "").strip())
            reward = 2.0 if abs(pred - true) < 0.01 else 0.0
        except ValueError:
            reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
        rewards.append(reward)
    return rewards
```

### Pattern 2: Format Compliance

```python
def format_reward(completions, **kwargs):
    """
    +0.5 for proper XML structure with reasoning and answer tags.
    """
    rewards = []
    for completion in completions:
        has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
        if has_reasoning and has_answer:
            rewards.append(0.5)
        elif has_answer:
            rewards.append(0.2)
        else:
            rewards.append(0.0)
    return rewards
```

### Pattern 3: Reasoning Quality

```python
def reasoning_length_reward(completions, **kwargs):
    """
    +0.3 for substantive reasoning (30-200 words).
    """
    rewards = []
    for completion in completions:
        reasoning = extract_reasoning(completion)
        word_count = len(reasoning.split()) if reasoning else 0
        if 30 <= word_count <= 200:
            rewards.append(0.3)
        elif 15 <= word_count < 30:
            rewards.append(0.1)
        else:
            rewards.append(0.0)
    return rewards
```

### Pattern 4: Negative Constraints

```python
def no_hedging_reward(completions, **kwargs):
    """
    -0.3 penalty for uncertainty language.
    """
    hedging = ["i think", "maybe", "perhaps", "possibly", 

Related in Cloud & DevOps