rlhf

Included with Lifetime

$97 forever

Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.

General

What this skill does


# Understanding RLHF

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.

## Table of Contents

- [Core Concepts](#core-concepts)
- [The RLHF Pipeline](#the-rlhf-pipeline)
- [Preference Data](#preference-data)
- [Instruction Tuning](#instruction-tuning)
- [Reward Modeling](#reward-modeling)
- [Policy Optimization](#policy-optimization)
- [Direct Alignment Algorithms](#direct-alignment-algorithms)
- [Challenges](#challenges)
- [Best Practices](#best-practices)
- [References](#references)

## Core Concepts

### Why RLHF?

Pretraining produces models that predict likely text, not necessarily *good* text. A model trained on internet data learns to complete text in ways that reflect its training distribution—including toxic, unhelpful, or dishonest patterns. RLHF addresses this gap by optimizing for human preferences rather than likelihood.

The core insight: humans can often recognize good outputs more easily than they can specify what makes an output good. RLHF exploits this by collecting human judgments and using them to shape model behavior.

### The Alignment Problem

Language models face several alignment challenges:

- **Helpfulness**: Following instructions and providing useful information
- **Harmlessness**: Avoiding toxic, dangerous, or inappropriate outputs
- **Honesty**: Acknowledging uncertainty and avoiding fabrication
- **Intent alignment**: Understanding what users actually want, not just what they say

RLHF provides a framework for encoding these properties through preference data.

### Key Components

1. **Preference data**: Human judgments comparing model outputs
2. **Reward model**: A learned function approximating human preferences
3. **Policy optimization**: RL algorithms that maximize expected reward
4. **Regularization**: Constraints preventing deviation from the base model

## The RLHF Pipeline

The standard RLHF pipeline consists of three main stages:

### Stage 1: Supervised Fine-Tuning (SFT)

Start with a pretrained language model and fine-tune it on high-quality demonstrations. This teaches the model the desired format and style of responses.

**Input**: Pretrained model + demonstration dataset
**Output**: SFT model that can follow instructions

### Stage 2: Reward Model Training

Train a model to predict human preferences between pairs of outputs. The reward model learns to score outputs in a way that correlates with human judgment.

**Input**: SFT model + preference dataset (chosen/rejected pairs)
**Output**: Reward model that scores any output

### Stage 3: Policy Optimization

Use reinforcement learning to optimize the SFT model against the reward model, while staying close to the original SFT distribution.

**Input**: SFT model + reward model
**Output**: Final aligned model

### Alternative: Direct Alignment

Direct alignment algorithms (DPO, IPO, KTO) skip the reward model entirely, optimizing directly from preference data. This simplifies the pipeline but trades off some flexibility.

### Modern Post-Training Variants

Current post-training stacks often mix these stages rather than using a single linear pipeline:

| Method | Data Signal | Best Fit |
|--------|-------------|----------|
| SFT | Demonstrations | Format, style, instruction following |
| DPO/IPO/KTO/ORPO | Offline preferences or binary feedback | Simpler alignment without online rollouts |
| PPO/RLOO | Reward model scores on sampled responses | Reward-model RL with explicit KL control |
| GRPO | Grouped completions scored by reward functions/models | Reasoning and verifiable-task optimization |

## Preference Data

Preference data encodes human judgment about model outputs. The most common format is pairwise comparisons.

### Pairwise Preferences

Given a prompt, collect two or more model outputs and have humans indicate which is better:

```
Prompt: "Explain quantum entanglement"

Response A: [technical explanation]
Response B: [simpler explanation with analogy]

Human preference: B > A
```

This creates (prompt, chosen, rejected) tuples for training.

### Collection Methods

**Human annotation**: Trained annotators compare outputs according to guidelines. Most reliable but expensive and slow.

**AI feedback**: Use a capable model to generate preferences. Faster and cheaper but may propagate biases. This is the basis for Constitutional AI (CAI) and RLAIF.

**Implicit signals**: User interactions like upvotes, regeneration requests, or conversation length. Noisy but abundant.

### Data Quality Considerations

- **Annotator agreement**: Low agreement suggests ambiguous criteria or subjective preferences
- **Distribution coverage**: Preferences should cover the range of model behaviors
- **Prompt diversity**: Diverse prompts prevent overfitting to narrow scenarios
- **Preference strength**: Some comparisons are clear; others are nearly ties

## Instruction Tuning

Instruction tuning (supervised fine-tuning on instruction-response pairs) serves as the foundation for RLHF.

### Purpose

- Teaches the model to follow instructions rather than just complete text
- Establishes the format and style for responses
- Creates a starting point that already exhibits desired behaviors
- Provides the reference policy for KL regularization

### Dataset Composition

Typical instruction tuning datasets include:

- **Single-turn QA**: Questions with direct answers
- **Multi-turn dialogue**: Conversational exchanges
- **Task instructions**: Specific tasks with examples
- **Chain-of-thought**: Reasoning traces for complex problems

### Relationship to RLHF

The SFT model defines the "prior" that RLHF refines. A better SFT model means:

- The reward model has better starting outputs to compare
- Policy optimization has less work to do
- The KL penalty keeps the final model closer to this baseline

## Reward Modeling

The reward model transforms pairwise preferences into a scalar signal for RL optimization.

### The Bradley-Terry Model

Preferences are modeled using the Bradley-Terry framework:

```
P(A > B) = sigmoid(r(A) - r(B))
```

Where r(x) is the reward for output x. This assumes preferences depend only on the difference in rewards.

The loss function is:

```
L = -log(sigmoid(r(chosen) - r(rejected)))
```

This pushes the reward model to assign higher scores to chosen outputs.

### Architecture

Reward models are typically:

- The SFT model with a scalar head instead of the language modeling head
- Trained on (prompt, chosen, rejected) tuples
- Output a single scalar reward for any (prompt, response) pair

### Considerations

- **Scaling**: Larger reward models generally produce better signals
- **Calibration**: Absolute reward values are less important than rankings
- **Generalization**: The model must score outputs it hasn't seen during training
- **Over-optimization**: Policies can exploit reward model weaknesses

See `reference/reward-modeling.md` for detailed training procedures.

## Policy Optimization

Policy optimization uses RL to maximize expected reward while staying close to the reference policy.

### The RLHF Objective

```
maximize E[R(x, y)] - β * KL(π || π_ref)
```

Where:
- R(x, y) is the reward for response y to prompt x
- KL(π || π_ref) measures deviation from the reference policy
- β controls the strength of the regularization

### PPO (Proximal Policy Optimization)

PPO is the classic reward-model RLHF algorithm:

1. Sample responses from the current policy
2. Score responses with the reward model
3. Compute advantage estimates
4. Update policy with clipped surrogate objective

The clipping prevents large policy updates that could destabilize training.

### KL Regularization

The KL penalty serves multiple purposes:

- **Prevents reward hacking**: Stops the policy from finding adversarial inputs to the reward model
- **Maintains capabilitie

Files: 4

Size: 40.8 KB

Complexity: 45/100

Category: General

Source: https://github.com/itsmostafa/llm-engineering-skills/tree/main/skills/rlhf

Related in General

modeling-omnistudio-epc-catalog

Included

Salesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).

Generalscripts

relationship-science-coach

Included

Use this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.

Generalscripts

building-sf-integrations

Included

Salesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).

Generalscripts

venue-templates

Included

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

Generalscripts

let-fate-decide

Included

Draws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.

Generalscripts

net-ops

Included

Cross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.

Generalscripts