Reinforcement Learning from Human Feedback

RLHF aligns a language model with human preferences through a three-stage pipeline. Each stage builds on the previous one: supervised fine-tuning produces an initial policy, a preference model learns to score outputs, and RL training optimizes the policy against that learned reward.

SFTinstruction data

→

Preference Modelpairwise data

→

RL Trainingreward from PM

Stage 1: Supervised Fine-Tuning

The base model is fine-tuned on the no_robots dataset, which contains human-written instruction-following examples designed to match the InstructGPT methodology. This gives the model basic instruction-following ability and produces the initial policy that RL will refine.

Stage 2: Preference Model

A separate model is trained on the Anthropic HHH dataset of pairwise comparisons. Given two completions for the same prompt, the model learns to predict which one a human preferred. The ComparisonRenderer formats each pair with section markers so the model sees both completions in a single forward pass.

Stage 3: RL Training

The SFT policy is optimized against the preference model using self-play. For each prompt, multiple completions are sampled and the preference model grades all pairs in a tournament. Each completion receives a reward based on its win fraction, and the policy is updated to produce more of the winning responses.

Resources

RLHF Tutorial -- Interactive marimo notebook walking through all three stages with code
RLHF Recipe -- Production-ready pipeline script with CLI configuration and wandb logging