Reinforcement Learning from Human Feedback
RLHF aligns a language model with human preferences through a three-stage pipeline. Each stage builds on the previous one: supervised fine-tuning produces an initial policy, a preference model learns to score outputs, and RL training optimizes the policy against that learned reward.
→
→
Stage 1: Supervised Fine-Tuning
The base model is fine-tuned on the no_robots dataset, which contains human-written instruction-following examples designed to match the InstructGPT methodology. This gives the model basic instruction-following ability and produces the initial policy that RL will refine.
Stage 2: Preference Model
A separate model is trained on the Anthropic HHH dataset of pairwise comparisons. Given two completions for the same prompt, the model learns to predict which one a human preferred. The ComparisonRenderer formats each pair with section markers so the model sees both completions in a single forward pass.
Stage 3: RL Training
The SFT policy is optimized against the preference model using self-play. For each prompt, multiple completions are sampled and the preference model grades all pairs in a tournament. Each completion receives a reward based on its win fraction, and the policy is updated to produce more of the winning responses.
Resources
- RLHF Tutorial -- Interactive marimo notebook walking through all three stages with code
- RLHF Recipe -- Production-ready pipeline script with CLI configuration and wandb logging