Preferences
Learning from Preferences
In this section, we focus on learning from pairwise feedback, where we have preference data indicating which of two completions is better for a given prompt. This kind of feedback is a natural fit for tasks where there's not a simple correctness criterion that can be computed programmatically. These preferences might be collected from human evaluators or generated by a model.
Two Approaches to Preference Learning
When you have pairwise preference data, there are two main approaches:
-
Direct Preference Optimization (DPO): Directly update the policy to prefer chosen responses over rejected ones, without needing a separate reward model. This is simpler and computationally cheaper. See the DPO Guide for details.
-
Reinforcement Learning from Human Feedback (RLHF): Train a reward model on preference data, then use reinforcement learning to optimize the policy against this reward model. This two-stage approach provides more flexibility. See the RLHF example for details.