Preferences

Learning from Preferences

In this section, we focus on learning from pairwise feedback, where we have preference data indicating which of two completions is better for a given prompt. This kind of feedback is a natural fit for tasks where there's not a simple correctness criterion that can be computed programmatically. These preferences might be collected from human evaluators or generated bya model.

Two Approaches to Preference Learning

When you have pairwise preference data, there are two main approaches:

Direct Preference Optimization (DPO): Directly update the policy to prefer chosen responses over rejected ones, without needing a separate reward model. This is simpler and computationally cheaper. See the DPO Guide for details.
Reinforcement Learning from Human Feedback (RLHF): Train a reward model on preference data, then use reinforcement learning to optimize the policy against this reward model. This two-stage approach provides more flexibility. See the the RLHF example for details.

RL hyperparams DPO guide