Preferences

Preferences

Learning from Preferences

In this section, we focus on learning from pairwise feedback, where we have preference data indicating which of two completions is better for a given prompt. This kind of feedback is a natural fit for tasks where there's not a simple correctness criterion that can be computed programmatically. These preferences might be collected from human evaluators or generated by a model.

Two Approaches to Preference Learning

When you have pairwise preference data, there are two main approaches:

  1. Direct Preference Optimization (DPO): Directly update the policy to prefer chosen responses over rejected ones, without needing a separate reward model. This is simpler and computationally cheaper. See the DPO Guide for details.

  2. Reinforcement Learning from Human Feedback (RLHF): Train a reward model on preference data, then use reinforcement learning to optimize the policy against this reward model. This two-stage approach provides more flexibility. See the RLHF example for details.