Direct Preference Optimization (DPO)
DPO trains a model to prefer chosen responses over rejected ones using a classification loss — no separate reward model needed.
Preference Datachosen vs rejected
→
Reference Policyπ_ref (frozen)
→
DPO Lossβ-weighted
→
Aligned Modelπ_θ
The DPO Loss
\[
\mathcal{L}_{\theta} = -\mathbb{E}_{x, y_\text{chosen}, y_\text{rejected} \sim \mathcal{D}}\left[\log\sigma\left(\beta\log \frac{\pi_{\theta}(y_\text{chosen}|x)}{\pi_{\text{ref}}(y_\text{chosen}|x)} - \beta\log \frac{\pi_{\theta}(y_\text{rejected}|x)}{\pi_{\text{ref}}(y_\text{rejected}|x)}\right)\right]
\]
- \(\pi_{\theta}\) — current policy being trained
- \(\pi_{\text{ref}}\) — reference model (frozen, typically the pre-DPO checkpoint)
- \(\beta\) — controls preference strength (higher = more conservative)
DPO vs RLHF
DPO eliminates the need for a separate reward model by directly optimizing the policy. Simpler and cheaper than classical RLHF, but requires the base model to already be in-distribution with the preference data.
Available Datasets
| Dataset | Source | Description |
|---|---|---|
hhh |
Anthropic | Helpful-Harmless-Honest pairwise preferences |
helpsteer3 |
NVIDIA | HelpSteer3 preference dataset |
ultrafeedback |
UltraFeedback | Binarized preference comparisons |
Key Hyperparameters
dpo_beta— Start with0.1. Higher values are more conservative.learning_rate— Lower than SFT, typically1e-5to1e-6.- Base model — Should be in-distribution with the preference data. Start with a light SFT phase or collect on-policy preferences.
Training Metrics
dpo_loss— classification loss (should decrease)accuracy— how often the model prefers the chosen responsemargin— average reward difference between chosen and rejectedchosen_reward/rejected_reward— average implicit rewards
Learn More
- DPO & Preferences Tutorial — hands-on interactive walkthrough
- RLHF Pipeline Tutorial — full 3-stage RLHF with reward model
- DPO Recipe — production training script
- tinker_cookbook.preference API — Comparison, PreferenceModel, Config