Skip to content

Direct Preference Optimization (DPO)

DPO trains a model to prefer chosen responses over rejected ones using a classification loss — no separate reward model needed.

Preference Datachosen vs rejected

Reference Policyπ_ref (frozen)

DPO Lossβ-weighted

Aligned Modelπ_θ

The DPO Loss

\[ \mathcal{L}_{\theta} = -\mathbb{E}_{x, y_\text{chosen}, y_\text{rejected} \sim \mathcal{D}}\left[\log\sigma\left(\beta\log \frac{\pi_{\theta}(y_\text{chosen}|x)}{\pi_{\text{ref}}(y_\text{chosen}|x)} - \beta\log \frac{\pi_{\theta}(y_\text{rejected}|x)}{\pi_{\text{ref}}(y_\text{rejected}|x)}\right)\right] \]
  • \(\pi_{\theta}\) — current policy being trained
  • \(\pi_{\text{ref}}\) — reference model (frozen, typically the pre-DPO checkpoint)
  • \(\beta\) — controls preference strength (higher = more conservative)

DPO vs RLHF

DPO eliminates the need for a separate reward model by directly optimizing the policy. Simpler and cheaper than classical RLHF, but requires the base model to already be in-distribution with the preference data.

Available Datasets

Dataset Source Description
hhh Anthropic Helpful-Harmless-Honest pairwise preferences
helpsteer3 NVIDIA HelpSteer3 preference dataset
ultrafeedback UltraFeedback Binarized preference comparisons

Key Hyperparameters

  • dpo_beta — Start with 0.1. Higher values are more conservative.
  • learning_rate — Lower than SFT, typically 1e-5 to 1e-6.
  • Base model — Should be in-distribution with the preference data. Start with a light SFT phase or collect on-policy preferences.

Training Metrics

  • dpo_loss — classification loss (should decrease)
  • accuracy — how often the model prefers the chosen response
  • margin — average reward difference between chosen and rejected
  • chosen_reward / rejected_reward — average implicit rewards

Learn More