Direct Preference Optimization (DPO)

DPO trains a model to prefer chosen responses over rejected ones using a classification loss — no separate reward model needed.

Preference Datachosen vs rejected

→

Reference Policyπ_ref (frozen)

→

DPO Lossβ-weighted

→

Aligned Modelπ_θ

The DPO Loss

\[ \mathcal{L}_{\theta} = -\mathbb{E}_{x, y_\text{chosen}, y_\text{rejected} \sim \mathcal{D}}\left[\log\sigma\left(\beta\log \frac{\pi_{\theta}(y_\text{chosen}|x)}{\pi_{\text{ref}}(y_\text{chosen}|x)} - \beta\log \frac{\pi_{\theta}(y_\text{rejected}|x)}{\pi_{\text{ref}}(y_\text{rejected}|x)}\right)\right] \]

\(\pi_{\theta}\) — current policy being trained
\(\pi_{\text{ref}}\) — reference model (frozen, typically the pre-DPO checkpoint)
\(\beta\) — controls preference strength (higher = more conservative)

DPO vs RLHF

DPO eliminates the need for a separate reward model by directly optimizing the policy. Simpler and cheaper than classical RLHF, but requires the base model to already be in-distribution with the preference data.

Available Datasets

Dataset	Source	Description
`hhh`	Anthropic	Helpful-Harmless-Honest pairwise preferences
`helpsteer3`	NVIDIA	HelpSteer3 preference dataset
`ultrafeedback`	UltraFeedback	Binarized preference comparisons

Key Hyperparameters

dpo_beta — Start with 0.1. Higher values are more conservative.
learning_rate — Lower than SFT, typically 1e-5 to 1e-6.
Base model — Should be in-distribution with the preference data. Start with a light SFT phase or collect on-policy preferences.

Training Metrics

dpo_loss — classification loss (should decrease)
accuracy — how often the model prefers the chosen response
margin — average reward difference between chosen and rejected
chosen_reward / rejected_reward — average implicit rewards

Learn More

DPO & Preferences Tutorial — hands-on interactive walkthrough
RLHF Pipeline Tutorial — full 3-stage RLHF with reward model
DPO Recipe — production training script
tinker_cookbook.preference API — Comparison, PreferenceModel, Config