Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.
DPO Algorithm Details
The core DPO loss is computed as:
Where:
- is the current policy
- is the reference model (typically the initial model before DPO training)
- is the DPO beta parameter
- Where is a dataset of prompts , a chosen response and a rejected response
This optimizes the classical constrianed RLHF objective, where the reference model constrains deviation from the initial distribution.
DPO vs RLHF: DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.
Running DPO Training
The implementation is in train_dpo.py with a CLI interface in train.py. You can run it from the command line:
python -m tinker_cookbook.recipes.preference.train \
log_path=/tmp/dpo-hhh-experiment \
model_name=meta-llama/Llama-3.2-1B \
dataset=hhh \
renderer_name=role_colon \
learning_rate=1e-5 \
dpo_beta=0.1Key Parameters
log_relpath: Directory where results and checkpoints are savedmodel_name: Base model used as initialization and for the reference policydataset: Dataset name (hhh,helpsteer3,ultrafeedback)renderer_name: How conversations are formatted (see Rendering)learning_rate: Learning rate for optimizationdpo_beta: DPO beta parameter (controls the strength of preference learning)
Available Datasets
There are several pre-defined datasets:
hhh: Anthropic's Helpful-Harmless-Honest datasethelpsteer3: NVIDIA's HelpSteer3 preference datasetultrafeedback: UltraFeedback binarized preferences dataset
These are implemented as DPODatasetBuilder classes and you can implement a custom dataset builder following the tinker_cookbook.preference.preference_datasets interface.
Training Process
During training, you'll see output like this showing the DPO metrics:
Step 50
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ accuracy │ 0.568627 │
│ batch_time │ 27.953704 │
│ chosen_reward │ 0.053621 │
│ dpo_loss │ 0.683825 │
│ learning_rate │ 0.000009 │
│ margin │ 0.002147 │
│ num_pairs │ 255 │
│ num_tokens │ 112638 │
│ progress │ 0.081210 │
│ rejected_reward │ 0.032152 │
│ test/nll │ 1.871778 │
└────────────────────────────────┴───────────┘The key metrics are:
dpo_loss: The DPO classification lossaccuracy: Accuracy of the implicit reward model evaluated on the preference datasetmargin: Average difference between chosen and rejected rewardschosen_reward/rejected_reward: Average rewards for chosen/rejected responses
Evaluating DPO Models
After training, you can evaluate your DPO model using the inspect evaluation framework:
MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
python -m tinker_cookbook.eval.run_inspect_evals \
model_path=$MODEL_PATH \
model_name=meta-llama/Llama-3.2-1B \
tasks=inspect_evals/ifeval \
renderer_name=role_colonThis will evaluate the model on various benchmarks to measure the impact of preference optimization.
Tips for DPO Training
-
Beta Parameter: Start with
dpo_beta=0.1and adjust based on your dataset. -
Learning Rate: Use a lower learning rate than supervised fine-tuning (typically 1e-5 to 1e-6).
-
Base Model: The base model should already be in-distribution with the preference data. Either start with a ligh SFT phase or collect on-policy preferences. While training would still work. sharp distribution mis-match will create strange model behaviors.