Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.
DPO Algorithm Details
The core DPO loss is computed as:
Where:
- is the current policy
- is the reference model (typically the initial model before DPO training)
- is the DPO beta parameter
- Where is a dataset of prompts , a chosen response and a rejected response
This optimizes the classical constrianed RLHF objective, where the reference model constrains deviation from the initial distribution.
DPO vs RLHF: DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.
Running DPO Training
The implementation is in train_dpo.py with a CLI interface in train.py. You can run it from the command line:
python -m tinker_cookbook.recipes.preference.train \
log_path=/tmp/dpo-hhh-experiment \
model_name=meta-llama/Llama-3.2-1B \
dataset=hhh \
renderer_name=role_colon \
learning_rate=1e-5 \
dpo_beta=0.1
Key Parameters
log_relpath
: Directory where results and checkpoints are savedmodel_name
: Base model used as initialization and for the reference policydataset
: Dataset name (hhh
,helpsteer3
,ultrafeedback
)renderer_name
: How conversations are formatted (see Rendering)learning_rate
: Learning rate for optimizationdpo_beta
: DPO beta parameter (controls the strength of preference learning)
Available Datasets
There are several pre-defined datasets:
hhh
: Anthropic's Helpful-Harmless-Honest datasethelpsteer3
: NVIDIA's HelpSteer3 preference datasetultrafeedback
: UltraFeedback binarized preferences dataset
These are implemented as DPODatasetBuilder
classes and you can implement a custom dataset builder following the tinker_cookbook.preference.preference_datasets
interface.
Training Process
During training, you'll see output like this showing the DPO metrics:
Step 50
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ accuracy │ 0.568627 │
│ batch_time │ 27.953704 │
│ chosen_reward │ 0.053621 │
│ dpo_loss │ 0.683825 │
│ learning_rate │ 0.000009 │
│ margin │ 0.002147 │
│ num_pairs │ 255 │
│ num_tokens │ 112638 │
│ progress │ 0.081210 │
│ rejected_reward │ 0.032152 │
│ test/nll │ 1.871778 │
└────────────────────────────────┴───────────┘
The key metrics are:
dpo_loss
: The DPO classification lossaccuracy
: Accuracy of the implicit reward model evaluated on the preference datasetmargin
: Average difference between chosen and rejected rewardschosen_reward
/rejected_reward
: Average rewards for chosen/rejected responses
Evaluating DPO Models
After training, you can evaluate your DPO model using the inspect evaluation framework:
MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
python -m tinker_cookbook.eval.run_inspect_evals \
model_path=$MODEL_PATH \
model_name=meta-llama/Llama-3.2-1B \
tasks=inspect_evals/ifeval \
renderer_name=role_colon
This will evaluate the model on various benchmarks to measure the impact of preference optimization.
Tips for DPO Training
-
Beta Parameter: Start with
dpo_beta=0.1
and adjust based on your dataset. -
Learning Rate: Use a lower learning rate than supervised fine-tuning (typically 1e-5 to 1e-6).
-
Base Model: The base model should already be in-distribution with the preference data. Either start with a ligh SFT phase or collect on-policy preferences. While training would still work. sharp distribution mis-match will create strange model behaviors.