Tutorial 403: DPO and Preference Learning
Prerequisites
Run it interactively [source]
Build preference data, render comparisons, and work through the DPO loss. The full pipeline, including training with train_dpo.main and evaluating with a PreferenceModel, is outlined in the Summary.
Direct Preference Optimization (DPO) trains a model to prefer "chosen" over "rejected" responses without an explicit reward model. The key idea: the optimal policy under a KL-constrained reward maximization objective has a closed-form relationship to a preference model.
from tinker_cookbook.preference.types import (
Comparison,
ComparisonRendererFromChatRenderer,
LabeledComparison,
)
Step 1 -- Create preference data
A Comparison pairs two completions for the same prompt. A LabeledComparison adds a human preference label (A, B, or Tie).
# Create a labeled comparison: the human prefers completion A
comparison = Comparison(
prompt_conversation=[
{"role": "user", "content": "Explain gravity in one sentence."},
],
completion_A=[
{
"role": "assistant",
"content": "Gravity is the force that attracts objects with mass toward each other.",
},
],
completion_B=[
{"role": "assistant", "content": "Gravity is like magnets but for everything."},
],
)
labeled = LabeledComparison(comparison=comparison, label="A")
print(f"Prompt: {comparison.prompt_conversation[0]['content']}")
print(f"Completion A: {comparison.completion_A[0]['content']}")
print(f"Completion B: {comparison.completion_B[0]['content']}")
print(f"Preferred: {labeled.label}")
# Swapping reverses the label
swapped = labeled.swap()
print("\nAfter swap:")
print(f"Completion A: {swapped.comparison.completion_A[0]['content']}")
print(f"Completion B: {swapped.comparison.completion_B[0]['content']}")
print(f"Preferred: {swapped.label}")
Output
Prompt: Explain gravity in one sentence.
Completion A: Gravity is the force that attracts objects with mass toward each other.
Completion B: Gravity is like magnets but for everything.
Preferred: A
After swap:
Completion A: Gravity is like magnets but for everything.
Completion B: Gravity is the force that attracts objects with mass toward each other.
Preferred: B
Step 2 -- ComparisonRenderer
The ComparisonRendererFromChatRenderer converts a Comparison into tokenized model input by formatting both completions with section markers:
For DPO training, each labeled comparison produces two datums (chosen + rejected) with per-token loss weights on the completion tokens.
from tinker_cookbook import renderers
from tinker_cookbook.tokenizer_utils import get_tokenizer
MODEL_NAME = "Qwen/Qwen3.5-4B"
tokenizer = get_tokenizer(MODEL_NAME)
renderer = renderers.get_renderer("qwen3_5_disable_thinking", tokenizer)
comparison_renderer = ComparisonRendererFromChatRenderer(renderer)
# Build a generation prompt for preference prediction
model_input = comparison_renderer.build_generation_prompt(comparison)
print(f"Prompt tokens: {model_input.length}")
print(f"Decoded:\n{tokenizer.decode(list(model_input.to_ints()))}")
Output
Prompt tokens: 84
Decoded:
<|im_start|>user
Explain gravity in one sentence.<|im_end|>
<|im_start|>system
==== Completion A ====<|im_end|>
<|im_start|>assistant
<think>
</think>
Gravity is the force that attracts objects with mass toward each other.<|im_end|>
<|im_start|>system
==== Completion B ====<|im_end|>
<|im_start|>assistant
<think>
</think>
Gravity is like magnets but for everything.<|im_end|>
<|im_start|>system
==== Preference ====<|im_end|>
<|im_start|>assistant
<think>
</think>
Note: the empty
Step 3 -- Configure DPO training
preference.train_dpo.Config is similar to the SFT config but adds:
dpo_beta-- KL penalty coefficient (higher = more conservative updates)reference_model_name-- optional explicit reference model (default: initial weights)
The dataset builder must produce interleaved chosen/rejected datum pairs.
from tinker_cookbook.preference.train_dpo import compute_dpo_loss
# Example config (not running training here)
print("DPO Config fields:")
print(f" model_name: {MODEL_NAME}")
print(" dpo_beta: 0.1 (default)")
print(" learning_rate: 1e-5 (default, lower than SFT)")
print(" lr_schedule: linear")
print(" lora_rank: 32")
Output
Step 4 -- Understanding the DPO loss
The DPO loss is:
wherelog_ratio = log p_policy(y|x) - log p_ref(y|x).
Intuitively:
- The model should assign higher probability to chosen over rejected
- The
betaparameter controls how much the model can deviate from the reference - Higher
beta= more conservative (stays closer to reference)
import torch
# Simulate DPO loss computation
# Positive log-ratio means policy prefers this over reference
chosen_logprobs = [torch.tensor(-2.0), torch.tensor(-1.5)]
rejected_logprobs = [torch.tensor(-3.0), torch.tensor(-4.0)]
chosen_ref_logprobs = [torch.tensor(-2.5), torch.tensor(-2.0)]
rejected_ref_logprobs = [torch.tensor(-2.5), torch.tensor(-3.0)]
for beta in [0.05, 0.1, 0.5]:
loss, metrics = compute_dpo_loss(
chosen_logprobs,
rejected_logprobs,
chosen_ref_logprobs,
rejected_ref_logprobs,
dpo_beta=beta,
)
print(
f"beta={beta:.2f}: loss={metrics['dpo_loss']:.4f}, "
f"accuracy={metrics['accuracy']:.2f}, margin={metrics['margin']:.4f}"
)
Output
Summary
DPO workflow:
- Collect preference data as
LabeledComparisonobjects (from humans or an AI judge) - Render them into chosen/rejected datum pairs using
DPODatasetBuilderFromComparisons - Configure training with
train_dpo.Config(setdpo_beta,learning_rate) - Run
train_dpo.main(config)-- handles reference logprob computation, custom loss, and checkpointing - Evaluate with a
PreferenceModelto measure win rate against a baseline
Key hyperparameters:
dpo_beta: 0.05-0.5 (start with 0.1)learning_rate: 1e-6 to 5e-5 (lower than SFT)num_epochs: 1-3 (DPO is prone to overfitting)