Tutorial 403: DPO and Preference Learning

Prerequisites

Run it interactively [source]

curl -O https://raw.githubusercontent.com/thinking-machines-lab/tinker-cookbook/main/tutorials/403_dpo_preferences.py && marimo edit 403_dpo_preferences.py

Build preference data, render comparisons, and work through the DPO loss. The full pipeline, including training with train_dpo.main and evaluating with a PreferenceModel, is outlined in the Summary.

Direct Preference Optimization (DPO) trains a model to prefer "chosen" over "rejected" responses without an explicit reward model. The key idea: the optimal policy under a KL-constrained reward maximization objective has a closed-form relationship to a preference model.

from tinker_cookbook.preference.types import (
    Comparison,
    ComparisonRendererFromChatRenderer,
    LabeledComparison,
)

Step 1 -- Create preference data

A Comparison pairs two completions for the same prompt. A LabeledComparison adds a human preference label (A, B, or Tie).

# Create a labeled comparison: the human prefers completion A
comparison = Comparison(
    prompt_conversation=[
        {"role": "user", "content": "Explain gravity in one sentence."},
    ],
    completion_A=[
        {
            "role": "assistant",
            "content": "Gravity is the force that attracts objects with mass toward each other.",
        },
    ],
    completion_B=[
        {"role": "assistant", "content": "Gravity is like magnets but for everything."},
    ],
)

labeled = LabeledComparison(comparison=comparison, label="A")
print(f"Prompt:       {comparison.prompt_conversation[0]['content']}")
print(f"Completion A: {comparison.completion_A[0]['content']}")
print(f"Completion B: {comparison.completion_B[0]['content']}")
print(f"Preferred:    {labeled.label}")

# Swapping reverses the label
swapped = labeled.swap()
print("\nAfter swap:")
print(f"Completion A: {swapped.comparison.completion_A[0]['content']}")
print(f"Completion B: {swapped.comparison.completion_B[0]['content']}")
print(f"Preferred:    {swapped.label}")

Output

Prompt:       Explain gravity in one sentence.
Completion A: Gravity is the force that attracts objects with mass toward each other.
Completion B: Gravity is like magnets but for everything.
Preferred:    A

After swap:
Completion A: Gravity is like magnets but for everything.
Completion B: Gravity is the force that attracts objects with mass toward each other.
Preferred:    B

Step 2 -- ComparisonRenderer

The ComparisonRendererFromChatRenderer converts a Comparison into tokenized model input by formatting both completions with section markers:

[prompt] ==== Completion A ==== [text A] ==== Completion B ==== [text B] ==== Preference ====

For DPO training, each labeled comparison produces two datums (chosen + rejected) with per-token loss weights on the completion tokens.

from tinker_cookbook import renderers
from tinker_cookbook.tokenizer_utils import get_tokenizer

MODEL_NAME = "Qwen/Qwen3.5-4B"
tokenizer = get_tokenizer(MODEL_NAME)
renderer = renderers.get_renderer("qwen3_5_disable_thinking", tokenizer)

comparison_renderer = ComparisonRendererFromChatRenderer(renderer)

# Build a generation prompt for preference prediction
model_input = comparison_renderer.build_generation_prompt(comparison)
print(f"Prompt tokens: {model_input.length}")
print(f"Decoded:\n{tokenizer.decode(list(model_input.to_ints()))}")

Output

Prompt tokens: 84
Decoded:
<|im_start|>user
Explain gravity in one sentence.<|im_end|>
<|im_start|>system
==== Completion A ====<|im_end|>
<|im_start|>assistant
<think>

</think>

Gravity is the force that attracts objects with mass toward each other.<|im_end|>
<|im_start|>system
==== Completion B ====<|im_end|>
<|im_start|>assistant
<think>

</think>

Gravity is like magnets but for everything.<|im_end|>
<|im_start|>system
==== Preference ====<|im_end|>
<|im_start|>assistant
<think>

</think>

Note: the empty blocks are Qwen3.5's non-thinking markers (part of the chat template, not content) — the judge model reads them as "answer directly," so you can ignore them here.

Step 3 -- Configure DPO training

preference.train_dpo.Config is similar to the SFT config but adds:

dpo_beta -- KL penalty coefficient (higher = more conservative updates)
reference_model_name -- optional explicit reference model (default: initial weights)

The dataset builder must produce interleaved chosen/rejected datum pairs.

from tinker_cookbook.preference.train_dpo import compute_dpo_loss

# Example config (not running training here)
print("DPO Config fields:")
print(f"  model_name:      {MODEL_NAME}")
print("  dpo_beta:        0.1  (default)")
print("  learning_rate:   1e-5 (default, lower than SFT)")
print("  lr_schedule:     linear")
print("  lora_rank:       32")

Output

DPO Config fields:
  model_name:      Qwen/Qwen3.5-4B
  dpo_beta:        0.1  (default)
  learning_rate:   1e-5 (default, lower than SFT)
  lr_schedule:     linear
  lora_rank:       32

Step 4 -- Understanding the DPO loss

The DPO loss is:

L = -log sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))

where log_ratio = log p_policy(y|x) - log p_ref(y|x).

Intuitively:

The model should assign higher probability to chosen over rejected
The beta parameter controls how much the model can deviate from the reference
Higher beta = more conservative (stays closer to reference)

import torch

# Simulate DPO loss computation
# Positive log-ratio means policy prefers this over reference
chosen_logprobs = [torch.tensor(-2.0), torch.tensor(-1.5)]
rejected_logprobs = [torch.tensor(-3.0), torch.tensor(-4.0)]
chosen_ref_logprobs = [torch.tensor(-2.5), torch.tensor(-2.0)]
rejected_ref_logprobs = [torch.tensor(-2.5), torch.tensor(-3.0)]

for beta in [0.05, 0.1, 0.5]:
    loss, metrics = compute_dpo_loss(
        chosen_logprobs,
        rejected_logprobs,
        chosen_ref_logprobs,
        rejected_ref_logprobs,
        dpo_beta=beta,
    )
    print(
        f"beta={beta:.2f}: loss={metrics['dpo_loss']:.4f}, "
        f"accuracy={metrics['accuracy']:.2f}, margin={metrics['margin']:.4f}"
    )

Output

beta=0.05: loss=0.6624, accuracy=1.00, margin=0.0625
beta=0.10: loss=0.6327, accuracy=1.00, margin=0.1250
beta=0.50: loss=0.4305, accuracy=1.00, margin=0.6250

Summary

DPO workflow:

Collect preference data as LabeledComparison objects (from humans or an AI judge)
Render them into chosen/rejected datum pairs using DPODatasetBuilderFromComparisons
Configure training with train_dpo.Config (set dpo_beta, learning_rate)
Run train_dpo.main(config) -- handles reference logprob computation, custom loss, and checkpointing
Evaluate with a PreferenceModel to measure win rate against a baseline

Key hyperparameters:

dpo_beta: 0.05-0.5 (start with 0.1)
learning_rate: 1e-6 to 5e-5 (lower than SFT)
num_epochs: 1-3 (DPO is prone to overfitting)