Skip to content

Preference Learning

Train models to align with human preferences using RLHF or DPO pipelines.

What you'll build

Models optimized for human preferences through three approaches: a simple "shorter responses" demo, a full three-stage RLHF pipeline (SFT, reward model, RL), and Direct Preference Optimization (DPO) with a custom loss function.

Prerequisites

uv pip install tinker-cookbook

Key concepts

  • Pairwise preferences — learning from pairs of responses where one is preferred over the other
  • RLHF — reinforcement learning from human feedback: train a reward model on preferences, then optimize the policy against it
  • DPO — direct preference optimization: skip the reward model and optimize preferences directly via a custom loss

Run it

Shorter responses (introductory demo)

python -m tinker_cookbook.recipes.preference.shorter.train

This sub-recipe introduces the PairwisePreferenceRLDatasetBuilder abstraction, which is the cookbook's core building block for preference-based RL. It walks through a simple example that trains a model to generate shorter responses, making it an ideal starting point before tackling the more complex pipelines below.

RLHF (three-stage pipeline)

See the RLHF subdirectory for the full pipeline. This sub-recipe demonstrates the standard three-stage RLHF pipeline: (1) supervised fine-tuning to initialize the policy, (2) reward model learning on pairwise preference data, and (3) reinforcement learning to optimize the policy against the learned reward model.

DPO

See the DPO subdirectory for direct preference optimization. This sub-recipe shows how to implement the DPO loss as a custom loss function and uses the ComparisonRenderer to format chosen/rejected pairs for training, skipping the reward model entirely and optimizing preferences directly.

Expected results

The shorter-responses demo shows reward increasing within the first few steps. RLHF and DPO results depend on your dataset and model choice.

Learn more