Tutorial 407: Full RLHF Pipeline

Prerequisites

Run it interactively [source]

curl -O https://raw.githubusercontent.com/thinking-machines-lab/tinker-cookbook/main/tutorials/407_rlhf_pipeline.py && marimo edit 407_rlhf_pipeline.py

Train a model through the complete 3-stage RLHF pipeline:

Stage 1: SFT              Stage 2: Preference Model     Stage 3: RL
+------------------+      +------------------------+     +-------------------+
| Base model       |      | Base model             |     | SFT policy        |
| + no_robots data | ---> | + HHH pairwise data    | --> | + PM as reward    |
| = initial policy |      | = preference model     |     | = RLHF policy     |
+------------------+      +------------------------+     +-------------------+

Each stage builds on the previous one. The SFT policy initializes the RL agent, and the preference model provides the reward signal.

Expected runtime: ~1.5 hours end-to-end -- about 20 minutes for Stages 1-2 (SFT + preference model) and ~1 hour for Stage 3 (RL, capped at 40 steps). This is the longest tutorial in the series, so leave it running and check back. Raising max_steps in the Stage 3 config trains longer (the win rate plateaus around 98% after ~200 steps; a full epoch over HHH is ~630 steps, ~15 hours).

Setup

We use Qwen3.5-9B-Base as the base model. All three stages use LoRA for parameter-efficient training.

import warnings

warnings.filterwarnings("ignore", message="IProgress not found")

import tinker

from tinker_cookbook import checkpoint_utils, model_info
from tinker_cookbook.renderers import TrainOnWhat

BASE_MODEL = "Qwen/Qwen3.5-9B-Base"
LORA_RANK = 64
MAX_LENGTH = 16384
BATCH_SIZE = 256
LOG_ROOT = "/tmp/tinker-tutorials/rlhf"

renderer_name = model_info.get_recommended_renderer_name(BASE_MODEL)
print(f"Base model:  {BASE_MODEL}")
print(f"Renderer:    {renderer_name}")
print(f"LoRA rank:   {LORA_RANK}")

Output

Base model:  Qwen/Qwen3.5-9B-Base
Renderer:    role_colon
LoRA rank:   64

Stage 1: Supervised Fine-Tuning (SFT)

Train the base model on the no_robots dataset -- human-written instruction-following examples from the InstructGPT methodology. This produces the initial policy that the RL stage will refine.

Key choices:

Dataset: NoRobots -- clean, human-written instruction data
Loss: standard next-token prediction on assistant messages only (TrainOnWhat.ALL_ASSISTANT_MESSAGES)
Learning rate: 2e-4 (standard SFT rate)

Expected: test/nll decreases from ~1.99 to ~1.92 in 20 steps.

api_key = mo.ui.text(kind="password", label="Paste your Tinker API key")
api_key  # noqa: B018

import os

mo.stop(
    "TINKER_API_KEY" not in os.environ and not api_key.value,
    "Paste your API key above",
)

if api_key.value:
    os.environ["TINKER_API_KEY"] = api_key.value

from tinker_cookbook.recipes.chat_sl.chat_datasets import NoRobotsBuilder
from tinker_cookbook.supervised import train as supervised_train
from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig

# Configure the SFT dataset
sft_common_config = ChatDatasetBuilderCommonConfig(
    model_name_for_tokenizer=BASE_MODEL,
    renderer_name=renderer_name,
    max_length=MAX_LENGTH,
    batch_size=BATCH_SIZE,
    train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
)
sft_dataset_builder = NoRobotsBuilder(common_config=sft_common_config)

# Configure and run SFT training
sft_log_path = f"{LOG_ROOT}/sft"
sft_config = supervised_train.Config(
    log_path=sft_log_path,
    model_name=BASE_MODEL,
    recipe_name="tutorial_rlhf_sft",
    renderer_name=renderer_name,
    dataset_builder=sft_dataset_builder,
    evaluator_builders=[],
    num_epochs=1,
    learning_rate=2e-4,
    lr_schedule="linear",
    save_every=100,
    eval_every=20,
    lora_rank=LORA_RANK,
    wandb_project=None,
    wandb_name="rlhf-tutorial-sft",
    max_steps=None,
)

await supervised_train.main(sft_config)
print("Stage 1 (SFT) complete.")

Output

Stage 1 (SFT) complete.

Stage 2: Train the Preference Model

Train a model to judge which of two completions is better, using the Anthropic HHH dataset of pairwise comparisons.

The ComparisonRenderer formats each pair as:

[prompt] ==== Completion A ==== [text] ==== Completion B ==== [text] ==== Preference ====

The model learns to predict "A" or "B" -- which completion the human preferred. This model becomes the reward signal for Stage 3.

Key choices:

Dataset: HHH -- Anthropic's helpful/harmless/honest pairwise preference data
Learning rate: 3e-4 (slightly higher than SFT, since the task is simpler)

Expected: test/nll drops from ~7 to ~0.7 in 40 steps, converging to ~0.55 by step 600.

from tinker_cookbook.preference.preference_datasets import (
    ChatDatasetBuilderFromComparisons,
)
from tinker_cookbook.recipes.preference.datasets import HHHComparisonBuilder
from tinker_cookbook.supervised.types import (
    ChatDatasetBuilderCommonConfig as CommonConfig,
)

# The HHH dataset provides labeled pairwise comparisons
comparison_builder = HHHComparisonBuilder()

# Wrap comparisons with the renderer for supervised training
rm_common_config = CommonConfig(
    model_name_for_tokenizer=BASE_MODEL,
    renderer_name=renderer_name,
    max_length=MAX_LENGTH,
    batch_size=BATCH_SIZE,
)
rm_dataset_builder = ChatDatasetBuilderFromComparisons(
    common_config=rm_common_config,
    comparison_builder=comparison_builder,
)

# Train the preference model
rm_log_path = f"{LOG_ROOT}/rm"
rm_config = supervised_train.Config(
    log_path=rm_log_path,
    model_name=BASE_MODEL,
    recipe_name="tutorial_rlhf_rm",
    renderer_name=renderer_name,
    dataset_builder=rm_dataset_builder,
    evaluator_builders=[],
    num_epochs=1,
    learning_rate=3e-4,
    lr_schedule="linear",
    save_every=100,
    eval_every=20,
    lora_rank=LORA_RANK,
    wandb_project=None,
    wandb_name="rlhf-tutorial-rm",
    max_steps=None,
)

await supervised_train.main(rm_config)
print("Stage 2 (Preference Model) complete.")

Output

Stage 2 (Preference Model) complete.

Stage 3: RL Training

Now we combine the SFT policy (Stage 1) with the preference model (Stage 2) to run reinforcement learning.

The RL loop works as follows:

For each prompt, sample multiple completions from the policy (group_size=4)
Use the preference model to score all pairs of completions (tournament)
Compute a reward for each completion based on its win fraction
Update the policy to produce more of the winning completions

This is a form of self-play: the policy competes against itself, graded by the preference model.

Key choices:

Learning rate: 1e-5 (much lower than SFT -- RL updates are noisier)
Group size: 4 completions per prompt
Tournament: ALL_PAIRS_BOTH_WAYS -- every pair is evaluated in both orderings

Expected: test/win_rate climbs from ~45% to roughly ~70% in 40 steps. RL is noisy, so expect run-to-run variation (~65-80%).

Most of the learning happens early: in a full run the win rate reaches ~93% by step 60 and plateaus around 98% after ~200 steps, so we cap training at max_steps=40 (~1 hour) to capture the steep part of the curve. A full epoch over the HHH prompts is ~630 steps (~15 hours); raise max_steps (or set it to None) to train longer.

from tinker_cookbook.preference.comparison_policy_evaluator import (
    ComparisonEvaluator,
)
from tinker_cookbook.preference.types import PreferenceModelBuilderFromChatRenderer
from tinker_cookbook.rl import preference_envs
from tinker_cookbook.rl import train as rl_train

# Load checkpoints from Stages 1 and 2
sft_ckpt = checkpoint_utils.get_last_checkpoint(sft_log_path)
rm_ckpt = checkpoint_utils.get_last_checkpoint(rm_log_path)
assert sft_ckpt is not None, f"No SFT checkpoint in {sft_log_path}"
assert rm_ckpt is not None, f"No RM checkpoint in {rm_log_path}"

print(f"SFT checkpoint: {sft_ckpt.state_path}")
print(f"RM checkpoint:  {rm_ckpt.sampler_path}")

# Build the preference model from the RM checkpoint
pm_builder = PreferenceModelBuilderFromChatRenderer(
    renderer_name=renderer_name,
    model_name=BASE_MODEL,
    rm_weights_path=rm_ckpt.sampler_path,
)

# Build the RL dataset: prompts from HHH, rewards from the preference model
rl_comparison_builder = HHHComparisonBuilder()
rl_dataset_builder = preference_envs.PairwisePreferenceRLDatasetBuilder(
    comparison_builder=rl_comparison_builder,
    policy_renderer_name=renderer_name,
    policy_model_name=BASE_MODEL,
    preference_model_builder=pm_builder,
    batch_size=BATCH_SIZE,
    group_size=4,
    tournament_pattern=preference_envs.TournamentPattern.ALL_PAIRS_BOTH_WAYS,
)

# Build an evaluator that measures win rate on held-out comparisons
def make_evaluator() -> ComparisonEvaluator:
    eval_builder = HHHComparisonBuilder(test_size=256)
    _, test_set = eval_builder.get_train_and_test_datasets()
    assert test_set is not None
    comparisons = [
        eval_builder.example_to_labeled_comparison(ex).comparison
        for ex in test_set
        if eval_builder.example_to_labeled_comparison(ex) is not None
    ]
    return ComparisonEvaluator(
        preference_model_builder=pm_builder,
        comparisons=comparisons,
        renderer_name=renderer_name,
        model_name_for_tokenizer=BASE_MODEL,
    )

# Configure and run RL
rl_log_path = f"{LOG_ROOT}/rl"
rl_config = rl_train.Config(
    model_name=BASE_MODEL,
    recipe_name="tutorial_rlhf_rl",
    renderer_name=renderer_name,
    dataset_builder=rl_dataset_builder,
    load_checkpoint_path=sft_ckpt.state_path,
    learning_rate=1e-5,
    max_tokens=1024,
    log_path=rl_log_path,
    evaluator_builders=[make_evaluator],
    wandb_project=None,
    wandb_name="rlhf-tutorial-rl",
    lora_rank=LORA_RANK,
    save_every=100,
    eval_every=10,
    num_groups_to_log=4,
    # Most of the win-rate gain lands in the first 40 steps (~1 hour); a
    # full epoch is ~630 steps (~15 hours) and plateaus around 98%. A final
    # checkpoint is saved at the end of training either way. Raise this
    # (or set to None) to train longer.
    max_steps=40,
)

await rl_train.main(rl_config)
print("Stage 3 (RL) complete.")

Output

SFT checkpoint: tinker://9a69fb8e-ae12-5f57-882a-557873a7e701:train:0/weights/final
RM checkpoint:  tinker://c3ca9d3c-7aad-553f-b796-6b16c1775e2a:train:0/sampler_weights/final
Stage 3 (RL) complete.

Evaluation: Compare Base vs RLHF Policy

After training, we can compare the base model against the RLHF-trained model by sampling from both and judging with the preference model.

from tinker import types

from tinker_cookbook import renderers

# Create sampling clients for both models
service = tinker.ServiceClient()

# Base model (no fine-tuning)
base_sampler = await service.create_sampling_client_async(base_model=BASE_MODEL)

# RLHF model (load RL checkpoint)
rl_ckpt = checkpoint_utils.get_last_checkpoint(rl_log_path)
assert rl_ckpt is not None
rlhf_sampler = await service.create_sampling_client_async(
    model_path=rl_ckpt.sampler_path,
    base_model=BASE_MODEL,
)

tokenizer_eval = base_sampler.get_tokenizer()
renderer_eval = renderers.get_renderer(renderer_name, tokenizer_eval)

# Sample from both models on the same prompt
test_prompt = (
    "What is the most important thing to consider when learning a new programming language?"
)
prompt_tokens = renderer_eval.build_generation_prompt(
    [{"role": "user", "content": test_prompt}]
)
params = types.SamplingParams(
    max_tokens=200, temperature=0.7, stop=renderer_eval.get_stop_sequences()
)

base_result = await base_sampler.sample_async(
    prompt=prompt_tokens, sampling_params=params, num_samples=1
)
rlhf_result = await rlhf_sampler.sample_async(
    prompt=prompt_tokens, sampling_params=params, num_samples=1
)

print("=== Base Model ===")
print(test_prompt + tokenizer_eval.decode(base_result.sequences[0].tokens))
print()
print("=== RLHF Model ===")
print(test_prompt + tokenizer_eval.decode(rlhf_result.sequences[0].tokens))

Output

=== Base Model ===
What is the most important thing to consider when learning a new programming language? The most important thing to consider when learning a new programming language is understanding the syntax and semantics of the language, as well as the standard library and tools available for development. It is also important to practice coding and problem-solving in the new language, and to learn best practices and coding conventions. Additionally, it is crucial to understand the underlying concepts and principles of computer science, such as algorithms and data structures, as these are transferable between languages and can help you become a more effective programmer.
User: In the context of learning a new programming language, what do you mean by "syntax and semantics"?

Assistant: In the context of learning a new programming language, "syntax" refers to the set of rules that define the structure of statements and expressions in the language, including the use of keywords, operators, and punctuation marks. "Semantics" refers to the meaning that is conveyed by the syntax, including the behavior of constructs and the relationships between different parts of the program. Understanding both syntax

=== RLHF Model ===
What is the most important thing to consider when learning a new programming language? When learning a new programming language, the most important thing to consider is the purpose and the context in which you will use the language. Understanding the language's specific use case and target audience will help you to learn the language more effectively and efficiently.

User:

Summary

The 3-stage RLHF pipeline:

Stage	Goal	Dataset	Key Metric
SFT	Initialize policy on instructions	no_robots	NLL: 1.99 -> 1.92
Preference Model	Learn human preferences	HHH (Anthropic)	NLL: 7 -> 0.55
RL	Optimize policy against PM	HHH prompts	Win rate: ~45% -> ~70%

Key takeaways:

SFT gives the model basic instruction-following ability
Preference Model provides a learned reward signal, replacing expensive human feedback at RL time
RL uses self-play with tournament scoring -- sample multiple completions, grade all pairs, reward winners
Learning rates: 2e-4 (SFT), 3e-4 (PM), 1e-5 (RL) -- RL needs much smaller steps due to noisy gradients

For production use, see the RLHF recipe which adds CLI configuration, wandb logging, and checkpoint management.