Model Distillation

Distill knowledge from a teacher model into a student model using off-policy SFT or on-policy distillation.

What you'll build

A student model trained to match a teacher's behavior on reasoning (OpenThoughts3 / DeepMath), personalization (Tulu3), or multi-turn tool use (Harbor). Supports both single-turn and multi-turn distillation, with optional multi-teacher setups.

Prerequisites

uv pip install tinker-cookbook

For multi-turn Harbor distillation:

uvx harbor datasets download [email protected]

Key concepts

Off-policy distillation — SFT on teacher-generated responses (e.g., OpenThoughts3 dataset)
On-policy distillation — student generates responses, then minimizes KL divergence against the teacher's distribution
Multi-turn distillation — extends on-policy distillation to multi-turn tool-use episodes in sandboxed environments

How it works

Three-layer architecture for multi-turn distillation

Multi-turn distillation reuses three layers of infrastructure:

tool_use library (tinker_cookbook/tool_use/) — Generic agent-tool interaction. The @tool decorator defines tools, build_agent_tool_env() creates token-level RL environments from tools + renderer + reward function. AgentToolMessageEnv manages the message-level episode loop (append assistant message, execute tool calls, check termination).
harbor_rl recipe (tinker_cookbook/recipes/harbor_rl/) — Applies tool_use to Harbor sandbox tasks. HarborBashTool wraps a sandbox as a @tool-decorated bash command. HarborEnvGroupBuilder creates sandboxed environments with task-specific grading via HarborReward.
Multi-turn distillation (tinker_cookbook/recipes/distillation/harbor_multiturn.py) — HarborDistillationDatasetBuilder subclasses HarborDatasetBuilder, passing reward_fn=zero_reward (always returns 0.0) to override the default HarborReward.

Environment-provided tokens (system prompt, user message, tool responses, assistant headers) are masked out during training — only the student's generated tokens contribute to the loss.

Zero-reward design insight

In on-policy distillation, the environment has no rewards — neither correctness nor format rewards. The only training signal is minimizing the KL divergence against the teacher model. You can optionally increase kl_discount_factor to optimize discounted future KL, but this generally does not improve performance.

Multi-teacher configuration

For every dataset, you can define a teacher model and batch size:

{
    "dataset_builder": RLDatasetBuilder,
    "teacher_model": {
        "base_model": str,  # e.g. "Qwen/Qwen3-32B"
        "load_checkpoint_path": str | None  # e.g. "tinker://<unique_id>/sampler_weights/final"
    },
    "groups_per_batch": int
}

The trainer samples from each configuration and concatenates all individual dataset batches to form the batch for training. This enables multi-teacher distillation with different teacher models for different domains.

Run it

Off-policy SFT (reasoning)

python -m tinker_cookbook.recipes.distillation.off_policy_reasoning \
    model_name=Qwen/Qwen3-8B-Base \
    learning_rate=1e-3 \
    batch_size=128 \
    lora_rank=128 \
    wandb_project=cookbook_distillation

On-policy distillation (reasoning)

python -m tinker_cookbook.recipes.distillation.on_policy_distillation \
    model_name=Qwen/Qwen3-8B-Base \
    load_checkpoint_path=tinker://4a1939e6-04be-5a77-9e4e-910ccff9f27e:train:0/weights/final \
    dataset=deepmath \
    learning_rate=1e-4 \
    groups_per_batch=512 \
    lora_rank=128 \
    wandb_project=cookbook_distillation

On-policy distillation (personalization)

python -m tinker_cookbook.recipes.distillation.on_policy_distillation \
    model_name=Qwen/Qwen3-8B-Base \
    dataset=tulu3 \
    learning_rate=1e-4 \
    groups_per_batch=64 \
    lora_rank=128 \
    wandb_project=cookbook_distillation

Multi-turn distillation (Harbor)

python -m tinker_cookbook.recipes.distillation.on_policy_distillation_harbor_multi_turn \
    model_name=moonshotai/Kimi-K2-Thinking \
    teacher_model=moonshotai/Kimi-K2-Thinking \
    max_turns=10 \
    group_size=4 \
    groups_per_batch=8 \
    learning_rate=1e-4 \
    lora_rank=8 \
    max_tokens=2048 \
    max_trajectory_tokens=24576

Multi-teacher distillation

python -m tinker_cookbook.recipes.distillation.on_policy_multi_teacher \
    learning_rate=1e-4 \
    deepmath_groups_per_batch=256 \
    tulu3_groups_per_batch=256 \
    lora_rank=128 \
    wandb_project=cookbook_distillation

Expected results

Method	Dataset	AIME'24 Score	Steps
Off-policy SFT	OpenThoughts3	~55%	3000
On-policy distillation	DeepMath	~65%	100

Checkpoints

Stage	Rank 8	Rank 32	Rank 128
SFT	`tinker://c15f09f1-...`	`tinker://b9190d16-...`	`tinker://4a1939e6-...`
On-policy	`tinker://4a97bc02-...`	`tinker://bfffa2b2-...`	`tinker://1dd8de47-...`