Model Distillation
Distill knowledge from a teacher model into a student model using off-policy SFT or on-policy distillation.
What you'll build
A student model trained to match a teacher's behavior on reasoning (OpenThoughts3 / DeepMath), personalization (Tulu3), or multi-turn tool use (Harbor). Supports both single-turn and multi-turn distillation, with optional multi-teacher setups.
Prerequisites
For multi-turn Harbor distillation:
uvx harbor datasets download [email protected]
Key concepts
- Off-policy distillation — SFT on teacher-generated responses (e.g., OpenThoughts3 dataset)
- On-policy distillation — student generates responses, then minimizes KL divergence against the teacher's distribution
- Multi-turn distillation — extends on-policy distillation to multi-turn tool-use episodes in sandboxed environments
How it works
Three-layer architecture for multi-turn distillation
Multi-turn distillation reuses three layers of infrastructure:
-
tool_uselibrary (tinker_cookbook/tool_use/) — Generic agent-tool interaction. The@tooldecorator defines tools,build_agent_tool_env()creates token-level RL environments from tools + renderer + reward function.AgentToolMessageEnvmanages the message-level episode loop (append assistant message, execute tool calls, check termination). -
harbor_rlrecipe (tinker_cookbook/recipes/harbor_rl/) — Appliestool_useto Harbor sandbox tasks.HarborBashToolwraps a sandbox as a@tool-decorated bash command.HarborEnvGroupBuildercreates sandboxed environments with task-specific grading viaHarborReward. -
Multi-turn distillation (
tinker_cookbook/recipes/distillation/harbor_multiturn.py) —HarborDistillationDatasetBuildersubclassesHarborDatasetBuilder, passingreward_fn=zero_reward(always returns 0.0) to override the defaultHarborReward.
Environment-provided tokens (system prompt, user message, tool responses, assistant headers) are masked out during training — only the student's generated tokens contribute to the loss.
Zero-reward design insight
In on-policy distillation, the environment has no rewards — neither correctness nor format rewards. The only training signal is minimizing the KL divergence against the teacher model. You can optionally increase kl_discount_factor to optimize discounted future KL, but this generally does not improve performance.
Multi-teacher configuration
For every dataset, you can define a teacher model and batch size:
{
"dataset_builder": RLDatasetBuilder,
"teacher_model": {
"base_model": str, # e.g. "Qwen/Qwen3-32B"
"load_checkpoint_path": str | None # e.g. "tinker://<unique_id>/sampler_weights/final"
},
"groups_per_batch": int
}
The trainer samples from each configuration and concatenates all individual dataset batches to form the batch for training. This enables multi-teacher distillation with different teacher models for different domains.
Run it
Off-policy SFT (reasoning)
python -m tinker_cookbook.recipes.distillation.off_policy_reasoning \
model_name=Qwen/Qwen3-8B-Base \
learning_rate=1e-3 \
batch_size=128 \
lora_rank=128 \
wandb_project=cookbook_distillation
On-policy distillation (reasoning)
python -m tinker_cookbook.recipes.distillation.on_policy_distillation \
model_name=Qwen/Qwen3-8B-Base \
load_checkpoint_path=tinker://4a1939e6-04be-5a77-9e4e-910ccff9f27e:train:0/weights/final \
dataset=deepmath \
learning_rate=1e-4 \
groups_per_batch=512 \
lora_rank=128 \
wandb_project=cookbook_distillation
On-policy distillation (personalization)
python -m tinker_cookbook.recipes.distillation.on_policy_distillation \
model_name=Qwen/Qwen3-8B-Base \
dataset=tulu3 \
learning_rate=1e-4 \
groups_per_batch=64 \
lora_rank=128 \
wandb_project=cookbook_distillation
Multi-turn distillation (Harbor)
python -m tinker_cookbook.recipes.distillation.on_policy_distillation_harbor_multi_turn \
model_name=moonshotai/Kimi-K2-Thinking \
teacher_model=moonshotai/Kimi-K2-Thinking \
max_turns=10 \
group_size=4 \
groups_per_batch=8 \
learning_rate=1e-4 \
lora_rank=8 \
max_tokens=2048 \
max_trajectory_tokens=24576
Multi-teacher distillation
python -m tinker_cookbook.recipes.distillation.on_policy_multi_teacher \
learning_rate=1e-4 \
deepmath_groups_per_batch=256 \
tulu3_groups_per_batch=256 \
lora_rank=128 \
wandb_project=cookbook_distillation
Expected results
| Method | Dataset | AIME'24 Score | Steps |
|---|---|---|---|
| Off-policy SFT | OpenThoughts3 | ~55% | 3000 |
| On-policy distillation | DeepMath | ~65% | 100 |
Checkpoints
| Stage | Rank 8 | Rank 32 | Rank 128 |
|---|---|---|---|
| SFT | tinker://c15f09f1-... |
tinker://b9190d16-... |
tinker://4a1939e6-... |
| On-policy | tinker://4a97bc02-... |
tinker://bfffa2b2-... |
tinker://1dd8de47-... |