Self-Distillation Fine-Tuning (SDFT)
This recipe implements the SDFT algorithm from "Self-Distillation Enables Continual Learning" (Shenfeld et al., 2026) using Tinker. SDFT is an on-policy distillation method that learns new skills from demonstrations while preserving prior capabilities — addressing catastrophic forgetting in sequential fine-tuning.
Algorithm
What the paper does
SDFT uses the same model in two roles with different prompts:
- Teacher sees the question + golden answer as an in-context demonstration
- Student sees only the question and generates a completion on-policy
- The loss minimizes forward KL divergence between teacher and student distributions at each token position of the student's completion:
where \(T\) is the number of completion tokens, and the KL at each position sums over the full vocabulary (~150K tokens).
Both teacher and student start from the same base model weights. The paper maintains the teacher as an EMA (Exponential Moving Average) of the student, updated every step with alpha=0.01 — so the teacher slowly tracks the student's learning while providing stable distillation targets.
What we implement in Tinker
Our Tinker implementation differs from the paper in two ways:
-
Top-K instead of full-vocabulary KL. The Tinker API does not expose full-vocabulary logits. Instead, we use Tinker's top-K distillation API (
topk_prompt_logprobs) to recover the teacher's top-K token distribution at each position, and train withcross_entropyloss:\[\mathcal{L}_{\text{top-K}} = \frac{1}{T} \sum_{t=1}^{T} \left[ -\sum_{k=1}^{K} P_{\text{teacher}}(x_k \mid t) \cdot \log P_{\text{student}}(x_k \mid t) \right]\]where the inner sum is over the \(K\) tokens with highest teacher probability (renormalized to sum to 1).
-
Static teacher instead of EMA. The paper maintains the teacher as an EMA of the student (updated every step with
alpha=0.01). Our implementation keeps the teacher frozen at the initial base model weights, which is simpler and avoids the overhead of periodic weight syncing.
Validating the approximation
We verified both design choices by running ablations on the official SDFT codebase with Qwen/Qwen2.5-7B-Instruct (full fine-tuning, lr=2e-5, tooluse task):
| KL Type | EMA Teacher | Tool-use | Science |
|---|---|---|---|
| Full-vocab | Yes | 68.04% | 33.33% |
| Full-vocab | No (static) | 67.01% | 36.69% |
| Top-K=20 | Yes | 68.04% | 35.31% |
| Top-K=20 | No (static) | 69.07% | 34.52% |
| SFT baseline | N/A | 65.98% | 36.88% |
Top-K=20 matches full-vocab KL (68.04% vs 68.04% with EMA, 69.07% vs 67.01% without). Static teacher matches EMA — the 1-2pp difference is within noise. Both design choices are validated.
We also confirmed our Tinker implementation produces identical results to the reference on the same model (Qwen/Qwen3-4B-Instruct-2507): both get 56.70% tooluse accuracy with top-K=20 and static teacher.
Setup
1. Download the data
Training data comes from the Self-Distillation repository, which includes preprocessed golden reasoning chains for tool-use and science tasks.
git clone https://github.com/idanshen/Self-Distillation.git
# Data is at Self-Distillation/data/tooluse_data/ and Self-Distillation/data/science_data/
2. Set your Tinker API key
Running the Recipe
Single-task SDFT training
# SDFT on tool-use (top-K distillation, K=20)
python -m tinker_cookbook.recipes.sdft.train \
model_name=Qwen/Qwen3.5-35B-A3B \
dataset=toolalpaca \
toolalpaca_data_path=Self-Distillation/data/tooluse_data/train_data \
groups_per_batch=128 \
learning_rate=5e-4 \
topk=20 \
lora_rank=64
Continual learning experiment (SFT vs SDFT)
The key experiment: train sequentially on two tasks and measure retention.
python -m tinker_cookbook.recipes.sdft.run_continual_learning \
model_name=Qwen/Qwen3.5-35B-A3B \
data_dir=Self-Distillation/data \
methods=sft,sdft_topk \
learning_rates=5e-4 \
stages=1,2 \
lora_rank=64 \
topk=20 \
thinking_format=true
Debug run
python -m tinker_cookbook.recipes.sdft.train \
model_name=Qwen/Qwen3.5-35B-A3B \
dataset=sciknoweval \
groups_per_batch=4 \
max_tokens=256 \
max_steps=3 \
topk=20 \
lora_rank=64
Key Parameters
| Parameter | Default | Description |
|---|---|---|
model_name |
— | Model for both student and teacher |
topk |
20 |
Top-K tokens for distillation (0 = importance sampling fallback) |
learning_rate |
2e-5 |
Adam learning rate. For LoRA, use 5e-4 to 1e-3 |
groups_per_batch |
32 |
Problems per training batch |
lora_rank |
128 |
LoRA adapter rank (64 for Qwen/Qwen3.5-35B-A3B) |
max_tokens |
2048 |
Max completion length |
thinking_format |
false |
Convert data for thinking models (e.g., Qwen3.5) |
teacher_sync_every |
None |
Optional periodic teacher weight sync |
Results
Continual Learning: Qwen/Qwen3.5-35B-A3B (Tinker, LoRA 64)
Stage 1: Train on tool-use, evaluate both tasks
| Method | LR | Tool-use | Science | Sci Δ |
|---|---|---|---|---|
| Base model | — | 61.86% | 46.35% | — |
| SFT | 1e-4 | 8.25% | 36.88% | -9.47 |
| SFT | 5e-4 | 67.01% | 29.19% | -17.16 |
| SFT | 1e-3 | 69.07% | 37.08% | -9.27 |
| SDFT | 1e-4 | 63.92% | 45.17% | -1.18 |
| SDFT | 5e-4 | 65.98% | 46.55% | +0.20 |
| SDFT | 1e-3 | 67.01% | 53.65% | +7.30 |
Stage 2: Train on science (from Stage 1 checkpoint), evaluate both tasks
| Method | LR | Tool-use | Science | TU Retention |
|---|---|---|---|---|
| SFT | 1e-4 | 64.95% | 57.40% | — * |
| SFT | 5e-4 | 68.04% | 63.71% | 101% |
| SFT | 1e-3 | 8.25% | 64.69% | 12% |
| SDFT | 1e-4 | 61.86% | 56.80% | 97% |
| SDFT | 5e-4 | 61.86% | 63.51% | 94% |
| SDFT | 1e-3 | 35.05% | 60.75% | 52% |
* SFT lr=1e-4 Stage 1 tool-use was 8.25% (anomalous), so Stage 2 retention is not meaningful.
Findings
-
SDFT preserves prior knowledge during new-task training. After tool-use training (Stage 1), SFT causes -9 to -17pp science degradation. SDFT either preserves or improves science (+0 to +7pp).
-
SFT can retain knowledge with careful LR tuning. At lr=5e-4, SFT retains 101% of tool-use after Stage 2 science training. However, at lr=1e-3, tool-use collapses to 8.25% (12% retention). The right LR for retention isn't known in advance.
-
SDFT retention is robust across learning rates. 94-97% tool-use retention in Stage 2 at lr=5e-4 and 1e-4, without needing to tune for retention specifically.
-
Practical recommendation. Use SDFT when you need to fine-tune on new data without risking degradation of existing capabilities. Use SFT with a conservative learning rate when you can afford some forgetting and want maximum target-task performance. Optimizing SFT's learning rate and number of steps can also help control retention.
Files
| File | Description |
|---|---|
tinker_cookbook/distillation/sdft.py |
Core: teacher prompts, top-K datum construction, training loop |
tinker_cookbook/recipes/sdft/train.py |
CLI entry point |
tinker_cookbook/recipes/sdft/run_continual_learning.py |
2-stage experiment runner |
tinker_cookbook/recipes/sdft/datasets.py |
Data loading (tool-use, science, thinking model format) |
tinker_cookbook/recipes/sdft/eval.py |
Evaluation (tool-use exact match, science MCQ) |
tinker_cookbook/recipes/sdft/sdft_test.py |
Unit tests |
References
- Shenfeld et al., "Self-Distillation Enables Continual Learning", 2026
- Official implementation (TRL-based, Qwen2.5-7B)
- Tinker loss functions (top-K distillation, cross_entropy)