Get Started
Installation
Optional extras: [math-rl], [modal], [wandb], [inspect], [all].
For nightly builds:
uv pip install 'tinker-cookbook @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git@nightly'
To develop locally or customize recipes for your own needs:
git clone https://github.com/thinking-machines-lab/tinker-cookbook.git
cd tinker-cookbook
uv pip install -e .
The cookbook provides higher-level abstractions on top of the Tinker SDK. Where the SDK gives you raw operations (forward_backward, optim_step, sample), the cookbook gives you configurable training pipelines that handle pipelining, checkpointing, evaluation, and logging automatically.
SFT with the Cookbook
With the SDK, an SFT loop requires manually calling forward_backward → optim_step → save_state in a loop. The cookbook wraps this into a single train.Config:
→
→
→
- Define your data — implement a
SupervisedDatasetBuilderthat returns batches ofDatumobjects - Configure training — set model, learning rate, LoRA rank, checkpointing schedule
- Run —
train.main(config)handles the entire loop
from tinker_cookbook.supervised import train
from tinker_cookbook.supervised.types import ChatDatasetBuilder
config = train.Config(
log_path="~/logs/my-sft-run",
model_name="Qwen/Qwen3-8B",
dataset_builder=my_dataset_builder,
learning_rate=1e-4,
lora_rank=32,
save_every=20,
eval_every=10,
)
asyncio.run(train.main(config))
Under the hood, train.main creates a TrainingClient, pipelines forward_backward + optim_step for throughput, runs evaluators, and saves checkpoints. See the SL architecture overview for the full component diagram.
RL with the Cookbook
RL is more complex than SFT because it involves a rollout loop: sample from the model, score with a reward function, then train on the scored trajectories. The cookbook abstracts this into composable types:
→
→
→
→
→ repeat
- Define environments — implement
Env(orProblemEnvfor Q&A tasks) that produces rewards - Group environments —
EnvGroupBuildercreates batches of environments (GRPO centers rewards across groups) - Define a dataset —
RLDatasetproduces batches ofEnvGroupBuilders - Run — the training loop handles rollouts, advantage computation, and weight updates
The simplest starting point is ProblemEnv — a single-turn Q&A environment:
from tinker_cookbook.rl.problem_env import ProblemEnv
class MathEnv(ProblemEnv):
def get_question(self) -> str:
return "What is 2 + 3?"
def check_answer(self, response: str) -> float:
return 1.0 if "5" in response else 0.0
See the RL architecture overview for the full component diagram.
Preferences (DPO / RLHF)
Preference learning trains models from pairwise comparisons rather than scalar rewards:
- Build comparisons —
Comparisonobjects containing two completions (A and B) for the same prompt - Label preferences —
LabeledComparisonadds a human preference (A, B, or tie) - Train with DPO — directly optimize the policy from preference data
from tinker_cookbook.preference.types import Comparison, LabeledComparison
comparison = Comparison(messages=messages, completion_a=resp_a, completion_b=resp_b)
labeled = LabeledComparison(**comparison.__dict__, label=1) # B preferred
See DPO Guide and RLHF Example for full walkthroughs.
Next steps
- SL Architecture — full component diagram for supervised learning
- RL Architecture — full component diagram for reinforcement learning
- Recipes — 13 production-ready training examples
- Tutorials — interactive notebooks
- API Reference — all
tinker_cookbookmodules