Get Started

Installation

uv pip install tinker tinker-cookbook
export TINKER_API_KEY="your-api-key-here"

Optional extras: [math-rl], [modal], [wandb], [inspect], [all].

For nightly builds:

uv pip install 'tinker-cookbook @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git@nightly'

To develop locally or customize recipes for your own needs:

git clone https://github.com/thinking-machines-lab/tinker-cookbook.git
cd tinker-cookbook
uv pip install -e .

The cookbook provides higher-level abstractions on top of the Tinker SDK. Where the SDK gives you raw operations (forward_backward, optim_step, sample), the cookbook gives you configurable training pipelines that handle pipelining, checkpointing, evaluation, and logging automatically.

SFT with the Cookbook

With the SDK, an SFT loop requires manually calling forward_backward → optim_step → save_state in a loop. The cookbook wraps this into a single train.Config:

DatasetBuilderdefine data

→

train.Configmodel, LR, rank

→

train.main()pipelined loop

→

checkpoints+ metrics

Define your data — implement a SupervisedDatasetBuilder that returns batches of Datum objects
Configure training — set model, learning rate, LoRA rank, checkpointing schedule
Run — train.main(config) handles the entire loop

from tinker_cookbook.supervised import train
from tinker_cookbook.supervised.types import ChatDatasetBuilder

config = train.Config(
    log_path="~/logs/my-sft-run",
    model_name="Qwen/Qwen3-8B",
    dataset_builder=my_dataset_builder,
    learning_rate=1e-4,
    lora_rank=32,
    save_every=20,
    eval_every=10,
)
asyncio.run(train.main(config))

Under the hood, train.main creates a TrainingClient, pipelines forward_backward + optim_step for throughput, runs evaluators, and saves checkpoints. See the SL architecture overview for the full component diagram.

RL with the Cookbook

RL is more complex than SFT because it involves a rollout loop: sample from the model, score with a reward function, then train on the scored trajectories. The cookbook abstracts this into composable types:

RLDatasetget_batch()

→

EnvGroupBuildermake_envs()

→

Envrollout

→

rewards+ advantages

→

forward_backwardtrain

→ repeat

Define environments — implement Env (or ProblemEnv for Q&A tasks) that produces rewards
Group environments — EnvGroupBuilder creates batches of environments (GRPO centers rewards across groups)
Define a dataset — RLDataset produces batches of EnvGroupBuilders
Run — the training loop handles rollouts, advantage computation, and weight updates

The simplest starting point is ProblemEnv — a single-turn Q&A environment:

from tinker_cookbook.rl.problem_env import ProblemEnv

class MathEnv(ProblemEnv):
    def get_question(self) -> str:
        return "What is 2 + 3?"

    def check_answer(self, response: str) -> float:
        return 1.0 if "5" in response else 0.0

See the RL architecture overview for the full component diagram.

Preferences (DPO / RLHF)

Preference learning trains models from pairwise comparisons rather than scalar rewards:

Build comparisons — Comparison objects containing two completions (A and B) for the same prompt
Label preferences — LabeledComparison adds a human preference (A, B, or tie)
Train with DPO — directly optimize the policy from preference data

from tinker_cookbook.preference.types import Comparison, LabeledComparison

comparison = Comparison(messages=messages, completion_a=resp_a, completion_b=resp_b)
labeled = LabeledComparison(**comparison.__dict__, label=1)  # B preferred

See DPO Guide and RLHF Example for full walkthroughs.

Next steps

SL Architecture — full component diagram for supervised learning
RL Architecture — full component diagram for reinforcement learning
Recipes — 13 production-ready training examples
Tutorials — interactive notebooks
API Reference — all tinker_cookbook modules