Skip to content

Get Started

Installation

uv pip install tinker tinker-cookbook
export TINKER_API_KEY="your-api-key-here"

Optional extras: [math-rl], [modal], [wandb], [inspect], [all].

For nightly builds:

uv pip install 'tinker-cookbook @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git@nightly'

To develop locally or customize recipes for your own needs:

git clone https://github.com/thinking-machines-lab/tinker-cookbook.git
cd tinker-cookbook
uv pip install -e .

The cookbook provides higher-level abstractions on top of the Tinker SDK. Where the SDK gives you raw operations (forward_backward, optim_step, sample), the cookbook gives you configurable training pipelines that handle pipelining, checkpointing, evaluation, and logging automatically.

SFT with the Cookbook

With the SDK, an SFT loop requires manually calling forward_backwardoptim_stepsave_state in a loop. The cookbook wraps this into a single train.Config:

DatasetBuilderdefine data

train.Configmodel, LR, rank

train.main()pipelined loop

checkpoints+ metrics
  1. Define your data — implement a SupervisedDatasetBuilder that returns batches of Datum objects
  2. Configure training — set model, learning rate, LoRA rank, checkpointing schedule
  3. Runtrain.main(config) handles the entire loop
from tinker_cookbook.supervised import train
from tinker_cookbook.supervised.types import ChatDatasetBuilder

config = train.Config(
    log_path="~/logs/my-sft-run",
    model_name="Qwen/Qwen3-8B",
    dataset_builder=my_dataset_builder,
    learning_rate=1e-4,
    lora_rank=32,
    save_every=20,
    eval_every=10,
)
asyncio.run(train.main(config))

Under the hood, train.main creates a TrainingClient, pipelines forward_backward + optim_step for throughput, runs evaluators, and saves checkpoints. See the SL architecture overview for the full component diagram.

RL with the Cookbook

RL is more complex than SFT because it involves a rollout loop: sample from the model, score with a reward function, then train on the scored trajectories. The cookbook abstracts this into composable types:

RLDatasetget_batch()

EnvGroupBuildermake_envs()

Envrollout

rewards+ advantages

forward_backwardtrain

repeat

  1. Define environments — implement Env (or ProblemEnv for Q&A tasks) that produces rewards
  2. Group environmentsEnvGroupBuilder creates batches of environments (GRPO centers rewards across groups)
  3. Define a datasetRLDataset produces batches of EnvGroupBuilders
  4. Run — the training loop handles rollouts, advantage computation, and weight updates

The simplest starting point is ProblemEnv — a single-turn Q&A environment:

from tinker_cookbook.rl.problem_env import ProblemEnv

class MathEnv(ProblemEnv):
    def get_question(self) -> str:
        return "What is 2 + 3?"

    def check_answer(self, response: str) -> float:
        return 1.0 if "5" in response else 0.0

See the RL architecture overview for the full component diagram.

Preferences (DPO / RLHF)

Preference learning trains models from pairwise comparisons rather than scalar rewards:

  1. Build comparisonsComparison objects containing two completions (A and B) for the same prompt
  2. Label preferencesLabeledComparison adds a human preference (A, B, or tie)
  3. Train with DPO — directly optimize the policy from preference data
from tinker_cookbook.preference.types import Comparison, LabeledComparison

comparison = Comparison(messages=messages, completion_a=resp_a, completion_b=resp_b)
labeled = LabeledComparison(**comparison.__dict__, label=1)  # B preferred

See DPO Guide and RLHF Example for full walkthroughs.

Next steps