Skip to content

Reinforcement Learning

RL means learning from trial and error. Instead of input-output pairs, you provide prompts and reward functions — the algorithm discovers good outputs.

Architecture

RLDataset
├── get_batch(index) → list[EnvGroupBuilder]
└── __len__() → number of batches

EnvGroupBuilder
├── make_envs() → list[Env]              create environments for this group
├── compute_group_rewards(trajs, envs)   final reward using the whole group
├── cleanup()                             release resources (sandboxes, etc.)
└── logging_tags()                        tags for metric aggregation

Env (stateful, one per episode)
├── initial_observation() → (Observation, StopCondition)
└── step(action, extra) → StepResult
     ├── reward: float
     ├── episode_done: bool
     ├── next_observation
     └── metrics / logs

Trajectory = list[Transition(ob, ac, reward, episode_done)]
TrajectoryGroup = list[Trajectory] + final_rewards + metrics

Key Components

Env

A stateful environment for a single episode. Discard after use.

from tinker_cookbook.rl.types import Env, StepResult

class MyEnv(Env):
    async def initial_observation(self):
        prompt = renderer.build_generation_prompt([{"role": "user", "content": self.question}])
        return prompt, renderer.get_stop_sequences()

    async def step(self, action, *, extra=None):
        response = renderer.parse_response(action)
        reward = 1.0 if check_correct(response) else 0.0
        return StepResult(reward=reward, episode_done=True, ...)

ProblemEnv

Convenience base class for single-turn Q&A tasks:

from tinker_cookbook.rl.problem_env import ProblemEnv

class MathEnv(ProblemEnv):
    def get_question(self) -> str:
        return "What is 2 + 3?"

    def check_answer(self, response: str) -> float:
        return 1.0 if "5" in response else 0.0

EnvGroupBuilder

Builds a group of environments. Groups enable reward centering (GRPO) and multi-agent setups.

from tinker_cookbook.rl.types import EnvGroupBuilder

class MyGroupBuilder(EnvGroupBuilder):
    async def make_envs(self) -> list[Env]:
        return [MyEnv(q) for q in self.questions]

    # Optional: compute group-level rewards (default: 0)
    async def compute_group_rewards(self, trajs, envs):
        return [(0.0, {}) for _ in trajs]

Built-in: ProblemGroupBuilder builds groups of ProblemEnv instances from a factory.

RLDataset / RLDatasetBuilder

Produces batches of EnvGroupBuilders for the training loop.

from tinker_cookbook.rl.types import RLDataset, RLDatasetBuilder

Data Processing

After rollouts, convert trajectories to training data:

from tinker_cookbook.rl.data_processing import compute_advantages, trajectory_to_data, assemble_training_data

# Compute GRPO-style advantages
advantages = compute_advantages(trajectory_groups, loss_fn="cispo")

# Convert to Datum objects for forward_backward
training_data = assemble_training_data(trajectory_groups, advantages, ...)

Rollout Strategies

Control error handling during rollouts:

Strategy Behavior
FailFast Stop on first error
RetryOnFailure Retry failed trajectories, continue on partial success

Training Loop

The RL training loop:

  1. Sample a batch of EnvGroupBuilders from RLDataset
  2. Rollout: for each group, create environments, sample completions, collect rewards
  3. Compute advantages: normalize rewards across the group (GRPO)
  4. Training update: forward_backward with the rollout data + optim_step
  5. Evaluate and checkpoint

Next Steps