Reinforcement Learning
RL means learning from trial and error. Instead of input-output pairs, you provide prompts and reward functions — the algorithm discovers good outputs.
Architecture
RLDataset
├── get_batch(index) → list[EnvGroupBuilder]
└── __len__() → number of batches
EnvGroupBuilder
├── make_envs() → list[Env] create environments for this group
├── compute_group_rewards(trajs, envs) final reward using the whole group
├── cleanup() release resources (sandboxes, etc.)
└── logging_tags() tags for metric aggregation
Env (stateful, one per episode)
├── initial_observation() → (Observation, StopCondition)
└── step(action, extra) → StepResult
├── reward: float
├── episode_done: bool
├── next_observation
└── metrics / logs
Trajectory = list[Transition(ob, ac, reward, episode_done)]
TrajectoryGroup = list[Trajectory] + final_rewards + metrics
Key Components
Env
A stateful environment for a single episode. Discard after use.
from tinker_cookbook.rl.types import Env, StepResult
class MyEnv(Env):
async def initial_observation(self):
prompt = renderer.build_generation_prompt([{"role": "user", "content": self.question}])
return prompt, renderer.get_stop_sequences()
async def step(self, action, *, extra=None):
response = renderer.parse_response(action)
reward = 1.0 if check_correct(response) else 0.0
return StepResult(reward=reward, episode_done=True, ...)
ProblemEnv
Convenience base class for single-turn Q&A tasks:
from tinker_cookbook.rl.problem_env import ProblemEnv
class MathEnv(ProblemEnv):
def get_question(self) -> str:
return "What is 2 + 3?"
def check_answer(self, response: str) -> float:
return 1.0 if "5" in response else 0.0
EnvGroupBuilder
Builds a group of environments. Groups enable reward centering (GRPO) and multi-agent setups.
from tinker_cookbook.rl.types import EnvGroupBuilder
class MyGroupBuilder(EnvGroupBuilder):
async def make_envs(self) -> list[Env]:
return [MyEnv(q) for q in self.questions]
# Optional: compute group-level rewards (default: 0)
async def compute_group_rewards(self, trajs, envs):
return [(0.0, {}) for _ in trajs]
Built-in: ProblemGroupBuilder builds groups of ProblemEnv instances from a factory.
RLDataset / RLDatasetBuilder
Produces batches of EnvGroupBuilders for the training loop.
Data Processing
After rollouts, convert trajectories to training data:
from tinker_cookbook.rl.data_processing import compute_advantages, trajectory_to_data, assemble_training_data
# Compute GRPO-style advantages
advantages = compute_advantages(trajectory_groups, loss_fn="cispo")
# Convert to Datum objects for forward_backward
training_data = assemble_training_data(trajectory_groups, advantages, ...)
Rollout Strategies
Control error handling during rollouts:
| Strategy | Behavior |
|---|---|
FailFast |
Stop on first error |
RetryOnFailure |
Retry failed trajectories, continue on partial success |
Training Loop
The RL training loop:
- Sample a batch of
EnvGroupBuilders fromRLDataset - Rollout: for each group, create environments, sample completions, collect rewards
- Compute advantages: normalize rewards across the group (GRPO)
- Training update:
forward_backwardwith the rollout data +optim_step - Evaluate and checkpoint
Next Steps
- RL Training Loop — minimal GRPO loop implementation
- RL Environments — building custom environments
- RL Hyperparameters — KL penalty, advantages, reward shaping
- Tutorials: First RL — interactive walkthrough