Tutorial 12: Sequence Extension in Multi-Turn RL
Prerequisites
Run it interactively
Manage conversation history and masks across turns.
In single-turn RL, each episode has one prompt and one completion. In multi-turn RL, the model generates multiple responses across turns, and the training data must correctly track which tokens get gradients.
The sequence extension concept
When the model takes multiple turns, each observation contains the full conversation so far. The RL data pipeline detects when consecutive observations share a prefix and merges them into a single training datum, avoiding redundant computation:
Turn 1: [O1] [A1]
Turn 2: [O1 A1 O2] [A2] <- O1+A1 is a prefix, so merge
Turn 3: [O1 A1 O2 A2 O3] [A3] <- continues extending
Result: One datum with [O1 A1 O2 A2 O3 A3]
Mask: [0 1 0 1 0 1 ]
(only action tokens get gradients)
import tinker
from tinker_cookbook.rl.data_processing import trajectory_to_data
from tinker_cookbook.rl.types import Trajectory, Transition
from tinker_cookbook.completers import TokensWithLogprobs
How prompt/completion boundaries shift
At each turn, the observation grows because it includes all prior turns. The mask field in the training datum marks which tokens should receive gradients:
| Position | Content | Mask | Why |
|---|---|---|---|
| O1 tokens | System prompt + user question | 0 | Prompt -- no gradient |
| A1 tokens | Model's first response | 1 | Action -- gets gradient |
| O2 tokens | Environment feedback / next user message | 0 | Prompt -- no gradient |
| A2 tokens | Model's second response | 1 | Action -- gets gradient |
The trajectory_to_data function handles this automatically by checking prefix relationships.
# Simulate a 3-turn conversation
# Observation tokens (prompt parts)
o1 = [100, 101, 102, 103] # "What is 2+3?"
a1 = [200, 201] # "5"
o2 = [300, 301] # "Correct! Now what is 5*3?"
a2 = [400, 401, 402] # "15"
o3 = [500] # "Right! Final: 15+1?"
a3 = [600, 601] # "16"
# Build transitions -- each observation extends the previous
transitions = [
Transition(
ob=tinker.ModelInput.from_ints(o1),
ac=TokensWithLogprobs(tokens=a1, logprobs=[-0.5, -0.3]),
reward=0.0, episode_done=False,
),
Transition(
ob=tinker.ModelInput.from_ints(o1 + a1 + o2), # Extends previous
ac=TokensWithLogprobs(tokens=a2, logprobs=[-0.4, -0.2, -0.1]),
reward=0.0, episode_done=False,
),
Transition(
ob=tinker.ModelInput.from_ints(o1 + a1 + o2 + a2 + o3), # Extends again
ac=TokensWithLogprobs(tokens=a3, logprobs=[-0.6, -0.3]),
reward=1.0, episode_done=True,
),
]
traj = Trajectory(transitions=transitions, final_ob=tinker.ModelInput.empty())
data = trajectory_to_data(traj, traj_advantage=1.0)
print(f"Number of datums: {len(data)} (merged into 1 because all observations extend)")
datum = data[0]
mask = datum.loss_fn_inputs["mask"].data
advantages = datum.loss_fn_inputs["advantages"].data
print(f"Total tokens: {datum.model_input.length}")
print(f"Mask (1=action): {mask}")
print(f"Advantages: {advantages}")
print(f"Action tokens: {sum(1 for m in mask if m == 1.0)}")
print(f"Prompt tokens: {sum(1 for m in mask if m == 0.0)}")
When sequences do NOT extend
If the observation at some turn is not a prefix of the next (e.g., the environment resets or shows a different view), trajectory_to_data creates a new datum:
# Turn 1: normal conversation
# Turn 2: observation is NOT a prefix extension (different context)
transitions_split = [
Transition(
ob=tinker.ModelInput.from_ints([100, 101]),
ac=TokensWithLogprobs(tokens=[200, 201], logprobs=[-0.5, -0.3]),
reward=0.5, episode_done=False,
),
Transition(
ob=tinker.ModelInput.from_ints([300, 301, 302]), # New context, not a prefix
ac=TokensWithLogprobs(tokens=[400, 401], logprobs=[-0.4, -0.2]),
reward=0.5, episode_done=True,
),
]
traj_split = Trajectory(transitions=transitions_split, final_ob=tinker.ModelInput.empty())
data_split = trajectory_to_data(traj_split, traj_advantage=1.0)
print(f"Number of datums: {len(data_split)} (split because observation changed)")
for i, d in enumerate(data_split):
print(f" Datum {i}: {d.model_input.length} tokens, mask={d.loss_fn_inputs['mask'].data}")
Multi-turn environments
The MessageEnv + EnvFromMessageEnv adapter makes multi-turn RL easy:
MessageEnv-- you implementinitial_observation()(returns messages) andstep(message)(processes assistant message, returns next messages + reward)EnvFromMessageEnv-- automatically handles tokenization, prefix detection, context overflow, and parse errors
The adapter wraps a message-level environment into the token-level Env interface:
class TutorEnv(MessageEnv):
async def initial_observation(self):
return [{"role": "user", "content": "Teach me about X"}]
async def step(self, message):
# message is the assistant's response
# Return next user message and reward
return MessageStepResult(
reward=0.0,
episode_done=False,
next_messages=[
*history,
message,
{"role": "user", "content": "Tell me more"},
],
)
env = EnvFromMessageEnv(renderer, TutorEnv(), max_trajectory_tokens=4096)
The max_trajectory_tokens parameter prevents context overflow by terminating the episode early.
Summary
- Multi-turn RL uses sequence extension: consecutive observations that share a prefix are merged into a single training datum
- The mask field marks action tokens (gradient) vs prompt tokens (no gradient)
trajectory_to_datahandles merging/splitting automatically- Use
MessageEnv+EnvFromMessageEnvfor clean multi-turn environment implementations - Set
max_trajectory_tokensto prevent context overflow in long conversations