Sequence Extension Property in Multi-Turn RL
When running reinforcement learning with multi-turn conversations, the way you render observations at each timestep has important implications for compute efficiency. This document explains the extension property and how it affects training and sampling.
What is the Extension Property?
A sequence of observations has the extension property if each successive observation contains all previous observations and actions as a prefix. In other words, the context grows monotonically by appending new tokens to the end.
When this property holds, multiple timesteps can be merged into a single training datum, the KV-cache can be reused during sampling, and compute scales as O(T) rather than O(T^2) for a trajectory of length T.
Example 1: Qwen3 with Thinking Visible (Extension Holds)
When using Qwen3Renderer with strip_thinking_from_history=False, the full conversation history (including <think> blocks) is preserved at each timestep. Consider a two-turn math conversation:
Timestep 1:
Assistant: <think>Let me calculate...</think> 4
User:
Timestep 2:
Assistant: <think>Let me calculate...</think> 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
Notice that the observation (green) at timestep 2 contains the entire timestep 1 sequence as a prefix. The new observation just appends What is 3+3?\n\nAssistant: to the end. This is the extension property.
Because extension holds, the RL code can merge both timesteps into a single Datum:
Assistant: <think>Let me calculate...</think> 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
Green = observation tokens (loss weight = 0). Red = action tokens (loss weight > 0).
Example 2: Qwen3 with Thinking Hidden (Extension Breaks)
When using Qwen3Renderer with the default strip_thinking_from_history=True, the <think>...</think> blocks are stripped from previous assistant messages. This matches how Qwen3 models were post-trained by the Qwen team.
Timestep 1:
Assistant: <think>Let me calculate...</think> 4
User:
Timestep 2:
Assistant: 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
The observation at timestep 2 is not an extension of timestep 1's full sequence. The <think>Let me calculate...</think> portion was stripped, so the prefix doesn't match. The RL code must create two separate Datums:
Datum 1:
Assistant: <think>Let me calculate...</think> 4
User:
Datum 2:
Assistant: 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
This results in more compute during training (two forward/backward passes instead of one) and prevents KV-cache reuse during sampling. For a trajectory of T timesteps, compute scales as O(T²) instead of O(T).
The Tradeoff
Keeping thinking visible (strip_thinking_from_history=False) gives you O(T) compute scaling, allows packing sequences together in training batches, and enables KV-cache reuse during sampling. The downside is that context grows faster since all thinking tokens are retained, so you may hit context length limits sooner.
Stripping thinking (strip_thinking_from_history=True, the default) keeps context smaller but breaks the extension property, leading to O(T²) compute scaling.
Note that while stripping thinking matches Qwen3's original post-training distribution, with RL fine-tuning the model should quickly adapt to the new situation where thinking is preserved. So "distribution match" might not be a major concern in practice.
How the RL Code Handles This
The RL training code in data_processing.py automatically detects whether consecutive timesteps satisfy the extension property. The key function is trajectory_to_data:
def trajectory_to_data(traj: Trajectory, traj_advantage: float) -> list[tinker.Datum]:
"""
Return one or more Datum objects corresponding to the trajectory.
If the sequence grows by appending, i.e., each successive observation contains
the previous observation+action as a prefix, then we can return a single Datum.
However, if we get a sequence that's not an extension of the previous sequence,
then that results in a new Datum.
"""When rendering your conversations, be aware of whether your renderer has the extension property. For Qwen3Renderer:
strip_thinking_from_history=False→ Extension holdsstrip_thinking_from_history=True(default) → Extension breaks
Note on sampling: The training code automatically merges timesteps when possible. Sampling infrastructure doesn't yet adjust billing based on KV-cache hits, but this is planned for a future release.
Advanced: Periodic Compaction
A hybrid approach is to use periodic compaction: keep thinking visible most of the time (preserving extension), but periodically clear old thinking blocks from the context.
How it works:
- For turns 1-10, keep all thinking visible (extension holds, single datum)
- At turn 11, strip thinking from turns 1-10 (extension breaks once, new datum starts)
- For turns 11-20, keep thinking visible again (extension holds)
- Repeat every N turns
Here's what the datums look like with compaction every 3 turns:
Datum 1 (turns 1-3):
Assistant: <think>...</think> A1
User: Q2
Assistant: <think>...</think> A2
User: Q3
Assistant: <think>...</think> A3
User:
Datum 2 (turns 4-6, thinking from turns 1-3 stripped):
Assistant: A1
User: Q2
Assistant: A2
User: Q3
Assistant: A3
User: Q4
Assistant: <think>...</think> A4
User: Q5
Assistant: <think>...</think> A5
User: Q6
Assistant: <think>...</think> A6
User:
This approach breaks extension only every N timesteps instead of every timestep, keeps context size bounded (old thinking doesn't accumulate forever), and amortizes the recomputation cost over N turns.
To implement this, you would modify your environment or renderer to periodically transform the conversation history, stripping <think> blocks from messages older than N turns.
Summary
For Qwen3Renderer:
strip_thinking_from_history=False→ Extension holds → Use for long trajectories where compute efficiency mattersstrip_thinking_from_history=True(default) → Extension breaks → Use for short trajectories, or when you want minimal changes from base model behavior- Periodic compaction → Best of both worlds when you need efficiency with bounded context
When designing your RL environment, consider how many turns you expect and whether the O(T) vs O(T²) difference will be significant for your use case.