Rendering

Rendering to tokens

Rendering converts list-of-message datatypes into their token representations for model training and inference. While similar to chat templates (opens in a new tab), Tinker's rendering system is designed for the full training lifecycle--not just inference--supporting supervised learning, reinforcement learning, and deployment.

HuggingFace Compatibility

Important: Tinker's default renderers are designed to produce identical tokens to HuggingFace's apply_chat_template. This is critical because:

  1. The OpenAI-compatible endpoint (/chat/completions) uses HuggingFace chat templates to convert messages to tokens
  2. If you train with a non-HF-compatible renderer, your model may not work correctly with the OpenAI endpoint

The default renderers (qwen3, llama3, deepseekv3, etc.) match HF behavior. The exception is when using strip_thinking_from_history=False on thinking-enabled renderers (Qwen3Renderer, DeepSeekV3ThinkingRenderer)—this is a special mode for multi-turn RL efficiency that does NOT match HF (see Sequence Extension).

RendererHF Equivalent
qwen3apply_chat_template(..., enable_thinking=True) (default)
qwen3_disable_thinkingapply_chat_template(..., enable_thinking=False)
llama3apply_chat_template(...) *
deepseekv3apply_chat_template(...)

* The Llama3 renderer omits the "Cutting Knowledge Date..." preamble that HF prepends to system messages. Add this manually if you need exact HF compatibility.

Recommendation: If you plan to use the OpenAI endpoint for inference, always use the default renderers with default options.

The Renderer class

The Renderer class is the main interface used for rendering. It can be found in tinker_cookbook/renderers/.

Example conversation:

messages =[
    {'role': 'system', 'content': 'Answer concisely; at most one sentence per response'},
    {'role': 'user', 'content': 'What is the longest-lived rodent species?'},
    {'role': 'assistant', 'content': 'The naked mole rat, which can live over 30 years.'},
    {'role': 'user', 'content': 'How do they live so long?'},
    {'role': 'assistant', 'content': 'They evolved multiple protective mechanisms including special hyaluronic acid that prevents cancer, extremely stable proteins, and efficient DNA repair systems that work together to prevent aging.'}
]

We'll use this conversation throughout the examples below.

Inference: Generating messages

Our model maps tokens to tokens, but with the renderer, it can map messages to messages. To sample messages from the model, we need to use three methods from the renderer:

  • build_generation_prompt
  • get_stop_sequences
  • parse_response

build_generation_prompt converts a conversation into a prompt that we can use to sample from the assistant. This is used during reinforcement learning and at deployment time.

Example: Generate an alternative assistant response

Let's remove the last assistant message and call build_generation_prompt to get a prompt that we can use to sample an alternative response from the assistant:

from tinker_cookbook import renderers, tokenizer_utils
tokenizer = tokenizer_utils.get_tokenizer('Qwen/Qwen3-30B-A3B')
renderer = renderers.get_renderer('qwen3', tokenizer)
prompt = renderer.build_generation_prompt(messages[:-1])
print(prompt)
print('-'*10)
print(tokenizer.decode(prompt.to_ints()))

Output:

ModelInput(chunks=[EncodedTextChunk(tokens=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 8948, 198, 16141, 3529, 285, 974, 26, 518, 1429, 825, 11652, 817, 2033, 151645, 198, 151644, 872, 198, 3838, 374, 279, 22032, 61854, 20589, 306, 9419, 30, 151645, 198, 151644, 77091, 198, 785, 19020, 34651, 11244, 11, 892, 646, 3887, 916, 220, 18, 15, 1635, 13, 151645, 198, 151644, 872, 198, 10234, 30, 151645, 198, 151644, 77091, 198], type='encoded_text')])
----------
<|im_start|>system
Answer concisely; at most one sentence per response<|im_end|>
<|im_start|>user
What is the longest-lived rodent species?<|im_end|>
<|im_start|>assistant
The naked mole rat, which can live over 30 years.<|im_end|>
<|im_start|>user
How do they live so long?<|im_end|>
<|im_start|>assistant

You can see that the prompt is a ModelInput object, which is a list of EncodedTextChunk objects (but contains different objects in multi-modal data).

Sampling and parsing the response:

Given that we're providing messages as input, we probably want a message output, rather than a token output. For that, we can use parse_response.

import tinker
from tinker.types import SamplingParams
service_client = tinker.ServiceClient()
sampling_client = service_client.create_sampling_client(base_model='Qwen/Qwen3-30B-A3B')
stop_sequences = renderer.get_stop_sequences()
print(f"Stop sequences: {stop_sequences}")
sampling_params = SamplingParams(max_tokens=100, temperature=0.5, stop=stop_sequences)
output = sampling_client.sample(prompt, sampling_params=sampling_params, num_samples=1).result()
print(f"Sampled tokens: {output.sequences[0].tokens}")
sampled_message, parse_success = renderer.parse_response(output.sequences[0].tokens)
print(f"Sampled message: {sampled_message}")
print(f"Parse success: {parse_success}")

Output:

Stop sequences: [151645]
Sampled tokens: [45, 7741, 34651, 31410, 614, 4911, 76665, 11, 2670, 264, 7548, 11050, 22077, 1849, 323, 264, 1602, 3347, 40761, 4379, 11, 892, 16792, 311, 862, 57119, 13, 151645]
Sampled message: {'role': 'assistant', 'content': 'Naked mole rats have unique adaptations, including a highly efficient immune system and a very low metabolic rate, which contribute to their longevity.'}
Parse success: True

You can see that there is one stop sequence, 151645, which you can verify is the <|im_end|> token. The output is parsed successfully into a message.

Training: Supervised learning

For supervised learning (and some other algorithms like DPO), we need to distinguish between prompt tokens (context) and completion tokens (what the model should learn to generate). We want to provide a target assistant message, and the renderer needs to tell us which tokens are part of the prompt and completion.

We can use build_supervised_example to get a ModelInput and per-token loss weights:

model_input, weights = renderer.build_supervised_example(messages)
 
from tinker_cookbook.utils.format_colorized import format_colorized
print(format_colorized(model_input.to_ints(), weights, tokenizer))

We get the following output:

<|im_start|>system↵


Answer concisely; at most one sentence per response<|im_end|>↵


<|im_start|>user↵


What is the longest-lived rodent species?<|im_end|>↵


<|im_start|>assistant↵


The naked mole rat, which can live over 30 years.<|im_end|>↵


<|im_start|>user↵


How do they live so long?<|im_end|>↵


<|im_start|>assistant↵


They evolved multiple protective mechanisms including special hyaluronic acid that prevents cancer, extremely stable proteins, and efficient DNA repair systems that work together to prevent aging.<|im_end|>


The green text is part of the prompt (i.e. with weight=0, so no loss is computed on these) and red is part of the completion (i.e. with weight=1, so the model is trained to predict these). Note that the ↵ have been inserted for clarity to show newlines; these are not actually part of the token sequence.

The key insight here is that only the final assistant message is treated as the completion. All previous context, including the first assistant response, is part of the prompt, so the model learns to continue conversations rather than just answer single questions.

Vision Inputs

Tinker supports vision-language models (VLMs) like Qwen/Qwen3-VL-30B-A3B-Instruct and Qwen/Qwen3-VL-235B-A22B-Instruct. For low-level ImageChunk usage, see Vision inputs in the Training and Sampling guide. This section covers the higher-level message abstractions.

Multimodal messages

For VLMs, message content can be either a string or a list of content parts:

from tinker_cookbook.renderers import Message, TextPart, ImagePart
 
# Text-only message (standard)
text_message = Message(role='user', content='What is this?')
 
# Multimodal message with image
multimodal_message = Message(
    role='user',
    content=[
        ImagePart(type='image', image='https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png'),
        TextPart(type='text', text='What is in this image?'),
    ]
)

For lower-level control using ImageChunk directly, see Vision inputs in the Training and Sampling guide.

Using Qwen3VLRenderer

The Qwen3VLRenderer and Qwen3VLInstructRenderer handle Qwen's vision special tokens (<|vision_start|>, <|vision_end|>) automatically:

from tinker_cookbook import renderers, tokenizer_utils
from tinker_cookbook.image_processing_utils import get_image_processor
 
model_name = "Qwen/Qwen3-VL-235B-A22B-Instruct"
tokenizer = tokenizer_utils.get_tokenizer(model_name)
image_processor = get_image_processor(model_name)
 
renderer = renderers.Qwen3VLInstructRenderer(tokenizer, image_processor)
 
messages = [
    {
        'role': 'user',
        'content': [
            {'type': 'image', 'image': 'https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png'},
            {'type': 'text', 'text': 'What is in this image?'},
        ]
    }
]
 
prompt = renderer.build_generation_prompt(messages)

For a complete example of training a VLM image classifier, see the VLM Classifier recipe in the cookbook.

Multi-turn RL and the Extension Property

When using renderers in multi-turn RL, an important consideration is whether consecutive timesteps satisfy the extension property—where each observation is a prefix extension of the previous observation plus action. This affects compute efficiency (O(T) vs O(T^2)) and KV-cache reuse.

Some renderers, like Qwen3Renderer, have options that affect this property. For example, strip_thinking_from_history controls whether <think> blocks are preserved in conversation history.

See the Sequence Extension documentation for details on how this works and the tradeoffs involved.

Appendix: Why not Jinja templates?

HuggingFace chat templates (Jinja2) are designed for a single use case: converting messages to tokens for inference. Tinker's renderer system handles the full training lifecycle, which requires capabilities that chat templates don't provide:

  1. Supervised learning requires per-token loss weights. Chat templates produce a flat token sequence; they can't distinguish which tokens are prompt (weight=0) vs completion (weight=1). The renderer's build_supervised_example method returns both tokens and weights.

  2. Parsing model output back into messages. After sampling, you need to convert tokens back into structured messages. This includes knowing the stop sequences (get_stop_sequences), extracting thinking blocks (<think>...</think>), and parsing tool calls. The parse_response method handles all of this, including graceful handling of malformed output.

  3. Tool calling details vary by model. Each model family has its own tool calling format (Qwen uses <tool_call> XML tags, DeepSeek uses special tokens, Llama3 doesn't support tool calling reliably). The renderers encode these formats correctly and parse tool calls from model output, including handling parse failures.

  4. Precise tokenization control. Tokenization can produce different results depending on how strings are split. For example, tokenizing "Hello" + " world" separately may differ from "Hello world". The renderers directly construct a token sequence, rather than strings, so they give you precise control over the token sequence, which is important for ensuring train-test consistency and KV-cache friendliness.