Tutorial 03: Efficient Sampling with Tinker
Prerequisites
Run it interactively
Tinker runs on remote GPUs. Every API call involves network latency plus GPU compute time. If you send sampling requests one at a time -- send, wait, send, wait -- you spend most of your time idle while Tinker works.
The solution: send requests concurrently as futures. Tinker can batch and pipeline concurrent requests on the GPU, so N requests take far less than N times the cost of one request. This matters most for sampling, where RL training may require hundreds of completions per step.
import time
import warnings
warnings.filterwarnings("ignore", message="IProgress not found")
import tinker
from tinker_cookbook.renderers import get_renderer, get_text_content
Setup
We create a SamplingClient for the base Qwen3.5-4B model (no fine-tuning needed for this tutorial). We also set up a renderer to handle the chat template and a list of diverse prompts to sample from.
BASE_MODEL = "Qwen/Qwen3.5-4B"
service_client = tinker.ServiceClient()
sampling_client = await service_client.create_sampling_client_async(base_model=BASE_MODEL)
tokenizer = sampling_client.get_tokenizer()
renderer = get_renderer("qwen3_5", tokenizer)
stop_sequences = renderer.get_stop_sequences()
params = tinker.SamplingParams(max_tokens=150, temperature=0.7, stop=stop_sequences)
# A diverse set of prompts to sample from
prompts = [
"What causes thunder?",
"Write a haiku about the ocean.",
"What is the capital of New Zealand?",
"Explain what a hash table is in two sentences.",
"Name three inventions from the 19th century.",
"Why do leaves change color in autumn?",
"Translate to Spanish: The library closes at nine.",
"What is the smallest prime number greater than 50?",
]
print(f"Model: {BASE_MODEL}")
print(f"Prompts: {len(prompts)}")
Sequential sampling (the slow way)
The simplest approach: for each prompt, build the generation input, call sample(), immediately call .result() to block until it finishes, then move on to the next. Each request waits for the previous one to complete before starting.
_start = time.time()
sequential_results = []
for _prompt_text in prompts:
_messages = [{"role": "user", "content": _prompt_text}]
_model_input = renderer.build_generation_prompt(_messages)
_result = await sampling_client.sample_async(
prompt=_model_input, num_samples=1, sampling_params=params
)
_response_msg, _ = renderer.parse_response(_result.sequences[0].tokens)
sequential_results.append(
get_text_content(_response_msg)
) # Block on each request before sending the next
sequential_time = time.time() - _start
for _prompt_text, _answer in zip(prompts, sequential_results):
print(f"Q: {_prompt_text}")
print(f"A: {_answer[:120]}...\n")
print(
f"Sequential: {sequential_time:.1f}s for {len(prompts)} prompts ({sequential_time / len(prompts):.1f}s each)"
)
Output
Q: What causes thunder?
A: Thinking Process:
1. **Analyze the Request:**
* Question: "What causes thunder?"
* Intent: The user wants ...
Q: Write a haiku about the ocean.
A: Thinking Process:
1. **Analyze the Request:**
* Topic: The ocean.
* Form: Haiku (5-7-5 syllables).
2. **...
Q: What is the capital of New Zealand?
A: The capital of New Zealand is **Wellington**....
Q: Explain what a hash table is in two sentences.
A: Thinking Process:
1. **Analyze the Request:**
* Topic: Hash table.
* Constraint: Explain in exactly two se...
Q: Name three inventions from the 19th century.
A: Thinking Process:
1. **Analyze the Request:**
* Task: Name three inventions.
* Constraint: From the 19th c...
Q: Why do leaves change color in autumn?
A: Here's a thinking process that leads to the explanation of why leaves change color in autumn:
1. **Analyze the Request...
Q: Translate to Spanish: The library closes at nine.
A: We need to translate "The library closes at nine." into Spanish. Let's break it down.
First, "The library" is "La bibli...
Q: What is the smallest prime number greater than 50?
A: Thinking Process:
1. **Analyze the Request:** The user is asking for the smallest prime number that is strictly greate...
Sequential: 33.5s for 8 prompts (4.2s each)
Concurrent sampling with futures
sample() returns a future immediately -- the request is already in flight before you call .result(). The key insight: submit all requests first, then collect results. Tinker batches concurrent requests on the GPU for higher throughput.
import asyncio
_start = time.time()
# Step 1: Submit ALL requests concurrently using asyncio.gather
async def _sample_one(_prompt_text):
_messages = [{"role": "user", "content": _prompt_text}]
_model_input = renderer.build_generation_prompt(_messages)
return await sampling_client.sample_async(prompt=_model_input, num_samples=1, sampling_params=params)
_results = await asyncio.gather(*[_sample_one(p) for p in prompts])
concurrent_results = []
for _result in _results:
# Step 2: Parse results (all requests were running in parallel)
_response_msg, _ = renderer.parse_response(_result.sequences[0].tokens)
concurrent_results.append(get_text_content(_response_msg))
concurrent_time = time.time() - _start
for _prompt_text, _answer in zip(prompts, concurrent_results):
print(f"Q: {_prompt_text}")
print(f"A: {_answer[:120]}...\n")
print(f"Concurrent: {concurrent_time:.1f}s for {len(prompts)} prompts")
print(f"Sequential: {sequential_time:.1f}s for {len(prompts)} prompts")
print(f"Speedup: {sequential_time / concurrent_time:.1f}x")
Multiple completions per prompt (num_samples)
In GRPO-style RL, you need group_size independent completions for each problem so you can compare them and compute advantages. The num_samples parameter generates multiple completions in a single API call -- more efficient than sending separate requests for the same prompt.
_GROUP_SIZE = 4
test_prompt = "Name a famous scientist and explain their key contribution in one sentence."
_messages = [{"role": "user", "content": test_prompt}]
_model_input = renderer.build_generation_prompt(_messages)
_start = time.time()
_result = await sampling_client.sample_async(
prompt=_model_input, num_samples=_GROUP_SIZE, sampling_params=params
)
# Single call with num_samples=4 -- generates 4 independent completions
multi_time = time.time() - _start
print(f"Prompt: {test_prompt}\n")
for i, _seq in enumerate(_result.sequences):
_response_msg, _ = renderer.parse_response(_seq.tokens)
text = get_text_content(_response_msg)
print(f"Completion {i + 1}: {text[:150]}\n")
_start = time.time()
for _ in range(_GROUP_SIZE):
await sampling_client.sample_async(prompt=_model_input, num_samples=1, sampling_params=params)
sequential_multi_time = time.time() - _start
print(f"num_samples={_GROUP_SIZE} in one call: {multi_time:.1f}s")
print(f"{_GROUP_SIZE} sequential calls: {sequential_multi_time:.1f}s")
# Compare: 4 sequential single calls
print(f"Speedup: {sequential_multi_time / multi_time:.1f}x")
Output
Prompt: Name a famous scientist and explain their key contribution in one sentence.
Completion 1: Thinking Process:
1. **Analyze the Request:**
* Task: Name a famous scientist and explain their key contribution.
* Constraint: Do it in
Completion 2: Thinking Process:
1. **Analyze the Request:**
* Target: Name a famous scientist.
* Task: Explain their key contribution.
* Constra
Completion 3: Thinking Process:
1. **Analyze the Request:**
* Task: Name a famous scientist and explain their key contribution.
* Constraint: Explain
Completion 4: Thinking Process:
1. **Analyze the Request:**
* Task: Name a famous scientist and explain their key contribution.
* Constraint: Explain
num_samples=4 in one call: 5.0s
4 sequential calls: 18.1s
Speedup: 3.6x
Putting it together: batch evaluation
Combine both techniques -- concurrent futures across prompts and num_samples per prompt -- for maximum throughput. This is exactly the pattern used in RL training: submit many sampling requests in parallel, each generating a group of completions, then collect and grade them all.
import asyncio
_GROUP_SIZE = 4
_start = time.time()
# Submit all requests concurrently using asyncio.gather, each with num_samples=GROUP_SIZE
async def _sample_group(_prompt_text):
_messages = [{"role": "user", "content": _prompt_text}]
_model_input = renderer.build_generation_prompt(_messages)
_result = await sampling_client.sample_async(
prompt=_model_input, num_samples=_GROUP_SIZE, sampling_params=params
)
return _prompt_text, _result
_results = await asyncio.gather(*[_sample_group(p) for p in prompts])
total_completions = 0
for _prompt_text, _result in _results:
completions = []
for _seq in _result.sequences:
# Collect all results
_response_msg, _ = renderer.parse_response(_seq.tokens)
completions.append(get_text_content(_response_msg))
total_completions += len(completions)
print(f"Q: {_prompt_text}")
print(f" ({len(completions)} completions, showing first): {completions[0][:100]}...\n")
batch_time = time.time() - _start
print(f"Total: {total_completions} completions in {batch_time:.1f}s")
print(f"Throughput: {total_completions / batch_time:.1f} completions/second")
Output
Q: What causes thunder?
(4 completions, showing first): Thunder is caused by the rapid expansion of air heated by a lightning bolt...
Q: Why is the sky blue?
(4 completions, showing first): The sky appears blue because of a phenomenon called Rayleigh scattering...
...
Total: 32 completions in 7.2s
Throughput: 4.4 completions/second
Next steps
This tutorial showed the two key techniques for efficient sampling: concurrent futures (submit all requests before collecting results) and num_samples (generate multiple completions per call). Together, they give you high throughput with minimal code changes.
- Tutorial 04: First RL: Uses this exact pattern -- sample many completions, grade them with a reward function, and train with GRPO.
- Clock Cycles & Pipelining: Full reference for sync/async APIs, the double-await pattern, and overlapping training requests.