Skip to content

Tutorial 103: Efficient Sampling with Tinker

Prerequisites

Run it interactively [source]

curl -O https://raw.githubusercontent.com/thinking-machines-lab/tinker-cookbook/main/tutorials/103_async_patterns.py && marimo edit 103_async_patterns.py

Tinker runs on remote GPUs. Every API call involves network latency plus GPU compute time. If you send sampling requests one at a time -- send, wait, send, wait -- you spend most of your time idle while Tinker works.

The solution: send requests concurrently with asyncio.gather. Tinker can batch and pipeline concurrent requests on the GPU, so N requests take far less than N times the cost of one request. This matters most for sampling, where RL training may require hundreds of completions per step.

import asyncio
import os
import time
import warnings

warnings.filterwarnings("ignore", message="IProgress not found")

import tinker

from tinker_cookbook.renderers import get_renderer, get_text_content

Setup

We create a SamplingClient for the base Qwen3.5-4B model (no fine-tuning needed for this tutorial). We also set up a renderer to handle the chat template and a list of diverse prompts to sample from.

api_key = mo.ui.text(kind="password", label="Paste your Tinker API key")
api_key  # noqa: B018
mo.stop(
    "TINKER_API_KEY" not in os.environ and not api_key.value,
    "Paste your API key above",
)

if api_key.value:
    os.environ["TINKER_API_KEY"] = api_key.value

BASE_MODEL = "Qwen/Qwen3.5-4B"

service_client = tinker.ServiceClient()
sampling_client = await service_client.create_sampling_client_async(base_model=BASE_MODEL)
tokenizer = sampling_client.get_tokenizer()
renderer = get_renderer("qwen3_5", tokenizer)

stop_sequences = renderer.get_stop_sequences()
params = tinker.SamplingParams(max_tokens=150, temperature=0.7, stop=stop_sequences)

# A diverse set of prompts to sample from
prompts = [
    "What causes thunder?",
    "Write a haiku about the ocean.",
    "What is the capital of New Zealand?",
    "Explain what a hash table is in two sentences.",
    "Name three inventions from the 19th century.",
    "Why do leaves change color in autumn?",
    "Translate to Spanish: The library closes at nine.",
    "What is the smallest prime number greater than 50?",
]

print(f"Model: {BASE_MODEL}")
print(f"Prompts: {len(prompts)}")
Output
Model: Qwen/Qwen3.5-4B
Prompts: 8

Sequential sampling (the slow way)

The simplest approach: for each prompt, build the generation input, await sample_async() to block until it finishes, then move on to the next. Each request waits for the previous one to complete before starting.

_start = time.time()
sequential_results = []
for _prompt_text in prompts:
    _messages = [{"role": "user", "content": _prompt_text}]
    _model_input = renderer.build_generation_prompt(_messages)
    _result = await sampling_client.sample_async(
        prompt=_model_input, num_samples=1, sampling_params=params
    )
    _response_msg, _ = renderer.parse_response(_result.sequences[0].tokens)
    sequential_results.append(
        get_text_content(_response_msg)
    )  # Block on each request before sending the next
sequential_time = time.time() - _start
for _prompt_text, _answer in zip(prompts, sequential_results):
    print(f"Q: {_prompt_text}")
    print(f"A: {_answer[:120]}...\n")
print(
    f"Sequential: {sequential_time:.1f}s for {len(prompts)} prompts ({sequential_time / len(prompts):.1f}s each)"
)
Output
Q: What causes thunder?
A: Here's a thinking process that leads to the explanation of thunder's cause:

1.  **Analyze the Request:**
    *   **Ques...

Q: Write a haiku about the ocean.
A: Thinking Process:

1.  **Analyze the Request:**
    *   Topic: The ocean.
    *   Form: Haiku (5-7-5 syllables).

2.  **...

Q: What is the capital of New Zealand?
A: Thinking Process:

1.  **Identify the core question:** The user is asking for the capital of New Zealand.
2.  **Retrieve...

Q: Explain what a hash table is in two sentences.
A: Thinking Process:

1.  **Analyze the Request:**
    *   Topic: Hash table.
    *   Constraint: Exactly two sentences.
  ...

Q: Name three inventions from the 19th century.
A: Thinking Process:

1.  **Analyze the Request:**
    *   Task: Name three inventions.
    *   Constraint: They must be fr...

Q: Why do leaves change color in autumn?
A: Here's a thinking process that leads to the explanation of why leaves change color in autumn:

1.  **Deconstruct the Req...

Q: Translate to Spanish: The library closes at nine.
A: We need to translate "The library closes at nine." into Spanish. Let's break it down.

First, "The library" = "La biblio...

Q: What is the smallest prime number greater than 50?
A: Thinking Process:

1.  **Analyze the Request:** The user is asking for the smallest prime number that is greater than 50...

Sequential: 14.2s for 8 prompts (1.8s each)

Concurrent sampling with asyncio.gather

asyncio.gather schedules every sample_async() coroutine onto the event loop at once, so all the requests go out before any of them finishes -- then it waits for the whole batch. The key insight: submit all requests first, then collect results. Tinker batches concurrent requests on the GPU for higher throughput.

_start = time.time()

# Step 1: Submit ALL requests concurrently using asyncio.gather
async def _sample_one(_prompt_text):
    _messages = [{"role": "user", "content": _prompt_text}]
    _model_input = renderer.build_generation_prompt(_messages)
    return await sampling_client.sample_async(
        prompt=_model_input, num_samples=1, sampling_params=params
    )

_results = await asyncio.gather(*[_sample_one(p) for p in prompts])
concurrent_results = []
for _result in _results:
    # Step 2: Parse results (all requests were running in parallel)
    _response_msg, _ = renderer.parse_response(_result.sequences[0].tokens)
    concurrent_results.append(get_text_content(_response_msg))
concurrent_time = time.time() - _start
for _prompt_text, _answer in zip(prompts, concurrent_results):
    print(f"Q: {_prompt_text}")
    print(f"A: {_answer[:120]}...\n")
print(f"Concurrent: {concurrent_time:.1f}s for {len(prompts)} prompts")
print(f"Sequential: {sequential_time:.1f}s for {len(prompts)} prompts")
print(f"Speedup: {sequential_time / concurrent_time:.1f}x")
Output
Q: What causes thunder?
A: Here's a thinking process that leads to the explanation of thunder:

1.  **Analyze the Request:**
    *   **Question:** ...

Q: Write a haiku about the ocean.
A: Thinking Process:

1.  **Analyze the Request:**
    *   Topic: The ocean.
    *   Form: Haiku (5-7-5 syllables).

2.  **...

Q: What is the capital of New Zealand?
A: The capital of New Zealand is **Wellington**....

Q: Explain what a hash table is in two sentences.
A: Thinking Process:

1.  **Analyze the Request:**
    *   Topic: Hash table.
    *   Constraint: Exactly two sentences.
  ...

Q: Name three inventions from the 19th century.
A: Thinking Process:

1.  **Analyze the Request:**
    *   Task: Name three inventions.
    *   Constraint: They must be fr...

Q: Why do leaves change color in autumn?
A: Here's a thinking process that leads to the explanation of why leaves change color in autumn:

1.  **Analyze the Request...

Q: Translate to Spanish: The library closes at nine.
A: We need to translate the sentence "The library closes at nine." into Spanish. Let's break it down.

First, identify the ...

Q: What is the smallest prime number greater than 50?
A: Thinking Process:

1.  **Identify the goal:** The user wants to find the smallest prime number greater than 50.

2.  **D...

Concurrent: 2.2s for 8 prompts
Sequential: 14.2s for 8 prompts
Speedup: 6.5x

Multiple completions per prompt (num_samples)

In GRPO-style RL, you need group_size independent completions for each problem so you can compare them and compute advantages. The num_samples parameter generates multiple completions in a single API call -- more efficient than sending separate requests for the same prompt.

_GROUP_SIZE = 4
test_prompt = "Name a famous scientist and explain their key contribution in one sentence."
_messages = [{"role": "user", "content": test_prompt}]
_model_input = renderer.build_generation_prompt(_messages)
_start = time.time()
_result = await sampling_client.sample_async(
    prompt=_model_input, num_samples=_GROUP_SIZE, sampling_params=params
)

# Single call with num_samples=4 -- generates 4 independent completions
multi_time = time.time() - _start
print(f"Prompt: {test_prompt}\n")
for i, _seq in enumerate(_result.sequences):
    _response_msg, _ = renderer.parse_response(_seq.tokens)
    text = get_text_content(_response_msg)
    print(f"Completion {i + 1}: {text[:150]}\n")

_start = time.time()
for _ in range(_GROUP_SIZE):
    await sampling_client.sample_async(
        prompt=_model_input, num_samples=1, sampling_params=params
    )
sequential_multi_time = time.time() - _start

print(f"num_samples={_GROUP_SIZE} in one call: {multi_time:.1f}s")
print(f"{_GROUP_SIZE} sequential calls:        {sequential_multi_time:.1f}s")
# Compare: 4 sequential single calls
print(f"Speedup: {sequential_multi_time / multi_time:.1f}x")
Output
Prompt: Name a famous scientist and explain their key contribution in one sentence.

Completion 1: Thinking Process:

1.  **Analyze the Request:**
    *   Task: Name a famous scientist and explain their key contribution.
    *   Constraint: Explain 

Completion 2: Thinking Process:

1.  **Analyze the Request:**
    *   Task: Name a famous scientist.
    *   Task: Explain their key contribution.
    *   Constrain

Completion 3: Thinking Process:

1.  **Analyze the Request:**
    *   Target: Name a famous scientist.
    *   Task: Explain their key contribution.
    *   Constra

Completion 4: Thinking Process:

1.  **Analyze the Request:**
    *   Task: Name a famous scientist and explain their key contribution.
    *   Constraint: Explain 

num_samples=4 in one call: 1.9s
4 sequential calls:        6.6s
Speedup: 3.4x

Putting it together: batch evaluation

Combine both techniques -- concurrent requests across prompts and num_samples per prompt -- for maximum throughput. This is exactly the pattern used in RL training: submit many sampling requests in parallel, each generating a group of completions, then collect and grade them all.

_GROUP_SIZE = 4
_start = time.time()

# Submit all requests concurrently using asyncio.gather, each with num_samples=GROUP_SIZE
async def _sample_group(_prompt_text):
    _messages = [{"role": "user", "content": _prompt_text}]
    _model_input = renderer.build_generation_prompt(_messages)
    _result = await sampling_client.sample_async(
        prompt=_model_input, num_samples=_GROUP_SIZE, sampling_params=params
    )
    return _prompt_text, _result

_results = await asyncio.gather(*[_sample_group(p) for p in prompts])
total_completions = 0
for _prompt_text, _result in _results:
    completions = []
    for _seq in _result.sequences:
        # Collect all results
        _response_msg, _ = renderer.parse_response(_seq.tokens)
        completions.append(get_text_content(_response_msg))
    total_completions += len(completions)
    print(f"Q: {_prompt_text}")
    print(f"   ({len(completions)} completions, showing first): {completions[0][:100]}...\n")
batch_time = time.time() - _start
print(f"Total: {total_completions} completions in {batch_time:.1f}s")
print(f"Throughput: {total_completions / batch_time:.1f} completions/second")
Output
Q: What causes thunder?
   (4 completions, showing first): Here's a thinking process that leads to the explanation of thunder:

1.  **Analyze the Request:**
  ...

Q: Write a haiku about the ocean.
   (4 completions, showing first): Thinking Process:

1.  **Analyze the Request:**
    *   Topic: The ocean.
    *   Form: Haiku (5-7-5...

Q: What is the capital of New Zealand?
   (4 completions, showing first): Thinking Process:

1.  **Identify the core question:** The user is asking for the capital city of Ne...

Q: Explain what a hash table is in two sentences.
   (4 completions, showing first): Thinking Process:

1.  **Analyze the Request:**
    *   Topic: Hash Table.
    *   Constraint: Exact...

Q: Name three inventions from the 19th century.
   (4 completions, showing first): Thinking Process:

1.  **Analyze the Request:**
    *   Task: Name three inventions.
    *   Constra...

Q: Why do leaves change color in autumn?
   (4 completions, showing first): Here's a thinking process that leads to the explanation of why leaves change color in autumn:

1.  *...

Q: Translate to Spanish: The library closes at nine.
   (4 completions, showing first): We need to translate the sentence "The library closes at nine." from English to Spanish. First, let'...

Q: What is the smallest prime number greater than 50?
   (4 completions, showing first): Thinking Process:

1.  **Analyze the Request:** The user is asking for the smallest prime number tha...

Total: 32 completions in 2.2s
Throughput: 14.8 completions/second

Next steps

This tutorial showed the two key techniques for efficient sampling: concurrent requests with asyncio.gather (submit all requests before collecting results) and num_samples (generate multiple completions per call). Together, they give you high throughput with minimal code changes.

  • Tutorial 104 (104_first_rl.py): Uses this exact pattern -- sample many completions, grade them with a reward function, and train with GRPO.
  • Async docs (docs/async.mdx): Full reference for sync/async APIs, the double-await pattern, and overlapping training requests.