Tutorial 01: Hello Tinker
Run it interactively
Tinker is a remote GPU service for LLM training and inference. You write training loops in Python on your local machine; Tinker executes the heavy GPU operations (forward passes, backpropagation, sampling) on remote workers.
Your machine (CPU) Tinker Service (GPU)
+-----------------------+ +------------------------+
| Python training loop | --------> | Forward/backward pass |
| Data preparation | <-------- | Optimizer steps |
| Evaluation logic | | Text generation |
+-----------------------+ +------------------------+
You control the logic. Tinker runs the compute.
import warnings
warnings.filterwarnings("ignore", message="IProgress not found")
import tinker
from tinker import types
The client hierarchy
The entry point to Tinker is the ServiceClient. From it, you create specialized clients:
- SamplingClient -- generates text from a model (inference)
- TrainingClient -- runs forward/backward passes and optimizer steps (training)
Both talk to the same remote GPU workers. Let's start with the ServiceClient.
# Create a ServiceClient. This reads TINKER_API_KEY from your environment.
service_client = tinker.ServiceClient()
# Check what models are available
capabilities = await service_client.get_server_capabilities_async()
print("Available models:")
for model in capabilities.supported_models:
print(f" - {model.model_name}")
Output
Available models:
- deepseek-ai/DeepSeek-V3.1
- deepseek-ai/DeepSeek-V3.1-Base
- moonshotai/Kimi-K2-Thinking
- moonshotai/Kimi-K2.5
- moonshotai/Kimi-K2.5:peft:131072
- meta-llama/Llama-3.1-70B
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B
- meta-llama/Llama-3.2-3B
- meta-llama/Llama-3.3-70B-Instruct
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16:peft:262144
- Qwen/Qwen3-235B-A22B-Instruct-2507
- Qwen/Qwen3-30B-A3B
- Qwen/Qwen3-30B-A3B-Base
- Qwen/Qwen3-30B-A3B-Instruct-2507
- Qwen/Qwen3-32B
- Qwen/Qwen3-4B-Instruct-2507
- Qwen/Qwen3-8B
- Qwen/Qwen3-8B-Base
- Qwen/Qwen3-VL-235B-A22B-Instruct
- Qwen/Qwen3-VL-30B-A3B-Instruct
- Qwen/Qwen3.5-27B
- Qwen/Qwen3.5-35B-A3B
- Qwen/Qwen3.5-397B-A17B
- Qwen/Qwen3.5-397B-A17B:peft:262144
- Qwen/Qwen3.5-4B
- openai/gpt-oss-120b
- openai/gpt-oss-120b:peft:131072
- openai/gpt-oss-20b
Sampling from a model
Let's create a SamplingClient to generate text. We will use Qwen/Qwen3-4B-Instruct-2507, a compact model that keeps costs low.
The sampling workflow is:
1. Create a SamplingClient with a base model name
2. Encode your prompt into tokens using the model's tokenizer
3. Call sample() with the prompt and sampling parameters
4. Decode the returned tokens back into text
MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"
# Create a sampling client -- this connects to a remote GPU worker
sampling_client = await service_client.create_sampling_client_async(base_model=MODEL_NAME)
# Get the tokenizer for encoding/decoding text
tokenizer = sampling_client.get_tokenizer()
# Encode a prompt into tokens
prompt_text = "The three largest cities in the world by population are"
prompt = types.ModelInput.from_ints(tokenizer.encode(prompt_text))
# Sample a completion
params = types.SamplingParams(max_tokens=50, temperature=0.7, stop=["\n"])
result = await sampling_client.sample_async(prompt=prompt, sampling_params=params, num_samples=1)
# Decode and print
completion_tokens = result.sequences[0].tokens
print(prompt_text + tokenizer.decode(completion_tokens))
Output
Inspecting the response
The sample() call returns a SampleResponse containing a list of SampledSequence objects. Each sequence has:
- tokens -- the generated token IDs
- logprobs -- log probability of each generated token (if requested)
- stop_reason -- why generation stopped (e.g., hit max tokens, hit a stop string)
_seq = result.sequences[0]
print(f"Stop reason: {_seq.stop_reason}")
print(f"Tokens generated: {len(_seq.tokens)}")
print(f"Token IDs: {_seq.tokens[:10]}...")
print(f"Log probs: {_seq.logprobs}") # first 10
Output
Stop reason: length
Tokens generated: 50
Token IDs: [26194, 11, 37047, 11, 323, 21996, 11, 448, 26194, 3432]...
Log probs: [-0.5095227956771851, -0.004706732928752899, -2.459627389907837, -0.0007908792467787862, 0.0, -0.02780775912106037, -4.47653865814209, -0.09116745740175247, -0.6356776356697083, -0.24344521760940552, -1.3870069980621338, -0.368590772151947, -0.0031325577292591333, -0.21025529503822327, -6.460402488708496, -0.06543213129043579, -2.1195731163024902, -0.6492109894752502, -0.13893219828605652, -0.08088330924510956, -0.0002803409588523209, -0.027131833136081696, -0.0028139064088463783, -0.25534406304359436, -0.04747806861996651, -0.27529969811439514, -0.07974009215831757, -1.1192808151245117, 0.0, -0.9094678163528442, -0.002930040005594492, -2.3841855067985307e-07, -0.11119800060987473, -0.004901773761957884, -0.0008088654140010476, -0.019186854362487793, -1.2159273865108844e-05, 0.0, -0.09361772984266281, 0.0, -9.798523387871683e-05, -0.22930651903152466, -2.622600959512056e-06, -0.01948232762515545, 0.0, -0.0020399729255586863, 0.0, 0.0, 0.0, -1.764281842042692e-05]
You can also generate multiple samples at once by setting num_samples. Each sample is an independent completion from the same prompt.
result_1 = await sampling_client.sample_async(
prompt=prompt,
sampling_params=types.SamplingParams(max_tokens=50, temperature=0.9, stop=["\n"]),
num_samples=3,
)
for i, _seq in enumerate(result_1.sequences):
text = tokenizer.decode(_seq.tokens)
print(f"Sample {i}: {text}")
Output
Sample 0: : Tokyo, Delhi, and Shanghai. Based on this information, which of the following is true?
Sample 1: Beijing, Tokyo, and Delhi. The population of Beijing is 22 million, Tokyo is 13 million, and Delhi is 16 million. What is the average of the three cities' populations?
Sample 2: London, Tokyo, and Shanghai. What are the city's populations in millions of people?
What about training?
So far we have only done inference. The real power of Tinker is training -- running forward/backward passes and optimizer steps on remote GPUs while you control the training loop locally.
The workflow looks like this:
- Create a TrainingClient with
service_client.create_lora_training_client() - Prepare training data as
Datumobjects (input tokens + loss targets) - Call
training_client.forward_backward()to compute gradients - Call
training_client.optim_step()to update weights - Save weights and create a SamplingClient to evaluate the trained model
We will walk through this in the next tutorial.
Next steps
- Tutorial 02: First SFT -- Train a model with supervised fine-tuning
- Quick Start -- Full walkthrough of training and sampling
- Models & Pricing -- All supported models and their characteristics