OpenAI API Compatible Inference (in beta)

OpenAI-compatible inference lets you interact with any model checkpoint in Tinker, using an endpoint compatible with the OpenAI Completions API (opens in a new tab). It’s designed to let you easily “poke at” your model while you're training it.

For inference within your training runs (e.g. RL), we recommend using Tinker’s standard sampling client.

Currently, OpenAI-compatible inference is meant for testing and internal use with low internal traffic, rather than large, high-throughput, user-facing deployments. Latency and throughput may vary by model and may change without notice during the beta. If you need higher or more stable throughput, contact the Tinker team in our Discord (opens in a new tab) for guidance on larger-scale setups.

Use Cases

OpenAI-compatible inference is designed for

Fast feedback while training: Start sampling very quickly from any sampler checkpoint obtained during training.
Sampling while training continues: Sample even while the training job is still running on that experiment.
Developer & internal workflows: Intended for testing, evaluation, and internal tools.

We will release production-grade inference soon and will update our users then.

Using OpenAI compatible inference from an OpenAI client

The new interface exposes an OpenAI-compatible HTTP API. You can use any OpenAI SDK or HTTP client that lets you override the base URL.

1. Set the base URL of your OpenAI-compatible client to:

https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1

2. Use a Tinker sampler weight path as the model name. For example:

tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080

Any valid Tinker sampler checkpoint path works here. You can keep training and sample from the same checkpoint simultaneously.

3. Authenticate with your Tinker API key, by passing the same key used for Tinker as the API key to the OpenAI client.

Note: We support both /completions and /chat/completions endpoints. Chat requests are rendered with the model’s default Hugging Face chat template; if your checkpoint expects a different renderer, render the prompt yourself (see Rendering) and use /completions.

Code Example

from os import getenv
from openai import OpenAI
 
BASE_URL = "https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1"
MODEL_PATH = "tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080"
 
api_key = getenv("TINKER_API_KEY")
 
client = OpenAI(
    base_url=BASE_URL,
    api_key=api_key,
)
 
response = client.completions.create(
    model=MODEL_PATH,
    prompt="The capital of France is",
    max_tokens=50,
    temperature=0.7,
    top_p=0.9,
)
 
print(f"{response.choices[0].text}")

Notes:

BASE_URL points to the OpenAI compatible inference endpoint.
MODEL_PATH is a sampler checkpoint path from Tinker (tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080).
The rest of the arguments (prompt, max_tokens, temperature, top_p) behave like they do in the OpenAI Completions API.
You can swap MODEL_PATH to any other sampler checkpoint to compare runs quickly in your evals or notebooks.

Related docs

Exceptions Overview