# TINKER DOCUMENTATION
This file contains the core Tinker documentation (index, quickstart, and losses).
## File: index.mdx
# Tinker: a training API for researchers and developers
Tinker lets you focus on what matters in LLM fine-tuning – your data and algorithms – while we handle the heavy lifting of distributed training.
You write a simple loop that runs on your CPU-only machine, including the data or environment and the loss function. We figure out how to make the training work on a bunch of GPUs, doing the exact computation you specified, efficiently. To change the model you're working with, you only need to change a single string in your code.
Tinker gives you full control over the training loop and all the algorithmic details. It's not a magic black box that makes fine-tuning "easy". It's a clean abstraction that shields you from the complexity of distributed training while preserving your control.
Here's how the division of responsibilities works in practice:
| **You focus on** | **You write** | **We handle** |
|---|---|---|
| 📊 **Datasets and RL environments**
Your custom training data | 💻 **Simple Python script**
Runs on your CPU | ⚡ **Efficient distributed training of large models**
Llama, Qwen, and more |
| 🎯 **Training logic**
Your loss functions, training loop, and evals | 🔧 **API calls**
`forward_backward()`
`optim_step()`
`sample()` | 🛡️ **Reliability**
Hardware failures handled transparently |
## Features
What the Tinker service currently supports:
- Tinker lets you fine-tune open-weight models like the Qwen and Llama series, including large mixture-of-experts models like Qwen3-235B-A22B.
- Tinker implements low-rank adaptation (LoRA) fine-tuning, not full fine-tuning. However, we believe that LoRA gives the same performance as full fine-tuning for many important use cases, especially in RL (see [LoRA Without Regret](https://thinkingmachines.ai/blog/lora/)).
- You can download the weights of your trained model to use outside of Tinker, for example with your inference provider of choice.
## A quick look at functionality
Tinker's main functionality is contained in a few key functions:
- `forward_backward`: feed in your data and loss function, and we'll compute and accumulate the gradients for you.
- `optim_step`: update your model using the accumulated gradients
- `sample`: Generate outputs from your trained model
- other functions for saving and loading weights and optimizer state
## What's next?
Some features we expect to support in the future:
- Image input for applicable models
- Full fine-tuning
---
## File: losses.mdx
# Loss functions in Tinker
For most use cases, you can use the Tinker API's built-in loss functions by passing in a string identifier to `forward_backward`, which supports cross entropy and policy gradient objectives. When you need more control, `forward_backward_custom` enables arbitrary differentiable loss functions at the cost of an additional forward pass; we explain both approaches in this doc.
When you call `forward_backward`, you specify a loss function using a string that selects from a predetermined set of options, comprising the most common losses used for language model training.
- **Input:** `forward_backward` expects a certain set of input tensors, passed in via `datum.loss_fn_inputs`, which is a dict mapping `str` to either a numpy or torch tensor
- **Output:** `forward_backward` returns a `ForwardBackwardOutput`, which has a set of output tensors in `fwd_bwd_result.loss_fn_outputs`
For an example of using `forward_backward`, see `rl/train.py` in the Cookbook:
```python
async def forward_backward(
training_client: tinker.TrainingClient,
batch_d: List[tinker.Datum],
) -> List[torch.Tensor]:
"""Accumulate gradients on a minibatch of data"""
fwd_bwd_future = await training_client.forward_backward_async(
list(map(remove_mask, batch_d)), loss_fn="importance_sampling"
)
fwd_bwd_result = await fwd_bwd_future.result_async()
# Extract training logprobs from loss_fn_outputs
training_logprobs_D: list[torch.Tensor] = []
for output in fwd_bwd_result.loss_fn_outputs:
training_logprobs = output["logprobs"].to_torch()
training_logprobs_D.append(training_logprobs)
return training_logprobs_D
```
## Basic loss functions
Currently, the Tinker API supports `cross_entropy` (for supervised learning), `importance_sampling` (for RL), and `ppo` (for RL).
All tensors below have shape `(N,)` where `N` is `model_input.length`. They can be provided as `numpy.ndarray` or `torch.Tensor`, and the return values will use the same tensor type.
### Supervised learning: `cross_entropy`
For SL, we implement the standard cross-entropy loss (i.e., negative-log-likelihood), which optimizes the policy $p_\theta$ to maximize the log-probability of the tokens $x$:
$$
\mathcal{L(\theta)} = -\mathbb{E}_x[\log p_\theta(x)]
$$
In practice, this looks like `-(weights * logp(target_tokens)).sum()`, where `weights` is either 0 or 1, typically generated from `renderers.build_supervised_example` (i.e., to specify the desired assistant turns to train on).
- **Input tensors:**
- `target_tokens: array[(N,), int]` - Target token IDs
- `weights: array[(N,), float]` - Token-level loss weights (typically from the renderer)
- **Output tensors:**
- `logprobs: array[(N,), float]` - Log probabilities of predicted tokens
- **Output diagnostics:**
- `loss:sum` (scalar) - Sum of weighted cross-entropy losses
### Policy gradient: `importance_sampling`
For RL, we implement a common variant of the policy gradient objective, used in practical settings where the *learner policy* $p$ may differ from the *sampling policy* $q$, which is common due to e.g. [non-determinism](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). To remove the bias caused by this difference, we can use a modified "importance sampling" objective:
$$
\nabla \mathbb{E}_{x\sim p_\theta}\bigl[r(x) \bigr] = \mathbb{E}_{x\sim q}\Bigl[r(x) \cdot \frac{\nabla p_\theta(x)}{q(x)}\Bigr]
$$
which yields the correct expected reward in expectation. This is implemented as `(exp(new_logprobs - logprobs) * advantages).sum()`, where `advantages` may additionally subtract a baseline from the rewards. Note this only works in the bandit setting, which is common in both RLHF and RLVR setups.
- **Input tensors:**
- `target_tokens: array[(N,), int]` - Target token IDs (from the sampler $q$)
- `logprobs: array[(N,), float]` - Reference log probabilities $q$ for the target tokens
- `advantages: array[(N,), float]` - Advantage values for RL
- **Output tensors:**
- `logprobs: array[(N,), float]` - Log probabilities $p$ for the target tokens
- **Output diagnostics:**
- `loss:sum` (scalar) - Sum of importance-weighted policy gradient losses
**Addendum:** Let's consider naively applying the policy gradient objective when $q \neq p$:
$$
\begin{align*}
\mathbb{E}_{x\sim q}\bigr[ r(x) \cdot \nabla \log p_\theta(x) \bigl] &= \sum_x q(x) r(x) \cdot \nabla \log p_\theta(x) \\
&= \sum_x q(x) (r(x) - \bar{r}) \nabla \log p_\theta(x) + \sum_x q(x) \bar{r} \cdot \nabla \log p_\theta(x) \\
&= \mathbb{E}_{x\sim q}\bigl[(r(x) - \bar{r}) \nabla \log p_\theta(x)\bigr] - \bar{r} \cdot \nabla KL(q \Vert p)
\end{align*}
$$
where $\bar{r} = \sum_x q(x) r(x)$, effectively an average-reward baseline.
- The first expectation term resembles a pseudo-policy gradient, increasing the log-likelihood of tokens $x$ which achieve higher-than-average rewards. (It is not an actual policy gradient, because $q \neq p$.)
- The second KL term is effectively a bias term which can destablize RL optimization. This bias increases as either the divergence $KL(q \Vert p)$ grows, or as the average reward $\bar{r}$ shifts.
## Flexible loss functions: `forward_backward_custom`
For use-cases outside of the above, we've provided the more flexible (but slower) methods `forward_backward_custom` and `forward_backward_custom_async` to compute a more general class of loss functions.
### Usage
Here's a simple example of a custom loss function:
```python
def logprob_squared_loss(data: list[Datum], logprobs: list[torch.Tensor]) -> tuple[torch.Tensor, dict[str, float]]:
loss = (logprobs ** 2).sum()
return loss, {"logprob_squared_loss": loss.item()}
```
You can call this loss function with `forward_backward_custom` like:
```python
loss, metrics = training_client.forward_backward_custom(data, logprob_squared_loss)
```
You can also define loss functions which operate on multiple sequences at a time. For example, (although practically useless), a loss function that computes the variance across the sequences can be implemented as:
```python
def variance_loss(data: list[Datum], logprobs: list[torch.Tensor]) -> tuple[torch.Tensor, dict[str, float]]:
flat_logprobs = torch.cat(logprobs)
variance = torch.var(flat_logprobs)
return variance, {"variance_loss": variance.item()}
```
A more practical use case would be to compute a Bradley-Terry loss on pairwise comparison data -- a classic approach in RL from human feedback, as introduced/popularized by [Learning to Summarize](https://arxiv.org/abs/2009.01325). Similarly, we can also implement [Direct Preference Optimization](https://arxiv.org/abs/2305.18290), which also computes a loss involving pairs of sequences; see the [DPO guide](/preferences/dpo-guide) for more details.
If you're using a custom loss function that you think is generally useful, please let us know, and we'll add it to the list of built-in loss functions.
We detail the `async` version of methods in the [Async and Futures](./async) of these docs.
### How `forward_backward_custom` works
---