Loss Functions
Tinker provides built-in loss functions for supervised learning and reinforcement learning. You select a loss by passing a string to forward_backward:
future = await training_client.forward_backward_async(data, loss_fn="cross_entropy")
result = await future.result_async()
How it works
Each training example is a Datum — a single sequence with all the information needed to compute the loss. A Datum contains:
model_input— the token sequence (shape:(N,)where N is the sequence length)loss_fn_inputs— a dict of tensors that the loss function needs (targets, weights, advantages, etc.)
The key design principle: everything the loss needs is in the Datum. This means each Datum is self-contained — no batch-level state, no external lookups. The simplest example is cross-entropy for SFT:
import tinker
from tinker import types
# A single training example: predict target_tokens from input_tokens
datum = types.Datum(
model_input=types.ModelInput.from_ints(input_tokens), # shape: (N,)
loss_fn_inputs={
"target_tokens": target_tokens, # shape: (N,) — what to predict at each position
"weights": weights, # shape: (N,) — 0 for prompt, 1 for completion
}
)
# forward_backward computes the loss and gradients in one call
future = await training_client.forward_backward_async([datum], loss_fn="cross_entropy")
result = await future.result_async()
print(f"Loss: {result.loss}")
For RL losses, the Datum also includes the sampling log-probabilities and advantages:
rl_datum = types.Datum(
model_input=types.ModelInput.from_ints(tokens), # shape: (N,)
loss_fn_inputs={
"target_tokens": target_tokens, # shape: (N,)
"weights": weights, # shape: (N,)
"logprobs": sampling_logprobs, # shape: (N,) — from the rollout policy
"advantages": advantages, # shape: (N,) — reward signal per token
}
)
future = await training_client.forward_backward_async([rl_datum], loss_fn="importance_sampling")
forward_backward returns a ForwardBackwardOutput with output tensors in result.loss_fn_outputs (e.g., the model's logprobs for each token).
At a glance
| Loss | Use case | Key idea |
|---|---|---|
cross_entropy |
Supervised learning | Maximize log-probability of target tokens |
importance_sampling |
RL (policy gradient) | Correct for off-policy sampling with \(p/q\) ratio |
ppo |
RL (clipped) | Clip the \(p/q\) ratio to prevent large updates |
cispo |
RL (clipped grad) | Clip the ratio but use it as a gradient coefficient |
dro |
RL (off-policy) | Quadratic penalty on policy divergence |
forward_backward_custom |
Any | Write arbitrary loss over logprobs |
Notation
We denote the training model as \(p_{\theta}\), the sampling distribution as \(q\), and advantages as \(A\). For notation simplicity we omit the query and denote the full model completion sequence of tokens as \(x\).
All losses are applied at the token level. Unless noted otherwise, tensors have shape (N,) where N is model_input.length. They can be provided as numpy.ndarray or torch.Tensor, and the return values will use the same tensor type.
Additional notes on RL losses:
- The loss formulations are quite general, since the user can organize the data generation and advantage estimation in their own code. For example, the main RL training scripts in the Tinker Cookbook use group-based rollouts with per-group advantage centering similar to GRPO (Shao et al., 2024).
- The functional implementations of REINFORCE and PPO do not use an additional KL term like the original GRPO work, which has been noted to be mathematically inconsistent (Zhang et al., 2025; Tang et al., 2025). However, it is possible to include a KL regularization term as part of the reward, which is mathematically correct and we provide this option in our RL training code and examples (consider the
incorporate_kl_penaltyfunction). - For all objectives we sum the token-level losses over the sequence length unlike some other loss implementations. If you would like to explore different aggregation schemes, you can include that in the advantage tensor computation.