Tutorial 205: Evaluations

Prerequisites

Your First SFT

Run it interactively

curl -O https://raw.githubusercontent.com/thinking-machines-lab/tinker-cookbook/main/tutorials/205_evaluations.py && uv run marimo edit 205_evaluations.py

Tinker's evaluation system uses two abstract classes:

TrainingClientEvaluator -- uses the training client (forward passes) to compute metrics like NLL
SamplingClientEvaluator -- uses the sampling client (generation) to compute metrics like accuracy

Both return dict[str, float] and plug into train.Config.evaluator_builders for automatic evaluation during training.

In this tutorial you will:

Implement a TrainingClientEvaluator that computes NLL on held-out data
Implement a SamplingClientEvaluator that samples answers and checks correctness
Wire evaluators into train.Config via evaluator_builders
Learn about the Inspect AI integration for standardized benchmarks

import warnings

warnings.filterwarnings("ignore", message="IProgress not found")

import tinker
import torch
from tinker import TensorData

from tinker_cookbook.eval.evaluators import (
    SamplingClientEvaluator,
    TrainingClientEvaluator,
)
from tinker_cookbook.renderers import get_renderer, get_text_content

The evaluator pattern

Both evaluator types are async callables with a simple contract:

class TrainingClientEvaluator:
    async def __call__(self, training_client: tinker.TrainingClient) -> dict[str, float]:
        ...

class SamplingClientEvaluator:
    async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
        ...

The training loop calls your evaluator periodically and logs the returned metrics. The keys become metric names (e.g., "eval/nll", "eval/accuracy").

Setup

Create a training client and prepare some evaluation data.

MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

service_client = tinker.ServiceClient()
training_client = await service_client.create_lora_training_client_async(
    base_model=MODEL_NAME, rank=16
)
tokenizer = training_client.get_tokenizer()
renderer = get_renderer("qwen3_instruct", tokenizer)

# Prepare held-out SFT data for the NLL evaluator
eval_examples = [
    "The speed of light is approximately 3 * 10^8 meters per second.",
    "Water freezes at 0 degrees Celsius under standard pressure.",
    "The Earth orbits the Sun once every 365.25 days.",
]

eval_datums = []
for text in eval_examples:
    ids = tokenizer.encode(text)
    model_input = tinker.ModelInput.from_ints(ids[:-1])
    target_tokens = ids[1:]
    w = [1.0] * len(target_tokens)
    eval_datums.append(
        tinker.Datum(
            model_input=model_input,
            loss_fn_inputs={
                "target_tokens": TensorData.from_torch(torch.tensor(target_tokens)),
                "weights": TensorData.from_torch(torch.tensor(w)),
            },
        )
    )

print(f"Prepared {len(eval_datums)} evaluation datums")

Output

Prepared 3 evaluation datums

Implementing a TrainingClientEvaluator: NLL

A TrainingClientEvaluator receives the current TrainingClient and can run forward passes to compute metrics. Here we compute the mean negative log-likelihood (NLL) on held-out data -- a standard measure of how well the model predicts the evaluation text.

class NLLEvaluator(TrainingClientEvaluator):
    """Compute mean NLL on held-out data using forward passes."""

    def __init__(self, eval_data: list[tinker.Datum], name: str = "eval"):
        self.eval_data = eval_data
        self.name = name

    async def __call__(self, training_client: tinker.TrainingClient) -> dict[str, float]:
        # Run a forward pass (no gradients) to get logprobs
        future = await training_client.forward_async(
            self.eval_data, loss_fn="cross_entropy"
        )
        result = await future.result_async()

        # Compute weighted mean NLL
        total_nll = 0.0
        total_tokens = 0
        for datum, output in zip(self.eval_data, result.loss_fn_outputs):
            logprobs = torch.tensor(output["logprobs"])
            weights = torch.tensor(datum.loss_fn_inputs["weights"])
            total_nll += -(logprobs * weights).sum().item()
            total_tokens += weights.sum().item()

        mean_nll = total_nll / max(total_tokens, 1)
        return {f"{self.name}/nll": mean_nll}

# Test the evaluator
nll_evaluator = NLLEvaluator(eval_datums, name="held_out")
nll_metrics = await nll_evaluator(training_client)
print(f"NLL evaluation: {nll_metrics}")

Implementing a SamplingClientEvaluator: Accuracy

A SamplingClientEvaluator receives a SamplingClient and generates text to compute metrics. Here we sample answers to simple factual questions and check if they contain the expected answer.

class AccuracyEvaluator(SamplingClientEvaluator):
    """Sample answers and check if they contain the expected string."""

    def __init__(self, questions_and_answers, renderer, tokenizer):
        self.qa_pairs = questions_and_answers
        self.renderer = renderer
        self.tokenizer = tokenizer

    async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
        correct = 0

        for question, expected_answer in self.qa_pairs:
            messages = [{"role": "user", "content": question}]
            prompt = self.renderer.build_generation_prompt(messages)
            stop = self.renderer.get_stop_sequences()

            result = await sampling_client.sample_async(
                prompt=prompt,
                sampling_params=tinker.SamplingParams(
                    max_tokens=64, temperature=0.0, stop=stop
                ),
                num_samples=1,
            )

            tokens = result.sequences[0].tokens
            parsed, _ = self.renderer.parse_response(tokens)
            response_text = get_text_content(parsed).lower()

            if expected_answer.lower() in response_text:
                correct += 1

        accuracy = correct / len(self.qa_pairs)
        return {"eval/accuracy": accuracy, "eval/correct": float(correct)}

# Create a sampling client from current weights
sampling_client = await training_client.save_weights_and_get_sampling_client_async()

# Define test questions
test_qa = [
    ("What is the capital of Japan?", "tokyo"),
    ("What is 15 + 27?", "42"),
    ("What element has the symbol 'O'?", "oxygen"),
]

accuracy_evaluator = AccuracyEvaluator(test_qa, renderer, tokenizer)
acc_metrics = await accuracy_evaluator(sampling_client)
print(f"Accuracy evaluation: {acc_metrics}")

Output

Accuracy evaluation: {'eval/accuracy': 1.0, 'eval/correct': 3.0}

Wiring evaluators into train.Config

The train.Config classes in tinker_cookbook.supervised.train and tinker_cookbook.rl.train accept an evaluator_builders parameter. Each builder is a zero-argument callable that returns an evaluator instance.

The training loop calls each builder once at startup, then runs the evaluators periodically (controlled by eval_every).

Supervised training with evaluators

from tinker_cookbook.supervised import train

def make_nll_evaluator():
    # Build eval data here (or capture it from outer scope)
    return NLLEvaluator(eval_datums, name="validation")

def make_accuracy_evaluator():
    return AccuracyEvaluator(test_qa, renderer, tokenizer)

config = train.Config(
    log_path="~/logs/sft-with-evals",
    model_name="Qwen/Qwen3-4B-Instruct-2507",
    dataset_builder=my_dataset_builder,
    learning_rate=1e-4,

    # Evaluator builders -- called once at startup, run every eval_every steps
    evaluator_builders=[make_nll_evaluator, make_accuracy_evaluator],
    eval_every=50,  # run evaluators every 50 steps
)

The SFT training loop detects the evaluator type automatically: - TrainingClientEvaluator is called with the training client - SamplingClientEvaluator is called with a sampling client created from current weights

RL training with evaluators

The RL train.Config works the same way, though it only accepts SamplingClientEvaluatorBuilder:

from tinker_cookbook.rl import train

config = train.Config(
    log_path="~/logs/rl-with-evals",
    model_name="meta-llama/Llama-3.1-8B",
    dataset_builder=my_rl_dataset_builder,

    evaluator_builders=[make_accuracy_evaluator],
    eval_every=10,
)

Built-in evaluator: NLLEvaluator

The cookbook ships a production-ready NLL evaluator at tinker_cookbook.supervised.nll_evaluator.NLLEvaluator. It can be constructed from a SupervisedDataset:

from tinker_cookbook.supervised.nll_evaluator import NLLEvaluator

# From a dataset object
nll_eval = NLLEvaluator.from_dataset(eval_dataset, name="test")

# Or from raw datums
nll_eval = NLLEvaluator(data=eval_datums, name="validation")

Inspect AI integration

For standardized benchmarks (MMLU, GSM8K, HumanEval, etc.), the cookbook integrates with Inspect AI. The InspectEvaluator wraps any Inspect task as a SamplingClientEvaluator:

from tinker_cookbook.eval.run_inspect_evals import Config

# Run Inspect evals standalone
config = Config(
    model_path="tinker://run-id/sampler_weights/final",
    # ... Inspect task configuration ...
)

Inspect AI provides a large library of pre-built evaluation tasks, so you can benchmark your fine-tuned model against established benchmarks without writing custom evaluation code.

Summary

Evaluator type	Receives	Typical metrics	Example
`TrainingClientEvaluator`	`TrainingClient`	NLL, perplexity	Forward pass on held-out data
`SamplingClientEvaluator`	`SamplingClient`	Accuracy, reward	Generate and grade answers

Key points: - Evaluators are async callables returning dict[str, float] - Wire them into training via evaluator_builders (list of zero-arg factories) - eval_every controls how often they run - Use TrainingClientEvaluator for metrics that only need a forward pass (fast) - Use SamplingClientEvaluator for metrics that need generation (slower but more informative) - For standard benchmarks, use the Inspect AI integration