Benchmarks Guide

Configuration

BenchmarkConfig controls how benchmarks run:

from tinker_cookbook.eval.benchmarks import run_benchmark, BenchmarkConfig

config = BenchmarkConfig(
    max_examples=200,         # Cap number of examples (None = all)
    concurrency=64,           # Parallel rollouts (single-turn)
    agent_concurrency=8,      # Parallel rollouts (multi-turn)
    timeout_seconds=300,      # Per-example timeout
    max_tokens=32768,         # Max generation tokens
    temperature=0.6,
    save_dir="evals/run_01",  # Save trajectories and results
)

result = await run_benchmark("gsm8k", sampling_client, renderer, config)

Model-Specific Defaults

BenchmarkConfig.for_model() sets max_tokens, timeout_seconds, and context_window from a built-in table:

config = BenchmarkConfig.for_model(
    "Qwen/Qwen3.5-35B-A3B",
    save_dir="evals/qwen3.5",
)
# max_tokens=65536, context_window=65536, timeout_seconds=1800

Multi-Turn Benchmarks

Agent benchmarks (swe_bench, terminal_bench, tau2_bench) need context management:

config = BenchmarkConfig(
    timeout_seconds=1800,
    max_trajectory_tokens=60000,   # Total tokens across all turns
    max_generation_tokens=8192,    # Tokens per generation step
    agent_concurrency=4,
)

Sandbox Benchmarks

Benchmarks that execute code (mbpp, livecodebench, swe_bench) require a sandbox:

from tinker_cookbook.sandbox.modal_sandbox import ModalSandbox

config = BenchmarkConfig(
    sandbox_factory=ModalSandbox.create,
)

Judge Benchmarks

Benchmarks like arena_hard require a separate LLM judge. See Judge Models for configuration details and examples.

Storing and Inspecting Results

Set save_dir to persist trajectories and results:

evals/run_01/
├── summary.json                    # Combined scores (run_benchmarks only)
├── gsm8k/
│   ├── result.json                 # BenchmarkResult
│   └── trajectories.jsonl          # One StoredTrajectory per line
└── ifeval/
    ├── result.json
    └── trajectories.jsonl

Loading Results

from tinker_cookbook.eval.benchmarks import (
    load_result,
    load_trajectories,
    load_summary,
    print_trajectory,
)

result = load_result("evals/run_01", "gsm8k")
print(f"Score: {result.score:.1%}, Completed: {result.score_completed:.1%}")

# Filter trajectories
wrong = load_trajectories("evals/run_01", "gsm8k", incorrect_only=True)
errors = load_trajectories("evals/run_01", "gsm8k", errors_only=True)
print_trajectory(wrong[0])

# Combined summary across benchmarks
summary = load_summary("evals/run_01")

Resumability

Rerunning with the same save_dir skips completed examples. Deduplication uses example_id (a content hash), so it's robust to dataset shuffling.

Understanding Scores

score — num_correct / num_examples. Truncated and errored examples count as 0.
score_completed — num_correct / num_completed. Excludes truncated and errored examples from the denominator.

For thinking models that often hit max_tokens, score_completed is the better comparison against published scores.

result = await run_benchmark("gsm8k", client, renderer, config)
print(f"Raw: {result.score:.1%}")                # 81.7%
print(f"Completed: {result.score_completed:.1%}") # 95.6%
print(f"{result.num_truncated} truncated, {result.num_errors} errors")

Comparing Checkpoints

checkpoints = {
    "base": "tinker://run-abc/sampler_weights/step0",
    "step500": "tinker://run-abc/sampler_weights/step500",
    "final": "tinker://run-abc/sampler_weights/final",
}

for name, path in checkpoints.items():
    client = await sc.create_sampling_client_async(model_path=path)
    results = await run_benchmarks(
        ["gsm8k", "ifeval"], client, renderer,
        BenchmarkConfig(save_dir=f"evals/{name}"),
    )
    for bench, r in results.items():
        print(f"{name}/{bench}: {r.score:.1%}")

EvalStore for Multi-Run Tracking

EvalStore manages evaluation runs across checkpoints with cloud-compatible storage:

from tinker_cookbook.stores.eval_store import EvalStore

store = EvalStore("~/experiments/evals")

run_id = store.create_run(
    model_name="Qwen/Qwen3.5-35B-A3B",
    checkpoint_path="tinker://run-123/weights/step500",
    checkpoint_name="step500",
    benchmarks=["gsm8k", "ifeval"],
)

config = BenchmarkConfig(save_dir=store.run_dir(run_id))
await run_benchmarks(["gsm8k", "ifeval"], client, renderer, config)

metadata = store.finalize_run(run_id)
print(metadata.scores)  # {"gsm8k": 0.847, "ifeval": 0.923}

Pass@k Evaluation

Set num_samples > 1 to compute unbiased pass@k estimates (Codex formula):

config = BenchmarkConfig(num_samples=4, save_dir="evals/pass_at_k")
result = await run_benchmark("gsm8k", sampling_client, renderer, config)
print(result.pass_at_k)  # {1: 0.45, 2: 0.58, 4: 0.72}

Customization

For system prompts, custom grading, answer parsing, and judge model configuration, see Customizing Benchmarks.

Adding a New Benchmark

Benchmarks reuse the same Env protocol as RL training — a single MessageEnv can drive both training and evaluation. See Shared Env Abstraction for the conceptual overview.

To add a benchmark, implement a BenchmarkBuilder and register it:

from tinker_cookbook.eval.benchmarks import BenchmarkBuilder, BenchmarkConfig, register
from tinker_cookbook.eval.benchmarks._common import load_benchmark_dataset, make_example_id
from tinker_cookbook.rl.message_env import MessageEnv, MessageStepResult, EnvFromMessageEnv


class MyMessageEnv(MessageEnv):
    """Single-turn env for one example."""

    def __init__(self, question: str, expected: str, example_id: str):
        self.question = question
        self.expected = expected
        self._example_id = example_id

    @property
    def example_id(self) -> str:
        return self._example_id

    async def initial_observation(self) -> list[Message]:
        return [{"role": "user", "content": self.question}]

    async def step(self, message: Message) -> MessageStepResult:
        response = get_text_content(message).lower()
        correct = self.expected.lower() in response
        return MessageStepResult(
            reward=1.0 if correct else 0.0,
            episode_done=True,
            next_messages=[],
            metrics={"correct": float(correct)},
            logs={"expected": self.expected},
        )


class MyBenchmark(BenchmarkBuilder):
    name = "my_benchmark"
    recommended_system_prompt = "Answer concisely."

    def make_envs(self, renderer, config):
        ds = load_benchmark_dataset("my/dataset", split="test")
        envs = []
        for row in ds:
            env = EnvFromMessageEnv(
                renderer=renderer,
                message_env=MyMessageEnv(
                    question=row["question"],
                    expected=row["answer"],
                    example_id=make_example_id("my_benchmark", row["question"]),
                ),
            )
            envs.append(env)
        return envs

register(MyBenchmark())

Key points:

Use MessageEnv for single-turn benchmarks — it handles prompt building and response parsing
Set example_id via make_example_id() for stable resumability across runs
logs stores diagnostic data per trajectory (expected answer, extracted answer); metrics stores numeric values for aggregation
For multi-turn benchmarks, implement Env directly and set multi_turn = True on the builder

Learn More

Evaluation Overview — available benchmarks and architecture
Customizing Benchmarks — system prompts, grading, judge models, shared Env design
Evaluations Tutorial — interactive walkthrough
RL Environments — the Env protocol used by benchmarks
API Reference — full type documentation