Benchmarks Guide
Configuration
BenchmarkConfig controls how benchmarks run:
from tinker_cookbook.eval.benchmarks import run_benchmark, BenchmarkConfig
config = BenchmarkConfig(
max_examples=200, # Cap number of examples (None = all)
concurrency=64, # Parallel rollouts (single-turn)
agent_concurrency=8, # Parallel rollouts (multi-turn)
timeout_seconds=300, # Per-example timeout
max_tokens=32768, # Max generation tokens
temperature=0.6,
save_dir="evals/run_01", # Save trajectories and results
)
result = await run_benchmark("gsm8k", sampling_client, renderer, config)
Model-Specific Defaults
BenchmarkConfig.for_model() sets max_tokens, timeout_seconds, and context_window from a built-in table:
config = BenchmarkConfig.for_model(
"Qwen/Qwen3.5-35B-A3B",
save_dir="evals/qwen3.5",
)
# max_tokens=65536, context_window=65536, timeout_seconds=1800
Multi-Turn Benchmarks
Agent benchmarks (swe_bench, terminal_bench, tau2_bench) need context management:
config = BenchmarkConfig(
timeout_seconds=1800,
max_trajectory_tokens=60000, # Total tokens across all turns
max_generation_tokens=8192, # Tokens per generation step
agent_concurrency=4,
)
Sandbox Benchmarks
Benchmarks that execute code (mbpp, livecodebench, swe_bench) require a sandbox:
from tinker_cookbook.sandbox.modal_sandbox import ModalSandbox
config = BenchmarkConfig(
sandbox_factory=ModalSandbox.create,
)
Judge Benchmarks
Benchmarks like arena_hard require a separate LLM judge. See Judge Models for configuration details and examples.
Storing and Inspecting Results
Set save_dir to persist trajectories and results:
evals/run_01/
├── summary.json # Combined scores (run_benchmarks only)
├── gsm8k/
│ ├── result.json # BenchmarkResult
│ └── trajectories.jsonl # One StoredTrajectory per line
└── ifeval/
├── result.json
└── trajectories.jsonl
Loading Results
from tinker_cookbook.eval.benchmarks import (
load_result,
load_trajectories,
load_summary,
print_trajectory,
)
result = load_result("evals/run_01", "gsm8k")
print(f"Score: {result.score:.1%}, Completed: {result.score_completed:.1%}")
# Filter trajectories
wrong = load_trajectories("evals/run_01", "gsm8k", incorrect_only=True)
errors = load_trajectories("evals/run_01", "gsm8k", errors_only=True)
print_trajectory(wrong[0])
# Combined summary across benchmarks
summary = load_summary("evals/run_01")
Resumability
Rerunning with the same save_dir skips completed examples. Deduplication uses example_id (a content hash), so it's robust to dataset shuffling.
Understanding Scores
score—num_correct / num_examples. Truncated and errored examples count as 0.score_completed—num_correct / num_completed. Excludes truncated and errored examples from the denominator.
For thinking models that often hit max_tokens, score_completed is the better comparison against published scores.
result = await run_benchmark("gsm8k", client, renderer, config)
print(f"Raw: {result.score:.1%}") # 81.7%
print(f"Completed: {result.score_completed:.1%}") # 95.6%
print(f"{result.num_truncated} truncated, {result.num_errors} errors")
Comparing Checkpoints
checkpoints = {
"base": "tinker://run-abc/sampler_weights/step0",
"step500": "tinker://run-abc/sampler_weights/step500",
"final": "tinker://run-abc/sampler_weights/final",
}
for name, path in checkpoints.items():
client = await sc.create_sampling_client_async(model_path=path)
results = await run_benchmarks(
["gsm8k", "ifeval"], client, renderer,
BenchmarkConfig(save_dir=f"evals/{name}"),
)
for bench, r in results.items():
print(f"{name}/{bench}: {r.score:.1%}")
EvalStore for Multi-Run Tracking
EvalStore manages evaluation runs across checkpoints with cloud-compatible storage:
from tinker_cookbook.stores.eval_store import EvalStore
store = EvalStore("~/experiments/evals")
run_id = store.create_run(
model_name="Qwen/Qwen3.5-35B-A3B",
checkpoint_path="tinker://run-123/weights/step500",
checkpoint_name="step500",
benchmarks=["gsm8k", "ifeval"],
)
config = BenchmarkConfig(save_dir=store.run_dir(run_id))
await run_benchmarks(["gsm8k", "ifeval"], client, renderer, config)
metadata = store.finalize_run(run_id)
print(metadata.scores) # {"gsm8k": 0.847, "ifeval": 0.923}
Pass@k Evaluation
Set num_samples > 1 to compute unbiased pass@k estimates (Codex formula):
config = BenchmarkConfig(num_samples=4, save_dir="evals/pass_at_k")
result = await run_benchmark("gsm8k", sampling_client, renderer, config)
print(result.pass_at_k) # {1: 0.45, 2: 0.58, 4: 0.72}
Customization
For system prompts, custom grading, answer parsing, and judge model configuration, see Customizing Benchmarks.
Adding a New Benchmark
Benchmarks reuse the same Env protocol as RL training — a single MessageEnv can drive both training and evaluation. See Shared Env Abstraction for the conceptual overview.
To add a benchmark, implement a BenchmarkBuilder and register it:
from tinker_cookbook.eval.benchmarks import BenchmarkBuilder, BenchmarkConfig, register
from tinker_cookbook.eval.benchmarks._common import load_benchmark_dataset, make_example_id
from tinker_cookbook.rl.message_env import MessageEnv, MessageStepResult, EnvFromMessageEnv
class MyMessageEnv(MessageEnv):
"""Single-turn env for one example."""
def __init__(self, question: str, expected: str, example_id: str):
self.question = question
self.expected = expected
self._example_id = example_id
@property
def example_id(self) -> str:
return self._example_id
async def initial_observation(self) -> list[Message]:
return [{"role": "user", "content": self.question}]
async def step(self, message: Message) -> MessageStepResult:
response = get_text_content(message).lower()
correct = self.expected.lower() in response
return MessageStepResult(
reward=1.0 if correct else 0.0,
episode_done=True,
next_messages=[],
metrics={"correct": float(correct)},
logs={"expected": self.expected},
)
class MyBenchmark(BenchmarkBuilder):
name = "my_benchmark"
recommended_system_prompt = "Answer concisely."
def make_envs(self, renderer, config):
ds = load_benchmark_dataset("my/dataset", split="test")
envs = []
for row in ds:
env = EnvFromMessageEnv(
renderer=renderer,
message_env=MyMessageEnv(
question=row["question"],
expected=row["answer"],
example_id=make_example_id("my_benchmark", row["question"]),
),
)
envs.append(env)
return envs
register(MyBenchmark())
Key points:
- Use
MessageEnvfor single-turn benchmarks — it handles prompt building and response parsing - Set
example_idviamake_example_id()for stable resumability across runs logsstores diagnostic data per trajectory (expected answer, extracted answer);metricsstores numeric values for aggregation- For multi-turn benchmarks, implement
Envdirectly and setmulti_turn = Trueon the builder
Learn More
- Evaluation Overview — available benchmarks and architecture
- Customizing Benchmarks — system prompts, grading, judge models, shared Env design
- Evaluations Tutorial — interactive walkthrough
- RL Environments — the Env protocol used by benchmarks
- API Reference — full type documentation