Evaluation
Run standardized benchmarks against any Tinker model or checkpoint. The framework handles concurrency, grading, trajectory storage, and result aggregation.
Quick Start
from tinker_cookbook.eval.benchmarks import run_benchmark, run_benchmarks, BenchmarkConfig
# Single benchmark
result = await run_benchmark("gsm8k", sampling_client, renderer)
print(f"GSM8K: {result.score:.1%}")
# Multiple benchmarks (parallel by default)
results = await run_benchmarks(
["gsm8k", "mmlu_pro", "ifeval"],
sampling_client, renderer,
)
for name, r in results.items():
print(f"{name}: {r.score:.1%}")
Architecture
Benchmarks reuse the RL Env protocol — the same environment can drive both evaluation and training.
BenchmarkBuilder
├── make_envs(renderer, config) → list[Env] one Env per example
├── aggregate(rewards, metrics) → BenchmarkResult
└── name, multi_turn, requires_sandbox, ...
Runner (run_benchmark / run_benchmarks)
├── Creates envs from BenchmarkBuilder
├── Runs rollouts concurrently (semaphore)
├── Grades responses via Env.step()
├── Saves trajectories as JSONL
└── Aggregates into BenchmarkResult
BenchmarkResult
├── score raw accuracy
├── score_completed accuracy excluding truncated/errored examples
├── num_examples, num_correct, num_errors, num_truncated
├── pass_at_k {1: 0.45, 5: 0.72} (when num_samples > 1)
└── time_seconds
Available Benchmarks
Programmatic Grading (Single-Turn)
| Benchmark | Examples | Description |
|---|---|---|
gsm8k |
1,319 | Grade school math |
math500 |
500 | Competition math (MATH) |
aime |
30 | AIME math competition |
aime_2025 |
30 | AIME 2025 |
aime_2026 |
30 | AIME 2026 |
hmmt_feb_2025 |
30 | HMMT February 2025 |
hmmt_nov_2025 |
30 | HMMT November 2025 |
mmlu_pro |
12,032 | Multi-domain knowledge |
mmlu_redux |
~3,000 | Curated MMLU subset (30 subjects, error-filtered) |
gpqa |
198 | Graduate-level science QA (Diamond set) |
supergpqa |
26,529 | Extended graduate QA |
ceval |
~1,300 | Chinese exam evaluation (52 subjects, val split) |
ifeval |
541 | Instruction following |
ifbench |
300 | Instruction following |
Code Execution
| Benchmark | Type | Description |
|---|---|---|
mbpp |
Single-turn | Python code generation (sanitized, 427) |
livecodebench |
Single-turn | Competitive programming |
swe_bench |
Multi-turn | GitHub issue resolution (500, agent) |
terminal_bench |
Multi-turn | Terminal interaction (agent) |
tau2_bench |
Multi-turn | Customer service agent |
LLM-as-Judge
| Benchmark | Description |
|---|---|
arena_hard |
ArenaHard (500, requires judge_sampling_client) |
Other
| Benchmark | Description |
|---|---|
longbench |
Long-context evaluation |
bfcl |
Function calling / tool use |
Using Benchmarks During Training
BenchmarkEvaluator wraps any benchmark as a training-loop evaluator. It runs a small sample at each eval step and logs metrics to TensorBoard / W&B:
from tinker_cookbook.eval.benchmark_evaluator import BenchmarkEvaluator
config = train.Config(
# ... training config ...
evaluator_builders=[
lambda: BenchmarkEvaluator("gsm8k", renderer, max_examples=100),
lambda: BenchmarkEvaluator("ifeval", renderer, max_examples=50),
],
eval_every=50,
)
Logged metrics: eval/gsm8k/score, eval/gsm8k/num_correct, etc.
Next Steps
- Benchmarks Guide — configuration, storage, pass@k, adding new benchmarks
- Customizing Benchmarks — system prompts, answer parsing, judge models, shared Env design
- Evaluations Tutorial — interactive walkthrough of the evaluator pattern
- API Reference — full type documentation