Evaluation

Run standardized benchmarks against any Tinker model or checkpoint. The framework handles concurrency, grading, trajectory storage, and result aggregation.

Quick Start

from tinker_cookbook.eval.benchmarks import run_benchmark, run_benchmarks, BenchmarkConfig

# Single benchmark
result = await run_benchmark("gsm8k", sampling_client, renderer)
print(f"GSM8K: {result.score:.1%}")

# Multiple benchmarks (parallel by default)
results = await run_benchmarks(
    ["gsm8k", "mmlu_pro", "ifeval"],
    sampling_client, renderer,
)
for name, r in results.items():
    print(f"{name}: {r.score:.1%}")

Architecture

Benchmarks reuse the RL Env protocol — the same environment can drive both evaluation and training.

BenchmarkBuilder
├── make_envs(renderer, config) → list[Env]   one Env per example
├── aggregate(rewards, metrics) → BenchmarkResult
└── name, multi_turn, requires_sandbox, ...

Runner (run_benchmark / run_benchmarks)
├── Creates envs from BenchmarkBuilder
├── Runs rollouts concurrently (semaphore)
├── Grades responses via Env.step()
├── Saves trajectories as JSONL
└── Aggregates into BenchmarkResult

BenchmarkResult
├── score              raw accuracy
├── score_completed    accuracy excluding truncated/errored examples
├── num_examples, num_correct, num_errors, num_truncated
├── pass_at_k          {1: 0.45, 5: 0.72}  (when num_samples > 1)
└── time_seconds

Available Benchmarks

Programmatic Grading (Single-Turn)

Benchmark	Examples	Description
`gsm8k`	1,319	Grade school math
`math500`	500	Competition math (MATH)
`aime`	30	AIME math competition
`aime_2025`	30	AIME 2025
`aime_2026`	30	AIME 2026
`hmmt_feb_2025`	30	HMMT February 2025
`hmmt_nov_2025`	30	HMMT November 2025
`mmlu_pro`	12,032	Multi-domain knowledge
`mmlu_redux`	~3,000	Curated MMLU subset (30 subjects, error-filtered)
`gpqa`	198	Graduate-level science QA (Diamond set)
`supergpqa`	26,529	Extended graduate QA
`ceval`	~1,300	Chinese exam evaluation (52 subjects, val split)
`ifeval`	541	Instruction following
`ifbench`	300	Instruction following

Code Execution

Benchmark	Type	Description
`mbpp`	Single-turn	Python code generation (sanitized, 427)
`livecodebench`	Single-turn	Competitive programming
`swe_bench`	Multi-turn	GitHub issue resolution (500, agent)
`terminal_bench`	Multi-turn	Terminal interaction (agent)
`tau2_bench`	Multi-turn	Customer service agent

LLM-as-Judge

Benchmark	Description
`arena_hard`	ArenaHard (500, requires `judge_sampling_client`)

Other

Benchmark	Description
`longbench`	Long-context evaluation
`bfcl`	Function calling / tool use

Using Benchmarks During Training

BenchmarkEvaluator wraps any benchmark as a training-loop evaluator. It runs a small sample at each eval step and logs metrics to TensorBoard / W&B:

from tinker_cookbook.eval.benchmark_evaluator import BenchmarkEvaluator

config = train.Config(
    # ... training config ...
    evaluator_builders=[
        lambda: BenchmarkEvaluator("gsm8k", renderer, max_examples=100),
        lambda: BenchmarkEvaluator("ifeval", renderer, max_examples=50),
    ],
    eval_every=50,
)

Logged metrics: eval/gsm8k/score, eval/gsm8k/num_correct, etc.

Next Steps

Benchmarks Guide — configuration, storage, pass@k, adding new benchmarks
Customizing Benchmarks — system prompts, answer parsing, judge models, shared Env design
Evaluations Tutorial — interactive walkthrough of the evaluator pattern
API Reference — full type documentation