Skip to content

Evaluation

Run standardized benchmarks against any Tinker model or checkpoint. The framework handles concurrency, grading, trajectory storage, and result aggregation.

Quick Start

from tinker_cookbook.eval.benchmarks import run_benchmark, run_benchmarks, BenchmarkConfig

# Single benchmark
result = await run_benchmark("gsm8k", sampling_client, renderer)
print(f"GSM8K: {result.score:.1%}")

# Multiple benchmarks (parallel by default)
results = await run_benchmarks(
    ["gsm8k", "mmlu_pro", "ifeval"],
    sampling_client, renderer,
)
for name, r in results.items():
    print(f"{name}: {r.score:.1%}")

Architecture

Benchmarks reuse the RL Env protocol — the same environment can drive both evaluation and training.

BenchmarkBuilder
├── make_envs(renderer, config) → list[Env]   one Env per example
├── aggregate(rewards, metrics) → BenchmarkResult
└── name, multi_turn, requires_sandbox, ...

Runner (run_benchmark / run_benchmarks)
├── Creates envs from BenchmarkBuilder
├── Runs rollouts concurrently (semaphore)
├── Grades responses via Env.step()
├── Saves trajectories as JSONL
└── Aggregates into BenchmarkResult

BenchmarkResult
├── score              raw accuracy
├── score_completed    accuracy excluding truncated/errored examples
├── num_examples, num_correct, num_errors, num_truncated
├── pass_at_k          {1: 0.45, 5: 0.72}  (when num_samples > 1)
└── time_seconds

Available Benchmarks

Programmatic Grading (Single-Turn)

Benchmark Examples Description
gsm8k 1,319 Grade school math
math500 500 Competition math (MATH)
aime 30 AIME math competition
aime_2025 30 AIME 2025
aime_2026 30 AIME 2026
hmmt_feb_2025 30 HMMT February 2025
hmmt_nov_2025 30 HMMT November 2025
mmlu_pro 12,032 Multi-domain knowledge
mmlu_redux ~3,000 Curated MMLU subset (30 subjects, error-filtered)
gpqa 198 Graduate-level science QA (Diamond set)
supergpqa 26,529 Extended graduate QA
ceval ~1,300 Chinese exam evaluation (52 subjects, val split)
ifeval 541 Instruction following
ifbench 300 Instruction following

Code Execution

Benchmark Type Description
mbpp Single-turn Python code generation (sanitized, 427)
livecodebench Single-turn Competitive programming
swe_bench Multi-turn GitHub issue resolution (500, agent)
terminal_bench Multi-turn Terminal interaction (agent)
tau2_bench Multi-turn Customer service agent

LLM-as-Judge

Benchmark Description
arena_hard ArenaHard (500, requires judge_sampling_client)

Other

Benchmark Description
longbench Long-context evaluation
bfcl Function calling / tool use

Using Benchmarks During Training

BenchmarkEvaluator wraps any benchmark as a training-loop evaluator. It runs a small sample at each eval step and logs metrics to TensorBoard / W&B:

from tinker_cookbook.eval.benchmark_evaluator import BenchmarkEvaluator

config = train.Config(
    # ... training config ...
    evaluator_builders=[
        lambda: BenchmarkEvaluator("gsm8k", renderer, max_examples=100),
        lambda: BenchmarkEvaluator("ifeval", renderer, max_examples=50),
    ],
    eval_every=50,
)

Logged metrics: eval/gsm8k/score, eval/gsm8k/num_correct, etc.

Next Steps