Customizing Benchmarks
Shared Env Abstraction
Benchmarks reuse the same Env protocol as RL training. The connection is direct — BenchmarkBuilder.make_envs() returns Env instances identical to those used in RLDataset:
RL Training Benchmark Evaluation
─────────── ────────────────────
RLDataset BenchmarkBuilder
└── EnvGroupBuilder └── make_envs()
└── make_envs() → list[Env] └── list[Env]
└── Env.step() → reward └── Env.step() → reward
Both paths use EnvFromMessageEnv to adapt message-level environments to the token-level Env interface. A single MessageEnv implementation can drive both training and evaluation — just wrap it with EnvFromMessageEnv in either context. The rollout infrastructure is also shared: run_benchmark() calls do_single_rollout() from the RL module internally.
For a complete example of implementing and registering a MessageEnv-based benchmark, see Adding a New Benchmark.
System Prompts
How Prompts Are Resolved
Each benchmark defines a recommended_system_prompt that improves scores (e.g., math benchmarks instruct models to use \boxed{}). The runner resolves prompts in this order:
BenchmarkConfig.system_prompt— if set, always takes precedenceBenchmarkBuilder.recommended_system_prompt— applied automatically when no override is set- No system prompt — if neither is set
Overriding Per Model
Different models may need different prompting strategies. Override at the config level:
# Thinking model — needs explicit formatting instruction
thinking_config = BenchmarkConfig.for_model(
"Qwen/Qwen3.5-35B-A3B",
system_prompt="Think step by step. Put your final answer in \\boxed{}.",
)
# Instruct model — shorter prompt works better
instruct_config = BenchmarkConfig.for_model(
"Qwen/Qwen3-4B-Instruct-2507",
system_prompt="Answer concisely. Final answer in \\boxed{}.",
)
# Compare
for name, config in [("thinking", thinking_config), ("instruct", instruct_config)]:
result = await run_benchmark("gsm8k", client, renderer, config)
print(f"{name}: {result.score:.1%}")
Built-In Recommended Prompts
| Benchmark | recommended_system_prompt |
|---|---|
gsm8k |
"Put your final answer in \boxed{}." |
math500 |
"Put your final answer in \boxed{}." |
aime*, hmmt* |
"Put your final answer in \boxed{}." |
mmlu_pro, gpqa, supergpqa |
None (MCQ format instruction is in the user prompt) |
ifeval |
None |
| Most others | None |
Answer Parsing and Grading
Built-In Extraction Utilities
The framework provides common answer extraction functions in tinker_cookbook.eval.benchmarks._common:
| Function | Use case |
|---|---|
extract_boxed(text) |
Extract content from \boxed{...} (handles nested braces) |
extract_gsm8k_answer(text) |
Extract numeric answer — tries \boxed{}, ####, answer is, then last number |
extract_mcq_answer(text) |
Extract multiple-choice letter (A/B/C/D) |
extract_python_code(text) |
Extract Python from fenced code blocks |
check_gsm8k(response, expected) |
Float comparison with tolerance |
Each benchmark's MessageEnv.step() uses these to grade responses. The extracted answer and expected answer are stored in logs for inspection.
Custom Grade Function
BenchmarkConfig.grade_fn overrides the built-in grading without modifying the Env. The runner calls it after the rollout completes, replacing the reward:
def my_grader(response: str, logs: Logs) -> float:
"""Custom grading function.
Args:
response: Last assistant turn (thinking stripped).
logs: Benchmark-specific fields — e.g., "expected", "input", "extracted".
Returns:
Reward value (typically 0.0 or 1.0).
"""
expected = logs["expected"]
# Use your own extraction logic
extracted = my_custom_extract(response)
return 1.0 if extracted == expected else 0.0
config = BenchmarkConfig(grade_fn=my_grader)
result = await run_benchmark("gsm8k", sampling_client, renderer, config)
This is useful when:
- The built-in extraction misses a valid answer format your model uses
- You want more lenient or stricter grading
- You need domain-specific parsing logic
Regrading Without Re-Running
regrade_trajectories() applies a new grade function to saved trajectories — no model inference needed:
from tinker_cookbook.eval.benchmarks import regrade_trajectories
# Try a more lenient grader on existing results
def lenient_grader(response: str, logs: Logs) -> float:
expected = logs["expected"]
# Accept answers with or without \boxed{}
return 1.0 if expected in response else 0.0
regraded = regrade_trajectories("evals/run_01", "gsm8k", grade_fn=lenient_grader)
print(f"Original: {original.score:.1%}, Regraded: {regraded.score:.1%}")
Judge Models
Benchmarks like arena_hard use an LLM judge to grade responses. The judge is fully independent from the candidate model — different model, different renderer, different scale.
Configuring a Judge
# Candidate model: the model being evaluated
candidate_client = await sc.create_sampling_client_async(model_path="tinker://run-abc/weights/final")
candidate_renderer = get_renderer("qwen3_5", candidate_tokenizer)
# Judge model: a separate, typically stronger model
judge_client = await sc.create_sampling_client_async(base_model="Qwen/Qwen3.5-397B-A17B")
judge_renderer = get_renderer("qwen3_5", judge_tokenizer)
config = BenchmarkConfig(
judge_sampling_client=judge_client,
judge_renderer=judge_renderer,
)
result = await run_benchmark("arena_hard", candidate_client, candidate_renderer, config)
How Judging Works
For arena_hard, the flow is:
- The candidate model generates a response to the benchmark question
- The judge model receives the question + response and scores it (1-10)
- Scores >= 7 are graded as correct
The judge renderer defaults to the candidate renderer if not explicitly set.
Swapping Judges
Compare how different judges rate the same model:
for judge_name, judge_client in judges.items():
config = BenchmarkConfig(
judge_sampling_client=judge_client,
judge_renderer=judge_renderer,
)
result = await run_benchmark("arena_hard", candidate_client, candidate_renderer, config)
print(f"Judge={judge_name}: {result.score:.1%}")
Learn More
- Evaluation Overview — available benchmarks and architecture
- Benchmarks Guide — configuration, storage, pass@k
- RL Environments — the Env protocol shared with benchmarks
- API Reference — full type documentation