Customizing Benchmarks

Shared Env Abstraction

Benchmarks reuse the same Env protocol as RL training. The connection is direct — BenchmarkBuilder.make_envs() returns Env instances identical to those used in RLDataset:

RL Training                          Benchmark Evaluation
───────────                          ────────────────────
RLDataset                            BenchmarkBuilder
└── EnvGroupBuilder                  └── make_envs()
    └── make_envs() → list[Env]         └── list[Env]
        └── Env.step() → reward             └── Env.step() → reward

Both paths use EnvFromMessageEnv to adapt message-level environments to the token-level Env interface. A single MessageEnv implementation can drive both training and evaluation — just wrap it with EnvFromMessageEnv in either context. The rollout infrastructure is also shared: run_benchmark() calls do_single_rollout() from the RL module internally.

For a complete example of implementing and registering a MessageEnv-based benchmark, see Adding a New Benchmark.

System Prompts

How Prompts Are Resolved

Each benchmark defines a recommended_system_prompt that improves scores (e.g., math benchmarks instruct models to use \boxed{}). The runner resolves prompts in this order:

BenchmarkConfig.system_prompt — if set, always takes precedence
BenchmarkBuilder.recommended_system_prompt — applied automatically when no override is set
No system prompt — if neither is set

Overriding Per Model

Different models may need different prompting strategies. Override at the config level:

# Thinking model — needs explicit formatting instruction
thinking_config = BenchmarkConfig.for_model(
    "Qwen/Qwen3.5-35B-A3B",
    system_prompt="Think step by step. Put your final answer in \\boxed{}.",
)

# Instruct model — shorter prompt works better
instruct_config = BenchmarkConfig.for_model(
    "Qwen/Qwen3-4B-Instruct-2507",
    system_prompt="Answer concisely. Final answer in \\boxed{}.",
)

# Compare
for name, config in [("thinking", thinking_config), ("instruct", instruct_config)]:
    result = await run_benchmark("gsm8k", client, renderer, config)
    print(f"{name}: {result.score:.1%}")

Built-In Recommended Prompts

Benchmark	`recommended_system_prompt`
`gsm8k`	`"Put your final answer in \boxed{}."`
`math500`	`"Put your final answer in \boxed{}."`
`aime`, `hmmt`	`"Put your final answer in \boxed{}."`
`mmlu_pro`, `gpqa`, `supergpqa`	None (MCQ format instruction is in the user prompt)
`ifeval`	None
Most others	None

Answer Parsing and Grading

Built-In Extraction Utilities

The framework provides common answer extraction functions in tinker_cookbook.eval.benchmarks._common:

Function	Use case
`extract_boxed(text)`	Extract content from `\boxed{...}` (handles nested braces)
`extract_gsm8k_answer(text)`	Extract numeric answer — tries `\boxed{}`, `####`, `answer is`, then last number
`extract_mcq_answer(text)`	Extract multiple-choice letter (A/B/C/D)
`extract_python_code(text)`	Extract Python from fenced code blocks
`check_gsm8k(response, expected)`	Float comparison with tolerance

Each benchmark's MessageEnv.step() uses these to grade responses. The extracted answer and expected answer are stored in logs for inspection.

Custom Grade Function

BenchmarkConfig.grade_fn overrides the built-in grading without modifying the Env. The runner calls it after the rollout completes, replacing the reward:

def my_grader(response: str, logs: Logs) -> float:
    """Custom grading function.

    Args:
        response: Last assistant turn (thinking stripped).
        logs: Benchmark-specific fields — e.g., "expected", "input", "extracted".

    Returns:
        Reward value (typically 0.0 or 1.0).
    """
    expected = logs["expected"]
    # Use your own extraction logic
    extracted = my_custom_extract(response)
    return 1.0 if extracted == expected else 0.0

config = BenchmarkConfig(grade_fn=my_grader)
result = await run_benchmark("gsm8k", sampling_client, renderer, config)

This is useful when:

The built-in extraction misses a valid answer format your model uses
You want more lenient or stricter grading
You need domain-specific parsing logic

Regrading Without Re-Running

regrade_trajectories() applies a new grade function to saved trajectories — no model inference needed:

from tinker_cookbook.eval.benchmarks import regrade_trajectories

# Try a more lenient grader on existing results
def lenient_grader(response: str, logs: Logs) -> float:
    expected = logs["expected"]
    # Accept answers with or without \boxed{}
    return 1.0 if expected in response else 0.0

regraded = regrade_trajectories("evals/run_01", "gsm8k", grade_fn=lenient_grader)
print(f"Original: {original.score:.1%}, Regraded: {regraded.score:.1%}")

Judge Models

Benchmarks like arena_hard use an LLM judge to grade responses. The judge is fully independent from the candidate model — different model, different renderer, different scale.

Configuring a Judge

# Candidate model: the model being evaluated
candidate_client = await sc.create_sampling_client_async(model_path="tinker://run-abc/weights/final")
candidate_renderer = get_renderer("qwen3_5", candidate_tokenizer)

# Judge model: a separate, typically stronger model
judge_client = await sc.create_sampling_client_async(base_model="Qwen/Qwen3.5-397B-A17B")
judge_renderer = get_renderer("qwen3_5", judge_tokenizer)

config = BenchmarkConfig(
    judge_sampling_client=judge_client,
    judge_renderer=judge_renderer,
)

result = await run_benchmark("arena_hard", candidate_client, candidate_renderer, config)

How Judging Works

For arena_hard, the flow is:

The candidate model generates a response to the benchmark question
The judge model receives the question + response and scores it (1-10)
Scores >= 7 are graded as correct

The judge renderer defaults to the candidate renderer if not explicitly set.

Swapping Judges

Compare how different judges rate the same model:

for judge_name, judge_client in judges.items():
    config = BenchmarkConfig(
        judge_sampling_client=judge_client,
        judge_renderer=judge_renderer,
    )
    result = await run_benchmark("arena_hard", candidate_client, candidate_renderer, config)
    print(f"Judge={judge_name}: {result.score:.1%}")

Learn More

Evaluation Overview — available benchmarks and architecture
Benchmarks Guide — configuration, storage, pass@k
RL Environments — the Env protocol shared with benchmarks
API Reference — full type documentation