Skip to content

Customizing Benchmarks

Shared Env Abstraction

Benchmarks reuse the same Env protocol as RL training. The connection is direct — BenchmarkBuilder.make_envs() returns Env instances identical to those used in RLDataset:

RL Training                          Benchmark Evaluation
───────────                          ────────────────────
RLDataset                            BenchmarkBuilder
└── EnvGroupBuilder                  └── make_envs()
    └── make_envs() → list[Env]         └── list[Env]
        └── Env.step() → reward             └── Env.step() → reward

Both paths use EnvFromMessageEnv to adapt message-level environments to the token-level Env interface. A single MessageEnv implementation can drive both training and evaluation — just wrap it with EnvFromMessageEnv in either context. The rollout infrastructure is also shared: run_benchmark() calls do_single_rollout() from the RL module internally.

For a complete example of implementing and registering a MessageEnv-based benchmark, see Adding a New Benchmark.

System Prompts

How Prompts Are Resolved

Each benchmark defines a recommended_system_prompt that improves scores (e.g., math benchmarks instruct models to use \boxed{}). The runner resolves prompts in this order:

  1. BenchmarkConfig.system_prompt — if set, always takes precedence
  2. BenchmarkBuilder.recommended_system_prompt — applied automatically when no override is set
  3. No system prompt — if neither is set

Overriding Per Model

Different models may need different prompting strategies. Override at the config level:

# Thinking model — needs explicit formatting instruction
thinking_config = BenchmarkConfig.for_model(
    "Qwen/Qwen3.5-35B-A3B",
    system_prompt="Think step by step. Put your final answer in \\boxed{}.",
)

# Instruct model — shorter prompt works better
instruct_config = BenchmarkConfig.for_model(
    "Qwen/Qwen3-4B-Instruct-2507",
    system_prompt="Answer concisely. Final answer in \\boxed{}.",
)

# Compare
for name, config in [("thinking", thinking_config), ("instruct", instruct_config)]:
    result = await run_benchmark("gsm8k", client, renderer, config)
    print(f"{name}: {result.score:.1%}")
Benchmark recommended_system_prompt
gsm8k "Put your final answer in \boxed{}."
math500 "Put your final answer in \boxed{}."
aime*, hmmt* "Put your final answer in \boxed{}."
mmlu_pro, gpqa, supergpqa None (MCQ format instruction is in the user prompt)
ifeval None
Most others None

Answer Parsing and Grading

Built-In Extraction Utilities

The framework provides common answer extraction functions in tinker_cookbook.eval.benchmarks._common:

Function Use case
extract_boxed(text) Extract content from \boxed{...} (handles nested braces)
extract_gsm8k_answer(text) Extract numeric answer — tries \boxed{}, ####, answer is, then last number
extract_mcq_answer(text) Extract multiple-choice letter (A/B/C/D)
extract_python_code(text) Extract Python from fenced code blocks
check_gsm8k(response, expected) Float comparison with tolerance

Each benchmark's MessageEnv.step() uses these to grade responses. The extracted answer and expected answer are stored in logs for inspection.

Custom Grade Function

BenchmarkConfig.grade_fn overrides the built-in grading without modifying the Env. The runner calls it after the rollout completes, replacing the reward:

def my_grader(response: str, logs: Logs) -> float:
    """Custom grading function.

    Args:
        response: Last assistant turn (thinking stripped).
        logs: Benchmark-specific fields — e.g., "expected", "input", "extracted".

    Returns:
        Reward value (typically 0.0 or 1.0).
    """
    expected = logs["expected"]
    # Use your own extraction logic
    extracted = my_custom_extract(response)
    return 1.0 if extracted == expected else 0.0

config = BenchmarkConfig(grade_fn=my_grader)
result = await run_benchmark("gsm8k", sampling_client, renderer, config)

This is useful when:

  • The built-in extraction misses a valid answer format your model uses
  • You want more lenient or stricter grading
  • You need domain-specific parsing logic

Regrading Without Re-Running

regrade_trajectories() applies a new grade function to saved trajectories — no model inference needed:

from tinker_cookbook.eval.benchmarks import regrade_trajectories

# Try a more lenient grader on existing results
def lenient_grader(response: str, logs: Logs) -> float:
    expected = logs["expected"]
    # Accept answers with or without \boxed{}
    return 1.0 if expected in response else 0.0

regraded = regrade_trajectories("evals/run_01", "gsm8k", grade_fn=lenient_grader)
print(f"Original: {original.score:.1%}, Regraded: {regraded.score:.1%}")

Judge Models

Benchmarks like arena_hard use an LLM judge to grade responses. The judge is fully independent from the candidate model — different model, different renderer, different scale.

Configuring a Judge

# Candidate model: the model being evaluated
candidate_client = await sc.create_sampling_client_async(model_path="tinker://run-abc/weights/final")
candidate_renderer = get_renderer("qwen3_5", candidate_tokenizer)

# Judge model: a separate, typically stronger model
judge_client = await sc.create_sampling_client_async(base_model="Qwen/Qwen3.5-397B-A17B")
judge_renderer = get_renderer("qwen3_5", judge_tokenizer)

config = BenchmarkConfig(
    judge_sampling_client=judge_client,
    judge_renderer=judge_renderer,
)

result = await run_benchmark("arena_hard", candidate_client, candidate_renderer, config)

How Judging Works

For arena_hard, the flow is:

  1. The candidate model generates a response to the benchmark question
  2. The judge model receives the question + response and scores it (1-10)
  3. Scores >= 7 are graded as correct

The judge renderer defaults to the candidate renderer if not explicitly set.

Swapping Judges

Compare how different judges rate the same model:

for judge_name, judge_client in judges.items():
    config = BenchmarkConfig(
        judge_sampling_client=judge_client,
        judge_renderer=judge_renderer,
    )
    result = await run_benchmark("arena_hard", candidate_client, candidate_renderer, config)
    print(f"Judge={judge_name}: {result.score:.1%}")

Learn More