tinker_cookbook.stores.EvalStore

class tinker_cookbook.stores.EvalStore(**)

Manages evaluation runs across checkpoints.

url(path)

Return a human-readable URI for a path within this eval store.

Parameters:

path (str)

create_run(model_name, benchmarks, checkpoint_path, checkpoint_name, config, run_id)

Create a new evaluation run and return its run_id.

Parameters:

model_name (str)
benchmarks (list[str])
checkpoint_path (str | None)
checkpoint_name (str | None)
config (dict | None)
run_id (str | None)

run_dir(run_id)

Return filesystem path for backward compat with BenchmarkConfig.save_dir.

Parameters:

run_id (str)

finalize_run(run_id)

Collect scores from benchmark results and update metadata.

Parameters:

run_id (str)

list_runs()

List all evaluation runs, most recent first.

read_run(run_id)

Load metadata for a specific run. Raises FileNotFoundError if missing.

Parameters:

run_id (str)

list_benchmarks(run_id)

List benchmark names that have results for a run.

Parameters:

run_id (str)

read_result(run_id, benchmark)

Get aggregated result for a benchmark.

Parameters:

run_id (str)
benchmark (str)

read_trajectories(run_id, benchmark, correct_only, incorrect_only, errors_only)

Get trajectories with optional filtering.

Parameters:

run_id (str)
benchmark (str)
correct_only (bool)
incorrect_only (bool)
errors_only (bool)

read_single_trajectory(run_id, benchmark, idx)

Get a single trajectory by index (O(n) scan — loads all trajectories).

Parameters:

run_id (str)
benchmark (str)
idx (int)

read_summary(run_id)

Read the combined summary for a run, or None if missing.

Parameters:

run_id (str)

write_result(run_id, result)

Save a benchmark result.

Parameters:

run_id (str)
result (BenchmarkResult)

write_trajectory(run_id, benchmark, traj)

Append one trajectory to the JSONL file.

Parameters:

run_id (str)
benchmark (str)
traj (StoredTrajectory)

write_summary(run_id, results)

Save a combined summary.

Parameters:

run_id (str)
results (dict[str, BenchmarkResult])

delete_run(run_id)

Delete all data for a run. Idempotent (no error if already gone).

Parameters:

run_id (str)

alist_runs()

Async version of :meth:list_runs.

aread_trajectories(run_id, benchmark)

Async version of :meth:read_trajectories.

Parameters:

run_id (str)
benchmark (str)

aread_result(run_id, benchmark)

Async version of :meth:read_result.

Parameters:

run_id (str)
benchmark (str)