Skip to content

tinker_cookbook.stores.EvalStore

class tinker_cookbook.stores.EvalStore(**)

Manages evaluation runs across checkpoints.

url(path)

Return a human-readable URI for a path within this eval store.

Parameters:

  • path (str)

create_run(model_name, benchmarks, checkpoint_path, checkpoint_name, config, run_id)

Create a new evaluation run and return its run_id.

Parameters:

  • model_name (str)
  • benchmarks (list[str])
  • checkpoint_path (str | None)
  • checkpoint_name (str | None)
  • config (dict | None)
  • run_id (str | None)

run_dir(run_id)

Return filesystem path for backward compat with BenchmarkConfig.save_dir.

Parameters:

  • run_id (str)

finalize_run(run_id)

Collect scores from benchmark results and update metadata.

Parameters:

  • run_id (str)

list_runs()

List all evaluation runs, most recent first.

read_run(run_id)

Load metadata for a specific run. Raises FileNotFoundError if missing.

Parameters:

  • run_id (str)

list_benchmarks(run_id)

List benchmark names that have results for a run.

Parameters:

  • run_id (str)

read_result(run_id, benchmark)

Get aggregated result for a benchmark.

Parameters:

  • run_id (str)
  • benchmark (str)

read_trajectories(run_id, benchmark, correct_only, incorrect_only, errors_only)

Get trajectories with optional filtering.

Parameters:

  • run_id (str)
  • benchmark (str)
  • correct_only (bool)
  • incorrect_only (bool)
  • errors_only (bool)

read_single_trajectory(run_id, benchmark, idx)

Get a single trajectory by index (O(n) scan — loads all trajectories).

Parameters:

  • run_id (str)
  • benchmark (str)
  • idx (int)

read_summary(run_id)

Read the combined summary for a run, or None if missing.

Parameters:

  • run_id (str)

write_result(run_id, result)

Save a benchmark result.

Parameters:

  • run_id (str)
  • result (BenchmarkResult)

write_trajectory(run_id, benchmark, traj)

Append one trajectory to the JSONL file.

Parameters:

  • run_id (str)
  • benchmark (str)
  • traj (StoredTrajectory)

write_summary(run_id, results)

Save a combined summary.

Parameters:

  • run_id (str)
  • results (dict[str, BenchmarkResult])

delete_run(run_id)

Delete all data for a run. Idempotent (no error if already gone).

Parameters:

  • run_id (str)

alist_runs()

Async version of :meth:list_runs.

aread_trajectories(run_id, benchmark)

Async version of :meth:read_trajectories.

Parameters:

  • run_id (str)
  • benchmark (str)

aread_result(run_id, benchmark)

Async version of :meth:read_result.

Parameters:

  • run_id (str)
  • benchmark (str)