Skip to content

RL Training Outputs

Each RL training run writes files to log_path. This page describes each file and how to extract data from it.

Files written to log_path

File Format Contents
metrics.jsonl JSONL One JSON object per training iteration with all scalar metrics
config.json JSON Serialized training config (hyperparams, model, dataset, etc.)
checkpoints.jsonl JSONL Checkpoint metadata (paths, loop state for resume)
train_iteration_NNNNNN.html HTML Human-readable logtree report for training rollouts
train_iteration_NNNNNN_logtree.json JSON Machine-readable export of the same logtree trace
train_iteration_NNNNNN_rollout_summaries.jsonl JSONL One JSON object per trajectory with rewards, metrics, and step-level data
eval_<name>_iteration_NNNNNN.html HTML Logtree report for eval rollouts
eval_<name>_iteration_NNNNNN_logtree.json JSON Machine-readable export of eval logtree trace
eval_<name>_iteration_NNNNNN_rollout_summaries.jsonl JSONL Per-trajectory eval data (for RLTestSetEvaluator)
code.diff text Git diff at the time training started

<name> is the evaluator name (sanitized for filenames); iteration numbers are zero-padded to 6 digits.

metrics.jsonl

Each line is a JSON object keyed by metric name. Common keys (varies by env and config):

  • progress/batch, progress/done_frac — iteration index and completion fraction
  • env/all/reward/total — mean total reward across all trajectories
  • env/all/<metric> — env-emitted metrics (e.g., format_parse, correct)
  • ac_tokens_per_turn — mean generated tokens per turn
  • entropy — per-token entropy
  • kl_sample_train_v1, kl_sample_train_v2 — KL divergence estimators
  • optim/lr — learning rate
  • time/... — wall-clock timings for different stages
df = pd.read_json("path/to/metrics.jsonl", lines=True)
df.plot(x="progress/batch", y="env/all/reward/total")

*_rollout_summaries.jsonl

One line per trajectory. Best for aggregate analysis (reward distributions, per-step metrics).

with open("train_iteration_000010_rollout_summaries.jsonl") as f:
    trajectories = [json.loads(line) for line in f]

# Each trajectory has:
# - metadata: schema_version, split, iteration, group_idx, traj_idx, tags, sampling_client_step
# - episode totals: total_reward, final_reward, trajectory_metrics, final_ob_len
# - steps: list of {step_idx, ob_len, ac_len, reward, episode_done, metrics, logs}

*_logtree.json

The logtree JSON contains full rollout transcripts: prompts, model responses, grading details, and reward breakdowns. Use this when you need the actual text content of rollouts.

Top level: title, started_at, path, root. root is a tree of nodes, each with tag, attrs, and children (either text strings or nested nodes).

Some nodes carry a data field with structured content. Use data to extract typed data like conversation messages:

def find_conversations(node):
    """Recursively find all nodes with conversation data."""
    results = []
    if isinstance(node, dict):
        if node.get("data", {}).get("type") == "conversation":
            results.append(node["data"])
        for child in node.get("children", []):
            if isinstance(child, dict):
                results.extend(find_conversations(child))
    return results

with open("eval_test_iteration_000020_logtree.json") as f:
    trace = json.load(f)

for conv in find_conversations(trace["root"]):
    for msg in conv["messages"]:
        print(f"{msg['role']}: {msg['content'][:100] if isinstance(msg['content'], str) else '...'}")

Note: num_groups_to_log (default: 4) controls how many trajectory groups get detailed env-level logging. Groups beyond this limit have no rollout content in the logtree — only the Trajectory Details section (turn-level stats) is always present.

config.json

Serialized chz config capturing all training hyperparameters. Useful for reproducing a run or comparing configs across experiments.

checkpoints.jsonl

Each line records a saved checkpoint with its path and the loop state at save time. Used by the resume logic to pick up where training left off.