RL Training Outputs

RL Training Outputs

Each RL training run writes files to log_path. This page describes each file and how to extract data from it.

Files written to log_path

FileFormatContents
metrics.jsonlJSONLOne JSON object per training iteration with all scalar metrics
config.jsonJSONSerialized training config (hyperparams, model, dataset, etc.)
checkpoints.jsonlJSONLCheckpoint metadata (paths, loop state for resume)
train_iteration_NNNNNN.htmlHTMLHuman-readable logtree report for training rollouts
train_iteration_NNNNNN_logtree.jsonJSONMachine-readable export of the same logtree trace
train_iteration_NNNNNN_rollout_summaries.jsonlJSONLOne JSON object per trajectory with rewards, metrics, and step-level data
eval_<name>_iteration_NNNNNN.htmlHTMLLogtree report for eval rollouts
eval_<name>_iteration_NNNNNN_logtree.jsonJSONMachine-readable export of eval logtree trace
eval_<name>_iteration_NNNNNN_rollout_summaries.jsonlJSONLPer-trajectory eval data (for RLTestSetEvaluator)
code.difftextGit diff at the time training started

<name> is the evaluator name (sanitized for filenames); iteration numbers are zero-padded to 6 digits.

metrics.jsonl

Each line is a JSON object keyed by metric name. Common keys (varies by env and config):

  • progress/batch, progress/done_frac — iteration index and completion fraction
  • env/all/reward/total — mean total reward across all trajectories
  • env/all/<metric> — env-emitted metrics (e.g., format_parse, correct)
  • ac_tokens_per_turn — mean generated tokens per turn
  • entropy — per-token entropy
  • kl_sample_train_v1, kl_sample_train_v2 — KL divergence estimators
  • optim/lr — learning rate
  • time/... — wall-clock timings for different stages
import pandas as pd
 
df = pd.read_json("path/to/metrics.jsonl", lines=True)
df.plot(x="progress/batch", y="env/all/reward/total")

*_rollout_summaries.jsonl

One line per trajectory. Best for aggregate analysis (reward distributions, per-step metrics).

import json
 
with open("train_iteration_000010_rollout_summaries.jsonl") as f:
    trajectories = [json.loads(line) for line in f]
 
# Each trajectory has:
# - metadata: schema_version, split, iteration, group_idx, traj_idx, tags, sampling_client_step
# - episode totals: total_reward, final_reward, trajectory_metrics, final_ob_len
# - steps: list of {step_idx, ob_len, ac_len, reward, episode_done, metrics, logs}

*_logtree.json

The logtree JSON contains full rollout transcripts: prompts, model responses, grading details, and reward breakdowns. Use this when you need the actual text content of rollouts.

Top level: title, started_at, path, root. root is a tree of nodes, each with tag, attrs, and children (either text strings or nested nodes).

Some nodes carry a data field with structured content. Use data to extract typed data like conversation messages:

import json
 
def find_conversations(node):
    """Recursively find all nodes with conversation data."""
    results = []
    if isinstance(node, dict):
        if node.get("data", {}).get("type") == "conversation":
            results.append(node["data"])
        for child in node.get("children", []):
            if isinstance(child, dict):
                results.extend(find_conversations(child))
    return results
 
with open("eval_test_iteration_000020_logtree.json") as f:
    trace = json.load(f)
 
for conv in find_conversations(trace["root"]):
    for msg in conv["messages"]:
        print(f"{msg['role']}: {msg['content'][:100] if isinstance(msg['content'], str) else '...'}")

Note: num_groups_to_log (default: 4) controls how many trajectory groups get detailed env-level logging. Groups beyond this limit have no rollout content in the logtree — only the Trajectory Details section (turn-level stats) is always present.

config.json

Serialized chz config capturing all training hyperparameters. Useful for reproducing a run or comparing configs across experiments.

checkpoints.jsonl

Each line records a saved checkpoint with its path and the loop state at save time. Used by the resume logic to pick up where training left off.