RL Training Outputs
Each RL training run writes files to log_path. This page describes each file and how to extract data from it.
Files written to log_path
| File | Format | Contents |
|---|---|---|
metrics.jsonl | JSONL | One JSON object per training iteration with all scalar metrics |
config.json | JSON | Serialized training config (hyperparams, model, dataset, etc.) |
checkpoints.jsonl | JSONL | Checkpoint metadata (paths, loop state for resume) |
train_iteration_NNNNNN.html | HTML | Human-readable logtree report for training rollouts |
train_iteration_NNNNNN_logtree.json | JSON | Machine-readable export of the same logtree trace |
train_iteration_NNNNNN_rollout_summaries.jsonl | JSONL | One JSON object per trajectory with rewards, metrics, and step-level data |
eval_<name>_iteration_NNNNNN.html | HTML | Logtree report for eval rollouts |
eval_<name>_iteration_NNNNNN_logtree.json | JSON | Machine-readable export of eval logtree trace |
eval_<name>_iteration_NNNNNN_rollout_summaries.jsonl | JSONL | Per-trajectory eval data (for RLTestSetEvaluator) |
code.diff | text | Git diff at the time training started |
<name> is the evaluator name (sanitized for filenames); iteration numbers are zero-padded to 6 digits.
metrics.jsonl
Each line is a JSON object keyed by metric name. Common keys (varies by env and config):
progress/batch,progress/done_frac— iteration index and completion fractionenv/all/reward/total— mean total reward across all trajectoriesenv/all/<metric>— env-emitted metrics (e.g.,format_parse,correct)ac_tokens_per_turn— mean generated tokens per turnentropy— per-token entropykl_sample_train_v1,kl_sample_train_v2— KL divergence estimatorsoptim/lr— learning ratetime/...— wall-clock timings for different stages
import pandas as pd
df = pd.read_json("path/to/metrics.jsonl", lines=True)
df.plot(x="progress/batch", y="env/all/reward/total")*_rollout_summaries.jsonl
One line per trajectory. Best for aggregate analysis (reward distributions, per-step metrics).
import json
with open("train_iteration_000010_rollout_summaries.jsonl") as f:
trajectories = [json.loads(line) for line in f]
# Each trajectory has:
# - metadata: schema_version, split, iteration, group_idx, traj_idx, tags, sampling_client_step
# - episode totals: total_reward, final_reward, trajectory_metrics, final_ob_len
# - steps: list of {step_idx, ob_len, ac_len, reward, episode_done, metrics, logs}*_logtree.json
The logtree JSON contains full rollout transcripts: prompts, model responses, grading details, and reward breakdowns. Use this when you need the actual text content of rollouts.
Top level: title, started_at, path, root. root is a tree of nodes, each with tag, attrs, and children (either text strings or nested nodes).
Some nodes carry a data field with structured content. Use data to extract typed data like conversation messages:
import json
def find_conversations(node):
"""Recursively find all nodes with conversation data."""
results = []
if isinstance(node, dict):
if node.get("data", {}).get("type") == "conversation":
results.append(node["data"])
for child in node.get("children", []):
if isinstance(child, dict):
results.extend(find_conversations(child))
return results
with open("eval_test_iteration_000020_logtree.json") as f:
trace = json.load(f)
for conv in find_conversations(trace["root"]):
for msg in conv["messages"]:
print(f"{msg['role']}: {msg['content'][:100] if isinstance(msg['content'], str) else '...'}")Note: num_groups_to_log (default: 4) controls how many trajectory groups get detailed env-level logging. Groups beyond this limit have no rollout content in the logtree — only the Trajectory Details section (turn-level stats) is always present.
config.json
Serialized chz config capturing all training hyperparameters. Useful for reproducing a run or comparing configs across experiments.
checkpoints.jsonl
Each line records a saved checkpoint with its path and the loop state at save time. Used by the resume logic to pick up where training left off.