RL training loop

Reinforcement Learning Training Loop

We've provided a simple RL training loop in rl_loop.py, which avoids using our environment classes and instead defines the data loading and rollouts in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in rl/train.py does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.

You can run the RL training loop using:

python -m tinker_cookbook.recipes.rl_loop

The default config should write the results to /tmp/tinker-examples/rl-loop. The experiment should be completed after 57 steps of training. You can plot the reward curve as follows:

import pandas
import matplotlib.pyplot as plt
 
metrics_path = "/tmp/tinker-examples/rl-loop/metrics.jsonl"
df = pandas.read_json(metrics_path, lines=True)
plt.plot(df["reward/mean"], label="reward/mean")
plt.legend()
plt.show()

You should see a plot like this: Reward as a function of steps