Your First RL Run
We've provided a minimal script that runs RL on the GSM8K dataset (opens in a new tab): rl_basic.py. You can run the minimal RL script from the command line as follows:
python -m tinker_cookbook.recipes.rl_basic
This script will fine-tune the Llama-3.1-8B base (pretrained) model on this dataset with the following reward function:
The training should take about 1 minute per iteration and climb to about 63% accuracy after 15 iterations (env/all/correct
). You can look at the printouts for some other metrics of interest:
ac_tokens_per_turn
: the number of each tokens in each generated completionenv/all/format
: the fraction of completions that are formatted correctlyenv/all/reward/total
: mean total reward (combining format and correctness as defined above)entropy
: per-token entropy (mean negative log-probability of sampled tokens)kl_sample_train_{v1,v2}
: two different approximations/estimators of KL divergence between the sampler's and learner's probability distribution (contributed to by numerical differences and rounding noise)progress/done_frac
: what fraction of the total number of iterations we've completed so fartime/...
: time for different parts of the training loop
You can also look at the log_path
directory for more detailed metrics. There are several files of interest, which are mostly the same as in the Supervised Learning case.