Verifiers RL
Use RL environments from Prime Intellect's Environments Hub with Tinker for training.
What you'll build
An RL training loop using any text-based environment from the Environments Hub, powered by the Verifiers library. Environments include reverse-text, alphabet-sort, math-python, wordle, and community contributions.
Prerequisites
uv pip install tinker-cookbook
uv tool install prime
prime env install primeintellect/reverse-text # or any other environment
Key concepts
- Verifiers — a library for creating RL environments for LLMs with standardized reward functions
- Environments Hub — a registry of community-built environments installable via the
primeCLI
Run it
To evaluate offline:
Expected results
The reverse-text environment should climb from ~0.2 to ~0.35 reward in 32 steps.
This recipe also includes a standalone AsyncOpenAI-compatible client (tinker_openai.py) implemented with Tinker, which can be adapted and reused for other applications that need an OpenAI-compatible inference interface backed by Tinker.
Note: Some Environments Hub implementations involve users writing their own <think> parsers (e.g. for use with reasoning RL starting on Instruct models). Despite being Instruct models, the Qwen3 models/tokenizers all use the same tokenizer chat template, which will strip any observed <think> sections automatically. This means thinking content may be inadvertently penalized by reward functions that expect to find it. Users should either modify the renderer, tokenizer chat template, or environment module if observing issues with thinking sections from Qwen3 models.