Verifiers RL

Use RL environments from Prime Intellect's Environments Hub with Tinker for training.

What you'll build

An RL training loop using any text-based environment from the Environments Hub, powered by the Verifiers library. Environments include reverse-text, alphabet-sort, math-python, wordle, and community contributions.

Prerequisites

uv pip install tinker-cookbook
uv tool install prime
prime env install primeintellect/reverse-text  # or any other environment

Key concepts

Verifiers — a library for creating RL environments for LLMs with standardized reward functions
Environments Hub — a registry of community-built environments installable via the prime CLI

Run it

python -m tinker_cookbook.recipes.verifiers_rl.train \
    vf_env_id=reverse-text \
    vf_env_args='{}'

To evaluate offline:

python -m tinker_cookbook.recipes.verifiers_rl.evaluate \
    vf_env_id=reverse-text \
    vf_env_args='{}'

Expected results

The reverse-text environment should climb from ~0.2 to ~0.35 reward in 32 steps.

This recipe also includes a standalone AsyncOpenAI-compatible client (tinker_openai.py) implemented with Tinker, which can be adapted and reused for other applications that need an OpenAI-compatible inference interface backed by Tinker.

Note: Some Environments Hub implementations involve users writing their own <think> parsers (e.g. for use with reasoning RL starting on Instruct models). Despite being Instruct models, the Qwen3 models/tokenizers all use the same tokenizer chat template, which will strip any observed <think> sections automatically. This means thinking content may be inadvertently penalized by reward functions that expect to find it. Users should either modify the renderer, tokenizer chat template, or environment module if observing issues with thinking sections from Qwen3 models.