Search Tool RL
Train LLMs to use retrieval tools for multi-hop question answering, replicating the Search-R1 approach.
What you'll build
A model that learns multi-turn search behavior over a Wikipedia index to answer complex questions. Uses Chroma DB for vector search and Gemini embeddings. Based on the Search-R1 paper.
Prerequisites
Set up Google Vertex AI credentials for embeddings:
export GOOGLE_GENAI_USE_VERTEXAI=...
export GCP_VERTEXAI_PROJECT_NUMBER=...
export GCP_VERTEXAI_REGION=...
Download and launch the pre-computed Wikipedia index:
- Download: wiki18 index on HuggingFace
- Launch:
chroma run --host localhost --path <decompressed_path>/chroma_db --port 8000
Note: the Chroma service needs 160+ GB RAM to load the full index.
Key concepts
- Tool-use RL — the model learns when and how to invoke search tools during multi-turn episodes
- Multi-hop QA — answering questions that require chaining multiple search queries together
How it works
Adding custom tools
Tools are defined using the @tool decorator. See ChromaTool.search() in ./tools.py as an example. To add new tools:
- Define your tool class with methods decorated by
@tool - The tool call rendering and parsing logic lives in
tinker_cookbook/renderers/— tool calling is supported across multiple renderers (Qwen, GPT-OSS, DeepSeek, Kimi) - The system prompt for enabling tool calling is in
./search_env.py(SEARCH_TASK_INSTRUCTIONS) and is written specifically for Qwen. Changing the tool calling parsing format requires updating the system prompt accordingly. - Extend
./embedding.pyto replace the Gemini embedding if needed.
Key differences from the Search-R1 paper
Our replication achieves consistently higher scores than the original paper across all benchmarks. The key differences are:
- We used the default importance-weighted REINFORCE loss implemented in Tinker (vs. the paper's loss formulation)
- We used the default synchronous rollout logic in the Tinker Cookbook (vs. the paper's rollout strategy)
- We used Gemini embedding and Chroma DB, motivated by their ease of setup for a public demo. In exploratory experiments, the Gemini embedding does not improve RL performance over the E5 embedding model used in the original paper.
Run it
Monitor env/all/turns_per_episode -- a successful run learns multi-turn search within 10-25 steps (should increase above 2 turns).
Expected results
Qwen3-4B-Instruct-2507 (default)
| Natural Questions | Trivia QA | HotpotQA | 2WikiMultihopQA |
|---|---|---|---|
| 51.8 | 70.2 | 52.0 | 47.7 |
Qwen2.5-7B-Instruct (vs. original paper)
| Natural Questions | Trivia QA | HotpotQA | 2WikiMultihopQA | |
|---|---|---|---|---|
| Search-R1 paper | 42.9 | 62.3 | 38.6 | 34.6 |
| Tinker | 51.6 | 67.3 | 49.7 | 42.8 |