Replicating Search-R1 with Tinker
Search-R1 is a recent paper that showcases tool-use RL for multi-hop QA on Wikipedia.
It provides a clean setup for testing tool-use RL and also released their training and evaluation data.
In this demo, we demonstrate similar experiments using Qwen3.5-4B in non-thinking mode, and we include our replication results using Qwen/Qwen2.5-7B-Instruct at the end.
Running This Demo
Installation and Setup
This demo is built with Chroma DB and the Gemini API. You can install the additional dependencies by
By default, we use google vertex ai for the embedding service, and you need to set $GOOGLE_GENAI_USE_VERTEXAI, $GCP_VERTEXAI_PROJECT_NUMBER, $GCP_VERTEXAI_REGION. Or, tweak ./embedding.py to authenticate differently.
Currently, the tool use RL run relies on a separate Chroma vector search service. You can set it up with the following step:
- You can download a pre-computed wiki18 index: https://huggingface.co/datasets/tianyi-thinks/2018-wiki-index/blob/main/chroma_db.tar.xz
- Launch the Chroma service on localhost. Example command:
chroma run --host localhost --path <decompressed_path>/chroma_db --port 8000
If you launch the chroma service locally, you generally need 160+ GB RAM to load the vector index in memory for good performance.
Example command
This default command trains Qwen3.5-4B in non-thinking mode with reasonable hyperparameters.
With the default hyperparameters, you can expect performance like: | | Natural Questions | Trivia QA | HotpotQA | 2WikiMultihopQA | |---|---|---|---|---| | Qwen3-4B-Instruct-2507 | 51.8 | 70.2 | 52.0 | 47.7 |
Rerun needed: these numbers were measured on the now-retired
Qwen3-4B-Instruct-2507and have not been refreshed with a full run. A 30-step verification run withQwen/Qwen3.5-4B+qwen3_5_disable_thinkingshowed healthy learning (train reward 0.39 → 0.60; NQ 0.27 → 0.47, HotpotQA 0.47 → 0.70), but full-length results may differ.
A successful run shows env/all/reward/total climbing within the first ~20 steps. env/all/turns_per_episode above 2 indicates the model is actually using the search tool; weaker starting models learn this within 10-25 steps, while Qwen3.5-4B already issues multi-turn searches from step 0 (~3 turns per episode), so expect the learning to show up in reward rather than turn count.
Note: The max_trajectory_tokens parameter (default: 32,768) limits the total context length for multi-turn interactions. If your searches require longer contexts, you can adjust it with max_trajectory_tokens=<value>.
To speed up training, you may consider turning on --stream_minibatch. In principle, this system improvement should have minimal effect on training.
Extensions: How to Include Other Tools?
- The tool call rendering / parsing logic is in tinker_cookbook/renderers/. Tool calling is supported on multiple renderers (Qwen, GPT-OSS, DeepSeek, Kimi). Currently, the system prompt necessary for enabling tool calling is included in
./search_env.py(SEARCH_TASK_INSTRUCTIONS) and is written specifically for Qwen. Changing the tool calling parsing format requires updating the system prompt accordingly. - Extend
./embedding.pyto replace the Gemini embedding. - Extend
./tools.pyto add new tools using the@tooldecorator - seeChromaTool.search()as an example.
Replication Results
We conducted experiments on a Qwen/Qwen2.5-7B-Instruct model and compared with the results reported in the original paper.
Note this model is not available on Tinker and we chose it specifically to compare with the original paper.
The results can be seen here,
| Natural Questions | Trivia QA | HotpotQA | 2WikiMultihopQA | |
|---|---|---|---|---|
| original paper | 42.9 | 62.3 | 38.6 | 34.6 |
| tinker | 51.6 | 67.3 | 49.7 | 42.8 |
The key differences between our experiment and the original paper include:
- We used the default importance-weighting REINFORCE loss implemented in Tinker
- We used the default synchronous rollout logic in the Tinker Cookbook
- We used Gemini embedding and Chroma DB, motivated by their ease of setup for a public demo. In exploratory experiments, the Gemini embedding does not improve RL performance over the E5 embedding model used in the original paper.
[1] Jin, B., Zeng, H., Yue, Z., Yoon, J., Arık, S. O., Wang, D., Zamani, H., & Han, J. (2025). Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.