Search Tool RL

Train LLMs to use retrieval tools for multi-hop question answering, replicating the Search-R1 approach.

What you'll build

A model that learns multi-turn search behavior over a Wikipedia index to answer complex questions. Uses Chroma DB for vector search and Gemini embeddings. Based on the Search-R1 paper.

Prerequisites

uv pip install 'tinker-cookbook[vector-search]'

Set up Google Vertex AI credentials for embeddings:

export GOOGLE_GENAI_USE_VERTEXAI=...
export GCP_VERTEXAI_PROJECT_NUMBER=...
export GCP_VERTEXAI_REGION=...

Download and launch the pre-computed Wikipedia index:

Download: wiki18 index on HuggingFace
Launch: chroma run --host localhost --path <decompressed_path>/chroma_db --port 8000

Note: the Chroma service needs 160+ GB RAM to load the full index.

Key concepts

Tool-use RL — the model learns when and how to invoke search tools during multi-turn episodes
Multi-hop QA — answering questions that require chaining multiple search queries together

How it works

Adding custom tools

Tools are defined using the @tool decorator. See ChromaTool.search() in ./tools.py as an example. To add new tools:

Define your tool class with methods decorated by @tool
The tool call rendering and parsing logic lives in tinker_cookbook/renderers/ — tool calling is supported across multiple renderers (Qwen, GPT-OSS, DeepSeek, Kimi)
The system prompt for enabling tool calling is in ./search_env.py (SEARCH_TASK_INSTRUCTIONS) and is written specifically for Qwen. Changing the tool calling parsing format requires updating the system prompt accordingly.
Extend ./embedding.py to replace the Gemini embedding if needed.

Key differences from the Search-R1 paper

Our replication achieves consistently higher scores than the original paper across all benchmarks. The key differences are:

We used the default importance-weighted REINFORCE loss implemented in Tinker (vs. the paper's loss formulation)
We used the default synchronous rollout logic in the Tinker Cookbook (vs. the paper's rollout strategy)
We used Gemini embedding and Chroma DB, motivated by their ease of setup for a public demo. In exploratory experiments, the Gemini embedding does not improve RL performance over the E5 embedding model used in the original paper.

Run it

python -m tinker_cookbook.recipes.search_tool.train

Monitor env/all/turns_per_episode -- a successful run learns multi-turn search within 10-25 steps (should increase above 2 turns).

Expected results

Qwen3-4B-Instruct-2507 (default)

Natural Questions	Trivia QA	HotpotQA	2WikiMultihopQA
51.8	70.2	52.0	47.7

Qwen2.5-7B-Instruct (vs. original paper)

	Natural Questions	Trivia QA	HotpotQA	2WikiMultihopQA
Search-R1 paper	42.9	62.3	38.6	34.6
Tinker	51.6	67.3	49.7	42.8