Agent RL
Train LLMs to use MCP-hosted tools via reinforcement learning, graded by LLM judges against task rubrics.
What you'll build
An agent that discovers, selects, and uses tools from MCP servers in sandboxed Modal containers. Trained on the APEX benchmark (document editing, spreadsheet manipulation, email drafting). Uses a dynamic toolbelt with meta-tools for tool discovery, and LLM-as-judge grading for reward.
Prerequisites
uv pip install 'tinker-cookbook[modal]'
modal token new
export OPENAI_API_KEY=... # for LLM judge grading
Key concepts
- Dynamic toolbelt — the agent starts with only meta-tools (list, inspect, add, remove) and must learn to discover relevant tools from ~60 available
- MCP tool provider — tools are hosted as MCP servers in sandboxed containers, discovered and wrapped automatically
- LLM-as-judge reward — each episode is graded by comparing sandbox state before/after interaction, then evaluating rubric criteria with an LLM judge
How it works
Dynamic toolbelt design
When many tools are available (the APEX benchmark exposes ~60 tools across 9 MCP servers), including all tool schemas in the prompt would exhaust the context window. Instead, the agent starts with only meta-tools visible:
| Meta-tool | Purpose |
|---|---|
toolbelt_list_tools |
List all available tools not in the active toolbelt |
toolbelt_inspect_tool |
Get full schema of a specific tool |
toolbelt_add_tool |
Add a tool to the active toolbelt (max 80) |
toolbelt_remove_tool |
Remove a tool from the active toolbelt |
final_answer |
Submit answer and end the episode |
The agent must learn to discover and select relevant tools before it can call them. When a tool is added to the toolbelt, its schema becomes part of the prompt on the next turn. This approach means the model learns tool discovery as part of the RL training.
Grading pipeline
Each episode is graded by comparing sandbox state before and after the agent's interaction, then running an LLM judge on each rubric criterion. The reward signal is the pass rate (fraction of criteria met):
- Snapshot diff — compare initial vs final sandbox filesystem (created/modified/deleted files with text diffs)
- LLM judge — evaluate each rubric criterion against the diff + agent's final answer (parallel, with concurrency limit)
- Score aggregation — pass rate across all criteria
Episode lifecycle
Each environment instance gets its own Modal sandbox — a cloud container running a FastAPI server with MCP tool servers:
- Create — spin up container from user-provided Dockerfile
- Populate — upload world data (filesystem archives, app data) via HTTP
- Configure MCP — POST server definitions to the sandbox gateway
- Interact — agent calls tools via MCP protocol (streamable HTTP transport)
- Snapshot — capture final filesystem state as tar.gz for grading
- Terminate — clean up the container
For GRPO training, a group of sandboxes is created per task (all starting from the same world state), and advantages are computed within each group.
Cost considerations
This recipe involves several cost-bearing components:
- Modal sandboxes: Each environment instance is a cloud container. With
group_size=4andgroups_per_batch=8, that's 32 concurrent sandboxes per batch. - LLM judge calls: Each episode requires one judge call per rubric criterion (typically 3-8 criteria per task). Using
gpt-4o-miniis significantly cheaper thangpt-4o. - Long episodes: APEX tasks can run up to 50 turns with a 3600s timeout. For initial experimentation, consider reducing
max_turnsandsandbox_timeout.
Tips for reducing cost during development:
- Use task_indices="0-2" to limit to a few tasks
- Use group_size=2 and groups_per_batch=2 for small batches
- Use a cheaper judge_model (e.g. gpt-4o-mini)
- Reduce max_turns and max_trajectory_tokens
Run it
Train
python -m tinker_cookbook.recipes.agent_rl.train \
model_name=meta-llama/Llama-3.1-8B-Instruct \
dockerfile_path=/path/to/environment/Dockerfile \
docker_context_dir=/path/to/repo/root \
group_size=4 \
groups_per_batch=8 \
learning_rate=1e-5 \
lora_rank=32 \
max_turns=50
Evaluate
python -m tinker_cookbook.recipes.agent_rl.eval \
model_name=meta-llama/Llama-3.1-8B-Instruct \
dockerfile_path=/path/to/environment/Dockerfile \
docker_context_dir=/path/to/repo/root \
load_checkpoint_path=/path/to/checkpoint \
task_indices=0-9 \
parallel=4
Expected results
Reward (pass rate across rubric criteria) increases over training. For cost-effective experimentation, use task_indices=0-2, group_size=2, groups_per_batch=2, and judge_model=gpt-4o-mini.