Agent RL

Train LLMs to use MCP-hosted tools via reinforcement learning, graded by LLM judges against task rubrics.

What you'll build

An agent that discovers, selects, and uses tools from MCP servers in sandboxed Modal containers. Trained on the APEX benchmark (document editing, spreadsheet manipulation, email drafting). Uses a dynamic toolbelt with meta-tools for tool discovery, and LLM-as-judge grading for reward.

Prerequisites

uv pip install 'tinker-cookbook[modal]'
modal token new
export OPENAI_API_KEY=...  # for LLM judge grading

Key concepts

Dynamic toolbelt — the agent starts with only meta-tools (list, inspect, add, remove) and must learn to discover relevant tools from ~60 available
MCP tool provider — tools are hosted as MCP servers in sandboxed containers, discovered and wrapped automatically
LLM-as-judge reward — each episode is graded by comparing sandbox state before/after interaction, then evaluating rubric criteria with an LLM judge

How it works

Dynamic toolbelt design

When many tools are available (the APEX benchmark exposes ~60 tools across 9 MCP servers), including all tool schemas in the prompt would exhaust the context window. Instead, the agent starts with only meta-tools visible:

Meta-tool	Purpose
`toolbelt_list_tools`	List all available tools not in the active toolbelt
`toolbelt_inspect_tool`	Get full schema of a specific tool
`toolbelt_add_tool`	Add a tool to the active toolbelt (max 80)
`toolbelt_remove_tool`	Remove a tool from the active toolbelt
`final_answer`	Submit answer and end the episode

The agent must learn to discover and select relevant tools before it can call them. When a tool is added to the toolbelt, its schema becomes part of the prompt on the next turn. This approach means the model learns tool discovery as part of the RL training.

Grading pipeline

Each episode is graded by comparing sandbox state before and after the agent's interaction, then running an LLM judge on each rubric criterion. The reward signal is the pass rate (fraction of criteria met):

Snapshot diff — compare initial vs final sandbox filesystem (created/modified/deleted files with text diffs)
LLM judge — evaluate each rubric criterion against the diff + agent's final answer (parallel, with concurrency limit)
Score aggregation — pass rate across all criteria

Episode lifecycle

Each environment instance gets its own Modal sandbox — a cloud container running a FastAPI server with MCP tool servers:

Create — spin up container from user-provided Dockerfile
Populate — upload world data (filesystem archives, app data) via HTTP
Configure MCP — POST server definitions to the sandbox gateway
Interact — agent calls tools via MCP protocol (streamable HTTP transport)
Snapshot — capture final filesystem state as tar.gz for grading
Terminate — clean up the container

For GRPO training, a group of sandboxes is created per task (all starting from the same world state), and advantages are computed within each group.

Cost considerations

This recipe involves several cost-bearing components:

Modal sandboxes: Each environment instance is a cloud container. With group_size=4 and groups_per_batch=8, that's 32 concurrent sandboxes per batch.
LLM judge calls: Each episode requires one judge call per rubric criterion (typically 3-8 criteria per task). Using gpt-4o-mini is significantly cheaper than gpt-4o.
Long episodes: APEX tasks can run up to 50 turns with a 3600s timeout. For initial experimentation, consider reducing max_turns and sandbox_timeout.

Tips for reducing cost during development: - Use task_indices="0-2" to limit to a few tasks - Use group_size=2 and groups_per_batch=2 for small batches - Use a cheaper judge_model (e.g. gpt-4o-mini) - Reduce max_turns and max_trajectory_tokens

Run it

Train

python -m tinker_cookbook.recipes.agent_rl.train \
    model_name=meta-llama/Llama-3.1-8B-Instruct \
    dockerfile_path=/path/to/environment/Dockerfile \
    docker_context_dir=/path/to/repo/root \
    group_size=4 \
    groups_per_batch=8 \
    learning_rate=1e-5 \
    lora_rank=32 \
    max_turns=50

Evaluate

python -m tinker_cookbook.recipes.agent_rl.eval \
    model_name=meta-llama/Llama-3.1-8B-Instruct \
    dockerfile_path=/path/to/environment/Dockerfile \
    docker_context_dir=/path/to/repo/root \
    load_checkpoint_path=/path/to/checkpoint \
    task_indices=0-9 \
    parallel=4

Expected results

Reward (pass rate across rubric criteria) increases over training. For cost-effective experimentation, use task_indices=0-2, group_size=2, groups_per_batch=2, and judge_model=gpt-4o-mini.