Harbor RL

RL training on Harbor-formatted tasks with sandboxed code execution, where an agent uses bash tools inside containers.

What you'll build

An agent trained via RL on Terminal-Bench 2.0 tasks. The agent gets a bash tool inside a sandboxed container, attempts a task, and receives reward based on test results. Uses Modal for cloud-based sandboxing.

Prerequisites

uv pip install 'tinker-cookbook[modal]'
modal token new

Download tasks:

uvx harbor datasets download [email protected]

Key concepts

HarborTask — a standardized format for SWE/Terminal-Bench style tasks with Dockerfile, test script, and instructions
SandboxInterface — a protocol for running commands, reading/writing files in isolated containers
SandboxFactory — injectable factory for swapping sandbox backends (Modal by default)

How it works

HarborTask format

Harbor offers a standardized format for SWE/Terminal-Bench style tasks. Tasks are downloaded via uvx harbor datasets download [email protected] and land in ~/.cache/harbor/tasks/ with this structure:

~/.cache/harbor/tasks/
  └── <shortuuid(task_id)>/
      └── <task_name>/
          ├── environment/
          │   └── Dockerfile
          ├── tests/
          │   └── test.sh
          ├── instruction.md
          ├── task.toml
          └── solution/

The training interface consumes tasks as:

@dataclass(frozen=True)
class HarborTask:
    task_name: str
    instruction: str
    task_dir: Path      # must contain environment/Dockerfile and tests/test.sh
    config: dict[str, Any] = field(default_factory=dict)

You can customize your own tasks as long as they conform to this interface.

SandboxInterface protocol

tinker_cookbook.sandbox.sandbox_interface defines the protocol that all sandbox backends must implement:

@runtime_checkable
class SandboxInterface(Protocol):
    async def run_command(self, command: str, workdir: str | None = None, timeout: int = 60, max_output_bytes: int | None = None) -> SandboxResult: ...
    async def read_file(self, path: str, max_bytes: int | None = None, timeout: int = 60) -> SandboxResult: ...
    async def write_file(self, path: str, content: str | bytes, executable: bool = False, timeout: int = 60) -> SandboxResult: ...
    async def send_heartbeat(self) -> None: ...
    async def cleanup(self) -> None: ...

ModalSandbox implements this interface by default. You can inject a custom SandboxFactory via cli_main() to swap sandbox backends.

Error analysis

At 32K context with no compaction, 80.4% of errors are context window overflow (prompt_tokens + max_tokens > 32768). These occur when the conversation history exceeds ~24.5K tokens, leaving insufficient room for the 8192 max_tokens generation budget. Advanced features like context compaction (summarizing tool calling history) would likely improve results significantly.

Run it

Train

uv run python tinker_cookbook/recipes/harbor_rl/scripts/train_terminal_bench.py

Evaluate

uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/terminal-bench-2.0
uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/swebench-verified-1.0
uv run python tinker_cookbook/recipes/harbor_rl/scripts/eval_terminal_bench.py

Expected results

Kimi-K2-Thinking at 32K context (no compaction):

Benchmark	Total	Pass Rate
SWE-Bench Verified 1.0	500	9.2%
Terminal-Bench 2.0	89	20.2%

Config: max_turns=200, max_tokens=8192, temperature=0.1, sandbox_timeout=3600s. Most errors are context window overflow.