Harbor RL
RL training on Harbor-formatted tasks with sandboxed code execution, where an agent uses bash tools inside containers.
What you'll build
An agent trained via RL on Terminal-Bench 2.0 tasks. The agent gets a bash tool inside a sandboxed container, attempts a task, and receives reward based on test results. Uses Modal for cloud-based sandboxing.
Prerequisites
Download tasks:
uvx harbor datasets download [email protected]
Key concepts
- HarborTask — a standardized format for SWE/Terminal-Bench style tasks with Dockerfile, test script, and instructions
- SandboxInterface — a protocol for running commands, reading/writing files in isolated containers
- SandboxFactory — injectable factory for swapping sandbox backends (Modal by default)
How it works
HarborTask format
Harbor offers a standardized format for SWE/Terminal-Bench style tasks. Tasks are downloaded via uvx harbor datasets download [email protected] and land in ~/.cache/harbor/tasks/ with this structure:
~/.cache/harbor/tasks/
└── <shortuuid(task_id)>/
└── <task_name>/
├── environment/
│ └── Dockerfile
├── tests/
│ └── test.sh
├── instruction.md
├── task.toml
└── solution/
The training interface consumes tasks as:
@dataclass(frozen=True)
class HarborTask:
task_name: str
instruction: str
task_dir: Path # must contain environment/Dockerfile and tests/test.sh
config: dict[str, Any] = field(default_factory=dict)
You can customize your own tasks as long as they conform to this interface.
SandboxInterface protocol
tinker_cookbook.sandbox.sandbox_interface defines the protocol that all sandbox backends must implement:
@runtime_checkable
class SandboxInterface(Protocol):
async def run_command(self, command: str, workdir: str | None = None, timeout: int = 60, max_output_bytes: int | None = None) -> SandboxResult: ...
async def read_file(self, path: str, max_bytes: int | None = None, timeout: int = 60) -> SandboxResult: ...
async def write_file(self, path: str, content: str | bytes, executable: bool = False, timeout: int = 60) -> SandboxResult: ...
async def send_heartbeat(self) -> None: ...
async def cleanup(self) -> None: ...
ModalSandbox implements this interface by default. You can inject a custom SandboxFactory via cli_main() to swap sandbox backends.
Error analysis
At 32K context with no compaction, 80.4% of errors are context window overflow (prompt_tokens + max_tokens > 32768). These occur when the conversation history exceeds ~24.5K tokens, leaving insufficient room for the 8192 max_tokens generation budget. Advanced features like context compaction (summarizing tool calling history) would likely improve results significantly.
Run it
Train
Evaluate
uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/terminal-bench-2.0
uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/swebench-verified-1.0
uv run python tinker_cookbook/recipes/harbor_rl/scripts/eval_terminal_bench.py
Expected results
Kimi-K2-Thinking at 32K context (no compaction):
| Benchmark | Total | Pass Rate |
|---|---|---|
| SWE-Bench Verified 1.0 | 500 | 9.2% |
| Terminal-Bench 2.0 | 89 | 20.2% |
Config: max_turns=200, max_tokens=8192, temperature=0.1, sandbox_timeout=3600s. Most errors are context window overflow.