Harbor RL
Installation
uv pip install 'tinker-cookbook[modal] @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git@nightly'
RL training on Harbor formatted tasks (e.g., Terminal Bench 2.0) with sandboxed code execution. An agent gets a bash tool inside a sandboxed container, attempts a task, and receives reward based on test results.
HarborTask
Harbor offers a standardized format for SWE/Terminal-Bench style task.
Adhering to this allows seperation between task creation layer and evaluation/training harness layer.
We can download the harbor datasets through uvx harbor datasets download [email protected].
By default, the task will land in ~/.cache/harbor/tasks/ with the structure
~/.cache/harbor/tasks/
└── <shortuuid(task_id)>/ # deterministic hash for deduplication
└── <task_name>/ # human-readable task directory
├── environment/
│ └── Dockerfile
├── tests/
│ └── test.sh
├── instruction.md
├── task.toml
└── solution/
@dataclass(frozen=True)
class HarborTask:
task_name: str
instruction: str
task_dir: Path # must contain environment/Dockerfile and tests/test.sh
config: dict[str, Any] = field(default_factory=dict)
You can load your downloaded tasks (e.g., 89 Terminal-Bench tasks) via load_harbor_tasks() in launch_terminal_bench.py:
from tinker_cookbook.recipes.harbor_rl.launch_terminal_bench import load_harbor_tasks
tasks = load_harbor_tasks() # reads from ~/.cache/harbor/tasks/ by default
print(f"Loaded {len(tasks)} tasks")
print(tasks[0].task_name, tasks[0].task_dir)
Sandbox Protocol and custom backends
The Protocol
tinker_cookbook.sandbox.sandbox_interface defines SandboxInterface:
@runtime_checkable
class SandboxInterface(Protocol):
async def run_command(self, command: str, workdir: str | None = None, timeout: int = 60, max_output_bytes: int | None = None) -> SandboxResult: ...
async def read_file(self, path: str, max_bytes: int | None = None, timeout: int = 60) -> SandboxResult: ...
async def write_file(self, path: str, content: str | bytes, executable: bool = False, timeout: int = 60) -> SandboxResult: ...
async def send_heartbeat(self) -> None: ...
async def cleanup(self) -> None: ...
ModalSandbox implements this interface.
SandboxFactory and injection
harbor_env.py defines a backend-agnostic factory type and a default Modal implementation:
SandboxFactory = Callable[[Path, int], Awaitable[SandboxInterface]]
async def default_sandbox_factory(env_dir: Path, timeout: int) -> SandboxInterface:
"""Create a Modal sandbox from a task environment directory."""
import modal
dockerfile_path = env_dir / "Dockerfile"
image = modal.Image.from_dockerfile(path=str(dockerfile_path), context_dir=str(env_dir))
return await ModalSandbox.create(image=image, timeout=timeout)
The first argument is the task's environment/ directory (containing a Dockerfile and build context). Each backend converts this to its own image format internally (e.g. Modal builds a modal.Image).
cli_main() accepts an optional sandbox_factory parameter. When None, it falls back to default_sandbox_factory (Modal). The factory flows through: cli_main -> HarborDatasetBuilder -> HarborEnvGroupBuilder.make_envs().
Running
First, download the Terminal-Bench tasks:
uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/terminal-bench-2.0/
Then launch training:
uv run python tinker_cookbook/recipes/harbor_rl/scripts/train_terminal_bench.py \
model_name=moonshotai/Kimi-K2.6 \
max_tokens=8192 \
group_size=4 \
groups_per_batch=8 \
learning_rate=1e-5 \
lora_rank=32 \
wandb_project=cookbook_harbor_rl
Evaluation
Evaluate a Tinker endpoint on Harbor datasets without training.
Download datasets:
uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/terminal-bench-2.0
uvx harbor datasets download [email protected] -o ~/.cache/harbor/tasks/swebench-verified-1.0
Run evaluation:
uv run python tinker_cookbook/recipes/harbor_rl/scripts/eval_harbor_rl.py \
checkpoint_url=tinker://YOUR_CHECKPOINT/sampler_weights/final \
benchmarks=terminal_bench,swe_bench \
max_turns=200 \
max_tokens=8192 \
temperature=1.0
Key parameters in EvalConfig: checkpoint_url, max_turns, max_tokens, temperature.
run_eval() also accepts sandbox_factory for custom sandbox backends and output_path to control where results are written (default: tinker_cookbook/recipes/harbor_rl/scripts/results/<timestamp>/).
We evaluated SWE-Bench-Verified-1.0 and Terminal-Bench-2.0 at 32K context length and naive agent harness with no advanced features like context compatification that summarizes the tool calling history.
Results: Kimi-K2.6 (32K context, no compaction)
| Benchmark | Total | PASS | FAIL | ERROR | Pass Rate |
|---|---|---|---|---|---|
| SWE-Bench Verified 1.0 | 500 | 145 (29.0%) | 52 (10.4%) | 303 (60.6%) | 29.0% |
| Terminal-Bench 2.0 | 89 | 14 (15.7%) | 31 (34.8%) | 44 (49.4%) | 15.7% |
Config: max_turns=200, max_tokens=8192, temperature=1.0, sandbox_timeout=3600s
All ERRORs are context window overflow (prompt_tokens + max_tokens > 32768).
These occur when the conversation history exceeds ~24.5K tokens, leaving insufficient room for the 8192 max_tokens generation budget.