Code RL

Train LLMs on competitive programming problems with sandboxed code execution, replicating the DeepCoder approach.

What you'll build

A code-reasoning model fine-tuned with RL on competitive programming tasks. Generated code is executed in a sandbox (SandboxFusion or Modal) and scored by test-case pass rates. Based on the DeepCoder pipeline.

Prerequisites

uv pip install tinker-cookbook

For Modal sandboxing:

uv pip install 'tinker-cookbook[modal]'
modal token new

For SandboxFusion (default), start a local Docker sandbox:

docker run -it -p 8080:8080 \
    -v ${TINKER_COOKBOOK_ROOT}/tinker_cookbook/recipes/code_rl/sandbox_config/local.yaml:/root/sandbox/sandbox/configs/local.yaml \
    volcengine/sandbox-fusion:server-20250609
export SANDBOX_URL=http://localhost:8080/run_code

Key concepts

Sandboxed execution — generated code runs in isolated containers for safe evaluation
Test-case grading — reward is based on passing hidden test cases for each problem

Run it

python -m tinker_cookbook.recipes.code_rl.train \
    model_name=Qwen/Qwen3-4B-Instruct-2507 \
    group_size=8 \
    groups_per_batch=128 \
    learning_rate=4e-5 \
    lora_rank=32 \
    max_tokens=24576

To use Modal instead of SandboxFusion, add sandbox_backend=modal. Optional environment variables for Modal:

MODAL_POOL_SIZE: Number of concurrent sandboxes (default: 32)
MODAL_CREATION_RATE_LIMIT: Max sandboxes created per second (default: 4)

The training dataset is a composite of three sources: primeintellect, taco, and lcbv5 (LiveCodeBench v5), providing broad coverage of competitive programming problems.

Expected results

After 100 steps on LiveCodeBench v6 (2025.02-2025.05):

Model	Pass@1	Pass@8
Qwen3-4B-Instruct-2507 (before)	33.8%	44.3%
Qwen3-4B-Instruct-2507 (after)	42.7%	55.0%