Code RL
Train LLMs on competitive programming problems with sandboxed code execution, replicating the DeepCoder approach.
What you'll build
A code-reasoning model fine-tuned with RL on competitive programming tasks. Generated code is executed in a sandbox (SandboxFusion or Modal) and scored by test-case pass rates. Based on the DeepCoder pipeline.
Prerequisites
For Modal sandboxing:
For SandboxFusion (default), start a local Docker sandbox:
docker run -it -p 8080:8080 \
-v ${TINKER_COOKBOOK_ROOT}/tinker_cookbook/recipes/code_rl/sandbox_config/local.yaml:/root/sandbox/sandbox/configs/local.yaml \
volcengine/sandbox-fusion:server-20250609
export SANDBOX_URL=http://localhost:8080/run_code
Key concepts
- Sandboxed execution — generated code runs in isolated containers for safe evaluation
- Test-case grading — reward is based on passing hidden test cases for each problem
Run it
python -m tinker_cookbook.recipes.code_rl.train \
model_name=Qwen/Qwen3-4B-Instruct-2507 \
group_size=8 \
groups_per_batch=128 \
learning_rate=4e-5 \
lora_rank=32 \
max_tokens=24576
To use Modal instead of SandboxFusion, add sandbox_backend=modal. Optional environment variables for Modal:
MODAL_POOL_SIZE: Number of concurrent sandboxes (default: 32)MODAL_CREATION_RATE_LIMIT: Max sandboxes created per second (default: 4)
The training dataset is a composite of three sources: primeintellect, taco, and lcbv5 (LiveCodeBench v5), providing broad coverage of competitive programming problems.
Expected results
After 100 steps on LiveCodeBench v6 (2025.02-2025.05):
| Model | Pass@1 | Pass@8 |
|---|---|---|
| Qwen3-4B-Instruct-2507 (before) | 33.8% | 44.3% |
| Qwen3-4B-Instruct-2507 (after) | 42.7% | 55.0% |