Replicating DeepCoder with Tinker

Competitive programming problems are a common testbed for RL with LLMs. The recent DeepCoder blog post introduces a dataset and training pipeline for this purpose. This recipe demonstrates a similar setup using Qwen3.5-4B with thinking enabled.

Running This Demo

Sandboxing

Sandboxing is essential for safely executing generated code during training and evaluation. Two sandbox backends are supported:

SandboxFusion (Default)

Sandbox Fusion provides local Docker-based sandboxing. You can start a local sandbox in Docker with:

docker run -it -p 8080:8080 \
    -v ${TINKER_COOKBOOK_ROOT}/tinker_cookbook/recipes/code_rl/sandbox_config/local.yaml:/root/sandbox/sandbox/configs/local.yaml \
    volcengine/sandbox-fusion:server-20250609

Here, ${TINKER_COOKBOOK_ROOT} is the absolute path to your local tinker-cookbook repository. The training script reads the sandbox endpoint from the SANDBOX_URL environment variable. By default it uses http://localhost:8080/run_code. Example:

export SANDBOX_URL=http://localhost:8080/run_code

If you prefer not to use Docker, you can set up the sandbox manually by following the instructions in the Sandbox Fusion repository.

Modal provides cloud-based sandboxed execution without local Docker setup. To use Modal:

Install the modal extra and authenticate:

uv pip install 'tinker-cookbook[modal] @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git@nightly'
modal token new

Set the sandbox backend in your training command:

python -m tinker_cookbook.recipes.code_rl.train \
    sandbox_backend=modal \
    ...

Optional environment variables for Modal: - MODAL_POOL_SIZE: Number of concurrent sandboxes (default: 32) - MODAL_CREATION_RATE_LIMIT: Max sandboxes created per second (default: 4)

Example command

Train a Qwen3.5-4B model with thinking enabled:

python -m tinker_cookbook.recipes.code_rl.train \
    model_name="Qwen/Qwen3.5-4B" \
    group_size=8 groups_per_batch=128 \
    learning_rate=4e-5 \
    lora_rank=32 \
    max_tokens=24576

After 190 steps of training, you can expect the following performance on LiveCodeBench v6 (2025.02–2025.05):

Model	Pass@1	Pass@8
Qwen3.5-4B (before training)	23.6%	34.3%
Qwen3.5-4B (after 190 steps)	52.4%	76.0%

[1] Luo, M., Tan, S., Huang, R., Patel, A., Ariyak, A., Wu, Q., Shi, X., Xin, R., Cai, C., Weber, M., Zhang, C., Li, L. E., Popa, R. A., & Stoica, I. (2025). DeepCoder: A fully open-source 14B coder at O3-mini level.

Replicating DeepCoder with Tinker

Running This Demo

Sandboxing

SandboxFusion (Default)

Modal (Alternative)

Example command