Skip to content

Multi-Agent RL

Train LLMs in multi-turn and multiplayer environments using the Environment abstraction.

What you'll build

Models that perform well across multiple interaction turns, from simple number guessing to self-play in games. Three environments of increasing complexity: Guess the Number, Twenty Questions, and Tic-Tac-Toe.

Prerequisites

uv pip install tinker-cookbook

Key concepts

  • Environment abstraction — a flexible interface for defining multi-turn interactions with programmatic or LLM-based counterparts
  • Self-play — in Tic-Tac-Toe, the model trains by playing against itself, requiring multiple simultaneous LLM clients

How it works

Progressive complexity

The three environments form a progression of increasing difficulty:

  1. Programmatic opponent (Guess the Number) — The user turn is a simple Python function returning "too high" or "too low". This is the simplest case since no LLM is needed for the counterpart.
  2. LLM opponent (Twenty Questions) — A separate language model answers yes/no questions from the policy. This introduces the complexity of coordinating with another LLM during rollouts.
  3. Self-play (Tic-Tac-Toe) — The policy trains by playing against itself, requiring multiple simultaneous LLM clients and coordinated turn management.

The Environment abstraction handles all three cases uniformly.

Self-play coordination

In the Tic-Tac-Toe recipe, self-play is managed by a TwoPlayerCoordinator that handles async turn management. Both players share the same model (but may use different LoRA checkpoints). The coordinator:

  • Maintains separate conversation histories for each player
  • Routes turns to the correct player based on game state
  • Handles simultaneous games in a batch, with advantages computed across the group
  • Ensures both players' experiences contribute to the training signal

Run it

Guess the Number

The policy learns to guess a target number given "too high" / "too low" feedback.

python -m tinker_cookbook.recipes.multiplayer_rl.guess_number.train

Twenty Questions

The policy learns to identify an object by asking yes/no questions answered by an LLM.

python -m tinker_cookbook.recipes.multiplayer_rl.twenty_questions.train

Tic-Tac-Toe (self-play)

The policy learns by playing against itself in text-based Tic-Tac-Toe.

python -m tinker_cookbook.recipes.multiplayer_rl.text_arena.train

Expected results

All three environments show increasing reward over training. Guess the Number converges fastest; Tic-Tac-Toe requires the most steps due to self-play dynamics.

Learn more