Multi-Agent RL
Train LLMs in multi-turn and multiplayer environments using the Environment abstraction.
What you'll build
Models that perform well across multiple interaction turns, from simple number guessing to self-play in games. Three environments of increasing complexity: Guess the Number, Twenty Questions, and Tic-Tac-Toe.
Prerequisites
Key concepts
- Environment abstraction — a flexible interface for defining multi-turn interactions with programmatic or LLM-based counterparts
- Self-play — in Tic-Tac-Toe, the model trains by playing against itself, requiring multiple simultaneous LLM clients
How it works
Progressive complexity
The three environments form a progression of increasing difficulty:
- Programmatic opponent (Guess the Number) — The user turn is a simple Python function returning "too high" or "too low". This is the simplest case since no LLM is needed for the counterpart.
- LLM opponent (Twenty Questions) — A separate language model answers yes/no questions from the policy. This introduces the complexity of coordinating with another LLM during rollouts.
- Self-play (Tic-Tac-Toe) — The policy trains by playing against itself, requiring multiple simultaneous LLM clients and coordinated turn management.
The Environment abstraction handles all three cases uniformly.
Self-play coordination
In the Tic-Tac-Toe recipe, self-play is managed by a TwoPlayerCoordinator that handles async turn management. Both players share the same model (but may use different LoRA checkpoints). The coordinator:
- Maintains separate conversation histories for each player
- Routes turns to the correct player based on game state
- Handles simultaneous games in a batch, with advantages computed across the group
- Ensures both players' experiences contribute to the training signal
Run it
Guess the Number
The policy learns to guess a target number given "too high" / "too low" feedback.
Twenty Questions
The policy learns to identify an object by asking yes/no questions answered by an LLM.
Tic-Tac-Toe (self-play)
The policy learns by playing against itself in text-based Tic-Tac-Toe.
Expected results
All three environments show increasing reward over training. Guess the Number converges fastest; Tic-Tac-Toe requires the most steps due to self-play dynamics.