Rubric-Based Grading
Use an LLM grader with rubrics to provide reward signals for RL training.
What you'll build
An RL training setup where a grader LLM evaluates policy responses against rubric criteria and returns a reward score. Includes a simple addition task demo and an experimental setup using the Prometheus evaluation dataset.
Prerequisites
Key concepts
- Rubric items — structured criteria specifying what constitutes a good response, with extraction regex for parsing grader output
- LLM-as-judge — a separate language model grades each response against the rubric, producing a scalar reward
How it works
JSON data format
Each datapoint consists of a conversation prefix and a list of rubric items:
{
"convo": [
{
"role": "user",
"content": "What is 4 + 5?"
},
{
"role": "assistant",
"content": "9"
},
{
"role": "user",
"content": "What is 122 + 12?"
}
],
"rubric_items": [
{
"rubric_str": "Does the chatbot correctly get the answer 134?",
"extraction_regex": "<score>(.*)</score>",
"grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
}
]
}
Each rubric item specifies what constitutes a good response (rubric_str), how the grader should format its output (grader_output_format_instruction), and how the score is extracted (extraction_regex).
Debug workflow
Use debug_env.py to inspect what happens during rollouts before launching a full training run:
This shows the message the policy sees, its response, the grader input, and the grader output -- useful for verifying that the grading pipeline is working correctly.
Parallel grading
The environment grades each rubric item in parallel using asyncio.gather. For each rollout, the policy reads the conversation prefix, generates a response, then all rubric items are evaluated concurrently by the grader LLM. The final reward is the sum of individual rubric scores.
Run it
Generate example data
Debug rollouts (inspect grader behavior)
Train on addition task
Train on Prometheus dataset (experimental)
Expected results
The addition task shows reward climbing quickly within the first few steps. The Prometheus dataset shows steady reward improvement, though this recipe is experimental -- fine-tuning the grader LLM may be needed for stronger results.