Rubric-Based Grading

Use an LLM grader with rubrics to provide reward signals for RL training.

What you'll build

An RL training setup where a grader LLM evaluates policy responses against rubric criteria and returns a reward score. Includes a simple addition task demo and an experimental setup using the Prometheus evaluation dataset.

Prerequisites

uv pip install tinker-cookbook

Key concepts

Rubric items — structured criteria specifying what constitutes a good response, with extraction regex for parsing grader output
LLM-as-judge — a separate language model grades each response against the rubric, producing a scalar reward

How it works

JSON data format

Each datapoint consists of a conversation prefix and a list of rubric items:

{
  "convo": [
    {
      "role": "user",
      "content": "What is 4 + 5?"
    },
    {
      "role": "assistant",
      "content": "9"
    },
    {
      "role": "user",
      "content": "What is 122 + 12?"
    }
  ],
  "rubric_items": [
    {
      "rubric_str": "Does the chatbot correctly get the answer 134?",
      "extraction_regex": "<score>(.*)</score>",
      "grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
    }
  ]
}

Each rubric item specifies what constitutes a good response (rubric_str), how the grader should format its output (grader_output_format_instruction), and how the score is extracted (extraction_regex).

Debug workflow

Use debug_env.py to inspect what happens during rollouts before launching a full training run:

python -m tinker_cookbook.recipes.rubric.debug_env

This shows the message the policy sees, its response, the grader input, and the grader output -- useful for verifying that the grading pipeline is working correctly.

Parallel grading

The environment grades each rubric item in parallel using asyncio.gather. For each rollout, the policy reads the conversation prefix, generates a response, then all rubric items are evaluated concurrently by the grader LLM. The final reward is the sum of individual rubric scores.

Run it

Generate example data

python -m tinker_cookbook.recipes.rubric.generate_data

Debug rollouts (inspect grader behavior)

python -m tinker_cookbook.recipes.rubric.debug_env

Train on addition task

python -m tinker_cookbook.recipes.rubric.train

Train on Prometheus dataset (experimental)

python -m tinker_cookbook.recipes.rubric.prometheus_experimental

Expected results

The addition task shows reward climbing quickly within the first few steps. The Prometheus dataset shows steady reward improvement, though this recipe is experimental -- fine-tuning the grader LLM may be needed for stronger results.