Evaluations
Our training scripts will print out training and test loss. Two common workflows for evaluations are to do inline evals during training and to do offline evals on various checkpoints from a run.
Inline Evals
You can add inline evaluations to your training runs by configuring evaluator builders in advance for both supervised fine-tuning and RL training jobs.
Supervised Fine-Tuning (supervised.train
)
Add one or both of the following to your config:
evaluator_builders: list[EvaluatorBuilder]
- Runs evaluations everyeval_every
stepsinfrequent_evaluator_builders: list[EvaluatorBuilder]
- Runs evaluations everyinfrequent_eval_every
steps
RL Training (rl.train
)
Add the following to your config:
evaluator_builders: list[SamplingClientEvaluator]
- Runs evaluations everyeval_every
steps
For implementation guidance and a detailed example, see here and here respectively.
Offline evals
We support and recommend several ways for creating and running your offline evaluations on your model checkpoints.
Running Standard Evaluations with Inspect AI.
We support running many of the standard cited evaluations using the Inspect AI library (opens in a new tab).
We have provided a script to evaluate models using Tinker's internal sampling functionality as shown below.
MODEL_PATH=tinker://FIXME # YOUR MODEL PATH HERE
python -m tinker_cookbook.eval.run_inspect_evals \
model_path=$MODEL_PATH \
model_name=MODEL_NAME \ # YOUR MODEL_NAME HERE
tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
renderer_name=RENDERER_NAME # YOUR RENDERER_NAME HERE
Click here (opens in a new tab) to view additional supported evaluations.
Creating your own Sampling Evaluations
We recommend two ways to create your own evaluations:
- creating your own tasks with Inspect AI and running like above
- creating your own SamplingClientEvaluator
Create tasks with Inspect AI
In addition to passing in standard evaluations, you can create your own tasks using inspect ai as detailed here (opens in a new tab).
Here is a toy example of how to create an evaluation with an LLM-as-a-judge where we use a model produced by tinker as a grader.
import tinker
from inspect_ai import Task, task
from inspect_ai.dataset import MemoryDataset, Sample
from inspect_ai.model import GenerateConfig as InspectAIGenerateConfig
from inspect_ai.model import Model as InspectAIModel
from inspect_ai.scorer import model_graded_qa
from inspect_ai.solver import generate
from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling
QA_DATASET = MemoryDataset(
name="qa_dataset",
samples=[
Sample(
input="What is the capital of France?",
target="Paris",
),
Sample(
input="What is the capital of Italy?",
target="Rome",
),
],
)
service_client = tinker.ServiceClient()
sampling_client = service_client.create_sampling_client(
base_model="meta-llama/Llama-3.1-8B-Instruct"
)
api = InspectAPIFromTinkerSampling(
renderer_name="llama3",
model_name="meta-llama/Llama-3.1-8B-Instruct",
sampling_client=sampling_client,
verbose=False,
)
GRADER_MODEL = InspectAIModel(api=api, config=InspectAIGenerateConfig())
@task
def example_lm_as_judge() -> Task:
"""
Example task using LLM-as-a-judge scoring.
Note: The grader model defaults to the model being evaluated.
To use a different grader model, specify it with --model-grader when using inspect directly.
"""
return Task(
name="llm_as_judge",
dataset=QA_DATASET,
solver=generate(),
scorer=model_graded_qa(
instructions="Grade strictly against the target text as general answer key and rubric. "
"Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
partial_credit=False,
# model parameter is optional - if not specified, uses the model being evaluated
model=GRADER_MODEL,
),
)
Inspect also natively supports replacing our GRADER_MODEL
with any openai-chat-completion style api (e.g. openrouter).
Create your own SamplingClientEvaluator
Alternatively, you can create your own SamplingClientEvaluator class instead of using Inspect AI. This is a lower level abstraction than the above with finer-grain control over running your evaluations.
We expose this to interace to allow users more control over their datasets and metrics. To illustrate, see this custom evaluators example of how one might create their own complex SamplingClientEvaluator.
For a more illustrative toy instructive example see below.
from typing import Any, Callable
import tinker
from tinker import types
from tinker_cookbook import renderers
from tinker_cookbook.evaluators import SamplingClientEvaluator
from tinker_cookbook.tokenizer_utils import get_tokenizer
class CustomEvaluator(SamplingClientEvaluator):
"""
A toy SamplingClientEvaluator that runs a custom evaluation and returns its metrics.
"""
def __init__(
self,
dataset: Any,
grader_fn: Callable[[str, str], bool],
model_name: str,
renderer_name: str,
):
"""
Initialize the CustomEvaluator.
Args:
config: Configuration object containing all evaluation parameters
"""
self.dataset = dataset
self.grader_fn = grader_fn
tokenizer = get_tokenizer(model_name)
self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
"""
Run custom evaluation on the given sampling client and return metrics.
Args:
sampling_client: The sampling client to evaluate
Returns:
Dictionary of metrics from inspect evaluation
"""
metrics = {}
num_examples = len(self.dataset)
num_correct = 0
sampling_params = types.SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=1.0,
stop=self.renderer.get_stop_sequences(),
)
for datum in self.dataset:
model_input: types.ModelInput = self.renderer.build_generation_prompt(
[renderers.Message(role="user", content=datum["input"])]
)
# Generate response
r: types.SampleResponse = await sampling_client.sample_async(
prompt=model_input, num_samples=1, sampling_params=sampling_params
)
tokens: list[int] = r.sequences[0].tokens
response: renderers.Message = self.renderer.parse_response(tokens)[0]
if self.grader_fn(response["content"], datum["output"]):
num_correct += 1
metrics["accuracy"] = num_correct / num_examples
return metrics
Here is an example of how we can use the above CustomEvaluator on a toy dataset and grader.
QA_DATASET = [
{"input": "What is the capital of France?", "output": "Paris"},
{"input": "What is the capital of Germany?", "output": "Berlin"},
{"input": "What is the capital of Italy?", "output": "Rome"},
]
def grader_fn(response: str, target: str) -> bool:
return target.lower() in response.lower()
evaluator = CustomEvaluator(
dataset=QA_DATASET,
grader_fn=grader_fn,
renderer_name="llama3",
model_name="meta-llama/Llama-3.1-8B-Instruct",
)
service_client = tinker.ServiceClient()
sampling_client = service_client.create_sampling_client(base_model="meta-llama/Llama-3.1-8B-Instruct")
async def main():
result = await evaluator(sampling_client)
print(result)
asyncio.run(main())