VLM Image Classification

Fine-tune vision-language models as image classifiers using supervised learning.

What you'll build

An image classifier powered by a vision-language model (Qwen3-VL), fine-tuned on datasets like Caltech-101. The model learns to output class names for input images.

Prerequisites

uv pip install tinker-cookbook

Key concepts

VLM fine-tuning — adapting a vision-language model for a specific classification task via supervised learning
ClassifierDataset — a dataset abstraction for loading image-label pairs from HuggingFace or custom sources

Run it

Train

python -m tinker_cookbook.recipes.vlm_classifier.train \
    experiment_dir=./vlm_classifier \
    wandb_project=vlm-classifier \
    dataset=caltech101 \
    renderer_name=qwen3_vl \
    model_name=Qwen/Qwen3-VL-30B-A3B-Instruct

Evaluate

python -m tinker_cookbook.recipes.vlm_classifier.eval \
    dataset=caltech101 \
    model_path=$YOUR_MODEL_PATH \
    model_name=Qwen/Qwen3-VL-30B-A3B-Instruct \
    renderer_name=qwen3_vl

How it works

Supported datasets

Four built-in datasets are available out of the box:

Caltech-101 — 101 object categories (animals, vehicles, household items, etc.)
Flowers-102 — 102 flower species
Oxford Pets — 37 cat and dog breeds
Stanford Cars — 196 classes of cars (make, model, year)

Custom datasets and evaluators

You can add custom datasets by creating a custom SupervisedDatasetBuilder in tinker_cookbook.recipes.vlm_classifier.data if your dataset is available for download on Hugging Face and has a column with your image and a column with the image labels (note, you must also define a ClassLabel for mapping integer labels to a human-readable class name). For more general datasets, you can subclass the base ClassifierDataset to load arbitrary image classification datasets.

To define a custom evaluator for a new dataset, you can create a new EvaluatorBuilder if your dataset is available on Hugging Face, or you can subclass ClassifierEvaluator to add an arbitrary custom dataset and parsing strategy.

Few-shot support

The examples_per_class parameter controls few-shot prompting. When set, the specified number of labeled examples per class are included in the prompt to guide the model.

Image augmentation

Training applies horizontal flip augmentation to input images for improved generalization.

Expected results

The eval script prints test accuracy. Results depend on model, dataset, and training duration. Currently supports Qwen series VLMs.

Learn more

Source code