VLM Image Classification
Fine-tune vision-language models as image classifiers using supervised learning.
What you'll build
An image classifier powered by a vision-language model (Qwen3-VL), fine-tuned on datasets like Caltech-101. The model learns to output class names for input images.
Prerequisites
Key concepts
- VLM fine-tuning — adapting a vision-language model for a specific classification task via supervised learning
- ClassifierDataset — a dataset abstraction for loading image-label pairs from HuggingFace or custom sources
Run it
Train
python -m tinker_cookbook.recipes.vlm_classifier.train \
experiment_dir=./vlm_classifier \
wandb_project=vlm-classifier \
dataset=caltech101 \
renderer_name=qwen3_vl \
model_name=Qwen/Qwen3-VL-30B-A3B-Instruct
Evaluate
python -m tinker_cookbook.recipes.vlm_classifier.eval \
dataset=caltech101 \
model_path=$YOUR_MODEL_PATH \
model_name=Qwen/Qwen3-VL-30B-A3B-Instruct \
renderer_name=qwen3_vl
How it works
Supported datasets
Four built-in datasets are available out of the box:
- Caltech-101 — 101 object categories (animals, vehicles, household items, etc.)
- Flowers-102 — 102 flower species
- Oxford Pets — 37 cat and dog breeds
- Stanford Cars — 196 classes of cars (make, model, year)
Custom datasets and evaluators
You can add custom datasets by creating a custom SupervisedDatasetBuilder in tinker_cookbook.recipes.vlm_classifier.data if your dataset is available for download on Hugging Face and has a column with your image and a column with the image labels (note, you must also define a ClassLabel for mapping integer labels to a human-readable class name). For more general datasets, you can subclass the base ClassifierDataset to load arbitrary image classification datasets.
To define a custom evaluator for a new dataset, you can create a new EvaluatorBuilder if your dataset is available on Hugging Face, or you can subclass ClassifierEvaluator to add an arbitrary custom dataset and parsing strategy.
Few-shot support
The examples_per_class parameter controls few-shot prompting. When set, the specified number of labeled examples per class are included in the prompt to guide the model.
Image augmentation
Training applies horizontal flip augmentation to input images for improved generalization.
Expected results
The eval script prints test accuracy. Results depend on model, dataset, and training duration. Currently supports Qwen series VLMs.