MultiModal Fine-tuning with Synthetic Captions

About

In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.

Shohei Enomoto, Shin'ya Yamaguchi• 2026

Related benchmarks

Task	Dataset	Result
Classification	Cars	Accuracy82.25	492
Image Classification	DTD	Accuracy77.68	487
Image Classification	CUB	Accuracy78.75	331
Image Classification	GTSRB	Accuracy99.08	291
Image Classification	CIFAR10	Accuracy98.04	240
Image Classification	Caltech101	Accuracy96.73	228
Image Classification	EuroSAT	Accuracy98.97	226
Image Classification	Food	Accuracy88.79	91
Image Classification	Flowers	Accuracy94.09	86
Image Classification	CIFAR100	Accuracy88.92	50

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord