Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MultiModal Fine-tuning with Synthetic Captions

About

In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.

Shohei Enomoto, Shin'ya Yamaguchi• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationDTD
Accuracy77.68
419
ClassificationCars
Accuracy82.25
314
Image ClassificationGTSRB
Accuracy99.08
291
Image ClassificationCUB
Accuracy78.75
249
Image ClassificationCIFAR10
Accuracy98.04
240
Image ClassificationCaltech101
Accuracy96.73
162
Image ClassificationEuroSAT
Accuracy98.97
83
Image ClassificationFlowers
Accuracy94.09
83
Image ClassificationCIFAR100
Accuracy88.92
38
Image ClassificationFood
Accuracy88.79
23
Showing 10 of 15 rows

Other info

Follow for update