MultiModal Fine-tuning with Synthetic Captions
About
In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | DTD | Accuracy77.68 | 419 | |
| Classification | Cars | Accuracy82.25 | 314 | |
| Image Classification | GTSRB | Accuracy99.08 | 291 | |
| Image Classification | CUB | Accuracy78.75 | 249 | |
| Image Classification | CIFAR10 | Accuracy98.04 | 240 | |
| Image Classification | Caltech101 | Accuracy96.73 | 162 | |
| Image Classification | EuroSAT | Accuracy98.97 | 83 | |
| Image Classification | Flowers | Accuracy94.09 | 83 | |
| Image Classification | CIFAR100 | Accuracy88.92 | 38 | |
| Image Classification | Food | Accuracy88.79 | 23 |