Exploring Cross-Modal Flows for Few-Shot Learning

About

Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

Ziqi Jiang, Yanghao Wang, Long Chen• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy87.7	705
Classification	Cars	Accuracy65.8	571
Image Classification	UCF101	Top-1 Acc87.1	529
Image Classification	Food101	Accuracy87.4	457
Image Classification	SUN397	Accuracy77.2	450
Image Classification	ImageNet	Top-1 Accuracy73.5	366
Image Classification	Pets	Accuracy90.5	320
Image Classification	Oxford Flowers 102	Accuracy99.1	244
Image Classification	EuroSAT	Accuracy91	226
Image Classification	Oxford-IIIT Pet	Accuracy93.2	219

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord