Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

About

We introduce CoMET, \textit{\textbf{C}omposing \textbf{M}odality \textbf{E}ncoders with \textbf{T}abular foundation models}, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \texttt{CLS} tokens of the foundation model align poorly with downstream tasks, we propose \textbf{PALPooling}, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

Herman Bergstr\"om, Aditya Mehrotra, Rahul G. Krishnan• 2026

Related benchmarks

TaskDatasetResultRank
Hierarchical Text ClassificationAMAZON
Macro F191.33
18
ClassificationAirbnb Text + Tabular
Accuracy45.17
8
ClassificationPetFinder (Text + Tabular)
Accuracy43.26
8
Hierarchical classificationAmazon (test)
Accuracy89.85
8
Hierarchical classificationBugs (test)
Accuracy76.61
8
Hierarchical Text ClassificationBugs
Accuracy76.61
8
ClassificationJigsaw Text + Tabular
Accuracy95.25
8
ClassificationWine Text + Tabular
Accuracy83.18
8
Hierarchical classificationWOS (test)
Accuracy75.53
8
Hierarchical Text ClassificationWOS (Web of Science)
Accuracy75.53
8
Showing 10 of 16 rows

Other info

Follow for update